Project Information:
Project 1 involves analyzing data on all US colleges and universities that primarily grant bachelor’s degrees.
Intent:
The goal of this project is to utilize the methods learned in Chapter 6: Descriptive Statistics to analyze the data. Methods include calculating mean, median, variance, standard deviation, and correlation. Additionally, use the following graphs to represent this data: histograms, box plots, bar plots, and pie charts.
Steps:
mean(college$AdmitRate, na.rm = TRUE)
## [1] 0.6702025
Comment:
This means that approximately 67% of applicants are admitted on average across all schools.
sd(college$AdmitRate, na.rm = TRUE)
## [1] 0.208179
Comment:
The standard deviation of admission rates is approximately 0.208, or 20.8%. This indicates a considerable variation in selectivity across all schools, with most admission rates falling between 46% and 88%.
hist(college$Cost,
main = "Distribution of Average Total Cost",
col = "red",
xlab = "Average Total Cost",
ylab = "Number of Schools",
)
Comment:
The histogram shows the distribution of the average total cost across all schools. Most of the schools are at the lower end of the cost range with fewer schools at the higher end. This indicates that low-to-moderate cost schools are more common, however the more expensive schools pull the mean upward.
sd(college$NetPrice, na.rm = TRUE)
## [1] 7854.096
Comment:
The standard deviation of the average net price across all schools is approximately $7,854. This indicates considerable variation in the average out-of-pocket cost for students after financial aid.
median(college$Debt, na.rm = TRUE)
## [1] 713.5
Comment:
The median debt for college students who complete their program across all schools is $713.50. This means half of the schools are either above or below this value.
cor(college$Enrollment, college$CompRate, use = "complete.obs")
## [1] 0.1678195
plot(college$Enrollment, college$CompRate,
main = "Undergraduate Enrollment vs. Completion Rate",
xlab = "Undergraduate Enrollment",
ylab = "Completion Rate",
pch = 16,
col = "blue",
)
Comment:
There is a weak positive correlation \((r \approx 0.168)\) between undergraduate enrollment and completion rate. Larger schools have slightly higher completion rates on average, although the relationship is not strong.
The scatter plot shows an upward trend which means that as undergraduate enrollment increases, the completion rate tends to increase as well.
sd(college$FacSalary, na.rm = TRUE)
## [1] 2563.004
var(college$FacSalary, na.rm = TRUE)
## [1] 6568988
ggplot(college, aes(x = FacSalary)) +
geom_histogram(binwidth = 1000, fill = "darkgreen", color = "white", alpha = 0.8) +
labs(
title = "Distribution of Average Faculty Salaries",
x = "Average Faculty Salary ($)",
y = "Number of Schools"
) +
theme_minimal()
## Warning: Removed 54 rows containing non-finite outside the scale range
## (`stat_bin()`).
Comment:
The histogram indicates that average faculty salaries vary moderately across schools with a standard deviation of about $2,563. The variance suggests a moderately to high spread which is meaningful but not extreme.
# Calculate frequencies and percentages
control_table <- table(college$Control)
percentages <- round(100 * control_table / sum(control_table), 1)
labels <- paste(names(control_table), "\n", control_table, " (", percentages, "%)", sep="")
# Create pie chart with percentages
pie(control_table,
main = "Distribution of School Control",
labels = labels,
col = rainbow(length(control_table)))
Comment:
The pie chart shows the distribution of schools across ownership types with majority of the schools being private, followed by public, and then for-profit schools.
As seen in the pie chart, majority of the schools are private schools, followed by public and for-profit, respectively.
tapply(college$Female, college$Control, mean, na.rm = TRUE)
## Private Profit Public
## 58.61426 68.65613 58.10881
means <- tapply(college$Female, college$Control, mean, na.rm = TRUE)
means <- round(means, 1)
barplot(means,
main = "Average % Female Students by School Control",
ylab = "Mean % Female",
col = c("lightblue", "lightcoral", "lightgreen"),
ylim = c(0, max(means) * 1.15))
text(x = barplot(means, plot = FALSE),
y = means + 1,
labels = paste0(means, "%"),
cex = 1.1, font = 2)
Comment:
The bar plot indicates that profit schools have the highest proportion of female students on average with approximately 68.7%, followed by private schools, and then public schools. This could be a result of some unknown variables such as profit schools offering programs that public and private schools don’t offer.
ggplot(college, aes(x = Region, y = PartTime)) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
labs(title = "Distribution of Part-Time Students by Region",
x = "Region", y = "% Part-Time Students") +
theme_minimal()
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
Comment:
The box plot shows the variation in percent of part-time students by region.
The box plot indicates that Territories and the Southeast tend to have higher medians with a greater spread in part-time percentages. Northeast and West regions show lower medians and more compacted distributions.