Introduction:

Project Information:

Project 1 involves analyzing data on all US colleges and universities that primarily grant bachelor’s degrees.

Intent:

The goal of this project is to utilize the methods learned in Chapter 6: Descriptive Statistics to analyze the data. Methods include calculating mean, median, variance, standard deviation, and correlation. Additionally, use the following graphs to represent this data: histograms, box plots, bar plots, and pie charts.

Steps:

  1. Suggest 10 simple questions that can be addressed using Chapter 6 methods.
  1. What is the average admission rate across all schools?
  2. What is the median SAT score for schools that offer a bachelor’s degree?
  3. How much does the average net price across all schools vary?
  4. What is the standard deviation of average net price by region?
  5. Is there a correlation between Name of the School and Completion Rate?
  6. What is the distribution of in-state tuition?
  7. What is the median family income across regions?
  8. How does the percentage of part-time students vary among regions?
  9. What Proportion of schools are public, private, or for-profit?
  10. What is the average percent of Asian students in public vs. private schools?
  1. Use ChatGPT to suggest 10 simple questions that can be addressed using Chapter 6 methods.
  1. What is the mean instructional spending per FTE student?
  2. What is the median debt for students across all schools?
  3. What is the variance in net tuition revenue per FTE?
  4. What is the standard deviation of admission rate?
  5. Is there a correlation between the percent Pell grant users and Completion Rate?
  6. What is the distribution of average total cost among all schools?
  7. How variable are average faculty salaries?
  8. How does the average percent of black students differ by region?
  9. What proportion of schools offer a graduate degree?
  10. What proportion of schools offer a graduate degree?
  1. Selecting 5 questions from (A) and (B), and analyze them using R-software.
  1. What is the average admission rate across all schools?
  2. What is the standard deviation of admission rate?
  3. What is the distribution of average total cost among all schools?
  4. How much does the average net price across all schools vary?
  5. What is the median debt for students across all schools who complete their program?
  6. Is there a correlation between Name of the School and Completion Rate?
  7. How variable are average faculty salaries?
  8. What proportion of schools are public, private, or for-profit?
  9. What is the average percent of female students in public vs. private schools?
  10. How does the percentage of part-time students vary among regions?

Analysis:

Question 1: What is the average admission rate across all schools?

mean(college$AdmitRate, na.rm = TRUE)
## [1] 0.6702025

Comment:

This means that approximately 67% of applicants are admitted on average across all schools.

Question 2: What is the standard deviation of admission rate?

sd(college$AdmitRate, na.rm = TRUE)
## [1] 0.208179

Comment:

The standard deviation of admission rates is approximately 0.208, or 20.8%. This indicates a considerable variation in selectivity across all schools, with most admission rates falling between 46% and 88%.

Question 3: What is the distribution of average total cost among all schools?

hist(college$Cost,
     main = "Distribution of Average Total Cost",
     col = "red",
     xlab = "Average Total Cost",
     ylab = "Number of Schools",
     )

Comment:

The histogram shows the distribution of the average total cost across all schools. Most of the schools are at the lower end of the cost range with fewer schools at the higher end. This indicates that low-to-moderate cost schools are more common, however the more expensive schools pull the mean upward.

Question 4: How much does the average net price across all schools vary?

sd(college$NetPrice, na.rm = TRUE)
## [1] 7854.096

Comment:

The standard deviation of the average net price across all schools is approximately $7,854. This indicates considerable variation in the average out-of-pocket cost for students after financial aid.

Question 5: What is the median debt for students across all schools who complete their program?

median(college$Debt, na.rm = TRUE)
## [1] 713.5

Comment:

The median debt for college students who complete their program across all schools is $713.50. This means half of the schools are either above or below this value.

Question 6: Is there a correlation between undergraduate enrollment and completion rate?

cor(college$Enrollment, college$CompRate, use = "complete.obs")
## [1] 0.1678195
plot(college$Enrollment, college$CompRate,
     main = "Undergraduate Enrollment vs. Completion Rate",
     xlab = "Undergraduate Enrollment",
     ylab = "Completion Rate",
     pch = 16, 
     col = "blue",
     )

Comment:

There is a weak positive correlation \((r \approx 0.168)\) between undergraduate enrollment and completion rate. Larger schools have slightly higher completion rates on average, although the relationship is not strong.

The scatter plot shows an upward trend which means that as undergraduate enrollment increases, the completion rate tends to increase as well.

Question 7: How variable are average faculty salaries?

sd(college$FacSalary, na.rm = TRUE)
## [1] 2563.004
var(college$FacSalary, na.rm = TRUE)
## [1] 6568988
ggplot(college, aes(x = FacSalary)) +
  geom_histogram(binwidth = 1000, fill = "darkgreen", color = "white", alpha = 0.8) +
  labs(
    title = "Distribution of Average Faculty Salaries",
    x = "Average Faculty Salary ($)",
    y = "Number of Schools"
  ) +
  theme_minimal()
## Warning: Removed 54 rows containing non-finite outside the scale range
## (`stat_bin()`).

Comment:

The histogram indicates that average faculty salaries vary moderately across schools with a standard deviation of about $2,563. The variance suggests a moderately to high spread which is meaningful but not extreme.

Question 8: What proportion of schools are public, private, or for-profit?

# Calculate frequencies and percentages
control_table <- table(college$Control)
percentages <- round(100 * control_table / sum(control_table), 1)
labels <- paste(names(control_table), "\n", control_table, " (", percentages, "%)", sep="")

# Create pie chart with percentages
pie(control_table,
    main = "Distribution of School Control",
    labels = labels,
    col = rainbow(length(control_table)))

Comment:

The pie chart shows the distribution of schools across ownership types with majority of the schools being private, followed by public, and then for-profit schools.

As seen in the pie chart, majority of the schools are private schools, followed by public and for-profit, respectively.

Question 9: What is the average percent of female students in public vs. private schools?

tapply(college$Female, college$Control, mean, na.rm = TRUE)
##  Private   Profit   Public 
## 58.61426 68.65613 58.10881
means <- tapply(college$Female, college$Control, mean, na.rm = TRUE)
means <- round(means, 1)

barplot(means,
        main = "Average % Female Students by School Control",
        ylab = "Mean % Female",
        col = c("lightblue", "lightcoral", "lightgreen"),
        ylim = c(0, max(means) * 1.15))

text(x = barplot(means, plot = FALSE),
     y = means + 1,
     labels = paste0(means, "%"),
     cex = 1.1, font = 2)

Comment:

The bar plot indicates that profit schools have the highest proportion of female students on average with approximately 68.7%, followed by private schools, and then public schools. This could be a result of some unknown variables such as profit schools offering programs that public and private schools don’t offer.

Question 10: How does the percentage of part-time students vary among regions?

ggplot(college, aes(x = Region, y = PartTime)) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
labs(title = "Distribution of Part-Time Students by Region",
x = "Region", y = "% Part-Time Students") +
theme_minimal()
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

Comment:

The box plot shows the variation in percent of part-time students by region.

The box plot indicates that Territories and the Southeast tend to have higher medians with a greater spread in part-time percentages. Northeast and West regions show lower medians and more compacted distributions.