Introduction

In this project, I will use data from the CollegeScores4Yr dataset from Lock5Stat.com

Based on the dataset, I ask 10 questions from my own understanding.

  1. What is the minimum average SAT score among the colleges in the data?
  2. What is the maximum cost of a college in the data?
  3. What would a stem and leaf diagram look like of the average ACT of colleges in the data?
  4. What is the mean enrollment of students in the data?
  5. What is the median net price of the colleges in the data?
  6. What is the mean faculty salary in the data?
  7. What is the correlation between cost and completion rate of colleges in the data?
  8. What is the variance of first generation students in the data?
  9. What is the standard deviation of tuition of colleges in the data?
  10. How does a histogram of the the enrollment of colleges in the data look like?

I asked ChatGPT to produce 10 questions based on the dataset. Here are 10 questions from ChatGPT.

  1. What is the mean proportion of female students in colleges?
  2. What is the median number of full-time faculty members across institutions?
  3. What is the variance in the percentage of first-generation students across colleges?
  4. What is the standard deviation in median income levels of students’ families?
  5. What does a histogram of student loan debt amounts look like?
  6. What does a boxplot of admittance rates across different regions look like?
  7. What does a boxplot of student debt amounts look like across different regions?
  8. How does a histogram of completion rates vary across institutions?
  9. What is the correlation between faculty salary and tuition fees?
  10. What is the correlation between the percentage of first-generation students and Pell Grant recipients?

Analysis

Based off both sets of 10 questions. I will choose 10 to answer.

1. What is the maximum cost of a college in the data?

max(college$Cost, na.rm = TRUE)
## [1] 72717

The maximum cost of a college in the data is $72,717.

2. What is the mean proportion of female students in colleges?

mean(college$Female, na.rm = TRUE)
## [1] 59.29588

The average percentage of female students in college is 59.3%.

3. What is the correlation between cost and completion rate of colleges in the data?

cor(college$TuitionFTE, college$CompRate, use = "complete.obs")
## [1] 0.4556305

The correlation between tuition cost and completion rate is 0.456 meaning that their is a positive correlation between high tuition cost and higher completion rate.

4. What is the standard deviation in median income levels of students’ families?

sd(college$MedIncome, na.rm = TRUE)
## [1] 22.85785

The standard deviation of the median income levels of students’ families is $22,858.85.

5. What is the variance of first generation students in the data?

var(college$FirstGen, na.rm = TRUE)
## [1] 122.8821

The variance of first generation students is 122.88.

6. How does a histogram of completion rates vary across institutions?

hist(college$CompRate, main = "Completion rate across institutions", xlab = "Completion Rate", ylab = "Frequency", col = "green")

The completion rate of different colleges is shown above. The graph appears to have a standard curve with the average completion rate around 50-60%.

7. What is the minimum average SAT score among the colleges in the data?

min(college$AvgSAT, na.rm = TRUE)
## [1] 564

The minimum average SAT among these colleges is 564.

8. What is the median net price of the colleges in the data?

median(college$NetPrice, na.rm = TRUE)
## [1] 19337.5

The median net price of colleges in the data is $19,337.50.

9. What would a stem and leaf diagram look like of the average ACT of colleges in the data?

stem(college$MidACT)
## 
##   The decimal point is at the |
## 
##    6 | 0
##    8 | 0
##   10 | 0
##   12 | 
##   14 | 00
##   16 | 00000000000000000000000
##   18 | 00000000000000000000000000000000000000000000000000000000000000000000+4
##   20 | 00000000000000000000000000000000000000000000000000000000000000000000+182
##   22 | 00000000000000000000000000000000000000000000000000000000000000000000+251
##   24 | 00000000000000000000000000000000000000000000000000000000000000000000+174
##   26 | 00000000000000000000000000000000000000000000000000000000000000000000+37
##   28 | 00000000000000000000000000000000000000000000000000000000000000000000
##   30 | 00000000000000000000000000000000000000000000000
##   32 | 00000000000000000000000000000000000000000
##   34 | 0000000000000

The stem-and-leaf plot above takes into account all of the colleges in the data. Their is a large concentration with the average ACT being around 20-26.

10. What does a boxplot of admittance rates across different regions look like?

boxplot(AdmitRate ~ Region, data = college, main = "Completiton Rate Across Regions",  ylab = "Admittance Rate", col = "green")

The boxplot above shows us admittance rates across different regions in the United States. We can see the first, median, and third quartile rates for each region.

Conclusion

In this project, we examined data given to us from Lock5Stat.com analyzing their CollegeScores4yr dataset. We chose questions from our own thinking and from ChatGPT to show a wide understanding of the methods in Chapter 6.

Appendix

R code
Q1: max(college$Cost, na.rm = TRUE)
Q2: mean(college$Female, na.rm = TRUE)
Q3: cor(college$TuitionFTE, college$CompRate, use = "complete.obs")
Q4: sd(college$MedIncome, na.rm = TRUE)
Q5: var(college$FirstGen, na.rm = TRUE)
Q6: hist(college$CompRate, main = "Completion rate across institutions", xlab = "Completion Rate", ylab = "Frequency", col = "green")
Q7: min(college$AvgSAT, na.rm = TRUE)
Q8: median(college$NetPrice, na.rm = TRUE)
Q9: stem(college$MidACT)
Q10: boxplot(AdmitRate ~ Region, data = college, main = "Completiton Rate Across Regions",  ylab = "Admittance Rate", col = "green")