Introduction

Using the data from the “CollegeScores4yr” spreadsheet, 10 questions are proposed by both me and ChatGPT.

My 10 questions: 1. What is the mean of Black (students)? ✓
2. What is the correlation between TuitionIn and AdmitRate? 3. What is the mean of Cost? 4. What is the distribution of NetPrice? 5. What is the standard deviation of Female? ✓ 6. What is average FacSalary ✓ 7. What is the correlation between Longitude and Latitude? ✓ 8. What is the median Enrollment? ✓ 9. What is the mean of Hispanic (students)? 10. What is the correlation between AvgSAT and MidACT?

ChatGPT’s questions: 1. What is the standard deviation of MidACT scores? ✓ 2. What is the variance in Faculty salaries (FacSalary) between colleges? 3. What is the average enrollment (Enrollment)? 4. Make a histogram of Debt, do schools have a similar amount? ✓ 5. What is the correlation between AdmitRate and AvgSAT? ✓ 6. Is there a relationship between TuitionIn and CompletionRate? ✓ 7. What proportion of schools offer only online programs? 8. Which school has the highest diversity, measured by (1 − max(race percentage))? 9. How does Median Family Income (MedIncome) relate to NetPrice? 10. What is the racial composition of students at [School Name]? ✓

Analysis

The data set is seen below.

Using the data, and R code, we will answer 10 questions from the 20 that were proposed.

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

Question 1: What is the mean of Black (students)?

mean(college$Black, na.rm = TRUE)
## [1] 13.92342

Accross the colleges in the data set, the mean of students who reported being black was 13.92%

Question 2: Make a histogram of Debt, do schools have a similar amount?

hist(college$Debt, main = "Histogram of Debt", xlab = "Debt", col = "orange")

The histogram shows that a majority of the schools have less than $10000 debt for a student after completing their program. A very small percentage of schools have more debt than that.

Question 3: What is the median Enrollment?

median(college$Enrollment, na.rm = TRUE)
## [1] 1722

The median value for number of undergrad student enrollments across the dataset is 1722 students.

Question 4: What is the standard deviation of MidACT scores?

sd(college$MidACT, na.rm = TRUE)
## [1] 3.653612
mean(college$MidACT, na.rm = TRUE)
## [1] 23.53514

The standard deviation of the median ACT scores across the data set is about 3.6 and the average is 23.5. This means most of the scores fall between 19.9 and 27.1

Qustion 5: What is the correlation between Longitude and Latitude?

cor(college$Longitude, college$Latitude, use = "complete.obs")
## [1] -0.06285682

There is almost no correlation between the Longitudes and Latitudes of the different colleges in the data set. This makes sense because the schools are spread all over, rather than being in a linear fashion.

Question 6: What is the correlation between AdmitRate and AvgSAT?

cor(college$AdmitRate, college$AvgSAT, use = "complete.obs")
## [1] -0.4221255

The correlation is -0.42, which suggests that there is a moderate negative correlation. As the admitance rate increases, the average SAT score somewhat starts to drop.

Question 7: What is the standard deviation of Female?

sd(college$Female, na.rm = TRUE)
## [1] 12.34421
mean(college$Female, na.rm = TRUE)
## [1] 59.29588

The mean for female students is about 60% with a standard deviation or roughly 12% in either direction.

max(college$Female, na.rm = TRUE)
## [1] 98

Comparing the max in our data set (98%) we see that it is clearly an outlier, likely an all women’s school.

Question 8: What is the racial composition of students at [School Name]?

For this question I decided to pick our very own St Cloud State University

scsu_row <- which(college$Name == "Saint Cloud State University")
scsu <- college[scsu_row, ]
race_percent <- c(scsu$White, scsu$Black, scsu$Hispanic, scsu$Asian, scsu$Other)
race_labels <- c("White", "Black", "Hispanic", "Asian", "Other")
pie(race_percent, labels = race_labels,
    main = "Racial Composition of SCSU",
    col = rainbow(length(race_percent)))

Based on the pie chart, it would seem over 60% of the students at SCSU are white, which makes up the majority. This makes sense considering roughly 67% of St. Cloud’s population is white.

Question 9: What is average FacSalary

mean(college$FacSalary, na.rm = TRUE)
## [1] 7465.778

The average monthly salary for full time faculty is $7465.

Question 10: Is there a relationship between TuitionIn and CompletionRate?

cor(college$TuitionIn, college$CompRate, use = "complete.obs")
## [1] 0.5477039

There seems to be a moderate positive relationship between tuition and completion rate. As the tuition rises, the completion rate also rises accordingly, though not as fast.

Conclusion

Overall I got to learn how to use R markdown code to better analyze different statistical properties. It was cool to see how such a huge data set could be easy manipulated with few lines of code.

Appendix

# Question 1:
mean(college$Black, na.rm = TRUE)
## [1] 13.92342
# Question 2:
hist(college$Debt, main = "Histogram of Debt", xlab = "Debt", col = "orange")

# Question 3:
median(college$Enrollment, na.rm = TRUE)
## [1] 1722
# Question 4:
sd(college$MidACT, na.rm = TRUE)
## [1] 3.653612
mean(college$MidACT, na.rm = TRUE)
## [1] 23.53514
# Question 5:
cor(college$Longitude, college$Latitude, use = "complete.obs")
## [1] -0.06285682
# Question 6:
cor(college$AdmitRate, college$AvgSAT, use = "complete.obs")
## [1] -0.4221255
# Question 7:
sd(college$Female, na.rm = TRUE)
## [1] 12.34421
mean(college$Female, na.rm = TRUE)
## [1] 59.29588
# Question 8:
scsu_row <- which(college$Name == "Saint Cloud State University")
scsu <- college[scsu_row, ]
race_percent <- c(scsu$White, scsu$Black, scsu$Hispanic, scsu$Asian, scsu$Other)
race_labels <- c("White", "Black", "Hispanic", "Asian", "Other")
pie(race_percent, labels = race_labels,
    main = "Racial Composition of SCSU",
    col = rainbow(length(race_percent)))

# Question 9:
mean(college$FacSalary, na.rm = TRUE)
## [1] 7465.778
# Question 10:
cor(college$TuitionIn, college$CompRate, use = "complete.obs")
## [1] 0.5477039