Introduction

We will use the data from “CollegeScores4yr” which includes variety of information about universities and colleges in the US.

I developed the following 10 questions based on my understanding of the data, with some additional guidance from ChatGPT to enhance them.

  1. What is the mean of cost for all the college in the data?

  2. What is the correlation between cost and AvgSAT?

  3. What is the distribution of cost?

  4. What is the correlation between enrollment and cost?

  5. Is there a significant relationship between the cost of
    tuition (both in-state and out-of-state) and the rate at
    which students complete their programs?

  6. How does faculty salary relate to the average debt of students who complete the program?

  7. What is the standard deviation of cost?

  8. Do colleges with higher percentages of part-time students
    have lower completion rates?

  9. How does median family income affect net price for students?

10.What is the distribution of average net price for students across colleges?

Analysis

We will explore the questions in detail.

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

Q1. What is the mean of cost for all the college in the data?

mean(college$Cost, na.rm = TRUE)
## [1] 34277.31

The mean of the cost for all college in the dataset is 34277.31.

Q2. What is the correlation between cost and AvgSAT?

cor(college$Cost, college$AvgSAT, use="complete.obs")
## [1] 0.5373884

The correlation between cost and average SAT score is 0.5373884, which means these two has positive correlation meaning that if one increases, the other also increase.

Q3. What is the distribution of cost?

hist(college$Cost, main="Histogram of Cost", xlab = "Cost", col = "blue")

The histogram shows the distribution of college costs. Most colleges fall within a moderate cost range.

Q4. What is the correlation between enrollment and cost?

cor(college$Enrollment, college$Cost, use="complete.obs")
## [1] -0.1914363

The correlation between enrollment and cost is -0.1914363 which means that those two has an inverse correlation meaning that when cost gets high, enrollment decreases.Similarly, when enrollment goes high, cost decreases.

Q5. Is there a significant relationship between the cost of tuition (both in-state and out-of-state) and the rate at which students complete their programs?

cor(college$TuitionIn, college$CompRate, use="complete.obs")
## [1] 0.5477039
cor(college$TuitonOut, college$CompRate, use="complete.obs")
## [1] 0.6636967
plot(college$TuitionIn, college$CompRate, main="Completion Rate vs In-State Tuition",
     xlab="In-State Tuition", ylab="Completion Rate", pch=19, col="blue")

This indicates that colleges with higher in-state tuition and/or out-of-state tuition tend to have higher completion rates. This could suggest that higher costs are associated with better resources, programs, or academic rigor, potentially leading to higher completion rates.

Q6. How does faculty salary relate to the average debt of students who complete the program?

cor(college$FacSalary, college$Debt, use="complete.obs")
## [1] 0.1707668
plot(college$FacSalary, college$Debt, main="Student Debt vs Faculty Salary",
     xlab="Faculty Salary (Average Monthly)", ylab="Student Debt", pch=19, col="purple")

The correlation between faculty salary and student debt is weakly positive (0.17). This suggests that while there is a slight relationship between these variables, it is not strong. In general, as faculty salaries increase, student debt tends to increase slightly as well. However, the correlation is weak, so faculty salary is not a major predictor of student debt. The scatter plot shows a scattered distribution of points with a slight upward trend, which confirms the weak relationship.

Q7. What is the standard deviation of cost?

sd(college$Cost, na.rm = TRUE)
## [1] 15278.54

The standard deviation of the cost of tuition across the colleges in the dataset is $15,278.54. This indicates a relatively high level of variation in the cost of attending college.

Q8. Do colleges with higher percentages of part-time students have lower completion rates?

cor(college$PartTime, college$CompRate, use="complete.obs")
## [1] -0.4190961
plot(college$PartTime, college$CompRate, main="Completion Rate vs Part-Time Students",
     xlab="Percent Part-Time Students", ylab="Completion Rate", pch=19, col="green")

There is a moderate negative correlation of -0.42 between the percentage of part-time students and the completion rate. This suggests that as the percentage of part-time students increases, the completion rate tends to decrease.

Q9. How does median family income affect net price for students?

cor(college$MedIncome, college$NetPrice, use="complete.obs")
## [1] 0.5151298
plot(college$MedIncome, college$NetPrice, main="Net Price vs Median Family Income",
     xlab="Median Family Income ($1,000s)", ylab="Net Price", pch=19, col="orange")

The correlation between median family income and net price is moderately positive (0.515). This suggests that as median family income increases, the net price for students also tends to increase.

Q10. What is the distribution of average net price for students across colleges?

hist(college$NetPrice, 
     main = "Distribution of Average Net Price for Students",
     xlab = "Net Price (Cost Minus Aid)",
     ylab = "Frequency",
     col = "skyblue",
     border = "black",
     breaks = 20)  # You can adjust 'breaks' to control the number of bins

The histogram of the average net price for students across colleges shows that the distribution is centered around $20,000, with most colleges charging net prices in this range.

#Appendix

The following R code was used for data analysis and visualization for the different questions in this report.

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)

mean(college$Cost, na.rm = TRUE)

cor(college$Cost, college$AvgSAT, use="complete.obs")

hist(college$Cost, main="Histogram of Cost", xlab = "Cost", col = "blue")

cor(college$Enrollment, college$Cost, use="complete.obs")

cor(college$TuitionIn, college$CompRate, use="complete.obs")
cor(college$TuitonOut, college$CompRate, use="complete.obs")
plot(college$TuitionIn, college$CompRate, main="Completion Rate vs In-State Tuition",
     xlab="In-State Tuition", ylab="Completion Rate", pch=19, col="blue")

cor(college$FacSalary, college$Debt, use="complete.obs")
plot(college$FacSalary, college$Debt, main="Student Debt vs Faculty Salary",
     xlab="Faculty Salary (Average Monthly)", ylab="Student Debt", pch=19, col="purple")

sd(college$Cost, na.rm = TRUE)

cor(college$PartTime, college$CompRate, use="complete.obs")
plot(college$PartTime, college$CompRate, main="Completion Rate vs Part-Time Students",
     xlab="Percent Part-Time Students", ylab="Completion Rate", pch=19, col="green")

cor(college$MedIncome, college$NetPrice, use="complete.obs")
plot(college$MedIncome, college$NetPrice, main="Net Price vs Median Family Income",
     xlab="Median Family Income ($1,000s)", ylab="Net Price", pch=19, col="orange")

hist(college$NetPrice, 
     main = "Distribution of Average Net Price for Students",
     xlab = "Net Price (Cost Minus Aid)",
     ylab = "Frequency",
     col = "skyblue",
     border = "black",
     breaks = 20)