We will use the data from “CollegeScores4yr” which includes variety of information about universities and colleges in the US.
I developed the following 10 questions based on my understanding of the data, with some additional guidance from ChatGPT to enhance them.
What is the mean of cost for all the college in the data?
What is the correlation between cost and AvgSAT?
What is the distribution of cost?
What is the correlation between enrollment and cost?
Is there a significant relationship between the cost of
tuition (both in-state and out-of-state) and the rate at
which students complete their programs?
How does faculty salary relate to the average debt of students who complete the program?
What is the standard deviation of cost?
Do colleges with higher percentages of part-time students
have lower completion rates?
How does median family income affect net price for students?
10.What is the distribution of average net price for students across colleges?
We will explore the questions in detail.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean(college$Cost, na.rm = TRUE)
## [1] 34277.31
The mean of the cost for all college in the dataset is 34277.31.
cor(college$Cost, college$AvgSAT, use="complete.obs")
## [1] 0.5373884
The correlation between cost and average SAT score is 0.5373884, which means these two has positive correlation meaning that if one increases, the other also increase.
hist(college$Cost, main="Histogram of Cost", xlab = "Cost", col = "blue")
The histogram shows the distribution of college costs. Most colleges fall within a moderate cost range.
cor(college$Enrollment, college$Cost, use="complete.obs")
## [1] -0.1914363
The correlation between enrollment and cost is -0.1914363 which means that those two has an inverse correlation meaning that when cost gets high, enrollment decreases.Similarly, when enrollment goes high, cost decreases.
cor(college$TuitionIn, college$CompRate, use="complete.obs")
## [1] 0.5477039
cor(college$TuitonOut, college$CompRate, use="complete.obs")
## [1] 0.6636967
plot(college$TuitionIn, college$CompRate, main="Completion Rate vs In-State Tuition",
xlab="In-State Tuition", ylab="Completion Rate", pch=19, col="blue")
This indicates that colleges with higher in-state tuition and/or out-of-state tuition tend to have higher completion rates. This could suggest that higher costs are associated with better resources, programs, or academic rigor, potentially leading to higher completion rates.
cor(college$FacSalary, college$Debt, use="complete.obs")
## [1] 0.1707668
plot(college$FacSalary, college$Debt, main="Student Debt vs Faculty Salary",
xlab="Faculty Salary (Average Monthly)", ylab="Student Debt", pch=19, col="purple")
The correlation between faculty salary and student debt is weakly positive (0.17). This suggests that while there is a slight relationship between these variables, it is not strong. In general, as faculty salaries increase, student debt tends to increase slightly as well. However, the correlation is weak, so faculty salary is not a major predictor of student debt. The scatter plot shows a scattered distribution of points with a slight upward trend, which confirms the weak relationship.
sd(college$Cost, na.rm = TRUE)
## [1] 15278.54
The standard deviation of the cost of tuition across the colleges in the dataset is $15,278.54. This indicates a relatively high level of variation in the cost of attending college.
cor(college$PartTime, college$CompRate, use="complete.obs")
## [1] -0.4190961
plot(college$PartTime, college$CompRate, main="Completion Rate vs Part-Time Students",
xlab="Percent Part-Time Students", ylab="Completion Rate", pch=19, col="green")
There is a moderate negative correlation of -0.42 between the percentage of part-time students and the completion rate. This suggests that as the percentage of part-time students increases, the completion rate tends to decrease.
cor(college$MedIncome, college$NetPrice, use="complete.obs")
## [1] 0.5151298
plot(college$MedIncome, college$NetPrice, main="Net Price vs Median Family Income",
xlab="Median Family Income ($1,000s)", ylab="Net Price", pch=19, col="orange")
The correlation between median family income and net price is moderately positive (0.515). This suggests that as median family income increases, the net price for students also tends to increase.
hist(college$NetPrice,
main = "Distribution of Average Net Price for Students",
xlab = "Net Price (Cost Minus Aid)",
ylab = "Frequency",
col = "skyblue",
border = "black",
breaks = 20) # You can adjust 'breaks' to control the number of bins
The histogram of the average net price for students across colleges shows that the distribution is centered around $20,000, with most colleges charging net prices in this range.
#Appendix
The following R code was used for data analysis and visualization for the different questions in this report.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
mean(college$Cost, na.rm = TRUE)
cor(college$Cost, college$AvgSAT, use="complete.obs")
hist(college$Cost, main="Histogram of Cost", xlab = "Cost", col = "blue")
cor(college$Enrollment, college$Cost, use="complete.obs")
cor(college$TuitionIn, college$CompRate, use="complete.obs")
cor(college$TuitonOut, college$CompRate, use="complete.obs")
plot(college$TuitionIn, college$CompRate, main="Completion Rate vs In-State Tuition",
xlab="In-State Tuition", ylab="Completion Rate", pch=19, col="blue")
cor(college$FacSalary, college$Debt, use="complete.obs")
plot(college$FacSalary, college$Debt, main="Student Debt vs Faculty Salary",
xlab="Faculty Salary (Average Monthly)", ylab="Student Debt", pch=19, col="purple")
sd(college$Cost, na.rm = TRUE)
cor(college$PartTime, college$CompRate, use="complete.obs")
plot(college$PartTime, college$CompRate, main="Completion Rate vs Part-Time Students",
xlab="Percent Part-Time Students", ylab="Completion Rate", pch=19, col="green")
cor(college$MedIncome, college$NetPrice, use="complete.obs")
plot(college$MedIncome, college$NetPrice, main="Net Price vs Median Family Income",
xlab="Median Family Income ($1,000s)", ylab="Net Price", pch=19, col="orange")
hist(college$NetPrice,
main = "Distribution of Average Net Price for Students",
xlab = "Net Price (Cost Minus Aid)",
ylab = "Frequency",
col = "skyblue",
border = "black",
breaks = 20)