We will use the data from “CollegeScores4yr” which includes variety of information about universities and colleges in the US.
I developed the following 10 questions based on my understanding of the data, with some additional guidance from ChatGPT to enhance them.
What is the mean of cost for all the college in the data?
What is correlation between cost and FacSalary?
What is the correlation between enrollment and cost?
What is the distribution of cost?
Is there a significant relationship between the cost of
tuition (both in-state and out-of-state) and the rate at
which students complete their programs?
What is the standard deviation of cost?
How does faculty salary relate to the average debt of students who complete the program?
Do colleges with higher percentages of part-time students
have lower completion rates?
What is the distribution of average net price for students across colleges?
How does median family income affect net price for students?
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean(college$Cost, na.rm = TRUE)
## [1] 34277.31
The mean of the cost for all college in the dataset is 34277.31.
cor(college$Cost, college$FacSalary, use="complete.obs")
## [1] 0.424201
The correlation between cost and FacSalary is 0.424201, which means these two has positivie correlation meaning that if one increases other increase.
cor(college$Enrollment, college$Cost, use="complete.obs")
## [1] -0.1914363
The correlation between enrollment and cost is -0.1914363 which means that those two has an inverse correlation meaning that when cost gets high, enrollment decreases.Similarly, when enrollment goes high, cost decreases.
hist(college$Cost, main="Histogram of Cost", xlab = "Cost", col = "red")
cor(college$TuitionIn, college$CompRate, use="complete.obs")
## [1] 0.5477039
cor(college$TuitonOut, college$CompRate, use="complete.obs")
## [1] 0.6636967
plot(college$TuitionIn, college$CompRate, main="Completion Rate vs In-State Tuition",
xlab="In-State Tuition", ylab="Completion Rate", pch=19, col="red")
This indicates that colleges with higher in-state tuition and/or
out-of-state tuition tend to have higher completion rates. This could
suggest that higher costs are associated with better resources,
programs, or academic rigor, potentially leading to higher completion
rates.
sd(college$Cost, na.rm = TRUE)
## [1] 15278.54
The standard deviation of the cost of tuition across the colleges in the dataset is 15278.54$. This indicates a relatively high level of variation in the cost of attending college.
cor(college$FacSalary, college$Debt, use="complete.obs")
## [1] 0.1707668
plot(college$FacSalary, college$Debt, main="Student Debt vs Faculty Salary",
xlab="Faculty Salary (Average Monthly)", ylab="Student Debt", pch=19, col="red")
The correlation between faculty salary and student debt is weakly positive (0.17). This suggests that while there is a slight relationship between these variables, it is not strong. In general, as faculty salaries increase, student debt tends to increase slightly as well. However, the correlation is weak, so faculty salary is not a major predictor of student debt. The scatter plot shows a scattered distribution of points with a slight upward trend, which confirms the weak relationship.
cor(college$PartTime, college$CompRate, use="complete.obs")
## [1] -0.4190961
plot(college$PartTime, college$CompRate, main="Completion Rate vs Part-Time Students",
xlab="Percent Part-Time Students", ylab="Completion Rate", pch=19, col="pink")
There is a moderate negative correlation of -0.42 between the percentage of part-time students and the completion rate. This suggests that as the percentage of part-time students increases, the completion rate tends to decrease.
hist(college$NetPrice,
main = "Distribution of Average Net Price for Students",
xlab = "Net Price (Cost Minus Aid)",
ylab = "Frequency",
col = "green",
border = "black",
breaks = 20) # You can adjust 'breaks' to control the number of bins
The histogram of the average net price for students across colleges shows that the distribution is centered around 20000$, with most colleges charging net prices in the range.
cor(college$MedIncome, college$NetPrice, use="complete.obs")
## [1] 0.5151298
plot(college$MedIncome, college$NetPrice, main="Net Price vs Median Family Income",
xlab="Median Family Income ($1,000s)", ylab="Net Price", pch=19, col="green")
The correlation between median family income and net price is moderately positive (0.515). This suggests that as median family income increases, the net price for students also tends to increase.
#Appendix
The following R code was used for data analysis and visualization for the different questions in this report.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean(college$Cost, na.rm = TRUE)
## [1] 34277.31
cor(college$Cost, college$FacSalary, use="complete.obs")
## [1] 0.424201
cor(college$Enrollment, college$Cost, use="complete.obs")
## [1] -0.1914363
hist(college$Cost, main="Histogram of Cost", xlab = "Cost", col = "red")
cor(college$TuitionIn, college$CompRate, use="complete.obs")
## [1] 0.5477039
cor(college$TuitonOut, college$CompRate, use="complete.obs")
## [1] 0.6636967
plot(college$TuitionIn, college$CompRate, main="Completion Rate vs In-State Tuition",
xlab="In-State Tuition", ylab="Completion Rate", pch=19, col="red")
sd(college$Cost, na.rm = TRUE)
## [1] 15278.54
cor(college$FacSalary, college$Debt, use="complete.obs")
## [1] 0.1707668
plot(college$FacSalary, college$Debt, main="Student Debt vs Faculty Salary",
xlab="Faculty Salary (Average Monthly)", ylab="Student Debt", pch=19, col="red")
cor(college$PartTime, college$CompRate, use="complete.obs")
## [1] -0.4190961
plot(college$PartTime, college$CompRate, main="Completion Rate vs Part-Time Students",
xlab="Percent Part-Time Students", ylab="Completion Rate", pch=19, col="pink")
hist(college$NetPrice,
main = "Distribution of Average Net Price for Students",
xlab = "Net Price (Cost Minus Aid)",
ylab = "Frequency",
col = "green",
border = "black",
breaks = 20) # You can adjust 'breaks' to control the number of bins
cor(college$MedIncome, college$NetPrice, use="complete.obs")
## [1] 0.5151298
plot(college$MedIncome, college$NetPrice, main="Net Price vs Median Family Income",
xlab="Median Family Income ($1,000s)", ylab="Net Price", pch=19, col="green")