We use the data from College Scores 4yr
I propose the following 10 questions based on my own understanding of the data
Q1. What is the average undergraduate enrollment among the universities in the data?
Q2. How much variation exists in in-state tuition and fees among the universities in the data?
Q3. What is the average total cost of attendance across universities, and how much does it vary in the data?
Q4. What does the distribution of median family income look like across universities in the data?
Q5. Do universities with a higher percentage of white students tend to have a higher or lower percentage of female students in the data?
Q6. Do universities with higher percentages of Pell-grant students tend to have lower median family income in the data?
Q7. Is there a relationship between average SAT score and completion rate in the data?
Q8. Do schools that are online-only have different average undergraduate enrollment than schools that are not in the data?
Q9. Do universities with a higher average total cost also have a higher average student debt in the data?
Q10. Do universities with higher median family income tend to have higher average net prices in the data?
We will explore the questions in detail.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean(college$Enrollment, na.rm = TRUE)
## [1] 4484.831
The average enrollment among the universities in the data is 4,484.831 students.
var(college$TuitionIn, use = "complete.obs")
## [1] 199665280
The variance of in-state tuition in the data is 199,665,280 (dollars²).
hist(college$Cost, main = "Distribution of Total Cost of Attendance", xlab = "Total Cost ($)", col = "lightblue", breaks = 30)
The histogram of the data of total cost of attendance shows that most universities have total costs between 20,000 dollars and 22,000 dollars. The distribution is right-skewed, with a few schools having much higher costs above this range.
hist(college$MedIncome, main = "Histogram of Income", xlab = "Income")
The histogram of median family income in the data shows that most universities have a median family income between 20,000 dollars and 40,000 dollars.
cor(college$White, college$Female, use = "complete.obs")
## [1] -0.09688832
From the data it shows that schools with more white students tend to have fewer female students.
plot(college$Pell, college$MedIncome, main = "Pell Grant Percentage vs Median Family Income", xlab = "Percent Receiving Pell Grants", ylab = "Median Family Income ($1000s)",col = "darkgreen")
From the data it shows that as the percentage of students receiving Pell Grants increases, the median family income of students tends to decrease.
cor(college$AvgSAT, college$CompRate, use = "complete.obs")
## [1] 0.8189495
For the data is shows a positive correlation so chools with higher SAT averages tend to have higher completion rates.
boxplot(college$Enrollment ~ college$Online, main = "Undergraduate Enrollment: Online vs On-Campus Schools", xlab = "Online-Only School", ylab = "Undergraduate Enrollment", names = c("No", "Yes"), col = c("lightblue", "lightgreen"))
From the data the boxplot comparing undergraduate enrollment for online-only vs on-campus schools shows that online-only institutions generally have smaller median enrollments than traditional on-campus universities.
cor(college$Cost, college$Debt, use = "complete.obs")
## [1] -0.2144525
plot(college$Cost, college$Debt, main = "Average Total Cost vs Average Student Debt", xlab = "Average Total Cost ($)", ylab = "Average Student Debt ($)", col = "darkblue")
The scatterplot of the data of average total cost versus average student debt shows a negative correlation, indicating that universities with higher total costs tend to have lower average student debt.
plot(college$MedIncome, college$NetPrice,main = "Median Family Income vs Average Net Price", xlab = "Median Family Income ($1000s)", ylab = "Average Net Price ($)", col = "darkorange")
cor(college$MedIncome, college$NetPrice, use = "complete.obs")
## [1] 0.5151298
From the data it shows a positive correlation meaning schools with wealthier students tend to charge higher net prices which means students get less help financially.
# Q1 code: mean(college$Enrollment, na.rm = TRUE)
# Q2 code: var(college$TuitionIn, use = "complete.obs")
# Q3 code: hist(college$Cost, main = "Distribution of Total Cost of Attendance", xlab = "Total Cost ($)", col = "lightblue", breaks = 30)
# Q4 code: hist(college$MedIncome, main = "Histogram of Income", xlab = "Income")
# Q5 code: cor(college$White, college$Female, use = "complete.obs")
# Q6 code: plot(college$Pell, college$MedIncome, main = "Pell Grant Percentage vs Median Family Income", xlab = "Percent Receiving Pell Grants", ylab = "Median Family Income ($1000s)",col = "darkgreen")
# Q7 code: cor(college$AvgSAT, college$CompRate, use = "complete.obs")
# Q8 code: boxplot(college$Enrollment ~ college$Online, main = "Undergraduate Enrollment: Online vs On-Campus Schools", xlab = "Online-Only School", ylab = "Undergraduate Enrollment", names = c("No", "Yes"), col = c("lightblue", "lightgreen"))
# Q9 code: cor(college$Cost, college$Debt, use = "complete.obs")
# Q9 code: plot(college$Cost, college$Debt, main = "Average Total Cost vs Average Student Debt", xlab = "Average Total Cost ($)", ylab = "Average Student Debt ($)", col = "darkblue")
# Q10 code: plot(college$MedIncome, college$NetPrice,main = "Median Family Income vs Average Net Price", xlab = "Median Family Income ($1000s)", ylab = "Average Net Price ($)", col = "darkorange")
# Q10 code: cor(college$MedIncome, college$NetPrice, use = "complete.obs")