In this project, we explore the CollegeScores4yr data set to better understand different characteristics of colleges in the United States. We focus on variables such as tuition, enrollment, faculty salaries, part-time students, and more. Using techniques from Chapter 6, we answer ten simple questions by applying basic statistical methods like mean, median, standard deviation, correlation, and graphical tools such as histograms, boxplots.
All analyses are performed using RStudio, allowing us to visualize the data and summarize important patterns clearly and effectively.
I propose the following 10 questions based on my own understanding of the data.
We will explore the questions in detail.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean(college$TuitionIn, na.rm = TRUE)
## [1] 21948.55
The mean of in-state tuition and fee is 21948.55.
median(college$Enrollment, na.rm = TRUE)
## [1] 1722
The median undergraduate enrollment is 1722.
hist(college$FacSalary, main = "Distribution of Faculty Salaries", xlab = "Average Monthly Salary", col = "yellow")
The distribution of average monthly salary for full-time faculty is above.
cor(college$AvgSAT, college$CompRate, use = "complete.obs")
## [1] 0.8189495
The correlation between average SAT score and completion rate is 0.8189495.
boxplot(college$FirstGen, main = "Boxplot of First-Generation Students", ylab = "Percent of First-Gen Students", col = "orange")
Distribution of first-generation students across colleges, as shown by the boxplot.
sd(college$FacSalary, na.rm = TRUE)
## [1] 2563.004
The standard deviation of faculty salary is 2563.004.
hist(college$NetPrice, main = "Distribution of Net Price", xlab = "Net Price", col = "blue")
The distribution of net price is above.
cor(college$MedIncome, college$NetPrice, use = "complete.obs")
## [1] 0.5151298
The correlation between family income and average net price is 0.5151298.
mean(college$Female, na.rm = TRUE)
## [1] 59.29588
The mean percentage of female students is 59.29588.
var(college$InstructFTE, na.rm = TRUE)
## [1] 123100321
The variance in instructional spending per FTE student is 123100321.
Through this project, we applied statistical methods from Chapter 6 to examine various aspects of U.S. colleges using the CollegeScores4yr data set. We used tools such as mean, median, standard deviation, correlation, and histograms to answer ten questions, each focused on a single variable.
Our analysis revealed several useful insights. For example, we found the average in-state tuition across colleges, explored how faculty salaries are distributed, and identified the median enrollment size. We also discovered relationships between variables, such as a positive correlation between average SAT scores and completion rates, and between family income and net price. By analyzing percentages of part-time and female students, as well as instructional spending per student, we gained a broader understanding of both costs and demographics in higher education.
Working with this data set allowed us to apply the concepts from Chapter 6 in a practical context. We were able to explore real trends in college-related variables and interpret the results using basic statistical methods. This experience not only helped us improve our R skills but also deepened our understanding of how data can be used to support analysis and decision-making in education.
# Q1 mean(college$TuitionIn, na.rm = TRUE)
# Q2 median(college$Enrollment, na.rm = TRUE)
# Q3 hist(college$FacSalary, main = "Distribution of Faculty Salaries", xlab = "Average Monthly Salary", col = "yellow")
# Q4 cor(college$AvgSAT, college$CompRate, use = "complete.obs")
# Q5 boxplot(college$FirstGen, main = "Boxplot of First-Generation Students", ylab = "Percent of First-Gen Students", col = "orange")
# Q6 sd(college$FacSalary, na.rm = TRUE)
# Q7 hist(college$NetPrice, main = "Distribution of Net Price", xlab = "Net Price", col = "blue")
# Q8 cor(college$MedIncome, college$NetPrice, use = "complete.obs")
# Q9 mean(college$Female, na.rm = TRUE)
# Q10 var(college$InstructFTE, na.rm = TRUE)