As the first project of class STAT 351, This project aims to analyze the CollegeScores4yr dataset from Lock5Data to gain a deeper understanding of various factors affecting four-year colleges in the United States. Through this analysis, we seek to explore a range of characteristics that influence college performance, accessibility, affordability, and student success. By examining key variables within the dataset, we can gain insights into college rankings, student demographics, graduation rates, tuition costs, and other crucial metrics that might affect students’ educational experiences and outcomes.
Since this is the first project of the class, our understandings and the tool sets that we have are limited. Therefore, our analysis and findings will also be limited to chapter 6 from the class which involves simple descriptive statistical analysis such as mean, median, range…etc. As such, we will also be using graphical representation to aid us in our understanding of the data set we analyze.
To begin, we will propose 10 questions that guide our analysis, establishing the foundation of this project. These questions will not only aids our data analysis but also help in developing hypotheses to test in later stages. Once these questions are establised, we will prepare our implementation by utilizing R software which provides our project with assistance through data processing, cleaning, and manipulation to effectively and clearly showcase our proposed questions.
Our proposed questions are as below:
In this section, we will explore the proposed questions in details. We will dive into a comprehensive analysis of the CollegeScores4yr dataset, addressing ten key questions that shed light on various aspects of US four-year colleges. By employing descriptive statistical techniques and visual representations, we aim to uncover insights into college admissions, costs, and academic performance such as SAT scores.
Note: Throughout the analysis, we will utilize appropriate statistical measures and visualizations to interpret the data effectively which will help us draw more accurate conclusion from the results.
college = read.csv("CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean(college$AdmitRate, na.rm = TRUE)
## [1] 0.6702025
hist(college$AdmitRate,
main = "Distribution of Admission Rates Across Colleges",
xlab = "Average Admission Rates",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 20) # Adjust breaks for better granularity)
mean(college$AvgSAT, na.rm = TRUE)
## [1] 1135.25
# The average SAT scores for public colleges
mean(college$AvgSAT[college$Control == "Public"], na.rm = TRUE)
## [1] 1118.91
# The average SAT scores for private colleges
mean(college$AvgSAT[college$Control == "Private"], na.rm = TRUE)
## [1] 1145.839
# Graphical Presentation as a box plot
boxplot(AvgSAT ~ Control, data = college[college$Control %in% c("Public", "Private"), ],
main = "Average SAT Scores by College Type",
xlab = "College Type",
ylab = "Average SAT Score",
col = c("lightblue", "lightgreen"))
hist(college$AvgSAT,
main = "Distribution of SAT Scores Across Colleges",
xlab = "Average SAT Score",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 20) # Adjust breaks for better granularity
# Create categories for the SAT score ranges
college$SAT_Category <- cut(college$AvgSAT,
breaks = c(0, 1000, 1200, Inf),
labels = c("Below 1000", "1000-1200", "Above 1200"))
# Calculate the frequency of each category
SAT_category_table <- table(college$SAT_Category)
# Calculate the percentage for each category
SAT_percentages <- prop.table(SAT_category_table) * 100
# Create the pie chart with percentages as labels
pie(SAT_category_table,
labels = paste(names(SAT_category_table), "\n", round(SAT_percentages, 1), "%"),
main = "Proportion of Colleges by SAT Score Range",
col = c("darkblue", "darkgreen", "lightcoral"))
mean(college$Cost, na.rm = TRUE)
## [1] 34277.31
cor(college$Cost, college$NetPrice, use = "complete.obs")
## [1] 0.8021118
# Scatter plot with a linear regression line
plot(college$Cost, college$NetPrice,
main = "College Cost vs Net Price",
xlab = "Cost (Total Cost)",
ylab = "Net Price (Price After Aid)",
pch = 19, col = rgb(0.5, 0.6, 1, 0.8),
cex = 0.6)
# Add linear regression line
abline(lm(college$NetPrice ~ college$Cost), col = rgb(0.3, 0.9, 0.8, 1))
cor(college$AdmitRate, college$Cost, use = "complete.obs")
## [1] -0.3036798
# Scatter plot for Admission Rate vs Cost
plot(college$AdmitRate, college$Cost,
main = "Admission Rate vs College Cost",
xlab = "Admission Rate",
ylab = "College Cost",
pch = 19, # Solid circle for points
col = rgb(0, 0, 1, 0.5), # Blue with some transparency
cex = 0.6) # Point size
# Calculate the standard deviation of the Admission Rate
sd(college$AdmitRate, na.rm = TRUE)
## [1] 0.208179
# Create a frequency histogram for Admission Rates (showing the actual counts)
hist(college$AdmitRate,
main = "Frequency of Admission Rates",
xlab = "Admission Rate",
col = "lightblue",
border = "black",
breaks = 15,
freq = TRUE) # This makes it a frequency plot instead of density
cor(college$FacSalary, college$InstructFTE, use = "complete.obs")
## [1] 0.4998836
# Scatter plot of Faculty Salary vs Instructional Spending per FTE Student
plot(college$FacSalary, college$InstructFTE,
main = "Faculty Salary vs Instructional Spending",
xlab = "Faculty Salary (Monthly)",
ylab = "Instructional Spending per FTE Student",
col = "blue",
pch = 16) # pch=16 gives solid circles
This analysis explored a small part of key questions about US four-year colleges using the CollegeScores4yr dataset. We examined admission rates, SAT scores, college costs, faculty salaries, and instructional spending, revealing important relationships between these factors. From the above calculation and analysis, we deduce the following conclusions.
Although we relied on a limited set of descriptive statistical techniques, the thoughtful formulation of questions and the power of R software enabled us to gain valuable insights into the world of US four-year colleges. From understanding the relationships between SAT scores, admission rates, and costs, to analyzing faculty salaries and instructional spending, this project has demonstrated how even simple statistical tools can provide a clearer view of complex educational trends. As we continue to explore data, the ability to ask the right questions and apply the right analytical methods will help guide our future inquiries.
References Lock5Data. (n.d.). CollegeScores4yr dataset. Retrieved from Lock5Data. R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved from https://www.r-project.org.