1. Introduction

As the first project of class STAT 351, This project aims to analyze the CollegeScores4yr dataset from Lock5Data to gain a deeper understanding of various factors affecting four-year colleges in the United States. Through this analysis, we seek to explore a range of characteristics that influence college performance, accessibility, affordability, and student success. By examining key variables within the dataset, we can gain insights into college rankings, student demographics, graduation rates, tuition costs, and other crucial metrics that might affect students’ educational experiences and outcomes.

Since this is the first project of the class, our understandings and the tool sets that we have are limited. Therefore, our analysis and findings will also be limited to chapter 6 from the class which involves simple descriptive statistical analysis such as mean, median, range…etc. As such, we will also be using graphical representation to aid us in our understanding of the data set we analyze.

To begin, we will propose 10 questions that guide our analysis, establishing the foundation of this project. These questions will not only aids our data analysis but also help in developing hypotheses to test in later stages. Once these questions are establised, we will prepare our implementation by utilizing R software which provides our project with assistance through data processing, cleaning, and manipulation to effectively and clearly showcase our proposed questions.

Our proposed questions are as below:

  1. What is the mean admissions rate for US 4 Year Colleges?
  2. What is the average/mean SAT score of US 4 Year colleges?
  3. What are the average SAT scores for public vs private colleges?
  4. How is the SAT score distributed across the colleges?
  5. What percentages of colleges meet or exceed a certain SAT score threshold?
  6. What is the mean cost for US 4 Year colleges?
  7. What is the correlation between college cost and net price?
  8. Is there correlation between admissions rate and cost of college?
  9. What is the standard deviation of the admission rates?
  10. Is there a correlation between faculty salary and instructional spending per student?

2. Analysis

In this section, we will explore the proposed questions in details. We will dive into a comprehensive analysis of the CollegeScores4yr dataset, addressing ten key questions that shed light on various aspects of US four-year colleges. By employing descriptive statistical techniques and visual representations, we aim to uncover insights into college admissions, costs, and academic performance such as SAT scores.

Note: Throughout the analysis, we will utilize appropriate statistical measures and visualizations to interpret the data effectively which will help us draw more accurate conclusion from the results.

college = read.csv("CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

Question 1: What is the mean admissions rate for US 4 Year Colleges?

mean(college$AdmitRate, na.rm = TRUE)
## [1] 0.6702025
hist(college$AdmitRate,
     main = "Distribution of Admission Rates Across Colleges", 
     xlab = "Average Admission Rates", 
     ylab = "Frequency", 
     col = "lightblue", 
     border = "black", 
     breaks = 20)  # Adjust breaks for better granularity)

Question 2. What is the average SAT score of US 4 Year colleges?

mean(college$AvgSAT, na.rm = TRUE)
## [1] 1135.25

Question 3. What are the average SAT scores for public vs private colleges?

# The average SAT scores for public colleges
mean(college$AvgSAT[college$Control == "Public"], na.rm = TRUE)
## [1] 1118.91
# The average SAT scores for private colleges
mean(college$AvgSAT[college$Control == "Private"], na.rm = TRUE)
## [1] 1145.839
# Graphical Presentation as a box plot
 boxplot(AvgSAT ~ Control, data = college[college$Control %in% c("Public", "Private"), ],
        main = "Average SAT Scores by College Type",
        xlab = "College Type",
        ylab = "Average SAT Score",
        col = c("lightblue", "lightgreen"))

Question 4. How is the SAT score distributed across the colleges?

hist(college$AvgSAT, 
     main = "Distribution of SAT Scores Across Colleges", 
     xlab = "Average SAT Score", 
     ylab = "Frequency", 
     col = "lightblue", 
     border = "black", 
     breaks = 20)  # Adjust breaks for better granularity

Question 5. What percentages of colleges meet or exceed a certain SAT score threshold?

# Create categories for the SAT score ranges
college$SAT_Category <- cut(college$AvgSAT, 
                            breaks = c(0, 1000, 1200, Inf), 
                            labels = c("Below 1000", "1000-1200", "Above 1200"))

# Calculate the frequency of each category
SAT_category_table <- table(college$SAT_Category)

# Calculate the percentage for each category
SAT_percentages <- prop.table(SAT_category_table) * 100

# Create the pie chart with percentages as labels
pie(SAT_category_table, 
    labels = paste(names(SAT_category_table), "\n", round(SAT_percentages, 1), "%"), 
    main = "Proportion of Colleges by SAT Score Range",
    col = c("darkblue", "darkgreen", "lightcoral"))

Question 6. What is the mean cost for US 4 Year colleges?

mean(college$Cost, na.rm = TRUE)
## [1] 34277.31

Question 7. What is the correlation between college cost and net price?

cor(college$Cost, college$NetPrice, use = "complete.obs")
## [1] 0.8021118
# Scatter plot with a linear regression line
plot(college$Cost, college$NetPrice, 
     main = "College Cost vs Net Price",
     xlab = "Cost (Total Cost)", 
     ylab = "Net Price (Price After Aid)",
     pch = 19, col = rgb(0.5, 0.6, 1, 0.8),
     cex = 0.6)

# Add linear regression line
abline(lm(college$NetPrice ~ college$Cost), col = rgb(0.3, 0.9, 0.8, 1))

Question 8. Is there correlation between admissions rate and cost of college?

cor(college$AdmitRate, college$Cost, use = "complete.obs")
## [1] -0.3036798
# Scatter plot for Admission Rate vs Cost
plot(college$AdmitRate, college$Cost, 
     main = "Admission Rate vs College Cost",
     xlab = "Admission Rate", 
     ylab = "College Cost",
     pch = 19, # Solid circle for points
     col = rgb(0, 0, 1, 0.5), # Blue with some transparency
     cex = 0.6)  # Point size

Question 9. What is the standard deviation of the admission rates?

# Calculate the standard deviation of the Admission Rate
sd(college$AdmitRate, na.rm = TRUE)
## [1] 0.208179
# Create a frequency histogram for Admission Rates (showing the actual counts)
hist(college$AdmitRate, 
     main = "Frequency of Admission Rates", 
     xlab = "Admission Rate", 
     col = "lightblue", 
     border = "black",
     breaks = 15,
     freq = TRUE)  # This makes it a frequency plot instead of density

Question 10. What is the correlation between faculty salary and instructional spending per student?

cor(college$FacSalary, college$InstructFTE, use = "complete.obs")
## [1] 0.4998836
# Scatter plot of Faculty Salary vs Instructional Spending per FTE Student
plot(college$FacSalary, college$InstructFTE, 
     main = "Faculty Salary vs Instructional Spending",
     xlab = "Faculty Salary (Monthly)",
     ylab = "Instructional Spending per FTE Student",
     col = "blue", 
     pch = 16)  # pch=16 gives solid circles

3. Summary

This analysis explored a small part of key questions about US four-year colleges using the CollegeScores4yr dataset. We examined admission rates, SAT scores, college costs, faculty salaries, and instructional spending, revealing important relationships between these factors. From the above calculation and analysis, we deduce the following conclusions.

Admission Rate and SAT Scores

  • The average admission rate for US four-year colleges is 67%.
  • The histogram shows us that the majority of the colleges have admission rate over 50%
  • The average SAT score among these colleges is 1135.25.
  • Public and private colleges exhibit distinct SAT score distributions, with private colleges generally having higher average scores.
  • The distribution of SAT scores across colleges is mostly concentrated in the 1000 - 1200 range for most colleges.
  • According to our pie chart, 10.7% of colleges score below 1000, 64.8% score between 1000 - 1200, and 24.5% scores above 1200

College Costs

  • The average cost of attending a US four-year college is 34277.31, and there is a strong correlation of 0.8021118 between total college cost and net price. This seems to mean that the higher the cost for a college, the higher net price a student would have to pay even with financial aids. Even though the aids help to lower the cost, it is not enough to bring the net price down. This result would tend to agree with common sense that higher cost college mean students would have to pay more out of pockets.
  • A scatter plot of admissions rate versus college cost showed a weak and negative correlation of -0.3 The result and the plot showed us there is not a strong connection between the two variables. Higher cost colleges would have the same admission rate as lower cost colleges. However, since most of our data points, as well as our previous analysis of admission rate, we know that most US four-year colleges tend to have higher admission rate, therefore, it makes sense that a lot of our data point is situated to the right of the plot. Also an interesting point to notice is that we do have a few very expensive colleges that have very low admission rate. This stays true for the few elite schools that do tend to have higher education cost while being very selective about their admission process.

Faculty Salaries and Instructional Spending

  • We found that the correlation coefficient is 0.49 which indicates a moderate positive correlation between faculty salary and instructional spending per student. This means there is a moderate tendency that schools with higher faculty salaries also tend to have somewhat higher instructional spending. However, the correlation is not very strong, so there are likely other factors influencing how schools allocate their budgets for instructional spending that are not directly tied to faculty salary. We can also see that most of the data points do cluster lower on the graph. This indicates that a lot of the schools do not spend quite as much on faculty spending regardless of wheter facuty salary is high or low.

Although we relied on a limited set of descriptive statistical techniques, the thoughtful formulation of questions and the power of R software enabled us to gain valuable insights into the world of US four-year colleges. From understanding the relationships between SAT scores, admission rates, and costs, to analyzing faculty salaries and instructional spending, this project has demonstrated how even simple statistical tools can provide a clearer view of complex educational trends. As we continue to explore data, the ability to ask the right questions and apply the right analytical methods will help guide our future inquiries.

Reference

References Lock5Data. (n.d.). CollegeScores4yr dataset. Retrieved from Lock5Data. R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved from https://www.r-project.org.