The purpose of this report is to analyze the CollegeScores4yr dataset, which contains various metrics from higher education institutions across the United States. These metrics include admission rates, completion rates, student debt, faculty salaries, student demographics, and SAT/ACT scores, among others. By examining these variables, we aim to uncover relationships and trends that provide insight into how different types of schools (public vs. private) and regions (West, Northeast, Midwest, Southeast, and Territories) compare in terms of educational outcomes, faculty compensation, and student financial burdens.
In this analysis, a series of statistical techniques and visualizations will be employed to explore questions related to faculty salary differences, student debt, completion rates, and more. Various types of plots, including scatter plots, box plots, and histograms, will illustrate these findings. The report will also discuss key descriptive statistics, such as means, medians, and percentiles, to provide a comprehensive overview of the dataset. The ultimate goal is to derive meaningful insights that can help in understanding the factors that influence outcomes in higher education.
We first start the analysis of our data with how the data was
collected. The dataset CollegeScores4yr was sourced from https://www.lock5stat.com/datapage3e.html. It includes
various metrics from public and private institutions, such as admission
rates, completion rates, faculty salaries, and student demographics. The
sample size is appropriate, representing a diverse range of schools
across different regions.
Once the data was collected, 10 questions were thought of and analyzed using R software. The first question was “What is the relationship between admission rates and completion rates across different colleges?” We analyze this question by finding the correlation coefficient and creating a data frame that would plot the 2 variables and compare them using a scatter diagram.. Our second question was “How does the average debt of students vary by region and type of school control (public, private, for-profit)?” To analyze this question, A scatter plot was created that displays the average student debt by region, with different colors representing public and private institutions. We then could analyze the data and comment on the original question. The third question was “Do colleges with a lower percentage of part-time students have lower average completion rates?” In order to analyze this question, another correlation coefficient was calculated between part time students and there completion rates, as well as a scatter plot to visualize the results. The fourth question was “How do faculty salaries and the percentage of full-time faculty correlate with the average instructional spending per student?” Using FacSalary, Region, and School type, box plots could be generated to explore this behavior. The fifth question was “What is the mean, and median data for the Completion Rate among Colleges?” These will be explored by just using simple R code commands to explore the data among Completion rates. The sixth question was “What’s the Variance, Standard Deviation, and 95th percentile of Faculty salaries?” These will be calculated for and commented on to explore the behavior with this category. The seventh question was “What is the distribution of the average SAT scores among colleges?” This question will be commented on using a histogram that will show the average SAT score among students. The eighth question is “What is the racial/ethnic distribution of students across regions, and how does it compare to the national average?” This question will be explored using a histogram that plots the race/ethnicity among students and shows the distribution which then can be commented on. The ninth question is “What is the common relationship between the averages of SAT/ACT scores and the percentage of students who receive pell grants?” A scatter plot using AvgSAT, MidACT, and pell can explore this. The final question is “How does faculty salaries compare between different types of schools (private & public)?” For this question, we can run a side-by-side box plot on FacSalary by Control to compare salaries between different locations.
We begin our analysis on our first question, “What is the relationship between admission rates and completion rates across different colleges?” We started with running a correlation test. In our R code, we verified that any blank data or non numeric data was eliminated to ensure we had clean data in order to find a reliable correlation coefficient which resulted in -0.3482341. This value is indicating a moderate negative correlation. This suggests that as admission rates increase, completion rates tend to decrease, meaning that colleges with more selective admission policies (lower admission rates) tend to have higher completion rates. While the correlation is not strong, it provides some evidence that selectivity in admissions could be associated with better student outcomes in terms of completion rates. A plot of the correlation was also created.
## AdmitRate CompRate
## 1 0.9027 23.96
## 2 0.9181 52.92
## 3 NA 18.18
## 4 0.8123 48.62
## 5 0.9787 27.69
## 6 0.5330 67.87
The scatter plot shows the relationship between admission rates (AdmitRate) and completion rates (CompRate) across various colleges. Each point represents a college, with the x-axis depicting the admission rate and the y-axis showing the completion rate. There is a visible trend where higher admission rates are associated with lower completion rates, suggesting an inverse relationship. This aligns with the correlation coefficient of -0.348, indicating a moderate negative correlation between the two variables. While there is a lot of variability, it can be seen that colleges with lower admission rates (indicating higher selectivity) tend to have higher completion rates, and those with higher admission rates generally have more spread-out and lower completion rates. This plot reinforces the finding that selectivity in admissions could be related to higher success in terms of student completion.
Our next question explores, how does the average debt of students vary by region and type of school control, (public, private, for-profit)? We analyze this question using a scatter plot which is shown below.
## # A tibble: 6 × 3
## # Groups: Region [2]
## Region Control AvgDebt
## <chr> <chr> <dbl>
## 1 Midwest Private 634.
## 2 Midwest Profit 9041.
## 3 Midwest Public 3147.
## 4 Northeast Private 932.
## 5 Northeast Profit 6830.
## 6 Northeast Public 4956.
Viewing this scatter plot, we can see it displays the relationship between average student debt and region, categorized by school control (Public vs. Private). The x-axis represents the different regions in the U.S. (Midwest, Northeast, Southeast, Territory, and West), while the y-axis shows the average student debt. Each point represents the average debt for a specific type of school (public or private) in each region, with public schools marked in blue and private schools in red.
The plot reveals a clear distinction between public and private institutions. Public schools consistently show lower average student debt across all regions, ranging between 2,000 and 4,000 USD. In contrast, private institutions exhibit significantly higher average debts, particularly in the West and Midwest regions, where student debt can surpass 9,000 USD.
Additionally, it is evident that regional differences affect debt levels. For instance, private institutions in the West have the highest recorded average debt, while private schools in the Northeast and Southeast exhibit relatively lower debt levels compared to those in the Midwest and West. Public schools, on the other hand, show less variability in average debt across regions, maintaining consistently lower debt levels regardless of location. This suggests that public institutions generally provide more affordable education in terms of student debt, whereas private institutions tend to impose a heavier financial burden on students, especially in certain regions like the West and Midwest.
We continue our analysis with our third question, “do colleges with a lower percentage of part-time students have lower average completion rates?” We analysed this using the same correlation method we did in question 1. We started out by making sure we sifted through and removed any non numerical data that that was presented to us. From here, we calculated the correlation which came to be -0.4190961. This indicates a moderate inverse relationship, suggesting that as the percentage of part-time students at an institution increases, the completion rates tend to decrease. In other words, schools with a higher proportion of part-time students are more likely to have lower completion rates.
This finding aligns with the assumption that part-time students often face challenges that full-time students may not, such as balancing work or family responsibilities with their studies. These additional responsibilities could potentially extend the time needed to complete their degrees or lead to higher dropout rates. The negative correlation implies that institutions with a larger part-time student population might need to offer more support or flexibility to help these students complete their programs successfully. We can visually see this information by plotting a scatter plot and verifying that our graph does also align with the data from the correlation.
## PartTime CompRate
## 1 6.6 23.96
## 2 25.2 52.92
## 3 54.4 18.18
## 4 15.0 48.62
## 5 7.7 27.69
## 6 7.9 67.87
The scatter plot visualizes the relationship between the percentage of part-time students (x-axis) and the completion rates (y-axis) at various institutions. As shown in the graph, there is a noticeable trend where schools with a lower percentage of part-time students tend to have higher completion rates. As the percentage of part-time students increases, the completion rates become more spread out, with a greater number of institutions showing lower completion rates. This supports the earlier finding of a negative correlation (-0.4190961), indicating that a higher percentage of part-time students is associated with lower completion rates. The clustering at lower part-time percentages suggests that many institutions have relatively few part-time students, while those with higher percentages tend to have more variable outcomes in completion rates.
Question 4 explores using box plots to compare faculty salaries by region and type of college. Box plots are a useful tool in statistics because they display the spread, skewness, and distribution of data through quartiles. These box plots reveal significant insights into faculty salary behavior across different regions and types of control (Private, Profit, Public). After eliminating 54 rows of insufficient or non-numerical data, we generated these box plots, which highlight clear distinctions in salary distributions.
## [1] 6983 10640 3866 9391 7399 10016
From the plots, we observe that Public institutions tend to offer higher faculty salaries in most regions, particularly in the West and Southeast. Private institutions generally exhibit higher salary ranges compared to for-profit colleges, which display a lower and tighter spread across most regions. Notably, there is a consistent trend of faculty at for-profit institutions earning lower salaries across all regions, while public institutions, especially in the Northeast and West, show larger interquartile ranges, indicating more variability in their salaries. These insights are crucial for understanding regional and control-based disparities in faculty compensation.
Question 5 was determining the mean and median data for the completion Rate among colleges. The mean and median of a data set is very useful because they provide critical insights into the distribution and central tendencies of a data set. The mean is the average of all the values that are contained within the dataset. It gives insight into the central part of the data and gives insights to the averages of certain things. The downside to this is that the data can have outliers and this can skew the overall data. Median is determining the middle of the data set. This is useful when the dataset contains outliers because this method is not affected by extreme values. after removing any missing or non numerical values from our data dealing with completion rate, we recieved a mean value of 52.1352358 and a median value of 52.45. These two values being so close together suggest that the distribution of completion rates is relatively symmetrical, with no significant skew in either direction. The similarity between the mean and median also indicates that there are likely no extreme outliers significantly affecting the data. This suggests that the completion rates across the institutions are distributed in a balanced manner around the central tendency, providing a good representation of the overall completion rate behavior in the dataset.
Question 6 was analyzing the use of R tools to find the variance, standard deviation, range, and 95th percentile of faculty salaries. Finding this data among faculty salaries is useful because it provides a deeper understanding of the distribution and variability in compensation. The variance shows how widely salaries spread from the average, while the standard deviation gives a more intuitive measure of this spread in the same units as the salaries. This helps assess whether salaries are fairly consistent or if there is significant variability. The 95th percentile highlights the top-earning faculty members, offering insights into the higher end of salary distribution. These metrics are crucial for identifying potential disparities in compensation, assessing equity across institutions, and informing decisions related to recruitment, retention, and budget planning. By understanding these key measures, institutions can ensure more informed decision-making around faculty pay policies.In our specific case, after data cleaning, our data resulted in a variance of 6.568988^{6}, a standard deviation of 2563.0037098, a range of 56, 2.2924^{4}, and a 95th percentile of 1.1971^{4} These statistical measures provide a clear picture of the distribution of faculty salaries in our data set. The variance, 6.568988^{6}, reveals that salaries are dispersed quite broadly around the mean. A high variance suggests significant variability in compensation across different institutions or regions. The standard deviation, calculated at 2563.00, further reinforces this understanding, offering a more digestible figure to describe how much salaries typically deviate from the average. The range, 56, 2.2924^{4}, highlights the extent between the lowest and highest salaries, indicating a broad spectrum in pay levels. Such a large range might suggest notable differences in faculty salaries based on factors like location, institution type, or seniority. The 95th percentile value of 1.1971^{4} (or approximately $11,971 per month) represents the salary that separates the top-earning 5% of faculty members from the rest, providing valuable insight into the compensation structure at the higher end. These insights into the distribution, variability, and extremes of faculty salaries are essential for institutional leaders and policymakers. By identifying disparities, they can address potential equity issues and make more informed decisions around recruitment, retention, and compensation planning. Institutions can also use this data to better allocate resources, ensuring competitive and equitable salaries that attract and retain top talent.
Question 7 was investigating the Distribution of Average SAT scores. This could provide useful information into the insite of the average intelligence of a person. Schools and colleges could use these to determine curriculum that better aligns to a students individual learning needs. In order to explore this further, we went through and verified our data set was accurate and then went ahead and created a histogram which is shown below that shows the distribution of Average SAT scores.
## [1] 929 1195 NA 1322 935 1278
From this data, it is clear that the majority of SAT scores fall within the range of 1000 to 1200, with the highest concentration around 1100 to 1200. The distribution appears roughly normal, with fewer scores on the lower (below 800) and higher (above 1400) ends of the spectrum. This suggests that most students tend to perform near the middle range, with fewer students achieving extremely low or extremely high scores. Understanding this distribution helps schools tailor their programs, addressing the majority of students who fall into this average scoring range while also creating opportunities for those at the extremes.
Question 8 asked “What is the racial/ethnic distribution of students across regions, and how does it compare to the national average?” This data is relevant because The racial/ethnic distribution of students across regions, compared to the national average, reveals disparities in educational access and representation. This data is crucial for guiding policy decisions, such as resource allocation, curriculum development, and diversity initiatives. By understanding regional differences, schools and policymakers can address educational inequities, ensure inclusive environments, and tailor support services to meet the needs of underrepresented groups, ultimately promoting equity and better outcomes across all student populations. A histogram representing White, Black, Hispanic, Asian, Other respectively was created that graphically represented the data. Data cleaning was also conducted for this question to eliminate any missing data to ensure accurate data depiction.
## [1] 2.5 57.8 7.1 74.2 1.5 78.5
## [1] 90.7 25.9 14.3 10.7 93.8 10.1
## [1] 0.9 3.3 0.6 4.6 1.0 4.7
## [1] 0.2 5.9 0.3 4.0 0.3 1.2
Based off the graph, we can see the distribution of racial/ethnic students, with the largest group being White, followed by smaller groups in the following order: Black, Hispanic, Asian, and “Other.” The White student population has a significantly higher count, while the counts for the other groups are much lower, indicating a clear disparity in representation across these racial/ethnic categories. This suggests that White students make up the majority in this data set, with minority groups (Black, Hispanic, Asian, and “Other”) being much less represented. This could be due to a variety of factors such as financial access to college, Access to college in general, and the overall percentage of the population being predominantly White. As of the 2020 U.S. Census data http://www.census.gov/library/stories/2021/08/2020-united-states-population-more-racially-ethnically-diverse-than-2010.html White (Non-Hispanic) people make up about 57.8% of the U.S population vs 12.1% of the population reported as being Black or African American. This could also be a major contributing factor to the data and could potentially be skewing the overall data. But none the less, if we compare our data from https://nces.ed.gov/ We can see that our data roughly aligns with the national average of 44% white total enrollment, 29% Hispanic total enrollment, 15% black total enrollment, 5% Asian total enrollment, 6% other total enrollment.
Question 9 explored the question, “What is the common relationship between the averages of SAT/ACT scores and the percentage of students who receive pell grants?” For this, another scatter plot was generated to gain a better understanding of the data graphically and attempt to answer the question.
## # A tibble: 6 × 3
## AvgSAT MidACT Pell
## <dbl> <dbl> <dbl>
## 1 929 18 71
## 2 1195 25 35.3
## 3 NA NA 74.2
## 4 1322 28 27.7
## 5 935 18 73.8
## 6 1278 28 18
The graph illustrates a clear inverse relationship between SAT/ACT scores and the percentage of Pell Grant recipients. As SAT and ACT scores increase, the percentage of Pell Grant recipients generally decreases. This suggests that students with lower standardized test scores are more likely to qualify for Pell Grants, which are typically awarded to students from lower-income families. The color gradient on the plot indicates that schools with a higher proportion of Pell Grant recipients (shown in red) tend to have lower average SAT/ACT scores, while schools with fewer Pell Grant recipients (shown in blue) have higher test scores.
This relationship may reflect socioeconomic factors, as students from lower-income backgrounds often have less access to resources like test preparation, contributing to lower standardized test scores. It underscores the financial barriers many students face in pursuing higher education and the importance of Pell Grants in supporting these students.
The last and final question was “How does faculty salaries compare between different types of schools (private & public)?” There was a similar question laid out above, but this specifically tries to compare faculties salaries with private colleges and public colleges. Once data cleaning was conducted, side-by-side box plots were created and commented on to gain a better understanding and attempt to find an answer to our reserach question.
## # A tibble: 6 × 2
## FacSalary Control
## <dbl> <chr>
## 1 6983 Public
## 2 10640 Public
## 3 3866 Private
## 4 9391 Public
## 5 7399 Public
## 6 10016 Public
The comparison of faculty salaries across different types of schools (private, public, and for-profit) reveals distinct trends. Public school faculty generally earn higher salaries, with a higher median salary compared to both private and for-profit institutions. Public schools also exhibit more consistency in salary distribution, although there is some variability, as indicated by the broader range. In contrast, private school faculty salaries show significant variation, with more extreme outliers at the higher end, suggesting that while the median salary is slightly lower than public schools, some faculty members earn much higher salaries. For-profit schools demonstrate the lowest median salary with less variation and fewer outliers compared to both public and private schools. These trends highlight the differences in compensation structures across school types, with public institutions generally offering higher and more consistent salaries.
The analysis of the CollegeScores4yr dataset provided several insights into higher education institutions across the United States. Key findings included an inverse relationship between admission and completion rates, where more selective schools (with lower admission rates) generally had higher completion rates. We also observed significant regional and institutional differences in student debt, with private institutions in certain regions, such as the West and Midwest, having higher average student debt compared to public schools. Faculty salaries similarly showed disparities, with public institutions offering higher median salaries compared to private and for-profit institutions. Additionally, the racial and ethnic distribution of students revealed disparities in representation, aligning closely with national demographic trends. Finally, the relationship between SAT/ACT scores and the percentage of Pell Grant recipients highlighted the financial barriers that students from lower-income backgrounds face, as evidenced by the correlation between lower test scores and higher Pell Grant eligibility. These findings are crucial for informing policy decisions, promoting equity, and improving educational outcomes across diverse institutions.
library(readxl)
CollegeScores4yr <- read_excel(“CollegeScores4yr.xlsx”) View(CollegeScores4yr)
AdmitRate <- CollegeScores4yr\(AdmitRate CompRate <- CollegeScores4yr\)CompRate D <- data.frame(AdmitRate, CompRate)
cor(AdmitRate, CompRate, use = “complete.obs”)
plot(D)
install.packages(“ggplot2”) install.packages(“dplyr”) library(ggplot2) library(dplyr)
avg_debt <- CollegeScores4yr %>% group_by(Region, Control) %>% summarise(AvgDebt = mean(Debt, na.rm = TRUE))
ggplot(avg_debt, aes(x = Region, y = AvgDebt, color = Control)) + geom_point(size = 3) + labs(title = “Average Student Debt by School Type and Region”, x = “Region”, y = “Average Debt”) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_color_manual(values = c(“Public” = “blue”, “Private” = “red”))
PartTime <- CollegeScores4yr\(PartTime CompRate <- CollegeScores4yr\)CompRate E <- data.frame(PartTime, CompRate)
cor(PartTime, CompRate, use = “complete.obs”)
plot(E)
ggplot(CollegeScores4yr, aes(x = Control, y = FacSalary, fill = Control)) + geom_boxplot() + facet_wrap(~ Region) + labs(title = “Faculty Salary by Region and Control”, x = “School Type (Control)”, y = “Faculty Salary”) + theme_minimal()
mean_CompletionRate <- mean(CompRate, na.rm = TRUE) print(mean_CompletionRate)
median_CompletionRate <- median(CompRate, na.rm = TRUE) print(median_CompletionRate)
variance <- var(CollegeScores4yr$FacSalary, na.rm = TRUE) print(variance)
sd <- sd(CollegeScores4yr$FacSalary, na.rm = TRUE) print(sd)
range <- range(CollegeScores4yr$FacSalary, na.rm = TRUE) print(range)
nintyfithpercentile <- quantile(CollegeScores4yr$FacSalary, 0.95, na.rm = TRUE) print(nintyfithpercentile)
hist(CollegeScores4yr$AvgSAT, main = “Distribution of Average SAT Scores”, col = “blue”, xlab = “Average SAT Score”, ylab = “Count”)
racial_data <- data.frame( Race = factor(c(rep(“White”, length(CollegeScores4yr\(White)), rep("Black", length(CollegeScores4yr\)Black)), rep(“Hispanic”, length(CollegeScores4yr\(Hispanic)), rep("Asian", length(CollegeScores4yr\)Asian)), rep(“Other”, length(CollegeScores4yr\(Other))), levels = c("White", "Black", "Hispanic", "Asian", "Other")), Count = c(CollegeScores4yr\)White, CollegeScores4yr\(Black, CollegeScores4yr\)Hispanic, CollegeScores4yr\(Asian, CollegeScores4yr\)Other) )
hist(racial_data\(Count, breaks = length(unique(racial_data\)Race)), main = “Distribution of Racial/Ethnic Students”, col = “blue”, xlab = “Race/Ethnicity”, ylab = “Count”)
ggplot(CollegeScores4yr, aes(x = AvgSAT, y = MidACT, color = Pell)) + geom_point(alpha = 0.6) + scale_color_gradient(low = “blue”, high = “red”) + labs(title = “Relationship between SAT/ACT Scores and Pell Grant Recipients”, x = “Average SAT Score”, y = “Midpoint ACT Score”, color = “Percentage of Pell Grant Recipients”) + theme_minimal()
ggplot(CollegeScores4yr, aes(x = Control, y = FacSalary, fill = Control)) + geom_boxplot() + labs(title = “Comparison of Faculty Salaries by School Type (Private vs Public)”, x = “School Type”, y = “Faculty Salary”) + theme_minimal() + scale_fill_manual(values = c(“Private” = “lightblue”, “Public” = “lightgreen”))