For our first project in STAT 353, Statistical Methods I for Engineering, we’re tasked with analyzing a dataset provided by the U.S. Department of Education’s College Scorecard project. The dataset includes a variety of attributes for colleges that participated in Title IV. However, the version we’re using is a smaller, truncated dataset that only includes four-year schools, with 2012 entries and 37 attributes. For the project, we need to come up with two sets of ten questions. The first set should be our own, and the second set should be generated with the help of AI. Afterward, we’ll compare both sets of questions, pick ten of them, and analyze them using the descriptive statistical methods we’ve learned so far.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
Based on my understanding, 10 simple questions from diverse perspectives are listed below to inquiry.
Affordability & Accessibility: What is the relationship between Net Price (NetPrice) and Percentage of First-Generation Students (FirstGen)?
Financial Burden: What is the median Average Debt (Debt) for students graduating from Public vs. Private vs. For-Profit institutions?
Institutional Spending: Is there a correlation between Instructional Spending and Completion Rate (CompRate)?
Geographical Trends: What is the distribution of Faculty Salaries (FacSalary) across different Regions?
Student Demographics: How does the Percentage of First-Generation Students (FirstGen) compare across different Locales (City, Suburb, Rural, Town)?
Selectivity & Student Outcomes: Do schools with higher standardized test scores have better graduation rates?
Debt & Earnings: Do schools with higher-paid faculty have students with lower or higher debt levels?
Enrollment Patterns: What is the distribution of Undergraduate Enrollment (Enrollment) across different Highest Degree Offered (HighDegree) categories?
Financial Aid & Affordability: How does Net Price (NetPrice) compare across different school types (Public, Private, For-Profit) for students from the lowest income bracket?
Academic Performance: How does Admission Rate (AdmitRate) vary by Region?
What is the average cost of tuition for each state?
Is there a positive correlation between tuition costs and student debt?
How does a student’s location affect the percentage of students who receive a Pell Grant?
What is the average ratio of tuition to faculty salary?
How common is Asian ethnicity for each locale?
Is there a correlation between Instructional Spending and Completion Rate (CompRate)?
Do schools with higher standardized test scores have better graduation rates?
What is the relationship between Net Price (NetPrice) and Percentage of First-Generation Students (FirstGen)
Do schools with higher-paid faculty have students with lower or higher debt levels?
10.What is the distribution of Undergraduate Enrollment (Enrollment) across different Highest Degree Offered (HighDegree) categories?
Below are all the necessary packages needed in R to run our program successfully.
install.packages("dplyr") # Install if not installed
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(dplyr) # Load the dplyr package
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cleaned_data <- college
head(cleaned_data)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
install.packages("ggplot2") # Install ggplot2 (if not installed)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(ggplot2) # Load the ggplot2 package
We will explore the chosen questions for analysis in detail below:
mean(college$Cost, na.rm = TRUE)
## [1] 34277.31
cor(college$Cost, college$Debt, use = "complete.obs")
## [1] -0.2144525
ggplot(cleaned_data, aes(x = Cost, y = Debt)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Tuition Costs vs. Student Debt",
x = "Tuition Costs",
y = "Student Debt") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 261 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 261 rows containing missing values or values outside the scale range
## (`geom_point()`).
Based on the scatterplot and the computed correlation coefficient of -0.2144525, there is a weak negative correlation between tuition costs and student debt. The scatterplot shows a wide spread of data points, with no strong upward or downward trend, but the slight downward slope of the red regression line suggests that as tuition costs increase, student debt tends to decrease slightly. However, since the correlation is close to zero, the relationship is weak and not substantial. This indicates that higher tuition costs do not necessarily lead to higher student debt, possibly due to factors such as financial aid, scholarships, or differences in borrowing habits among students attending more expensive institutions. While intuition might suggest that students at higher-cost institutions accumulate more debt, this data suggests that they may have alternative funding sources that offset their costs. Ultimately, the weak negative correlation implies that tuition cost alone is not a strong predictor of student debt levels.
# Removing rows with missing values in 'Pell' and 'Locale'
# Grouping by location and calculating the mean percentage of Pell recipients
location_pell <- cleaned_data %>%
group_by(Locale) %>%
summarize(mean_pell = mean(Pell, na.rm = TRUE))
location_pell
## # A tibble: 4 × 2
## Locale mean_pell
## <chr> <dbl>
## 1 City 38.2
## 2 Rural 43.2
## 3 Suburb 37.2
## 4 Town 36.1
cleaned_data <- college %>%
filter(!is.na(Pell) & !is.na(Locale))
ggplot(location_pell, aes(x = Locale, y = mean_pell)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Effect of Student Location on Pell Grant Percentage",
x = "Location Type",
y = "Mean Percentage of Pell Grant Recipients") +
theme_minimal()
The bar chart illustrates the relationship between student location and the percentage of students receiving Pell Grants. The data indicates that students from rural areas have the highest percentage of Pell Grant recipients, with an average of 43.23%, followed by students from cities at 38.24%. Suburban and town areas have slightly lower Pell Grant recipient rates, averaging 37.17% and 36.11%, respectively. This suggests that students from rural areas are more likely to qualify for Pell Grants, which are typically awarded based on financial need. The trend may be attributed to the economic conditions in rural communities, where household incomes tend to be lower compared to urban and suburban areas. In contrast, students from suburban and town locations may have relatively higher family incomes, resulting in fewer Pell Grant recipients. Overall, while location appears to have some impact on Pell Grant distribution, other factors, such as tuition costs and financial aid availability, may also influence the observed trends.
# Calculate the average ratio of tuition to faculty salary
average_ratio <- college %>%
summarise(avg_ratio = mean(TuitionIn / FacSalary, na.rm = TRUE))
# Show the result
average_ratio
## avg_ratio
## 1 3.211055
The average ratio of tuition to faculty salary is approximately 3.21, meaning that, on average, tuition costs are 3.21 times higher than the faculty salary at these institutions. This suggests that a significant portion of tuition revenue is allocated to areas other than direct faculty compensation, such as administrative costs, campus facilities, and student services. While faculty salaries are an important investment in education quality, this ratio highlights the broader financial structure of higher education institutions.
# Group by Locale and calculate average Asian percentage
asian_by_locale <- college %>%
group_by(Locale) %>%
summarize(Avg_Asian_Pct = mean(Asian, na.rm = TRUE))
# Print the summary table
asian_by_locale
## # A tibble: 4 × 2
## Locale Avg_Asian_Pct
## <chr> <dbl>
## 1 City 5.48
## 2 Rural 2.21
## 3 Suburb 4.58
## 4 Town 1.97
# Create a bar plot of Asian percentage by locale
ggplot(asian_by_locale, aes(x = reorder(Locale, -Avg_Asian_Pct), y = Avg_Asian_Pct, fill = Locale)) +
geom_bar(stat = "identity") +
labs(title = "Average Percentage of Asian Students by Locale",
x = "Locale",
y = "Average Asian Student Percentage") +
theme_minimal()
The data indicates that the percentage of Asian students varies significantly by locale, with the highest concentration in cities (5.48%) and the lowest in towns (1.97%). Suburban areas (4.58%) also have a relatively high proportion of Asian students, though slightly lower than cities. In contrast, rural areas (2.21%) have a significantly lower representation, similar to towns. This trend suggests that Asian students are more likely to attend institutions in urban and suburban areas, potentially due to larger immigrant communities, better educational opportunities, and access to resources in these could be related to smaller Asian populations in those areas or fewer higher education institutions catering to diverse student demographics.
# Removing rows with missing values in 'InstructFTE' and 'CompRate'
cleaned_data <- college %>%
filter(!is.na(InstructFTE) & !is.na(CompRate))
correlation <- cor(cleaned_data$InstructFTE, cleaned_data$CompRate)
correlation
## [1] 0.4894415
ggplot(cleaned_data, aes(x = InstructFTE, y = CompRate)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
labs(title = "Instructional Spending vs. Completion Rate",
x = "Instructional Spending per FTE",
y = "Completion Rate (%)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The correlation coefficient of 0.489 suggests a moderate positive relationship between instructional spending per FTE and completion rate. This means that, generally, as instructional spending increases, completion rates also tend to rise. However, the correlation is not extremely strong, indicating that other factors likely influence completion rates as well. The scatterplot visually supports this, showing a trend where higher spending is associated with higher completion rates, though there are still many variations in the data.
# Assuming the dataset is named `college_data`
# Filtering out missing values for a cleaner analysis
clean_data <- college %>%
filter(!is.na(AvgSAT), !is.na(CompRate))
# Correlation between SAT scores and graduation rates
correlation <- cor(clean_data$AvgSAT, clean_data$CompRate, use = "complete.obs")
# Scatter plot
ggplot(clean_data, aes(x = AvgSAT, y = CompRate)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", col = "red") +
labs(title = "Graduation Rate vs. Average SAT Score",
x = "Average SAT Score",
y = "Graduation Rate") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot and trendline suggest a strong positive correlation between average SAT scores and graduation rates. This means that schools with higher SAT scores generally have higher graduation rates. The data points follow an upward trend, indicating that students at institutions with higher standardized test scores are more likely to graduate. However, while the correlation is strong, it doesn’t necessarily mean that SAT scores directly cause higher graduation rates—other factors like institutional resources and student support could also play a role.
# Calculate the correlation between NetPrice and FirstGen
cor(college$NetPrice, college$FirstGen, use = "complete.obs")
## [1] -0.4270689
ggplot(college, aes(x = NetPrice, y = FirstGen)) +
geom_point(alpha = 0.6) + # Scatter plot points with transparency
geom_smooth(method = "lm", col = "blue") + # Regression line
labs(title = "Net Price vs. Percentage of First-Generation Students",
x = "Net Price (Cost Minus Aid)",
y = "Percentage of First-Generation Students") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 315 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 315 rows containing missing values or values outside the scale range
## (`geom_point()`).
The correlation value of -0.43 suggests a moderate negative relationship between net price and the percentage of first-generation students. This means that as the net price of a college increases, the proportion of first-generation students tends to decrease. The scatterplot also supports this, showing a downward trend. This pattern makes sense because higher costs can be a barrier for first-generation students, who may have fewer financial resources or support systems to afford expensive schools. However, while the correlation is noticeable, other factors like financial aid availability and institutional policies could also influence this trend.
# Calculate the correlation between FacSalary and Debt
cor(college$FacSalary, college$Debt, use = "complete.obs")
## [1] 0.1707668
The correlation value of 0.17 suggests a weak positive relationship between faculty salary and student debt. This means that, on average, schools that pay their faculty more tend to have students with slightly higher debt levels. However, the relationship is not very strong, so other factors likely play a bigger role in determining student debt.
ggplot(college, aes(x = as.factor(HighDegree), y = Enrollment)) +
geom_boxplot() +
labs(title = "Distribution of Undergraduate Enrollment by Highest Degree Offered",
x = "Highest Degree Offered",
y = "Undergraduate Enrollment") +
theme_minimal()
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
The boxplot shows the distribution of undergraduate enrollment across different levels of the highest degree offered by colleges. From the graph, it’s clear that schools offering a bachelor’s degree (category 4) tend to have significantly higher undergraduate enrollment compared to institutions offering lower degrees. The median enrollment for these schools is much higher, and they also have a wider spread of values, with many outliers representing large universities with tens of thousands of students. In contrast, insttutions offering associate degrees or lower tend to have much smaller enrollments, with relatively compact distributions and fewer extreme values. This suggests that colleges granting higher degrees generally attract more students, likely due to their broader academic offerings and resources.
This report brought out a lot of surprising insights about four-year colleges in the U.S. One of the most unexpected findings was that tuition and student debt aren’t as closely linked as you’d think. In fact, the data shows a negative correlation between the two, which goes against common assumptions. Overall, working on this report was a great way to practice using R for descriptive statistics, data analysis, and creating visualizations.
Lock, R., Lock, P., Morgan, K., Lock, E., & Lock, D. (2020). CollegeScores4yr [Dataset]. https://www.lock5stat.com/datapage3e.html Lock, R., Lock, P., Morgan, K., Lock, E., & Lock, D. (2020). Dataset documentaton for the third edition of “Statistics: UnLocking the Power of Data.” Wiley. https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf OpenAI. (2025, April 1). College Data Analysis Questions, ChatGPT [Large language model]. https://chatgpt.com/c/67ec7767-95fc-8007-aae8-677f6a1b63e4
Description Information on all US colleges and universities that primarily granting bachelor’s degrees, collected by the Department of Education for the College Scoreboard Format A data frame with 2012 observations on the following 37 variables. Name Name of the school State State where school is located ID ID number for school Main Main campus? (1=yes, 0=branch campus) Accred Accreditation agency MainDegree Predominant undergrad degree (3=bachelors) HighDegree Highest degree (0=no degrees, 1=certificate, 2=associate, 3=bachelors, 4= graduate) Control Control of school (Private, Profit, Public) Region Region of country (Midwest, Northeast, Southeast, Territory, West) Locale Locale (City, Rural, Suburb, Town) Latitude Latitude Longitude Longitude AdmitRate Admission rate CommuteAtlanta 21 MidACT Median of ACT scores AvgSAT Average combined SAT scores Online Only online (distance) programs Enrollment Undergraduate enrollment White Percent of undergraduates who report being white Black Percent of undergraduates who report being black Hispanic Percent of undergraduates who report being Hispanic Asian Percent of undergraduates who report being Asian Other Percent of undergraduates who don’t report one of the above PartTime Percent of undergraduates who are part-time students NetPrice Average net price (cost minus aid) Cost Average total cost for tuition, room, board, etc. TuitionIn In-state tuition and fees TuitonOut Out-of-state tuition and fees TuitionFTE Net Tuition revenue per FTE student InstructFTE Instructional spending per FTE student FacSalary Average monthly salary for full-time faculty FullTimeFac Percent of faculty that are full-time Pell Percent of students receiving Pell grants CompRate Completion rate (percent who finish program within 150% of normal time) Debt Average debt for students who complete program Female Percent of female students FirstGen Percent of first-generation students MedIncome Median family income (in $1,000)