INTRODUCTION

For our first project in STAT 353, Statistical Methods I for Engineering, we’re tasked with analyzing a dataset provided by the U.S. Department of Education’s College Scorecard project. The dataset includes a variety of attributes for colleges that participated in Title IV. However, the version we’re using is a smaller, truncated dataset that only includes four-year schools, with 2012 entries and 37 attributes. For the project, we need to come up with two sets of ten questions. The first set should be our own, and the second set should be generated with the help of AI. Afterward, we’ll compare both sets of questions, pick ten of them, and analyze them using the descriptive statistical methods we’ve learned so far.

LOADING CollegeScores4yr DATASET

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

Based on my understanding, 10 simple questions from diverse perspectives are listed below to inquiry.

  1. What is the average cost of tuition for each state?
  2. What is the relationship between proportion of female and first-generation students?
  3. What is the average ratio of NetPrice to faculty salary?
  4. What proportion of undergraduate enrollment have an average SAT score higher than 1400?
  5. How common is Asian ethnicity for each locale?
  6. How does a student’s location affect the percentage of students who receive a Pell Grant?
  7. Is there a positive correlation between tuition costs and student debt?
  8. What is the relationship between partTime and Hispanic students?
  9. What is the average cost of online programs for each locale? 10.How are median family income related to the percent of students who are black?

AI GENERATED QUESTIONS

  1. Affordability & Accessibility: What is the relationship between Net Price (NetPrice) and Percentage of First-Generation Students (FirstGen)?

  2. Financial Burden: What is the median Average Debt (Debt) for students graduating from Public vs. Private vs. For-Profit institutions?

  3. Institutional Spending: Is there a correlation between Instructional Spending and Completion Rate (CompRate)?

  4. Geographical Trends: What is the distribution of Faculty Salaries (FacSalary) across different Regions?

  5. Student Demographics: How does the Percentage of First-Generation Students (FirstGen) compare across different Locales (City, Suburb, Rural, Town)?

  6. Selectivity & Student Outcomes: Do schools with higher standardized test scores have better graduation rates?

  7. Debt & Earnings: Do schools with higher-paid faculty have students with lower or higher debt levels?

  8. Enrollment Patterns: What is the distribution of Undergraduate Enrollment (Enrollment) across different Highest Degree Offered (HighDegree) categories?

  9. Financial Aid & Affordability: How does Net Price (NetPrice) compare across different school types (Public, Private, For-Profit) for students from the lowest income bracket?

  10. Academic Performance: How does Admission Rate (AdmitRate) vary by Region?

QUESTIONS FOR ANALYSIS

  1. What is the average cost of tuition for each state?

  2. Is there a positive correlation between tuition costs and student debt?

  3. How does a student’s location affect the percentage of students who receive a Pell Grant?

  4. What is the average ratio of tuition to faculty salary?

  5. How common is Asian ethnicity for each locale?

  6. Is there a correlation between Instructional Spending and Completion Rate (CompRate)?

  7. Do schools with higher standardized test scores have better graduation rates?

  8. What is the relationship between Net Price (NetPrice) and Percentage of First-Generation Students (FirstGen)

  9. Do schools with higher-paid faculty have students with lower or higher debt levels?

10.What is the distribution of Undergraduate Enrollment (Enrollment) across different Highest Degree Offered (HighDegree) categories?

GETTING THE DATA & PACKAGES

Below are all the necessary packages needed in R to run our program successfully.

install.packages("dplyr")  # Install if not installed
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(dplyr)  # Load the dplyr package
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
cleaned_data <- college 
head(cleaned_data)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7
install.packages("ggplot2")  # Install ggplot2 (if not installed)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(ggplot2)  # Load the ggplot2 package

ANALYSIS OF 10 QUESTIONS

We will explore the chosen questions for analysis in detail below:

Q1 What is the average cost of tuition for each state?

mean(college$Cost, na.rm = TRUE)
## [1] 34277.31

Q2: Is there a positive correlation between tuition costs and student debt?

cor(college$Cost, college$Debt, use = "complete.obs")
## [1] -0.2144525
ggplot(cleaned_data, aes(x = Cost, y = Debt)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Tuition Costs vs. Student Debt",
       x = "Tuition Costs",
       y = "Student Debt") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 261 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 261 rows containing missing values or values outside the scale range
## (`geom_point()`).

Based on the scatterplot and the computed correlation coefficient of -0.2144525, there is a weak negative correlation between tuition costs and student debt. The scatterplot shows a wide spread of data points, with no strong upward or downward trend, but the slight downward slope of the red regression line suggests that as tuition costs increase, student debt tends to decrease slightly. However, since the correlation is close to zero, the relationship is weak and not substantial. This indicates that higher tuition costs do not necessarily lead to higher student debt, possibly due to factors such as financial aid, scholarships, or differences in borrowing habits among students attending more expensive institutions. While intuition might suggest that students at higher-cost institutions accumulate more debt, this data suggests that they may have alternative funding sources that offset their costs. Ultimately, the weak negative correlation implies that tuition cost alone is not a strong predictor of student debt levels.

Q3 How does a student’s location affect the percentage of students who receive a Pell Grant?

# Removing rows with missing values in 'Pell' and 'Locale'
# Grouping by location and calculating the mean percentage of Pell recipients
location_pell <- cleaned_data %>%
  group_by(Locale) %>%
  summarize(mean_pell = mean(Pell, na.rm = TRUE))

location_pell
## # A tibble: 4 × 2
##   Locale mean_pell
##   <chr>      <dbl>
## 1 City        38.2
## 2 Rural       43.2
## 3 Suburb      37.2
## 4 Town        36.1
cleaned_data <- college %>%
  filter(!is.na(Pell) & !is.na(Locale))
ggplot(location_pell, aes(x = Locale, y = mean_pell)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Effect of Student Location on Pell Grant Percentage",
       x = "Location Type",
       y = "Mean Percentage of Pell Grant Recipients") +
  theme_minimal()

The bar chart illustrates the relationship between student location and the percentage of students receiving Pell Grants. The data indicates that students from rural areas have the highest percentage of Pell Grant recipients, with an average of 43.23%, followed by students from cities at 38.24%. Suburban and town areas have slightly lower Pell Grant recipient rates, averaging 37.17% and 36.11%, respectively. This suggests that students from rural areas are more likely to qualify for Pell Grants, which are typically awarded based on financial need. The trend may be attributed to the economic conditions in rural communities, where household incomes tend to be lower compared to urban and suburban areas. In contrast, students from suburban and town locations may have relatively higher family incomes, resulting in fewer Pell Grant recipients. Overall, while location appears to have some impact on Pell Grant distribution, other factors, such as tuition costs and financial aid availability, may also influence the observed trends.

Q4 What is the average ratio of tuition to faculty salary?

# Calculate the average ratio of tuition to faculty salary
average_ratio <- college %>%
  summarise(avg_ratio = mean(TuitionIn / FacSalary, na.rm = TRUE))

# Show the result
average_ratio
##   avg_ratio
## 1  3.211055

The average ratio of tuition to faculty salary is approximately 3.21, meaning that, on average, tuition costs are 3.21 times higher than the faculty salary at these institutions. This suggests that a significant portion of tuition revenue is allocated to areas other than direct faculty compensation, such as administrative costs, campus facilities, and student services. While faculty salaries are an important investment in education quality, this ratio highlights the broader financial structure of higher education institutions.

Q5 How common is Asian ethnicity for each locale?

# Group by Locale and calculate average Asian percentage
asian_by_locale <- college %>%
  group_by(Locale) %>%
  summarize(Avg_Asian_Pct = mean(Asian, na.rm = TRUE))

# Print the summary table
asian_by_locale
## # A tibble: 4 × 2
##   Locale Avg_Asian_Pct
##   <chr>          <dbl>
## 1 City            5.48
## 2 Rural           2.21
## 3 Suburb          4.58
## 4 Town            1.97
# Create a bar plot of Asian percentage by locale
ggplot(asian_by_locale, aes(x = reorder(Locale, -Avg_Asian_Pct), y = Avg_Asian_Pct, fill = Locale)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Percentage of Asian Students by Locale",
       x = "Locale",
       y = "Average Asian Student Percentage") +
  theme_minimal()

The data indicates that the percentage of Asian students varies significantly by locale, with the highest concentration in cities (5.48%) and the lowest in towns (1.97%). Suburban areas (4.58%) also have a relatively high proportion of Asian students, though slightly lower than cities. In contrast, rural areas (2.21%) have a significantly lower representation, similar to towns. This trend suggests that Asian students are more likely to attend institutions in urban and suburban areas, potentially due to larger immigrant communities, better educational opportunities, and access to resources in these could be related to smaller Asian populations in those areas or fewer higher education institutions catering to diverse student demographics.

Q6 Is there a correlation between Instructional Spending and Completion Rate(CompRate)?

# Removing rows with missing values in 'InstructFTE' and 'CompRate'
cleaned_data <- college %>%
  filter(!is.na(InstructFTE) & !is.na(CompRate))
correlation <- cor(cleaned_data$InstructFTE, cleaned_data$CompRate)
correlation
## [1] 0.4894415
ggplot(cleaned_data, aes(x = InstructFTE, y = CompRate)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") +
  labs(title = "Instructional Spending vs. Completion Rate",
       x = "Instructional Spending per FTE",
       y = "Completion Rate (%)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The correlation coefficient of 0.489 suggests a moderate positive relationship between instructional spending per FTE and completion rate. This means that, generally, as instructional spending increases, completion rates also tend to rise. However, the correlation is not extremely strong, indicating that other factors likely influence completion rates as well. The scatterplot visually supports this, showing a trend where higher spending is associated with higher completion rates, though there are still many variations in the data.

Q7 Do schools with higher standardized test scores have better graduation rates?

# Assuming the dataset is named `college_data`
# Filtering out missing values for a cleaner analysis
clean_data <- college %>%
  filter(!is.na(AvgSAT), !is.na(CompRate))

# Correlation between SAT scores and graduation rates
correlation <- cor(clean_data$AvgSAT, clean_data$CompRate, use = "complete.obs")

# Scatter plot
ggplot(clean_data, aes(x = AvgSAT, y = CompRate)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", col = "red") +
  labs(title = "Graduation Rate vs. Average SAT Score",
       x = "Average SAT Score",
       y = "Graduation Rate") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot and trendline suggest a strong positive correlation between average SAT scores and graduation rates. This means that schools with higher SAT scores generally have higher graduation rates. The data points follow an upward trend, indicating that students at institutions with higher standardized test scores are more likely to graduate. However, while the correlation is strong, it doesn’t necessarily mean that SAT scores directly cause higher graduation rates—other factors like institutional resources and student support could also play a role.

Q8 What is the relationship between Net Price (NetPrice) and Percentage of First-Generation Students (FirstGen)?

# Calculate the correlation between NetPrice and FirstGen
cor(college$NetPrice, college$FirstGen, use = "complete.obs")
## [1] -0.4270689
ggplot(college, aes(x = NetPrice, y = FirstGen)) +
  geom_point(alpha = 0.6) +  # Scatter plot points with transparency
  geom_smooth(method = "lm", col = "blue") +  # Regression line
  labs(title = "Net Price vs. Percentage of First-Generation Students",
       x = "Net Price (Cost Minus Aid)",
       y = "Percentage of First-Generation Students") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 315 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 315 rows containing missing values or values outside the scale range
## (`geom_point()`).

The correlation value of -0.43 suggests a moderate negative relationship between net price and the percentage of first-generation students. This means that as the net price of a college increases, the proportion of first-generation students tends to decrease. The scatterplot also supports this, showing a downward trend. This pattern makes sense because higher costs can be a barrier for first-generation students, who may have fewer financial resources or support systems to afford expensive schools. However, while the correlation is noticeable, other factors like financial aid availability and institutional policies could also influence this trend.

Q9 Do schools with higher-paid faculty have students with lower or higher debt levels?

# Calculate the correlation between FacSalary and Debt
cor(college$FacSalary, college$Debt, use = "complete.obs")
## [1] 0.1707668

The correlation value of 0.17 suggests a weak positive relationship between faculty salary and student debt. This means that, on average, schools that pay their faculty more tend to have students with slightly higher debt levels. However, the relationship is not very strong, so other factors likely play a bigger role in determining student debt.

Q10 What is the distribution of Undergraduate Enrollment (Enrollment) across different Highest Degree Offered (HighDegree) categories?

ggplot(college, aes(x = as.factor(HighDegree), y = Enrollment)) +
  geom_boxplot() +  
  labs(title = "Distribution of Undergraduate Enrollment by Highest Degree Offered",
       x = "Highest Degree Offered",
       y = "Undergraduate Enrollment") +
  theme_minimal()
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

The boxplot shows the distribution of undergraduate enrollment across different levels of the highest degree offered by colleges. From the graph, it’s clear that schools offering a bachelor’s degree (category 4) tend to have significantly higher undergraduate enrollment compared to institutions offering lower degrees. The median enrollment for these schools is much higher, and they also have a wider spread of values, with many outliers representing large universities with tens of thousands of students. In contrast, insttutions offering associate degrees or lower tend to have much smaller enrollments, with relatively compact distributions and fewer extreme values. This suggests that colleges granting higher degrees generally attract more students, likely due to their broader academic offerings and resources.

CONCLUSION

This report brought out a lot of surprising insights about four-year colleges in the U.S. One of the most unexpected findings was that tuition and student debt aren’t as closely linked as you’d think. In fact, the data shows a negative correlation between the two, which goes against common assumptions. Overall, working on this report was a great way to practice using R for descriptive statistics, data analysis, and creating visualizations.

REFERENCES

Lock, R., Lock, P., Morgan, K., Lock, E., & Lock, D. (2020). CollegeScores4yr [Dataset]. https://www.lock5stat.com/datapage3e.html Lock, R., Lock, P., Morgan, K., Lock, E., & Lock, D. (2020). Dataset documentaton for the third edition of “Statistics: UnLocking the Power of Data.” Wiley. https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf OpenAI. (2025, April 1). College Data Analysis Questions, ChatGPT [Large language model]. https://chatgpt.com/c/67ec7767-95fc-8007-aae8-677f6a1b63e4

APPENDIX

Description Information on all US colleges and universities that primarily granting bachelor’s degrees, collected by the Department of Education for the College Scoreboard Format A data frame with 2012 observations on the following 37 variables. Name Name of the school State State where school is located ID ID number for school Main Main campus? (1=yes, 0=branch campus) Accred Accreditation agency MainDegree Predominant undergrad degree (3=bachelors) HighDegree Highest degree (0=no degrees, 1=certificate, 2=associate, 3=bachelors, 4= graduate) Control Control of school (Private, Profit, Public) Region Region of country (Midwest, Northeast, Southeast, Territory, West) Locale Locale (City, Rural, Suburb, Town) Latitude Latitude Longitude Longitude AdmitRate Admission rate CommuteAtlanta 21 MidACT Median of ACT scores AvgSAT Average combined SAT scores Online Only online (distance) programs Enrollment Undergraduate enrollment White Percent of undergraduates who report being white Black Percent of undergraduates who report being black Hispanic Percent of undergraduates who report being Hispanic Asian Percent of undergraduates who report being Asian Other Percent of undergraduates who don’t report one of the above PartTime Percent of undergraduates who are part-time students NetPrice Average net price (cost minus aid) Cost Average total cost for tuition, room, board, etc. TuitionIn In-state tuition and fees TuitonOut Out-of-state tuition and fees TuitionFTE Net Tuition revenue per FTE student InstructFTE Instructional spending per FTE student FacSalary Average monthly salary for full-time faculty FullTimeFac Percent of faculty that are full-time Pell Percent of students receiving Pell grants CompRate Completion rate (percent who finish program within 150% of normal time) Debt Average debt for students who complete program Female Percent of female students FirstGen Percent of first-generation students MedIncome Median family income (in $1,000)