1. Introduction

I used the data from the CollegeScores4yr dataset, which provides detailed information on U.S. colleges and universities that primarily grant bachelor’s degrees. The data, collected by the U.S. Department of Education through the College Scorecard project, includes variables such as tuition, admission rates, student demographics, faculty salaries, and completion rates.

To guide my analysis, I developed 10 research questions through a combination of my own inquiries and AI-generated suggestions. I first proposed questions based on my understanding of the data, then used ChatGPT to generate additional questions, and finally selected the most relevant combination to explore key patterns in college characteristics, costs, and student outcomes.

I proposed the following 10 questions based on my own understanding of the data:

-1. What is the mean in-state tuition (TuitionIn) for four-year colleges in the dataset? -2. What is the sample variance of the total cost (Cost) among all institutions? -3: What are the quantiles of student debt (Debt) for graduates across the dataset? -4: What are the percentiles of median family income (MedIncome) for the listed colleges? -5: Create a stem-and-leaf plot to display the distribution of completion rate (CompRate). -6: What is the mean monthly salary for full-time faculty (FacSalary)? -7: Create a histogram showing the distribution of average net price (NetPrice) across colleges. -8: Create a boxplot to visualize the spread and outliers of total cost (Cost). -9: Create a boxplot to display the distribution of the percentage of female students (Female). -10: Create a histogram showing the distribution of first-generation students (FirstGen) across institutions.

These are questions generated by ChatGPT:

-1: What is the median admission rate (AdmitRate) across all four-year colleges? -2: What is the standard deviation of the average SAT score (AvgSAT) among all institutions? -3: What is the mean percentage of part-time students (PartTime) in Midwest vs. West regions? -4: Compare median instructional spending (InstructFTE) between public and private colleges using a boxplot. -5: What is the correlation between tuition cost (Cost) and average debt (Debt)? -6: Create a boxplot to compare completion rates (CompRate) across different types of school control (Control). -7: What is the average completion rate (CompRate) among colleges where more than 50% of students receive Pell grants? -8: Calculate the correlation between in-state tuition (TuitionIn) and median family income (MedIncome). -9: What is the correlation between median family income (MedIncome) and net price (NetPrice)? -10: Create a scatter diagram to display the relationship between average total cost (Cost) and completion rate (CompRate).

2. Analysis

Here I will explore the questions in details. I

college= read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

I have selected 10 questions from a combination of my questions and the ones generated by chatGPT.

Q1. What is the mean in-state tuition (TuitionIn) for four-year colleges in the dataset?

mean(college$TuitionIn, na.rm= TRUE)
## [1] 21948.55

The mean cost in in-state tuition is 21948.55

Q2. What is the sample variance of the total cost (Cost) among all institutions?

var(college$Cost, na.rm= TRUE)
## [1] 233433900

The sample variance of the total cost is 233433900

Q3: What are the quantiles of student debt (Debt) for graduates across the dataset?

quantile(college$Debt, 0.95, na.rm= TRUE)
##   95% 
## 10243

The quantiles of student debt(95%) for graduates across the dataset is 10243

Q4: Create a boxplot to compare completion rates (CompRate) across different types of school control (Control).

boxplot(CompRate ~ Control, data = college,
        main = "Completion Rate by School Control",
        xlab = "School Control",
        ylab = "Completion Rate (%)",
        col = c("lightblue", "lightgreen", "lightpink"))

Private schools show the highest median completion rate at approximately 55%, followed by public schools at around 48%, while profit schools have the lowest median completion rate at roughly 25%.

Q5: Create a stem-and-leaf plot to display the distribution of completion rate (CompRate).

stem(college$CompRate)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    0 | 000000000000000000000003345566677778889999
##    1 | 00011111111111222333333444444444444455555555666666666677777777888888+1
##    2 | 00000000000001111111111111222222222222222333333333334444444444444445+78
##    3 | 00000000000000000111111111111111111222222222222222222222222222333333+131
##    4 | 00000000000000000000000000000001111111111111111111111111111111111222+241
##    5 | 00000000000000000000000000000000000000000000001111111111111111111112+276
##    6 | 00000000000000000000000000000000001111111111111111111111112222222222+218
##    7 | 00000000000000000000111111111111111111111111222222222222222222222233+99
##    8 | 00000000000000000111111111111111222222222222333333333333333333344444+40
##    9 | 0000000011111111111112222222222233333334444444445555555555666677
##   10 | 000000000000000

The stem-and-leaf plot shows that most completion rates are between 2.0 and 8.0, with fewer institutions having very low or very high rates. Overall, the distribution is slightly skewed toward higher completion rates.

Q6: Create a histogram showing the distribution of first-generation students (FirstGen) across institutions.

hist(college$FirstGen,
     main= "Histogram of first-generation students",
     col= "orange",
     xlab="FirstGen"
     )

The distribution of first-generation students across institutions is approximately normal, with the majority of institutions having between 20% and 50% first-generation students, and a peak frequency around 35-40%.

Q7: What is the correlation between median family income (MedIncome) and net price (NetPrice)?

cor(college$MedIncome, college$NetPrice, use= "complete.obs")
## [1] 0.5151298

The correlation between median family income and net price is 0.5151298

Q8: What is the median admission rate (AdmitRate) across all four-year colleges?

median(college$AdmitRate, na.rm=TRUE)
## [1] 0.69505

The median admission rate across all four-year colleges is 0.69505

Q9: Create a boxplot to display the distribution of the percentage of female students (Female).

boxplot(college$Female,
        main="The distribution of the the percentage of female students",
        ylab="Female students"
        )

The percentage of female students across institutions has a median of approximately 60%, with the middle 50% of institutions ranging from about 55% to 65%.

Q10. Create a scatter diagram to display the relationship between average total cost (Cost) and completion rate (CompRate).

plot(college$Cost, college$CompRate,
     main= "Completion Rate vs Total Cost",
     xlab="Total Cost",
     ylab="Completion Rate",
     col="black")

The scatter plot reveals a weak positive relationship between total cost and completion rate, with institutions ranging from approximately $10,000 to $75,000 in total cost and completion rates spanning from near 0% to 100%.

3. Summary

This analysis of the CollegeScores4yr dataset revealed several important patterns in U.S. four-year colleges:

Cost and Affordability: The average in-state tuition is approximately $21,949, with substantial variation across institutions (variance of 233,433,900 in total costs). Student debt shows that 95% of graduates have debt below $10,243.

Completion Rates:Completion rates vary significantly by institutional type, with private schools achieving the highest median rate (~55%), followed by public schools (~48%), and for-profit institutions showing considerably lower rates (~25%). The relationship between cost and completion rate shows only a weak positive association, suggesting that higher costs do not guarantee better completion outcomes.

Student Demographics: First-generation students comprise 30-40% of enrollment at most institutions, following an approximately normal distribution. Female students represent about 60% of enrollment at typical institutions, with relatively consistent patterns across schools.

Key Findings:The moderate positive correlation between median family income and net price (r = 0.515) suggests that wealthier students tend to attend higher-priced institutions. However, the weak relationship between cost and completion rates indicates that factors beyond institutional cost—such as student support services, academic preparation, and institutional resources—likely play more important roles in student success.

These findings highlight the complexity of higher education outcomes and suggest that prospective students should consider multiple factors beyond cost when evaluating institutions.