The Task (Introduction)

For this project, we are to refer to the data “CollegeScores4yr” from https://www.lock5stat.com/datapage3e.html.

From this data, we are to propose ten simple questions from diverse perspectives that can be addressed using Chapter 6 methods. These methods can be mean, median, variance, standard deviation, correlation, histogram, boxplot, barplot, or pie chart. We are not to choose a question that involves more than two variables.

We then need to use ChatGPT to help propose ten more questions, independent of the original ten that we came up with but still restricted to the information learned in Chapter 6.

Finally, we are required to choose a final ten questions, some from our proposed questions and some from ChatGPT’s questions. With these final ten questions, we are to use relevant and descriptive statistics methods that we learned to analyze the data and create a full report.

Question Proposal

Questions proposed by me:

  1. Which states have the highest average ACT score for college students? (and lowest)

  2. Do higher female percentages correlate to higher completion rates?

  3. How does the faculty salary compare to the total cost?

  4. Is there a correlation between student debt and family income?

  5. What is the percentage out of total colleges are public, private, or profit, and what is the average total cost across these types of institutions?

  6. Does the Pell grant have any effect on student debt?

  7. What percentage of colleges have a higher tuition rate for out of state students compared to in state students?

  8. What is the correlation between white students and college locality (town, city, suburb, rural)?

  9. By how much is the admittance rate higher for public than private and profit?

  10. What’s the completion rate difference for public, private, and profit schools?

Questions proposed by ChatGPT:

  1. What is the average tuition cost across different types of institutions (public vs. private)?

  2. How does the average faculty salary correlate with the completion rate across different regions?

  3. Is there a relationship between the percentage of first-generation students and the median family income at each college?

  4. How does the average debt of graduates vary between states?

  5. What is the correlation between the number of full-time faculty and student-to-faculty ratio across different locales (e.g., city, suburban)?

  6. Which colleges (over 500 student enrollment) have the highest and lowest percentage of students receiving Pell grants?

  7. How does the proportion of female students compare across public and private institutions in different states?

  8. What factors most strongly correlate with higher completion rates across colleges (e.g., tuition cost, faculty salary, median income)?

  9. How do instructional expenditures per student differ across colleges with varying completion rates?

  10. Which regions have colleges with the highest average levels of student debt?

Of the above 20 questions, the 10 that are bold are the chosen questions for data analysis.

Data Analysis

#1: Which states have the highest average ACT score for college students? (and also lowest)

## # A tibble: 54 × 8
##    State Count MeanACT SdACT RangeACT FirstQuartile MedianACT ThirdQuartile
##    <fct> <int>   <dbl> <dbl>    <dbl>         <dbl>     <dbl>         <dbl>
##  1 DC        5    26.6  6.02       15          25        28            31  
##  2 RI        5    26.6  5.13       14          25        28            28  
##  3 NH        4    26.2  4.5        11          24.8      26            27.5
##  4 MA       46    26.1  4.83       18          22.2      25            30.8
##  5 UT        5    25.8  2.49        6          24        25            26  
##  6 NY       82    25.5  3.56       13          23        24.5          27.8
##  7 WY        1    25   NA           0          25        25            25  
##  8 CA       72    24.7  5.00       29          21        24            29  
##  9 VT        6    24.7  4.97       13          20.8      24.5          27.5
## 10 CT       15    24.6  4.03       13          21.5      23            27  
## # ℹ 44 more rows

This plot and table show important data regarding the ACT scores for college students for each U.S state (and territory), ordered by highest average score. The data on the table is shown on the box plot, besides the mean scores and the standard deviation of the ACT scores. For me, the table is much easier to read than the graph since it gives definitive number values for the box plot and highlights the top 10 states. An important statistic on the table to note for each state is the count, which is the amount of colleges in each state that have reported an average ACT score. The reason this is important to keep note of is because some states only have 1 college reporting an ACT score. One such case is Wyoming only having 1 college reporting an average ACT score, landing it a spot in the top 10 with other top 10 states having at least 5.

#2: How does the faculty salary compare to the total cost?

##   MeanFacSalary SdFacSalary VarSalary MeanCost   SdCost   VarCost
## 1      7465.778    2563.004   6568988 34277.31 15278.54 233433900
## [1] "Correlation between Faculty Salary and Total Cost: 0.424200977940778"

I am not very surprised by the comparison of monthly faculty salary vs. total cost, as I expected that these two should go slightly hand-in-hand. We can see from the correlation coefficient between these two that there is a moderate positive correlation. The data does deviate quite a bit, but I imagine if the 95th percentile or even 99th percentile was removed, we would see a noticeable difference in the standard deviation, possibly leading to more correlation. Looking at the graph, it seems like there are more outliers above the line rather than below the line, possibly meaning that these outlier colleges put more funding towards facility than faculty.

#3: Is there a correlation between student debt and family income?

##   MeanDebt   SdDebt VarDebt MeanMedIncome SdMedIncome VarMedIncome
## 1 1918.566 3116.754 9714156      47.80809    22.46382      504.623
## [1] "Correlation between Debt and Family Income: -0.0351028240706128"

The data for comparing student debt and family income surprised me the most out of all of these. Originally, I thought that a higher family income would correlate to a moderate/strong negative correlation with student debt, but looking at the correlation coefficient, we can see that this is not the case, as these two variables have no correlation.

#4: What percentage of colleges have a higher tuition rate for out of state students compared to in state students?

##   MeanDifference SdDifference RangeDifference FirstQuartile MedianDifference
## 1       11687.77     5528.126           32620          8041            11244
##   ThirdQuartile
## 1       14682.5

The data for the first graph was created by taking the out-of-state tuition subtracted by the in-state tuition. The issue with reflecting this data on a graph is that many colleges do not charge extra for out-of-state students, making a tuition difference of $0 a prominent value. To make the histogram not affected by this high frequency of $0, I filtered this value out and created a pie chart along with the histogram to show the amount of colleges with no change in tuition compared to higher out-of-state tuition. This 29% of colleges with higher out-of-state could be from certain states or regions, but that is not the takeaway of this data.

#5: What’s the completion rate difference for public, private, and profit schools?

## # A tibble: 3 × 7
##   Control MeanCompletion SdCompletion RangeCompletion FirstQuartile
##   <chr>            <dbl>        <dbl>           <dbl>         <dbl>
## 1 Private           55.7         21.1           100            42.2
## 2 Profit            29.4         20.4           100            15.1
## 3 Public            50.2         17.5            87.6          37.7
## # ℹ 2 more variables: MedianCompletion <dbl>, ThirdQuartile <dbl>

I mostly expected these results for completion rate by institution type. I originally thought that private colleges would have a higher completion rate than public colleges, since private colleges usually cost more, giving students more financial incentive. This is seen on the graph and table. Although, I did think that the gap would be somewhere around 10-15%, but this is not the case. Comparing public and private’s mean completion rates, the gap between them is 5.5%. I had no expectations for profit institutions, since I did not know what they were prior to this project.

#6: How does the average debt of graduates vary between states?

## # A tibble: 54 × 7
##    State MeanDebt SdDebt RangeDebt FirstQuartile MedianDebt ThirdQuartile
##    <fct>    <dbl>  <dbl>     <int>         <dbl>      <dbl>         <dbl>
##  1 AZ       3540.  3899.     10200          144       1824.         7050 
##  2 UT       2576.  1668.      4707         1268       2943          3375 
##  3 NV       2442.  3390.     10162          455        768          4098 
##  4 DE       2405   2310.      5045          501       1766.         4489.
##  5 WY       2394     NA          0         2394       2394          2394 
##  6 RI       2357.  2050.      5204         1012.      1258.         3889 
##  7 CT       2296.  2531.      7035          443       1313          2539 
##  8 FL       2122.  2684.     10226          311.       822.         2855 
##  9 NJ       2017.  2129.     10159          509       1693          2141 
## 10 TX       1947.  2485.     10223          372.       900          2438.
## # ℹ 44 more rows

This graph, which shows average debt of graduates by state, is formatted closely alike to the one seen in question #1. This one is far less clear than question #1, since there are many more outliers, even after removing the 95th percentile. In the table, we can see the top 10 states’ average debt. Ranks #2-10 on the table are all within a range of 650, but something that seems odd from this data is the gap from rank #1 to #2. The debt of rank #1, Arizona’s average college debt, is almost 1000 more than rank #2, Utah’s average college debt. I’m not sure why this is the case, but it is an interesting statistic to note.

#7: Which regions have colleges with the highest average levels of student debt?

Comparing the average student debt by regions shows what outliers can do to a graph. The graph with the 99th percentile outliers removed is similar to the one with all data included except for the West and Southeast regions. Removing the 99th percentile makes for nearly a $1000 difference in the West region and a nearly $500 difference in the Southeast region.

#8: What is the percentage out of total colleges are public, private, or profit, and what is the average total cost across these types of institutions?

## # A tibble: 3 × 7
##   Control MeanCost SdCost RangeCost FirstQuartile MedianTuition ThirdQuartile
##   <chr>      <dbl>  <dbl>     <int>         <dbl>         <dbl>         <dbl>
## 1 Private   41350. 14928.     64817        30695          41488         52122
## 2 Profit    28862.  7622.     44627        24711.         27935         33024
## 3 Public    21339.  4656.     30041        18490          21148         23999

The pie chart which shows the distribution of college types in generally simple, but it is a very important chart for interpreting data and graphs that compare data for the different institution types. One such graph is the one below the pie chart, which shows the total cost across institution types. The takeaway from this box plot is that public colleges often are cheapest, profit colleges are a middle ground, and private colleges often are more expensive.

#9: Which colleges (over 500 student enrollment) have the highest and lowest percentage of students receiving Pell grants?

##   MeanPell   SdPell RangePell FirstQuartile MedianPell ThirdQuartile
## 1 36.37747 16.44895      94.6          25.1      34.25          44.7

This graph shows the top 10 and bottom 10 colleges for percentage of students that receive Pell Grants. The colleges included on the chart are only those with over 500 enrollment. The graph is filtered like this since there are a few schools under 500 students which skews the data more than I would prefer. When searching for these top 10 and bottom 10 colleges on the spreadsheet, most of them are not bounded to a specific state/territory, but on the chart it can be seen that a few of the higher percentages are for Puerto Rican schools. The reason for this could be that the U.S. government is pushing for more learning in its territories.

#10: Is there a relationship between the percentage of first-generation students and the median family income at each college?

##   MeanFirstGen SdFirstGen VarFirstGen MeanIncome SdIncome VarIncome
## 1     33.55713   11.08522    122.8821   46.51453 22.85785  522.4814
## [1] "Correlation between First-Gen Percentage and Median Income: -0.771508704243445"

This scatter plot shows first-generation students vs. family income. The data for this graph does not deviate that much, and there are not very many outliers when comparing it to the other graphs above. The correlation coefficient represents that this data has a strong negative correlation, meaning that a higher average family income should correlate to a decrease in percentage of first generation students.

Summary

This report aims to explore financial, academic, and demographic trends in United States colleges. It starts by laying out the task at hand, which is to find questions to answer and to then analyze the data that is given. The analyses contained in this report offer graphs and tables that pertain to the questions that are to be answered. The analyses also offer personal insights and takeaways for each set of data analyzed.

Reference

The data used for this project comes from Lock5’s Statistics: Unlocking the Power of Data, 3rd Edition. The spreadsheet can be found on https://www.lock5stat.com/datapage3e.html under the Dataname ‘CollegeScores4yr.’ The information about this spreadsheet can be found on Pg. 20-21 of Lock5’s Dataset Documentation for the third edition of “Statistics: Unlocking the Power of Data, 3rd Edition”. (https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf) We can ensure this data is accurate and reliable since the Dataset Documentation PDF states: “Data downloaded from the US Department of Education’s College Scorecard at https://collegescorecard. ed.gov/data/ (November 2019).”