Introduction

This project uses data from the “CollegeScores4yr” dataset that was provided. Using this dataset, I came up with 10 problems utilizing lessons from chapter 6 ‘Descriptive Statistics’. After 10 problems were created, chatGPT was used to come up with 10 additional questions. After a total of 20 questions were created, a sample of 10 of the questions, some from my list and some AI generated ones, were selected for further analysis by r software.

The 10 questions that I created are listed here: 1) What is the correlation between cost and average SAT scores? 2) What is the mean admission rate of all colleges? 3) What is the distribution of race amongst students at the colleges in the dataset? (pie chart) 4) What is the spread of avg student debt? (standard variance) 5) What is the median ACT average score? 6) Where are the schools located? (histogram of rural suburb urban) 7) What is the correlation between in-state and out of state tuition? 8) What is the spread of faculty salary? 9) What is the mean completion rate? 10) What is the distribution of net price? (box plot)

The 10 AI generated questions are listed here: 1) What is the distribution of admission rates (AdmitRate) among 4-year universities? Use: Histogram or boxplot to show spread and outliers. 2) What is the average and median student debt (Debt) for students who complete their programs? → Use: Mean and median to summarize debt 3) How much variation exists in undergraduate enrollment (Enrollment) across universities? → Use: Variance and standard deviation. 4) What proportion of universities fall under each control type (Control: Private, Public, Profit)? → Use: Barplot or pie chart. 5) How does the average faculty salary (FacSalary) differ between public and private universities (Control)? → Use: Boxplot or compare mean salaries by control type. 6) Is there a relationship between average net price (NetPrice) and median family income (MedIncome)? → Use: Correlation coefficient and scatterplot 7) How does the average admission rate (AdmitRate) vary by region (Region)? → Use: Boxplot or barplot of mean/median by region 8) What is the distribution of the percentage of female students (Female) across universities? → Use: Histogram or boxplot. 9) Do universities offering only online programs (Online) have different median ACT scores (MidACT) than traditional universities? → Use: Boxplot or compare means. 10) Is there a correlation between completion rate (CompRate) and average total cost (Cost)? → Use: Correlation coefficient and scatterplot.

Analysis

Here, we will explore 10 of the above questions in detail.

collegeData = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(collegeData)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7

Q1 (student generated): What is the mean admission rate of all the colleges in the dataset?

mean(collegeData$AdmitRate, na.rm = TRUE)
## [1] 0.6702025

The mean admission rate is 67%.

Q2 (AI generated): What is the distribution of admission rates (AdmitRate) among 4-year universities?

AdmitRate = c(collegeData$AdmitRate)
boxplot(AdmitRate,
        main = "Distribution of Admittance Rates",
        col = "blue",
        xlab = "Admittance Rates",
        horizontal = TRUE)

The distribution of admittance rates at 4 year colleges is skewed to the left, as the left whisker is much larger than the right whisker. There are outlier’s in the lower end of the admittance rates, but most are in the 50% to 80% range.

Q3 (student generated) What is the spread of avg student debt?

debt = c(collegeData$Debt)
var(debt, na.rm = TRUE)
## [1] 28740171

The sample variance of average student debt is 28740171, which is very large in regard to the data values, which tells us that the spread of average student debt in 4 year colleges is very large.

Q4 (AI generated) What is the distribution of the percentage of female students (Female) across universities?

femaleStudents = c(collegeData$Female)
hist(femaleStudents,
     main = "Distribution of Female Students in Colleges",
     col = "red",
     xlab = "Percent Students Female",
     ylab = "Number of Schools")

The histogram shows that the majority of universities have betwen 50% to 70% female students.

Q5 (student generated) What is the mean completion rate?

complRate = c(collegeData$CompRate)
mean(complRate, na.rm = TRUE)
## [1] 52.13524

The mean rate of completion of degree in 4 year colleges is 52.13524%.

Q6 (AI generated) Is there a correlation between completion rate (CompRate) and average total cost (Cost)?

cor(collegeData$CompRate, collegeData$Cost, use = "complete.obs")
## [1] 0.5870019

The correlation between completion rate and average total cost of colleges is 0.587, whcih means that there is a moderate positive relationship between the two.

Q7 (student generated) Where are the schools located?

barplot(table(collegeData$Locale),
     main = "Locations of Colleges",
     col = "green",
     xlab = "Locales",
     ylab = "Number of Colleges")

This barplot gives a good representation to show that most colleges in the dataset are located in cities.

Q8 (AI generated) What proportion of universities fall under each control type (Control: Private, Public, Profit)?

barplot(table(collegeData$Control),
     main = "Locations of Colleges",
     col = "purple",
     xlab = "Locales",
     ylab = "Number of Colleges")

This barplot gives a good visual representation that most colleges in the dataset are private with nearly half as many being public.

Q9 (student generated) What is the distribution of net price?

boxplot(collegeData$NetPrice,
        main = "Distribution of Net Price (cost - aid)",
        col = "orange",
        xlab = "Net Price",
        horizontal = TRUE)

The boxplot shows that the distribution of net price (cost minus aid) in colleges is right skewed with many outliers on the higher end.

Q10 (AI generated) Do universities offering only online programs (Online) have different median ACT scores (MidACT) than traditional universities?

tapply(collegeData$MidACT, collegeData$Online, median, na.rm = TRUE)
##  0  1 
## 23 25

According to the data, the mean ACT score for online only programs is slightly higher than traditional universities.

Summary

In conclusion, critical thinking was employed to come up with questions that utilized concepts covered in chapter 6. When generating questions using AI, I noticed that it was easily able to come up with detailed questions with very little prompting, which just proves how useful it is in these types of scenarios. Through this project, I became more familiar with the topics covered in chapter 6. I also gained familiarization with utilizing posit.