This project uses data from the “CollegeScores4yr” dataset that was provided. Using this dataset, I came up with 10 problems utilizing lessons from chapter 6 ‘Descriptive Statistics’. After 10 problems were created, chatGPT was used to come up with 10 additional questions. After a total of 20 questions were created, a sample of 10 of the questions, some from my list and some AI generated ones, were selected for further analysis by r software.
The 10 questions that I created are listed here: 1) What is the correlation between cost and average SAT scores? 2) What is the mean admission rate of all colleges? 3) What is the distribution of race amongst students at the colleges in the dataset? (pie chart) 4) What is the spread of avg student debt? (standard variance) 5) What is the median ACT average score? 6) Where are the schools located? (histogram of rural suburb urban) 7) What is the correlation between in-state and out of state tuition? 8) What is the spread of faculty salary? 9) What is the mean completion rate? 10) What is the distribution of net price? (box plot)
The 10 AI generated questions are listed here: 1) What is the distribution of admission rates (AdmitRate) among 4-year universities? Use: Histogram or boxplot to show spread and outliers. 2) What is the average and median student debt (Debt) for students who complete their programs? → Use: Mean and median to summarize debt 3) How much variation exists in undergraduate enrollment (Enrollment) across universities? → Use: Variance and standard deviation. 4) What proportion of universities fall under each control type (Control: Private, Public, Profit)? → Use: Barplot or pie chart. 5) How does the average faculty salary (FacSalary) differ between public and private universities (Control)? → Use: Boxplot or compare mean salaries by control type. 6) Is there a relationship between average net price (NetPrice) and median family income (MedIncome)? → Use: Correlation coefficient and scatterplot 7) How does the average admission rate (AdmitRate) vary by region (Region)? → Use: Boxplot or barplot of mean/median by region 8) What is the distribution of the percentage of female students (Female) across universities? → Use: Histogram or boxplot. 9) Do universities offering only online programs (Online) have different median ACT scores (MidACT) than traditional universities? → Use: Boxplot or compare means. 10) Is there a correlation between completion rate (CompRate) and average total cost (Cost)? → Use: Correlation coefficient and scatterplot.
Here, we will explore 10 of the above questions in detail.
collegeData = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(collegeData)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean(collegeData$AdmitRate, na.rm = TRUE)
## [1] 0.6702025
The mean admission rate is 67%.
AdmitRate = c(collegeData$AdmitRate)
boxplot(AdmitRate,
main = "Distribution of Admittance Rates",
col = "blue",
xlab = "Admittance Rates",
horizontal = TRUE)
The distribution of admittance rates at 4 year colleges is skewed to the
left, as the left whisker is much larger than the right whisker. There
are outlier’s in the lower end of the admittance rates, but most are in
the 50% to 80% range.
debt = c(collegeData$Debt)
var(debt, na.rm = TRUE)
## [1] 28740171
The sample variance of average student debt is 28740171, which is very large in regard to the data values, which tells us that the spread of average student debt in 4 year colleges is very large.
femaleStudents = c(collegeData$Female)
hist(femaleStudents,
main = "Distribution of Female Students in Colleges",
col = "red",
xlab = "Percent Students Female",
ylab = "Number of Schools")
The histogram shows that the majority of universities have betwen 50% to
70% female students.
complRate = c(collegeData$CompRate)
mean(complRate, na.rm = TRUE)
## [1] 52.13524
The mean rate of completion of degree in 4 year colleges is 52.13524%.
cor(collegeData$CompRate, collegeData$Cost, use = "complete.obs")
## [1] 0.5870019
The correlation between completion rate and average total cost of colleges is 0.587, whcih means that there is a moderate positive relationship between the two.
barplot(table(collegeData$Locale),
main = "Locations of Colleges",
col = "green",
xlab = "Locales",
ylab = "Number of Colleges")
This barplot gives a good representation to show that most colleges in the dataset are located in cities.
barplot(table(collegeData$Control),
main = "Locations of Colleges",
col = "purple",
xlab = "Locales",
ylab = "Number of Colleges")
This barplot gives a good visual representation that most colleges in the dataset are private with nearly half as many being public.
boxplot(collegeData$NetPrice,
main = "Distribution of Net Price (cost - aid)",
col = "orange",
xlab = "Net Price",
horizontal = TRUE)
The boxplot shows that the distribution of net price (cost minus aid) in
colleges is right skewed with many outliers on the higher end.
tapply(collegeData$MidACT, collegeData$Online, median, na.rm = TRUE)
## 0 1
## 23 25
According to the data, the mean ACT score for online only programs is slightly higher than traditional universities.
In conclusion, critical thinking was employed to come up with questions that utilized concepts covered in chapter 6. When generating questions using AI, I noticed that it was easily able to come up with detailed questions with very little prompting, which just proves how useful it is in these types of scenarios. Through this project, I became more familiar with the topics covered in chapter 6. I also gained familiarization with utilizing posit.