This document provides an analysis of the data provided at https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv The data contains information about 4 year colleges and several statistics about the school. The goal is to explore different aspects of colleges and gain an understanding about various metrics of them
I propose the following 10 questions based on my understanding of the data.
We will explore the questions in detail and summarize the results.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
The mean debt from all the colleges in the data is $2365.65.
The correlation between the cost of tuition and the median income is 0.589288. This is a weak positive correlation meaning that as tuition increases, you can sometimes expect a slight increase to median income.
The historgram shows the distribution of faculty salaries for the schools in the data. The meadian appears to be around $7,500 with some making below that and some making above that. The graph resembles a normal distribution meaing the majority of faculty earn around the median salary with a few outliers in both directions.
The correlation between SAT scores and admission rate is a weak negative correlation. This means that as SAT scores go up, you can reasonably assume admission rates go down slightly. This is probably due to schools with lower acceptance rates having higher standards and only accepting students with higher SAT scores.
The histogram shows the distribution of part time students accros schools. The data shows that most schools have between 0% and 20% part time attendance with some having higher. This shows that most students across schools in the dataset are full time students.
The median completion rate among schools is 52.45%
The pie chart shows the distribution of ethnicity of people attending schools in the data. Over half the students in the studies population are of white ethincity while serveral other ethinicities make up smaller potions of the student population.
The net price of schools has a standard deviation of $7854.1 showing how much net price varies among schools.
The schools enrollment have a standard deviation of 7473.072 idicating the spread in enrollment across schools.
The pie chart shows the distribution of schools among defferent regions. The distribution of schools is relatively even among the four main regions of the U.S. while only a few schools are located in the US territories.
mean(college$Debt, na.rm = TRUE)
## [1] 2365.655
cor(college$Cost, college$MedIncome, use = "complete.obs")
## [1] 0.589288
hist(college$FacSalary, main = "Distribution of Faculty Salary", ylab = "# of Employees", xlab = "Salary in Dollars", col = "lightblue")
cor(college$AvgSAT, college$AdmitRate, use = "complete.obs")
## [1] -0.4221255
hist(college$PartTime,
main = "Distribution of Part-Time Students",
xlab = "Percentage of Part-Time Students",
col = "orange",
)
median(college$CompRate, na.rm = TRUE)
## [1] 52.45
ethnicity <- colSums(college[, c("White", "Black", "Hispanic", "Asian")], na.rm = TRUE)
pie(ethnicity,
main = "Distribution of Ethnicities Across Colleges",
xlab = "Ethnicity",
ylab = "Frequency",
col = c("lightblue", "lightgreen", "purple", "yellow"),
names.arg = c("White", "Black", "Hispanic", "Asian"))
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in title(main = main, ...): "names.arg" is not a graphical parameter
sd(college$NetPrice, na.rm = TRUE)
## [1] 7854.096
sd(college$Enrollment, na.rm = TRUE)
## [1] 7473.072
region_counts <- table(college$Region)
pie(region_counts,
main = "Distribution of Schools Among Regions",
xlab = "Region",
ylab = "Frequency",
col = c("lightblue", "lightgreen", "purple", "yellow", "lightcoral"),
names.arg = c("Southeast", "West", "Midwest", "Northeast", "Territory"))
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in text.default(1.1 * P$x, 1.1 * P$y, labels[i], xpd = TRUE, adj =
## ifelse(P$x < : "names.arg" is not a graphical parameter
## Warning in title(main = main, ...): "names.arg" is not a graphical parameter