In this project, a data set containing various statistics for all four year colleges and universities will be analyzed. I will be analyzing the data by proposing ten questions regarding specific entities of the data set and using statistical methods to answer these questions.
These are the ten questions that I have chosen given my understanding of the data:
The first step is to access the data set. For convenience, I have removed all empty values from the initial, raw data set. This way, most chunks of code will not need to have empty values removed. The chunks of code can use the already “cleaned” data.
# call in data set
data_raw <- read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
# remove all NA values
data_clean <- na.omit(data_raw)
# display first six rows
head(data_clean)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## 8 Auburn University at Montgomery AL 100830 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## 8 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## 8 3 4 Public Southeast City 32.36736 -86.17754 0.8254
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## 8 22 1083 0 4211 49.6 38.3 1.3 2.6 8.2 23.2
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## 8 15310 19892 8020 17140 8709 7487 7518
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
## 8 97.4 44.2 27.65 959 69.3 38.2 29.7
# create scatter plot
plot(data_clean$CompRate ~ data_clean$AvgSAT,
main = "Average SAT Score vs Completion Rate",
xlab = "Average SAT Score",
ylab = "Completion Rate (%)",
pch = 19,
col = "royalblue")
The plot indicates that a strong positive relationship exists between average SAT scores and the completion rate of a university.
# create scatter plot
plot(data_clean$FacSalary, data_clean$MedIncome,
main = "Average Faculty Salary vs Median Student Income",
xlab = "Average Faculty Salary ($)",
ylab = "Median Student Income ($)",
pch = 19,
col = "purple")
The plot indicates that a weak positive relationship exists between average faculty salary and student median income.
# create scatter plot
plot(data_clean$FullTimeFac, data_clean$CompRate,
main = "Full-Time Faculty vs Completion Rate",
xlab = "Full-Time Faculty (%)",
ylab = "Completion Rate (%)",
pch = 19,
col = "darkgreen")
The plot indicates that a no relationship exists between the percentage of full-time faculty impact the completion rate.
# set up region counter
Counter <- table(data_raw$Region)
# display region count
print(Counter)
##
## Midwest Northeast Southeast Territory West
## 492 552 475 48 445
# create counter data frame
CounterDF <- as.data.frame(Counter)
# set up data frame column names
colnames(CounterDF) <- c("Region", "Count")
# create bar plot
barplot(Counter,
main = "Number of Universities by Region",
xlab = "Region",
ylab = "Number of Universities",
col = "aquamarine",
las = 1,
ylim = c(0, 600))
The plot indicates that the Northeast region has the largest number of universities with 552, while the Territories region has the smallest number universities with 48.
# select desired data
TuitionIn <- c(data_raw$TuitionIn)
TuitionOut <- c(data_raw$TuitonOut)
PartTime <- c(data_raw$PartTime)
# calculate correlation values
CorrelationIn <- cor(PartTime, TuitionIn, use = "complete.obs")
CorrelationOut <- cor(PartTime, TuitionOut, use = "complete.obs")
# display correlation values
cat("Correlation value between Part-Time Percentage and In-State Tuition:", CorrelationIn, "\n")
## Correlation value between Part-Time Percentage and In-State Tuition: -0.3482614
cat("Correlation value between Part-Time Percentage and Out-of-State Tuition:", CorrelationOut, "\n")
## Correlation value between Part-Time Percentage and Out-of-State Tuition: -0.412103
There is a greater correlation between the percentage of part-time students and out-of-state tuition fees.
# calculate mean values
mean_Public <- mean(data_clean$Female[data_clean$Control == "Public"])
mean_Private <- mean(data_clean$Female[data_clean$Control == "Private"])
mean_Profit <- mean(data_clean$Female[data_clean$Control == "Profit"])
# set values to display one decimal place
form_Public <- sprintf("%.1f", mean_Public)
form_Private <- sprintf("%.1f", mean_Private)
form_Profit <- sprintf("%.1f", mean_Profit)
# display mean values
cat("Mean Percentage Information: \n")
## Mean Percentage Information:
cat("The mean percentage of female students in public institutions is", form_Public, "% \n")
## The mean percentage of female students in public institutions is 57.1 %
cat("The mean percentage of female students in private institutions is", form_Private, "% \n")
## The mean percentage of female students in private institutions is 58.4 %
cat("The mean percentage of female students in profit institutions is", form_Profit, "% \n")
## The mean percentage of female students in profit institutions is 56.7 %
# create box plot
boxplot(data_clean$Female ~ data_clean$Control,
outline = FALSE,
main = "Percentage of Female Students vs Institution Type",
xlab = "Type of Instituition",
ylab = "Percentage of Female Students (%)",
col = "pink",
horizontal = FALSE,
ylim = c(0, 100))
The plot indicates that private institutions have the highest percentage of female students with 58.4%, while profit institutions have the lowest percentage of female students with 56.7%.
# calculate mean values
mean_City <- mean(data_clean$TuitionIn[data_clean$Locale == "City"])
mean_Rural <- mean(data_clean$TuitionIn[data_clean$Locale == "Rural"])
mean_Suburb <- mean(data_clean$TuitionIn[data_clean$Locale == "Suburb"])
mean_Town <- mean(data_clean$TuitionIn[data_clean$Locale == "Town"])
# set values to display two decimal places
form_City <- sprintf("%.2f", mean_City)
form_Rural <- sprintf("%.2f", mean_Rural)
form_Suburb <- sprintf("%.2f", mean_Suburb)
form_Town <- sprintf("%.2f", mean_Town)
# display mean values
cat("Mean Tuition Information: \n")
## Mean Tuition Information:
cat("The mean tuition of universities in cities is $", form_City, "\n")
## The mean tuition of universities in cities is $ 24324.51
cat("The mean tuition of universities in rural areas is $", form_Rural, "\n")
## The mean tuition of universities in rural areas is $ 19701.22
cat("The mean tuition of universities in suburbs is $", form_Suburb, "\n")
## The mean tuition of universities in suburbs is $ 25981.50
cat("The mean tuition of universities in towns is $", form_Town, "\n")
## The mean tuition of universities in towns is $ 20452.37
# create box plot
boxplot(data_clean$TuitionIn ~ data_clean$Locale,
outline = FALSE,
main = "Tuition Costs by Locale",
xlab = "Locale",
ylab = "Tuition Cost ($)",
col = "gold",
horizontal = FALSE)
The plot indicates that universities in suburbs have the highest average tuition at $25,981.50 while universities in rural areas have the lowest average tuition at $19,701.22.
# select desired data
PrivMedIncome <- data_clean$MedIncome[data_clean$Control == "Private"]
# calculate percentile values
Q1 <- quantile(PrivMedIncome, 0.25)
Q2 <- quantile(PrivMedIncome, 0.50)
Q3 <- quantile(PrivMedIncome, 0.75)
# display percentile values
cat("IQR Information: \n")
## IQR Information:
cat("The 25th Percentile value is $", Q1, "\n")
## The 25th Percentile value is $ 39.8
cat("The 50th Percentile value is $", Q2, "\n")
## The 50th Percentile value is $ 52.7
cat("The 75th Percentile value is $", Q3, "\n")
## The 75th Percentile value is $ 69.5
# create box plot
boxplot(PrivMedIncome,
outline = FALSE,
main = "Median Income of Students at Private Universities",
xlab = "Median Income (Thousand $)",
col = "orange",
horizontal = TRUE)
The plot indicates that the interquartile range of median incomes for students attending private universities spans from $39,800 to $69,500 with a mean value of $52,700.
# select desired data
DebtData <- c(data_raw$Debt[data_clean$Control == "Private"])
# calculate mean value
DebtMean <- mean(DebtData, na.rm = TRUE)
# calculate uncertainty value
DebtUnc <- quantile(order(DebtData), 0.95)
# set values to display two decimal places
formatted_DebtMean <- sprintf("%.2f", DebtMean)
formatted_DebtUnc <- sprintf("%.2f", DebtUnc)
# display values
cat("Mean value of Debt is $", formatted_DebtMean,"\n")
## Mean value of Debt is $ 2404.05
cat("Margin of error with 95% confidence level is $", formatted_DebtUnc,"\n")
## Margin of error with 95% confidence level is $ 1168.55
# select desired data
PublicWhite <- c(data_clean$White[data_clean$Control == "Public"])
PublicBlack <- c(data_clean$Black[data_clean$Control == "Public"])
PublicHispanic <- c(data_clean$Hispanic[data_clean$Control == "Public"])
PublicAsian <- c(data_clean$White[data_clean$Control == "Public"])
# calculate mean values
AvgWhite <- c(mean(PublicWhite))
AvgBlack <- c(mean(PublicBlack))
AvgHispanic <- c(mean(PublicHispanic))
AvgAsian <- c(mean(PublicAsian))
# set values to display one decimal place
form_AvgWhite <- sprintf("%.1f", AvgWhite)
form_AvgBlack <- sprintf("%.1f", AvgBlack)
form_AvgHispanic <- sprintf("%.1f", AvgHispanic)
form_AvgAsian <- sprintf("%.1f", AvgAsian)
# display mean values
cat("Mean Percentage Information: \n")
## Mean Percentage Information:
cat("The average percentage of White students is", form_AvgWhite, "% \n")
## The average percentage of White students is 58.4 %
cat("The average percentage of Black students is", form_AvgBlack, "% \n")
## The average percentage of Black students is 14.3 %
cat("The average percentage of Hispanic students is", form_AvgHispanic, "% \n")
## The average percentage of Hispanic students is 11.9 %
cat("The average percentage of Asian students is", form_AvgAsian, "% \n")
## The average percentage of Asian students is 58.4 %
# create set of mean values
PublicData <- c(AvgWhite, AvgBlack, AvgHispanic, AvgAsian)
# create bar plot
barplot(PublicData,
main = "Mean Percentage of Students by Ethnicity in Public Universities",
names.arg = c("White", "Black", "Hispanic", "Asian"),
xlab = "Ethnicities",
ylab = "Mean Percentage (%)",
col = "maroon",
las = 1,
ylim = c(0, 100))
The plot indicates that both White and Asian students have the highest percentages in public universities at 58.4%, while Hispanic students have the lowest percentages in public universities at 11.9%.