Introduction

In this project, a data set containing various statistics for all four year colleges and universities will be analyzed. I will be analyzing the data by proposing ten questions regarding specific entities of the data set and using statistical methods to answer these questions.

These are the ten questions that I have chosen given my understanding of the data:

  1. Does the average SAT score correlate with the completion rate of a university?
  2. What is the relationship between average faculty salary and student median income?
  3. How does the percentage of full-time faculty impact the completion rate?
  4. What is the distribution of universities across all regions?
  5. Is there a correlation between the percentage of part-time students and the tuition fees (in-state vs. out-of-state)?
  6. Is the female student percentage higher in any particular type of institution (e.g., public vs. private)?
  7. Are universities in urban areas (Locale data) associated with higher tuition costs compared to rural ones?
  8. What is distribution of student median incomes for students attending private universities?
  9. Calculate the mean value of debt for private institutions. What is the margin of error at a 95% confidence level?
  10. Find the mean percentage of each ethnicity in public universities.

Analysis

The first step is to access the data set. For convenience, I have removed all empty values from the initial, raw data set. This way, most chunks of code will not need to have empty values removed. The chunks of code can use the already “cleaned” data.

# call in data set
data_raw <- read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")

# remove all NA values
data_clean <- na.omit(data_raw)

# display first six rows
head(data_clean)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
## 8     Auburn University at Montgomery    AL 100830    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## 8 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
## 8          3          4  Public Southeast   City 32.36736 -86.17754    0.8254
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
## 8     22   1083      0       4211  49.6  38.3      1.3   2.6   8.2     23.2
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
## 8    15310 19892      8020     17140       8709        7487      7518
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7
## 8        97.4 44.2    27.65  959   69.3     38.2      29.7

Question 1: Does the average SAT score correlate with the completion rate of a university?

# create scatter plot
plot(data_clean$CompRate ~ data_clean$AvgSAT, 
     main = "Average SAT Score vs Completion Rate",
     xlab = "Average SAT Score", 
     ylab = "Completion Rate (%)",
     pch = 19, 
     col = "royalblue")

The plot indicates that a strong positive relationship exists between average SAT scores and the completion rate of a university.

Question 2: What is the relationship between average faculty salary and student median income?

# create scatter plot
plot(data_clean$FacSalary, data_clean$MedIncome,
     main = "Average Faculty Salary vs Median Student Income",
     xlab = "Average Faculty Salary ($)",
     ylab = "Median Student Income ($)",
     pch = 19, 
     col = "purple")

The plot indicates that a weak positive relationship exists between average faculty salary and student median income.

Question 3: How does the percentage of full-time faculty impact the completion rate?

# create scatter plot
plot(data_clean$FullTimeFac, data_clean$CompRate,
     main = "Full-Time Faculty vs Completion Rate",
     xlab = "Full-Time Faculty (%)",
     ylab = "Completion Rate (%)",
     pch = 19, 
     col = "darkgreen")

The plot indicates that a no relationship exists between the percentage of full-time faculty impact the completion rate.

Question 4: What is the distribution of universities across all regions?

# set up region counter
Counter <- table(data_raw$Region)

# display region count
print(Counter)
## 
##   Midwest Northeast Southeast Territory      West 
##       492       552       475        48       445
# create counter data frame
CounterDF <- as.data.frame(Counter)

# set up data frame column names
colnames(CounterDF) <- c("Region", "Count")

# create bar plot
barplot(Counter,
        main = "Number of Universities by Region",
        xlab = "Region",
        ylab = "Number of Universities",
        col = "aquamarine",
        las = 1,
        ylim = c(0, 600))

The plot indicates that the Northeast region has the largest number of universities with 552, while the Territories region has the smallest number universities with 48.

Question 5: What are the correlations between the percentage of part-time students and the tuition fees (in-state vs. out-of-state)?

# select desired data
TuitionIn <- c(data_raw$TuitionIn)
TuitionOut <- c(data_raw$TuitonOut)
PartTime <- c(data_raw$PartTime)

# calculate correlation values
CorrelationIn <- cor(PartTime, TuitionIn, use = "complete.obs")
CorrelationOut <- cor(PartTime, TuitionOut, use = "complete.obs")

# display correlation values
cat("Correlation value between Part-Time Percentage and In-State Tuition:", CorrelationIn, "\n")
## Correlation value between Part-Time Percentage and In-State Tuition: -0.3482614
cat("Correlation value between Part-Time Percentage and Out-of-State Tuition:", CorrelationOut, "\n")
## Correlation value between Part-Time Percentage and Out-of-State Tuition: -0.412103

There is a greater correlation between the percentage of part-time students and out-of-state tuition fees.

Question 6: Is the female student percentage higher in any particular type of institution?

# calculate mean values
mean_Public <- mean(data_clean$Female[data_clean$Control == "Public"])
mean_Private <- mean(data_clean$Female[data_clean$Control == "Private"])
mean_Profit <- mean(data_clean$Female[data_clean$Control == "Profit"])

# set values to display one decimal place
form_Public <- sprintf("%.1f", mean_Public)
form_Private <- sprintf("%.1f", mean_Private)
form_Profit <- sprintf("%.1f", mean_Profit)

# display mean values
cat("Mean Percentage Information: \n")
## Mean Percentage Information:
cat("The mean percentage of female students in public institutions is", form_Public, "% \n")
## The mean percentage of female students in public institutions is 57.1 %
cat("The mean percentage of female students in private institutions is", form_Private, "% \n")
## The mean percentage of female students in private institutions is 58.4 %
cat("The mean percentage of female students in profit institutions is", form_Profit, "% \n")
## The mean percentage of female students in profit institutions is 56.7 %
# create box plot
boxplot(data_clean$Female ~ data_clean$Control, 
     outline = FALSE,
     main = "Percentage of Female Students vs Institution Type", 
     xlab = "Type of Instituition",
     ylab = "Percentage of Female Students (%)",
     col = "pink",
     horizontal = FALSE,
     ylim = c(0, 100))

The plot indicates that private institutions have the highest percentage of female students with 58.4%, while profit institutions have the lowest percentage of female students with 56.7%.

Question 7: Are universities in certain areas associated with higher tuition costs compared to others?

# calculate mean values
mean_City <- mean(data_clean$TuitionIn[data_clean$Locale == "City"])
mean_Rural <- mean(data_clean$TuitionIn[data_clean$Locale == "Rural"])
mean_Suburb <- mean(data_clean$TuitionIn[data_clean$Locale == "Suburb"])
mean_Town <- mean(data_clean$TuitionIn[data_clean$Locale == "Town"])

# set values to display two decimal places
form_City <- sprintf("%.2f", mean_City)
form_Rural <- sprintf("%.2f", mean_Rural)
form_Suburb <- sprintf("%.2f", mean_Suburb)
form_Town <- sprintf("%.2f", mean_Town)

# display mean values
cat("Mean Tuition Information: \n")
## Mean Tuition Information:
cat("The mean tuition of universities in cities is $", form_City, "\n")
## The mean tuition of universities in cities is $ 24324.51
cat("The mean tuition of universities in rural areas is $", form_Rural, "\n")
## The mean tuition of universities in rural areas is $ 19701.22
cat("The mean tuition of universities in suburbs is $", form_Suburb, "\n")
## The mean tuition of universities in suburbs is $ 25981.50
cat("The mean tuition of universities in towns is $", form_Town, "\n")
## The mean tuition of universities in towns is $ 20452.37
# create box plot
boxplot(data_clean$TuitionIn ~ data_clean$Locale,
     outline = FALSE,
     main = "Tuition Costs by Locale", 
     xlab = "Locale",
     ylab = "Tuition Cost ($)",
     col = "gold",
     horizontal = FALSE)

The plot indicates that universities in suburbs have the highest average tuition at $25,981.50 while universities in rural areas have the lowest average tuition at $19,701.22.

Question 8: What is distribution of median incomes for students attending private universities?

# select desired data
PrivMedIncome <- data_clean$MedIncome[data_clean$Control == "Private"]

# calculate percentile values
Q1 <- quantile(PrivMedIncome, 0.25)
Q2 <- quantile(PrivMedIncome, 0.50)
Q3 <- quantile(PrivMedIncome, 0.75)

# display percentile values
cat("IQR Information: \n")
## IQR Information:
cat("The 25th Percentile value is $", Q1, "\n")
## The 25th Percentile value is $ 39.8
cat("The 50th Percentile value is $", Q2, "\n")
## The 50th Percentile value is $ 52.7
cat("The 75th Percentile value is $", Q3, "\n")
## The 75th Percentile value is $ 69.5
# create box plot
boxplot(PrivMedIncome, 
     outline = FALSE,
     main = "Median Income of Students at Private Universities", 
     xlab = "Median Income (Thousand $)",
     col = "orange",
     horizontal = TRUE)

The plot indicates that the interquartile range of median incomes for students attending private universities spans from $39,800 to $69,500 with a mean value of $52,700.

Question 9: Calculate the mean value of debt for private institutions. What is the margin of error at a 95% confidence level?

# select desired data
DebtData <- c(data_raw$Debt[data_clean$Control == "Private"])

# calculate mean value
DebtMean <- mean(DebtData, na.rm = TRUE)

# calculate uncertainty value
DebtUnc <- quantile(order(DebtData), 0.95)

# set values to display two decimal places
formatted_DebtMean <- sprintf("%.2f", DebtMean)
formatted_DebtUnc <- sprintf("%.2f", DebtUnc)

# display values
cat("Mean value of Debt is $", formatted_DebtMean,"\n")
## Mean value of Debt is $ 2404.05
cat("Margin of error with 95% confidence level is $", formatted_DebtUnc,"\n")
## Margin of error with 95% confidence level is $ 1168.55

Question 10: Find the mean percentage of each ethnicity in public universities.

# select desired data
PublicWhite <- c(data_clean$White[data_clean$Control == "Public"])
PublicBlack <- c(data_clean$Black[data_clean$Control == "Public"])
PublicHispanic <- c(data_clean$Hispanic[data_clean$Control == "Public"])
PublicAsian <- c(data_clean$White[data_clean$Control == "Public"])

# calculate mean values
AvgWhite <- c(mean(PublicWhite))
AvgBlack <- c(mean(PublicBlack))
AvgHispanic <- c(mean(PublicHispanic))
AvgAsian <- c(mean(PublicAsian))

# set values to display one decimal place
form_AvgWhite <- sprintf("%.1f", AvgWhite)
form_AvgBlack <- sprintf("%.1f", AvgBlack)
form_AvgHispanic <- sprintf("%.1f", AvgHispanic)
form_AvgAsian <- sprintf("%.1f", AvgAsian)

# display mean values
cat("Mean Percentage Information: \n")
## Mean Percentage Information:
cat("The average percentage of White students is", form_AvgWhite, "% \n")
## The average percentage of White students is 58.4 %
cat("The average percentage of Black students is", form_AvgBlack, "% \n")
## The average percentage of Black students is 14.3 %
cat("The average percentage of Hispanic students is", form_AvgHispanic, "% \n")
## The average percentage of Hispanic students is 11.9 %
cat("The average percentage of Asian students is", form_AvgAsian, "% \n")
## The average percentage of Asian students is 58.4 %
# create set of mean values
PublicData <- c(AvgWhite, AvgBlack, AvgHispanic, AvgAsian)

# create bar plot
barplot(PublicData,
        main = "Mean Percentage of Students by Ethnicity in Public Universities",
        names.arg = c("White", "Black", "Hispanic", "Asian"),
        xlab = "Ethnicities",
        ylab = "Mean Percentage (%)",
        col = "maroon",
        las = 1,
        ylim = c(0, 100))

The plot indicates that both White and Asian students have the highest percentages in public universities at 58.4%, while Hispanic students have the lowest percentages in public universities at 11.9%.