I would like to explore the following 10 questions based on the data:
Which region has the highest percent of first generation students on average?
Are ACT/SAT scores high on average at public or private universities?
What is the mean and distribution in-state tuition and mean out of state tuition?
How does school control affect admission rates?
Is there a relationship between percentage of first-generation students and completion rate?
How does median family income affect the cost?
Are there differences in instructural spending per FTE based on whether it is a main campus or a branch campus?
Do public and private universities have differences in faculty salary?
Does student population affect completion rate?
Do universities with higher cost have a higher completion rate?
I chose these questions because they are interesting to me, and the results are meaningful in my mind. For example, standard deviation is not a value that is easy to conceptualize, and in most cases it is hard to draw conclusions with it. Because of that, I chose to not explore any questions where standard deviation were the most important statistic. Correlation is a great statistic that can be used to measure if variables may be related somehow, so a lot of my questions use correlation as the main statistic.
Here is a breakdown of how I will be exploring each question:
Calculate the average of percentage of first generation students, and sort by region. This way I will get a cumulative average for each region.
Calculate the mean ACT and SAT scores, and sort by either public or private universities.
Calculate the mean for all in-state tuitions and out of state tuitions. Then I will make two histograms to find the distribution.
Find the average admission rate for all three possibilities for school control: Public, Private, and Profit
Find the correlation between first generation student percentage and completion rate. Then make a dot plot to visually check for a correlation.
Find the correlation between family income and cost. Then create a dot plot to visually check for a correlation.
Find the mean instructural spending per FTE and sort by whether the campus is a main campus or a branch campus.
Find the mean faculty salary for both public and private universities and sort.
Find the correlation between student population and completion rate. Then make a dot plot with these values to visually look for correlation.
Find the correlation between cost and completion rate. Then make a dot plot with these values to visually look for correlation.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
# Load necessary library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Calculate the average percentage of first generation students and sort by region
firstgenByRegion <- college %>%
group_by(Region) %>%
summarize(AvgFirstGen = mean(FirstGen, na.rm = TRUE)) %>%
arrange(desc(AvgFirstGen))
#Display the result
firstgenByRegion
## # A tibble: 5 × 2
## Region AvgFirstGen
## <chr> <dbl>
## 1 West 37.7
## 2 Southeast 34.6
## 3 Territory 33.7
## 4 Midwest 31.9
## 5 Northeast 30.7
# Calculate the average ACT and SAT scores by school control
testScores <- college %>%
group_by(Control) %>%
summarize(
AvgACT = mean(MidACT, na.rm = TRUE),
AvgSAT = mean(AvgSAT, na.rm = TRUE)
)
# Display the result
testScores
## # A tibble: 3 × 3
## Control AvgACT AvgSAT
## <chr> <dbl> <dbl>
## 1 Private 23.8 1146.
## 2 Profit 23.4 1124.
## 3 Public 23.0 1119.
# Calculate and display mean in and out of state tuition
meanIN = mean(college$TuitionIn, na.rm = TRUE)
meanOut = mean(college$TuitonOut, na.rm = TRUE)
meanIN
## [1] 21948.55
meanOut
## [1] 25336.66
# Make distribution diagrams to show the differences visually
hist(college$TuitionIn,
breaks = 20,
main = "Distribution of in-state tuition",
col = "blue",
xlab = "In-state tuition"
)
hist(college$TuitonOut,
breaks = 20,
main = "Distribution of out of state tuition",
col = 'red',
xlab = "Out of state tuition"
)
# Calculate the average admission rate by school control
admissionByControl <- college %>%
group_by(Control) %>%
summarize(AvgAdmitRate = mean(AdmitRate, na.rm = TRUE))
# Display the result
admissionByControl
## # A tibble: 3 × 2
## Control AvgAdmitRate
## <chr> <dbl>
## 1 Private 0.650
## 2 Profit 0.745
## 3 Public 0.701
# Calculate the correlation between First generation students and completion rate
correlation <- cor(college$FirstGen, college$CompRate, use = "complete.obs")
correlation
## [1] -0.6643909
#Create a dot plot with all the data points to show any trends
plot(college$FirstGen, college$CompRate, xlab = "Percent of First Generation Students", ylab = "Percent of students who finish within 150% of normal time")
correlation_income_debt <- cor(college$MedIncome, college$Cost, use = "complete.obs")
correlation_income_debt
## [1] 0.589288
plot(college$MedIncome, college$Cost, xlab = "Median Family Income (Thousands)", ylab = "Total cost to attend")
# Compare average instructional spending per FTE for main and branch campuses
spendingByCampus <- college %>%
group_by(Main) %>%
summarize(AvgInstructFTE = mean(InstructFTE, na.rm = TRUE))
# Display the results
spendingByCampus
## # A tibble: 2 × 2
## Main AvgInstructFTE
## <int> <dbl>
## 1 0 5803.
## 2 1 11418.
# Compare average faculty salary between public and private universities
salaryByControl <- college %>%
group_by(Control) %>%
summarize(AvgFacSalary = mean(FacSalary, na.rm = TRUE))
# Display the results
salaryByControl
## # A tibble: 3 × 2
## Control AvgFacSalary
## <chr> <dbl>
## 1 Private 7091.
## 2 Profit 6234.
## 3 Public 8520.
# Creating Subsets to make Boxplots
collegePublic <- subset(college, Control == "Public")
collegePrivate <- subset(college, Control == "Private")
collegeProfit <- subset(college, Control == "Profit")
boxplot(FacSalary ~ Control,
data = collegePublic,
main = "Faculty Monthly Salary for Public Universities",
col = "red", # Color for the public universities
ylab = "Faculty Monthly Salary",
horizontal = FALSE)
boxplot(FacSalary ~ Control,
data = collegePrivate,
main = "Faculty Monthly Salary for Private Universities",
col = "blue", # Color for the public universities
ylab = "Faculty Monthly Salary",
horizontal = FALSE)
boxplot(FacSalary ~ Control,
data = collegeProfit,
main = "Faculty Monthly Salary for Profit Universities",
col = "green", # Color for the public universities
ylab = "Faculty Monthly Salary",
horizontal = FALSE)
# Calculate the correlation between Enrollment and Completion Rate
correlation <- cor(college$Enrollment, college$CompRate, use = "complete.obs")
correlation
## [1] 0.1678195
plot(college$Enrollment, college$CompRate, xlab = "Student Population", ylab = "Completion Rate")
# Calculate the correlation between Cost and Completion Rate
correlation_cost_comprate <- cor(college$Cost, college$CompRate, use = "complete.obs")
correlation_cost_comprate
## [1] 0.5870019
plot(college$Cost, college$CompRate, xlab = "Cost", ylab = "Completion Rate")
The region with the highest percent of first generation students is the West with 37.7 percent of students being first generation. Next is the Southeast with 34.6 percent. Third is US territories with 33.7 percent. In fourth is the Midwest with 31.9 percent of students being first generation. And last is the Northeast with 30.7 percent of students being first generation.
The average ACT Score is highest at private with 23.8, then profit at 23.4, and then public at 23.0. The average SAT is also highest at private with a 1146, than profit with 1124, and finally public with 1119.
The mean in-state tuition across all colleges is 21948.55. The mean out-of-state tuition is 25336.66. The distribuition histograms show a good visualization of the difference between in and out of state tuition.
The highest average admission rate comes from profit schools at 0.745. Next is public schools at 0.701. Last is private schools at 0.650.
There is a very strong trend that colleges with more first generation students have much lower completion rates on average. The correlation value is -0.664.
There is a strong correlation of 0.589 between median familiy income and total cost to attend college. This could imply that families with more wealth are more willing to spend more money for college.
Instructural spending is nearly double at main campuses than what it is at branch campuses. The average is 11418 at main campuses and 5803 at branch campuses per month.
The average faculty salary is highest at public universities, second most at private universities, and lowest at profit universities. This was surprising because it is well known that cost is higher at private universities, but here we see that faculty gets paid more at public universities.
There is a very weak correlation between student population and completion rate. It is not a large enough correlation to be considered significant.
There is definitely a trend that demonstrates colleges with higher cost have a higher completion rate. The plot is interesting, as it shows seemingly 2 seperate trend lines. It would be very interesting to see if there is a certain factor that causes these trends, or if it just happened by chance.
It is important when drawing conclusions about data to remember the difference between correlation and causation. For example, there is a strong trend that colleges with more first generation students have a lower completion rate, as we explored in question 5. Someone might look at this and conclude that first generation students may not have as much motivation to finish college as other students. We do not have enough information to make this conclusion. All we can conclude from the data we have is what is shown in the calculations and plots.
All data was sourced from this website: https://www.lock5stat.com/datapage3e.html