Title: Stat 353 Project 1:Exploring College Data
Author: “Sagar Neupane”
Date: “2024-11-10”
The purpose of this assignment is to examine the features and trends of US universities using descriptive statistics and visualization techniques described in Chapter 6 of the course. The “CollegeScores4yr” dataset contains a multitude of college-related information, such as admission rates, tuition expenses, student demographics, graduation rates, and faculty salaries. The purpose of this analysis is to investigate many areas of higher education and get valuable insights into the factors that influence college affordability, accessibility, and achievement.For this project, I propose the following 10 research questions to analyze and explore the “CollegeScores4yr” dataset using descriptive statistics and visualization techniques:
For the analysis, we aim to explore key aspects of the “CollegeScores4yr” dataset by addressing ten diverse questions. We will calculate averages, medians, variances, and proportions to understand admission rates, tuition costs, and student demographics. Visualizations such as histograms, boxplots, and pie charts will help us examine the distribution of SAT and ACT scores, compare net prices between public and private colleges, and visualize the proportion of colleges by control type. Additionally, we will analyze regional differences in Black student enrollment, explore the relationship between part-time enrollment and graduation rates, and compare faculty salaries across institution types. These analyses will provide meaningful insights into U.S. colleges, highlighting affordability, diversity, and institutional characteristics.
college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
## Name State ID Main
## 1 Alabama A & M University AL 100654 1
## 2 University of Alabama at Birmingham AL 100663 1
## 3 Amridge University AL 100690 1
## 4 University of Alabama in Huntsville AL 100706 1
## 5 Alabama State University AL 100724 1
## 6 The University of Alabama AL 100751 1
## Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
## MainDegree HighDegree Control Region Locale Latitude Longitude AdmitRate
## 1 3 4 Public Southeast City 34.78337 -86.56850 0.9027
## 2 3 4 Public Southeast City 33.50570 -86.79935 0.9181
## 3 3 4 Private Southeast City 32.36261 -86.17401 NA
## 4 3 4 Public Southeast City 34.72456 -86.64045 0.8123
## 5 3 4 Public Southeast City 32.36432 -86.29568 0.9787
## 6 3 4 Public Southeast City 33.21187 -87.54598 0.5330
## MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1 18 929 0 4824 2.5 90.7 0.9 0.2 5.6 6.6
## 2 25 1195 0 12866 57.8 25.9 3.3 5.9 7.1 25.2
## 3 NA NA 1 322 7.1 14.3 0.6 0.3 77.6 54.4
## 4 28 1322 0 6917 74.2 10.7 4.6 4.0 6.5 15.0
## 5 18 935 0 4189 1.5 93.8 1.0 0.3 3.5 7.7
## 6 28 1278 0 32387 78.5 10.1 4.7 1.2 5.6 7.9
## NetPrice Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1 15184 22886 9857 18236 9227 7298 6983
## 2 17535 24129 8328 19032 11612 17235 10640
## 3 9649 15080 6900 6900 14738 5265 3866
## 4 19986 22108 10280 21480 8727 9748 9391
## 5 12874 19413 11068 19396 9003 7983 7399
## 6 21973 28836 10780 28100 13574 10894 10016
## FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1 71.3 71.0 23.96 1068 56.4 36.6 23.6
## 2 89.9 35.3 52.92 3755 63.9 34.1 34.5
## 3 100.0 74.2 18.18 109 64.9 51.3 15.0
## 4 64.6 27.7 48.62 1347 47.6 31.0 44.8
## 5 54.2 73.8 27.69 1294 61.3 34.3 22.1
## 6 74.0 18.0 67.87 6430 61.5 22.6 66.7
mean_admission_rate <- mean(college$AdmitRate, na.rm = TRUE)
print(paste("Average Admission Rate:", mean_admission_rate))
## [1] "Average Admission Rate: 0.670202481840194"
hist(college$AvgSAT, main = "Distribution of Average SAT Scores", xlab = "Average SAT Score", col = "lightblue", border = "black")
boxplot(NetPrice ~ Control, data = college, main = "Net Price by College Type", ylab = "Net Price", col = c("orange", "skyblue", "lightgreen"))
var_hispanic <- var(college$Hispanic, na.rm = TRUE)
sd_hispanic <- sd(college$Hispanic, na.rm = TRUE)
print(paste("Variance of Hispanic Student Percentage:", var_hispanic))
## [1] "Variance of Hispanic Student Percentage: 312.714534804842"
print(paste("Standard Deviation of Hispanic Student Percentage:", sd_hispanic))
## [1] "Standard Deviation of Hispanic Student Percentage: 17.6837364492022"
# Calculate the proportion of colleges by control type
control_distribution <- table(college$Control)
# Convert to percentages
control_percentage <- prop.table(control_distribution) * 100
# Create a data frame for visualization
control_df <- data.frame(
ControlType = names(control_percentage),
Percentage = as.vector(control_percentage)
)
# Plot the pie chart
library(ggplot2)
ggplot(control_df, aes(x = "", y = Percentage, fill = ControlType)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(title = "Proportion of Colleges by Control Type", fill = "Control Type") +
theme_void() +
scale_fill_manual(values = c("Public" = "lightblue", "Private" = "orange", "Profit" = "green"))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
median_black_by_region <- college %>% group_by(Region) %>% summarize(MedianBlack = median(Black, na.rm = TRUE))
barplot(median_black_by_region$MedianBlack, names.arg = median_black_by_region$Region, main = "Median Black Student Percentage by Region", col = "lightcoral")
hist(college$MidACT,
main = "Distribution of Mid-range ACT Scores Across Colleges",
xlab = "Mid-range ACT Score",
col = "lightblue",
border = "black")
8.How does the average graduation rate differ between colleges with high and low part-time enrollment rates?
college <- college %>% mutate(HighPartTime = ifelse(PartTime > median(college$PartTime, na.rm = TRUE), "High", "Low"))
avg_grad_rate_by_enrollment <- college %>% group_by(HighPartTime) %>% summarize(AverageGradRate = mean(CompRate, na.rm = TRUE))
barplot(avg_grad_rate_by_enrollment$AverageGradRate, names.arg = avg_grad_rate_by_enrollment$HighPartTime, main = "Average Graduation Rate by Part-Time Enrollment Level", col = c("lightblue", "lightgreen"))
avg_faculty_salary <- college %>% group_by(Control) %>% summarize(AvgSalary = mean(FacSalary, na.rm = TRUE))
barplot(avg_faculty_salary$AvgSalary, names.arg = avg_faculty_salary$Control, main = "Average Faculty Salary by College Type", col = c("orange", "skyblue", "lightgreen"))
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Load the dataset
college <- read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
# Filter the dataset for non-missing values
college_filtered <- college %>%
filter(!is.na(AvgSAT) & !is.na(AdmitRate))
# Create the scatter plot
ggplot(college_filtered, aes(x = AvgSAT, y = AdmitRate)) +
geom_point(color = "blue", alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Relationship Between Average SAT Scores and Admission Rates",
x = "Average SAT Score",
y = "Admission Rate") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Calculate the correlation coefficient
correlation <- cor(college_filtered$AvgSAT, college_filtered$AdmitRate)
print(paste("Correlation Coefficient:", round(correlation, 2)))
## [1] "Correlation Coefficient: -0.42"
This project looked at the U. S Using simple numbers and charts to answer ten important questions about college information. We looked into how many people get accepted, the cost of tuition, who the students are, and what the schools are like. Key points include an average admission rate of 67%, differences in Black student enrollment in different regions, and changes in costs between public and private colleges. Charts and graphs like histograms, bar charts, and pie charts helped us understand SAT scores, how part-time enrollment affects graduation rates, and differences in faculty salaries. These results provide important insights about the cost of college, diversity, and student success, showing how useful data analysis can be.