Title: Stat 353 Project 1:Exploring College Data

Author: “Sagar Neupane”

Date: “2024-11-10”

Output: html_document


Introduction

The purpose of this assignment is to examine the features and trends of US universities using descriptive statistics and visualization techniques described in Chapter 6 of the course. The “CollegeScores4yr” dataset contains a multitude of college-related information, such as admission rates, tuition expenses, student demographics, graduation rates, and faculty salaries. The purpose of this analysis is to investigate many areas of higher education and get valuable insights into the factors that influence college affordability, accessibility, and achievement.For this project, I propose the following 10 research questions to analyze and explore the “CollegeScores4yr” dataset using descriptive statistics and visualization techniques:

  1. What is the average admission rate for colleges across all states?
  2. What is the distribution of the average SAT scores across all colleges?
  3. How does the average net price vary between public and private colleges?
  4. What is the variance and standard deviation of the percentage of Hispanic students enrolled across all colleges?
  5. What is the proportion of colleges by control type (Control)?
  6. How does the median percentage of Black students vary by region?
  7. What is the distribution of mid-range ACT scores across colleges?
  8. How does the average graduation rate differ between colleges with high and low part-time enrollment rates?
  9. What is the average faculty salary across public and private institutions?
  10. Is there a relationship between the average SAT scores of colleges and their respective admission rates?

Analysis

For the analysis, we aim to explore key aspects of the “CollegeScores4yr” dataset by addressing ten diverse questions. We will calculate averages, medians, variances, and proportions to understand admission rates, tuition costs, and student demographics. Visualizations such as histograms, boxplots, and pie charts will help us examine the distribution of SAT and ACT scores, compare net prices between public and private colleges, and visualize the proportion of colleges by control type. Additionally, we will analyze regional differences in Black student enrollment, explore the relationship between part-time enrollment and graduation rates, and compare faculty salaries across institution types. These analyses will provide meaningful insights into U.S. colleges, highlighting affordability, diversity, and institutional characteristics.

college = read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")
head(college)
##                                  Name State     ID Main
## 1            Alabama A & M University    AL 100654    1
## 2 University of Alabama at Birmingham    AL 100663    1
## 3                  Amridge University    AL 100690    1
## 4 University of Alabama in Huntsville    AL 100706    1
## 5            Alabama State University    AL 100724    1
## 6           The University of Alabama    AL 100751    1
##                                                                Accred
## 1 Southern Association of Colleges and Schools Commission on Colleges
## 2 Southern Association of Colleges and Schools Commission on Colleges
## 3 Southern Association of Colleges and Schools Commission on Colleges
## 4 Southern Association of Colleges and Schools Commission on Colleges
## 5 Southern Association of Colleges and Schools Commission on Colleges
## 6 Southern Association of Colleges and Schools Commission on Colleges
##   MainDegree HighDegree Control    Region Locale Latitude Longitude AdmitRate
## 1          3          4  Public Southeast   City 34.78337 -86.56850    0.9027
## 2          3          4  Public Southeast   City 33.50570 -86.79935    0.9181
## 3          3          4 Private Southeast   City 32.36261 -86.17401        NA
## 4          3          4  Public Southeast   City 34.72456 -86.64045    0.8123
## 5          3          4  Public Southeast   City 32.36432 -86.29568    0.9787
## 6          3          4  Public Southeast   City 33.21187 -87.54598    0.5330
##   MidACT AvgSAT Online Enrollment White Black Hispanic Asian Other PartTime
## 1     18    929      0       4824   2.5  90.7      0.9   0.2   5.6      6.6
## 2     25   1195      0      12866  57.8  25.9      3.3   5.9   7.1     25.2
## 3     NA     NA      1        322   7.1  14.3      0.6   0.3  77.6     54.4
## 4     28   1322      0       6917  74.2  10.7      4.6   4.0   6.5     15.0
## 5     18    935      0       4189   1.5  93.8      1.0   0.3   3.5      7.7
## 6     28   1278      0      32387  78.5  10.1      4.7   1.2   5.6      7.9
##   NetPrice  Cost TuitionIn TuitonOut TuitionFTE InstructFTE FacSalary
## 1    15184 22886      9857     18236       9227        7298      6983
## 2    17535 24129      8328     19032      11612       17235     10640
## 3     9649 15080      6900      6900      14738        5265      3866
## 4    19986 22108     10280     21480       8727        9748      9391
## 5    12874 19413     11068     19396       9003        7983      7399
## 6    21973 28836     10780     28100      13574       10894     10016
##   FullTimeFac Pell CompRate Debt Female FirstGen MedIncome
## 1        71.3 71.0    23.96 1068   56.4     36.6      23.6
## 2        89.9 35.3    52.92 3755   63.9     34.1      34.5
## 3       100.0 74.2    18.18  109   64.9     51.3      15.0
## 4        64.6 27.7    48.62 1347   47.6     31.0      44.8
## 5        54.2 73.8    27.69 1294   61.3     34.3      22.1
## 6        74.0 18.0    67.87 6430   61.5     22.6      66.7
  1. What is the average admission rate for colleges across all states?
mean_admission_rate <- mean(college$AdmitRate, na.rm = TRUE)
print(paste("Average Admission Rate:", mean_admission_rate))
## [1] "Average Admission Rate: 0.670202481840194"
  1. What is the distribution of the average SAT scores across all colleges?
hist(college$AvgSAT, main = "Distribution of Average SAT Scores", xlab = "Average SAT Score", col = "lightblue", border = "black")

  1. How does the average net price vary between public and private colleges?
boxplot(NetPrice ~ Control, data = college, main = "Net Price by College Type", ylab = "Net Price", col = c("orange", "skyblue", "lightgreen"))

  1. What is the variance and standard deviation of the percentage of Hispanic students enrolled across all colleges?
var_hispanic <- var(college$Hispanic, na.rm = TRUE)
sd_hispanic <- sd(college$Hispanic, na.rm = TRUE)
print(paste("Variance of Hispanic Student Percentage:", var_hispanic))
## [1] "Variance of Hispanic Student Percentage: 312.714534804842"
print(paste("Standard Deviation of Hispanic Student Percentage:", sd_hispanic))
## [1] "Standard Deviation of Hispanic Student Percentage: 17.6837364492022"
  1. What is the proportion of colleges by control type (Control)?
# Calculate the proportion of colleges by control type
control_distribution <- table(college$Control)

# Convert to percentages
control_percentage <- prop.table(control_distribution) * 100

# Create a data frame for visualization
control_df <- data.frame(
  ControlType = names(control_percentage),
  Percentage = as.vector(control_percentage)
)

# Plot the pie chart
library(ggplot2)
ggplot(control_df, aes(x = "", y = Percentage, fill = ControlType)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Proportion of Colleges by Control Type", fill = "Control Type") +
  theme_void() +
  scale_fill_manual(values = c("Public" = "lightblue", "Private" = "orange", "Profit" = "green"))

  1. How does the median percentage of Black students vary by region?
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
median_black_by_region <- college %>% group_by(Region) %>% summarize(MedianBlack = median(Black, na.rm = TRUE))
barplot(median_black_by_region$MedianBlack, names.arg = median_black_by_region$Region, main = "Median Black Student Percentage by Region", col = "lightcoral")

  1. What is the distribution of mid-range ACT scores across colleges?
hist(college$MidACT, 
     main = "Distribution of Mid-range ACT Scores Across Colleges", 
     xlab = "Mid-range ACT Score", 
     col = "lightblue", 
     border = "black")

8.How does the average graduation rate differ between colleges with high and low part-time enrollment rates?

college <- college %>% mutate(HighPartTime = ifelse(PartTime > median(college$PartTime, na.rm = TRUE), "High", "Low"))
avg_grad_rate_by_enrollment <- college %>% group_by(HighPartTime) %>% summarize(AverageGradRate = mean(CompRate, na.rm = TRUE))
barplot(avg_grad_rate_by_enrollment$AverageGradRate, names.arg = avg_grad_rate_by_enrollment$HighPartTime, main = "Average Graduation Rate by Part-Time Enrollment Level", col = c("lightblue", "lightgreen"))

  1. What is the average faculty salary across public and private institutions?
avg_faculty_salary <- college %>% group_by(Control) %>% summarize(AvgSalary = mean(FacSalary, na.rm = TRUE))
barplot(avg_faculty_salary$AvgSalary, names.arg = avg_faculty_salary$Control, main = "Average Faculty Salary by College Type", col = c("orange", "skyblue", "lightgreen"))

  1. Is there a relationship between the average SAT scores of colleges and their respective admission rates?
# Load necessary libraries
library(ggplot2)
library(dplyr)

# Load the dataset
college <- read.csv("https://www.lock5stat.com/datasets3e/CollegeScores4yr.csv")

# Filter the dataset for non-missing values
college_filtered <- college %>%
  filter(!is.na(AvgSAT) & !is.na(AdmitRate))

# Create the scatter plot
ggplot(college_filtered, aes(x = AvgSAT, y = AdmitRate)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship Between Average SAT Scores and Admission Rates",
       x = "Average SAT Score",
       y = "Admission Rate") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Calculate the correlation coefficient
correlation <- cor(college_filtered$AvgSAT, college_filtered$AdmitRate)
print(paste("Correlation Coefficient:", round(correlation, 2)))
## [1] "Correlation Coefficient: -0.42"

Summary

This project looked at the U. S Using simple numbers and charts to answer ten important questions about college information. We looked into how many people get accepted, the cost of tuition, who the students are, and what the schools are like. Key points include an average admission rate of 67%, differences in Black student enrollment in different regions, and changes in costs between public and private colleges. Charts and graphs like histograms, bar charts, and pie charts helped us understand SAT scores, how part-time enrollment affects graduation rates, and differences in faculty salaries. These results provide important insights about the cost of college, diversity, and student success, showing how useful data analysis can be.