CollegeScores4yr Data Analysis

Introduction

This report provides an analysis of the “CollegeScores4yr” dataset, which contains data about universities, including variables such as tuition, faculty salary, and student demographics. The aim is to explore these variables using descriptive statistics and visualizations to gain insight.

Table of Context

Mean of Faculty Salaries.
Median of Completion Rate.
Variance of Student Debt.
Standard Deviation of Average SAT Scores.
Correlation Between Median Family Income and Student Debt.
Barplot of Universities by Control Type.
Histogram of Percentage of Pell Grant Recipients.
Range of Tuition Fees per Student.
Boxplot of Faculty Salaries by Region 10.Pie Chart of Universities with More than 50% Female Students.

1. Mean of Faculty Salaries

# 1. Mean of Faculty Salaries 
# Calculate the mean of faculty salaries
mean_fac_salary <- mean(CollegeScores4yr$FacSalary, na.rm = TRUE)
print(paste("Mean Faculty Salary:", mean_fac_salary))

## [1] "Mean Faculty Salary: 7465.77834525026"

Insight on question 1:

The provided code calculates the mean of faculty salaries from the FacSalary variable in the CollegeScores4yr dataset. By using the mean() function with na.rm = TRUE, any missing (NA) values are excluded from the calculation, ensuring an accurate result. The mean value represents the average salary for faculty across the institutions in the dataset. This metric provides insight into the overall compensation level for faculty members, which can be useful for benchmarking and comparing salaries across different regions or types of institutions. A higher mean salary might indicate better-funded institutions or regions with higher costs of living, while a lower mean salary could reflect more budget-constrained environments or regions with lower living costs.

2. Median of Completion Rate

# Calculate the median completion.
median_comp_rate <- median(CollegeScores4yr$CompRate, na.rm = TRUE)
print(paste("Median Completion Rate:", median_comp_rate))

## [1] "Median Completion Rate: 52.45"

Insight on question 2:

The provided code calculates the median completion rate from the CompRate variable in the CollegeScores4yr dataset. By using the median() function with na.rm = TRUE, the code excludes any missing (NA) values to ensure an accurate calculation. The result is the median value, which represents the midpoint of the completion rate data, meaning half of the universities have a completion rate below this value and half have a rate above it. This metric is useful for understanding the typical completion rate among the institutions in the dataset, offering a measure that is less affected by extreme values compared to the mean.

3. Variance of Student Debt

# Calculate the variance of student debt
var_debt <- var(CollegeScores4yr$Debt, na.rm = TRUE)
print(paste("Variance of Student Debt:", var_debt))

## [1] "Variance of Student Debt: 28740170.9285079"

Insight on question 3:

The provided code calculates the variance of student debt in the CollegeScores4yr dataset. By using the var() function with na.rm = TRUE, any missing (NA) values are excluded, ensuring that the calculation reflects only complete data. The output, which is the variance of student debt, indicates how spread out the student debt amounts are around their mean. A higher variance means that student debt levels differ significantly across institutions, while a lower variance indicates that the debt levels are more consistent. This metric helps understand the degree of variability in student financial burdens across universities in the data set.

4. Standard Deviation of Average SAT Scores

# Calculate the standard deviation of average SAT scores
sd_avg_sat <- sd(CollegeScores4yr$AvgSAT, na.rm = TRUE)
print(paste("Standard Deviation of Average SAT Scores:", sd_avg_sat))

## [1] "Standard Deviation of Average SAT Scores: 128.90771927285"

Insight on question 4:

The provided code calculates the standard deviation of average SAT scores in the CollegeScores4yr data set. By using the sd() function with the na.rm = TRUE parameter, any missing (NA) values are excluded to ensure an accurate calculation. The code outputs the standard deviation, which quantifies how much the SAT scores vary around the mean. This helps in understanding the spread and consistency of SAT scores across different universities, indicating the level of variation in academic competitiveness within the data set.

5. Correlation Between Median Family Income and Student Debt

# Calculate the correlation between median family income and student debt
cor_med_income_debt <- cor(CollegeScores4yr$MedIncome, CollegeScores4yr$Debt, use = "complete.obs")
print(paste("Correlation between Median Family Income and Student Debt:", cor_med_income_debt))

## [1] "Correlation between Median Family Income and Student Debt: -0.120722092197048"

Insight on question 5:

The correlation between median family income and student debt is a useful metric for understanding how socioeconomic status may influence the financial obligations of students. Depending on the result (positive, negative, or zero), this analysis can help identify areas where educational financial policies and support systems might need to be improved to ensure equitable access to higher education and reduce the financial burden on students.

6. Barplot of Universities by Control Type

ggplot(CollegeScores4yr, aes(x = Control)) +
  geom_bar(fill = "lightblue") +
  ggtitle("Distribution of Universities by Control Type") +
  xlab("Control Type") +
  ylab("Number of Universities")

Insight on question 6:

The bar chart provides a foundation overview of how universities are distributed by control type in the CollegeScores4yr data set. The insights can guide further research into financial, demographic, and policy-related aspects of higher education. By expanding the analysis with additional metrics, a more comprehensive understanding of the trends and their implications can be achieved.

7. Histogram of Percentage of Pell Grant Recipients

ggplot(CollegeScores4yr, aes(x = Pell)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "black") +
  ggtitle("Histogram of Percentage of Pell Grant Recipients") +
  xlab("Percentage of Pell Grant Recipients") +
  ylab("Frequency")

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_bin()`).

Insight on question 7:

The histogram of the percentage of Pell Grant recipients shows that while a large proportion of universities have moderate levels of grant support, a non-negligible number cater to a significantly higher percentage of students who rely on financial aid. This analysis could guide further investigations into institutional characteristics, outcomes for students receiving grants, and potential policy initiatives.

8. Range of Tuition Fees per Student

range_tuition <- range(CollegeScores4yr$TuitionFTE, na.rm = TRUE)
print(paste("Range of Tuition Fees per Student:", paste(range_tuition, collapse = " - ")))

## [1] "Range of Tuition Fees per Student: 0 - 58553"

Insight on question 8:

The code snippet calculates the range of tuition fees per student using the Tuition variable in the CollegeScores4yr data set. By setting na.rm = TRUE, it ensures that NA values are excluded from the calculation. The range() function returns the minimum and maximum tuition fees, which are then printed in a formatted string.

9. Boxplot of Faculty Salaries by Region

ggplot(CollegeScores4yr, aes(x = Region, y = FacSalary)) +
  geom_boxplot(fill = "lightblue") +
  ggtitle("Boxplot of Faculty Salaries by Region") +
  xlab("Region") +
  ylab("Faculty Salary")

## Warning: Removed 54 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Insight on question 9:

This box plot analysis highlights significant regional differences in faculty salaries, with the Northeast and West leading in median pay. The presence of outliers and variability in certain regions suggests a mix of institutional practices and external economic factors influencing pay. Further analysis could provide more detailed insights into the drivers of these differences and their implications for faculty recruitment and retention.

10. Pie Chart of Universities with More than 50% Female Students

above_50_female <- sum(CollegeScores4yr$Female > 50, na.rm = TRUE)
below_50_female <- sum(CollegeScores4yr$Female <= 50, na.rm = TRUE)
pie_values <- c(above_50_female, below_50_female)
pie_labels <- c("More than 50% Female", "50% or Less Female")
pie(pie_values, labels = pie_labels, main = "Percentage of Universities with More than 50% Female Students", col = c("lightcoral", "lightblue"))

Insight on question 10:

The pie chart illustrates the distribution of universities based on the proportion of female students enrolled. Specifically:

More than 50% Female: This category, represented by the larger section of the chart (~82.8%), indicates that a significant majority of universities have a student body composed of more than 50% female students. 50% or Less Female: The smaller section of the chart (~17.2%) represents the universities

where the proportion of female students is 50% or less. This visualization highlights that the vast majority of universities have a predominantly female student population, suggesting gender distribution trends in higher education enrollment.

Summary

Data Collection: The data set was sourced directly from a public URL, ensuring easy access and reproducibility. Initial Observations: Basic data inspection was performed, including checking dimensions and previewing the first few rows. Data Cleaning: Missing values were assessed, and the na.omit() function was used to exclude rows containing NA values to ensure the analysis was based on complete observations.

Summary of Approach: The data set’s collection and initial cleaning process focused on ensuring data integrity and preparing it for meaningful analysis. Missing values were handled appropriately, and the data was structured for further statistical exploration and visualization.