This report provides an analysis of the “CollegeScores4yr” dataset, which contains data about universities, including variables such as tuition, faculty salary, and student demographics. The aim is to explore these variables using descriptive statistics and visualizations to gain insight.
# 1. Mean of Faculty Salaries
# Calculate the mean of faculty salaries
mean_fac_salary <- mean(CollegeScores4yr$FacSalary, na.rm = TRUE)
print(paste("Mean Faculty Salary:", mean_fac_salary))
## [1] "Mean Faculty Salary: 7465.77834525026"
Insight on question 1:
The provided code calculates the mean of faculty salaries from the
FacSalary variable in the CollegeScores4yr
dataset. By using the mean() function with
na.rm = TRUE, any missing (NA) values are
excluded from the calculation, ensuring an accurate result. The mean
value represents the average salary for faculty across the institutions
in the dataset. This metric provides insight into the overall
compensation level for faculty members, which can be useful for
benchmarking and comparing salaries across different regions or types of
institutions. A higher mean salary might indicate better-funded
institutions or regions with higher costs of living, while a lower mean
salary could reflect more budget-constrained environments or regions
with lower living costs.
# Calculate the median completion.
median_comp_rate <- median(CollegeScores4yr$CompRate, na.rm = TRUE)
print(paste("Median Completion Rate:", median_comp_rate))
## [1] "Median Completion Rate: 52.45"
Insight on question 2:
The provided code calculates the median completion rate from the
CompRate variable in the CollegeScores4yr
dataset. By using the median() function with
na.rm = TRUE, the code excludes any missing
(NA) values to ensure an accurate calculation. The result
is the median value, which represents the midpoint of the completion
rate data, meaning half of the universities have a completion rate below
this value and half have a rate above it. This metric is useful for
understanding the typical completion rate among the institutions in the
dataset, offering a measure that is less affected by extreme values
compared to the mean.
# Calculate the variance of student debt
var_debt <- var(CollegeScores4yr$Debt, na.rm = TRUE)
print(paste("Variance of Student Debt:", var_debt))
## [1] "Variance of Student Debt: 28740170.9285079"
Insight on question 3:
The provided code calculates the variance of student debt in the
CollegeScores4yr dataset. By using the var()
function with na.rm = TRUE, any missing (NA)
values are excluded, ensuring that the calculation reflects only
complete data. The output, which is the variance of student debt,
indicates how spread out the student debt amounts are around their mean.
A higher variance means that student debt levels differ significantly
across institutions, while a lower variance indicates that the debt
levels are more consistent. This metric helps understand the degree of
variability in student financial burdens across universities in the data
set.
# Calculate the standard deviation of average SAT scores
sd_avg_sat <- sd(CollegeScores4yr$AvgSAT, na.rm = TRUE)
print(paste("Standard Deviation of Average SAT Scores:", sd_avg_sat))
## [1] "Standard Deviation of Average SAT Scores: 128.90771927285"
Insight on question 4:
The provided code calculates the standard deviation of average SAT
scores in the CollegeScores4yr data set. By using the
sd() function with the na.rm = TRUE parameter,
any missing (NA) values are excluded to ensure an accurate
calculation. The code outputs the standard deviation, which quantifies
how much the SAT scores vary around the mean. This helps in
understanding the spread and consistency of SAT scores across different
universities, indicating the level of variation in academic
competitiveness within the data set.
# Calculate the correlation between median family income and student debt
cor_med_income_debt <- cor(CollegeScores4yr$MedIncome, CollegeScores4yr$Debt, use = "complete.obs")
print(paste("Correlation between Median Family Income and Student Debt:", cor_med_income_debt))
## [1] "Correlation between Median Family Income and Student Debt: -0.120722092197048"
Insight on question 5:
The correlation between median family income and student debt is a useful metric for understanding how socioeconomic status may influence the financial obligations of students. Depending on the result (positive, negative, or zero), this analysis can help identify areas where educational financial policies and support systems might need to be improved to ensure equitable access to higher education and reduce the financial burden on students.
ggplot(CollegeScores4yr, aes(x = Control)) +
geom_bar(fill = "lightblue") +
ggtitle("Distribution of Universities by Control Type") +
xlab("Control Type") +
ylab("Number of Universities")
Insight on question 6:
The bar chart provides a foundation overview of how universities are distributed by control type in the CollegeScores4yr data set. The insights can guide further research into financial, demographic, and policy-related aspects of higher education. By expanding the analysis with additional metrics, a more comprehensive understanding of the trends and their implications can be achieved.
ggplot(CollegeScores4yr, aes(x = Pell)) +
geom_histogram(binwidth = 5, fill = "lightgreen", color = "black") +
ggtitle("Histogram of Percentage of Pell Grant Recipients") +
xlab("Percentage of Pell Grant Recipients") +
ylab("Frequency")
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_bin()`).
Insight on question 7:
The histogram of the percentage of Pell Grant recipients shows that while a large proportion of universities have moderate levels of grant support, a non-negligible number cater to a significantly higher percentage of students who rely on financial aid. This analysis could guide further investigations into institutional characteristics, outcomes for students receiving grants, and potential policy initiatives.
range_tuition <- range(CollegeScores4yr$TuitionFTE, na.rm = TRUE)
print(paste("Range of Tuition Fees per Student:", paste(range_tuition, collapse = " - ")))
## [1] "Range of Tuition Fees per Student: 0 - 58553"
Insight on question 8:
The code snippet calculates the range of tuition fees per student using the Tuition variable in the CollegeScores4yr data set. By setting na.rm = TRUE, it ensures that NA values are excluded from the calculation. The range() function returns the minimum and maximum tuition fees, which are then printed in a formatted string.
ggplot(CollegeScores4yr, aes(x = Region, y = FacSalary)) +
geom_boxplot(fill = "lightblue") +
ggtitle("Boxplot of Faculty Salaries by Region") +
xlab("Region") +
ylab("Faculty Salary")
## Warning: Removed 54 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Insight on question 9:
This box plot analysis highlights significant regional differences in faculty salaries, with the Northeast and West leading in median pay. The presence of outliers and variability in certain regions suggests a mix of institutional practices and external economic factors influencing pay. Further analysis could provide more detailed insights into the drivers of these differences and their implications for faculty recruitment and retention.
above_50_female <- sum(CollegeScores4yr$Female > 50, na.rm = TRUE)
below_50_female <- sum(CollegeScores4yr$Female <= 50, na.rm = TRUE)
pie_values <- c(above_50_female, below_50_female)
pie_labels <- c("More than 50% Female", "50% or Less Female")
pie(pie_values, labels = pie_labels, main = "Percentage of Universities with More than 50% Female Students", col = c("lightcoral", "lightblue"))
Insight on question 10:
The pie chart illustrates the distribution of universities based on the proportion of female students enrolled. Specifically:
More than 50% Female: This category, represented by the larger section of the chart (~82.8%), indicates that a significant majority of universities have a student body composed of more than 50% female students. 50% or Less Female: The smaller section of the chart (~17.2%) represents the universities
where the proportion of female students is 50% or less. This visualization highlights that the vast majority of universities have a predominantly female student population, suggesting gender distribution trends in higher education enrollment.
Data Collection: The data set was sourced directly from a public URL, ensuring easy access and reproducibility. Initial Observations: Basic data inspection was performed, including checking dimensions and previewing the first few rows. Data Cleaning: Missing values were assessed, and the na.omit() function was used to exclude rows containing NA values to ensure the analysis was based on complete observations.
Summary of Approach: The data set’s collection and initial cleaning process focused on ensuring data integrity and preparing it for meaningful analysis. Missing values were handled appropriately, and the data was structured for further statistical exploration and visualization.