This project consists of analyzing the 2026 US Top 50 Universities dataset to evaluate the relationship between institutional characteristics, specifically focusing on research impact and employment outcomes.By analyzing research impact and post-graduation employment rates, we aim to identify performance trends across public and private institutions for the 2026 academic outlook.
APPROACH
To conduct the analysis of the dataset, i will implement a structured data science pipeline using the “tidyverse” framework as followed:
Load the CSV dataset in R and commit to a GitHub repository ensuring its accessibility at anytime .
Rename some variables by removing to standardize naming convention.
Transform the dataset into a long format, which allows for more efficient faceted plotting and statistical comparison
Construct plots to compare research impact score distributions between public and private institutions, identifying variability and outliers.
Compute a correlation analysis to determine if higher research impact scores are significantly associated with higher employment rates for each institution type.
Load useful libraries
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Warning: package 'janitor' was built under R version 4.5.2
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Rows: 50 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): University_Name, Institution_Type, State
dbl (5): National_Rank, Founded_Year, Research_Impact_Score, Intl_Student_Ra...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(US_Top_50_Universities_2026,10)
# A tibble: 10 × 8
University_Name National_Rank Founded_Year Institution_Type State
<chr> <dbl> <dbl> <chr> <chr>
1 Massachusetts Institute of… 1 1861 Private MA
2 Columbia University 2 1754 Private NY
3 Princeton University 3 1746 Private NJ
4 Stanford University 4 1891 Private CA
5 University of California, … 5 1868 Public CA
6 Harvard University 6 1636 Private MA
7 Williams College 7 1793 Private MA
8 Johns Hopkins University 8 1876 Private MD
9 Yale University 9 1701 Private CT
10 University of Pennsylvania 10 1740 Private PA
# ℹ 3 more variables: Research_Impact_Score <dbl>, Intl_Student_Ratio <dbl>,
# Employment_Rate <dbl>
# A tibble: 10 × 8
university_name national_rank founded_year institution_type state_code
<chr> <dbl> <dbl> <chr> <chr>
1 Massachusetts Institu… 1 1861 Private MA
2 Columbia University 2 1754 Private NY
3 Princeton University 3 1746 Private NJ
4 Stanford University 4 1891 Private CA
5 University of Califor… 5 1868 Public CA
6 Harvard University 6 1636 Private MA
7 Williams College 7 1793 Private MA
8 Johns Hopkins Univers… 8 1876 Private MD
9 Yale University 9 1701 Private CT
10 University of Pennsyl… 10 1740 Private PA
# ℹ 3 more variables: research_impact_score <dbl>, intl_student_ratio <dbl>,
# employment_rate <dbl>
Transform the dataset from wide to long or tidy format
This step is crucial because it allows us for more efficient faceted plotting and statistical analysis. We will be using the pivot_longer() function for that purpose.
# A tibble: 10 × 7
university_name national_rank founded_year institution_type state_code metric
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Massachusetts … 1 1861 Private MA Resea…
2 Massachusetts … 1 1861 Private MA Intl.…
3 Massachusetts … 1 1861 Private MA Emplo…
4 Columbia Unive… 2 1754 Private NY Resea…
5 Columbia Unive… 2 1754 Private NY Intl.…
6 Columbia Unive… 2 1754 Private NY Emplo…
7 Princeton Univ… 3 1746 Private NJ Resea…
8 Princeton Univ… 3 1746 Private NJ Intl.…
9 Princeton Univ… 3 1746 Private NJ Emplo…
10 Stanford Unive… 4 1891 Private CA Resea…
# ℹ 1 more variable: value <dbl>
Generate a boxplot that Compare the Research Impact Score by Institution Type
I use the help of CLAUDE SONNET 4.6 using the following prompt: “Generate a box plot to describe and compare the Research Impact Score by Institution Type”
# Create a plot that shows Research Impact Score by Institution Typeggplot(Clean_US_Top_50_Universities_2026, aes(x = institution_type,y = research_impact_score,fill = institution_type)) +geom_boxplot(alpha =0.6, outlier.shape =NA, width =0.5) +geom_jitter(aes(colour = institution_type),width =0.18, size =2.2, alpha =0.75, show.legend =FALSE) +stat_summary(fun = mean, geom ="point", shape =23,size =4, fill ="white", colour ="black") +scale_fill_manual(values =c("Public"="blue", "Private"="orange"),name ="Institution Type" ) +scale_colour_manual(values =c("Public"="#1565C0", "Private"="red") ) +labs(title ="Research Impact Score by Institution Type",subtitle ="US Top 50 Universities 2026 | Diamond = group mean",x ="Institution Type",y ="Research Impact Score (0–100)",caption ="Source: US Top 50 Universities 2026 dataset" ) +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold", size =15),plot.subtitle =element_text(colour ="grey40"),legend.position ="top",panel.grid.minor =element_blank() )
Interpretation
In the private sector, the mean research score is often lower than the median due to multiple institutions having significantly lower research impact scores compared to the top tier (Ivy League, MIT, etc.) and that creates a skewed distribution to the left. Moreover, Public institutions in this Top 50 demonstrate a very tight distribution with a high median. This indicates that major public research universities maintain a consistently high level of research output and impact.
Let’s find the Correlation between the Employment Rate Vs Research Impact score and create a graph to visualize the result
We are going to implement this correlation by institution type and the Pearson correlation coefficient.
Correlation_EmpRate_vs_RImpactScore <- Clean_US_Top_50_Universities_2026 %>%group_by(institution_type) %>%summarise(r =cor(employment_rate, research_impact_score, method ="pearson"),n =n(),.groups ="drop" )cat("\nCorrelation (Employment Rate vs Research Impact Score):\n")
Correlation (Employment Rate vs Research Impact Score):
print(Correlation_EmpRate_vs_RImpactScore)
# A tibble: 2 × 3
institution_type r n
<chr> <dbl> <int>
1 Private 0.575 35
2 Public 0.336 15
Private Institutions show a moderate positive correlation with a correlation coefficient equal to 0.575 . This suggests that in the private institutions, research prestige is a good indicator of graduate employ-ability.
Public Institutions show a weaker positive correlation which is approximately 0.34. That is surprising since we thought their higher research impact score were going to have greater impact in the employment rates for public university graduates. It appears that they might be a broader set of factors beyond just research metrics to predict the employment rate.
CONCLUSION
In conclusion, the analysis suggests that the employment rate cannot be summarized by a single trend across all institution types and the rate of employment is influenced by other factors. This makes sense given the dual nature of elite private universities: flagship research institutions (MIT, Stanford, Caltech, Carnegie Mellon) simultaneously lead in research output and in industry placement, particularly in tech and finance. In the contrary, the pattern shown by public institutions partly reflects that highly research-intensive public flagships (e.g., UC San Diego, UC Santa Barbara) channel large proportions of their graduates into graduate study rather than direct employment, modestly suppressing headline employment figures.