US_Top_50_Universities_2026 Analysis

Author

Pascal Hermann Kouogang Tafo

INTRODUCTION

This project consists of analyzing the 2026 US Top 50 Universities dataset to evaluate the relationship between institutional characteristics, specifically focusing on research impact and employment outcomes.By analyzing research impact and post-graduation employment rates, we aim to identify performance trends across public and private institutions for the 2026 academic outlook.

APPROACH

To conduct the analysis of the dataset, i will implement a structured data science pipeline using the “tidyverse” framework as followed:

Load the CSV dataset in R and commit to a GitHub repository ensuring its accessibility at anytime .
Rename some variables by removing to standardize naming convention.
Transform the dataset into a long format, which allows for more efficient faceted plotting and statistical comparison
Construct plots to compare research impact score distributions between public and private institutions, identifying variability and outliers.
Compute a correlation analysis to determine if higher research impact scores are significantly associated with higher employment rates for each institution type.

Load useful libraries

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

Warning: package 'janitor' was built under R version 4.5.2


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

Load CSV file and Cleaned it

url <- "https://raw.githubusercontent.com/Pascaltafo2025/PROJECT-2--TIDY-DATA-ANALYSIS/refs/heads/main/US_Top_50_Universities_2026.csv"

US_Top_50_Universities_2026 <- read_csv(url)

Rows: 50 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): University_Name, Institution_Type, State
dbl (5): National_Rank, Founded_Year, Research_Impact_Score, Intl_Student_Ra...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(US_Top_50_Universities_2026,10)

# A tibble: 10 × 8
   University_Name             National_Rank Founded_Year Institution_Type State
   <chr>                               <dbl>        <dbl> <chr>            <chr>
 1 Massachusetts Institute of…             1         1861 Private          MA   
 2 Columbia University                     2         1754 Private          NY   
 3 Princeton University                    3         1746 Private          NJ   
 4 Stanford University                     4         1891 Private          CA   
 5 University of California, …             5         1868 Public           CA   
 6 Harvard University                      6         1636 Private          MA   
 7 Williams College                        7         1793 Private          MA   
 8 Johns Hopkins University                8         1876 Private          MD   
 9 Yale University                         9         1701 Private          CT   
10 University of Pennsylvania             10         1740 Private          PA   
# ℹ 3 more variables: Research_Impact_Score <dbl>, Intl_Student_Ratio <dbl>,
#   Employment_Rate <dbl>

# Rename columns → snake_case, consistent convention

 Clean_US_Top_50_Universities_2026 <- US_Top_50_Universities_2026 %>%
  clean_names() %>%
  rename(state_code = state)

head(Clean_US_Top_50_Universities_2026,10)

# A tibble: 10 × 8
   university_name        national_rank founded_year institution_type state_code
   <chr>                          <dbl>        <dbl> <chr>            <chr>     
 1 Massachusetts Institu…             1         1861 Private          MA        
 2 Columbia University                2         1754 Private          NY        
 3 Princeton University               3         1746 Private          NJ        
 4 Stanford University                4         1891 Private          CA        
 5 University of Califor…             5         1868 Public           CA        
 6 Harvard University                 6         1636 Private          MA        
 7 Williams College                   7         1793 Private          MA        
 8 Johns Hopkins Univers…             8         1876 Private          MD        
 9 Yale University                    9         1701 Private          CT        
10 University of Pennsyl…            10         1740 Private          PA        
# ℹ 3 more variables: research_impact_score <dbl>, intl_student_ratio <dbl>,
#   employment_rate <dbl>

write_csv(Clean_US_Top_50_Universities_2026, "Clean_US_Top_50_Universities_2026.csv")

Transform the dataset from wide to long or tidy format

This step is crucial because it allows us for more efficient faceted plotting and statistical analysis. We will be using the pivot_longer() function for that purpose.

long_Clean_dataset <- Clean_US_Top_50_Universities_2026 %>%
  pivot_longer(
    cols      = c(research_impact_score, intl_student_ratio, employment_rate),
    names_to  = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = recode(metric,
      "research_impact_score"  = "Research Impact Score",
      "intl_student_ratio" = "Intl. Student Ratio (%)",
      "employment_rate"    = "Employment Rate (%)"
    )
  )

head(long_Clean_dataset,10)

# A tibble: 10 × 7
   university_name national_rank founded_year institution_type state_code metric
   <chr>                   <dbl>        <dbl> <chr>            <chr>      <chr> 
 1 Massachusetts …             1         1861 Private          MA         Resea…
 2 Massachusetts …             1         1861 Private          MA         Intl.…
 3 Massachusetts …             1         1861 Private          MA         Emplo…
 4 Columbia Unive…             2         1754 Private          NY         Resea…
 5 Columbia Unive…             2         1754 Private          NY         Intl.…
 6 Columbia Unive…             2         1754 Private          NY         Emplo…
 7 Princeton Univ…             3         1746 Private          NJ         Resea…
 8 Princeton Univ…             3         1746 Private          NJ         Intl.…
 9 Princeton Univ…             3         1746 Private          NJ         Emplo…
10 Stanford Unive…             4         1891 Private          CA         Resea…
# ℹ 1 more variable: value <dbl>

Generate a boxplot that Compare the Research Impact Score by Institution Type

I use the help of CLAUDE SONNET 4.6 using the following prompt: “Generate a box plot to describe and compare the Research Impact Score by Institution Type”

# Summary statistics

research_Impact_Score_summary <- Clean_US_Top_50_Universities_2026 %>%
  group_by(institution_type) %>%
  summarise(
    n           = n(),
    mean_score  = mean(research_impact_score),
    median_score = median(research_impact_score),
    sd_score    = sd(research_impact_score),
    min_score   = min(research_impact_score),
    max_score   = max(research_impact_score),
    .groups     = "drop"
  )

print(research_Impact_Score_summary)

# A tibble: 2 × 7
  institution_type     n mean_score median_score sd_score min_score max_score
  <chr>            <int>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl>
1 Private             35       76.0         88.2    26.6       28.5     100  
2 Public              15       91.5         90.4     4.26      85.3      98.9

# Create a plot that shows Research Impact Score by Institution Type


ggplot(Clean_US_Top_50_Universities_2026, aes(x = institution_type,
                         y = research_impact_score,
                         fill = institution_type)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA, width = 0.5) +
  geom_jitter(aes(colour = institution_type),
              width = 0.18, size = 2.2, alpha = 0.75, show.legend = FALSE) +
  stat_summary(fun = mean, geom = "point", shape = 23,
               size = 4, fill = "white", colour = "black") +
  scale_fill_manual(
    values = c("Public" = "blue", "Private" = "orange"),
    name   = "Institution Type"
  ) +
  scale_colour_manual(
    values = c("Public" = "#1565C0", "Private" = "red")
  ) +
  labs(
    title    = "Research Impact Score by Institution Type",
    subtitle = "US Top 50 Universities 2026 | Diamond = group mean",
    x        = "Institution Type",
    y        = "Research Impact Score (0–100)",
    caption  = "Source: US Top 50 Universities 2026 dataset"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold", size = 15),
    plot.subtitle = element_text(colour = "grey40"),
    legend.position = "top",
    panel.grid.minor = element_blank()
  )

Interpretation

In the private sector, the mean research score is often lower than the median due to multiple institutions having significantly lower research impact scores compared to the top tier (Ivy League, MIT, etc.) and that creates a skewed distribution to the left. Moreover, Public institutions in this Top 50 demonstrate a very tight distribution with a high median. This indicates that major public research universities maintain a consistently high level of research output and impact.

Let’s find the Correlation between the Employment Rate Vs Research Impact score and create a graph to visualize the result

We are going to implement this correlation by institution type and the Pearson correlation coefficient.

Correlation_EmpRate_vs_RImpactScore <- Clean_US_Top_50_Universities_2026 %>%
  group_by(institution_type) %>%
  summarise(
    r   = cor(employment_rate, research_impact_score, method = "pearson"),
    n   = n(),
    .groups = "drop"
  )
cat("\nCorrelation (Employment Rate vs Research Impact Score):\n")


Correlation (Employment Rate vs Research Impact Score):

print(Correlation_EmpRate_vs_RImpactScore)

# A tibble: 2 × 3
  institution_type     r     n
  <chr>            <dbl> <int>
1 Private          0.575    35
2 Public           0.336    15

# Scatter plot with regression lines

ggplot(Clean_US_Top_50_Universities_2026, aes(x = research_impact_score,
                          y = employment_rate,
                          colour = institution_type,
                          fill   = institution_type)) +
  geom_point(size = 3, alpha = 0.80) +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.12, linewidth = 1.1) +
  geom_text(
    aes(label = state_code),
    size = 2.5, vjust = -0.6, show.legend = FALSE
  ) +
  scale_colour_manual(
    values = c("Public" = "#2196F3", "Private" = "#FF5722"),
    name   = "Institution Type"
  ) +
  scale_fill_manual(
    values = c("Public" = "#2196F3", "Private" = "#FF5722"),
    name   = "Institution Type"
  ) +
  # Annotate correlation coefficients
  annotate("text", x = 30, y = 97.5,
           label = sprintf("r (Public)  = %.2f", Correlation_EmpRate_vs_RImpactScore$r[Correlation_EmpRate_vs_RImpactScore$institution_type == "Public"]),
           colour = "#1565C0", hjust = 0, size = 3.8, fontface = "bold") +
  annotate("text", x = 30, y = 96.2,
           label = sprintf("r (Private) = %.2f", Correlation_EmpRate_vs_RImpactScore$r[Correlation_EmpRate_vs_RImpactScore$institution_type == "Private"]),
           colour = "#BF360C", hjust = 0, size = 3.8, fontface = "bold") +
  labs(
    title    = "Employment Rate vs Research Impact Score by Institution Type",
    subtitle = "US Top 50 Universities 2026 | Lines = OLS regression with 95% CI",
    x        = "Research Impact Score (0–100)",
    y        = "Employment Rate (%)",
    caption  = "Source: US Top 50 Universities 2026 dataset"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey40"),
    legend.position = "top",
    panel.grid.minor = element_blank()
  )

`geom_smooth()` using formula = 'y ~ x'

Interpretation

Private Institutions show a moderate positive correlation with a correlation coefficient equal to 0.575 . This suggests that in the private institutions, research prestige is a good indicator of graduate employ-ability.
Public Institutions show a weaker positive correlation which is approximately 0.34. That is surprising since we thought their higher research impact score were going to have greater impact in the employment rates for public university graduates. It appears that they might be a broader set of factors beyond just research metrics to predict the employment rate.

CONCLUSION

In conclusion, the analysis suggests that the employment rate cannot be summarized by a single trend across all institution types and the rate of employment is influenced by other factors. This makes sense given the dual nature of elite private universities: flagship research institutions (MIT, Stanford, Caltech, Carnegie Mellon) simultaneously lead in research output and in industry placement, particularly in tech and finance. In the contrary, the pattern shown by public institutions partly reflects that highly research-intensive public flagships (e.g., UC San Diego, UC Santa Barbara) channel large proportions of their graduates into graduate study rather than direct employment, modestly suppressing headline employment figures.