# PROJECT OVERVIEW

Overview of Project

Exploring the differences between American research colleges and universities through data visualization.

# RESEARCH QUESTIONS/TOPICS, AND NARRATIVE
# Communications - Mia Leandri

Research Question/Topic

  1. Descriptive and Comparative Question How do institutional characteristics compare between how different American research colleges and universities are classified (“R1”, “R2”, “Research Colleges and Universities (Unranked)”)
  • Exploring features, such as the location, admissions, academic offerings, and doctoral degrees awarded
  1. Analytical and Inferential Question How do institutional characteristics predict or correlate with research designation tiers among American colleges and universities?
  • Exploring features, such as the location, admissions, academic offerings, and doctoral degrees awarded

Research Narrative

Our chosen datasets should be well-suited to address both our descriptive/comparative and analytical/inferential research questions. For descriptive and comparative purposes, we have access to data on most of the key factors we aim to explore, allowing us to describe and compare differences across universities. For instance, we can analyze the geographic distribution of institutions by state and city, admissions data on who is being admitted, academic offerings across universities, and the number of doctoral degrees awarded at each institution. By combining this information with institutional research spending data from the carnegie_spending_doctorate_awards dataset and basic school characteristics from carnegie_classification_stats, we can compare which factors may contribute to higher research spending across institutions.

For our analytical and inferential questions, we can use the data in several ways to explore what factors—beyond financial expenditures—are associated with a higher research classification. For example, admissions and academic offerings data, when examined alongside research classifications, may help us determine whether specific academic or demographic characteristics play a role in an institution’s likelihood of achieving Research 1 or Research 2 status. This approach goes beyond simply examining research spending and instead investigates what factors may influence institutional research standing.

Our primary audience includes college and university administrators. We hope this analysis will be especially useful for those seeking to understand which institutional characteristics are associated with higher research rankings. For instance, if institutions with more diverse admissions profiles or broader academic offerings tend to rank higher in research classifications, administrators at other institutions may consider incorporating similar changes. This information could support strategic planning and resource allocation to strengthen their institutions’ research profile.

# DATA AND RESEARCH SOURCES
# Data Enginners - Colin Thompson (finding, describing) and Lily Gates (cleaning, combining datasets)

Data and Research Sources Overview

Cleaned CSV versions of the “Carnegie Classification of Institutions of Higher Education” https://carnegieclassifications.acenet.edu/

carnegie_spending_doctorate_awards (See carnegie_research_activity_desig_factsheet.pdf for more info) * School * Location (City, State) * Classification (R1, R2, Other) — From 2021 and 2025 * Expenses (2021, 2022, 2023) * Research Doctorates Awarded (2020, 2021, 2022)

carnegie_classification_stats (See carnegie_variable_descr_flowchart.pdf for more info) * School * Filtering main school if applicable (e.g., UMBC, UMD, UMES all are “University of Maryland” sub schools) * Location (City, State, residential or not) * Public or private (non-profit, for-profit) * Description of types of degree programs offered * Enrollment information * Classification on tier (R1, R2, Other)

Description of Variables

  • classification_df is from the carnegie_classification_stats
  • spending_df is from the carnegie_spending_doctorate_awards

Basic School Information NAME - unitid - unique ID number - name.x - name of school - orig from classification_df - NOTE: is later renamed to name while name.y is dropped - name.y - name - orig from spending_df - core_name - name of the main university (e.g., “University of Maryland”) - orig from the classification_df - spec_name - name of the specific campus/satellite (e.g., “College Park”) - orig from the classification_df

SCHOOL TYPE - public_private_profit - public, private (for-profit, non-profit) - orig from classification_df

SCHOOL LOCATION - city - city - state - state abbreviation - size_setting - size, 4-year vs. graduate/professional, how residential the school is - orig from classification_df

Enrollment and Admissions - level - four or more years as opposed to associate degree 2-year programs such as community colleges - orig from classification_df - enrollment_profile - distribution of undergrad vs. graduate students - orig from classification_df - size_setting - student body population - orig from classification_df - size: - very small (less than 999) - small (1000 to 2999) - medium (3000 to 9999) - large (10000 or greater) - residential: percentage of living on-campus students - primarily non-res: less than 25% FT students - primarily residential: 25 to 50% FT students - highly residential: more than 50% FT students - undergrad_profile - for 4-year universities - orig from classification_df - full/part-time: - higher part-time (over 40% PT) - medium-full-time (21-39% PT) - full-time (less than 20% PT) - selective: - inclusive (ACT equivalent <19) - more selective (ACT equivalent 19 to 23) - selective (ACT equivalent over 23) - transfer in: - lower (<20%) - higher (greater than 20%)

Research Classification Ranking - research_tier_2025 - 2025 tier ranking - orig from classification_df - classific_2021 - 2021 tier ranking - orig from spending_df - classific_2025 - 2025 tier ranking - orig from spending_df

Finances
orig from spending_df - herd_fy21 - 2021 fiscal year expenses on research - herd_fy22 - 2022 fiscal year expenses on research - herd_fy23 - 2023 fiscal year expenses on research - herd_avg_fy21_to_fy23 - Average expenses on research from 2021 to 2023

Types of Academic Programs and Degrees Offered
orig from classficiation_df - degree_focuses - degree level and academic focus overview - top degree level offered: bacc, masters, doctoral, special, tribal - focus: arts and science, diverse, engineering and tech, medical, other health, special focus, research, or NULL - size: of program (only for Masters Universities) - research activity: only for Doctoral Universities - undergrad_program - distribution of undergrad degrees in certain academic fields - dominant subj: “arts and sciences” OR “professions” OR NULL - balanced vs plus vs focus: measured by the percentage of bachelor’s degrees awarded in arts and sciences (rather than professions) - graduate degree coexistence: percentage of graduate degrees in undergraduate fields - e.g., a BS in Computer Science AND a MS or PHD in Computer Science is also offered at the same school - measured by “none” (0%) OR “some” (0 to 50%) or “high” (greater than 50%) - grad_program - distribution of grad degrees in certain academic fields - program level: undergrad only, postbac, research doctorate, NULL - single or comprehensive: multiple subject fields or single subject - focus: education, business, humanities, STEM, professional

Research Doctorates Awarded
orig from spending_df

Number of doctoral research degrees conferred for… - num_doc_degrees_2020_2021 - the 2020-2021 school year - num_doc_degrees_2021_2022 - the 2021-2022 school year - num_doc_degrees_2022_2023 - the 2022-2023 school year

Miscellaneous
orig from classification_df
Not important, used for Carnegie’s specific engagement and purposes, special program with Carnegie - community_engage - whether they were classified by Carnegie - leadership_for_public_prax - whether they were classified by Carnegie

Narrative Overview

The datasets we intend to use for this project are carnegie_classification_stats and carnegie_spending_doctorate_awards, both sourced from the Carnegie Classification of Institutions of Higher Education—a framework maintained by the American Council on Education and the Carnegie Foundation for the Advancement of Teaching. The purpose of this classification system is to provide a standardized way of categorizing U.S. colleges and universities based on a range of institutional characteristics, including degree offerings, enrollment profiles, and research activity.

The carnegie_classification_stats dataset includes a wide range of institutional data. It provides general school information such as the official name of the university and the name of specific campuses—e.g., the University of Maryland as the flagship institution and College Park as its campus. It also classifies institutions by their sector, indicating whether they are public, private non-profit, or private for-profit. Additional variables detail the types of degrees offered, from bachelor’s to doctoral levels, as well as the institution’s academic focus—such as liberal arts, professional fields, or STEM. The dataset includes enrollment profiles, offering insight into whether an institution primarily serves undergraduate or graduate students, and whether its student body is full-time, part-time, or transfer-heavy. It also captures information about the size of the institution, the residential setting, and undergraduate selectivity. Importantly, it includes the university’s most recent research classification label (e.g., Research 1 or Research 2) as of 2025.

The carnegie_spending_doctorate_awards dataset complements this information by including similar institutional identifiers and research classifications for 2021 and 2025. It also tracks financial data, reporting annual research expenses for fiscal years 2021 through 2023. This provides a window into how institutions allocate resources toward research over time. Additionally, this dataset documents the number of research doctorates awarded by each institution in three consecutive academic years: 2020–2021, 2021–2022, and 2022–2023. Together, these two datasets allow for a comprehensive analysis of how institutional characteristics, research activity, and academic program offerings intersect across the higher education landscape in the United States.

# CODE BEGINS

Installing Packages

Note: Only have to do 1 time

# install.packages("patchwork")
# install.packages("scales")
# install.packages("RColorBrewer")

Importing Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(tidyr)
library(dplyr)
library(lubridate)  # for Datetime formatting
library(stringr)  # for extracting URL
library(patchwork)  # for putting multiple graphs on one fig
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(RColorBrewer)  # for color palettes

Reading in Data

classification_stats <- read_csv("carnegie_classification_stats.csv")
## Rows: 542 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): name, core_name, spec_name, city, state, level, public_private_pro...
## dbl  (1): unitid
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spending_doctorate_awards <- read_csv("carnegie_spending_doctorate_awards.csv")
## Rows: 542 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, city, state, classific_2025, classific_2021, herd_fy21, herd_...
## dbl (5): unitid, num_doc_degrees_2020_2021, num_doc_degrees_2021_2022, num_d...
## num (2): herd_fy23, herd_avg_fy21_to_fy23
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Preview and Summarize Data

#head(classification_stats)
colnames(classification_stats)
##  [1] "unitid"                     "name"                      
##  [3] "core_name"                  "spec_name"                 
##  [5] "city"                       "state"                     
##  [7] "level"                      "public_private_profit"     
##  [9] "undergrad_program"          "grad_program"              
## [11] "enrollment_profile"         "undergrad_profile"         
## [13] "size_setting"               "degree_focuses"            
## [15] "community_engage"           "leadership_for_public_prax"
## [17] "research_tier_2025"
summary(classification_stats)
##      unitid           name            core_name          spec_name        
##  Min.   :100654   Length:542         Length:542         Length:542        
##  1st Qu.:147953   Class :character   Class :character   Class :character  
##  Median :187289   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :190540                                                           
##  3rd Qu.:217853                                                           
##  Max.   :495767                                                           
##      city              state              level           public_private_profit
##  Length:542         Length:542         Length:542         Length:542           
##  Class :character   Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character     
##                                                                                
##                                                                                
##                                                                                
##  undergrad_program  grad_program       enrollment_profile undergrad_profile 
##  Length:542         Length:542         Length:542         Length:542        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  size_setting       degree_focuses     community_engage  
##  Length:542         Length:542         Length:542        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  leadership_for_public_prax research_tier_2025
##  Length:542                 Length:542        
##  Class :character           Class :character  
##  Mode  :character           Mode  :character  
##                                               
##                                               
## 
head(spending_doctorate_awards)
## # A tibble: 6 × 14
##   unitid name      city  state classific_2025 classific_2021 herd_fy21 herd_fy22
##    <dbl> <chr>     <chr> <chr> <chr>          <chr>          <chr>     <chr>    
## 1 222178 Abilene … Abil… TX    Research 2: H… Doctoral/Prof… 4,597,00… 6,310,00…
## 2 200697 Air Forc… Wrig… OH    Research 2: H… Doctoral Univ… 45,703,0… 44,391,0…
## 3 100654 Alabama … Norm… AL    Research Coll… Master's Coll… 9,180,00… 9,611,00…
## 4 100724 Alabama … Mont… AL    Research Coll… Doctoral/Prof… 2,631,00… 3,178,00…
## 5 188526 Albany C… Alba… NY    Research Coll… Special Focus… 3,687,00… 2,276,00…
## 6 188580 Albany M… Alba… NY    Research Coll… Special Focus… 22,443,0… 22,297,0…
## # ℹ 6 more variables: herd_fy23 <dbl>, herd_avg_fy21_to_fy23 <dbl>,
## #   num_doc_degrees_2020_2021 <dbl>, num_doc_degrees_2021_2022 <dbl>,
## #   num_doc_degrees_2022_2023 <dbl>, avg_num_doc_degrees_2020_2023 <dbl>
colnames(spending_doctorate_awards)
##  [1] "unitid"                        "name"                         
##  [3] "city"                          "state"                        
##  [5] "classific_2025"                "classific_2021"               
##  [7] "herd_fy21"                     "herd_fy22"                    
##  [9] "herd_fy23"                     "herd_avg_fy21_to_fy23"        
## [11] "num_doc_degrees_2020_2021"     "num_doc_degrees_2021_2022"    
## [13] "num_doc_degrees_2022_2023"     "avg_num_doc_degrees_2020_2023"
summary(spending_doctorate_awards)
##      unitid           name               city              state          
##  Min.   :100654   Length:542         Length:542         Length:542        
##  1st Qu.:147953   Class :character   Class :character   Class :character  
##  Median :187289   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :190022                                                           
##  3rd Qu.:217744                                                           
##  Max.   :492689                                                           
##  classific_2025     classific_2021      herd_fy21          herd_fy22        
##  Length:542         Length:542         Length:542         Length:542        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    herd_fy23         herd_avg_fy21_to_fy23 num_doc_degrees_2020_2021
##  Min.   :0.000e+00   Min.   :1.215e+06     Min.   :  0.0            
##  1st Qu.:6.320e+06   1st Qu.:5.790e+06     1st Qu.:  0.0            
##  Median :2.560e+07   Median :2.363e+07     Median : 32.0            
##  Mean   :1.971e+08   Mean   :1.790e+08     Mean   :108.1            
##  3rd Qu.:2.135e+08   3rd Qu.:1.959e+08     3rd Qu.:132.5            
##  Max.   :3.802e+09   Max.   :3.468e+09     Max.   :900.0            
##  num_doc_degrees_2021_2022 num_doc_degrees_2022_2023
##  Min.   :  0.0             Min.   :  0.0            
##  1st Qu.:  0.0             1st Qu.:  3.0            
##  Median : 36.0             Median : 36.0            
##  Mean   :120.1             Mean   :122.3            
##  3rd Qu.:144.8             3rd Qu.:151.2            
##  Max.   :942.0             Max.   :930.0            
##  avg_num_doc_degrees_2020_2023
##  Min.   :  0.0                
##  1st Qu.:  2.0                
##  Median : 34.5                
##  Mean   :116.8                
##  3rd Qu.:146.5                
##  Max.   :924.0
# CREATING DATAFRAMES

Convert Both Imported CSVs to Dataframes

classification_df <- data.frame(classification_stats)
class(classification_df)
## [1] "data.frame"
spending_df <- data.frame(spending_doctorate_awards)
class(spending_df)
## [1] "data.frame"
colnames(spending_df)
##  [1] "unitid"                        "name"                         
##  [3] "city"                          "state"                        
##  [5] "classific_2025"                "classific_2021"               
##  [7] "herd_fy21"                     "herd_fy22"                    
##  [9] "herd_fy23"                     "herd_avg_fy21_to_fy23"        
## [11] "num_doc_degrees_2020_2021"     "num_doc_degrees_2021_2022"    
## [13] "num_doc_degrees_2022_2023"     "avg_num_doc_degrees_2020_2023"
# CLEANING DATA

Convert and Ensure Financial Values are Numeric

For spending_df columns herd_fyXX Replace any missing values with NA

# Clean and convert HERD columns to integer
spending_df <- spending_df %>%
  mutate(
    herd_fy21 = as.integer(parse_number(as.character(herd_fy21))),
    herd_fy22 = as.integer(parse_number(as.character(herd_fy22))),
    herd_fy23 = as.integer(parse_number(as.character(herd_fy23))),
    herd_avg_fy21_to_fy23 = as.integer(parse_number(as.character(herd_avg_fy21_to_fy23)))
  )
## Warning: There were 6 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `herd_fy21 = as.integer(parse_number(as.character(herd_fy21)))`.
## Caused by warning:
## ! 1 parsing failure.
## row col expected actual
## 162  -- a number    N/A
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
# Inspect rows where parsing failed (NAs introduced)
problematic_rows <- spending_df %>%
  filter(if_any(c(herd_fy21, herd_fy22, herd_fy23, herd_avg_fy21_to_fy23), is.na)) %>%
  distinct()

# View problematic rows – check for Johns Hopkins, Kent State, or other issues
print(problematic_rows)
##   unitid                          name      city state
## 1 162928      Johns Hopkins University Baltimore    MD
## 2 203517 Kent State University at Kent      Kent    OH
##                                            classific_2025
## 1 Research 1: Very High Spending and Doctorate Production
## 2 Research 1: Very High Spending and Doctorate Production
##                                       classific_2021 herd_fy21 herd_fy22
## 1 Doctoral Universities: Very High Research Activity        NA        NA
## 2 Doctoral Universities: Very High Research Activity        NA        NA
##   herd_fy23 herd_avg_fy21_to_fy23 num_doc_degrees_2020_2021
## 1        NA                    NA                       498
## 2  57758000              57758000                       134
##   num_doc_degrees_2021_2022 num_doc_degrees_2022_2023
## 1                       605                       672
## 2                       145                       165
##   avg_num_doc_degrees_2020_2023
## 1                           592
## 2                           148

Combine Both into Single Dataframe

Note: has name.x and name.y

# Confirming column names on what to join by
colnames(classification_df)
##  [1] "unitid"                     "name"                      
##  [3] "core_name"                  "spec_name"                 
##  [5] "city"                       "state"                     
##  [7] "level"                      "public_private_profit"     
##  [9] "undergrad_program"          "grad_program"              
## [11] "enrollment_profile"         "undergrad_profile"         
## [13] "size_setting"               "degree_focuses"            
## [15] "community_engage"           "leadership_for_public_prax"
## [17] "research_tier_2025"
colnames(spending_df)
##  [1] "unitid"                        "name"                         
##  [3] "city"                          "state"                        
##  [5] "classific_2025"                "classific_2021"               
##  [7] "herd_fy21"                     "herd_fy22"                    
##  [9] "herd_fy23"                     "herd_avg_fy21_to_fy23"        
## [11] "num_doc_degrees_2020_2021"     "num_doc_degrees_2021_2022"    
## [13] "num_doc_degrees_2022_2023"     "avg_num_doc_degrees_2020_2023"
# Perform full join on 'unitid'
combined_df <- full_join(classification_df, spending_df, by = "unitid")

# Confirm join and notice duplicate column names
colnames(combined_df)
##  [1] "unitid"                        "name.x"                       
##  [3] "core_name"                     "spec_name"                    
##  [5] "city.x"                        "state.x"                      
##  [7] "level"                         "public_private_profit"        
##  [9] "undergrad_program"             "grad_program"                 
## [11] "enrollment_profile"            "undergrad_profile"            
## [13] "size_setting"                  "degree_focuses"               
## [15] "community_engage"              "leadership_for_public_prax"   
## [17] "research_tier_2025"            "name.y"                       
## [19] "city.y"                        "state.y"                      
## [21] "classific_2025"                "classific_2021"               
## [23] "herd_fy21"                     "herd_fy22"                    
## [25] "herd_fy23"                     "herd_avg_fy21_to_fy23"        
## [27] "num_doc_degrees_2020_2021"     "num_doc_degrees_2021_2022"    
## [29] "num_doc_degrees_2022_2023"     "avg_num_doc_degrees_2020_2023"

Debugging - Checking for Duplicate Values in Duplicate Column Names

(name, city, state each have x and y values) The .x refers to the column coming from classification_df The .y refers to the column coming from spending_df

# Total issues = 13

# Duplicate School Names (there are 12)
combined_df %>%
  filter(name.x != name.y) %>%
  select(unitid, name.x, name.y)
##    unitid                                            name.x
## 1  104151                          Arizona State University
## 2  138354                    The University of West Florida
## 3  151111 Indiana University-Purdue University-Indianapolis
## 4  163259                 University of Maryland, Baltimore
## 5  163286               University of Maryland-College Park
## 6  164155                                  US Naval Academy
## 7  195049                            Rockefeller University
## 8  196060                                    SUNY at Albany
## 9  199111         University of North Carolina at Asheville
## 10 201885                          University of Cincinnati
## 11 207388                         Oklahoma State University
## 12 224554                                  East Texas A & M
##                                               name.y
## 1          Arizona State University Campus Immersion
## 2                         University of West Florida
## 3  Indiana University–Purdue University-Indianapolis
## 4                 University of Maryland - Baltimore
## 5              University of Maryland - College Park
## 6                        United States Naval Academy
## 7                         The Rockefeller University
## 8                               University at Albany
## 9             University of North Carolina Asheville
## 10              University of Cincinnati-Main Campus
## 11             Oklahoma State University-Main Campus
## 12                       East Texas A & M University
# Duplicate City Names (there is 1, 'Des Moines' and 'West Des Moines')
combined_df %>%
  filter(city.x != city.y ) %>%
  select(unitid, city.x, city.y)
##   unitid     city.x          city.y
## 1 154156 Des Moines West Des Moines
# Duplicate State Names (there are no duplicates)
combined_df %>%
  filter(state.x != state.y) %>%
  select(unitid, state.x, state.y)
## [1] unitid  state.x state.y
## <0 rows> (or 0-length row.names)
# Create a new dataframe for mismatched rows
mismatched <- combined_df %>%
  filter(name.x != name.y | city.x != city.y | state.x != state.y) %>%
  select(unitid, name.x, name.y, city.x, city.y, state.x, state.y)

# View the mismatched rows
print(mismatched)
##    unitid                                            name.x
## 1  104151                          Arizona State University
## 2  138354                    The University of West Florida
## 3  151111 Indiana University-Purdue University-Indianapolis
## 4  154156  Des Moines University-Osteopathic Medical Center
## 5  163259                 University of Maryland, Baltimore
## 6  163286               University of Maryland-College Park
## 7  164155                                  US Naval Academy
## 8  195049                            Rockefeller University
## 9  196060                                    SUNY at Albany
## 10 199111         University of North Carolina at Asheville
## 11 201885                          University of Cincinnati
## 12 207388                         Oklahoma State University
## 13 224554                                  East Texas A & M
##                                               name.y       city.x
## 1          Arizona State University Campus Immersion        Tempe
## 2                         University of West Florida    Pensacola
## 3  Indiana University–Purdue University-Indianapolis Indianapolis
## 4   Des Moines University-Osteopathic Medical Center   Des Moines
## 5                 University of Maryland - Baltimore    Baltimore
## 6              University of Maryland - College Park College Park
## 7                        United States Naval Academy    Annapolis
## 8                         The Rockefeller University     New York
## 9                               University at Albany       Albany
## 10            University of North Carolina Asheville    Asheville
## 11              University of Cincinnati-Main Campus   Cincinnati
## 12             Oklahoma State University-Main Campus   Stillwater
## 13                       East Texas A & M University     Commerce
##             city.y state.x state.y
## 1            Tempe      AZ      AZ
## 2        Pensacola      FL      FL
## 3     Indianapolis      IN      IN
## 4  West Des Moines      IA      IA
## 5        Baltimore      MD      MD
## 6     College Park      MD      MD
## 7        Annapolis      MD      MD
## 8         New York      NY      NY
## 9           Albany      NY      NY
## 10       Asheville      NC      NC
## 11      Cincinnati      OH      OH
## 12      Stillwater      OK      OK
## 13        Commerce      TX      TX

Cleaning Data – Fill any missing values with “NA”

# Function to replace NULL with NA

replace_null_with_na <- function(x) {
  if (is.list(x)) {
    # Apply recursively if it's a list (to handle nested data frames)
    return(lapply(x, replace_null_with_na))
  } else if (is.null(x)) {
    return(NA)  # Replace NULL with NA
  } else {
    return(x)  # Otherwise, return the value unchanged
  }
}

# Use it to apply to a data frame
combined_df_na <- combined_df %>%
  mutate(across(everything(), replace_null_with_na))

# View the result
summary(combined_df_na)
##      unitid          name.x           core_name          spec_name        
##  Min.   :100654   Length:543         Length:543         Length:543        
##  1st Qu.:148139   Class :character   Class :character   Class :character  
##  Median :187444   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :190585                                                           
##  3rd Qu.:217842                                                           
##  Max.   :495767                                                           
##                                                                           
##     city.x            state.x             level           public_private_profit
##  Length:543         Length:543         Length:543         Length:543           
##  Class :character   Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character     
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  undergrad_program  grad_program       enrollment_profile undergrad_profile 
##  Length:543         Length:543         Length:543         Length:543        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  size_setting       degree_focuses     community_engage  
##  Length:543         Length:543         Length:543        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  leadership_for_public_prax research_tier_2025    name.y         
##  Length:543                 Length:543         Length:543        
##  Class :character           Class :character   Class :character  
##  Mode  :character           Mode  :character   Mode  :character  
##                                                                  
##                                                                  
##                                                                  
##                                                                  
##     city.y            state.y          classific_2025     classific_2021    
##  Length:543         Length:543         Length:543         Length:543        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    herd_fy21           herd_fy22           herd_fy23        
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:4.844e+06   1st Qu.:5.722e+06   1st Qu.:6.320e+06  
##  Median :1.982e+07   Median :2.205e+07   Median :2.558e+07  
##  Mean   :1.571e+08   Mean   :1.716e+08   Mean   :1.905e+08  
##  3rd Qu.:1.682e+08   3rd Qu.:1.868e+08   3rd Qu.:2.094e+08  
##  Max.   :1.710e+09   Max.   :1.806e+09   Max.   :2.047e+09  
##  NA's   :3           NA's   :3           NA's   :2          
##  herd_avg_fy21_to_fy23 num_doc_degrees_2020_2021 num_doc_degrees_2021_2022
##  Min.   :1.215e+06     Min.   :  0.0             Min.   :  0.0            
##  1st Qu.:5.778e+06     1st Qu.:  0.0             1st Qu.:  0.0            
##  Median :2.330e+07     Median : 32.0             Median : 36.0            
##  Mean   :1.730e+08     Mean   :108.1             Mean   :120.1            
##  3rd Qu.:1.900e+08     3rd Qu.:132.5             3rd Qu.:144.8            
##  Max.   :1.854e+09     Max.   :900.0             Max.   :942.0            
##  NA's   :2             NA's   :1                 NA's   :1                
##  num_doc_degrees_2022_2023 avg_num_doc_degrees_2020_2023
##  Min.   :  0.0             Min.   :  0.0                
##  1st Qu.:  3.0             1st Qu.:  2.0                
##  Median : 36.0             Median : 34.5                
##  Mean   :122.3             Mean   :116.8                
##  3rd Qu.:151.2             3rd Qu.:146.5                
##  Max.   :930.0             Max.   :924.0                
##  NA's   :1                 NA's   :1

Debugging - Checking which columns have missing values (NA)

Note: spec_name may have many values and that is okay, it is optional (e.g., University of Maryland, College Park would have “College Park” as the spec_name”, but John’s Hopkins only has one campus, so there is no “main campus” or any other special name)

# Display a table of the count of missing values (NAs) per column
missing_data <- combined_df_na %>% 
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  gather(key = "column_name", value = "NAs")

print(missing_data)
##                      column_name NAs
## 1                         unitid   0
## 2                         name.x   1
## 3                      core_name   1
## 4                      spec_name 371
## 5                         city.x   1
## 6                        state.x   1
## 7                          level   1
## 8          public_private_profit   1
## 9              undergrad_program   1
## 10                  grad_program   1
## 11            enrollment_profile   1
## 12             undergrad_profile   1
## 13                  size_setting   1
## 14                degree_focuses   1
## 15              community_engage   1
## 16    leadership_for_public_prax   1
## 17            research_tier_2025   1
## 18                        name.y   1
## 19                        city.y   1
## 20                       state.y   1
## 21                classific_2025   1
## 22                classific_2021   1
## 23                     herd_fy21   3
## 24                     herd_fy22   3
## 25                     herd_fy23   2
## 26         herd_avg_fy21_to_fy23   2
## 27     num_doc_degrees_2020_2021   1
## 28     num_doc_degrees_2021_2022   1
## 29     num_doc_degrees_2022_2023   1
## 30 avg_num_doc_degrees_2020_2023   1
# Filter rows where name.x or name.y is NA, and select unitid, name.x, name.y
# Should be referring to Penn State, with 2 different unitid AND different school name spelling

rows_with_na_names <- combined_df_na %>%
  filter(is.na(name.x) | is.na(name.y)) %>%
  select(unitid, name.x, name.y, city.x, city.y)

print(rows_with_na_names)
##   unitid                            name.x
## 1 495767 The Pennsylvania State University
## 2 214777                              <NA>
##                                      name.y          city.x          city.y
## 1                                      <NA> University Park            <NA>
## 2 Pennsylvania State University-Main Campus            <NA> University Park
# VISUALIZATIONS

Visualization Name (e.g., Research Expenses from 2020 to 2023)


Number of Institutions by Research Classification 2021

2 Versions (Raw count and relative proportion)

By Diamond Andy

This bar graph represents the number of institutions that were categorized under the 2021 Carnegie research classification tier. Comparing how many universities fall into each group and highlights the overall distribution of research in U.S. Institutions. This visualization also helps emphasize the tiers that are less common.

Filter and Assign Color Palette

# Recode main college type and subcategory
combined_df_na_cleaned <- combined_df_na %>%
  mutate(
    college_type = case_when(
      grepl("Baccalaureate Colleges", classific_2021) ~ "Baccalaureate",
      grepl("Master's Colleges & Universities", classific_2021) ~ "Master's",
      grepl("Doctoral Universities", classific_2021) ~ "Doctoral",
      grepl("Tribal Colleges and Universities", classific_2021) ~ "Tribal",
      grepl("Special Focus Four-Year", classific_2021) ~ "Special Focus",
      TRUE ~ NA_character_
    ),
    subcategory = case_when(
      grepl("Master's Colleges & Universities: Small Programs", classific_2021) ~ "Master's: Small Programs",
      grepl("Master's Colleges & Universities: Medium Programs", classific_2021) ~ "Master's: Medium Programs",
      grepl("Master's Colleges & Universities: Larger Programs", classific_2021) ~ "Master's: Larger Programs",
      grepl("Doctoral Universities: High Research Activity", classific_2021) ~ "Doctoral: High Research Activity",
      grepl("Doctoral Universities: Very High Research Activity", classific_2021) ~ "Doctoral: Very High Research Activity",
      grepl("Doctoral/Professional Universities", classific_2021) ~ "Doctoral: Professional Universities",
      grepl("Baccalaureate Colleges: Diverse Fields", classific_2021) ~ "Baccalaureate: Diverse Fields",
      grepl("Baccalaureate Colleges: Arts & Sciences Focus", classific_2021) ~ "Baccalaureate: Arts & Sciences Focus",
      grepl("Special Focus Four-Year: Research Institution", classific_2021) ~ "Special Focus: Research Institution",
      grepl("Special Focus Four-Year: Other Health Professions Schools", classific_2021) ~ "Special Focus: Health Professions",
      grepl("Special Focus Four-Year: Medical Schools & Centers", classific_2021) ~ "Special Focus: Medical Schools",
      grepl("Special Focus Four-Year: Engineering and Other Technology-Related Schools", classific_2021) ~ "Special Focus: Engineering/Technology",
      grepl("Special Focus Four-Year: Other Special Focus Institutions", classific_2021) ~ "Special Focus: Other Institutions",
      grepl("Tribal Colleges and Universities", classific_2021) ~ "Tribal Colleges",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(college_type) & !is.na(subcategory))

# Set factor levels for custom order of college_type
combined_df_na_cleaned$college_type <- factor(combined_df_na_cleaned$college_type,
  levels = c("Baccalaureate", "Master's", "Doctoral", "Special Focus", "Tribal")
)

# Set factor levels for custom order of subcategories
combined_df_na_cleaned$subcategory <- factor(combined_df_na_cleaned$subcategory, levels = c(
  "Baccalaureate: Diverse Fields",
  "Baccalaureate: Arts & Sciences Focus",
  "Master's: Small Programs",
  "Master's: Medium Programs",
  "Master's: Larger Programs",
  "Doctoral: High Research Activity",
  "Doctoral: Very High Research Activity",
  "Doctoral: Professional Universities",
  "Special Focus: Research Institution",
  "Special Focus: Health Professions",
  "Special Focus: Medical Schools",
  "Special Focus: Engineering/Technology",
  "Special Focus: Other Institutions",
  "Tribal Colleges"
))

# Custom shaded color palette
color_palette_shaded <- c(
  # Reds for Baccalaureate
  "Baccalaureate: Diverse Fields" = "#ff9999",
  "Baccalaureate: Arts & Sciences Focus" = "#cc0000",
  
  # Oranges for Master's
  "Master's: Small Programs" = "#ffcc99",
  "Master's: Medium Programs" = "#ff9933",
  "Master's: Larger Programs" = "#cc6600",
  
  # Blues for Doctoral
  "Doctoral: High Research Activity" = "#99ccff",
  "Doctoral: Very High Research Activity" = "#3399ff",
  "Doctoral: Professional Universities" = "#003366",
  
  # Greens for Special Focus
  "Special Focus: Research Institution" = "#b2d8b2",
  "Special Focus: Health Professions" = "#66cc66",
  "Special Focus: Medical Schools" = "#339966",
  "Special Focus: Engineering/Technology" = "#26734d",
  "Special Focus: Other Institutions" = "#145c33",
  
  # Purple for Tribal
  "Tribal Colleges" = "#9966cc"
)

Raw Count of Institutions by College Type and Subcategory

p1 <- ggplot(combined_df_na_cleaned, aes(x = college_type, fill = subcategory)) + 
  geom_bar(position = "stack") +
  scale_fill_manual(values = color_palette_shaded) +
  labs(title = "Institution Counts by College Type and Subcategory",
       x = "College Type", y = "Count of Institutions") +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.title = element_blank(),
    legend.box = "horizontal",
    legend.text = element_text(size = 10),
    legend.key.size = unit(0.5, "cm"),
    legend.box.just = "center",
    legend.spacing.x = unit(0.5, 'cm'),
    legend.spacing.y = unit(0.25, 'cm'),
    legend.key.height = unit(0.5, "cm"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  guides(fill = guide_legend(ncol = 3, title = "Subcategory"))

p1

Relative Proportion of Institutions by College Type and Subcategory

# Ensure college_type factor order is set (if not already done)
combined_df_na_cleaned$college_type <- factor(combined_df_na_cleaned$college_type,
  levels = c("Baccalaureate", "Master's", "Doctoral", "Special Focus", "Tribal")
)

# Proportion stacked bar plot (horizontal) with 4 columns for the legend
p2 <- ggplot(combined_df_na_cleaned, aes(y = college_type, fill = subcategory)) + 
  geom_bar(position = "fill") +  # This scales the bars to 100% for each college type
  scale_fill_manual(values = color_palette_shaded) +  # Apply the custom color palette
  labs(title = "Proportional Counts by College Type and Subcategory",
       x = "Proportion of Institutions",
       y = "College Type") +              
  theme_minimal() +
  theme(axis.text.y = element_text(angle = 0),    # Keep y-axis labels horizontal
        axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels for readability
  guides(fill = guide_legend(ncol = 1, title = "Subcategory"))

# Print the plot
p2


Research Activity Designations (2025)

By Lily Gates

This graph visualizes the raw count and relative distribution of institutions by their research activity designation in 2025. The categories displayed include “Research Colleges and Universities”, “Research 1”, and “Research 2”. Data excludes institutions without a research designation.

The graph shows the both the raw count and relative distribution of institutions across three research activity categories: Research Colleges and Universities (RCA), Research 1 (R1), and Research 2 (R2). - RCA institutions spend at least $2.5 million on research annually but do not meet the criteria for R1 or R2. - R1 institutions spend at least $50 million on research and produce at least 70 research doctorates annually - R2 institutions spend at least $5 million and produce at least 20 research doctorates.

# Prepare the data and drop NA
plot_data <- combined_df_na %>%
  filter(!is.na(research_tier_2025)) %>%
  mutate(
    research_tier_2025 = factor(
      research_tier_2025,
      levels = c("Research Colleges and Universities",
                 "Research 1: Very High Research Spending and Doctorate Production", 
                 "Research 2: High Research Spending and Doctorate Production")
    )
  ) %>%
  count(research_tier_2025) %>%
  mutate(percentage = n / sum(n) * 100)  # Calculate percentage for pie chart

# Raw count bar plot
raw_count_plot <- ggplot(plot_data, aes(x = research_tier_2025, y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(
    x = "Research Activity Designation",
    y = "Raw Count"
  ) +
  theme_minimal() +
  scale_x_discrete(labels = c(
    "Research Colleges and Universities" = "Research Colleges\nand Universities", 
    "Research 1: Very High Research Spending and Doctorate Production" = "R1", 
    "Research 2: High Research Spending and Doctorate Production" = "R2"
  ))

# Proportional bar plot (percentage)
proportional_plot <- ggplot(plot_data, aes(x = research_tier_2025, y = percentage)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(
    x = "Research Activity Designation",
    y = "Proportion (%)"
  ) +
  theme_minimal() +
  scale_x_discrete(labels = c(
    "Research Colleges and Universities" = "Research Colleges\nand Universities", 
    "Research 1: Very High Research Spending and Doctorate Production" = "R1", 
    "Research 2: High Research Spending and Doctorate Production" = "R2"
  ))

# Combine the two plots and add a single title, subtitle, and caption for the whole figure
combined_plot <- raw_count_plot + proportional_plot + 
  plot_layout(guides = 'collect') +
  plot_annotation(
    title = "Raw Count and Proportion of Institutions by Research Activity Designation (2025)",
    subtitle = "Excludes institutions without a research designation",
    caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
  )

# Print the combined plot
combined_plot


Average Doctoral Degrees by Research Tier (2020-2023)

By Lily Gates

This boxplot displays the distribution of average doctoral degrees awarded from 2020 to 2023 across three research activity designations: “Research Colleges and Universities,” “Research 1 (R1),” and “Research 2 (R2).” The x-axis represents the average number of doctoral degrees awarded, while the y-axis categorizes institutions by their research tier.

The plot highlights the range, median, and quartiles for each research tier. Research Colleges and Universities, which are considered research-focused but do not meet the rigorous criteria for R1 or R2 designation, show a lower average number of doctoral degrees compared to R1 and R2 institutions. The “Research 1” and “Research 2” categories display higher and more variable averages, reflecting their very high and high research spending, respectively. This visualization provides insight into how research activity designation correlates with doctoral production.

# Clean data and relabel tiers
clean_df <- combined_df_na %>%
  filter(!is.na(avg_num_doc_degrees_2020_2023), !is.na(research_tier_2025)) %>%
  mutate(
    research_tier_2025 = recode(
      research_tier_2025,
      "Research 1: Very High Research Spending and Doctorate Production" = "R1",
      "Research 2: High Research Spending and Doctorate Production" = "R2",
      "Research Colleges and Universities" = "Research Colleges\nand Universities"
    )
  )

# Plot with horizontal box and whiskers
ggplot(clean_df, aes(x = avg_num_doc_degrees_2020_2023, y = research_tier_2025)) +
  geom_boxplot(fill = "slateblue", alpha = 0.7, outlier.shape = NA) +
  geom_jitter(color = "darkslateblue", alpha = 0.5, size = 1.8) +
  labs(
    title = "Average Doctoral Degrees by Research Tier (2020–2023)",
    x = "Avg. Doctoral Degrees (2020–2023)",
    y = "Research Tier (2025)",
    caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
  ) +
  theme_minimal() +
  theme(
    plot.caption = element_text(hjust = 0, size = 9, color = "gray30"),
    axis.text.y = element_text(size = 10)
  )


Doctoral Degrees Conferred by Research Tier and Academic Year (2020-2023)

By Lily Gates

This stacked bar graph visualizes the count of doctoral degrees conferred across different academic years (2020–2023), broken down by research tier. The research tiers are represented by three categories: “Research Colleges and Universities,” “Research 1 (R1),” and “Research 2 (R2).” The bars for each academic year are stacked to show the distribution of doctoral degrees within each research tier, allowing for an understanding of how the number of degrees varies across tiers and over time. The y-axis represents the raw count of doctoral degrees, and the x-axis represents the academic years.

In general, the number of doctoral degrees conferred by R2 institutions and “Research Colleges and Universities” has remained relatively stable over the years. The smallest proportion is represented by “Research Colleges and Universities,” with R2 institutions showing a modest amount. However, R1 institutions have experienced a significant increase in both the count and proportion of degrees conferred, especially between the 2020-2021 and 2021-2022 academic years. This highlights the growing trend of doctoral degree production within high-research activity institutions. The plot also includes a note that “Research Colleges and Universities” are research-focused but do not meet the criteria for R1 or R2 designation.

# Reshape the data into long format for easier plotting and drop NA values
long_combined_df <- combined_df_na_cleaned %>%
  select(
    research_tier_2025,
    num_doc_degrees_2020_2021,
    num_doc_degrees_2021_2022,
    num_doc_degrees_2022_2023
  ) %>%
  pivot_longer(
    cols = c(
      "num_doc_degrees_2020_2021",
      "num_doc_degrees_2021_2022",
      "num_doc_degrees_2022_2023"
    ),
    names_to = "academic_year",
    values_to = "num_doc_degrees"
  ) %>%
  # Clean up the research tier labels and academic year labels
  mutate(
    research_tier_2025 = recode(
      research_tier_2025, 
      "Research 1: Very High Research Spending and Doctorate Production" = "Research 1", 
      "Research 2: High Research Spending and Doctorate Production" = "Research 2",
      "Research Colleges and Universities" = "Research Colleges and Universities"
    ),
    academic_year = recode(
      academic_year,
      "num_doc_degrees_2020_2021" = "2020-2021",
      "num_doc_degrees_2021_2022" = "2021-2022",
      "num_doc_degrees_2022_2023" = "2022-2023"
    )
  ) %>%
  # Filter out any rows with NA values in num_doc_degrees or research_tier_2025
  filter(!is.na(num_doc_degrees), !is.na(research_tier_2025))

# Create a stacked bar graph of doctoral degrees by research tier and academic year
ggplot(long_combined_df, aes(x = academic_year, y = num_doc_degrees, fill = research_tier_2025)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Doctoral Degrees Conferred by Research Tier and Academic Year (2020–2023)",
    x = "Academic Year",
    y = "Count of Doctoral Degrees",
    fill = "Research Tier",
    caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0),
    axis.text.x = element_text(angle = 0, hjust = 0.5),  # Center the x-axis labels
    plot.caption = element_text(hjust = 0, size = 9, color = "gray30"),
    legend.position = "right",  # Keep the legend on the right
    legend.box = "vertical"  # Arrange legend vertically
  ) +
  guides(
    fill = guide_legend(ncol = 1)  # Set the legend to 1 column
  ) +
  scale_fill_manual(
    values = c("Research 1" = "#4C79A1", "Research 2" = "#F1A340", "Research Colleges and Universities" = "#6D9D4B"),  # New colors
    labels = c("Research 1", "Research 2", "Research Colleges and Universities")  # Set the legend labels
  ) +
  scale_y_continuous(
    labels = scales::comma  # Adds commas to the y-axis labels for better readability
  )


Count of Institutions by Type and Research Activity Designation (2025) – 2 graphs

by Lily Gates

This set of grouped bar graphs displays the count of institutions by their type (public, private non-profit, and private for-profit) and their research activity designation (R1, R2, and Research Colleges and Universities) for the year 2025. They represent the same data, but the color-fill and x-axis variables are the opposite on both graphs.

The majority of institutions with any research activity designation in this dataset are public institutions, with far fewer private non-profit and private for-profit institutions in each research activity category. Notably, the Research Colleges and Universities category is the only one that contains private for-profit institutions, with only one school in this group.

In terms of distribution across research activity designations, Research Colleges and Universities appear to have a higher number of private non-profit institutions compared to R1 and R2 institutions. Specifically, Research Colleges and Universities are predominantly composed of private non-profit institutions, whereas R1 and R2 categories show a much stronger representation of public institutions.

Interestingly, the number of public institutions in the R1 category is almost equal to that of the Research Colleges and Universities category, which further suggests that Research Colleges and Universities have a more balanced composition of private non-profit institutions.

This analysis highlights key trends in how research activity designation relates to institutional type, with public institutions overwhelmingly dominating higher research tiers, and Research Colleges and Universities being the outlier with more private non-profit representation.

Institutions Grouped by Institution Type (Public, Private (non-profit), Private (for-profit))

# Ensure all combinations of 'public_private_profit' and 'research_tier_2025' are represented
plot_df_clean <- combined_df_na %>%
  filter(!is.na(public_private_profit), !is.na(research_tier_2025)) %>%
  count(research_tier_2025, public_private_profit) %>%
  complete(research_tier_2025, public_private_profit, fill = list(n = 0))  # Fill missing combinations with 0

# Create the plot
ggplot(plot_df_clean, aes(x = public_private_profit, y = n, fill = research_tier_2025)) +
  geom_bar(position = "dodge", stat = "identity", width = 0.7) +  # Adjust width for equal bar widths
  labs(
    title = "Count of Institutions by Type and Research Activity Designation (2025)",
    subtitle = "Excludes institutions without a research designation",
    x = "Institution Type",
    y = "Count of Institutions",
    fill = "Research Activity Designation",
    caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
  ) +
  theme_minimal() +
  scale_x_discrete(labels = c(
    "Public" = "Public", 
    "Private non-profit" = "Private (non-profit)", 
    "Private for-profit" = "Private (for-profit)"
  )) +
  scale_y_continuous(limits = c(0, 150), expand = c(0, 0)) +
  scale_fill_manual(
    values = c(
      "Research Colleges and Universities" = "palevioletred",  # Research Colleges stand out
      "Research 1: Very High Research Spending and Doctorate Production" = "steelblue",  # R1 color
      "Research 2: High Research Spending and Doctorate Production" = "cornflowerblue"  # R2 color
    ),
    labels = c(
      "Research Colleges and Universities" = "Research Colleges\nand Universities", 
      "Research 1: Very High Research Spending and Doctorate Production" = "R1", 
      "Research 2: High Research Spending and Doctorate Production" = "R2"
    )
  ) +
  theme(
    axis.text.y = element_text(size = 10),
    legend.position = "bottom",  # Move the legend to the bottom
    legend.box = "horizontal",   # Makes the legend horizontal
    legend.title = element_text(size = 12),  # Adding legend title size
    axis.text.x = element_text(angle = 0, hjust = 0.5)  # Set x-axis labels to 0 degrees (normal)
  )

Insitutions Grouped by Research Acivity Designation

ggplot(plot_df_clean, aes(x = research_tier_2025, y = n, fill = public_private_profit)) +
  geom_bar(position = "dodge", stat = "identity", width = 0.7) +
  labs(
    title = "Count of Institutions by Research Activity Designation and Type (2025)",
    subtitle = "Excludes institutions without a research designation",
    x = "Research Activity Designation",
    y = "Count of Institutions",
    fill = "Institution Type",
    caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
  ) +
  theme_minimal() +
  scale_x_discrete(labels = c(
    "Research Colleges and Universities" = "Research Colleges\nand Universities", 
    "Research 1: Very High Research Spending and Doctorate Production" = "R1", 
    "Research 2: High Research Spending and Doctorate Production" = "R2"
  )) +
  scale_y_continuous(limits = c(0, 150), expand = c(0, 0)) +
  scale_fill_manual(
    values = c(
      "Public" = "seagreen", 
      "Private not-for-profit" = "darkorange", 
      "Private for-profit" = "saddlebrown"
    ),
    labels = c(
      "Public" = "Public",
      "Private not-for-profit" = "Private (non-profit)",
      "Private for-profit" = "Private (for-profit)"
    )
  ) +
  theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    legend.position = "bottom",
    legend.box = "horizontal",
    legend.title = element_text(size = 12)
  )


Distribution of Spending for Fiscal Year 2022

By Colin Thompson

This histogram displays the distribution of HERD spending across three fiscal years: 2021, 2022, and 2023. It highlights the range of research spending institutions report, with the peak around 0.05 billion dollars in fiscal year 2023, suggesting that many institutions are clustered in this spending range. Over the years, spending has gradually increased, especially in FY2023, indicating a growing trend in research investment. By examining this distribution, we can identify the most common spending ranges, which can help provide insight into what is considered typical or desirable for research spending. The graph also reveals the highest amounts of spending, useful for understanding the research budgets of institutions with the largest research investments.

# Clean and reshape HERD data for all three fiscal years (FY21, FY22, FY23)

combined_df_long <- combined_df_na_cleaned %>%
  select(research_tier_2025, herd_fy21, herd_fy22, herd_fy23) %>%
  pivot_longer(
    cols = starts_with("herd_fy"),
    names_to = "fiscal_year",
    values_to = "herd_spending"
  ) %>%
  filter(!is.na(herd_spending)) %>%
  mutate(
    fiscal_year = recode(
      fiscal_year,
      herd_fy21 = "HERD Spending FY2021",
      herd_fy22 = "HERD Spending FY2022",
      herd_fy23 = "HERD Spending FY2023"
    )
  )

# Plot distribution of HERD spending for FY21, FY22, and FY23 with log scale
ggplot(combined_df_long, aes(x = herd_spending)) +
  geom_histogram(bins = 20, color = "black", fill = "seagreen", alpha = 0.85) +
  scale_x_log10(labels = scales::dollar_format(scale = 1e-9, suffix = "B")) +
  facet_wrap(~ fiscal_year, nrow = 1, ncol = 3) + # Arrange FY21, FY22, FY23 side by side
  labs(
    title = "Distribution of HERD Spending for Fiscal Years 2021, 2022, and 2023",
    x = "HERD Spending (Log Scale, in Billions)",
    y = "Number of Institutions"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    strip.text = element_text(face = "bold"),
    legend.position = "none"
  )
## Warning in scale_x_log10(labels = scales::dollar_format(scale = 1e-09, suffix =
## "B")): log-10 transformation introduced infinite values.
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_bin()`).


HERD Spending by Fiscal Year and Research Tier

By Lily Gates

This histogram displays research spending (HERD) across three types of institutions for fiscal years 2021, 2022, and 2023: - Research Colleges and Universities (R Colleges): These institutions generally have the lowest research spending, concentrated on the left side of the graph. - Research 2 (R2): R2 institutions fall in the middle range, with some overlapping with both R Colleges and Research 1 institutions. - Research 1 (R1): These institutions have the highest research spending, appearing on the far right.

The log scale on the x-axis highlights differences in spending, particularly at the higher end. The overlap between R2 and the other categories suggests that some R2 institutions have spending comparable to R1 or R Colleges.

# Filter out rows with NAs in research_tier_simple or herd_spending
combined_df_long <- combined_df_na_cleaned %>%
  select(research_tier_2025, herd_fy21, herd_fy22, herd_fy23) %>%
  pivot_longer(
    cols = starts_with("herd_fy"),
    names_to = "fiscal_year",
    values_to = "herd_spending"
  ) %>%
  filter(!is.na(herd_spending), !is.na(research_tier_2025)) %>%
  mutate(
    fiscal_year = recode(
      fiscal_year,
      herd_fy21 = "HERD Spending FY2021",
      herd_fy22 = "HERD Spending FY2022",
      herd_fy23 = "HERD Spending FY2023"
    ),
    # Simplify research tier labels
    research_tier_simple = recode(
      research_tier_2025,
      "Research 1: Very High Research Spending and Doctorate Production" = "Research 1",
      "Research 2: High Research Spending and Doctorate Production" = "Research 2",
      "Research Colleges and Universities" = "Research Colleges and Universities"
    )
  ) %>%
  filter(!is.na(research_tier_simple), !is.na(herd_spending))  # Remove NAs

# Plot with simplified legend labels and without NAs
ggplot(combined_df_long, aes(x = herd_spending, fill = research_tier_simple)) +
  geom_histogram(bins = 30, color = "black", alpha = 0.6, position = "identity") +
  scale_x_log10(labels = scales::dollar_format(scale = 1e-9, suffix = "B")) +
  facet_wrap(~ fiscal_year, nrow = 1, ncol = 3) +
  scale_fill_manual(
    values = c(
      "Research 1" = brewer.pal(3, "Set2")[1],
      "Research 2" = brewer.pal(3, "Set2")[2],
      "Research Colleges and Universities" = brewer.pal(3, "Set2")[3]
    ),
    labels = c("Research 1", "Research 2", "Research Colleges and Universities")  # Simplify the legend labels
  ) +
  labs(
    title = "HERD Spending by Fiscal Year and Research Tier (Log Scale)",
    subtitle = "Excludes institutions without a research designation",
    x = "HERD Spending (Log Scale, in Billions)",
    y = "Number of Institutions",
    fill = "Research Tier",
    caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    plot.caption = element_text(hjust = 0.5, size = 9),
    strip.text = element_text(face = "bold"),
    legend.position = "bottom",
    legend.direction = "horizontal"
  )
## Warning in scale_x_log10(labels = scales::dollar_format(scale = 1e-09, suffix =
## "B")): log-10 transformation introduced infinite values.
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_bin()`).


Average Research Spending vs. Doctoral Degrees (2020–2023) by Research Tier

Two graphs: Raw and Log Scale

By Mia Leandri

This scatter plot visualizes the relationship between average research spending (from FY2021 to FY2023) and the average number of doctoral degrees awarded (2020–2023) across American higher education institutions.

The x-axis shows average research spending in U.S. dollars, while the y-axis indicates the average number of doctoral degrees conferred. Institutions with greater research spending tend to award more doctoral degrees, illustrating a positive correlation between institutional investment in research and doctoral productivity.

This visualization supports the analytical research question by highlighting how non-financial institutional characteristics (like research tier) correlate with tangible research outcomes such as doctoral degree production.

Log Scale - Average Research Spending vs. Doctoral Degrees (2020–2023) by Research Tier

# Clean the spending column (convert to billions)
clean_df_long <- combined_df %>%
  mutate(
    herd_avg_fy21_to_fy23_billions = herd_avg_fy21_to_fy23 / 1e9  # Convert to billions
  )
# Rename tiers for cleaner display
clean_df_long <- clean_df_long %>%
  mutate(research_tier_2025 = case_when(
    grepl("Research 1", research_tier_2025) ~ "Research 1 (R1)",
    grepl("Research 2", research_tier_2025) ~ "Research 2 (R2)",
    TRUE ~ "Research Colleges and Universities"
  ))

# Filter out rows where y-values are <= 0
clean_df_long_filtered <- clean_df_long %>%
  filter(avg_num_doc_degrees_2020_2023 > 0)

# Scatter plot using log10 scales with a lower limit for the y-axis
ggplot(clean_df_long_filtered, aes(
  x = herd_avg_fy21_to_fy23_billions,
  y = avg_num_doc_degrees_2020_2023,
  color = research_tier_2025
)) +
  geom_point(alpha = 0.6, size = 1.2) +
  scale_color_brewer(palette = "Set2") +
  scale_x_log10(
    labels = scales::label_dollar(scale = 1, suffix = "B"),
    breaks = scales::log_breaks()
  ) +
  scale_y_log10(
    breaks = scales::log_breaks(),
    limits = c(1, NA)  # Set the lower limit of y-axis to 1
  ) +
  labs(
    title = "Avg. Research Spending vs. Doctoral Degrees (2020–2023)",
    x = "Avg. Research Spending (Log Scale, Billions USD)",
    y = "Avg. Number of Doctoral Degrees\n(Log Scale)",
    color = "Research Tier"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.key.width = unit(1, "cm"),
    legend.key.height = unit(0.5, "cm"),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  ) +
  guides(color = guide_legend(ncol = 3))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Raw Counts - Average Research Spending vs. Doctoral Degrees (2020–2023) by Research Tier

# Clean the spending column (remove $ and commas and convert to numeric)
clean_df_long <- combined_df %>%
  mutate(
    herd_avg_fy21_to_fy23 = as.numeric(gsub("[$,]", "", herd_avg_fy21_to_fy23)),
    herd_avg_fy21_to_fy23_billions = herd_avg_fy21_to_fy23 / 1e9  # Convert to billions
  )

# Optional: Rename tiers for cleaner display (if needed)
clean_df_long <- clean_df_long %>%
  mutate(research_tier_2025 = case_when(
    grepl("Research 1", research_tier_2025) ~ "Research 1 (R1)",
    grepl("Research 2", research_tier_2025) ~ "Research 2 (R2)",
    TRUE ~ "Research Colleges and Universities"
  ))

# Scatter plot of research spending vs doctoral degrees
ggplot(clean_df_long, aes(
  x = herd_avg_fy21_to_fy23_billions,
  y = avg_num_doc_degrees_2020_2023,
  color = research_tier_2025
)) +
  geom_point(alpha = 0.5, size = 1.5) +
  scale_color_brewer(palette = "Set2") +
  scale_x_continuous(
    labels = scales::label_dollar(scale = 1, suffix = "B"),  # Format in billions
    breaks = scales::pretty_breaks(n = 10)  # More breaks on the x-axis
  ) +
  scale_y_continuous(
    breaks = scales::pretty_breaks(n = 10)  # More breaks on the y-axis
  ) +
  labs(
    title = "Avg. Research Spending vs. Doctoral Degrees (2020–2023)",
    x = "Avg. Research Spending (Billions USD)",
    y = "Avg. Number of Doctoral Degrees",
    color = "Research Tier"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",  # Position the legend below the x-axis
    legend.key.width = unit(1, "cm"),  # Adjust the width of the legend keys
    legend.key.height = unit(0.5, "cm"),  # Adjust the height of the legend keys
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  ) +
  guides(color = guide_legend(ncol = 3))  # Set the legend to 3 columns
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).


Distribution of Size and Residential Type by Research Type

By Mia Leandri

This grouped bar chart displays the distribution of research-designated institutions (R1, R2, and Research Colleges and Universities) across different campus size and residential categories, focusing only on 4-year universities. Institutions are categorized by size (Very Small to Large) and residential character (Highly Residential to Primarily Nonresidential), providing a detailed look at how research activity is spread across different campus environments. The chart reveals trends in how institutional research tiers align with campus scale and student living dynamics, offering insights into the structural diversity of U.S. higher education.

# Clean and categorize the data
clean_combined_df <- combined_df_na %>%
  mutate(
    size_category = case_when(
      grepl("very small", size_setting) ~ "Very Small",
      grepl("small", size_setting) ~ "Small",
      grepl("medium", size_setting) ~ "Medium",
      grepl("large", size_setting) ~ "Large",
      TRUE ~ NA_character_  # Filter out non-4-year
    ),
    residential_category = case_when(
      grepl("highly residential", size_setting) ~ "Highly Residential",
      grepl("primarily residential", size_setting) ~ "Primarily Residential",
      grepl("primarily nonresidential", size_setting) ~ "Primarily Nonresidential",
      TRUE ~ NA_character_
    ),
    research_tier_2025 = factor(
      research_tier_2025,
      levels = c(
        "Research 1: Very High Research Spending and Doctorate Production",
        "Research 2: High Research Spending and Doctorate Production",
        "Research Colleges and Universities"
      )
    )
  ) %>%
  filter(!is.na(size_category) & !is.na(residential_category))

# Define label shorteners for x-axis and legend
research_labels <- c(
  "Research 1: Very High Research Spending and Doctorate Production" = "R1",
  "Research 2: High Research Spending and Doctorate Production" = "R2",
  "Research Colleges and Universities" = "Research\nColleges"
)

# Plot: Grouped bar chart faceted by size and residential categories
ggplot(clean_combined_df, aes(x = research_tier_2025, fill = research_tier_2025)) +
  geom_bar(position = "dodge") +
  facet_grid(size_category ~ residential_category, scales = "free_x") +
  scale_x_discrete(labels = research_labels) +
  scale_fill_manual(
    values = c(
      "Research 1: Very High Research Spending and Doctorate Production" = "#1b9e77",
      "Research 2: High Research Spending and Doctorate Production" = "#d95f02",
      "Research Colleges and Universities" = "#7570b3"
    ),
    labels = research_labels
  ) +
  labs(
    title = "Research Tiers by Size and Residential Type (4-Year Universities Only)",
    x = "Research Tier",
    y = "Number of Institutions",
    fill = "Research Tier",
    caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = .5, size = 10),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_text(size = 11),
    legend.text = element_text(size = 10),
    strip.text.x = element_text(size = 11),
    strip.text.y = element_text(size = 11),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.caption = element_text(size = 9, face = "italic", hjust = 1)
  )