| # PROJECT OVERVIEW |
Exploring the differences between American research colleges and universities through data visualization.
| # RESEARCH QUESTIONS/TOPICS, AND NARRATIVE |
| # Communications - Mia Leandri |
Our chosen datasets should be well-suited to address both our
descriptive/comparative and analytical/inferential research questions.
For descriptive and comparative purposes, we have access to data on most
of the key factors we aim to explore, allowing us to describe and
compare differences across universities. For instance, we can analyze
the geographic distribution of institutions by state and city,
admissions data on who is being admitted, academic offerings across
universities, and the number of doctoral degrees awarded at each
institution. By combining this information with institutional research
spending data from the carnegie_spending_doctorate_awards
dataset and basic school characteristics from
carnegie_classification_stats, we can compare which factors
may contribute to higher research spending across institutions.
For our analytical and inferential questions, we can use the data in several ways to explore what factors—beyond financial expenditures—are associated with a higher research classification. For example, admissions and academic offerings data, when examined alongside research classifications, may help us determine whether specific academic or demographic characteristics play a role in an institution’s likelihood of achieving Research 1 or Research 2 status. This approach goes beyond simply examining research spending and instead investigates what factors may influence institutional research standing.
Our primary audience includes college and university administrators. We hope this analysis will be especially useful for those seeking to understand which institutional characteristics are associated with higher research rankings. For instance, if institutions with more diverse admissions profiles or broader academic offerings tend to rank higher in research classifications, administrators at other institutions may consider incorporating similar changes. This information could support strategic planning and resource allocation to strengthen their institutions’ research profile.
| # DATA AND RESEARCH SOURCES |
| # Data Enginners - Colin Thompson (finding, describing) and Lily Gates (cleaning, combining datasets) |
Cleaned CSV versions of the “Carnegie Classification of Institutions of Higher Education” https://carnegieclassifications.acenet.edu/
carnegie_spending_doctorate_awards (See
carnegie_research_activity_desig_factsheet.pdf for more
info) * School * Location (City, State) * Classification (R1, R2, Other)
— From 2021 and 2025 * Expenses (2021, 2022, 2023) * Research Doctorates
Awarded (2020, 2021, 2022)
carnegie_classification_stats (See
carnegie_variable_descr_flowchart.pdf for more info) *
School * Filtering main school if applicable (e.g., UMBC, UMD, UMES all
are “University of Maryland” sub schools) * Location (City, State,
residential or not) * Public or private (non-profit, for-profit) *
Description of types of degree programs offered * Enrollment information
* Classification on tier (R1, R2, Other)
classification_df is from the
carnegie_classification_statsspending_df is from the
carnegie_spending_doctorate_awardsBasic School Information NAME - unitid
- unique ID number - name.x - name of school - orig from
classification_df - NOTE: is later renamed to
name while name.y is dropped -
name.y - name - orig from spending_df -
core_name - name of the main university (e.g., “University
of Maryland”) - orig from the classification_df -
spec_name - name of the specific campus/satellite (e.g.,
“College Park”) - orig from the classification_df
SCHOOL TYPE - public_private_profit - public, private
(for-profit, non-profit) - orig from classification_df
SCHOOL LOCATION - city - city - state -
state abbreviation - size_setting - size, 4-year
vs. graduate/professional, how residential the school is - orig from
classification_df
Enrollment and Admissions - level -
four or more years as opposed to associate degree 2-year programs such
as community colleges - orig from classification_df -
enrollment_profile - distribution of undergrad vs. graduate
students - orig from classification_df -
size_setting - student body population - orig from
classification_df - size: - very small (less than
999) - small (1000 to 2999) - medium (3000 to 9999) - large (10000 or
greater) - residential: percentage of living on-campus students
- primarily non-res: less than 25% FT students - primarily residential:
25 to 50% FT students - highly residential: more than 50% FT students -
undergrad_profile - for 4-year universities - orig from
classification_df - full/part-time: - higher
part-time (over 40% PT) - medium-full-time (21-39% PT) - full-time (less
than 20% PT) - selective: - inclusive (ACT equivalent <19) -
more selective (ACT equivalent 19 to 23) - selective (ACT equivalent
over 23) - transfer in: - lower (<20%) - higher (greater
than 20%)
Research Classification Ranking -
research_tier_2025 - 2025 tier ranking - orig from
classification_df - classific_2021 - 2021 tier
ranking - orig from spending_df -
classific_2025 - 2025 tier ranking - orig from
spending_df
Finances
orig from spending_df - herd_fy21 - 2021
fiscal year expenses on research - herd_fy22 - 2022 fiscal
year expenses on research - herd_fy23 - 2023 fiscal year
expenses on research - herd_avg_fy21_to_fy23 - Average
expenses on research from 2021 to 2023
Types of Academic Programs and Degrees Offered
orig from classficiation_df - degree_focuses -
degree level and academic focus overview - top degree level
offered: bacc, masters, doctoral, special, tribal - focus:
arts and science, diverse, engineering and tech, medical, other health,
special focus, research, or NULL - size: of program (only for
Masters Universities) - research activity: only for Doctoral
Universities - undergrad_program - distribution of
undergrad degrees in certain academic fields - dominant subj:
“arts and sciences” OR “professions” OR NULL - balanced vs plus vs
focus: measured by the percentage of bachelor’s degrees awarded in
arts and sciences (rather than professions) - graduate degree
coexistence: percentage of graduate degrees in undergraduate fields
- e.g., a BS in Computer Science AND a MS or PHD in Computer Science is
also offered at the same school - measured by “none” (0%) OR “some” (0
to 50%) or “high” (greater than 50%) - grad_program -
distribution of grad degrees in certain academic fields - program
level: undergrad only, postbac, research doctorate, NULL -
single or comprehensive: multiple subject fields or single
subject - focus: education, business, humanities, STEM,
professional
Research Doctorates Awarded
orig from spending_df
Number of doctoral research degrees conferred for… -
num_doc_degrees_2020_2021 - the 2020-2021 school year -
num_doc_degrees_2021_2022 - the 2021-2022 school year -
num_doc_degrees_2022_2023 - the 2022-2023 school year
Miscellaneous
orig from classification_df
Not important, used for Carnegie’s specific engagement and purposes,
special program with Carnegie - community_engage - whether
they were classified by Carnegie -
leadership_for_public_prax - whether they were classified
by Carnegie
The datasets we intend to use for this project are
carnegie_classification_stats and
carnegie_spending_doctorate_awards, both sourced from the
Carnegie Classification of Institutions of Higher Education—a framework
maintained by the American Council on Education and the Carnegie
Foundation for the Advancement of Teaching. The purpose of this
classification system is to provide a standardized way of categorizing
U.S. colleges and universities based on a range of institutional
characteristics, including degree offerings, enrollment profiles, and
research activity.
The carnegie_classification_stats dataset includes a
wide range of institutional data. It provides general school information
such as the official name of the university and the name of specific
campuses—e.g., the University of Maryland as the flagship institution
and College Park as its campus. It also classifies institutions by their
sector, indicating whether they are public, private non-profit, or
private for-profit. Additional variables detail the types of degrees
offered, from bachelor’s to doctoral levels, as well as the
institution’s academic focus—such as liberal arts, professional fields,
or STEM. The dataset includes enrollment profiles, offering insight into
whether an institution primarily serves undergraduate or graduate
students, and whether its student body is full-time, part-time, or
transfer-heavy. It also captures information about the size of the
institution, the residential setting, and undergraduate selectivity.
Importantly, it includes the university’s most recent research
classification label (e.g., Research 1 or Research 2) as of 2025.
The carnegie_spending_doctorate_awards dataset
complements this information by including similar institutional
identifiers and research classifications for 2021 and 2025. It also
tracks financial data, reporting annual research expenses for fiscal
years 2021 through 2023. This provides a window into how institutions
allocate resources toward research over time. Additionally, this dataset
documents the number of research doctorates awarded by each institution
in three consecutive academic years: 2020–2021, 2021–2022, and
2022–2023. Together, these two datasets allow for a comprehensive
analysis of how institutional characteristics, research activity, and
academic program offerings intersect across the higher education
landscape in the United States.
| # CODE BEGINS |
Note: Only have to do 1 time
# install.packages("patchwork")
# install.packages("scales")
# install.packages("RColorBrewer")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(tidyr)
library(dplyr)
library(lubridate) # for Datetime formatting
library(stringr) # for extracting URL
library(patchwork) # for putting multiple graphs on one fig
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(RColorBrewer) # for color palettes
classification_stats <- read_csv("carnegie_classification_stats.csv")
## Rows: 542 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): name, core_name, spec_name, city, state, level, public_private_pro...
## dbl (1): unitid
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spending_doctorate_awards <- read_csv("carnegie_spending_doctorate_awards.csv")
## Rows: 542 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, city, state, classific_2025, classific_2021, herd_fy21, herd_...
## dbl (5): unitid, num_doc_degrees_2020_2021, num_doc_degrees_2021_2022, num_d...
## num (2): herd_fy23, herd_avg_fy21_to_fy23
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#head(classification_stats)
colnames(classification_stats)
## [1] "unitid" "name"
## [3] "core_name" "spec_name"
## [5] "city" "state"
## [7] "level" "public_private_profit"
## [9] "undergrad_program" "grad_program"
## [11] "enrollment_profile" "undergrad_profile"
## [13] "size_setting" "degree_focuses"
## [15] "community_engage" "leadership_for_public_prax"
## [17] "research_tier_2025"
summary(classification_stats)
## unitid name core_name spec_name
## Min. :100654 Length:542 Length:542 Length:542
## 1st Qu.:147953 Class :character Class :character Class :character
## Median :187289 Mode :character Mode :character Mode :character
## Mean :190540
## 3rd Qu.:217853
## Max. :495767
## city state level public_private_profit
## Length:542 Length:542 Length:542 Length:542
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## undergrad_program grad_program enrollment_profile undergrad_profile
## Length:542 Length:542 Length:542 Length:542
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## size_setting degree_focuses community_engage
## Length:542 Length:542 Length:542
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## leadership_for_public_prax research_tier_2025
## Length:542 Length:542
## Class :character Class :character
## Mode :character Mode :character
##
##
##
head(spending_doctorate_awards)
## # A tibble: 6 × 14
## unitid name city state classific_2025 classific_2021 herd_fy21 herd_fy22
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 222178 Abilene … Abil… TX Research 2: H… Doctoral/Prof… 4,597,00… 6,310,00…
## 2 200697 Air Forc… Wrig… OH Research 2: H… Doctoral Univ… 45,703,0… 44,391,0…
## 3 100654 Alabama … Norm… AL Research Coll… Master's Coll… 9,180,00… 9,611,00…
## 4 100724 Alabama … Mont… AL Research Coll… Doctoral/Prof… 2,631,00… 3,178,00…
## 5 188526 Albany C… Alba… NY Research Coll… Special Focus… 3,687,00… 2,276,00…
## 6 188580 Albany M… Alba… NY Research Coll… Special Focus… 22,443,0… 22,297,0…
## # ℹ 6 more variables: herd_fy23 <dbl>, herd_avg_fy21_to_fy23 <dbl>,
## # num_doc_degrees_2020_2021 <dbl>, num_doc_degrees_2021_2022 <dbl>,
## # num_doc_degrees_2022_2023 <dbl>, avg_num_doc_degrees_2020_2023 <dbl>
colnames(spending_doctorate_awards)
## [1] "unitid" "name"
## [3] "city" "state"
## [5] "classific_2025" "classific_2021"
## [7] "herd_fy21" "herd_fy22"
## [9] "herd_fy23" "herd_avg_fy21_to_fy23"
## [11] "num_doc_degrees_2020_2021" "num_doc_degrees_2021_2022"
## [13] "num_doc_degrees_2022_2023" "avg_num_doc_degrees_2020_2023"
summary(spending_doctorate_awards)
## unitid name city state
## Min. :100654 Length:542 Length:542 Length:542
## 1st Qu.:147953 Class :character Class :character Class :character
## Median :187289 Mode :character Mode :character Mode :character
## Mean :190022
## 3rd Qu.:217744
## Max. :492689
## classific_2025 classific_2021 herd_fy21 herd_fy22
## Length:542 Length:542 Length:542 Length:542
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## herd_fy23 herd_avg_fy21_to_fy23 num_doc_degrees_2020_2021
## Min. :0.000e+00 Min. :1.215e+06 Min. : 0.0
## 1st Qu.:6.320e+06 1st Qu.:5.790e+06 1st Qu.: 0.0
## Median :2.560e+07 Median :2.363e+07 Median : 32.0
## Mean :1.971e+08 Mean :1.790e+08 Mean :108.1
## 3rd Qu.:2.135e+08 3rd Qu.:1.959e+08 3rd Qu.:132.5
## Max. :3.802e+09 Max. :3.468e+09 Max. :900.0
## num_doc_degrees_2021_2022 num_doc_degrees_2022_2023
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 3.0
## Median : 36.0 Median : 36.0
## Mean :120.1 Mean :122.3
## 3rd Qu.:144.8 3rd Qu.:151.2
## Max. :942.0 Max. :930.0
## avg_num_doc_degrees_2020_2023
## Min. : 0.0
## 1st Qu.: 2.0
## Median : 34.5
## Mean :116.8
## 3rd Qu.:146.5
## Max. :924.0
| # CREATING DATAFRAMES |
classification_df <- data.frame(classification_stats)
class(classification_df)
## [1] "data.frame"
spending_df <- data.frame(spending_doctorate_awards)
class(spending_df)
## [1] "data.frame"
colnames(spending_df)
## [1] "unitid" "name"
## [3] "city" "state"
## [5] "classific_2025" "classific_2021"
## [7] "herd_fy21" "herd_fy22"
## [9] "herd_fy23" "herd_avg_fy21_to_fy23"
## [11] "num_doc_degrees_2020_2021" "num_doc_degrees_2021_2022"
## [13] "num_doc_degrees_2022_2023" "avg_num_doc_degrees_2020_2023"
| # CLEANING DATA |
For spending_df columns herd_fyXX Replace any missing
values with NA
# Clean and convert HERD columns to integer
spending_df <- spending_df %>%
mutate(
herd_fy21 = as.integer(parse_number(as.character(herd_fy21))),
herd_fy22 = as.integer(parse_number(as.character(herd_fy22))),
herd_fy23 = as.integer(parse_number(as.character(herd_fy23))),
herd_avg_fy21_to_fy23 = as.integer(parse_number(as.character(herd_avg_fy21_to_fy23)))
)
## Warning: There were 6 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `herd_fy21 = as.integer(parse_number(as.character(herd_fy21)))`.
## Caused by warning:
## ! 1 parsing failure.
## row col expected actual
## 162 -- a number N/A
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
# Inspect rows where parsing failed (NAs introduced)
problematic_rows <- spending_df %>%
filter(if_any(c(herd_fy21, herd_fy22, herd_fy23, herd_avg_fy21_to_fy23), is.na)) %>%
distinct()
# View problematic rows – check for Johns Hopkins, Kent State, or other issues
print(problematic_rows)
## unitid name city state
## 1 162928 Johns Hopkins University Baltimore MD
## 2 203517 Kent State University at Kent Kent OH
## classific_2025
## 1 Research 1: Very High Spending and Doctorate Production
## 2 Research 1: Very High Spending and Doctorate Production
## classific_2021 herd_fy21 herd_fy22
## 1 Doctoral Universities: Very High Research Activity NA NA
## 2 Doctoral Universities: Very High Research Activity NA NA
## herd_fy23 herd_avg_fy21_to_fy23 num_doc_degrees_2020_2021
## 1 NA NA 498
## 2 57758000 57758000 134
## num_doc_degrees_2021_2022 num_doc_degrees_2022_2023
## 1 605 672
## 2 145 165
## avg_num_doc_degrees_2020_2023
## 1 592
## 2 148
classification_df and
spending_df both have “name” but it was raising an error:
“Error: unexpected symbol” for that column when trying to do a full
join# Confirming column names on what to join by
colnames(classification_df)
## [1] "unitid" "name"
## [3] "core_name" "spec_name"
## [5] "city" "state"
## [7] "level" "public_private_profit"
## [9] "undergrad_program" "grad_program"
## [11] "enrollment_profile" "undergrad_profile"
## [13] "size_setting" "degree_focuses"
## [15] "community_engage" "leadership_for_public_prax"
## [17] "research_tier_2025"
colnames(spending_df)
## [1] "unitid" "name"
## [3] "city" "state"
## [5] "classific_2025" "classific_2021"
## [7] "herd_fy21" "herd_fy22"
## [9] "herd_fy23" "herd_avg_fy21_to_fy23"
## [11] "num_doc_degrees_2020_2021" "num_doc_degrees_2021_2022"
## [13] "num_doc_degrees_2022_2023" "avg_num_doc_degrees_2020_2023"
# Perform full join on 'unitid'
combined_df <- full_join(classification_df, spending_df, by = "unitid")
# Confirm join and notice duplicate column names
colnames(combined_df)
## [1] "unitid" "name.x"
## [3] "core_name" "spec_name"
## [5] "city.x" "state.x"
## [7] "level" "public_private_profit"
## [9] "undergrad_program" "grad_program"
## [11] "enrollment_profile" "undergrad_profile"
## [13] "size_setting" "degree_focuses"
## [15] "community_engage" "leadership_for_public_prax"
## [17] "research_tier_2025" "name.y"
## [19] "city.y" "state.y"
## [21] "classific_2025" "classific_2021"
## [23] "herd_fy21" "herd_fy22"
## [25] "herd_fy23" "herd_avg_fy21_to_fy23"
## [27] "num_doc_degrees_2020_2021" "num_doc_degrees_2021_2022"
## [29] "num_doc_degrees_2022_2023" "avg_num_doc_degrees_2020_2023"
(name, city, state each have x and y values) The .x
refers to the column coming from classification_df The
.y refers to the column coming from
spending_df
# Total issues = 13
# Duplicate School Names (there are 12)
combined_df %>%
filter(name.x != name.y) %>%
select(unitid, name.x, name.y)
## unitid name.x
## 1 104151 Arizona State University
## 2 138354 The University of West Florida
## 3 151111 Indiana University-Purdue University-Indianapolis
## 4 163259 University of Maryland, Baltimore
## 5 163286 University of Maryland-College Park
## 6 164155 US Naval Academy
## 7 195049 Rockefeller University
## 8 196060 SUNY at Albany
## 9 199111 University of North Carolina at Asheville
## 10 201885 University of Cincinnati
## 11 207388 Oklahoma State University
## 12 224554 East Texas A & M
## name.y
## 1 Arizona State University Campus Immersion
## 2 University of West Florida
## 3 Indiana University–Purdue University-Indianapolis
## 4 University of Maryland - Baltimore
## 5 University of Maryland - College Park
## 6 United States Naval Academy
## 7 The Rockefeller University
## 8 University at Albany
## 9 University of North Carolina Asheville
## 10 University of Cincinnati-Main Campus
## 11 Oklahoma State University-Main Campus
## 12 East Texas A & M University
# Duplicate City Names (there is 1, 'Des Moines' and 'West Des Moines')
combined_df %>%
filter(city.x != city.y ) %>%
select(unitid, city.x, city.y)
## unitid city.x city.y
## 1 154156 Des Moines West Des Moines
# Duplicate State Names (there are no duplicates)
combined_df %>%
filter(state.x != state.y) %>%
select(unitid, state.x, state.y)
## [1] unitid state.x state.y
## <0 rows> (or 0-length row.names)
# Create a new dataframe for mismatched rows
mismatched <- combined_df %>%
filter(name.x != name.y | city.x != city.y | state.x != state.y) %>%
select(unitid, name.x, name.y, city.x, city.y, state.x, state.y)
# View the mismatched rows
print(mismatched)
## unitid name.x
## 1 104151 Arizona State University
## 2 138354 The University of West Florida
## 3 151111 Indiana University-Purdue University-Indianapolis
## 4 154156 Des Moines University-Osteopathic Medical Center
## 5 163259 University of Maryland, Baltimore
## 6 163286 University of Maryland-College Park
## 7 164155 US Naval Academy
## 8 195049 Rockefeller University
## 9 196060 SUNY at Albany
## 10 199111 University of North Carolina at Asheville
## 11 201885 University of Cincinnati
## 12 207388 Oklahoma State University
## 13 224554 East Texas A & M
## name.y city.x
## 1 Arizona State University Campus Immersion Tempe
## 2 University of West Florida Pensacola
## 3 Indiana University–Purdue University-Indianapolis Indianapolis
## 4 Des Moines University-Osteopathic Medical Center Des Moines
## 5 University of Maryland - Baltimore Baltimore
## 6 University of Maryland - College Park College Park
## 7 United States Naval Academy Annapolis
## 8 The Rockefeller University New York
## 9 University at Albany Albany
## 10 University of North Carolina Asheville Asheville
## 11 University of Cincinnati-Main Campus Cincinnati
## 12 Oklahoma State University-Main Campus Stillwater
## 13 East Texas A & M University Commerce
## city.y state.x state.y
## 1 Tempe AZ AZ
## 2 Pensacola FL FL
## 3 Indianapolis IN IN
## 4 West Des Moines IA IA
## 5 Baltimore MD MD
## 6 College Park MD MD
## 7 Annapolis MD MD
## 8 New York NY NY
## 9 Albany NY NY
## 10 Asheville NC NC
## 11 Cincinnati OH OH
## 12 Stillwater OK OK
## 13 Commerce TX TX
# Function to replace NULL with NA
replace_null_with_na <- function(x) {
if (is.list(x)) {
# Apply recursively if it's a list (to handle nested data frames)
return(lapply(x, replace_null_with_na))
} else if (is.null(x)) {
return(NA) # Replace NULL with NA
} else {
return(x) # Otherwise, return the value unchanged
}
}
# Use it to apply to a data frame
combined_df_na <- combined_df %>%
mutate(across(everything(), replace_null_with_na))
# View the result
summary(combined_df_na)
## unitid name.x core_name spec_name
## Min. :100654 Length:543 Length:543 Length:543
## 1st Qu.:148139 Class :character Class :character Class :character
## Median :187444 Mode :character Mode :character Mode :character
## Mean :190585
## 3rd Qu.:217842
## Max. :495767
##
## city.x state.x level public_private_profit
## Length:543 Length:543 Length:543 Length:543
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## undergrad_program grad_program enrollment_profile undergrad_profile
## Length:543 Length:543 Length:543 Length:543
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## size_setting degree_focuses community_engage
## Length:543 Length:543 Length:543
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## leadership_for_public_prax research_tier_2025 name.y
## Length:543 Length:543 Length:543
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## city.y state.y classific_2025 classific_2021
## Length:543 Length:543 Length:543 Length:543
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## herd_fy21 herd_fy22 herd_fy23
## Min. :0.000e+00 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:4.844e+06 1st Qu.:5.722e+06 1st Qu.:6.320e+06
## Median :1.982e+07 Median :2.205e+07 Median :2.558e+07
## Mean :1.571e+08 Mean :1.716e+08 Mean :1.905e+08
## 3rd Qu.:1.682e+08 3rd Qu.:1.868e+08 3rd Qu.:2.094e+08
## Max. :1.710e+09 Max. :1.806e+09 Max. :2.047e+09
## NA's :3 NA's :3 NA's :2
## herd_avg_fy21_to_fy23 num_doc_degrees_2020_2021 num_doc_degrees_2021_2022
## Min. :1.215e+06 Min. : 0.0 Min. : 0.0
## 1st Qu.:5.778e+06 1st Qu.: 0.0 1st Qu.: 0.0
## Median :2.330e+07 Median : 32.0 Median : 36.0
## Mean :1.730e+08 Mean :108.1 Mean :120.1
## 3rd Qu.:1.900e+08 3rd Qu.:132.5 3rd Qu.:144.8
## Max. :1.854e+09 Max. :900.0 Max. :942.0
## NA's :2 NA's :1 NA's :1
## num_doc_degrees_2022_2023 avg_num_doc_degrees_2020_2023
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.0 1st Qu.: 2.0
## Median : 36.0 Median : 34.5
## Mean :122.3 Mean :116.8
## 3rd Qu.:151.2 3rd Qu.:146.5
## Max. :930.0 Max. :924.0
## NA's :1 NA's :1
Note: spec_name may have many values and that is okay,
it is optional (e.g., University of Maryland, College Park would have
“College Park” as the spec_name”, but John’s Hopkins only has one
campus, so there is no “main campus” or any other special name)
# Display a table of the count of missing values (NAs) per column
missing_data <- combined_df_na %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
gather(key = "column_name", value = "NAs")
print(missing_data)
## column_name NAs
## 1 unitid 0
## 2 name.x 1
## 3 core_name 1
## 4 spec_name 371
## 5 city.x 1
## 6 state.x 1
## 7 level 1
## 8 public_private_profit 1
## 9 undergrad_program 1
## 10 grad_program 1
## 11 enrollment_profile 1
## 12 undergrad_profile 1
## 13 size_setting 1
## 14 degree_focuses 1
## 15 community_engage 1
## 16 leadership_for_public_prax 1
## 17 research_tier_2025 1
## 18 name.y 1
## 19 city.y 1
## 20 state.y 1
## 21 classific_2025 1
## 22 classific_2021 1
## 23 herd_fy21 3
## 24 herd_fy22 3
## 25 herd_fy23 2
## 26 herd_avg_fy21_to_fy23 2
## 27 num_doc_degrees_2020_2021 1
## 28 num_doc_degrees_2021_2022 1
## 29 num_doc_degrees_2022_2023 1
## 30 avg_num_doc_degrees_2020_2023 1
# Filter rows where name.x or name.y is NA, and select unitid, name.x, name.y
# Should be referring to Penn State, with 2 different unitid AND different school name spelling
rows_with_na_names <- combined_df_na %>%
filter(is.na(name.x) | is.na(name.y)) %>%
select(unitid, name.x, name.y, city.x, city.y)
print(rows_with_na_names)
## unitid name.x
## 1 495767 The Pennsylvania State University
## 2 214777 <NA>
## name.y city.x city.y
## 1 <NA> University Park <NA>
## 2 Pennsylvania State University-Main Campus <NA> University Park
| # VISUALIZATIONS |
This bar graph represents the number of institutions that were categorized under the 2021 Carnegie research classification tier. Comparing how many universities fall into each group and highlights the overall distribution of research in U.S. Institutions. This visualization also helps emphasize the tiers that are less common.
# Recode main college type and subcategory
combined_df_na_cleaned <- combined_df_na %>%
mutate(
college_type = case_when(
grepl("Baccalaureate Colleges", classific_2021) ~ "Baccalaureate",
grepl("Master's Colleges & Universities", classific_2021) ~ "Master's",
grepl("Doctoral Universities", classific_2021) ~ "Doctoral",
grepl("Tribal Colleges and Universities", classific_2021) ~ "Tribal",
grepl("Special Focus Four-Year", classific_2021) ~ "Special Focus",
TRUE ~ NA_character_
),
subcategory = case_when(
grepl("Master's Colleges & Universities: Small Programs", classific_2021) ~ "Master's: Small Programs",
grepl("Master's Colleges & Universities: Medium Programs", classific_2021) ~ "Master's: Medium Programs",
grepl("Master's Colleges & Universities: Larger Programs", classific_2021) ~ "Master's: Larger Programs",
grepl("Doctoral Universities: High Research Activity", classific_2021) ~ "Doctoral: High Research Activity",
grepl("Doctoral Universities: Very High Research Activity", classific_2021) ~ "Doctoral: Very High Research Activity",
grepl("Doctoral/Professional Universities", classific_2021) ~ "Doctoral: Professional Universities",
grepl("Baccalaureate Colleges: Diverse Fields", classific_2021) ~ "Baccalaureate: Diverse Fields",
grepl("Baccalaureate Colleges: Arts & Sciences Focus", classific_2021) ~ "Baccalaureate: Arts & Sciences Focus",
grepl("Special Focus Four-Year: Research Institution", classific_2021) ~ "Special Focus: Research Institution",
grepl("Special Focus Four-Year: Other Health Professions Schools", classific_2021) ~ "Special Focus: Health Professions",
grepl("Special Focus Four-Year: Medical Schools & Centers", classific_2021) ~ "Special Focus: Medical Schools",
grepl("Special Focus Four-Year: Engineering and Other Technology-Related Schools", classific_2021) ~ "Special Focus: Engineering/Technology",
grepl("Special Focus Four-Year: Other Special Focus Institutions", classific_2021) ~ "Special Focus: Other Institutions",
grepl("Tribal Colleges and Universities", classific_2021) ~ "Tribal Colleges",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(college_type) & !is.na(subcategory))
# Set factor levels for custom order of college_type
combined_df_na_cleaned$college_type <- factor(combined_df_na_cleaned$college_type,
levels = c("Baccalaureate", "Master's", "Doctoral", "Special Focus", "Tribal")
)
# Set factor levels for custom order of subcategories
combined_df_na_cleaned$subcategory <- factor(combined_df_na_cleaned$subcategory, levels = c(
"Baccalaureate: Diverse Fields",
"Baccalaureate: Arts & Sciences Focus",
"Master's: Small Programs",
"Master's: Medium Programs",
"Master's: Larger Programs",
"Doctoral: High Research Activity",
"Doctoral: Very High Research Activity",
"Doctoral: Professional Universities",
"Special Focus: Research Institution",
"Special Focus: Health Professions",
"Special Focus: Medical Schools",
"Special Focus: Engineering/Technology",
"Special Focus: Other Institutions",
"Tribal Colleges"
))
# Custom shaded color palette
color_palette_shaded <- c(
# Reds for Baccalaureate
"Baccalaureate: Diverse Fields" = "#ff9999",
"Baccalaureate: Arts & Sciences Focus" = "#cc0000",
# Oranges for Master's
"Master's: Small Programs" = "#ffcc99",
"Master's: Medium Programs" = "#ff9933",
"Master's: Larger Programs" = "#cc6600",
# Blues for Doctoral
"Doctoral: High Research Activity" = "#99ccff",
"Doctoral: Very High Research Activity" = "#3399ff",
"Doctoral: Professional Universities" = "#003366",
# Greens for Special Focus
"Special Focus: Research Institution" = "#b2d8b2",
"Special Focus: Health Professions" = "#66cc66",
"Special Focus: Medical Schools" = "#339966",
"Special Focus: Engineering/Technology" = "#26734d",
"Special Focus: Other Institutions" = "#145c33",
# Purple for Tribal
"Tribal Colleges" = "#9966cc"
)
p1 <- ggplot(combined_df_na_cleaned, aes(x = college_type, fill = subcategory)) +
geom_bar(position = "stack") +
scale_fill_manual(values = color_palette_shaded) +
labs(title = "Institution Counts by College Type and Subcategory",
x = "College Type", y = "Count of Institutions") +
theme_minimal() +
theme(
legend.position = "bottom",
legend.title = element_blank(),
legend.box = "horizontal",
legend.text = element_text(size = 10),
legend.key.size = unit(0.5, "cm"),
legend.box.just = "center",
legend.spacing.x = unit(0.5, 'cm'),
legend.spacing.y = unit(0.25, 'cm'),
legend.key.height = unit(0.5, "cm"),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
guides(fill = guide_legend(ncol = 3, title = "Subcategory"))
p1
# Ensure college_type factor order is set (if not already done)
combined_df_na_cleaned$college_type <- factor(combined_df_na_cleaned$college_type,
levels = c("Baccalaureate", "Master's", "Doctoral", "Special Focus", "Tribal")
)
# Proportion stacked bar plot (horizontal) with 4 columns for the legend
p2 <- ggplot(combined_df_na_cleaned, aes(y = college_type, fill = subcategory)) +
geom_bar(position = "fill") + # This scales the bars to 100% for each college type
scale_fill_manual(values = color_palette_shaded) + # Apply the custom color palette
labs(title = "Proportional Counts by College Type and Subcategory",
x = "Proportion of Institutions",
y = "College Type") +
theme_minimal() +
theme(axis.text.y = element_text(angle = 0), # Keep y-axis labels horizontal
axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels for readability
guides(fill = guide_legend(ncol = 1, title = "Subcategory"))
# Print the plot
p2
This graph visualizes the raw count and relative distribution of institutions by their research activity designation in 2025. The categories displayed include “Research Colleges and Universities”, “Research 1”, and “Research 2”. Data excludes institutions without a research designation.
The graph shows the both the raw count and relative distribution of institutions across three research activity categories: Research Colleges and Universities (RCA), Research 1 (R1), and Research 2 (R2). - RCA institutions spend at least $2.5 million on research annually but do not meet the criteria for R1 or R2. - R1 institutions spend at least $50 million on research and produce at least 70 research doctorates annually - R2 institutions spend at least $5 million and produce at least 20 research doctorates.
# Prepare the data and drop NA
plot_data <- combined_df_na %>%
filter(!is.na(research_tier_2025)) %>%
mutate(
research_tier_2025 = factor(
research_tier_2025,
levels = c("Research Colleges and Universities",
"Research 1: Very High Research Spending and Doctorate Production",
"Research 2: High Research Spending and Doctorate Production")
)
) %>%
count(research_tier_2025) %>%
mutate(percentage = n / sum(n) * 100) # Calculate percentage for pie chart
# Raw count bar plot
raw_count_plot <- ggplot(plot_data, aes(x = research_tier_2025, y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(
x = "Research Activity Designation",
y = "Raw Count"
) +
theme_minimal() +
scale_x_discrete(labels = c(
"Research Colleges and Universities" = "Research Colleges\nand Universities",
"Research 1: Very High Research Spending and Doctorate Production" = "R1",
"Research 2: High Research Spending and Doctorate Production" = "R2"
))
# Proportional bar plot (percentage)
proportional_plot <- ggplot(plot_data, aes(x = research_tier_2025, y = percentage)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(
x = "Research Activity Designation",
y = "Proportion (%)"
) +
theme_minimal() +
scale_x_discrete(labels = c(
"Research Colleges and Universities" = "Research Colleges\nand Universities",
"Research 1: Very High Research Spending and Doctorate Production" = "R1",
"Research 2: High Research Spending and Doctorate Production" = "R2"
))
# Combine the two plots and add a single title, subtitle, and caption for the whole figure
combined_plot <- raw_count_plot + proportional_plot +
plot_layout(guides = 'collect') +
plot_annotation(
title = "Raw Count and Proportion of Institutions by Research Activity Designation (2025)",
subtitle = "Excludes institutions without a research designation",
caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
)
# Print the combined plot
combined_plot
This boxplot displays the distribution of average doctoral degrees awarded from 2020 to 2023 across three research activity designations: “Research Colleges and Universities,” “Research 1 (R1),” and “Research 2 (R2).” The x-axis represents the average number of doctoral degrees awarded, while the y-axis categorizes institutions by their research tier.
The plot highlights the range, median, and quartiles for each research tier. Research Colleges and Universities, which are considered research-focused but do not meet the rigorous criteria for R1 or R2 designation, show a lower average number of doctoral degrees compared to R1 and R2 institutions. The “Research 1” and “Research 2” categories display higher and more variable averages, reflecting their very high and high research spending, respectively. This visualization provides insight into how research activity designation correlates with doctoral production.
# Clean data and relabel tiers
clean_df <- combined_df_na %>%
filter(!is.na(avg_num_doc_degrees_2020_2023), !is.na(research_tier_2025)) %>%
mutate(
research_tier_2025 = recode(
research_tier_2025,
"Research 1: Very High Research Spending and Doctorate Production" = "R1",
"Research 2: High Research Spending and Doctorate Production" = "R2",
"Research Colleges and Universities" = "Research Colleges\nand Universities"
)
)
# Plot with horizontal box and whiskers
ggplot(clean_df, aes(x = avg_num_doc_degrees_2020_2023, y = research_tier_2025)) +
geom_boxplot(fill = "slateblue", alpha = 0.7, outlier.shape = NA) +
geom_jitter(color = "darkslateblue", alpha = 0.5, size = 1.8) +
labs(
title = "Average Doctoral Degrees by Research Tier (2020–2023)",
x = "Avg. Doctoral Degrees (2020–2023)",
y = "Research Tier (2025)",
caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
) +
theme_minimal() +
theme(
plot.caption = element_text(hjust = 0, size = 9, color = "gray30"),
axis.text.y = element_text(size = 10)
)
This stacked bar graph visualizes the count of doctoral degrees conferred across different academic years (2020–2023), broken down by research tier. The research tiers are represented by three categories: “Research Colleges and Universities,” “Research 1 (R1),” and “Research 2 (R2).” The bars for each academic year are stacked to show the distribution of doctoral degrees within each research tier, allowing for an understanding of how the number of degrees varies across tiers and over time. The y-axis represents the raw count of doctoral degrees, and the x-axis represents the academic years.
In general, the number of doctoral degrees conferred by R2 institutions and “Research Colleges and Universities” has remained relatively stable over the years. The smallest proportion is represented by “Research Colleges and Universities,” with R2 institutions showing a modest amount. However, R1 institutions have experienced a significant increase in both the count and proportion of degrees conferred, especially between the 2020-2021 and 2021-2022 academic years. This highlights the growing trend of doctoral degree production within high-research activity institutions. The plot also includes a note that “Research Colleges and Universities” are research-focused but do not meet the criteria for R1 or R2 designation.
# Reshape the data into long format for easier plotting and drop NA values
long_combined_df <- combined_df_na_cleaned %>%
select(
research_tier_2025,
num_doc_degrees_2020_2021,
num_doc_degrees_2021_2022,
num_doc_degrees_2022_2023
) %>%
pivot_longer(
cols = c(
"num_doc_degrees_2020_2021",
"num_doc_degrees_2021_2022",
"num_doc_degrees_2022_2023"
),
names_to = "academic_year",
values_to = "num_doc_degrees"
) %>%
# Clean up the research tier labels and academic year labels
mutate(
research_tier_2025 = recode(
research_tier_2025,
"Research 1: Very High Research Spending and Doctorate Production" = "Research 1",
"Research 2: High Research Spending and Doctorate Production" = "Research 2",
"Research Colleges and Universities" = "Research Colleges and Universities"
),
academic_year = recode(
academic_year,
"num_doc_degrees_2020_2021" = "2020-2021",
"num_doc_degrees_2021_2022" = "2021-2022",
"num_doc_degrees_2022_2023" = "2022-2023"
)
) %>%
# Filter out any rows with NA values in num_doc_degrees or research_tier_2025
filter(!is.na(num_doc_degrees), !is.na(research_tier_2025))
# Create a stacked bar graph of doctoral degrees by research tier and academic year
ggplot(long_combined_df, aes(x = academic_year, y = num_doc_degrees, fill = research_tier_2025)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Doctoral Degrees Conferred by Research Tier and Academic Year (2020–2023)",
x = "Academic Year",
y = "Count of Doctoral Degrees",
fill = "Research Tier",
caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0),
axis.text.x = element_text(angle = 0, hjust = 0.5), # Center the x-axis labels
plot.caption = element_text(hjust = 0, size = 9, color = "gray30"),
legend.position = "right", # Keep the legend on the right
legend.box = "vertical" # Arrange legend vertically
) +
guides(
fill = guide_legend(ncol = 1) # Set the legend to 1 column
) +
scale_fill_manual(
values = c("Research 1" = "#4C79A1", "Research 2" = "#F1A340", "Research Colleges and Universities" = "#6D9D4B"), # New colors
labels = c("Research 1", "Research 2", "Research Colleges and Universities") # Set the legend labels
) +
scale_y_continuous(
labels = scales::comma # Adds commas to the y-axis labels for better readability
)
This set of grouped bar graphs displays the count of institutions by their type (public, private non-profit, and private for-profit) and their research activity designation (R1, R2, and Research Colleges and Universities) for the year 2025. They represent the same data, but the color-fill and x-axis variables are the opposite on both graphs.
The majority of institutions with any research activity designation in this dataset are public institutions, with far fewer private non-profit and private for-profit institutions in each research activity category. Notably, the Research Colleges and Universities category is the only one that contains private for-profit institutions, with only one school in this group.
In terms of distribution across research activity designations, Research Colleges and Universities appear to have a higher number of private non-profit institutions compared to R1 and R2 institutions. Specifically, Research Colleges and Universities are predominantly composed of private non-profit institutions, whereas R1 and R2 categories show a much stronger representation of public institutions.
Interestingly, the number of public institutions in the R1 category is almost equal to that of the Research Colleges and Universities category, which further suggests that Research Colleges and Universities have a more balanced composition of private non-profit institutions.
This analysis highlights key trends in how research activity designation relates to institutional type, with public institutions overwhelmingly dominating higher research tiers, and Research Colleges and Universities being the outlier with more private non-profit representation.
# Ensure all combinations of 'public_private_profit' and 'research_tier_2025' are represented
plot_df_clean <- combined_df_na %>%
filter(!is.na(public_private_profit), !is.na(research_tier_2025)) %>%
count(research_tier_2025, public_private_profit) %>%
complete(research_tier_2025, public_private_profit, fill = list(n = 0)) # Fill missing combinations with 0
# Create the plot
ggplot(plot_df_clean, aes(x = public_private_profit, y = n, fill = research_tier_2025)) +
geom_bar(position = "dodge", stat = "identity", width = 0.7) + # Adjust width for equal bar widths
labs(
title = "Count of Institutions by Type and Research Activity Designation (2025)",
subtitle = "Excludes institutions without a research designation",
x = "Institution Type",
y = "Count of Institutions",
fill = "Research Activity Designation",
caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
) +
theme_minimal() +
scale_x_discrete(labels = c(
"Public" = "Public",
"Private non-profit" = "Private (non-profit)",
"Private for-profit" = "Private (for-profit)"
)) +
scale_y_continuous(limits = c(0, 150), expand = c(0, 0)) +
scale_fill_manual(
values = c(
"Research Colleges and Universities" = "palevioletred", # Research Colleges stand out
"Research 1: Very High Research Spending and Doctorate Production" = "steelblue", # R1 color
"Research 2: High Research Spending and Doctorate Production" = "cornflowerblue" # R2 color
),
labels = c(
"Research Colleges and Universities" = "Research Colleges\nand Universities",
"Research 1: Very High Research Spending and Doctorate Production" = "R1",
"Research 2: High Research Spending and Doctorate Production" = "R2"
)
) +
theme(
axis.text.y = element_text(size = 10),
legend.position = "bottom", # Move the legend to the bottom
legend.box = "horizontal", # Makes the legend horizontal
legend.title = element_text(size = 12), # Adding legend title size
axis.text.x = element_text(angle = 0, hjust = 0.5) # Set x-axis labels to 0 degrees (normal)
)
ggplot(plot_df_clean, aes(x = research_tier_2025, y = n, fill = public_private_profit)) +
geom_bar(position = "dodge", stat = "identity", width = 0.7) +
labs(
title = "Count of Institutions by Research Activity Designation and Type (2025)",
subtitle = "Excludes institutions without a research designation",
x = "Research Activity Designation",
y = "Count of Institutions",
fill = "Institution Type",
caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
) +
theme_minimal() +
scale_x_discrete(labels = c(
"Research Colleges and Universities" = "Research Colleges\nand Universities",
"Research 1: Very High Research Spending and Doctorate Production" = "R1",
"Research 2: High Research Spending and Doctorate Production" = "R2"
)) +
scale_y_continuous(limits = c(0, 150), expand = c(0, 0)) +
scale_fill_manual(
values = c(
"Public" = "seagreen",
"Private not-for-profit" = "darkorange",
"Private for-profit" = "saddlebrown"
),
labels = c(
"Public" = "Public",
"Private not-for-profit" = "Private (non-profit)",
"Private for-profit" = "Private (for-profit)"
)
) +
theme(
axis.text.y = element_text(size = 10),
axis.text.x = element_text(angle = 0, hjust = 0.5),
legend.position = "bottom",
legend.box = "horizontal",
legend.title = element_text(size = 12)
)
This histogram displays the distribution of HERD spending across three fiscal years: 2021, 2022, and 2023. It highlights the range of research spending institutions report, with the peak around 0.05 billion dollars in fiscal year 2023, suggesting that many institutions are clustered in this spending range. Over the years, spending has gradually increased, especially in FY2023, indicating a growing trend in research investment. By examining this distribution, we can identify the most common spending ranges, which can help provide insight into what is considered typical or desirable for research spending. The graph also reveals the highest amounts of spending, useful for understanding the research budgets of institutions with the largest research investments.
# Clean and reshape HERD data for all three fiscal years (FY21, FY22, FY23)
combined_df_long <- combined_df_na_cleaned %>%
select(research_tier_2025, herd_fy21, herd_fy22, herd_fy23) %>%
pivot_longer(
cols = starts_with("herd_fy"),
names_to = "fiscal_year",
values_to = "herd_spending"
) %>%
filter(!is.na(herd_spending)) %>%
mutate(
fiscal_year = recode(
fiscal_year,
herd_fy21 = "HERD Spending FY2021",
herd_fy22 = "HERD Spending FY2022",
herd_fy23 = "HERD Spending FY2023"
)
)
# Plot distribution of HERD spending for FY21, FY22, and FY23 with log scale
ggplot(combined_df_long, aes(x = herd_spending)) +
geom_histogram(bins = 20, color = "black", fill = "seagreen", alpha = 0.85) +
scale_x_log10(labels = scales::dollar_format(scale = 1e-9, suffix = "B")) +
facet_wrap(~ fiscal_year, nrow = 1, ncol = 3) + # Arrange FY21, FY22, FY23 side by side
labs(
title = "Distribution of HERD Spending for Fiscal Years 2021, 2022, and 2023",
x = "HERD Spending (Log Scale, in Billions)",
y = "Number of Institutions"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
strip.text = element_text(face = "bold"),
legend.position = "none"
)
## Warning in scale_x_log10(labels = scales::dollar_format(scale = 1e-09, suffix =
## "B")): log-10 transformation introduced infinite values.
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_bin()`).
This histogram displays research spending (HERD) across three types of institutions for fiscal years 2021, 2022, and 2023: - Research Colleges and Universities (R Colleges): These institutions generally have the lowest research spending, concentrated on the left side of the graph. - Research 2 (R2): R2 institutions fall in the middle range, with some overlapping with both R Colleges and Research 1 institutions. - Research 1 (R1): These institutions have the highest research spending, appearing on the far right.
The log scale on the x-axis highlights differences in spending, particularly at the higher end. The overlap between R2 and the other categories suggests that some R2 institutions have spending comparable to R1 or R Colleges.
# Filter out rows with NAs in research_tier_simple or herd_spending
combined_df_long <- combined_df_na_cleaned %>%
select(research_tier_2025, herd_fy21, herd_fy22, herd_fy23) %>%
pivot_longer(
cols = starts_with("herd_fy"),
names_to = "fiscal_year",
values_to = "herd_spending"
) %>%
filter(!is.na(herd_spending), !is.na(research_tier_2025)) %>%
mutate(
fiscal_year = recode(
fiscal_year,
herd_fy21 = "HERD Spending FY2021",
herd_fy22 = "HERD Spending FY2022",
herd_fy23 = "HERD Spending FY2023"
),
# Simplify research tier labels
research_tier_simple = recode(
research_tier_2025,
"Research 1: Very High Research Spending and Doctorate Production" = "Research 1",
"Research 2: High Research Spending and Doctorate Production" = "Research 2",
"Research Colleges and Universities" = "Research Colleges and Universities"
)
) %>%
filter(!is.na(research_tier_simple), !is.na(herd_spending)) # Remove NAs
# Plot with simplified legend labels and without NAs
ggplot(combined_df_long, aes(x = herd_spending, fill = research_tier_simple)) +
geom_histogram(bins = 30, color = "black", alpha = 0.6, position = "identity") +
scale_x_log10(labels = scales::dollar_format(scale = 1e-9, suffix = "B")) +
facet_wrap(~ fiscal_year, nrow = 1, ncol = 3) +
scale_fill_manual(
values = c(
"Research 1" = brewer.pal(3, "Set2")[1],
"Research 2" = brewer.pal(3, "Set2")[2],
"Research Colleges and Universities" = brewer.pal(3, "Set2")[3]
),
labels = c("Research 1", "Research 2", "Research Colleges and Universities") # Simplify the legend labels
) +
labs(
title = "HERD Spending by Fiscal Year and Research Tier (Log Scale)",
subtitle = "Excludes institutions without a research designation",
x = "HERD Spending (Log Scale, in Billions)",
y = "Number of Institutions",
fill = "Research Tier",
caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.5, size = 9),
strip.text = element_text(face = "bold"),
legend.position = "bottom",
legend.direction = "horizontal"
)
## Warning in scale_x_log10(labels = scales::dollar_format(scale = 1e-09, suffix =
## "B")): log-10 transformation introduced infinite values.
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_bin()`).
This scatter plot visualizes the relationship between average research spending (from FY2021 to FY2023) and the average number of doctoral degrees awarded (2020–2023) across American higher education institutions.
The x-axis shows average research spending in U.S. dollars, while the y-axis indicates the average number of doctoral degrees conferred. Institutions with greater research spending tend to award more doctoral degrees, illustrating a positive correlation between institutional investment in research and doctoral productivity.
This visualization supports the analytical research question by highlighting how non-financial institutional characteristics (like research tier) correlate with tangible research outcomes such as doctoral degree production.
# Clean the spending column (convert to billions)
clean_df_long <- combined_df %>%
mutate(
herd_avg_fy21_to_fy23_billions = herd_avg_fy21_to_fy23 / 1e9 # Convert to billions
)
# Rename tiers for cleaner display
clean_df_long <- clean_df_long %>%
mutate(research_tier_2025 = case_when(
grepl("Research 1", research_tier_2025) ~ "Research 1 (R1)",
grepl("Research 2", research_tier_2025) ~ "Research 2 (R2)",
TRUE ~ "Research Colleges and Universities"
))
# Filter out rows where y-values are <= 0
clean_df_long_filtered <- clean_df_long %>%
filter(avg_num_doc_degrees_2020_2023 > 0)
# Scatter plot using log10 scales with a lower limit for the y-axis
ggplot(clean_df_long_filtered, aes(
x = herd_avg_fy21_to_fy23_billions,
y = avg_num_doc_degrees_2020_2023,
color = research_tier_2025
)) +
geom_point(alpha = 0.6, size = 1.2) +
scale_color_brewer(palette = "Set2") +
scale_x_log10(
labels = scales::label_dollar(scale = 1, suffix = "B"),
breaks = scales::log_breaks()
) +
scale_y_log10(
breaks = scales::log_breaks(),
limits = c(1, NA) # Set the lower limit of y-axis to 1
) +
labs(
title = "Avg. Research Spending vs. Doctoral Degrees (2020–2023)",
x = "Avg. Research Spending (Log Scale, Billions USD)",
y = "Avg. Number of Doctoral Degrees\n(Log Scale)",
color = "Research Tier"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.key.width = unit(1, "cm"),
legend.key.height = unit(0.5, "cm"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
) +
guides(color = guide_legend(ncol = 3))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
# Clean the spending column (remove $ and commas and convert to numeric)
clean_df_long <- combined_df %>%
mutate(
herd_avg_fy21_to_fy23 = as.numeric(gsub("[$,]", "", herd_avg_fy21_to_fy23)),
herd_avg_fy21_to_fy23_billions = herd_avg_fy21_to_fy23 / 1e9 # Convert to billions
)
# Optional: Rename tiers for cleaner display (if needed)
clean_df_long <- clean_df_long %>%
mutate(research_tier_2025 = case_when(
grepl("Research 1", research_tier_2025) ~ "Research 1 (R1)",
grepl("Research 2", research_tier_2025) ~ "Research 2 (R2)",
TRUE ~ "Research Colleges and Universities"
))
# Scatter plot of research spending vs doctoral degrees
ggplot(clean_df_long, aes(
x = herd_avg_fy21_to_fy23_billions,
y = avg_num_doc_degrees_2020_2023,
color = research_tier_2025
)) +
geom_point(alpha = 0.5, size = 1.5) +
scale_color_brewer(palette = "Set2") +
scale_x_continuous(
labels = scales::label_dollar(scale = 1, suffix = "B"), # Format in billions
breaks = scales::pretty_breaks(n = 10) # More breaks on the x-axis
) +
scale_y_continuous(
breaks = scales::pretty_breaks(n = 10) # More breaks on the y-axis
) +
labs(
title = "Avg. Research Spending vs. Doctoral Degrees (2020–2023)",
x = "Avg. Research Spending (Billions USD)",
y = "Avg. Number of Doctoral Degrees",
color = "Research Tier"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom", # Position the legend below the x-axis
legend.key.width = unit(1, "cm"), # Adjust the width of the legend keys
legend.key.height = unit(0.5, "cm"), # Adjust the height of the legend keys
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
) +
guides(color = guide_legend(ncol = 3)) # Set the legend to 3 columns
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
This grouped bar chart displays the distribution of research-designated institutions (R1, R2, and Research Colleges and Universities) across different campus size and residential categories, focusing only on 4-year universities. Institutions are categorized by size (Very Small to Large) and residential character (Highly Residential to Primarily Nonresidential), providing a detailed look at how research activity is spread across different campus environments. The chart reveals trends in how institutional research tiers align with campus scale and student living dynamics, offering insights into the structural diversity of U.S. higher education.
# Clean and categorize the data
clean_combined_df <- combined_df_na %>%
mutate(
size_category = case_when(
grepl("very small", size_setting) ~ "Very Small",
grepl("small", size_setting) ~ "Small",
grepl("medium", size_setting) ~ "Medium",
grepl("large", size_setting) ~ "Large",
TRUE ~ NA_character_ # Filter out non-4-year
),
residential_category = case_when(
grepl("highly residential", size_setting) ~ "Highly Residential",
grepl("primarily residential", size_setting) ~ "Primarily Residential",
grepl("primarily nonresidential", size_setting) ~ "Primarily Nonresidential",
TRUE ~ NA_character_
),
research_tier_2025 = factor(
research_tier_2025,
levels = c(
"Research 1: Very High Research Spending and Doctorate Production",
"Research 2: High Research Spending and Doctorate Production",
"Research Colleges and Universities"
)
)
) %>%
filter(!is.na(size_category) & !is.na(residential_category))
# Define label shorteners for x-axis and legend
research_labels <- c(
"Research 1: Very High Research Spending and Doctorate Production" = "R1",
"Research 2: High Research Spending and Doctorate Production" = "R2",
"Research Colleges and Universities" = "Research\nColleges"
)
# Plot: Grouped bar chart faceted by size and residential categories
ggplot(clean_combined_df, aes(x = research_tier_2025, fill = research_tier_2025)) +
geom_bar(position = "dodge") +
facet_grid(size_category ~ residential_category, scales = "free_x") +
scale_x_discrete(labels = research_labels) +
scale_fill_manual(
values = c(
"Research 1: Very High Research Spending and Doctorate Production" = "#1b9e77",
"Research 2: High Research Spending and Doctorate Production" = "#d95f02",
"Research Colleges and Universities" = "#7570b3"
),
labels = research_labels
) +
labs(
title = "Research Tiers by Size and Residential Type (4-Year Universities Only)",
x = "Research Tier",
y = "Number of Institutions",
fill = "Research Tier",
caption = "Note: 'Research Colleges and Universities' are research-focused but do not meet R1 or R2 criteria"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = .5, size = 10),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 12),
legend.position = "bottom",
legend.title = element_text(size = 11),
legend.text = element_text(size = 10),
strip.text.x = element_text(size = 11),
strip.text.y = element_text(size = 11),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
plot.caption = element_text(size = 9, face = "italic", hjust = 1)
)