Introduction

My interest lies in the post-secondary pathways and college admissions of high school students from rural and small-town areas. While low population density is a common measure of rurality, those of us interested in access to higher education are often more concerned about the resources available to high school students, and thus our definition of ‘rural and small town’ will differ from a simple measure of population density or the categorization of zip codes. Private high schools and boarding schools are just two examples of secondary institutions that may be located in low population areas but may also have vast post-secondary-related resources available to them, such as rigorous college counseling, strong family finances, and socio-cultural capital. Conversely, there are high schools located on the outskirts of more densely populated areas– just close enough to urbanity to be excluded from a ‘rural’ count based on population or zip code– that face transportation issues, receive few college admissions counselor visits, or have limited curricular options, all of which are challenges faced by many high schools in rural areas 1. Accordingly, the definition of rural is not always clear and there is not always data available to explore these topics.

In addition to the challenge that defining rural presents, we also encounter difficulty when it comes to acquiring useful data, and these two limitations can coincide. In order to examine these topics, we need a way to determine which high schoolers are rural and small town students and we need data about post-secondary pathways in such a format that allows us to compare rural and small town students’ with non-rural and small town students’ postsecondary educations. Currently, there is no standard reference for a non-geographic- or non-population-density-based definition rural and small town high schools. 2

Instead, organizations and groups have come up with their own identifications. The National Association on College Admissions Counseling (NACAC)’s Rural and Small Town Special Interest Group (SIG) has a spreadsheet of 12,000+ hand compiled high schools that fit the SIG’s definition of ‘rural and small town’. Swarthmore College Admissions has yet another scheme to define rural and small town, and their list overlaps somewhat with the SIG’s list, but not entirely. Both lists are based in part on a few categories of the NCES Urban-Centric Locales (see footnote 2). So, that elucidates the question Which high schoolers are rural and small town students?: Students from rural and small town high schools, by whichever definition you chose.

In order for these lists of rural and small town high schools to be of use in data analysis, we need high school level data about graduates’ postsecondary experiences. Such data is tricky to find and often varies from state to state. Postsecondary institutions are more nationally accountable, and we have access to standard data in formats such as the Common Data Set, but high schools are more responsible to state-level organizations and so the data available is hit or miss. When data on high schools is available, it often only reports the number or percent of a graduating class that enrolled in their first year of postsecondary education without information on type of institution, persistence, eventual degrees, or additional information about the students’ experiences at those institutions. After all, high schools don’t generally follow their graduates beyond graduation. For this reason, it is often easier to obtain and examine nation-wide, school-level data about postsecondary institutions.

This is not to say that institutional level data cannot be of use in exploring these topics. There are many dimensions of access to higher education that can be explored in postsecondary education data, for concerns about issues of access do not end when a student is accepted into and/or attends college. We also should investigate how and if collegiate resources and experiences are accessible in useful ways to students for whom such institutions have not been created for. Are first generation, low income, minority, and other students who face barriers to college education benefiting from the resources on their campuses or are there additional obstacles to access that must be addressed? While high school level datasets might help us understand which students apply to and enroll in which colleges and post-secondary programs, we also should want to know which students find the support and resources they need in formats that cultivate success and allow them to benefit throughout and after college. It is for these very reasons that I have decided to look at variables about selected post-secondary institutions in the United States using data from The Integrated Postsecondary Education Data System (IPEDS). Specifically, the 2017 complete data files. 3

The IPEDS complete data files are marvelously detailed and contain years’ worth of data visualization potential. I chose some variables that I thought would be interesting to look at, including the state the institution is in, student-faculty ratio, percent of students formally registered as students with disabilities, and percent of full-time, first-year undergraduate students offered financial aid in the form of an institutional grant. Two data files, one containing the states of residency of first-time, first-year degree/certificate-seeking undergraduates at each institution and the other about the demographics of institutions’ new hires also caught my attention, so my visualizations aim to explore these areas as well.

The IPEDS complete data files contain the responses to twelve interrelated survey components. 4 Each postsecondary institution is given a unique identifier called an UNITID and each complete data file contains somewhere between 2,000 and 7,000 unique UNITIDs. I downloaded a selection of the complete data files and pulled variables from several to create my own collection of data. I used three data files on their own because they had more than one row per UNITID. Those three data files where ef2016c_rv.csv, s2017_is_rv.csv, and s2017_nh_rv.csv. (See footnote 3 to understand why you won’t see _rv when you go to download these data files.) I chose the institutions that would be included in my visuals by finding the UNITIDs that were present in every file I had downloaded. Therefore, a major limitation on my data is that it is not necessarily representative of the entire set of postsecondary institutions in the United States, nor in the entire 2017 IPEDS survey. For further discussion of limitations and a comparison of how these 1,672 institutions compare to the 7,153 listed in the 2017 directory file, see the Data Limitations section under Discussion.

I set out to visualize relationships between student demographics (such as the states that students lived in when they applied and the percent of students with registered disabilities) and institution characteristics (such as the size of institutions, the financial aid awarded in institutional grants, and the student faculty ratio). Wherever student totals or total undergraduates are given, they are totals of full-time, degree/certificate-seeking students. Given the scope and length of this project and the limitations of the data, I focused on finding interesting patterns and creating strong graphics, which led me to several interactive visualizations that allow the viewer to explore and perhaps ask and investigate questions on their own during that exploration. I hope you enjoy the results.

Exploration

Each visualization is presented below. See the Data Work section at the end for details about data wrangling and creation of the plots.

Home Regions Alluvial Plot

To explore the demographics of these institutions and their students, I developed four graphical representations of the data. Studying the postsecondary pathways of rural and small-town high school students, on a very simple level, involves understanding where students go to high school and where they go to college. Thus, the first question that guided my work pertained to location. The first graphical representation I created is an alluvial plot showing students’ home regions of residence and the regions where they pursued postsecondary education. The plot was very busy when each state was separate, so I grouped them into regions using the Interstate regions of the Census Bureau-designated regions and divisions from Wikipedia.5 The resulting graphic visualizes the movement of first-year students between high school and the post-secondary institution they enroll in as a first-time degree/certificate-seeking undergraduate. (Just assume for the reminder of this paper that if I write “first-year undergraduates” or “first-year students” or “first-years” or “freshmen” I mean “first-time degree/certificate-seeking undergraduate students.”) Alluvial plots give us a sense of summarized movement while also showing the relative magnitude (number of students in each band) of that movement.


## Import the data file 
regions_data <- read_csv("Data2017/regions_data.csv")

## Get the data into an alluvial format
## select just the id, number of students, and EFCSTATE
COLLEGE <- regions_data %>% select(id, num_students, STABBR)
HOME <- regions_data %>% select(id, num_students, EFCSTATE)
COLLEGE <- COLLEGE %>% mutate(WHERE = "College")
HOME <- HOME %>% mutate(WHERE = "Home")
COLLEGE <- COLLEGE %>% rename(REGION = STABBR)
HOME <- HOME %>% rename(REGION = EFCSTATE)
# Bind into one dataframe
alluvial_data <- bind_rows(COLLEGE, HOME) %>%
  # factor WHERE
  mutate(WHERE = factor(WHERE, levels = c('Home', 'College'))) %>%
  mutate(REGION = factor(REGION, levels = c('Foreign countries', 'Northeast', 
'State unknown', 'South', 'Not reported', 'Midwest', 'Outlying areas', 'West')))
#write_csv(alluvial_data, "Data2017/home_state_alluvial.csv")
#write_csv(alluvial_data, "Shiny App/home_state_alluvial_app.csv")
 ggplot(alluvial_data, aes(x = WHERE, 
           stratum = REGION, 
           alluvium = id,
           y = num_students,
           fill = REGION, 
           label = REGION)) +
  geom_flow(width = 1/4) +
  geom_stratum(col = "black", fill = "black", width = 1/4, alpha = .8) +
  geom_text(stat = "stratum", size = 3, col = "white", angle = 0, family = "Avenir Next") +
  labs(title = "Home regions and institution regions of first-year undergraduates, 2016",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2016",
       x = "",
       y = "Number of students") +
  scale_x_discrete(expand = c(.1, .1), labels = c("Home", "Postsecondary Institution")) +
  scale_y_continuous(minor_breaks = NULL) +
  scale_fill_viridis_d() +
  project_theme +
  theme(legend.position = "none") +
  guides(fill = "none") 

The ‘Outlying areas’ category includes: American Samoa, Federated States of Micronesia, Guam, Marshall Islands, Northern Marianas, Palau, Puerto Rico, and Virgin Islands.


Interpretation:

We immediately see that the majority of first-years begin postsecondary education in the same region where they live when they enroll. (In fact, when I made the same plot using states instead of regions, it was clear that the vast majority of first-years begin postsecondary education in the same state where they live when they enroll.) Why might this be? Could it have something to do with tuition? It is widely known that state schools tend to be less expensive for in-state than out-of-state students.

The ‘South’ region appears to have the greatest number of students, both as a home region and an institution region. The three other primary regions are more difficult to compare in this format. One limitation of this alluvial plot is that bands representing larger numbers of students are wider and thus more visible. Therefore, we should draw conclusions carefully and be mindful of what we might not be able to learn from the plot. (For example, 18 students who began postsecondary education in the ‘Outlying areas’ region lived in ‘Foreign countries’ when they enrolled. Looking at the plot, we can only see four bands leaving ‘Foreign countries’ and only one band entering ‘Outlying areas.’)

Tuition Scatterplots

To further explore the topic of in-state versus out-of-state postsecondary institutions, I plotted the in-state tuition against the out-of-state tuition for the institutions in my dataset. In other words, these are the full student charges for the entire academic year. Given the context that public institutions are known to sometimes offer lower tuition to in-state students, I faceted the plots by CONTROL, or type of institution, so we can compare just that. I included the number of enrolled first-year students as a color gradient in the points to see if there is any relationship between tuition and enrollment.

I created a tab in my Shiny App where users can select a state and see the institutions from that state overlaid atop a grayed-out plot of the entire dataset. The view when Pennsylvania is selected is shown below as a static example of this functionality.


### Import data
institutions <- read_csv("Data2017/institutions.csv", col_types = 
cols(.default = "n", STABBR = "c", SECTOR = "f", CONTROL = "f", HBCU = "f", 
LOCALE = "f", INSTCAT = "f", INSTSIZE = "f", F1SYSTYP = "f", PEO2ISTR = "f", 
FT_UG = "f", DISTNCED = "f", DISAB = "f", ROOM = "f"))
## In-state tuition lower for primarily public institutions
ggplot(data = institutions, mapping = aes(x = TUITION2, y = TUITION3, 
                                          color = ENRLT)) +
  geom_point(alpha = 0.5) +
  labs(title = "In- and Out-of-state tution by type of institution",
       subtitle = "In-state tuition lower for some public institutions",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "In-state tuition (USD)",
       y = "Out-of-state\ntuition (USD)",
       color = "# of\nenrolled\nfirst-years") +
  scale_color_viridis(option = "A", end = 0.85, direction = -1) + 
  scale_y_continuous(breaks = c(0, 10000, 20000, 30000, 40000, 50000), minor_breaks = NULL) +
  scale_x_continuous(minor_breaks = NULL) +
  ## use labeller to give new names to the different levels of CONTROL
  facet_wrap(~ CONTROL, labeller = as_labeller(c(`1` = "Public", 
                                                 `2` = "Private not-for-profit",
                                                 `3` = "Private for-profit", 
                                                 `-3` = "Not available"))) +
  project_theme +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5))


Interpretation:

As we predicted, there are many institutions for which in-state tuition is lower than out-of-state tuition. Specifically, in-state tuition is lower primarily for public institutions. Notably, in-state tuition at public institutions does not exceed 20,000 dollars, while out-of-state tuition reaches nearly $50,000 and tuition at some private not-for-profit institutions exceeds $50,000. We see that two private not-for-profit institutions offer slightly lower in- than out-of-state tuition, and there seem to be three institutions (one private not-for-profit and at least two public) with free tuition. The private not-for-profit institution with free tuition is located in Pennsylvania, as it appears on the state plot below. The majority of private not-for-profit, and all of the private for-profit, institutions charge the same tuition to in-state students as to out-of-state students. (I checked by plotting a line with intercept \(0\) and slope \(1\) behind the scatterplot, and the clear lines of points in these plots are in fact the line \(y = x\).) Finally, we see that many institutions have less than 3,000 first-year students. It seems that institutions with larger numbers of first-years tend to charge higher tuition, but a closer look is necessary before we can draw conclusions.

Below, I test out what having the state “Pennsylvania” selected on my Shiny App looks like:


## In- and out-of- state tuition for PA
institutions %>% filter(STABBR == "PA") %>%
ggplot(., mapping = aes(x = TUITION2, y = TUITION3, 
                                          color = ENRLT)) +
  #geom_abline(slope = 1, intercept = 0, alpha = 0.25) +
  geom_point(data = institutions, aes(x = TUITION2, y = TUITION3), color = "grey", alpha = 0.025) +
  geom_point(alpha = 1) +
  labs(title = "In- and Out-of-state tution by type of institution",
       subtitle = "Pennsylvania",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "In-state tuition (USD)",
       y = "Out-of-state\ntuition (USD)",
       color = "# of\nenrolled\nfirst-years") +
  scale_color_viridis(option = "A", end = 0.85, direction = -1) + 
  scale_y_continuous(breaks = c(0, 10000, 20000, 30000, 40000, 50000), minor_breaks = NULL) +
  scale_x_continuous(minor_breaks = NULL) +
  ## use labeller to give new names to the different levels of CONTROL
  facet_wrap(~ CONTROL, labeller = as_labeller(c(`1` = "Public", 
                                                 `2` = "Private not-for-profit",
                                                 `3` = "Private for-profit", 
                                                 `-3` = "Not available"))) +
  project_theme +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5))


Another readily available variable in the dataset is whether or not an institution is a historically Black college or university, so I plotted tuition by this category as well.


## Tuition by HBCU status
ggplot(data = institutions, mapping = aes(x = TUITION2, y = TUITION3, 
                                          color = ENRLT)) +
  #geom_abline(slope = 1, intercept = 0, alpha = 0.25) +
  geom_point(alpha = 0.5) +
  labs(title = "In- and Out-of-state tution",
       subtitle = "Historically Black Colleges and Universities (HBCU)",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "In-state tuition (USD)",
       y = "Out-of-state\ntuition (USD)",
       color = "# of\nenrolled\nfirst-years") +
  scale_color_viridis(option = "A", end = 0.85, direction = -1) + 
  scale_y_continuous(breaks = c(0, 10000, 20000, 30000, 40000, 50000), minor_breaks = NULL) +
  scale_x_continuous(minor_breaks = NULL) +
  ## use labeller to give new names to the different levels of CONTROL
  facet_wrap(~ HBCU, labeller = as_labeller(c(`1` = "HBCU", 
                                                 `2` = "Not an HBCU"))) +
  project_theme + 
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5))


Interpretation:

I was not sure what to expect from this plot, since I do not have much prior knowledge of historically Black colleges and universities (HBCUs). Overall, we see that tuition for HBCU institutions does not exceed $30,000, and that there are a combination of schools with equal and lower in-state tuition. It also appears that HBCUs tend to have about 3,000 or fewer first-years.

Faculty and Students by Race/Ethnicity

The third institutional characteristic that I examine is the population of instructional staff, specifically professors, by race/ethnicity. The column charts I have created show the numbers of full-time full professors, associate professors, and assistant professors (as of November 1, 2017) and full-time new hires of instructional staff (hired between November 1, 2016 and October 31, 2017) employed by the institution. For comparison, I also replicated this plot with the numbers of full-time, degree/certificate-seeking undergraduate students enrolled in fall 2017. It is important that students have mentors who understand their experiences and that students see themselves in the role models and successful adults around them, and so it is important for faculty and staff to represent diverse identities including, but not limited to: first-generation, minority, race/ethnicity, gender identity, and/or having English as a second language. Looking at race and ethnicity are a beginning; I hope that more data becomes available to fully represent and learn about additional identities.

This visualization is also included in my Shiny App, where you can select an institution to be displayed from a list of Swarthmore College and peer liberal arts colleges. Below are the charts for Swarthmore College.


##### SET UNITID
unitid = 216287
### GET NEW HIRES DATA READY
new_hires_bd <- read_csv("Data2017/new_hires_bd.csv")
## Just the race columns (has male and female columns)
#new_hires_bd <- new_hires_bd %>% select(UNITID, SNHCAT, HRTOTLT, HRTOTLM, 
#HRTOTLW, HRAIANT, HRAIANM, HRAIANW, HRASIAT, HRASIAM, HRASIAW, HRBKAAT, HRBKAAM, 
#HRBKAAW, HRHISPT, HRHISPM, HRHISPW, HRNHPIT, HRNHPIM, HRNHPIW, HRWHITT, HRWHITM, 
#HRWHITW, HR2MORT, HR2MORM, HR2MORW, HRUNKNT, HRUNKNM, HRUNKNW, HRNRALT, HRNRALM, 
#HRNRALW)
## Just the total race columns
new_hires_bd <- new_hires_bd %>% select(UNITID, SNHCAT, HRTOTLT, HRAIANT, HRASIAT,
HRBKAAT, HRHISPT, HRNHPIT, HRWHITT, HR2MORT, HRUNKNT, HRNRALT)

new_hires <- new_hires_bd %>% 
            # Select just the one institution
            filter(UNITID == unitid) %>%
            # Get just the values of total new hires:
            filter(SNHCAT == 10000) %>% 
            # Remove redundant column
            select(-SNHCAT) %>%
            # Gather columns into rows %>%
            gather(RACE, NUMFAC, -UNITID) %>% 
            # Add a tag for the type of faculty
            mutate(PROFTYPE = "New Hires")

### GET CURRENT FAC DATA READY
instructional_staff_bd <- read_csv("Data2017/s2017_is_rv.csv")

## Just the total race columns
instructional_staff_bd <- instructional_staff_bd %>% select(UNITID, SISCAT, ARANK, HRTOTLT, 
HRAIANT, HRASIAT, HRBKAAT, HRHISPT, HRNHPIT, HRWHITT, HR2MORT, HRUNKNT, HRNRALT)
# select just one institution
faculty_by_rank <- instructional_staff_bd %>% filter(UNITID == unitid)
## get just the values of the current faculty that are:
# Full professors (1) 
# Associate professors (2)
# Assistant professors (3)
professor <- faculty_by_rank %>% 
  filter(ARANK == 1) %>% 
  select(-ARANK) %>% 
  group_by(UNITID) %>%
  summarize(HRTOTLT = sum(HRTOTLT),
            HRAIANT = sum(HRAIANT), 
            HRASIAT = sum(HRASIAT),
            HRBKAAT = sum(HRBKAAT),
            HRHISPT = sum(HRHISPT),
            HRNHPIT = sum(HRNHPIT),
            HRWHITT = sum(HRWHITT), 
            HR2MORT = sum(HR2MORT),
            HRUNKNT = sum(HRUNKNT),
            HRNRALT = sum(HRNRALT))
associate <- faculty_by_rank %>% 
  filter(ARANK == 2)  %>% 
  select(-ARANK) %>% 
  group_by(UNITID) %>%
  summarize(HRTOTLT = sum(HRTOTLT),
            HRAIANT = sum(HRAIANT), 
            HRASIAT = sum(HRASIAT),
            HRBKAAT = sum(HRBKAAT),
            HRHISPT = sum(HRHISPT),
            HRNHPIT = sum(HRNHPIT),
            HRWHITT = sum(HRWHITT), 
            HR2MORT = sum(HR2MORT),
            HRUNKNT = sum(HRUNKNT),
            HRNRALT = sum(HRNRALT))
assistant <- faculty_by_rank %>% 
  filter(ARANK == 3) %>% 
  select(-ARANK) %>% 
  group_by(UNITID) %>%
  summarize(HRTOTLT = sum(HRTOTLT),
            HRAIANT = sum(HRAIANT), 
            HRASIAT = sum(HRASIAT),
            HRBKAAT = sum(HRBKAAT),
            HRHISPT = sum(HRHISPT),
            HRNHPIT = sum(HRNHPIT),
            HRWHITT = sum(HRWHITT), 
            HR2MORT = sum(HR2MORT),
            HRUNKNT = sum(HRUNKNT),
            HRNRALT = sum(HRNRALT))
# Gather colums into rows
professor <- gather(professor, RACE, NUMFAC, -UNITID) %>% 
  mutate(PROFTYPE = "Professors")
associate <- gather(associate, RACE, NUMFAC, -UNITID) %>% 
  mutate(PROFTYPE = "Associate Professors")
assistant <- gather(assistant, RACE, NUMFAC, -UNITID) %>%
  mutate(PROFTYPE = "Assistant Professors")

# Create the dataset we will use to make the chart
faculty_by_rank <- bind_rows(professor, associate) %>%
            bind_rows(assistant) %>%
            bind_rows(new_hires) %>%
            # Factor the race variable
            mutate(RACE = factor(RACE)) %>%
            # Factor the professorship variable
            mutate(PROFTYPE = factor(PROFTYPE)) %>%
            mutate(PROFTYPE = forcats::fct_relevel(PROFTYPE, c("New Hires", 
            "Professors", "Associate Professors", "Assistant Professors"))) 

## Split into two datasets so we can plot all races except the total,
## but have the total as a text geom
column_data <- faculty_by_rank %>% filter(RACE != "HRTOTLT") %>%
  mutate(RACE = factor(RACE))
column_data$RACE <- forcats::fct_relevel(column_data$RACE, c("HRWHITT", "HRAIANT",
"HRASIAT", "HRBKAAT", "HRHISPT", "HRNHPIT", "HR2MORT", "HRNRALT", "HRUNKNT"))
annotate_data <- faculty_by_rank %>% filter(RACE == "HRTOTLT")
annotate_data <- bind_cols(annotate_data, as.data.frame(str_c("Total: ", annotate_data$NUMFAC))) %>%
  # rename the new column
  rename("total_text" = 5)
# The faculty plot
ggplot() +
  geom_col(data = column_data, aes(x = RACE, y = NUMFAC, fill = RACE), 
           size = 1.5) +
  geom_label(data = annotate_data, 
             aes(x = "HRUNKNT", y = max(NUMFAC/2), label = total_text), 
             family = "Avenir Next") +
  labs(title = "Full-time new hires and full-time instructional staff",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       subtitle = "Swarthmore College",
       x = "",
       y = "Number of Faculty") +
  scale_x_discrete(labels = c("White", "American Indian or Alaska Native", 
  "Asian", "Black or African American", "Hispanic or Latino", 
  "Native Hawaiian or Other Pacific Islander", "Two or more races", 
  "Nonresident alien", "Race/ethnicity unknown")) +
  scale_fill_viridis(option = "D", end = 0.75, discrete = TRUE, direction = -1) +
  project_theme +
  guides(fill = "none") +
  coord_flip() +
  facet_wrap(~ PROFTYPE)

### Code to get the undergrads data for column chart in shiny app
directory <- read_csv("Data2017/hd2017.csv", col_types = cols_only(UNITID = "n",
 INSTNM = "c"))
undergrads <- read_csv("Data2017/ef2017a_rv.csv") %>% 
  # Grab only rows counting full-time degree-seeking undergrads
  filter(EFALEVEL == 23) %>% 
  # Grab only the columns counting the total number in total and for each race
  select(UNITID, EFTOTLT, EFAIANT, EFASIAT, EFBKAAT, EFHISPT, EFNHPIT, EFWHITT, EF2MORT, EFUNKNT, EFNRALT)
undergrads <- left_join(undergrads, directory, by = "UNITID")
#write_csv(undergrads, "Data2017/undergrads_bd.csv")
undergrads_bd <- read_csv("Data2017/undergrads_bd.csv")
student_data <- undergrads_bd %>% 
  # Select just the one institution
  filter(UNITID == unitid) %>%
  # Gather columns into rows %>%
  gather(RACE, NUMSTU, -UNITID) %>% 
  # Factor the race variable
  mutate(RACE = factor(RACE)) %>%
  # Every race/ethnicity besides total and the row with institution name
  filter(RACE != "EFTOTLT" & RACE != "INSTNM") %>%
  # Relevel the levels of Race so it plots nicely
  mutate(RACE = forcats::fct_relevel(RACE, c("EFWHITT", "EFAIANT", 
  "EFASIAT", "EFBKAAT", "EFHISPT", "EFNHPIT", "EF2MORT", "EFNRALT", 
  "EFUNKNT"))) %>%
  # Make the number of students, NUMSTU, a numeric variable
  mutate(NUMSTU = as.numeric(NUMSTU))

ggplot() +
  geom_col(data = student_data, aes(x = RACE, y = NUMSTU, fill = RACE), 
           size = 1.5) +
  labs(title = "Undergraduates enrolled in fall 2017",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       subtitle = "Swarthmore College",
       x = "",
       y = "Number of Students") +
  scale_x_discrete(labels = c("White", "American Indian or Alaska Native",
  "Asian", "Black or African American", "Hispanic or Latino", 
  "Native Hawaiian or Other Pacific Islander", "Two or more races", 
  "Nonresident alien", "Race/ethnicity unknown")) +
  scale_fill_viridis(option = "D", end = 0.75, discrete = TRUE, direction = -1) +
  project_theme +
  guides(fill = "none") +
  coord_flip() 


Interpretation: I want to emphasize that these column charts have NUMBERS of students and faculty on the \(x\)-axes, not percentages. Because of this, we have to be careful when considering the charts together, but we can still gain meaningful insights from them. What is nice about the four faculty plots is that they all have the same scale on the \(x\)-axis, so taller bars mean larger numbers across those four plots. Overall, we see the tallest bars on all five plots are light green, so in every plot there are more white professors, new hires, or students than of each other race/ethnicity. Among Swarthmore’s professors in 2017, assistant professors had the the highest numbers of non-white faulty members. (Note that the top two bars on each chart are “Race/ethnicity unknown” and “Nonresident alien”, so while our eye might like to clump every bar above the “white” (light green) one into a “non-white” group, we cannot make assumptions about those individuals in the top two bars.) American Indian or Alaskan Native and Native Hawaiian or Other Pacific Islander are the least represented races/ethnicities in all five charts. Finally, this may be entire coincidental (because, again, these charts show COUNTS, not percentages) but the relative heights of the race/ethnicity bars in the assistant professor chart most closely mirror the relative heights of the bars in the student chart.

Disabilities

The fourth visualization deals with the percent of students with registered disabilities by institution size (categories based on the number of undergraduate students). The institutions plotted here are only those for which the percent of students with registered disabilities is greater than three percent. The values are reported as “the percentage of all undergraduates enrolled in Fall 2016 who are formally registered as students with disabilities with the institution’s office of disability services (or the equivalent office).” (Quoted from ic2017.xlsx, the data information document accompanying the data file ic2017.csv.)

There are two institutions for which 100% of their students have registered disabilities, so there is a second plot excluding those two data points for a closer look.

## disability percent scatterplot
disability_data <- institutions #%>% mutate(DISABPCT = ifelse(DISAB == "1", -20, DISABPCT))
disability_data$INSTSIZE <- factor(disability_data$INSTSIZE)
disability_data$INSTSIZE <- forcats::fct_relevel(disability_data$INSTSIZE, c("1", "2", "3", "4", "5", "-1", "-2"))

# CODE FOR YOUR SECOND EDA GRAPH SUBMISSION HERE
ggplot(disability_data, aes(x = INSTSIZE, y = DISABPCT)) +
  geom_jitter(data = disability_data %>% filter(DISAB == 2), alpha = 0.25) +
  geom_violin(data = disability_data %>% filter(DISAB == 2), alpha = 0, color = "purple", scale = "area") +
  labs(title = "Percent of students with registered disabilities (when greater than 3%)",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
x = "Total students enrolled for credit, Fall 2017",
y = "% of students w/ registered disabilities") +
  scale_x_discrete(labels = c(`1` = "Under 1,000", 
                              `2` = "1,000 - 4,999",
                              `3` = "5,000 - 9,999", 
                              `4` = "10,000 - 19,999",
                              `5` = "20,000 and above",
                              `-1` = "Not reported",
                              `-2` = "Not applicable")) +
  project_theme

# CODE FOR YOUR SECOND EDA GRAPH SUBMISSION HERE
ggplot(disability_data %>% filter(DISABPCT < 100), aes(x = INSTSIZE, y = DISABPCT)) +
  geom_jitter(alpha = 0.25) +
  geom_violin(alpha = 0, color = "purple", scale = "area") +
  labs(title = "Percent of students with registered disabilities (when greater than 3%)",
       subtitle = "Excluding 2 institutions at 100% (both in the Under 1,000 category)",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
x = "Total students enrolled for credit, Fall 2017",
y = "% of students w/ registered disabilities") +
  scale_x_discrete(labels = c(`1` = "Under 1,000", 
                              `2` = "1,000 - 4,999",
                              `3` = "5,000 - 9,999", 
                              `4` = "10,000 - 19,999",
                              `5` = "20,000 and above",
                              `-1` = "Not reported",
                              `-2` = "Not applicable")) +
  project_theme

Interpretation: Students with registered disabilities attend institutions of all sizes, from less than 1,000 students to over 20,000 students. The institutional size category with the most institutions from the dataset in it is the 1,000 - 4,999 group.

Discussion

Data Limitations

As I mentioned in the introduction, the data I used in this project comes from the Integrated Postsecondary Education Data System (IPEDS) complete data files (See footnotes 3 and 4). The complete data files I downloaded each contain somewhere between 2,000 and 7,000 unique UNITIDs, or unique postsecondary institutions. I chose the year 2017 because it is long enough ago for many of the data files to have been revised and edited, but also recent enough for the data to be relevant. The tuition scatterplots, faculty and student race/ethnicity column charts, and the registered disabilities plot all use 2017 data. The home region alluvial plot uses 2016 data because the state of residence survey question is only required in even numbered years. I deemed the 2016 file more appropriate since it was revised in 2018, whereas the 2018 file is still a preliminary release.

To ensure that I could use any variables I wished while maintaining continuity between visualizations, I created a list of the UNITIDs that are present in every file I downloaded (the 2017 complete files I chose and the 2016 state of residence file) and visualized data from only those 1,672 institutions. However, there were 7,153 institutions in the 2017 directory file. What I hope to do in this section is to give the curious reader, and myself as the researcher, a sense of how the institutions in my dataset compare the institutions that participate in the survey. The IPEDS survey is actually part of the 1965 Higher Education Act and the survey is mandatory for all postsecondary institutions that participate in federal financial aid programs (also known as Title-IV eligible institutions).6 Each plot below comes in three “flavors”: The entire survey (all 7,153 institutions in the full directory), the project dataset (the 1,672 institutions in the data I plotted in all my visualizations above), and the complement of the project dataset (the 5,481 institutions in the full directory but not in the project dataset). I chose to show all three for a more comprehensive overview. I will not comment on every plot, but instead leave many of the comparisons to speak for themselves.

# The data
divided_directory <- read_csv("Data2017/divided_directory.csv") %>%
  mutate(DIREC = forcats::fct_relevel(DIREC, c("Full Survey", 
                          "Project Dataset", "Complement of Project Dataset")))

## A map of just the continental us
us_states <- map_data("state")

## Plotting every institution
ggplot(data = us_states, mapping = aes(x = long, y = lat)) +
  #Draw US map lines
  geom_polygon(mapping = aes(group = group), color = "black", 
               fill = "beige", alpha = 0.4, size = 0.1) +
  #Use Albers projection
  coord_map(projection = "albers", lat0 = 0, lat1 = 60) + 
  labs(title = "Institution Locations on a Map",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       fill = NULL) +
  #A theme that removes background and other unneeded elements (& use theme font)
  theme_map(base_family = "Avenir Next") +
  geom_point(data = divided_directory, mapping = aes(x = LONGITUD, y = LATITUDE, 
                      text = as.character(UNITID)), size = 0.25, alpha = 0.25) +
  facet_wrap(~ DIREC, nrow = 3)

ggplot(data = us_states, mapping = aes(x = long, y = lat)) +
  #Draw US map lines
  geom_polygon(mapping = aes(group = group), color = "black", fill = "#FAECB3", 
               alpha = 0.25, size = 0.1) +
  #Use Albers projection
  coord_map(projection = "albers", lat0 = 39, lat1 = 45) + 
  labs(title = "Institution Locations on a Map of Continental US", 
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       fill = NULL) +
  #A theme that removes background and other unneeded elements (& use theme font)
  theme_map(base_family = "Avenir Next") +
  geom_point(data = divided_directory %>% filter(LATITUDE > 22 & LATITUDE < 58), 
             mapping = aes(x = LONGITUD, y = LATITUDE, text = as.character(UNITID)), 
             size = .8, alpha = 0.25) +
  scale_color_viridis(option = "magma", begin = 0.15, end = 0.85, discrete = TRUE) +
  facet_wrap(~DIREC, nrow = 3)

ggplot(data = divided_directory, aes(x = factor(INSTSIZE), fill = INSTSIZE)) +
  geom_bar() +
  labs(title = "Institution Size",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "Number of students enrolled for credit, fall 2017",
       y = "Count") +
  scale_x_discrete(labels = c(`1` = "Under 1,000", 
                              `2` = "1,000 - 4,999",
                              `3` = "5,000 - 9,999", 
                              `4` = "10,000 - 19,999",
                              `5` = "20,000 and above",
                              `-1` = "Not reported",
                              `-2` = "Not applicable")) +
  project_theme +
  scale_fill_viridis(option = "E", end = 0.3) +
  guides(fill = "none") +
  facet_wrap(~ DIREC, nrow = 3)

From this plot, we see that the project dataset has many fewer “Under 1,000” institutions than the full directory. Our dataset contains about half the number of institutions with 1,000-4,999 students, and a little less than half the number of institutions in each larger category, than are in the full directory.


ggplot(data = divided_directory, aes(x = factor(SECTOR), fill = factor(SECTOR))) +
  geom_bar() +
  labs(title = "Institution Sector",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "",
       y = "Count") +
   scale_x_discrete(labels = c(`0` = "Administrative Unit",
                              `1` = "Public, 4-year or above", 
                              `2` = "Private not-for-profit, 4-year or above",
                              `3` = "Private for-profit, 4-year or above", 
                              `4` = "Public, 2-year",
                              `5` = "Private not-for-profit, 2-year",
                              `6` = "Private for-profit, 2-year",
                              `7` = "Public, less-than 2-year",
                              `8` = "Private not-for-profit, less-than 2-year",
                              `9` = "Private for-profit, less-than 2-year",
                              `99` = "Sector unknown (not active)")) +
  coord_flip() +
  project_theme +
  scale_fill_viridis(option = "inferno", discrete = TRUE, begin = 0.7, end = 0.4) +
  guides(fill = "none") +
  facet_wrap(~ DIREC, nrow = 3)

It appears that most of the institutions in the project dataset are four-year or above. Our dataset contains a little more than half of all the private not-for-profit and public 4-year or above institutions.


ggplot(data = divided_directory, aes(x = factor(ICLEVEL), fill = factor(ICLEVEL))) +
  geom_bar() +
  labs(title = "Program Years",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "",
       y = "Count") +
   scale_x_discrete(labels = c(`1` = "Four or more years", 
                              `2` = "At least 2 but less than 4 years",
                              `3` = "Less than 2 years (below associate)", 
                              `-3` = "Not available")) +
  coord_flip() +
  project_theme +
  scale_fill_viridis(option = "inferno", discrete = TRUE, begin = 0.4, end = 0.3) +
  guides(fill = "none") +
  facet_wrap(~ DIREC, nrow = 3)

These plots confirm what the last set indicated, which is that most of the institutions in the project dataset are four year or above and our dataset contains a little more than half of the 4-year or above institutions in the whole directory.


ggplot(data = divided_directory, aes(x = factor(CONTROL), fill = factor(CONTROL))) +
  geom_bar() +
  labs(title = "Operation / Control",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "",
       y = "Count") +
   scale_x_discrete(labels = c(`1` = "Public", 
                              `2` = "Private not-for-profit",
                              `3` = "Private for-profit", 
                              `-3` = "Not available")) +
  coord_flip() +
  project_theme +
  scale_fill_viridis(option = "plasma", discrete = TRUE, begin = 0.4, end = 0.3) +
  guides(fill = "none") +
  facet_wrap(~ DIREC, nrow = 3)

ggplot(data = divided_directory, aes(x = factor(HLOFFER), fill = factor(HLOFFER))) +
  geom_bar() +
  labs(title = "Highest Level of Degree Offering",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "",
       y = "Count",
       fill = "Highest Degree\nOffered") +
   scale_x_discrete(labels = c(`0` = "Other",
`1` = "Postsecondary award, certificate or diploma\nof less than one academic year", 
`2` = "Postsecondary award, certificate or diploma\nof at least one but less than two academic years",
`3` = "Associate's degree",
`4` = "Postsecondary award, certificate or diploma\nof at least two but less than four academic years",
`5` = "Bachelor's degree",
`6` = "Postbaccalaureate certificate", 
`7` = "Master's degree",
`8` = "Post-master's certificate",
`9` = "Doctor's degree",
`b` = "None of the above or no answer",
`-2` = "Not applicable, first-professional only",
`-3` = "Not available")) +
  project_theme +
  scale_fill_viridis(option = "plasma", discrete = TRUE, begin = 0.7, end = 0.9) +
  guides(fill = "none") +
  coord_flip() +
  facet_wrap(~ DIREC, ncol = 3)

hbcu <- ggplot(data = divided_directory, aes(x = factor(HBCU), fill = factor(HBCU))) +
  geom_bar() +
  labs(title = "Historically Black Colleges\nand Universities",
       caption = "Data from The Integrated Postsecondary Education\nData System (IPEDS) Complete data files 2017",
       x = "",
       y = "Count") +
   scale_x_discrete(labels = c(`0` = "Other", `1` = "HBCU", `2` = "Not an HBCU")) +
  project_theme +
  scale_fill_viridis(option = "viridis", discrete = TRUE, begin = 0.4, end = 0.3) +
  guides(fill = "none") +
  facet_wrap(~ DIREC, nrow = 3)

tribal <- ggplot(data = divided_directory, aes(x = factor(TRIBAL), fill = factor(TRIBAL))) +
  geom_bar() +
  labs(title = "Tribal Institutions",
       caption = "Data from The Integrated Postsecondary Education\nData System (IPEDS) Complete data files 2017",
       x = "",
       y = "Count") +
   scale_x_discrete(labels = c(`0` = "Other", `1` = "Tribal", `2` = "Not Tribal")) +
  project_theme +
  scale_fill_viridis(option = "viridis", discrete = TRUE, begin = 0.5, end = 0.6) +
  guides(fill = "none") +
  facet_wrap(~ DIREC, nrow = 3)

gridExtra::grid.arrange(hbcu, tribal, nrow = 1)

ggplot(data = divided_directory, aes(x = factor(C15ENPRF), fill = factor(C15ENPRF))) +
  geom_bar() +
  labs(title = "Enrollment Profile Classification",
       caption = "Data from The Integrated Postsecondary Education Data System (IPEDS)\nComplete data files 2017",
       x = "",
       y = "Count") +
   scale_x_discrete(labels = c(`1` = "Exclusively undergraduate\ntwo-year", 
`2` = "Exclusively undergraduate\nfour-year",
`3` = "Very high undergraduate",
`4` = "High undergraduate",
`5` = "Majority undergraduate",
`6` = "Majority graduate", 
`7` = "Exclusively graduate",
`-2` = "Not applicable, not\nin Carnegie universe\n(not accredited or\nnondegree-granting)")) +  project_theme +
  scale_fill_viridis(option = "inferno", discrete = TRUE, begin = 0.5, end = 0.55) +
  guides(fill = "none") +
  coord_flip() +
  facet_wrap(~ DIREC, ncol = 3)

In summary, the institutions in the project dataset are mostly four-year, private not-for-profit and public institutions serving majority undergraduate student bodies and offering Bachelor’s, Master’s, and Doctor’s degrees.

Another limitation of this project and these data files is that we are only looking at first-time students in a single year, which excludes information on transfer and continuing students.

Future Potential

As I said, the data I am working with has enough material to make visualizations for at least a year. Just the few variables I have highlighted in these four plots offer the potential for elaboration.

Even as I wrap up this project, there are more visualizations I hope to make and other areas I would like to explore. First, I would like to expand the race/ethnicity column charts to also show percentages in addition to counts. I would also like to map the color in the tuition plots to total number of undergraduates instead of only the number of first-years. I would like to dig deeper into the state-of-residence data and further explore the awarding of institutional grants, the student faculty ratio, and the percentages of students who attend in-state versus out-of-state postsecondary education.

Another data file that I did not download contains information about the libraries and library collections at each institution. I would be very interested to know how library collections differ across postsecondary institutions.

While I have focused on colleges and universities, I would love to do another project on vocational programs which do not get as much attention in the higher education world, but which are still attended by many students.

Finally, my most hopeful future direction is to have data about students at the high school level that also follows them all the way through postsecondary education to compare rural and small town students to their non-rural and small town peers.

Data Work

Home Region Alluvial Plot

Extensive data rearrangement was necessary for the alluvial plot. For each institution (each UNITID), the dataset ef2016c_rv.csv contains many rows, one for each home state (EFCSTATE) with a corresponding column giving the number of first-time degree/certificate-seeking undergraduate students at the institution from that state (EFRES01). The home states are coded with integers. I removed all rows with EFCSTATE values of 58, 89, and 99, which correspond to the total numbers of first-years from the US, the total number from outlying areas, and the total number at the institution. Next, I added a column (STABBR) from the hd2017.csv file containing the state where each institution is. Then, I grouped the dataset by home state and institution state (EFCSTATE and STABBR) and summarized the number of students (EFRES01) before ungrouping. At this point, the dataset consists of a column that holds the state where an institution is, a column for home state, and a column for the number of students from the home state attending an institution in the institution state.

I then added a column that contained a unique integer ID, one for each combination of institution state and home state (and thus one for each row). This ID will be used to match students from each home state to the proper institution state later. I created and used a csv file (called efcstate_codes.csv) to match each integer in EFCSTATE to the full name of the state or category (such as ‘State unknown’) based on the legend in the information file accompanying the data. Finally, I used the state_codes dataset from the USAboundaries package (with a few of my own rows appended for special categories) to match the full names to two letter abbreviations. This serves to ensure that the institution states and the home states are in the same format (two letter abbreviations), as they will be in the same column when I finally get the data into alluvial form.

At this point, I wrote the dataset into the file states_data.csv which I read in again and manipulated further.

I brought in a dataset called regions.csv (which I created using the Wikipedia page) containing a state abbreviation and its corresponding Census Bureau-designated region and used this to convert the home state and institution state columns into columns of region names. I grouped the dataset by home state (EFCSTATE, which by now was actually a column of region names) and the institution state (STABBR; also a column of region names) and summarized the number of students (EFRES01) before ungrouping. At this point, the dataset consists of a column that holds the region where an institution is, a column for home region, and a column for the number of students from the home region attending an institution in the institution region.

The final step was to select and funnel the columns id, num_students, and STABBR into a new dataset and add a column called WHERE with “College” in every row. Similarly, select id, num_students, and EFCSTATE into a new dataset and append a column called WHERE with “Home” in every row. Rename the columns STABBR and EFCSTATE both to REGION and then bind the two datasets together and factor the variable WHERE. Write into a file called home_state_alluvial.csv.

Using the alluvial plot code, set x = WHERE, stratum = REGION, alluvium = id, fill = REGION, and label = REGION.

Tuition Scatterplots

The tuition scatterplots were straightforward—I just used the TUITION2, TUITION3, ENRLT, CONTROL, and HBCU variables. I used facets to produce the separate plots.

Faculty Demographics Column Charts

For the column charts of faculty and student race/ethnicity, I selected the UNITID, the staff category (SNHCAT), and the total counts for each race/ethnicity from the file s2017_nh_rv.csv (the new hires file). From the file s2017_is.csv (the current faculty file) I selected UNITID, SISCAT, ARANK, and the total counts for each race/ethnicity. For each plot, I filtered the dataset by a single UNITID and then made four datasets: one for all new hires (SNHCAT == 10000) from the new hires data file, and three from the current faulty file: one each for full professors, associate professors, and assistant professors (ARANK = 1, 2, and 3, respectively). I use dplyr::gather() to made the wide dataset into a long dataset and then add a column to indicate which professorship (PROFTYPE) this dataset held. I then bound the rows together using bind_rows(). After binding, I split the data into two datasets for the Shiny App: one without the HRTOTLT (the grand total for each professor type) and one with only this grand total, so that I could exclude the total from the column chart and instead add it as a label geom over top of the plot. (There may be a more straightforward way to do this without further rearranging the datasets.) I faceted by PROFTYPE to get the four separate plots.

Registered Disabilities

To prepare the data for plotting, I factored INSTSIZE and releveled the levels so they would plot in a logical order. Then I plotted INSTSIZE against DISABPCT.

The Shiny App

The Shiny app that accompanies this paper can be found at this link, or used below.


Acknowledgements

The biggest thank you to Amanda Luby for her guidance, feedback and for teaching in Stat 041! This was a super fun course and I learned so much.

Thank you to Andrew Moe for answering my many questions this summer when I was working on rural and small town research. He is the reason I understand rural and small town labeling as well as I do (especially NACAC’s Rural and Small Town SIG’s and Swarthmore Admissions’ definitions). Also thank you to Christina Danberg and Josh Throckmorton who welcomed me into the Admissions Office as a student and took an interest in my interest in Admissions.

Footnotes

Fun Fact: Clicking on the little arrow at the end of a footnote returns you to the location in the text where the footnote is referenced.

  1. These examples of specific challenges faced by rural students are paraphrased from this page.↩︎

  2. In circles of rural and small town education, one of the better known systems is the National Center for Education Statistics’ (NCES) list of Urban-Centric Locales, which classifies high schools based on how geographically close/far they are to/from various types of urban centers. Alas, these labels suffer from the issues I already mentioned. For example, Killington Mountain School is classified as Distant Rural, or 42, but a quick visit to their website immediately lets you know that these students have great post-secondary-related resources.↩︎

  3. You can download them for yourself here! If multiple versions were available, I always used the revised file (indicated by _rv at the end of the file name). I used the 2016 file for ef<year>c.csv because that survey question is only required in even numbered years and the 2016 data file was revised in 2018, whereas the 2018 data file is still a preliminary release.↩︎

  4. If you want to nerd out over the survey components, here you go: IPEDS Survey Components. You can even see the exact survey that is sent to each institution!↩︎

  5. The Census Bureau-designated regions and divisions from Wikipedia.↩︎

  6. This information comes from slide 3 of the first module on the Overview of IPEDS Data webpage.↩︎