Loading Data Chunk

medill_county <- read.csv("data/medill_counties.csv") |>
  janitor::clean_names() |> filter(!(state_abbr %in% c('AK', 'HI'))) |>
  dplyr::select(
    county_fips, county_name, state_abbr, watchlist_2024, acp_category, 
    median_household_income_2022, median_age_2020, 
    percentage_below_poverty_line_2022, broadband_access,
    total
  ) |> 
  mutate(
    nd_or_wl = ifelse(total == 0 | watchlist_2024 == 'Yes', 1, 0)
  ) |> 
  rename(medill_news_outlets = total)

civic <- read.csv("data/civic.csv") |> 
  dplyr::select(fips, news_and_information_index, overall_index) |> 
  rename(county_fips = fips)

universities <- read.csv("data/ccn2023.csv")  |> # filter out some columns
  mutate(
    enrollment = ifelse(enrollment < 0, NA, enrollment),
    endowment = ifelse(endowment < 0, NA, endowment),
    county_fips = as.integer(county_fips) 
  )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `county_fips = as.integer(county_fips)`.
## Caused by warning:
## ! NAs introduced by coercion
election_county2020 <- read.csv('data/2020_election_county.csv')

Cleaning Data Chunk

medill_county <- medill_county |> left_join(civic, by = 'county_fips')

# cleaning universities
universities <- universities |> mutate(
  news_partnership = ifelse(is.na(news_partnership), 'None or No Data', news_partnership))

universities$news_partnership <- factor(
  universities$news_partnership, 
  levels = c("None or No Data", 'Exploring','News Partnership'))

universities$instsize2 <- factor(
  universities$instsize2, 
  levels = c("Very Small", "Small", "Medium", "Large", "Very Large"))

# combining prez results and medill
medill_county_prez <- medill_county |> 
  left_join(election_county2020 |>  
              dplyr::select(county_fips, percent_votes_democrat_2020, percent_votes_republican_2020, 
                     margin_of_victory, county_winner), 
            by = 'county_fips'
  ) |> 
  mutate(
    in_need_of_news = ifelse(medill_news_outlets == 0 | watchlist_2024 == 'Yes', 'In Need of a News Outlet', 'Not In Need')
  )

counties_in_need <- medill_county_prez |> 
  filter(in_need_of_news == 'In Need of a News Outlet')

# combining universities with medill_county_prez and filtering to get the ones in need
impact_universities <- universities |> 
  inner_join(medill_county_prez, by = "county_fips") |> 
  filter(adj_county_min == 0 | medill_news_outlets == 0 | watchlist_2024 == 'Yes')

universities_counties <- universities |> 
  inner_join(medill_county_prez, by = "county_fips")

Introduction

Research Question/Question of Interest

What areas most need a news outlet, and what schools are in the best position to help these areas?

What is a News Desert?

The rapid decline of local newspapers across the U.S. has led to the emergence of what are now widely referred to as news deserts. News Deserts are defined as counties with one or zero local newspapers. As traditional news outlets disappear, communities lose access to trusted, local sources of information, impacting civic engagement and democratic participation.

According to the State of Local News report from Northwestern University’s Medill School of Journalism, more than 55 million people in the U.S. live in areas with minimal or no access to local journalism. These deserts are most commonly found in high-poverty, rural areas, especially in the South and Midwest.

However, the core question behind the Medill study remains largely unanswered: Why are these counties news deserts? What makes them different from other counties?

Medill identifies counties that may be at risk to becoming a news desert in the State of Local News report. They refer to the identified counties as Watchlist Counties. Listed below are a few watchlist counties from Medill’s 2024 report. These counties come from data provided by Medill’s News Deserts project and was used extensively in this report.

medill_county_wl <- medill_county |> filter(watchlist_2024 == 'Yes')

medill_county_wl[sample(nrow(medill_county_wl), 6),]
##     county_fips    county_name state_abbr watchlist_2024           acp_category
## 266       51580 Covington city         VA            Yes       Evangelical Hubs
## 252       48405  San Augustine         TX            Yes  Working Class Country
## 72        13253       Seminole         GA            Yes African American South
## 12         4012         La Paz         AZ            Yes        Graying America
## 253       48437        Swisher         TX            Yes       Hispanic Centers
## 88        17151           Pope         IL            Yes        Graying America
##     median_household_income_2022 median_age_2020
## 266                        45737            42.1
## 252                        45888            48.8
## 72                         46063            45.1
## 12                         46634            57.4
## 253                        40290            37.0
## 88                         57582            56.8
##     percentage_below_poverty_line_2022 broadband_access medill_news_outlets
## 266                               16.9            97.44                   1
## 252                               27.3            19.29                   1
## 72                                21.0            44.89                   1
## 12                                21.0             0.93                   1
## 253                               29.5            93.23                   1
## 88                                18.7            46.24                   1
##     nd_or_wl news_and_information_index overall_index
## 266        1                         35            44
## 252        1                         17             1
## 72         1                         22            14
## 12         1                         19             7
## 253        1                          5             3
## 88         1                         27            40

How Can Universities Help?

Universities, particularly those with journalism programs are uniquely positioned to help slow the growing gap in local news coverage. There is a possibility to fulfill both the need for student journalism experience and local news, but having a partnership is a challenge many universities have not overcome.

# making labels for the chart
label_data <- universities |>
  group_by(news_partnership) |>
  summarise(count = n(), .groups = 'drop')

# make a chart showing number of schools with partnerships vs not partnerships
ggplot(
  data = universities,
  mapping = aes(
    x = news_partnership,
  )
) + 
  geom_bar( fill = '#154734') + 
  geom_text(data = label_data, aes(x = news_partnership, y = count, label = count), 
            vjust = -0.3, color = "black", size = 4) +
  theme_classic() + 
  labs(title = "The Staggering Number of Universities Without a News-Academic Partnership",
       caption = "Source: UVM CCN 2023 Data\nBy: Joey Gilmartin", 
       x = 'News-Academic Partnership Status',
       y = 'Number of Institutions'
  ) +
  theme(
    panel.grid.major.y = element_line(color = "gray", size = 0.25, linetype = "dashed")
  ) +
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  )
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The growth of news-academic partnerships has been tracked throughout the nation. Last year student reporters published over 12,000 stories, reaching an estimated 25 million readers. These programs not only serve public interest but also give students real-world reporting experience and help universities achieve their public service missions.

Despite this potential, many schools, even large, well-funded institutions, do not currently contribute to local news coverage. For instance, the University of Notre Dame, despite its size and resources, does not have a community news partnership in place. This raises a key question: What prevents some schools from forming a news-academic partnership?

An Introduction to the Data

To begin answering these questions, we compiled and joined several key datasets:

  1. Medill News Deserts This dataset from the Medill State of Local News Report provides an overview local news across U.S. counties. It identifies “news deserts,” or areas where residents have limited or no access to credible and comprehensive local news sources. It includes details such as the number of local newspapers, ownership type, and publishing details like frequency. This data is essential for understanding the geography of local news decline and its potential impact on communities. The columns that we are most interested in can be seen below.
head(medill_county)
##   county_fips county_name state_abbr watchlist_2024           acp_category
## 1        1001     Autauga         AL             No  Working Class Country
## 2        1003     Baldwin         AL             No   Rural Middle America
## 3        1005     Barbour         AL             No African American South
## 4        1007        Bibb         AL             No       Evangelical Hubs
## 5        1009      Blount         AL             No       Evangelical Hubs
## 6        1011     Bullock         AL            Yes African American South
##   median_household_income_2022 median_age_2020
## 1                        68315            38.6
## 2                        71039            43.2
## 3                        39712            40.1
## 4                        50669            39.9
## 5                        57440            41.0
## 6                        36136            39.7
##   percentage_below_poverty_line_2022 broadband_access medill_news_outlets
## 1                               11.4            93.64                   0
## 2                               10.2            74.19                   2
## 3                               24.2            59.21                   2
## 4                               20.6            14.46                   1
## 5                               14.2            33.91                   1
## 6                               27.9            78.10                   1
##   nd_or_wl news_and_information_index overall_index
## 1        1                         18            53
## 2        0                         86            66
## 3        0                         18            11
## 4        0                         20            13
## 5        0                         29            37
## 6        1                          1             6
  1. CCN News-Academic Partners The UVM Center for Community News (CCN) keeps data on all universities in the United States. This dataset provides a list of these universities as well as other important features about the school, such as institution endowment, enrollment, CBSA type (urban/rural), as well as some information about the surrounding counties news ecosystems. Most importantly, there is a column signifying whether the university has established, are exploring, or do not have a news partnership. A sample of schools can be seen below.
universities[sample(nrow(universities), 6), c(1, 2, 4, 5)]
##      unitid                           instnm                        addr
## 1448 136516               Polk State College             999 Avenue H NE
## 2371 158431 Bossier Parish Community College             6220 East Texas
## 4890 215682       Rosedale Technical College          215 Beecham  Drive
## 4332 389860      Mid-EastCTC-Adult Education             400 Richards Rd
## 6023 233684      Strayer University-Virginia          2121 15th Street N
## 1408 480851               Miami Media School 7955 NW 12 Street Suite 119
##              city
## 1448 Winter Haven
## 2371 Bossier City
## 4890   Pittsburgh
## 4332   Zanesville
## 6023    Arlington
## 1408        Doral
  1. Civic Information Index This index is a county-level measure that reflects both access to civic information and the overall level of civic engagement. More information on the Civic Information Index Website. A sample of the dataset can be seen here.
civic[sample(nrow(civic), 6),]
##      county_fips news_and_information_index overall_index
## 1593       29007                         53            37
## 1805       26085                         35             6
## 1869       24035                         58            82
## 1260       36029                         75            64
## 2790        8065                         42            49
## 1464       30041                         33            38
  1. 2020 Election Results Data This data was taken from MIT Dataverse: https://dataverse.Harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ and then reformatted to be similar to the structure of the rest of our data. Below you will see six random counties and their election results.
election_county2020[sample(nrow(election_county2020), 6),]
##      year state_po county_name county_fips percent_votes_democrat_2020
## 411  2020       GA    BLECKLEY       13023                        0.23
## 2810 2020       TN     JOHNSON       47091                        0.16
## 1071 2020       KY     LETCHER       21133                        0.20
## 1194 2020       ME    KENNEBEC       23011                        0.48
## 2429 2020       MT     SANDERS       30089                        0.24
## 2293 2020       WV    BERKELEY       54003                        0.33
##      percent_votes_republican_2020 margin_of_victory county_winner
## 411                           0.77             -0.54    Republican
## 2810                          0.83             -0.67    Republican
## 1071                          0.79             -0.59    Republican
## 1194                          0.48              0.00    Republican
## 2429                          0.74             -0.50    Republican
## 2293                          0.65             -0.32    Republican

Final Merged Set For the most part, these datasets can be combined into a single, unified dataset while maintaining interpretability. The Medill News Deserts, Civic Information Index, and 2020 Election Results datasets are all recorded at the county level and can be joined using a unique geographic identifier known as the FIPS code.

To incorporate university partnerships from the CCN News–Academic Partners dataset, we linked schools to the counties in which they are located. This allows us to match civic and news ecosystem characteristics at the county level with the presence (or absence) of academic news partnerships The resulting dataset enables us to explore how civic health, local news infrastructure, and educational institutions interact across the U.S.

Exploring the Data

To better understand where student journalism partnerships are most needed and where they are most likely to succeed, we conducted exploratory data analysis across a range of county-level and institution-level data. This included the Civic Information Index scores, socioeconomic indicators like poverty, the number of local news outlets, and characteristics of colleges such as enrollment size, endowment, and location type (urban vs rural).

One of the clearest trends emerged in relation to university size. As expected, as university size increases, there are more universities with a news-academic partnership. However, what stood out was the presence of smaller schools exploring news-academic partnerships.

This trend suggests that while larger institutions may have more resources and infrastructure to support established partnerships, smaller schools are still actively seeking ways to engage with their local news ecosystems, highlighting the potential impact of these schools.

universities |> filter(news_partnership != 'None or No Data') |> 
  ggplot(
    mapping = aes(
      x = instsize2,
      fill =  news_partnership
    )
  ) + 
  geom_bar(position = 'dodge2') +
  labs(title = "News Partnerships are Strongest at Larger Universities, \nBut Smaller Schools Are Exploring",
       subtitle = 'Including Only Colleges With or Exploring a Student News Partnership',
       caption = "Source: UVM CCN 2023 Data\nBy: Joey Gilmartin", 
       x = 'Institution Size',
       y = 'Number of Institutions',
       fill = "News-Academic\nPartnership Status"
  ) + 
  theme_classic() + 
  theme(
    panel.grid.major.y = element_line(color = "gray", size = 0.25, linetype = "dashed")
  ) +
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  ) +
  scale_fill_manual(
    labels = c("Exploring" = "Exploring a Partnership", "News Partnership" = "Has Partnership"),
    values = c("#FFD100", "#154734"))

Ultimately, the broad analysis suggests that the presence or absence of student news partnerships is shaped by a complex mix the institutional and county level factors we examined. In many cases, the data was crowded making it difficult to identify a clear and consistent trend. This limitation influenced us to narrow our focus to schools that have the largest opportunity for impact - what we refer to as ‘Impact Schools’ - and counties that are in need of local news - what we refer to as ‘Counties in Need’.

Narrowing the Focus: “Impact Schools” and “Counties In Need”

While early data exploration included all universities, my work has increasingly focused on a subset of universities we refer to as Impact Schools, which we defined as universities that have one or more of the following characteristics: - Located in a county that is classified as a news desert - Located in a county adjacent to a news desert - Located in a county that is on Medill’s 2024 Watchlist

### make a data viz for these schools
# make a column called 'impact_uni' in universities_counties that has 'Potential for Impact' if the column is in the df impact_universities
universities_counties <- universities_counties %>%
  mutate(impact_uni = if_else(instnm %in% impact_universities$instnm,
                              "Potential for Impact", "Other"))
# using universities_counties, make a bar chart with impact_uni on the x axis, and proportion on the y axis, each bar should be filled all the way and colored by the news_partnership column of the data
df_plot <- universities_counties %>%
  count(impact_uni, news_partnership) %>%
  group_by(impact_uni) %>%
  mutate(proportion = n / sum(n))

impact_totals <- df_plot %>%
  group_by(impact_uni) %>%
  summarise(total = sum(n), .groups = "drop")


ggplot(df_plot, 
       aes(
         x = impact_uni,
         y = proportion, 
         fill = news_partnership)
  ) +
  geom_bar(stat = "identity", 
           position = "fill", 
           width = 0.7
  ) + 
  geom_text(data = impact_totals, 
            aes(
              x = impact_uni,
              y = 1.05,
              label = paste0("n = ", total)), 
              inherit.aes = FALSE, size = 4
  ) +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "University Type",
       y = "Proportion",
       fill = "News Partnership",
       title = "Proportion of News Partnerships by Impact Category"
       ) +
  theme_minimal() + 
  scale_fill_manual(
    values = c("None or No Data" = "tomato1", 
               "Exploring" = "royalblue1", 
               "News Partnership" = "springgreen4")
  )

The chart above shows the proportion of universities with different levels of news-academic partnerships, separated by whether they are classified as an Impact University. The distribution of partnership types — None or No Data, Exploring, and News Partnership — is similar between the two groups. This suggests that being an Impact University doesn’t make a school more or less likely to have a news partnership, which means the patterns we see in these schools probably reflect what’s happening across all universities.

Along with only exploring these schools, we limited our county research to counties that were considered news deserts or on the Medill News Desert Watchlist. These counties were labeled as “counties in need”. We focused on places where access to local news is weakest, with the idea that if we identified trends in the most at-risk counties and the schools located near them, those insights could be applied to all schools and counties.

While focusing in on these counties, we noticed an important trend in the ACP Category- a classification system that groups U.S. counties based on shared cultural, demographic, and geographic characteristics. Some ACP categories, such as African American South, Evangelical Hubs, and Graying America, appeared far more often in the in need counties than other categories. After doing some data analysis on these counties, we concluded that these counties are typically more rural, older, and economically disadvantaged. Categories that are urban, younger, and have more economic opportunity, such as Urban Burbs and College Towns, appeared more frequently in news deserts.

# get old data viz: bar chart comparing number of universities by ACP category (Desert, Watchlist, etc.)
medill_county_prez |>  filter(nd_or_wl == 1) |> 
  ggplot(
    mapping = aes(
      y = fct_infreq(acp_category),
    )
  ) + geom_bar( fill = '#154734') +
  labs(
    title = 'Geography of News Deserts by Community Type',
    y = 'ACP Category',
    x = 'Number of News Deserts',
    fill = 'County Election Result'
  ) +
  theme_bw() + scale_x_continuous(expand = c(0, 0, 0, 0))

#then describe it

This pattern suggests that news deserts are not evenly distributed across the United States. There may be clusters in certain regions that are not identifiable in a static, generalizing graph like the ones shown above.

Additionally, at this point during the research process, we were introduced to a recent study from Medill titled “Trump Wins News Deserts in a Landslide”. The article explores the political outcomes of news deserts, showing that the counties with limited access to local news favored Donald Trump in the 2020 election. This insight inspired us to incorporate presidential election data into our analysis.

Together, these two factors directly influenced the next steps of this analysis: building interactive maps.

Exploring Areas in Need - Interactive Maps

To help visualize these schools, we utilized interactive maps (using the Leaflet package in R) that allow users to explore the geographic relationships between schools, counties, and news desert classifications. Additionally, it allowed further exploration of schools that share geographic characteristics such as location, American Communities Project (ACP) Category, and proximity to other schools. We explored variables like election results, county civic health, and university size, to understand what characteristics make a school more likely to host a news partnership.

Community Type in News Deserts Map

# Load the US counties GeoJSON file found on GITHUB
geojson <- sf::read_sf("https://gist.githubusercontent.com/sdwfrost/d1c73f91dd9d175998ed166eb216994a/raw/e89c35f308cee7e2e5a784e1d3afc5d449e9e4bb/counties.geojson") |> clean_names()

# Merge the county data with GeoJSON properties
geojson2 <- geojson |> 
  mutate(geoid = as.integer(geoid)) |> 
  dplyr::select(geoid, name, aland, awater, geometry) |>
  left_join(medill_county_prez, by = c("geoid" = "county_fips"))

geojson_refined <- geojson |> 
  mutate(geoid = as.integer(geoid)) |> 
  dplyr::select(geoid, name, aland, awater, geometry) |>
  left_join(medill_county_prez, by = c("geoid" = "county_fips")) |>
  filter(in_need_of_news == 'In Need of a News Outlet')

county_labels <- sprintf(
  "<strong>%s, %s</strong><br/>ACP Category: %s",
  geojson_refined$name, 
  geojson_refined$state_abbr, 
  geojson_refined$acp_category
) |>  lapply(htmltools::HTML)

school_labels <- sprintf(
  "<strong>%s</strong><br/>Patnership Status: %s<br/>Endowment: %s<br/>Enrollment: %s",
  impact_universities$instnm, 
  impact_universities$news_partnership, 
  sapply(impact_universities$endowment, scales::label_dollar(scale = 1, prefix = "$")), 
  formatC(impact_universities$enrollment, big.mark = ",", format = 'd')
) |>  lapply(htmltools::HTML)

custom_pal <- c(
  "#FF6347", # Tomato (warm red-orange)
  "#FFD700", # Gold (bright yellow)
  "#32CD32", # Lime Green (vibrant green)
  "#8A2BE2", # Blue Violet (purplish-blue)
  "#FF8C00", # Dark Orange (orange)
  "#8B4513", # SaddleBrown (rich brown)
  "#ADFF2F", # Green Yellow (yellowish-green)
  "#FF1493", # Deep Pink (vivid pink)
  "#00CED1", # Dark Turquoise (teal blue)
  "#D2691E", # Chocolate (brownish-red)
  "#B8860B", # Dark Goldenrod (golden yellow-brown)
  "#6A5ACD", # Slate Blue (bluish purple)
  "#FF4500", # Orange Red (bright orange-red)
  "#2E8B57", # Sea Green (dark green)
  "#FF00FF"  # Magenta (vivid purple-pink)
)

# Set up color scale using pals
pal <- colorFactor(
  custom_pal,
  geojson_refined$acp_category
)
pal2 <- colorFactor(c("red", 'blue', 'darkgreen'), impact_universities$news_partnership)


map1 <- leaflet() |> 
  setView(-96, 37.8, 4) |> 
  addTiles() |> 
  addPolygons(
    data = geojson_refined,
    stroke = FALSE, 
    smoothFactor = 0.3, 
    fillColor = ~pal(acp_category),
    fillOpacity = 0.8,
    label = county_labels,
    group = ~acp_category
  ) |> 
  
  # Add circles for colleges
  addCircleMarkers(data = impact_universities,
             lat = ~latitude,
             lng = ~longitud,
             stroke = TRUE, # set up black /dark grey stroke color
             weight = 2,
             color = 'black',
             radius = ~sqrt(endowment) / 4000,
             fillColor = ~pal2(news_partnership),
             fillOpacity = 1,
             label = school_labels,
  ) |>
  
  # Add a legend for the choropleth
  addLegend("bottomright", 
            pal = pal, 
            values = geojson_refined$acp_category,
            title = "ACP Category", opacity = 1
  ) |> 
  addLegend("bottomleft", 
            pal = pal2, 
            values = universities$news_partnership,
            title = NULL, opacity = 1) |> 
  addLayersControl(
    # creating groups
    overlayGroups = unique(geojson_refined$acp_category),
    options = layersControlOptions(collapsed = FALSE),
    position = 'topleft'
  )

map1

This map focuses only on counties identified as “in need of a news outlet” as defined in the previous section. Each county is shaded by its ACP Category to visualize areas with similar demographics. Users can toggle individual ACP Categories on and off to isolate patterns within certain categories of interest.

On top of these colored zones, are universities marked by circles where size represents endowment (larger circles = more endowment, wealthier school) and color indicates their news partnership status. This layering captures both topics of interest at the same time which many previous graphs have not allowed us to do.

We initially hoped to find clear trends over specific ACP Category - for example, an ACP category which tended to have partnerships. No clear pattern was found on a large scale. But, we gained a lot of insight exploring local examples: investigating a handful of schools with similar characteristics but different Partnership Statuses. The case studies below will highlight some of these smaller explorations.

Case Study 1: Unexpected Outliers

Before diving into more granular detail, its worth addressing the two major outliers that stand out immediately. As expected, most universities with large endowments either have a news-academic partnership or are exploring one. However, two exceptions stand out: Dartmouth College and the University of Virginia. Despite their size, wealth, and overall high status among institutions, neither has a documented news partnership. This was unexpected. While we took note of it and tried to identify a reason, there was no explanation in the data as to why these schools do not have a news-academic partnership.

Map 1 Case Study 2: Northern Utah

map1.2 <- map1 |> 
  setView(lng = -111.9, lat = 41.2, zoom = 7) 
  
map1.2

One of the more interesting local comparisons on the map occurs in northern Utah, where Weber State University and Utah State University are located in neighboring counties. Both schools have large enrollments of around 30,000 students and are located in the same type of area. However, only Utah State has an established news partnership, while Weber State is currently only exploring one. So what makes them different?

At first glance, Weber State might seem like the more likely candidate. It’s located closer to a major city (Salt Lake City) while having similar size. The only difference visible on the map, Utah State’s slightly higher endowment.

We conducted research to explore factors not captured in our dataset, such as student-faculty ratio and the presence of a journalism program. Few meaningful differences were found in this research. Weber State University has a student-faculty ratio of 21:1, only slightly higher than Utah State’s 19:1 ratio. Both schools also have strong journalism programs which could potentially play a factor in news partnership status.

Due to time and data constraints, we couldn’t fully integrate these factors into our already large dataset. Still, this comparison highlights a common trend in our analyses that institution size and resources are the most consistent predictor of partnership status.

Political Geography of News Deserts

This map shifts focus from ACP category and focuses on the relationship between political outcomes and access to local news. Counties are shaded according to their 2020 presidential election margin of victory, with a deeper red or blue indicating stronger support for the Republican or Democratic Candidate. The user has the option to toggle counties on and off to display all counties or just those in need of local news. Overlaid on this map are the same set of universities from school one. They are sized by their endowment and colored by News Partnership Status.

# Merge the county data with GeoJSON properties
geojson2 <- geojson |> 
  mutate(geoid = as.integer(geoid)) |> 
  select(geoid, name, aland, awater, geometry) |>
  left_join(medill_county_prez, by = c("geoid" = "county_fips")) |> 
  filter(!is.na(margin_of_victory))

# using counties_in_need df, make a column in geojson2 that has a 1 if 
# the county is in need, and a 0 if not, so we can use grouping

geojson_refined <- geojson |> 
  mutate(geoid = as.integer(geoid)) |> 
  select(geoid, name, aland, awater, geometry) |>
  left_join(counties_in_need, by = c("geoid" = "county_fips"))

county_prez_labels <- sprintf(
  "<strong>%s, %s</strong><br/>2020 Presidential Election Result: %s <br/>Percentage of Votes: %d%%<br/>News Outlets in County: %s",
  geojson2$name,
  geojson2$state_abbr,
  geojson2$county_winner,
  round(ifelse(geojson2$county_winner == 'Democrat', 
         geojson2$percent_votes_democrat_2020,
         geojson2$percent_votes_republican_2020) * 100, 1),
  formatC(geojson2$medill_news_outlets, big.mark = ",", format = 'd')
) |> lapply(htmltools::HTML)

school_label2 <- sprintf(
  "<strong>%s</strong><br/>Endowment: $%s<br/>Enrollment: %s",
  impact_universities$instnm, 
  formatC(impact_universities$endowment, big.mark = ",", format = 'd'), 
  formatC(impact_universities$enrollment, big.mark = ",", format = 'd')
) |> lapply(htmltools::HTML)
## Warning in storage.mode(x) <- "integer": NAs introduced by coercion to integer
## range
school_pal2 <- colorFactor(c("#FF4500", '#FFD700', 'darkgreen'), impact_universities$news_partnership)
geojson_politic <- geojson2 |> filter(!is.na(margin_of_victory))

# setting up political scale
political_pal <- colorNumeric(palette = c("red", "white", "blue"), geojson_politic$margin_of_victory)
#already done so leaving it commented
#pal2 <- colorFactor(c("darkgreen", 'blue', 'red'), impact_universities$news_partnership)


map2 <- leaflet() |> 
  setView(-96, 37.8, 4) |> 
  addTiles() |> 
  addPolygons(
    data = geojson_politic,
    stroke = FALSE, 
    smoothFactor = 0.3, 
    fillColor = ~political_pal(margin_of_victory),
    fillOpacity = 0.7,
    label = county_prez_labels,
    group = ~in_need_of_news
  ) |>
  # Add circles for colleges
  addCircleMarkers(data = impact_universities,
             lat = ~latitude,
             lng = ~longitud,
             stroke = TRUE, # set up black /dark grey stroke color
             weight = 2,
             color = 'black',
             radius = ~sqrt(endowment) / 4000,
             fillColor = ~school_pal2(news_partnership),
             fillOpacity = 1,
             label = school_labels
  ) |>
  # Add a legend for the choropleth
  addLegend("bottomright", 
            pal = political_pal, 
            values = geojson_politic$margin_of_victory,
            title = "2020 Presidential Results", opacity = 1
  ) |> 
  addLegend("bottomleft", 
            pal = school_pal2, 
            values = impact_universities$news_partnership,
            title = NULL, opacity = 1) |> 
  addLayersControl(
    overlayGroups = unique(geojson2$in_need_of_news),
    options = layersControlOptions(collapsed = FALSE),
    position = 'topright'
  )


map2
#saveWidget(map2, 'map2.html')

At a national level, a few trends stand out. Many of the in need counties appear to be shades of red, confirming the finding in Medill’s report that news deserts are more common in Republican-leaning areas.

As we zoomed in on the map, we were asking the question: Since more news deserts vote red, are schools in red counties less likely to have a news-academic partnership?

What we found is that in some cases, yes-schools in Republican counties are are less likely to have a news-academic partnership. But similarly to most other analyses in this report, it doesn’t explain everything. The following two case studies illustrate one scenario where the hypothesis fits and one where it does not.

Map 2 Case Study 1: Penn State and Neighboring Institutions

map2.1 <- map2 |>  
  setView(lng = -77.86, lat = 40.80, zoom = 7) 
map2.1

Pennsylvania State University is a textbook case of what is expected: a large, well-funded university, located in a Democratic-leaning county, with a fully established news-partnership. The counties surrounding Penn State tend to be Republican-leaning. In these surrounding counties, all the schools lack a news-academic partnership.

That said, while this case supports the hypothesis that institutions are more likely to have a news-academic partnerships in Republican counties, it doesn’t confirm it as a rule. The neighboring institutions of Penn State are much smaller than Penn State so there may be other variables that are causing this trend.

Map 2 Case Study 2: Southern Ohio

map2.2 <- map2 |> 
  setView(lng = -84.74, lat = 39.51, zoom = 7) 
  
map2.2
#saveWidget(map2, 'map2.html')

In southern Ohio, the data does not support the hypothesis as cleanly as it did in the previous example. Miami University–Oxford (the large green circle) is located in a Republican county and has a news partnership. Slightly south, the University of Cincinnati (the large yellow circle) and Xavier University (red) are located in a Democratic-leaning county. Both of these schools lack a partnership.

This contradiction is especially surprising given that UC is larger than Miami University and located closer to a major city. If political context alone were a strong predictor of news partnership status, the pattern observed would be reversed.

This case reinforces the idea that while political context might have some indication of a news-academic partnership, it is not a perfectly clear indicator.

Interactive Maps Conclusion

While ACP categories helped surface some compelling local contrasts, they didn’t provide a reliable way to predict partnerships nationwide. Instead, they served as a starting point, revealing where more detailed analysis was necessary.

After exploring geographic and political patterns, we found that local context matters — but not enough to generalize confidently across the country.

The interactive maps helped uncover local patterns, but still couldn’t reveal a single, consistent trend that can explain partnership presence or absence across the whole country. Still, they were crucial for identifying potential predictors. Ultimately, the limitations of data visualization, especially when dealing a dataset as large as ours, led to a more systematic approach to answer the research questions. In order to better understand the relationship between institutions and news deserts, we will use Machine Learning.

Model Building

After exploring patterns in the data through static visualizations and maps, the next step was to build a predictive model to help identify universities that are most likely to form news-academic partnerships. The goal of the model is to both predict and use feature importance (an attribute of the model used) to better understand the factors that may cause certain schools to form a partnership. To do this, we used machine learning, specifically a method called Random Forests.

Random Forests and Machine Learning

A Single Decision Tree

A random forest is an extension of decision trees - a basic machine learning method that splits data into different groups based on certain characteristics.

For example, a decision tree might begin by evaluating whether a university is classified as large or small. From there, it may consider other characteristics, such as the poverty rate in the surrounding county or the number of local newspapers available. At each step (or split), the tree selects the variable and corresponding threshold that results in the greatest improvement in classification purity — that is, the split that best separates the data into distinct categories like “Has Partnership” or “No Partnership.”

An example of a single decision tree using the data we used can be seen below.

#prepping the data
universities_model <- universities_counties |> dplyr::select(instnm, county_fips, news_partnership, instsize, endowment, enrollment, acp_category, median_age_2020, medill_news_outlets, percent_votes_democrat_2020, percentage_below_poverty_line_2022, news_and_information_index) |> filter(instsize > 0)

universities_model_full <- universities_model[complete.cases(universities_model),]


library(rpart)
set.seed(2)
tree_model <- rpart(news_partnership ~ instsize + endowment + enrollment + 
                    acp_category + median_age_2020 + medill_news_outlets +
                    percent_votes_democrat_2020 + 
                    percentage_below_poverty_line_2022 + 
                    news_and_information_index, 
                   data = universities_model_full, 
                   method = "class")  # keep it small and readable

# Plot the tree
library(rpart.plot)
rpart.plot(tree_model)

However, the one tree doesn’t do a good enough job of capturing all aspects in this data. This is a common issue in single decision trees. By nature, they are often too general or too specific to the data it is trained on.

This is where the advantages of Random Forest shine. A random forest builds hundreds of trees, each using a random sample of the data. At each split, a random forest also chooses from a random set of variables. The forest then takes a majority vote from all the trees to decide the final prediction. This technique is extremely powerful in machine learning fields as it reduces over-fitting, essentially the computer memorizing data instead of learning, which will lead to more accurate predictions and allow us to measure variable importance.

Variables of Interest

The variables used in the model were selected based on patterns identified during earlier analysis and visualizations. Here’s a quick overview of the main variables we used: - Institution Size (instsize): A numeric categorical measure of how big a school is (scale 1-5 representing Very Small - Very Large). - Endowment: A measure of the university’s financial resources. - Enrollment: Number of students enrolled at the university. - ACP Category (acp_category): A classification that captures cultural and demographic county characteristics. - Median Age (median_age_2020): Age of residents in the surrounding county. - Number of News Outlets (medill_news_outlets): Count of local news outlets in the county. - Percent Democratic Vote (2020) (percent_votes_democrat_2020): Political leaning of the county. - Poverty Rate (percentage_below_poverty_line_2022): Poverty rate of the county - Civic Information Index (news_and_information_index): A measure from the Civic Information Index, indicating the strength of local news and information.

# set up testing data ( universities in need )
universities_test <- impact_universities |> 
  dplyr::select(instnm, county_fips, news_partnership, instsize, endowment, enrollment, 
         acp_category, median_age_2020, medill_news_outlets, 
         percent_votes_democrat_2020, percentage_below_poverty_line_2022, news_and_information_index)

# lose roughly 465 obs again, but will make testing possible
universities_test_full <- universities_test[complete.cases(universities_test),]

universities_model_full <- universities_model_full |> 
  filter(!instnm %in% universities_test_full$instnm)

rm(universities_model, universities_test)

#table(universities_model_full$news_partnership)
# lose out on roughly 10 exploring partnerships, 9 News Partnerships, but 3000 None or No Data
# this is good because it most likely weeds out colleges we do not have full data on,
# making our model more accurate when we do have all the data present :)

Making the Model

To train the random forest model, we used universities for which we had complete data and that were not part of our ‘Impact Schools’ subset. This approach helped ensure that we could later test the model’s predictions on the universities we’re most interested in without introducing too much bias.

Some basic model parameters: - mtry = 3: This tells the model to consider 3 variables at each split. - ntree = 1000: This sets the number of trees in the forest to 1000.

These parameters are chosen by the user and can influence both accuracy and run-time. In a different setting, a grid search or cross-validation may be used to find optimal parameter choices. However, in this case, the model accuracy was stable across different settings, suggesting the values chosen for each parameter are sufficient.

set.seed(3295247)
# creating a rf using the variables mentioned, 1000 trees, and 3 variables at each split
rf.uni = randomForest(news_partnership ~ instsize + endowment + enrollment + 
          acp_category + median_age_2020 + medill_news_outlets +
          percent_votes_democrat_2020 + percentage_below_poverty_line_2022 + 
          news_and_information_index, 
         data = universities_model_full, 
         mtry = 3, 
         ntree = 1000, 
         importance = T)

tab_train <- table(rf.uni$predicted, universities_model_full$news_partnership)
(tab_train[1,1]+tab_train[2,2]+tab_train[3,3])/sum(tab_train) #% of correct predictions
## [1] 0.9093432

Once the model was trained, we assessed its accuracy by comparing predicted versus actual partnership statuses. Out of 2,162 institutions, the model correctly classified 1,969, resulting in an accuracy rate of approximately 91%. This is an extremely strong performance.

Testing the Model on Universities in Need

Next, we tested its ability to predict on Institutions the random forest was not trained on - the set of Impact Universities. These schools were excluded from training to ensure that the model’s predictions would reflect its ability to generalize to new, unseen data.

rf.uni.pred = predict(rf.uni, universities_test_full)

universities_test_full$predicted_news_partnership <- rf.uni.pred

tab_test <- table(rf.uni.pred, universities_test_full$news_partnership)
tab_test
##                   
## rf.uni.pred        None or No Data Exploring News Partnership
##   None or No Data              350        18               22
##   Exploring                      0         0                0
##   News Partnership               2         2                3
misclassifications <- universities_test_full[(rf.uni.pred != universities_test_full$news_partnership),] |> 
  dplyr::select(instnm, news_partnership, predicted_news_partnership)

# misclassifications |> filter(news_partnership == 'Exploring' & predicted_news_partnership == 'News Partnership')


(tab_test[1,1]+tab_test[2,2]+tab_test[3,3])/sum(tab_test) #% of correct predictions
## [1] 0.8891688

The model achieved an overall accuracy of 89%, meaning it correctly classified the partnership status for the vast majority of universities in the test group. This is notably high, especially given the variability in these schools.

However, no model is perfect. 11% of the Impact Schools were misclassified. Specifically, there are three cases where the model misclassified universities as ones with news-academic partnerships, despite being reported as not having one. The three cases can be seen in the table below.

misclassifications |> filter(news_partnership == 'None or No Data')
##                           instnm news_partnership predicted_news_partnership
## 1 University of California-Davis  None or No Data           News Partnership
## 2            Marshall University  None or No Data           News Partnership

Rather than treating these as traditional errors, we tried to derive some meaning from them. They are being misclassified because they may share many of the characteristics of schools with active news-academic partnerships. Essentially, these cases are confusing the model into thinking they have news-academic partnerships. For this reason, we consider these schools ‘high-potential universities’.

Additionally, there were more misclassifications in which schools that do have a news-partnership were predicted not to have one. We considered these schools to be ‘against-the-odds’. The model believes that a partnership is unlikely, however are present in the real world. This further proves that there are some factors that explain news-academic partnerships that are not captured in our data. These schools can be seen below.

misclassifications |> filter(news_partnership == 'News Partnership' & predicted_news_partnership == 'None or No Data')
##                                  instnm news_partnership
## 1              Alabama State University News Partnership
## 2             Arkansas State University News Partnership
## 3     California State University-Chico News Partnership
## 4         Florida Gulf Coast University News Partnership
## 5    Georgia College & State University News Partnership
## 6                     Mercer University News Partnership
## 7                       Simpson College News Partnership
## 8          Northern Kentucky University News Partnership
## 9                Bowie State University News Partnership
## 10  University of Maryland-College Park News Partnership
## 11 University of New Mexico-Main Campus News Partnership
## 12            University of Nevada-Reno News Partnership
## 13            University of Nevada-Reno News Partnership
## 14              Miami University-Oxford News Partnership
## 15    The Pennsylvania State University News Partnership
## 16                   Claflin University News Partnership
## 17                         Lane College News Partnership
## 18          Lincoln Memorial University News Partnership
## 19        Southern Adventist University News Partnership
## 20                Utah State University News Partnership
## 21                Utah State University News Partnership
## 22          Washington State University News Partnership
##    predicted_news_partnership
## 1             None or No Data
## 2             None or No Data
## 3             None or No Data
## 4             None or No Data
## 5             None or No Data
## 6             None or No Data
## 7             None or No Data
## 8             None or No Data
## 9             None or No Data
## 10            None or No Data
## 11            None or No Data
## 12            None or No Data
## 13            None or No Data
## 14            None or No Data
## 15            None or No Data
## 16            None or No Data
## 17            None or No Data
## 18            None or No Data
## 19            None or No Data
## 20            None or No Data
## 21            None or No Data
## 22            None or No Data

These misclassifications provide valuable insights. Rather than seeing them as failures, they highlight opportunities for future outreach, resource targeting, or further investigation. They may help identify universities that are ready—but not yet active—in local journalism efforts.

Variable Importance

Random forests are able to keep track of variable importance during its building process. Variable importance is measured by the variable’s average decrease in Gini Index. The Gini Index is a measure used in classification problems to determine how “pure” a group is. A group is pure if it contains only one type of outcome (e.g., all “Has Partnership”). The lower the Gini value, the purer the split. In other words, a variable that leads to big drops in the Gini value is doing a good job of helping the model sort universities into correct categories.

importance_df <- data.frame(importance(rf.uni)) #|> arrange(desc(MeanDecreaseGini))
importance_df <- importance_df |> dplyr::select(MeanDecreaseAccuracy, MeanDecreaseGini)
#importance_df
importance_df2 <-  data.frame( predictor_variable = c('instsize', 'endowment', 'enrollment', 'acp_category', 'median_age_2020', 'medill_news_outlets', 'percent_votes_democrat_2020', 'percentage_below_poverty_line_2022', 'news_and_information_index'), meanDecreaseGini = importance_df$MeanDecreaseGini)
importance_df2|> arrange(desc(meanDecreaseGini))
##                   predictor_variable meanDecreaseGini
## 1                         enrollment         82.54570
## 2                          endowment         77.73912
## 3                    median_age_2020         37.82052
## 4        percent_votes_democrat_2020         36.68538
## 5 percentage_below_poverty_line_2022         36.59336
## 6         news_and_information_index         34.57103
## 7                medill_news_outlets         27.14505
## 8                           instsize         21.42309
## 9                       acp_category         13.96087

From our model, the most important variables were: 1) Enrollment 2) Endowment 3) Median Age 2020 4) Percentage Below Poverty Line 2022 5) Percent Votes Democrat 2020
6) News And Information Index

Although the variables above provided a significant decrease in Gini Index, enrollment and endowment dominated this measure - being more than double any other variable.

These results suggest that schools with more resources and those located in areas with fewer local news sources are more likely to form or explore partnerships. The model and its results align with our earlier findings and helps to provide a data-driven foundation for identifying ‘high-potential universities’.

Where the Model Might Fall Short

One key limitation of this model is class imbalance. The majority of schools in the dataset are labeled as “None or No Data”. When a model is trained on classes with unequal proportions, it makes it harder for the model to accurately predict less common outcomes, especially the “Exploring” category which had no predicted values. Since random forests rely on majority patterns at splits and to make decisions, underrepresented classes can be easily overlooked.

Another concern is the use of Impact Schools as a test set to judge accuracy. While these were excluded from model training, they may not reflect the same distribution of the news partnership column that is present in the full dataset. For instance, since these schools are in areas that we identified to be in need of a news outlet, they may be more likely not to have a news-academic partnership.

Lastly, random forests require the data to be full, so any row with a missing value was dropped from both the testing and training set. These schools could be part of an important trend that is otherwise not recognized by the model due to its exclusion.

Conclusion

This project set out to understand the complex relationship between universities and local news ecosystems. Specifically, identifying which universities are best positioned to help address the growing crisis of news deserts and what makes a county at-risk of becoming a news desert. Through exploratory analysis, geographic visualizations, and machine learning, we identified several factors that appear to influence the presence of student-led news partnerships.

While no single variable explains why certain schools have partnerships and others do not, some trends emerged. University enrollment and endowment were consistently the strongest predictors acros. County-level data like poverty rate, political results, and demographics, added some insights, but they failed to generalize to the whole dataset.

Interactive maps helped illustrate meaningful local patterns that were hidden in large-scale exploratory data analysis. The machine learning quantifies the trends and identifies ‘high-potential’ institutions.

In short, the presence of student-led news partnerships may not follow one neat formula. The analysis provides a framework for identifying opportunity and need for local news. We can get much closer to understanding where they’re likely to thrive, and where they’re needed most. As the decline of local news continues, these insights could guide outreach to institutions.

Limitations

The model building provided us with insightful predictions, but these came with limitations. The limitations that directly influenced model building are discussed in the Where the Model Might Fall Short section. The broader report is also constrained by several data and methodological limitations that should be acknowledged.

One major limitation is the treatment of missing values in the ‘news_partnership’ variable. Due to limited and inconsistent reporting, many universities lacked a value in this column. These values as well as values indicating no partnership were altered to be labeled as “None or No Data”. This introduced the risk of misclassification because some of these institutions may be engaged in a news-academic partnership. This could lead to potential skew in the whole report, but it was a necessary assumption to make in order for the data to be useful.

Another limitation, due to geographical reasons, was the exclusion of Hawaiian and Alaskan counties and universities.

Finally, due to time and resource constraints, we were unable to access certain data. During the map building phase of our analysis, we attempted to use census tract-level data to investigate racial diversity in counties in need of a news outlet. However, the extreme granularity of the data made it difficult to work with efficiently and computationally. This limited any takeaways from the data. Additionally, we were unable to find a reliable, comprehensive source of university data containing student-faculty ratio and journalism major data - variables we believed would have provided valuable context to the complex state of news-academic partnerships.