Project 1

Author

Ryan Seabold

Project 1: Exploring the Racial Makeup of the Southern Mid-Atlantic Region

This project will explore the racial diversity in Delaware, the District of Columbia, Maryland, Virginia, and West Virginia — the southern Mid-Atlantic region of the United States. The project utilizes a report from the decennial census (2020) focused on race, divided by census tract — a small political subdivision. The variables are the census tract geoID, the population of each race by census tract (from which percentages are determined), and a calculated diversity score, based on the percentage of a census tract’s population which the most populous race occupies. The races are classified thusly: White, Black or African American, American Indian or Alaska Native, Asian, Pacific Islander or Native Hawaiian, other, and multiple races. The project also uses TIGER/Line shapefiles from the census to map the racial diversity using Tableau. The first map shows the most populous race in each census tract with colors and the second contrasts the levels of diversity by census tract through shading.

Dataset source: decennial census report P1 “RACE” (2020), collected from data.census.gov Shapefiles source: TIGER/Line Shapefiles, collected from www.census.gov

Set up the project

# Set the working directory
setwd("C:/Users/dodgy/OneDrive/School/DATA 110/Projects/Project 1")

# Open the necessary libraries
library(tidyverse)
library(ggplot2)
library(janitor)
library(stringr)
library(tigris)

# Open the data file
censusdata <- read_csv("P1 RACE 2020.csv")

New names:
Rows: 85396 Columns: 74
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(73): GEO_ID, NAME, P1_001N, P1_002N, P1_003N, P1_004N, P1_005N, P1_006N... lgl
(1): ...74
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...74`

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 74
  GEO_ID   NAME  P1_001N P1_002N P1_003N P1_004N P1_005N P1_006N P1_007N P1_008N
  <chr>    <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 Geograp… Geog… !!Tota… !!Tota… !!Tota… !!Tota… !!Tota… !!Tota… !!Tota… !!Tota…
2 1400000… Cens… 1775    1653    1389    213     5       8       3       35     
3 1400000… Cens… 2055    1984    842     1104    2       12      4       20     
4 1400000… Cens… 3216    3039    2244    714     12      14      6       49     
5 1400000… Cens… 4246    3993    3578    327     17      32      1       38     
6 1400000… Cens… 4322    4055    3241    632     29      93      2       58     
# ℹ 64 more variables: P1_009N <chr>, P1_010N <chr>, P1_011N <chr>,
#   P1_012N <chr>, P1_013N <chr>, P1_014N <chr>, P1_015N <chr>, P1_016N <chr>,
#   P1_017N <chr>, P1_018N <chr>, P1_019N <chr>, P1_020N <chr>, P1_021N <chr>,
#   P1_022N <chr>, P1_023N <chr>, P1_024N <chr>, P1_025N <chr>, P1_026N <chr>,
#   P1_027N <chr>, P1_028N <chr>, P1_029N <chr>, P1_030N <chr>, P1_031N <chr>,
#   P1_032N <chr>, P1_033N <chr>, P1_034N <chr>, P1_035N <chr>, P1_036N <chr>,
#   P1_037N <chr>, P1_038N <chr>, P1_039N <chr>, P1_040N <chr>, …

Clean/prepare the data

# Clean the headers
censusdata <- censusdata |>
  row_to_names(1, remove_row = TRUE)
censusdata <- censusdata |>
  clean_names()

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 74
  geography            geographic_area_name         total total_population_of_…¹
  <chr>                <chr>                        <chr> <chr>                 
1 1400000US01001020100 Census Tract 201, Autauga C… 1775  1653                  
2 1400000US01001020200 Census Tract 202, Autauga C… 2055  1984                  
3 1400000US01001020300 Census Tract 203, Autauga C… 3216  3039                  
4 1400000US01001020400 Census Tract 204, Autauga C… 4246  3993                  
5 1400000US01001020501 Census Tract 205.01, Autaug… 4322  4055                  
6 1400000US01001020502 Census Tract 205.02, Autaug… 3284  3092                  
# ℹ abbreviated name: ¹total_population_of_one_race
# ℹ 70 more variables: total_population_of_one_race_white_alone <chr>,
#   total_population_of_one_race_black_or_african_american_alone <chr>,
#   total_population_of_one_race_american_indian_and_alaska_native_alone <chr>,
#   total_population_of_one_race_asian_alone <chr>,
#   total_population_of_one_race_native_hawaiian_and_other_pacific_islander_alone <chr>,
#   total_population_of_one_race_some_other_race_alone <chr>, …

# Remove unnecessary columns
censusdata[,12:ncol(censusdata)] <- NULL
censusdata$total_population_of_one_race <- NULL
censusdata$geography <- NULL

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 9
  geographic_area_name       total total_population_of_…¹ total_population_of_…²
  <chr>                      <chr> <chr>                  <chr>                 
1 Census Tract 201, Autauga… 1775  1389                   213                   
2 Census Tract 202, Autauga… 2055  842                    1104                  
3 Census Tract 203, Autauga… 3216  2244                   714                   
4 Census Tract 204, Autauga… 4246  3578                   327                   
5 Census Tract 205.01, Auta… 4322  3241                   632                   
6 Census Tract 205.02, Auta… 3284  1894                   907                   
# ℹ abbreviated names: ¹total_population_of_one_race_white_alone,
#   ²total_population_of_one_race_black_or_african_american_alone
# ℹ 5 more variables:
#   total_population_of_one_race_american_indian_and_alaska_native_alone <chr>,
#   total_population_of_one_race_asian_alone <chr>,
#   total_population_of_one_race_native_hawaiian_and_other_pacific_islander_alone <chr>,
#   total_population_of_one_race_some_other_race_alone <chr>, …

# Create new columns for state, county, and census tract number
censusdata <- censusdata |>
  separate(geographic_area_name, into = c("census_tract", "county", "state"), sep = ", ")

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 11
  census_tract  county state total total_population_of_…¹ total_population_of_…²
  <chr>         <chr>  <chr> <chr> <chr>                  <chr>                 
1 Census Tract… Autau… Alab… 1775  1389                   213                   
2 Census Tract… Autau… Alab… 2055  842                    1104                  
3 Census Tract… Autau… Alab… 3216  2244                   714                   
4 Census Tract… Autau… Alab… 4246  3578                   327                   
5 Census Tract… Autau… Alab… 4322  3241                   632                   
6 Census Tract… Autau… Alab… 3284  1894                   907                   
# ℹ abbreviated names: ¹total_population_of_one_race_white_alone,
#   ²total_population_of_one_race_black_or_african_american_alone
# ℹ 5 more variables:
#   total_population_of_one_race_american_indian_and_alaska_native_alone <chr>,
#   total_population_of_one_race_asian_alone <chr>,
#   total_population_of_one_race_native_hawaiian_and_other_pacific_islander_alone <chr>,
#   total_population_of_one_race_some_other_race_alone <chr>, …

# Improve column names
colnames(censusdata) <-
  c("census_tract", "county", "state", "total", "total_white",
    "total_black", "total_native", "total_asian", "total_pacific",
    "total_other", "total_multiple")

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 11
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <chr> <chr>       <chr>       <chr>       
1 Census Tract 201    Autauga … Alab… 1775  1389        213         5           
2 Census Tract 202    Autauga … Alab… 2055  842         1104        2           
3 Census Tract 203    Autauga … Alab… 3216  2244        714         12          
4 Census Tract 204    Autauga … Alab… 4246  3578        327         17          
5 Census Tract 205.01 Autauga … Alab… 4322  3241        632         29          
6 Census Tract 205.02 Autauga … Alab… 3284  1894        907         15          
# ℹ 4 more variables: total_asian <chr>, total_pacific <chr>,
#   total_other <chr>, total_multiple <chr>

# Reformat columns as numeric
censusdata[4:ncol(censusdata)] <-
  lapply(censusdata[4:ncol(censusdata)], as.numeric)

# Get the relevant data using tigris
countiesdata <- counties(cb = TRUE, year = 2020, progress_bar = FALSE) |>
  select(STATEFP, COUNTYFP, GEOID, NAMELSAD, STATE_NAME)

# Remove automatic, but unnecessary geometry column
countiesdata$geometry <- NULL

# Create a column for state FIPS codes
censusdata <- censusdata |>
  left_join(countiesdata |>
    distinct(STATEFP, STATE_NAME), by = c("state" = "STATE_NAME")) |>
      rename(statefp = STATEFP)

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 12
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 5 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>

# Create a column for county FIPS codes
censusdata <- censusdata |>
  left_join(countiesdata |>
              select(STATEFP, COUNTYFP, NAMELSAD), by = c("statefp" = "STATEFP", "county" = "NAMELSAD")) |>
    rename(countyfp = COUNTYFP)

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 13
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 6 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>

# Create a column for county geoID codes
censusdata <- censusdata |>
  left_join(countiesdata |>
              select(STATEFP, NAMELSAD, GEOID), by = c("statefp" = "STATEFP", "county" = "NAMELSAD")) |>
    rename(county_geoid = GEOID)

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 14
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 7 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>

# Create a column for padded_census_tract with "Census Tract " removed
censusdata <- censusdata |>
  mutate(padded_census_tract = str_remove(census_tract, "Census Tract ")) 

# Ensure padded_census_tract has two digits after the period by padding with zeroes
censusdata <- censusdata |>
  mutate(padded_census_tract = if_else(
    str_detect(padded_census_tract, "\\."),
    if_else( # If there is a decimal point, check how many digits are after it
      str_length(str_extract(padded_census_tract, "(?<=\\.)\\d+")) == 2,
      padded_census_tract, # If there are already 2 digits after the decimal point, keep it as is
      str_replace(padded_census_tract, "(\\d+)(\\.\\d+)?", "\\1\\20")), # If there's only 1 digit after the decimal point, add a 0
    paste0(padded_census_tract, ".00"))) # If there's no decimal point, just append ".00"

# Remove the decimal point from padded_census_tract
censusdata <- censusdata |>
  mutate(padded_census_tract = str_remove(padded_census_tract, "\\."))

# Ensure padded_census_tract is six digits total by padding with zeroes
censusdata <- censusdata |>
  mutate(padded_census_tract = str_pad(padded_census_tract, width = 6, pad = "0", side = "left"))

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 15
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 8 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>

# Create a new column called census_tract_geoid that combines county_geoid and padded_census_tract
censusdata <- censusdata |>
  mutate(census_tract_geoid = paste0(county_geoid, padded_census_tract))

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 16
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 9 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>

Create columns for the percentages of each race

# Define the races and columns
race_columns <- c("total_white", "total_black", "total_native", "total_asian", "total_pacific", "total_other", "total_multiple")
races_simple <- c("white", "black", "native", "asian", "pacific", "other", "multiple")
races <- c("White", "Black or African American", "American Indian or Alaska Native", "Asian", "Pacific Islander", "Other", "Multiple Races")

# Create columns for the percentage (out of 100) of the total population each race occupies
censusdata <- censusdata |>
  mutate(across(all_of(race_columns), 
                 ~ if_else(total == 0, 0, . / total * 100), 
                 .names = "pct_{str_remove(.col, '^total_')}"))

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 23
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 16 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
#   pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
#   pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>

# Create a column for most populous race
censusdata <- censusdata |>
  rowwise() |>
    mutate(most_populous_race = races[which.max(c_across(all_of(race_columns)))]) |>
      ungroup()
# Create a column for most populous race, simplified
censusdata <- censusdata |>
  rowwise() |>
    mutate(most_populous_race_simple = races_simple[which.max(c_across(all_of(race_columns)))]) |>
      ungroup()

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 25
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 18 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
#   pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
#   pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
#   most_populous_race <chr>, most_populous_race_simple <chr>

Create columns for measuring the diversity of a census tract

# Define thresholds
thresholds <- c(33.33333333, 50, 66.66666667, 70, 75, 80, 90, 95, 98)
# Create Boolean columns for each percentage threshold
for (threshold in thresholds) {
  censusdata <- censusdata |> 
    mutate(!!paste0("is_pct_above_", threshold) := rowSums(across(all_of(paste0("pct_", races_simple))) > threshold) > 0)
}

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 34
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 27 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
#   pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
#   pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
#   most_populous_race <chr>, most_populous_race_simple <chr>,
#   is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …

# Define race percentage columns
race_percentages <- c("pct_white", "pct_black", "pct_native", "pct_asian", "pct_pacific", "pct_other", "pct_multiple")

# Create a column for the percentage of the most populous race
censusdata <- censusdata |>
  rowwise() |>
    mutate(most_populous_pct = max(c_across(all_of(race_percentages)))) |>
      ungroup()

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 35
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 28 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
#   pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
#   pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
#   most_populous_race <chr>, most_populous_race_simple <chr>,
#   is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …

# Create a column for the diversity score
censusdata <- censusdata |>
  mutate(diversity_score = if_else(most_populous_pct == 0, NA_real_,
                                   1 - (most_populous_pct / 100)^2))

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 36
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 29 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
#   pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
#   pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
#   most_populous_race <chr>, most_populous_race_simple <chr>,
#   is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …

Give all rows with 0 population null data

# All rows with 0 population get null for most columns
censusdata <- censusdata |>
  mutate(across(4:11, ~ ifelse(total == 0, NA, .))) |> # Columns total through total_multiple
  mutate(across(17:36, ~ ifelse(total == 0, NA, .))) # Columns pct_white through diversity_score

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 36
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 29 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
#   pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
#   pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
#   most_populous_race <chr>, most_populous_race_simple <chr>,
#   is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …

Create and visualize four different linear regressions

# Create a regression model for total versus most_populous_pct
model1 <- lm(total ~ most_populous_pct, data = na.omit(censusdata))
summary(model1)


Call:
lm(formula = total ~ most_populous_pct, data = na.omit(censusdata))

Residuals:
   Min     1Q Median     3Q    Max 
 -4519  -1174   -172    976  33988 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       4848.7436    21.8210  222.21   <2e-16 ***
most_populous_pct  -13.0257     0.3045  -42.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1670 on 84779 degrees of freedom
Multiple R-squared:  0.02112,   Adjusted R-squared:  0.02111 
F-statistic:  1829 on 1 and 84779 DF,  p-value: < 2.2e-16

# Create a regression model for total versus diversity_score
model2 <- lm(total ~ diversity_score, data = na.omit(censusdata))
summary(model2)


Call:
lm(formula = total ~ diversity_score, data = na.omit(censusdata))

Residuals:
   Min     1Q Median     3Q    Max 
 -4395  -1173   -171    978  33956 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3462.16      12.55  275.88   <2e-16 ***
diversity_score   998.97      22.94   43.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1670 on 84779 degrees of freedom
Multiple R-squared:  0.02188,   Adjusted R-squared:  0.02186 
F-statistic:  1896 on 1 and 84779 DF,  p-value: < 2.2e-16

# Create a regression model for pct_white versus most_populous_pct, not including rows wherein white is the dominant race
model3 <- na.omit(censusdata) |>
  filter(most_populous_race != "White") %>%
    lm(pct_white ~ most_populous_pct, data = .)
summary(model3)


Call:
lm(formula = pct_white ~ most_populous_pct, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.537  -5.705  -1.528   5.642  27.045 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       39.624584   0.210347   188.4   <2e-16 ***
most_populous_pct -0.363495   0.003615  -100.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.872 on 18089 degrees of freedom
Multiple R-squared:  0.3585,    Adjusted R-squared:  0.3585 
F-statistic: 1.011e+04 on 1 and 18089 DF,  p-value: < 2.2e-16

# Create a regression model for pct_white versus diversity_score, not including rows wherein white is the dominant race
model4 <- na.omit(censusdata) |>
  filter(most_populous_race != "White") %>%
    lm(pct_white ~ diversity_score, data = .)
summary(model4)


Call:
lm(formula = pct_white ~ diversity_score, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.9692  -5.8163  -0.8111   5.4187  26.3145 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.6525     0.1981  -3.295 0.000987 ***
diversity_score  30.5298     0.2832 107.793  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.644 on 18089 degrees of freedom
Multiple R-squared:  0.3911,    Adjusted R-squared:  0.3911 
F-statistic: 1.162e+04 on 1 and 18089 DF,  p-value: < 2.2e-16

# Visualize model1 with a scatterplot and regression line
na.omit(censusdata) |>
  ggplot(aes(x = most_populous_pct, y = total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Regression of Total Population by Most Populous Percentage",
       x = "Most Populous Percentage",
       y = "Total Population",
       caption = "Source: U.S. Decennial Census 2020") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize model2 with a scatterplot and regression line
na.omit(censusdata) |>
  ggplot(aes(x = diversity_score, y = total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Regression of Total Population by Diversity Score",
       x = "Diversity Score",
       y = "Total Population",
       caption = "Source: U.S. Decennial Census 2020") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize model3 with a scatterplot and regression line
na.omit(censusdata) |>
  filter(most_populous_race != "White") |>
    ggplot(aes(x = most_populous_pct, y = pct_white)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = "Regression of White Percentage by Most Populous Percentage",
         x = "Most Populous Percentage",
         y = "Percent White",
         caption = "Source: U.S. Decennial Census 2020") +
    theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize model4 with a scatterplot and regression line
na.omit(censusdata) |>
  filter(most_populous_race != "White") |>
    ggplot(aes(x = diversity_score, y = pct_white)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = "Regression of White Percentage by Diversity Score",
         x = "Diversity Score",
         y = "Percent White",
         caption = "Source: U.S. Decennial Census 2020") +
    theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Try linear regressions for just Maryland

# Create a new tibble for only Maryland's data
marylanddata <- censusdata |>
  filter(state == "Maryland")

# Create a regression model for total versus most_populous_pct
marylandmodel1 <- lm(total ~ most_populous_pct, data = na.omit(marylanddata))
summary(marylandmodel1)


Call:
lm(formula = total ~ most_populous_pct, data = na.omit(marylanddata))

Residuals:
    Min      1Q  Median      3Q     Max 
-4230.9 -1219.0  -131.7  1100.4 10744.3 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5585.306    163.496  34.162   <2e-16 ***
most_populous_pct  -20.226      2.353  -8.595   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1666 on 1458 degrees of freedom
Multiple R-squared:  0.04822,   Adjusted R-squared:  0.04757 
F-statistic: 73.87 on 1 and 1458 DF,  p-value: < 2.2e-16

# Create a regression model for total versus diversity_score
marylandmodel2 <- lm(total ~ diversity_score, data = na.omit(marylanddata))
summary(marylandmodel2)


Call:
lm(formula = total ~ diversity_score, data = na.omit(marylanddata))

Residuals:
    Min      1Q  Median      3Q     Max 
-4286.3 -1200.6  -135.5  1087.9 10686.9 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3402.0      103.9  32.728   <2e-16 ***
diversity_score   1602.6      182.5   8.783   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1664 on 1458 degrees of freedom
Multiple R-squared:  0.05025,   Adjusted R-squared:  0.0496 
F-statistic: 77.14 on 1 and 1458 DF,  p-value: < 2.2e-16

# Create a regression model for pct_white versus most_populous_pct, not including rows wherein white is the dominant race
marylandmodel3 <- na.omit(marylanddata) |>
  filter(most_populous_race != "White") %>%
    lm(pct_white ~ most_populous_pct, data = .)
summary(marylandmodel3)


Call:
lm(formula = pct_white ~ most_populous_pct, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.8587  -3.8365  -0.7349   4.0721  24.8221 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       40.95290    1.06251   38.54   <2e-16 ***
most_populous_pct -0.41083    0.01566  -26.23   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.277 on 520 degrees of freedom
Multiple R-squared:  0.5696,    Adjusted R-squared:  0.5687 
F-statistic: 688.1 on 1 and 520 DF,  p-value: < 2.2e-16

# Create a regression model for pct_white versus diversity_score, not including rows wherein white is the dominant race
marylandmodel4 <- na.omit(marylanddata) |>
  filter(most_populous_race != "White") %>%
    lm(pct_white ~ diversity_score, data = .)
summary(marylandmodel4)


Call:
lm(formula = pct_white ~ diversity_score, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.4073  -3.9988   0.1995   3.2814  24.2460 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -3.3346     0.7237  -4.607 5.13e-06 ***
diversity_score  32.7879     1.2094  27.111  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.139 on 520 degrees of freedom
Multiple R-squared:  0.5857,    Adjusted R-squared:  0.5849 
F-statistic:   735 on 1 and 520 DF,  p-value: < 2.2e-16

# Visualize marylandmodel1 with a scatterplot and regression line
na.omit(marylanddata) |>
  ggplot(aes(x = most_populous_pct, y = total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Regression of Total Population of Maryland by Most Populous Percentage",
       x = "Most Populous Percentage",
       y = "Total Population",
       caption = "Source: U.S. Decennial Census 2020") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize marylandmodel2 with a scatterplot and regression line
na.omit(marylanddata) |>
  filter(most_populous_race != "White") |>
  ggplot(aes(x = diversity_score, y = total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Regression of Total Population of Maryland by Diversity Score",
       x = "Diversity Score",
       y = "Total Population",
       caption = "Source: U.S. Decennial Census 2020") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize marylandmodel3 with a scatterplot and regression line
na.omit(marylanddata) |>
  filter(most_populous_race != "White") |>
    ggplot(aes(x = most_populous_pct, y = pct_white)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = "Regression of White Percentage of Maryland by Most Populous Percentage",
         x = "Most Populous Percentage",
         y = "Percent White",
         caption = "Source: U.S. Decennial Census 2020") +
    theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize marylandmodel4 with a scatterplot and regression line
na.omit(marylanddata) |>
  filter(most_populous_race != "White") |>
    ggplot(aes(x = diversity_score, y = pct_white)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = "Regression of White Percentage of Maryland by Diversity Score",
         x = "Diversity Score",
         y = "Percent White",
         caption = "Source: U.S. Decennial Census 2020") +
    theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Try linear regressions for just the District of Columbia

# Create a new tibble for only DC's data
dcdata <- censusdata |>
  filter(state == "District of Columbia")

# Create a regression model for total versus most_populous_pct
dcmodel1 <- lm(total ~ most_populous_pct, data = na.omit(dcdata))
summary(dcmodel1)


Call:
lm(formula = total ~ most_populous_pct, data = na.omit(dcdata))

Residuals:
    Min      1Q  Median      3Q     Max 
-3187.5  -710.8  -116.8   632.8  3513.2 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3778.198    296.524   12.74   <2e-16 ***
most_populous_pct   -6.402      4.268   -1.50    0.135    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1053 on 204 degrees of freedom
Multiple R-squared:  0.01091,   Adjusted R-squared:  0.006058 
F-statistic: 2.249 on 1 and 204 DF,  p-value: 0.1352

# Create a regression model for total versus diversity_score
dcmodel2 <- lm(total ~ diversity_score, data = na.omit(dcdata))
summary(dcmodel2)


Call:
lm(formula = total ~ diversity_score, data = na.omit(dcdata))

Residuals:
    Min      1Q  Median      3Q     Max 
-3200.7  -699.1  -120.5   627.5  3500.1 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3106.1      179.7  17.287   <2e-16 ***
diversity_score    466.3      317.0   1.471    0.143    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1054 on 204 degrees of freedom
Multiple R-squared:  0.0105,    Adjusted R-squared:  0.005645 
F-statistic: 2.164 on 1 and 204 DF,  p-value: 0.1428

# Create a regression model for pct_white versus most_populous_pct, not including rows wherein white is the dominant race
dcmodel3 <- na.omit(dcdata) |>
  filter(most_populous_race != "White") %>%
    lm(pct_white ~ most_populous_pct, data = .)
summary(dcmodel3)


Call:
lm(formula = pct_white ~ most_populous_pct, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.3237  -1.5041   0.2876   1.3497  12.2462 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       57.22383    1.98178   28.88   <2e-16 ***
most_populous_pct -0.60093    0.02633  -22.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.189 on 95 degrees of freedom
Multiple R-squared:  0.8457,    Adjusted R-squared:  0.8441 
F-statistic: 520.7 on 1 and 95 DF,  p-value: < 2.2e-16

# Create a regression model for pct_white versus diversity_score, not including rows wherein white is the dominant race
dcmodel4 <- na.omit(dcdata) |>
  filter(most_populous_race != "White") %>%
    lm(pct_white ~ diversity_score, data = .)
summary(dcmodel4)


Call:
lm(formula = pct_white ~ diversity_score, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.121  -1.591   0.372   1.853  13.019 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5.230      1.002   -5.22 1.05e-06 ***
diversity_score   43.485      1.950   22.30  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.291 on 95 degrees of freedom
Multiple R-squared:  0.8396,    Adjusted R-squared:  0.8379 
F-statistic: 497.3 on 1 and 95 DF,  p-value: < 2.2e-16

# Visualize dcmodel1 with a scatterplot and regression line
na.omit(dcdata) |>
  ggplot(aes(x = most_populous_pct, y = total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Regression of Total Population of the District of Columbia by Most Populous Percentage",
       x = "Most Populous Percentage",
       y = "Total Population",
       caption = "Source: U.S. Decennial Census 2020") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize dcmodel2 with a scatterplot and regression line
na.omit(dcdata) |>
  ggplot(aes(x = diversity_score, y = total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Regression of Total Population of the District of Colubmia by Diversity Score",
       x = "Diversity Score",
       y = "Total Population",
       caption = "Source: U.S. Decennial Census 2020") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize dcmodel3 with a scatterplot and regression line
na.omit(dcdata) |>
  filter(most_populous_race != "White") |>
    ggplot(aes(x = most_populous_pct, y = pct_white)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = "Regression of White Percentage of the District of Colubmia by Most Populous Percentage",
         x = "Most Populous Percentage",
         y = "Percent White",
       caption = "Source: U.S. Decennial Census 2020") +
    theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Visualize dcmodel4 with a scatterplot and regression line
na.omit(dcdata) |>
  filter(most_populous_race != "White") |>
    ggplot(aes(x = diversity_score, y = pct_white)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = "Regression of White Percentage of the District of Colubmia by Diversity Score",
         x = "Diversity Score",
         y = "Percent White",
       caption = "Source: U.S. Decennial Census 2020") +
    theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Use linear regression to predict diversity based on race percentages

# Create a linear model that predicts the diversity score based on the percentages of each race
# The predictions are not useful for very small census tracts, so the minimum population will be 50
diversity_model <- na.omit(censusdata) %>%
  filter(total >= 50) %>% # Ensure at least 50 population
    lm(diversity_score ~ pct_white + pct_black + pct_native + pct_asian + pct_pacific + pct_other + pct_multiple, data = .)
summary(diversity_model)


Call:
lm(formula = diversity_score ~ pct_white + pct_black + pct_native + 
    pct_asian + pct_pacific + pct_other + pct_multiple, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.97297 -0.09346  0.00610  0.09683  0.42282 

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.357e+00  7.146e-03  189.94   <2e-16 ***
pct_white    -1.164e-02  7.703e-05 -151.18   <2e-16 ***
pct_black    -7.995e-03  7.694e-05 -103.91   <2e-16 ***
pct_native   -8.693e-03  1.328e-04  -65.47   <2e-16 ***
pct_asian    -2.498e-03  9.269e-05  -26.95   <2e-16 ***
pct_pacific  -6.797e-03  4.713e-04  -14.42   <2e-16 ***
pct_other    -2.731e-03  1.171e-04  -23.32   <2e-16 ***
pct_multiple         NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1475 on 84447 degrees of freedom
Multiple R-squared:  0.6512,    Adjusted R-squared:  0.6512 
F-statistic: 2.628e+04 on 6 and 84447 DF,  p-value: < 2.2e-16

# Generate predictions for the visualization
censusdata <- censusdata %>%
  mutate(predicted_diversity = predict(diversity_model, newdata = .),
         predicted_diversity = pmax(0, pmin(predicted_diversity, 1))) # Constrain values between 0 and 1

# Check the header of the tibble
head(censusdata)

# A tibble: 6 × 37
  census_tract        county    state total total_white total_black total_native
  <chr>               <chr>     <chr> <dbl>       <dbl>       <dbl>        <dbl>
1 Census Tract 201    Autauga … Alab…  1775        1389         213            5
2 Census Tract 202    Autauga … Alab…  2055         842        1104            2
3 Census Tract 203    Autauga … Alab…  3216        2244         714           12
4 Census Tract 204    Autauga … Alab…  4246        3578         327           17
5 Census Tract 205.01 Autauga … Alab…  4322        3241         632           29
6 Census Tract 205.02 Autauga … Alab…  3284        1894         907           15
# ℹ 30 more variables: total_asian <dbl>, total_pacific <dbl>,
#   total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
#   county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
#   pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
#   pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
#   most_populous_race <chr>, most_populous_race_simple <chr>,
#   is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …

# Create a long tibble for the plot
# The predictions are not useful for very small census tracts, so the minimum population will be 50
predictions_long <- na.omit(censusdata) |>
  filter(total >= 50) |>  # Ensure at least 50 population
  select(1:11, 17:24, 35:37) |>  # Select specific columns
  pivot_longer(cols = starts_with("pct_"), names_to = "race", values_to = "percentage") |>
  mutate(race = recode(race,
    "pct_white"   = "White",
    "pct_black"   = "Black or African American",
    "pct_native"  = "American Indian or Alaska Native",
    "pct_asian"   = "Asian",
    "pct_pacific" = "Pacific Islander",
    "pct_other"   = "Other",
    "pct_multiple"= "Multiple Races"))

Check the accuracy of the predicted diversity scores

# Create a clean dataset
censusdata_clean <- na.omit(censusdata) |>
  filter(total >= 50)

# Create a column for residuals when comparing predicted and actual diversity scores
censusdata_clean <- censusdata_clean |>
  mutate(residuals = diversity_score - predicted_diversity)

# Visualize the residuals
ggplot(censusdata_clean, aes(x = residuals)) +
  geom_histogram(binwidth = 0.05, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Residuals (Actual - Predicted Diversity Score)",
       x = "Residuals",
       y = "Count",
       caption = "Source: U.S. Decennial Census 2020") +
  theme_minimal()

# Get a summary of the residuals
summary(censusdata_clean$residuals)

      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-0.8649076 -0.0934634  0.0060953  0.0005969  0.0968255  0.4228240

With a median of only 0.006 and low first and third quartiles, we can see that the predictions are fairly accurate.

Visualize the effects of the prevalence of each race on the census tract’s diversity, based on the predictions

# Plot the predicted diversity scores versus the most populous race percentage by race
ggplot(predictions_long, aes(x = percentage, y = predicted_diversity, color = race)) +
  geom_point() +
  geom_smooth(method = "lm", aes(group = race), color = "black") +
  labs(title = "Predicted Diversity Score by Race Percentage",
       x = "Percentage of Race",
       y = "Predicted Diversity Score",
       color = "Race",
       caption = "Source: U.S. Decennial Census 2020") +
  theme_minimal() +
  theme(legend.position = "none") +
  facet_wrap(~ race, scales = "free_x") +
  ylim(0, 1)

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 161 rows containing missing values or values outside the scale range
(`geom_smooth()`).

Ensure all columns are the correct format before exporting

# Reformat all columns to avoid errors
censusdata <- censusdata |>
  mutate(
    census_tract = as.character(census_tract),
    county = as.character(county),
    state = as.character(state),
    total = as.numeric(total),
    total_white = as.numeric(total_white),
    total_black = as.numeric(total_black),
    total_native = as.numeric(total_native),
    total_asian = as.numeric(total_asian),
    total_pacific = as.numeric(total_pacific),
    total_other = as.numeric(total_other),
    total_multiple = as.numeric(total_multiple),
    statefp = as.character(statefp),
    countyfp = as.character(countyfp),
    county_geoid = as.character(county_geoid),
    padded_census_tract = as.character(padded_census_tract),
    census_tract_geoid = as.character(census_tract_geoid),
    pct_white = as.numeric(pct_white),
    pct_black = as.numeric(pct_black),
    pct_native = as.numeric(pct_native),
    pct_asian = as.numeric(pct_asian),
    pct_pacific = as.numeric(pct_pacific),
    pct_other = as.numeric(pct_other),
    pct_multiple = as.numeric(pct_multiple),
    most_populous_race = as.character(most_populous_race),
    most_populous_race_simple = as.character(most_populous_race_simple),
    is_pct_above_33.33333333 = as.logical(is_pct_above_33.33333333),
    is_pct_above_50 = as.logical(is_pct_above_50),
    is_pct_above_66.66666667 = as.logical(is_pct_above_66.66666667),
    is_pct_above_75 = as.logical(is_pct_above_70),
    is_pct_above_80 = as.logical(is_pct_above_80),
    is_pct_above_90 = as.logical(is_pct_above_90),
    is_pct_above_95 = as.logical(is_pct_above_95),
    is_pct_above_98 = as.logical(is_pct_above_98),
    most_populous_pct = as.numeric(most_populous_pct),
    diversity_score = as.numeric(diversity_score),
    predicted_diversity = as.numeric(predicted_diversity)
  )

Write the prepared data file to the directory

# Write the changed file to the directory folder as "P1 RACE 2020 NEW.csv"
censusdata |>
  write_csv(file = "P1 RACE 2020 NEW.csv")

The process of cleaning the dataset was rather simple. As it comes from the decennial census, the data was already clean. As such, I simply fixed the headers with row_to_names() and clean_names(). However, the process of preparing and processing the data took much longer. I did the following things to prepare and process the data prior to visualization and exporting: removing the unnecessary columns, which came with the dataset; separating the geographic_area_name column into census_tract, county, and state; renaming the columns to be more useful; reformatting the numeric columns as numeric; getting FIPS codes for the states and counties; creating padded census tract codes; using the state and counties FIPS codes and the padded census tract codes to create geoIDs for each census tract, which will be used to connect the census data to TIGER/Line Shapefiles; creating columns for the percentage of the total population each race occupies; creating columns for the most populous race of each census tract; create columns for measuring the overall racial diversity of a census tract; and giving the unpopulated counties NAs in most columns.

The linear regression visualizations turned out to be less useful than I wanted. However, the section after that, using the predicted diversity scores to determine the influence each race’s prevalence has on overall diversity, proved useful. Through linear regression, I found that the prevalence of Asians, Pacific Islanders or Hawaiian Natives, and those who report multiple or other races has a positive correlation with the diversity of an area. Additionally, the prevalence of Blacks or African Americans and American Indians or Alaska Natives also have a positive correlation with increased diversity, though not as much of an increase as the prior four. Conversely, the prevalence of Whites has a strong negative correlation with decreased diversity, which is unsurprising, given the dominance of the White race in most areas of the United States. I also checked the accuracy of the predicted diversity scores to ensure that this analysis would be accurate. The resulting residuals had a median close to zero and small first and third quartiles, meaning that it is accurate enough for this purpose.

The two maps on Tableau consist of: a map of the Southern Mid-Atlantic Region census tracts, colorized to represent the most populous race in each census tract and a map of the Southern Mid-Atlantic Region shaded to represent the overall racial diversity (diversity score — (1 - (most_populous_pct / 100)^2)) of each census tract. The prevalence of Whites in most areas, especially in West Virginia, Virginia, and Delaware, as well as the prevalence of Blacks in Baltimore City, Prince George’s County, the District of Columbia, and Southern and Southeastern Virginia, were unsurprising. However, the lack of racial diversity in certain areas of the District of Columbia, Baltimore City, and Prince George’s County was startling because urban areas tend to be more diverse. Additionally, the racial diversity in Southern and Southeastern Virginia and Delaware was surprising because those areas are (or, I at least imagine them to be) largely rural, and rural areas are most often less diverse than more urban areas. This dichotomy was an interesting and unexpected discovery.

I had attempted to use a linear curve for the line of best fit in the diversity score versus White percentage/total population, but my several tries were unsuccessful, so I opted to leave it as a linear vector.

Note: ChatGPT was used in the creation of this project