# Set the working directory
setwd("C:/Users/dodgy/OneDrive/School/DATA 110/Projects/Project 1")
Project 1
Project 1: Exploring the Racial Makeup of the Southern Mid-Atlantic Region
This project will explore the racial diversity in Delaware, the District of Columbia, Maryland, Virginia, and West Virginia — the southern Mid-Atlantic region of the United States. The project utilizes a report from the decennial census (2020) focused on race, divided by census tract — a small political subdivision. The variables are the census tract geoID, the population of each race by census tract (from which percentages are determined), and a calculated diversity score, based on the percentage of a census tract’s population which the most populous race occupies. The races are classified thusly: White, Black or African American, American Indian or Alaska Native, Asian, Pacific Islander or Native Hawaiian, other, and multiple races. The project also uses TIGER/Line shapefiles from the census to map the racial diversity using Tableau. The first map shows the most populous race in each census tract with colors and the second contrasts the levels of diversity by census tract through shading.
Dataset source: decennial census report P1 “RACE” (2020), collected from data.census.gov Shapefiles source: TIGER/Line Shapefiles, collected from www.census.gov
Set up the project
# Open the necessary libraries
library(tidyverse)
library(ggplot2)
library(janitor)
library(stringr)
library(tigris)
# Open the data file
<- read_csv("P1 RACE 2020.csv") censusdata
New names:
Rows: 85396 Columns: 74
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(73): GEO_ID, NAME, P1_001N, P1_002N, P1_003N, P1_004N, P1_005N, P1_006N... lgl
(1): ...74
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...74`
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 74
GEO_ID NAME P1_001N P1_002N P1_003N P1_004N P1_005N P1_006N P1_007N P1_008N
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Geograp… Geog… !!Tota… !!Tota… !!Tota… !!Tota… !!Tota… !!Tota… !!Tota… !!Tota…
2 1400000… Cens… 1775 1653 1389 213 5 8 3 35
3 1400000… Cens… 2055 1984 842 1104 2 12 4 20
4 1400000… Cens… 3216 3039 2244 714 12 14 6 49
5 1400000… Cens… 4246 3993 3578 327 17 32 1 38
6 1400000… Cens… 4322 4055 3241 632 29 93 2 58
# ℹ 64 more variables: P1_009N <chr>, P1_010N <chr>, P1_011N <chr>,
# P1_012N <chr>, P1_013N <chr>, P1_014N <chr>, P1_015N <chr>, P1_016N <chr>,
# P1_017N <chr>, P1_018N <chr>, P1_019N <chr>, P1_020N <chr>, P1_021N <chr>,
# P1_022N <chr>, P1_023N <chr>, P1_024N <chr>, P1_025N <chr>, P1_026N <chr>,
# P1_027N <chr>, P1_028N <chr>, P1_029N <chr>, P1_030N <chr>, P1_031N <chr>,
# P1_032N <chr>, P1_033N <chr>, P1_034N <chr>, P1_035N <chr>, P1_036N <chr>,
# P1_037N <chr>, P1_038N <chr>, P1_039N <chr>, P1_040N <chr>, …
Clean/prepare the data
# Clean the headers
<- censusdata |>
censusdata row_to_names(1, remove_row = TRUE)
<- censusdata |>
censusdata clean_names()
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 74
geography geographic_area_name total total_population_of_…¹
<chr> <chr> <chr> <chr>
1 1400000US01001020100 Census Tract 201, Autauga C… 1775 1653
2 1400000US01001020200 Census Tract 202, Autauga C… 2055 1984
3 1400000US01001020300 Census Tract 203, Autauga C… 3216 3039
4 1400000US01001020400 Census Tract 204, Autauga C… 4246 3993
5 1400000US01001020501 Census Tract 205.01, Autaug… 4322 4055
6 1400000US01001020502 Census Tract 205.02, Autaug… 3284 3092
# ℹ abbreviated name: ¹total_population_of_one_race
# ℹ 70 more variables: total_population_of_one_race_white_alone <chr>,
# total_population_of_one_race_black_or_african_american_alone <chr>,
# total_population_of_one_race_american_indian_and_alaska_native_alone <chr>,
# total_population_of_one_race_asian_alone <chr>,
# total_population_of_one_race_native_hawaiian_and_other_pacific_islander_alone <chr>,
# total_population_of_one_race_some_other_race_alone <chr>, …
# Remove unnecessary columns
12:ncol(censusdata)] <- NULL
censusdata[,$total_population_of_one_race <- NULL
censusdata$geography <- NULL
censusdata
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 9
geographic_area_name total total_population_of_…¹ total_population_of_…²
<chr> <chr> <chr> <chr>
1 Census Tract 201, Autauga… 1775 1389 213
2 Census Tract 202, Autauga… 2055 842 1104
3 Census Tract 203, Autauga… 3216 2244 714
4 Census Tract 204, Autauga… 4246 3578 327
5 Census Tract 205.01, Auta… 4322 3241 632
6 Census Tract 205.02, Auta… 3284 1894 907
# ℹ abbreviated names: ¹total_population_of_one_race_white_alone,
# ²total_population_of_one_race_black_or_african_american_alone
# ℹ 5 more variables:
# total_population_of_one_race_american_indian_and_alaska_native_alone <chr>,
# total_population_of_one_race_asian_alone <chr>,
# total_population_of_one_race_native_hawaiian_and_other_pacific_islander_alone <chr>,
# total_population_of_one_race_some_other_race_alone <chr>, …
# Create new columns for state, county, and census tract number
<- censusdata |>
censusdata separate(geographic_area_name, into = c("census_tract", "county", "state"), sep = ", ")
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 11
census_tract county state total total_population_of_…¹ total_population_of_…²
<chr> <chr> <chr> <chr> <chr> <chr>
1 Census Tract… Autau… Alab… 1775 1389 213
2 Census Tract… Autau… Alab… 2055 842 1104
3 Census Tract… Autau… Alab… 3216 2244 714
4 Census Tract… Autau… Alab… 4246 3578 327
5 Census Tract… Autau… Alab… 4322 3241 632
6 Census Tract… Autau… Alab… 3284 1894 907
# ℹ abbreviated names: ¹total_population_of_one_race_white_alone,
# ²total_population_of_one_race_black_or_african_american_alone
# ℹ 5 more variables:
# total_population_of_one_race_american_indian_and_alaska_native_alone <chr>,
# total_population_of_one_race_asian_alone <chr>,
# total_population_of_one_race_native_hawaiian_and_other_pacific_islander_alone <chr>,
# total_population_of_one_race_some_other_race_alone <chr>, …
# Improve column names
colnames(censusdata) <-
c("census_tract", "county", "state", "total", "total_white",
"total_black", "total_native", "total_asian", "total_pacific",
"total_other", "total_multiple")
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 11
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 4 more variables: total_asian <chr>, total_pacific <chr>,
# total_other <chr>, total_multiple <chr>
# Reformat columns as numeric
4:ncol(censusdata)] <-
censusdata[lapply(censusdata[4:ncol(censusdata)], as.numeric)
# Get the relevant data using tigris
<- counties(cb = TRUE, year = 2020, progress_bar = FALSE) |>
countiesdata select(STATEFP, COUNTYFP, GEOID, NAMELSAD, STATE_NAME)
# Remove automatic, but unnecessary geometry column
$geometry <- NULL
countiesdata
# Create a column for state FIPS codes
<- censusdata |>
censusdata left_join(countiesdata |>
distinct(STATEFP, STATE_NAME), by = c("state" = "STATE_NAME")) |>
rename(statefp = STATEFP)
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 12
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 5 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>
# Create a column for county FIPS codes
<- censusdata |>
censusdata left_join(countiesdata |>
select(STATEFP, COUNTYFP, NAMELSAD), by = c("statefp" = "STATEFP", "county" = "NAMELSAD")) |>
rename(countyfp = COUNTYFP)
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 13
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 6 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>
# Create a column for county geoID codes
<- censusdata |>
censusdata left_join(countiesdata |>
select(STATEFP, NAMELSAD, GEOID), by = c("statefp" = "STATEFP", "county" = "NAMELSAD")) |>
rename(county_geoid = GEOID)
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 14
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 7 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>
# Create a column for padded_census_tract with "Census Tract " removed
<- censusdata |>
censusdata mutate(padded_census_tract = str_remove(census_tract, "Census Tract "))
# Ensure padded_census_tract has two digits after the period by padding with zeroes
<- censusdata |>
censusdata mutate(padded_census_tract = if_else(
str_detect(padded_census_tract, "\\."),
if_else( # If there is a decimal point, check how many digits are after it
str_length(str_extract(padded_census_tract, "(?<=\\.)\\d+")) == 2,
# If there are already 2 digits after the decimal point, keep it as is
padded_census_tract, str_replace(padded_census_tract, "(\\d+)(\\.\\d+)?", "\\1\\20")), # If there's only 1 digit after the decimal point, add a 0
paste0(padded_census_tract, ".00"))) # If there's no decimal point, just append ".00"
# Remove the decimal point from padded_census_tract
<- censusdata |>
censusdata mutate(padded_census_tract = str_remove(padded_census_tract, "\\."))
# Ensure padded_census_tract is six digits total by padding with zeroes
<- censusdata |>
censusdata mutate(padded_census_tract = str_pad(padded_census_tract, width = 6, pad = "0", side = "left"))
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 15
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 8 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>
# Create a new column called census_tract_geoid that combines county_geoid and padded_census_tract
<- censusdata |>
censusdata mutate(census_tract_geoid = paste0(county_geoid, padded_census_tract))
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 16
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 9 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>
Create columns for the percentages of each race
# Define the races and columns
<- c("total_white", "total_black", "total_native", "total_asian", "total_pacific", "total_other", "total_multiple")
race_columns <- c("white", "black", "native", "asian", "pacific", "other", "multiple")
races_simple <- c("White", "Black or African American", "American Indian or Alaska Native", "Asian", "Pacific Islander", "Other", "Multiple Races")
races
# Create columns for the percentage (out of 100) of the total population each race occupies
<- censusdata |>
censusdata mutate(across(all_of(race_columns),
~ if_else(total == 0, 0, . / total * 100),
.names = "pct_{str_remove(.col, '^total_')}"))
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 23
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 16 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
# pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
# pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>
# Create a column for most populous race
<- censusdata |>
censusdata rowwise() |>
mutate(most_populous_race = races[which.max(c_across(all_of(race_columns)))]) |>
ungroup()
# Create a column for most populous race, simplified
<- censusdata |>
censusdata rowwise() |>
mutate(most_populous_race_simple = races_simple[which.max(c_across(all_of(race_columns)))]) |>
ungroup()
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 25
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 18 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
# pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
# pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
# most_populous_race <chr>, most_populous_race_simple <chr>
Create columns for measuring the diversity of a census tract
# Define thresholds
<- c(33.33333333, 50, 66.66666667, 70, 75, 80, 90, 95, 98)
thresholds # Create Boolean columns for each percentage threshold
for (threshold in thresholds) {
<- censusdata |>
censusdata mutate(!!paste0("is_pct_above_", threshold) := rowSums(across(all_of(paste0("pct_", races_simple))) > threshold) > 0)
}
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 34
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 27 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
# pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
# pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
# most_populous_race <chr>, most_populous_race_simple <chr>,
# is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …
# Define race percentage columns
<- c("pct_white", "pct_black", "pct_native", "pct_asian", "pct_pacific", "pct_other", "pct_multiple")
race_percentages
# Create a column for the percentage of the most populous race
<- censusdata |>
censusdata rowwise() |>
mutate(most_populous_pct = max(c_across(all_of(race_percentages)))) |>
ungroup()
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 35
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 28 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
# pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
# pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
# most_populous_race <chr>, most_populous_race_simple <chr>,
# is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …
# Create a column for the diversity score
<- censusdata |>
censusdata mutate(diversity_score = if_else(most_populous_pct == 0, NA_real_,
1 - (most_populous_pct / 100)^2))
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 36
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 29 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
# pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
# pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
# most_populous_race <chr>, most_populous_race_simple <chr>,
# is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …
Give all rows with 0 population null data
# All rows with 0 population get null for most columns
<- censusdata |>
censusdata mutate(across(4:11, ~ ifelse(total == 0, NA, .))) |> # Columns total through total_multiple
mutate(across(17:36, ~ ifelse(total == 0, NA, .))) # Columns pct_white through diversity_score
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 36
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 29 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
# pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
# pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
# most_populous_race <chr>, most_populous_race_simple <chr>,
# is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …
Create and visualize four different linear regressions
# Create a regression model for total versus most_populous_pct
<- lm(total ~ most_populous_pct, data = na.omit(censusdata))
model1 summary(model1)
Call:
lm(formula = total ~ most_populous_pct, data = na.omit(censusdata))
Residuals:
Min 1Q Median 3Q Max
-4519 -1174 -172 976 33988
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4848.7436 21.8210 222.21 <2e-16 ***
most_populous_pct -13.0257 0.3045 -42.77 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1670 on 84779 degrees of freedom
Multiple R-squared: 0.02112, Adjusted R-squared: 0.02111
F-statistic: 1829 on 1 and 84779 DF, p-value: < 2.2e-16
# Create a regression model for total versus diversity_score
<- lm(total ~ diversity_score, data = na.omit(censusdata))
model2 summary(model2)
Call:
lm(formula = total ~ diversity_score, data = na.omit(censusdata))
Residuals:
Min 1Q Median 3Q Max
-4395 -1173 -171 978 33956
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3462.16 12.55 275.88 <2e-16 ***
diversity_score 998.97 22.94 43.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1670 on 84779 degrees of freedom
Multiple R-squared: 0.02188, Adjusted R-squared: 0.02186
F-statistic: 1896 on 1 and 84779 DF, p-value: < 2.2e-16
# Create a regression model for pct_white versus most_populous_pct, not including rows wherein white is the dominant race
<- na.omit(censusdata) |>
model3 filter(most_populous_race != "White") %>%
lm(pct_white ~ most_populous_pct, data = .)
summary(model3)
Call:
lm(formula = pct_white ~ most_populous_pct, data = .)
Residuals:
Min 1Q Median 3Q Max
-30.537 -5.705 -1.528 5.642 27.045
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.624584 0.210347 188.4 <2e-16 ***
most_populous_pct -0.363495 0.003615 -100.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.872 on 18089 degrees of freedom
Multiple R-squared: 0.3585, Adjusted R-squared: 0.3585
F-statistic: 1.011e+04 on 1 and 18089 DF, p-value: < 2.2e-16
# Create a regression model for pct_white versus diversity_score, not including rows wherein white is the dominant race
<- na.omit(censusdata) |>
model4 filter(most_populous_race != "White") %>%
lm(pct_white ~ diversity_score, data = .)
summary(model4)
Call:
lm(formula = pct_white ~ diversity_score, data = .)
Residuals:
Min 1Q Median 3Q Max
-27.9692 -5.8163 -0.8111 5.4187 26.3145
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6525 0.1981 -3.295 0.000987 ***
diversity_score 30.5298 0.2832 107.793 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.644 on 18089 degrees of freedom
Multiple R-squared: 0.3911, Adjusted R-squared: 0.3911
F-statistic: 1.162e+04 on 1 and 18089 DF, p-value: < 2.2e-16
# Visualize model1 with a scatterplot and regression line
na.omit(censusdata) |>
ggplot(aes(x = most_populous_pct, y = total)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of Total Population by Most Populous Percentage",
x = "Most Populous Percentage",
y = "Total Population",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize model2 with a scatterplot and regression line
na.omit(censusdata) |>
ggplot(aes(x = diversity_score, y = total)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of Total Population by Diversity Score",
x = "Diversity Score",
y = "Total Population",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize model3 with a scatterplot and regression line
na.omit(censusdata) |>
filter(most_populous_race != "White") |>
ggplot(aes(x = most_populous_pct, y = pct_white)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of White Percentage by Most Populous Percentage",
x = "Most Populous Percentage",
y = "Percent White",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize model4 with a scatterplot and regression line
na.omit(censusdata) |>
filter(most_populous_race != "White") |>
ggplot(aes(x = diversity_score, y = pct_white)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of White Percentage by Diversity Score",
x = "Diversity Score",
y = "Percent White",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Try linear regressions for just Maryland
# Create a new tibble for only Maryland's data
<- censusdata |>
marylanddata filter(state == "Maryland")
# Create a regression model for total versus most_populous_pct
<- lm(total ~ most_populous_pct, data = na.omit(marylanddata))
marylandmodel1 summary(marylandmodel1)
Call:
lm(formula = total ~ most_populous_pct, data = na.omit(marylanddata))
Residuals:
Min 1Q Median 3Q Max
-4230.9 -1219.0 -131.7 1100.4 10744.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5585.306 163.496 34.162 <2e-16 ***
most_populous_pct -20.226 2.353 -8.595 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1666 on 1458 degrees of freedom
Multiple R-squared: 0.04822, Adjusted R-squared: 0.04757
F-statistic: 73.87 on 1 and 1458 DF, p-value: < 2.2e-16
# Create a regression model for total versus diversity_score
<- lm(total ~ diversity_score, data = na.omit(marylanddata))
marylandmodel2 summary(marylandmodel2)
Call:
lm(formula = total ~ diversity_score, data = na.omit(marylanddata))
Residuals:
Min 1Q Median 3Q Max
-4286.3 -1200.6 -135.5 1087.9 10686.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3402.0 103.9 32.728 <2e-16 ***
diversity_score 1602.6 182.5 8.783 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1664 on 1458 degrees of freedom
Multiple R-squared: 0.05025, Adjusted R-squared: 0.0496
F-statistic: 77.14 on 1 and 1458 DF, p-value: < 2.2e-16
# Create a regression model for pct_white versus most_populous_pct, not including rows wherein white is the dominant race
<- na.omit(marylanddata) |>
marylandmodel3 filter(most_populous_race != "White") %>%
lm(pct_white ~ most_populous_pct, data = .)
summary(marylandmodel3)
Call:
lm(formula = pct_white ~ most_populous_pct, data = .)
Residuals:
Min 1Q Median 3Q Max
-17.8587 -3.8365 -0.7349 4.0721 24.8221
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.95290 1.06251 38.54 <2e-16 ***
most_populous_pct -0.41083 0.01566 -26.23 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.277 on 520 degrees of freedom
Multiple R-squared: 0.5696, Adjusted R-squared: 0.5687
F-statistic: 688.1 on 1 and 520 DF, p-value: < 2.2e-16
# Create a regression model for pct_white versus diversity_score, not including rows wherein white is the dominant race
<- na.omit(marylanddata) |>
marylandmodel4 filter(most_populous_race != "White") %>%
lm(pct_white ~ diversity_score, data = .)
summary(marylandmodel4)
Call:
lm(formula = pct_white ~ diversity_score, data = .)
Residuals:
Min 1Q Median 3Q Max
-17.4073 -3.9988 0.1995 3.2814 24.2460
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.3346 0.7237 -4.607 5.13e-06 ***
diversity_score 32.7879 1.2094 27.111 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.139 on 520 degrees of freedom
Multiple R-squared: 0.5857, Adjusted R-squared: 0.5849
F-statistic: 735 on 1 and 520 DF, p-value: < 2.2e-16
# Visualize marylandmodel1 with a scatterplot and regression line
na.omit(marylanddata) |>
ggplot(aes(x = most_populous_pct, y = total)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of Total Population of Maryland by Most Populous Percentage",
x = "Most Populous Percentage",
y = "Total Population",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize marylandmodel2 with a scatterplot and regression line
na.omit(marylanddata) |>
filter(most_populous_race != "White") |>
ggplot(aes(x = diversity_score, y = total)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of Total Population of Maryland by Diversity Score",
x = "Diversity Score",
y = "Total Population",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize marylandmodel3 with a scatterplot and regression line
na.omit(marylanddata) |>
filter(most_populous_race != "White") |>
ggplot(aes(x = most_populous_pct, y = pct_white)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of White Percentage of Maryland by Most Populous Percentage",
x = "Most Populous Percentage",
y = "Percent White",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize marylandmodel4 with a scatterplot and regression line
na.omit(marylanddata) |>
filter(most_populous_race != "White") |>
ggplot(aes(x = diversity_score, y = pct_white)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of White Percentage of Maryland by Diversity Score",
x = "Diversity Score",
y = "Percent White",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Try linear regressions for just the District of Columbia
# Create a new tibble for only DC's data
<- censusdata |>
dcdata filter(state == "District of Columbia")
# Create a regression model for total versus most_populous_pct
<- lm(total ~ most_populous_pct, data = na.omit(dcdata))
dcmodel1 summary(dcmodel1)
Call:
lm(formula = total ~ most_populous_pct, data = na.omit(dcdata))
Residuals:
Min 1Q Median 3Q Max
-3187.5 -710.8 -116.8 632.8 3513.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3778.198 296.524 12.74 <2e-16 ***
most_populous_pct -6.402 4.268 -1.50 0.135
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1053 on 204 degrees of freedom
Multiple R-squared: 0.01091, Adjusted R-squared: 0.006058
F-statistic: 2.249 on 1 and 204 DF, p-value: 0.1352
# Create a regression model for total versus diversity_score
<- lm(total ~ diversity_score, data = na.omit(dcdata))
dcmodel2 summary(dcmodel2)
Call:
lm(formula = total ~ diversity_score, data = na.omit(dcdata))
Residuals:
Min 1Q Median 3Q Max
-3200.7 -699.1 -120.5 627.5 3500.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3106.1 179.7 17.287 <2e-16 ***
diversity_score 466.3 317.0 1.471 0.143
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1054 on 204 degrees of freedom
Multiple R-squared: 0.0105, Adjusted R-squared: 0.005645
F-statistic: 2.164 on 1 and 204 DF, p-value: 0.1428
# Create a regression model for pct_white versus most_populous_pct, not including rows wherein white is the dominant race
<- na.omit(dcdata) |>
dcmodel3 filter(most_populous_race != "White") %>%
lm(pct_white ~ most_populous_pct, data = .)
summary(dcmodel3)
Call:
lm(formula = pct_white ~ most_populous_pct, data = .)
Residuals:
Min 1Q Median 3Q Max
-17.3237 -1.5041 0.2876 1.3497 12.2462
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.22383 1.98178 28.88 <2e-16 ***
most_populous_pct -0.60093 0.02633 -22.82 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.189 on 95 degrees of freedom
Multiple R-squared: 0.8457, Adjusted R-squared: 0.8441
F-statistic: 520.7 on 1 and 95 DF, p-value: < 2.2e-16
# Create a regression model for pct_white versus diversity_score, not including rows wherein white is the dominant race
<- na.omit(dcdata) |>
dcmodel4 filter(most_populous_race != "White") %>%
lm(pct_white ~ diversity_score, data = .)
summary(dcmodel4)
Call:
lm(formula = pct_white ~ diversity_score, data = .)
Residuals:
Min 1Q Median 3Q Max
-18.121 -1.591 0.372 1.853 13.019
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.230 1.002 -5.22 1.05e-06 ***
diversity_score 43.485 1.950 22.30 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.291 on 95 degrees of freedom
Multiple R-squared: 0.8396, Adjusted R-squared: 0.8379
F-statistic: 497.3 on 1 and 95 DF, p-value: < 2.2e-16
# Visualize dcmodel1 with a scatterplot and regression line
na.omit(dcdata) |>
ggplot(aes(x = most_populous_pct, y = total)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of Total Population of the District of Columbia by Most Populous Percentage",
x = "Most Populous Percentage",
y = "Total Population",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize dcmodel2 with a scatterplot and regression line
na.omit(dcdata) |>
ggplot(aes(x = diversity_score, y = total)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of Total Population of the District of Colubmia by Diversity Score",
x = "Diversity Score",
y = "Total Population",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize dcmodel3 with a scatterplot and regression line
na.omit(dcdata) |>
filter(most_populous_race != "White") |>
ggplot(aes(x = most_populous_pct, y = pct_white)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of White Percentage of the District of Colubmia by Most Populous Percentage",
x = "Most Populous Percentage",
y = "Percent White",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# Visualize dcmodel4 with a scatterplot and regression line
na.omit(dcdata) |>
filter(most_populous_race != "White") |>
ggplot(aes(x = diversity_score, y = pct_white)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Regression of White Percentage of the District of Colubmia by Diversity Score",
x = "Diversity Score",
y = "Percent White",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Use linear regression to predict diversity based on race percentages
# Create a linear model that predicts the diversity score based on the percentages of each race
# The predictions are not useful for very small census tracts, so the minimum population will be 50
<- na.omit(censusdata) %>%
diversity_model filter(total >= 50) %>% # Ensure at least 50 population
lm(diversity_score ~ pct_white + pct_black + pct_native + pct_asian + pct_pacific + pct_other + pct_multiple, data = .)
summary(diversity_model)
Call:
lm(formula = diversity_score ~ pct_white + pct_black + pct_native +
pct_asian + pct_pacific + pct_other + pct_multiple, data = .)
Residuals:
Min 1Q Median 3Q Max
-0.97297 -0.09346 0.00610 0.09683 0.42282
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.357e+00 7.146e-03 189.94 <2e-16 ***
pct_white -1.164e-02 7.703e-05 -151.18 <2e-16 ***
pct_black -7.995e-03 7.694e-05 -103.91 <2e-16 ***
pct_native -8.693e-03 1.328e-04 -65.47 <2e-16 ***
pct_asian -2.498e-03 9.269e-05 -26.95 <2e-16 ***
pct_pacific -6.797e-03 4.713e-04 -14.42 <2e-16 ***
pct_other -2.731e-03 1.171e-04 -23.32 <2e-16 ***
pct_multiple NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1475 on 84447 degrees of freedom
Multiple R-squared: 0.6512, Adjusted R-squared: 0.6512
F-statistic: 2.628e+04 on 6 and 84447 DF, p-value: < 2.2e-16
# Generate predictions for the visualization
<- censusdata %>%
censusdata mutate(predicted_diversity = predict(diversity_model, newdata = .),
predicted_diversity = pmax(0, pmin(predicted_diversity, 1))) # Constrain values between 0 and 1
# Check the header of the tibble
head(censusdata)
# A tibble: 6 × 37
census_tract county state total total_white total_black total_native
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Census Tract 201 Autauga … Alab… 1775 1389 213 5
2 Census Tract 202 Autauga … Alab… 2055 842 1104 2
3 Census Tract 203 Autauga … Alab… 3216 2244 714 12
4 Census Tract 204 Autauga … Alab… 4246 3578 327 17
5 Census Tract 205.01 Autauga … Alab… 4322 3241 632 29
6 Census Tract 205.02 Autauga … Alab… 3284 1894 907 15
# ℹ 30 more variables: total_asian <dbl>, total_pacific <dbl>,
# total_other <dbl>, total_multiple <dbl>, statefp <chr>, countyfp <chr>,
# county_geoid <chr>, padded_census_tract <chr>, census_tract_geoid <chr>,
# pct_white <dbl>, pct_black <dbl>, pct_native <dbl>, pct_asian <dbl>,
# pct_pacific <dbl>, pct_other <dbl>, pct_multiple <dbl>,
# most_populous_race <chr>, most_populous_race_simple <chr>,
# is_pct_above_33.33333333 <lgl>, is_pct_above_50 <lgl>, …
# Create a long tibble for the plot
# The predictions are not useful for very small census tracts, so the minimum population will be 50
<- na.omit(censusdata) |>
predictions_long filter(total >= 50) |> # Ensure at least 50 population
select(1:11, 17:24, 35:37) |> # Select specific columns
pivot_longer(cols = starts_with("pct_"), names_to = "race", values_to = "percentage") |>
mutate(race = recode(race,
"pct_white" = "White",
"pct_black" = "Black or African American",
"pct_native" = "American Indian or Alaska Native",
"pct_asian" = "Asian",
"pct_pacific" = "Pacific Islander",
"pct_other" = "Other",
"pct_multiple"= "Multiple Races"))
Check the accuracy of the predicted diversity scores
# Create a clean dataset
<- na.omit(censusdata) |>
censusdata_clean filter(total >= 50)
# Create a column for residuals when comparing predicted and actual diversity scores
<- censusdata_clean |>
censusdata_clean mutate(residuals = diversity_score - predicted_diversity)
# Visualize the residuals
ggplot(censusdata_clean, aes(x = residuals)) +
geom_histogram(binwidth = 0.05, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Residuals (Actual - Predicted Diversity Score)",
x = "Residuals",
y = "Count",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal()
# Get a summary of the residuals
summary(censusdata_clean$residuals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.8649076 -0.0934634 0.0060953 0.0005969 0.0968255 0.4228240
With a median of only 0.006 and low first and third quartiles, we can see that the predictions are fairly accurate.
Visualize the effects of the prevalence of each race on the census tract’s diversity, based on the predictions
# Plot the predicted diversity scores versus the most populous race percentage by race
ggplot(predictions_long, aes(x = percentage, y = predicted_diversity, color = race)) +
geom_point() +
geom_smooth(method = "lm", aes(group = race), color = "black") +
labs(title = "Predicted Diversity Score by Race Percentage",
x = "Percentage of Race",
y = "Predicted Diversity Score",
color = "Race",
caption = "Source: U.S. Decennial Census 2020") +
theme_minimal() +
theme(legend.position = "none") +
facet_wrap(~ race, scales = "free_x") +
ylim(0, 1)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 161 rows containing missing values or values outside the scale range
(`geom_smooth()`).
Ensure all columns are the correct format before exporting
# Reformat all columns to avoid errors
<- censusdata |>
censusdata mutate(
census_tract = as.character(census_tract),
county = as.character(county),
state = as.character(state),
total = as.numeric(total),
total_white = as.numeric(total_white),
total_black = as.numeric(total_black),
total_native = as.numeric(total_native),
total_asian = as.numeric(total_asian),
total_pacific = as.numeric(total_pacific),
total_other = as.numeric(total_other),
total_multiple = as.numeric(total_multiple),
statefp = as.character(statefp),
countyfp = as.character(countyfp),
county_geoid = as.character(county_geoid),
padded_census_tract = as.character(padded_census_tract),
census_tract_geoid = as.character(census_tract_geoid),
pct_white = as.numeric(pct_white),
pct_black = as.numeric(pct_black),
pct_native = as.numeric(pct_native),
pct_asian = as.numeric(pct_asian),
pct_pacific = as.numeric(pct_pacific),
pct_other = as.numeric(pct_other),
pct_multiple = as.numeric(pct_multiple),
most_populous_race = as.character(most_populous_race),
most_populous_race_simple = as.character(most_populous_race_simple),
is_pct_above_33.33333333 = as.logical(is_pct_above_33.33333333),
is_pct_above_50 = as.logical(is_pct_above_50),
is_pct_above_66.66666667 = as.logical(is_pct_above_66.66666667),
is_pct_above_75 = as.logical(is_pct_above_70),
is_pct_above_80 = as.logical(is_pct_above_80),
is_pct_above_90 = as.logical(is_pct_above_90),
is_pct_above_95 = as.logical(is_pct_above_95),
is_pct_above_98 = as.logical(is_pct_above_98),
most_populous_pct = as.numeric(most_populous_pct),
diversity_score = as.numeric(diversity_score),
predicted_diversity = as.numeric(predicted_diversity)
)
Write the prepared data file to the directory
# Write the changed file to the directory folder as "P1 RACE 2020 NEW.csv"
|>
censusdata write_csv(file = "P1 RACE 2020 NEW.csv")
The process of cleaning the dataset was rather simple. As it comes from the decennial census, the data was already clean. As such, I simply fixed the headers with row_to_names() and clean_names(). However, the process of preparing and processing the data took much longer. I did the following things to prepare and process the data prior to visualization and exporting: removing the unnecessary columns, which came with the dataset; separating the geographic_area_name column into census_tract, county, and state; renaming the columns to be more useful; reformatting the numeric columns as numeric; getting FIPS codes for the states and counties; creating padded census tract codes; using the state and counties FIPS codes and the padded census tract codes to create geoIDs for each census tract, which will be used to connect the census data to TIGER/Line Shapefiles; creating columns for the percentage of the total population each race occupies; creating columns for the most populous race of each census tract; create columns for measuring the overall racial diversity of a census tract; and giving the unpopulated counties NAs in most columns.
The linear regression visualizations turned out to be less useful than I wanted. However, the section after that, using the predicted diversity scores to determine the influence each race’s prevalence has on overall diversity, proved useful. Through linear regression, I found that the prevalence of Asians, Pacific Islanders or Hawaiian Natives, and those who report multiple or other races has a positive correlation with the diversity of an area. Additionally, the prevalence of Blacks or African Americans and American Indians or Alaska Natives also have a positive correlation with increased diversity, though not as much of an increase as the prior four. Conversely, the prevalence of Whites has a strong negative correlation with decreased diversity, which is unsurprising, given the dominance of the White race in most areas of the United States. I also checked the accuracy of the predicted diversity scores to ensure that this analysis would be accurate. The resulting residuals had a median close to zero and small first and third quartiles, meaning that it is accurate enough for this purpose.
The two maps on Tableau consist of: a map of the Southern Mid-Atlantic Region census tracts, colorized to represent the most populous race in each census tract and a map of the Southern Mid-Atlantic Region shaded to represent the overall racial diversity (diversity score — (1 - (most_populous_pct / 100)^2)) of each census tract. The prevalence of Whites in most areas, especially in West Virginia, Virginia, and Delaware, as well as the prevalence of Blacks in Baltimore City, Prince George’s County, the District of Columbia, and Southern and Southeastern Virginia, were unsurprising. However, the lack of racial diversity in certain areas of the District of Columbia, Baltimore City, and Prince George’s County was startling because urban areas tend to be more diverse. Additionally, the racial diversity in Southern and Southeastern Virginia and Delaware was surprising because those areas are (or, I at least imagine them to be) largely rural, and rural areas are most often less diverse than more urban areas. This dichotomy was an interesting and unexpected discovery.
I had attempted to use a linear curve for the line of best fit in the diversity score versus White percentage/total population, but my several tries were unsuccessful, so I opted to leave it as a linear vector.
Note: ChatGPT was used in the creation of this project