Analysis of Presidential Elections Data: County-Level in USA
Project Link
Use the below link that was published in rPubs.
Dashboard Link
Use the below link for the visualization dashboard.
Loading packages
In my project, I utilize a variety of powerful libraries for data manipulation, visualization, and statistical analysis. The tidyverse suite, including dplyr, ggplot2, and tidyr, enables seamless data wrangling and plotting, while lubridate helps with date and time manipulation. Additionally, I leverage libraries like psych and skimr for descriptive statistics and concise data summarization. For creating publication-ready plots and interactive visualizations, ggpubr, plotly, and patchwork play key roles. Moreover, I employ specialized packages like datadictionary for documenting datasets and gtrendsR to access Google Trends data. Lastly, maps and spatial data are effectively handled using maps and sf libraries.
# Data manipulation and visualization libraries
library(tidyverse) # Comprehensive data manipulation and visualization tools
library(magrittr) # Allows using the %>% operator for cleaner code
library(stringr) # Provides string manipulation functions
library(dplyr) # Provides data manipulation functions (part of tidyverse)
library(tidyr) # Provides tools for data tidying (part of tidyverse)
library(lubridate) # Helps with date and time manipulation (part of tidyverse)
# Descriptive statistics and data summarization
library(skimr) # Provides concise summary statistics for data
library(psych) # Helps with more advanced descriptive statistics
# Data visualization libraries
library(ggplot2) # Widely used for creating static data visualizations
library(ggpubr) # Extends ggplot2 for creating publication-ready plots
library(patchwork) # Allows combining multiple ggplots into one plot
# Data import/export
library(readr) # Helps with reading and writing data in various formats
# Miscellaneous visualization and formatting
library(flextable) # Creates flexible and nice-looking tables
library(viridis) # Provides color palettes for visualizations
library(hrbrthemes) # Offers additional ggplot2 themes
# Categorical data manipulation
library(forcats) # Useful for manipulating categorical variables
# Google Trends API
library(gtrendsR) # Provides an interface for accessing Google Trends data
# Data Dictionary
library(datadictionary) # Helps create data dictionaries for datasets
# Interactive plotting
library(plotly) # Allows creating interactive plots
# Maps and geographic data
library(maps) # Offers map data for visualizations
library(sf) # Provides functions for working with spatial dataImporting the data
These are the datasets that were used in this project. First, there is “Presidential_Elections_Data_count_level.csv”. This data contains constituency county-level returns for elections to the U.S. presidency from 1976 to 2020. Data was obtained from the Massachusetts Institute of Technology website(https://electionlab.mit.edu/data).
# Importing CSV file
election_data <- read_csv("Presidential_Elections_Data_count_level.csv")
census_dataset <- read_csv("census.csv")
# Downloading county-level shape files from US Census Bureau
if(!file.exists("cb_2018_us_county_500k.zip")) {
download.file(url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip",
destfile = "cb_2018_us_county_500k.zip")
}
if(!dir.exists("GeoFiles")) {
dir.create("GeoFiles/")
}
# Unzipping files
utils::unzip("cb_2018_us_county_500k.zip",
exdir = "GeoFiles")
# Loading the shapefiles
county_shape <- st_read("GeoFiles//cb_2018_us_county_500k.shp",
quiet = TRUE)Exploring the data
Exploring and display of high-level characteristics of the data set, e.g., important variables and their types, different levels for factor variables, any patterns of missing values.
## exploring the data
glimpse(election_data)## Rows: 72,617
## Columns: 12
## $ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…
## $ state <chr> "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", …
## $ state_po <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ county_name <chr> "AUTAUGA", "AUTAUGA", "AUTAUGA", "AUTAUGA", "BALDWIN", …
## $ county_fips <chr> "01001", "01001", "01001", "01001", "01003", "01003", "…
## $ office <chr> "US PRESIDENT", "US PRESIDENT", "US PRESIDENT", "US PRE…
## $ candidate <chr> "AL GORE", "GEORGE W. BUSH", "RALPH NADER", "OTHER", "A…
## $ party <chr> "DEMOCRAT", "REPUBLICAN", "GREEN", "OTHER", "DEMOCRAT",…
## $ candidatevotes <dbl> 4942, 11993, 160, 113, 13997, 40872, 1033, 578, 5188, 5…
## $ totalvotes <dbl> 17208, 17208, 17208, 17208, 56480, 56480, 56480, 56480,…
## $ version <dbl> 20220315, 20220315, 20220315, 20220315, 20220315, 20220…
## $ mode <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "…
skim(election_data)| Name | election_data |
| Number of rows | 72617 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 4 | 20 | 0 | 51 | 0 |
| state_po | 0 | 1 | 2 | 2 | 0 | 51 | 0 |
| county_name | 0 | 1 | 3 | 21 | 0 | 1892 | 0 |
| county_fips | 57 | 1 | 5 | 5 | 0 | 3156 | 0 |
| office | 0 | 1 | 12 | 12 | 0 | 1 | 0 |
| candidate | 0 | 1 | 5 | 17 | 0 | 13 | 0 |
| party | 0 | 1 | 5 | 11 | 0 | 5 | 0 |
| mode | 0 | 1 | 4 | 20 | 0 | 16 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2011.30 | 7.52 | 2000 | 2004 | 2012 | 2020 | 2020 | ▇▃▃▃▇ |
| candidatevotes | 0 | 1 | 10781.83 | 46034.94 | 0 | 115 | 1278 | 5848 | 3028885 | ▇▁▁▁▁ |
| totalvotes | 0 | 1 | 42514.21 | 121951.40 | 0 | 5175 | 11194 | 29855 | 4264365 | ▇▁▁▁▁ |
| version | 0 | 1 | 20220315.00 | 0.00 | 20220315 | 20220315 | 20220315 | 20220315 | 20220315 | ▁▁▇▁▁ |
glimpse(census_dataset)## Rows: 52,612
## Columns: 26
## $ geoid <chr> "01003", "01015", "01043", "01049", "01051", …
## $ county_state <chr> "Baldwin County, Alabama", "Calhoun County, A…
## $ year <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 200…
## $ population <dbl> 174439, 113419, 81324, 68515, 78106, 103303, …
## $ median_income <dbl> 52320, 40056, 39229, 34357, 54601, 37642, 426…
## $ median_monthly_rent_cost <dbl> 790, 533, 562, 536, 620, 569, 570, 737, 537, …
## $ median_monthly_home_cost <dbl> 884, 686, 653, 548, 787, 595, 673, 975, 655, …
## $ prop_female <dbl> 0.5072088, 0.5213941, 0.5008853, 0.5028388, 0…
## $ prop_male <dbl> 0.4927912, 0.4786059, 0.4991147, 0.4971612, 0…
## $ prop_white <dbl> 0.8650990, 0.7668380, 0.9676971, 0.9294607, 0…
## $ prop_black <dbl> 0.08835180, 0.20128021, 0.01090699, 0.0140115…
## $ prop_native <dbl> 0.006896394, 0.003720717, 0.007463971, 0.0164…
## $ prop_asian <dbl> 0.0072632840, 0.0081291494, 0.0024592986, 0.0…
## $ prop_hawaiin_islander <dbl> 0.0017312642, 0.0000000000, 0.0000000000, 0.0…
## $ prop_other_race <dbl> 0.0146813499, 0.0092312575, 0.0004303773, 0.0…
## $ prop_multi_racial <dbl> 0.015976932, 0.010800660, 0.011042251, 0.0155…
## $ prop_highschool <dbl> 0.2234336, 0.2642586, 0.2830509, 0.2945016, 0…
## $ prop_GED <dbl> 0.04600955, 0.07705351, 0.07444981, 0.0797542…
## $ prop_some_college <dbl> 0.06240842, 0.06737910, 0.05683327, 0.0587472…
## $ prop_college_no_degree <dbl> 0.1687486, 0.1806985, 0.1389848, 0.1287847, 0…
## $ prop_associates <dbl> 0.08350924, 0.05767840, 0.09165914, 0.0494131…
## $ prop_bachelors <dbl> 0.20768040, 0.09130223, 0.08016855, 0.0449906…
## $ prop_masters <dbl> 0.06303756, 0.04145800, 0.04394398, 0.0302134…
## $ prop_professional <dbl> 0.013004859, 0.008675421, 0.008834829, 0.0026…
## $ prop_doctoral <dbl> 0.0108111688, 0.0155106011, 0.0012570599, 0.0…
## $ prop_poverty <dbl> 0.09132561, 0.18246331, 0.15272973, 0.2357663…
skim(census_dataset)| Name | census_dataset |
| Number of rows | 52612 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 24 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| geoid | 0 | 1 | 5 | 5 | 0 | 3225 | 0 |
| county_state | 0 | 1 | 16 | 42 | 0 | 3229 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2014.82 | 3.78 | 2008.00 | 2012.00 | 2015.00 | 2018.00 | 2021.00 | ▆▇▅▇▇ |
| population | 0 | 1.00 | 145758.59 | 395597.70 | 41.00 | 14429.00 | 40119.50 | 122855.25 | 10170292.00 | ▇▁▁▁▁ |
| median_income | 4 | 1.00 | 49709.77 | 15079.85 | 10499.00 | 40264.50 | 47527.00 | 56565.25 | 156821.00 | ▃▇▁▁▁ |
| median_monthly_rent_cost | 30 | 1.00 | 741.63 | 235.85 | 99.00 | 587.00 | 688.00 | 841.00 | 2599.00 | ▃▇▁▁▁ |
| median_monthly_home_cost | 18 | 1.00 | 837.68 | 388.18 | 103.00 | 572.00 | 743.00 | 1002.00 | 3254.00 | ▇▇▁▁▁ |
| prop_female | 0 | 1.00 | 0.50 | 0.02 | 0.19 | 0.50 | 0.51 | 0.51 | 0.63 | ▁▁▁▇▁ |
| prop_male | 0 | 1.00 | 0.50 | 0.02 | 0.37 | 0.49 | 0.49 | 0.50 | 0.81 | ▁▇▁▁▁ |
| prop_white | 42443 | 0.19 | 0.78 | 0.15 | 0.08 | 0.71 | 0.82 | 0.89 | 0.98 | ▁▁▂▅▇ |
| prop_black | 42443 | 0.19 | 0.11 | 0.12 | 0.00 | 0.03 | 0.06 | 0.15 | 0.73 | ▇▂▁▁▁ |
| prop_native | 42443 | 0.19 | 0.01 | 0.05 | 0.00 | 0.00 | 0.00 | 0.01 | 0.80 | ▇▁▁▁▁ |
| prop_asian | 42443 | 0.19 | 0.03 | 0.04 | 0.00 | 0.01 | 0.02 | 0.04 | 0.44 | ▇▁▁▁▁ |
| prop_hawaiin_islander | 42443 | 0.19 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | ▇▁▁▁▁ |
| prop_other_race | 42443 | 0.19 | 0.03 | 0.05 | 0.00 | 0.01 | 0.02 | 0.04 | 0.80 | ▇▁▁▁▁ |
| prop_multi_racial | 42443 | 0.19 | 0.04 | 0.04 | 0.00 | 0.02 | 0.03 | 0.04 | 0.69 | ▇▁▁▁▁ |
| prop_highschool | 41916 | 0.20 | 0.25 | 0.06 | 0.05 | 0.21 | 0.25 | 0.29 | 0.49 | ▁▅▇▂▁ |
| prop_GED | 41916 | 0.20 | 0.04 | 0.02 | 0.00 | 0.03 | 0.04 | 0.05 | 0.18 | ▆▇▁▁▁ |
| prop_some_college | 41916 | 0.20 | 0.07 | 0.02 | 0.00 | 0.06 | 0.07 | 0.08 | 0.15 | ▁▅▇▂▁ |
| prop_college_no_degree | 41916 | 0.20 | 0.15 | 0.03 | 0.05 | 0.13 | 0.15 | 0.17 | 0.28 | ▁▆▇▂▁ |
| prop_associates | 41916 | 0.20 | 0.09 | 0.02 | 0.02 | 0.07 | 0.09 | 0.10 | 0.29 | ▅▇▁▁▁ |
| prop_bachelors | 41916 | 0.20 | 0.18 | 0.06 | 0.04 | 0.13 | 0.17 | 0.21 | 0.40 | ▂▇▆▂▁ |
| prop_masters | 41916 | 0.20 | 0.08 | 0.03 | 0.01 | 0.05 | 0.07 | 0.09 | 0.29 | ▇▇▂▁▁ |
| prop_professional | 41916 | 0.20 | 0.02 | 0.01 | 0.00 | 0.01 | 0.02 | 0.02 | 0.10 | ▇▃▁▁▁ |
| prop_doctoral | 41916 | 0.20 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.02 | 0.11 | ▇▁▁▁▁ |
| prop_poverty | 2 | 1.00 | 0.16 | 0.08 | 0.00 | 0.11 | 0.15 | 0.19 | 0.67 | ▆▇▁▁▁ |
After Exploring the election data by the use of skim function, I can observe that there are 57 missing values for county_fips variable in the datasets I have used for this project and in census data there is more 40 thousand missing values in prop_white, prop_black, prop_native, prop_asian, prop_hawaiin_islander, prop_other_race etc and 30 missing values in median_monthly_rent_cost and 18 missing values in median_monthly_home_cost column might impact while merging the data with election data.
1. Data Dictionary for the datasets used
# Loading election data
data(election_data)
# Creating data dictionary.
# Note penguins will need to be replaced with your own data set, and
# the variable descriptions should be updated as well.
dataDictionary1 <- tibble(Variable = colnames(election_data),
Description = c(
"year of election",
"state name",
"state postal code abbreviation",
"county name",
"county fips code",
"name of the public office to which the candidate is seeking election",
"candidate name",
"party",
"number of votes for candidate",
"total votes cast in the election",
"version",
"mode"
),
Type = map_chr(election_data, .f = function(x){typeof(x)[1]}),
Class = map_chr(election_data, .f = function(x){class(x)[1]}))# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary1, cwidth = 2)Variable | Description | Type | Class |
|---|---|---|---|
year | year of election | double | numeric |
state | state name | character | character |
state_po | state postal code abbreviation | character | character |
county_name | county name | character | character |
county_fips | county fips code | character | character |
office | name of the public office to which the candidate is seeking election | character | character |
candidate | candidate name | character | character |
party | party | character | character |
candidatevotes | number of votes for candidate | double | numeric |
totalvotes | total votes cast in the election | double | numeric |
version | version | double | numeric |
mode | mode | character | character |
# Printing nicely in R Markdown (option 2)
knitr::kable(dataDictionary1)| Variable | Description | Type | Class |
|---|---|---|---|
| year | year of election | double | numeric |
| state | state name | character | character |
| state_po | state postal code abbreviation | character | character |
| county_name | county name | character | character |
| county_fips | county fips code | character | character |
| office | name of the public office to which the candidate is seeking election | character | character |
| candidate | candidate name | character | character |
| party | party | character | character |
| candidatevotes | number of votes for candidate | double | numeric |
| totalvotes | total votes cast in the election | double | numeric |
| version | version | double | numeric |
| mode | mode | character | character |
# Loading census data
data(census_dataset)
# Creating data dictionary.
# Note penguins will need to be replaced with your own data set, and
# the variable descriptions should be updated as well.
dataDictionary2 <- tibble(Variable = colnames(census_dataset),
Description = c(
"geographic region ID",
"geographic region",
"year",
"population",
"median income in dollars",
"median monthly housing costs for homeowners in dollars",
"median monthly rent costs for renters in dollars",
"proportion of people who are female",
"proportion of people who are male",
"proportion of people who are white alone",
"proportion of people who are black or African American alone",
"proportion of people who are American Indian and Alaska Native alone",
"proportion of people who are Asian alone",
"proportion of people who are Native Hawaiian and Other Pacific Islander alone",
"proportion of people who are some other race alone",
"proportion of people who are two or more races",
"proportion of people 25 and older whose highest education-level is high school",
"proportion of people 25 and older whose highest education-level is a GED",
"proportion of people 25 and older whose highest education-level is some, but less than 1 year of college",
"proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree",
"proportion of people 25 and older whose highest education-level is an Associates degree",
"proportion of people 25 and older whose highest education-level is a Bachelors degree",
"proportion of people 25 and older whose highest education-level is a Masters degree",
"proportion of people 25 and older whose highest education-level is a Professional degree",
"proportion of people 25 and older whose highest education-level is a Doctoral degree",
"proportion of people 25 and older living in poverty"
),
Type = map_chr(census_dataset, .f = function(x){typeof(x)[1]}),
Class = map_chr(census_dataset, .f = function(x){class(x)[1]}))# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary2, cwidth = 2)Variable | Description | Type | Class |
|---|---|---|---|
geoid | geographic region ID | character | character |
county_state | geographic region | character | character |
year | year | double | numeric |
population | population | double | numeric |
median_income | median income in dollars | double | numeric |
median_monthly_rent_cost | median monthly housing costs for homeowners in dollars | double | numeric |
median_monthly_home_cost | median monthly rent costs for renters in dollars | double | numeric |
prop_female | proportion of people who are female | double | numeric |
prop_male | proportion of people who are male | double | numeric |
prop_white | proportion of people who are white alone | double | numeric |
prop_black | proportion of people who are black or African American alone | double | numeric |
prop_native | proportion of people who are American Indian and Alaska Native alone | double | numeric |
prop_asian | proportion of people who are Asian alone | double | numeric |
prop_hawaiin_islander | proportion of people who are Native Hawaiian and Other Pacific Islander alone | double | numeric |
prop_other_race | proportion of people who are some other race alone | double | numeric |
prop_multi_racial | proportion of people who are two or more races | double | numeric |
prop_highschool | proportion of people 25 and older whose highest education-level is high school | double | numeric |
prop_GED | proportion of people 25 and older whose highest education-level is a GED | double | numeric |
prop_some_college | proportion of people 25 and older whose highest education-level is some, but less than 1 year of college | double | numeric |
prop_college_no_degree | proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree | double | numeric |
prop_associates | proportion of people 25 and older whose highest education-level is an Associates degree | double | numeric |
prop_bachelors | proportion of people 25 and older whose highest education-level is a Bachelors degree | double | numeric |
prop_masters | proportion of people 25 and older whose highest education-level is a Masters degree | double | numeric |
prop_professional | proportion of people 25 and older whose highest education-level is a Professional degree | double | numeric |
prop_doctoral | proportion of people 25 and older whose highest education-level is a Doctoral degree | double | numeric |
prop_poverty | proportion of people 25 and older living in poverty | double | numeric |
# Printing nicely in R Markdown (option 2)
knitr::kable(dataDictionary2)| Variable | Description | Type | Class |
|---|---|---|---|
| geoid | geographic region ID | character | character |
| county_state | geographic region | character | character |
| year | year | double | numeric |
| population | population | double | numeric |
| median_income | median income in dollars | double | numeric |
| median_monthly_rent_cost | median monthly housing costs for homeowners in dollars | double | numeric |
| median_monthly_home_cost | median monthly rent costs for renters in dollars | double | numeric |
| prop_female | proportion of people who are female | double | numeric |
| prop_male | proportion of people who are male | double | numeric |
| prop_white | proportion of people who are white alone | double | numeric |
| prop_black | proportion of people who are black or African American alone | double | numeric |
| prop_native | proportion of people who are American Indian and Alaska Native alone | double | numeric |
| prop_asian | proportion of people who are Asian alone | double | numeric |
| prop_hawaiin_islander | proportion of people who are Native Hawaiian and Other Pacific Islander alone | double | numeric |
| prop_other_race | proportion of people who are some other race alone | double | numeric |
| prop_multi_racial | proportion of people who are two or more races | double | numeric |
| prop_highschool | proportion of people 25 and older whose highest education-level is high school | double | numeric |
| prop_GED | proportion of people 25 and older whose highest education-level is a GED | double | numeric |
| prop_some_college | proportion of people 25 and older whose highest education-level is some, but less than 1 year of college | double | numeric |
| prop_college_no_degree | proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree | double | numeric |
| prop_associates | proportion of people 25 and older whose highest education-level is an Associates degree | double | numeric |
| prop_bachelors | proportion of people 25 and older whose highest education-level is a Bachelors degree | double | numeric |
| prop_masters | proportion of people 25 and older whose highest education-level is a Masters degree | double | numeric |
| prop_professional | proportion of people 25 and older whose highest education-level is a Professional degree | double | numeric |
| prop_doctoral | proportion of people 25 and older whose highest education-level is a Doctoral degree | double | numeric |
| prop_poverty | proportion of people 25 and older living in poverty | double | numeric |
2. Data Cleaning ( Using String modification, Merging dataset, Lubridate package)
Cleaning the data set and modifying the variables and data to match our requirements for the Project
a. String modification
# Converting values in "state" variable to capitalized form
election_data <- election_data %>%
mutate(state = str_to_title(state))
# Converting values in "candidate" variable to capitalized form
election_data <- election_data %>%
mutate(candidate = str_to_title(candidate))
# Converting values in "party" variable to capitalized form
election_data <- election_data %>%
mutate(party = str_to_title(party))
# Converting values in "county_name" variable to capitalized form
election_data <- election_data %>%
mutate(county_name = str_to_title(county_name))
election_data <- election_data %>%
rename(county = county_name)
# Now, Cleaning the candidate using mutate() and str_replace() function
election_data <- election_data %>%
mutate(candidate = str_replace(candidate, "Donald Trump|Donald J Trump", "Donald Trump")) %>%
mutate(candidate = str_replace(candidate, "Joseph R Biden Jr", "Joe Biden"))
# Separate "county_state" column into "county" and "state"
if ("county_state" %in% names(census_dataset)) {
# Separate "county_state" column into "county" and "state"
census_data <- census_dataset %>%
separate(county_state, into = c("county", "state"), sep = ",\\s*", extra = "merge")
}
# now, remove the substring "county" in every county column value, example: "Houston County" > "Houston"
census_data$county <- str_replace(census_data$county, " County", "")
# Converting values in "state" variable in census data to capitalized form
census_data <- census_data %>%
mutate(state = str_to_title(state))
# Converting values in "county" variable in census data to capitalized form
census_data <- census_data %>%
mutate(county = str_to_title(county))
# renaming the geoId column name to county_fips
census_data <- census_data %>%
rename(county_fips = geoid)b. Merging Data Sets
# Merge the two data frames based on county_fips as the unique identifier
election_census_merged_data <- left_join(census_data, election_data, by = c("county_fips", "state", "year", "county"))
# Update the dataset with the count of females and male for each county
election_census_merged_data$male_count <- round(election_census_merged_data$population * election_census_merged_data$prop_male)
election_census_merged_data$female_count <- round(election_census_merged_data$population * election_census_merged_data$prop_female)
# Filter the dataset to show Republican candidate votes for each county_fips
election_census_merged_data_republican_2016 <- election_census_merged_data %>%
filter(party == "Republican", year == 2016) %>%
distinct(county_fips, .keep_all = TRUE)Note: Lubridate package is used in Data Visualizations in plot 3
3. Exploratory Data Analysis
a. Tables of Summary Statistics (group-level and Frequency)
Table 1 : Group-level summary statistics based on state wise candidate votes
summary_state<- election_data %>% group_by(state) %>%
summarise(mean_v=round(mean(candidatevotes),2),
median_v=round(median(candidatevotes),2),
sd_v=round(sd(candidatevotes),2),
max_v=round(max(candidatevotes),2),
min_v=round(min(candidatevotes),2),.groups = 'drop') %>%
as.data.frame()
summary_state_table <- flextable::qflextable(summary_state)
summary_state_tablestate | mean_v | median_v | sd_v | max_v | min_v |
|---|---|---|---|---|---|
Alabama | 9,562.28 | 3,555.0 | 20,036.95 | 181,688 | 2 |
Alaska | 2,412.17 | 1,939.0 | 2,334.27 | 10,717 | 0 |
Arizona | 30,384.94 | 1,508.0 | 112,684.35 | 978,457 | 0 |
Arkansas | 2,400.19 | 201.0 | 6,834.58 | 89,574 | 0 |
California | 67,021.72 | 8,106.5 | 203,790.55 | 3,028,885 | 2 |
Colorado | 11,103.56 | 1,135.5 | 31,472.45 | 313,293 | 0 |
Connecticut | 52,802.04 | 22,058.0 | 74,918.46 | 297,505 | 0 |
Delaware | 41,267.58 | 30,838.0 | 49,206.40 | 195,034 | 294 |
District Of Columbia | 78,320.05 | 17,367.0 | 113,325.05 | 317,323 | 1,322 |
Florida | 36,193.16 | 3,897.0 | 77,794.32 | 624,146 | 1 |
Georgia | 5,128.96 | 786.5 | 17,792.05 | 297,051 | 0 |
Hawaii | 31,985.51 | 12,239.0 | 52,880.07 | 238,866 | 133 |
Idaho | 4,505.79 | 1,032.5 | 12,104.62 | 130,699 | 1 |
Illinois | 15,114.35 | 2,066.0 | 88,520.33 | 1,725,973 | 0 |
Indiana | 8,183.73 | 2,682.5 | 19,101.97 | 247,772 | 0 |
Iowa | 3,573.57 | 1,152.0 | 9,276.25 | 128,465 | 0 |
Kansas | 3,613.44 | 693.0 | 13,483.38 | 184,259 | 3 |
Kentucky | 4,592.88 | 1,530.0 | 13,653.98 | 228,288 | 1 |
Louisiana | 9,249.22 | 2,986.5 | 18,781.17 | 152,610 | 11 |
Maine | 12,366.85 | 7,165.5 | 18,244.04 | 128,759 | 0 |
Maryland | 15,822.72 | 1,229.5 | 42,364.39 | 357,837 | 0 |
Massachusetts | 63,935.49 | 15,048.0 | 94,365.68 | 617,196 | 41 |
Michigan | 16,720.18 | 3,518.0 | 51,486.44 | 660,085 | 2 |
Minnesota | 9,488.48 | 2,153.0 | 32,290.92 | 532,623 | 4 |
Mississippi | 4,183.36 | 1,790.5 | 7,462.33 | 76,112 | 0 |
Missouri | 6,817.37 | 1,543.5 | 23,002.74 | 333,123 | 0 |
Montana | 2,765.24 | 717.0 | 6,199.47 | 50,772 | 5 |
Nebraska | 2,754.28 | 460.0 | 10,890.56 | 150,350 | 0 |
Nevada | 17,504.76 | 998.5 | 63,992.63 | 521,777 | 6 |
New Hampshire | 22,193.54 | 12,792.0 | 28,790.04 | 122,344 | 191 |
New Jersey | 51,471.90 | 24,212.0 | 62,124.22 | 285,967 | 93 |
New Mexico | 6,769.28 | 1,398.0 | 18,473.54 | 193,757 | 0 |
New York | 34,742.93 | 9,105.0 | 83,084.69 | 703,310 | 7 |
North Carolina | 7,081.89 | 348.0 | 20,627.13 | 302,736 | 0 |
North Dakota | 1,836.63 | 501.0 | 4,864.71 | 42,619 | 0 |
Ohio | 18,766.28 | 5,745.0 | 43,724.48 | 458,204 | 13 |
Oklahoma | 3,946.76 | 431.5 | 13,808.50 | 174,741 | 0 |
Oregon | 15,031.14 | 2,947.0 | 34,565.99 | 367,249 | 1 |
Pennsylvania | 27,830.84 | 7,303.0 | 60,148.15 | 603,790 | 12 |
Rhode Island | 22,876.00 | 9,098.0 | 36,662.68 | 167,442 | 0 |
South Carolina | 5,436.18 | 89.5 | 13,823.51 | 127,832 | 0 |
South Dakota | 1,785.25 | 571.0 | 4,620.91 | 49,249 | 0 |
Tennessee | 8,380.75 | 2,517.0 | 21,679.46 | 256,297 | 7 |
Texas | 9,406.51 | 663.0 | 43,432.78 | 918,193 | 0 |
Utah | 6,058.90 | 86.0 | 24,074.89 | 271,576 | 0 |
Vermont | 6,519.92 | 3,510.0 | 9,348.12 | 72,938 | 12 |
Virginia | 5,878.28 | 899.0 | 18,452.44 | 355,133 | 0 |
Washington | 22,963.08 | 3,947.0 | 69,060.14 | 907,310 | 2 |
West Virginia | 3,904.89 | 1,606.0 | 6,125.35 | 46,398 | 6 |
Wisconsin | 11,854.21 | 3,820.5 | 28,906.19 | 332,438 | 0 |
Wyoming | 3,263.54 | 1,101.0 | 4,821.71 | 27,891 | 13 |
Table 2 : Aggregate votes for each candidate for each term year
aggregate_table <- aggregate(candidatevotes ~ year + candidate, data = election_data, FUN = sum)
# Order the aggregate_table by year and candidatevotes
aggregate_table_ordered <- arrange(aggregate_table, year, desc(candidatevotes))
# Spread the aggregate_table to have each unique year as a separate row
spread_table <- spread(aggregate_table_ordered, key = year, value = candidatevotes)
spread_table_filled <- spread_table %>%
mutate(across(-candidate, ~ replace(., is.na(.), "-")))
# Printing the spread_table
flextable_spread_table <- flextable::qflextable(spread_table_filled)
flextable_spread_tablecandidate | 2000 | 2004 | 2008 | 2012 | 2016 | 2020 |
|---|---|---|---|---|---|---|
Al Gore | 51001968 | - | - | - | - | - |
Barack Obama | - | - | 69448278 | 65753019 | - | - |
Donald Trump | - | - | - | - | 62979031 | 74218999 |
George W. Bush | 50457896 | 61931981 | - | - | - | - |
Hillary Clinton | - | - | - | - | 65844241 | - |
Jo Jorgensen | - | - | - | - | - | 1810407 |
Joe Biden | - | - | - | - | - | 81264648 |
John Kerry | - | 58899063 | - | - | - | - |
John Mccain | - | - | 59928695 | - | - | - |
Mitt Romney | - | - | - | 60666387 | - | - |
Other | 1068557 | 1489505 | 1810364 | 2629510 | 7674891 | 1183656 |
Ralph Nader | 2882954 | - | - | - | - | - |
Table 3 : Percentage of votes each candidate recieved for each term year
# Filteing the election_data for the years 2000 to 2020
year_data <- election_data[election_data$year >= 2000 & election_data$year <= 2020, ]
# Group by year and candidate, and calculate the total votes for each candidate in each year
candidate_votes <- year_data %>%
group_by(year, candidate) %>%
summarize(total_votes = sum(candidatevotes))
# Calculating the total votes for each year
total_votes_per_year <- year_data %>%
group_by(year) %>%
summarize(total_votes = sum(candidatevotes))
# Merging the two data frames to get the total votes as percentages
candidate_votes_percentage <- merge(candidate_votes, total_votes_per_year, by = "year") %>%
mutate(percentage_votes = (total_votes.x / total_votes.y) * 100) %>%
select(-total_votes.x, -total_votes.y) %>%
spread(year, percentage_votes, fill = 0)
# Converting percentage values to numeric, round to two decimals, and add "%" symbol
candidate_votes_percentage[, -1] <- lapply(candidate_votes_percentage[, -1], function(x) paste0(format(round(as.numeric(x), 2), nsmall = 2), "%"))
# Printing the candidate_votes_percentage data frame
candidate_votes_percentage_table <- flextable::qflextable(candidate_votes_percentage)
candidate_votes_percentage_tablecandidate | 2000 | 2004 | 2008 | 2012 | 2016 | 2020 |
|---|---|---|---|---|---|---|
Al Gore | 48.38% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
Barack Obama | 0.00% | 0.00% | 52.94% | 50.95% | 0.00% | 0.00% |
Donald Trump | 0.00% | 0.00% | 0.00% | 0.00% | 46.14% | 46.83% |
George W. Bush | 47.87% | 50.63% | 0.00% | 0.00% | 0.00% | 0.00% |
Hillary Clinton | 0.00% | 0.00% | 0.00% | 0.00% | 48.24% | 0.00% |
Jo Jorgensen | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 1.14% |
Joe Biden | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 51.28% |
John Kerry | 0.00% | 48.15% | 0.00% | 0.00% | 0.00% | 0.00% |
John Mccain | 0.00% | 0.00% | 45.68% | 0.00% | 0.00% | 0.00% |
Mitt Romney | 0.00% | 0.00% | 0.00% | 47.01% | 0.00% | 0.00% |
Other | 1.01% | 1.22% | 1.38% | 2.04% | 5.62% | 0.75% |
Ralph Nader | 2.73% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
Frequency table for two categorical variables
From the below table we can observe frequency of candidates participated from year 2000 - 2020
# Filtering the election_data for the years 2000 to 2020
year_data <- election_data[election_data$year >= 2000 & election_data$year <= 2020, ]
# Obtaining the unique candidates for each year
candidate_occurrence <- with(year_data, tapply(candidate, year, function(x) unique(x)))
# Creating a data frame to display the occurrence of candidates
occurrence_data_frame <- as.data.frame(candidate_occurrence)
# Replacing empty cells with zero
occurrence_data_frame[is.na(occurrence_data_frame)] <- 0
# Printing the candidate occurrence data frame
occurrence_data_frame## candidate_occurrence
## 2000 Al Gore, George W. Bush, Ralph Nader, Other
## 2004 John Kerry, George W. Bush, Other
## 2008 Barack Obama, John Mccain, Other
## 2012 Barack Obama, Mitt Romney, Other
## 2016 Hillary Clinton, Donald Trump, Other
## 2020 Joe Biden, Other, Donald Trump, Jo Jorgensen
Now, we could look at election outcomes (which party won the overall election?) and whether or not a candidate won a majority of the popular vote and In order to calculate the state_winners, total electoral seats won by each party we need know the details of electoral college seats. For that I referred link https://www.archives.gov/electoral-college/allocation and used in below step 1 to load into data frame.
# Creating a lookup table for electoral college votes allocation
default_electoral_college_seats_state_wise <- data.frame(
state = c("Alabama", "Kentucky", "North Dakota", "Alaska", "Louisiana", "Ohio", "Arizona", "Maine", "Oklahoma", "Arkansas",
"Maryland", "Oregon", "California", "Massachusetts", "Pennsylvania", "Colorado", "Michigan", "Rhode Island", "Connecticut",
"Minnesota", "South Carolina", "Delaware", "Mississippi", "South Dakota", "District of Columbia", "Missouri", "Tennessee",
"Florida", "Montana", "Texas", "Georgia", "Nebraska", "Utah", "Hawaii", "Nevada", "Vermont", "Idaho", "New Hampshire",
"Virginia", "Illinois", "New Jersey", "Washington", "Indiana", "New Mexico", "West Virginia", "Iowa", "New York", "Wisconsin",
"Kansas", "North Carolina", "Wyoming"),
electoral_seats = c(9, 8, 3, 3, 8, 17, 11, 4, 7, 6, 10, 8, 54, 11, 19, 10, 15, 4, 7, 10, 9, 3, 6, 3, 3, 10, 11, 30, 4, 40, 16, 5, 6,
4, 4, 3, 4, 4, 13, 19, 14, 12, 11, 5, 4, 6, 28, 10, 6, 16, 3)
)
# Merging election_data with the electoral_seats lookup table based on state
merged_election_data <- merge(election_data, default_electoral_college_seats_state_wise, by = "state", all.x = TRUE)
# Calculating total votes for each party in each state for each year
merged_election_data <- merged_election_data %>%
group_by(year, state, party) %>%
summarise(total_votes_state_party = sum(candidatevotes))
# Determining the winner party in each state for each year based on total votes majority
state_winners <- merged_election_data %>%
group_by(year, state) %>%
filter(total_votes_state_party == max(total_votes_state_party)) %>%
ungroup()
# Merging the state_winners with the electoral_seats table to assign electoral votes to each party
state_winners_with_votes <- state_winners %>%
left_join(default_electoral_college_seats_state_wise, by = c("state" = "state"))
# Calculating the total electoral votes won by each party for each year
electoral_college_seats <- state_winners_with_votes %>%
group_by(year, party) %>%
summarise(total_electoral_seats = sum(electoral_seats, na.rm = TRUE))
electoral_college_seats## # A tibble: 12 × 3
## # Groups: year [6]
## year party total_electoral_seats
## <dbl> <chr> <dbl>
## 1 2000 Democrat 276
## 2 2000 Republican 257
## 3 2004 Democrat 239
## 4 2004 Republican 294
## 5 2008 Democrat 351
## 6 2008 Republican 182
## 7 2012 Democrat 324
## 8 2012 Republican 209
## 9 2016 Democrat 227
## 10 2016 Republican 306
## 11 2020 Democrat 298
## 12 2020 Republican 235
# Determining the winning party based on total electoral votes for each year
winning_party <- electoral_college_seats %>%
group_by(year) %>%
filter(total_electoral_seats == max(total_electoral_seats)) %>%
ungroup()
# Select the final columns to display and rename the columns
winning_party_table_each_term <- winning_party %>%
select(year, party, total_electoral_seats) %>%
rename("Total Electoral College Seats Won" = total_electoral_seats, "Winning Party" = party)
winning_party_table_each_term$year <- gsub(",", "", winning_party_table_each_term$year)
winning_party_table_each_term## # A tibble: 6 × 3
## year `Winning Party` `Total Electoral College Seats Won`
## <chr> <chr> <dbl>
## 1 2000 Democrat 276
## 2 2004 Republican 294
## 3 2008 Democrat 351
## 4 2012 Democrat 324
## 5 2016 Republican 306
## 6 2020 Democrat 298
b. Data Visualizations
Plot 1
From the below graph we can observe the metrics of number of votings took place over each term between period (2000-2020)
# plot 1: Line chart showing in which year most votings took place
# Calculating total votes per year
total_votes_per_year <- election_data %>%
group_by(year) %>%
summarize(total_votes = sum(candidatevotes))
# Creating the line graph
total_votes_per_year_line_graph <- ggplot(total_votes_per_year, aes(x = year, y = total_votes)) +
geom_line(size=1,color="Red") +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5),
labels = scales::comma) +
labs(title = "Total Number of Polling in the United States from 2000-2020",
x = "Year",
y = "Number of Votes (in numbers)",
caption = "Data source: https://electionlab.mit.edu/data") +
theme(plot.title = element_text(hjust = 0.5))+
theme_get()
total_votes_per_year_line_graphPlot 2
From the below graph we can observe the metrics of perentage number of votings took place over each term by each party between period (2000-2020)
# First, Calculate the percentage of votes for each party in each year
votes_of_party_in_percentage <- election_data %>%
group_by(year, party) %>%
summarize(percentage_votes = sum(candidatevotes) / sum(totalvotes) * 100) %>%
ungroup()
# Then, Reorder the levels of the party variable based on average percentage of votes across all years
votes_of_party_in_percentage <- votes_of_party_in_percentage %>%
mutate(party = fct_reorder(party, percentage_votes, .desc = TRUE))
# Now, Creating the stacked bar chart with facet_wrap by year
ggplot(votes_of_party_in_percentage, aes(fill = party, y = percentage_votes, x = party)) +
geom_bar(position = position_dodge(width = 0.7), stat = "identity") +
scale_fill_viridis(discrete = TRUE, option = "E") +
facet_wrap(~ year, ncol = 3, scales = "free_x") +
theme_minimal() +
xlab("Party") +
labs(title = "Percentage of Votes Received by Each Party Each Year",
x = "Party",
y = "Percentage of Votes",
caption = "Data source: https://electionlab.mit.edu/data") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "bottom") +
scale_y_continuous(breaks = seq(0, 100, by = 10))Plot 3
The plot(using plotly) visualizes the search interest for the term “Election Fraud” and “US Election Results 2020” during the 2020 US Election Season. The y-axis represents the relative interest in the search term, ranging from 0 to 100, with 100 being the peak popularity. The x-axis shows the months in the year 2020, spanning from August 2nd to December 3rd.
The plot also provides insights into how public interest in the topic of “Election Fraud” and “US Election Results 2020” evolved during the specified period, helping to understand its popularity and fluctuations throughout the election season.
# Function to load data from gtrends, if the gtrends fetching fails, will load the local RDS file
load_gtrends_data <- function() {
tryCatch({
# Attempt to fetch data from gtrends
data <- gtrends(c("US Election Results 2020", "Election Fraud"),
time = "2020-08-02 2020-12-03", geo = "US")
# Manipulating date using lubridate function
data$interest_over_time$date <- ymd(data$interest_over_time$date) # Converting to proper date format
write_rds(data, file = "importedDataGtrends.rds") # Save data to RDS file
return(data) # Return the data
}, error = function(err) {
# If gtrends call fails, attempt to read data from RDS file
if (file.exists("importedDataGtrends.rds")) {
data <- read_rds("importedDataGtrends.rds")
# Manipulating date using lubridate function
data$interest_over_time$date <- ymd(data$interest_over_time$date) # Converting to proper date format
return(data) # Return the data from RDS file
} else {
message("Failed to fetch data from gtrends and no RDS file found.")
return(NULL) # Return NULL or any other default value if both attempts fail
}
})
}
# Loading google trends data using the load_gtrends_data function
gTrendsElectionSearchData <- load_gtrends_data()
if (!is.null(data)) {
plot_trend <- function(keyword_string, data) {
time_trend <- data$interest_over_time %>%
mutate(hits = ifelse(hits == "<1", 0.5, as.numeric(hits)),
date = as.Date(date),
keyword = factor(keyword, levels = keyword_string))
plot <- plot_ly(data = time_trend, x = ~date, y = ~hits, color = ~keyword,
type = "scatter", mode = "lines", fill = "tozeroy") %>%
layout(title = "Exploring the Search Interest for 'Election Fraud' and 'US Election Results 2020' during the 2020 US Election Season",
xaxis = list(title = "Months in year 2020", standoff = 20),
yaxis = list(title = "Hits (Relative to Peak from 0 - 100)", standoff = 20),
legend = list(orientation = "v"),
showlegend = TRUE,
margin = list(l = 100, r = 100, b = 100, t = 100), # inorder to Adjust margins
padding = list(r = 10, b = 50) # inorder to Adjust padding
)
return(plot)
}
election_fraud_trend_plot <- plot_trend(keyword_string = c("US Election Results 2020", "Election Fraud"),gTrendsElectionSearchData)
election_fraud_trend_plot
}Plot 4
The plot displays the winners in different states for the year 2016 and 2020 US Presidential Election. The map provides a clear visual representation of the electoral outcome across the United States during the specified election year.
# Loading the us_states shapefile from the maps package
us_states <- map_data("state")
# Renamed the "region" column to "state" in us_states
us_states <- us_states %>%
rename(state = region)
us_states <- us_states %>%
mutate(state = str_to_title(state))
# Filtering the state_winners_with_votes dataset for the year 2020
state_winners_2020 <- state_winners_with_votes %>%
filter(year == 2020)
state_winners_2016 <- state_winners_with_votes %>%
filter(year == 2016)
# Left join the us_states dataset with state_winners_2020 based on the "state" column
map_data_combined_2020 <- left_join(us_states, state_winners_2020, by = "state", copy = TRUE)
map_data_combined_2016 <- left_join(us_states, state_winners_2016, by = "state", copy = TRUE)
# Displaying the map view which show the winners of different states by the parties
election_outcome_map_2016 <- ggplot(map_data_combined_2016, aes(x = long, y = lat, group = group, fill = party)) +
geom_polygon(color = "white", size = 0.25) + # Add thick white borders between states
coord_quickmap() +
scale_fill_manual(values = c("Democrat" = "blue", "Republican" = "red", "Other" = "gray")) +
labs(title = "2016 Election outcome by state",
caption = "Data source: https://electionlab.mit.edu/data") +
theme_void()
election_outcome_map_2016election_outcome_map_2020 <- ggplot(map_data_combined_2020, aes(x = long, y = lat, group = group, fill = party)) +
geom_polygon(color = "white", size = 0.25) + # Add thick white borders between states
coord_quickmap() +
scale_fill_manual(values = c("Democrat" = "blue", "Republican" = "red", "Other" = "gray")) +
labs(title = "2020 Election outcome by state",
caption = "Data source: https://electionlab.mit.edu/data") +
theme_void()
election_outcome_map_2020Plot 5
The plot displays county level median income in USA in the year 2021
# Merge shape and census data
ggMapData <- county_shape %>%
full_join(census_data, by = c("GEOID" = "county_fips")) %>%
dplyr::filter(year == 2021)
# Fixing issue with Alaska and Hawaii
ggMapDataFix <- ggMapData %>%
tigris::shift_geometry()
# Plotting by thresholding max poverty level at 50%
povertyGG <- ggMapDataFix %>% rowwise() %>%
mutate(pov_plot = min(median_income, 156821)) %>%
ggplot(aes(fill = pov_plot)) +
geom_sf(color = "#C0C0C0",
size = 0.1) +
scale_fill_gradient(low = "white", high = "dodgerblue",
limits = c(0, 180000)) +
labs( title = "County-level median income, 2021",
caption = "Data source: https://electionlab.mit.edu/data",
fill = "Median Income") +
theme_void()
povertyGG4. Monte Carlo Methods of Inference
Implementing a Permutation test
# Filtering data for California and Texas in year 2020
california_data <- election_census_merged_data %>%
filter(state == "California", year == 2020)
michigan_data <- election_census_merged_data %>%
filter(state == "Michigan", year == 2020)
# Calculating the observed median income difference between California and Texas
observed_median_diff <- median(california_data$median_income) - median(michigan_data$median_income)
# Number of permutations
n_permutations <- 10000
# Creating a function to calculate the median income difference in each permutation
calculate_permutation_diff <- function(data) {
california_counties <- data %>%
filter(state == "California") %>%
sample_n(size = nrow(california_data), replace = TRUE)
michigan_counties <- data %>%
filter(state == "Texas") %>%
sample_n(size = nrow(michigan_data), replace = TRUE)
return(median(california_counties$median_income) - median(michigan_counties$median_income))
}
# Performing the permutation test
set.seed(1995) # For reproducibility
permutation_diffs <- replicate(n_permutations, calculate_permutation_diff(election_census_merged_data))
# Removing non-finite values from permutation_diffs
permutation_diffs <- permutation_diffs[is.finite(permutation_diffs)]
# Removing non-finite values from observed_median_diff
observed_median_diff <- ifelse(is.finite(observed_median_diff), observed_median_diff, 0)# Ploting the null distribution
ggplot() +
geom_histogram(aes(x = permutation_diffs), bins = 30, color = "black", fill = "lightblue") +
geom_vline(xintercept = observed_median_diff, color = "red", linetype = "dashed", size = 1) +
labs(title = "Null Distribution of Median Income Difference between California and Michigan (2020)",
x = "Median Income Difference",
y = "Frequency") +
theme_minimal()# Now, Calculating the p-value
p_value <- mean(permutation_diffs >= observed_median_diff, na.rm = TRUE)
p_value## [1] 0.01227287
Hence, With a p-value of 0.0122729, we have strong evidence to reject the null hypothesis, indicating that there is a statistically significant difference in median income between counties in California and Michigan for the year 2020.
The permutation test indicates that there is a statistically significant difference in median income between counties in California and Michigan for the year 2020. The observed median income difference between the two states is higher than expected by random chance, suggesting a genuine disparity in income levels. This finding highlights important economic differences between California and Michigan and may be relevant for policymaking and economic development decisions.
5. Bootstrap Methods of Inference
a. Non-parametric Bootstrapping
# Simulating data from distribution
set.seed(1989)
n <- 135
observedVal <- election_census_merged_data_republican_2016$candidatevotes
# Creating a raincloud plot to visualize the simulated data. Describe the data in terms of its number of peaks (e.g., unimodal, bimodal, etc.) and its symmetry.
tibble(value=observedVal) %>% ggplot(aes(y = value)) +
ggdist::stat_halfeye(adjust = .5, width = 2*.3, .width = c(0.5, 1)) +
geom_boxplot(width = .3, outlier.shape = NA) +
ggdist::stat_dots(side = "left", dotsize = 6, justification = 1.05, binwidth = .1,
color = "black") +
coord_flip() +
labs(y = "Value",
title = "simulated values from non-normal distribution") +
theme_bw() +
theme(legend.position = "none")# What is the sample median for this data?
median(observedVal)## [1] 7239
# Number of bootstrap samples
Bsamples <- 10000
# Instantiating matrix for bootstrap samples
bootsMatrix <- matrix(NA, nrow = n, ncol = Bsamples)
# Sampling with replacement B times
for(b in 1:Bsamples) {
bootsMatrix[, b] <- observedVal[sample(1:n, size = n, replace = TRUE)]
}
# Using the generated bootstrap samples, will create a bootstrap distribution of sample medians, and will visualize this distribution using a histogram.
# Installing vector for bootstrap medians
bootmediansVector <- vector(length=Bsamples)
#sampling with replacement Bsamples times
for (b in 1:Bsamples){
bootmediansVector[b]<-median(bootsMatrix[,b])
}
# creating a histogram
tibble(Median=bootmediansVector) %>% ggplot(aes(x = Median)) +
geom_histogram(color="white")+
labs(y = "frequency",
title = "Distribution of bootstrap medians") +
theme_bw()# Now, use the bootstrap samples to obtain a nonparametric estimate of the standard error of the sample median.
SEestimates <- sd(bootmediansVector)
SEestimates## [1] 4255.382
# Now, Use the bootstrap samples to obtain a nonparametric 95% confidence interval for the population median.
lowerBoundMedianVal <- quantile(bootmediansVector,probs=0.025)
upperBoundMedianVal <- quantile(bootmediansVector,probs=0.975)We are 95% confident that the true median is in between 41396 and 58970.
# creating a histogram
tibble(Median=bootmediansVector) %>% ggplot(aes(x = Median)) +
geom_histogram(color="white")+
geom_vline(xintercept= lowerBoundMedianVal,size=1,
color="dodgerblue",linetype="solid")+
geom_vline(xintercept= upperBoundMedianVal,size=1,
color="dodgerblue",linetype="solid")+
labs(y = "frequency",
title = "Distribution of bootstrap medians") +
theme_bw()b. Parametric Bootstrapping
# First, Generate B=10,000 samples each of size 30 from a normal distribution using the estimated mean and standard deviation from the observed data set, and visualize this distribution using a histogram.
BsamplesForParametric <- 10000
observedMaleCountVal <- election_census_merged_data_republican_2016$male_count
# Then, Instantiating matrix for bootstrap samples
parammetricBootMatrix <- matrix(NA, nrow = n, ncol = BsamplesForParametric)
xBar<-mean(observedMaleCountVal)
s<-sd(observedMaleCountVal)
# Now, Sampling a normal set of n values, B times
for(b in 1:BsamplesForParametric) {
parammetricBootMatrix[, b] <- rnorm(n=n,mean=xBar,sd=s)
}
# Now, Installing vector for bootstrap medians
bootParametricMedians <- vector(length=BsamplesForParametric)
# Next, caluculating median for each simulated data set
for (b in 1:BsamplesForParametric){
bootParametricMedians[b]<-median(parammetricBootMatrix[,b])
}
# creating a histogram
tibble(Median=bootParametricMedians) %>% ggplot(aes(x = Median)) +
geom_histogram(color="white")+
labs(y = "frequency",
title = "Distribution of Parametric bootstrap medians") +
theme_bw()# Obtain a parametric bootstrap estimate of the standard error of the sample median.
SEparametricestimate <- sd(bootParametricMedians)
SEparametricestimate## [1] 17906.09
# Obtain a parametric bootstrap 95% confidence interval for the sample median.
lowerBoundParmetricMed <- quantile(bootParametricMedians,probs=0.025)
upperBoundParametricMed <- quantile(bootParametricMedians,probs=0.975)
# creating a histogram
tibble(Median=bootParametricMedians) %>% ggplot(aes(x = Median)) +
geom_histogram(color="white")+
geom_vline(xintercept= lowerBoundParmetricMed,size=1,
color="dodgerblue",linetype="solid")+
geom_vline(xintercept= upperBoundParametricMed,size=1,
color="dodgerblue",linetype="solid")+
labs(y = "frequency",
title = "Distribution of parametric bootstrap medians") +
theme_bw()6. Conclusions / Main Takeaways
In conclusion, the analysis of presidential elections data at the county level in the USA provides valuable insights into voting trends, party voting patterns, and candidate performance. It also highlights the importance of search interest in specific topics during the election season and sheds light on income disparities across regions. These findings can be utilized to better understand the dynamics of the electoral process and contribute to discussions on improving voting practices and policies.
From the above data analysis, I even got ideas of future work like,
Voter Turnout Analysis: Explore factors influencing voter turnout at the county level, and identify strategies to increase voter participation.
Temporal Analysis: Extend the analysis to include more recent election data beyond 2020, if available, to observe any emerging trends or changes in voting patterns over time.
Comparative Studies: Conduct comparative studies with elections from other countries to identify similarities and differences in voting behaviors and electoral processes.
Interactive Visualization: Create interactive data visualizations and dashboards to allow users to explore the data and draw their own insights.
Acknowledgment to Professor:
I would like to extend my sincere gratitude to Professor Andrew Dilernia for his invaluable guidance and support throughout this project. His expertise in data selection and research methodologies has been instrumental in shaping the direction of this analysis. I am grateful for the opportunity to learn under his mentorship is truly inspiring.