Analysis of Presidential Elections Data: County-Level in USA

Project Link

Use the below link that published in rPubs

“https://rpubs.com/haripoorna2208/1061697”

Loading packages

This project makes use of the following libraries. We used “readr” and “readxl” to read the data, “skimr” to explore and find different data characteristics, “datadictonary” and “flextable” to create and display data dictionary containing a description of each variable in the dataset, and “lubridate” and “stringr” to manipulate the data. These are some of the primary functions carried out with the assistance of the libraries listed below.

library(tidyverse)
library(magrittr)
library(stringr)
library(skimr)
library(ggpubr)
library(readr)
library(readxl)
library(skimr)
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
library(psych)
library(ggplot2)
library(forcats)
library(flextable)
library(viridis)
library(hrbrthemes)
library(forcats)  # Loading the forcats package for fct_reorder
library(datadictionary)
library(gtrendsR)
library(knitr)
library(patchwork)
library(plotly)
library(maps)
library(sf)

Importing the data

These are the datasets that were used in this project. First, there is “Presidential_Elections_Data_count_level.csv”. This data contains constituency county-level returns for elections to the U.S. presidency from 1976 to 2020. Data was obtained from the Massachusetts Institute of Technology website(https://electionlab.mit.edu/data).

# Importing CSV file
election_data <- read_csv("Presidential_Elections_Data_count_level.csv")
census_data <- read_csv("census.csv")


# Downloading county-level shape files from US Census Bureau
if(!file.exists("cb_2018_us_county_500k.zip")) {
download.file(url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip",
              destfile = "cb_2018_us_county_500k.zip")
}

if(!dir.exists("GeoFiles")) {
  dir.create("GeoFiles/")
}

# Unzipping files
utils::unzip("cb_2018_us_county_500k.zip",
             exdir = "GeoFiles")

# Loading the shapefiles
county_shape <- st_read("GeoFiles//cb_2018_us_county_500k.shp",
                        quiet = TRUE)

Exploring the data

Explore and display high-level characteristics of your data set, e.g., important variables and their types, different levels for factor variables, any patterns of missing values.

After Exploring the data by the use of skim function, I can observe that there are 57 missing values for county_fips variable in the datasets I have used for this project.

## exploring the data
glimpse(election_data)

## Rows: 72,617
## Columns: 12
## $ year           <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…
## $ state          <chr> "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", …
## $ state_po       <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ county_name    <chr> "AUTAUGA", "AUTAUGA", "AUTAUGA", "AUTAUGA", "BALDWIN", …
## $ county_fips    <chr> "01001", "01001", "01001", "01001", "01003", "01003", "…
## $ office         <chr> "US PRESIDENT", "US PRESIDENT", "US PRESIDENT", "US PRE…
## $ candidate      <chr> "AL GORE", "GEORGE W. BUSH", "RALPH NADER", "OTHER", "A…
## $ party          <chr> "DEMOCRAT", "REPUBLICAN", "GREEN", "OTHER", "DEMOCRAT",…
## $ candidatevotes <dbl> 4942, 11993, 160, 113, 13997, 40872, 1033, 578, 5188, 5…
## $ totalvotes     <dbl> 17208, 17208, 17208, 17208, 56480, 56480, 56480, 56480,…
## $ version        <dbl> 20220315, 20220315, 20220315, 20220315, 20220315, 20220…
## $ mode           <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "…

skim(election_data)

Data summary
Name	election_data
Number of rows	72617
Number of columns	12
_______________________
Column type frequency:
character	8
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state	0	1	4	20	51
state_po	0	1	2	2	51
county_name	0	1	3	21	1892
county_fips	57	1	5	5	3156
office	0	1	12	12	1
candidate	0	1	5	17	13
party	0	1	5	11	5
mode	0	1	4	20	16

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2011.30	7.52	2000	2004	2012	2020	2020	▇▃▃▃▇
candidatevotes	1	10781.83	46034.94	0	115	1278	5848	3028885	▇▁▁▁▁
totalvotes	1	42514.21	121951.40	0	5175	11194	29855	4264365	▇▁▁▁▁
version	1	20220315.00	0.00	20220315	20220315	20220315	20220315	20220315	▁▁▇▁▁

glimpse(census_data)

## Rows: 52,612
## Columns: 26
## $ geoid                    <chr> "01003", "01015", "01043", "01049", "01051", …
## $ county_state             <chr> "Baldwin County, Alabama", "Calhoun County, A…
## $ year                     <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 200…
## $ population               <dbl> 174439, 113419, 81324, 68515, 78106, 103303, …
## $ median_income            <dbl> 52320, 40056, 39229, 34357, 54601, 37642, 426…
## $ median_monthly_rent_cost <dbl> 790, 533, 562, 536, 620, 569, 570, 737, 537, …
## $ median_monthly_home_cost <dbl> 884, 686, 653, 548, 787, 595, 673, 975, 655, …
## $ prop_female              <dbl> 0.5072088, 0.5213941, 0.5008853, 0.5028388, 0…
## $ prop_male                <dbl> 0.4927912, 0.4786059, 0.4991147, 0.4971612, 0…
## $ prop_white               <dbl> 0.8650990, 0.7668380, 0.9676971, 0.9294607, 0…
## $ prop_black               <dbl> 0.08835180, 0.20128021, 0.01090699, 0.0140115…
## $ prop_native              <dbl> 0.006896394, 0.003720717, 0.007463971, 0.0164…
## $ prop_asian               <dbl> 0.0072632840, 0.0081291494, 0.0024592986, 0.0…
## $ prop_hawaiin_islander    <dbl> 0.0017312642, 0.0000000000, 0.0000000000, 0.0…
## $ prop_other_race          <dbl> 0.0146813499, 0.0092312575, 0.0004303773, 0.0…
## $ prop_multi_racial        <dbl> 0.015976932, 0.010800660, 0.011042251, 0.0155…
## $ prop_highschool          <dbl> 0.2234336, 0.2642586, 0.2830509, 0.2945016, 0…
## $ prop_GED                 <dbl> 0.04600955, 0.07705351, 0.07444981, 0.0797542…
## $ prop_some_college        <dbl> 0.06240842, 0.06737910, 0.05683327, 0.0587472…
## $ prop_college_no_degree   <dbl> 0.1687486, 0.1806985, 0.1389848, 0.1287847, 0…
## $ prop_associates          <dbl> 0.08350924, 0.05767840, 0.09165914, 0.0494131…
## $ prop_bachelors           <dbl> 0.20768040, 0.09130223, 0.08016855, 0.0449906…
## $ prop_masters             <dbl> 0.06303756, 0.04145800, 0.04394398, 0.0302134…
## $ prop_professional        <dbl> 0.013004859, 0.008675421, 0.008834829, 0.0026…
## $ prop_doctoral            <dbl> 0.0108111688, 0.0155106011, 0.0012570599, 0.0…
## $ prop_poverty             <dbl> 0.09132561, 0.18246331, 0.15272973, 0.2357663…

skim(census_data)

Data summary
Name	census_data
Number of rows	52612
Number of columns	26
_______________________
Column type frequency:
character	2
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
geoid	0	1	5	5	0	3225	0
county_state	0	1	16	42	0	3229	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2014.82	3.78	2008.00	2012.00	2015.00	2018.00	2021.00	▆▇▅▇▇
population	0	1.00	145758.59	395597.70	41.00	14429.00	40119.50	122855.25	10170292.00	▇▁▁▁▁
median_income	4	1.00	49709.77	15079.85	10499.00	40264.50	47527.00	56565.25	156821.00	▃▇▁▁▁
median_monthly_rent_cost	30	1.00	741.63	235.85	99.00	587.00	688.00	841.00	2599.00	▃▇▁▁▁
median_monthly_home_cost	18	1.00	837.68	388.18	103.00	572.00	743.00	1002.00	3254.00	▇▇▁▁▁
prop_female	0	1.00	0.50	0.02	0.19	0.50	0.51	0.51	0.63	▁▁▁▇▁
prop_male	0	1.00	0.50	0.02	0.37	0.49	0.49	0.50	0.81	▁▇▁▁▁
prop_white	42443	0.19	0.78	0.15	0.08	0.71	0.82	0.89	0.98	▁▁▂▅▇
prop_black	42443	0.19	0.11	0.12	0.00	0.03	0.06	0.15	0.73	▇▂▁▁▁
prop_native	42443	0.19	0.01	0.05	0.00	0.00	0.00	0.01	0.80	▇▁▁▁▁
prop_asian	42443	0.19	0.03	0.04	0.00	0.01	0.02	0.04	0.44	▇▁▁▁▁
prop_hawaiin_islander	42443	0.19	0.00	0.01	0.00	0.00	0.00	0.00	0.14	▇▁▁▁▁
prop_other_race	42443	0.19	0.03	0.05	0.00	0.01	0.02	0.04	0.80	▇▁▁▁▁
prop_multi_racial	42443	0.19	0.04	0.04	0.00	0.02	0.03	0.04	0.69	▇▁▁▁▁
prop_highschool	41916	0.20	0.25	0.06	0.05	0.21	0.25	0.29	0.49	▁▅▇▂▁
prop_GED	41916	0.20	0.04	0.02	0.00	0.03	0.04	0.05	0.18	▆▇▁▁▁
prop_some_college	41916	0.20	0.07	0.02	0.00	0.06	0.07	0.08	0.15	▁▅▇▂▁
prop_college_no_degree	41916	0.20	0.15	0.03	0.05	0.13	0.15	0.17	0.28	▁▆▇▂▁
prop_associates	41916	0.20	0.09	0.02	0.02	0.07	0.09	0.10	0.29	▅▇▁▁▁
prop_bachelors	41916	0.20	0.18	0.06	0.04	0.13	0.17	0.21	0.40	▂▇▆▂▁
prop_masters	41916	0.20	0.08	0.03	0.01	0.05	0.07	0.09	0.29	▇▇▂▁▁
prop_professional	41916	0.20	0.02	0.01	0.00	0.01	0.02	0.02	0.10	▇▃▁▁▁
prop_doctoral	41916	0.20	0.01	0.01	0.00	0.01	0.01	0.02	0.11	▇▁▁▁▁
prop_poverty	2	1.00	0.16	0.08	0.00	0.11	0.15	0.19	0.67	▆▇▁▁▁

1. Data Dictionary for the datasets used

# Loading palmer penguins data
data(election_data)

# Creating data dictionary. 
# Note penguins will need to be replaced with your own data set, and 
# the variable descriptions should be updated as well.
dataDictionary <- tibble(Variable = colnames(election_data),
                         Description = c(
                            "year of election",
                            "state name",
                            "state postal code abbreviation",
                            "county name",
                            "county fips code",
                            "name of the public office to which the candidate is seeking election",
                            "candidate name",
                            "party",
                            "number of votes for candidate",
                            "total votes cast in the election",
                            "version",
                            "mode"
),
                         Type = map_chr(election_data, .f = function(x){typeof(x)[1]}),
                         Class = map_chr(election_data, .f = function(x){class(x)[1]}))

# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary, cwidth = 2)

Variable	Description	Type	Class
year	year of election	double	numeric
state	state name	character	character
state_po	state postal code abbreviation	character	character
county_name	county name	character	character
county_fips	county fips code	character	character
office	name of the public office to which the candidate is seeking election	character	character
candidate	candidate name	character	character
party	party	character	character
candidatevotes	number of votes for candidate	double	numeric
totalvotes	total votes cast in the election	double	numeric
version	version	double	numeric
mode	mode	character	character

# Printing nicely in R Markdown (option 2)
knitr::kable(dataDictionary)

Variable	Description	Type	Class
year	year of election	double	numeric
state	state name	character	character
state_po	state postal code abbreviation	character	character
county_name	county name	character	character
county_fips	county fips code	character	character
office	name of the public office to which the candidate is seeking election	character	character
candidate	candidate name	character	character
party	party	character	character
candidatevotes	number of votes for candidate	double	numeric
totalvotes	total votes cast in the election	double	numeric
version	version	double	numeric
mode	mode	character	character

2. Data Cleaning ( Using Lubridate package, String modification, merging dataset)

Cleaning the data set and modifying the variables and data to match our requirements for the Project

a. String modification

# Converting values in "state" variable to capitalized form
election_data <- election_data %>%
  mutate(state = str_to_title(state))

# Converting values in "candidate" variable to capitalized form
election_data <- election_data %>%
  mutate(candidate = str_to_title(candidate))

# Converting values in "party" variable to capitalized form
election_data <- election_data %>%
  mutate(party = str_to_title(party))

# Converting values in "county_name" variable to capitalized form
election_data <- election_data %>%
  mutate(county_name = str_to_title(county_name))

election_data <- election_data %>% 
  rename(county = county_name)

# Now, Cleaning the candidate using mutate() and str_replace() function
election_data <- election_data %>% 
  mutate(candidate = str_replace(candidate, "Donald Trump|Donald J Trump", "Donald Trump")) %>%
  mutate(candidate = str_replace(candidate, "Joseph R Biden Jr", "Joe Biden"))

# Separate "county_state" column into "county" and "state"
if ("county_state" %in% names(census_data)) {
  # Separate "county_state" column into "county" and "state"
  census_data <- census_data %>%
    separate(county_state, into = c("county", "state"), sep = ",\\s*", extra = "merge")
}
# now, remove the substring "county" in every county column value, example: "Houston County" > "Houston"
census_data$county <- str_replace(census_data$county, " County", "")

# Converting values in "state" variable in census data to capitalized form
census_data <- census_data %>%
  mutate(state = str_to_title(state))

# Converting values in "county" variable in census data to capitalized form
census_data <- census_data %>%
  mutate(county = str_to_title(county))

# renaming the geoId column name to county_fips
census_data <- census_data %>% 
  rename(county_fips = geoid)

b. Merging Data Sets

# Filter the census_data for the year 2020
census_data_2016 <- subset(census_data, year == 2016)

# Filter the election_data for the year 2020
election_data_2016 <- subset(election_data, year == 2016)

# Merge the two data frames based on county_fips as the unique identifier
election_census_merged_data <- left_join(census_data_2016, election_data_2016, by = c("county_fips", "state", "year", "county"))

# Update the dataset with the count of females and male for each county
election_census_merged_data$male_count <- round(election_census_merged_data$population * election_census_merged_data$prop_male)

election_census_merged_data$female_count <- round(election_census_merged_data$population * election_census_merged_data$prop_female)

# Filter the dataset to show Republican candidate votes for each county_fips
election_census_merged_data_republican <- election_census_merged_data %>%
  filter(party == "Republican") %>%
  distinct(county_fips, .keep_all = TRUE)

3.Exploratory Data Analysis

a. Tables of Summary Statistics (Grouped and Frequency)

Table 1 : Grouped summary statistics based on state wise candidate votes

summary_state<- election_data %>% group_by(state) %>% 
  summarise(mean_v=round(mean(candidatevotes),2),
            median_v=round(median(candidatevotes),2),
            sd_v=round(sd(candidatevotes),2),
            max_v=round(max(candidatevotes),2),
            min_v=round(min(candidatevotes),2),.groups = 'drop') %>%
  as.data.frame()

summary_state_table <- flextable::qflextable(summary_state)
summary_state_table

state	mean_v	median_v	sd_v	max_v	min_v
Alabama	9,562.28	3,555.0	20,036.95	181,688	2
Alaska	2,412.17	1,939.0	2,334.27	10,717	0
Arizona	30,384.94	1,508.0	112,684.35	978,457	0
Arkansas	2,400.19	201.0	6,834.58	89,574	0
California	67,021.72	8,106.5	203,790.55	3,028,885	2
Colorado	11,103.56	1,135.5	31,472.45	313,293	0
Connecticut	52,802.04	22,058.0	74,918.46	297,505	0
Delaware	41,267.58	30,838.0	49,206.40	195,034	294
District Of Columbia	78,320.05	17,367.0	113,325.05	317,323	1,322
Florida	36,193.16	3,897.0	77,794.32	624,146	1
Georgia	5,128.96	786.5	17,792.05	297,051	0
Hawaii	31,985.51	12,239.0	52,880.07	238,866	133
Idaho	4,505.79	1,032.5	12,104.62	130,699	1
Illinois	15,114.35	2,066.0	88,520.33	1,725,973	0
Indiana	8,183.73	2,682.5	19,101.97	247,772	0
Iowa	3,573.57	1,152.0	9,276.25	128,465	0
Kansas	3,613.44	693.0	13,483.38	184,259	3
Kentucky	4,592.88	1,530.0	13,653.98	228,288	1
Louisiana	9,249.22	2,986.5	18,781.17	152,610	11
Maine	12,366.85	7,165.5	18,244.04	128,759	0
Maryland	15,822.72	1,229.5	42,364.39	357,837	0
Massachusetts	63,935.49	15,048.0	94,365.68	617,196	41
Michigan	16,720.18	3,518.0	51,486.44	660,085	2
Minnesota	9,488.48	2,153.0	32,290.92	532,623	4
Mississippi	4,183.36	1,790.5	7,462.33	76,112	0
Missouri	6,817.37	1,543.5	23,002.74	333,123	0
Montana	2,765.24	717.0	6,199.47	50,772	5
Nebraska	2,754.28	460.0	10,890.56	150,350	0
Nevada	17,504.76	998.5	63,992.63	521,777	6
New Hampshire	22,193.54	12,792.0	28,790.04	122,344	191
New Jersey	51,471.90	24,212.0	62,124.22	285,967	93
New Mexico	6,769.28	1,398.0	18,473.54	193,757	0
New York	34,742.93	9,105.0	83,084.69	703,310	7
North Carolina	7,081.89	348.0	20,627.13	302,736	0
North Dakota	1,836.63	501.0	4,864.71	42,619	0
Ohio	18,766.28	5,745.0	43,724.48	458,204	13
Oklahoma	3,946.76	431.5	13,808.50	174,741	0
Oregon	15,031.14	2,947.0	34,565.99	367,249	1
Pennsylvania	27,830.84	7,303.0	60,148.15	603,790	12
Rhode Island	22,876.00	9,098.0	36,662.68	167,442	0
South Carolina	5,436.18	89.5	13,823.51	127,832	0
South Dakota	1,785.25	571.0	4,620.91	49,249	0
Tennessee	8,380.75	2,517.0	21,679.46	256,297	7
Texas	9,406.51	663.0	43,432.78	918,193	0
Utah	6,058.90	86.0	24,074.89	271,576	0
Vermont	6,519.92	3,510.0	9,348.12	72,938	12
Virginia	5,878.28	899.0	18,452.44	355,133	0
Washington	22,963.08	3,947.0	69,060.14	907,310	2
West Virginia	3,904.89	1,606.0	6,125.35	46,398	6
Wisconsin	11,854.21	3,820.5	28,906.19	332,438	0
Wyoming	3,263.54	1,101.0	4,821.71	27,891	13

Table 2 : Aggregate votes for each candidate for each term year

aggregate_table <- aggregate(candidatevotes ~ year + candidate, data = election_data, FUN = sum)

# Order the aggregate_table by year and candidatevotes
aggregate_table_ordered <- arrange(aggregate_table, year, desc(candidatevotes))


# Spread the aggregate_table to have each unique year as a separate row
spread_table <- spread(aggregate_table_ordered, key = year, value = candidatevotes)

spread_table_filled <- spread_table %>%
  mutate(across(-candidate, ~ replace(., is.na(.), "-")))

# Printing the spread_table
flextable_spread_table <- flextable::qflextable(spread_table_filled)
flextable_spread_table

candidate	2000	2004	2008	2012	2016	2020
Al Gore	51001968	-	-	-	-	-
Barack Obama	-	-	69448278	65753019	-	-
Donald Trump	-	-	-	-	62979031	74218999
George W. Bush	50457896	61931981	-	-	-	-
Hillary Clinton	-	-	-	-	65844241	-
Jo Jorgensen	-	-	-	-	-	1810407
Joe Biden	-	-	-	-	-	81264648
John Kerry	-	58899063	-	-	-	-
John Mccain	-	-	59928695	-	-	-
Mitt Romney	-	-	-	60666387	-	-
Other	1068557	1489505	1810364	2629510	7674891	1183656
Ralph Nader	2882954	-	-	-	-	-

Table 3 : Percentage of votes each candidate recieved for each term year

# Filteing the election_data for the years 2000 to 2020
year_data <- election_data[election_data$year >= 2000 & election_data$year <= 2020, ]

# Group by year and candidate, and calculate the total votes for each candidate in each year
candidate_votes <- year_data %>%
  group_by(year, candidate) %>%
  summarize(total_votes = sum(candidatevotes))

# Calculating the total votes for each year
total_votes_per_year <- year_data %>%
  group_by(year) %>%
  summarize(total_votes = sum(candidatevotes))

# Merging the two data frames to get the total votes as percentages
candidate_votes_percentage <- merge(candidate_votes, total_votes_per_year, by = "year") %>%
  mutate(percentage_votes = (total_votes.x / total_votes.y) * 100) %>%
  select(-total_votes.x, -total_votes.y) %>%
  spread(year, percentage_votes, fill = 0)

# Converting percentage values to numeric, round to two decimals, and add "%" symbol
candidate_votes_percentage[, -1] <- lapply(candidate_votes_percentage[, -1], function(x) paste0(format(round(as.numeric(x), 2), nsmall = 2), "%"))

# Printing the candidate_votes_percentage data frame
candidate_votes_percentage_table <- flextable::qflextable(candidate_votes_percentage)
candidate_votes_percentage_table

candidate	2000	2004	2008	2012	2016	2020
Al Gore	48.38%	0.00%	0.00%	0.00%	0.00%	0.00%
Barack Obama	0.00%	0.00%	52.94%	50.95%	0.00%	0.00%
Donald Trump	0.00%	0.00%	0.00%	0.00%	46.14%	46.83%
George W. Bush	47.87%	50.63%	0.00%	0.00%	0.00%	0.00%
Hillary Clinton	0.00%	0.00%	0.00%	0.00%	48.24%	0.00%
Jo Jorgensen	0.00%	0.00%	0.00%	0.00%	0.00%	1.14%
Joe Biden	0.00%	0.00%	0.00%	0.00%	0.00%	51.28%
John Kerry	0.00%	48.15%	0.00%	0.00%	0.00%	0.00%
John Mccain	0.00%	0.00%	45.68%	0.00%	0.00%	0.00%
Mitt Romney	0.00%	0.00%	0.00%	47.01%	0.00%	0.00%
Other	1.01%	1.22%	1.38%	2.04%	5.62%	0.75%
Ralph Nader	2.73%	0.00%	0.00%	0.00%	0.00%	0.00%

Frequency table for two categorical variables

From the below table we can observe frequency of candidates participated from year 2000 - 2020

# Filtering the election_data for the years 2000 to 2020
year_data <- election_data[election_data$year >= 2000 & election_data$year <= 2020, ]

# Obtaining the unique candidates for each year
candidate_occurrence <- with(year_data, tapply(candidate, year, function(x) unique(x)))

# Creating a data frame to display the occurrence of candidates
occurrence_data_frame <- as.data.frame(candidate_occurrence)

# Replacing empty cells with zero
occurrence_data_frame[is.na(occurrence_data_frame)] <- 0

# Printing the candidate occurrence data frame
occurrence_data_frame

##                              candidate_occurrence
## 2000  Al Gore, George W. Bush, Ralph Nader, Other
## 2004            John Kerry, George W. Bush, Other
## 2008             Barack Obama, John Mccain, Other
## 2012             Barack Obama, Mitt Romney, Other
## 2016         Hillary Clinton, Donald Trump, Other
## 2020 Joe Biden, Other, Donald Trump, Jo Jorgensen

Now, we could look at election outcomes (which party won the overall election?) and whether or not a candidate won a majority of the popular vote and In order to calculate the state_winners, total electoral seats won by each party we need know the details of electoral college seats. For that I referred link https://www.archives.gov/electoral-college/allocation and used in below step 1 to load into data frame.

# Creating a lookup table for electoral college votes allocation
default_electoral_college_seats_state_wise <- data.frame(
  state = c("Alabama", "Kentucky", "North Dakota", "Alaska", "Louisiana", "Ohio", "Arizona", "Maine", "Oklahoma", "Arkansas",
            "Maryland", "Oregon", "California", "Massachusetts", "Pennsylvania", "Colorado", "Michigan", "Rhode Island", "Connecticut",
            "Minnesota", "South Carolina", "Delaware", "Mississippi", "South Dakota", "District of Columbia", "Missouri", "Tennessee",
            "Florida", "Montana", "Texas", "Georgia", "Nebraska", "Utah", "Hawaii", "Nevada", "Vermont", "Idaho", "New Hampshire",
            "Virginia", "Illinois", "New Jersey", "Washington", "Indiana", "New Mexico", "West Virginia", "Iowa", "New York", "Wisconsin",
            "Kansas", "North Carolina", "Wyoming"),
  electoral_seats = c(9, 8, 3, 3, 8, 17, 11, 4, 7, 6, 10, 8, 54, 11, 19, 10, 15, 4, 7, 10, 9, 3, 6, 3, 3, 10, 11, 30, 4, 40, 16, 5, 6,
                      4, 4, 3, 4, 4, 13, 19, 14, 12, 11, 5, 4, 6, 28, 10, 6, 16, 3)
)

# Merging election_data with the electoral_seats lookup table based on state
merged_election_data <- merge(election_data, default_electoral_college_seats_state_wise, by = "state", all.x = TRUE)

# Calculating total votes for each party in each state for each year
merged_election_data <- merged_election_data %>%
  group_by(year, state, party) %>%
  summarise(total_votes_state_party = sum(candidatevotes))

# Determining the winner party in each state for each year based on total votes majority
state_winners <- merged_election_data %>%
  group_by(year, state) %>%
  filter(total_votes_state_party == max(total_votes_state_party)) %>%
  ungroup()

# Merging the state_winners with the electoral_seats table to assign electoral votes to each party
state_winners_with_votes <- state_winners %>%
  left_join(default_electoral_college_seats_state_wise, by = c("state" = "state"))

# Calculating the total electoral votes won by each party for each year
electoral_college_seats <- state_winners_with_votes %>%
  group_by(year, party) %>%
  summarise(total_electoral_seats = sum(electoral_seats, na.rm = TRUE))

electoral_college_seats

## # A tibble: 12 × 3
## # Groups:   year [6]
##     year party      total_electoral_seats
##    <dbl> <chr>                      <dbl>
##  1  2000 Democrat                     276
##  2  2000 Republican                   257
##  3  2004 Democrat                     239
##  4  2004 Republican                   294
##  5  2008 Democrat                     351
##  6  2008 Republican                   182
##  7  2012 Democrat                     324
##  8  2012 Republican                   209
##  9  2016 Democrat                     227
## 10  2016 Republican                   306
## 11  2020 Democrat                     298
## 12  2020 Republican                   235

# Determining the winning party based on total electoral votes for each year
winning_party <- electoral_college_seats %>%
  group_by(year) %>%
  filter(total_electoral_seats == max(total_electoral_seats)) %>%
  ungroup()

# Select the final columns to display and rename the columns
winning_party_table_each_term <- winning_party %>%
  select(year, party, total_electoral_seats) %>%
  rename("Total Electoral College Seats Won" = total_electoral_seats, "Winning Party" = party)

winning_party_table_each_term

## # A tibble: 6 × 3
##    year `Winning Party` `Total Electoral College Seats Won`
##   <dbl> <chr>                                         <dbl>
## 1  2000 Democrat                                        276
## 2  2004 Republican                                      294
## 3  2008 Democrat                                        351
## 4  2012 Democrat                                        324
## 5  2016 Republican                                      306
## 6  2020 Democrat                                        298

b. Data Visualizations

Plot 1

From the below graph we can observe the metrics of number of votings took place over each term between period (2000-2020)

# plot 1: Line chart showing in which year most votings took place

# Calculating total votes per year
total_votes_per_year <- election_data %>%
  group_by(year) %>%
  summarize(total_votes = sum(candidatevotes))

# Creating the line graph
line_graph <- ggplot(total_votes_per_year, aes(x = year, y = total_votes)) +
  geom_line(size=1,color="Red") +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 5),
                     labels = scales::comma) +
  labs(title = "Total Number of Polling in the United States from 2000-2020",
       x = "Year",
       y = "Number of Votes (in numbers)",
       caption = "Data source: https://electionlab.mit.edu/data") +
  theme(plot.title = element_text(hjust = 0.5))+
  theme_get()

line_graph

Plot 2

From the below graph we can observe the metrics of perentage number of votings took place over each term by each party between period (2000-2020)

# Calculate the percentage of votes for each party in each year
votes_of_party_in_percentage <- election_data %>%
  group_by(year, party) %>%
  summarize(percentage_votes = sum(candidatevotes) / sum(totalvotes) * 100) %>%
  ungroup()

# Reorder the levels of the party variable based on average percentage of votes across all years
votes_of_party_in_percentage <- votes_of_party_in_percentage %>%
  mutate(party = fct_reorder(party, percentage_votes, .desc = TRUE))

# Creating the stacked bar chart with facet_wrap by year
ggplot(votes_of_party_in_percentage, aes(fill = party, y = percentage_votes, x = party)) +
  geom_bar(position = position_dodge(width = 0.7), stat = "identity") +
  scale_fill_viridis(discrete = TRUE, option = "E") +
  facet_wrap(~ year, ncol = 3, scales = "free_x") +
  theme_minimal() +
  xlab("Party") +
  labs(title = "Percentage of Votes Received by Each Party Each Year",
       x = "Party",
       y = "Percentage of Votes",
       caption = "Data source: https://electionlab.mit.edu/data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "bottom") +
  scale_y_continuous(breaks = seq(0, 100, by = 10))

plot 3

The plot(using plotly) visualizes the search interest for the term “Election Fraud” and “US Election Results 2020” during the 2020 US Election Season. The y-axis represents the relative interest in the search term, ranging from 0 to 100, with 100 being the peak popularity. The x-axis shows the months in the year 2020, spanning from August 2nd to December 3rd.

The plot also provides insights into how public interest in the topic of “Election Fraud” and “” evolved during the specified period, helping to understand its popularity and fluctuations throughout the election season.

# Function to load data from gtrends, if the gtrends fetching fails, will load the local RDS file
load_gtrends_data <- function() {
  tryCatch({
    # Attempt to fetch data from gtrends
    data <- gtrends(c("US Election Results 2020", "Election Fraud"), 
                    time = "2020-08-02 2020-12-03", geo = "US")
    write_rds(data, file = "importedDataGtrends.rds")  # Save data to RDS file
    return(data)  # Return the data
  }, error = function(err) {
    # If gtrends call fails, attempt to read data from RDS file
    if (file.exists("importedDataGtrends.rds")) {
      data <- read_rds("importedDataGtrends.rds")
      return(data)  # Return the data from RDS file
    } else {
      message("Failed to fetch data from gtrends and no RDS file found.")
      return(NULL)  # Return NULL or any other default value if both attempts fail
    }
  })
}
# Load data using the function
gTrendsElectionSearchData <- load_gtrends_data()

if (!is.null(data)) {
  plot_trend <- function(keyword_string, data) {
    time_trend <- data$interest_over_time %>%
      mutate(hits = ifelse(hits == "<1", 0.5, as.numeric(hits)),
             date = as.Date(date),
             keyword = factor(keyword, levels = keyword_string))
    
    plot <- plot_ly(data = time_trend, x = ~date, y = ~hits, color = ~keyword,
                    type = "scatter", mode = "lines", fill = "tozeroy") %>%
      layout(title = "Exploring the Search Interest for 'Election Fraud' and 'US Election Results 2020' during the 2020 US Election Season",
             xaxis = list(title = "Months in year 2020", standoff = 20),
             yaxis = list(title = "Hits (Relative to Peak from 0 - 100)", standoff = 20),
             legend = list(orientation = "v"),
             showlegend = TRUE,
             margin = list(l = 100, r = 100, b = 100, t = 100),  # Adjust margins
             padding = list(r = 10, b = 50)  # Adjust padding
      )
    return(plot)
  }
  election_fraud_trend_plot <- plot_trend(keyword_string = c("US Election Results 2020", "Election Fraud"),gTrendsElectionSearchData)
  election_fraud_trend_plot
}

plot 4

The plot displays the winners in different states for the year 2020 US Presidential Election. The map provides a clear visual representation of the electoral outcome across the United States during the specified election year.

# Loading the us_states shapefile from the maps package
us_states <- map_data("state")

# Renamed the "region" column to "state" in us_states
us_states <- us_states %>% 
  rename(state = region)

us_states <- us_states %>%
  mutate(state = str_to_title(state))

# Filtering the state_winners_with_votes dataset for the year 2020
state_winners_2020 <- state_winners_with_votes %>%
  filter(year == 2020)

state_winners_2016 <- state_winners_with_votes %>%
  filter(year == 2016)

# Left join the us_states dataset with state_winners_2020 based on the "state" column
map_data_combined_2020 <- left_join(us_states, state_winners_2020, by = "state", copy = TRUE)
map_data_combined_2016 <- left_join(us_states, state_winners_2016, by = "state", copy = TRUE)

# Displaying the map view which show the winners of different states by the parties

election_outcome_map_2016 <- ggplot(map_data_combined_2016, aes(x = long, y = lat, group = group, fill = party)) +
  geom_polygon(color = "white", size = 0.25) +  # Add thick white borders between states
  coord_quickmap() +
  scale_fill_manual(values = c("Democrat" = "blue", "Republican" = "red", "Other" = "gray")) +
  labs(title = "2016 Election outcome by state",
       caption = "Data source:  https://electionlab.mit.edu/data") +
  theme_void()

election_outcome_map_2016

election_outcome_map_2020 <- ggplot(map_data_combined_2020, aes(x = long, y = lat, group = group, fill = party)) +
  geom_polygon(color = "white", size = 0.25) +  # Add thick white borders between states
  coord_quickmap() +
  scale_fill_manual(values = c("Democrat" = "blue", "Republican" = "red", "Other" = "gray")) +
  labs(title = "2020 Election outcome by state",
       caption = "Data source:  https://electionlab.mit.edu/data") +
  theme_void()

election_outcome_map_2020

plot 5

The plot displays county level median income in USA in the year 2021

# Merge shape and census data
ggMapData <- county_shape %>% 
  full_join(census_data, by = c("GEOID" = "county_fips")) %>% 
  dplyr::filter(year == 2021)

# Fixing issue with Alaska and Hawaii
ggMapDataFix <- ggMapData %>% 
  tigris::shift_geometry()

# Plotting by thresholding max poverty level at 50%
povertyGG <- ggMapDataFix %>% rowwise() %>% 
  mutate(pov_plot = min(median_income, 156821)) %>% 
  ggplot(aes(fill = pov_plot)) +
  geom_sf(color = "#C0C0C0",
          size = 0.1) +
  scale_fill_gradient(low = "white", high = "dodgerblue",
                      limits = c(0, 180000)) +
  labs( title = "County-level median income, 2021",
        caption = "Data source: https://electionlab.mit.edu/data",
        fill = "Median Income") +
  theme_void()

povertyGG

4. Monte Carlo Methods of Inference

Implementing a Permutation test

quantitative_variable <- merged_election_data$total_votes_state_party

observed_diff <- mean(quantitative_variable[merged_election_data$party == "Republican"]) - 
                mean(quantitative_variable[merged_election_data$party == "Democrat"])

num_permutations <- 100000
perm_diffs <- replicate(num_permutations, {
  shuffled_party <- sample(merged_election_data$party)
  mean(quantitative_variable[shuffled_party == "Republican"]) - 
  mean(quantitative_variable[shuffled_party == "Democrat"])
})

# Calculating the p-value
p_value <- sum(abs(perm_diffs) >= abs(observed_diff)) / num_permutations
p_value

## [1] 0.44745

# Step 6: Assess the statistical significance
if (p_value < 0.05) {
  print("The difference in means between Republican and Democrat parties is statistically significant.")
} else {
  print("There is no significant difference in means between Republican and Democrat parties.")
}

## [1] "There is no significant difference in means between Republican and Democrat parties."

5. Bootstrap Methods of Inference

a. Non-parametric Bootstrapping

# Simulating data from distribution
set.seed(1989)
n <- 135
observedVal <- election_census_merged_data_republican$candidatevotes

# Create a raincloud plot to visualize the simulated data. Describe the data in terms of its number of peaks (e.g., unimodal, bimodal, etc.) and its symmetry.

tibble(value=observedVal) %>% ggplot(aes(y = value)) +
  ggdist::stat_halfeye(adjust = .5, width = 2*.3, .width = c(0.5, 1)) + 
    geom_boxplot(width = .3, outlier.shape = NA) +
  ggdist::stat_dots(side = "left", dotsize = 6, justification = 1.05, binwidth = .1,
                    color = "black") +
    coord_flip() +
    labs(y = "Value",
         title = "simulated values from non-normal distribution") + 
  theme_bw() +
  theme(legend.position = "none")

# What is the sample median for this data?

median(observedVal)

## [1] 7239

# Number of bootstrap samples
Bsamples <- 10000

# Instantiating matrix for bootstrap samples
bootsMatrix <- matrix(NA, nrow = n, ncol = Bsamples)

# Sampling with replacement B times
for(b in 1:Bsamples) {
bootsMatrix[, b] <- observedVal[sample(1:n, size = n, replace = TRUE)]
}

#  Using the generated bootstrap samples, create a bootstrap distribution of sample medians, and visualize this distribution using a histogram.

# Installing vector for bootstrap medians
bootmediansVector <- vector(length=Bsamples)

#sampling with replacement B times

for (b in 1:Bsamples){
  bootmediansVector[b]<-median(bootsMatrix[,b])
}


# creating a histogram

tibble(Median=bootmediansVector) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

# Use the bootstrap samples to obtain a nonparametric estimate of the standard error of the sample median.

SEestimates <- sd(bootmediansVector)
SEestimates

## [1] 4255.382

# Use the bootstrap samples to obtain a nonparametric 95% confidence interval for the population median.

lowerBoundMedianVal <- quantile(bootmediansVector,probs=0.025)
upperBoundMedianVal <- quantile(bootmediansVector,probs=0.975)

We are 95% confident that the true median is in between 41396 and 58970.

# creating a histogram

tibble(Median=bootmediansVector) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundMedianVal,size=1,
             color="dodgerblue",linetype="solid")+
  geom_vline(xintercept= upperBoundMedianVal,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

b. Parametric Bootstrapping

# Generate B=10,000 samples each of size 30 from a normal distribution using the estimated mean and standard deviation from the observed data set, and visualize this distribution using a histogram.

BsamplesForParametric <- 10000
observedMaleCountVal <- election_census_merged_data_republican$male_count

# Instantiating matrix for bootstrap samples
parammetricBootMatrix <- matrix(NA, nrow = n, ncol = BsamplesForParametric)
xBar<-mean(observedMaleCountVal)
s<-sd(observedMaleCountVal)

# Sampling a normal set of n values, B times
for(b in 1:BsamplesForParametric) {
parammetricBootMatrix[, b] <- rnorm(n=n,mean=xBar,sd=s)
}


# Installing vector for bootstrap medians
bootParametricMedians <- vector(length=BsamplesForParametric)

# caluculating median for each simulated data set
for (b in 1:BsamplesForParametric){
  bootParametricMedians[b]<-median(parammetricBootMatrix[,b])
}


# creating a histogram

tibble(Median=bootParametricMedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of Parametric bootstrap medians") + 
  theme_bw()

# Obtain a parametric bootstrap estimate of the standard error of the sample median.

SEparametricestimate <- sd(bootParametricMedians)
SEparametricestimate

## [1] 17906.09

# Obtain a parametric bootstrap 95% confidence interval for the sample median.

lowerBoundParmetricMed <- quantile(bootParametricMedians,probs=0.025)
upperBoundParametricMed <- quantile(bootParametricMedians,probs=0.975)


# creating a histogram

tibble(Median=bootParametricMedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundParmetricMed,size=1,
             color="dodgerblue",linetype="solid")+
   geom_vline(xintercept= upperBoundParametricMed,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of parametric bootstrap medians") + 
  theme_bw()

6. Conclusions / Main Takeaways

In conclusion, the analysis of presidential elections data at the county level in the USA provides valuable insights into voting trends, party voting patterns, and candidate performance. It also highlights the importance of search interest in specific topics during the election season and sheds light on income disparities across regions. These findings can be utilized to better understand the dynamics of the electoral process and contribute to discussions on improving voting practices and policies.

From the above data analysis, I even got ideas of future work like,

Voter Turnout Analysis: Explore factors influencing voter turnout at the county level, and identify strategies to increase voter participation.

Temporal Analysis: Extend the analysis to include more recent election data beyond 2020, if available, to observe any emerging trends or changes in voting patterns over time.

Comparative Studies: Conduct comparative studies with elections from other countries to identify similarities and differences in voting behaviors and electoral processes.

Interactive Visualization: Create interactive data visualizations and dashboards to allow users to explore the data and draw their own insights.

Acknowledgment to Professor:

I would like to extend my sincere gratitude to Professor Andrew Dilernia for their invaluable guidance and support throughout this project. His expertise in data selection and research methodologies has been instrumental in shaping the direction of this analysis. I am grateful for the opportunity to learn under his mentorship is truly inspiring.

7. References

(https://www.archives.gov/electoral-college/allocation)

(https://ggplot2-book.org/maps.html)

(https://r-graph-gallery.com/interactive-charts.html)

(https://dplyr.tidyverse.org/index.html)