Analysis of Presidential Elections Data: County-Level in USA

Project Link

Use the below link that was published in rPubs.

https://rpubs.com/haripoorna2208/1069024

Dashboard Link

Use the below link for the visualization dashboard.

https://rpubs.com/haripoorna2208/1069039

Loading packages

In my project, I utilize a variety of powerful libraries for data manipulation, visualization, and statistical analysis. The tidyverse suite, including dplyr, ggplot2, and tidyr, enables seamless data wrangling and plotting, while lubridate helps with date and time manipulation. Additionally, I leverage libraries like psych and skimr for descriptive statistics and concise data summarization. For creating publication-ready plots and interactive visualizations, ggpubr, plotly, and patchwork play key roles. Moreover, I employ specialized packages like datadictionary for documenting datasets and gtrendsR to access Google Trends data. Lastly, maps and spatial data are effectively handled using maps and sf libraries.

# Data manipulation and visualization libraries
library(tidyverse)   # Comprehensive data manipulation and visualization tools
library(magrittr)    # Allows using the %>% operator for cleaner code
library(stringr)     # Provides string manipulation functions
library(dplyr)       # Provides data manipulation functions (part of tidyverse)
library(tidyr)       # Provides tools for data tidying (part of tidyverse)
library(lubridate)   # Helps with date and time manipulation (part of tidyverse)

# Descriptive statistics and data summarization
library(skimr)       # Provides concise summary statistics for data
library(psych)       # Helps with more advanced descriptive statistics

# Data visualization libraries
library(ggplot2)     # Widely used for creating static data visualizations
library(ggpubr)      # Extends ggplot2 for creating publication-ready plots
library(patchwork)   # Allows combining multiple ggplots into one plot

# Data import/export
library(readr)       # Helps with reading and writing data in various formats

# Miscellaneous visualization and formatting
library(flextable)   # Creates flexible and nice-looking tables
library(viridis)     # Provides color palettes for visualizations
library(hrbrthemes)  # Offers additional ggplot2 themes

# Categorical data manipulation
library(forcats)     # Useful for manipulating categorical variables

# Google Trends API
library(gtrendsR)    # Provides an interface for accessing Google Trends data

# Data Dictionary
library(datadictionary)  # Helps create data dictionaries for datasets

# Interactive plotting
library(plotly)      # Allows creating interactive plots

# Maps and geographic data
library(maps)        # Offers map data for visualizations
library(sf)          # Provides functions for working with spatial data

Importing the data

These are the datasets that were used in this project. First, there is “Presidential_Elections_Data_count_level.csv”. This data contains constituency county-level returns for elections to the U.S. presidency from 1976 to 2020. Data was obtained from the Massachusetts Institute of Technology website(https://electionlab.mit.edu/data).

# Importing CSV file
election_data <- read_csv("Presidential_Elections_Data_count_level.csv")
census_dataset <- read_csv("census.csv")


# Downloading county-level shape files from US Census Bureau
if(!file.exists("cb_2018_us_county_500k.zip")) {
download.file(url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip",
              destfile = "cb_2018_us_county_500k.zip")
}

if(!dir.exists("GeoFiles")) {
  dir.create("GeoFiles/")
}

# Unzipping files
utils::unzip("cb_2018_us_county_500k.zip",
             exdir = "GeoFiles")

# Loading the shapefiles
county_shape <- st_read("GeoFiles//cb_2018_us_county_500k.shp",
                        quiet = TRUE)

Exploring the data

Exploring and display of high-level characteristics of the data set, e.g., important variables and their types, different levels for factor variables, any patterns of missing values.

## exploring the data
glimpse(election_data)

## Rows: 72,617
## Columns: 12
## $ year           <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…
## $ state          <chr> "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", …
## $ state_po       <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ county_name    <chr> "AUTAUGA", "AUTAUGA", "AUTAUGA", "AUTAUGA", "BALDWIN", …
## $ county_fips    <chr> "01001", "01001", "01001", "01001", "01003", "01003", "…
## $ office         <chr> "US PRESIDENT", "US PRESIDENT", "US PRESIDENT", "US PRE…
## $ candidate      <chr> "AL GORE", "GEORGE W. BUSH", "RALPH NADER", "OTHER", "A…
## $ party          <chr> "DEMOCRAT", "REPUBLICAN", "GREEN", "OTHER", "DEMOCRAT",…
## $ candidatevotes <dbl> 4942, 11993, 160, 113, 13997, 40872, 1033, 578, 5188, 5…
## $ totalvotes     <dbl> 17208, 17208, 17208, 17208, 56480, 56480, 56480, 56480,…
## $ version        <dbl> 20220315, 20220315, 20220315, 20220315, 20220315, 20220…
## $ mode           <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "…

skim(election_data)

Data summary
Name	election_data
Number of rows	72617
Number of columns	12
_______________________
Column type frequency:
character	8
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state	0	1	4	20	51
state_po	0	1	2	2	51
county_name	0	1	3	21	1892
county_fips	57	1	5	5	3156
office	0	1	12	12	1
candidate	0	1	5	17	13
party	0	1	5	11	5
mode	0	1	4	20	16

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2011.30	7.52	2000	2004	2012	2020	2020	▇▃▃▃▇
candidatevotes	1	10781.83	46034.94	0	115	1278	5848	3028885	▇▁▁▁▁
totalvotes	1	42514.21	121951.40	0	5175	11194	29855	4264365	▇▁▁▁▁
version	1	20220315.00	0.00	20220315	20220315	20220315	20220315	20220315	▁▁▇▁▁

glimpse(census_dataset)

## Rows: 52,612
## Columns: 26
## $ geoid                    <chr> "01003", "01015", "01043", "01049", "01051", …
## $ county_state             <chr> "Baldwin County, Alabama", "Calhoun County, A…
## $ year                     <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 200…
## $ population               <dbl> 174439, 113419, 81324, 68515, 78106, 103303, …
## $ median_income            <dbl> 52320, 40056, 39229, 34357, 54601, 37642, 426…
## $ median_monthly_rent_cost <dbl> 790, 533, 562, 536, 620, 569, 570, 737, 537, …
## $ median_monthly_home_cost <dbl> 884, 686, 653, 548, 787, 595, 673, 975, 655, …
## $ prop_female              <dbl> 0.5072088, 0.5213941, 0.5008853, 0.5028388, 0…
## $ prop_male                <dbl> 0.4927912, 0.4786059, 0.4991147, 0.4971612, 0…
## $ prop_white               <dbl> 0.8650990, 0.7668380, 0.9676971, 0.9294607, 0…
## $ prop_black               <dbl> 0.08835180, 0.20128021, 0.01090699, 0.0140115…
## $ prop_native              <dbl> 0.006896394, 0.003720717, 0.007463971, 0.0164…
## $ prop_asian               <dbl> 0.0072632840, 0.0081291494, 0.0024592986, 0.0…
## $ prop_hawaiin_islander    <dbl> 0.0017312642, 0.0000000000, 0.0000000000, 0.0…
## $ prop_other_race          <dbl> 0.0146813499, 0.0092312575, 0.0004303773, 0.0…
## $ prop_multi_racial        <dbl> 0.015976932, 0.010800660, 0.011042251, 0.0155…
## $ prop_highschool          <dbl> 0.2234336, 0.2642586, 0.2830509, 0.2945016, 0…
## $ prop_GED                 <dbl> 0.04600955, 0.07705351, 0.07444981, 0.0797542…
## $ prop_some_college        <dbl> 0.06240842, 0.06737910, 0.05683327, 0.0587472…
## $ prop_college_no_degree   <dbl> 0.1687486, 0.1806985, 0.1389848, 0.1287847, 0…
## $ prop_associates          <dbl> 0.08350924, 0.05767840, 0.09165914, 0.0494131…
## $ prop_bachelors           <dbl> 0.20768040, 0.09130223, 0.08016855, 0.0449906…
## $ prop_masters             <dbl> 0.06303756, 0.04145800, 0.04394398, 0.0302134…
## $ prop_professional        <dbl> 0.013004859, 0.008675421, 0.008834829, 0.0026…
## $ prop_doctoral            <dbl> 0.0108111688, 0.0155106011, 0.0012570599, 0.0…
## $ prop_poverty             <dbl> 0.09132561, 0.18246331, 0.15272973, 0.2357663…

skim(census_dataset)

Data summary
Name	census_dataset
Number of rows	52612
Number of columns	26
_______________________
Column type frequency:
character	2
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
geoid	0	1	5	5	0	3225	0
county_state	0	1	16	42	0	3229	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2014.82	3.78	2008.00	2012.00	2015.00	2018.00	2021.00	▆▇▅▇▇
population	0	1.00	145758.59	395597.70	41.00	14429.00	40119.50	122855.25	10170292.00	▇▁▁▁▁
median_income	4	1.00	49709.77	15079.85	10499.00	40264.50	47527.00	56565.25	156821.00	▃▇▁▁▁
median_monthly_rent_cost	30	1.00	741.63	235.85	99.00	587.00	688.00	841.00	2599.00	▃▇▁▁▁
median_monthly_home_cost	18	1.00	837.68	388.18	103.00	572.00	743.00	1002.00	3254.00	▇▇▁▁▁
prop_female	0	1.00	0.50	0.02	0.19	0.50	0.51	0.51	0.63	▁▁▁▇▁
prop_male	0	1.00	0.50	0.02	0.37	0.49	0.49	0.50	0.81	▁▇▁▁▁
prop_white	42443	0.19	0.78	0.15	0.08	0.71	0.82	0.89	0.98	▁▁▂▅▇
prop_black	42443	0.19	0.11	0.12	0.00	0.03	0.06	0.15	0.73	▇▂▁▁▁
prop_native	42443	0.19	0.01	0.05	0.00	0.00	0.00	0.01	0.80	▇▁▁▁▁
prop_asian	42443	0.19	0.03	0.04	0.00	0.01	0.02	0.04	0.44	▇▁▁▁▁
prop_hawaiin_islander	42443	0.19	0.00	0.01	0.00	0.00	0.00	0.00	0.14	▇▁▁▁▁
prop_other_race	42443	0.19	0.03	0.05	0.00	0.01	0.02	0.04	0.80	▇▁▁▁▁
prop_multi_racial	42443	0.19	0.04	0.04	0.00	0.02	0.03	0.04	0.69	▇▁▁▁▁
prop_highschool	41916	0.20	0.25	0.06	0.05	0.21	0.25	0.29	0.49	▁▅▇▂▁
prop_GED	41916	0.20	0.04	0.02	0.00	0.03	0.04	0.05	0.18	▆▇▁▁▁
prop_some_college	41916	0.20	0.07	0.02	0.00	0.06	0.07	0.08	0.15	▁▅▇▂▁
prop_college_no_degree	41916	0.20	0.15	0.03	0.05	0.13	0.15	0.17	0.28	▁▆▇▂▁
prop_associates	41916	0.20	0.09	0.02	0.02	0.07	0.09	0.10	0.29	▅▇▁▁▁
prop_bachelors	41916	0.20	0.18	0.06	0.04	0.13	0.17	0.21	0.40	▂▇▆▂▁
prop_masters	41916	0.20	0.08	0.03	0.01	0.05	0.07	0.09	0.29	▇▇▂▁▁
prop_professional	41916	0.20	0.02	0.01	0.00	0.01	0.02	0.02	0.10	▇▃▁▁▁
prop_doctoral	41916	0.20	0.01	0.01	0.00	0.01	0.01	0.02	0.11	▇▁▁▁▁
prop_poverty	2	1.00	0.16	0.08	0.00	0.11	0.15	0.19	0.67	▆▇▁▁▁

After Exploring the election data by the use of skim function, I can observe that there are 57 missing values for county_fips variable in the datasets I have used for this project and in census data there is more 40 thousand missing values in prop_white, prop_black, prop_native, prop_asian, prop_hawaiin_islander, prop_other_race etc and 30 missing values in median_monthly_rent_cost and 18 missing values in median_monthly_home_cost column might impact while merging the data with election data.

1. Data Dictionary for the datasets used

# Loading election data
data(election_data)

# Creating data dictionary. 
# Note penguins will need to be replaced with your own data set, and 
# the variable descriptions should be updated as well.
dataDictionary1 <- tibble(Variable = colnames(election_data),
                         Description = c(
                            "year of election",
                            "state name",
                            "state postal code abbreviation",
                            "county name",
                            "county fips code",
                            "name of the public office to which the candidate is seeking election",
                            "candidate name",
                            "party",
                            "number of votes for candidate",
                            "total votes cast in the election",
                            "version",
                            "mode"
),
                         Type = map_chr(election_data, .f = function(x){typeof(x)[1]}),
                         Class = map_chr(election_data, .f = function(x){class(x)[1]}))

# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary1, cwidth = 2)

Variable	Description	Type	Class
year	year of election	double	numeric
state	state name	character	character
state_po	state postal code abbreviation	character	character
county_name	county name	character	character
county_fips	county fips code	character	character
office	name of the public office to which the candidate is seeking election	character	character
candidate	candidate name	character	character
party	party	character	character
candidatevotes	number of votes for candidate	double	numeric
totalvotes	total votes cast in the election	double	numeric
version	version	double	numeric
mode	mode	character	character

# Printing nicely in R Markdown (option 2)
knitr::kable(dataDictionary1)

Variable	Description	Type	Class
year	year of election	double	numeric
state	state name	character	character
state_po	state postal code abbreviation	character	character
county_name	county name	character	character
county_fips	county fips code	character	character
office	name of the public office to which the candidate is seeking election	character	character
candidate	candidate name	character	character
party	party	character	character
candidatevotes	number of votes for candidate	double	numeric
totalvotes	total votes cast in the election	double	numeric
version	version	double	numeric
mode	mode	character	character

# Loading census data
data(census_dataset)
# Creating data dictionary. 
# Note penguins will need to be replaced with your own data set, and 
# the variable descriptions should be updated as well.

dataDictionary2 <- tibble(Variable = colnames(census_dataset),
                        Description = c(
                          "geographic region ID",
                          "geographic region",
                          "year",
                          "population",
                          "median income in dollars",
                          "median monthly housing costs for homeowners in dollars",
                          "median monthly rent costs for renters in dollars",
                          "proportion of people who are female",
                          "proportion of people who are male",
                          "proportion of people who are white alone",
                          "proportion of people who are black or African American alone",
                          "proportion of people who are American Indian and Alaska Native alone",
                          "proportion of people who are Asian alone",
                          "proportion of people who are Native Hawaiian and Other Pacific Islander alone",
                          "proportion of people who are some other race alone",
                          "proportion of people who are two or more races",
                          "proportion of people 25 and older whose highest education-level is high school",
                          "proportion of people 25 and older whose highest education-level is a GED",
                          "proportion of people 25 and older whose highest education-level is some, but less than 1 year of college",
                          "proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree",
                          "proportion of people 25 and older whose highest education-level is an Associates degree",
                          "proportion of people 25 and older whose highest education-level is a Bachelors degree",
                          "proportion of people 25 and older whose highest education-level is a Masters degree",
                          "proportion of people 25 and older whose highest education-level is a Professional degree",
                          "proportion of people 25 and older whose highest education-level is a Doctoral degree",
                          "proportion of people 25 and older living in poverty"
  ),
                         Type = map_chr(census_dataset, .f = function(x){typeof(x)[1]}),
                         Class = map_chr(census_dataset, .f = function(x){class(x)[1]}))

# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary2, cwidth = 2)

Variable	Description	Type	Class
geoid	geographic region ID	character	character
county_state	geographic region	character	character
year	year	double	numeric
population	population	double	numeric
median_income	median income in dollars	double	numeric
median_monthly_rent_cost	median monthly housing costs for homeowners in dollars	double	numeric
median_monthly_home_cost	median monthly rent costs for renters in dollars	double	numeric
prop_female	proportion of people who are female	double	numeric
prop_male	proportion of people who are male	double	numeric
prop_white	proportion of people who are white alone	double	numeric
prop_black	proportion of people who are black or African American alone	double	numeric
prop_native	proportion of people who are American Indian and Alaska Native alone	double	numeric
prop_asian	proportion of people who are Asian alone	double	numeric
prop_hawaiin_islander	proportion of people who are Native Hawaiian and Other Pacific Islander alone	double	numeric
prop_other_race	proportion of people who are some other race alone	double	numeric
prop_multi_racial	proportion of people who are two or more races	double	numeric
prop_highschool	proportion of people 25 and older whose highest education-level is high school	double	numeric
prop_GED	proportion of people 25 and older whose highest education-level is a GED	double	numeric
prop_some_college	proportion of people 25 and older whose highest education-level is some, but less than 1 year of college	double	numeric
prop_college_no_degree	proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree	double	numeric
prop_associates	proportion of people 25 and older whose highest education-level is an Associates degree	double	numeric
prop_bachelors	proportion of people 25 and older whose highest education-level is a Bachelors degree	double	numeric
prop_masters	proportion of people 25 and older whose highest education-level is a Masters degree	double	numeric
prop_professional	proportion of people 25 and older whose highest education-level is a Professional degree	double	numeric
prop_doctoral	proportion of people 25 and older whose highest education-level is a Doctoral degree	double	numeric
prop_poverty	proportion of people 25 and older living in poverty	double	numeric

# Printing nicely in R Markdown (option 2)
knitr::kable(dataDictionary2)

Variable	Description	Type	Class
geoid	geographic region ID	character	character
county_state	geographic region	character	character
year	year	double	numeric
population	population	double	numeric
median_income	median income in dollars	double	numeric
median_monthly_rent_cost	median monthly housing costs for homeowners in dollars	double	numeric
median_monthly_home_cost	median monthly rent costs for renters in dollars	double	numeric
prop_female	proportion of people who are female	double	numeric
prop_male	proportion of people who are male	double	numeric
prop_white	proportion of people who are white alone	double	numeric
prop_black	proportion of people who are black or African American alone	double	numeric
prop_native	proportion of people who are American Indian and Alaska Native alone	double	numeric
prop_asian	proportion of people who are Asian alone	double	numeric
prop_hawaiin_islander	proportion of people who are Native Hawaiian and Other Pacific Islander alone	double	numeric
prop_other_race	proportion of people who are some other race alone	double	numeric
prop_multi_racial	proportion of people who are two or more races	double	numeric
prop_highschool	proportion of people 25 and older whose highest education-level is high school	double	numeric
prop_GED	proportion of people 25 and older whose highest education-level is a GED	double	numeric
prop_some_college	proportion of people 25 and older whose highest education-level is some, but less than 1 year of college	double	numeric
prop_college_no_degree	proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree	double	numeric
prop_associates	proportion of people 25 and older whose highest education-level is an Associates degree	double	numeric
prop_bachelors	proportion of people 25 and older whose highest education-level is a Bachelors degree	double	numeric
prop_masters	proportion of people 25 and older whose highest education-level is a Masters degree	double	numeric
prop_professional	proportion of people 25 and older whose highest education-level is a Professional degree	double	numeric
prop_doctoral	proportion of people 25 and older whose highest education-level is a Doctoral degree	double	numeric
prop_poverty	proportion of people 25 and older living in poverty	double	numeric

2. Data Cleaning ( Using String modification, Merging dataset, Lubridate package)

Cleaning the data set and modifying the variables and data to match our requirements for the Project

a. String modification

# Converting values in "state" variable to capitalized form
election_data <- election_data %>%
  mutate(state = str_to_title(state))

# Converting values in "candidate" variable to capitalized form
election_data <- election_data %>%
  mutate(candidate = str_to_title(candidate))

# Converting values in "party" variable to capitalized form
election_data <- election_data %>%
  mutate(party = str_to_title(party))

# Converting values in "county_name" variable to capitalized form
election_data <- election_data %>%
  mutate(county_name = str_to_title(county_name))

election_data <- election_data %>% 
  rename(county = county_name)

# Now, Cleaning the candidate using mutate() and str_replace() function
election_data <- election_data %>% 
  mutate(candidate = str_replace(candidate, "Donald Trump|Donald J Trump", "Donald Trump")) %>%
  mutate(candidate = str_replace(candidate, "Joseph R Biden Jr", "Joe Biden"))

# Separate "county_state" column into "county" and "state"
if ("county_state" %in% names(census_dataset)) {
  # Separate "county_state" column into "county" and "state"
  census_data <- census_dataset %>%
    separate(county_state, into = c("county", "state"), sep = ",\\s*", extra = "merge")
}
# now, remove the substring "county" in every county column value, example: "Houston County" > "Houston"
census_data$county <- str_replace(census_data$county, " County", "")

# Converting values in "state" variable in census data to capitalized form
census_data <- census_data %>%
  mutate(state = str_to_title(state))

# Converting values in "county" variable in census data to capitalized form
census_data <- census_data %>%
  mutate(county = str_to_title(county))

# renaming the geoId column name to county_fips
census_data <- census_data %>% 
  rename(county_fips = geoid)

b. Merging Data Sets

# Merge the two data frames based on county_fips as the unique identifier
election_census_merged_data <- left_join(census_data, election_data, by = c("county_fips", "state", "year", "county"))

# Update the dataset with the count of females and male for each county
election_census_merged_data$male_count <- round(election_census_merged_data$population * election_census_merged_data$prop_male)

election_census_merged_data$female_count <- round(election_census_merged_data$population * election_census_merged_data$prop_female)

# Filter the dataset to show Republican candidate votes for each county_fips
election_census_merged_data_republican_2016 <- election_census_merged_data %>%
  filter(party == "Republican", year == 2016) %>%
  distinct(county_fips, .keep_all = TRUE)

Note: Lubridate package is used in Data Visualizations in plot 3

3. Exploratory Data Analysis

a. Tables of Summary Statistics (group-level and Frequency)

Table 1 : Group-level summary statistics based on state wise candidate votes

summary_state<- election_data %>% group_by(state) %>% 
  summarise(mean_v=round(mean(candidatevotes),2),
            median_v=round(median(candidatevotes),2),
            sd_v=round(sd(candidatevotes),2),
            max_v=round(max(candidatevotes),2),
            min_v=round(min(candidatevotes),2),.groups = 'drop') %>%
  as.data.frame()

summary_state_table <- flextable::qflextable(summary_state)
summary_state_table

state	mean_v	median_v	sd_v	max_v	min_v
Alabama	9,562.28	3,555.0	20,036.95	181,688	2
Alaska	2,412.17	1,939.0	2,334.27	10,717	0
Arizona	30,384.94	1,508.0	112,684.35	978,457	0
Arkansas	2,400.19	201.0	6,834.58	89,574	0
California	67,021.72	8,106.5	203,790.55	3,028,885	2
Colorado	11,103.56	1,135.5	31,472.45	313,293	0
Connecticut	52,802.04	22,058.0	74,918.46	297,505	0
Delaware	41,267.58	30,838.0	49,206.40	195,034	294
District Of Columbia	78,320.05	17,367.0	113,325.05	317,323	1,322
Florida	36,193.16	3,897.0	77,794.32	624,146	1
Georgia	5,128.96	786.5	17,792.05	297,051	0
Hawaii	31,985.51	12,239.0	52,880.07	238,866	133
Idaho	4,505.79	1,032.5	12,104.62	130,699	1
Illinois	15,114.35	2,066.0	88,520.33	1,725,973	0
Indiana	8,183.73	2,682.5	19,101.97	247,772	0
Iowa	3,573.57	1,152.0	9,276.25	128,465	0
Kansas	3,613.44	693.0	13,483.38	184,259	3
Kentucky	4,592.88	1,530.0	13,653.98	228,288	1
Louisiana	9,249.22	2,986.5	18,781.17	152,610	11
Maine	12,366.85	7,165.5	18,244.04	128,759	0
Maryland	15,822.72	1,229.5	42,364.39	357,837	0
Massachusetts	63,935.49	15,048.0	94,365.68	617,196	41
Michigan	16,720.18	3,518.0	51,486.44	660,085	2
Minnesota	9,488.48	2,153.0	32,290.92	532,623	4
Mississippi	4,183.36	1,790.5	7,462.33	76,112	0
Missouri	6,817.37	1,543.5	23,002.74	333,123	0
Montana	2,765.24	717.0	6,199.47	50,772	5
Nebraska	2,754.28	460.0	10,890.56	150,350	0
Nevada	17,504.76	998.5	63,992.63	521,777	6
New Hampshire	22,193.54	12,792.0	28,790.04	122,344	191
New Jersey	51,471.90	24,212.0	62,124.22	285,967	93
New Mexico	6,769.28	1,398.0	18,473.54	193,757	0
New York	34,742.93	9,105.0	83,084.69	703,310	7
North Carolina	7,081.89	348.0	20,627.13	302,736	0
North Dakota	1,836.63	501.0	4,864.71	42,619	0
Ohio	18,766.28	5,745.0	43,724.48	458,204	13
Oklahoma	3,946.76	431.5	13,808.50	174,741	0
Oregon	15,031.14	2,947.0	34,565.99	367,249	1
Pennsylvania	27,830.84	7,303.0	60,148.15	603,790	12
Rhode Island	22,876.00	9,098.0	36,662.68	167,442	0
South Carolina	5,436.18	89.5	13,823.51	127,832	0
South Dakota	1,785.25	571.0	4,620.91	49,249	0
Tennessee	8,380.75	2,517.0	21,679.46	256,297	7
Texas	9,406.51	663.0	43,432.78	918,193	0
Utah	6,058.90	86.0	24,074.89	271,576	0
Vermont	6,519.92	3,510.0	9,348.12	72,938	12
Virginia	5,878.28	899.0	18,452.44	355,133	0
Washington	22,963.08	3,947.0	69,060.14	907,310	2
West Virginia	3,904.89	1,606.0	6,125.35	46,398	6
Wisconsin	11,854.21	3,820.5	28,906.19	332,438	0
Wyoming	3,263.54	1,101.0	4,821.71	27,891	13

Table 2 : Aggregate votes for each candidate for each term year

aggregate_table <- aggregate(candidatevotes ~ year + candidate, data = election_data, FUN = sum)

# Order the aggregate_table by year and candidatevotes
aggregate_table_ordered <- arrange(aggregate_table, year, desc(candidatevotes))


# Spread the aggregate_table to have each unique year as a separate row
spread_table <- spread(aggregate_table_ordered, key = year, value = candidatevotes)

spread_table_filled <- spread_table %>%
  mutate(across(-candidate, ~ replace(., is.na(.), "-")))

# Printing the spread_table
flextable_spread_table <- flextable::qflextable(spread_table_filled)
flextable_spread_table

candidate	2000	2004	2008	2012	2016	2020
Al Gore	51001968	-	-	-	-	-
Barack Obama	-	-	69448278	65753019	-	-
Donald Trump	-	-	-	-	62979031	74218999
George W. Bush	50457896	61931981	-	-	-	-
Hillary Clinton	-	-	-	-	65844241	-
Jo Jorgensen	-	-	-	-	-	1810407
Joe Biden	-	-	-	-	-	81264648
John Kerry	-	58899063	-	-	-	-
John Mccain	-	-	59928695	-	-	-
Mitt Romney	-	-	-	60666387	-	-
Other	1068557	1489505	1810364	2629510	7674891	1183656
Ralph Nader	2882954	-	-	-	-	-

Table 3 : Percentage of votes each candidate recieved for each term year

# Filteing the election_data for the years 2000 to 2020
year_data <- election_data[election_data$year >= 2000 & election_data$year <= 2020, ]

# Group by year and candidate, and calculate the total votes for each candidate in each year
candidate_votes <- year_data %>%
  group_by(year, candidate) %>%
  summarize(total_votes = sum(candidatevotes))

# Calculating the total votes for each year
total_votes_per_year <- year_data %>%
  group_by(year) %>%
  summarize(total_votes = sum(candidatevotes))

# Merging the two data frames to get the total votes as percentages
candidate_votes_percentage <- merge(candidate_votes, total_votes_per_year, by = "year") %>%
  mutate(percentage_votes = (total_votes.x / total_votes.y) * 100) %>%
  select(-total_votes.x, -total_votes.y) %>%
  spread(year, percentage_votes, fill = 0)

# Converting percentage values to numeric, round to two decimals, and add "%" symbol
candidate_votes_percentage[, -1] <- lapply(candidate_votes_percentage[, -1], function(x) paste0(format(round(as.numeric(x), 2), nsmall = 2), "%"))

# Printing the candidate_votes_percentage data frame
candidate_votes_percentage_table <- flextable::qflextable(candidate_votes_percentage)
candidate_votes_percentage_table

candidate	2000	2004	2008	2012	2016	2020
Al Gore	48.38%	0.00%	0.00%	0.00%	0.00%	0.00%
Barack Obama	0.00%	0.00%	52.94%	50.95%	0.00%	0.00%
Donald Trump	0.00%	0.00%	0.00%	0.00%	46.14%	46.83%
George W. Bush	47.87%	50.63%	0.00%	0.00%	0.00%	0.00%
Hillary Clinton	0.00%	0.00%	0.00%	0.00%	48.24%	0.00%
Jo Jorgensen	0.00%	0.00%	0.00%	0.00%	0.00%	1.14%
Joe Biden	0.00%	0.00%	0.00%	0.00%	0.00%	51.28%
John Kerry	0.00%	48.15%	0.00%	0.00%	0.00%	0.00%
John Mccain	0.00%	0.00%	45.68%	0.00%	0.00%	0.00%
Mitt Romney	0.00%	0.00%	0.00%	47.01%	0.00%	0.00%
Other	1.01%	1.22%	1.38%	2.04%	5.62%	0.75%
Ralph Nader	2.73%	0.00%	0.00%	0.00%	0.00%	0.00%

Frequency table for two categorical variables

From the below table we can observe frequency of candidates participated from year 2000 - 2020

# Filtering the election_data for the years 2000 to 2020
year_data <- election_data[election_data$year >= 2000 & election_data$year <= 2020, ]

# Obtaining the unique candidates for each year
candidate_occurrence <- with(year_data, tapply(candidate, year, function(x) unique(x)))

# Creating a data frame to display the occurrence of candidates
occurrence_data_frame <- as.data.frame(candidate_occurrence)

# Replacing empty cells with zero
occurrence_data_frame[is.na(occurrence_data_frame)] <- 0

# Printing the candidate occurrence data frame
occurrence_data_frame

##                              candidate_occurrence
## 2000  Al Gore, George W. Bush, Ralph Nader, Other
## 2004            John Kerry, George W. Bush, Other
## 2008             Barack Obama, John Mccain, Other
## 2012             Barack Obama, Mitt Romney, Other
## 2016         Hillary Clinton, Donald Trump, Other
## 2020 Joe Biden, Other, Donald Trump, Jo Jorgensen

Now, we could look at election outcomes (which party won the overall election?) and whether or not a candidate won a majority of the popular vote and In order to calculate the state_winners, total electoral seats won by each party we need know the details of electoral college seats. For that I referred link https://www.archives.gov/electoral-college/allocation and used in below step 1 to load into data frame.

# Creating a lookup table for electoral college votes allocation
default_electoral_college_seats_state_wise <- data.frame(
  state = c("Alabama", "Kentucky", "North Dakota", "Alaska", "Louisiana", "Ohio", "Arizona", "Maine", "Oklahoma", "Arkansas",
            "Maryland", "Oregon", "California", "Massachusetts", "Pennsylvania", "Colorado", "Michigan", "Rhode Island", "Connecticut",
            "Minnesota", "South Carolina", "Delaware", "Mississippi", "South Dakota", "District of Columbia", "Missouri", "Tennessee",
            "Florida", "Montana", "Texas", "Georgia", "Nebraska", "Utah", "Hawaii", "Nevada", "Vermont", "Idaho", "New Hampshire",
            "Virginia", "Illinois", "New Jersey", "Washington", "Indiana", "New Mexico", "West Virginia", "Iowa", "New York", "Wisconsin",
            "Kansas", "North Carolina", "Wyoming"),
  electoral_seats = c(9, 8, 3, 3, 8, 17, 11, 4, 7, 6, 10, 8, 54, 11, 19, 10, 15, 4, 7, 10, 9, 3, 6, 3, 3, 10, 11, 30, 4, 40, 16, 5, 6,
                      4, 4, 3, 4, 4, 13, 19, 14, 12, 11, 5, 4, 6, 28, 10, 6, 16, 3)
)

# Merging election_data with the electoral_seats lookup table based on state
merged_election_data <- merge(election_data, default_electoral_college_seats_state_wise, by = "state", all.x = TRUE)

# Calculating total votes for each party in each state for each year
merged_election_data <- merged_election_data %>%
  group_by(year, state, party) %>%
  summarise(total_votes_state_party = sum(candidatevotes))

# Determining the winner party in each state for each year based on total votes majority
state_winners <- merged_election_data %>%
  group_by(year, state) %>%
  filter(total_votes_state_party == max(total_votes_state_party)) %>%
  ungroup()

# Merging the state_winners with the electoral_seats table to assign electoral votes to each party
state_winners_with_votes <- state_winners %>%
  left_join(default_electoral_college_seats_state_wise, by = c("state" = "state"))

# Calculating the total electoral votes won by each party for each year
electoral_college_seats <- state_winners_with_votes %>%
  group_by(year, party) %>%
  summarise(total_electoral_seats = sum(electoral_seats, na.rm = TRUE))

electoral_college_seats

## # A tibble: 12 × 3
## # Groups:   year [6]
##     year party      total_electoral_seats
##    <dbl> <chr>                      <dbl>
##  1  2000 Democrat                     276
##  2  2000 Republican                   257
##  3  2004 Democrat                     239
##  4  2004 Republican                   294
##  5  2008 Democrat                     351
##  6  2008 Republican                   182
##  7  2012 Democrat                     324
##  8  2012 Republican                   209
##  9  2016 Democrat                     227
## 10  2016 Republican                   306
## 11  2020 Democrat                     298
## 12  2020 Republican                   235

# Determining the winning party based on total electoral votes for each year
winning_party <- electoral_college_seats %>%
  group_by(year) %>%
  filter(total_electoral_seats == max(total_electoral_seats)) %>%
  ungroup()

# Select the final columns to display and rename the columns
winning_party_table_each_term <- winning_party %>%
  select(year, party, total_electoral_seats) %>%
  rename("Total Electoral College Seats Won" = total_electoral_seats, "Winning Party" = party)

winning_party_table_each_term$year <- gsub(",", "", winning_party_table_each_term$year)

winning_party_table_each_term

## # A tibble: 6 × 3
##   year  `Winning Party` `Total Electoral College Seats Won`
##   <chr> <chr>                                         <dbl>
## 1 2000  Democrat                                        276
## 2 2004  Republican                                      294
## 3 2008  Democrat                                        351
## 4 2012  Democrat                                        324
## 5 2016  Republican                                      306
## 6 2020  Democrat                                        298

b. Data Visualizations

Plot 1

From the below graph we can observe the metrics of number of votings took place over each term between period (2000-2020)

# plot 1: Line chart showing in which year most votings took place

# Calculating total votes per year
total_votes_per_year <- election_data %>%
  group_by(year) %>%
  summarize(total_votes = sum(candidatevotes))

# Creating the line graph
total_votes_per_year_line_graph <- ggplot(total_votes_per_year, aes(x = year, y = total_votes)) +
  geom_line(size=1,color="Red") +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 5),
                     labels = scales::comma) +
  labs(title = "Total Number of Polling in the United States from 2000-2020",
       x = "Year",
       y = "Number of Votes (in numbers)",
       caption = "Data source: https://electionlab.mit.edu/data") +
  theme(plot.title = element_text(hjust = 0.5))+
  theme_get()

total_votes_per_year_line_graph

Plot 2

From the below graph we can observe the metrics of perentage number of votings took place over each term by each party between period (2000-2020)

# First, Calculate the percentage of votes for each party in each year
votes_of_party_in_percentage <- election_data %>%
  group_by(year, party) %>%
  summarize(percentage_votes = sum(candidatevotes) / sum(totalvotes) * 100) %>%
  ungroup()

# Then, Reorder the levels of the party variable based on average percentage of votes across all years
votes_of_party_in_percentage <- votes_of_party_in_percentage %>%
  mutate(party = fct_reorder(party, percentage_votes, .desc = TRUE))

# Now, Creating the stacked bar chart with facet_wrap by year
ggplot(votes_of_party_in_percentage, aes(fill = party, y = percentage_votes, x = party)) +
  geom_bar(position = position_dodge(width = 0.7), stat = "identity") +
  scale_fill_viridis(discrete = TRUE, option = "E") +
  facet_wrap(~ year, ncol = 3, scales = "free_x") +
  theme_minimal() +
  xlab("Party") +
  labs(title = "Percentage of Votes Received by Each Party Each Year",
       x = "Party",
       y = "Percentage of Votes",
       caption = "Data source: https://electionlab.mit.edu/data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "bottom") +
  scale_y_continuous(breaks = seq(0, 100, by = 10))

Plot 3

The plot(using plotly) visualizes the search interest for the term “Election Fraud” and “US Election Results 2020” during the 2020 US Election Season. The y-axis represents the relative interest in the search term, ranging from 0 to 100, with 100 being the peak popularity. The x-axis shows the months in the year 2020, spanning from August 2nd to December 3rd.

The plot also provides insights into how public interest in the topic of “Election Fraud” and “US Election Results 2020” evolved during the specified period, helping to understand its popularity and fluctuations throughout the election season.

# Function to load data from gtrends, if the gtrends fetching fails, will load the local RDS file
load_gtrends_data <- function() {
  tryCatch({
    # Attempt to fetch data from gtrends
    data <- gtrends(c("US Election Results 2020", "Election Fraud"), 
                    time = "2020-08-02 2020-12-03", geo = "US")
    
    # Manipulating date using lubridate function
    data$interest_over_time$date <- ymd(data$interest_over_time$date) # Converting to proper date format
    
    write_rds(data, file = "importedDataGtrends.rds")  # Save data to RDS file
    return(data)  # Return the data
  }, error = function(err) {
    # If gtrends call fails, attempt to read data from RDS file
    if (file.exists("importedDataGtrends.rds")) {
      data <- read_rds("importedDataGtrends.rds")
      
      # Manipulating date using lubridate function
      data$interest_over_time$date <- ymd(data$interest_over_time$date) # Converting to proper date format
      
      return(data)  # Return the data from RDS file
    } else {
      message("Failed to fetch data from gtrends and no RDS file found.")
      return(NULL)  # Return NULL or any other default value if both attempts fail
    }
  })
}
# Loading google trends data using the load_gtrends_data function
gTrendsElectionSearchData <- load_gtrends_data()

if (!is.null(data)) {
  plot_trend <- function(keyword_string, data) {
    time_trend <- data$interest_over_time %>%
      mutate(hits = ifelse(hits == "<1", 0.5, as.numeric(hits)),
             date = as.Date(date),
             keyword = factor(keyword, levels = keyword_string))
    
    plot <- plot_ly(data = time_trend, x = ~date, y = ~hits, color = ~keyword,
                    type = "scatter", mode = "lines", fill = "tozeroy") %>%
      layout(title = "Exploring the Search Interest for 'Election Fraud' and 'US Election Results 2020' during the 2020 US Election Season",
             xaxis = list(title = "Months in year 2020", standoff = 20),
             yaxis = list(title = "Hits (Relative to Peak from 0 - 100)", standoff = 20),
             legend = list(orientation = "v"),
             showlegend = TRUE,
             margin = list(l = 100, r = 100, b = 100, t = 100),  # inorder to Adjust margins
             padding = list(r = 10, b = 50)  # inorder to Adjust padding
      )
    return(plot)
  }
  election_fraud_trend_plot <- plot_trend(keyword_string = c("US Election Results 2020", "Election Fraud"),gTrendsElectionSearchData)
  election_fraud_trend_plot
}

Plot 4

The plot displays the winners in different states for the year 2016 and 2020 US Presidential Election. The map provides a clear visual representation of the electoral outcome across the United States during the specified election year.

# Loading the us_states shapefile from the maps package
us_states <- map_data("state")

# Renamed the "region" column to "state" in us_states
us_states <- us_states %>% 
  rename(state = region)

us_states <- us_states %>%
  mutate(state = str_to_title(state))

# Filtering the state_winners_with_votes dataset for the year 2020
state_winners_2020 <- state_winners_with_votes %>%
  filter(year == 2020)

state_winners_2016 <- state_winners_with_votes %>%
  filter(year == 2016)

# Left join the us_states dataset with state_winners_2020 based on the "state" column
map_data_combined_2020 <- left_join(us_states, state_winners_2020, by = "state", copy = TRUE)
map_data_combined_2016 <- left_join(us_states, state_winners_2016, by = "state", copy = TRUE)

# Displaying the map view which show the winners of different states by the parties

election_outcome_map_2016 <- ggplot(map_data_combined_2016, aes(x = long, y = lat, group = group, fill = party)) +
  geom_polygon(color = "white", size = 0.25) +  # Add thick white borders between states
  coord_quickmap() +
  scale_fill_manual(values = c("Democrat" = "blue", "Republican" = "red", "Other" = "gray")) +
  labs(title = "2016 Election outcome by state",
       caption = "Data source:  https://electionlab.mit.edu/data") +
  theme_void()

election_outcome_map_2016

election_outcome_map_2020 <- ggplot(map_data_combined_2020, aes(x = long, y = lat, group = group, fill = party)) +
  geom_polygon(color = "white", size = 0.25) +  # Add thick white borders between states
  coord_quickmap() +
  scale_fill_manual(values = c("Democrat" = "blue", "Republican" = "red", "Other" = "gray")) +
  labs(title = "2020 Election outcome by state",
       caption = "Data source:  https://electionlab.mit.edu/data") +
  theme_void()

election_outcome_map_2020

Plot 5

The plot displays county level median income in USA in the year 2021

# Merge shape and census data
ggMapData <- county_shape %>% 
  full_join(census_data, by = c("GEOID" = "county_fips")) %>% 
  dplyr::filter(year == 2021)

# Fixing issue with Alaska and Hawaii
ggMapDataFix <- ggMapData %>% 
  tigris::shift_geometry()

# Plotting by thresholding max poverty level at 50%
povertyGG <- ggMapDataFix %>% rowwise() %>% 
  mutate(pov_plot = min(median_income, 156821)) %>% 
  ggplot(aes(fill = pov_plot)) +
  geom_sf(color = "#C0C0C0",
          size = 0.1) +
  scale_fill_gradient(low = "white", high = "dodgerblue",
                      limits = c(0, 180000)) +
  labs( title = "County-level median income, 2021",
        caption = "Data source: https://electionlab.mit.edu/data",
        fill = "Median Income") +
  theme_void()

povertyGG

4. Monte Carlo Methods of Inference

Implementing a Permutation test

# Filtering data for California and Texas in year 2020
california_data <- election_census_merged_data %>%
  filter(state == "California", year == 2020)

michigan_data <- election_census_merged_data %>%
  filter(state == "Michigan", year == 2020)

# Calculating the observed median income difference between California and Texas
observed_median_diff <- median(california_data$median_income) - median(michigan_data$median_income)

# Number of permutations
n_permutations <- 10000

# Creating a function to calculate the median income difference in each permutation
calculate_permutation_diff <- function(data) {
  california_counties <- data %>%
    filter(state == "California") %>%
    sample_n(size = nrow(california_data), replace = TRUE)
  
  michigan_counties <- data %>%
    filter(state == "Texas") %>%
    sample_n(size = nrow(michigan_data), replace = TRUE)
  
  return(median(california_counties$median_income) - median(michigan_counties$median_income))
}

# Performing the permutation test
set.seed(1995) # For reproducibility
permutation_diffs <- replicate(n_permutations, calculate_permutation_diff(election_census_merged_data))

# Removing non-finite values from permutation_diffs
permutation_diffs <- permutation_diffs[is.finite(permutation_diffs)]

# Removing non-finite values from observed_median_diff
observed_median_diff <- ifelse(is.finite(observed_median_diff), observed_median_diff, 0)

# Ploting the null distribution
ggplot() +
  geom_histogram(aes(x = permutation_diffs), bins = 30, color = "black", fill = "lightblue") +
  geom_vline(xintercept = observed_median_diff, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Null Distribution of Median Income Difference between California and Michigan (2020)",
       x = "Median Income Difference",
       y = "Frequency") +
  theme_minimal()

# Now, Calculating the p-value
p_value <- mean(permutation_diffs >= observed_median_diff, na.rm = TRUE)

p_value

## [1] 0.01227287

Hence, With a p-value of 0.0122729, we have strong evidence to reject the null hypothesis, indicating that there is a statistically significant difference in median income between counties in California and Michigan for the year 2020.

The permutation test indicates that there is a statistically significant difference in median income between counties in California and Michigan for the year 2020. The observed median income difference between the two states is higher than expected by random chance, suggesting a genuine disparity in income levels. This finding highlights important economic differences between California and Michigan and may be relevant for policymaking and economic development decisions.

5. Bootstrap Methods of Inference

a. Non-parametric Bootstrapping

# Simulating data from distribution
set.seed(1989)
n <- 135
observedVal <- election_census_merged_data_republican_2016$candidatevotes

# Creating a raincloud plot to visualize the simulated data. Describe the data in terms of its number of peaks (e.g., unimodal, bimodal, etc.) and its symmetry.

tibble(value=observedVal) %>% ggplot(aes(y = value)) +
  ggdist::stat_halfeye(adjust = .5, width = 2*.3, .width = c(0.5, 1)) + 
    geom_boxplot(width = .3, outlier.shape = NA) +
  ggdist::stat_dots(side = "left", dotsize = 6, justification = 1.05, binwidth = .1,
                    color = "black") +
    coord_flip() +
    labs(y = "Value",
         title = "simulated values from non-normal distribution") + 
  theme_bw() +
  theme(legend.position = "none")

# What is the sample median for this data?
median(observedVal)

## [1] 7239

# Number of bootstrap samples
Bsamples <- 10000

# Instantiating matrix for bootstrap samples
bootsMatrix <- matrix(NA, nrow = n, ncol = Bsamples)

# Sampling with replacement B times
for(b in 1:Bsamples) {
bootsMatrix[, b] <- observedVal[sample(1:n, size = n, replace = TRUE)]
}

#  Using the generated bootstrap samples, will create a bootstrap distribution of sample medians, and will visualize this distribution using a histogram.

# Installing vector for bootstrap medians
bootmediansVector <- vector(length=Bsamples)

#sampling with replacement Bsamples times
for (b in 1:Bsamples){
  bootmediansVector[b]<-median(bootsMatrix[,b])
}

# creating a histogram

tibble(Median=bootmediansVector) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

# Now, use the bootstrap samples to obtain a nonparametric estimate of the standard error of the sample median.

SEestimates <- sd(bootmediansVector)
SEestimates

## [1] 4255.382

# Now, Use the bootstrap samples to obtain a nonparametric 95% confidence interval for the population median.

lowerBoundMedianVal <- quantile(bootmediansVector,probs=0.025)
upperBoundMedianVal <- quantile(bootmediansVector,probs=0.975)

We are 95% confident that the true median is in between 41396 and 58970.

# creating a histogram

tibble(Median=bootmediansVector) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundMedianVal,size=1,
             color="dodgerblue",linetype="solid")+
  geom_vline(xintercept= upperBoundMedianVal,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

b. Parametric Bootstrapping

# First, Generate B=10,000 samples each of size 30 from a normal distribution using the estimated mean and standard deviation from the observed data set, and visualize this distribution using a histogram.

BsamplesForParametric <- 10000
observedMaleCountVal <- election_census_merged_data_republican_2016$male_count

# Then, Instantiating matrix for bootstrap samples
parammetricBootMatrix <- matrix(NA, nrow = n, ncol = BsamplesForParametric)
xBar<-mean(observedMaleCountVal)
s<-sd(observedMaleCountVal)

# Now, Sampling a normal set of n values, B times
for(b in 1:BsamplesForParametric) {
parammetricBootMatrix[, b] <- rnorm(n=n,mean=xBar,sd=s)
}


# Now, Installing vector for bootstrap medians
bootParametricMedians <- vector(length=BsamplesForParametric)

# Next, caluculating median for each simulated data set
for (b in 1:BsamplesForParametric){
  bootParametricMedians[b]<-median(parammetricBootMatrix[,b])
}

# creating a histogram

tibble(Median=bootParametricMedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of Parametric bootstrap medians") + 
  theme_bw()

# Obtain a parametric bootstrap estimate of the standard error of the sample median.

SEparametricestimate <- sd(bootParametricMedians)
SEparametricestimate

## [1] 17906.09

# Obtain a parametric bootstrap 95% confidence interval for the sample median.

lowerBoundParmetricMed <- quantile(bootParametricMedians,probs=0.025)
upperBoundParametricMed <- quantile(bootParametricMedians,probs=0.975)

# creating a histogram

tibble(Median=bootParametricMedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundParmetricMed,size=1,
             color="dodgerblue",linetype="solid")+
   geom_vline(xintercept= upperBoundParametricMed,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of parametric bootstrap medians") + 
  theme_bw()

6. Conclusions / Main Takeaways

In conclusion, the analysis of presidential elections data at the county level in the USA provides valuable insights into voting trends, party voting patterns, and candidate performance. It also highlights the importance of search interest in specific topics during the election season and sheds light on income disparities across regions. These findings can be utilized to better understand the dynamics of the electoral process and contribute to discussions on improving voting practices and policies.

From the above data analysis, I even got ideas of future work like,

Voter Turnout Analysis: Explore factors influencing voter turnout at the county level, and identify strategies to increase voter participation.

Temporal Analysis: Extend the analysis to include more recent election data beyond 2020, if available, to observe any emerging trends or changes in voting patterns over time.

Comparative Studies: Conduct comparative studies with elections from other countries to identify similarities and differences in voting behaviors and electoral processes.

Interactive Visualization: Create interactive data visualizations and dashboards to allow users to explore the data and draw their own insights.

Acknowledgment to Professor:

I would like to extend my sincere gratitude to Professor Andrew Dilernia for his invaluable guidance and support throughout this project. His expertise in data selection and research methodologies has been instrumental in shaping the direction of this analysis. I am grateful for the opportunity to learn under his mentorship is truly inspiring.

7. References

(https://www.archives.gov/electoral-college/allocation)

(https://ggplot2-book.org/maps.html)

(https://r-graph-gallery.com/interactive-charts.html)

(https://dplyr.tidyverse.org/index.html)