Analysis of Presidential Elections Data: County-Level in USA

Loading packages

This project makes use of the following libraries. We used “readr” and “readxl” to read the data, “skimr” to explore and find different data characteristics, “datadictonary” and “flextable” to create and display  data dictionary containing a description of each variable in the dataset, and “lubridate” and “stringr” to manipulate the data. These are some of the primary functions carried out with the assistance of the libraries listed below.

library(tidyverse)
library(magrittr)
library(stringr)
library(skimr)
library(ggpubr)
library(readr)
library(readxl)
library(skimr)
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
library(psych)
library(ggplot2)
library(forcats)
library(flextable)
library(viridis)
library(hrbrthemes)
library(forcats)  # Loading the forcats package for fct_reorder
library(datadictionary)
library(gtrendsR)
library(knitr)
library(patchwork)
library(plotly)
library(maps)
library(sf)

Importing the data

These are the datasets that were used in this project. First, there is “Presidential_Elections_Data_count_level.csv”. This data contains constituency county-level returns for elections to the U.S. presidency from 1976 to 2020. Data was obtained from the Massachusetts Institute of Technology website(https://electionlab.mit.edu/data).

# Importing CSV file
election_data <- read_csv("Presidential_Elections_Data_count_level.csv")
census_data <- read_csv("census.csv")


# Downloading county-level shape files from US Census Bureau
if(!file.exists("cb_2018_us_county_500k.zip")) {
download.file(url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip",
              destfile = "cb_2018_us_county_500k.zip")
}

if(!dir.exists("GeoFiles")) {
  dir.create("GeoFiles/")
}

# Unzipping files
utils::unzip("cb_2018_us_county_500k.zip",
             exdir = "GeoFiles")

# Loading the shapefiles
county_shape <- st_read("GeoFiles//cb_2018_us_county_500k.shp",
                        quiet = TRUE)

Exploring the data

Explore and display high-level characteristics of your data set, e.g., important variables and their types, different levels for factor variables, any patterns of missing values.

After Exploring the data by the use of skim function, I can observe that there are 57 missing values for county_fips variable in the datasets I have used for this project.

## exploring the data
glimpse(election_data)
## Rows: 72,617
## Columns: 12
## $ year           <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…
## $ state          <chr> "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", "ALABAMA", …
## $ state_po       <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ county_name    <chr> "AUTAUGA", "AUTAUGA", "AUTAUGA", "AUTAUGA", "BALDWIN", …
## $ county_fips    <chr> "01001", "01001", "01001", "01001", "01003", "01003", "…
## $ office         <chr> "US PRESIDENT", "US PRESIDENT", "US PRESIDENT", "US PRE…
## $ candidate      <chr> "AL GORE", "GEORGE W. BUSH", "RALPH NADER", "OTHER", "A…
## $ party          <chr> "DEMOCRAT", "REPUBLICAN", "GREEN", "OTHER", "DEMOCRAT",…
## $ candidatevotes <dbl> 4942, 11993, 160, 113, 13997, 40872, 1033, 578, 5188, 5…
## $ totalvotes     <dbl> 17208, 17208, 17208, 17208, 56480, 56480, 56480, 56480,…
## $ version        <dbl> 20220315, 20220315, 20220315, 20220315, 20220315, 20220…
## $ mode           <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "…
skim(election_data)
Data summary
Name election_data
Number of rows 72617
Number of columns 12
_______________________
Column type frequency:
character 8
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 4 20 0 51 0
state_po 0 1 2 2 0 51 0
county_name 0 1 3 21 0 1892 0
county_fips 57 1 5 5 0 3156 0
office 0 1 12 12 0 1 0
candidate 0 1 5 17 0 13 0
party 0 1 5 11 0 5 0
mode 0 1 4 20 0 16 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2011.30 7.52 2000 2004 2012 2020 2020 ▇▃▃▃▇
candidatevotes 0 1 10781.83 46034.94 0 115 1278 5848 3028885 ▇▁▁▁▁
totalvotes 0 1 42514.21 121951.40 0 5175 11194 29855 4264365 ▇▁▁▁▁
version 0 1 20220315.00 0.00 20220315 20220315 20220315 20220315 20220315 ▁▁▇▁▁
glimpse(census_data)
## Rows: 52,612
## Columns: 26
## $ geoid                    <chr> "01003", "01015", "01043", "01049", "01051", …
## $ county_state             <chr> "Baldwin County, Alabama", "Calhoun County, A…
## $ year                     <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 200…
## $ population               <dbl> 174439, 113419, 81324, 68515, 78106, 103303, …
## $ median_income            <dbl> 52320, 40056, 39229, 34357, 54601, 37642, 426…
## $ median_monthly_rent_cost <dbl> 790, 533, 562, 536, 620, 569, 570, 737, 537, …
## $ median_monthly_home_cost <dbl> 884, 686, 653, 548, 787, 595, 673, 975, 655, …
## $ prop_female              <dbl> 0.5072088, 0.5213941, 0.5008853, 0.5028388, 0…
## $ prop_male                <dbl> 0.4927912, 0.4786059, 0.4991147, 0.4971612, 0…
## $ prop_white               <dbl> 0.8650990, 0.7668380, 0.9676971, 0.9294607, 0…
## $ prop_black               <dbl> 0.08835180, 0.20128021, 0.01090699, 0.0140115…
## $ prop_native              <dbl> 0.006896394, 0.003720717, 0.007463971, 0.0164…
## $ prop_asian               <dbl> 0.0072632840, 0.0081291494, 0.0024592986, 0.0…
## $ prop_hawaiin_islander    <dbl> 0.0017312642, 0.0000000000, 0.0000000000, 0.0…
## $ prop_other_race          <dbl> 0.0146813499, 0.0092312575, 0.0004303773, 0.0…
## $ prop_multi_racial        <dbl> 0.015976932, 0.010800660, 0.011042251, 0.0155…
## $ prop_highschool          <dbl> 0.2234336, 0.2642586, 0.2830509, 0.2945016, 0…
## $ prop_GED                 <dbl> 0.04600955, 0.07705351, 0.07444981, 0.0797542…
## $ prop_some_college        <dbl> 0.06240842, 0.06737910, 0.05683327, 0.0587472…
## $ prop_college_no_degree   <dbl> 0.1687486, 0.1806985, 0.1389848, 0.1287847, 0…
## $ prop_associates          <dbl> 0.08350924, 0.05767840, 0.09165914, 0.0494131…
## $ prop_bachelors           <dbl> 0.20768040, 0.09130223, 0.08016855, 0.0449906…
## $ prop_masters             <dbl> 0.06303756, 0.04145800, 0.04394398, 0.0302134…
## $ prop_professional        <dbl> 0.013004859, 0.008675421, 0.008834829, 0.0026…
## $ prop_doctoral            <dbl> 0.0108111688, 0.0155106011, 0.0012570599, 0.0…
## $ prop_poverty             <dbl> 0.09132561, 0.18246331, 0.15272973, 0.2357663…
skim(census_data)
Data summary
Name census_data
Number of rows 52612
Number of columns 26
_______________________
Column type frequency:
character 2
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geoid 0 1 5 5 0 3225 0
county_state 0 1 16 42 0 3229 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2014.82 3.78 2008.00 2012.00 2015.00 2018.00 2021.00 ▆▇▅▇▇
population 0 1.00 145758.59 395597.70 41.00 14429.00 40119.50 122855.25 10170292.00 ▇▁▁▁▁
median_income 4 1.00 49709.77 15079.85 10499.00 40264.50 47527.00 56565.25 156821.00 ▃▇▁▁▁
median_monthly_rent_cost 30 1.00 741.63 235.85 99.00 587.00 688.00 841.00 2599.00 ▃▇▁▁▁
median_monthly_home_cost 18 1.00 837.68 388.18 103.00 572.00 743.00 1002.00 3254.00 ▇▇▁▁▁
prop_female 0 1.00 0.50 0.02 0.19 0.50 0.51 0.51 0.63 ▁▁▁▇▁
prop_male 0 1.00 0.50 0.02 0.37 0.49 0.49 0.50 0.81 ▁▇▁▁▁
prop_white 42443 0.19 0.78 0.15 0.08 0.71 0.82 0.89 0.98 ▁▁▂▅▇
prop_black 42443 0.19 0.11 0.12 0.00 0.03 0.06 0.15 0.73 ▇▂▁▁▁
prop_native 42443 0.19 0.01 0.05 0.00 0.00 0.00 0.01 0.80 ▇▁▁▁▁
prop_asian 42443 0.19 0.03 0.04 0.00 0.01 0.02 0.04 0.44 ▇▁▁▁▁
prop_hawaiin_islander 42443 0.19 0.00 0.01 0.00 0.00 0.00 0.00 0.14 ▇▁▁▁▁
prop_other_race 42443 0.19 0.03 0.05 0.00 0.01 0.02 0.04 0.80 ▇▁▁▁▁
prop_multi_racial 42443 0.19 0.04 0.04 0.00 0.02 0.03 0.04 0.69 ▇▁▁▁▁
prop_highschool 41916 0.20 0.25 0.06 0.05 0.21 0.25 0.29 0.49 ▁▅▇▂▁
prop_GED 41916 0.20 0.04 0.02 0.00 0.03 0.04 0.05 0.18 ▆▇▁▁▁
prop_some_college 41916 0.20 0.07 0.02 0.00 0.06 0.07 0.08 0.15 ▁▅▇▂▁
prop_college_no_degree 41916 0.20 0.15 0.03 0.05 0.13 0.15 0.17 0.28 ▁▆▇▂▁
prop_associates 41916 0.20 0.09 0.02 0.02 0.07 0.09 0.10 0.29 ▅▇▁▁▁
prop_bachelors 41916 0.20 0.18 0.06 0.04 0.13 0.17 0.21 0.40 ▂▇▆▂▁
prop_masters 41916 0.20 0.08 0.03 0.01 0.05 0.07 0.09 0.29 ▇▇▂▁▁
prop_professional 41916 0.20 0.02 0.01 0.00 0.01 0.02 0.02 0.10 ▇▃▁▁▁
prop_doctoral 41916 0.20 0.01 0.01 0.00 0.01 0.01 0.02 0.11 ▇▁▁▁▁
prop_poverty 2 1.00 0.16 0.08 0.00 0.11 0.15 0.19 0.67 ▆▇▁▁▁

1. Data Dictionary for the datasets used

# Loading palmer penguins data
data(election_data)

# Creating data dictionary. 
# Note penguins will need to be replaced with your own data set, and 
# the variable descriptions should be updated as well.
dataDictionary <- tibble(Variable = colnames(election_data),
                         Description = c(
                            "year of election",
                            "state name",
                            "state postal code abbreviation",
                            "county name",
                            "county fips code",
                            "name of the public office to which the candidate is seeking election",
                            "candidate name",
                            "party",
                            "number of votes for candidate",
                            "total votes cast in the election",
                            "version",
                            "mode"
),
                         Type = map_chr(election_data, .f = function(x){typeof(x)[1]}),
                         Class = map_chr(election_data, .f = function(x){class(x)[1]}))
# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary, cwidth = 2)

Variable

Description

Type

Class

year

year of election

double

numeric

state

state name

character

character

state_po

state postal code abbreviation

character

character

county_name

county name

character

character

county_fips

county fips code

character

character

office

name of the public office to which the candidate is seeking election

character

character

candidate

candidate name

character

character

party

party

character

character

candidatevotes

number of votes for candidate

double

numeric

totalvotes

total votes cast in the election

double

numeric

version

version

double

numeric

mode

mode

character

character

# Printing nicely in R Markdown (option 2)
knitr::kable(dataDictionary)
Variable Description Type Class
year year of election double numeric
state state name character character
state_po state postal code abbreviation character character
county_name county name character character
county_fips county fips code character character
office name of the public office to which the candidate is seeking election character character
candidate candidate name character character
party party character character
candidatevotes number of votes for candidate double numeric
totalvotes total votes cast in the election double numeric
version version double numeric
mode mode character character

2. Data Cleaning ( Using Lubridate package, String modification, merging dataset)

Cleaning the data set and modifying the variables and data to match our requirements for the Project

a. String modification

# Converting values in "state" variable to capitalized form
election_data <- election_data %>%
  mutate(state = str_to_title(state))

# Converting values in "candidate" variable to capitalized form
election_data <- election_data %>%
  mutate(candidate = str_to_title(candidate))

# Converting values in "party" variable to capitalized form
election_data <- election_data %>%
  mutate(party = str_to_title(party))

# Converting values in "county_name" variable to capitalized form
election_data <- election_data %>%
  mutate(county_name = str_to_title(county_name))

election_data <- election_data %>% 
  rename(county = county_name)

# Now, Cleaning the candidate using mutate() and str_replace() function
election_data <- election_data %>% 
  mutate(candidate = str_replace(candidate, "Donald Trump|Donald J Trump", "Donald Trump")) %>%
  mutate(candidate = str_replace(candidate, "Joseph R Biden Jr", "Joe Biden"))

# Separate "county_state" column into "county" and "state"
if ("county_state" %in% names(census_data)) {
  # Separate "county_state" column into "county" and "state"
  census_data <- census_data %>%
    separate(county_state, into = c("county", "state"), sep = ",\\s*", extra = "merge")
}
# now, remove the substring "county" in every county column value, example: "Houston County" > "Houston"
census_data$county <- str_replace(census_data$county, " County", "")

# Converting values in "state" variable in census data to capitalized form
census_data <- census_data %>%
  mutate(state = str_to_title(state))

# Converting values in "county" variable in census data to capitalized form
census_data <- census_data %>%
  mutate(county = str_to_title(county))

# renaming the geoId column name to county_fips
census_data <- census_data %>% 
  rename(county_fips = geoid)

b. Merging Data Sets

# Filter the census_data for the year 2020
census_data_2016 <- subset(census_data, year == 2016)

# Filter the election_data for the year 2020
election_data_2016 <- subset(election_data, year == 2016)

# Merge the two data frames based on county_fips as the unique identifier
election_census_merged_data <- left_join(census_data_2016, election_data_2016, by = c("county_fips", "state", "year", "county"))

# Update the dataset with the count of females and male for each county
election_census_merged_data$male_count <- round(election_census_merged_data$population * election_census_merged_data$prop_male)

election_census_merged_data$female_count <- round(election_census_merged_data$population * election_census_merged_data$prop_female)

# Filter the dataset to show Republican candidate votes for each county_fips
election_census_merged_data_republican <- election_census_merged_data %>%
  filter(party == "Republican") %>%
  distinct(county_fips, .keep_all = TRUE)

3.Exploratory Data Analysis

a. Tables of Summary Statistics (Grouped and Frequency)

Table 1 : Grouped summary statistics based on state wise candidate votes

summary_state<- election_data %>% group_by(state) %>% 
  summarise(mean_v=round(mean(candidatevotes),2),
            median_v=round(median(candidatevotes),2),
            sd_v=round(sd(candidatevotes),2),
            max_v=round(max(candidatevotes),2),
            min_v=round(min(candidatevotes),2),.groups = 'drop') %>%
  as.data.frame()

summary_state_table <- flextable::qflextable(summary_state)
summary_state_table

state

mean_v

median_v

sd_v

max_v

min_v

Alabama

9,562.28

3,555.0

20,036.95

181,688

2

Alaska

2,412.17

1,939.0

2,334.27

10,717

0

Arizona

30,384.94

1,508.0

112,684.35

978,457

0

Arkansas

2,400.19

201.0

6,834.58

89,574

0

California

67,021.72

8,106.5

203,790.55

3,028,885

2

Colorado

11,103.56

1,135.5

31,472.45

313,293

0

Connecticut

52,802.04

22,058.0

74,918.46

297,505

0

Delaware

41,267.58

30,838.0

49,206.40

195,034

294

District Of Columbia

78,320.05

17,367.0

113,325.05

317,323

1,322

Florida

36,193.16

3,897.0

77,794.32

624,146

1

Georgia

5,128.96

786.5

17,792.05

297,051

0

Hawaii

31,985.51

12,239.0

52,880.07

238,866

133

Idaho

4,505.79

1,032.5

12,104.62

130,699

1

Illinois

15,114.35

2,066.0

88,520.33

1,725,973

0

Indiana

8,183.73

2,682.5

19,101.97

247,772

0

Iowa

3,573.57

1,152.0

9,276.25

128,465

0

Kansas

3,613.44

693.0

13,483.38

184,259

3

Kentucky

4,592.88

1,530.0

13,653.98

228,288

1

Louisiana

9,249.22

2,986.5

18,781.17

152,610

11

Maine

12,366.85

7,165.5

18,244.04

128,759

0

Maryland

15,822.72

1,229.5

42,364.39

357,837

0

Massachusetts

63,935.49

15,048.0

94,365.68

617,196

41

Michigan

16,720.18

3,518.0

51,486.44

660,085

2

Minnesota

9,488.48

2,153.0

32,290.92

532,623

4

Mississippi

4,183.36

1,790.5

7,462.33

76,112

0

Missouri

6,817.37

1,543.5

23,002.74

333,123

0

Montana

2,765.24

717.0

6,199.47

50,772

5

Nebraska

2,754.28

460.0

10,890.56

150,350

0

Nevada

17,504.76

998.5

63,992.63

521,777

6

New Hampshire

22,193.54

12,792.0

28,790.04

122,344

191

New Jersey

51,471.90

24,212.0

62,124.22

285,967

93

New Mexico

6,769.28

1,398.0

18,473.54

193,757

0

New York

34,742.93

9,105.0

83,084.69

703,310

7

North Carolina

7,081.89

348.0

20,627.13

302,736

0

North Dakota

1,836.63

501.0

4,864.71

42,619

0

Ohio

18,766.28

5,745.0

43,724.48

458,204

13

Oklahoma

3,946.76

431.5

13,808.50

174,741

0

Oregon

15,031.14

2,947.0

34,565.99

367,249

1

Pennsylvania

27,830.84

7,303.0

60,148.15

603,790

12

Rhode Island

22,876.00

9,098.0

36,662.68

167,442

0

South Carolina

5,436.18

89.5

13,823.51

127,832

0

South Dakota

1,785.25

571.0

4,620.91

49,249

0

Tennessee

8,380.75

2,517.0

21,679.46

256,297

7

Texas

9,406.51

663.0

43,432.78

918,193

0

Utah

6,058.90

86.0

24,074.89

271,576

0

Vermont

6,519.92

3,510.0

9,348.12

72,938

12

Virginia

5,878.28

899.0

18,452.44

355,133

0

Washington

22,963.08

3,947.0

69,060.14

907,310

2

West Virginia

3,904.89

1,606.0

6,125.35

46,398

6

Wisconsin

11,854.21

3,820.5

28,906.19

332,438

0

Wyoming

3,263.54

1,101.0

4,821.71

27,891

13

Table 2 : Aggregate votes for each candidate for each term year

aggregate_table <- aggregate(candidatevotes ~ year + candidate, data = election_data, FUN = sum)

# Order the aggregate_table by year and candidatevotes
aggregate_table_ordered <- arrange(aggregate_table, year, desc(candidatevotes))


# Spread the aggregate_table to have each unique year as a separate row
spread_table <- spread(aggregate_table_ordered, key = year, value = candidatevotes)

spread_table_filled <- spread_table %>%
  mutate(across(-candidate, ~ replace(., is.na(.), "-")))

# Printing the spread_table
flextable_spread_table <- flextable::qflextable(spread_table_filled)
flextable_spread_table

candidate

2000

2004

2008

2012

2016

2020

Al Gore

51001968

-

-

-

-

-

Barack Obama

-

-

69448278

65753019

-

-

Donald Trump

-

-

-

-

62979031

74218999

George W. Bush

50457896

61931981

-

-

-

-

Hillary Clinton

-

-

-

-

65844241

-

Jo Jorgensen

-

-

-

-

-

1810407

Joe Biden

-

-

-

-

-

81264648

John Kerry

-

58899063

-

-

-

-

John Mccain

-

-

59928695

-

-

-

Mitt Romney

-

-

-

60666387

-

-

Other

1068557

1489505

1810364

2629510

7674891

1183656

Ralph Nader

2882954

-

-

-

-

-

Table 3 : Percentage of votes each candidate recieved for each term year

# Filteing the election_data for the years 2000 to 2020
year_data <- election_data[election_data$year >= 2000 & election_data$year <= 2020, ]

# Group by year and candidate, and calculate the total votes for each candidate in each year
candidate_votes <- year_data %>%
  group_by(year, candidate) %>%
  summarize(total_votes = sum(candidatevotes))

# Calculating the total votes for each year
total_votes_per_year <- year_data %>%
  group_by(year) %>%
  summarize(total_votes = sum(candidatevotes))

# Merging the two data frames to get the total votes as percentages
candidate_votes_percentage <- merge(candidate_votes, total_votes_per_year, by = "year") %>%
  mutate(percentage_votes = (total_votes.x / total_votes.y) * 100) %>%
  select(-total_votes.x, -total_votes.y) %>%
  spread(year, percentage_votes, fill = 0)

# Converting percentage values to numeric, round to two decimals, and add "%" symbol
candidate_votes_percentage[, -1] <- lapply(candidate_votes_percentage[, -1], function(x) paste0(format(round(as.numeric(x), 2), nsmall = 2), "%"))

# Printing the candidate_votes_percentage data frame
candidate_votes_percentage_table <- flextable::qflextable(candidate_votes_percentage)
candidate_votes_percentage_table

candidate

2000

2004

2008

2012

2016

2020

Al Gore

48.38%

0.00%

0.00%

0.00%

0.00%

0.00%

Barack Obama

0.00%

0.00%

52.94%

50.95%

0.00%

0.00%

Donald Trump

0.00%

0.00%

0.00%

0.00%

46.14%

46.83%

George W. Bush

47.87%

50.63%

0.00%

0.00%

0.00%

0.00%

Hillary Clinton

0.00%

0.00%

0.00%

0.00%

48.24%

0.00%

Jo Jorgensen

0.00%

0.00%

0.00%

0.00%

0.00%

1.14%

Joe Biden

0.00%

0.00%

0.00%

0.00%

0.00%

51.28%

John Kerry

0.00%

48.15%

0.00%

0.00%

0.00%

0.00%

John Mccain

0.00%

0.00%

45.68%

0.00%

0.00%

0.00%

Mitt Romney

0.00%

0.00%

0.00%

47.01%

0.00%

0.00%

Other

1.01%

1.22%

1.38%

2.04%

5.62%

0.75%

Ralph Nader

2.73%

0.00%

0.00%

0.00%

0.00%

0.00%

Frequency table for two categorical variables

From the below table we can observe frequency of candidates participated from year 2000 - 2020

# Filtering the election_data for the years 2000 to 2020
year_data <- election_data[election_data$year >= 2000 & election_data$year <= 2020, ]

# Obtaining the unique candidates for each year
candidate_occurrence <- with(year_data, tapply(candidate, year, function(x) unique(x)))

# Creating a data frame to display the occurrence of candidates
occurrence_data_frame <- as.data.frame(candidate_occurrence)

# Replacing empty cells with zero
occurrence_data_frame[is.na(occurrence_data_frame)] <- 0

# Printing the candidate occurrence data frame
occurrence_data_frame
##                              candidate_occurrence
## 2000  Al Gore, George W. Bush, Ralph Nader, Other
## 2004            John Kerry, George W. Bush, Other
## 2008             Barack Obama, John Mccain, Other
## 2012             Barack Obama, Mitt Romney, Other
## 2016         Hillary Clinton, Donald Trump, Other
## 2020 Joe Biden, Other, Donald Trump, Jo Jorgensen

Now, we could look at election outcomes (which party won the overall election?) and whether or not a candidate won a majority of the popular vote and In order to calculate the state_winners, total electoral seats won by each party we need know the details of electoral college seats. For that I referred link https://www.archives.gov/electoral-college/allocation and used in below step 1 to load into data frame.

# Creating a lookup table for electoral college votes allocation
default_electoral_college_seats_state_wise <- data.frame(
  state = c("Alabama", "Kentucky", "North Dakota", "Alaska", "Louisiana", "Ohio", "Arizona", "Maine", "Oklahoma", "Arkansas",
            "Maryland", "Oregon", "California", "Massachusetts", "Pennsylvania", "Colorado", "Michigan", "Rhode Island", "Connecticut",
            "Minnesota", "South Carolina", "Delaware", "Mississippi", "South Dakota", "District of Columbia", "Missouri", "Tennessee",
            "Florida", "Montana", "Texas", "Georgia", "Nebraska", "Utah", "Hawaii", "Nevada", "Vermont", "Idaho", "New Hampshire",
            "Virginia", "Illinois", "New Jersey", "Washington", "Indiana", "New Mexico", "West Virginia", "Iowa", "New York", "Wisconsin",
            "Kansas", "North Carolina", "Wyoming"),
  electoral_seats = c(9, 8, 3, 3, 8, 17, 11, 4, 7, 6, 10, 8, 54, 11, 19, 10, 15, 4, 7, 10, 9, 3, 6, 3, 3, 10, 11, 30, 4, 40, 16, 5, 6,
                      4, 4, 3, 4, 4, 13, 19, 14, 12, 11, 5, 4, 6, 28, 10, 6, 16, 3)
)

# Merging election_data with the electoral_seats lookup table based on state
merged_election_data <- merge(election_data, default_electoral_college_seats_state_wise, by = "state", all.x = TRUE)

# Calculating total votes for each party in each state for each year
merged_election_data <- merged_election_data %>%
  group_by(year, state, party) %>%
  summarise(total_votes_state_party = sum(candidatevotes))

# Determining the winner party in each state for each year based on total votes majority
state_winners <- merged_election_data %>%
  group_by(year, state) %>%
  filter(total_votes_state_party == max(total_votes_state_party)) %>%
  ungroup()

# Merging the state_winners with the electoral_seats table to assign electoral votes to each party
state_winners_with_votes <- state_winners %>%
  left_join(default_electoral_college_seats_state_wise, by = c("state" = "state"))

# Calculating the total electoral votes won by each party for each year
electoral_college_seats <- state_winners_with_votes %>%
  group_by(year, party) %>%
  summarise(total_electoral_seats = sum(electoral_seats, na.rm = TRUE))

electoral_college_seats
## # A tibble: 12 × 3
## # Groups:   year [6]
##     year party      total_electoral_seats
##    <dbl> <chr>                      <dbl>
##  1  2000 Democrat                     276
##  2  2000 Republican                   257
##  3  2004 Democrat                     239
##  4  2004 Republican                   294
##  5  2008 Democrat                     351
##  6  2008 Republican                   182
##  7  2012 Democrat                     324
##  8  2012 Republican                   209
##  9  2016 Democrat                     227
## 10  2016 Republican                   306
## 11  2020 Democrat                     298
## 12  2020 Republican                   235
# Determining the winning party based on total electoral votes for each year
winning_party <- electoral_college_seats %>%
  group_by(year) %>%
  filter(total_electoral_seats == max(total_electoral_seats)) %>%
  ungroup()

# Select the final columns to display and rename the columns
winning_party_table_each_term <- winning_party %>%
  select(year, party, total_electoral_seats) %>%
  rename("Total Electoral College Seats Won" = total_electoral_seats, "Winning Party" = party)

winning_party_table_each_term
## # A tibble: 6 × 3
##    year `Winning Party` `Total Electoral College Seats Won`
##   <dbl> <chr>                                         <dbl>
## 1  2000 Democrat                                        276
## 2  2004 Republican                                      294
## 3  2008 Democrat                                        351
## 4  2012 Democrat                                        324
## 5  2016 Republican                                      306
## 6  2020 Democrat                                        298

b. Data Visualizations

Plot 1

From the below graph we can observe the metrics of number of votings took place over each term between period (2000-2020)

# plot 1: Line chart showing in which year most votings took place

# Calculating total votes per year
total_votes_per_year <- election_data %>%
  group_by(year) %>%
  summarize(total_votes = sum(candidatevotes))

# Creating the line graph
line_graph <- ggplot(total_votes_per_year, aes(x = year, y = total_votes)) +
  geom_line(size=1,color="Red") +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 5),
                     labels = scales::comma) +
  labs(title = "Total Number of Polling in the United States from 2000-2020",
       x = "Year",
       y = "Number of Votes (in numbers)",
       caption = "Data source: https://electionlab.mit.edu/data") +
  theme(plot.title = element_text(hjust = 0.5))+
  theme_get()

line_graph

Plot 2

From the below graph we can observe the metrics of perentage number of votings took place over each term by each party between period (2000-2020)

# Calculate the percentage of votes for each party in each year
votes_of_party_in_percentage <- election_data %>%
  group_by(year, party) %>%
  summarize(percentage_votes = sum(candidatevotes) / sum(totalvotes) * 100) %>%
  ungroup()

# Reorder the levels of the party variable based on average percentage of votes across all years
votes_of_party_in_percentage <- votes_of_party_in_percentage %>%
  mutate(party = fct_reorder(party, percentage_votes, .desc = TRUE))

# Creating the stacked bar chart with facet_wrap by year
ggplot(votes_of_party_in_percentage, aes(fill = party, y = percentage_votes, x = party)) +
  geom_bar(position = position_dodge(width = 0.7), stat = "identity") +
  scale_fill_viridis(discrete = TRUE, option = "E") +
  facet_wrap(~ year, ncol = 3, scales = "free_x") +
  theme_minimal() +
  xlab("Party") +
  labs(title = "Percentage of Votes Received by Each Party Each Year",
       x = "Party",
       y = "Percentage of Votes",
       caption = "Data source: https://electionlab.mit.edu/data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "bottom") +
  scale_y_continuous(breaks = seq(0, 100, by = 10))

plot 3

The plot(using plotly) visualizes the search interest for the term “Election Fraud” and “US Election Results 2020” during the 2020 US Election Season. The y-axis represents the relative interest in the search term, ranging from 0 to 100, with 100 being the peak popularity. The x-axis shows the months in the year 2020, spanning from August 2nd to December 3rd.

The plot also provides insights into how public interest in the topic of “Election Fraud” and “” evolved during the specified period, helping to understand its popularity and fluctuations throughout the election season.

# Function to load data from gtrends, if the gtrends fetching fails, will load the local RDS file
load_gtrends_data <- function() {
  tryCatch({
    # Attempt to fetch data from gtrends
    data <- gtrends(c("US Election Results 2020", "Election Fraud"), 
                    time = "2020-08-02 2020-12-03", geo = "US")
    write_rds(data, file = "importedDataGtrends.rds")  # Save data to RDS file
    return(data)  # Return the data
  }, error = function(err) {
    # If gtrends call fails, attempt to read data from RDS file
    if (file.exists("importedDataGtrends.rds")) {
      data <- read_rds("importedDataGtrends.rds")
      return(data)  # Return the data from RDS file
    } else {
      message("Failed to fetch data from gtrends and no RDS file found.")
      return(NULL)  # Return NULL or any other default value if both attempts fail
    }
  })
}
# Load data using the function
gTrendsElectionSearchData <- load_gtrends_data()

if (!is.null(data)) {
  plot_trend <- function(keyword_string, data) {
    time_trend <- data$interest_over_time %>%
      mutate(hits = ifelse(hits == "<1", 0.5, as.numeric(hits)),
             date = as.Date(date),
             keyword = factor(keyword, levels = keyword_string))
    
    plot <- plot_ly(data = time_trend, x = ~date, y = ~hits, color = ~keyword,
                    type = "scatter", mode = "lines", fill = "tozeroy") %>%
      layout(title = "Exploring the Search Interest for 'Election Fraud' and 'US Election Results 2020' during the 2020 US Election Season",
             xaxis = list(title = "Months in year 2020", standoff = 20),
             yaxis = list(title = "Hits (Relative to Peak from 0 - 100)", standoff = 20),
             legend = list(orientation = "v"),
             showlegend = TRUE,
             margin = list(l = 100, r = 100, b = 100, t = 100),  # Adjust margins
             padding = list(r = 10, b = 50)  # Adjust padding
      )
    return(plot)
  }
  election_fraud_trend_plot <- plot_trend(keyword_string = c("US Election Results 2020", "Election Fraud"),gTrendsElectionSearchData)
  election_fraud_trend_plot
}

plot 4

The plot displays the winners in different states for the year 2020 US Presidential Election. The map provides a clear visual representation of the electoral outcome across the United States during the specified election year.

# Loading the us_states shapefile from the maps package
us_states <- map_data("state")

# Renamed the "region" column to "state" in us_states
us_states <- us_states %>% 
  rename(state = region)

us_states <- us_states %>%
  mutate(state = str_to_title(state))

# Filtering the state_winners_with_votes dataset for the year 2020
state_winners_2020 <- state_winners_with_votes %>%
  filter(year == 2020)

state_winners_2016 <- state_winners_with_votes %>%
  filter(year == 2016)

# Left join the us_states dataset with state_winners_2020 based on the "state" column
map_data_combined_2020 <- left_join(us_states, state_winners_2020, by = "state", copy = TRUE)
map_data_combined_2016 <- left_join(us_states, state_winners_2016, by = "state", copy = TRUE)

# Displaying the map view which show the winners of different states by the parties

election_outcome_map_2016 <- ggplot(map_data_combined_2016, aes(x = long, y = lat, group = group, fill = party)) +
  geom_polygon(color = "white", size = 0.25) +  # Add thick white borders between states
  coord_quickmap() +
  scale_fill_manual(values = c("Democrat" = "blue", "Republican" = "red", "Other" = "gray")) +
  labs(title = "2016 Election outcome by state",
       caption = "Data source:  https://electionlab.mit.edu/data") +
  theme_void()

election_outcome_map_2016

election_outcome_map_2020 <- ggplot(map_data_combined_2020, aes(x = long, y = lat, group = group, fill = party)) +
  geom_polygon(color = "white", size = 0.25) +  # Add thick white borders between states
  coord_quickmap() +
  scale_fill_manual(values = c("Democrat" = "blue", "Republican" = "red", "Other" = "gray")) +
  labs(title = "2020 Election outcome by state",
       caption = "Data source:  https://electionlab.mit.edu/data") +
  theme_void()

election_outcome_map_2020

plot 5

The plot displays county level median income in USA in the year 2021

# Merge shape and census data
ggMapData <- county_shape %>% 
  full_join(census_data, by = c("GEOID" = "county_fips")) %>% 
  dplyr::filter(year == 2021)

# Fixing issue with Alaska and Hawaii
ggMapDataFix <- ggMapData %>% 
  tigris::shift_geometry()

# Plotting by thresholding max poverty level at 50%
povertyGG <- ggMapDataFix %>% rowwise() %>% 
  mutate(pov_plot = min(median_income, 156821)) %>% 
  ggplot(aes(fill = pov_plot)) +
  geom_sf(color = "#C0C0C0",
          size = 0.1) +
  scale_fill_gradient(low = "white", high = "dodgerblue",
                      limits = c(0, 180000)) +
  labs( title = "County-level median income, 2021",
        caption = "Data source: https://electionlab.mit.edu/data",
        fill = "Median Income") +
  theme_void()

povertyGG

4. Monte Carlo Methods of Inference

Implementing a Permutation test

quantitative_variable <- merged_election_data$total_votes_state_party

observed_diff <- mean(quantitative_variable[merged_election_data$party == "Republican"]) - 
                mean(quantitative_variable[merged_election_data$party == "Democrat"])

num_permutations <- 100000
perm_diffs <- replicate(num_permutations, {
  shuffled_party <- sample(merged_election_data$party)
  mean(quantitative_variable[shuffled_party == "Republican"]) - 
  mean(quantitative_variable[shuffled_party == "Democrat"])
})

# Calculating the p-value
p_value <- sum(abs(perm_diffs) >= abs(observed_diff)) / num_permutations
p_value
## [1] 0.44745
# Step 6: Assess the statistical significance
if (p_value < 0.05) {
  print("The difference in means between Republican and Democrat parties is statistically significant.")
} else {
  print("There is no significant difference in means between Republican and Democrat parties.")
}
## [1] "There is no significant difference in means between Republican and Democrat parties."

5. Bootstrap Methods of Inference

a. Non-parametric Bootstrapping

# Simulating data from distribution
set.seed(1989)
n <- 135
observedVal <- election_census_merged_data_republican$candidatevotes

# Create a raincloud plot to visualize the simulated data. Describe the data in terms of its number of peaks (e.g., unimodal, bimodal, etc.) and its symmetry.

tibble(value=observedVal) %>% ggplot(aes(y = value)) +
  ggdist::stat_halfeye(adjust = .5, width = 2*.3, .width = c(0.5, 1)) + 
    geom_boxplot(width = .3, outlier.shape = NA) +
  ggdist::stat_dots(side = "left", dotsize = 6, justification = 1.05, binwidth = .1,
                    color = "black") +
    coord_flip() +
    labs(y = "Value",
         title = "simulated values from non-normal distribution") + 
  theme_bw() +
  theme(legend.position = "none")

# What is the sample median for this data?

median(observedVal)
## [1] 7239
# Number of bootstrap samples
Bsamples <- 10000

# Instantiating matrix for bootstrap samples
bootsMatrix <- matrix(NA, nrow = n, ncol = Bsamples)

# Sampling with replacement B times
for(b in 1:Bsamples) {
bootsMatrix[, b] <- observedVal[sample(1:n, size = n, replace = TRUE)]
}

#  Using the generated bootstrap samples, create a bootstrap distribution of sample medians, and visualize this distribution using a histogram.

# Installing vector for bootstrap medians
bootmediansVector <- vector(length=Bsamples)

#sampling with replacement B times

for (b in 1:Bsamples){
  bootmediansVector[b]<-median(bootsMatrix[,b])
}


# creating a histogram

tibble(Median=bootmediansVector) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

# Use the bootstrap samples to obtain a nonparametric estimate of the standard error of the sample median.

SEestimates <- sd(bootmediansVector)
SEestimates
## [1] 4255.382
# Use the bootstrap samples to obtain a nonparametric 95% confidence interval for the population median.

lowerBoundMedianVal <- quantile(bootmediansVector,probs=0.025)
upperBoundMedianVal <- quantile(bootmediansVector,probs=0.975)

We are 95% confident that the true median is in between 41396 and 58970.

# creating a histogram

tibble(Median=bootmediansVector) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundMedianVal,size=1,
             color="dodgerblue",linetype="solid")+
  geom_vline(xintercept= upperBoundMedianVal,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

b. Parametric Bootstrapping

# Generate B=10,000 samples each of size 30 from a normal distribution using the estimated mean and standard deviation from the observed data set, and visualize this distribution using a histogram.

BsamplesForParametric <- 10000
observedMaleCountVal <- election_census_merged_data_republican$male_count

# Instantiating matrix for bootstrap samples
parammetricBootMatrix <- matrix(NA, nrow = n, ncol = BsamplesForParametric)
xBar<-mean(observedMaleCountVal)
s<-sd(observedMaleCountVal)

# Sampling a normal set of n values, B times
for(b in 1:BsamplesForParametric) {
parammetricBootMatrix[, b] <- rnorm(n=n,mean=xBar,sd=s)
}


# Installing vector for bootstrap medians
bootParametricMedians <- vector(length=BsamplesForParametric)

# caluculating median for each simulated data set
for (b in 1:BsamplesForParametric){
  bootParametricMedians[b]<-median(parammetricBootMatrix[,b])
}


# creating a histogram

tibble(Median=bootParametricMedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of Parametric bootstrap medians") + 
  theme_bw()

# Obtain a parametric bootstrap estimate of the standard error of the sample median.

SEparametricestimate <- sd(bootParametricMedians)
SEparametricestimate
## [1] 17906.09
# Obtain a parametric bootstrap 95% confidence interval for the sample median.

lowerBoundParmetricMed <- quantile(bootParametricMedians,probs=0.025)
upperBoundParametricMed <- quantile(bootParametricMedians,probs=0.975)


# creating a histogram

tibble(Median=bootParametricMedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundParmetricMed,size=1,
             color="dodgerblue",linetype="solid")+
   geom_vline(xintercept= upperBoundParametricMed,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of parametric bootstrap medians") + 
  theme_bw()

6. Conclusions / Main Takeaways

In conclusion, the analysis of presidential elections data at the county level in the USA provides valuable insights into voting trends, party voting patterns, and candidate performance. It also highlights the importance of search interest in specific topics during the election season and sheds light on income disparities across regions. These findings can be utilized to better understand the dynamics of the electoral process and contribute to discussions on improving voting practices and policies.

From the above data analysis, I even got ideas of future work like,

Voter Turnout Analysis: Explore factors influencing voter turnout at the county level, and identify strategies to increase voter participation.

Temporal Analysis: Extend the analysis to include more recent election data beyond 2020, if available, to observe any emerging trends or changes in voting patterns over time.

Comparative Studies: Conduct comparative studies with elections from other countries to identify similarities and differences in voting behaviors and electoral processes.

Interactive Visualization: Create interactive data visualizations and dashboards to allow users to explore the data and draw their own insights.

Acknowledgment to Professor:

I would like to extend my sincere gratitude to Professor Andrew Dilernia for their invaluable guidance and support throughout this project. His expertise in data selection and research methodologies has been instrumental in shaping the direction of this analysis. I am grateful for the opportunity to learn under his mentorship is truly inspiring.