A lot has been said about the Presidential elections in America over the last one and a half years, seriously we should spend more time discussing governance than a road show for the masses, anyways I digress. In this analysis I have a dataset containing the results of the election primaries with vote share for each candidate across counties. Apart from this I have another dataset which has facts about each county like their population, median income, demographics, etc. We will use these datasets in conjunction to understand how the American people voted in the primaries and try to understand why they voted for a particular candidate.
This data is obtained from an open competition about the elections over on kaggle There is a potential for a lot of interesting insights and correlations in the data, and we will look to explore them all. Here is a snapshot of a few ideas we will be exploring:
list.of.packages <- c('plyr', 'ggplot2','reshape2','tidyr','dplyr','grid','gridExtra')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,'Package'])]
if(length(new.packages)) install.packages(new.packages)
library(knitr) # To allow the use of code chunks in the Rmd File
library(tibble) # To allow creation of tibbles from data frames
library(readr) # To input data into R as tibbles
library(tidyr) # For Data Manipulation
library(dplyr) # For primary Data Manipuation using the %>% operator
library(ggplot2) # To create visualizations
library(scales)# To manipulate the scales on each of the ggplot graphs
library(grid) # To manipulate details about the grid containing multiple scatter plots
library(gridExtra) # Manipulation of the grid
library(magrittr) # To use other facets & operators working with dplyr
library(DT) #To produce outputs in a readable format on the server side
library(cowplot) #An add-on to ggplot to help with secondary formatting of plots
#Data has been hosted on github in order to make code sharing easier and reproducible across users
url <- "https://raw.githubusercontent.com/singhster-05/electiondata/master/primary_results.csv"
election <- read_csv(url)
facts <- read_csv("https://raw.githubusercontent.com/singhster-05/electiondata/master/county_facts.csv")
So the first thing we do is look at the structure of the data. This will give us an idea about the variables available in both the datasets and also the variables we will be using.
The original county facts dataset contains 54 columns. Because we will only be using a certain set of demographic and census information from the dataset, I will not bore you with the details for each variable. More details about the county data can be found here
The variables which I will be using are:
Also for the ease of use and understanding I will be recoding these variables to easily understandable names. Lets extract these variables from the county facts dataset.
#Lets select and recode the variable names
county_facts <- facts %>% select(fips,county=area_name,state_abbreviation,population=PST045214,age_18=AGE295214,age_65=AGE775214,female_pop=SEX255214,white_pop=RHI125214,afr_pop=RHI225214,
hisp_pop=RHI725214,highschool_grads=EDU635213,college_grads=EDU685213,
median_income=INC110213,veterans=VET605213) %>% mutate(vet_per = veterans/population)
sapply(county_facts, function(x)length(unique(x)))
## fips county state_abbreviation
## 3195 1929 52
## population age_18 age_65
## 3133 222 253
## female_pop white_pop afr_pop
## 164 574 482
## hisp_pop highschool_grads college_grads
## 456 334 418
## median_income veterans vet_per
## 3047 2584 3193
datatable(county_facts,class = 'compact')
sapply(election, function(x)sum(is.na(x)))
## state state_abbreviation county
## 0 0 0
## fips party candidate
## 100 0 0
## votes fraction_votes
## 0 0
unique(election$county[is.na(election$fips)])
## [1] "Belknap" "Carroll" "Cheshire" "Coos"
## [5] "Grafton" "Hillsborough" "Merrimack" "Rockingham"
## [9] "Strafford" "Sullivan"
We also found out the counties for which the FIPS are unavailable. A quick google search tells us that each of these counties have FIPS codes but the data was not generated in this dataset.
sapply(county_facts, function(x)sum(is.na(x)))
## fips county state_abbreviation
## 0 0 52
## population age_18 age_65
## 0 0 0
## female_pop white_pop afr_pop
## 0 0 0
## hisp_pop highschool_grads college_grads
## 0 0 0
## median_income veterans vet_per
## 0 0 0
datatable(county_facts[is.na(county_facts$state_abbreviation),],class = 'compact')
Post that we will also join both the datasets based on county and state abbreviations, because certain county names are repeated across states. I had to clean up the county name column in the facts dataset as it contained a trailing " County" for every entry. I replaced that with a blank space and joined both the tables.
Post joining I found ~ 2300 NA values in the table, because the county facts were not available for certain counties. The values for those counties was replaced by the mean value for the county. Even after this imputation I found that 263 entries had NA values. The values for these could not be imputed with the Nation average because that would be incorrect to compare against. These data entries were deleted.
election <- election %>% select(-fips)
county_facts <- county_facts %>% select(-fips)
maps_base <- election
county_facts$county <- gsub(" County","",county_facts$county)
election <- election %>% group_by(state,state_abbreviation,county,party) %>%
summarize(candidate = candidate[which.max(votes)],
fraction_votes = max(fraction_votes), votes = max(votes)) %>%
filter(fraction_votes > 0.25)
election_final <- left_join(election,county_facts,by=c("state_abbreviation","county"))
impute <- function(x) replace(x, is.na(x), mean(x, na.rm = T))
election_final <- election_final %>% group_by(state_abbreviation) %>%
mutate(population=impute(population),age_18=impute(age_18),
age_65=impute(age_65),female_pop=impute(female_pop),
white_pop=impute(white_pop),afr_pop=impute(afr_pop),
hisp_pop=impute(hisp_pop),highschool_grads=impute(highschool_grads),
college_grads=impute(college_grads),median_income=impute(median_income),
vet_per=impute(vet_per)) %>% select(-veterans)
election_final <- election_final[complete.cases(election_final),]
datatable(election_final,class = 'compact')
Each candidate who won the primary nomination won on an average 57.19% of the votes in each county.
On an average for a Democrat to win the nomination they needed 63.24% of the votes, while the Republican candidate needed 50.28% of votes for each county.
election_final %>% group_by(candidate) %>% filter(votes==max(votes)) %>% select(state,county,party,candidate,votes,fraction_votes) %>% arrange(desc(votes))
## Source: local data frame [6 x 6]
## Groups: candidate [6]
##
## state county party candidate votes fraction_votes
## <chr> <chr> <chr> <chr> <int> <dbl>
## 1 California Los Angeles Democrat Hillary Clinton 590502 0.570
## 2 California Los Angeles Republican Donald Trump 179130 0.698
## 3 Texas Harris Republican Ted Cruz 147721 0.453
## 4 Florida Miami-Dade Republican Marco Rubio 111898 0.627
## 5 Wisconsin Dane Democrat Bernie Sanders 102585 0.626
## 6 Ohio Franklin Republican John Kasich 101217 0.637
ggplot(data = election_final, aes(fraction_votes)) + geom_histogram(binwidth=0.05,fill = "#56B4E9") + facet_wrap(~ candidate, nrow = 2) + ggtitle("Votes won in Winning Counties") + xlab("Fraction of Votes in Winning Counties") + ylab("Count")
election_final %>% group_by(party,candidate) %>% summarize(Counties_Won = unique(n()),Total_Votes = sum(votes)) %>% arrange(desc(Counties_Won))
## Source: local data frame [6 x 4]
## Groups: party [2]
##
## party candidate Counties_Won Total_Votes
## <chr> <chr> <int> <int>
## 1 Republican Donald Trump 2661 10448070
## 2 Democrat Bernie Sanders 2191 3929359
## 3 Democrat Hillary Clinton 1772 12585115
## 4 Republican Ted Cruz 608 2710399
## 5 Republican John Kasich 157 890617
## 6 Republican Marco Rubio 48 501394
From the histograms & the table shown it is clear that Donald Trump swept the Republican primaries winning almost 4 times the number of counties which Ted Cruz won. The Democratic primaries were however a much closer affair with Bernie Sanders edging out Hillary Clinton by ~400 counties, but Clinton winning the popular vote by a huge margin. In the next section we will discuss why this result came about and what could have been the deciding factors.
1. Voter demographics across candidates
ggplot(data=election_final,aes(candidate,median_income)) + geom_boxplot(aes(fill=candidate)) + stat_boxplot(geom = "errorbar") + ggtitle("Counties won against their Median Income") + xlab("Candidate") + ylab("Median Income of County Won") + guides(fill=F) + xlim("Bernie Sanders","Donald Trump","Hillary Clinton","John Kasich","Marco Rubio","Ted Cruz")
ggplot(data=election_final,aes(candidate,college_grads)) + geom_boxplot(aes(fill=candidate)) + stat_boxplot(geom = "errorbar") + ggtitle("Counties won against Proportion of College Graduates") + xlab("Candidate") + ylab("Proportion of College Graduates") + guides(fill=F) + xlim("Bernie Sanders","Donald Trump","Hillary Clinton","John Kasich","Marco Rubio","Ted Cruz")
ggplot(data=election_final,aes(candidate,afr_pop)) + geom_boxplot(aes(fill=candidate)) + stat_boxplot(geom = "errorbar") + ggtitle("Counties won against Proportion of African-American Population") + xlab("Candidate") + ylab("Proportion of African-American Population") + guides(fill=F) + xlim("Bernie Sanders","Donald Trump","Hillary Clinton","John Kasich","Marco Rubio","Ted Cruz")
From the three charts given above it is clear that Marco Rubio won the majority of the counties with a high median income & higher proportion of college graduates among the Republicans, while Hillary Clinton won the counties with a high African American population.
A simple chart to illustrate Hillary’s dominance with the African American Population
election_final %>% filter(party=="Democrat") %>% ggplot(aes(y=fraction_votes,x=afr_pop)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of African American Population") + ylab("Fraction of Votes in County Won") + ggtitle("Votes won against African American Population of the County") + labs(colour = "Candidate Name") + scale_colour_manual(values = c("#D55E00", "#0072B2"))
Hillary was supposed to dominate counties with a higher proportion of women’s population. But did it really happen?
election_final %>% filter(party=="Democrat") %>% ggplot(aes(y=fraction_votes,x=female_pop)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of Female Population") + ylab("Fraction of Votes in County Won") + ggtitle("Votes won against Female Population of the County") + labs(colour = "Candidate Name") + scale_colour_manual(values = c("#D55E00", "#0072B2"))
Even though she did win the majority of counties with high female population, the dominance wasn’t as strong as the media made it out to be.
Now let’s look at some patterns within the Republican candidates
election_final %>% filter(party=="Republican") %>% ggplot(aes(y=fraction_votes,x=college_grads)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of College Graduates") + ylab("Fraction of Votes in County Won") + ggtitle("Votes won against proportion of College Graduates in the County") + labs(colour="Candidate Name") + scale_colour_manual(values = c("#D55E00", "#009E73","#999999","#0072B2","#E69F00"))
It is clearly visible that Donald Trump & Ted Cruz won the majority of the counties amongst the Republican candidates, although in terms of college graduates there is no discernible pattern.
During the entire campaign Donald Trump claimed that he had the support of the veterans and the Army serviceman. Lets see if the data reflects this claim.
election_final %>% filter(party=="Republican") %>% ggplot(aes(y=fraction_votes,x=vet_per)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of Veterans") + ylab("Fraction of Votes in County Won") + ggtitle("Votes won against proportion of Veterans in the County") + labs(colour="Candidate Name") + scale_colour_manual(values = c("#D55E00", "#009E73","#999999","#0072B2","#E69F00"))
From the graph it is clearly evident that veterans did actually vote for Trump amongst the other Republican candidates.
Donald Trump enjoyed a lot of support amongst the white male demographic with the final results claiming that ~70% of them voted for him. Lets see if this pattern was discernible during the primaries as well.
election_final %>% filter(party=="Republican") %>% ggplot(aes(y=(100-female_pop),x=white_pop)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of White Population") + ylab("Male Population in the County") + ggtitle("Performance in the White Male Demographic") + labs(colour="Candidate Name",fraction_votes="Fraction of Votes") + scale_colour_manual(values = c("#D55E00", "#009E73","#999999","#0072B2","#E69F00"))
election_final %>% filter(party=="Republican" & female_pop<50) %>% group_by(candidate) %>%
summarize(Counties = n()) %>% arrange(desc(Counties))
## # A tibble: 4 × 2
## candidate Counties
## <chr> <int>
## 1 Donald Trump 646
## 2 Ted Cruz 234
## 3 John Kasich 12
## 4 Marco Rubio 4
It is clealy visible from the graph that Donald Trump dominated counties with a high white male population. From the table we can see that in counties with higher than 50% male population, Donald Trump outperformed other Republicans almost 3:1.
2. Mapping Candidate Performance across states
Because Maps are memory intensive and require time to render in R, we will only be mapping the performance of the winners of the two parties across the United States to compare their performance.
usa <- map_data("state")
format <- function(x) {
substr(x, 1, 1) <- toupper(substr(x, 1, 1))
x
}
usa$state <- format(usa$region)
usa$region <- NULL
mapping <- function(x){
data_map <- maps_base %>% filter(candidate==x) %>% group_by(state) %>%
summarize(Votes=mean(fraction_votes))
data_map_fin <- inner_join(data_map,usa,by="state")
usa_base <- ggplot(data = usa, mapping = aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_polygon(color = "black", fill = "gray")
fin_map <- usa_base + geom_polygon(data = data_map_fin, aes(fill = Votes), color = "white")
ditch_the_axes <- theme(
axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank()
)
fin_map + scale_fill_distiller(palette = "Blues", labels = percent,
breaks = pretty_breaks(n = 10)) +
guides(fill = guide_legend(reverse = TRUE)) + ditch_the_axes +
ggtitle(paste0("Performance of ",x," across the USA in the primaries"))
}
mapping("Donald Trump")
mapping("Hillary Clinton")
There are quite a few interesting things to learn from these maps:
Donald Trump dominated states considered strongly Democratic during the Republican primaries winning ~65% of the votes in these states. What’s interesting is none of these states then effectively backed him in the presidential polls. He won on the back of a strong performance in the midwest states which he won by much smaller margins.
Hillary Clinton on the other hand showed a strong performance in the midwest states when competing against Bernie Sanders but these were the states that failed her in the presidential elections.
3. Correlation analysis between the Winning Candidates
hillary <- election_final %>% filter(candidate=="Hillary Clinton") %>% select(state_abbreviation,state,county,fraction_votes)
donald <- election_final %>% filter(candidate=="Donald Trump") %>% select(state_abbreviation,state,county,fraction_votes)
hil_don <- inner_join(hillary,donald,by=c("county","state"))
func <- function(hil_don)
{
return(data.frame(COR = cor(hil_don$fraction_votes.x, hil_don$fraction_votes.y)))
}
corr <- hil_don %>% group_by(state) %>% summarize(corr=cor(fraction_votes.x,fraction_votes.y))
corr <- corr[complete.cases(corr),]
ggplot(corr, aes(x=reorder(state, corr), y=corr)) +
geom_bar(stat='identity',fill="#56B4E9") + coord_flip() +
scale_y_continuous(breaks=c(-0.5,-0.25,0,0.25,0.5)) +
xlab("States") + ylab("Correlation in County Voting") +
ggtitle("Relative Performance of Clinton & Trump across states")
From the chart we learn that Clinton & Trump performed similarly in counties in the states of Nevada, South Carolina, Indiana, etc. i.e. counties in the top half of the graph have the same opinion of them(positive correlation), while those in the lower half have a diverging opinion(negative correlation).
The Primary results from the American elections were analyzed in order to understand the variables which impact the results.
Data was downloaded from an active kaggle dataset which was originally obtained by CNN and scraped off their website. I carried out primarily graphical analysis in order to understand the relationship between certain demographic and census variables with the voting patterns in a particular county.
election_final %>% filter(party=="Republican" & college_grads>25) %>% group_by(candidate) %>% summarize(College_Grads_Counties=unique(n()),Total_Votes=sum(votes)) %>% arrange(desc(College_Grads_Counties))
## # A tibble: 4 × 3
## candidate College_Grads_Counties Total_Votes
## <chr> <int> <int>
## 1 Donald Trump 1027 5305614
## 2 Ted Cruz 120 1622854
## 3 John Kasich 116 566053
## 4 Marco Rubio 35 477801
#In counties with college graduate proportion being greater than 25%, Donald Trump outperforms Ted Cruz almost 3:1 in terms of total votes won.
election_final %>% filter(party=="Democrat" & afr_pop>25) %>% group_by(candidate) %>% summarize(African_American_Counties=unique(n()))
## # A tibble: 1 × 2
## candidate African_American_Counties
## <chr> <int>
## 1 Hillary Clinton 344
#The table clearly shows that of counties having greater than 25% African American population, Bernie Sanders did not win a single one while Hillary Clinton won all 344.
election_final %>% filter(party=="Democrat" & college_grads>25) %>% group_by(candidate) %>% summarize(College_Grads_Counties=unique(n()),Total_Votes=sum(votes))
## # A tibble: 2 × 3
## candidate College_Grads_Counties Total_Votes
## <chr> <int> <int>
## 1 Bernie Sanders 1289 2588008
## 2 Hillary Clinton 460 8290332
#It is interesting to note that even though Bernie Sanders won almost 3 times as many counties with greater than 25% of the proportion being college graduates, Hillary Clinton won the important ones winning almost 4 times the number of votes.