Data Manipulation Final Project

Packages Required

list.of.packages <- c('plyr', 'ggplot2','reshape2','tidyr','dplyr','grid','gridExtra')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,'Package'])]
if(length(new.packages)) install.packages(new.packages)
library(knitr) # To allow the use of code chunks in the Rmd File
library(tibble) # To allow creation of tibbles from data frames
library(readr) # To input data into R as tibbles
library(tidyr) # For Data Manipulation 
library(dplyr) # For primary Data Manipuation using the %>% operator
library(ggplot2) # To create visualizations
library(scales)# To manipulate the scales on each of the ggplot graphs
library(grid) # To manipulate details about the grid containing multiple scatter plots
library(gridExtra) # Manipulation of the grid
library(magrittr) # To use other facets & operators working with dplyr
library(DT) #To produce outputs in a readable format on the server side
library(cowplot) #An add-on to ggplot to help with secondary formatting of plots
#Data has been hosted on github in order to make code sharing easier and reproducible across users
url <- "https://raw.githubusercontent.com/singhster-05/electiondata/master/primary_results.csv"
election <- read_csv(url)
facts <- read_csv("https://raw.githubusercontent.com/singhster-05/electiondata/master/county_facts.csv")

Data Preparation

Data Dictionary

So the first thing we do is look at the structure of the data. This will give us an idea about the variables available in both the datasets and also the variables we will be using.

Vote Share Data

str(election)

## Classes 'tbl_df', 'tbl' and 'data.frame':    24611 obs. of  8 variables:
##  $ state             : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ state_abbreviation: chr  "AL" "AL" "AL" "AL" ...
##  $ county            : chr  "Autauga" "Autauga" "Baldwin" "Baldwin" ...
##  $ fips              : num  1001 1001 1003 1003 1005 ...
##  $ party             : chr  "Democrat" "Democrat" "Democrat" "Democrat" ...
##  $ candidate         : chr  "Bernie Sanders" "Hillary Clinton" "Bernie Sanders" "Hillary Clinton" ...
##  $ votes             : int  544 2387 2694 5290 222 2567 246 942 395 564 ...
##  $ fraction_votes    : num  0.182 0.8 0.329 0.647 0.078 0.906 0.197 0.755 0.386 0.551 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 8
##   .. ..$ state             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ state_abbreviation: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ county            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ fips              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ party             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ candidate         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ votes             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ fraction_votes    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

sapply(election, function(x)length(unique(x)))

##              state state_abbreviation             county 
##                 49                 49               2633 
##               fips              party          candidate 
##               4208                  2                 16 
##              votes     fraction_votes 
##               5548               1043

The variables present in the data are:

State: This contains 49 states of the 50 US states.
state_abbreviation: It is just the name of the states abbreviated.
County: This contains information about the counties that voted. Has data for a total of 2633 unique counties
FIPS: The FIPS county code are 5 digit codes which uniquely identified each county in the continental United States. To learn more about FIPS, use this link
Party: This is a character variable which talks about which party the candidate belonged to.
Candidate: This variable names the candidates for each of the two parties. We have 16 candidates in total.
Votes: This is an integer variable which tells us the number of votes each candidate scored amongst their party in each county in the primaries
Fraction_Votes: This is a secondary variable derived from the votes variable. It gives us a percentage of the number of votes obtained in each county

County Facts

The original county facts dataset contains 54 columns. Because we will only be using a certain set of demographic and census information from the dataset, I will not bore you with the details for each variable. More details about the county data can be found here

The variables which I will be using are:

FIPS: The FIPS county code are 5 digit codes which uniquely identified each county in the continental United States. To learn more about FIPS, use this link
area_name: Name of the county related to the FIPS code
state_abbreviation: It is just the name of the states abbreviated.
PST045214: Population of the county from the 2014 census
AGE295214: People under the age of 18 i.e. not eligible to vote
AGE775214: People over the age of 65
SEX255214: Percentage of female population
RHI125214: Percentage of white population
RHI225214: Percentage of black/african-american population
RHI725214: Percentage of Hispanic/Latino population
EDU635213: Percentage of high school graduates among people over the age of 25
EDU685213: Percentage of college graduates among people over the age of 25
INC110213: Median Household Income

Also for the ease of use and understanding I will be recoding these variables to easily understandable names. Lets extract these variables from the county facts dataset.

#Lets select and recode the variable names
county_facts <- facts %>% select(fips,county=area_name,state_abbreviation,population=PST045214,age_18=AGE295214,age_65=AGE775214,female_pop=SEX255214,white_pop=RHI125214,afr_pop=RHI225214,
hisp_pop=RHI725214,highschool_grads=EDU635213,college_grads=EDU685213,
median_income=INC110213,veterans=VET605213) %>% mutate(vet_per = veterans/population)

sapply(county_facts, function(x)length(unique(x)))

##               fips             county state_abbreviation 
##               3195               1929                 52 
##         population             age_18             age_65 
##               3133                222                253 
##         female_pop          white_pop            afr_pop 
##                164                574                482 
##           hisp_pop   highschool_grads      college_grads 
##                456                334                418 
##      median_income           veterans            vet_per 
##               3047               2584               3193

datatable(county_facts,class = 'compact')

Missing Values

Voter Data

sapply(election, function(x)sum(is.na(x)))

##              state state_abbreviation             county 
##                  0                  0                  0 
##               fips              party          candidate 
##                100                  0                  0 
##              votes     fraction_votes 
##                  0                  0

unique(election$county[is.na(election$fips)])

##  [1] "Belknap"      "Carroll"      "Cheshire"     "Coos"        
##  [5] "Grafton"      "Hillsborough" "Merrimack"    "Rockingham"  
##  [9] "Strafford"    "Sullivan"

Both the methods tell us that the votes data contains missing values only for the FIPS code variable. This isnt too important for us as we will not be using this variable in any subsequent analysis.

We also found out the counties for which the FIPS are unavailable. A quick google search tells us that each of these counties have FIPS codes but the data was not generated in this dataset.

County Facts Data

sapply(county_facts, function(x)sum(is.na(x)))

##               fips             county state_abbreviation 
##                  0                  0                 52 
##         population             age_18             age_65 
##                  0                  0                  0 
##         female_pop          white_pop            afr_pop 
##                  0                  0                  0 
##           hisp_pop   highschool_grads      college_grads 
##                  0                  0                  0 
##      median_income           veterans            vet_per 
##                  0                  0                  0

In the county facts dataset, only the state abbreviation column contains exactly 52 NA values. The entry has an NA value when the corresponding county entry is the aggregated state name instead of the actual county name. We can clearly see this here.

datatable(county_facts[is.na(county_facts$state_abbreviation),],class = 'compact')

Data Cleaning

From what we saw from the structure and summary of the data we know that the data is pretty tidy and doesnt need any cleaning. The only thing we will be doing is dropping the fips and the state abbreviation variable from the voter dataset.

Post that we will also join both the datasets based on county and state abbreviations, because certain county names are repeated across states. I had to clean up the county name column in the facts dataset as it contained a trailing " County" for every entry. I replaced that with a blank space and joined both the tables.
Post joining I found ~ 2300 NA values in the table, because the county facts were not available for certain counties. The values for those counties was replaced by the mean value for the county. Even after this imputation I found that 263 entries had NA values. The values for these could not be imputed with the Nation average because that would be incorrect to compare against. These data entries were deleted.

election <- election %>% select(-fips)
county_facts <- county_facts %>% select(-fips)
maps_base <- election

county_facts$county <- gsub(" County","",county_facts$county)

election <- election %>% group_by(state,state_abbreviation,county,party) %>% 
            summarize(candidate = candidate[which.max(votes)],
            fraction_votes = max(fraction_votes), votes = max(votes)) %>%
            filter(fraction_votes > 0.25)

election_final <- left_join(election,county_facts,by=c("state_abbreviation","county"))

impute <- function(x) replace(x, is.na(x), mean(x, na.rm = T))

election_final <- election_final %>% group_by(state_abbreviation) %>%
                  mutate(population=impute(population),age_18=impute(age_18),
                  age_65=impute(age_65),female_pop=impute(female_pop),
                  white_pop=impute(white_pop),afr_pop=impute(afr_pop),
                  hisp_pop=impute(hisp_pop),highschool_grads=impute(highschool_grads),
                  college_grads=impute(college_grads),median_income=impute(median_income),
                  vet_per=impute(vet_per)) %>% select(-veterans)

election_final <- election_final[complete.cases(election_final),]

datatable(election_final,class = 'compact')

Initial Data Summary

Each candidate who won the primary nomination won on an average 57.19% of the votes in each county.
On an average for a Democrat to win the nomination they needed 63.24% of the votes, while the Republican candidate needed 50.28% of votes for each county.

election_final %>% group_by(candidate) %>% filter(votes==max(votes)) %>% select(state,county,party,candidate,votes,fraction_votes) %>% arrange(desc(votes))

## Source: local data frame [6 x 6]
## Groups: candidate [6]
## 
##        state      county      party       candidate  votes fraction_votes
##        <chr>       <chr>      <chr>           <chr>  <int>          <dbl>
## 1 California Los Angeles   Democrat Hillary Clinton 590502          0.570
## 2 California Los Angeles Republican    Donald Trump 179130          0.698
## 3      Texas      Harris Republican        Ted Cruz 147721          0.453
## 4    Florida  Miami-Dade Republican     Marco Rubio 111898          0.627
## 5  Wisconsin        Dane   Democrat  Bernie Sanders 102585          0.626
## 6       Ohio    Franklin Republican     John Kasich 101217          0.637

From the table above we can see that both Donald Trump & Hillary Clinton won the county of Los Angeles which is the largest county in the United States.

ggplot(data = election_final, aes(fraction_votes)) + geom_histogram(binwidth=0.05,fill = "#56B4E9") + facet_wrap(~ candidate, nrow = 2) + ggtitle("Votes won in Winning Counties") + xlab("Fraction of Votes in Winning Counties") + ylab("Count")

election_final %>% group_by(party,candidate) %>% summarize(Counties_Won = unique(n()),Total_Votes = sum(votes)) %>% arrange(desc(Counties_Won))

## Source: local data frame [6 x 4]
## Groups: party [2]
## 
##        party       candidate Counties_Won Total_Votes
##        <chr>           <chr>        <int>       <int>
## 1 Republican    Donald Trump         2661    10448070
## 2   Democrat  Bernie Sanders         2191     3929359
## 3   Democrat Hillary Clinton         1772    12585115
## 4 Republican        Ted Cruz          608     2710399
## 5 Republican     John Kasich          157      890617
## 6 Republican     Marco Rubio           48      501394

From the histograms & the table shown it is clear that Donald Trump swept the Republican primaries winning almost 4 times the number of counties which Ted Cruz won. The Democratic primaries were however a much closer affair with Bernie Sanders edging out Hillary Clinton by ~400 counties, but Clinton winning the popular vote by a huge margin. In the next section we will discuss why this result came about and what could have been the deciding factors.

Exploratory Data Analysis

1. Voter demographics across candidates

ggplot(data=election_final,aes(candidate,median_income)) + geom_boxplot(aes(fill=candidate)) + stat_boxplot(geom = "errorbar") + ggtitle("Counties won against their Median Income") + xlab("Candidate") + ylab("Median Income of County Won") + guides(fill=F) + xlim("Bernie Sanders","Donald Trump","Hillary Clinton","John Kasich","Marco Rubio","Ted Cruz")

ggplot(data=election_final,aes(candidate,college_grads)) + geom_boxplot(aes(fill=candidate)) + stat_boxplot(geom = "errorbar") + ggtitle("Counties won against Proportion of College Graduates") + xlab("Candidate") + ylab("Proportion of College Graduates") + guides(fill=F) + xlim("Bernie Sanders","Donald Trump","Hillary Clinton","John Kasich","Marco Rubio","Ted Cruz")

ggplot(data=election_final,aes(candidate,afr_pop)) + geom_boxplot(aes(fill=candidate)) + stat_boxplot(geom = "errorbar") + ggtitle("Counties won against Proportion of African-American Population") + xlab("Candidate") + ylab("Proportion of African-American Population") + guides(fill=F) + xlim("Bernie Sanders","Donald Trump","Hillary Clinton","John Kasich","Marco Rubio","Ted Cruz")

From the three charts given above it is clear that Marco Rubio won the majority of the counties with a high median income & higher proportion of college graduates among the Republicans, while Hillary Clinton won the counties with a high African American population.

A simple chart to illustrate Hillary’s dominance with the African American Population

election_final %>% filter(party=="Democrat") %>% ggplot(aes(y=fraction_votes,x=afr_pop)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of African American Population") + ylab("Fraction of Votes in County Won") + ggtitle("Votes won against African American Population of the County") + labs(colour = "Candidate Name") + scale_colour_manual(values = c("#D55E00", "#0072B2"))

Hillary was supposed to dominate counties with a higher proportion of women’s population. But did it really happen?

election_final %>% filter(party=="Democrat") %>% ggplot(aes(y=fraction_votes,x=female_pop)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of Female Population") + ylab("Fraction of Votes in County Won") + ggtitle("Votes won against Female Population of the County") + labs(colour = "Candidate Name") + scale_colour_manual(values = c("#D55E00", "#0072B2"))

Even though she did win the majority of counties with high female population, the dominance wasn’t as strong as the media made it out to be.

Now let’s look at some patterns within the Republican candidates

election_final %>% filter(party=="Republican") %>% ggplot(aes(y=fraction_votes,x=college_grads)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of College Graduates") + ylab("Fraction of Votes in County Won") + ggtitle("Votes won against proportion of College Graduates in the County") + labs(colour="Candidate Name") + scale_colour_manual(values = c("#D55E00", "#009E73","#999999","#0072B2","#E69F00"))

It is clearly visible that Donald Trump & Ted Cruz won the majority of the counties amongst the Republican candidates, although in terms of college graduates there is no discernible pattern.

During the entire campaign Donald Trump claimed that he had the support of the veterans and the Army serviceman. Lets see if the data reflects this claim.

election_final %>% filter(party=="Republican") %>% ggplot(aes(y=fraction_votes,x=vet_per)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of Veterans") + ylab("Fraction of Votes in County Won") + ggtitle("Votes won against proportion of Veterans in the County") + labs(colour="Candidate Name") + scale_colour_manual(values = c("#D55E00", "#009E73","#999999","#0072B2","#E69F00"))

From the graph it is clearly evident that veterans did actually vote for Trump amongst the other Republican candidates.

Donald Trump enjoyed a lot of support amongst the white male demographic with the final results claiming that ~70% of them voted for him. Lets see if this pattern was discernible during the primaries as well.

election_final %>% filter(party=="Republican") %>% ggplot(aes(y=(100-female_pop),x=white_pop)) + geom_point(aes(colour=candidate),alpha=0.3) + xlab("Proportion of White Population") + ylab("Male Population in the County") + ggtitle("Performance in the White Male Demographic") + labs(colour="Candidate Name",fraction_votes="Fraction of Votes") + scale_colour_manual(values = c("#D55E00", "#009E73","#999999","#0072B2","#E69F00"))

election_final %>% filter(party=="Republican" & female_pop<50) %>% group_by(candidate) %>%
summarize(Counties = n()) %>% arrange(desc(Counties))

## # A tibble: 4 × 2
##      candidate Counties
##          <chr>    <int>
## 1 Donald Trump      646
## 2     Ted Cruz      234
## 3  John Kasich       12
## 4  Marco Rubio        4

It is clealy visible from the graph that Donald Trump dominated counties with a high white male population. From the table we can see that in counties with higher than 50% male population, Donald Trump outperformed other Republicans almost 3:1.

2. Mapping Candidate Performance across states

Because Maps are memory intensive and require time to render in R, we will only be mapping the performance of the winners of the two parties across the United States to compare their performance.

usa <- map_data("state")
format <- function(x) {
   substr(x, 1, 1) <- toupper(substr(x, 1, 1))
x
}

usa$state <- format(usa$region)
usa$region <- NULL

mapping <- function(x){
data_map <- maps_base %>% filter(candidate==x) %>% group_by(state) %>%
summarize(Votes=mean(fraction_votes))

data_map_fin <- inner_join(data_map,usa,by="state")

usa_base <- ggplot(data = usa, mapping = aes(x = long, y = lat, group = group)) + 
coord_fixed(1.3) + 
geom_polygon(color = "black", fill = "gray")
  
fin_map <- usa_base + geom_polygon(data = data_map_fin, aes(fill = Votes), color = "white")
        
ditch_the_axes <- theme(
  axis.text = element_blank(),
  axis.line = element_blank(),
  axis.ticks = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.title = element_blank()
)
fin_map + scale_fill_distiller(palette = "Blues", labels = percent,
breaks = pretty_breaks(n = 10)) +
guides(fill = guide_legend(reverse = TRUE)) + ditch_the_axes + 
ggtitle(paste0("Performance of ",x," across the USA in the primaries"))
}
mapping("Donald Trump")

mapping("Hillary Clinton")

There are quite a few interesting things to learn from these maps:

Donald Trump dominated states considered strongly Democratic during the Republican primaries winning ~65% of the votes in these states. What’s interesting is none of these states then effectively backed him in the presidential polls. He won on the back of a strong performance in the midwest states which he won by much smaller margins.
Hillary Clinton on the other hand showed a strong performance in the midwest states when competing against Bernie Sanders but these were the states that failed her in the presidential elections.

3. Correlation analysis between the Winning Candidates

hillary <- election_final %>% filter(candidate=="Hillary Clinton") %>% select(state_abbreviation,state,county,fraction_votes)
donald  <- election_final %>% filter(candidate=="Donald Trump") %>% select(state_abbreviation,state,county,fraction_votes)

hil_don <- inner_join(hillary,donald,by=c("county","state"))

func <- function(hil_don)
{
return(data.frame(COR = cor(hil_don$fraction_votes.x, hil_don$fraction_votes.y)))
}

corr <- hil_don %>% group_by(state) %>% summarize(corr=cor(fraction_votes.x,fraction_votes.y))

corr <- corr[complete.cases(corr),]

ggplot(corr, aes(x=reorder(state, corr), y=corr)) +
geom_bar(stat='identity',fill="#56B4E9") + coord_flip() + 
scale_y_continuous(breaks=c(-0.5,-0.25,0,0.25,0.5)) +
xlab("States") + ylab("Correlation in County Voting") + 
ggtitle("Relative Performance of Clinton & Trump across states")

From the chart we learn that Clinton & Trump performed similarly in counties in the states of Nevada, South Carolina, Indiana, etc. i.e. counties in the top half of the graph have the same opinion of them(positive correlation), while those in the lower half have a diverging opinion(negative correlation).

Summary

Problem Statement

The Primary results from the American elections were analyzed in order to understand the variables which impact the results.

Data & EDA

Data was downloaded from an active kaggle dataset which was originally obtained by CNN and scraped off their website. I carried out primarily graphical analysis in order to understand the relationship between certain demographic and census variables with the voting patterns in a particular county.

Insights

On the back of the insights from the maps it is interesting to note that America votes for candidates in the primaries, but during the presidential elections it primarily votes for the party it leans towards even though it may not have supported the candidate in the past.
The box plots and the scatter plots clearly show that Trump enjoyed immense support amongst the white, male population while he floundered with the college graduates and the high income demographic.

election_final %>% filter(party=="Republican" & college_grads>25) %>% group_by(candidate) %>% summarize(College_Grads_Counties=unique(n()),Total_Votes=sum(votes)) %>% arrange(desc(College_Grads_Counties))

## # A tibble: 4 × 3
##      candidate College_Grads_Counties Total_Votes
##          <chr>                  <int>       <int>
## 1 Donald Trump                   1027     5305614
## 2     Ted Cruz                    120     1622854
## 3  John Kasich                    116      566053
## 4  Marco Rubio                     35      477801

#In counties with college graduate proportion being greater than 25%, Donald Trump outperforms Ted Cruz almost 3:1 in terms of total votes won.

Hillary Clinton on the other hand enjoyed overwhelming support within the African American Community, while Bernie Sanders enjoyed support from college & high school graduates.

election_final %>% filter(party=="Democrat" & afr_pop>25) %>% group_by(candidate) %>% summarize(African_American_Counties=unique(n()))

## # A tibble: 1 × 2
##         candidate African_American_Counties
##             <chr>                     <int>
## 1 Hillary Clinton                       344

#The table clearly shows that of counties having greater than 25% African American population, Bernie Sanders did not win a single one while Hillary Clinton won all 344.

election_final %>% filter(party=="Democrat" & college_grads>25) %>% group_by(candidate) %>% summarize(College_Grads_Counties=unique(n()),Total_Votes=sum(votes))

## # A tibble: 2 × 3
##         candidate College_Grads_Counties Total_Votes
##             <chr>                  <int>       <int>
## 1  Bernie Sanders                   1289     2588008
## 2 Hillary Clinton                    460     8290332

#It is interesting to note that even though Bernie Sanders won almost 3 times as many counties with greater than 25% of the proportion being college graduates, Hillary Clinton won the important ones winning almost 4 times the number of votes.