Donald Trump won the elections on November 8, 2016 by defeating the other presidential candidate Hillary Clinton. We have data from the 2015 American Community Survey taken in 2015 which contains census data from all the counties in the USA and their respective presidential candidate winner for the election. The Data contains information about the ethnicity, Income, Age, Profession, Modes of Transport, Languages spoken and other data taken in census
Our Primary target is to identify if percent of english speaking population in any county had any direct influence in determining the winner of the election in that county
Based on our results we will also try to determine if ethnicity played any major role in deciding the winner of the election in that particular county
Understanding the Question
Trump’s Pro-American policies got him support from many Americans but also faced opposition from part of population belonging to different cultures. Our aim from this exercise is to determine if this support or hatred from various communities were reflected in the voting pattern as well or was just a media hyped phenomenon. We strive to understand which counties in USA have high perecentage of population who does not have English as their first language and in turn get an idea about the voting preferences of the population with ethnicity other than white Americans.
The Plan
Future Scope
Once we come to a conclusion and establish a voting pattern based on language spoken and ethnicity we can use regression model and use the analysis to predict future election outcomes although we have to keep in mind there will always be other factors playing role in elections. Still we can atleast have an overview about how ehtinicity can play a role in presidential candidate’s selection.
library(readr) # To import csv files
library(rmarkdown) # to create HTML tables
library(tidyr) # to tidying the data
library(dplyr) # for data transformation
library(ggplot2) # for Visulization of data
Data originated from 2015 American Community Survey which contained multiple datasets having information about the census counties. We concentrated on the county data which has information about population of each county and its respective presidential winner and percent_english_speaking_non_english_speaking dataset which contains information about perecentage of population having English as thier first language.
# Import Data
county_data <- read_csv("acs2015_county_data.csv")
dim(county_data)
## [1] 3142 38
language_data <- read_csv("acs2015_S1601_percent-speak-onlyenglish-otherthanenglish.csv")
dim(language_data)
## [1] 3142 5
acs_2015_county_data has total 38 columns and 3142 observations representing each county.
acs2015_S1601_percent-speak-onlyenglish-otherthanenglish dataset has 5 columns and 3142 observations for each county We observe that are missing values in columns Income,IncomeErr and ChildPoverty however these columns are not in the scope of our Analysis and hence will not be included in final joined dataset
Joining the datasets
We join the two datasets by common ID which is CensusID and ID2. We rename the last column of second dataset to NonEnglish
Selecting the required columns
On joining the datasets we only keep those columns that are necessary for our Analysis and to our question. We created a new dataset final_data that includes state, County, TotalPop, Hispanic, White, Black, Native, Asian, Pacific, English_Only and PresWinner from both the datasets. We are keeping these columns to know what percent of population belongs to each of these ethnicity and will help us answer our question,
Creating customized columns
We created two new columns PerCitizen and PerNonCitizen to better identify percentage of Citizen and Non-Citizen in every county. This is to have better visulizations and easier interpretations in percentages rather than numbers.
final_data <- county_data %>%
inner_join(language_data,by = c("CensusId" = "Id2")) %>% # Joining the two datasets
rename(NonEnglish = Other_Than_English) %>% # Rename Other_Than_English to NonEnglish
mutate(PerCitizen = round((Citizen/TotalPop)*100,2), # Create customized variable Citizens Percentage
PerNonCitizen = 100 - PerCitizen) %>% # Create customized variable NonCitizens percentage
unite(County,County,State,sep=",") %>% # Join County and State.
select(County,TotalPop,Hispanic,White,Black,Native, # Select variables required for our analysis
Asian,Pacific,PerCitizen,PerNonCitizen,
English_Only,NonEnglish,PresWinner) %>%
arrange(desc(English_Only)) # Arrange counties with largest English Speaking population
paged_table(final_data)
structure and summary of the final dataset
summary(final_data)
## County TotalPop Hispanic White
## Length:3142 Min. : 85 Min. : 0.000 Min. : 0.90
## Class :character 1st Qu.: 11028 1st Qu.: 1.900 1st Qu.:65.60
## Mode :character Median : 25768 Median : 3.700 Median :84.60
## Mean : 100737 Mean : 8.826 Mean :77.28
## 3rd Qu.: 67552 3rd Qu.: 9.000 3rd Qu.:93.30
## Max. :10038388 Max. :98.700 Max. :99.80
## Black Native Asian Pacific
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00000
## 1st Qu.: 0.600 1st Qu.: 0.100 1st Qu.: 0.200 1st Qu.: 0.00000
## Median : 2.100 Median : 0.300 Median : 0.500 Median : 0.00000
## Mean : 8.879 Mean : 1.766 Mean : 1.258 Mean : 0.08475
## 3rd Qu.:10.175 3rd Qu.: 0.600 3rd Qu.: 1.200 3rd Qu.: 0.00000
## Max. :85.900 Max. :92.100 Max. :41.600 Max. :35.30000
## PerCitizen PerNonCitizen English_Only NonEnglish
## Min. :44.25 Min. : 5.88 Min. : 3.90 Min. : 0.00
## 1st Qu.:72.86 1st Qu.:22.10 1st Qu.: 89.50 1st Qu.: 2.80
## Median :75.73 Median :24.27 Median : 94.90 Median : 5.10
## Mean :74.69 Mean :25.31 Mean : 90.64 Mean : 9.36
## 3rd Qu.:77.90 3rd Qu.:27.14 3rd Qu.: 97.20 3rd Qu.:10.50
## Max. :94.12 Max. :55.75 Max. :100.00 Max. :96.10
## PresWinner
## Length:3142
## Class :character
## Mode :character
##
##
##
We observe that PresWinner is character variable but should be factor as there are only two values Trump and Clinton. We changed it using as.factor(). Summary shows that all the values in the final dataset are numerical except state and county. There are no empty or null values in our dataset.
Exploratory data analysis is used to identify how the data is distributed, if there are any outliers or not, this exercise is to identify if spoken language of the majority of the population influences the presidential winner in any particular county. We also aim to explore other factors like ethnicity and citzenship of the individual to assess if there is any preference to any particular candidate from any particular culture or ethnicity.
We have used boxplots and histograms for visualization of the data. These will help identify the distribution of the data and outliers if any in the data.
# Boxplot
final_data %>%
ggplot(aes(y=English_Only, x=PresWinner)) +
geom_boxplot() +
xlab("Presidential Winner") + ylab("Percentage of Population Speaking English") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title="2016 Presidential Winner By Counties",
subtitle="Boxplots showing distribution of English Speaking population by counties")
We observe that both presidential candiate enjoys the support from counties have high percentage of English Speaking population. Clifton wins in varying range but Trump wins in more counties where English is major language. We also observe outliers in both cases signifying that some counties in Non-English speaking population support both Trump and Clinton. There might be other factors as we discussed playing important role in deciding the winner for a particular county.
Histograms
As we are trying to find out if Language,Citizenship or ethnicity are significant factors in deciding the presidential winner for each county, we will plot histogram for Presidential winner by each of these variables for important counties. To avoide too much information on the plots and to make it easy to read we have only shown stats for top 40 counties which are representative of the respective population and it has the highest population of that particular variable.
# Histogram for Presidential Winner for Top 40 Counties with English Speaking Population
final_data %>%
top_n(40,English_Only) %>%
ggplot(aes(fill=PresWinner, y=English_Only, x=County)) +
geom_bar(stat="identity")+
xlab("Top 40 Counties") + ylab("English Speaking population in percentage") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title="Figure 1a",
subtitle="Top 40 counties with English Speaking Population",
fill="Presidential Winner") +
coord_flip()
# Histogram for Presidential Winner for Top 40 Counties with Non-English Speaking Population
final_data %>%
top_n(40,NonEnglish) %>%
ggplot(aes(fill=PresWinner, y=NonEnglish, x=County)) +
geom_bar(stat="identity")+
xlab("Top 40 Counties") + ylab("Non-English Speaking population in percentage") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title="Figure 1b",
subtitle="Top 40 counties with Non-English Speaking population",
fill="Presidential Winner") +
coord_flip()
Observation
Above two histogrmas show Top 40 counties with Highest Number of English Speaking adn Non English Speaking Population respectively. As we can see all the top 40 Counties have near to 100% population which speaks English Language primarily and only two counties- Jefferson Davis and Jefferson have Hillary Clinton as their presidential winner. On the other hand, Fig 1b shows that only Imperial,Maverick and Hudspeth counties from the top 40 Non English speaking counties voted for Donald Trump as their Winner. These counties are the outliers from the botplox plotted earlier. Infact Hillary Clintion was winner in only Jefferson Davis and Jefferson counties among the top 30 English Speaking Counties.
Thus we can conclude that whether a person’s first language was english or not significantly affected who’d he vote. And since 2293 counties out of 3142 has english speaking population of more than 90 %. Trump was signifcantly ahead in polls as compared to Clinton.
Since vast vast majority of Non English Speaking Population if of different ethnicity, we will also look at some visulaizations related to ethnicity population of each county to further support our outcome that Language influenced the preference of voter’s Before that let us check if Whether or not having citizenships influenced voters since it also ties up to immigrants from differnt ethnicity
# Histogram for Presidential Winner for Top 40 Counties with High Percentage of Citizens
final_data %>%
top_n(40,PerCitizen) %>%
ggplot(aes(fill=PresWinner, y=PerCitizen, x=County)) +
geom_bar(stat="identity")+
xlab("Top 40 Counties") + ylab("Citizen Population in percentage") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title="Fig 2a",
subtitle="Top 40 counties with Highest percentage of citizens",
fill="Presidential Winner") +
coord_flip()
# Histogram for Presidential Winner for Top 40 Counties with High Percentage of Non-Citizens
final_data %>%
top_n(40,PerNonCitizen) %>%
ggplot(aes(fill = PresWinner, y = PerNonCitizen, x = County)) +
geom_bar(stat = "identity") +
xlab("Top 40 Counties") + ylab("Non-Citizen Population in percentage") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title = "Fig 2b",
subtitle = "Top 40 counties with Highest percenatage of Non-citizens",
fill = "Presidential Winner") +
coord_flip()
Observation
As we can se even though there are some counties in both figures having presidential winner as Hillary Clinton, more than half of the counties in both histograms show presidential winner as DOnald Trump. Thus, we observe that there is no strong correlation between person’s citizenship and his or hers preference while voting for the presidential candidate.
Next, We will check distribution of each ethnicity across all the counties.
# Histogram for Presidential Winner for Top 40 Counties with High Percentage of Hispanic Population
final_data %>%
top_n(40,Hispanic) %>%
ggplot(aes(fill = PresWinner, y=Hispanic, x=County)) +
geom_bar(stat = "identity") +
xlab("Top 40 Counties") + ylab("Hispanic Population in percentage") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title = "Fig. 3a",
subtitle = "Top 40 counties with Hispanic Population",
fill = "Presidential Winner") +
coord_flip()
# Histogram for Presidential Winner for Top 40 Counties with High Percentage of White Population
final_data %>%
top_n(40,White) %>%
ggplot(aes(fill = PresWinner, y = White, x = County)) +
geom_bar(stat ="identity") +
xlab("Top 40 Counties") + ylab("White Population in percentage") +
scale_fill_manual(values = c("Red")) +
labs(title="Fig. 3b",
subtitle="Top 40 counties with White Population",
fill="Presidential Winner") +
coord_flip()
# Histogram for Presidential Winner for Top 40 Counties with High Percentage of Black Population
final_data %>%
top_n(40,Black) %>%
ggplot(aes(fill = PresWinner, y=Black, x=County)) +
geom_bar(stat = "identity") +
xlab("Top 40 Counties") + ylab("Black Population in percentage") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title="Fig. 3c",
subtitle="Top 40 counties with Black Population",
fill="Presidential Winner") +
coord_flip()
# Histogram for Presidential Winner for Top 40 Counties with High Percentage of Native Population
final_data %>%
top_n(40,Native) %>%
ggplot(aes(fill=PresWinner, y=Native, x=County)) +
geom_bar(stat="identity") +
xlab("Top 40 Counties") + ylab("Native Population in percentage") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title="Fig. 3d",
subtitle="Top 40 counties with Native Population",
fill="Presidential Winner") +
coord_flip()
# Histogram for Presidential Winner for Top 40 Counties with High Percentage of Asian Population
final_data %>%
top_n(40,Asian) %>%
ggplot(aes(fill=PresWinner, y=Asian, x=County)) +
geom_bar(stat="identity") +
xlab("Top 40 Counties") + ylab("Asian Population in percentage") +
scale_fill_manual(values = c("Blue","Red")) +
labs(title="Fig. 3e",
subtitle="Top 40 counties with Asian Population",
fill="Presidential Winner") +
coord_flip()
Observation
In Figure 3a, we observe that majority of the Hispanic population in top 40 counties voted for Clinton as their president, even though their are some outliers, in general, Hispanic person preferred Clinton over trump in Presidential Election 2016
In Figure 3b, We see a clear mandate where all the top 40 counties with White population elected Donald Trump as thier Presidential winner.
In Fifure 3c, We observe an exact opposite mandate wherein counties with Majority Black Population preferred Hillary Clinton rather than Donald Trump.
In Figure 3d, We see that there is a mix of both winners in the top 40 counties with population having Native ethnicity. 19 out of 40 counties had Donald Trump as their winner which is almost 50%.
Finally in Figure 3e, We see a trend favouring Clinton as Population of Asians is Highest in these counties.
In general we observe that apart from Native Population all other Non English Speaking ethnicities supported Hillary Clinton for the Presidential race. The policies both the candidate employed during their campaign and the media’s focus on Trump’s views about Non-White population may have been driving force behind this statistics. Even then, since Population of white and English Speakers was above 90% on most of the counties and due to some section of Native population voting for Trump, Majority of Counties had trump as the presidential winner.
We will take these variables and try to model it into logistic regression to verify our hypothesis that Language and ethnicity affects the outcome of Presidential Winner for any given county.
# categorization
final_data$PresWinner <- as.factor(final_data$PresWinner)
# Logistic Model
model <- glm(PresWinner~Hispanic+White+Black+Native+Asian+Pacific+PerCitizen+English_Only,
data = final_data,family ="binomial")
# Summary of the Model
summary(model)
##
## Call:
## glm(formula = PresWinner ~ Hispanic + White + Black + Native +
## Asian + Pacific + PerCitizen + English_Only, family = "binomial",
## data = final_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7939 0.1876 0.2656 0.4307 5.0915
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.96057 3.97854 -2.504 0.0123 *
## Hispanic 0.15921 0.03782 4.210 2.56e-05 ***
## White 0.14607 0.03522 4.147 3.36e-05 ***
## Black 0.06695 0.03462 1.934 0.0531 .
## Native 0.09541 0.03834 2.489 0.0128 *
## Asian -0.23777 0.05872 -4.049 5.14e-05 ***
## Pacific -0.18035 0.33739 -0.535 0.5930
## PerCitizen -0.16363 0.01746 -9.371 < 2e-16 ***
## English_Only 0.12492 0.01600 7.807 5.85e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2786.9 on 3141 degrees of freedom
## Residual deviance: 1772.2 on 3133 degrees of freedom
## AIC: 1790.2
##
## Number of Fisher Scoring iterations: 6
Observation
The summary statistics of our Model shows that variables Hispanic,White,Asian,PerCitizen and the language siginificantly affect the outcome of Presidential Winner as the p values of these variable is less than 0.05. Percentage of Population belonging to pacific is very small and hence does not affect the results of any county and counties with high Native and Black population have less say in deciding the Presidential Winner. This may be the reason the scales tip towards Donald Trump. This model was just to see what factors affect the outcome variable PresWinner and if there is any logistic Regression. Tansformation and checking the accuracy of this model is out of scope of this project and may be covered in future.
In conclusion, Our Aim from this exercise was to identify of language, citizenship or ethnicity affects the outcome of presidential winner for any given county in the United States.
We used two datasets namely the acs election Datasets containing information about percentage of population belonging to different ethnicity in each county and acs English-NonEnglish Speaking Datasets containing percentage of population Speaking English in each county. We joined these datasets using common column CencsuId.
We created new variables to contain the percentage of citizens in each county and created new dataset containing only the information about ethnicities, citizenship and language for each county. Since there were datapoints with same county name, we joined county and state using unite() to make each row unique.
For Data visualizations we used the top 40 counties having most population for each county and compared the outcome for each Presidential Winner.
We found that Counties where English is a major language clearly supported Donald Trump and counties where other languages are spoken have preferred Hillary Clinton as their presidential Winner. This outcome is again reflected in the ethnicity with the exception of Native population other race clearly favours Hillary clintion whereas White race favours Donald Trump. It might be that Counties with high native population still has major white population and hence was not significant in affecting the poll.
Plotting Citizen and Non citizen population did showed that citizens favaoured Donal Trump whereas having high nummber of Non-citizens(as they cannot vote) did not affect Donal Trump’s votes.
This analysis could help identify how much ethincity and language affects poll results and help campaign runners devise strategy accordingly. Counties where there is more diversity in population can have different strategies of campaign and different sets of problems to be resolved than counties where majority of population belonged to one group or culture. Language can be a huge barrier while communicating ideas, promises and progress. Understanding what the people of the nation wants at a micro level can make a huge impact in one’s political career.
This Project highlights the significane of language and ethnicity on Presidential polls in 2016. We could advance this project to fit the dataset into a model and then train the model to predict given census of any given county which candidates or policies similar to Trump and Clinton are likely to emerge victorious.
The analysis limits itself in identying individual factors affecting the poll results and more research can be done in identifying a combination of different factors that affect the polls.