The dataset we propose includes county level data on the US elections 2016. You will analyze the percentage of votes for one of the presidential candidates (your choice which one you study - Candidate A). You have access to a number of potential predictors (incl. income, gender, education level,… see list attached) and are asked to study one such continuous explanatory variable (your team makes the choice of which key explanatory variable) in more detail, while adjusting for a selected set of other covariates. You are to preselect this set of 6 more explanatory variables from the list available based on their meaning only. No data snooping or statistical evaluation involving the outcome is allowed at this stage to decide on this set which you will study. You may consider the descriptives of the covariates for this purpose if you wish, but no observed link with the outcome at this stage. This is quite important.
library(naniar)
data_us<-read.csv('~/mastats_ugent/continuous/data/data.csv')
dict<-read.csv('~/mastats_ugent/continuous/data/county_facts_dictionary.csv')
head(data_us)[,1:10]
## combined_fips votes_dem_2016 votes_gop_2016 total_votes_2016 per_dem_2016
## 1 2013 93003 130413 246588 0.3771595
## 2 2016 93003 130413 246588 0.3771595
## 3 2020 93003 130413 246588 0.3771595
## 4 2050 93003 130413 246588 0.3771595
## 5 2060 93003 130413 246588 0.3771595
## 6 2068 93003 130413 246588 0.3771595
## per_gop_2016 diff_2016 per_point_diff_2016 state_abbr county_name
## 1 0.52887 37410 -0.1517105 AK Alaska
## 2 0.52887 37410 -0.1517105 AK Alaska
## 3 0.52887 37410 -0.1517105 AK Alaska
## 4 0.52887 37410 -0.1517105 AK Alaska
## 5 0.52887 37410 -0.1517105 AK Alaska
## 6 0.52887 37410 -0.1517105 AK Alaska
vis_miss(data_us[,1:20])
vis_miss(data_us[,21:40])
vis_miss(data_us[,41:60])
hist(data_us[,'RHI125214'],xlab='White alone, percent, 2014',main='White alone, percent, 2014')
library(usmap)
library(ggplot2)
plot_usmap(data = data_us, values = "RHI125214") +
scale_fill_continuous(name = "White alone, percent, 2014", label = scales::comma) +
theme(legend.position = "right")
Most counties are typically white as indicated by the 2014 data.
Example age
hist(data_us[,'AGE775214'],main='Persons 65 years and over, percent, 2014')
gain some insight in the association of this key predictor variable with the percentage of votes for Candidate A (following a simple linear regression model first and then a multiple linear regression analysis, adjusting for variables from the list of six. Note that you may need more than 6 covariates to incorporate the 6 variables in a way that seems fitting) and
Hints
predict whether the outcome is close to ‘undecided’ in the county (plus minus 2% of the votes. If this is the case: the county deserves more focus and action in the campaign leading up to the (next) elections.
Hints
# Example
data_us['dem_new_per'] = data_us$votes_dem_2016/(data_us$votes_dem_2016+data_us$votes_gop_2016)
data_us['dem_new_per_cat'] = cut(data_us$dem_new_per,c(0,.42,.52,1))
plot_usmap(data = data_us, values = "dem_new_per_cat") +
theme(legend.position = "right")
Hints
data_us['dem_new_per_cat'] = cut(data_us$dem_new_per,c(0,.40,.50,1))
plot_usmap(data = data_us, values = "dem_new_per_cat") +
theme(legend.position = "right")
In more detail, the following steps should be taken and reflected in both the protocol and your final report:
Hints
Hints
Hints