An overview of the assignment

The dataset we propose includes county level data on the US elections 2016. You will analyze the percentage of votes for one of the presidential candidates (your choice which one you study - Candidate A). You have access to a number of potential predictors (incl. income, gender, education level,… see list attached) and are asked to study one such continuous explanatory variable (your team makes the choice of which key explanatory variable) in more detail, while adjusting for a selected set of other covariates. You are to preselect this set of 6 more explanatory variables from the list available based on their meaning only. No data snooping or statistical evaluation involving the outcome is allowed at this stage to decide on this set which you will study. You may consider the descriptives of the covariates for this purpose if you wish, but no observed link with the outcome at this stage. This is quite important.

Hints

  • Choose one continuous variable based on its non-missingness.
library(naniar)
data_us<-read.csv('~/mastats_ugent/continuous/data/data.csv')
dict<-read.csv('~/mastats_ugent/continuous/data/county_facts_dictionary.csv')
head(data_us)[,1:10]
##   combined_fips votes_dem_2016 votes_gop_2016 total_votes_2016 per_dem_2016
## 1          2013          93003         130413           246588    0.3771595
## 2          2016          93003         130413           246588    0.3771595
## 3          2020          93003         130413           246588    0.3771595
## 4          2050          93003         130413           246588    0.3771595
## 5          2060          93003         130413           246588    0.3771595
## 6          2068          93003         130413           246588    0.3771595
##   per_gop_2016 diff_2016 per_point_diff_2016 state_abbr county_name
## 1      0.52887     37410          -0.1517105         AK      Alaska
## 2      0.52887     37410          -0.1517105         AK      Alaska
## 3      0.52887     37410          -0.1517105         AK      Alaska
## 4      0.52887     37410          -0.1517105         AK      Alaska
## 5      0.52887     37410          -0.1517105         AK      Alaska
## 6      0.52887     37410          -0.1517105         AK      Alaska
vis_miss(data_us[,1:20])

vis_miss(data_us[,21:40])

vis_miss(data_us[,41:60])

  • Not a lot of missing values

My choice

hist(data_us[,'RHI125214'],xlab='White alone, percent, 2014',main='White alone, percent, 2014')

library(usmap)
library(ggplot2)

plot_usmap(data = data_us, values = "RHI125214") + 
  scale_fill_continuous(name = "White alone, percent, 2014", label = scales::comma) + 
  theme(legend.position = "right")

Example age

hist(data_us[,'AGE775214'],main='Persons 65 years and over, percent, 2014')

The goal is to

  1. gain some insight in the association of this key predictor variable with the percentage of votes for Candidate A (following a simple linear regression model first and then a multiple linear regression analysis, adjusting for variables from the list of six. Note that you may need more than 6 covariates to incorporate the 6 variables in a way that seems fitting) and

    Hints

  1. predict whether the outcome is close to ‘undecided’ in the county (plus minus 2% of the votes. If this is the case: the county deserves more focus and action in the campaign leading up to the (next) elections.

    Hints

# Example
data_us['dem_new_per'] = data_us$votes_dem_2016/(data_us$votes_dem_2016+data_us$votes_gop_2016)
data_us['dem_new_per_cat'] = cut(data_us$dem_new_per,c(0,.42,.52,1))
plot_usmap(data = data_us, values = "dem_new_per_cat") + 
  theme(legend.position = "right")

Hints

data_us['dem_new_per_cat'] = cut(data_us$dem_new_per,c(0,.40,.50,1))
plot_usmap(data = data_us, values = "dem_new_per_cat") + 
  theme(legend.position = "right")

Execution steps - in order

In more detail, the following steps should be taken and reflected in both the protocol and your final report:

  1. Perform a descriptive analysis - this is only to be performed on the outcome + 1 + 6 predictors (not on the entire variable set). Use simple statistics and graphical representations to get a view of the distribution of the measured variables in the study population and of the scope of your model. Examine also bivariate associations, and consider any possibly outliers in the dimensions examined. Don’t forget to consider the study design at this point

Hints

  1. Randomly select 50% of the data as training data, as mentioned in the introductory paragraphs.

Hints

  1. First fit a simple linear model, then build a model, both on the training dataset to answer the key question.

Hints