Project continuous Data Analysis

An overview of the assignment

The dataset we propose includes county level data on the US elections 2016. You will analyze the percentage of votes for one of the presidential candidates (your choice which one you study - Candidate A). You have access to a number of potential predictors (incl. income, gender, education level,… see list attached) and are asked to study one such continuous explanatory variable (your team makes the choice of which key explanatory variable) in more detail, while adjusting for a selected set of other covariates. You are to preselect this set of 6 more explanatory variables from the list available based on their meaning only. No data snooping or statistical evaluation involving the outcome is allowed at this stage to decide on this set which you will study. You may consider the descriptives of the covariates for this purpose if you wish, but no observed link with the outcome at this stage. This is quite important.

Hints

Choose one continuous variable based on its non-missingness.

library(naniar)
data_us<-read.csv('~/mastats_ugent/continuous/data/data.csv')
dict<-read.csv('~/mastats_ugent/continuous/data/county_facts_dictionary.csv')
head(data_us)[,1:10]

##   combined_fips votes_dem_2016 votes_gop_2016 total_votes_2016 per_dem_2016
## 1          2013          93003         130413           246588    0.3771595
## 2          2016          93003         130413           246588    0.3771595
## 3          2020          93003         130413           246588    0.3771595
## 4          2050          93003         130413           246588    0.3771595
## 5          2060          93003         130413           246588    0.3771595
## 6          2068          93003         130413           246588    0.3771595
##   per_gop_2016 diff_2016 per_point_diff_2016 state_abbr county_name
## 1      0.52887     37410          -0.1517105         AK      Alaska
## 2      0.52887     37410          -0.1517105         AK      Alaska
## 3      0.52887     37410          -0.1517105         AK      Alaska
## 4      0.52887     37410          -0.1517105         AK      Alaska
## 5      0.52887     37410          -0.1517105         AK      Alaska
## 6      0.52887     37410          -0.1517105         AK      Alaska

vis_miss(data_us[,1:20])

vis_miss(data_us[,21:40])

vis_miss(data_us[,41:60])

Not a lot of missing values

My choice

RHI125214: White alone, percent, 2014

hist(data_us[,'RHI125214'],xlab='White alone, percent, 2014',main='White alone, percent, 2014')

White alone, percent, 2014 is heavily skewed, most counties have above 90% of Whites

library(usmap)
library(ggplot2)

plot_usmap(data = data_us, values = "RHI125214") + 
  scale_fill_continuous(name = "White alone, percent, 2014", label = scales::comma) + 
  theme(legend.position = "right")

Most counties are typically white as indicated by the 2014 data.
For the other variables (make sure the variables you pick are of the same year)
- Pick a variable related to age just one, if you pick two age-related variables they will be correlated
- Pick something in relation to population, e.g., Population, 2014 estimate
- A variable related education only one variable, check the distribution
- A variable related to language maybe? this might be correlated with educatio, check them
- A variable that relates to income, this might be correlated with age and language and education, check them
- Other variables, please pick a variable that is actionable, you can’t really change people’s ages or education during campaign, well maybe look for a variable that the politicians can really impact in the course of their campaign.

Example age

hist(data_us[,'AGE775214'],main='Persons 65 years and over, percent, 2014')

Most counties have about 20 percent of people of age 65%

The goal is to

gain some insight in the association of this key predictor variable with the percentage of votes for Candidate A (following a simple linear regression model first and then a multiple linear regression analysis, adjusting for variables from the list of six. Note that you may need more than 6 covariates to incorporate the 6 variables in a way that seems fitting) and

Hints

For the association please quantify this with a p-value, the p-value of you better cofficient in the simple linear regression model. For the p-value to be correct, make sure, that you have checked goodness of fit.

predict whether the outcome is close to ‘undecided’ in the county (plus minus 2% of the votes. If this is the case: the county deserves more focus and action in the campaign leading up to the (next) elections.

Hints

First recompute the percentages so that 50% will really mean undecided or within 2% will mean between [48%,52%].
How do you recompute? for example: \(\text{per_votes_dem} = \frac{\text{number of votes of dem}}{\text{Number of votes of dem + number of votes of rep}}\)

# Example
data_us['dem_new_per'] = data_us$votes_dem_2016/(data_us$votes_dem_2016+data_us$votes_gop_2016)
data_us['dem_new_per_cat'] = cut(data_us$dem_new_per,c(0,.42,.52,1))

plot_usmap(data = data_us, values = "dem_new_per_cat") + 
  theme(legend.position = "right")

Hints

If we will only look at 2016 results after elections, then the action counties are the green counties. But unfortunately we only know this after the elections. So our prediction model will predict which counties are to be green in the 2020 elections say, using data from 2016.
Also note that there is spatial correlation in the data as you can see, red counties are together, blue turn to occure together, and green together. How can we remove this spatial correlation during analysis (Mercy mentioned stratified sampling)
There are more red states in 2016, ofcourse this why Trump won the elections, do you think if ther democrates will manage to act on only the green counties, will they win the elections? If not maybe the 2% range is not enough? What about say 10%?

data_us['dem_new_per_cat'] = cut(data_us$dem_new_per,c(0,.40,.50,1))
plot_usmap(data = data_us, values = "dem_new_per_cat") + 
  theme(legend.position = "right")

All of Alaska now becomes an action area, well I leave you to make recommendation in you conclusions.

Execution steps - in order

In more detail, the following steps should be taken and reflected in both the protocol and your final report:

Perform a descriptive analysis - this is only to be performed on the outcome + 1 + 6 predictors (not on the entire variable set). Use simple statistics and graphical representations to get a view of the distribution of the measured variables in the study population and of the scope of your model. Examine also bivariate associations, and consider any possibly outliers in the dimensions examined. Don’t forget to consider the study design at this point

Hints

perform correlation between features
look at their distributions (histograms for continuous data, bar plots for categorical)
is there need to transform the variables, clearly state this, e.g.,
- dummy coding and one hot encoding for categorical variables,
- mean centered and divide by standard deviation, for continuous variables
Investigate for outliers with box plots, remove outlier with IQR formula

Randomly select 50% of the data as training data, as mentioned in the introductory paragraphs.

Hints

First fit a simple linear model, then build a model, both on the training dataset to answer the key question.

Hints

First transform your vote percentages \(\tilde Y\) (note here \(\tilde Y\in [0,1]\)) using \(Y = \log(\frac{\tilde Y}{1-\tilde Y})\) (here \(Y\in (-\infty,\infty)\)) so it is easy to work with.
at the end you will have to back transform, \(\tilde Y = \frac{1}{1+\exp(-Y)}\)
Investigate association by looking at the p-value of the \(\beta\) coeficient
Look at the \(R^2\), it is an indication of prediction
Look at the goodness of fit, distribution of the residuals
Look at the homoskerdasticity
Compute the rmse and compare it with that of the null model, a model that has only \(\beta_0\), i.e., a model is simply the mean of the percentage of votes.

Project continuous Data Analysis

11/24/2020

An overview of the assignment

Hints

My choice

The goal is to

Execution steps - in order