Confounders are variables that correlate (+ or -) with both dependent variable and the independent variable. The Confounding variable’s presence affects the variable being studied, which, in turn, the results do not reflect the relationship. This problem is also to known as “Adjustment”. To overcome this, the various methods are:
eg. Confounding Variable Diagram
[src: “https://www.med.uottawa.ca/sim/data/Images/Confounding.jpg”]
In the early phases of a project, confounding is a type of bias that is adjusted after gathering the data using statistical models. There are two key methods in dealing with confounding variables in the analysis stage:
Stratification fixes the level of the confounders and produce new groups within, which the confounder does not vary. These will be evaluated to determine an expore-outcome associated with each stratum of the confounder. In essence, each stratum, the confounder cannot confound due to how it does not vary across the exposure-outcome. Once stratification is complete, a M-H (Mantel-Haenszel) estimator can be utilised and adjust the result according to strata.
Multivariate Models Stratficiation only works well when there are only 1 or 2 confounders that need to be controlled. If there are more, Multivariate offers the best solution.
Multivariate models handle multiple covariates at the same time. These models include:
The purpose of this is to look into applying ANOVA into a real word data set. Aanlysis of Variance (ANOVA) analyses covariance and controls is used to control potential confounding variables.
ANOVA models are generally modelled as one-way or two-way. This simply identifies whether it is utilising a single factor or inclusive of a second factor and the interaction between the two factors.
As an example, lets look at some financial data that shows whether people default in the next month.
#load in relevant datasets
library(tidyverse)
library(ggplot2)
library(stats)
library(dplyr)
#Import the data
training.raw <- read_csv('AT2_credit_train_STUDENT.csv')
#remove NA - 2 rows from the sex being cat/dog
training.raw <- training.raw[complete.cases(training.raw), ]
glimpse(training.raw)
## Observations: 23,098
## Variables: 17
## $ ID <int> 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ LIMIT_BAL <dbl> 20000, 120000, 90000, 50000, 50000, 500000, 100000, ...
## $ SEX <int> 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2...
## $ EDUCATION <int> 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2, 1, 3, 1, 1, 3...
## $ MARRIAGE <int> 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 3, 1, 2, 2...
## $ AGE <int> 24, 26, 34, 37, 37, 29, 23, 28, 35, 34, 51, 41, 30, ...
## $ PAY_PC1 <dbl> 0.47746305, -1.46161240, -0.39330764, -0.39330764, -...
## $ PAY_PC2 <dbl> -3.224589961, 0.853866590, 0.175554996, 0.175554996,...
## $ PAY_PC3 <dbl> 0.145040802, -0.360863449, 0.004885522, 0.004885522,...
## $ AMT_PC1 <dbl> -1.7522077, -1.6633432, -1.1348380, -0.3971748, -0.3...
## $ AMT_PC2 <dbl> -0.22434761, -0.14388558, -0.17660872, -0.45109777, ...
## $ AMT_PC3 <dbl> -0.077841055, -0.054600404, 0.015954854, -0.09978950...
## $ AMT_PC4 <dbl> 0.006957244, -0.002851947, -0.129071306, -0.03533896...
## $ AMT_PC5 <dbl> -0.041356958, 0.043889122, 0.098245278, -0.055306582...
## $ AMT_PC6 <dbl> 0.0008865935, -0.0261897987, -0.0223825102, 0.050465...
## $ AMT_PC7 <dbl> -0.05626505, -0.09997756, -0.06898686, -0.02820475, ...
## $ default <chr> "Y", "Y", "N", "N", "N", "N", "N", "Y", "N", "N", "N...
Lets clean our data
ggplot(training.raw, aes(LIMIT_BAL, AGE, colour = SEX)) +
geom_point()
The plot shows the general pattern between PAY_PC1 and AMT_PC2 which are both engineered features.
The r function aov
apart of the stats
package helps us analyse the variance in the model.
aov1 <- aov(LIMIT_BAL ~ AGE+SEX, data = training.raw)
summary(aov1)
## Df Sum Sq Mean Sq F value Pr(>F)
## AGE 1 5.655e+12 5.655e+12 368.42 < 2e-16 ***
## SEX 1 4.202e+11 4.202e+11 27.37 1.69e-07 ***
## Residuals 23095 3.545e+14 1.535e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary above concludes that both SEX and AGE are statistically signifant. AGE is most significate factor variable. This means that if we change the AGE or SEX, it will impact significantly on the mean LIMIT_BAL.
The fitted model above is an additive model. It assumes that the two factor variables are independent. To see if they might interact, we replace the + symbol with the *.
#two way ANOVA with interaction effect
aov2 <- aov(LIMIT_BAL ~ AGE*SEX, data = training.raw)
summary(aov2)
## Df Sum Sq Mean Sq F value Pr(>F)
## AGE 1 5.655e+12 5.655e+12 368.71 < 2e-16 ***
## SEX 1 4.202e+11 4.202e+11 27.40 1.67e-07 ***
## AGE:SEX 1 2.977e+11 2.977e+11 19.41 1.06e-05 ***
## Residuals 23094 3.542e+14 1.534e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This can also be represented in the code below
aov3 <- aov(LIMIT_BAL ~ AGE + SEX + AGE:SEX, data = training.raw)
summary(aov3)
## Df Sum Sq Mean Sq F value Pr(>F)
## AGE 1 5.655e+12 5.655e+12 368.71 < 2e-16 ***
## SEX 1 4.202e+11 4.202e+11 27.40 1.67e-07 ***
## AGE:SEX 1 2.977e+11 2.977e+11 19.41 1.06e-05 ***
## Residuals 23094 3.542e+14 1.534e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the summary above, we can tell that the two main effects AGE and SEX are statistically significant as well as their interaction.
From the results above, based on the p-values and significance level of 0:
Lets see what happens when we compute the mean?
training.raw %>%
group_by(AGE, SEX) %>%
summarise(
count = n(),
mean = mean(LIMIT_BAL, na.rm = TRUE),
sd = sd(LIMIT_BAL, na.rm = TRUE)
)
## # A tibble: 127 x 5
## # Groups: AGE [?]
## AGE SEX count mean sd
## <dbl> <fct> <int> <dbl> <dbl>
## 1 21 Male 14 23571. 12157.
## 2 21 Female 35 24857. 11472.
## 3 22 Male 106 33302. 25773.
## 4 22 Female 304 38618. 23520.
## 5 23 Male 200 39400 34021.
## 6 23 Female 515 66777. 50129.
## 7 24 Male 286 47273. 42089.
## 8 24 Female 615 87138. 69289.
## 9 25 Male 296 71318. 64734.
## 10 25 Female 603 117081. 90582.
## # ... with 117 more rows
Anova assumes that data is normally distributed and the variance is correlative. Lets check this.
plot(aov3, 1)
We can see the uppermost point has some outliers that has the potential to affect normality and homogeneity of variance. It would be recommended to remove these.
This plot checks and verifies the assumption that residuals are normally distributed. To be considered a normal probability it should follow an approximate straight line.
plot(aov3, 2)
All the points do not follow a straight line and we can see there is alot of depatures from normality. We can also see that the upper and lower tails depart from the fitted line.
This S-like pattern is common with both long and short tails. It is quite erratic and departs from approximately -1 to 1.
We can conclude that normal distribution can be imrpoved as a model for this data. The next step might be to take a Tukey Lambda PPCC plot to identify appropriate distributional family.
This exercise showed that there is an interaction with both LIMIT_BAL and AGE and also LIMIT_BAL and SEX.