Confounders are variables that correlate (+ or -) with both dependent variable and the independent variable. The Confounding variable’s presence affects the variable being studied, which, in turn, the results do not reflect the relationship. This problem is also to known as “Adjustment”. To overcome this, the various methods are:

  1. Randomization;
  2. Restriction; and
  3. Matching
eg. Confounding Variable Diagram

eg. Confounding Variable Diagram

[src: “https://www.med.uottawa.ca/sim/data/Images/Confounding.jpg”]

Using statistical Analysis to remove confounding effects:

In the early phases of a project, confounding is a type of bias that is adjusted after gathering the data using statistical models. There are two key methods in dealing with confounding variables in the analysis stage:

  1. Stratification; and
  2. Multivariate Methods

Stratification fixes the level of the confounders and produce new groups within, which the confounder does not vary. These will be evaluated to determine an expore-outcome associated with each stratum of the confounder. In essence, each stratum, the confounder cannot confound due to how it does not vary across the exposure-outcome. Once stratification is complete, a M-H (Mantel-Haenszel) estimator can be utilised and adjust the result according to strata.

Multivariate Models Stratficiation only works well when there are only 1 or 2 confounders that need to be controlled. If there are more, Multivariate offers the best solution.

Multivariate models handle multiple covariates at the same time. These models include:

  1. Logistic Regression;
  2. Linear Regression; and
  3. Analysis of Covariance (ANCOVA, ANOVA)

The purpose of this is to look into applying ANOVA into a real word data set. Aanlysis of Variance (ANOVA) analyses covariance and controls is used to control potential confounding variables.

ANOVA models are generally modelled as one-way or two-way. This simply identifies whether it is utilising a single factor or inclusive of a second factor and the interaction between the two factors.

As an example, lets look at some financial data that shows whether people default in the next month.

#load in relevant datasets
library(tidyverse)
library(ggplot2)
library(stats)
library(dplyr)

#Import the data
training.raw <- read_csv('AT2_credit_train_STUDENT.csv')

#remove NA - 2 rows from the sex being cat/dog
training.raw <- training.raw[complete.cases(training.raw), ]

glimpse(training.raw)
## Observations: 23,098
## Variables: 17
## $ ID        <int> 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ LIMIT_BAL <dbl> 20000, 120000, 90000, 50000, 50000, 500000, 100000, ...
## $ SEX       <int> 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2...
## $ EDUCATION <int> 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2, 1, 3, 1, 1, 3...
## $ MARRIAGE  <int> 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 3, 1, 2, 2...
## $ AGE       <int> 24, 26, 34, 37, 37, 29, 23, 28, 35, 34, 51, 41, 30, ...
## $ PAY_PC1   <dbl> 0.47746305, -1.46161240, -0.39330764, -0.39330764, -...
## $ PAY_PC2   <dbl> -3.224589961, 0.853866590, 0.175554996, 0.175554996,...
## $ PAY_PC3   <dbl> 0.145040802, -0.360863449, 0.004885522, 0.004885522,...
## $ AMT_PC1   <dbl> -1.7522077, -1.6633432, -1.1348380, -0.3971748, -0.3...
## $ AMT_PC2   <dbl> -0.22434761, -0.14388558, -0.17660872, -0.45109777, ...
## $ AMT_PC3   <dbl> -0.077841055, -0.054600404, 0.015954854, -0.09978950...
## $ AMT_PC4   <dbl> 0.006957244, -0.002851947, -0.129071306, -0.03533896...
## $ AMT_PC5   <dbl> -0.041356958, 0.043889122, 0.098245278, -0.055306582...
## $ AMT_PC6   <dbl> 0.0008865935, -0.0261897987, -0.0223825102, 0.050465...
## $ AMT_PC7   <dbl> -0.05626505, -0.09997756, -0.06898686, -0.02820475, ...
## $ default   <chr> "Y", "Y", "N", "N", "N", "N", "N", "Y", "N", "N", "N...

Lets clean our data

ggplot(training.raw, aes(LIMIT_BAL, AGE, colour = SEX)) +
  geom_point()

The plot shows the general pattern between PAY_PC1 and AMT_PC2 which are both engineered features.

The r function aov apart of the stats package helps us analyse the variance in the model.

aov1 <- aov(LIMIT_BAL ~ AGE+SEX, data = training.raw)
summary(aov1)
##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## AGE             1 5.655e+12 5.655e+12  368.42  < 2e-16 ***
## SEX             1 4.202e+11 4.202e+11   27.37 1.69e-07 ***
## Residuals   23095 3.545e+14 1.535e+10                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary above concludes that both SEX and AGE are statistically signifant. AGE is most significate factor variable. This means that if we change the AGE or SEX, it will impact significantly on the mean LIMIT_BAL.

The fitted model above is an additive model. It assumes that the two factor variables are independent. To see if they might interact, we replace the + symbol with the *.

#two way ANOVA with interaction effect

aov2 <- aov(LIMIT_BAL ~ AGE*SEX, data = training.raw)
summary(aov2)
##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## AGE             1 5.655e+12 5.655e+12  368.71  < 2e-16 ***
## SEX             1 4.202e+11 4.202e+11   27.40 1.67e-07 ***
## AGE:SEX         1 2.977e+11 2.977e+11   19.41 1.06e-05 ***
## Residuals   23094 3.542e+14 1.534e+10                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This can also be represented in the code below

aov3 <- aov(LIMIT_BAL ~ AGE + SEX + AGE:SEX, data = training.raw)
summary(aov3)
##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## AGE             1 5.655e+12 5.655e+12  368.71  < 2e-16 ***
## SEX             1 4.202e+11 4.202e+11   27.40 1.67e-07 ***
## AGE:SEX         1 2.977e+11 2.977e+11   19.41 1.06e-05 ***
## Residuals   23094 3.542e+14 1.534e+10                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the summary above, we can tell that the two main effects AGE and SEX are statistically significant as well as their interaction.

Interpreting the results

From the results above, based on the p-values and significance level of 0:

  1. the p-value of AGE is < 2e-16 (significant), which indicates that the levels of AGE are associated with significant different LIMIT_BAL.
  2. the p-value of SEX is 1.67e-07 (significant), which indicates that the levels of SEX are associated with significant different LIMIT_BAL.
  3. the p-value of the interaction between AGE*SEX is 1.06e-05 (significant), which indicates that the relationships between SEX and LIMIT_BAL depends on the AGE value.

Summary Statistics

Lets see what happens when we compute the mean?

training.raw %>%
  group_by(AGE, SEX) %>%
  summarise(
    count = n(),
    mean = mean(LIMIT_BAL, na.rm = TRUE),
    sd = sd(LIMIT_BAL, na.rm = TRUE)
  )
## # A tibble: 127 x 5
## # Groups:   AGE [?]
##      AGE SEX    count    mean     sd
##    <dbl> <fct>  <int>   <dbl>  <dbl>
##  1    21 Male      14  23571. 12157.
##  2    21 Female    35  24857. 11472.
##  3    22 Male     106  33302. 25773.
##  4    22 Female   304  38618. 23520.
##  5    23 Male     200  39400  34021.
##  6    23 Female   515  66777. 50129.
##  7    24 Male     286  47273. 42089.
##  8    24 Female   615  87138. 69289.
##  9    25 Male     296  71318. 64734.
## 10    25 Female   603 117081. 90582.
## # ... with 117 more rows

Anova Assumptions - Validity

Anova assumes that data is normally distributed and the variance is correlative. Lets check this.

plot(aov3, 1) 

We can see the uppermost point has some outliers that has the potential to affect normality and homogeneity of variance. It would be recommended to remove these.

Check Normality

This plot checks and verifies the assumption that residuals are normally distributed. To be considered a normal probability it should follow an approximate straight line.

plot(aov3, 2)

All the points do not follow a straight line and we can see there is alot of depatures from normality. We can also see that the upper and lower tails depart from the fitted line.

This S-like pattern is common with both long and short tails. It is quite erratic and departs from approximately -1 to 1.

We can conclude that normal distribution can be imrpoved as a model for this data. The next step might be to take a Tukey Lambda PPCC plot to identify appropriate distributional family.

This exercise showed that there is an interaction with both LIMIT_BAL and AGE and also LIMIT_BAL and SEX.