Research Question:

Is there a relationship between Sex and Survival?


Simple model:

model1 <- glm(Survived ~ Sex, family=binomial(link='logit'), data=data)
model1$coefficients


Detecting Confounding Variables

Example: Is Age a confounder of Sex?

Example: Is Pclass a confounder of Sex?

Let’s settle on considering Pclass a confounding variable and include it in the model alongside Sex.

More analysis is needed though…


Bivariate Association Analysis

Let’s check the bivariate associations between Pclass and Sex and between Pclass and Survived to see if the theory is supported by the data.

Given that both are categorical variables we must use a Chi-square test. This test allows us to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies of the relevant contingency tables:

table(Pclass, Sex)
chisq.test(Pclass, Sex)
table(Pclass, Survived)
chisq.test(Pclass, Survived)

Interpretation of the above outputs: Given that in both cases p<0.5, Pclass is significantly associated with both Sex and Survived, thus the variable seems to meet the definition of a confounding variable.


Greater than 10% change in x1 coefficient?

Now let’[s addd Pclass to the model:

model2 <- glm(Survived ~ Sex + Pclass, family=binomial(link='logit'), data=data)
model2$coefficients

The coefficient of Sexmale does not change much (only 5%) which suggests that Pclass is not confounder. This contradicts the previous finding.


Is the model better when we include the confounding variable?


The residual deviance of the model that includes Pclass is lower. Based on this we can reject \(H_0\) that there is no difference betwwen full and reduced model i.e. that the full model is not significantly better than the reduced model. Thus we conclude that we have evidence to support \(H_a\) that the model that includes Pclass is significantly better. This is confirmed by the ‘lrtest()’ function below:

lrtest(model1, model2)
#??lr.test

So, keep it or not? It depends…Keep Pclass for now….


Assessing Collinearity

Let’s look at adding collinear variables AgeGroup and AgeGroup2029.

We will assess which one is better by looking at AIC. Not that we don’t simply model say Survived and AgeGroup but rather look at how their respective additions to the existing model impact the results.

model3 <- glm(Survived ~ Sex + Pclass + AgeGroup, family=binomial(link='logit'), data=data)

model4 <- glm(Survived ~ Sex + Pclass + AgeGroup2029, family=binomial(link='logit'), data=data)

AIC for model with AgeGroup: 965.929 AIC for model with AgeGroup2029: 965.929

AIC is slightly lower for the model with *AgeGroup** so moving forward the model will include Sex, Pclass, AgeGroup explanatory variables.

careful, this only temporary, AgeGroup results in missing data.


Assess Interaction via Likelihood Ratio Test

If some variable does not meet the criteria of a confounder, it could still be an effect modifier! Does the effect that \(x1\) has on the odds of \(y\) depend on some other variable \(x2\)? If so then we call the variable \(x2\) an effect modifier

Thus now we try to answer the question:

Does the effect that Sex has on Survived depend on Pclass?

An interaction terms is the multiplication of two variable with each other e.g. Sex * Pclass.

We can use a Likelihood Ratio Test to compare models with and without a \(Sex \times Pclass\) interaction term.

To compare two models we use the anova() function.

model3bis <- glm(Survived ~ Sex + Pclass + AgeGroup + Sex * Pclass, family=binomial(link='logit'), data=data)
anova(model3, model3bis, test='Chisq')

We can see that the p-value is < 0.5 and we reject the null hypothesis that the residual deviance is the same for models that do and do not include the interaction term. The model that includes the interaction term IS significantly better.

This suggests that Pclass is an effect modifier. The effect of Sex on the odds of Survived varies, or differs, depending on Pclass.

There is significant interaction between Sex and Pclass.

This means that we want to report on the effect of Pclass for different Sex categories: anything we say about the effect of Sex on Survived needs to be Pclass-specific.

Can a class be both a confounder and an effect modifier??

model3bis <- glm(Survived ~ Sex + Pclass + AgeGroup + Sex * Pclass, family=binomial(link='logit'), data=data)
summary(model3)
#summary(model3bis)

Note that if the effect modifier was continuous e.g. Age we need to do the calculations for the Pclass-specific odds ratios by hand and assign possible values to compare groups e.g. Age=50, Age=60, etc. Example: \(b_1 + 50 \times (age \times survived interaction term)\) (see video 17:44).

Extra step (comparfed to LR) to make log-odds more interpretable is to compute the odd-ratios by exponentiating the log-odds ratios.

---
title: "Interpreting the Output of a Logistic Regression Model (Part 2)"
output: html_notebook
---

```{r setup, include=FALSE, message=FALSE, error=TRUE}
library(epiDisplay) # likelihood ratio test
options(digits=3)
attach(data) # run titanic_train.R first
```

<br>

### Research Question: 

>  Is there a relationship between Sex and Survival?

* Because we have a binary outcome, we can use logistic regression.
* When we are interested in the relationship between a single X variable and Y it is referred to as a comparative, or *effect size*, model. 
* We want to include other explanatory variables in the model in order to get the least biased estimate of the relationship between Y (Survived) and X (Sex).

<br>

### Simple model:

```{r}
model1 <- glm(Survived ~ Sex, family=binomial(link='logit'), data=data)
model1$coefficients
```

* The log-odds of Survival is `r model1$coefficients[2]` times higher for men compared to women. 
* Odds-ratio: Men have `r exp(model1$coefficients[2])` times the odds of survival (i.e. `r round((exp(model1$coefficients[2]) - 1) *100,0)`%)  than women.
* 95% Confidence Interval for OR is `exp(b1 +/- (St.Error of b1 * 1.96))`: 
(
`r exp(model1$coefficient[2] - (summary(model1)$coefficients[, 2][2] * 1.96))`
, `r exp(model1$coefficient[2] + (summary(model1)$coefficients[, 2][2] * 1.96))`
)  

<br>

### Detecting Confounding Variables

* Same principles as with linear regression.
* Does variable meet the definition of a confounder?
  * Does it make sense that the variable is associated with $x_1$ and is a predictor of $y$ and is not on the causal pathway?
  * How are these variables associated when we look at their **bivariate associations**?
  * Is there a greater than 10% change in $b_1$?
  
**Example**: Is *Age* a confounder of *Sex*?

* *Age* is likely a predictor of *Survived*: children and elderly evacuated first.
* BUT *Age* could be on the causal pathway between Sex and Survival: e.g. young woman evacuated first?

**Example**: Is *Pclass* a confounder of *Sex*?

* *Pclass* is likely a predictor of *Survived*: Social status and wealth may give some advantage during evacuation.
* It wouldn't make sense that *Pclass* is on the causal pathway between *Sex* and *Survived*.
* But could there be any form of association between *Sex* and *Pclass*? Working class men travelling 3rd class to emigrate to the USA for a new life?

Let's settle on considering *Pclass* a confounding variable and include it in the model alongside *Sex*.

More analysis is needed though...

<br>

**Bivariate Association Analysis**

Let's check the bivariate associations between *Pclass* and *Sex* and between *Pclass* and *Survived* to see if the theory is supported by the data.

Given that both are categorical variables we must use a Chi-square test. This test allows us to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies of the relevant contingency tables:

* *Pclass* vs. *Sex*
```{r}
table(Pclass, Sex)
chisq.test(Pclass, Sex)
```
* *Pclass* vs. *Survived*

```{r}
table(Pclass, Survived)
chisq.test(Pclass, Survived)
```

Interpretation of the above outputs: Given that in both cases p<0.5, *Pclass* is significantly associated with both *Sex* and *Survived*, thus the variable seems to meet the definition of a confounding variable.

<br>

**Greater than 10% change in x1 coefficient?**

* Recall that in the simple model `Survived ~ Sex`, the coefficient of *Sexmale* was equal to `r model1$coefficients[2]`.

Now let'[s addd *Pclass* to the model:

```{r}
model2 <- glm(Survived ~ Sex + Pclass, family=binomial(link='logit'), data=data)
model2$coefficients
```

The coefficient of *Sexmale* does not change much (only 5%) which suggests that *Pclass* is not confounder. This contradicts the previous finding. 

<br>

**Is the model better when we include the confounding variable?**

* AIC

  * AIC for model 1: `r AIC(model1)` 
  * AIC for model 2: `r AIC(model2)`

<br>

* Likelihood Ratio Test

  * SSResid reduced model =  `r model1$deviance`
  * SSResid full model = `r model2$deviance`


   The residual deviance of the model that includes *Pclass* is lower. Based on this we can reject $H_0$ that there is no difference betwwen full and reduced model i.e. that the full model is not significantly better than the reduced model. Thus we conclude that we have evidence to support $H_a$ that the model that includes *Pclass* is significantly better. This is confirmed by the 'lrtest()' function below:

```{r}
lrtest(model1, model2)
#??lr.test
```

So, keep it or not? It depends...Keep *Pclass* for now....

<br>

**Assessing Collinearity**

Let's look at adding collinear variables AgeGroup and AgeGroup2029.

We will assess which one is better by looking at AIC. Not that we don't simply model say Survived and AgeGroup but rather look at how their respective additions to the existing model impact the results.

```{r}
model3 <- glm(Survived ~ Sex + Pclass + AgeGroup, family=binomial(link='logit'), data=data)

model4 <- glm(Survived ~ Sex + Pclass + AgeGroup2029, family=binomial(link='logit'), data=data)
```
AIC for model with AgeGroup: `r AIC(model3)`
AIC for model with AgeGroup2029: `r AIC(model3)`

AIC is slightly lower for the model with *AgeGroup** so moving forward the model will include Sex, Pclass, AgeGroup explanatory variables.

> careful, this only temporary, AgeGroup results in missing data.

<br>

### Assess Interaction via Likelihood Ratio Test

**If some variable does not meet the criteria of a confounder, it could still be an effect modifier!** Does the effect that $x1$ has on the odds of $y$ depend on some other variable $x2$? If so then we call the variable $x2$ an **effect modifier** 

Thus now we try to answer the question:

> Does the effect that *Sex* has on *Survived* **depend on** *Pclass*? 

An interaction terms is the multiplication of two variable with each other e.g. `Sex * Pclass`.

We can use a Likelihood Ratio Test to compare models with and without a $Sex \times Pclass$ [interaction term](https://stats.stackexchange.com/questions/127549/creating-interaction-term-for-dummy-variables-and-categorical-variables).

To compare two models we use the `anova()` function.

```{r}
model3bis <- glm(Survived ~ Sex + Pclass + AgeGroup + Sex * Pclass, family=binomial(link='logit'), data=data)
anova(model3, model3bis, test='Chisq')
```

We can see that the p-value is < 0.5 and we reject the null hypothesis that the residual deviance is the same for models that do and do not include the interaction term. The model that includes the interaction term IS significantly better.

This suggests that *Pclass* is an effect modifier. The effect of Sex on the odds of Survived varies, or differs, depending on Pclass.

There is significant interaction between Sex and Pclass. 

This means that we want to report on the effect of Pclass for different Sex categories: anything we say about the effect of Sex on Survived needs to be Pclass-specific.

> Can a class be both a confounder and an effect modifier??

```{r}
model3bis <- glm(Survived ~ Sex + Pclass + AgeGroup + Sex * Pclass, family=binomial(link='logit'), data=data)
summary(model3)
```

* Pclass = 1: Among passengers that travelled 1st class, the log-odds of survival are x times higher/lower for men compared to women, adjusting for AgeGroup.


* Pclass = 2:
* Pclass = 3:

```{r}
#summary(model3bis)
```

Note that if the effect modifier was continuous e.g. Age we need to do the calculations for the Pclass-specific odds ratios by hand and assign possible values to compare groups e.g. Age=50, Age=60, etc. Example: $b_1 + 50 \times (age \times survived interaction term)$ (see video 17:44).

Extra step (comparfed to LR) to make log-odds more interpretable is to compute the odd-ratios by exponentiating the log-odds ratios.  

