Load some potentially useful packages:

library(ggplot2)

library(dplyr)


Question 1

Consider two models involving 3 variables: A, B, and C (and built using the same data):

Model 1: A ~ B

Model 2: A ~ B + C

Explain in your own words when we would say that Simpson’s Paradox is occurring.

Simpson’s paradox would occur if the addition of variable C changes the sign of the coefficient of variable B



Question 2

Consider the Diamonds.csv data that we’ve used in some previous activities. Recall that this file has data on the price, carat, color, certification, and clarity of 308 diamonds. Read the data into RStudio:

Diamonds = read.csv( "https://raw.githubusercontent.com/vittorioaddona/data/main/Diamonds.csv" )


(a) Suppose we’re interested in exploring the relationship between price and clarity. Fit 2 models, one that uses clarity as the only explanatory variable, and another that also controls for the carat size of the diamond.

lm(formula=price~clarity,data=Diamonds)
## 
## Call:
## lm(formula = price ~ clarity, data = Diamonds)
## 
## Coefficients:
## (Intercept)   clarityVS1   clarityVS2  clarityVVS1  clarityVVS2  
##        2695         2362         3163         2873         2662
lm(formula=price~clarity+carat,data=Diamonds)
## 
## Call:
## lm(formula = price ~ clarity + carat, data = Diamonds)
## 
## Coefficients:
## (Intercept)   clarityVS1   clarityVS2  clarityVVS1  clarityVVS2        carat  
##     -1851.2      -1001.2      -1561.9       -403.7       -958.8      12226.4


(b) Examine the output of the models you created in (a), and describe in what sense Simpson’s Paradox has occurred.

when the only explanatory variable is clarity, we have a positive correlation or connection between clarity and price, but when you add carat as an additional explanatory variable, the connection between price and clarity flips, in this model the correlation between price and clarity is negative, showing clarity is a confounding variable and fits the simpson’s paradox


(c) Which variable is the “confounding variable” in this example?

clarity


(d) Investigate the relationship between: 1. carat and price and 2. carat and clarity. Explain how these relationships lead to the results seen in the (price ~ clarity) model.

lm(formula=carat~price,data=Diamonds)
## 
## Call:
## lm(formula = carat ~ price, data = Diamonds)
## 
## Coefficients:
## (Intercept)        price  
##   2.447e-01    7.695e-05
lm(formula=carat~clarity,data=Diamonds)
## 
## Call:
## lm(formula = carat ~ clarity, data = Diamonds)
## 
## Coefficients:
## (Intercept)   clarityVS1   clarityVS2  clarityVVS1  clarityVVS2  
##      0.3718       0.2751       0.3865       0.2680       0.2961

The coefficients for these models show that there is a relationship of a similar nature between carat and clarity, and carat and price, showing that carat is the confounding variable that has the same connection to both of the other two variables



Question 3

A study evaluating the effectiveness of a hypothetical drug yields the following results, both overall and stratified by (or grouped by) another categorical variable (which could either equal A or B):

Overall:

Drug Recovered Didn’t Recover Total
Not Given 2000 2000 4000
Given 1600 2400 4000
Total 3600 4400 8000


Category A:

Drug Recovered Didn’t Recover Total
Not Given 200 800 1000
Given 900 2100 3000
Total 1100 2900 4000


Category B:

Drug Recovered Didn’t Recover Total
Not Given 1800 1200 3000
Given 700 300 1000
Total 2500 1500 4000


(a) Using the Overall table, find the odds ratio for recovery based on whether the drug was given or not (place the odds of recovery in the “Not Given” group in the numerator).

1.5


(b) Now, calculate the analogous odds ratios among category A patients, and among category B patients.

Cat A=0.583333 Cat B=0.6428


(c) The data used to create these tables is contained in drug_data.csv. Read in this data, and show how to reproduce the odds ratio you found in (a) by fitting an appropriate model.

NOTE: If you are fitting a logistic regression model, use the Outcome01 version of the outcome (where 1=Recovered and 0=Not Recovered), since the response variable in glm needs to be numerical.

Drug_Data=read.csv("https://raw.githubusercontent.com/vittorioaddona/data/main/drug_data.csv")
glm(Outcome01~Drug, family=binomial, data=Drug_Data)
## 
## Call:  glm(formula = Outcome01 ~ Drug, family = binomial, data = Drug_Data)
## 
## Coefficients:
##   (Intercept)  DrugNot Given  
##       -0.4055         0.4055  
## 
## Degrees of Freedom: 7999 Total (i.e. Null);  7998 Residual
## Null Deviance:       11010 
## Residual Deviance: 10930     AIC: 10930

the DrugNot Given coefficient to the power of e is the odds ratio

(d) Now fit another model to illustrate that a Simpson’s Paradox is occurring.

glm(Outcome01~Drug+Category,family=binomial,data=Drug_Data)
## 
## Call:  glm(formula = Outcome01 ~ Drug + Category, family = binomial, 
##     data = Drug_Data)
## 
## Coefficients:
##   (Intercept)  DrugNot Given      CategoryB  
##       -0.8584        -0.4849         1.7391  
## 
## Degrees of Freedom: 7999 Total (i.e. Null);  7997 Residual
## Null Deviance:       11010 
## Residual Deviance: 9926  AIC: 9932

When a model using only Outcome01 and Drug as variables, you have a positive relationship between the two, however when you control for category as well, you see the opposite relationship between Outcome01 and Drug

(e) If we want to reproduce the odds ratios from (b), we will need an interaction model. Fit the interaction model, and show how its coefficients can be used to reproduce the odds ratios you found in (b).

glm(Outcome01~Drug+Category+Drug:Category,family=binomial,data=Drug_Data)
## 
## Call:  glm(formula = Outcome01 ~ Drug + Category + Drug:Category, family = binomial, 
##     data = Drug_Data)
## 
## Coefficients:
##             (Intercept)            DrugNot Given                CategoryB  
##                -0.84730                 -0.53900                  1.69460  
## DrugNot Given:CategoryB  
##                 0.09716  
## 
## Degrees of Freedom: 7999 Total (i.e. Null);  7996 Residual
## Null Deviance:       11010 
## Residual Deviance: 9926  AIC: 9934

in order to find the odds ratio for category A, we need to do e to the power of the coefficient for DrugNot Given, for category B, we need to do e to the power of the coefficient for DrugNot Given subtracted by the coefficient for DrugNot Given:Category B


(f) In the end, we want to address the research question on the effectiveness of this drug: Does the drug improve the recovery rate? Briefly justify your response.

The drug did improve recovery rate. This is because when you control for the categories, the odds ratio favors the drug in both categories.


(g) Carefully explain why the Overall table and the stratified tables lead to odds ratios that are on opposite sides of 1. Be as specific as possible in your response.

It is an example of the Simpson’s Paradox. More specifically, the reason why “not given” is favored in the overall recovery rate but not in either category is because “not given” has a higher occurrence in the category with the higher aggregate recovery rate (category B). This leads to the result that more people who recovered did not take the drug, but this is simply a percentage of a total rather than a rate of recovery, because a rate of recovery requires that all possible confounding variables be constrained, when this occurs, the odds ratio favors the recovery rate of people who took the drug over those who did not.

Question 4

(a) Consider the following 2 directed acyclic graphs (DAGs):

DAG 1:


DAG 2:


Based on the following model outputs, which of the 2 above DAGs (DAG 1 or DAG 2) are you more inclined to believe as the truth? Explain your reasoning.

DAG 2 because the first model shows that there is a connection between y and x1, but when x2 is constrained as in the 2nd model, the connection between x1 and y is no longer present, this shows that DAG 2 is correct because it shows that when x2 is constrained or removed, no connection exists between x1 and y. ### lm( y ~ x1 , data=...)

(Intercept) x1
10 5

lm( y ~ x1 + x2 , data=...)

(Intercept) x1 x2
10 0 7


(b) Consider the following 2 directed acyclic graphs (DAGs):

DAG 1:


DAG 2:


Based on the following model outputs, which of the 2 above DAGs (DAG 1 or DAG 2) are you more inclined to believe as the truth? Explain your reasoning.

DAG 2, because when you don’t control for z1, there is a direct connection between x1 and y as seen in model 1. However, when z1 is constrained, that connection only exists as indirect, from x1 to z1 to y. ### lm( y ~ x1 + z2 , data=...)

(Intercept) x1 z2
10 7 9

lm( y ~ x1 + z2 + z1 , data=...)

(Intercept) x1 z2 z1
10 4 9 3



Question 5

You and a friend are taking 2 different classes. You meet up after your respective mid-term exams, and agree that they were tough. You got 65% and your friend got 70%. You meet up after your respective final exams, and agree that they went better! You got 85% and your friend got 90%.

Can you end up with a higher numerical grade than your friend? (assume that the mid-term and final exam are the only graded items in the classes).


NOTE: The point of this question is to illustrate how Simpson’s Paradox operates; it is entirely plausible and not a paradox at all!



Question 6

One measure of fitness is percent body fat. Unlike body mass index (BMI), percent body fat is not easy to calculate. BodyFat.csv has estimates of percent body fat (determined by underwater weighing) and various body circumference measurements for 252 men. The variables are:

BF: percent body fat from Siri’s equation (percentage, 0-100)

Age: individual’s age in years

Weight: individual’s weight in pounds (lbs)

Height: individual’s height in inches

Neck: individual’s neck circumference in cm

Chest: individual’s chest circumference in cm

Abdomen: individual’s abdomen circumference in cm

Hip: individual’s hip circumference in cm

Thigh: individual’s thigh circumference in cm

Knee: individual’s knee circumference in cm

Ankle: individual’s ankle circumference in cm

Biceps: individual’s bicep circumference in cm

Forearm: individual’s forearm circumference in cm

Wrist: individual’s wrist circumference in cm


Read in the data:

BodyFat = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/BodyFat.csv')


One data point is very different from the others. It doesn’t affect the point of this question, but let’s remove it:

BodyFat = BodyFat %>% filter( row_number() != 42 )


(b) Investigate the relationship between height and percent body fat with a graph.


(c) How does your intuition from (a) compare with what you see from your graph in (b)?


(d) Fit a model for percent body fat using height. The graph from (b) should tell you something about the magnitude of one of the coefficients; which one, and what can you infer?


(e) Now, control for Weight in the model from (d). What happens to the coefficient on Height? Explain in your own words how this makes sense.


(f) Re-make the graph from (b), and add the model from (d) to it. Then, add 3 lines to this graph corresponding to the model from (e) for people who weigh 150, 200, and 250 pounds, respectively.


You should notice a dramatic change in the relationship between percent body fat and height at any fixed weight.


(g) Consider another example: Make a model for BF that uses Weight. What is the direction of this relationship?


(h) Do you think it is possible for the relationship between BF and Weight to reverse when holding some other variable fixed?

Think about where body fat might be stored on the body. Look through the list of variables and control for something based on your thoughts.

What happens? Try to think of a plausible story for how this result could make sense.



Question 7

During the early 2000s, the U.S. experienced a boom in the housing industry. The question, How much is a house worth?, is difficult to answer because it boils down to what the market is willing to pay. A real estate agent gathered data on 28 home sales and wishes to develop a model that relates sale price Price, (in thousands of dollars) to other aspects of the house: size of the lot (Acres), number of bedrooms (Bedrooms), number of bathrooms (Bathrooms), square footage (SqFt), age in years (Age), and total number of rooms (Rooms). The data is in Houseprices.csv, but you don’t need it to answer this question.

The SqFt variable leads to a few interesting results.

(a) Interpret the coefficient on Age in each of the two models below. Then, try to explain why controlling for SqFt led to a change in the direction of the relationship between Price and Age.

the coefficient for age in the first model shows that as a house grows older, the price drops, but in the second model, it shows that as a house of a certain square feet gets older, the price rises. This could be because of a difference between the average square feet of older houses and newer houses. This means that when square feet is controlled for, the price of houses goes up over time. Square feet is an example of a confounding variable. ### lm( Price ~ Age , data=Houseprices )

(Intercept) Age
373.792 -1.922

lm( Price ~ Age + SqFt , data=Houseprices )

(Intercept) Age SqFt
-201.7278 1.4531 0.2196


(b) Repeat this process for Bedrooms:

lm(Price ~ Bedrooms , data=Houseprices)

the first model shows that as price goes up, so do the number of bedrooms. However, the second model shows that in two houses with the same square footage, a greater number of bedrooms will force the price down. The reason for the change is the same as for age, Square footage is a confounding variable. This means that when Bedrooms and Price are compared by themselves, the price will rise with the number of bedrooms, but when Square Footage is fixed, the price decreases as the number of bedrooms rises. | (Intercept) | Bedrooms | | :———–: | :———–: | | -209.2 | 143.2 |

lm(Price ~ Bedrooms + SqFt , data=Houseprices)

(Intercept) Bedrooms SqFt
50.7588 -76.5550 0.2618



Question 8

Consider the following directed acyclic graph (DAG):



Suppose our research question is, Does chronic kidney disease lead to elevated mortality risk? This question is depicted by the “?” in the DAG. Based on this DAG, should we control for age in our investigation? If you answer ‘no’, explain why not, and if you answer ‘yes’, explain why we might be led astray by not controlling for age.

We should control for age because as you age, your immune system strength and general internal functions decline. If age is not controlled for, then the data will show that chronic kidney disease leads to higher mortality rates. This is not a good conclusion because it is saying that the chance of dying from kidney disease is the same in a 45 year old and a 70 year old, which is very much a false conclusion.