library(ggplot2)
library(dplyr)
Simpson’s paradox would occur if the addition of variable C changes the sign of the coefficient of variable B
Diamonds = read.csv( "https://raw.githubusercontent.com/vittorioaddona/data/main/Diamonds.csv" )
price and clarity. Fit 2 models, one that uses
clarity as the only explanatory variable, and another that
also controls for the carat size of the diamond.lm(formula=price~clarity,data=Diamonds)
##
## Call:
## lm(formula = price ~ clarity, data = Diamonds)
##
## Coefficients:
## (Intercept) clarityVS1 clarityVS2 clarityVVS1 clarityVVS2
## 2695 2362 3163 2873 2662
lm(formula=price~clarity+carat,data=Diamonds)
##
## Call:
## lm(formula = price ~ clarity + carat, data = Diamonds)
##
## Coefficients:
## (Intercept) clarityVS1 clarityVS2 clarityVVS1 clarityVVS2 carat
## -1851.2 -1001.2 -1561.9 -403.7 -958.8 12226.4
when the only explanatory variable is clarity, we have a positive correlation or connection between clarity and price, but when you add carat as an additional explanatory variable, the connection between price and clarity flips, in this model the correlation between price and clarity is negative, showing clarity is a confounding variable and fits the simpson’s paradox
clarity
carat and
price and 2. carat and clarity.
Explain how these relationships lead to the results seen in the
(price ~ clarity) model.lm(formula=carat~price,data=Diamonds)
##
## Call:
## lm(formula = carat ~ price, data = Diamonds)
##
## Coefficients:
## (Intercept) price
## 2.447e-01 7.695e-05
lm(formula=carat~clarity,data=Diamonds)
##
## Call:
## lm(formula = carat ~ clarity, data = Diamonds)
##
## Coefficients:
## (Intercept) clarityVS1 clarityVS2 clarityVVS1 clarityVVS2
## 0.3718 0.2751 0.3865 0.2680 0.2961
The coefficients for these models show that there is a relationship of a similar nature between carat and clarity, and carat and price, showing that carat is the confounding variable that has the same connection to both of the other two variables
| Drug | Recovered | Didn’t Recover | Total |
|---|---|---|---|
| Not Given | 2000 | 2000 | 4000 |
| Given | 1600 | 2400 | 4000 |
| Total | 3600 | 4400 | 8000 |
| Drug | Recovered | Didn’t Recover | Total |
|---|---|---|---|
| Not Given | 200 | 800 | 1000 |
| Given | 900 | 2100 | 3000 |
| Total | 1100 | 2900 | 4000 |
| Drug | Recovered | Didn’t Recover | Total |
|---|---|---|---|
| Not Given | 1800 | 1200 | 3000 |
| Given | 700 | 300 | 1000 |
| Total | 2500 | 1500 | 4000 |
1.5
Cat A=0.583333 Cat B=0.6428
Outcome01 version of the outcome (where 1=Recovered and
0=Not Recovered), since the response variable in glm needs
to be numerical.Drug_Data=read.csv("https://raw.githubusercontent.com/vittorioaddona/data/main/drug_data.csv")
glm(Outcome01~Drug, family=binomial, data=Drug_Data)
##
## Call: glm(formula = Outcome01 ~ Drug, family = binomial, data = Drug_Data)
##
## Coefficients:
## (Intercept) DrugNot Given
## -0.4055 0.4055
##
## Degrees of Freedom: 7999 Total (i.e. Null); 7998 Residual
## Null Deviance: 11010
## Residual Deviance: 10930 AIC: 10930
the DrugNot Given coefficient to the power of e is the odds ratio
glm(Outcome01~Drug+Category,family=binomial,data=Drug_Data)
##
## Call: glm(formula = Outcome01 ~ Drug + Category, family = binomial,
## data = Drug_Data)
##
## Coefficients:
## (Intercept) DrugNot Given CategoryB
## -0.8584 -0.4849 1.7391
##
## Degrees of Freedom: 7999 Total (i.e. Null); 7997 Residual
## Null Deviance: 11010
## Residual Deviance: 9926 AIC: 9932
When a model using only Outcome01 and Drug as variables, you have a
positive relationship between the two, however when you control for
category as well, you see the opposite relationship between Outcome01
and Drug
glm(Outcome01~Drug+Category+Drug:Category,family=binomial,data=Drug_Data)
##
## Call: glm(formula = Outcome01 ~ Drug + Category + Drug:Category, family = binomial,
## data = Drug_Data)
##
## Coefficients:
## (Intercept) DrugNot Given CategoryB
## -0.84730 -0.53900 1.69460
## DrugNot Given:CategoryB
## 0.09716
##
## Degrees of Freedom: 7999 Total (i.e. Null); 7996 Residual
## Null Deviance: 11010
## Residual Deviance: 9926 AIC: 9934
in order to find the odds ratio for category A, we need to do e to the power of the coefficient for DrugNot Given, for category B, we need to do e to the power of the coefficient for DrugNot Given subtracted by the coefficient for DrugNot Given:Category B
The drug did improve recovery rate. This is because when you control for the categories, the odds ratio favors the drug in both categories.
It is an example of the Simpson’s Paradox. More specifically, the
reason why “not given” is favored in the overall recovery rate but not
in either category is because “not given” has a higher occurrence in the
category with the higher aggregate recovery rate (category B). This
leads to the result that more people who recovered did not take the
drug, but this is simply a percentage of a total rather than a rate of
recovery, because a rate of recovery requires that all possible
confounding variables be constrained, when this occurs, the odds ratio
favors the recovery rate of people who took the drug over those who did
not.
DAG 2 because the first model shows that there is a connection
between y and x1, but when x2 is constrained as in the 2nd model, the
connection between x1 and y is no longer present, this shows that DAG 2
is correct because it shows that when x2 is constrained or removed, no
connection exists between x1 and y. ###
lm( y ~ x1 , data=...)
| (Intercept) | x1 |
|---|---|
| 10 | 5 |
lm( y ~ x1 + x2 , data=...)| (Intercept) | x1 | x2 |
|---|---|---|
| 10 | 0 | 7 |
DAG 2, because when you don’t control for z1, there is a direct
connection between x1 and y as seen in model 1. However, when z1 is
constrained, that connection only exists as indirect, from x1 to z1 to
y. ### lm( y ~ x1 + z2 , data=...)
| (Intercept) | x1 | z2 |
|---|---|---|
| 10 | 7 | 9 |
lm( y ~ x1 + z2 + z1 , data=...)| (Intercept) | x1 | z2 | z1 |
|---|---|---|---|
| 10 | 4 | 9 | 3 |
BodyFat = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/BodyFat.csv')
BodyFat = BodyFat %>% filter( row_number() != 42 )
Weight in the model from (d). What
happens to the coefficient on Height? Explain in your own
words how this makes sense.BF that
uses Weight. What is the direction of this
relationship?BF and Weight to reverse when holding some
other variable fixed?
Price, (in
thousands of dollars) to other aspects of the house: size of the lot
(Acres), number of bedrooms (Bedrooms), number
of bathrooms (Bathrooms), square footage
(SqFt), age in years (Age), and total number
of rooms (Rooms). The data is in Houseprices.csv,
but you don’t need it to answer this question.SqFt variable leads to a few interesting
results.Age in each of the two
models below. Then, try to explain why controlling for SqFt
led to a change in the direction of the relationship between
Price and Age.the coefficient for age in the first model shows that as a house
grows older, the price drops, but in the second model, it shows that as
a house of a certain square feet gets older, the price rises. This could
be because of a difference between the average square feet of older
houses and newer houses. This means that when square feet is controlled
for, the price of houses goes up over time. Square feet is an example of
a confounding variable. ###
lm( Price ~ Age , data=Houseprices )
| (Intercept) | Age |
|---|---|
| 373.792 | -1.922 |
lm( Price ~ Age + SqFt , data=Houseprices )| (Intercept) | Age | SqFt |
|---|---|---|
| -201.7278 | 1.4531 | 0.2196 |
Bedrooms:lm(Price ~ Bedrooms , data=Houseprices)the first model shows that as price goes up, so do the number of bedrooms. However, the second model shows that in two houses with the same square footage, a greater number of bedrooms will force the price down. The reason for the change is the same as for age, Square footage is a confounding variable. This means that when Bedrooms and Price are compared by themselves, the price will rise with the number of bedrooms, but when Square Footage is fixed, the price decreases as the number of bedrooms rises. | (Intercept) | Bedrooms | | :———–: | :———–: | | -209.2 | 143.2 |
lm(Price ~ Bedrooms + SqFt , data=Houseprices)| (Intercept) | Bedrooms | SqFt |
|---|---|---|
| 50.7588 | -76.5550 | 0.2618 |
age in our investigation? If you answer ‘no’, explain why
not, and if you answer ‘yes’, explain why we might be led astray by not
controlling for age.We should control for age because as you age, your immune system
strength and general internal functions decline. If age is not
controlled for, then the data will show that chronic kidney disease
leads to higher mortality rates. This is not a good conclusion because
it is saying that the chance of dying from kidney disease is the same in
a 45 year old and a 70 year old, which is very much a false conclusion.