Download the dataset, import it into R, and recode variables as instructed in Homework 1.
install.packages("table1")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("splines")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Warning: package 'splines' is a base package, and should not be updated
Download the dataset, import it into R, and recode variables as instructed in Homework 1.
library(readr)
nhefs <- read_csv("https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/1268/20/nhefs.csv",
col_types = cols(qsmk = col_factor(levels = c("0",
"1")), death = col_factor(levels = c("0",
"1")), sex = col_factor(levels = c("0",
"1")), race = col_factor(levels = c("0",
"1")), income = col_factor(levels = c("11",
"12", "13", "14", "15", "16", "17",
"18", "19", "20", "21", "22")), marital = col_factor(levels = c("1",
"2", "3", "4", "5", "6", "8")), education = col_factor(levels = c("1",
"2", "3", "4", "5")), asthma = col_factor(levels = c("0",
"1")), bronch = col_factor(levels = c("0",
"1")), tb = col_factor(levels = c("0",
"1")), hf = col_factor(levels = c("0",
"1")), hbp = col_factor(levels = c("0",
"1")), pepticulcer = col_factor(levels = c("0",
"1")), colitis = col_factor(levels = c("0",
"1")), hepatitis = col_factor(levels = c("0",
"1")), chroniccough = col_factor(levels = c("0",
"1")), hayfever = col_factor(levels = c("0",
"1")), diabetes = col_factor(levels = c("0",
"1")), polio = col_factor(levels = c("0",
"1")), tumor = col_factor(levels = c("0",
"1")), nervousbreak = col_factor(levels = c("0",
"1")), alcoholpy = col_factor(levels = c("0",
"1")), alcoholfreq = col_factor(levels = c("1",
"2", "3", "4", "5")), alcoholtype = col_factor(levels = c("1",
"2", "3", "4")), pica = col_factor(levels = c("0",
"1", "2")), headache = col_factor(levels = c("0",
"1")), otherpain = col_factor(levels = c("0",
"1")), weakheart = col_factor(levels = c("0",
"1")), allergies = col_factor(levels = c("0",
"1")), nerves = col_factor(levels = c("0",
"1")), lackpep = col_factor(levels = c("0",
"1")), hbpmed = col_factor(levels = c("0",
"1", "2")), boweltrouble = col_factor(levels = c("0",
"1","2")), active = col_factor(levels = c("0",
"1","2")),exercise = col_factor(levels = c("0",
"1","2"))))
## `curl` package not installed, falling back to using `url()`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
nhefs2<-select(nhefs,c(death,active,age,education,exercise,race,sex,smokeintensity,smokeyrs,wt82_71,qsmk))
library(dplyr)
nhefs2<-nhefs2 %>% mutate(active=case_match(death,"0"~"NO","1"~"YES",.ptype = factor(c("NO","YES"))))
#nhefs2<-nhefs2 %>% mutate(active=case_match(active,"0"~"inactive","1"~"moderately active","2"~"very active",.ptype = factor(c("inactive","moderately active","very active"))))
nhefs2<-nhefs2 %>% mutate(exercise=case_match(exercise,"0"~"little or no exercise","1"~"moderate exercise","2"~"much exercise",.ptype = factor(c("little or no exercise","moderate exercise","much exercise"))))
nhefs2<-nhefs2 %>% mutate(sex=case_match(sex,"0"~"Male","1"~"Female",.ptype = factor(c("Male","Female"))))
nhefs2<-nhefs2 %>% mutate(education=case_match(education,"1"~"8TH GRADE OR LESS","2"~"HS DROPOUT","3"~"HS","4"~"COLLEGE DROPOUT","5"~"COLLEGE OR MORE",.ptype = factor(c("8TH GRADE OR LESS","HS DROPOUT","HS","COLLEGE DROPOUT","COLLEGE OR MORE"))))
nhefs2<-nhefs2 %>% mutate(race=case_match(race,"0"~"WHITE","1"~"BLACK OR OTHER",.ptype = factor(c("WHITE", "BLACK OR OTHER"))))
nhefs2<-nhefs2 %>% mutate(qsmk=case_match(qsmk,"0"~"NO",
"1"~"YES",.ptype = factor(c("NO", "YES"))))
nhefs2
#nhefs2<-nhefs2 %>% mutate(death=relevel(death,ref="NO"))
#nhefs2<-nhefs2 %>% mutate(active=relevel(active,ref="moderately active"))
nhefs2<-nhefs2 %>% mutate(education=relevel(education,ref="HS"))
nhefs2<-nhefs2 %>% mutate(exercise=relevel(exercise,ref="moderate exercise"))
nhefs2<-nhefs2 %>% mutate(race=relevel(race,ref="WHITE"))
summary(nhefs2)
Remove all participants that were lost to follow-up or “censored”, whose value of the cens variable is 0. You would ordinarily want to perform a sensitivity analysis and consider the possibility of bias from informative censoring, but you can ignore that possibility for the purposes of this assignment.
Also remove observations that are missing any of the following variables:
death cens active age education exercise qsmk race sex smokeintensity smkyrs wt82_71
nhefsclean<-na.omit(nhefs2)
is.na(nhefsclean)
Create a table of descriptive statistics like in HW1, but this time stratify it by the exposure variable qsmk
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
nhefs2_table<-
table1::table1(
~death+active+age+education+exercise+race+smokeintensity+smokeyrs+wt82_71|qsmk,
data= nhefsclean,
overall="Total",
topclass="Rtable1-times",
caption="Table1:Epi Table-characterisitics of data",
)
nhefs2_table
| NO (N=1163) |
YES (N=403) |
Total (N=1566) |
|
|---|---|---|---|
| death | |||
| 0 | 963 (82.8%) | 312 (77.4%) | 1275 (81.4%) |
| 1 | 200 (17.2%) | 91 (22.6%) | 291 (18.6%) |
| active | |||
| NO | 963 (82.8%) | 312 (77.4%) | 1275 (81.4%) |
| YES | 200 (17.2%) | 91 (22.6%) | 291 (18.6%) |
| age | |||
| Mean (SD) | 42.8 (11.8) | 46.2 (12.2) | 43.7 (12.0) |
| Median [Min, Max] | 42.0 [25.0, 72.0] | 46.0 [25.0, 74.0] | 43.0 [25.0, 74.0] |
| education | |||
| HS | 480 (41.3%) | 157 (39.0%) | 637 (40.7%) |
| 8TH GRADE OR LESS | 210 (18.1%) | 81 (20.1%) | 291 (18.6%) |
| COLLEGE DROPOUT | 92 (7.9%) | 29 (7.2%) | 121 (7.7%) |
| COLLEGE OR MORE | 115 (9.9%) | 62 (15.4%) | 177 (11.3%) |
| HS DROPOUT | 266 (22.9%) | 74 (18.4%) | 340 (21.7%) |
| exercise | |||
| moderate exercise | 485 (41.7%) | 176 (43.7%) | 661 (42.2%) |
| little or no exercise | 237 (20.4%) | 63 (15.6%) | 300 (19.2%) |
| much exercise | 441 (37.9%) | 164 (40.7%) | 605 (38.6%) |
| race | |||
| WHITE | 993 (85.4%) | 367 (91.1%) | 1360 (86.8%) |
| BLACK OR OTHER | 170 (14.6%) | 36 (8.9%) | 206 (13.2%) |
| smokeintensity | |||
| Mean (SD) | 21.2 (11.5) | 18.6 (12.4) | 20.5 (11.8) |
| Median [Min, Max] | 20.0 [1.00, 60.0] | 20.0 [1.00, 80.0] | 20.0 [1.00, 80.0] |
| smokeyrs | |||
| Mean (SD) | 24.1 (11.7) | 26.0 (12.7) | 24.6 (12.0) |
| Median [Min, Max] | 23.0 [1.00, 64.0] | 26.0 [1.00, 60.0] | 24.0 [1.00, 64.0] |
| wt82_71 | |||
| Mean (SD) | 1.98 (7.45) | 4.53 (8.75) | 2.64 (7.88) |
| Median [Min, Max] | 2.15 [-41.3, 48.5] | 3.97 [-22.2, 47.5] | 2.60 [-41.3, 48.5] |
crude coefficient: log odds = 0.05384, Probability = 1 / (1 + exp(-0.05384)) ≈ 0.51344
For those that quit smoking between the 1st questionnaire and 1982, the log odds that they died by 1992 is 0.05384 or 51.34%
adjusted model coefficient: log odds = -0.02877, Probability = 1 / (1 + exp(0.02877)) ≈ 0.492
For those that quit smoking between the 1st questionnaire and 1982, the odds for the adjusted model that they died by 1992 is 0.05384 or 49.2%
The adjusted coefficent results are smaller than the crude coefficient results which indicates the possibility of confounding.The idea of confounding is also supported by the fact that the crude model coefficient is signficant and the adjusted model coefficient is not.
#simple
fit0<-glm(formula=death~qsmk,data=nhefsclean, family=binomial)
summary(fit0)
##
## Call:
## glm(formula = death ~ qsmk, family = binomial, data = nhefsclean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.57174 0.07771 -20.226 <2e-16 ***
## qsmkYES 0.33959 0.14224 2.387 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1498.2 on 1564 degrees of freedom
## AIC: 1502.2
##
## Number of Fisher Scoring iterations: 4
#multiple logistic
library(splines)
fit<-glm(formula=death~qsmk + sex + race +bs(age) + education+ exercise, data=nhefsclean,family = binomial)
summary(fit)
##
## Call:
## glm(formula = death ~ qsmk + sex + race + bs(age) + education +
## exercise, family = binomial, data = nhefsclean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.37902 0.49694 -8.812 < 2e-16 ***
## qsmkYES -0.02877 0.16952 -0.170 0.865238
## sexMale 0.59777 0.15772 3.790 0.000151 ***
## raceBLACK OR OTHER 0.13358 0.22586 0.591 0.554248
## bs(age)1 1.13606 1.19849 0.948 0.343175
## bs(age)2 3.11016 0.71530 4.348 1.37e-05 ***
## bs(age)3 5.42471 0.83822 6.472 9.69e-11 ***
## education8TH GRADE OR LESS 0.67584 0.19969 3.384 0.000713 ***
## educationCOLLEGE DROPOUT 0.23876 0.34155 0.699 0.484516
## educationCOLLEGE OR MORE 0.06469 0.28637 0.226 0.821281
## educationHS DROPOUT 0.31473 0.20739 1.518 0.129120
## exerciselittle or no exercise 0.05748 0.22478 0.256 0.798177
## exercisemuch exercise 0.21118 0.17053 1.238 0.215572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1122.8 on 1553 degrees of freedom
## AIC: 1148.8
##
## Number of Fisher Scoring iterations: 6
Create a binned residual plot using the binnedplot function from the arm library. Interpret the plot.
Interpretation: The light lines represent theoretical 95% error bounds. In a well specified model, the majority of points should lie within the error bounds, there should be randomness with no patterns.I think this model is overall well-specified because the pattern is random and the majority of points fall within the 95% error bounds. But there are concerns. While the majority of points fall within the 95% error bounds, there are still several that fall outside of it.
library(arm)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.13-1, built: 2022-8-25)
## Working directory is /cloud/project
x<-predict(fit)
y<-resid(fit)
binnedplot(x,y)
a.(2 points)) Create a model to calculate the propensity of exposure given the values of the hypothesized confounders (propensity score). This is your “denominator” model of Lecture 11, Slide 15.
#denominator model
denom.fit<-glm(qsmk~sex+race+bs(age)+education+exercise,family = "binomial",data=nhefsclean)
#numerator model
num.fit<-glm(qsmk~1,data=nhefsclean,family="binomial")
pdenom <- predict(denom.fit, type = "response")
pnum <- predict(num.fit, type = "response")
nhefs$iptw <- ifelse(nhefs$qsmk == 0, ((1-pnum)/(1-pdenom)),
(pnum/pdenom))
nhefs$iptw
nhefs
fitweight<-glm(formula=death~qsmk, weights=nhefs$iptw, data=nhefs,family = binomial)
## Warning: Unknown or uninitialised column: `iptw`.
fitweight
##
## Call: glm(formula = death ~ qsmk, family = binomial, data = nhefs,
## weights = nhefs$iptw)
##
## Coefficients:
## (Intercept) qsmk1
## -1.5174 0.3554
##
## Degrees of Freedom: 1628 Total (i.e. Null); 1627 Residual
## Null Deviance: 1608
## Residual Deviance: 1602 AIC: 1606
summary(fitweight)
##
## Call:
## glm(formula = death ~ qsmk, family = binomial, data = nhefs,
## weights = nhefs$iptw)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.51736 0.07513 -20.196 <2e-16 ***
## qsmk1 0.35544 0.13607 2.612 0.009 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1608.5 on 1628 degrees of freedom
## Residual deviance: 1601.8 on 1627 degrees of freedom
## AIC: 1605.8
##
## Number of Fisher Scoring iterations: 4
summary(fit)
##
## Call:
## glm(formula = death ~ qsmk + sex + race + bs(age) + education +
## exercise, family = binomial, data = nhefsclean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.37902 0.49694 -8.812 < 2e-16 ***
## qsmkYES -0.02877 0.16952 -0.170 0.865238
## sexMale 0.59777 0.15772 3.790 0.000151 ***
## raceBLACK OR OTHER 0.13358 0.22586 0.591 0.554248
## bs(age)1 1.13606 1.19849 0.948 0.343175
## bs(age)2 3.11016 0.71530 4.348 1.37e-05 ***
## bs(age)3 5.42471 0.83822 6.472 9.69e-11 ***
## education8TH GRADE OR LESS 0.67584 0.19969 3.384 0.000713 ***
## educationCOLLEGE DROPOUT 0.23876 0.34155 0.699 0.484516
## educationCOLLEGE OR MORE 0.06469 0.28637 0.226 0.821281
## educationHS DROPOUT 0.31473 0.20739 1.518 0.129120
## exerciselittle or no exercise 0.05748 0.22478 0.256 0.798177
## exercisemuch exercise 0.21118 0.17053 1.238 0.215572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1503.7 on 1565 degrees of freedom
## Residual deviance: 1122.8 on 1553 degrees of freedom
## AIC: 1148.8
##
## Number of Fisher Scoring iterations: 6
d.(1 point)) Compare the coefficient and standard error for quitting smoking to Question 2, and comment. Recall the discussion in lecture about geepack standard errors being too conservative.
Question 2 adjusted model coefficient: log odds = -0.02877, Probability = 1 / (1 + exp(0.02877)) ≈ 0.492 or 49% SE = 0.16952
Question 4 model coefficient: log odds = -1.5126, Probability = 1/(1 + exp (-1.5126)) ≈ 0.1808 or 18% SE = 0.3441
While, both the adjusted model coefficient from q2 and the IPTW coefficient from q4 seek to address confounding, there are noteworthy differences between their coefficients and SE. The magnitude of the q4 coefficient is larger than the magnitude of the q2 coefficient. The q4 SE is more than double that of q2. Keeping in mind that both models address confounding using different methods and with different assumptions, I don’t think it’s unusal to see differences. I guess I can comment that the differences show that there is an effect of weighting in the model on confounding. An increase in standard error may mean an increase in variation suggesting more uncertainity with the inverse probability weighting.