Data import and recoding

Download the dataset, import it into R, and recode variables as instructed in Homework 1.

install.packages("table1")
install.packages("dplyr")
install.packages("ggplot2")

install.packages("splines")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

## Warning: package 'splines' is a base package, and should not be updated

Data import and recoding

Download the dataset, import it into R, and recode variables as instructed in Homework 1.

library(readr)
nhefs <- read_csv("https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/1268/20/nhefs.csv", 
    col_types = cols(qsmk = col_factor(levels = c("0", 
         "1")), death = col_factor(levels = c("0", 
         "1")), sex = col_factor(levels = c("0", 
         "1")), race = col_factor(levels = c("0", 
         "1")), income = col_factor(levels = c("11", 
         "12", "13", "14", "15", "16", "17", 
         "18", "19", "20", "21", "22")), marital = col_factor(levels = c("1", 
         "2", "3", "4", "5", "6", "8")), education = col_factor(levels = c("1", 
         "2", "3", "4", "5")), asthma = col_factor(levels = c("0", 
         "1")), bronch = col_factor(levels = c("0", 
         "1")), tb = col_factor(levels = c("0", 
         "1")), hf = col_factor(levels = c("0", 
         "1")), hbp = col_factor(levels = c("0", 
         "1")), pepticulcer = col_factor(levels = c("0", 
         "1")), colitis = col_factor(levels = c("0", 
         "1")), hepatitis = col_factor(levels = c("0", 
         "1")), chroniccough = col_factor(levels = c("0", 
         "1")), hayfever = col_factor(levels = c("0", 
         "1")), diabetes = col_factor(levels = c("0", 
         "1")), polio = col_factor(levels = c("0", 
         "1")), tumor = col_factor(levels = c("0", 
         "1")), nervousbreak = col_factor(levels = c("0", 
         "1")), alcoholpy = col_factor(levels = c("0", 
         "1")), alcoholfreq = col_factor(levels = c("1", 
         "2", "3", "4", "5")), alcoholtype = col_factor(levels = c("1", 
         "2", "3", "4")), pica = col_factor(levels = c("0", 
         "1", "2")), headache = col_factor(levels = c("0", 
         "1")), otherpain = col_factor(levels = c("0", 
         "1")), weakheart = col_factor(levels = c("0", 
         "1")), allergies = col_factor(levels = c("0", 
         "1")), nerves = col_factor(levels = c("0", 
         "1")), lackpep = col_factor(levels = c("0", 
         "1")), hbpmed = col_factor(levels = c("0", 
         "1", "2")), boweltrouble = col_factor(levels = c("0", 
         "1","2")), active = col_factor(levels = c("0", 
         "1","2")),exercise = col_factor(levels = c("0", 
         "1","2"))))

## `curl` package not installed, falling back to using `url()`

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

nhefs2<-select(nhefs,c(death,active,age,education,exercise,race,sex,smokeintensity,smokeyrs,wt82_71,qsmk))

library(dplyr)

nhefs2<-nhefs2 %>% mutate(active=case_match(death,"0"~"NO","1"~"YES",.ptype = factor(c("NO","YES"))))

#nhefs2<-nhefs2 %>% mutate(active=case_match(active,"0"~"inactive","1"~"moderately active","2"~"very active",.ptype = factor(c("inactive","moderately active","very active"))))

nhefs2<-nhefs2 %>% mutate(exercise=case_match(exercise,"0"~"little or no exercise","1"~"moderate exercise","2"~"much exercise",.ptype = factor(c("little or no exercise","moderate exercise","much exercise"))))

nhefs2<-nhefs2 %>% mutate(sex=case_match(sex,"0"~"Male","1"~"Female",.ptype = factor(c("Male","Female"))))

nhefs2<-nhefs2 %>% mutate(education=case_match(education,"1"~"8TH GRADE OR LESS","2"~"HS DROPOUT","3"~"HS","4"~"COLLEGE DROPOUT","5"~"COLLEGE OR MORE",.ptype = factor(c("8TH GRADE OR LESS","HS DROPOUT","HS","COLLEGE DROPOUT","COLLEGE OR MORE"))))

nhefs2<-nhefs2 %>% mutate(race=case_match(race,"0"~"WHITE","1"~"BLACK OR OTHER",.ptype = factor(c("WHITE", "BLACK OR OTHER"))))

nhefs2<-nhefs2 %>% mutate(qsmk=case_match(qsmk,"0"~"NO",
           "1"~"YES",.ptype = factor(c("NO", "YES"))))

nhefs2

#nhefs2<-nhefs2 %>% mutate(death=relevel(death,ref="NO"))

#nhefs2<-nhefs2 %>% mutate(active=relevel(active,ref="moderately active"))

nhefs2<-nhefs2 %>% mutate(education=relevel(education,ref="HS"))

nhefs2<-nhefs2 %>% mutate(exercise=relevel(exercise,ref="moderate exercise"))

nhefs2<-nhefs2 %>% mutate(race=relevel(race,ref="WHITE"))

summary(nhefs2)

Question 1 (2 points): filter any censored or missing data

Remove all participants that were lost to follow-up or “censored”, whose value of the cens variable is 0. You would ordinarily want to perform a sensitivity analysis and consider the possibility of bias from informative censoring, but you can ignore that possibility for the purposes of this assignment.

Also remove observations that are missing any of the following variables:

death cens active age education exercise qsmk race sex smokeintensity smkyrs wt82_71

nhefsclean<-na.omit(nhefs2)

is.na(nhefsclean)

Question 2 (4 points): Table 1

Create a table of descriptive statistics like in HW1, but this time stratify it by the exposure variable qsmk

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

nhefs2_table<-
  table1::table1(
    ~death+active+age+education+exercise+race+smokeintensity+smokeyrs+wt82_71|qsmk,
    data= nhefsclean,
    overall="Total",
    topclass="Rtable1-times",
    caption="Table1:Epi Table-characterisitics of data",
    )

nhefs2_table

Table1:Epi Table-characterisitics of data
	NO (N=1163)	YES (N=403)	Total (N=1566)
death
0	963 (82.8%)	312 (77.4%)	1275 (81.4%)
1	200 (17.2%)	91 (22.6%)	291 (18.6%)
active
NO	963 (82.8%)	312 (77.4%)	1275 (81.4%)
YES	200 (17.2%)	91 (22.6%)	291 (18.6%)
age
Mean (SD)	42.8 (11.8)	46.2 (12.2)	43.7 (12.0)
Median [Min, Max]	42.0 [25.0, 72.0]	46.0 [25.0, 74.0]	43.0 [25.0, 74.0]
education
HS	480 (41.3%)	157 (39.0%)	637 (40.7%)
8TH GRADE OR LESS	210 (18.1%)	81 (20.1%)	291 (18.6%)
COLLEGE DROPOUT	92 (7.9%)	29 (7.2%)	121 (7.7%)
COLLEGE OR MORE	115 (9.9%)	62 (15.4%)	177 (11.3%)
HS DROPOUT	266 (22.9%)	74 (18.4%)	340 (21.7%)
exercise
moderate exercise	485 (41.7%)	176 (43.7%)	661 (42.2%)
little or no exercise	237 (20.4%)	63 (15.6%)	300 (19.2%)
much exercise	441 (37.9%)	164 (40.7%)	605 (38.6%)
race
WHITE	993 (85.4%)	367 (91.1%)	1360 (86.8%)
BLACK OR OTHER	170 (14.6%)	36 (8.9%)	206 (13.2%)
smokeintensity
Mean (SD)	21.2 (11.5)	18.6 (12.4)	20.5 (11.8)
Median [Min, Max]	20.0 [1.00, 60.0]	20.0 [1.00, 80.0]	20.0 [1.00, 80.0]
smokeyrs
Mean (SD)	24.1 (11.7)	26.0 (12.7)	24.6 (12.0)
Median [Min, Max]	23.0 [1.00, 64.0]	26.0 [1.00, 60.0]	24.0 [1.00, 64.0]
wt82_71
Mean (SD)	1.98 (7.45)	4.53 (8.75)	2.64 (7.88)
Median [Min, Max]	2.15 [-41.3, 48.5]	3.97 [-22.2, 47.5]	2.60 [-41.3, 48.5]

Question 2 (4 points): fit crude and adjusted logistic regression models. Interpret the coefficient of qsmk in the crude model and in the adjusted model. Compare them.

crude coefficient: log odds = 0.05384, Probability = 1 / (1 + exp(-0.05384)) ≈ 0.51344

For those that quit smoking between the 1st questionnaire and 1982, the log odds that they died by 1992 is 0.05384 or 51.34%

adjusted model coefficient: log odds = -0.02877, Probability = 1 / (1 + exp(0.02877)) ≈ 0.492

For those that quit smoking between the 1st questionnaire and 1982, the odds for the adjusted model that they died by 1992 is 0.05384 or 49.2%

The adjusted coefficent results are smaller than the crude coefficient results which indicates the possibility of confounding.The idea of confounding is also supported by the fact that the crude model coefficient is signficant and the adjusted model coefficient is not.

#simple
fit0<-glm(formula=death~qsmk,data=nhefsclean, family=binomial)
summary(fit0)

## 
## Call:
## glm(formula = death ~ qsmk, family = binomial, data = nhefsclean)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.57174    0.07771 -20.226   <2e-16 ***
## qsmkYES      0.33959    0.14224   2.387    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1498.2  on 1564  degrees of freedom
## AIC: 1502.2
## 
## Number of Fisher Scoring iterations: 4

#multiple logistic
library(splines)
fit<-glm(formula=death~qsmk + sex + race +bs(age) + education+ exercise, data=nhefsclean,family = binomial)
summary(fit)

## 
## Call:
## glm(formula = death ~ qsmk + sex + race + bs(age) + education + 
##     exercise, family = binomial, data = nhefsclean)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -4.37902    0.49694  -8.812  < 2e-16 ***
## qsmkYES                       -0.02877    0.16952  -0.170 0.865238    
## sexMale                        0.59777    0.15772   3.790 0.000151 ***
## raceBLACK OR OTHER             0.13358    0.22586   0.591 0.554248    
## bs(age)1                       1.13606    1.19849   0.948 0.343175    
## bs(age)2                       3.11016    0.71530   4.348 1.37e-05 ***
## bs(age)3                       5.42471    0.83822   6.472 9.69e-11 ***
## education8TH GRADE OR LESS     0.67584    0.19969   3.384 0.000713 ***
## educationCOLLEGE DROPOUT       0.23876    0.34155   0.699 0.484516    
## educationCOLLEGE OR MORE       0.06469    0.28637   0.226 0.821281    
## educationHS DROPOUT            0.31473    0.20739   1.518 0.129120    
## exerciselittle or no exercise  0.05748    0.22478   0.256 0.798177    
## exercisemuch exercise          0.21118    0.17053   1.238 0.215572    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1122.8  on 1553  degrees of freedom
## AIC: 1148.8
## 
## Number of Fisher Scoring iterations: 6

Question 3 (3 points): model diagnostics

Create a binned residual plot using the binnedplot function from the arm library. Interpret the plot.

Interpretation: The light lines represent theoretical 95% error bounds. In a well specified model, the majority of points should lie within the error bounds, there should be randomness with no patterns.I think this model is overall well-specified because the pattern is random and the majority of points fall within the 95% error bounds. But there are concerns. While the majority of points fall within the 95% error bounds, there are still several that fall outside of it.

library(arm)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: Matrix

## Loading required package: lme4

## 
## arm (Version 1.13-1, built: 2022-8-25)

## Working directory is /cloud/project

x<-predict(fit)
y<-resid(fit)
binnedplot(x,y)

Question 4 (7 points): Inverse probability weighing

a.(2 points)) Create a model to calculate the propensity of exposure given the values of the hypothesized confounders (propensity score). This is your “denominator” model of Lecture 11, Slide 15.

#denominator model
denom.fit<-glm(qsmk~sex+race+bs(age)+education+exercise,family = "binomial",data=nhefsclean)

#numerator model
num.fit<-glm(qsmk~1,data=nhefsclean,family="binomial")

(2 points)) Calculate IPTW weights using this model. Hint - you can use code like this:

pdenom <- predict(denom.fit, type = "response")
pnum <- predict(num.fit, type = "response")
nhefs$iptw <- ifelse(nhefs$qsmk == 0, ((1-pnum)/(1-pdenom)),
                     (pnum/pdenom))

nhefs$iptw

nhefs

(2 points)) Use the geeglm function from the geepack library to use these weights to estimate the causal effect of quitting smoking on death.

fitweight<-glm(formula=death~qsmk, weights=nhefs$iptw, data=nhefs,family = binomial)

## Warning: Unknown or uninitialised column: `iptw`.

fitweight

## 
## Call:  glm(formula = death ~ qsmk, family = binomial, data = nhefs, 
##     weights = nhefs$iptw)
## 
## Coefficients:
## (Intercept)        qsmk1  
##     -1.5174       0.3554  
## 
## Degrees of Freedom: 1628 Total (i.e. Null);  1627 Residual
## Null Deviance:       1608 
## Residual Deviance: 1602  AIC: 1606

summary(fitweight)

## 
## Call:
## glm(formula = death ~ qsmk, family = binomial, data = nhefs, 
##     weights = nhefs$iptw)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.51736    0.07513 -20.196   <2e-16 ***
## qsmk1        0.35544    0.13607   2.612    0.009 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1608.5  on 1628  degrees of freedom
## Residual deviance: 1601.8  on 1627  degrees of freedom
## AIC: 1605.8
## 
## Number of Fisher Scoring iterations: 4

summary(fit)

## 
## Call:
## glm(formula = death ~ qsmk + sex + race + bs(age) + education + 
##     exercise, family = binomial, data = nhefsclean)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -4.37902    0.49694  -8.812  < 2e-16 ***
## qsmkYES                       -0.02877    0.16952  -0.170 0.865238    
## sexMale                        0.59777    0.15772   3.790 0.000151 ***
## raceBLACK OR OTHER             0.13358    0.22586   0.591 0.554248    
## bs(age)1                       1.13606    1.19849   0.948 0.343175    
## bs(age)2                       3.11016    0.71530   4.348 1.37e-05 ***
## bs(age)3                       5.42471    0.83822   6.472 9.69e-11 ***
## education8TH GRADE OR LESS     0.67584    0.19969   3.384 0.000713 ***
## educationCOLLEGE DROPOUT       0.23876    0.34155   0.699 0.484516    
## educationCOLLEGE OR MORE       0.06469    0.28637   0.226 0.821281    
## educationHS DROPOUT            0.31473    0.20739   1.518 0.129120    
## exerciselittle or no exercise  0.05748    0.22478   0.256 0.798177    
## exercisemuch exercise          0.21118    0.17053   1.238 0.215572    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1503.7  on 1565  degrees of freedom
## Residual deviance: 1122.8  on 1553  degrees of freedom
## AIC: 1148.8
## 
## Number of Fisher Scoring iterations: 6

d.(1 point)) Compare the coefficient and standard error for quitting smoking to Question 2, and comment. Recall the discussion in lecture about geepack standard errors being too conservative.

Question 2 adjusted model coefficient: log odds = -0.02877, Probability = 1 / (1 + exp(0.02877)) ≈ 0.492 or 49% SE = 0.16952

Question 4 model coefficient: log odds = -1.5126, Probability = 1/(1 + exp (-1.5126)) ≈ 0.1808 or 18% SE = 0.3441

While, both the adjusted model coefficient from q2 and the IPTW coefficient from q4 seek to address confounding, there are noteworthy differences between their coefficients and SE. The magnitude of the q4 coefficient is larger than the magnitude of the q2 coefficient. The q4 SE is more than double that of q2. Keeping in mind that both models address confounding using different methods and with different assumptions, I don’t think it’s unusal to see differences. I guess I can comment that the differences show that there is an effect of weighting in the model on confounding. An increase in standard error may mean an increase in variation suggesting more uncertainity with the inverse probability weighting.

Homework 2

2023-05-11

Data import and recoding

Data import and recoding

Question 1 (2 points): filter any censored or missing data

Question 2 (4 points): Table 1

Question 2 (4 points): fit crude and adjusted logistic regression models. Interpret the coefficient of qsmk in the crude model and in the adjusted model. Compare them.

Question 3 (3 points): model diagnostics

Question 4 (7 points): Inverse probability weighing