Secondhand smoke (SHS) can cause numerous serious health problems among adults and children. According to the statistics from CDC, there are more than 7,300 lung cancer deaths among nonsmokers each year caused by SHS. In this project, we will introduce the use of interrupt with time series regression method on a real-world example of public intervention. We first highlight the causal inference of the topic in the project, including the proposed cause and the proposed effect. We then discuss the sampling strategies concerning the external validity of our pilot data. The measurement from the construct validity of the study is clearly defined before the quasi-experimental design is implemented. Finally, we describe the main method which is the interrupted with time series analysis, and interpret the meaning from the results.
An interrupted time series design is a quasi-experimental design that can be used to assess the impact of an a policy that was implemented as specific point in time. First, we identify the underlying trend in the pre-treatment period and determine if the intervention interrupts the trend at a known point in time. The counterfactual scenario assumes the underlying trend should have continued unchanged if the intervention did not occur.
Effects come in two forms:
Model formulation
\(Y_{t} = \beta_{0} + \beta_{1}T + \beta_{2}X_{t} + \beta_{3}TX_{t}\)
\(T\): the number of days elapses since the start of the study period
\(X_{t}\): a dummy variable indicating the pre- and post-intervention period
\(Y_{t}\): the outcome at time \(t\)
\(\beta_{0}=\) the baseline level
\(\beta_{1}=\) the baseline trend
\(\beta_{2}=\) the level change
\(\beta_{3}=\) the trend change
To demonstrate how the model works, consider the following examples where we randomly generate time series data and attempt to detect level and trend changes via segmented regression. As expected, our model performs better when the effect is large or the data has lower amounts of noise.
library(broom)
library(tidyverse)
toy_its <- function(B0, B1, B2, B3, epochs, noise){
# Create synthetic data
t <- seq(1, epochs)
X <- c(rep(0, epochs/2), rep(1, epochs/2))
t0 <- epochs/2 + 1 # time of intervention
Y = B0 + B1*t + B2*X + B3*t*X + noise*runif(length(X))
df <- tibble(t = t, X = X, Y = Y, t0 = epochs/2 + 1)
# fit a regression model
model <- lm(Y ~ t + X + t*X, data = df)
results = tidy(model)
slope_change <- results %>% filter(term == "t:X", p.value <= 0.05) %>% nrow()
if (slope_change == 1){
slope_change = paste0("significant change in slope (",
round(results$estimate[results$term == "t:X"], 2), ")")
}
if (slope_change == 0){
slope_change = " "
}
level_change <- results %>% filter(term == "X", p.value <= 0.05) %>% nrow()
if (level_change == 1){
level_change = paste0("significant change in level (",
round(results$estimate[results$term == "X"], 2), ")")
}
if (level_change == 0){
level_change = " "
}
# Plot the results
pred <- predict(model,type="response",df)
plot(df$Y, type="n", ylim=c(00,max(df$Y) + 4),
xlab="Time", ylab="Response", bty="l",xaxt="n")
rect(epochs/2,0, epochs, 100,col=grey(0.9),border=F)
points(df$Y,cex=0.7)
lines(1:epochs,pred,col=2)
axis(1,at=(0:epochs)*2,labels=F)
axis(1,at=(0:(epochs/2))*2,tick=F,labels=(0:(epochs/2))*2)
title(paste0(level_change, "\n", slope_change))
summary(model)
}
set.seed(1)
# Detect level drops with low noise data
toy_its(B0=2, B1=0.5, B2=0, B3=0, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52276 -0.16406 0.00888 0.24377 0.38215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.51114 0.19584 12.823 7.83e-10 ***
## t 0.50734 0.03156 16.074 2.70e-11 ***
## X -0.78090 0.53470 -1.460 0.164
## t:X 0.04612 0.04464 1.033 0.317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2867 on 16 degrees of freedom
## Multiple R-squared: 0.9924, Adjusted R-squared: 0.991
## F-statistic: 697.1 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=1, B3=0, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42103 -0.18237 -0.02009 0.19578 0.50038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.49534 0.19960 12.502 1.13e-09 ***
## t 0.48600 0.03217 15.108 6.86e-11 ***
## X 1.03702 0.54496 1.903 0.0752 .
## t:X 0.01381 0.04549 0.304 0.7653
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2922 on 16 degrees of freedom
## Multiple R-squared: 0.994, Adjusted R-squared: 0.9929
## F-statistic: 885.7 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=2, B3=0, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55049 -0.08353 0.01891 0.14280 0.42226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.71870 0.17008 15.985 2.93e-11 ***
## t 0.47930 0.02741 17.486 7.51e-12 ***
## X 1.82102 0.46436 3.922 0.00122 **
## t:X 0.01230 0.03876 0.317 0.75511
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.249 on 16 degrees of freedom
## Multiple R-squared: 0.9963, Adjusted R-squared: 0.9957
## F-statistic: 1454 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=3, B3=0, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42218 -0.19083 -0.04299 0.19339 0.39571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.51851 0.18508 13.608 3.26e-10 ***
## t 0.49866 0.02983 16.718 1.49e-11 ***
## X 2.40990 0.50532 4.769 0.000209 ***
## t:X 0.04609 0.04218 1.093 0.290750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2709 on 16 degrees of freedom
## Multiple R-squared: 0.9969, Adjusted R-squared: 0.9963
## F-statistic: 1698 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=4, B3=0, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37815 -0.16217 -0.09326 0.17877 0.36646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.62806 0.16624 15.809 3.47e-11 ***
## t 0.45952 0.02679 17.151 1.01e-11 ***
## X 3.36207 0.45389 7.407 1.48e-06 ***
## t:X 0.07773 0.03789 2.052 0.057 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2434 on 16 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9976
## F-statistic: 2595 on 3 and 16 DF, p-value: < 2.2e-16
set.seed(1)
# Detect level drops with low noise data
toy_its(B0=2, B1=0.5, B2=0, B3=0, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52276 -0.16406 0.00888 0.24377 0.38215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.51114 0.19584 12.823 7.83e-10 ***
## t 0.50734 0.03156 16.074 2.70e-11 ***
## X -0.78090 0.53470 -1.460 0.164
## t:X 0.04612 0.04464 1.033 0.317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2867 on 16 degrees of freedom
## Multiple R-squared: 0.9924, Adjusted R-squared: 0.991
## F-statistic: 697.1 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=0, B3=0.1, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42103 -0.18237 -0.02009 0.19578 0.50038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.49534 0.19960 12.502 1.13e-09 ***
## t 0.48600 0.03217 15.108 6.86e-11 ***
## X 0.03702 0.54496 0.068 0.9467
## t:X 0.11381 0.04549 2.502 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2922 on 16 degrees of freedom
## Multiple R-squared: 0.995, Adjusted R-squared: 0.994
## F-statistic: 1058 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=0, B3=0.2, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55049 -0.08353 0.01891 0.14280 0.42226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.71870 0.17008 15.985 2.93e-11 ***
## t 0.47930 0.02741 17.486 7.51e-12 ***
## X -0.17898 0.46436 -0.385 0.705
## t:X 0.21230 0.03876 5.477 5.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.249 on 16 degrees of freedom
## Multiple R-squared: 0.9973, Adjusted R-squared: 0.9968
## F-statistic: 1994 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=0, B3=0.3, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42218 -0.19083 -0.04299 0.19339 0.39571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.51851 0.18508 13.608 3.26e-10 ***
## t 0.49866 0.02983 16.718 1.49e-11 ***
## X -0.59010 0.50532 -1.168 0.26
## t:X 0.34609 0.04218 8.204 3.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2709 on 16 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9975
## F-statistic: 2524 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=0, B3=0.4, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37815 -0.16217 -0.09326 0.17877 0.36646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.62806 0.16624 15.809 3.47e-11 ***
## t 0.45952 0.02679 17.151 1.01e-11 ***
## X -0.63793 0.45389 -1.405 0.179
## t:X 0.47773 0.03789 12.608 1.00e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2434 on 16 degrees of freedom
## Multiple R-squared: 0.9987, Adjusted R-squared: 0.9985
## F-statistic: 4139 on 3 and 16 DF, p-value: < 2.2e-16
set.seed(1)
# Detect level drops with low noise data
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=1)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52276 -0.16406 0.00888 0.24377 0.38215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.51114 0.19584 12.823 7.83e-10 ***
## t 0.50734 0.03156 16.074 2.70e-11 ***
## X 3.21910 0.53470 6.020 1.78e-05 ***
## t:X 0.34612 0.04464 7.754 8.29e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2867 on 16 degrees of freedom
## Multiple R-squared: 0.9987, Adjusted R-squared: 0.9985
## F-statistic: 4112 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=2)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84205 -0.36475 -0.04017 0.39157 1.00075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.99067 0.39920 7.492 1.29e-06 ***
## t 0.47200 0.06434 7.336 1.67e-06 ***
## X 4.07405 1.08992 3.738 0.00179 **
## t:X 0.32763 0.09099 3.601 0.00239 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5844 on 16 degrees of freedom
## Multiple R-squared: 0.9947, Adjusted R-squared: 0.9938
## F-statistic: 1009 on 3 and 16 DF, p-value: < 2.2e-16
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=3)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65146 -0.25059 0.05672 0.42841 1.26677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.15611 0.51023 8.146 4.38e-07 ***
## t 0.43791 0.08223 5.325 6.83e-05 ***
## X 3.46307 1.39307 2.486 0.0244 *
## t:X 0.33690 0.11629 2.897 0.0105 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7469 on 16 degrees of freedom
## Multiple R-squared: 0.9904, Adjusted R-squared: 0.9886
## F-statistic: 548.9 on 3 and 16 DF, p-value: 2.442e-16
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=4)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6887 -0.7633 -0.1720 0.7736 1.5828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0741 0.7403 5.503 4.82e-05 ***
## t 0.4946 0.1193 4.146 0.00076 ***
## X 1.6396 2.0213 0.811 0.42918
## t:X 0.4844 0.1687 2.871 0.01110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.084 on 16 degrees of freedom
## Multiple R-squared: 0.9831, Adjusted R-squared: 0.9799
## F-statistic: 310 on 3 and 16 DF, p-value: 2.216e-14
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=5)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8908 -0.8108 -0.4663 0.8939 1.8323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1403 0.8312 6.184 1.31e-05 ***
## t 0.2976 0.1340 2.221 0.04110 *
## X 0.8104 2.2694 0.357 0.72570
## t:X 0.6887 0.1894 3.635 0.00223 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.217 on 16 degrees of freedom
## Multiple R-squared: 0.9795, Adjusted R-squared: 0.9757
## F-statistic: 255.1 on 3 and 16 DF, p-value: 1.023e-13
set.seed(1)
# Detect level drops with low noise data
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=3)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.56829 -0.49219 0.02664 0.73132 1.14645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.53342 0.58751 6.014 1.80e-05 ***
## t 0.52202 0.09469 5.513 4.72e-05 ***
## X 1.65730 1.60409 1.033 0.31689
## t:X 0.43835 0.13391 3.274 0.00478 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.86 on 16 degrees of freedom
## Multiple R-squared: 0.9887, Adjusted R-squared: 0.9866
## F-statistic: 465.6 on 3 and 16 DF, p-value: 8.986e-16
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 16, noise=3)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11834 -0.59958 -0.03557 0.58887 0.99980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0013 0.6107 6.552 2.72e-05 ***
## t 0.3030 0.1209 2.506 0.0276 *
## X 3.6592 1.6539 2.213 0.0471 *
## t:X 0.4981 0.1710 2.912 0.0130 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7838 on 12 degrees of freedom
## Multiple R-squared: 0.9886, Adjusted R-squared: 0.9857
## F-statistic: 345.6 on 3 and 12 DF, p-value: 6.545e-12
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 14, noise=3)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5576 -0.2234 0.1701 0.4586 0.9119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3485 0.6952 4.816 0.000706 ***
## t 0.6223 0.1555 4.003 0.002506 **
## X 3.6755 1.8719 1.963 0.077989 .
## t:X 0.2326 0.2198 1.058 0.315014
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8226 on 10 degrees of freedom
## Multiple R-squared: 0.9843, Adjusted R-squared: 0.9796
## F-statistic: 208.8 on 3 and 10 DF, p-value: 2.578e-09
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 12, noise=3)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84877 -0.38509 -0.07159 0.23439 1.07883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4326 0.6421 6.904 0.000124 ***
## t 0.1181 0.1649 0.716 0.494211
## X 2.4595 1.7160 1.433 0.189681
## t:X 0.7517 0.2332 3.224 0.012166 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6897 on 8 degrees of freedom
## Multiple R-squared: 0.9887, Adjusted R-squared: 0.9844
## F-statistic: 232.9 on 3 and 8 DF, p-value: 4.024e-08
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 10, noise=3)
##
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36957 -0.47219 0.09908 0.61772 0.88333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3392 0.9679 3.450 0.0136 *
## t 0.4894 0.2918 1.677 0.1446
## X 3.4408 2.5608 1.344 0.2276
## t:X 0.4310 0.4127 1.044 0.3366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9228 on 6 degrees of freedom
## Multiple R-squared: 0.9781, Adjusted R-squared: 0.9672
## F-statistic: 89.53 on 3 and 6 DF, p-value: 2.264e-05
The data was collected from the National Health and Nutrition Examination Survey (NHANES), covering the period of 1999-2016.
The pilot data source of this project is from National Health and Nutrition Examination Survey (NHANES) NHANES is a national survey that combines both interviews and physical examinations. The interview questions include demographic, socioeconomic, dietary, and health-related questions.
The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel. The reason we choose to use NHANES as the pilot data source of the project is that it is a national program that was designed to access and study on the health and nutrition status among children and adults in the United States.
It is also part of the National Center for Health Statistics Program which aims to produce vital and health statistic for the nation. The sampling strategy that was used in NHANES is a complex and probability sampling that involves multistage in the process to select survey participants representative of the US population. The NHANES sampling procedure consists of four stages, which are the following:
Smokers vs Nonsmokers
Nonsmokers in this study will be defined as people who have not smoked 100 cigarettes in their lives. This is presented in one of the NHANES questions which survey participants respond with yes or no. Therefore, participants who answer yes are considered smokers, and those who answer no are considered nonsmokers.
The study population in this project will be non-smokers.
Poor vs Rich
How does the implementation of raising the federal tax rate for cigarettes affect the level of secondhand smoke exposure among smokers and nonsmokers?
With the policy, will people with different income levels have different effects in regards to their SHS exposure?
Here, we develop a hypothesis about the intervention’s impact on the outcome and apply the regression model to analyze the results. The programming environment for this study is conducted in R with the statistical packages.
#include some specific Rmarkdown format packages
library(formattable)
library(knitr)
library(kableExtra)
# general R parkages
library(dplyr)
library(data.table)
library(tidyverse)
library(ggplot2)
library(readxl)
library(ISLR)
library(Amelia)
library(mlbench)
library(pscl)
library(Matrix)
library(xgboost)
library(mltools)
library(caret)
library(Hmisc)
library(smbinning)
library(InformationValue)
library(car)
library(broom)
library(modelr)
library(randomForest)
library(rsample)
library(e1071)
library(glue)
library(DataExplorer)
library(reshape2)
library(dummies)
library(table1)
library(AICcmodavg)
library(readr)
library(fitdistrplus)
library(EnvStats)
library(olsrr)
library(nortest)
library(grid)
library(gridExtra)
library(plotly)
library(DT)
# DecisionTree
library(mlr)
library(parallel)
library(parallelMap)
library(rpart.plot)
library(MASS)
source("http://www.sthda.com/upload/rquery_cormat.r")
First to import useful R packages
Then to filter out the data set from NHANES to only obtain the valid predicted/outcome value
# Load and read data from NHANES
data <- read.csv('nhanesallrev_subset_rev.csv')
# Load the description excel
code_book <- read_excel('NHANES public data 1999-2016 codebook.xlsx', skip = 5)
# Only select the rows with valid value of Serum Cotinine
NHANES <- data %>%
dplyr::filter(!(is.na(LBXCOT)|LBXCOT==9999)) %>%
dplyr::filter(SMQ020 %in% c(1,2))
The total number of observations for the filtered data set is: 44667
The total number of columns of the filtered data set is: 442
In this section, interrupted with time series model will be performed with the pilot data.
The study groups will be divided into four groups which they are:
Poor Smokers
Poor Nonsmokers
Rich Smokers
Rich Nonsmokers
# Filter the poor and rich into two groups
df_poor<-NHANES %>%
filter(INDFMPIR<1)
df_rich<-NHANES %>%
filter(INDFMPIR>=1)
# poor smokers
df1 <- df_poor %>%
filter(SMQ020==1) %>%
dplyr::group_by(SDDSRVYR,RIDEXMON) %>%
dplyr::summarise(Serum_Cotinine = mean(LBXCOT))
df1$Time <- rep(1 : nrow(df1))
df1$treatment <- ifelse(df1$SDDSRVYR<=5,0,1)
df1$timesince <- df1$Time*df1$treatment
df1$year <- c(1999:2016)
# Poor nonsmokers
df2 <- df_poor %>%
filter(SMQ020==2) %>%
dplyr::group_by(SDDSRVYR,RIDEXMON) %>%
dplyr::summarise(Serum_Cotinine = mean(LBXCOT))
df2$Time <- rep(1 : nrow(df2))
df2$treatment <- ifelse(df2$SDDSRVYR<=5,0,1)
df2$timesince <- df1$Time*df1$treatment
df2$year <- c(1999:2016)
# Rich smokers
df3 <- df_rich %>%
filter(SMQ020==1) %>%
dplyr::group_by(SDDSRVYR,RIDEXMON) %>%
dplyr::summarise(Serum_Cotinine = mean(LBXCOT))
df3$Time <- rep(1 : nrow(df3))
df3$treatment <- ifelse(df3$SDDSRVYR<=5,0,1)
df3$timesince <- df3$Time*df3$treatment
df3$year <- c(1999:2016)
# Rich nonsmokers
df4 <- df_rich %>%
filter(SMQ020==2) %>%
dplyr::group_by(SDDSRVYR,RIDEXMON) %>%
dplyr::summarise(Serum_Cotinine = mean(LBXCOT))
df4$Time <- rep(1 : nrow(df4))
df4$treatment <- ifelse(df4$SDDSRVYR<=5,0,1)
df4$timesince <- df4$Time*df4$treatment
df4$year <- c(1999:2016)
Plotting the distribution of the four groups
The red-dot line represents the public policy intervention
datatable(df1[,],
rownames = FALSE,
options=list(pageLength = 5))
ggplot(data=df1, aes(x=year, y=Serum_Cotinine, group=1)) +
geom_line()+
geom_point()+
ggtitle("Serum Cotinine Level in Poor Smokers, 1999-2016") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Time") +
ylab("Serum Cotinine Rate")+
#coord_cartesian(xlim = c(1999, 2016)) +
geom_vline(xintercept=2008, linetype="longdash", color='red')
datatable(df2[,],
rownames = FALSE,
options=list(pageLength = 5))
ggplot(data=df2, aes(x=year, y=Serum_Cotinine, group=1)) +
geom_line()+
geom_point()+
ggtitle("Serum Cotinine Level in Poor Non-smokers, 1999-2016") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Time") +
ylab("Serum Cotinine Rate")+
#coord_cartesian(xlim = c(1999, 2016)) +
geom_vline(xintercept=2008, linetype="longdash", color='red')
datatable(df3[,],
rownames = FALSE,
options=list(pageLength = 5))
ggplot(data=df3, aes(x=year, y=Serum_Cotinine, group=1)) +
geom_line()+
geom_point()+
ggtitle("Serum Cotinine Level in Rich Smokers, 1999-2016") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Time") +
ylab("Serum Cotinine Rate")+
#coord_cartesian(xlim = c(1999, 2016)) +
geom_vline(xintercept=2008, linetype="longdash", color='red')
datatable(df4[,],
rownames = FALSE,
options=list(pageLength = 5))
ggplot(data=df4, aes(x=year, y=Serum_Cotinine, group=1)) +
geom_line()+
geom_point()+
ggtitle("Serum Cotinine Level in Rich Non-smokers, 1999-2016") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Time") +
ylab("Serum Cotinine Rate")+
#coord_cartesian(xlim = c(1999, 2016)) +
geom_vline(xintercept=2008, linetype="longdash", color='red')
# Fit with the poor smokers group
temp <- tibble(Time = seq(1, 18, by = 0.1),
treatment = c(rep(0, 91), rep(1, 80)))
model1 <- glm(Serum_Cotinine ~ treatment + Time, family=gaussian, df1)
pred1 <- predict(model1,temp)
plot(df1$Serum_Cotinine,type="n",ylim=c(00,250),xlab="Year", ylab="Serum Cotinine Rate", bty="l",xaxt="n")
rect(10,0,18.5,250,col=grey(0.9),border=F)
points(df1$Serum_Cotinine,cex=0.7)
axis(1,at=0:1*0.5,labels=F)
axis(1,at=0:1*10,labels=F)
axis(1,at=0:1*18,labels=F)
axis(1,at=0:1*10+4.5,tick=F,labels=c('1999-2008','2009-2016'))
lines(temp$Time,pred1,col="dodgerblue4")
title("Serum Cotinine Level in Poor Smokers, 1999-2016")
summary(model1)
##
## Call:
## glm(formula = Serum_Cotinine ~ treatment + Time, family = gaussian,
## data = df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -52.526 -16.903 5.028 15.981 40.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.583 15.246 7.253 2.82e-06 ***
## treatment -32.539 24.361 -1.336 0.2016
## Time 5.165 2.333 2.214 0.0428 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 677.7476)
##
## Null deviance: 14352 on 17 degrees of freedom
## Residual deviance: 10166 on 15 degrees of freedom
## AIC: 173.14
##
## Number of Fisher Scoring iterations: 2
model2 <- glm(Serum_Cotinine ~ treatment + Time, family=gaussian, df2)
pred2 <- predict(model2,temp)
plot(df2$Serum_Cotinine,type="n",ylim=c(00,30),xlab="Year", ylab="Serum Cotinine Rate", bty="l",xaxt="n")
rect(10,0,18.5,30,col=grey(0.9),border=F)
points(df2$Serum_Cotinine,cex=0.7)
axis(1,at=0:1*0.5,labels=F)
axis(1,at=0:1*10,labels=F)
axis(1,at=0:1*18,labels=F)
axis(1,at=0:1*10+4.5,tick=F,labels=c('1999-2008','2009-2016'))
lines(temp$Time,pred2,col="dodgerblue4")
title("Serum Cotinine Level in Poor Non-Smokers, 1999-2016")
summary(model2)
##
## Call:
## glm(formula = Serum_Cotinine ~ treatment + Time, family = gaussian,
## data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.336 -5.281 -1.167 3.688 12.921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9761 4.0697 2.206 0.0434 *
## treatment -10.0634 6.5025 -1.548 0.1426
## Time 0.9442 0.6228 1.516 0.1503
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 48.29017)
##
## Null deviance: 846.24 on 17 degrees of freedom
## Residual deviance: 724.35 on 15 degrees of freedom
## AIC: 125.59
##
## Number of Fisher Scoring iterations: 2
model3 <- glm(Serum_Cotinine ~ treatment + Time, family=gaussian, df3)
pred3 <- predict(model3,temp)
plot(df3$Serum_Cotinine,type="n",ylim=c(00,150),xlab="Year", ylab="Serum Cotinine Rate", bty="l",xaxt="n")
rect(10,0,18.5,150,col=grey(0.9),border=F)
points(df3$Serum_Cotinine,cex=0.7)
axis(1,at=0:1*0.5,labels=F)
axis(1,at=0:1*10,labels=F)
axis(1,at=0:1*18,labels=F)
axis(1,at=0:1*10+4.5,tick=F,labels=c('1999-2008','2009-2016'))
lines(temp$Time,pred3,col="dodgerblue4")
title("Serum Cotinine Level in Rich Smokers, 1999-2016")
summary(model3)
##
## Call:
## glm(formula = Serum_Cotinine ~ treatment + Time, family = gaussian,
## data = df3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -20.080 -6.162 1.331 6.219 17.390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.0060 6.2770 13.224 1.13e-09 ***
## treatment -17.1230 10.0294 -1.707 0.1084
## Time 2.4167 0.9606 2.516 0.0237 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 114.8797)
##
## Null deviance: 2545.5 on 17 degrees of freedom
## Residual deviance: 1723.2 on 15 degrees of freedom
## AIC: 141.19
##
## Number of Fisher Scoring iterations: 2
model4 <- glm(Serum_Cotinine ~ treatment + Time, family=gaussian, df4)
pred4 <- predict(model4,temp)
plot(df4$Serum_Cotinine,type="n",ylim=c(00,20),xlab="Year", ylab="Serum Cotinine Rate", bty="l",xaxt="n")
rect(10,0,18.5,20,col=grey(0.9),border=F)
points(df4$Serum_Cotinine,cex=0.7)
axis(1,at=0:1*0.5,labels=F)
axis(1,at=0:1*10,labels=F)
axis(1,at=0:1*18,labels=F)
axis(1,at=0:1*10+4.5,tick=F,labels=c('1999-2008','2009-2016'))
lines(temp$Time,pred4,col="dodgerblue4")
title("Serum Cotinine Level in Rich Non-Smokers, 1999-2016")
summary(model4)
##
## Call:
## glm(formula = Serum_Cotinine ~ treatment + Time, family = gaussian,
## data = df4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0763 -1.1345 -0.0506 0.5144 4.1416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8773 1.3143 4.472 0.000448 ***
## treatment -6.2589 2.0999 -2.981 0.009335 **
## Time 0.6300 0.2011 3.133 0.006846 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 5.036159)
##
## Null deviance: 126.501 on 17 degrees of freedom
## Residual deviance: 75.542 on 15 degrees of freedom
## AIC: 84.9
##
## Number of Fisher Scoring iterations: 2
Base on the preliminary results, there is clear evidence indicating that rich nonsmokers are the only group that gets away from SHS exposure with the policy intervention. And there is evidence showing the public policy intervention does not have an impact on smokers, regardless of the rich or poor groups. It is not surprising to notice that the federal tax raised on cigarettes does not affect smokers’ serum cotinine level since smokers tend not to quit smoking easily.
It also further illustrates that poverty may increase the chance to be exposed by SHS. However, there are some limitations and important threats to the validity of this interrupted time series study. One of the biggest limitations is the lack of statistical power in the regression analysis. Because of the nature of the public survey, there are many other useful variables (e.g., information about month and location, etc..) that are restricted and not accessible.
I would like to thank Prof. David Broniatowski, Dan Koban, Dr MyDzung T. Chu, and Dr Ami Zota for providing great support and the NHANES data set used in this project.
This project came from a grant funded by the US HUD grant (DCHHU0054-19). The content of this manuscript is solely the responsibility of the grantee and does not represent the official views of any funding entity. Further, no parties involved endorse the purchase of any commercial products or services discussed in this manuscript.
RDC Research Project Title: The Impact of Housing Assistance on Residential Environmental Exposures
Funded by: US Department of Housing and Urban Development (HUD) Healthy Homes Program Homes Technical Research (DCHHU0054-19)
Organization: Environmental Occupational Health at The George Washington University, The Milken Institute School of Public Health
Supervisors: Dr MyDzung T. Chu and Dr Ami Zota
Victora CG, Habicht J-P, Bryce J. Evidence-based public health: moving beyond randomized trials. Am J Public Health 2004;94:400–05.
“Health Effects of Secondhand Smoke” Fed. 27, 2020. Accessed on: Dec.13,2021.[Online].Available:https://www.cdc.gov/tobacco/data_statistics/fact_sheets/secondhand_smoke/health_effects/index.htm
Richter PA, Bishop EE, Wang J, Swahn MH. Tobacco Smoke Exposure and Levels of Urinary Metals in the U.S. Youth and Adult Population: The National Health and Nutrition Examination Survey (NHANES) 1999–2004. International Journal of Environmental Research and Public Health. 2009; 6(7):1930-1946.
Gaurang P. Nazar, MSc, John Tayu Lee, PhD, Monika Arora, PhD, Christopher Millett, PhD, Socioeconomic Inequalities in Secondhand Smoke Exposure at Home and at Work in 15 Low- and Middle-Income Countries, Nicotine & Tobacco Research, Volume 18, Issue 5, May 2016, Pages 1230–1239
“About the National Health and Nutrition Examination Survey” Sep. 15, 2017. Accessed on: Dec. 13, 2021. [Online]. Available:https://www.cdc.gov/nchs/nhanes/about_nhanes.htm
“Module 2: Sample Design” Sep. 15, 2017. Accessed on: Dec. 13, 2021. [Online].Available:https://wwwn.cdc.gov/nchs/nhanes/tutorials/module2.aspx
Bernal JL, Cummins S, Gasparrini A. Interrupted time series regression for the evaluation of public health interventions: a tutorial. Int J Epidemiol. 2017 Feb 1;46(1):348-355. doi: 10.1093/ije/dyw098. Erratum in: Int J Epidemiol. 2020 Aug 1;49(4):1414. PMID: 27283160; PMCID: PMC5407170.
Bush T, Zbikowski S, Mahoney L, Deprey M, Mowery PD, Magnusson B. The 2009 US federal cigarette tax increase and quitline utilization in 16 states. J Environ Public Health. 2012;2012:314740. doi: 10.1155/2012/314740. Epub 2012 May 8. PMID: 22649463; PMCID:PMC3356941