Introduction

Secondhand smoke (SHS) can cause numerous serious health problems among adults and children. According to the statistics from CDC, there are more than 7,300 lung cancer deaths among nonsmokers each year caused by SHS. In this project, we will introduce the use of interrupt with time series regression method on a real-world example of public intervention. We first highlight the causal inference of the topic in the project, including the proposed cause and the proposed effect. We then discuss the sampling strategies concerning the external validity of our pilot data. The measurement from the construct validity of the study is clearly defined before the quasi-experimental design is implemented. Finally, we describe the main method which is the interrupted with time series analysis, and interpret the meaning from the results.

Research Method

An interrupted time series design is a quasi-experimental design that can be used to assess the impact of an a policy that was implemented as specific point in time. First, we identify the underlying trend in the pre-treatment period and determine if the intervention interrupts the trend at a known point in time. The counterfactual scenario assumes the underlying trend should have continued unchanged if the intervention did not occur.

Effects come in two forms:

  1. Change in level or intercept - the time series jumps up or down immediately after the interruption or the pre- and post-treatment slopes have different intercepts.
  2. Change in slope, trend, or slope - the rate of change per time interval changes after the interruption.

Justification for using an ITS

  1. Intervention begins at a specific, known point in time
  2. We have sequential observations before and after the intervention
  3. Can’t use ordinary statistics (e.g., using a t-test) because:
  • observations are not independent
  • mean and variance are not constant through time, i.e., they are not identically distributed

Common threats to validity

  • History: specific events occurring between the first and second measurement in addition to the experimental variable
  • Instrumentation: Changes in the calibration of a measuring instrument (e.g., surveys, interviews, etc.).

Segmented Regression Model

Model formulation

\(Y_{t} = \beta_{0} + \beta_{1}T + \beta_{2}X_{t} + \beta_{3}TX_{t}\)

\(T\): the number of days elapses since the start of the study period
\(X_{t}\): a dummy variable indicating the pre- and post-intervention period
\(Y_{t}\): the outcome at time \(t\)
\(\beta_{0}=\) the baseline level
\(\beta_{1}=\) the baseline trend
\(\beta_{2}=\) the level change
\(\beta_{3}=\) the trend change

Detecting effects

To demonstrate how the model works, consider the following examples where we randomly generate time series data and attempt to detect level and trend changes via segmented regression. As expected, our model performs better when the effect is large or the data has lower amounts of noise.

library(broom)
library(tidyverse)

toy_its <- function(B0, B1, B2, B3, epochs, noise){

# Create synthetic data
t <- seq(1, epochs)
X <- c(rep(0, epochs/2), rep(1, epochs/2))
t0 <- epochs/2 + 1 # time of intervention
Y = B0 + B1*t + B2*X + B3*t*X + noise*runif(length(X))
df <- tibble(t = t, X = X, Y = Y, t0 = epochs/2 + 1)

# fit a regression model
model <- lm(Y ~ t + X + t*X, data = df)
results = tidy(model)
slope_change <- results %>% filter(term == "t:X", p.value <= 0.05) %>% nrow()
if (slope_change == 1){
  slope_change = paste0("significant change in slope (", 
                        round(results$estimate[results$term == "t:X"], 2), ")") 
}
if (slope_change == 0){
  slope_change = " "
}

level_change <- results %>% filter(term == "X", p.value <= 0.05) %>% nrow()
if (level_change == 1){
  level_change = paste0("significant change in level (", 
                        round(results$estimate[results$term == "X"], 2), ")") 
}
if (level_change == 0){
  level_change = " "
}

# Plot the results
pred <- predict(model,type="response",df)

plot(df$Y, type="n", ylim=c(00,max(df$Y) + 4),
     xlab="Time", ylab="Response", bty="l",xaxt="n")
rect(epochs/2,0, epochs, 100,col=grey(0.9),border=F)
points(df$Y,cex=0.7)
lines(1:epochs,pred,col=2)
axis(1,at=(0:epochs)*2,labels=F)
axis(1,at=(0:(epochs/2))*2,tick=F,labels=(0:(epochs/2))*2)
title(paste0(level_change, "\n", slope_change))
summary(model)
}

Level changes

\(\beta_{2}=0\)

set.seed(1)
# Detect level drops with low noise data
toy_its(B0=2, B1=0.5, B2=0, B3=0, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52276 -0.16406  0.00888  0.24377  0.38215 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.51114    0.19584  12.823 7.83e-10 ***
## t            0.50734    0.03156  16.074 2.70e-11 ***
## X           -0.78090    0.53470  -1.460    0.164    
## t:X          0.04612    0.04464   1.033    0.317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 16 degrees of freedom
## Multiple R-squared:  0.9924, Adjusted R-squared:  0.991 
## F-statistic: 697.1 on 3 and 16 DF,  p-value: < 2.2e-16

\(\beta_{2}=1\)

toy_its(B0=2, B1=0.5, B2=1, B3=0, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42103 -0.18237 -0.02009  0.19578  0.50038 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.49534    0.19960  12.502 1.13e-09 ***
## t            0.48600    0.03217  15.108 6.86e-11 ***
## X            1.03702    0.54496   1.903   0.0752 .  
## t:X          0.01381    0.04549   0.304   0.7653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2922 on 16 degrees of freedom
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.9929 
## F-statistic: 885.7 on 3 and 16 DF,  p-value: < 2.2e-16

\(\beta_{2}=2\)

toy_its(B0=2, B1=0.5, B2=2, B3=0, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55049 -0.08353  0.01891  0.14280  0.42226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.71870    0.17008  15.985 2.93e-11 ***
## t            0.47930    0.02741  17.486 7.51e-12 ***
## X            1.82102    0.46436   3.922  0.00122 ** 
## t:X          0.01230    0.03876   0.317  0.75511    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.249 on 16 degrees of freedom
## Multiple R-squared:  0.9963, Adjusted R-squared:  0.9957 
## F-statistic:  1454 on 3 and 16 DF,  p-value: < 2.2e-16

\(\beta_{2}=3\)

toy_its(B0=2, B1=0.5, B2=3, B3=0, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42218 -0.19083 -0.04299  0.19339  0.39571 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.51851    0.18508  13.608 3.26e-10 ***
## t            0.49866    0.02983  16.718 1.49e-11 ***
## X            2.40990    0.50532   4.769 0.000209 ***
## t:X          0.04609    0.04218   1.093 0.290750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2709 on 16 degrees of freedom
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9963 
## F-statistic:  1698 on 3 and 16 DF,  p-value: < 2.2e-16

\(\beta_{2}=4\)

toy_its(B0=2, B1=0.5, B2=4, B3=0, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37815 -0.16217 -0.09326  0.17877  0.36646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.62806    0.16624  15.809 3.47e-11 ***
## t            0.45952    0.02679  17.151 1.01e-11 ***
## X            3.36207    0.45389   7.407 1.48e-06 ***
## t:X          0.07773    0.03789   2.052    0.057 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2434 on 16 degrees of freedom
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9976 
## F-statistic:  2595 on 3 and 16 DF,  p-value: < 2.2e-16

Trend changes

\(\beta_{3}=0\)

set.seed(1)
# Detect level drops with low noise data
toy_its(B0=2, B1=0.5, B2=0, B3=0, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52276 -0.16406  0.00888  0.24377  0.38215 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.51114    0.19584  12.823 7.83e-10 ***
## t            0.50734    0.03156  16.074 2.70e-11 ***
## X           -0.78090    0.53470  -1.460    0.164    
## t:X          0.04612    0.04464   1.033    0.317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 16 degrees of freedom
## Multiple R-squared:  0.9924, Adjusted R-squared:  0.991 
## F-statistic: 697.1 on 3 and 16 DF,  p-value: < 2.2e-16

\(\beta_{3}=0.1\)

toy_its(B0=2, B1=0.5, B2=0, B3=0.1, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42103 -0.18237 -0.02009  0.19578  0.50038 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.49534    0.19960  12.502 1.13e-09 ***
## t            0.48600    0.03217  15.108 6.86e-11 ***
## X            0.03702    0.54496   0.068   0.9467    
## t:X          0.11381    0.04549   2.502   0.0236 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2922 on 16 degrees of freedom
## Multiple R-squared:  0.995,  Adjusted R-squared:  0.994 
## F-statistic:  1058 on 3 and 16 DF,  p-value: < 2.2e-16

\(\beta_{3}=0.2\)

toy_its(B0=2, B1=0.5, B2=0, B3=0.2, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55049 -0.08353  0.01891  0.14280  0.42226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.71870    0.17008  15.985 2.93e-11 ***
## t            0.47930    0.02741  17.486 7.51e-12 ***
## X           -0.17898    0.46436  -0.385    0.705    
## t:X          0.21230    0.03876   5.477 5.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.249 on 16 degrees of freedom
## Multiple R-squared:  0.9973, Adjusted R-squared:  0.9968 
## F-statistic:  1994 on 3 and 16 DF,  p-value: < 2.2e-16

\(\beta_{3}=0.3\)

toy_its(B0=2, B1=0.5, B2=0, B3=0.3, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42218 -0.19083 -0.04299  0.19339  0.39571 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.51851    0.18508  13.608 3.26e-10 ***
## t            0.49866    0.02983  16.718 1.49e-11 ***
## X           -0.59010    0.50532  -1.168     0.26    
## t:X          0.34609    0.04218   8.204 3.99e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2709 on 16 degrees of freedom
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9975 
## F-statistic:  2524 on 3 and 16 DF,  p-value: < 2.2e-16

\(\beta_{3}=0.4\)

toy_its(B0=2, B1=0.5, B2=0, B3=0.4, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37815 -0.16217 -0.09326  0.17877  0.36646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.62806    0.16624  15.809 3.47e-11 ***
## t            0.45952    0.02679  17.151 1.01e-11 ***
## X           -0.63793    0.45389  -1.405    0.179    
## t:X          0.47773    0.03789  12.608 1.00e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2434 on 16 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9985 
## F-statistic:  4139 on 3 and 16 DF,  p-value: < 2.2e-16

Changes with noise

\(noise=1\)

set.seed(1)
# Detect level drops with low noise data
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=1)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52276 -0.16406  0.00888  0.24377  0.38215 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.51114    0.19584  12.823 7.83e-10 ***
## t            0.50734    0.03156  16.074 2.70e-11 ***
## X            3.21910    0.53470   6.020 1.78e-05 ***
## t:X          0.34612    0.04464   7.754 8.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 16 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9985 
## F-statistic:  4112 on 3 and 16 DF,  p-value: < 2.2e-16

\(noise=2\)

toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=2)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84205 -0.36475 -0.04017  0.39157  1.00075 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.99067    0.39920   7.492 1.29e-06 ***
## t            0.47200    0.06434   7.336 1.67e-06 ***
## X            4.07405    1.08992   3.738  0.00179 ** 
## t:X          0.32763    0.09099   3.601  0.00239 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5844 on 16 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9938 
## F-statistic:  1009 on 3 and 16 DF,  p-value: < 2.2e-16

\(noise=3\)

toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=3)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65146 -0.25059  0.05672  0.42841  1.26677 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.15611    0.51023   8.146 4.38e-07 ***
## t            0.43791    0.08223   5.325 6.83e-05 ***
## X            3.46307    1.39307   2.486   0.0244 *  
## t:X          0.33690    0.11629   2.897   0.0105 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7469 on 16 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9886 
## F-statistic: 548.9 on 3 and 16 DF,  p-value: 2.442e-16

\(noise=4\)

toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=4)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6887 -0.7633 -0.1720  0.7736  1.5828 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.0741     0.7403   5.503 4.82e-05 ***
## t             0.4946     0.1193   4.146  0.00076 ***
## X             1.6396     2.0213   0.811  0.42918    
## t:X           0.4844     0.1687   2.871  0.01110 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.084 on 16 degrees of freedom
## Multiple R-squared:  0.9831, Adjusted R-squared:  0.9799 
## F-statistic:   310 on 3 and 16 DF,  p-value: 2.216e-14

\(noise=5\)

toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=5)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8908 -0.8108 -0.4663  0.8939  1.8323 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.1403     0.8312   6.184 1.31e-05 ***
## t             0.2976     0.1340   2.221  0.04110 *  
## X             0.8104     2.2694   0.357  0.72570    
## t:X           0.6887     0.1894   3.635  0.00223 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.217 on 16 degrees of freedom
## Multiple R-squared:  0.9795, Adjusted R-squared:  0.9757 
## F-statistic: 255.1 on 3 and 16 DF,  p-value: 1.023e-13

Short Time Series

\(epochs=20\)

set.seed(1)
# Detect level drops with low noise data
toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 20, noise=3)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.56829 -0.49219  0.02664  0.73132  1.14645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.53342    0.58751   6.014 1.80e-05 ***
## t            0.52202    0.09469   5.513 4.72e-05 ***
## X            1.65730    1.60409   1.033  0.31689    
## t:X          0.43835    0.13391   3.274  0.00478 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.86 on 16 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9866 
## F-statistic: 465.6 on 3 and 16 DF,  p-value: 8.986e-16

\(epochs=16\)

toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 16, noise=3)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11834 -0.59958 -0.03557  0.58887  0.99980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.0013     0.6107   6.552 2.72e-05 ***
## t             0.3030     0.1209   2.506   0.0276 *  
## X             3.6592     1.6539   2.213   0.0471 *  
## t:X           0.4981     0.1710   2.912   0.0130 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7838 on 12 degrees of freedom
## Multiple R-squared:  0.9886, Adjusted R-squared:  0.9857 
## F-statistic: 345.6 on 3 and 12 DF,  p-value: 6.545e-12

\(epochs=14\)

toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 14, noise=3)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5576 -0.2234  0.1701  0.4586  0.9119 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.3485     0.6952   4.816 0.000706 ***
## t             0.6223     0.1555   4.003 0.002506 ** 
## X             3.6755     1.8719   1.963 0.077989 .  
## t:X           0.2326     0.2198   1.058 0.315014    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8226 on 10 degrees of freedom
## Multiple R-squared:  0.9843, Adjusted R-squared:  0.9796 
## F-statistic: 208.8 on 3 and 10 DF,  p-value: 2.578e-09

\(epochs=12\)

toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 12, noise=3)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84877 -0.38509 -0.07159  0.23439  1.07883 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.4326     0.6421   6.904 0.000124 ***
## t             0.1181     0.1649   0.716 0.494211    
## X             2.4595     1.7160   1.433 0.189681    
## t:X           0.7517     0.2332   3.224 0.012166 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6897 on 8 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9844 
## F-statistic: 232.9 on 3 and 8 DF,  p-value: 4.024e-08

\(epochs=10\)

toy_its(B0=2, B1=0.5, B2=4, B3=0.3, epochs = 10, noise=3)

## 
## Call:
## lm(formula = Y ~ t + X + t * X, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36957 -0.47219  0.09908  0.61772  0.88333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.3392     0.9679   3.450   0.0136 *
## t             0.4894     0.2918   1.677   0.1446  
## X             3.4408     2.5608   1.344   0.2276  
## t:X           0.4310     0.4127   1.044   0.3366  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9228 on 6 degrees of freedom
## Multiple R-squared:  0.9781, Adjusted R-squared:  0.9672 
## F-statistic: 89.53 on 3 and 6 DF,  p-value: 2.264e-05

The Pilot Data

The data was collected from the National Health and Nutrition Examination Survey (NHANES), covering the period of 1999-2016.

  • The pilot data source of this project is from National Health and Nutrition Examination Survey (NHANES) NHANES is a national survey that combines both interviews and physical examinations. The interview questions include demographic, socioeconomic, dietary, and health-related questions.

  • The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel. The reason we choose to use NHANES as the pilot data source of the project is that it is a national program that was designed to access and study on the health and nutrition status among children and adults in the United States.

  • It is also part of the National Center for Health Statistics Program which aims to produce vital and health statistic for the nation. The sampling strategy that was used in NHANES is a complex and probability sampling that involves multistage in the process to select survey participants representative of the US population. The NHANES sampling procedure consists of four stages, which are the following:


Smokers vs Nonsmokers

  • Nonsmokers in this study will be defined as people who have not smoked 100 cigarettes in their lives. This is presented in one of the NHANES questions which survey participants respond with yes or no. Therefore, participants who answer yes are considered smokers, and those who answer no are considered nonsmokers.

  • The study population in this project will be non-smokers.

Poor vs Rich

  • The next important construct is the measurement of the Federal Poverty Income Ratio (PIR). The PIR is measured by family total income divided by the family poverty threshold. If the total family income is less than the poverty family threshold which is less than 1, that family and everyone in it is considered to be in poverty. Otherwise, if the PIR is equal or greater than 1, that family is not considered to be in poverty.

Research Questions

  • How does the implementation of raising the federal tax rate for cigarettes affect the level of secondhand smoke exposure among smokers and nonsmokers?

  • With the policy, will people with different income levels have different effects in regards to their SHS exposure?

Here, we develop a hypothesis about the intervention’s impact on the outcome and apply the regression model to analyze the results. The programming environment for this study is conducted in R with the statistical packages.

Load and Clean Data

#include some specific Rmarkdown format packages 
library(formattable)
library(knitr)
library(kableExtra)

# general R parkages
library(dplyr)
library(data.table)
library(tidyverse)
library(ggplot2)
library(readxl)
library(ISLR)
library(Amelia)
library(mlbench)
library(pscl)
library(Matrix)
library(xgboost)
library(mltools)
library(caret)
library(Hmisc)
library(smbinning)
library(InformationValue)
library(car)
library(broom)
library(modelr)
library(randomForest)
library(rsample)
library(e1071)
library(glue)
library(DataExplorer)
library(reshape2)
library(dummies)
library(table1)
library(AICcmodavg)
library(readr)
library(fitdistrplus)
library(EnvStats)
library(olsrr)
library(nortest)
library(grid)
library(gridExtra)
library(plotly)
library(DT)
# DecisionTree
library(mlr)
library(parallel)
library(parallelMap)
library(rpart.plot)
library(MASS)
source("http://www.sthda.com/upload/rquery_cormat.r")
  • First to import useful R packages

  • Then to filter out the data set from NHANES to only obtain the valid predicted/outcome value

# Load and read data from NHANES
data <- read.csv('nhanesallrev_subset_rev.csv')

# Load the description excel
code_book <- read_excel('NHANES public data 1999-2016 codebook.xlsx', skip = 5)

# Only select the rows with valid value of Serum Cotinine
NHANES <- data %>% 
  dplyr::filter(!(is.na(LBXCOT)|LBXCOT==9999)) %>% 
  dplyr::filter(SMQ020 %in% c(1,2))

The total number of observations for the filtered data set is: 44667

The total number of columns of the filtered data set is: 442

Analysis

  • In this section, interrupted with time series model will be performed with the pilot data.

  • The study groups will be divided into four groups which they are:

Poor Smokers

Poor Nonsmokers

Rich Smokers

Rich Nonsmokers

# Filter the poor and rich into two groups
df_poor<-NHANES %>%  
  filter(INDFMPIR<1) 
df_rich<-NHANES %>%  
  filter(INDFMPIR>=1)

# poor smokers
df1 <- df_poor %>%
  filter(SMQ020==1) %>%
  dplyr::group_by(SDDSRVYR,RIDEXMON) %>% 
  dplyr::summarise(Serum_Cotinine = mean(LBXCOT))

df1$Time <- rep(1 : nrow(df1))
df1$treatment  <- ifelse(df1$SDDSRVYR<=5,0,1)
df1$timesince <- df1$Time*df1$treatment
df1$year <- c(1999:2016)

# Poor nonsmokers
df2 <- df_poor %>%
  filter(SMQ020==2) %>%
  dplyr::group_by(SDDSRVYR,RIDEXMON) %>% 
  dplyr::summarise(Serum_Cotinine = mean(LBXCOT))

df2$Time <- rep(1 : nrow(df2))
df2$treatment  <- ifelse(df2$SDDSRVYR<=5,0,1)
df2$timesince <- df1$Time*df1$treatment
df2$year <- c(1999:2016)

# Rich smokers
df3 <- df_rich %>%
  filter(SMQ020==1) %>%
  dplyr::group_by(SDDSRVYR,RIDEXMON) %>% 
  dplyr::summarise(Serum_Cotinine = mean(LBXCOT))

df3$Time <- rep(1 : nrow(df3))
df3$treatment  <- ifelse(df3$SDDSRVYR<=5,0,1)
df3$timesince <- df3$Time*df3$treatment
df3$year <- c(1999:2016)

# Rich nonsmokers
df4 <- df_rich %>%
  filter(SMQ020==2) %>%
  dplyr::group_by(SDDSRVYR,RIDEXMON) %>% 
  dplyr::summarise(Serum_Cotinine = mean(LBXCOT))

df4$Time <- rep(1 : nrow(df4))
df4$treatment  <- ifelse(df4$SDDSRVYR<=5,0,1)
df4$timesince <- df4$Time*df4$treatment
df4$year <- c(1999:2016)

Data Distribution

  • Plotting the distribution of the four groups

  • The red-dot line represents the public policy intervention

Poor Smokers

datatable(df1[,],
          rownames = FALSE,
          options=list(pageLength = 5))
ggplot(data=df1, aes(x=year, y=Serum_Cotinine, group=1)) +
  geom_line()+
  geom_point()+
  ggtitle("Serum Cotinine Level in Poor Smokers, 1999-2016") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  xlab("Time") +
  ylab("Serum Cotinine Rate")+
  #coord_cartesian(xlim = c(1999, 2016)) +
  geom_vline(xintercept=2008, linetype="longdash", color='red')

Poor Nonsmokers

datatable(df2[,],
          rownames = FALSE,
          options=list(pageLength = 5))
ggplot(data=df2, aes(x=year, y=Serum_Cotinine, group=1)) +
  geom_line()+
  geom_point()+
  ggtitle("Serum Cotinine Level in Poor Non-smokers, 1999-2016") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  xlab("Time") +
  ylab("Serum Cotinine Rate")+
  #coord_cartesian(xlim = c(1999, 2016)) +
  geom_vline(xintercept=2008, linetype="longdash", color='red')

Rich Smokers

datatable(df3[,],
          rownames = FALSE,
          options=list(pageLength = 5))
ggplot(data=df3, aes(x=year, y=Serum_Cotinine, group=1)) +
  geom_line()+
  geom_point()+
  ggtitle("Serum Cotinine Level in Rich Smokers, 1999-2016") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  xlab("Time") +
  ylab("Serum Cotinine Rate")+
  #coord_cartesian(xlim = c(1999, 2016)) +
  geom_vline(xintercept=2008, linetype="longdash", color='red')

Rich Nonsmokers

datatable(df4[,],
          rownames = FALSE,
          options=list(pageLength = 5))
ggplot(data=df4, aes(x=year, y=Serum_Cotinine, group=1)) +
  geom_line()+
  geom_point()+
  ggtitle("Serum Cotinine Level in Rich Non-smokers, 1999-2016") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  xlab("Time") +
  ylab("Serum Cotinine Rate")+
  #coord_cartesian(xlim = c(1999, 2016)) +
  geom_vline(xintercept=2008, linetype="longdash", color='red')

Models with ITS

  • This section fits the four different groups into Interrupted with Time Series models

Poor Smokers

# Fit with the poor smokers group
temp <- tibble(Time = seq(1, 18, by = 0.1),
               treatment = c(rep(0, 91), rep(1, 80)))

model1 <- glm(Serum_Cotinine ~ treatment + Time, family=gaussian, df1)
pred1 <- predict(model1,temp)
plot(df1$Serum_Cotinine,type="n",ylim=c(00,250),xlab="Year", ylab="Serum Cotinine Rate", bty="l",xaxt="n")
rect(10,0,18.5,250,col=grey(0.9),border=F)
points(df1$Serum_Cotinine,cex=0.7)
axis(1,at=0:1*0.5,labels=F)
axis(1,at=0:1*10,labels=F)
axis(1,at=0:1*18,labels=F)
axis(1,at=0:1*10+4.5,tick=F,labels=c('1999-2008','2009-2016'))
lines(temp$Time,pred1,col="dodgerblue4")
title("Serum Cotinine Level in Poor Smokers, 1999-2016")

summary(model1)
## 
## Call:
## glm(formula = Serum_Cotinine ~ treatment + Time, family = gaussian, 
##     data = df1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -52.526  -16.903    5.028   15.981   40.655  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  110.583     15.246   7.253 2.82e-06 ***
## treatment    -32.539     24.361  -1.336   0.2016    
## Time           5.165      2.333   2.214   0.0428 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 677.7476)
## 
##     Null deviance: 14352  on 17  degrees of freedom
## Residual deviance: 10166  on 15  degrees of freedom
## AIC: 173.14
## 
## Number of Fisher Scoring iterations: 2

Poor Nonsmokers

model2 <- glm(Serum_Cotinine ~ treatment + Time, family=gaussian, df2)
pred2 <- predict(model2,temp)
plot(df2$Serum_Cotinine,type="n",ylim=c(00,30),xlab="Year", ylab="Serum Cotinine Rate", bty="l",xaxt="n")
rect(10,0,18.5,30,col=grey(0.9),border=F)
points(df2$Serum_Cotinine,cex=0.7)
axis(1,at=0:1*0.5,labels=F)
axis(1,at=0:1*10,labels=F)
axis(1,at=0:1*18,labels=F)
axis(1,at=0:1*10+4.5,tick=F,labels=c('1999-2008','2009-2016'))
lines(temp$Time,pred2,col="dodgerblue4")
title("Serum Cotinine Level in Poor Non-Smokers, 1999-2016")

summary(model2)
## 
## Call:
## glm(formula = Serum_Cotinine ~ treatment + Time, family = gaussian, 
##     data = df2)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -9.336  -5.281  -1.167   3.688  12.921  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   8.9761     4.0697   2.206   0.0434 *
## treatment   -10.0634     6.5025  -1.548   0.1426  
## Time          0.9442     0.6228   1.516   0.1503  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 48.29017)
## 
##     Null deviance: 846.24  on 17  degrees of freedom
## Residual deviance: 724.35  on 15  degrees of freedom
## AIC: 125.59
## 
## Number of Fisher Scoring iterations: 2

Rich Smokers

model3 <- glm(Serum_Cotinine ~ treatment + Time, family=gaussian, df3)
pred3 <- predict(model3,temp)
plot(df3$Serum_Cotinine,type="n",ylim=c(00,150),xlab="Year", ylab="Serum Cotinine Rate", bty="l",xaxt="n")
rect(10,0,18.5,150,col=grey(0.9),border=F)
points(df3$Serum_Cotinine,cex=0.7)
axis(1,at=0:1*0.5,labels=F)
axis(1,at=0:1*10,labels=F)
axis(1,at=0:1*18,labels=F)
axis(1,at=0:1*10+4.5,tick=F,labels=c('1999-2008','2009-2016'))
lines(temp$Time,pred3,col="dodgerblue4")
title("Serum Cotinine Level in Rich Smokers, 1999-2016")

summary(model3)
## 
## Call:
## glm(formula = Serum_Cotinine ~ treatment + Time, family = gaussian, 
##     data = df3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -20.080   -6.162    1.331    6.219   17.390  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.0060     6.2770  13.224 1.13e-09 ***
## treatment   -17.1230    10.0294  -1.707   0.1084    
## Time          2.4167     0.9606   2.516   0.0237 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 114.8797)
## 
##     Null deviance: 2545.5  on 17  degrees of freedom
## Residual deviance: 1723.2  on 15  degrees of freedom
## AIC: 141.19
## 
## Number of Fisher Scoring iterations: 2

Rich Nonsmokers

model4 <- glm(Serum_Cotinine ~ treatment + Time, family=gaussian, df4)
pred4 <- predict(model4,temp)
plot(df4$Serum_Cotinine,type="n",ylim=c(00,20),xlab="Year", ylab="Serum Cotinine Rate", bty="l",xaxt="n")
rect(10,0,18.5,20,col=grey(0.9),border=F)
points(df4$Serum_Cotinine,cex=0.7)
axis(1,at=0:1*0.5,labels=F)
axis(1,at=0:1*10,labels=F)
axis(1,at=0:1*18,labels=F)
axis(1,at=0:1*10+4.5,tick=F,labels=c('1999-2008','2009-2016'))
lines(temp$Time,pred4,col="dodgerblue4")
title("Serum Cotinine Level in Rich Non-Smokers, 1999-2016")

summary(model4)
## 
## Call:
## glm(formula = Serum_Cotinine ~ treatment + Time, family = gaussian, 
##     data = df4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0763  -1.1345  -0.0506   0.5144   4.1416  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.8773     1.3143   4.472 0.000448 ***
## treatment    -6.2589     2.0999  -2.981 0.009335 ** 
## Time          0.6300     0.2011   3.133 0.006846 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 5.036159)
## 
##     Null deviance: 126.501  on 17  degrees of freedom
## Residual deviance:  75.542  on 15  degrees of freedom
## AIC: 84.9
## 
## Number of Fisher Scoring iterations: 2

Conclusion

  • Base on the preliminary results, there is clear evidence indicating that rich nonsmokers are the only group that gets away from SHS exposure with the policy intervention. And there is evidence showing the public policy intervention does not have an impact on smokers, regardless of the rich or poor groups. It is not surprising to notice that the federal tax raised on cigarettes does not affect smokers’ serum cotinine level since smokers tend not to quit smoking easily.

  • It also further illustrates that poverty may increase the chance to be exposed by SHS. However, there are some limitations and important threats to the validity of this interrupted time series study. One of the biggest limitations is the lack of statistical power in the regression analysis. Because of the nature of the public survey, there are many other useful variables (e.g., information about month and location, etc..) that are restricted and not accessible.

Acknowledgment

  • I would like to thank Prof. David Broniatowski, Dan Koban, Dr MyDzung T. Chu, and Dr Ami Zota for providing great support and the NHANES data set used in this project.

  • This project came from a grant funded by the US HUD grant (DCHHU0054-19). The content of this manuscript is solely the responsibility of the grantee and does not represent the official views of any funding entity. Further, no parties involved endorse the purchase of any commercial products or services discussed in this manuscript.

    • RDC Research Project Title: The Impact of Housing Assistance on Residential Environmental Exposures

    • Funded by: US Department of Housing and Urban Development (HUD) Healthy Homes Program Homes Technical Research (DCHHU0054-19)

    • Organization: Environmental Occupational Health at The George Washington University, The Milken Institute School of Public Health

    • Supervisors: Dr MyDzung T. Chu and Dr Ami Zota

Reference

  1. Victora CG, Habicht J-P, Bryce J. Evidence-based public health: moving beyond randomized trials. Am J Public Health 2004;94:400–05.

  2. “Health Effects of Secondhand Smoke” Fed. 27, 2020. Accessed on: Dec.13,2021.[Online].Available:https://www.cdc.gov/tobacco/data_statistics/fact_sheets/secondhand_smoke/health_effects/index.htm

  3. Richter PA, Bishop EE, Wang J, Swahn MH. Tobacco Smoke Exposure and Levels of Urinary Metals in the U.S. Youth and Adult Population: The National Health and Nutrition Examination Survey (NHANES) 1999–2004. International Journal of Environmental Research and Public Health. 2009; 6(7):1930-1946.

  4. Gaurang P. Nazar, MSc, John Tayu Lee, PhD, Monika Arora, PhD, Christopher Millett, PhD, Socioeconomic Inequalities in Secondhand Smoke Exposure at Home and at Work in 15 Low- and Middle-Income Countries, Nicotine & Tobacco Research, Volume 18, Issue 5, May 2016, Pages 1230–1239

  5. “About the National Health and Nutrition Examination Survey” Sep. 15, 2017. Accessed on: Dec. 13, 2021. [Online]. Available:https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

  6. “Module 2: Sample Design” Sep. 15, 2017. Accessed on: Dec. 13, 2021. [Online].Available:https://wwwn.cdc.gov/nchs/nhanes/tutorials/module2.aspx

  7. Bernal JL, Cummins S, Gasparrini A. Interrupted time series regression for the evaluation of public health interventions: a tutorial. Int J Epidemiol. 2017 Feb 1;46(1):348-355. doi: 10.1093/ije/dyw098. Erratum in: Int J Epidemiol. 2020 Aug 1;49(4):1414. PMID: 27283160; PMCID: PMC5407170.

  8. Bush T, Zbikowski S, Mahoney L, Deprey M, Mowery PD, Magnusson B. The 2009 US federal cigarette tax increase and quitline utilization in 16 states. J Environ Public Health. 2012;2012:314740. doi: 10.1155/2012/314740. Epub 2012 May 8. PMID: 22649463; PMCID:PMC3356941