Questions:

1. What are the potential ways that one might exit a cohort before the end of the study date?

Some ways that participants might exit a cohort before the end of the study include: death, cessation of smoking, and emigration.

2. Why do we include the person-time contributed by individuals who are lost to follow-up?

We include the person-time contributed by individuals who are lost to follow-up in order accurately represent the dynamic nature of study populations in rate calculations.

3. Import the data.

# setup
setwd("/Users/shmoops/Desktop/UNC 2019-2020/FALL/EPID 600/")

# read in data
library(readxl)
data = read_xlsx("Data+and+Codebook+-+anti-smoking+advertisement+evaluation+data.xlsx")

# making headers lower case
colnames(data) = tolower(colnames(data))

# frequency of cohort demographic variables
library(dplyr)
a = tableone::CreateTableOne(vars = colnames(data), data = data, factorVars = c("sex", "race", "ethnicity", "ses", "education", "advert exposure", "smoking cessation", "ltf"))

# printing table one summary
print(summary(a))
## 
##      ### Summary of continuous variables ###
## 
## strata: Overall
##                n miss p.miss mean  sd median p25 p75 min  max skew  kurt
## id          1000    0      0  500 289    500 251 750   1 1000  0.0 -1.20
## age         1000    0      0   47  13     48  39  57   5   78 -0.3 -0.61
## person days 1000    0      0  454 130    540 293 540  14  624 -1.1 -0.07
## 
## =======================================================================================
## 
##      ### Summary of categorical variables ### 
## 
## strata: Overall
##                var    n miss p.miss level freq percent cum.percent
##                sex 1000    0    0.0     0  355    35.5        35.5
##                                         1  645    64.5       100.0
##                                                                   
##               race 1000    0    0.0     1  685    68.5        68.5
##                                         2  230    23.0        91.5
##                                         3   10     1.0        92.5
##                                         4   25     2.5        95.0
##                                         5   50     5.0       100.0
##                                                                   
##          ethnicity 1000    0    0.0     0  775    77.5        77.5
##                                         1  225    22.5       100.0
##                                                                   
##                ses 1000    0    0.0     0  715    71.5        71.5
##                                         1  285    28.5       100.0
##                                                                   
##          education 1000    0    0.0     1  175    17.5        17.5
##                                         2  465    46.5        64.0
##                                         3  359    35.9        99.9
##                                         4    1     0.1       100.0
##                                                                   
##    advert exposure 1000    1    0.1     0  730    73.1        73.1
##                                         1  269    26.9       100.0
##                                                                   
##  smoking cessation 1000    0    0.0     0  912    91.2        91.2
##                                         1   88     8.8       100.0
##                                                                   
##                ltf 1000    0    0.0     0  681    68.1        68.1
##                                         1  319    31.9       100.0
##                                                                   
## NULL

4. Using your data set, calculate the prevalence of seeing the “stages” advertisement. Does everyone have information on this exposure?

prevalence = (sum(data$`advert exposure`, na.rm = T)/nrow(data))*100
  
print(prevalence)
## [1] 26.9
No, not everyone has information on this exposure.

5. Using your data set, calculate the “risk” of smoking cessation in the cohort, both including and excluding those lost to follow up.

a) Including subjects that were lost to follow up:

# total cohort  
(sum(data$`smoking cessation`)/nrow(data))*100
## [1] 8.8

b) Excluding those lost to follow up:

# Drop those who end dates are less than the max end date
followup_pop = data %>% filter(ltf == 0) 

# total cohort  
(sum(followup_pop$`smoking cessation`)/nrow(followup_pop))*100
## [1] 12.92217

6. Calculate the “rate” of smoking cessation in the total cohort.

# total cohort 
(sum(data$`smoking cessation`)/(sum(data$`person days`))/365.25)
## [1] 5.309268e-07

7. Compare your calculated risk estimates

a) including and excluding those lost with each other

By excluding those lost to follow up from the risk calculation, the “risk” of smoking cessation in the population of interest increases (from 8.8% to 12.9% in the total cohort). This happens because those lost to follow up are not able to report smoking cessation, so they artificially decrease the risk of smoking cessation by making the denominator larger than it should be in the first calculation.

b) with your calculated rate. How are the estimates different from each other?

Compared to the risk calculations, the rate value is much smaller because its denominator represents the cumulative participitory time of all people in the study.

8. Calculate the risk for both those individuals that saw the advertisement and for those that did not. Include those lost to follow up in your calculations.

# risk in exposed pop (saw advertisments)
exposed_pop = data %>% filter(`advert exposure` == 1)
(sum(exposed_pop$`smoking cessation`)/nrow(exposed_pop))*100
## [1] 11.89591
# risk unexposed pop (did not see advertisement)
unexposed_pop = data %>% filter(`advert exposure` == 0)
(sum(unexposed_pop$`smoking cessation`)/nrow(unexposed_pop))*100
## [1] 7.671233

What would you do with the individual with missing data?

I will exclude the individual with missing data
data = data%>% filter(!is.na(`advert exposure`))

Does the advertisement seem to help increase smoking cessation?

The advertisement does seem to help increase smoking cessation.

9. What happens to the risk estimates and your conclusions in the previous question when you exclude those lost to follow up? With respect to smoking cessation, how were those that stayed in the study different from those that were lost to follow up?

# risk in exposed pop (saw advertisments)
exposed_pop = followup_pop %>% filter(`advert exposure` == 1)
(sum(exposed_pop$`smoking cessation`)/nrow(exposed_pop))*100
## [1] 18.82353
# risk unexposed pop (did not see advertisement)
unexposed_pop = followup_pop %>% filter(`advert exposure` == 0)
(sum(unexposed_pop$`smoking cessation`)/nrow(unexposed_pop))*100
## [1] 10.98039
The difference between the risk of smoking cessation in the exposed versus unexposed populations increases after excluding those lost to followup.

10. Calculate the rates of smoking cessation for the advertisement exposed and the non exposed. What would you conclude about the relationship between advertisement viewing and smoking cessation now?

# Redifining follow up pop
followup_pop = data %>% filter(ltf == 0) 

# rate in exposed pop (saw advertisments)
exposed_pop = followup_pop %>% filter(`advert exposure` == 1)
(sum(exposed_pop$`smoking cessation`)/(sum(exposed_pop$`person days`))/365.25)
## [1] 9.560474e-07
# rate in unexposed pop (did not see advertisement)
unexposed_pop = followup_pop %>% filter(`advert exposure` == 0)
(sum(unexposed_pop$`smoking cessation`)/(sum(unexposed_pop$`person days`))/365.25)
## [1] 5.584744e-07
I would conclude that advertisement viewing has a direct relationship with smoking cessation based off of the differenes in cessation rates between the exposed and unexposed populations.