Do deaths caused by smoking and lung cancer increase as a population gets more older? In other words, are there more deaths caused by this disease among older populations versus a more younger population?

The dataset used for this assignment contains information on lung cancer deaths prompted by a persons’ age and their smoking status within a population. Researchers frequently make an association between smoking and Lung Cancer. Furthermore, they also make a connection between lung cancer and death. Instead of looking at these types of relationships, I will observe whether there are more deaths from lung cancer, for example, in a population of older people versus younger people. Using the count data model, more specifically, the poisson regression to to see which population will have more deaths. Due to larger different age groups, I believe that the older the population, the more deaths will occur due to lung cancer.

library(Zelig)
library(ZeligChoice)
library(faraway)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(survival)
## 
## Attaching package: 'survival'
## The following object is masked from 'package:faraway':
## 
##     rats

The following dataset was procured from the open data portal from Princeton University. The dataset was transferred to a csv file and was then uploaded into R using the readr library for further analyses.

library(readr)

SALC <- read_csv("C:/Users/AHMED FAMILY/Desktop/salc.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   Age = col_integer(),
##   SmokingStatus = col_integer(),
##   Population = col_integer(),
##   Deaths = col_integer()
## )
head(SALC)
## # A tibble: 6 × 4
##     Age SmokingStatus Population Deaths
##   <int>         <int>      <int>  <int>
## 1     1             1        656     18
## 2     2             1        359     22
## 3     3             1        249     19
## 4     4             1        632     55
## 5     5             1       1067    117
## 6     6             1        897    170

The dataset has four variables. Three of the variables known as SmokingStatus, Population and Age are the independent variables. They measure and tell us the Age group of the participant (For example those identified as # 1 are the youngest from 40-44 years of age while those that identify with nine are 80 years and above), whether they smoked or did not smoke, and the population respectively. The Deaths variable is the dependent variable and tells us how many people died by lung cancer.

z <- zelig(Deaths ~ SmokingStatus + Population + Age, model = "poisson", data = SALC, cite = F)
summary(z)
## Model: 
## 
## Call:
## z5$zelig(formula = Deaths ~ SmokingStatus + Population + Age, 
##     data = SALC)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -19.691   -4.753   -0.051    3.008   10.398  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)   2.426e+00  5.671e-02   42.78   <2e-16
## SmokingStatus 2.690e-01  1.212e-02   22.19   <2e-16
## Population    4.433e-04  6.290e-06   70.48   <2e-16
## Age           2.618e-01  5.741e-03   45.61   <2e-16
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 8434.4  on 35  degrees of freedom
## Residual deviance: 1303.8  on 32  degrees of freedom
## AIC: 1551.8
## 
## Number of Fisher Scoring iterations: 4
## 
## Next step: Use 'setx' method

As observed from the coefficients, there is a positive relationship between the smoking status of a participant and death caused by lung cancer. Moreover, the relationship is statistically significant (P<0.01). Furthermore, there is also a positive relationship between age and deaths from lung cancer. The coefficent (2.618e-01) indicates that the older the person, the more likely they are to die from lung cancer. In addition to the first relationship, the relationship between Age and Death is also statistically significant.

a.range = min(SALC$Age):max(SALC$Age)
x <- setx(z, Age = a.range)
s <- sim(z, x = x)
ci.plot(s)

As seen in the graph above, as age increases, the death toll rises from lung cancer. Aforementioned above, there are levels of age groups, and each number represents a range of ages. The graph clearly shows that death by cancer is positively affected by one’s age. There is an upward movement in the graph for everytime the age group increases by one level (Starting from 1 all the way till 9).

x <- setx(z, Age = mean(SALC$Age))
x1 <- setx(z, Age = mean(SALC$Age) + sd(SALC$Age))
s <- sim(z, x = x, x1 = x1)
summary(s)
## 
##  sim x :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 163.7068 2.516416 163.7391 158.7473 168.6248
## pv
##         mean       sd 50% 2.5% 97.5%
## [1,] 164.462 13.14593 164  139   190
## 
##  sim x1 :
##  -----
## ev
##          mean       sd      50%     2.5%    97.5%
## [1,] 325.2884 4.912795 325.1419 315.6411 335.2655
## pv
##        mean       sd 50% 2.5% 97.5%
## [1,] 325.85 18.72908 326  290   363
## fd
##          mean       sd     50%     2.5%    97.5%
## [1,] 161.5816 4.216867 161.501 153.1722 169.9577

As shown in the simulation above, Age increases the deaths caused by lung cancer. (Mean = 162)

x <- setx(z, Age = 100)
x1 <- setx(z, Age = 100 + 100)
s <- sim(z, x = x, x1 = x1)
## Warning in rpois(nrow(theta.local), lambda = theta.local[, i]): NAs
## produced

## Warning in rpois(nrow(theta.local), lambda = theta.local[, i]): NAs
## produced
summary(s)
## 
##  sim x :
##  -----
## ev
##              mean           sd          50%         2.5%        97.5%
## [1,] 1.232503e+13 7.098617e+12 1.064434e+13 3.703164e+12 3.193882e+13
## pv
##      mean sd 50% 2.5% 97.5%
## [1,]  NaN NA  NA   NA    NA
## 
##  sim x1 :
##  -----
## ev
##            mean           sd          50%         2.5%        97.5%
## [1,] 4.7631e+24 6.479938e+24 2.569519e+24 2.821243e+23 2.483832e+25
## pv
##      mean sd 50% 2.5% 97.5%
## [1,]  NaN NA  NA   NA    NA
## fd
##            mean           sd          50%         2.5%        97.5%
## [1,] 4.7631e+24 6.479938e+24 2.569519e+24 2.821243e+23 2.483832e+25
xl <- setx(z, Age = quantile(SALC$Age, .25))
xl1 <- setx(z, Age = quantile(SALC$Age, .25) + sd(SALC$Age))
sl <- sim(z, x = xl, x1 = xl1)
fdl <- sl$get_qi(xvalue="x1", qi="fd")

xm <- setx(z, Age = quantile(SALC$Age, .50))
xm1 <- setx(z, Age = quantile(SALC$Age, .50) + sd(SALC$Age))
sm <- sim(z, x = xm, x1 = xm1)
fdm <- sm$get_qi(xvalue="x1", qi="fd")

xh <- setx(z, Age = quantile(SALC$Age, .75))
xh1 <- setx(z, Age = quantile(SALC$Age, .75) + sd(SALC$Age))
sh <- sim(z, x = xh, x1 = xh1)
fdh <- sh$get_qi(xvalue="x1", qi="fd")
d <- as.data.frame(cbind(fdl, fdm, fdh))
head(d)
##         V1       V2       V3
## 1 93.41548 162.7977 275.6811
## 2 95.40640 163.4334 260.6363
## 3 94.30121 158.9976 262.2324
## 4 96.12208 162.2955 273.7232
## 5 94.94268 150.0220 271.4582
## 6 97.57438 160.5090 254.9349
tidd <- d %>% 
  gather(quantile, dif, 1:3)
head(tidd)
##   quantile      dif
## 1       V1 93.41548
## 2       V1 95.40640
## 3       V1 94.30121
## 4       V1 96.12208
## 5       V1 94.94268
## 6       V1 97.57438
tidd %>% 
  group_by(quantile) %>% 
  summarise(mean = mean(dif), sd = sd(dif))
## # A tibble: 3 × 3
##   quantile      mean       sd
##      <chr>     <dbl>    <dbl>
## 1       V1  95.43435 1.632829
## 2       V2 161.52024 4.293177
## 3       V3 271.87715 9.993180

The code below creates three histograms based off of the 25th, 50th and 75th percentiles.

library(ggplot2)

ggplot(tidd, aes(dif)) + geom_histogram() + facet_grid(~ quantile)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After creating the histograms using ggplot, we can see that all three graphs are considerably distinct. The mean difference in the first histogram is 80 while the mean differences for the middle histogram is 175 and 300 for the last histogram respectively. The mean difference increases from the first histogram, then again in the second and even more on the third.