The dataset used for this assignment contains information on lung cancer deaths prompted by a persons’ age and their smoking status within a population. Researchers frequently make an association between smoking and Lung Cancer. Furthermore, they also make a connection between lung cancer and death. Instead of looking at these types of relationships, I will observe whether there are more deaths from lung cancer, for example, in a population of older people versus younger people. Using the count data model, more specifically, the poisson regression to to see which population will have more deaths. Due to larger different age groups, I believe that the older the population, the more deaths will occur due to lung cancer.
library(Zelig)
library(ZeligChoice)
library(faraway)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(survival)
##
## Attaching package: 'survival'
## The following object is masked from 'package:faraway':
##
## rats
The following dataset was procured from the open data portal from Princeton University. The dataset was transferred to a csv file and was then uploaded into R using the readr library for further analyses.
library(readr)
SALC <- read_csv("C:/Users/AHMED FAMILY/Desktop/salc.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
## Age = col_integer(),
## SmokingStatus = col_integer(),
## Population = col_integer(),
## Deaths = col_integer()
## )
head(SALC)
## # A tibble: 6 × 4
## Age SmokingStatus Population Deaths
## <int> <int> <int> <int>
## 1 1 1 656 18
## 2 2 1 359 22
## 3 3 1 249 19
## 4 4 1 632 55
## 5 5 1 1067 117
## 6 6 1 897 170
The dataset has four variables. Three of the variables known as SmokingStatus, Population and Age are the independent variables. They measure and tell us the Age group of the participant (For example those identified as # 1 are the youngest from 40-44 years of age while those that identify with nine are 80 years and above), whether they smoked or did not smoke, and the population respectively. The Deaths variable is the dependent variable and tells us how many people died by lung cancer.
z <- zelig(Deaths ~ SmokingStatus + Population + Age, model = "poisson", data = SALC, cite = F)
summary(z)
## Model:
##
## Call:
## z5$zelig(formula = Deaths ~ SmokingStatus + Population + Age,
## data = SALC)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -19.691 -4.753 -0.051 3.008 10.398
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.426e+00 5.671e-02 42.78 <2e-16
## SmokingStatus 2.690e-01 1.212e-02 22.19 <2e-16
## Population 4.433e-04 6.290e-06 70.48 <2e-16
## Age 2.618e-01 5.741e-03 45.61 <2e-16
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 8434.4 on 35 degrees of freedom
## Residual deviance: 1303.8 on 32 degrees of freedom
## AIC: 1551.8
##
## Number of Fisher Scoring iterations: 4
##
## Next step: Use 'setx' method
As observed from the coefficients, there is a positive relationship between the smoking status of a participant and death caused by lung cancer. Moreover, the relationship is statistically significant (P<0.01). Furthermore, there is also a positive relationship between age and deaths from lung cancer. The coefficent (2.618e-01) indicates that the older the person, the more likely they are to die from lung cancer. In addition to the first relationship, the relationship between Age and Death is also statistically significant.
a.range = min(SALC$Age):max(SALC$Age)
x <- setx(z, Age = a.range)
s <- sim(z, x = x)
ci.plot(s)
As seen in the graph above, as age increases, the death toll rises from lung cancer. Aforementioned above, there are levels of age groups, and each number represents a range of ages. The graph clearly shows that death by cancer is positively affected by one’s age. There is an upward movement in the graph for everytime the age group increases by one level (Starting from 1 all the way till 9).
x <- setx(z, Age = mean(SALC$Age))
x1 <- setx(z, Age = mean(SALC$Age) + sd(SALC$Age))
s <- sim(z, x = x, x1 = x1)
summary(s)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 163.7068 2.516416 163.7391 158.7473 168.6248
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 164.462 13.14593 164 139 190
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 325.2884 4.912795 325.1419 315.6411 335.2655
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 325.85 18.72908 326 290 363
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 161.5816 4.216867 161.501 153.1722 169.9577
As shown in the simulation above, Age increases the deaths caused by lung cancer. (Mean = 162)
x <- setx(z, Age = 100)
x1 <- setx(z, Age = 100 + 100)
s <- sim(z, x = x, x1 = x1)
## Warning in rpois(nrow(theta.local), lambda = theta.local[, i]): NAs
## produced
## Warning in rpois(nrow(theta.local), lambda = theta.local[, i]): NAs
## produced
summary(s)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 1.232503e+13 7.098617e+12 1.064434e+13 3.703164e+12 3.193882e+13
## pv
## mean sd 50% 2.5% 97.5%
## [1,] NaN NA NA NA NA
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 4.7631e+24 6.479938e+24 2.569519e+24 2.821243e+23 2.483832e+25
## pv
## mean sd 50% 2.5% 97.5%
## [1,] NaN NA NA NA NA
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 4.7631e+24 6.479938e+24 2.569519e+24 2.821243e+23 2.483832e+25
xl <- setx(z, Age = quantile(SALC$Age, .25))
xl1 <- setx(z, Age = quantile(SALC$Age, .25) + sd(SALC$Age))
sl <- sim(z, x = xl, x1 = xl1)
fdl <- sl$get_qi(xvalue="x1", qi="fd")
xm <- setx(z, Age = quantile(SALC$Age, .50))
xm1 <- setx(z, Age = quantile(SALC$Age, .50) + sd(SALC$Age))
sm <- sim(z, x = xm, x1 = xm1)
fdm <- sm$get_qi(xvalue="x1", qi="fd")
xh <- setx(z, Age = quantile(SALC$Age, .75))
xh1 <- setx(z, Age = quantile(SALC$Age, .75) + sd(SALC$Age))
sh <- sim(z, x = xh, x1 = xh1)
fdh <- sh$get_qi(xvalue="x1", qi="fd")
d <- as.data.frame(cbind(fdl, fdm, fdh))
head(d)
## V1 V2 V3
## 1 93.41548 162.7977 275.6811
## 2 95.40640 163.4334 260.6363
## 3 94.30121 158.9976 262.2324
## 4 96.12208 162.2955 273.7232
## 5 94.94268 150.0220 271.4582
## 6 97.57438 160.5090 254.9349
tidd <- d %>%
gather(quantile, dif, 1:3)
head(tidd)
## quantile dif
## 1 V1 93.41548
## 2 V1 95.40640
## 3 V1 94.30121
## 4 V1 96.12208
## 5 V1 94.94268
## 6 V1 97.57438
tidd %>%
group_by(quantile) %>%
summarise(mean = mean(dif), sd = sd(dif))
## # A tibble: 3 × 3
## quantile mean sd
## <chr> <dbl> <dbl>
## 1 V1 95.43435 1.632829
## 2 V2 161.52024 4.293177
## 3 V3 271.87715 9.993180
The code below creates three histograms based off of the 25th, 50th and 75th percentiles.
library(ggplot2)
ggplot(tidd, aes(dif)) + geom_histogram() + facet_grid(~ quantile)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
After creating the histograms using ggplot, we can see that all three graphs are considerably distinct. The mean difference in the first histogram is 80 while the mean differences for the middle histogram is 175 and 300 for the last histogram respectively. The mean difference increases from the first histogram, then again in the second and even more on the third.