This manual was created by Catherine D’Este and updated by Tambri Housen and Alice Richardson. The manual was originally created for use with Stata data analysis software, the conversion to R was conducted by Nidhi Menon and Arminder Deol. The manual was created for the Australian Field Epidemiology Training Program - Masters of Philosophy (Applied Epidemiology), Australian National University. CRICOS Provider No. 00120C
After successfully completing this session, students will be able to:
In this exercise, you will compare lung cancer rates between four Danish cities: Frederica, Horsens, Kolding and Vejle.
There are two data files needed for this exercise.
Numbers of lung cancer cases were obtained from each city between 1968 and 1971 and saved in the DenmarkLungCancer.csv file. Age-specific population counts for these cities are saved in the DenmarkPopulations.csv file.
# Create a new project and save the 'DenmarkPopulations.csv' file in the project folder on your computer.
data1 <- read.csv("DenmarkPopulations.csv")
# Describe DenmarkPopulations.csv which we've now named as "data1":
str(data1)
## 'data.frame': 24 obs. of 3 variables:
## $ age : chr "40-54" "40-54" "40-54" "40-54" ...
## $ city: chr "Fredericia" "Horsens" "Kolding" "Vejle" ...
## $ pop : int 3059 2879 3142 2520 800 1083 1050 878 710 923 ...
# View the data (if you want to view the data in a separate tab, then type View(data1))
data1
## age city pop
## 1 40-54 Fredericia 3059
## 2 40-54 Horsens 2879
## 3 40-54 Kolding 3142
## 4 40-54 Vejle 2520
## 5 55-59 Fredericia 800
## 6 55-59 Horsens 1083
## 7 55-59 Kolding 1050
## 8 55-59 Vejle 878
## 9 60-64 Fredericia 710
## 10 60-64 Horsens 923
## 11 60-64 Kolding 895
## 12 60-64 Vejle 839
## 13 65-69 Fredericia 581
## 14 65-69 Horsens 834
## 15 65-69 Kolding 702
## 16 65-69 Vejle 631
## 17 70-74 Fredericia 509
## 18 70-74 Horsens 634
## 19 70-74 Kolding 535
## 20 70-74 Vejle 539
## 21 >74 Fredericia 605
## 22 >74 Horsens 782
## 23 >74 Kolding 659
## 24 >74 Vejle 619
Notice the way the data have been entered - is this consistent with the structure of the cancer data? Note the names of any common variables - do they have the same name across the two data sets?
# Save the 'DenmarkLungCancer.csv' file in the project folder on your computer.
data2 <- read.csv("DenmarkLungCancer.csv")
# Describe DenmarkLungCancer.csv which we've now named as "data2":
str(data2)
## 'data.frame': 24 obs. of 3 variables:
## $ cases: int 11 11 11 10 11 10 13 6 15 10 ...
## $ age : chr "40-54" "55-59" "60-64" "65-69" ...
## $ city : chr "Fredericia" "Fredericia" "Fredericia" "Fredericia" ...
# View the data (if you want to view the data in a separate tab, then type View(data2))
data2
## cases age city
## 1 11 40-54 Fredericia
## 2 11 55-59 Fredericia
## 3 11 60-64 Fredericia
## 4 10 65-69 Fredericia
## 5 11 70-74 Fredericia
## 6 10 >74 Fredericia
## 7 13 40-54 Horsens
## 8 6 55-59 Horsens
## 9 15 60-64 Horsens
## 10 10 65-69 Horsens
## 11 12 70-74 Horsens
## 12 2 >74 Horsens
## 13 4 40-54 Kolding
## 14 8 55-59 Kolding
## 15 7 60-64 Kolding
## 16 11 65-69 Kolding
## 17 9 70-74 Kolding
## 18 12 >74 Kolding
## 19 5 40-54 Vejle
## 20 7 55-59 Vejle
## 21 10 60-64 Vejle
## 22 14 65-69 Vejle
## 23 8 70-74 Vejle
## 24 7 >74 Vejle
# This is a 1-to-1 merg. To merge data1 onto data2 (to merge population and cancer data), we reopen the cancer data. The variables that define the merge are 'age' and 'city'. Note that unlike STATA, we define a new dataset that is the result of merging both data1 and data2
final <- merge(data1, data2, by = c("age", "city"))
# View merged dataset that we've now named 'final'
final
## age city pop cases
## 1 >74 Fredericia 605 10
## 2 >74 Horsens 782 2
## 3 >74 Kolding 659 12
## 4 >74 Vejle 619 7
## 5 40-54 Fredericia 3059 11
## 6 40-54 Horsens 2879 13
## 7 40-54 Kolding 3142 4
## 8 40-54 Vejle 2520 5
## 9 55-59 Fredericia 800 11
## 10 55-59 Horsens 1083 6
## 11 55-59 Kolding 1050 8
## 12 55-59 Vejle 878 7
## 13 60-64 Fredericia 710 11
## 14 60-64 Horsens 923 15
## 15 60-64 Kolding 895 7
## 16 60-64 Vejle 839 10
## 17 65-69 Fredericia 581 10
## 18 65-69 Horsens 834 10
## 19 65-69 Kolding 702 11
## 20 65-69 Vejle 631 14
## 21 70-74 Fredericia 509 11
## 22 70-74 Horsens 634 12
## 23 70-74 Kolding 535 9
## 24 70-74 Vejle 539 8
new_variable <- factor(old_variable)Confirm the recoding by obtaining a frequency table of the new variable: table (new_variable)
table(final$age)
##
## >74 40-54 55-59 60-64 65-69 70-74
## 4 4 4 4 4 4
table(final$city)
##
## Fredericia Horsens Kolding Vejle
## 6 6 6 6
# Sort the dataset by age group to reflect 40-54 as the first group.
final$age <- factor(final$age, levels = c("40-54","55-59","60-64", "65-69", "70-74", ">74" ))
final <- final[order(final$age),]
final$rate <- final$cases / final$pop *10000
These are 4-year rates of lung cancer per 10,000 people. Annual rates could be obtained by dividing the rates by 4, but this would not change the relative results (i.e. IRRs) from the regression modelling.
boxplot(rate~age, data = final,
ylab = "rates", xlab = "Age groups",
main = "Boxplot1", col = "yellow")
boxplot(rate~ city, data = final,
ylab = "rates", xlab = "City",
main = "Boxplot2", col = "green")
There is a clear difference in lung cancer rates by age, with rates increasing until the oldest age group. There does not appear to be a large difference by city, however the rate in Fredericia appears larger than the other three cities.
pois <- glm(cases ~ age +city, offset=log(pop), data=final,
family = poisson(link = log))
#to calculate IRR values from the beta coefficients
exp(coef(pois))
## (Intercept) age55-59 age60-64 age65-69 age70-74 age>74
## 0.003581174 3.007213795 4.565884929 5.857402508 6.403619032 4.135686847
## cityHorsens cityKolding cityVejle
## 0.718880610 0.689667168 0.761612264
#to calculate CI for IRR values
exp(confint(pois))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.002373629 0.005212346
## age55-59 1.842951851 4.901008833
## age60-64 2.907180919 7.236296972
## age65-69 3.748295295 9.248885425
## age70-74 4.043044796 10.211923083
## age>74 2.522891909 6.762422572
## cityHorsens 0.502694733 1.025912422
## cityKolding 0.475568043 0.995045687
## cityVejle 0.525131867 1.098950868
The Poisson regression model is fitted using the numeric versions of age and city, and population size as the exposure.
The IRR for age group 55-59 can be interpreted as:
After adjusting for city, the incidence rate of lung cancer in those aged 55-59 was 3.0 (95% CI: 1.8 to 4.9)
times higher than in those aged 40-54.
For Horsens:
After adjusting for age, the incidence rate of lung cancer for those living in Horsens was 0.72 (95% CI: 0.50 to 1.03) times that in Fredericia.
Inference can easily be performed using the summary() method for assessing the regression coefficients via partial Wald tests or the anova() method for comparing nested models via an analysis of deviance.
# Install the "lmtest" package before the next step.
library(lmtest)
pois.full <- glm(cases ~ age + city, data=final, family = poisson(link = log))
pois.reduced <- glm(cases ~ age, data=final, family = poisson(link = log))
waldtest(pois.full, pois.reduced, test="Chisq")
## Wald test
##
## Model 1: cases ~ age + city
## Model 2: cases ~ age
## Res.Df Df Chisq Pr(>Chisq)
## 1 15
## 2 18 -3 2.0999 0.5519
We conclude that there is no evidence that the lung cancer rate differs between the four cities (P=0.55).
with(pois, cbind(res.deviance = deviance, df = df.residual, p=pchisq(deviance, df.residual, lower.tail = FALSE)))
## res.deviance df p
## [1,] 23.44748 15 0.07509017
From the deviance goodness-of-fit test, we conclude that there is no evidence against this Poisson regression model (P=0.08). Note that if this test were significant, we should consider whether there were omitted variables (for example, sex), or if there was over dispersion present.
#Run a regression model omitting the variable of interest (here, city)
pois1 <- glm(cases ~ age + log(pop), family = poisson(link = log), data=final)
pois1
##
## Call: glm(formula = cases ~ age + log(pop), family = poisson(link = log),
## data = final)
##
## Coefficients:
## (Intercept) age55-59 age60-64 age65-69 age70-74 age>74
## 6.6663 -0.6711 -0.4441 -0.5185 -0.7539 -0.9050
## log(pop)
## -0.5719
##
## Degrees of Freedom: 23 Total (i.e. Null); 17 Residual
## Null Deviance: 27.7
## Residual Deviance: 21.91 AIC: 132.3
# And a regression model with the variable of interest
pois_original <- glm(cases ~ age + city + log(pop), family = poisson(link = log), data=final)
pois_original
##
## Call: glm(formula = cases ~ age + city + log(pop), family = poisson(link = log),
## data = final)
##
## Coefficients:
## (Intercept) age55-59 age60-64 age65-69 age70-74 age>74
## 11.7496 -1.3842 -1.2366 -1.4378 -1.8049 -1.8383
## cityHorsens cityKolding cityVejle log(pop)
## 0.1833 -0.0483 -0.1679 -1.2096
##
## Degrees of Freedom: 23 Total (i.e. Null); 14 Residual
## Null Deviance: 27.7
## Residual Deviance: 19.5 AIC: 135.9
lrtest(pois1,pois_original)
## Likelihood ratio test
##
## Model 1: cases ~ age + log(pop)
## Model 2: cases ~ age + city + log(pop)
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -59.151
## 2 10 -57.943 3 2.4154 0.4908
# The AIC and the BIC can be seen in the above model outputs but can otherwise be shown as below
AIC(pois1,pois_original)
## df AIC
## pois1 7 132.3017
## pois_original 10 135.8862
BIC(pois1,pois_original)
## df BIC
## pois1 7 140.5480
## pois_original 10 147.6668
# Our conclusion is unchanged from the Wald test.
Note that the output might differ from that obtained using STATA.