This manual was created by Catherine D’Este and updated by Tambri Housen and Alice Richardson. The manual was originally created for use with Stata data analysis software, the conversion to R was conducted by Nidhi Menon and Arminder Deol. The manual was created for the Australian Field Epidemiology Training Program - Masters of Philosophy (Applied Epidemiology), Australian National University. CRICOS Provider No. 00120C


Learning objectives:

After successfully completing this session, students will be able to:


Scenario

In this exercise, you will compare lung cancer rates between four Danish cities: Frederica, Horsens, Kolding and Vejle.

There are two data files needed for this exercise.

  • DenmarkLungCancer.csv
  • DenmarkPopulations.csv

Numbers of lung cancer cases were obtained from each city between 1968 and 1971 and saved in the DenmarkLungCancer.csv file. Age-specific population counts for these cities are saved in the DenmarkPopulations.csv file.

Exercises

  1. Open the Population Data and familiarise yourself with the data set. How are age group and city stored?
# Create a new project and save the 'DenmarkPopulations.csv' file in the project folder on your computer. 
data1 <- read.csv("DenmarkPopulations.csv")

# Describe DenmarkPopulations.csv which we've now named as "data1":
str(data1)
## 'data.frame':    24 obs. of  3 variables:
##  $ age : chr  "40-54" "40-54" "40-54" "40-54" ...
##  $ city: chr  "Fredericia" "Horsens" "Kolding" "Vejle" ...
##  $ pop : int  3059 2879 3142 2520 800 1083 1050 878 710 923 ...
# View the data (if you want to view the data in a separate tab, then type View(data1))
data1
##      age       city  pop
## 1  40-54 Fredericia 3059
## 2  40-54    Horsens 2879
## 3  40-54    Kolding 3142
## 4  40-54      Vejle 2520
## 5  55-59 Fredericia  800
## 6  55-59    Horsens 1083
## 7  55-59    Kolding 1050
## 8  55-59      Vejle  878
## 9  60-64 Fredericia  710
## 10 60-64    Horsens  923
## 11 60-64    Kolding  895
## 12 60-64      Vejle  839
## 13 65-69 Fredericia  581
## 14 65-69    Horsens  834
## 15 65-69    Kolding  702
## 16 65-69      Vejle  631
## 17 70-74 Fredericia  509
## 18 70-74    Horsens  634
## 19 70-74    Kolding  535
## 20 70-74      Vejle  539
## 21   >74 Fredericia  605
## 22   >74    Horsens  782
## 23   >74    Kolding  659
## 24   >74      Vejle  619
  1. Open the Lung Cancer data.

Notice the way the data have been entered - is this consistent with the structure of the cancer data? Note the names of any common variables - do they have the same name across the two data sets?

# Save the 'DenmarkLungCancer.csv' file in the project folder on your computer. 
data2 <- read.csv("DenmarkLungCancer.csv")

# Describe DenmarkLungCancer.csv which we've now named as "data2":
str(data2)
## 'data.frame':    24 obs. of  3 variables:
##  $ cases: int  11 11 11 10 11 10 13 6 15 10 ...
##  $ age  : chr  "40-54" "55-59" "60-64" "65-69" ...
##  $ city : chr  "Fredericia" "Fredericia" "Fredericia" "Fredericia" ...
# View the data (if you want to view the data in a separate tab, then type View(data2))
data2
##    cases   age       city
## 1     11 40-54 Fredericia
## 2     11 55-59 Fredericia
## 3     11 60-64 Fredericia
## 4     10 65-69 Fredericia
## 5     11 70-74 Fredericia
## 6     10   >74 Fredericia
## 7     13 40-54    Horsens
## 8      6 55-59    Horsens
## 9     15 60-64    Horsens
## 10    10 65-69    Horsens
## 11    12 70-74    Horsens
## 12     2   >74    Horsens
## 13     4 40-54    Kolding
## 14     8 55-59    Kolding
## 15     7 60-64    Kolding
## 16    11 65-69    Kolding
## 17     9 70-74    Kolding
## 18    12   >74    Kolding
## 19     5 40-54      Vejle
## 20     7 55-59      Vejle
## 21    10 60-64      Vejle
## 22    14 65-69      Vejle
## 23     8 70-74      Vejle
## 24     7   >74      Vejle
  1. You will need to merge the population data onto the cancer data. What type of merge is this: a 1-to-1, many-to-1, 1-to-many or many-to-many? Perform the merge and examine the merged data. What do you conclude from this table? Examine the merged data. If you are satisfied with the results of the merge, continue.
# This is a 1-to-1 merg. To merge data1 onto data2 (to merge population and cancer data), we reopen the cancer data.  The variables that define the merge are 'age' and 'city'. Note that unlike STATA, we define a new dataset that is the result of merging both data1 and data2

final <- merge(data1, data2, by = c("age", "city"))

# View merged dataset that we've now named 'final'
final
##      age       city  pop cases
## 1    >74 Fredericia  605    10
## 2    >74    Horsens  782     2
## 3    >74    Kolding  659    12
## 4    >74      Vejle  619     7
## 5  40-54 Fredericia 3059    11
## 6  40-54    Horsens 2879    13
## 7  40-54    Kolding 3142     4
## 8  40-54      Vejle 2520     5
## 9  55-59 Fredericia  800    11
## 10 55-59    Horsens 1083     6
## 11 55-59    Kolding 1050     8
## 12 55-59      Vejle  878     7
## 13 60-64 Fredericia  710    11
## 14 60-64    Horsens  923    15
## 15 60-64    Kolding  895     7
## 16 60-64      Vejle  839    10
## 17 65-69 Fredericia  581    10
## 18 65-69    Horsens  834    10
## 19 65-69    Kolding  702    11
## 20 65-69      Vejle  631    14
## 21 70-74 Fredericia  509    11
## 22 70-74    Horsens  634    12
## 23 70-74    Kolding  535     9
## 24 70-74      Vejle  539     8
  1. You will need to recode the string variables age and city into numeric variables to be used in a regression model. This can be done most easily using the factor command: new_variable <- factor(old_variable)

Confirm the recoding by obtaining a frequency table of the new variable: table (new_variable)

table(final$age)
## 
##   >74 40-54 55-59 60-64 65-69 70-74 
##     4     4     4     4     4     4
table(final$city)
## 
## Fredericia    Horsens    Kolding      Vejle 
##          6          6          6          6
# Sort the dataset by age group to reflect 40-54 as the first group.
final$age <- factor(final$age, levels = c("40-54","55-59","60-64", "65-69", "70-74", ">74" ))
final <- final[order(final$age),]
  1. Calculate rates of cancer per 10,000. What do these rates represent?
final$rate <- final$cases / final$pop *10000

These are 4-year rates of lung cancer per 10,000 people. Annual rates could be obtained by dividing the rates by 4, but this would not change the relative results (i.e. IRRs) from the regression modelling.

  1. Create a graph showing boxplots of cancer rates by age group, and separately by city. Describe these relationships.
boxplot(rate~age, data = final, 
ylab = "rates", xlab = "Age groups", 
main = "Boxplot1", col = "yellow")

boxplot(rate~ city, data = final, 
ylab = "rates", xlab = "City", 
main = "Boxplot2", col = "green")

There is a clear difference in lung cancer rates by age, with rates increasing until the oldest age group. There does not appear to be a large difference by city, however the rate in Fredericia appears larger than the other three cities.

  1. Fit a Poisson regression model modelling rate of cancer against age and city, requesting incidence rate ratios. Interpret these incidence rate ratios.
pois <- glm(cases ~ age +city, offset=log(pop), data=final, 
family = poisson(link = log))

#to calculate IRR values from the beta coefficients
exp(coef(pois))
## (Intercept)    age55-59    age60-64    age65-69    age70-74      age>74 
## 0.003581174 3.007213795 4.565884929 5.857402508 6.403619032 4.135686847 
## cityHorsens cityKolding   cityVejle 
## 0.718880610 0.689667168 0.761612264
#to calculate CI for IRR values
exp(confint(pois))
## Waiting for profiling to be done...
##                   2.5 %       97.5 %
## (Intercept) 0.002373629  0.005212346
## age55-59    1.842951851  4.901008833
## age60-64    2.907180919  7.236296972
## age65-69    3.748295295  9.248885425
## age70-74    4.043044796 10.211923083
## age>74      2.522891909  6.762422572
## cityHorsens 0.502694733  1.025912422
## cityKolding 0.475568043  0.995045687
## cityVejle   0.525131867  1.098950868

The Poisson regression model is fitted using the numeric versions of age and city, and population size as the exposure.

The IRR for age group 55-59 can be interpreted as:
After adjusting for city, the incidence rate of lung cancer in those aged 55-59 was 3.0 (95% CI: 1.8 to 4.9) 
times higher than in those aged 40-54.

For Horsens:
After adjusting for age, the incidence rate of lung cancer for those living in Horsens was 0.72 (95% CI: 0.50 to 1.03) times that in Fredericia.
  1. Test the joint effect of city after adjusting for age using a Wald test.

Inference can easily be performed using the summary() method for assessing the regression coefficients via partial Wald tests or the anova() method for comparing nested models via an analysis of deviance.

# Install the "lmtest" package before the next step.

library(lmtest)
pois.full <- glm(cases ~ age + city,  data=final, family = poisson(link = log))
pois.reduced <- glm(cases ~ age,  data=final, family = poisson(link = log))
waldtest(pois.full, pois.reduced, test="Chisq")
## Wald test
## 
## Model 1: cases ~ age + city
## Model 2: cases ~ age
##   Res.Df Df  Chisq Pr(>Chisq)
## 1     15                     
## 2     18 -3 2.0999     0.5519

We conclude that there is no evidence that the lung cancer rate differs between the four cities (P=0.55).

  1. Perform a goodness of fit test to assess the how well this model fits the observed data.
with(pois, cbind(res.deviance = deviance, df = df.residual, p=pchisq(deviance, df.residual, lower.tail = FALSE)))
##      res.deviance df          p
## [1,]     23.44748 15 0.07509017

From the deviance goodness-of-fit test, we conclude that there is no evidence against this Poisson regression model (P=0.08). Note that if this test were significant, we should consider whether there were omitted variables (for example, sex), or if there was over dispersion present.

  1. If you have time, repeat (8) using a likelihood ratio test. Does your conclusion change?
#Run a regression model omitting the variable of interest (here, city)

pois1 <- glm(cases ~ age + log(pop), family = poisson(link = log), data=final)
pois1
## 
## Call:  glm(formula = cases ~ age + log(pop), family = poisson(link = log), 
##     data = final)
## 
## Coefficients:
## (Intercept)     age55-59     age60-64     age65-69     age70-74       age>74  
##      6.6663      -0.6711      -0.4441      -0.5185      -0.7539      -0.9050  
##    log(pop)  
##     -0.5719  
## 
## Degrees of Freedom: 23 Total (i.e. Null);  17 Residual
## Null Deviance:       27.7 
## Residual Deviance: 21.91     AIC: 132.3
# And a regression model with the variable of interest
pois_original <- glm(cases ~ age + city + log(pop), family = poisson(link = log), data=final)
pois_original
## 
## Call:  glm(formula = cases ~ age + city + log(pop), family = poisson(link = log), 
##     data = final)
## 
## Coefficients:
## (Intercept)     age55-59     age60-64     age65-69     age70-74       age>74  
##     11.7496      -1.3842      -1.2366      -1.4378      -1.8049      -1.8383  
## cityHorsens  cityKolding    cityVejle     log(pop)  
##      0.1833      -0.0483      -0.1679      -1.2096  
## 
## Degrees of Freedom: 23 Total (i.e. Null);  14 Residual
## Null Deviance:       27.7 
## Residual Deviance: 19.5  AIC: 135.9
lrtest(pois1,pois_original)
## Likelihood ratio test
## 
## Model 1: cases ~ age + log(pop)
## Model 2: cases ~ age + city + log(pop)
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   7 -59.151                     
## 2  10 -57.943  3 2.4154     0.4908
# The AIC and the BIC can be seen in the above model outputs but can otherwise be shown as below
AIC(pois1,pois_original)
##               df      AIC
## pois1          7 132.3017
## pois_original 10 135.8862
BIC(pois1,pois_original)
##               df      BIC
## pois1          7 140.5480
## pois_original 10 147.6668
# Our conclusion is unchanged from the Wald test.

Note that the output might differ from that obtained using STATA.