This data analysis exercise was designed by Martyn Kirk MAppEpid PhD, National Centre for Epidemiology and Population Health in August 2013 and updated by Tambri Housen and Alice Richardson. The manual was originally created for use with Stata data analysis software, the conversion to R was conducted by Nidhi Menon and Arminder Deol. The manual was created for the Australian Field Epidemiology Training Program - Masters of Philosophy (Applied Epidemiology), Australian National University. CRICOS Provider No. 00120C


Learning objectives:

By the end of this session participants will have the skills and ability to:


This data analysis exercise was designed by Martyn Kirk MAppEpid PhD, National Centre for Epidemiology and Population Health in August 2013 and was last reviewed in September 2017. This case study on introductory analysis of case control study data requires use of the following dataset:

This practical exercise is based on chapters and examples from the following textbooks:

Jewell N. Statistics for Epidemiology. 2004. Chapman & Hall/CRC, New York.

Schlesselman JJ. Case-Control Studies Design, Conduct, Analysis. 1982. Oxford University Press, New York.

Hosmer DW & Lemeshow S. Applied Logistic Regression. Third Edition. 2013 John Wiley and Sons, New York.

An Introduction to R

R packages are bundles of functions which extend the capabilities of R. Thousands of add-on packages are available in the main online repository (CRAN) and several more packages can be found on GitHub. These packages can be found and installed online.

We will mainly use packages which are already installed in R. Add-on packages sometimes help make things easier. In such cases, we will install the add-on packages. Run the following code at the start to ensure that you have all the available packages and functions you need for analysis:

install.packages (“package_name1”, “package_name2”)

And to to load an installed package:

library(“package name”)

R and STATA have minor difference in default setting and methods. In this document, we will follow the STATA analysis as closely as possible, but small and usually important differences between R and STATA will be noted. Unlike STATA, R can hold multiple datasets in memory simultaneously and so, there is no need to save intermediate files or close and re-open datasets.

Introduction to Study

Good case control studies usually have study hypotheses that focus specifically on one or more exposures that are causally related to a given disease. Epidemiologists collect data relating the disease to the exposures, along with information on potential confounders. The key distinguishing feature of case control studies is that both cases and controls are selected into the study according to well-defined criteria on the basis of whether they have disease or not. Investigators often incorporate individual or frequency matching of cases and controls in study design to ensure statistical efficiency and to control potential confounding.

Prior to descriptive analysis of any case control study dataset, it is important for analysts to check coding of all variables and recode data. The second step is a thorough review of all variables to examine the data. For case control studies specifically, we are interested in ensuring that cases and controls are similar in terms of demographic variables and other factors that are not thought to be related to disease. This may include a comparison of matching variables to ensure that study recruitment has worked to plan. We would then conduct simple cross tabulations and stratified analyses, before moving onto more sophisticated multivariable analysis—according to a predetermined data analysis plan of course!

In this exercise we will use a case control for an outbreak of Shiga toxin producing E. coli infection to conduct a simple descriptive analysis, and estimate measures of association. The study will allow us to interpret findings of a case control study that employs frequency matching of controls. We will also analyze the data employing pair matching of cases and controls.

1. Introduction to the R Studio environment

Setting your working directory

Just as in STATA, you can set a folder to be your working directory. This is done using the setwd command (set working directory), e.g. setwd("C:/Users/Documents/myexample"). Ideally however, you’ll want to create a new project, which will create a project file on your computer. Locate this and move all of your data files for the session into this project folder. This project folder will become your working directory.

Reading your datasets.

To import data from a csv file: This can be done by saving your excel or STATA file as a csv file and importing in directly into R.

#to import a csv file
#Select separator as comma (sep = “,”)
#Do not import “string”variables as “Factors” (stringsAsFactors = FALSE)
#greencountry.csv is read and saved in R as gc
gc <- read.csv("greencountry.csv", sep = "," ,stringsAsFactors    = FALSE)
Browsing your data

R Studio has a nice feature where everything is in one browser window, so you can browse your dataset and your code without having to switch between windows.

#to view your dataset 
View(gc)

Alternatively, you can also view your dataset by clicking on “gc” in on the top right panel in R studio browser (see Fig 1)

Figure 1: Browsing your data in R studio

Saving your codes in R Scripts

This is the equivalent of creating a “do-file” in STATA. You can save your comments here using “#”. Your codes are saved in the R-Scripts (See Fig 2)

Figure 2: Creating a new R-Script in R studio

2. Checking Your Data

About the study

Greencountry is a case control study of an outbreak of Shiga toxin producing E. coli (STEC) that occurred in the Tasmanian town of Greencountry (population 5,500 persons) in the first six months of 1998. Investigators were concerned about exposures in the farm environment, as Greencountry was a rural area and several cases lived on a farm. To identify potential sources of infection, the Tasmanian Department of Health conducted a case control study of 25 cases and 25 controls. Controls were selected from the same medical practice as cases and were matched to a case by age (±5 years).

Why use a case control study in this instance? What are some of the benefits and challenges in doing so?

Initially, we are going to conduct simple univariate analyses. Then we will analyze the dataset using frequency matching in three age groups: 0-9 years, 10-64 years and 65 years and above. Finally, we will show you how to analyze the data using pairwise (i.e. individual) matching. The aim of this analysis of data from the case control study is to identify possible causes of STEC infection.

Task 1: You have been asked to analyze the case control study data by your new boss—Dr Robert Hogspen, Director of the Greencountry Council.

The first thing you do is:

  • Create our R-Script and write comments and name it

  • Open your dataset “greencountry.csv”

  • Inspect the dataset using commands, such assummary

  • You can also review the dataset in browse. The dataset isn’t particularly wide with only 24 variables. A convenient way to examine the data is to use the table and summary command that allows you to prepare summaries of variables. The command table will give you a very basic frequency table. The command summary will give you a more detailed summary as in STATA.

Below you will see the data dictionary for “greencountry.csv”. There is clearly a need for some recoding and cleaning of these data.

Question 1. Discuss what changes you need to make to get data ready for your analysis. Write down which variables need to be recoded.

The following changes need to be made to the data:

  1. status needs changing to a numeric variable: 0 control and 1 case
  2. agematch needs changing to a numeric variable containing 3 levels
  3. sex needs changing to a numeric variable: 0 female and 1 male and
  4. Exposure variables all need a label.

We will use the assignment operator <- to create a new variable. R’s ability to handle variable labels is somewhat unsatisfying. If you use the Hmisc package, you can take advantage of some labelling features. This can be done using the label command. To understand value labels in R, you need to understand the data structure “factor”. You can use the factor function to create your own value labels.

Table 1: Data dictionary for greencounty.csv data

We need to construct multi-line commands, so go into the R script you have opened. In the first section, we are going to develop a new numeric variable which will be automatically labeled. Then we will use the label values command to label data for several variables.

Task 2: In the R script window type the below commands

We use syntax dataset$varname to call (or define) a variable from a specific dataset, e.g.: gc$status

#a. 

gc$case1<- ifelse(gc$status == "CONTROL", 0,  1) # Creates new variable called 'case'. Now we are relabelling case as a factor where: 0= Control and 1 = Case. 

#b. 

gc$case <- factor(gc$case1, levels = c(0,1),labels = c("Control", "Case")) # Now we are re-labelling case as a factor where: 0= Control and 1 = Case. 

#c.

gc$agegrp <- factor(gc$agematch, 
            levels = c("0-9", "10-64", "65+"),
            labels = c("0-9", "10-64", "65+"))

#d. 

gc[c(7,9,11:25)] <- lapply(gc[c(7,9,11:25)], factor, 
                      levels = c(0, 1), 
                      labels = c("No", "Yes"))#Since we want to assign value labels only certain columns (exposure variables), we specify the column numbers here. 

#e. Hit run in the R-Script and review the results.

#f. Do some tabulations using these new and labeled variables to check that things have worked properly.
Question 2. What has happened to the data and values in the dataset?

The first of these commands has generated a new variable for designating cases and controls, which was then labeled with ‘No’ for 0 and ‘Yes’ for 1. In the second step we used the factor command to generate a labeled numeric variable for agegroup.

Factors are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. The factor function is used to create a factor.

The last step here set up a label for yes/no variables and applied it to the values of multiple variables. lappy/sapply and apply functions are handy to perform these tasks.


Hint: It is good to make tabulations of newly created and recoded variables part of your R-script when manipulating data so you can review that they have worked. It is best to do it immediately after the command, so you don’t lose track of what you have done.


Descriptive analysis

For epidemiological studies, it is important to describe the clinical characteristics of cases. In case control studies in particular, it is important to compare the features of cases and controls to ensure that they are similar. If the study has been matched, then it is important to check that the matching during study recruitment occurred according to plan.

Question 3. Describe the nature of the STEC cases illness. What commands might be useful to do this? Write them into your R-script, run them and review the results.

To describe cases’ illness, we can use a few different commands:

#summary gives you more detailed statistics as in STATA
#command “which” helps subset the data so that only values for cases are summarized
summary(gc[which(gc$case == "Case"),])
##        ID           status             matchid        agematch        
##  Min.   : 1.00   Length:25          Min.   : 1.00   Length:25         
##  1st Qu.:18.00   Class :character   1st Qu.: 7.00   Class :character  
##  Median :29.00   Mode  :character   Median :13.00   Mode  :character  
##  Mean   :27.16                      Mean   :13.48                     
##  3rd Qu.:37.00                      3rd Qu.:20.00                     
##  Max.   :49.00                      Max.   :26.00                     
##                                                                       
##       age            sex            diarrhoea  diarrdate         vomit   
##  Min.   : 0.00   Length:25          No : 0    Length:25          No :13  
##  1st Qu.:29.00   Class :character   Yes:25    Class :character   Yes:12  
##  Median :47.00   Mode  :character             Mode  :character           
##  Mean   :46.72                                                           
##  3rd Qu.:69.00                                                           
##  Max.   :85.00                                                           
##                                                                          
##     stoolno       hus     duration  travel   animal    pets    townwat 
##  Min.   : 2.00   No :21   No  : 0   No :18   No :19   No :10   No :17  
##  1st Qu.: 5.75   Yes: 4   Yes : 0   Yes: 7   Yes: 6   Yes:15   Yes: 8  
##  Median : 9.50            NA's:25                                      
##  Mean   :10.38                                                         
##  3rd Qu.:15.00                                                         
##  Max.   :20.00                                                         
##  NA's   :1                                                             
##  watersport chicken   eggs    spinach  lettuce  cucumber milkraw  farmlive
##  No :16     No :13   No : 6   No :13   No : 5   No : 4   No : 4   No : 8  
##  Yes: 9     Yes:12   Yes:19   Yes:12   Yes:20   Yes:21   Yes:21   Yes:17  
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  milkbottle     case1        case      agegrp  
##  No : 4     Min.   :1   Control: 0   0-9  : 3  
##  Yes:21     1st Qu.:1   Case   :25   10-64:12  
##             Median :1                65+  :10  
##             Mean   :1                          
##             3rd Qu.:1                          
##             Max.   :1                          
## 
# to look at specific variables
#define the variables which you want to look at and save as “vars”
#We use a for loop such that for all variables in “vars”, do as follows.
vars <- c("vomit", "hus", "diarrhoea")

for(var in vars){
  
  #create a table for one variable at a time among cases
  b <- table(gc[,var],  gc$case)
  #print variable name to as a header in the output
  print(var)
  # table of proportions
  print(prop.table(b))
  #rename column names as per your output
  colnames(b) <- c("n", "%")
  #print table with (n & %)
  print(b)
  # repeats the loop till all variables in “vars” are complete
}
## [1] "vomit"
##      
##       Control Case
##   No     0.50 0.26
##   Yes    0.00 0.24
##      
##        n  %
##   No  25 13
##   Yes  0 12
## [1] "hus"
##      
##       Control Case
##   No     0.50 0.42
##   Yes    0.00 0.08
##      
##        n  %
##   No  25 21
##   Yes  0  4
## [1] "diarrhoea"
##      
##       Control Case
##   No      0.5  0.0
##   Yes     0.0  0.5
##      
##        n  %
##   No  25  0
##   Yes  0 25
Question 4: Compare age and sex between cases and controls. How similar are cases to control in terms of the 
matching variable?

To compare cases with controls, we can compare means, and use some simple cross-tabulations that incorporate a chi-square or exact test:

# First install the 'plyr' package before the next step
#to summarize age across case and control & obtain mean/median/quartiles for age
library(plyr)
ddply(gc,.(gc$case), summarize, M=mean(age), Med=median(age), Q=matrix(quantile (age, probs=c(0.25,0.50,0.75)) ,ncol=3) )

#to calculate Chi Square and Fisher’s Exact Test

#Pearson’s Chi-Sq test 
#correct = FALSE to turn off Yates’ correction factor
x <- table(gc$case, gc$sex)
chisq.test(x, correct=FALSE)

#Fisher’ Exact Test
y<- table(gc$agegrp, gc$case)
fisher.test(y)

# RESULTS: No significant difference between cases and controls by age or sex

Hint: The table command is very useful for making tables of statistics. It can calculate multiple statistics on the same variable.


4. Measures of Association

Crude analysis

In case control studies, we have selected study subjects on the basis of disease and cannot estimate incidence. In case control studies we estimate odds ratios by cross tabulating disease against exposure in a two-by-two table.

Table 2: A two-by-two table

Odds = probability event occurs/probability event doesn’t occur \[=\frac{p}{(1-p)}\] OR = odds of exposure among cases/odds exposure among controls

Task 4: In the R script type the below commands

  1. make a 2x2 table of exposure and outcome
table(gc$case, gc$travel)
##          
##           No Yes
##   Control 16   9
##   Case    18   7
  1. To calculate odds ratio, first install and load package “epiR”. Then use the following command. You can also estimate Relative Risk by changing the method to “cohort.count”.
library(epiR)
#create a 2x2 table of exposure and outcome
counts <- table(gc$travel, gc$case)
#to view table
counts
##      
##       Control Case
##   No       16   18
##   Yes       9    7
#Applu epi.2by2 function to table. 
cc <- epi.2by2(counts, method = "case.control")
#to view output
cc
##              Outcome +    Outcome -      Total        Prevalence *        Odds
## Exposed +           16           18         34                47.1       0.889
## Exposed -            9            7         16                56.2       1.286
## Total               25           25         50                50.0       1.000
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Odds ratio (W)                               0.69 (0.21, 2.28)
## Attrib prevalence *                          -9.19 (-38.73, 20.34)
## Attrib prevalence in population *            -6.25 (-34.23, 21.73)
## Attrib fraction (est) in exposed  (%)        -43.57 (-471.72, 62.59)
## Attrib fraction (est) in population (%)      -28.57 (-191.19, 43.23)
## -------------------------------------------------------------------
##  Test that OR = 1: chi2(1) = 0.368 Pr>chi2 = 0.54
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units
Question 5: What do you think about the association between illness and travel? How would you interpret the statistics that R produces?

There is no statistically significant association between travel in the incubation period and illness. The odds ratio is less than one and the 95% confidence interval includes 1.

The table presents proportions of diseased and non-diseased exposed, as well as the population attributable fractions. There is an associated Chi-square test statistic and associated p-value

Frequency matching

Because the study investigators frequency-matched during study recruitment, we should adjust odds ratios for age.

Task 5: In the R- Script type the below commands

  1. Make a 3 way table stating exposure, outcome and stratifying variable in that order:
x1 <- table(gc$travel, gc$case, gc$agegrp)
# to view table
x1
## , ,  = 0-9
## 
##      
##       Control Case
##   No        2    2
##   Yes       1    1
## 
## , ,  = 10-64
## 
##      
##       Control Case
##   No        5    8
##   Yes       6    4
## 
## , ,  = 65+
## 
##      
##       Control Case
##   No        9    8
##   Yes       2    2
#Use epi.2by2 function to calculate ORs.
stcc <- epi.2by2(x1, method = "case.control")
#to view output
stcc
##              Outcome +    Outcome -      Total        Prevalence *        Odds
## Exposed +           16           18         34                47.1       0.889
## Exposed -            9            7         16                56.2       1.286
## Total               25           25         50                50.0       1.000
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Odds ratio (crude)                           0.69 (0.21, 2.28)
## Odds ratio (M-H)                             0.65 (0.19, 2.21)
## Odds ratio (crude:M-H)                       1.07
## Attrib prevalence (crude) *                  -9.19 (-38.73, 20.34)
## Attrib prevalence (M-H) *                    -10.98 (-59.92, 37.97)
## Attrib prevalence (crude:M-H)                0.84
## -------------------------------------------------------------------
##  M-H test of homogeneity of ORs: chi2(2) = 0.568 Pr>chi2 = 0.75
##  Test that M-H adjusted OR = 1:  chi2(2) = 0.463 Pr>chi2 = 0.50
##  Wald confidence limits
##  M-H: Mantel-Haenszel; CI: confidence interval
##  * Outcomes per 100 population units
  1. To extract various outputs from the output table stcc
#note that $ is used twice to imply two levels within output
#Crude OR
stcc$massoc$crude.wald
#Stratum specific ORs
stcc$massoc$OR.strata.wald
#To combine results of all elements into a single table
results <- rbind(stcc$massoc$crude.wald,   stcc$massoc$OR.strata.wald)  #rbind and cbind are commands used to combine R objects by rows or columns respectively

From the first command you can see the three separate tables that have given rise to the odds ratios in the second command. The second command produces a crude odds ratio that is the same as what was estimated earlier. A combined Mantel-Haenszel odds ratio is produced that takes into account the effect of the third variable, which in this case is age group. A small difference between the crude and summary odds ratio may indicate that age is a potential confounder for the relationship between STEC and the exposure.


Hint: The key thing to remember with data that have been matched during design of a study, is that you must take account of that matching during analysis. If you do not stratify or conduct a matched analysis, the results may be biased.


The next step is to repeat this analysis for all exposure variables and examine the association with illness. This could be quite a laborious process….

Question 6: How would you do this analysis of multiple exposures simply?

In this instance, we are going to demonstrate the use of the epi2.by2 command and a for loop, which allows us to repeat analyses easily.

library("epiR")
vars <- c("animal", "travel", "pets", "watersport")
#for each exposure, run the analysis
for (var in vars){
  b <- table(gc[,var], gc$case)
  test <- epi.2by2(b, method = "case.control")
  #resultstable <- rbind(mh$massoc$OR.crude.wald, mh$massoc$OR.strata.wald, mh$mas)
  
  print(b)
  print(test)
}
Question 7: Fill in the table below and discuss if there any exposures that might be the risk factor causing the outbreak?

Table 3: Potential risk factors

The main risk factor causing illness is drinking raw milk. Raw or unpasteurized milk is a known cause of enteric infections, including STEC. Was there any evidence of confounding or effect modification by age? Imagine if no frequency matching by age was incorporated in design. What might the age profile of controls look like in comparison to cases?

The other option is to use the cctable option in R that is available in package EpiStats. This command calculates number of cases, attack rate among cases and controls, odds ratio, 95%CI interval, p-value for each variable of varlist. The result table could be sorted by p-value, odds ratio, attack rates.

library(EpiStats)
x <- cctable(gc, "case", exposure=c(14:21), exact = TRUE)
x
## $df
##    Tot.Cases Exposed     % Tot.Ctrls Exposed     %   OR CI ll CI ul p(Fisher)
## 17        25       9 36.00        25       5 20.00 2.25  0.54 10.21     0.345
## 16        25       8 32.00        25       6 24.00 1.49  0.36  6.34     0.754
## 15        25      15 60.00        25      17 68.00 0.71  0.19  2.61     0.769
## 18        25      12 48.00        25      10 40.00 1.38  0.39  4.92     0.776
## 14        25       6 24.00        25       7 28.00 0.81  0.19  3.46     1.000
## 19        25      19 76.00        25      18 72.00 1.23  0.29  5.37     1.000
## 20        25      12 48.00        25      11 44.00 1.17  0.34  4.12     1.000
## 21        25      20 80.00        25      20 80.00 1.00  0.20  5.10     1.000
Stratified analysis

There are two exposures that are statistically associated with STEC infection. We need to sort out the relationship to identify what is truly causing infection. In this instance, we are going to ignore the frequency matching for simplicity sake. When we observe two or more variables associated with the outcome, we need to stratify or control for the other variables our analysis.

Task 7: In the command window type the below commands

# task 7a.  
mytable <- xtabs(~gc$farmlive+gc$case+gc$milkraw)
ftable(mytable) # print table 
##                     gc$milkraw No Yes
## gc$farmlive gc$case                  
## No          Control            14   3
##             Case                3   5
## Yes         Control             6   2
##             Case                1  16
summary(mytable) # chi-square test of independence
## Call: xtabs(formula = ~gc$farmlive + gc$case + gc$milkraw)
## Number of cases in table: 50 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 35.56, df = 4, p-value = 3.557e-07
# task 7b.  
library(epiR)    
cc <- epi.2by2(mytable, method = "case.control")
cc
##              Outcome +    Outcome -      Total        Prevalence *        Odds
## Exposed +           17            8         25                  68       2.125
## Exposed -            8           17         25                  32       0.471
## Total               25           25         50                  50       1.000
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Odds ratio (crude)                           4.52 (1.38, 14.82)
## Odds ratio (M-H)                             2.14 (0.49, 9.29)
## Odds ratio (crude:M-H)                       2.11
## Attrib prevalence (crude) *                  36.00 (10.14, 61.86)
## Attrib prevalence (M-H) *                    12.34 (-29.09, 53.76)
## Attrib prevalence (crude:M-H)                2.92
## -------------------------------------------------------------------
##  M-H test of homogeneity of ORs: chi2(1) = 1.041 Pr>chi2 = 0.31
##  Test that M-H adjusted OR = 1:  chi2(1) = 1.039 Pr>chi2 = 0.31
##  Wald confidence limits
##  M-H: Mantel-Haenszel; CI: confidence interval
##  * Outcomes per 100 population units
# task 7c.  
tab1 <- xtabs(~gc$milkraw+gc$case+gc$farmlive)
ftable(tab1) # print table 
##                    gc$farmlive No Yes
## gc$milkraw gc$case                   
## No         Control             14   6
##            Case                 3   1
## Yes        Control              3   2
##            Case                 5  16
summary(tab1) # chi-square test of independence
## Call: xtabs(formula = ~gc$milkraw + gc$case + gc$farmlive)
## Number of cases in table: 50 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 35.56, df = 4, p-value = 3.557e-07
# task 7d.  
cc <- epi.2by2(tab1, method = "case.control")
cc
##              Outcome +    Outcome -      Total        Prevalence *        Odds
## Exposed +           20            4         24                83.3       5.000
## Exposed -            5           21         26                19.2       0.238
## Total               25           25         50                50.0       1.000
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Odds ratio (crude)                           21.00 (4.92, 89.56)
## Odds ratio (M-H)                             15.09 (3.47, 65.69)
## Odds ratio (crude:M-H)                       1.39
## Attrib prevalence (crude) *                  64.10 (42.85, 85.36)
## Attrib prevalence (M-H) *                    59.16 (1.99, 116.33)
## Attrib prevalence (crude:M-H)                1.08
## -------------------------------------------------------------------
##  M-H test of homogeneity of ORs: chi2(1) = 1.041 Pr>chi2 = 0.31
##  Test that M-H adjusted OR = 1:  chi2(1) = 16.182 Pr>chi2 = <0.001
##  Wald confidence limits
##  M-H: Mantel-Haenszel; CI: confidence interval
##  * Outcomes per 100 population units
Question 8: What is your assessment of the association between STEC infection and living on a farm after 
controlling for raw milk consumption? What about your conclusions about raw milk consumption? Is there evidence 
of confounding?

After controlling for raw milk consumption, the association with living on a farm is negligible. There is a distinct difference between the Mantel-Haenzel pooled odds ratio and the crude odds ratio, which indicates that raw milk is a confounder for the association between living on a farm and STEC. The association between raw milk consumption and STEC infection is very strong regardless of whether a person lives on a farm or not.

5. Matched Analysis

In case control studies where you have individually matched controls to your cases on one or more factors that are likely to correspond to potential confounders, you need to ensure that you maintain that matching during analysis. In R there are a couple of choices for analysis:

Conduct conditional logistic regression using `clogit` from the 'survival' package in R.

Task 8: We are going to analyse the association between STEC infection and raw milk consumption by typing in the below commands

#table(gc$matchid, gc$case)
#tab2 <- xtabs(~gc$milkraw+gc$case+gc$agegrp)
#ftable(tab2)
# below is using the epiR package
#cc <- epi.2by2(tab2, method = "case.control")
#cc

# task 8a 
table(gc$matchid, gc$case)
##     
##      Control Case
##   1        1    1
##   2        1    1
##   3        1    1
##   4        1    1
##   5        1    1
##   6        1    1
##   7        1    1
##   8        1    1
##   9        1    1
##   10       1    1
##   11       1    1
##   12       1    1
##   13       1    1
##   15       1    1
##   16       1    1
##   17       1    1
##   18       1    1
##   19       1    1
##   20       1    1
##   21       1    1
##   22       1    1
##   23       1    1
##   24       1    1
##   25       1    1
##   26       1    1
# task 8b 
tab2 <- xtabs(~gc$milkraw+gc$case+gc$agegrp)
ftable(tab2) # "print" table 
##                    gc$agegrp 0-9 10-64 65+
## gc$milkraw gc$case                        
## No         Control             2     8  10
##            Case                0     1   3
## Yes        Control             1     3   1
##            Case                3    11   7
cc <- epi.2by2(tab2, method = "case.control")
cc 
##              Outcome +    Outcome -      Total        Prevalence *        Odds
## Exposed +           20            4         24                83.3       5.000
## Exposed -            5           21         26                19.2       0.238
## Total               25           25         50                50.0       1.000
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Odds ratio (crude)                           21.00 (4.92, 89.56)
## Odds ratio (M-H)                             29.86 (5.39, 165.47)
## Odds ratio (crude:M-H)                       0.70
## Attrib prevalence (crude) *                  64.10 (42.85, 85.36)
## Attrib prevalence (M-H) *                    67.04 (20.54, 113.53)
## Attrib prevalence (crude:M-H)                0.96
## -------------------------------------------------------------------
##  M-H test of homogeneity of ORs: chi2(2) = 0.053 Pr>chi2 = 0.97
##  Test that M-H adjusted OR = 1:  chi2(2) = 19.894 Pr>chi2 = <0.001
##  Wald confidence limits
##  M-H: Mantel-Haenszel; CI: confidence interval
##  * Outcomes per 100 population units
# task 8c
tab3 <- xtabs(~gc$milkraw+gc$case+ gc$farmlive)
ftable(tab3) 
##                    gc$farmlive No Yes
## gc$milkraw gc$case                   
## No         Control             14   6
##            Case                 3   1
## Yes        Control              3   2
##            Case                 5  16
cc <- epi.2by2(tab3, method = "case.control") 
cc
##              Outcome +    Outcome -      Total        Prevalence *        Odds
## Exposed +           20            4         24                83.3       5.000
## Exposed -            5           21         26                19.2       0.238
## Total               25           25         50                50.0       1.000
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Odds ratio (crude)                           21.00 (4.92, 89.56)
## Odds ratio (M-H)                             15.09 (3.47, 65.69)
## Odds ratio (crude:M-H)                       1.39
## Attrib prevalence (crude) *                  64.10 (42.85, 85.36)
## Attrib prevalence (M-H) *                    59.16 (1.99, 116.33)
## Attrib prevalence (crude:M-H)                1.08
## -------------------------------------------------------------------
##  M-H test of homogeneity of ORs: chi2(1) = 1.041 Pr>chi2 = 0.31
##  Test that M-H adjusted OR = 1:  chi2(1) = 16.182 Pr>chi2 = <0.001
##  Wald confidence limits
##  M-H: Mantel-Haenszel; CI: confidence interval
##  * Outcomes per 100 population units
# task 8d

tab4 <- xtabs(~gc$milkraw+gc$case+gc$matchid)
ftable(tab4) 
##                    gc$matchid 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 26
## gc$milkraw gc$case                                                                             
## No         Control            1 0 1 1 0 1 1 1 1  1  1  1  1  1  1  1  1  1  1  0  0  1  1  1  0
##            Case               1 0 0 0 0 0 0 0 0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0
## Yes        Control            0 1 0 0 1 0 0 0 0  0  0  0  0  0  0  0  0  0  0  1  1  0  0  0  1
##            Case               0 1 1 1 1 1 1 1 1  1  0  1  1  1  1  1  1  1  1  1  0  1  0  1  1

6. Pairwise matched and Conditional Logistic regression analysis in R

This short exercise shows how to analyze matched data in R using a 1:1 pair matched case control study. We continue to use the greencountry.csv dataset, but will need to reshape it to get it into the right format. For matched pairs, data are arranged differently in two-by-two tables: where exposed and non-exposed cases are shown in columns, while their matched exposed and non-exposed controls are shown in the rows.

Table 4: two-by-two table (showing matched pairs)

The odds ratio in this instance is given by f/g. The concordant pairs (exposed case + exposed control; unexposed case + unexposed control) do not contribute anything to the estimation of the odds ratio. This means that the sample size for analysis can become small in some instances and make confidence intervals wider.

Task 9: To see how this works we will first perform a simple conditional logistic regression, followed by the reshape command from the ‘reshape’ package. In the command window, type the below commands

#In order to run this patch of code, we need to convert matchid to a factor variable and the outcome needs to be defined as a numeric variable. 

# task 9a.  
gc$case <- as.numeric(gc$case)
# task 9b.  
gc$matchid <- as.factor(gc$matchid)
# task 9c.  
logit <- clogit(case ~ milkraw + strata(matchid), data = gc)

# task 9d.  Now try the conditional logistic regression with the variable farmlive, and then repeat the analysis incorporating both milkraw and farmlive. 

# task 9e.  We will reshape the data from long to wide to generate a table to examine the variable milkraw and a couple of other variables in the format above. To do this we will drop several variables and re-shape the data from long (one study subject per line) to wide (data on case:control pairs on the same line). To do this type:
#i. 
gc1 <-  gc[c(1,3,4:28)]
View(gc1)
#ii
library(reshape)
gc_wide<- reshape(gc1, idvar= "matchid", direction = "wide", timevar= "case")

#iii.   If you browse your dataset you will see that you now have only 26 observations and the variables have either a suffix of 1 for a control or 2 for a case. You are now ready to do a pairwise matched analysis!

Firstly, you should get a data table that looks like this:

x <- table(gc_wide$milkraw.1, gc_wide$milkraw.2)
x
# To perform McNemar’s Chi Sq test we get, 

mcnemar.test(x)

Repeat the analysis for other exposure variables. Please note that that there are several packages in R which allow you to do the same analysis in multiple ways. You will find many differences between the output generated by R and STATA for such specific analysis.

Question 10: What do the outputs from these analyses tell you about the association between these farm and milk drinking behaviours on STEC infection? 

The results show that the odds of an STEC case drinking raw milk was 17 times higher than for people who weren’t infected. The other variables of living on a farm and drinking bottled milk had elevated odds ratios, but were not statistically significant. All three methods estimated the same odds ratio, with similar 95% confidence intervals. The mcnemar.test commands are impractical because of the need to reshape the data. It is much simpler to use clogit.

Summary

Now the only thing left to do is to clear your data and make notes in your RScipts (if you haven’t done so already) and save it. Make sure you put explanatory text in RScript so you will know what you were thinking during this session- you will forget when you go back to your code in a couple of month time.

Well done!!! Hope you feel comfortable conducting descriptive and analytical analyses of case control study data after completing this exercise.