Introduction

Pairs of categorical variables

Contingency tables

  • Consider an example of two treatment groups, e.g. Control group and Treatment group
  • The patients are also classified according to the grade of disease, with grades I, II, and III
  • The count for each of these can be expressed as in the code below using rbind()
# Row form using rbind()
rbind(c(22, 29), c(28, 23), c(27, 24))
##      [,1] [,2]
## [1,]   22   29
## [2,]   28   23
## [3,]   27   24
  • In this example there are 77 patients in the control arm (sum of the first column) and 76 in the treatment arm (sum of the second column)
  • The rows represent the three disease grades with 51 with grade I disease, 51 with grade II disease, and 51 with grade III disease (sums of the rows)
  • The contingency tables expresses the division of the patients in the size possible groups
  • The same can be achived with cbind() where each numeric vector will have its values expressed as a column
# Column form using cbind()
cbind(c(22, 28, 27), c(29, 23, 24))
##      [,1] [,2]
## [1,]   22   29
## [2,]   28   23
## [3,]   27   24
  • Yet another way to create the same data is through the use of the matrix() command
    • The number of rows is specified
    • Default is to fill in the table column by column
matrix(c(22, 28, 27, 29, 23, 24),
       nrow = 3)
##      [,1] [,2]
## [1,]   22   29
## [2,]   28   23
## [3,]   27   24
  • This can also be done row by row
  • Note the change in the order of the values needed to achieve this
matrix(c(22, 29, 28, 23, 27, 24),
       byrow = TRUE,
       nrow = 3)
##      [,1] [,2]
## [1,]   22   29
## [2,]   28   23
## [3,]   27   24
  • Storing the table as a computer variable allows for the addition of row and column names
contingencyTable <- matrix(c(22, 29, 28, 23, 27, 24),
                           byrow = TRUE,
                           nrow = 3)
rownames(contingencyTable) <- c("Grade I",
                                "Grade II",
                                "Grade III")
colnames(contingencyTable) <- c("Control",
                                "Treatment")
contingencyTable
##           Control Treatment
## Grade I        22        29
## Grade II       28        23
## Grade III      27        24
  • Data in a data.frame can also be expressed as a contingency table
set.seed(123)
df <- data.frame(Grade = sample(c("Grade I",
                                  "Grade II",
                                  "Grade III"),
                                size = 153,
                                replace = TRUE),
                 Group = sample(c("Control",
                                  "Treatment"),
                                size = 153,
                                replace = TRUE))
head(df)
##       Grade     Group
## 1   Grade I   Control
## 2 Grade III   Control
## 3  Grade II   Control
## 4 Grade III Treatment
## 5 Grade III   Control
## 6   Grade I   Control
  • Using the table() command
table(df$Grade,
      df$Group)
##            
##             Control Treatment
##   Grade I        22        29
##   Grade II       28        23
##   Grade III      27        24

Stacked bar chart

  • A stacked bar chart is a simple way of visualizing the data
barplot(table(df$Grade,
             df$Group),
        legend.text = TRUE,
        main = "Number of patients with grade of disease (by treatment group)",
        xlab = "Treatment group",
        ylab = "Grade count",
        col = c("deepskyblue",
                "orange",
                "gray"),
        border = NA,
        las = 1)

  • By changing the x and y axis values a transpose of the data can be visualized
barplot(table(df$Group,
              df$Grade),
        legend.text = TRUE,
        main = "Number of patients in each group (by grade of disease)",
        xlab = "Grade of disease",
        ylab = "Treatment group count",
        col = c("deepskyblue",
                "orange"),
        border = NA,
        las = 1)

Numerical data

set.seed(123)
df <- data.frame(Temp = round(rnorm(100,
                                    37,
                                    2),
                             digits = 1),
                 WCC = round(rnorm(100,
                                   mean = 12,
                                   sd = 3),
                             digits = 1))
head(df)
##   Temp  WCC
## 1 35.9  9.9
## 2 36.5 12.8
## 3 40.1 11.3
## 4 37.1 11.0
## 5 37.3  9.1
## 6 40.4 11.9

Visualizing the data

  • A scatter plot creates pairs of values for each row (patient)
plot(df$Temp,
     df$WCC,
     main = "Temperature vs white cell count",
     xlab = "Temperature (deg Celcius)",
     ylab = "White cell count",
     las = TRUE)