Dealing with Categorical variables (without Chi-Squared)

The One Way ANOVA

Simply put, a one way ANOVA compares the means of multiple (x>2) groups of data with one independent variable that is continuous.

In health, this is useful for testing the effectiveness of an intervention, or comparing groups of people for one thing, let’s say blood pressure, when they are mutually exclusive or distinct.

Let’s go ahead and run through an example. Fortunatelly R has perfect data for us to do so.

anova <- PlantGrowth
head(anova, 10)
##    weight group
## 1    4.17  ctrl
## 2    5.58  ctrl
## 3    5.18  ctrl
## 4    6.11  ctrl
## 5    4.50  ctrl
## 6    4.61  ctrl
## 7    5.17  ctrl
## 8    4.53  ctrl
## 9    5.33  ctrl
## 10   5.14  ctrl
summary(anova)
##      weight       group   
##  Min.   :3.590   ctrl:10  
##  1st Qu.:4.550   trt1:10  
##  Median :5.155   trt2:10  
##  Mean   :5.073            
##  3rd Qu.:5.530            
##  Max.   :6.310

Looks like its the weight of some plants split into three groups: control, treatment 1, and treatment 2.

Is there a way to summarize these groups? Maybe give these means an eye test before conducting a more robust mathematical test?

Let’s rely on dplyr to help us conduct some stats.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
group_by(anova, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
## # A tibble: 3 x 4
##   group count  mean    sd
##   <fct> <int> <dbl> <dbl>
## 1 ctrl     10  5.03 0.583
## 2 trt1     10  4.66 0.794
## 3 trt2     10  5.53 0.443

So there are 10 observations per group, and from the looks of it, each group’s means are quite different, but the standard deviation seems to overlap.

Seeing this in numbers won’t do, a full blow visualization is necessary here.

library(ggpubr)
## Warning: package 'ggpubr' was built under R version 3.5.3
## Loading required package: ggplot2
## Loading required package: magrittr
## Warning: package 'magrittr' was built under R version 3.5.2
#I'll also shameless admit that the 6 digit color codes are copies, who has time to fish out those colors anyway? 
ggboxplot(anova, x = "group", y = "weight",
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

We see aside from the two outliers on treatment one, there is a clear different between the means of all three groups with only minor overlaps of the deviation between treatment 2 and the control group.

What if we only want to see the mean and the first sigma of the deviation? fear not.

ggline(anova, x = "group", y = "weight",
       add = c("mean_se", "jitter"),
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "treatment")

With this the difference between treatment 2 ad the other plots is a litte less clear, we see that the first two categories cover a lot more ground.

But enough visualization, what does the statistics tell us?

anova_done <- aov(weight ~ group, data = anova)
summary(anova_done)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        2  3.766  1.8832   4.846 0.0159 *
## Residuals   27 10.492  0.3886                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So the p value is indeed less than 0.05, and does show that there is significant difference between the mean of each group, despite some overlap with the standard deviations. Perhaps if we had more data a more accurate comparison can be made.

Two Way ANOVA

So what if we want to compare two different factors with multiple groups of data with a continuous independent variable. An example would be that not only do we want to check the IQ of those who’s highest education is high school, college, or grade school, but also want to see if there is a difference between sexes as well. If we have an adequate amount of data on these factors then this is possible.

twoway <- ToothGrowth
head(twoway, 10)
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
str(twoway)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

We see that we have our two factors in the “supp” variable, and can move forward with our analysis.

Here this data set describes tooth length and compares them to a supplement type and a dosage of that supplement. Again let’s just check the visuals to see if we can find a difference.

library("ggpubr")
ggboxplot(twoway, x = "dose", y = "len", color = "supp",
          palette = c("#00AFBB", "#E7B800"))

Looks like there is a difference between the two factors, especially at dose 1. Let’s re-visualize this with a line graph.

ggline(twoway, x = "dose", y = "len", color = "supp",
       add = c("mean_se", "dotplot"),
       palette = c("#00AFBB", "#E7B800"))
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

This makes the difference between dose 0.5 and 1 much more apparent, seeing the IQR physically seperated.

anova2 <- aov(len ~ supp + dose, data = twoway)
summary(anova2)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4   11.45   0.0013 ** 
## dose         1 2224.3  2224.3  123.99 6.31e-16 ***
## Residuals   57 1022.6    17.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Would you look at that, both supplement type and dose differ significantly and it is clear they affect the outcome.

However, does the dose affect the supplement? or vice versa? What if these two effects interact with each other? After all, they could work synergistically. Fortuantely, the easiest way to test this is to multiply the effect.

anova2_int <- aov(len ~ supp + dose + supp:dose, data = twoway)
summary(anova2_int)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  12.317 0.000894 ***
## dose         1 2224.3  2224.3 133.415  < 2e-16 ***
## supp:dose    1   88.9    88.9   5.333 0.024631 *  
## Residuals   56  933.6    16.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Seems like the interaction is also statistically significant, which means they work synergistically and their product should also be included in a predictive model.

While ANOVA is one way of testing for statistical significance, it’s often good practice to back up or validate your findings with another examination. For ANOVA it’s the Tukey Honest Significant Differences test. It basically tests all the possible means and compares them against their range distribution. It also compares all possible pairs of means, making it less precise than ANOVA.

It’s also good practice to check for homogeneity of variances, after all, we’re performing an analysis of variances, so you need to make sure your assumption that the variances within each population are roughly equal otherwise it’s not a good comparison.

plot(anova2_int,1)

uh oh, it seems as though we have several outliers that may affect normality/homogeneity. Meaning that if this holds true, the data must be cleaned of these outliers or more data points must popular the data set to validate them.

Let’s see if this is statistically worrisome with the Levene’s Test for Homogeneity of Variance.

library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(len~supp, data = twoway, center=median)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  1.2136 0.2752
##       58

Looks like we should be good for the categorical variable, how about everything else? Let’s just check a QQ plot and call it a day.,

plot(anova2_int, 2)

Yeah….that’s a bit of something off with three outliers, but overall this isn’t aggregious enough to not be able to assume normality. Looks like we’re good!

MAVNOVA

Unforunately for MANOVA, it’s difficult to visualize, since we’re going to be comparing a factor between multiple continuous outcomes.

Let’s use R’s iris database to perform this.

iris <- iris
head(iris, 10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

Let’s see if Sepal Length, and Petal Length differ amongst species.

sepl <- iris$Sepal.Length
petl <- iris$Petal.Length
manova <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary (manova)
##            Df Pillai approx F num Df den Df    Pr(>F)    
## Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
## Residuals 147                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(manova)
##  Response Sepal.Length :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Species       2 63.212  31.606  119.26 < 2.2e-16 ***
## Residuals   147 38.956   0.265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Petal.Length :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
## Residuals   147  27.22   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like they certainly do differ