For the examples in this lab, we will again return to the Titanic data set. We’ll group passengers by the passenger class they travelled under (a categorical variable so we will use the option stringsAsFActors = True option when loading the Titanic data) and ask whether different passenger classes differed in their mean age (a numerical variable). Let’s load the data.
titanicData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\titanic.csv", stringsAsFactors = TRUE)
Let’s first look at the data to get a sense of how well it fits the assumptions of ANOVA.
Recall that the assumptions of ANOVA (Analysis of Variance) are:
Independence: The observations in each group must be independent of each other.
Normality: The data in each group should be approximately normally distributed.
Homogeneity of Variances: The variance among groups should be roughly equal (homoscedasticity); equal spread.
Violations of these assumptions may impact the validity of ANOVA results.
Multiple histogram are useful for this purpose. As we saw in the last lab, we can use ggplot() and facets to make this plot:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
ggplot(titanicData, aes(x = age)) +
geom_histogram() +
facet_wrap(~ passenger_class, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_bin()`).
These data look sufficiently normal and with similar spreads that ANOVA would be appropriate.
To confirm these visual impressions, it would be useful to construct a table of the means and standard deviations of each group. There are numerous ways to do this in R, but one of the neatest is to use functions from the package dplyr. If you have not done so yet, install the dplyr package from the “Packages” tab in RStudio. Then load the dplyr package with library().
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The package dplyr has several useful features for manipulating data sets. For our current purposes, we will find two functions particularly useful: group_by() and summarise(). These two functions are well named and work together well, first to organize the data by groups, and second to summarize the results for each group.
First, use group_by() to organize your data frame by the appropriate grouping variable. For example, here we want to organize the titanicData by passenger_class:
titanic_by_passenger_class <- group_by(titanicData,passenger_class)
After applying group_by() to a data frame, we can summarize the data using summarise(). (The “s” in summarise() is not a typo—the creator of the package is from New Zealand.) With summarise(), we can apply any type of function that summarizes data (e.g. mean(), median(), var(), etc.), and receive that summary group by group. For example, to calculate the mean age of each passenger_class, we can use:
summarise(titanic_by_passenger_class, group_mean = mean(age, na.rm=TRUE))
## # A tibble: 3 × 2
## passenger_class group_mean
## <fct> <dbl>
## 1 1st 39.7
## 2 2nd 28.3
## 3 3rd 24.5
As input, we give the name of the grouped table created by group_by() and the function we want to apply to each group. In this case we used mean(age, na.rm=TRUE). “group_mean” is a name we give to that summary variable (it could have been any name we wanted). The output looks like a table and includes the names of the groups being summarized. (A “tibble” is not how New Zealanders spell “table”, but is a type of table like a data frame.) “3 x 2” here refers to the number of rows x columns in the “tibble” output.
We can give summarise() many summary functions at once, and it will create columns in the output table for each one. For example, if we want to output both the mean and the standard deviation, we can add sd = sd(age, na.rm=TRUE) to the function above.
summarise(titanic_by_passenger_class, group_mean = mean(age, na.rm=TRUE), group_sd = sd(age, na.rm=TRUE))
## # A tibble: 3 × 3
## passenger_class group_mean group_sd
## <fct> <dbl> <dbl>
## 1 1st 39.7 14.9
## 2 2nd 28.3 13.0
## 3 3rd 24.5 11.3
Note that the standard deviations are very similar, which means that these data fit the equal variance assumption of ANOVA.
Analysis of variance (or ANOVA) is not quite as simple in R as one might hope. Doing ANOVA takes at least two steps:
First, we fit the ANOVA model to the data using the function lm(). This step carries out a bunch of intermediate calculations.
Second, we use the results of first step to do the ANOVA calculations and place them in an ANOVA table using the function anova().
The function name lm() stands for “linear model”; this is actually a very powerful function that allows a variety of calculations. One-way ANOVA is a type of linear model.
The function lm() needs a formula and a data frame as arguments. The formula is a statement specifying the “model” that we are asking R to fit to the data. A model formula always takes the form of a response variable, followed by a tilde(~), and then at least one explanatory variable. In the case of a one-way ANOVA, this model statement will take the form
numerical_variable ~ categorical_variable
## numerical_variable ~ categorical_variable
For example, to compare differences in mean age among passenger classes on the Titanic, this formula is
age ~ passenger_class
## age ~ passenger_class
This formula tells R to “fit” a model in which the ages of passengers are grouped by the variable passenger_class.
The name of the data frame containing the variables stated in the formula is the second argument of lm(). Finally, to complete the lm() command, it is necessary to save the intermediate results by assigning them to a new object, which anova() can then use to make the ANOVA table. For example, here we assign the results of lm() to a new object named “titanicANOVA”:
titanicANOVA <- lm(age ~ passenger_class, data = titanicData)
The function anova() takes the results of lm() as input and returns an ANOVA table as output:
anova(titanicANOVA)
## Analysis of Variance Table
##
## Response: age
## Df Sum Sq Mean Sq F value Pr(>F)
## passenger_class 2 26690 13344.8 75.903 < 2.2e-16 ***
## Residuals 630 110764 175.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This table shows the results of a test of the null hypothesis that the mean ages are the same among the three groups. The P-value is very small, and so we reject the null hypothesis of no differences in mean age among the passenger classes; meaning THERE IS a significant difference in mean age among the passenger classes.
A single-factor ANOVA can tell us that at least one group has a different mean from another group, but it DOES NOT inform us which group means are different from which other group means. A Tukey-Kramer test lets us test the null hypothesis of no difference between the population means for all pairs of groups. The Tukey-Kramer test (also known as a Tukey Honest Significance Test, or Tukey HSD), is implemented in R in the function TukeyHSD().
We will use the results of an ANOVA done with lm() as above, that we stored in the variable titanicANOVA. To do a Tukey-Kramer test on these data, we need to first apply the function aov() to titanicANOVA, and then we need to apply the function TukeyHSD to the result. We can do this in a single command:
TukeyHSD(aov(titanicANOVA))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = titanicANOVA)
##
## $passenger_class
## diff lwr upr p adj
## 2nd-1st -11.367459 -14.345803 -8.389115 0.0000000
## 3rd-1st -15.148115 -18.192710 -12.103521 0.0000000
## 3rd-2nd -3.780656 -6.871463 -0.689849 0.0116695
The key part of this output is the table at the bottom. It estimates the difference between the means of groups (for example, the 2nd passenger class compared to the 1st passenger class) and calculates a 95% confidence interval for the difference between the corresponding population means. (“lwr” and “upr” correspond to the lower and upper bounds of that confidence interval for the difference in means.) Finally, it give the P-value from a test of the null hypothesis of no difference between the means (the column headed with “p adj”). In the case of the Titanic data, P is less than 0.05 in all pairs, and we therefore reject every null hypothesis. We conclude that the population mean ages of all passenger classes are significantly different from each other.
A Kruskal-Wallis test is a non-parametric analog of a one-way ANOVA (essentially, it’s an alternative to one-way ANOVA which again tells you whether or not if there is a significant difference between groups using medians instead of means, which ANOVA compares). It does not assume that the variable has a normal distribution. (Instead, it tests whether the variable has the same distribution with the same mean in each group.)
To run a Kruskal-Wallis test, use the R function kruskal.test(). The input for this function is the same as we used for lm() above. It includes a model formula statement and the name of the data frame to be used.
kruskal.test(age ~ passenger_class, data = titanicData)
##
## Kruskal-Wallis rank sum test
##
## data: age by passenger_class
## Kruskal-Wallis chi-squared = 116.08, df = 2, p-value < 2.2e-16
You can see for the output that a Kruskal-Wallis test also strongly rejects the null hypothesis of equality of age for all passenger class groups with the Titanic data.