As we’ve done before in previous labs, we can read in a csv file of a data set and read categorical variables as a factor in order for us to “group” the responses in this categorical variable into categories. We use this method so R knows to get the counts/frequencies for each group in a categorical variable (i.e. when we run summaries for gender, we can see how many males and females there are separately. We do this by adding the stringsAsFactors = TRUE option to the read.csv() function.
titanicData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\titanic.csv", stringsAsFactors = TRUE)
levels(titanicData$survive)
## [1] "no" "yes"
Here, we’ve asked R to list the factors/categories in our categorical variable survive in our titanic data set using the levels() function. Notice that “no” is listed first then “yes”.
When we “factor” a categorical variable, by default R will list the categories in alphabetical order (i.e. same example with gender, R will group female as a the first group and male as the second group, as done in alphabetical order). Sometimes, we might be interested in changing the order of these categories when we run other analyses like finding the odds ratio. We can tell R to change the order in which we desire the category order to be in by using the levels argument argument of the factor() command like so:
titanicData$survive <- factor(titanicData$survive, levels = c("yes", "no"))
Here, we’re telling R to list our categories in our categorical variable survive with yes first then no. Let’s run our overwritten categorical variable survive and see how it outputs now:
levels(titanicData$survive)
## [1] "yes" "no"
A contingency table is an effective method to see the association between two categorical variables. Moreover, other R functions we will use in this exercise require a contingency table as input.
A frequency table can be created using a function we saw in the last lab, called table(). To create a contingency table that shows the relationship between two categorical variables, we simply give table() the vectors that contain those two variables. Put the explanatory variable first, and then the response variable after a comma.
sex_survive_table <- table(titanicData$sex, titanicData$survive)
sex_survive_table
##
## yes no
## female 307 156
## male 142 708
Here, our explanatory variable is gender (sex) and our response variable is whether they survived or not (survive).
This shows us that in the Titanic data set, there are 156 female passengers who did not survive, 307 females who did survive, and similar information for the males.
It is useful to keep this frequency table as a named object, as we have done here (sex_survive_table). We shall use this table several times in later analyses.
Sometimes for contingency analysis we have already summarized the counts for each case. In these cases it is useful to be able to create a data table directly from these counts. The following syntax will allow you to create a data table directly:
sex_survive_table_direct <- data.frame(yes = c(307,142), no = c(156,708),
row.names = c("female","male"))
We learned in week 3 how to show associations between categorical variables like this using grouped bar graphs. The mosaic plot is another graphical technique for showing the association between two categorical variables. Here, each combination of the variables is represented by a rectangle, and the size of the rectangle is proportional to the number of individuals in that combination.
R has a function to calculate mosaic plots, with the sensible name mosaicplot(). In its most basic form, you just give it a frequency table as input.
mosaicplot(sex_survive_table_direct)
This shows the basic pattern. However, this plot can be greatly improved by adding a couple of extra options, to specify color and axes labels. We can add the option color = c(“darkred”, “gold”) to tell R which colors to use for the different response variables. This is a vector of color names that R assigns in the same order as the order of the categories of the variable plotted on the vertical axis (the response variable) starting with the topmost. (R has many named colors, including all the basics like “red”, “orange”, “blue”, etc.)
We would also like the axes to have labels. We can specify these with xlab and ylab as options. Let’s simply call the x-axis “Sex” and the y-axis “Survival”. Here’s what the command would now look like:
mosaicplot(sex_survive_table, color = c("darkred", "gold"), xlab ="Gender", ylab = "Survival", main = " Mosiac plot of gender and survival")
It is much easier now to see in the graph that the majority of females survived whereas the majority of males did not.
One of the ways to measure the strength of the association between two categorical variables is an odds ratio.
In R, the simplest way to estimate an odds ratio is to use the command fisher.test(). This function will also perform a Fisher’s exact test (which we will see more on that later). The input to this function is a contingency table like the one we calculated above. We’ll use the results in a couple of ways, so let’s save the results in an object. (Here we called it sex_survive_fisher.)
sex_survive_fisher <- fisher.test(sex_survive_table_direct)
The output of this function has several parts, two of which we’ll want to look at now for estimating an odds ratio. We can see the specific elements of the output by using the $ notation.
Add $estimate after the results of the fisher.test() function call to get the odds ratio estimate. For example, if we want to know the odds ratio for survival as a function of sex for the Titanic data, we write:
sex_survive_fisher$estimate
## odds ratio
## 9.792284
This shows that the odds ratio is about 10—the odds of a female surviving were about ten times the odds of a male surviving. This is (female survival / female death) / (male survival / male death). The order of the values in the odds ratios is determined by the order of the values of each variable; again, by default R uses alphabetical order.
This fisher.test() function also calculates the 95% confidence interval for the odds ratio, and assigns it to an output variable called conf.int. We can see the 95% confidence interval for the odds ratio with a command like:
sex_survive_fisher$conf.int
## [1] 7.471136 12.887759
## attr(,"conf.level")
## [1] 0.95
The confidence interval for this odds ratio ranges from about 7.47 to about 12.89.
A χχ2 contingency analysis allows us to test the null hypothesis that two categorical variables are independent of each other. As a reminder, the null hypothesis for a test for independence is that there is no association between the two categorical variables. When you p-val < 0.05, we reject the null.
Because this scenario appears frequently, it is one of the most common tests used in biology. However, remember that the χχ2 test is an approximation, and requires that all of the expected values are greater than 1 and that at least 80% are greater than 5. When doing such a test of independence on a computer, it is probably better to use Fisher’s exact test, which doesn’t have this restriction.
The χχ2 contingency test can be done with a function we saw last week, chisq.test(). If we give a frequency table as input, this function will calculate the χχ2 test for us.
Before we do the test, though, we need to make sure that the assumptions of the χχ2 test are met by our data. Fortunately, the chisq.test() function also provides a way for us to look at the expected values. If we give a frequency table as input, and then add $expected at the end of the function, it will show us the expected values for a test of independence, like this:
chisq.test(sex_survive_table_direct)$expected
## yes no
## female 158.3298 304.6702
## male 290.6702 559.3298
In this case all the expected values are greater than 5, so we have no problem meeting this assumption. Therefore, it is appropriate to do a χχ2 contingency test. Just give a frequency table as input to the chisq.test() function to do this test. We’ve added the option “correct = FALSE” to tell R to not do a Yate’s correction, which can be overly conservative.
chisq.test(sex_survive_table_direct, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: sex_survive_table_direct
## X-squared = 327.7, df = 1, p-value < 2.2e-16
This output shows that the χχ2 value for this test is 327.7, with 1 degree of freedom and a P-value less than 2.2 x 10-16. So we can reject the null hypothesis of no association between sex and survival on the Titanic.
Another, more exact, option for testing for the independence of two categorical variables is Fisher’s exact test. This is a test that is tedious to calculate by hand, but R can do it in a flash. This test makes no approximations and sets no minimum threshold for the sizes of the expected values (no assumptions to be met).
To implement Fisher’s exact test in R, use fisher.test(). This function is easy to use; just give it a frequency table as input.
fisher.test(sex_survive_table)
##
## Fisher's Exact Test for Count Data
##
## data: sex_survive_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 7.471136 12.887759
## sample estimates:
## odds ratio
## 9.792284
Here there is little uncertainty about the presence of an association between sex and survival; the p-value is less that 2.2 x 10-16. (This strange value is the smallest number that can be recorded using the default numerical precision in R.)
Also given in the output here is information about the estimate of the odds ratio and its 95% confidence interval.
Unlike the version of Fisher’s test described in the text, fisher.test() is able to calculate contingency tests even when there are more than two possible values for each variable. In such cases, though, it cannot calculate the odds ratios. (i.e. if you have more categories than just “yes” and “no” or “male” and “female”).
We’ll do an experiment on ourselves. The point of the experiment needs to remain obscure until after the data is collected, so as to not bias the data collection process.
Please read through this paragraph, and circle every letter “t”. Please proceed at a normal reading speed. If you ever realize that you missed a “t” in a previous word, do not retrace your steps to encircle the “t”. You are not expected to get every “t”, so don’t slow down your reading to get the “t”s.
It is interesting to contemplate an entangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent on each other in so complex a manner, have all been produced by laws acting around us. These laws, taken in the largest sense, being Growth with Reproduction; inheritance which is almost implied by reproduction; Variability from the indirect and direct action of the external conditions of life, and from use and disuse; a Ratio of Increase so high as to lead to a Struggle for Life, and as a consequence to Natural Selection, entailing Divergence of Character and the Extinction of less-improved forms. Thus, from the war of nature, from famine and death, the most exalted object which we are capable of conceiving, namely, the production of the higher animals, directly follows. There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.
Question number 5 in your report will return to this exercise. Please don’t read Question 5 until you have completed this procedure.