Chi-square analysis

A chi-square analysis is used when our data are in the form of raw counts for two or more categorical groups eg yellow peas or green pea seeds, survival rate of mice if they took drug A or took drug B, etc. Each independent observation must definitely belong to either one group or the other, and there must be no replicates. That is, for each category we just have one count.

What we do is compare the counts we got to some expected value according either to chance or to some prior theory.

For example:

If we were tossing a fair coin 1000 times we would expect 500 heads and 500 tails, ie head and tails in the proportion 1:1.
If we threw a fair dice a large number of times we would expect each possible score, from 1 to 6, to occur the same number of times. ie each score would occur 1/6th of the time.
If the basic idea of Mendelian inheritance is correct, then we would expect that if we crossed two pea plants that were heterozygous for yellow and green seed colour, with yellow being dominant, then the offspring would have yellow:green seeds in the ratio 3:1 (YY, YG, GY would all be yellow, GG would be green).

In a chi-square test, our null hypothesis is that the ideas/theory behind us getting the expected proportions are correct. We give it these expected proportions and also the numbers we actually got. It then gives us a p-value, a probability, for how likely it is that we would get those values if that null hypothesis were correct. If this p-value is too small, and usually, by that we mean less than 0.05, then we reject the null hypothesis.

Example

Suppose we have crossed two pea plants that were heterozygous for yellow/green seed colour. We get 176 offspring , of which 130 were yellow and 46 were green.

The data here are raw counts, and an individual pea plant offspring contributes either to the yellow count or the green count, but not both.

Our expected counts of yellow and green are found by simply dividing the total count, 76, in the ratio 3:1, giving us an expected 132 yellow plants and an expected 44 green plants.

Doing the chi-Square test in R

What we do in R is use the chisq.test() command to see how likely it is we would have got counts of 130 and 46 if the null hypothesis, with its expected counts in the ratio 3:1, were true.

We do it like this:

chisq.test(c(130,46),p=c(0.75,0.25))

There are two arguments. The first is the counts we got, which we enter as a ‘vector’ c(z,y,....), so we write c(130,46). The second is a vector of the proportions we expect for the two counts, where these proportions should add up to one. So for our expected 3:1 ratio we enter c(0.75,0.25).

Let’s do it: type the above command into the console window (bottom left). You will get an output something like this:

## 
##  Chi-squared test for given probabilities
## 
## data:  c(130, 46)
## X-squared = 0.12121, df = 1, p-value = 0.7277

This output is typical of tests done in R. We get the ‘test statistic’ whose name varies depending on the test. Here it is called X-squared, pronounced chi-squared. This is a number that the test calculates, based on the data you have given it. For the most part, we don’t need to worry about how it does that. Then there is the p-value, which is the probability of getting this test statistic if the null hypothesis were true.

In this case, we see that the p-value is 0.73, which is large. We could very plausibly have got yellow:green numbers of 130 and 46 if the null hypothesis were true, so we cannot reject that null hypothesis. In other words, our data are consistent with the predictions of simple Mendelian inheritance.

Reporting the result in English

In English, we might report this result as:

We found counts of 130 yellow plants and 46 green plant, which are consistent with the predictions of Mendelian inheritance (chi-squared test, X-squared = 0.12, p=0.73).

Note that we do not say we have proved Mendelian inheritance to be correct. We haven’t. We never prove things in science. We haven’t said anything about the truth of the null hypothesis. All we can say is whether our data are or are not consistent with the null hypothesis. In this case they are. We then report the test we used and the values of the test statistic and p-value. Other tests might give you other details to report too.

Exercise.

Suppose you tossed a fair coin 100 times and got 45 heads and 55 tails.
Under a null hypothesis that the coin is fair, what would the expected proportions of heads and tails be?
Use R to do a chi-square test of that null hypothesis.
What do you conclude?
How would you report your result?

(Answer at the bottom ,but don’t peek until you have thought about and tried this_)

Chi-Square test with real data.

Now we will do a chi-square test on some real data that we read into R.

Before we start, have you set up somewhere on your machine a project folder called RStuff, and in there created two sub-folders, one called scripts and the other called data? If not, do that now.

You need to make sure that your RStuff folder is recognised by R as a Project. The reasons for this may seem obscure right now, but it will pay dividends later.

Gather the data.

Normally, you would head off into the field and return weeks later clutching your hard-won data. Today we will just shake some dice. Shake a die 180 times (no, really) and enter the score in one long column in a spreadsheet. Call the column score. (Top tip: never use spaces in a column heading. R really does not like that. I also avoid using any capitals, ever, anywhere. It saves me lots of headaches later.) Save this spreadsheet as a .csv file in the folder RStuff/data.

Start a new script

Once we move beyond doing one-liners and have several steps to what we want to do, we normally save those steps in a script.

Using File/New File/R Script, open a blank script. This will appear in your script window (top-left).

Save this script into your RStuff/scripts folder. Called it dice.

You should find that you now have a file in your scripts folder called dice.R.

Be a good scientist and add comments to your script.

It is good practice to annotate R scripts. We denote a line of code as a comment by starting it with a #. A good way to begin your script is with something like this:

# My name
# The date
# What this script is for

Load packages

Like many other applications, including QGIS, the R that you download can be enhanced with extra features. In R these are called packages. In QGIS they are called plugins. It is normal to use a few packages.

We will use two. You will probably use these two very often. They are called tidyverse and here. tidyverse is just a big goody bag of other packages that are all very useful. So useful that I always load tidyverse into any script, whatever that script is for. hereis a little thing, but a handy thing. It makes the business of telling R where to go to find files or where to put them much less techy and clunky than it would otherwise be. I now always use it in my scripts.

If you have never used a particular package on your machine before, you need to install it. You do this only once, so don’t put any command for doing this in your script. Otherwise you will be trying to reinstall the package every time you ran the script. That would be bad. It would be like trying to download an app from your favourite app store every time you wanted to use it. Why would you do that? So, if you have done this before, skip the next step. If you haven’t, then in the console window, (bottom left), type

install.packages("tidyverse")

Stuff will happen, and eventually the console window will return to the command prompt >. If so, good. Now do the same for the here package:

install.packages("here")

All good? Now we can make these packages available to our script. We do this by using the library() command. In your script, type:

library(tidyverse)
library(here)

Read in the data

Now add these lines to your script:

filepath<-here("data","dice.csv")
my_data<-read_csv(filepath)

The first line uses the command here from the (you gueesed it) here package to tell R where to find the data file.

The second line reads that data into an object called my_data. All the things that R stores in its head during a session can be thought of as objects. There are different kinds of object and you will come across them in time. Do not be afraid. This particular object is what R calls a data frame. You will use them often. Think of them as being like a sheet in a spreadsheet, with columns of numbers, text and so on.

We should have one column with 180 rows. To see if that is the case, we can use

glimpse(my_data)

What does that tell you? Do you see what you expect? If not, sort it out.

Tabulate our scores

We now need to make a table that shows how many times we scored 1, 2, 3 and so on. We’d like to store this table as a new object called dice_table. This is easily done using the table() command.

dice_table<-table(my_data)

Now we can run the chi-square test to see if the dice used were fair. If they were, then we would expect each of the scores to appear about 1/6 of the time. So we can use the chisq.test command like this:

chisq.test(dice_table, p = c(1/6,1/6,1/6,1/6,1/6,1/6))

In fact, since the expected proportions are all the same, we can leave out the p=... argument, since uniform proportions are the default for this argument and that is what we have. Most functions in R have one or two arguments which are mandatory, and many others which are optional and are normally left out, in which case they take on their default values. We can do that here with the dice p=... argument since we expect each of the scores to occur with equal likelihood, if the dice is fair.

So we could write, more simply.

chisq.test(dice_table)

Output of the test

If we ran that command we might see something like this:

## 
##  Chi-squared test for given probabilities
## 
## data:  dice_table
## X-squared = 3.4667, df = 5, p-value = 0.6284

Conclusion

What do we conclude about the dice, and how would we report this result?

The whole script

# my name
# the date
# Chi-square test on tossing a die

# load packages
library(tidyverse)
library(here)

# load data
filepath<-here("data","dice.csv")
my_data<- read_csv(filepath)

# check data
glimpse(data)

# tabulate data
dice_table<-table(my_data)

# test data
chisq.test(dice_table)

Answer to Exercise.

Suppose you tossed a fair coin 100 times and got 45 heads and 55 tails.
Under a null hypothesis that the coin is fair, what would the expected proportions of heads and tails be? 0.5 and 0.5

Use R to do a chi-square test of that null hypothesis.

chisq.test(c(45,55),p=c(0.5,0.5)) # we could leave out the second argument here

## 
##  Chi-squared test for given probabilities
## 
## data:  c(45, 55)
## X-squared = 1, df = 1, p-value = 0.3173

What do you conclude?
The p value is much greater than 0.5 so we cannot reject the null hypothesis.

How would you report your result?
We find no evidence that the coin is not fair (chi-square, X-squared = 1, p=0.32).