BI412L Lab 6: Frequency Data

Confidence Interval (CI) for a Proportion

Last lab, we have found the CI for a mean of a continous variable. In this lab, we will calculate the CI for a proportion. The main difference between the two is the variable type that were finding the CI for: CI for mean is used with a continuous variable and CI for a proportion (if you might have guessed) is for categorical variables. To calculate a confidence interval (CI) for an estimate of a proportion (of a categorical variable), we suggest using the Agresti-Coull method. This method is available in the R package “binom”, which you can install and load with the following:

#install.packages("binom", dependencies = TRUE)

*Note: I have put the # (hashtag) in front of the above code because I already have the package installed and put it for the purpose of producing the html file for you to view the notes. DO NOT incude the # when running the code. Once installed, load the package into R to use the package:

library(binom)

Calculating a CI for a proportion is fairly straightforward. We will use the function binom.confint(x, n, method) where x: the number of successes, n: the total sample size, and method: “ac” for Agresti-Coull.

Let’s say we have 87 data points and 30 of them are “successses”, we can find the 95% confidence interval for the proportion of success with the following command:

binom.confint(x = 30, n = 87, method = "ac")

##          method  x  n      mean     lower     upper
## 1 agresti-coull 30 87 0.3448276 0.2532164 0.4495625

This tells us that the observed proportion (called “mean” here) is 0.3448, and the lower and upper bounds of the 95% confidence interval are 0.2532 and 0.4496.

Binomial Tests

Doing a binomial test in R is much easier than doing one by hand. To conduct a binomial test, we use the binom.test(x, n, p) function where x: the number of “successes” observed in the data, n: the total snumebr of data points, and p: the proportion given by the null hypothesis.

For example, if we have 18 toads that have been measured for a left-right preference and 14 are right-handed, we can test the null hypothesis of equal probabilities of left- and right-handedness with a binomial test. In this case, x = 14, n = 18, and p = 0.5. This is how we would apply the binomial test for this case in R:

binom.test(x = 14, n = 18, p = 0.5)

## 
##  Exact binomial test
## 
## data:  14 and 18
## number of successes = 14, number of trials = 18, p-value = 0.03088
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5236272 0.9359080
## sample estimates:
## probability of success 
##              0.7777778

Summary of Testing Method and Interpretation of Results:

Null Hypothesis: The null hypothesis is that there is an equal probability (50%) of toads being either left- or right-handed (i.e., p=0.5).
Test Results:
- P-value: The p-value is 0.03088, which is below the commonly used significance level of 0.05. This suggests that the observed proportion of right-handed toads (14 out of 18) is significantly different from what would be expected if left- and right-handedness were equally likely.
- Conclusion: With a p-value of 0.03088, we reject the null hypothesis at the 5% significance level. This means there is evidence to suggest that the toads do not have equal probabilities of being left- or right-handed and may have a preference for right-handedness.
Confidence Interval: The 95% confidence interval for the probability of right-handedness is (0.5236, 0.9359). This interval does not include 0.5, reinforcing the finding that the probability of right-handedness is likely higher than 50%. (Note:
Sample Estimate: The sample estimate of the probability of right-handedness is approximately 0.778 (14 out of 18), indicating that about 77.8% of the toads were right-handed in this sample.

In conclusion, the results indicate a statistically significant tendency for right-handedness among the toads, with a confidence interval that suggests the true probability of right-handedness is likely above 50%.

Goodness of Fit Test

A χχ2 goodness-of-fit test compares the frequencies of values of a categorical variable to the frequencies predicted by a null hypothesis. For example, the file “MandMlist.csv” contains a list of all the colors of M&M candies from one package, under the variable “color”. Let’s ask whether this fits the proportions of colors advertised by the candy company.

First lets read in the dataset (we use the read.csv() function):

MMlist <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\MandMlist.csv", stringsAsFactors = TRUE)

MMlist$color contains the color of each of 55 M&Ms.

Summarizing Frequency Data in a Table

Summarizing frequency data in a table is useful to see the data more concisely, and such a table is also necessary as input to the χχ2 test function. We can summarize the counts (frequencies) of a categorical variable with table() function:

MMtable <- table(MMlist$color)

MMtable

## 
##   blue  brown  green orange    red yellow 
##     10      5     12     11      6     11

This shows that in this list of M&M colors, 10 were blue, 5 were brown, 12 were green, 11 were orange, 6 were red, and 11 were yellow.

χχ2 Goodness-of-Fit Test

We can use a χχ² goodness-of-fit test to compare the frequencies in a set of data to a null hypothesis that specifies the probabilities of each category. In R, this can be calculated using the function chisq.test().

Lets say the company says that the percentages are 24% blue, 14% brown, 16% green, 20% orange, 13% red and 13% yellow. This is our null hypothesis. Let’s test whether these proportions are consistent with the frequencies of the colors in this bag. R wants the proportions expected by the null hypothesis in a vector, like this:

expected_proportions <- c(0.24, 0.14, 0.16, 0.20, 0.13, 0.13)

The first thing we need to do is check whether the expected frequencies (expected by the null hypothesis) are large enough to justify using a χχ2 goodness of fit test. (In other words, that no more than 25% of the expected frequencies are less than 5 and none is less than 1.)

To obtain these expected counts, multiply the vector of the expected proportions by the total sample size. We can find the total sample size by taking the sum() of the frequency table:

sum(MMtable)

## [1] 55

So for this example, the expected frequencies from our null hypothesis are 55 times the list of expected proportions:

55 * expected_proportions

## [1] 13.20  7.70  8.80 11.00  7.15  7.15

Notice that all of these expected values are greater than 5, so we have no problem with the assumptions of the χχ2 goodness of fit test. (If there were a problem here, we’d have to combine categories. We’ll see an example of that in the next section.)

The function chisq.test() requires two arguments in its input: a frequency table from the data and a vector with the expected proportions from the null hypothesis. The name of the vector of expected proportions in this function is p. So here is the format of the R command to do a χχ2 test:

chisq.test(MMtable, p = expected_proportions)

## 
##  Chi-squared test for given probabilities
## 
## data:  MMtable
## X-squared = 5.1442, df = 5, p-value = 0.3985

“X–squared” here in the output means χχ2 (Chi-square). The χχ2 value for this test is 5.1442. There are 5 (6-1 = 5) degrees of freedom (df) for this test, and the P-value turned out to be 0.3985.

Summary of Testing Methond and Interpretation of Results:

Null Hypothesis: The null hypothesis is that the distribution of colors in this sample of M&Ms matches the company’s specified color proportions: 24% blue, 14% brown, 16% green, 20% orange, 13% red, and 13% yellow.
Observed vs. Expected Frequencies:
- Observed Frequencies: From the sample, there were 10 blue, 5 brown, 12 green, 11 orange, 6 red, and 11 yellow M&Ms.
- Expected Frequencies: Based on the company’s proportions and a total sample size of 55, the expected counts are:
  - Blue: $55 \times 0.24 = 13.2$
  - Brown: $55 \times 0.14 = 7.7$
  - Green: $55 \times 0.16 = 8.8$
  - Orange: $55 \times 0.20 = 11.0$
  - Red: $55 \times 0.13 = 7.15$
  - Yellow: $55 \times 0.13 = 7.15$
Since all expected frequencies are greater than 5, the chi-square test is appropriate.
Chi-Square Test Results:
- Test Statistic: The calculated chi-square value is 5.1442.
- Degrees of Freedom: With 6 categories (colors) and no parameters estimated from the data, the degrees of freedom are 5.
- P-Value: The p-value is 0.3985.
Conclusion:
- Since the p-value (0.3985) is greater than the commonly used significance level of 0.05, we do not reject the null hypothesis.
- Interpretation: There is no statistically significant difference between the observed and expected color distributions. This suggests that the sample color distribution is consistent with the company’s specified proportions.

Fit to a Poisson Distribution

A χχ² goodness of fit test can be used to ask whether a frequency distribution of a variable is consistent with a Poisson distribution. This is a useful way to determine if events in space or time are not “random” but instead are clumped or dispersed.

It turns out that this is not a straightforward process in R. In this section, we’ll highlight a couple of functions in R that will streamline doing the calculations by hand. We’ll use the data on the numbers of extinctions per prehistoric time period described in Example 8.4 in the Whitlock and Schluter textbook. Let’s read this dataset in and create a summary table for the number of extinctions variable in this dataset:

extinctData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\MassExtinctions.csv", stringsAsFactors = TRUE)

number_of_extinctions <- extinctData$numberOfExtinctions

table(number_of_extinctions)

## number_of_extinctions
##  1  2  3  4  5  6  7  8  9 10 11 14 16 20 
## 13 15 16  7 10  4  2  1  2  1  1  1  2  1

The first row lists each possible number of extinctions in a time period, ranging from 1 to 20. The second row shows the number of time periods that had each number of extinctions. (so there were 13 time periods with one extinction, 15 time periods with 2 extinctions, etc.)

The mean number of extinctions per unit of time is not specified by the null hypothesis, so we need to estimate it from the data. This mean turns out to be 4.21 extinctions per time period.

mean(number_of_extinctions)

## [1] 4.210526

We can use this mean to calculate the probability according to a Poisson distribution of getting each specific number of extinctions.

Fortunately, R has a function to calculate the probability of a given value under the Poisson distribution. This function is dpois(), which stands for “density of the Poisson”. dpois() requires two values for input, the mean (which is called lambda in this function) and x, the value of the outcome we are interested in (here, the number of extinctions). For example, the probability of getting x = 3 from a Poisson distribution with mean 4.21 can be found from the following:

dpois(x = 3, lambda = 4.21)

## [1] 0.1846355

The probability of getting exactly 3 successes from a Poisson distribution with mean 4.21 is about 18.5%.

dpois() will accept a vector of x’s as input, and return a vector of the probabilities of those x’s. A convenient short hand to know here is that R will create a vector with a range of values with a colon. For example, 0:20 is a vector with the integers from 0 to 20.

We’ll need the probabilities for all possible values of x from 0 to 20 (because this was the largest value observed in the data). We can create a vector with the Poisson probabilities of each possibility from 0 to 20 like this:

expected_probability <- dpois(x = 0:20, lambda = 4.21)

expected_probability

##  [1] 1.484637e-02 6.250321e-02 1.315693e-01 1.846355e-01 1.943289e-01
##  [6] 1.636249e-01 1.148102e-01 6.905011e-02 3.633762e-02 1.699793e-02
## [11] 7.156129e-03 2.738846e-03 9.608784e-04 3.111768e-04 9.357530e-05
## [16] 2.626347e-05 6.910575e-06 1.711384e-06 4.002736e-07 8.869220e-08
## [21] 1.866971e-08

The odd notation here is a shorthand for scientific notation. 1.48e-02 means 1.48 x 10^-2. The e stands for exponent.

Remember that we asked R to output the probability of values from 0 to 20. So the first value in this output corresponds to the probability of getting a zero, the second to the probability of getting a 1, etc.

In order to convert these probabilities into expected values for a χχ² test, we need to multiply them by the total sample size.

length(number_of_extinctions) * expected_probability

##  [1] 1.128324e+00 4.750244e+00 9.999264e+00 1.403230e+01 1.476900e+01
##  [6] 1.243549e+01 8.725572e+00 5.247808e+00 2.761659e+00 1.291843e+00
## [11] 5.438658e-01 2.081523e-01 7.302676e-02 2.364943e-02 7.111723e-03
## [16] 1.996023e-03 5.252037e-04 1.300651e-04 3.042079e-05 6.740607e-06
## [21] 1.418898e-06

Most of these values would cause problems for the χχ2 test because they are too small (less than 1). We need to combine categories in some reasonable way so that the expected frequencies match the assumptions of the χχ2 test. Let’s combine the categories for 0 and 1 successes, and combine everything greater than or equal to 8 into one category. It is possible to do this all in R, but you can also do it by hand, using the expected frequencies we just calculated.

Here is a table of the expected frequencies for these new combined categories, for both the observed and expected, summing over all the possibilities within each of these groupings. (E.g., the expected frequency for the category “0 or 1” is the sum of 1.128324 and 4.750244.)

Note that these expected values are large enough to match the conditions of the χχ² test.

Make vectors (using c()) for the observed and expected values for these new groups:

expected_combined <- c(5.878568, 9.999264, 14.032300, 14.768996, 12.435494,  8.725572, 5.247808, 4.911998)

observed_combined <- c(13, 15, 16, 7, 10, 4, 2, 9)

From these we can calculate the χχ² test statistic using chisq.test()$statistic.

We give it first the list of observed values, then the expected values as p. Because we are giving the list of expected frequencies rather the expected probabilities, we need to give it the option rescale.p = TRUE. Finally, by adding $statistic at the end, R will only give us the χχ² value as output. (We don’t yet want the full results of the test because R does not know to correct for the reduction in degrees of freedom caused by estimating the mean.)

chisq.test(observed_combined, p = expected_combined, rescale.p = TRUE)$statistic

## Warning in chisq.test(observed_combined, p = expected_combined, rescale.p =
## TRUE): Chi-squared approximation may be incorrect

## X-squared 
##  23.93919

This correctly calculates our χχ² value as 23.939. The warning message in red is because one of the expected values is under 5, but because this occurred in less than 20% of the categories, it is fine to proceed.

The degrees of freedom for this test are the number of categories minus the number of parameters estimated from the data minus one. We ended up using 8 categories and estimated one parameter from the data, so we have df = 8 – 1 – 1(est. one parameter) = 6.

To calculate the correct P-value, we need to know probability of getting a χχ2 value greater than 23.939 from a distribution that has 6 degrees of freedom. R has a function we can use for this, called pchisq(). We are interested here in the probability above the observed χχ2 value, so we want the probability in the right (or upper) tail. To use this function, we need to specify three values: our observed χχ2 (confusingly called q in this function), the degrees of freedom (called df, thank goodness), and an option that tells us to use the upper tail (lower.tail = FALSE).

pchisq(q = 23.939, df = 6, lower.tail = FALSE)

## [1] 0.0005359236

Thus, our P-value for this test of the fit of these data to a Poisson distribution is approximately 0.0005. We reject the null hypothesis that these extinction data follow a Poisson distribution.