General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

Your lab write-up should be typed in RMarkdown. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, we will conduct analyze the impact that bin choice has on the chi square goodness of fit test. We will do this in the context of the goodness of fit example from the Week 11 discussion problems, number 3. We will then apply the goodness of fit test to other examples of fitting distributions, using built-in R functions.

Analysis of Bin Choice

In this part of the lab, we will work with problem 3 from the Week 11 in-class problem sets. You may find it helpful to review that problem before continuing.

The raw data set can be stored in R as a vector, and the sample size computed.

raw_data <- c(7, 4, 3, 6, 4, 4, 5, 3, 5, 3,
              5, 5, 3, 2, 5, 4, 3, 3, 7, 6,
              6, 4, 3, 11, 9, 6, 7, 4, 5, 4,
              7, 3, 2, 8, 7, 6, 4, 1, 9, 8,
              4, 8, 9, 3, 9, 7, 7, 9, 3, 10)
n <- length(raw_data)

In the problem we discussed in class, we broke the raw data into the following categories \[A_1=\{0, 1, 2, 3\}, \,\, A_2=\{4\}, \,\, A_3=\{5\}, \,\, A_4=\{6\}, \,\, A_5=\{7\}, \,\, A_6=\{8, 9, \ldots\}\] We’ll code this into R as a list of vectors. Note that we only use the values 1 through 11 since those are the only values in the data set. We can deal with the missing values in \(A_1\) and \(A_6\) later. It will also be useful to have the number of total categories stored as a variable.

cats <- list(c(1,2,3), c(4), c(5), c(6), c(7), c(8,9,10,11))
n_cats <- length(cats)

Instead of this raw data, it will be important to have the frequencies of each value in the data set. The function below will compute these frequencies for us.

count_freq <- function(data, cat) {
  n <- length(cat)
  count <- rep(0, length = n)
  for (i in 1:n) {
    count[i] <- sum(data %in% cat[[i]])
  }
  return(count)
}

As part of the process of carrying out this chi-square goodness of fit test, we need to fit a Poisson distribution to the data and compute the null probabilities of observing each category. Note that we use the MLE \(\overline{Y}\) to estimate the population parameter \(\lambda\) for the Poisson distribution.

lambda <- mean(raw_data)
lambda
## [1] 5.4

The function below computes the Poisson probabilities for each category we defined and outputs the probabilities as a vector.

pois_p <- function(cat, lambda) {
  n <- length(cat)
  probs <- rep(0, length = n)
  for (i in 1:n) {
    x <- cat[[i]]
    prob <- sum(dpois(x, lambda))
    if (1 %in% x) {
      prob <- prob + dpois(0, lambda)
    }
    if (11 %in% x) {
      prob <- prob + ppois(11, lambda, lower.tail = FALSE)
    }
    probs[i] <- prob
  }
  return(probs)
}

We are ready to make a data frame of the counts of each category, the Poisson probabilities, and the expected number of observations in each category.

fit <- data.frame(category = 1:n_cats) %>%
  mutate(count = count_freq(raw_data, cats)) %>%
  mutate(pois_probs = pois_p(cats, lambda)) %>%
  mutate(pois_exp = n * pois_probs)

Viewing this table allows us to compare to our work from the in-class discussion.

category count pois_probs pois_exp
1 13 0.2132910 10.664551
2 9 0.1600198 8.000988
3 6 0.1728213 8.641067
4 5 0.1555392 7.776960
5 7 0.1199874 5.999369
6 10 0.1783413 8.917066

Now, we can compute our test statistic \(X^2\) and our chi-square score.

xsq <- sum((fit$count - fit$pois_exp)^2 / fit$pois_exp)
xsq
## [1] 2.733396
chisqscore <- qchisq(0.05, df = n_cats - 2, lower.tail = FALSE)
chisqscore
## [1] 9.487729

Just as we observed in class, since the value of \(X^2\) is smaller than \(\chi^2_{0.05}(4)\), we do not reject the null hypothesis of the data coming from a Poisson distribution.

  1. In your R script, carry out a chi-square goodness of fit test for at least four different choices of categories.

  2. (\(\star\)) In your write-up, show the categories you chose along with why you chose them. For each, discuss the result of the chi-square test. What observations can you make after comparing these different results?

  3. (\(\star\)) In your R script, compute the \(p\)-values of the tests you conducted in problem 2. Report and interpret your values in your write-up.

  4. (\(\star\)) There is a built-in R function, chisq.test, that performs a chi-square test. Read the documentation for that function and explain why it cannot be used for our test here (hence why we did everything by hand).

Contingency Tables

We have also applied chi-square tests to determine whether or not two categorical variables are independent in a contingency table. In this part of the lab, we’ll walk through an example of how this method works and then apply it to a new data set.

In the example below, we will look at a data set from OpenIntro Statistics that contains the results of a 2012 SurveyUSA poll of opinions on the DREAM act, broken down by political ideology. We can see the structure of the data set below.

ideology stance
Conservative Yes
Conservative Yes
Conservative Yes
Conservative Yes
Conservative Yes
Conservative Yes

Since this data set is not formatted as a contingency table, we use the R function table to tally all combinations of the two factor variables.

dream_ct <- table(dream)
No Not sure Yes
Conservative 151 35 186
Liberal 52 9 114
Moderate 161 28 174

Finally, we can use the built-in R function chisq.test to compute the chi-square test for independence of the two variables. Read the documentation of this function to see what the argument correct does.

chisq.test(dream_ct, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  dream_ct
## X-squared = 16.387, df = 4, p-value = 0.002541

You will now run the same test with a different data set, still from OpenIntro Statistics. This time, the data set is data from aircraft-wildlife collisions between 1990 and 1997, collected by the FAA. Specifically, we will look at the variables effect and birds_struck to see if the effect (e.g., engine shutoff) is independent of the number of birds struck.

  1. (\(\star\)) In your R script, conduct a chi-square test for independence of the effect and birds_struck variables. Report your results in your write-up.

  2. (\(\star\)) Interpret your results from the test you conducted in problem 7. In particular, discuss the structure of your contingency table in the context of the result of your hypothesis test.