Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.
Your lab write-up should be typed in RMarkdown. All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.
In this lab, we will conduct analyze the impact that bin choice has on the chi square goodness of fit test. We will do this in the context of the goodness of fit example from the Week 11 discussion problems, number 3. We will then apply the goodness of fit test to other examples of fitting distributions, using built-in R functions.
In this part of the lab, we will work with problem 3 from the Week 11 in-class problem sets. You may find it helpful to review that problem before continuing.
The raw data set can be stored in R as a vector, and the sample size computed.
raw_data <- c(7, 4, 3, 6, 4, 4, 5, 3, 5, 3,
5, 5, 3, 2, 5, 4, 3, 3, 7, 6,
6, 4, 3, 11, 9, 6, 7, 4, 5, 4,
7, 3, 2, 8, 7, 6, 4, 1, 9, 8,
4, 8, 9, 3, 9, 7, 7, 9, 3, 10)
n <- length(raw_data)
In the problem we discussed in class, we broke the raw data into the following categories \[A_1=\{0, 1, 2, 3\}, \,\, A_2=\{4\}, \,\, A_3=\{5\}, \,\, A_4=\{6\}, \,\, A_5=\{7\}, \,\, A_6=\{8, 9, \ldots\}\] We’ll code this into R as a list of vectors. Note that we only use the values 1 through 11 since those are the only values in the data set. We can deal with the missing values in \(A_1\) and \(A_6\) later. It will also be useful to have the number of total categories stored as a variable.
cats <- list(c(1,2,3), c(4), c(5), c(6), c(7), c(8,9,10,11))
n_cats <- length(cats)
Instead of this raw data, it will be important to have the frequencies of each value in the data set. The function below will compute these frequencies for us.
count_freq <- function(data, cat) {
n <- length(cat)
count <- rep(0, length = n)
for (i in 1:n) {
count[i] <- sum(data %in% cat[[i]])
}
return(count)
}
As part of the process of carrying out this chi-square goodness of fit test, we need to fit a Poisson distribution to the data and compute the null probabilities of observing each category. Note that we use the MLE \(\overline{Y}\) to estimate the population parameter \(\lambda\) for the Poisson distribution.
lambda <- mean(raw_data)
lambda
## [1] 5.4
The function below computes the Poisson probabilities for each category we defined and outputs the probabilities as a vector.
pois_p <- function(cat, lambda) {
n <- length(cat)
probs <- rep(0, length = n)
for (i in 1:n) {
x <- cat[[i]]
prob <- sum(dpois(x, lambda))
if (1 %in% x) {
prob <- prob + dpois(0, lambda)
}
if (11 %in% x) {
prob <- prob + ppois(11, lambda, lower.tail = FALSE)
}
probs[i] <- prob
}
return(probs)
}
We are ready to make a data frame of the counts of each category, the Poisson probabilities, and the expected number of observations in each category.
fit <- data.frame(category = 1:n_cats) %>%
mutate(count = count_freq(raw_data, cats)) %>%
mutate(pois_probs = pois_p(cats, lambda)) %>%
mutate(pois_exp = n * pois_probs)
Viewing this table allows us to compare to our work from the in-class discussion.
| category | count | pois_probs | pois_exp |
|---|---|---|---|
| 1 | 13 | 0.2132910 | 10.664551 |
| 2 | 9 | 0.1600198 | 8.000988 |
| 3 | 6 | 0.1728213 | 8.641067 |
| 4 | 5 | 0.1555392 | 7.776960 |
| 5 | 7 | 0.1199874 | 5.999369 |
| 6 | 10 | 0.1783413 | 8.917066 |
Now, we can compute our test statistic \(X^2\) and our chi-square score.
xsq <- sum((fit$count - fit$pois_exp)^2 / fit$pois_exp)
xsq
## [1] 2.733396
chisqscore <- qchisq(0.05, df = n_cats - 2, lower.tail = FALSE)
chisqscore
## [1] 9.487729
Just as we observed in class, since the value of \(X^2\) is smaller than \(\chi^2_{0.05}(4)\), we do not reject the null hypothesis of the data coming from a Poisson distribution.
In your R script, carry out a chi-square goodness of fit test for at least four different choices of categories.
(\(\star\)) In your write-up, show the categories you chose along with why you chose them. For each, discuss the result of the chi-square test. What observations can you make after comparing these different results?
(\(\star\)) In your R script, compute the \(p\)-values of the tests you conducted in problem 2. Report and interpret your values in your write-up.
(\(\star\)) There is a
built-in R function, chisq.test, that performs a chi-square
test. Read the documentation for that function and explain why it
cannot be used for our test here (hence why we did
everything by hand).
We have also applied chi-square tests to determine whether or not two categorical variables are independent in a contingency table. In this part of the lab, we’ll walk through an example of how this method works and then apply it to a new data set.
In the example below, we will look at a data set from OpenIntro Statistics that contains the results of a 2012 SurveyUSA poll of opinions on the DREAM act, broken down by political ideology. We can see the structure of the data set below.
| ideology | stance |
|---|---|
| Conservative | Yes |
| Conservative | Yes |
| Conservative | Yes |
| Conservative | Yes |
| Conservative | Yes |
| Conservative | Yes |
Since this data set is not formatted as a contingency table, we use
the R function table to tally all combinations of the two
factor variables.
dream_ct <- table(dream)
| No | Not sure | Yes | |
|---|---|---|---|
| Conservative | 151 | 35 | 186 |
| Liberal | 52 | 9 | 114 |
| Moderate | 161 | 28 | 174 |
Finally, we can use the built-in R function chisq.test
to compute the chi-square test for independence of the two variables.
Read the documentation of this function to see what the argument
correct does.
chisq.test(dream_ct, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: dream_ct
## X-squared = 16.387, df = 4, p-value = 0.002541
You will now run the same test with a different data set, still from
OpenIntro
Statistics. This time, the data set is data from aircraft-wildlife
collisions between 1990 and 1997, collected by the FAA. Specifically, we
will look at the variables effect and
birds_struck to see if the effect (e.g., engine shutoff) is
independent of the number of birds struck.
(\(\star\)) In your R script,
conduct a chi-square test for independence of the effect
and birds_struck variables. Report your results in your
write-up.
(\(\star\)) Interpret your results from the test you conducted in problem 7. In particular, discuss the structure of your contingency table in the context of the result of your hypothesis test.