We first load the data on physical activity and fruit consumption of college students described in the textbook on p. 537 and following:
load(file="Colhealth.Rdata")
If we double click on the data to visualize it, we can see that it is not in the form needed for a chi-squared test. The data must be a table like the one at the top of p. 537 in the textbook. So, we first create the table we need for the test.
table1 <- matrix(Colhealth$Count, ncol=3, byrow=TRUE)
# The first line above puts in the table the Count values from our dataset, telling R that we need to order them in 3 columns (ncol=3), and to start filling row by row (byrow=TRUE). The default is byrow=FALSE and would fill the data column by column, resulting in a different ordering in the table.
table1 #to see what the table looks like.
## [,1] [,2] [,3]
## [1,] 69 206 294
## [2,] 25 126 170
## [3,] 14 111 169
colnames(table1) <-Colhealth$PhysAct[1:3] # The first line above tells R what the column names should be, taking from the PhysAct variable in our data. We only need 3 names, not 9 (the original dataset is 9 rows), so we tell R to just take the first three values of PhysAct by indicating PhysAct[1:3].
rownames(table1) <-Colhealth$Fruit[c(1,4,7)] # The first line above tells R what the row names should be, taking from the Fruit variable in our data. The Fruit is ordered as Low, Low, Low, etc., so we don’t want the first three values. Instead, we want the first, fourth and seventh c(1,4,7), and this is something that can be seen by looking at the data.
table1
## Low Moderate Vigorous
## Low 69 206 294
## Medium 25 126 170
## High 14 111 169
chisq.test(table1)
##
## Pearson's Chi-squared test
##
## data: table1
## X-squared = 14.152, df = 4, p-value = 0.006824
results <- chisq.test(table1)
Expected counts
#If we want to see the expected counts, we can type:
results$expected
## Low Moderate Vigorous
## Low 51.90203 212.8944 304.2035
## Medium 29.28041 120.1039 171.6157
## High 26.81757 110.0017 157.1807
Conditional distribution
prop.table(table1,2) #To see the conditional distribution by column (second dimension of the table):
## Low Moderate Vigorous
## Low 0.6388889 0.4650113 0.4644550
## Medium 0.2314815 0.2844244 0.2685624
## High 0.1296296 0.2505643 0.2669826
#This yields the table from the top of p. 538.
#If we wanted the conditional distribution by row (first dimension of the table), we could type:
prop.table(table1,1)
## Low Moderate Vigorous
## Low 0.12126538 0.3620387 0.5166960
## Medium 0.07788162 0.3925234 0.5295950
## High 0.04761905 0.3775510 0.5748299
Contribution to chi-squared
#We can also look at which cells in our table contribute the most to a high chi-squared by typing:
results$residuals
## Low Moderate Vigorous
## Low 2.3732991 -0.47251538 -0.5850178
## Medium -0.7910362 0.53800636 -0.1233345
## High -2.4751181 0.09518447 0.9427369
#This gives the (observed – expected) / sqrt(expected).
#A positive number means that we have more observations than expected in the cell, while a negative number means that we have fewer observations than expected.
#Since we are summing the squares of these residuals to arrive at the chi-squared statistic,
#it is the absolute number that matters for the contribution to the chi-squared statistic: both very low numbers (negative and large) and very high numbers (positive and large) make very large contributions to the chi-squared.
#In this specific example, we can see that the biggest contribution is from the Low / High combination: among low physical activity students, the proportion with high fruit consumption is much lower than expected.
#Let’s create a smaller table from our big table that is just two by two:
smalltable1 <- table1[1:2,1:2]
smalltable1
## Low Moderate
## Low 69 206
## Medium 25 126
# Run the two-proportioin test (from week 10) on this table:
prop.test(smalltable1)
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: smalltable1
## X-squared = 3.6474, df = 1, p-value = 0.05616
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.001857855 0.168834499
## sample estimates:
## prop 1 prop 2
## 0.2509091 0.1655629
# And now let’s run the chi-squared test on this smaller table:
chisq.test(smalltable1)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: smalltable1
## X-squared = 3.6474, df = 1, p-value = 0.05616
# Compare the results of the two tests.
# Testing against a pre-specified probability
# Let’s create a smaller table from our big table that is just one column for the low physical activity students:
smalltable2 <- table1[1:3]
smalltable2
## [1] 69 25 14
# What are the proportions for the low physical activity students in terms of their fruit consumption?
prop.table(smalltable2)
## [1] 0.6388889 0.2314815 0.1296296
# Let’s test this against two possible probability distributions specified with p=:
chisq.test(smalltable2,p=c(0.2,0.3,0.5))
##
## Chi-squared test for given probabilities
##
## data: smalltable2
## X-squared = 135.34, df = 2, p-value < 2.2e-16
chisq.test(smalltable2,p=c(0.6,0.3,0.1))
##
## Chi-squared test for given probabilities
##
## data: smalltable2
## X-squared = 2.9105, df = 2, p-value = 0.2333
load(file="beer.Rdata")
beer$Catcal <- cut(beer$Calories, breaks=quantile(beer$Calories, probs=seq(0,1, by=1/3)), include.lowest=TRUE, labels = c("low","medium", "high"))
summary(beer$Catcal)
## low medium high
## 72 78 62
table(beer$Type, beer$Catcal)
##
## low medium high
## Domestic 57 49 53
## Imported 15 29 9
chisq.test(beer$Type, beer$Catcal)
##
## Pearson's Chi-squared test
##
## data: beer$Type and beer$Catcal
## X-squared = 10.472, df = 2, p-value = 0.005321