Creating the table we need to perform the chi-squared test and manipulations

We first load the data on physical activity and fruit consumption of college students described in the textbook on p. 537 and following:

load(file="Colhealth.Rdata")

If we double click on the data to visualize it, we can see that it is not in the form needed for a chi-squared test. The data must be a table like the one at the top of p. 537 in the textbook. So, we first create the table we need for the test.

table1 <- matrix(Colhealth$Count, ncol=3, byrow=TRUE)
# The first line above puts in the table the Count values from our dataset, telling R that we need to order them in 3 columns (ncol=3), and to start filling row by row (byrow=TRUE). The default is byrow=FALSE and would fill the data column by column, resulting in a different ordering in the table. 
table1  #to see what the table looks like.
##      [,1] [,2] [,3]
## [1,]   69  206  294
## [2,]   25  126  170
## [3,]   14  111  169
colnames(table1) <-Colhealth$PhysAct[1:3] # The first line above tells R what the column names should be, taking from the PhysAct variable in our data. We only need 3 names, not 9 (the original dataset is 9 rows), so we tell R to just take the first three values of PhysAct by indicating PhysAct[1:3].
rownames(table1) <-Colhealth$Fruit[c(1,4,7)] # The first line above tells R what the row names should be, taking from the Fruit variable in our data. The Fruit is ordered as Low, Low, Low, etc., so we don’t want the first three values. Instead, we want the first, fourth and seventh c(1,4,7), and this is something that can be seen by looking at the data.
table1
##        Low Moderate Vigorous
## Low     69      206      294
## Medium  25      126      170
## High    14      111      169

Chi-squared test

chisq.test(table1)
## 
##  Pearson's Chi-squared test
## 
## data:  table1
## X-squared = 14.152, df = 4, p-value = 0.006824
results <- chisq.test(table1)

Expected counts

#If we want to see the expected counts, we can type:
results$expected
##             Low Moderate Vigorous
## Low    51.90203 212.8944 304.2035
## Medium 29.28041 120.1039 171.6157
## High   26.81757 110.0017 157.1807

Conditional distribution

prop.table(table1,2) #To see the conditional distribution by column (second dimension of the table):
##              Low  Moderate  Vigorous
## Low    0.6388889 0.4650113 0.4644550
## Medium 0.2314815 0.2844244 0.2685624
## High   0.1296296 0.2505643 0.2669826
#This yields the table from the top of p. 538.
#If we wanted the conditional distribution by row (first dimension of the table), we could type:
prop.table(table1,1)
##               Low  Moderate  Vigorous
## Low    0.12126538 0.3620387 0.5166960
## Medium 0.07788162 0.3925234 0.5295950
## High   0.04761905 0.3775510 0.5748299

Contribution to chi-squared

#We can also look at which cells in our table contribute the most to a high chi-squared by typing:
results$residuals
##               Low    Moderate   Vigorous
## Low     2.3732991 -0.47251538 -0.5850178
## Medium -0.7910362  0.53800636 -0.1233345
## High   -2.4751181  0.09518447  0.9427369
#This gives the (observed – expected) / sqrt(expected). 
#A positive number means that we have more observations than expected in the cell, while a negative number means that we have fewer observations than expected. 
#Since we are summing the squares of these residuals to arrive at the chi-squared statistic, 
#it is the absolute number that matters for the contribution to the chi-squared statistic: both very low numbers (negative and large) and very high numbers (positive and large) make very large contributions to the chi-squared. 
#In this specific example, we can see that the biggest contribution is from the Low / High combination: among low physical activity students, the proportion with high fruit consumption is much lower than expected.

Showing that the prop test for proportions and the chi-squared test give the same result for a two by two table

#Let’s create a smaller table from our big table that is just two by two:
smalltable1 <- table1[1:2,1:2]
smalltable1
##        Low Moderate
## Low     69      206
## Medium  25      126
# Run the two-proportioin test (from week 10) on this table:
prop.test(smalltable1)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  smalltable1
## X-squared = 3.6474, df = 1, p-value = 0.05616
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.001857855 0.168834499
## sample estimates:
##    prop 1    prop 2 
## 0.2509091 0.1655629
# And now let’s run the chi-squared test on this smaller table:
chisq.test(smalltable1)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  smalltable1
## X-squared = 3.6474, df = 1, p-value = 0.05616
# Compare the results of the two tests.
# Testing against a pre-specified probability
# Let’s create a smaller table from our big table that is just one column for the low physical activity students:
smalltable2 <- table1[1:3]
smalltable2
## [1] 69 25 14
# What are the proportions for the low physical activity students in terms of their fruit consumption?
prop.table(smalltable2)
## [1] 0.6388889 0.2314815 0.1296296
# Let’s test this against two possible probability distributions specified with p=:
chisq.test(smalltable2,p=c(0.2,0.3,0.5))
## 
##  Chi-squared test for given probabilities
## 
## data:  smalltable2
## X-squared = 135.34, df = 2, p-value < 2.2e-16
chisq.test(smalltable2,p=c(0.6,0.3,0.1))
## 
##  Chi-squared test for given probabilities
## 
## data:  smalltable2
## X-squared = 2.9105, df = 2, p-value = 0.2333

Chi square test with Beer data (Two categorical variables)

load(file="beer.Rdata")
beer$Catcal <- cut(beer$Calories, breaks=quantile(beer$Calories, probs=seq(0,1, by=1/3)), include.lowest=TRUE, labels = c("low","medium", "high"))
summary(beer$Catcal)
##    low medium   high 
##     72     78     62
table(beer$Type, beer$Catcal)
##           
##            low medium high
##   Domestic  57     49   53
##   Imported  15     29    9
chisq.test(beer$Type, beer$Catcal)
## 
##  Pearson's Chi-squared test
## 
## data:  beer$Type and beer$Catcal
## X-squared = 10.472, df = 2, p-value = 0.005321