Data 606 - Presentation

Pavan Akula

Chapter 5: Inference For Numerical Data

Question 5.45:

Coffee, depression, and physical activity. Caffeine is the world's most widely used stimulant, with approximately 80% consumed in the form of coffee. Participants in a study investigating the relationship between coffee consumption and exercise were asked to report the number of hours they spent per week on moderate (e.g., brisk walking) and vigorous (e.g., strenuous sports and jogging) exercise. Based on these data the researchers estimated the total hours of metabolic equivalent tasks (MET) per week, a value always greater than 0. The table below gives summary statistics of MET for women in this study based on the amount of coffee consumed.

ANOVA Summary

Questions

a. Write the hypotheses for evaluating if the average physical activity level varies among the different levels of coffee consumption.

b. Check conditions and describe any assumptions you must make to proceed with the test.

Questions continued ...

c. Below is part of the output associated with this test. Fill in the empty cells.

d. What is the conclusion of the test?

Answers - a

a. ANOVA uses a single hypothesis test to check whether the means across many groups are equal.

Null hypothesis: The average physical activity level of women in each caffeine consumption group are identical to the average physical activity level of women in other groups of caffeine consumption.

\( H_{0}: \mu_{(\lt cup/week)} = \mu_{(2-6 \space cups/week)} = \mu_{(cup/day)} = \mu_{(2-3 \space cups/day)} = \mu_{(\gt 4 \space cups/day)} \)

Alternative hypothesis: The average physical activity level of women in each caffeine consumption group varies across some or all groups of caffeine consumption.

\( H_{A}: At \space least \space one \space mean \space is \space different. \)

Answers - b

Since random sample size is 50739, it is less than 10% of actual population size of women in US. This satisfies condition of independence.
The standard deviation of each group is higher than mean, and this suggests data is strongly skewed. As sample sizes for each group are large, skewness is acceptable.
Since standard deviations across the groups are about equal, it satisfies the condition of variability.

Answers - c

This can be solved either by R-package rpsychi or manually

# Using library Statistics for psychiatric research
library(rpsychi)
library(dplyr)
library(knitr)
options("scipen"=100, "digits"=4)

# Create vectors for summary values
coffee.mean <- c(18.7,19.6,19.3,18.9,17.5)
coffee.sd <- c(21.1,25.5,22.5,22.0,22.0) 
coffee.sample <- c(12215,6617,17234,12290,2383)

# Create data frame
coffee.data.frame <- data.frame(coffee.mean, coffee.sd, coffee.sample)

Answers - c

options("scipen"=100, "digits"=4)
# Get ANOVA details
coffee.details <- with(coffee.data.frame, ind.oneway.second(coffee.mean, coffee.sd, coffee.sample))

# Extract ANOVA table
coffee.anovatable <- coffee.details$anova.table

# Calculate p-value
p <- pf(coffee.anovatable$F[1], coffee.anovatable$df[1], coffee.anovatable$df[2], lower.tail = F)
p <- c(p, NA, NA)

Answers - c

options("scipen"=100, "digits"=4)
# Bind p-value to ANOVA table
coffee.anovatable <- cbind(coffee.anovatable, p)

#Rename column name
coffee.anovatable <- rename(coffee.anovatable, `Df` = `df`, `Sum Sq` = SS, `Mean Sq` = MS, `F value` = `F`, `Pr(>F)` = `p`)
coffee.anovatable <- coffee.anovatable %>% select(Df,`Sum Sq`,`Mean Sq`,`F value`,`Pr(>F)`)

#Rename row names
rownames(coffee.anovatable) <- NULL
rnames <- c("coffee","Residuals","Total")
coffee.anovatable <- cbind(" " = rnames, coffee.anovatable)

Answers - c

options("scipen"=100, "digits"=4)
kable(coffee.anovatable, format="html", align="r", digits = 4, row.names = F)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
coffee	4	10508	2627.1	5.213	0.0003
Residuals	50734	25564819	503.9	NA	NA
Total	50738	25575327	NA	NA	NA

Answers - c

Manually

Answers - c

library(knitr)
options("scipen"=100, "digits"=4)

m <- 5 #Total groups
n <- 12215+6617+17234+12290+2383 #Total observations
df <- c(m-1,n-m,n-1)
sst <- 25575327 #Given summary
sse <- 25564819 #Given summary
ssb <- sst - sse
ss <- c(ssb,sse,sst)
msb <- ssb/(m-1)
mse <- sse/(n-m)

Answers - c

options("scipen"=100, "digits"=4)
ms <- c(msb,mse,NA)
f <- msb/mse
p <- pf(f, m-1, n-m, lower.tail = F)

c.anova <- data.frame(df,ss,ms,f=c(f,NA,NA),p=c(p,NA,NA))
rnames <- c("coffee","Residuals","Total")
c.anova <- cbind(" " = rnames, c.anova)
kable(c.anova, format="html", align="r", digits = 4, row.names = F)

	df	ss	ms	f	p
coffee	4	10508	2627.0	5.213	0.0003
Residuals	50734	25564819	503.9	NA	NA
Total	50738	25575327	NA	NA	NA

Answers - c

Answers - d

Since p-value (0.0003) is smaller than 0.05, suggesting that there is strong evidence that variation in average physical activity level of women in each caffeine consumption group is not by chance. Hence, null hypothesis is rejected.