This is week 7 of your BIO205 class. Previously, we learned how to compare dependent variables from two samples, using a two sample t test. This week, we will explore our Grip data we collected in lab, as well as other datasets, to better understand comparison of two or more groups. When we have two groups, we use a t-test. When we have more than two groups, we use a test called Analysis of Variance (ANOVA).
To work with datasets, we need to read them, usually in the form of a csv file. To do this, we use the read.csv() function. The file (“BIO205GripNew.csv”) is loaded into your R studio cloud workspace. Create a object name (grip) and assign this to the file.
grip1 <- read.csv("Grip_part1.csv")
str(grip1)
## 'data.frame': 17 obs. of 19 variables:
## $ name : chr "Student49" "Student59" "Student51" "Student47" ...
## $ semester : chr "2022_spring" "2022_Spring" "2022_Spring" "2022_Spring" ...
## $ dom.hand : chr "Right" "Right" "Left" "Left" ...
## $ height.in : num 62 64 66 65 64 70 65 65 65 68 ...
## $ arm.circumference.in : num 12.5 9.6 9.25 9.4 9.1 9.75 10.5 9.3 13.8 9.4 ...
## $ dom.max.grip.lbs : num 42.8 54.8 73 52.5 67 75 43.8 79.6 82.8 64 ...
## $ non.dom.max.grip.lbs : num 47 62 56 57.8 53 65 32.4 43.2 59.6 72 ...
## $ dom.fatigue.secs : num 2.5 29.4 44.7 21.3 22.3 ...
## $ non.dom.fatigue.secs : num NA 21.2 28.1 16.2 42.5 ...
## $ athlete.status : chr "No" "Yes" "Yes" "Yes" ...
## $ shoe.size.US : num 9 7 8.5 7.5 9 9 6 8.5 10.5 9 ...
## $ Avg.sleep.per.night.hrs : int 6 6 7 7 7 7 7 7 7 7 ...
## $ age.yrs : int 19 20 19 20 19 19 19 20 20 20 ...
## $ birth.month : chr "June" "December" "July" "February" ...
## $ step.length.heeltoheel.in : num 24.1 24.8 24.8 24.7 26.3 33.3 22.1 27.3 29.8 24.1 ...
## $ step.length.heeltoheel.backward.in: num 15.3 19.2 23.2 14.6 24.7 26.2 16 19.3 24.5 19.5 ...
## $ drinkscaffiene : chr "No" "Yes" "No" "Yes" ...
## $ head.circumference.in : num 23.2 22.8 22.5 22.9 23.5 ...
## $ horoscopesign : chr "Gemini" "" "Cancer" "Aquarius" ...
head(grip1)
## name semester dom.hand height.in arm.circumference.in
## 1 Student49 2022_spring Right 62 12.50
## 2 Student59 2022_Spring Right 64 9.60
## 3 Student51 2022_Spring Left 66 9.25
## 4 Student47 2022_Spring Left 65 9.40
## 5 Student50 2022_Spring Right 64 9.10
## 6 Student52 2022_spring Right 70 9.75
## dom.max.grip.lbs non.dom.max.grip.lbs dom.fatigue.secs non.dom.fatigue.secs
## 1 42.8 47.0 2.50 NA
## 2 54.8 62.0 29.35 21.25
## 3 73.0 56.0 44.70 28.07
## 4 52.5 57.8 21.33 16.25
## 5 67.0 53.0 22.31 42.55
## 6 75.0 65.0 26.98 39.83
## athlete.status shoe.size.US Avg.sleep.per.night.hrs age.yrs birth.month
## 1 No 9.0 6 19 June
## 2 Yes 7.0 6 20 December
## 3 Yes 8.5 7 19 July
## 4 Yes 7.5 7 20 February
## 5 No 9.0 7 19 June
## 6 No 9.0 7 19 May
## step.length.heeltoheel.in step.length.heeltoheel.backward.in drinkscaffiene
## 1 24.1 15.3 No
## 2 24.8 19.2 Yes
## 3 24.8 23.2 No
## 4 24.7 14.6 Yes
## 5 26.3 24.7 Yes
## 6 33.3 26.2 Yes
## head.circumference.in horoscopesign
## 1 23.25 Gemini
## 2 22.80
## 3 22.50 Cancer
## 4 22.90 Aquarius
## 5 23.50 Gemini
## 6 23.00 Taurus
We wanted to compare dominant hand fatigue in the groups of varying sleep.
What was your biological null hypothesis?
Maybe it was something like: Rest is important for muscle recovery.
To test this, maybe you collected data on peoples’ grip strength fatigue, and also polled them on their numbers of sleep hours. Nice, we have that data.
What is the statistical null?
May it is something like: Those students with 6, 7, or 8 hours of sleep will have the same mean fatigue time on a grip strength test.
For this, sleep is your independent variable and domninant hand fatigue is your measurement, or dependent variable. Because sleep, numeric information in the table, is your independent variable, we want to change it to nominal (factor in R) information. Use the as.factor() function.
grip1$Avg.sleep.per.night.hrs <- as.factor(grip1$Avg.sleep.per.night.hrs)
If we want to see summary stats, we can use shortcuts instead of subsetting functions. For example, let’s use the summarySE() function.
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
gripSum <- summarySE(data = grip1,
measurevar = "dom.fatigue.secs",
groupvars = "Avg.sleep.per.night.hrs")
## Warning in qt(conf.interval/2 + 0.5, datac$N - 1): NaNs produced
gripSum
## Avg.sleep.per.night.hrs N dom.fatigue.secs sd se ci
## 1 6 2 15.92500 18.98582 13.425000 170.580799
## 2 7 9 20.37333 11.82596 3.941987 9.090237
## 3 8 5 30.55800 27.14187 12.138215 33.701088
## 4 <NA> 1 6.50000 NA NA NaN
Notice the extra row. Let’s get rid of that.
gripSum <- gripSum[1:3,]
gripSum
## Avg.sleep.per.night.hrs N dom.fatigue.secs sd se ci
## 1 6 2 15.92500 18.98582 13.425000 170.580799
## 2 7 9 20.37333 11.82596 3.941987 9.090237
## 3 8 5 30.55800 27.14187 12.138215 33.701088
Now, we can run the one factor ANOVA. Recall from lecture, this works off the lm() function, which produces a linear model. Then, we use the anova() function to run the model.
grip1.lm <- lm(grip1$dom.fatigue.secs ~ grip1$Avg.sleep.per.night.hrs)
anova(grip1.lm)
## Analysis of Variance Table
##
## Response: grip1$dom.fatigue.secs
## Df Sum Sq Mean Sq F value Pr(>F)
## grip1$Avg.sleep.per.night.hrs 2 447.8 223.91 0.6577 0.5345
## Residuals 13 4426.0 340.46
The ANOVA test only allos us to examine whether groups in general are different than the model. It doesn’t tell us whether those that slept 6 hours is different than those that slept 8. For this, we need to do post-hoc tests. The most common post-hoc test is the Tukey HSD. This does pairwise comparisons (similar to what we did for chi square tests before) of all groups, and gives us a p value. Remember, independent variable information needs to be nominal data.
TukeyHSD(aov(grip1$dom.fatigue.secs ~ grip1$Avg.sleep.per.night.hrs))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = grip1$dom.fatigue.secs ~ grip1$Avg.sleep.per.night.hrs)
##
## $`grip1$Avg.sleep.per.night.hrs`
## diff lwr upr p adj
## 7-6 4.448333 -33.63813 42.53479 0.9491276
## 8-6 14.633000 -26.12938 55.39538 0.6209803
## 8-7 10.184667 -16.99025 37.35959 0.5959667
THis gives us all comparisons. If we search the one i just mentioned (6 vs 8), it is about 2/3 way down the list. The p value is 1.00, which suggest this is very much not differnet. Looking at the data, that is not much of a surprise.
OK. Graphing for two or multiple groups is the same. Let’s first try bar graphs. Let’s do a basic graph first of the sleep vs grip example above.
For bar graphs in ggplot, we can use the result of the summarySE function.
library(ggplot2)
gripSum # this is from above, we can use the same table
## Avg.sleep.per.night.hrs N dom.fatigue.secs sd se ci
## 1 6 2 15.92500 18.98582 13.425000 170.580799
## 2 7 9 20.37333 11.82596 3.941987 9.090237
## 3 8 5 30.55800 27.14187 12.138215 33.701088
gripbase <- ggplot(data = gripSum,
aes(x=Avg.sleep.per.night.hrs,
y=dom.fatigue.secs)) # this is the first layer of mapping
gripbase
Notice that we added a basic graph background.
gripbase +
geom_bar(stat = "identity")
Let’s clean it up now.
gripbase +
geom_bar(stat = "identity",
fill = c(rainbow(3))) + # this adds color to the bars
geom_errorbar(aes(ymin = dom.fatigue.secs-se,
ymax = dom.fatigue.secs+se),
width=0.5) # this adds your error bars, using another element
We can also change titles. Each thing we are adding with the + sign is another element to the graph.
gripbase +
geom_bar(stat = "identity",
fill = c(rainbow(3))) + # this adds color to the bars
geom_errorbar(aes(ymin = dom.fatigue.secs-se,
ymax = dom.fatigue.secs+se),
width=0.5) +
xlab("hours of sleep") +
ylab("grip strength of dominant hand (lbs)")+
ggtitle("grip strength by hours of sleep")
Viola! Much nicer than our old graphs.
Let’s just quickly do the boxplot version of this graph. Remember that boxplots use all the datapoints, since it is taking information to put them into quartiles. Thus, we don’t need a summarySE table.
# we need a new base
grip2 <- na.omit(grip1)
grip2$Avg.sleep.per.night.hrs <- as.factor(grip2$Avg.sleep.per.night.hrs)
boxgrip <- ggplot(data = grip2,
aes(x=Avg.sleep.per.night.hrs,
y=dom.max.grip.lbs)) # first layer
boxgrip +
geom_boxplot(fill=c(rainbow(3))) # add boxplot, add some color
Many times, we want to report real point. It shows transparancy in science. Let’s see how you do this.
boxgrip + geom_boxplot(fill=c(rainbow(3))) + geom_jitter(shape=16, position=position_jitter(0.2)) # add some points
# Part 3. Assignment
On your own, choose any other question, using our dataset, where you can measure 1 dependent variable comparing only 2 groups. Note, please try to do your own, and not the same exact ones as someone else. Name your information different. Then share with others after.
For these questions, use the gripclean.csv datafile. It is already on your working directory. This dataset includes last year’s information, which makes the dataset bigger and more interesting to work with.
For two groups,
* write the biological null
* write the statistical null
* run the test
* graph your data using a bargraph AND boxplot. Add points to your boxplot!
* explain your results
Then, try to find a question that would examine 1 dependent variable comparing more than 2 groups.
For more than groups,
* write the biological null
* write the statistical null
* run the test
* graph your data using a bargraph AND boxplot. Add points to your boxplot!
* explain your results