BIO205.7.1

Reminders

Quiz on Thursday
Reflection on polleverywhere

Learning goals

To be able to describe the use of one-factor ANOVAs
Discern between t-tests and ANOVAs
Describe how variation influences an ANOVA, particular the F-statisitic
To use R to calculate ANOVA and post-hoc tests

Guide to BIO205 Week 7 lecture (Tuesday), with some description

First, let’s look at the dataset you collected!! It’s awesome. Note, i did clean it up, including making all No/Yes the same capitalization/spelling and taking away characters from numeric columns. This consistency is important for reading files. I named mine grip1…name yours something else!

grip1 <- read.csv("Grip_part1.csv")
head(grip1)

##        name    semester dom.hand height.in arm.circumference.in
## 1 Student49 2022_spring    Right        62                12.50
## 2 Student59 2022_Spring    Right        64                 9.60
## 3 Student51 2022_Spring     Left        66                 9.25
## 4 Student47 2022_Spring     Left        65                 9.40
## 5 Student50 2022_Spring    Right        64                 9.10
## 6 Student52 2022_spring    Right        70                 9.75
##   dom.max.grip.lbs non.dom.max.grip.lbs dom.fatigue.secs non.dom.fatigue.secs
## 1             42.8                 47.0             2.50                   NA
## 2             54.8                 62.0            29.35                21.25
## 3             73.0                 56.0            44.70                28.07
## 4             52.5                 57.8            21.33                16.25
## 5             67.0                 53.0            22.31                42.55
## 6             75.0                 65.0            26.98                39.83
##   athlete.status shoe.size.US Avg.sleep.per.night.hrs age.yrs birth.month
## 1             No          9.0                       6      19        June
## 2            Yes          7.0                       6      20    December
## 3            Yes          8.5                       7      19        July
## 4            Yes          7.5                       7      20    February
## 5             No          9.0                       7      19        June
## 6             No          9.0                       7      19         May
##   step.length.heeltoheel.in step.length.heeltoheel.backward.in drinkscaffiene
## 1                      24.1                               15.3             No
## 2                      24.8                               19.2            Yes
## 3                      24.8                               23.2             No
## 4                      24.7                               14.6            Yes
## 5                      26.3                               24.7            Yes
## 6                      33.3                               26.2            Yes
##   head.circumference.in horoscopesign  X X.1 X.2 X.3 X.4 X.5
## 1                 23.25        Gemini NA  NA  NA  NA  NA  NA
## 2                 22.80               NA  NA  NA  NA  NA  NA
## 3                 22.50        Cancer NA  NA  NA  NA  NA  NA
## 4                 22.90      Aquarius NA  NA  NA  NA  NA  NA
## 5                 23.50        Gemini NA  NA  NA  NA  NA  NA
## 6                 23.00        Taurus NA  NA  NA  NA  NA  NA

Part 1

In the first in class activity, we are trying to do a t-test review. Let’s assume we wanted to test this statistical null hypothesis:

Mean fatigue time of left hand dominant students will be the same time as right hand dominant students.

Here is how would would do the t-test the long way, based on this formula for solving the t statisitc for two sample tests.

Xbar are your means
s is standard deviation of your group (1 or 2)
N is the sample size for your group (1 or 2)

# first, i subseted my left and right hand values
gripleft <- grip1$dom.fatigue.secs[grip1$dom.hand=="Left"]
gripright <- grip1$dom.fatigue.secs[grip1$dom.hand=="Right"]

# solve for the t statistic using the formula
(mean(gripleft) - mean(gripright)) /
  (sd(gripleft)^2/length(gripleft) + sd(gripright)^2/length(gripright))^0.5

## [1] 1.691536

Of course, we can just ask R to do it. Remember, since the sample size distribution is not equal (3 lefties and 14 righties), we want to run a test that account for unequal balance. This is the Welch t test.
Worksheet question 1c

t.test(grip1$dom.fatigue.secs~grip1$dom.hand)

## 
##  Welch Two Sample t-test
## 
## data:  grip1$dom.fatigue.secs by grip1$dom.hand
## t = 1.6915, df = 4.4433, p-value = 0.1588
## alternative hypothesis: true difference in means between group Left and group Right is not equal to 0
## 95 percent confidence interval:
##  -8.185592 36.471782
## sample estimates:
##  mean in group Left mean in group Right 
##            33.67667            19.53357

Part 2

Analysis of Variance (ANOVA)

We collected a lot of data on Thursday! Go us! That means we can do a lot of analysis. For example, …
* what if we wanted to know whether dominant hand muscle fatigue varies by the number of hours of sleep we get (reported sleep). * what if we wanted to know whether our heights varied by our horoscope sign (thanks for that suggestion!)

The dataset offers many possilties to compare a dependent variable (such as fatigue or height) by various independent variables. However, there are more than two groups in the independent variable. Thus, we can no longer do a Student or Welch t test. We have to perform an analysis of variance test, or ANOVA.

There are two main types of ANOVA, a one factor ANOVA and a two factor ANOVA (sometimes also called one way or two way ANOVA). First, one factor ANOVAs have one independent variable, whereas two factor ANOVAs will have two independent variables.

First, let’s just tackle a one factor ANOVA. Here is an example. This dataset contains birth metrics of many babies, and some information about their mothers.

baby <- read.csv("Birthweight2.csv")
head(baby)

##     id headcirumference length Birthweight Gestation smoker motherage mnocig
## 1  822               13     19         7.5        38      0        20      0
## 2 1081               14     21         8.0        38      0        18      0
## 3 1636               14     20         8.6        38      0        29      0
## 4 1107               14     20         7.1        38      0        31      0
## 5 1023               13     20         6.6        38      1        30     12
## 6 1369               13     19         7.0        38      1        31     25
##   mheight mppwt fage fedyrs fnocig fheight lowbwt mage35 LowBirthWeight
## 1      62   103   22     14      0      70      0      0         Normal
## 2      67   109   20     12      7      67      0      0         Normal
## 3      64   135   31     16      0      70      0      0         Normal
## 4      64   125   35     16      0      72      0      0         Normal
## 5      64   140   38     14     50      70      0      0         Normal
## 6      63   124   32     16     50      76      0      0         Normal

In the example in class, we are talking about whether a baby’s birthweight differs depending on the gestation period.
Thus, there is one dependent variable = birthweight
And one independent variable = gestation (with 4 groups)

First, let’s look at the data summary statistic. In the past, i would have you summarize by subsetting, and then getting means. Finally, we can use some shortcuts. Here is one:

# first, load the Rmisc package. It has been installed for you.
# install.packages("Rmisc") 
library(Rmisc)

## Loading required package: lattice

## Loading required package: plyr

Then, we will run a function called summarySE(). This provides summary stats broken down by an independent variable. BUT, the independent variable has to be a factor information, such as character or nominal information. Thus, our data is a number (38, 39, 40, 41) which we have to turn into a factor. To do that, we use the as.factor() function.

baby$Gestation <- as.factor(baby$Gestation)

# now run the summarySE function
babySum <- summarySE(data = baby, 
                     measurevar = "Birthweight", 
                     groupvars = "Gestation")
babySum

##   Gestation N Birthweight        sd        se        ci
## 1        38 6    7.466667 0.7312090 0.2985148 0.7673568
## 2        39 7    6.814286 1.0205041 0.3857143 0.9438089
## 3        40 9    7.644444 0.9773831 0.3257944 0.7512832
## 4        41 7    7.857143 1.1970201 0.4524311 1.1070590

Running the ANOVA test

For this, we use linear modeling functions, such as the lm() function. First, name your model (i named mine results.lm) and then run the anova of that model.

results.lm <- lm(baby$Birthweight ~ baby$Gestation)
anova(results.lm)

## Analysis of Variance Table
## 
## Response: baby$Birthweight
##                Df  Sum Sq Mean Sq F value Pr(>F)
## baby$Gestation  3  4.3291  1.4430  1.4338 0.2565
## Residuals      25 25.1613  1.0065

Here, we find a F ratio value of 1.4338, which gives us a P value of 0.2565. Thus, it looks like we would fail to reject our null hypothesis.

Sometimes, if our result IS significant (but not here), we might also want to see if there are signifcant difference between specific groups. In other words, the ANOVA is telling you whether or not there are differences among groups, not which groups. To determine which groups, we need what is called a post-hoc test. A very common one is called the Tukey HSD. For this functoin, it is important that your independent variable is a factor (which we changed above).

TukeyHSD(aov(baby$Birthweight ~ baby$Gestation))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = baby$Birthweight ~ baby$Gestation)
## 
## $`baby$Gestation`
##             diff        lwr       upr     p adj
## 39-38 -0.6523810 -2.1876256 0.8828637 0.6514550
## 40-38  0.1777778 -1.2766070 1.6321625 0.9865903
## 41-38  0.3904762 -1.1447685 1.9257209 0.8962529
## 40-39  0.8301587 -0.5604991 2.2208166 0.3744453
## 41-39  1.0428571 -0.4321582 2.5178725 0.2356823
## 41-40  0.2126984 -1.1779594 1.6033562 0.9744138

Part 3

You analyze your own data. Examine dominant hand fatigue in the groups of varying sleep.

Part 4

Graphing. Introduction to ggplot2

Let’s just briefly explore the awesomeness of ggplot2.

# install.packages("ggplot2")
library(ggplot2)

This is making a simple bar graph. ggplot works of a base, and adds elements.

base <- ggplot(data=babySum, aes(x = Gestation, y=Birthweight))
base

Notice it gives you the framework for a graph. Now, we can add elements to it. First, let’s add the simple bars.

base + 
  geom_bar(stat = "identity", 
           fill = c(rainbow(4)))

Now, we can make the bar graph a little better, with error bars and color.

base + 
  geom_bar(stat = "identity", 
           fill = c(rainbow(4))) +
  geom_errorbar(aes(ymin = Birthweight-ci,
                    ymax = Birthweight+ci), 
                width=0.5)

Here is a simple bar graph, with error bars.