First, let’s look at the dataset you collected!! It’s awesome. Note, i did clean it up, including making all No/Yes the same capitalization/spelling and taking away characters from numeric columns. This consistency is important for reading files. I named mine grip1…name yours something else!
grip1 <- read.csv("Grip_part1.csv")
head(grip1)
## name semester dom.hand height.in arm.circumference.in
## 1 Student49 2022_spring Right 62 12.50
## 2 Student59 2022_Spring Right 64 9.60
## 3 Student51 2022_Spring Left 66 9.25
## 4 Student47 2022_Spring Left 65 9.40
## 5 Student50 2022_Spring Right 64 9.10
## 6 Student52 2022_spring Right 70 9.75
## dom.max.grip.lbs non.dom.max.grip.lbs dom.fatigue.secs non.dom.fatigue.secs
## 1 42.8 47.0 2.50 NA
## 2 54.8 62.0 29.35 21.25
## 3 73.0 56.0 44.70 28.07
## 4 52.5 57.8 21.33 16.25
## 5 67.0 53.0 22.31 42.55
## 6 75.0 65.0 26.98 39.83
## athlete.status shoe.size.US Avg.sleep.per.night.hrs age.yrs birth.month
## 1 No 9.0 6 19 June
## 2 Yes 7.0 6 20 December
## 3 Yes 8.5 7 19 July
## 4 Yes 7.5 7 20 February
## 5 No 9.0 7 19 June
## 6 No 9.0 7 19 May
## step.length.heeltoheel.in step.length.heeltoheel.backward.in drinkscaffiene
## 1 24.1 15.3 No
## 2 24.8 19.2 Yes
## 3 24.8 23.2 No
## 4 24.7 14.6 Yes
## 5 26.3 24.7 Yes
## 6 33.3 26.2 Yes
## head.circumference.in horoscopesign X X.1 X.2 X.3 X.4 X.5
## 1 23.25 Gemini NA NA NA NA NA NA
## 2 22.80 NA NA NA NA NA NA
## 3 22.50 Cancer NA NA NA NA NA NA
## 4 22.90 Aquarius NA NA NA NA NA NA
## 5 23.50 Gemini NA NA NA NA NA NA
## 6 23.00 Taurus NA NA NA NA NA NA
In the first in class activity, we are trying to do a t-test review. Let’s assume we wanted to test this statistical null hypothesis:
Mean fatigue time of left hand dominant students will be the same time as right hand dominant students.
Here is how would would do the t-test the long way, based on this formula for solving the t statisitc for two sample tests.
Xbar are your means
s is standard deviation of your group (1 or 2)
N is the sample size for your group (1 or 2)
# first, i subseted my left and right hand values
gripleft <- grip1$dom.fatigue.secs[grip1$dom.hand=="Left"]
gripright <- grip1$dom.fatigue.secs[grip1$dom.hand=="Right"]
# solve for the t statistic using the formula
(mean(gripleft) - mean(gripright)) /
(sd(gripleft)^2/length(gripleft) + sd(gripright)^2/length(gripright))^0.5
## [1] 1.691536
Of course, we can just ask R to do it. Remember, since the sample size distribution is not equal (3 lefties and 14 righties), we want to run a test that account for unequal balance. This is the Welch t test.
Worksheet question 1c
t.test(grip1$dom.fatigue.secs~grip1$dom.hand)
##
## Welch Two Sample t-test
##
## data: grip1$dom.fatigue.secs by grip1$dom.hand
## t = 1.6915, df = 4.4433, p-value = 0.1588
## alternative hypothesis: true difference in means between group Left and group Right is not equal to 0
## 95 percent confidence interval:
## -8.185592 36.471782
## sample estimates:
## mean in group Left mean in group Right
## 33.67667 19.53357
Analysis of Variance (ANOVA)
We collected a lot of data on Thursday! Go us! That means we can do a lot of analysis. For example, …
* what if we wanted to know whether dominant hand muscle fatigue varies by the number of hours of sleep we get (reported sleep). * what if we wanted to know whether our heights varied by our horoscope sign (thanks for that suggestion!)
The dataset offers many possilties to compare a dependent variable (such as fatigue or height) by various independent variables. However, there are more than two groups in the independent variable. Thus, we can no longer do a Student or Welch t test. We have to perform an analysis of variance test, or ANOVA.
There are two main types of ANOVA, a one factor ANOVA and a two factor ANOVA (sometimes also called one way or two way ANOVA). First, one factor ANOVAs have one independent variable, whereas two factor ANOVAs will have two independent variables.
First, let’s just tackle a one factor ANOVA. Here is an example. This dataset contains birth metrics of many babies, and some information about their mothers.
baby <- read.csv("Birthweight2.csv")
head(baby)
## id headcirumference length Birthweight Gestation smoker motherage mnocig
## 1 822 13 19 7.5 38 0 20 0
## 2 1081 14 21 8.0 38 0 18 0
## 3 1636 14 20 8.6 38 0 29 0
## 4 1107 14 20 7.1 38 0 31 0
## 5 1023 13 20 6.6 38 1 30 12
## 6 1369 13 19 7.0 38 1 31 25
## mheight mppwt fage fedyrs fnocig fheight lowbwt mage35 LowBirthWeight
## 1 62 103 22 14 0 70 0 0 Normal
## 2 67 109 20 12 7 67 0 0 Normal
## 3 64 135 31 16 0 70 0 0 Normal
## 4 64 125 35 16 0 72 0 0 Normal
## 5 64 140 38 14 50 70 0 0 Normal
## 6 63 124 32 16 50 76 0 0 Normal
In the example in class, we are talking about whether a baby’s birthweight differs depending on the gestation period.
Thus, there is one dependent variable = birthweight
And one independent variable = gestation (with 4 groups)
First, let’s look at the data summary statistic. In the past, i would have you summarize by subsetting, and then getting means. Finally, we can use some shortcuts. Here is one:
# first, load the Rmisc package. It has been installed for you.
# install.packages("Rmisc")
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
Then, we will run a function called summarySE(). This provides summary stats broken down by an independent variable. BUT, the independent variable has to be a factor information, such as character or nominal information. Thus, our data is a number (38, 39, 40, 41) which we have to turn into a factor. To do that, we use the as.factor() function.
baby$Gestation <- as.factor(baby$Gestation)
# now run the summarySE function
babySum <- summarySE(data = baby,
measurevar = "Birthweight",
groupvars = "Gestation")
babySum
## Gestation N Birthweight sd se ci
## 1 38 6 7.466667 0.7312090 0.2985148 0.7673568
## 2 39 7 6.814286 1.0205041 0.3857143 0.9438089
## 3 40 9 7.644444 0.9773831 0.3257944 0.7512832
## 4 41 7 7.857143 1.1970201 0.4524311 1.1070590
For this, we use linear modeling functions, such as the lm() function. First, name your model (i named mine results.lm) and then run the anova of that model.
results.lm <- lm(baby$Birthweight ~ baby$Gestation)
anova(results.lm)
## Analysis of Variance Table
##
## Response: baby$Birthweight
## Df Sum Sq Mean Sq F value Pr(>F)
## baby$Gestation 3 4.3291 1.4430 1.4338 0.2565
## Residuals 25 25.1613 1.0065
Here, we find a F ratio value of 1.4338, which gives us a P value of 0.2565. Thus, it looks like we would fail to reject our null hypothesis.
Sometimes, if our result IS significant (but not here), we might also want to see if there are signifcant difference between specific groups. In other words, the ANOVA is telling you whether or not there are differences among groups, not which groups. To determine which groups, we need what is called a post-hoc test. A very common one is called the Tukey HSD. For this functoin, it is important that your independent variable is a factor (which we changed above).
TukeyHSD(aov(baby$Birthweight ~ baby$Gestation))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = baby$Birthweight ~ baby$Gestation)
##
## $`baby$Gestation`
## diff lwr upr p adj
## 39-38 -0.6523810 -2.1876256 0.8828637 0.6514550
## 40-38 0.1777778 -1.2766070 1.6321625 0.9865903
## 41-38 0.3904762 -1.1447685 1.9257209 0.8962529
## 40-39 0.8301587 -0.5604991 2.2208166 0.3744453
## 41-39 1.0428571 -0.4321582 2.5178725 0.2356823
## 41-40 0.2126984 -1.1779594 1.6033562 0.9744138
You analyze your own data. Examine dominant hand fatigue in the groups of varying sleep.
Let’s just briefly explore the awesomeness of ggplot2.
# install.packages("ggplot2")
library(ggplot2)
This is making a simple bar graph. ggplot works of a base, and adds elements.
base <- ggplot(data=babySum, aes(x = Gestation, y=Birthweight))
base
Notice it gives you the framework for a graph. Now, we can add elements to it. First, let’s add the simple bars.
base +
geom_bar(stat = "identity",
fill = c(rainbow(4)))
Now, we can make the bar graph a little better, with error bars and color.
base +
geom_bar(stat = "identity",
fill = c(rainbow(4))) +
geom_errorbar(aes(ymin = Birthweight-ci,
ymax = Birthweight+ci),
width=0.5)
Here is a simple bar graph, with error bars.