Code and results from module 9 on inference tests, using flower data, deer data, and LOTR.
Run a t-test to compare the Legolas actors to the set of Aragorns and then the set of Gimlis. Do you find evidence for significant differences?
My hypothesis is that there will be a significant difference in height between Legolas actors and Gimli actors, and that there will also be a significant difference in height between Legolas actors and Aragorn actors. The null hypothesis is that there is no significant difference for either of them.
aragorn = rnorm(50, mean=180, sd=10)
gimli = rnorm(50, mean=132, sd=15)
legolas = rnorm(50, mean=195, sd=15)
t.test(aragorn, legolas, alternative ="two.sided")
##
## Welch Two Sample t-test
##
## data: aragorn and legolas
## t = -6.4024, df = 88.27, p-value = 7.183e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -22.10987 -11.63573
## sample estimates:
## mean of x mean of y
## 179.4580 196.3308
t.test(gimli, legolas, alternative ="two.sided")
##
## Welch Two Sample t-test
##
## data: gimli and legolas
## t = -22.01, df = 96.826, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -69.26706 -57.80782
## sample estimates:
## mean of x mean of y
## 132.7933 196.3308
Both of these t-values are relatively large, although the t-value for the Legolas/Gimli comparison is much larger indicating that the difference in the means is greater. Both of these differences are significant, as both p-values are below the commonly used threshold of 0.05. However, the Legolas/Gimli p-value is much smaller, indicating stronger evidence against the null hypothesis.
Re-run the variance test (F-test) to compare the group of Gimli and Legolas actors. Do these groups have different variance?
gimli = rnorm(50, mean=132, sd=15)
legolas = rnorm(50, mean=195, sd=15)
actors = c(rep("Gimli", 50), rep("Legolas", 50))
heights = c(gimli, legolas)
fit = aov(heights ~ actors)
fit
## Call:
## aov(formula = heights ~ actors)
##
## Terms:
## actors Residuals
## Sum of Squares 94878.78 19512.60
## Deg. of Freedom 1 98
##
## Residual standard error: 14.11057
## Estimated effects may be unbalanced
summary(aov(fit))
## Df Sum Sq Mean Sq F value Pr(>F)
## actors 1 94879 94879 476.5 <2e-16 ***
## Residuals 98 19513 199
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is very low, so these groups do have different variance.
Redo the correlation for the Sepal Length and Sepal Width for the Iris dataset, but for the three individual species. Are these correlated?
iris = read.csv("iris (2).csv")
#Setosa:
cor.test(iris$Sepal.Length[iris$Species == "setosa"],
iris$Sepal.Width[iris$Species == "setosa"])
##
## Pearson's product-moment correlation
##
## data: iris$Sepal.Length[iris$Species == "setosa"] and iris$Sepal.Width[iris$Species == "setosa"]
## t = 7.6807, df = 48, p-value = 6.71e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5851391 0.8460314
## sample estimates:
## cor
## 0.7425467
#Versicolor:
cor.test(iris$Sepal.Length[iris$Species == "versicolor"],
iris$Sepal.Width[iris$Species == "versicolor"])
##
## Pearson's product-moment correlation
##
## data: iris$Sepal.Length[iris$Species == "versicolor"] and iris$Sepal.Width[iris$Species == "versicolor"]
## t = 4.2839, df = 48, p-value = 8.772e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2900175 0.7015599
## sample estimates:
## cor
## 0.5259107
#Virginica
cor.test(iris$Sepal.Length[iris$Species == "virginica"],
iris$Sepal.Width[iris$Species == "virginica"])
##
## Pearson's product-moment correlation
##
## data: iris$Sepal.Length[iris$Species == "virginica"] and iris$Sepal.Width[iris$Species == "virginica"]
## t = 3.5619, df = 48, p-value = 0.0008435
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049657 0.6525292
## sample estimates:
## cor
## 0.4572278
All of these correlation tests produce p-values less than 0.05, indicating that for all species, sepal length is positively correlated with sepal width. There is the most evidence for this correlation in the Setosa species, as that was the species with the smallest p-value.
Using the deer dataset and the chisq.test() function, test:
If there are significant differences in the number of deer caught per month:
deer = read.csv("deer.csv")
table(deer$Month)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 256 165 27 3 2 35 11 19 58 168 189 188
chisq.test(table(deer$Month))
##
## Chi-squared test for given probabilities
##
## data: table(deer$Month)
## X-squared = 997.07, df = 11, p-value < 2.2e-16
The p-value is very small, so there are significant differences in the number of deer caught per month.
If the cases of tuberculosis are uniformly distributed across all farms:
deer = read.csv("deer.csv")
table(deer$Farm, deer$Tb)
##
## 0 1
## AL 10 3
## AU 23 0
## BA 67 5
## BE 7 0
## CB 88 3
## CRC 4 0
## HB 22 1
## LCV 0 1
## LN 28 6
## MAN 27 24
## MB 16 5
## MO 186 31
## NC 24 4
## NV 18 1
## PA 11 0
## PN 39 0
## QM 67 7
## RF 23 1
## RN 21 0
## RO 31 0
## SAL 0 1
## SAU 3 0
## SE 16 10
## TI 9 0
## TN 16 2
## VISO 13 1
## VY 15 4
chisq.test(table(deer$Farm, deer$Tb))
## Warning in chisq.test(table(deer$Farm, deer$Tb)): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: table(deer$Farm, deer$Tb)
## X-squared = 129.09, df = 26, p-value = 1.243e-15
The P value is very small, so there are significant differences in the cases of tuberculosis across farms. The warning message may be because some expected cell counts are too small.
The end!