Simple Inference Tests

Code and results from module 9 on inference tests, using flower data, deer data, and LOTR.

Run a t-test to compare the Legolas actors to the set of Aragorns and then the set of Gimlis. Do you find evidence for significant differences?

My hypothesis is that there will be a significant difference in height between Legolas actors and Gimli actors, and that there will also be a significant difference in height between Legolas actors and Aragorn actors. The null hypothesis is that there is no significant difference for either of them.

aragorn = rnorm(50, mean=180, sd=10)
gimli = rnorm(50, mean=132, sd=15)
legolas = rnorm(50, mean=195, sd=15)
t.test(aragorn, legolas, alternative ="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  aragorn and legolas
## t = -6.4024, df = 88.27, p-value = 7.183e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -22.10987 -11.63573
## sample estimates:
## mean of x mean of y 
##  179.4580  196.3308
t.test(gimli, legolas, alternative ="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  gimli and legolas
## t = -22.01, df = 96.826, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -69.26706 -57.80782
## sample estimates:
## mean of x mean of y 
##  132.7933  196.3308

Both of these t-values are relatively large, although the t-value for the Legolas/Gimli comparison is much larger indicating that the difference in the means is greater. Both of these differences are significant, as both p-values are below the commonly used threshold of 0.05. However, the Legolas/Gimli p-value is much smaller, indicating stronger evidence against the null hypothesis.

Re-run the variance test (F-test) to compare the group of Gimli and Legolas actors. Do these groups have different variance?

gimli = rnorm(50, mean=132, sd=15)
legolas = rnorm(50, mean=195, sd=15)
actors = c(rep("Gimli", 50), rep("Legolas", 50))
heights = c(gimli, legolas)
fit = aov(heights ~ actors)
fit
## Call:
##    aov(formula = heights ~ actors)
## 
## Terms:
##                   actors Residuals
## Sum of Squares  94878.78  19512.60
## Deg. of Freedom        1        98
## 
## Residual standard error: 14.11057
## Estimated effects may be unbalanced
summary(aov(fit))
##             Df Sum Sq Mean Sq F value Pr(>F)    
## actors       1  94879   94879   476.5 <2e-16 ***
## Residuals   98  19513     199                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is very low, so these groups do have different variance.

Redo the correlation for the Sepal Length and Sepal Width for the Iris dataset, but for the three individual species. Are these correlated?

iris = read.csv("iris (2).csv")
#Setosa: 
cor.test(iris$Sepal.Length[iris$Species == "setosa"],
         iris$Sepal.Width[iris$Species == "setosa"])
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length[iris$Species == "setosa"] and iris$Sepal.Width[iris$Species == "setosa"]
## t = 7.6807, df = 48, p-value = 6.71e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5851391 0.8460314
## sample estimates:
##       cor 
## 0.7425467
#Versicolor:
cor.test(iris$Sepal.Length[iris$Species == "versicolor"],
         iris$Sepal.Width[iris$Species == "versicolor"])
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length[iris$Species == "versicolor"] and iris$Sepal.Width[iris$Species == "versicolor"]
## t = 4.2839, df = 48, p-value = 8.772e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2900175 0.7015599
## sample estimates:
##       cor 
## 0.5259107
#Virginica
cor.test(iris$Sepal.Length[iris$Species == "virginica"],
         iris$Sepal.Width[iris$Species == "virginica"])
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length[iris$Species == "virginica"] and iris$Sepal.Width[iris$Species == "virginica"]
## t = 3.5619, df = 48, p-value = 0.0008435
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049657 0.6525292
## sample estimates:
##       cor 
## 0.4572278

All of these correlation tests produce p-values less than 0.05, indicating that for all species, sepal length is positively correlated with sepal width. There is the most evidence for this correlation in the Setosa species, as that was the species with the smallest p-value.

Using the deer dataset and the chisq.test() function, test:

If there are significant differences in the number of deer caught per month:

deer = read.csv("deer.csv")
table(deer$Month)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 256 165  27   3   2  35  11  19  58 168 189 188
chisq.test(table(deer$Month))
## 
##  Chi-squared test for given probabilities
## 
## data:  table(deer$Month)
## X-squared = 997.07, df = 11, p-value < 2.2e-16

The p-value is very small, so there are significant differences in the number of deer caught per month.

If the cases of tuberculosis are uniformly distributed across all farms:

deer = read.csv("deer.csv")
table(deer$Farm, deer$Tb)
##       
##          0   1
##   AL    10   3
##   AU    23   0
##   BA    67   5
##   BE     7   0
##   CB    88   3
##   CRC    4   0
##   HB    22   1
##   LCV    0   1
##   LN    28   6
##   MAN   27  24
##   MB    16   5
##   MO   186  31
##   NC    24   4
##   NV    18   1
##   PA    11   0
##   PN    39   0
##   QM    67   7
##   RF    23   1
##   RN    21   0
##   RO    31   0
##   SAL    0   1
##   SAU    3   0
##   SE    16  10
##   TI     9   0
##   TN    16   2
##   VISO  13   1
##   VY    15   4
chisq.test(table(deer$Farm, deer$Tb))
## Warning in chisq.test(table(deer$Farm, deer$Tb)): Chi-squared approximation may
## be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(deer$Farm, deer$Tb)
## X-squared = 129.09, df = 26, p-value = 1.243e-15

The P value is very small, so there are significant differences in the cases of tuberculosis across farms. The warning message may be because some expected cell counts are too small.

The end!