Module 09 - Simple inference tests in R

Exercises

  • Make up a vector of 50 random Legolas actors, with mean height of 195cm, and a standard deviation of 15cm. Run a t-test to compare this sample of actors to the set of Aragorns and then the set of Gimlis.
aragorn = rnorm(50, mean = 180, sd = 10)
gimli = rnorm(50, mean = 132, sd = 15)
legolas = rnorm(50, mean = 195, sd = 15)
t.test(legolas, aragorn, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  legolas and aragorn
## t = 6.6858, df = 87.098, p-value = 2.099e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.01283 22.17695
## sample estimates:
## mean of x mean of y 
##  195.6145  178.5196
t.test(legolas, gimli, alternative = "two.sided")  
## 
##  Welch Two Sample t-test
## 
## data:  legolas and gimli
## t = 20.826, df = 97.721, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  57.61429 69.75091
## sample estimates:
## mean of x mean of y 
##  195.6145  131.9319
  • Do you find evidence for significant differences?

Since the p-value for both t-tests is approaching zero, there is evidence with high confidence of significant differences between the height of Legolas and Aragorn or Gimli.

  • Re-run the variance test (F-test) to compare the group of Gimli and Legolas actors.
var.test(gimli, legolas)
## 
##  F test to compare two variances
## 
## data:  gimli and legolas
## F = 1.1128, num df = 49, denom df = 49, p-value = 0.7098
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6314953 1.9609876
## sample estimates:
## ratio of variances 
##           1.112814
  • Do these groups have different variance?

The p-value is high (p = 0.7505), indicating there is no significant difference in variance.

  • Redo the correlation for the Sepal Length and Sepal Width for the Iris dataset, but for the three individual species.
library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
iris = read.csv("iris.csv") 
iris %>%
  group_by(Species)
## # A tibble: 150 x 6
## # Groups:   Species [3]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  Code
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>   <int>
##  1          5.1         3.5          1.4         0.2 setosa      1
##  2          4.9         3            1.4         0.2 setosa      1
##  3          4.7         3.2          1.3         0.2 setosa      1
##  4          4.6         3.1          1.5         0.2 setosa      1
##  5          5           3.6          1.4         0.2 setosa      1
##  6          5.4         3.9          1.7         0.4 setosa      1
##  7          4.6         3.4          1.4         0.3 setosa      1
##  8          5           3.4          1.5         0.2 setosa      1
##  9          4.4         2.9          1.4         0.2 setosa      1
## 10          4.9         3.1          1.5         0.1 setosa      1
## # … with 140 more rows
setosa = iris %>%
  filter(Species == "setosa")
versicolor = iris %>%
  filter(Species == "versicolor")
virginica = iris %>%
  filter(Species == "virginica")
cor.test(setosa$Sepal.Length, setosa$Sepal.Width)
## 
##  Pearson's product-moment correlation
## 
## data:  setosa$Sepal.Length and setosa$Sepal.Width
## t = 7.6807, df = 48, p-value = 6.71e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5851391 0.8460314
## sample estimates:
##       cor 
## 0.7425467
cor.test(versicolor$Sepal.Length, versicolor$Sepal.Width)
## 
##  Pearson's product-moment correlation
## 
## data:  versicolor$Sepal.Length and versicolor$Sepal.Width
## t = 4.2839, df = 48, p-value = 8.772e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2900175 0.7015599
## sample estimates:
##       cor 
## 0.5259107
cor.test(virginica$Sepal.Length, virginica$Sepal.Width)
## 
##  Pearson's product-moment correlation
## 
## data:  virginica$Sepal.Length and virginica$Sepal.Width
## t = 3.5619, df = 48, p-value = 0.0008435
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049657 0.6525292
## sample estimates:
##       cor 
## 0.4572278
  • Are these correlated?

The low p-value for all species indiates that there is a significant correlation betwen Sepal Length and Sepal Width among species.

  • Using the deer dataset and the chisq.test() function, test if there are significant differences in the number of deer caught per month
deer = read.csv("deer.csv")
str(deer)
## 'data.frame':    1182 obs. of  9 variables:
##  $ Farm   : chr  "AL" "AL" "AL" "AL" ...
##  $ Month  : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ Year   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sex    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ clas1_4: int  4 4 3 4 4 4 4 4 4 4 ...
##  $ LCT    : num  191 180 192 196 204 190 196 200 197 208 ...
##  $ KFI    : num  20.4 16.4 15.9 17.3 NA ...
##  $ Ecervi : num  0 0 2.38 0 0 0 1.21 0 0.8 0 ...
##  $ Tb     : int  0 0 0 0 NA 0 NA 1 0 0 ...
table(deer$Month)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 256 165  27   3   2  35  11  19  58 168 189 188
chisq.test(table(deer$Month))
## 
##  Chi-squared test for given probabilities
## 
## data:  table(deer$Month)
## X-squared = 997.07, df = 11, p-value < 2.2e-16

Since the p-value for this test is low, it suggests there is a significant difference in the number of deer caught per month

  • Using the deer dataset and the chisq.test() function, test if the cases of tuberculosis are uniformly distributed across all farms
table(deer$Tb, deer$Farm)
##    
##      AL  AU  BA  BE  CB CRC  HB LCV  LN MAN  MB  MO  NC  NV  PA  PN  QM  RF  RN
##   0  10  23  67   7  88   4  22   0  28  27  16 186  24  18  11  39  67  23  21
##   1   3   0   5   0   3   0   1   1   6  24   5  31   4   1   0   0   7   1   0
##    
##      RO SAL SAU  SE  TI  TN VISO  VY
##   0  31   0   3  16   9  16   13  15
##   1   0   1   0  10   0   2    1   4
chisq.test(table(deer$Tb, deer$Farm))
## Warning in chisq.test(table(deer$Tb, deer$Farm)): Chi-squared approximation may
## be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(deer$Tb, deer$Farm)
## X-squared = 129.09, df = 26, p-value = 1.243e-15

Since the p-value is so low, it is unlikely the relationship between tuberculosis cases and farms occured due to random chance.