R Programming for Biologists: Workshop 4

Gavin Douglas
Aug. 28th, 2018

Workshop outline: Basic statistics in R


  • Statistical tests

  • Simulations and random sampling

  • Building and applying models

Example statistical tests


  • t.test

  • wilcox.test

  • fisher.exact

  • cor.test

Iris dataset statistical testing examples

?iris

First subset data to different species:

iris_virginica <- iris[which(iris$Species == "virginica"), ]

iris_versicolor <- iris[which(iris$Species == "versicolor"), ]

Testing for differences between 2 distributions


t.test(iris_virginica$Sepal.Length, iris_versicolor$Sepal.Length)

wilcox.test(iris_virginica$Sepal.Length, iris_versicolor$Sepal.Length)

Certain power tests can be run in R


power.t.test(n = 10, delta = 1, sig.level=0.05)

     Two-sample t test power calculation 

              n = 10
          delta = 1
             sd = 1
      sig.level = 0.05
          power = 0.5619846
    alternative = two.sided

NOTE: n is number in *each* group

Simple correlations in R

petal_pearson_cor <- cor.test(iris$Petal.Width, iris$Petal.Length, method="pearson")

sepal_spearman_cor <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method="spearman")

Fisher's exact test example

TeaTasting <- matrix(c(3, 1, 1, 3), nrow = 2, dimnames = list(Guess = c("Milk", "Tea"),
                                                   Truth = c("Milk", "Tea")))
TeaTasting
      Truth
Guess  Milk Tea
  Milk    3   1
  Tea     1   3
fisher.test(TeaTasting, alternative = "greater")

    Fisher's Exact Test for Count Data

data:  TeaTasting
p-value = 0.2429
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 0.3135693       Inf
sample estimates:
odds ratio 
  6.408309 

Adjusting for multiple tests with p.adjust


set1 <- c(0.4, 0.01, 0.001, 0.005)
p.adjust(set1, "bonferroni")
[1] 1.000 0.040 0.004 0.020
p.adjust(set1, "fdr")
[1] 0.40000000 0.01333333 0.00400000 0.01000000

Functions for probability distributions in R


  • p for probability (cdf)
  • q for quantile
  • d for density (pdf)
  • r for random [sampling]

Setting the random seed for reproducibility

set.seed(15)
rnorm(5)
[1]  0.2588229  1.8311207 -0.3396186  0.8971982  0.4880163
rnorm(5)
[1] -1.2553858  0.0227882  1.0907732 -0.1321224 -1.0750013
set.seed(15)
rnorm(5)
[1]  0.2588229  1.8311207 -0.3396186  0.8971982  0.4880163

Simulating a linear model part 1 (Coursera example)

y = B0 + B1*x + e

Where:

  • B0 = 0.5
  • B1 = 2
  • e ~ N(0, 1)
  • x ~ N(0, 22)

Simulating a linear model part 2 (Coursera example)

set.seed(20); x <- rnorm(100); e <- rnorm(100, 0, 2)
y <- 0.5 + 2 * x + e
plot(x, y)

plot of chunk unnamed-chunk-8

Other common distributions used in R


  • binom
  • chisq
  • nbinom
  • pois
  • unif
  • and many others!

Taking random samples in R


my_vec <- c(20, 1, 233, 45, 7138)
sample(my_vec, size=2)
[1]    1 7138
sample(my_vec, size=2)
[1] 20 45
sample(my_vec, size=10, replace=TRUE)
 [1]   45    1   20   45 7138   20    1  233 7138   20

Linear regression in R

Simplest version is for one-factor model of form:

y = a0 + a1 * x

a1 is the slope and a0 is the intercept.

For these examples we'll be using the widely used mtcars dataset. We'll be investigating which factors are associated with miles per gallon.

Miles per gallon against weight

plot(mtcars$wt, mtcars$mpg)

plot of chunk unnamed-chunk-10

Building a linear model with 1 factor

mtcars_lm1 <- lm(mpg ~ wt, data=mtcars)
mtcars_lm1

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  

Typing summary(mtcars_lm1) returns more details

Plotting fitted line

plot(mtcars$wt, mtcars$mpg)
abline(mtcars_lm1)

plot of chunk unnamed-chunk-12

Evaluating model fit


  • Check whether the standard error estimates are ~1/10th the size of the corresponding coefficients
  • Check whether each coefficient is significant
  • Check whether the residuals are normally distributed (run plot(mtcars_lm1))

Building a linear model with multiple factors

y = a0 + a1x1 + a2x2 + ...anxn

You could plot all pairwise scatterplots between columns of mtcars with pairs(mtcars, gap=0.9).

Splitting into training and test datasets

We'll partition the data into training and test sets to have an independent way to evaluate our final models.

mtcars_train_rows <- sample(1:32, 24)
mtcars_train <- mtcars[mtcars_train_rows,]
mtcars_test <- mtcars[-mtcars_train_rows,]

Specifying a multiple regression model using all columns in dataframe

mtcars_lm2 <- lm(mpg ~ ., data=mtcars_train)

We'll keep track of this model for later:

mtcars_lm2_orig <- mtcars_lm2

Now let's prune unnecessary variable from this model with backward selection.

Excluding variables with backward selection

Repeated actions of summary(mtcars_lm2) and removing the variable with the highest p-value for as long as any variables aren't significant.

Use this syntax to remove variables from the model:

mtcars_lm2 <- update(mtcars_lm2, .~. - VARIABLE, data = mtcars_train)

Useful and free reference for linear regressions in R

Workshop feedback

Any anonymous feedback would be appreciated here: https://www.suggestionox.com/r/JP3K9k