Gavin Douglas
Aug. 28th, 2018
Statistical tests
Simulations and random sampling
Building and applying models
t.test
wilcox.test
fisher.exact
cor.test
?iris
First subset data to different species:
iris_virginica <- iris[which(iris$Species == "virginica"), ]
iris_versicolor <- iris[which(iris$Species == "versicolor"), ]
t.test(iris_virginica$Sepal.Length, iris_versicolor$Sepal.Length)
wilcox.test(iris_virginica$Sepal.Length, iris_versicolor$Sepal.Length)
power.t.test(n = 10, delta = 1, sig.level=0.05)
Two-sample t test power calculation
n = 10
delta = 1
sd = 1
sig.level = 0.05
power = 0.5619846
alternative = two.sided
NOTE: n is number in *each* group
petal_pearson_cor <- cor.test(iris$Petal.Width, iris$Petal.Length, method="pearson")
sepal_spearman_cor <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method="spearman")
TeaTasting <- matrix(c(3, 1, 1, 3), nrow = 2, dimnames = list(Guess = c("Milk", "Tea"),
Truth = c("Milk", "Tea")))
TeaTasting
Truth
Guess Milk Tea
Milk 3 1
Tea 1 3
fisher.test(TeaTasting, alternative = "greater")
Fisher's Exact Test for Count Data
data: TeaTasting
p-value = 0.2429
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.3135693 Inf
sample estimates:
odds ratio
6.408309
set1 <- c(0.4, 0.01, 0.001, 0.005)
p.adjust(set1, "bonferroni")
[1] 1.000 0.040 0.004 0.020
p.adjust(set1, "fdr")
[1] 0.40000000 0.01333333 0.00400000 0.01000000
set.seed(15)
rnorm(5)
[1] 0.2588229 1.8311207 -0.3396186 0.8971982 0.4880163
rnorm(5)
[1] -1.2553858 0.0227882 1.0907732 -0.1321224 -1.0750013
set.seed(15)
rnorm(5)
[1] 0.2588229 1.8311207 -0.3396186 0.8971982 0.4880163
y = B0 + B1*x + e
Where:
set.seed(20); x <- rnorm(100); e <- rnorm(100, 0, 2)
y <- 0.5 + 2 * x + e
plot(x, y)
binomchisqnbinompoisunifmy_vec <- c(20, 1, 233, 45, 7138)
sample(my_vec, size=2)
[1] 1 7138
sample(my_vec, size=2)
[1] 20 45
sample(my_vec, size=10, replace=TRUE)
[1] 45 1 20 45 7138 20 1 233 7138 20
Simplest version is for one-factor model of form:
y = a0 + a1 * x
a1 is the slope and a0 is the intercept.
For these examples we'll be using the widely used mtcars dataset.
We'll be investigating which factors are associated with miles per gallon.
plot(mtcars$wt, mtcars$mpg)
mtcars_lm1 <- lm(mpg ~ wt, data=mtcars)
mtcars_lm1
Call:
lm(formula = mpg ~ wt, data = mtcars)
Coefficients:
(Intercept) wt
37.285 -5.344
Typing summary(mtcars_lm1) returns more details
plot(mtcars$wt, mtcars$mpg)
abline(mtcars_lm1)
plot(mtcars_lm1))y = a0 + a1x1 + a2x2 + ...anxn
You could plot all pairwise scatterplots between columns of mtcars with pairs(mtcars, gap=0.9).
We'll partition the data into training and test sets to have an independent way to evaluate our final models.
mtcars_train_rows <- sample(1:32, 24)
mtcars_train <- mtcars[mtcars_train_rows,]
mtcars_test <- mtcars[-mtcars_train_rows,]
mtcars_lm2 <- lm(mpg ~ ., data=mtcars_train)
We'll keep track of this model for later:
mtcars_lm2_orig <- mtcars_lm2
Now let's prune unnecessary variable from this model with backward selection.
Repeated actions of summary(mtcars_lm2) and removing the variable with the highest p-value for as long as any variables aren't significant.
Use this syntax to remove variables from the model:
mtcars_lm2 <- update(mtcars_lm2, .~. - VARIABLE, data = mtcars_train)
Library website: https://conservancy.umn.edu/handle/11299/189222
Any anonymous feedback would be appreciated here: https://www.suggestionox.com/r/JP3K9k