Computational Biology & Data Science

Two-sample Tests

Wilcoxon Rank Sum test

Load the tidyverse library to be able to call its functions.

library(tidyverse)

attach(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

box <- ggplot(data=iris, aes(x=Species, y=Sepal.Length))
box + geom_boxplot(aes(fill=Species)) + 
  ylab("Sepal Length") + ggtitle("Iris Boxplot") +
  stat_summary(fun=mean, geom="point", shape=5, size=4)

histogram <- ggplot(data=iris, aes(x=Sepal.Width))
histogram + geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) + 
  xlab("Sepal Width") +  ylab("Frequency") + ggtitle("Histogram of Sepal Width")+
  facet_grid(cols=vars(Species))

iris.sub <- iris %>% filter(Species != "virginica")

wilcox.test(Sepal.Width~Species, iris.sub)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Sepal.Width by Species
## W = 2312, p-value = 2.143e-13
## alternative hypothesis: true location shift is not equal to 0

t.test(Sepal.Width~Species, iris.sub)

## 
##  Welch Two Sample t-test
## 
## data:  Sepal.Width by Species
## t = 9.455, df = 94.698, p-value = 2.484e-15
## alternative hypothesis: true difference in means between group setosa and group versicolor is not equal to 0
## 95 percent confidence interval:
##  0.5198348 0.7961652
## sample estimates:
##     mean in group setosa mean in group versicolor 
##                    3.428                    2.770

Paired t-Test

data("sleep")

t.test(extra~group, data=sleep, paired=T)

## 
##  Paired t-test
## 
## data:  extra by group
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean of the differences 
##                   -1.58

How about testing without the pairness…

t.test(extra~group, data=sleep)

## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

Simple Linear Regression

Please, download this data by right click

indata <- read.csv("income.data.csv", header = T)
head(indata)

##   X   income happiness
## 1 1 3.862647  2.314489
## 2 2 4.979381  3.433490
## 3 3 4.923957  4.599373
## 4 4 3.214372  2.791114
## 5 5 7.196409  5.596398
## 6 6 3.729643  2.458556

hist(indata$happiness)

plot(indata$income, indata$happiness)

Linear regression

Here, we will explore if two variables can be fitted to regression line.

\[ y = αx + β\] The slope of the line (the regression coefficient) is β, the increase per unit change in x. The line intersects the y-axis at the intercept α.

Here is the linear model

income.happiness.lm <- lm(happiness ~ income, data = indata)

summary(income.happiness.lm)

## 
## Call:
## lm(formula = happiness ~ income, data = indata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02479 -0.48526  0.04078  0.45898  2.37805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20427    0.08884   2.299   0.0219 *  
## income       0.71383    0.01854  38.505   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared:  0.7493, Adjusted R-squared:  0.7488 
## F-statistic:  1483 on 1 and 496 DF,  p-value: < 2.2e-16

So, the best-fitting straight line is seen to be happiness = 0.20247 + 0.71383 × income.

The most important thing to note is the p value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data well.

From these results, we can say that there is a significant positive relationship between income and happiness (p value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness for every unit increase in income.

Disclaimer:

For more details of linear regression, please check out https://www.scribbr.com/statistics/linear-regression-in-r/

Computational Biology & Data Science - Lecture 4

Yasin Kaymaz

2022-11-27

Two-sample Tests

Paired t-Test

Simple Linear Regression

Linear regression