Wilcoxon Rank Sum test
Load the tidyverse library to be able to call its functions.
library(tidyverse)
attach(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
box <- ggplot(data=iris, aes(x=Species, y=Sepal.Length))
box + geom_boxplot(aes(fill=Species)) +
ylab("Sepal Length") + ggtitle("Iris Boxplot") +
stat_summary(fun=mean, geom="point", shape=5, size=4)
histogram <- ggplot(data=iris, aes(x=Sepal.Width))
histogram + geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) +
xlab("Sepal Width") + ylab("Frequency") + ggtitle("Histogram of Sepal Width")+
facet_grid(cols=vars(Species))
iris.sub <- iris %>% filter(Species != "virginica")
wilcox.test(Sepal.Width~Species, iris.sub)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Sepal.Width by Species
## W = 2312, p-value = 2.143e-13
## alternative hypothesis: true location shift is not equal to 0
t.test(Sepal.Width~Species, iris.sub)
##
## Welch Two Sample t-test
##
## data: Sepal.Width by Species
## t = 9.455, df = 94.698, p-value = 2.484e-15
## alternative hypothesis: true difference in means between group setosa and group versicolor is not equal to 0
## 95 percent confidence interval:
## 0.5198348 0.7961652
## sample estimates:
## mean in group setosa mean in group versicolor
## 3.428 2.770
data("sleep")
t.test(extra~group, data=sleep, paired=T)
##
## Paired t-test
##
## data: extra by group
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean of the differences
## -1.58
How about testing without the pairness…
t.test(extra~group, data=sleep)
##
## Welch Two Sample t-test
##
## data: extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean in group 1 mean in group 2
## 0.75 2.33
Please, download this data by right click
indata <- read.csv("income.data.csv", header = T)
head(indata)
## X income happiness
## 1 1 3.862647 2.314489
## 2 2 4.979381 3.433490
## 3 3 4.923957 4.599373
## 4 4 3.214372 2.791114
## 5 5 7.196409 5.596398
## 6 6 3.729643 2.458556
hist(indata$happiness)
plot(indata$income, indata$happiness)
Here, we will explore if two variables can be fitted to regression line.
\[ y = αx + β\] The slope of the line (the regression coefficient) is β, the increase per unit change in x. The line intersects the y-axis at the intercept α.
Here is the linear model
income.happiness.lm <- lm(happiness ~ income, data = indata)
summary(income.happiness.lm)
##
## Call:
## lm(formula = happiness ~ income, data = indata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02479 -0.48526 0.04078 0.45898 2.37805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20427 0.08884 2.299 0.0219 *
## income 0.71383 0.01854 38.505 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared: 0.7493, Adjusted R-squared: 0.7488
## F-statistic: 1483 on 1 and 496 DF, p-value: < 2.2e-16
So, the best-fitting straight line is seen to be happiness = 0.20247 + 0.71383 × income.
The most important thing to note is the p value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data well.
From these results, we can say that there is a significant positive relationship between income and happiness (p value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness for every unit increase in income.
Disclaimer:
For more details of linear regression, please check out https://www.scribbr.com/statistics/linear-regression-in-r/