I am going to try to help with doing the homework and project by giving you some examples in coding these non-parametric tests.
To keep this assignment simple, we are going to use the built in dataset diamonds that is included with ggplot2.
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
table(diamonds$color)
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
I am going to look at color of the diamonds. Let’s ask if the median price of a D colored diamond is equal to the J colored diamond. First I’ll trim down the data frame to just contain those diamonds and then compare the prices.
df <- diamonds[which(diamonds$color %in% c("D","J")),]
summary(df)
## carat cut color clarity depth
## Min. :0.2000 Fair : 282 D:6775 SI1 :2833 Min. :43.00
## 1st Qu.:0.4000 Good : 969 E: 0 VS2 :2428 1st Qu.:61.00
## Median :0.7000 Very Good:2191 F: 0 SI2 :1849 Median :61.90
## Mean :0.8056 Premium :2411 G: 0 VS1 :1247 Mean :61.75
## 3rd Qu.:1.0300 Ideal :3730 H: 0 VVS2 : 684 3rd Qu.:62.60
## Max. :5.0100 I: 0 VVS1 : 326 Max. :73.60
## J:2808 (Other): 216
## table price x y
## Min. :51.60 Min. : 335.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 976.5 1st Qu.: 4.73 1st Qu.: 4.740
## Median :57.00 Median : 2310.0 Median : 5.67 Median : 5.680
## Mean :57.52 Mean : 3801.1 Mean : 5.74 Mean : 5.743
## 3rd Qu.:59.00 3rd Qu.: 5084.5 3rd Qu.: 6.53 3rd Qu.: 6.530
## Max. :73.00 Max. :18710.0 Max. :10.74 Max. :10.540
##
## z
## Min. :0.000
## 1st Qu.:2.930
## Median :3.500
## Mean :3.545
## 3rd Qu.:4.030
## Max. :6.980
##
wilcox.test(price~color, data = df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: price by color
## W = 6557122, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
by(diamonds$price,diamonds$color,median)
## diamonds$color: D
## [1] 1838
## ------------------------------------------------------------
## diamonds$color: E
## [1] 1739
## ------------------------------------------------------------
## diamonds$color: F
## [1] 2343.5
## ------------------------------------------------------------
## diamonds$color: G
## [1] 2242
## ------------------------------------------------------------
## diamonds$color: H
## [1] 3460
## ------------------------------------------------------------
## diamonds$color: I
## [1] 3730
## ------------------------------------------------------------
## diamonds$color: J
## [1] 4234
I did not know the by command but it gives me a nice way to build these tables. It is clear here that these are very different! Might be more interesting to compare a price that is closer like F and G
df2 <- diamonds[which(diamonds$color %in% c("F","G")),]
wilcox.test(price ~ color, data = df2)
##
## Wilcoxon rank sum test with continuity correction
##
## data: price by color
## W = 53235136, p-value = 0.1396
## alternative hypothesis: true location shift is not equal to 0
Here I would fail to reject the null hypothesis (which is what I was shooting for) There is not evidence to suggest that the median price is different.
I need matched data for this test. This is the tricky thing! I think I am going to look at the iris dataset. It is one of the world famous.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
I am going to look at the difference of the Sepal Length and Widths and see if the species make a difference
df3 <- iris[which(iris$Species %in% c("setosa","versicolor")),]
df3["Sepal.Difference"] = df3$Sepal.Length - df3$Sepal.Width
With that all cleaned up we run the test.
wilcox.test(Sepal.Difference ~ Species, data = df3, paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: Sepal.Difference by Species
## V = 0, p-value = 7.755e-10
## alternative hypothesis: true location shift is not equal to 0
So we are able to reject the null hypothesis that the medians are the same. We will look at the medians just because it is nice to see why we rejected the null hypothesis.
by(df3$Sepal.Difference,df3$Species,median)
## df3$Species: setosa
## [1] 1.55
## ------------------------------------------------------------
## df3$Species: versicolor
## [1] 3.1
## ------------------------------------------------------------
## df3$Species: virginica
## [1] NA
Yes, it is clear that we would reject that those are equal.
boxplot(df3$Sepal.Difference~df3$Species)
I’ll repeat the Kruskal-Wallis test on the iris dataset. I’ll just look at the Sepal Length vs the species.
boxplot(iris$Sepal.Length ~ iris$Species)
by(iris$Sepal.Length,iris$Species,median)
## iris$Species: setosa
## [1] 5
## ------------------------------------------------------------
## iris$Species: versicolor
## [1] 5.9
## ------------------------------------------------------------
## iris$Species: virginica
## [1] 6.5
kruskal.test(Sepal.Length ~ Species , data = iris)
##
## Kruskal-Wallis rank sum test
##
## data: Sepal.Length by Species
## Kruskal-Wallis chi-squared = 96.937, df = 2, p-value < 2.2e-16
Of course we reject the null hypothesis, these flowers are very different.
To finish we examine the correlations using the Spearman rank test. My guess is that the Sepal Lengths and widths should be related.
cor.test(iris$Sepal.Length, iris$Sepal.Width, method = "spearman")
## Warning in cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: iris$Sepal.Length and iris$Sepal.Width
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1667777
With this p value we will still reject the null hypothesis.
plot(iris$Sepal.Length, iris$Sepal.Width)
abline(lm(Sepal.Width ~ Sepal.Length, data = iris),col = "Blue")
We see that this relationship is not strong but we can see that as the sepal gets longer it also gets narrower.
Okay hope this helps you get your homeworks done!