I am going to try to help with doing the homework and project by giving you some examples in coding these non-parametric tests.

To keep this assignment simple, we are going to use the built in dataset diamonds that is included with ggplot2.

head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Wilcoxson Ranked Sum Test

table(diamonds$color)
## 
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808

I am going to look at color of the diamonds. Let’s ask if the median price of a D colored diamond is equal to the J colored diamond. First I’ll trim down the data frame to just contain those diamonds and then compare the prices.

df <- diamonds[which(diamonds$color %in% c("D","J")),]
summary(df)
##      carat               cut       color       clarity         depth      
##  Min.   :0.2000   Fair     : 282   D:6775   SI1    :2833   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 969   E:   0   VS2    :2428   1st Qu.:61.00  
##  Median :0.7000   Very Good:2191   F:   0   SI2    :1849   Median :61.90  
##  Mean   :0.8056   Premium  :2411   G:   0   VS1    :1247   Mean   :61.75  
##  3rd Qu.:1.0300   Ideal    :3730   H:   0   VVS2   : 684   3rd Qu.:62.60  
##  Max.   :5.0100                    I:   0   VVS1   : 326   Max.   :73.60  
##                                    J:2808   (Other): 216                  
##      table           price               x               y         
##  Min.   :51.60   Min.   :  335.0   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  976.5   1st Qu.: 4.73   1st Qu.: 4.740  
##  Median :57.00   Median : 2310.0   Median : 5.67   Median : 5.680  
##  Mean   :57.52   Mean   : 3801.1   Mean   : 5.74   Mean   : 5.743  
##  3rd Qu.:59.00   3rd Qu.: 5084.5   3rd Qu.: 6.53   3rd Qu.: 6.530  
##  Max.   :73.00   Max.   :18710.0   Max.   :10.74   Max.   :10.540  
##                                                                    
##        z        
##  Min.   :0.000  
##  1st Qu.:2.930  
##  Median :3.500  
##  Mean   :3.545  
##  3rd Qu.:4.030  
##  Max.   :6.980  
## 
wilcox.test(price~color, data = df)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  price by color
## W = 6557122, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
by(diamonds$price,diamonds$color,median)
## diamonds$color: D
## [1] 1838
## ------------------------------------------------------------ 
## diamonds$color: E
## [1] 1739
## ------------------------------------------------------------ 
## diamonds$color: F
## [1] 2343.5
## ------------------------------------------------------------ 
## diamonds$color: G
## [1] 2242
## ------------------------------------------------------------ 
## diamonds$color: H
## [1] 3460
## ------------------------------------------------------------ 
## diamonds$color: I
## [1] 3730
## ------------------------------------------------------------ 
## diamonds$color: J
## [1] 4234

I did not know the by command but it gives me a nice way to build these tables. It is clear here that these are very different! Might be more interesting to compare a price that is closer like F and G

df2 <- diamonds[which(diamonds$color %in% c("F","G")),]
wilcox.test(price ~ color, data = df2)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  price by color
## W = 53235136, p-value = 0.1396
## alternative hypothesis: true location shift is not equal to 0

Here I would fail to reject the null hypothesis (which is what I was shooting for) There is not evidence to suggest that the median price is different.

Wilcoxson Ranked Sign Test

I need matched data for this test. This is the tricky thing! I think I am going to look at the iris dataset. It is one of the world famous.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

I am going to look at the difference of the Sepal Length and Widths and see if the species make a difference

df3 <- iris[which(iris$Species %in% c("setosa","versicolor")),]
df3["Sepal.Difference"] = df3$Sepal.Length - df3$Sepal.Width

With that all cleaned up we run the test.

wilcox.test(Sepal.Difference ~ Species, data = df3, paired = TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Sepal.Difference by Species
## V = 0, p-value = 7.755e-10
## alternative hypothesis: true location shift is not equal to 0

So we are able to reject the null hypothesis that the medians are the same. We will look at the medians just because it is nice to see why we rejected the null hypothesis.

by(df3$Sepal.Difference,df3$Species,median)
## df3$Species: setosa
## [1] 1.55
## ------------------------------------------------------------ 
## df3$Species: versicolor
## [1] 3.1
## ------------------------------------------------------------ 
## df3$Species: virginica
## [1] NA

Yes, it is clear that we would reject that those are equal.

boxplot(df3$Sepal.Difference~df3$Species)

Kruskal-Wallis

I’ll repeat the Kruskal-Wallis test on the iris dataset. I’ll just look at the Sepal Length vs the species.

boxplot(iris$Sepal.Length ~ iris$Species)

by(iris$Sepal.Length,iris$Species,median)
## iris$Species: setosa
## [1] 5
## ------------------------------------------------------------ 
## iris$Species: versicolor
## [1] 5.9
## ------------------------------------------------------------ 
## iris$Species: virginica
## [1] 6.5
kruskal.test(Sepal.Length ~ Species , data = iris)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Sepal.Length by Species
## Kruskal-Wallis chi-squared = 96.937, df = 2, p-value < 2.2e-16

Of course we reject the null hypothesis, these flowers are very different.

Spearman

To finish we examine the correlations using the Spearman rank test. My guess is that the Sepal Lengths and widths should be related.

cor.test(iris$Sepal.Length, iris$Sepal.Width, method = "spearman")
## Warning in cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1667777

With this p value we will still reject the null hypothesis.

plot(iris$Sepal.Length, iris$Sepal.Width)
abline(lm(Sepal.Width ~ Sepal.Length, data = iris),col = "Blue")

We see that this relationship is not strong but we can see that as the sepal gets longer it also gets narrower.

Okay hope this helps you get your homeworks done!