I am going to try to complete the homework and project by giving you some examples in coding these non-parametric tests.

To keep this assignment simple, we are going to use the built in dataset Football that is included with ggplot2.

head(data)
##                date_GMT         referee total_goal_count
## 1  Aug 10 2018 - 7:00pm  Andre Marriner                3
## 2 Aug 11 2018 - 11:30am Martin Atkinson                3
## 3  Aug 11 2018 - 2:00pm    Kevin Friend                2
## 4  Aug 11 2018 - 2:00pm       Mike Dean                2
## 5  Aug 11 2018 - 2:00pm  Chris Kavanagh                3
## 6  Aug 11 2018 - 2:00pm   Jonathan Moss                2
##   total_goals_at_half_time total_minute
## 1                        1           90
## 2                        3           90
## 3                        1           90
## 4                        1           90
## 5                        2           90
## 6                        1           90
##                                          stadium_name
## 1                           Old Trafford (Manchester)
## 2               St. James' Park (Newcastle upon Tyne)
## 3              Vitality Stadium (Bournemouth- Dorset)
## 4                             Craven Cottage (London)
## 5 John Smith's Stadium (Huddersfield- West Yorkshire)
## 6                             Vicarage Road (Watford)

Wilcoxson Ranked Sum Test

table(data$stadium_name)
## 
##                                          Anfield (Liverpool) 
##                                                           19 
##                    Cardiff City Stadium (Cardiff (Caerdydd)) 
##                                                           19 
##                                      Craven Cottage (London) 
##                                                           19 
##                                    Emirates Stadium (London) 
##                                                           19 
##                                  Etihad Stadium (Manchester) 
##                                                           19 
##                                    Goodison Park (Liverpool) 
##                                                           19 
##          John Smith's Stadium (Huddersfield- West Yorkshire) 
##                                                           19 
##               King Power Stadium (Leicester- Leicestershire) 
##                                                           19 
##                                      London Stadium (London) 
##                                                           19 
##              Molineux Stadium (Wolverhampton- West Midlands) 
##                                                           19 
##                                    Old Trafford (Manchester) 
##                                                           19 
##                                       Selhurst Park (London) 
##                                                           19 
##                        St. James' Park (Newcastle upon Tyne) 
##                                                           19 
##                  St. Mary's Stadium (Southampton- Hampshire) 
##                                                           19 
##                                     Stamford Bridge (London) 
##                                                           19 
## The American Express Community Stadium (Falmer- East Sussex) 
##                                                           19 
##                           Tottenham Hotspur Stadium (London) 
##                                                            5 
##                                          Turf Moor (Burnley) 
##                                                           19 
##                                      Vicarage Road (Watford) 
##                                                           19 
##                       Vitality Stadium (Bournemouth- Dorset) 
##                                                           19 
##                                     Wembley Stadium (London) 
##                                                           14

I am going to look at Tournament name. Let’s ask if the median goal of a game is equal to the total score. First I’ll trim down the data frame to just contain those match summary and then compare the goal.

df <- data[which(data$stadium_name %in% c("D","J")),]
summary(df)
##    date_GMT           referee          total_goal_count
##  Length:0           Length:0           Min.   : NA     
##  Class :character   Class :character   1st Qu.: NA     
##  Mode  :character   Mode  :character   Median : NA     
##                                        Mean   :NaN     
##                                        3rd Qu.: NA     
##                                        Max.   : NA     
##  total_goals_at_half_time  total_minute stadium_name      
##  Min.   : NA              Min.   : NA   Length:0          
##  1st Qu.: NA              1st Qu.: NA   Class :character  
##  Median : NA              Median : NA   Mode  :character  
##  Mean   :NaN              Mean   :NaN                     
##  3rd Qu.: NA              3rd Qu.: NA                     
##  Max.   : NA              Max.   : NA
wilcox.test(data$total_goal_count, data$total_minute, data = df, paired=TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$total_goal_count and data$total_minute
## V = 0, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
by(data$total_goal_count,data$total_minute, median)
## data$total_minute: 90
## [1] 3

It is clear here that these are very different! Might be more interesting to compare a tournament goal that is closer like F and G

df2 <- data[which(data$stadium_name %in% c("F","G")),]
wilcox.test(data$total_goal_count, data$total_goals_at_half_time, data = df2, paired=TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$total_goal_count and data$total_goals_at_half_time
## V = 46665, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Here I am able to reject the null hypothesis.

Wilcoxson Ranked Sign Test

head(data)
##                date_GMT         referee total_goal_count
## 1  Aug 10 2018 - 7:00pm  Andre Marriner                3
## 2 Aug 11 2018 - 11:30am Martin Atkinson                3
## 3  Aug 11 2018 - 2:00pm    Kevin Friend                2
## 4  Aug 11 2018 - 2:00pm       Mike Dean                2
## 5  Aug 11 2018 - 2:00pm  Chris Kavanagh                3
## 6  Aug 11 2018 - 2:00pm   Jonathan Moss                2
##   total_goals_at_half_time total_minute
## 1                        1           90
## 2                        3           90
## 3                        1           90
## 4                        1           90
## 5                        2           90
## 6                        1           90
##                                          stadium_name
## 1                           Old Trafford (Manchester)
## 2               St. James' Park (Newcastle upon Tyne)
## 3              Vitality Stadium (Bournemouth- Dorset)
## 4                             Craven Cottage (London)
## 5 John Smith's Stadium (Huddersfield- West Yorkshire)
## 6                             Vicarage Road (Watford)

I am going to look at the difference of the summary and year cost and see if the species make a difference

df3 <- data[which(data$stadium_name %in% c("setosa","versicolor")),]
df3["Tournament.Difference"] = df3$total_goal_count - df3$total_goals_at_half_time

With that all cleaned up we run the test.

wilcox.test(data$total_goal_count, data$total_goals_at_half_time, data = df3, paired=TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$total_goal_count and data$total_goals_at_half_time
## V = 46665, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

So we are able to reject the null hypothesis.

by(data$total_goal_count, data$total_goals_at_half_time, median)
## data$total_goals_at_half_time: 0
## [1] 1
## ------------------------------------------------------------ 
## data$total_goals_at_half_time: 1
## [1] 2.5
## ------------------------------------------------------------ 
## data$total_goals_at_half_time: 2
## [1] 3
## ------------------------------------------------------------ 
## data$total_goals_at_half_time: 3
## [1] 4
## ------------------------------------------------------------ 
## data$total_goals_at_half_time: 4
## [1] 6
## ------------------------------------------------------------ 
## data$total_goals_at_half_time: 5
## [1] 6
## ------------------------------------------------------------ 
## data$total_goals_at_half_time: 6
## [1] 6

Visualize the data by boxplot

boxplot(data$total_goal_count ~ data$stadium_name)

Kruskal-Wallis

data$total_minute[is.na(data$total_minute)] <- 0
data$total_goal_count[is.na(data$total_goal_count)] <- 0
data[which(data$total_minute < data$total_goal_count),"Games"] = "Less"
data[which(data$total_minute > data$total_goal_count),"Games"] = "More"
data[which(data$total_minute == data$total_goal_count),"Games"] = "Equal"

Looking at the median and seeing if the goal stay the same

by(data$total_goals_at_half_time,data$Games, median)
## data$Games: More
## [1] 1
kruskal.test(total_goals_at_half_time ~ total_goal_count, data = data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  total_goals_at_half_time by total_goal_count
## Kruskal-Wallis chi-squared = 161.2, df = 8, p-value < 2.2e-16

Here I am able to reject the null hypothesis.

ggplot to visualize the data

ggplot(data = data,aes(x = data$total_goal_count, y = data$stadium_name))+
  geom_boxplot()
## Warning: Use of `data$total_goal_count` is discouraged. Use `total_goal_count`
## instead.
## Warning: Use of `data$stadium_name` is discouraged. Use `stadium_name` instead.

Spearman

cor.test(data$total_goal_count,data$total_goals_at_half_time, method = "spearman")
## Warning in cor.test.default(data$total_goal_count,
## data$total_goals_at_half_time, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  data$total_goal_count and data$total_goals_at_half_time
## S = 3410149, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6271134

With this p value we will still reject the null hypothesis.

plot(data$total_goals_at_half_time, data$total_goal_count)
abline(lm(total_goal_count ~ total_goals_at_half_time, data = data),col = "Blue")

We see that this relationship is not strong but we can see that as the sepal gets longer it also gets narrower.