I am going to try to complete the homework and project by giving you some examples in coding these non-parametric tests.
To keep this assignment simple, we are going to use the built in dataset Football that is included with ggplot2.
head(data)
## date_GMT referee total_goal_count
## 1 Aug 10 2018 - 7:00pm Andre Marriner 3
## 2 Aug 11 2018 - 11:30am Martin Atkinson 3
## 3 Aug 11 2018 - 2:00pm Kevin Friend 2
## 4 Aug 11 2018 - 2:00pm Mike Dean 2
## 5 Aug 11 2018 - 2:00pm Chris Kavanagh 3
## 6 Aug 11 2018 - 2:00pm Jonathan Moss 2
## total_goals_at_half_time total_minute
## 1 1 90
## 2 3 90
## 3 1 90
## 4 1 90
## 5 2 90
## 6 1 90
## stadium_name
## 1 Old Trafford (Manchester)
## 2 St. James' Park (Newcastle upon Tyne)
## 3 Vitality Stadium (Bournemouth- Dorset)
## 4 Craven Cottage (London)
## 5 John Smith's Stadium (Huddersfield- West Yorkshire)
## 6 Vicarage Road (Watford)
table(data$stadium_name)
##
## Anfield (Liverpool)
## 19
## Cardiff City Stadium (Cardiff (Caerdydd))
## 19
## Craven Cottage (London)
## 19
## Emirates Stadium (London)
## 19
## Etihad Stadium (Manchester)
## 19
## Goodison Park (Liverpool)
## 19
## John Smith's Stadium (Huddersfield- West Yorkshire)
## 19
## King Power Stadium (Leicester- Leicestershire)
## 19
## London Stadium (London)
## 19
## Molineux Stadium (Wolverhampton- West Midlands)
## 19
## Old Trafford (Manchester)
## 19
## Selhurst Park (London)
## 19
## St. James' Park (Newcastle upon Tyne)
## 19
## St. Mary's Stadium (Southampton- Hampshire)
## 19
## Stamford Bridge (London)
## 19
## The American Express Community Stadium (Falmer- East Sussex)
## 19
## Tottenham Hotspur Stadium (London)
## 5
## Turf Moor (Burnley)
## 19
## Vicarage Road (Watford)
## 19
## Vitality Stadium (Bournemouth- Dorset)
## 19
## Wembley Stadium (London)
## 14
I am going to look at Tournament name. Let’s ask if the median goal of a game is equal to the total score. First I’ll trim down the data frame to just contain those match summary and then compare the goal.
df <- data[which(data$stadium_name %in% c("D","J")),]
summary(df)
## date_GMT referee total_goal_count
## Length:0 Length:0 Min. : NA
## Class :character Class :character 1st Qu.: NA
## Mode :character Mode :character Median : NA
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
## total_goals_at_half_time total_minute stadium_name
## Min. : NA Min. : NA Length:0
## 1st Qu.: NA 1st Qu.: NA Class :character
## Median : NA Median : NA Mode :character
## Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA
wilcox.test(data$total_goal_count, data$total_minute, data = df, paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: data$total_goal_count and data$total_minute
## V = 0, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
by(data$total_goal_count,data$total_minute, median)
## data$total_minute: 90
## [1] 3
It is clear here that these are very different! Might be more interesting to compare a tournament goal that is closer like F and G
df2 <- data[which(data$stadium_name %in% c("F","G")),]
wilcox.test(data$total_goal_count, data$total_goals_at_half_time, data = df2, paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: data$total_goal_count and data$total_goals_at_half_time
## V = 46665, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Here I am able to reject the null hypothesis.
head(data)
## date_GMT referee total_goal_count
## 1 Aug 10 2018 - 7:00pm Andre Marriner 3
## 2 Aug 11 2018 - 11:30am Martin Atkinson 3
## 3 Aug 11 2018 - 2:00pm Kevin Friend 2
## 4 Aug 11 2018 - 2:00pm Mike Dean 2
## 5 Aug 11 2018 - 2:00pm Chris Kavanagh 3
## 6 Aug 11 2018 - 2:00pm Jonathan Moss 2
## total_goals_at_half_time total_minute
## 1 1 90
## 2 3 90
## 3 1 90
## 4 1 90
## 5 2 90
## 6 1 90
## stadium_name
## 1 Old Trafford (Manchester)
## 2 St. James' Park (Newcastle upon Tyne)
## 3 Vitality Stadium (Bournemouth- Dorset)
## 4 Craven Cottage (London)
## 5 John Smith's Stadium (Huddersfield- West Yorkshire)
## 6 Vicarage Road (Watford)
I am going to look at the difference of the summary and year cost and see if the species make a difference
df3 <- data[which(data$stadium_name %in% c("setosa","versicolor")),]
df3["Tournament.Difference"] = df3$total_goal_count - df3$total_goals_at_half_time
With that all cleaned up we run the test.
wilcox.test(data$total_goal_count, data$total_goals_at_half_time, data = df3, paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: data$total_goal_count and data$total_goals_at_half_time
## V = 46665, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
So we are able to reject the null hypothesis.
by(data$total_goal_count, data$total_goals_at_half_time, median)
## data$total_goals_at_half_time: 0
## [1] 1
## ------------------------------------------------------------
## data$total_goals_at_half_time: 1
## [1] 2.5
## ------------------------------------------------------------
## data$total_goals_at_half_time: 2
## [1] 3
## ------------------------------------------------------------
## data$total_goals_at_half_time: 3
## [1] 4
## ------------------------------------------------------------
## data$total_goals_at_half_time: 4
## [1] 6
## ------------------------------------------------------------
## data$total_goals_at_half_time: 5
## [1] 6
## ------------------------------------------------------------
## data$total_goals_at_half_time: 6
## [1] 6
Visualize the data by boxplot
boxplot(data$total_goal_count ~ data$stadium_name)
data$total_minute[is.na(data$total_minute)] <- 0
data$total_goal_count[is.na(data$total_goal_count)] <- 0
data[which(data$total_minute < data$total_goal_count),"Games"] = "Less"
data[which(data$total_minute > data$total_goal_count),"Games"] = "More"
data[which(data$total_minute == data$total_goal_count),"Games"] = "Equal"
Looking at the median and seeing if the goal stay the same
by(data$total_goals_at_half_time,data$Games, median)
## data$Games: More
## [1] 1
kruskal.test(total_goals_at_half_time ~ total_goal_count, data = data)
##
## Kruskal-Wallis rank sum test
##
## data: total_goals_at_half_time by total_goal_count
## Kruskal-Wallis chi-squared = 161.2, df = 8, p-value < 2.2e-16
Here I am able to reject the null hypothesis.
ggplot to visualize the data
ggplot(data = data,aes(x = data$total_goal_count, y = data$stadium_name))+
geom_boxplot()
## Warning: Use of `data$total_goal_count` is discouraged. Use `total_goal_count`
## instead.
## Warning: Use of `data$stadium_name` is discouraged. Use `stadium_name` instead.
cor.test(data$total_goal_count,data$total_goals_at_half_time, method = "spearman")
## Warning in cor.test.default(data$total_goal_count,
## data$total_goals_at_half_time, : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: data$total_goal_count and data$total_goals_at_half_time
## S = 3410149, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.6271134
With this p value we will still reject the null hypothesis.
plot(data$total_goals_at_half_time, data$total_goal_count)
abline(lm(total_goal_count ~ total_goals_at_half_time, data = data),col = "Blue")
We see that this relationship is not strong but we can see that as the sepal gets longer it also gets narrower.