library(DATA606)
library(ggplot2)
library(tidyr)
library(dplyr)
library(knitr)
url <- "https://raw.githubusercontent.com/wheremagichappens/an.dy/master/DATA606/final%20project/I1.csv"
seriea <- read.csv(url, sep=",", header=T)
str(seriea)
head(seriea)
#Using gather function, we need to put columns into row. YC represents the number of yellow cards received and team_h_a classifies whether cards received are by Home/Away teams (HY = Yellow Card by home teams, AY = Yellow Cards by away teams)
seriea_yc <- gather(seriea, team_h_a, YC, HY:AY)
seriea_yc <- seriea_yc[c("team_h_a","YC")]
There is a myth that home teams receive “home-advantage” in any sport tournament. I want to examine whether it is actually true by performing hypothesis testing after mesauring number of yellow cards recieved by home teams and away teams in each match of Serie A in 2016-2017.
To examine whether it is true that being home or away team affects the number of yellow cards received, on average, in Serie A (Italian league) between 2016 and 2017. My assumption is that both total and average number of yellow cards received by away teams in matches are higher than home teams. I will need one-tail hypothesis test to prove this (alternative hypothesis: mean of yellow cards received by away teams > mean of yellow cards received by home teams).
Each case represents a match statistics in Serie A in 2016-2017. There are 380 observations in data set.
Data is updated monthly. It is collected by Football-Data.
It is an obeservational study.
Data is collected by Football-Data and it is freely available: http://www.football-data.co.uk/italym.php. For this project, CSV file for 2016-2017 was downloaded and then uploaded into R.
The response variable is the number of yellow cards received and is numerical.
The explanatory variable is a variable that classifies whether yellow cards were received by home teams or away teams and is categorical.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(seriea$HY)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 380 2.07 1.29 2 2.02 1.48 0 6 6 0.49 0.03 0.07
describe(seriea$AY)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 380 2.33 1.28 2 2.28 1.48 0 6 6 0.35 -0.24 0.07
table(seriea$HY, useNA = 'ifany')
##
## 0 1 2 3 4 5 6
## 37 99 111 86 30 13 4
table(seriea$AY, useNA = 'ifany')
##
## 0 1 2 3 4 5 6
## 23 79 122 88 46 19 3
prop.table(table(seriea$HY, useNA='ifany')) * 100
##
## 0 1 2 3 4 5 6
## 9.736842 26.052632 29.210526 22.631579 7.894737 3.421053 1.052632
prop.table(table(seriea$AY, useNA='ifany')) * 100
##
## 0 1 2 3 4 5
## 6.0526316 20.7894737 32.1052632 23.1578947 12.1052632 5.0000000
## 6
## 0.7894737
describe.by(seriea_yc$YC, group = seriea_yc$team_h_a, mat = TRUE)
## Warning: describe.by is deprecated. Please use the describeBy function
## item group1 vars n mean sd median trimmed mad min max
## X11 1 AY 1 380 2.326316 1.280601 2 2.276316 1.4826 0 6
## X12 2 HY 1 380 2.073684 1.291271 2 2.019737 1.4826 0 6
## range skew kurtosis se
## X11 6 0.3462785 -0.24464584 0.06569344
## X12 6 0.4866522 0.02524958 0.06624078
#Histogram for number of yellow card received by home teams
ggplot(seriea, aes(HY)) + geom_histogram() + ggtitle("Home Yellow Cards count") + xlab("Yellow Card received by Home Teams")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Histogram for number of yellow card received by away teams
ggplot(seriea, aes(AY)) + geom_histogram() + ggtitle("Away Yellow Cards count") + xlab("Yellow Card received by Away Teams")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.