Data Preparation

library(DATA606)
library(ggplot2)
library(tidyr)
library(dplyr)
library(knitr)

url <- "https://raw.githubusercontent.com/wheremagichappens/an.dy/master/DATA606/final%20project/I1.csv"
seriea <- read.csv(url, sep=",",  header=T)

str(seriea)
head(seriea)

#Using gather function, we need to put columns into row. YC represents the number of yellow cards received and team_h_a classifies whether cards received are by Home/Away teams (HY = Yellow Card by home teams, AY = Yellow Cards by away teams)
seriea_yc <- gather(seriea, team_h_a, YC, HY:AY)
seriea_yc <- seriea_yc[c("team_h_a","YC")]

Introduction

There is a myth that home teams receive “home-advantage” in any sport tournament. I want to examine whether it is actually true by performing hypothesis testing after mesauring number of yellow cards recieved by home teams and away teams in each match of Serie A in 2016-2017.

Research question

To examine whether it is true that being home or away team affects the number of yellow cards received, on average, in Serie A (Italian league) between 2016 and 2017. My assumption is that both total and average number of yellow cards received by away teams in matches are higher than home teams. I will need one-tail hypothesis test to prove this (alternative hypothesis: mean of yellow cards received by away teams > mean of yellow cards received by home teams).

Cases

Each case represents a match statistics in Serie A in 2016-2017. There are 380 observations in data set.

Data collection

Data is updated monthly. It is collected by Football-Data.

Type of study

It is an obeservational study.

Data Source

Data is collected by Football-Data and it is freely available: http://www.football-data.co.uk/italym.php. For this project, CSV file for 2016-2017 was downloaded and then uploaded into R.

Response

The response variable is the number of yellow cards received and is numerical.

Explanatory

The explanatory variable is a variable that classifies whether yellow cards were received by home teams or away teams and is categorical.

Relevant summary statistics

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(seriea$HY)
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 380 2.07 1.29      2    2.02 1.48   0   6     6 0.49     0.03 0.07
describe(seriea$AY)
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 380 2.33 1.28      2    2.28 1.48   0   6     6 0.35    -0.24 0.07
table(seriea$HY, useNA = 'ifany')
## 
##   0   1   2   3   4   5   6 
##  37  99 111  86  30  13   4
table(seriea$AY, useNA = 'ifany')
## 
##   0   1   2   3   4   5   6 
##  23  79 122  88  46  19   3
prop.table(table(seriea$HY, useNA='ifany')) * 100
## 
##         0         1         2         3         4         5         6 
##  9.736842 26.052632 29.210526 22.631579  7.894737  3.421053  1.052632
prop.table(table(seriea$AY, useNA='ifany')) * 100
## 
##          0          1          2          3          4          5 
##  6.0526316 20.7894737 32.1052632 23.1578947 12.1052632  5.0000000 
##          6 
##  0.7894737
describe.by(seriea_yc$YC, group = seriea_yc$team_h_a, mat = TRUE)
## Warning: describe.by is deprecated. Please use the describeBy function
##     item group1 vars   n     mean       sd median  trimmed    mad min max
## X11    1     AY    1 380 2.326316 1.280601      2 2.276316 1.4826   0   6
## X12    2     HY    1 380 2.073684 1.291271      2 2.019737 1.4826   0   6
##     range      skew    kurtosis         se
## X11     6 0.3462785 -0.24464584 0.06569344
## X12     6 0.4866522  0.02524958 0.06624078
#Histogram for number of yellow card received by home teams
ggplot(seriea, aes(HY)) + geom_histogram() + ggtitle("Home Yellow Cards count") + xlab("Yellow Card received by Home Teams")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Histogram for number of yellow card received by away teams
ggplot(seriea, aes(AY)) + geom_histogram() + ggtitle("Away Yellow Cards count") + xlab("Yellow Card received by Away Teams")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.