Data Preparation

library(DATA606)
library(ggplot2)
library(tidyr)
library(dplyr)
library(knitr)

url <- "https://raw.githubusercontent.com/wheremagichappens/an.dy/master/DATA606/final%20project/I1.csv"
seriea <- read.csv(url, sep=",",  header=T)

str(seriea)
head(seriea)

#by(seriea$FTR, nrow(seriea))
#group_by(seriea, FTR) %>% summarise(n=n())

#Using gather function, we need to put columns into row. YC represents number of yellow cards received and team_h_a classifies whether teams received cards are from Home/Away team (HY = Home Yellow Card, AY = Away Yellow Card)
seriea_yc <- gather(seriea, team_h_a, YC, HY:AY)
seriea_yc <- seriea_yc[c("team_h_a","YC")]

Introduction

There is a myth that home teams receive “home advantage” in any sport tournament. I want to examine whether it is actually true by performing hypothesis testing after mesauring number of yellow cards recieved by home teams and away teams.

Research question

To examine whether it is true being home or away team affect the number of yellow cards received, on average, in Serie A (Italian league) between 2016 and 2017. Not only that, I want to know if the number of yellow cards received affect full time scores and shots on target.

Cases

Each case represents a match in Serie A. There are 380 observations in data set.

Data collection

Data is submitted monthly. It is collected by Football-Data.

Type of study

It is an obeservational study.

Data Source

Data is collected by Football-Data and it is freely available: http://www.football-data.co.uk/italym.php. For this project, CSV file for 2016-2017 was downloaded and then uploaded into R.

Response

The response variables are the number of yellow cards received, shots on goal and full time goals by home and away teams and they are numerical.

Explanatory

One of the explanatory variables is a variable that classifies whether yellow cards were received by home teams or away teams and is categorical. The other variables are the number of yellow cards received by home and away teams.

Relevant summary statistics

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(seriea$HY)
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 380 2.07 1.29      2    2.02 1.48   0   6     6 0.49     0.03 0.07
describe(seriea$AY)
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 380 2.33 1.28      2    2.28 1.48   0   6     6 0.35    -0.24 0.07
table(seriea$HY, useNA = 'ifany')
## 
##   0   1   2   3   4   5   6 
##  37  99 111  86  30  13   4
table(seriea$AY, useNA = 'ifany')
## 
##   0   1   2   3   4   5   6 
##  23  79 122  88  46  19   3
prop.table(table(seriea$HY, useNA='ifany')) * 100
## 
##         0         1         2         3         4         5         6 
##  9.736842 26.052632 29.210526 22.631579  7.894737  3.421053  1.052632
prop.table(table(seriea$AY, useNA='ifany')) * 100
## 
##          0          1          2          3          4          5 
##  6.0526316 20.7894737 32.1052632 23.1578947 12.1052632  5.0000000 
##          6 
##  0.7894737
describe.by(seriea_yc$YC, group = seriea_yc$team_h_a, mat = TRUE)
## Warning: describe.by is deprecated. Please use the describeBy function
##     item group1 vars   n     mean       sd median  trimmed    mad min max
## X11    1     AY    1 380 2.326316 1.280601      2 2.276316 1.4826   0   6
## X12    2     HY    1 380 2.073684 1.291271      2 2.019737 1.4826   0   6
##     range      skew    kurtosis         se
## X11     6 0.3462785 -0.24464584 0.06569344
## X12     6 0.4866522  0.02524958 0.06624078
#Histogram for number of yellow card received by home teams
ggplot(seriea, aes(HY)) + geom_histogram() + ggtitle("Home Yellow Cards count") + xlab("Yellow Card received by Home Teams")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Histogram for number of yellow card received by away teams
ggplot(seriea, aes(AY)) + geom_histogram() + ggtitle("Away Yellow Cards count") + xlab("Yellow Card received by Away Teams")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Final Project – Things to add

#Visualize in boxplot grouping by Home/Away team
boxplot(YC ~ team_h_a, seriea_yc, col=c("red","blue"),ylab="# of Yellow Cards received",main="YC by away team VS YC by home team")

#Histogram for number of yellow card received by home teams
ggplot(seriea, aes(HY)) + geom_histogram() + ggtitle("Home Yellow Cards count") + xlab("Yellow Card received by Home Teams")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Histogram for number of yellow card received by away teams
ggplot(seriea, aes(AY)) + geom_histogram() + ggtitle("Away Yellow Cards count") + xlab("Yellow Card received by Away Teams")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Looks like grouped bar plot is better representation of the distribution
counts <- table(seriea_yc$team_h_a,seriea_yc$YC)
barplot(counts, main="Yellow Card Distribution by Home Teams and Away Teams",
  xlab="Number of Yellow Card received", col=c("red","blue"),
    legend = rownames(counts), beside=TRUE)

#Let's take a look at summary statistics
by(seriea_yc$YC, seriea_yc$team_h_a, mean)
## seriea_yc$team_h_a: AY
## [1] 2.326316
## -------------------------------------------------------- 
## seriea_yc$team_h_a: HY
## [1] 2.073684
by(seriea_yc$YC, seriea_yc$team_h_a, median)
## seriea_yc$team_h_a: AY
## [1] 2
## -------------------------------------------------------- 
## seriea_yc$team_h_a: HY
## [1] 2
by(seriea_yc$YC, seriea_yc$team_h_a, min)
## seriea_yc$team_h_a: AY
## [1] 0
## -------------------------------------------------------- 
## seriea_yc$team_h_a: HY
## [1] 0
by(seriea_yc$YC, seriea_yc$team_h_a, max)
## seriea_yc$team_h_a: AY
## [1] 6
## -------------------------------------------------------- 
## seriea_yc$team_h_a: HY
## [1] 6
by(seriea_yc$YC, seriea_yc$team_h_a, sd)
## seriea_yc$team_h_a: AY
## [1] 1.280601
## -------------------------------------------------------- 
## seriea_yc$team_h_a: HY
## [1] 1.291271
#Let's put all of summary statistics into table
seriea_yc %>% group_by(team_h_a) %>% summarise(Min=min(YC,na.rm= TRUE),Q1 = quantile(YC,probs = .25,na.rm = TRUE),Median = median(YC, na.rm = TRUE),Q3 = quantile(YC,probs = .75,na.rm = TRUE),Max = max(YC,na.rm = TRUE),Mean = mean(YC, na.rm = TRUE),SD = sd(YC, na.rm = TRUE),n = n(),Missing = sum(is.na(YC)), Total = sum(YC)) -> stat_table

kable(stat_table)
team_h_a Min Q1 Median Q3 Max Mean SD n Missing Total
AY 0 1 2 3 6 2.326316 1.280601 380 0 884
HY 0 1 2 3 6 2.073684 1.291270 380 0 788
#Note that mean of yellow card received is higher for away teams than home teams from the table. Not only that, total number of yellow card received is substantially higher for away teams than home teams. Let's turn to other relevant statistics to find out why.


#Let's examine number of fouls committed by home and away teams. My thinking is that teams that committed more fouls tend to receive more yellow cards, thus away teams might have comitted more fouls.

summary(seriea$HF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   11.00   14.00   13.98   17.00   28.00
sum(seriea$HF)
## [1] 5312
summary(seriea$AF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   11.00   14.00   13.95   17.00   32.00
sum(seriea$AF)
## [1] 5300
#Well, it turns out that home teams tend to commit more fouls than away teams, both on average (mean) and aggregate total, but why do away teams still get more yellow cards? It suggests that there might be a "home-advantage". Let's examine whether "home-advantage" (mean of YC for Away > mean of YC for Home using one-tail hypothesis test) really exists.

Inference

#From summary statistics, we know that home teams tend to commit more fouls than away teams, both on average (mean) and aggregate total, but still receive less yellow card on average(mean) and aggregate total than away teams. Our inferential statistics, hypothesis testing, will eventually answer whether home teams do receive help from refrees or not.

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/inference.Rdata'))


inference(y = seriea_yc$YC, x = seriea_yc$team_h_a, est = "mean", type = "ht", null = 0, 
          alternative = "greater", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_AY = 380, mean_AY = 2.3263, sd_AY = 1.2806
## n_HY = 380, mean_HY = 2.0737, sd_HY = 1.2913
## Observed difference between means (AY-HY) = 0.2526
## H0: mu_AY - mu_HY = 0 
## HA: mu_AY - mu_HY > 0 
## Standard error = 0.093 
## Test statistic: Z =  2.708 
## p-value =  0.0034

#Linear regression on full time goals and number of yellow cards as well as shot on goals and number of yellow cards.

ddd <- lm(FTAG ~ HC, data =seriea)
summary(ddd)
## 
## Call:
## lm(formula = FTAG ~ HC, data = seriea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4773 -1.1326 -0.2893  0.6794  5.6167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.47731    0.12739  11.596   <2e-16 ***
## HC          -0.03134    0.01912  -1.639    0.102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.205 on 378 degrees of freedom
## Multiple R-squared:  0.007058,   Adjusted R-squared:  0.004431 
## F-statistic: 2.687 on 1 and 378 DF,  p-value: 0.102
eee <- lm(FTHG ~ AC, data =seriea)
summary(eee)
## 
## Call:
## lm(formula = FTHG ~ AC, data = seriea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7938 -0.7606 -0.5283  0.5381  5.4053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.82704    0.13870  13.172   <2e-16 ***
## AC          -0.03320    0.02394  -1.387    0.166    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.353 on 378 degrees of freedom
## Multiple R-squared:  0.00506,    Adjusted R-squared:  0.002428 
## F-statistic: 1.922 on 1 and 378 DF,  p-value: 0.1664
fff <- lm(AST ~ HC, data =seriea)
summary(fff)
## 
## Call:
## lm(formula = AST ~ HC, data = seriea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2704 -1.5746 -0.3579  1.2734  8.7296 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.03080    0.23403   21.50  < 2e-16 ***
## HC          -0.15208    0.03512   -4.33 1.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.214 on 378 degrees of freedom
## Multiple R-squared:  0.04726,    Adjusted R-squared:  0.04474 
## F-statistic: 18.75 on 1 and 378 DF,  p-value: 1.911e-05
ggg <- lm(HST ~ AC, data =seriea)
summary(ggg)
## 
## Call:
## lm(formula = HST ~ AC, data = seriea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0763 -1.9900 -0.4037  1.5787  8.4636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.76640    0.26296  21.929   <2e-16 ***
## AC          -0.11502    0.04539  -2.534   0.0117 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.565 on 378 degrees of freedom
## Multiple R-squared:  0.0167, Adjusted R-squared:  0.0141 
## F-statistic: 6.421 on 1 and 378 DF,  p-value: 0.01168
##Since p-value is less than 0.05 from one-tail hypothesis testing, we fail to reject null hypothesis in favor of alternative hypothesis that mean of yellow cards received by away teams is higher than that of by home teams. Despite of the fact that number of fouls committed from home teams is higher than away teams, we know that the number of yellow cards received from away teams is rather higher than home teams. We can conclude that home teams clearly receive "home-advantage" from refrees. However, we also have to know that we might be missing a counfounding variable; number of tackles committed for both away and home teams or any statistics for "aggressive" activities commited by both away and home teams. This could cause higher number of YC for away teams, not necessarily because of "home-advantage". What is also really interesting is that from linear regression, we know that number of yellow card received has no relationship with full time goal scored. Not only that, it is surprising that number of yellow card received by home teams rather decreases shot on target for away teams and vice versa. It might be true that away teams tend to play harder and therefore, more likely to commit more dangerous actions during the game which can lead to receiving more yellow cards. Unfortunately, we do not have data to claim this though. We will need the data for more accurate inference in the future.