This is part of a series of pages related to EPL Away Wins:

Exploratory data analysis for English Premier League results, for seasons 2000/01 to 2018/19. How well can Away wins be predicted from individual match variables?

Summary of Selected Variables

For each game, we have home team and away team data for:

We have data for the previous 4 games for the home team and away team. We also have data for:

We can look at summary statistics and distributions of some of these variables.

       HS             HST               HC               HF              HY            HomexG     
 Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00   Min.   :0.000   Min.   :0.000  
 1st Qu.:10.00   1st Qu.: 4.000   1st Qu.: 4.000   1st Qu.: 9.00   1st Qu.:0.000   1st Qu.:0.850  
 Median :13.00   Median : 6.000   Median : 6.000   Median :11.00   Median :1.000   Median :1.370  
 Mean   :13.53   Mean   : 6.363   Mean   : 6.167   Mean   :11.37   Mean   :1.376   Mean   :1.505  
 3rd Qu.:17.00   3rd Qu.: 8.000   3rd Qu.: 8.000   3rd Qu.:14.00   3rd Qu.:2.000   3rd Qu.:2.030  
 Max.   :43.00   Max.   :24.000   Max.   :19.000   Max.   :33.00   Max.   :7.000   Max.   :5.730  
                                                                                   NA's   :4620   
      Dist             winpc_H         top6_perfH    
 Min.   :  0.9521   Min.   :  0.00   Min.   :0.0000  
 1st Qu.:108.4414   1st Qu.: 24.00   1st Qu.:0.1482  
 Median :174.1738   Median : 33.33   Median :0.2667  
 Mean   :187.5508   Mean   : 36.70   Mean   :0.3031  
 3rd Qu.:278.3909   3rd Qu.: 48.15   3rd Qu.:0.4444  
 Max.   :472.2527   Max.   :100.00   Max.   :1.0000  
                                     NA's   :61      

We only have xG data for a couple of seasons, so there are a lot of missing values for this. There are a small number of missing values for top 6 performance variables, but all the other variables are complete.

Yellow card data is very discrete and will not be investigated further at this stage.

Normalisation of these distributions using transformations is not deemed to be required.

Feature Correlations

We will now demonstrate how much correlation is observed between the feature variables. We will just compare Home variables for the previous game as indicators.

Only variables with at least one correlation coefficient outside ±0.65 are included in the above plot.

This plot shows that:

Away Wins

What proportion of results are Away Wins and how does this vary?

overall, how many games are Away wins:
   0    1 
4514 1756 

by season:

Therefore, approximately 28% of these games are Away wins. The proportion of these that are Away wins varies on a season basis from ~22-33%. The change in proportion over time suggests that the proportion could be increasing, however the p-value for the season component of a linear fit to this data is 0.104.

Two Variable Plots

We will now create plots to illustrate how feature variables affect the observed proportion of Away wins. We will sum the match variables over the previous 4 matches to give perhaps a better representation of current form.

Some of the conclusions on relationship with Away Win proportion that we can draw from these:

So recommended to take the following forward for predictive purposes:


End

