What is the relationship between number of shots and actual goals scored by a team analysis of EPL last 4 seasons data including 18-19
Do teams with most shots on goal score more goals than opponents?
NB In the end its more about the quality ,not quantity of shots that court hence expected goals xG.
##Let Load Data
Data<-read.csv("epl.csv",header = TRUE)
head(Data,n=10)
## X Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR HS
## 1 1 08/08/15 Bournemouth Aston Villa 0 1 A 0 0 D 11
## 2 2 08/08/15 Chelsea Swansea 2 2 D 2 1 H 11
## 3 3 08/08/15 Everton Watford 2 2 D 0 1 A 10
## 4 4 08/08/15 Leicester Sunderland 4 2 H 3 0 H 19
## 5 5 08/08/15 Man United Tottenham 1 0 H 1 0 H 9
## 6 6 08/08/15 Norwich Crystal Palace 1 3 A 0 1 A 17
## 7 7 09/08/15 Arsenal West Ham 0 2 A 0 1 A 22
## 8 8 09/08/15 Newcastle Southampton 2 2 D 1 1 D 9
## 9 9 09/08/15 Stoke Liverpool 0 1 A 0 0 D 7
## 10 10 10/08/15 West Brom Man City 0 3 A 0 2 A 9
## AS HST AST PSH PSA PSD
## 1 7 2 3 1.95 4.27 3.65
## 2 18 3 10 1.39 10.39 4.92
## 3 11 5 5 1.70 5.62 3.95
## 4 10 8 5 1.99 4.34 3.48
## 5 9 1 4 1.65 5.90 4.09
## 6 11 6 7 2.52 3.08 3.35
## 7 8 6 4 1.31 12.00 5.75
## 8 15 4 5 2.88 2.69 3.33
## 9 8 1 3 3.48 2.25 3.46
## 10 19 2 7 5.75 1.68 3.98
We are interested in FTHG(full time home goal),FTAG(full time away goal),HST(home shots on goal) And AST(away shots on goal) Columns so we create a new dataframe STG
##load Package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0 v readr 1.3.1
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v stringr 1.4.0
## v ggplot2 3.2.0 v forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
STG<-select(Data,FTHG,FTAG,HST,AST)
head(STG,n=5)
## FTHG FTAG HST AST
## 1 0 1 2 3
## 2 2 2 3 10
## 3 2 2 5 5
## 4 4 2 8 5
## 5 1 0 1 4
Now let sum all variables
sum_HG<-sum(STG$FTHG)
print(sum_HG)
## [1] 2352
sum_AG<-sum(STG$FTAG)
print(sum_AG)
## [1] 1828
sum_HS<-sum(STG$HST)
print(sum_HS)
## [1] 7162
sum_AS<-sum(STG$AST)
print(sum_AS)
## [1] 5863
Average goals per game =sum home gaols and away goals
AG<-(4180/1520)
print(AG)
## [1] 2.75
Average shots on goal per game =sum home and away teams shots
AS<-(13025/1520)
print(AS)
## [1] 8.569079
Teams from English Premier Leag.ue score an average of 2.75 goals per game.to do that,they make an average of 8.569 shots on goal per match which means EPL teams have scored 11.36% of their total shots on goal
Shots To Goal Ratio =Total Shots/Total Goals
SGR<-(13025/4180)
print(SGR)
## [1] 3.116029
It takes a English Premier League Team an average of 3.116 shots on goal to score a goal
Model relationship between Full Time Home Goals And Home Shots
let find correlation between Home Shots and FTHG
cor(STG$HS,STG$FTHG)
## [1] 0.5920373
coefficient of 0.5920373 indicates a positive linear relationship between home shots on goal and home goals scored
Let fit a regression model to predict home goals with home shots on goal
?lm
## starting httpd help server ... done
model<-lm(STG$FTHG~STG$HST)
##Model summery
summary(model)
##
## Call:
## lm(formula = STG$FTHG ~ STG$HST)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4943 -0.7706 -0.0570 0.6565 4.0837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.19773 0.05432 3.64 0.000282 ***
## STG$HST 0.28643 0.01001 28.62 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.052 on 1518 degrees of freedom
## Multiple R-squared: 0.3505, Adjusted R-squared: 0.3501
## F-statistic: 819.2 on 1 and 1518 DF, p-value: < 2.2e-16
coef(model)
## (Intercept) STG$HST
## 0.1977348 0.2864344
plot(STG$HST,STG$FTHG,abline(model),main = "Scatterplot")
plot(model)
Residuals VS Fitted Plot
Most of the dots are above the dotted line and we can see that red line is close to the dotted line.we could then assume there is a close linear relationship between home shots and home goals scored
Normal Q-Q
The residual points follow the dotted line closely until when the quartile is greater than 2.75 which is the average goals per match in EPL in this case the data is not normally distributed
Residual Vs Leverage Cooks distance estimate influence of data point(3)
This plot indicates 3 point (774,834 and 901) that could potentailly have large influence in the model the values might influence the regression result when included or excluded from results
most of dots are in cooks distance are apart from the three points which have cooks distance greater than 0.4 which means they have large influence on the model
Model correlation between Full Time Away Goals And Away Shots
Let find correlation between Away Shots and FTAG
cor(STG$AST,STG$FTAG)
## [1] 0.59071
Away shots on goal have higher correlation with Away goals scored at 0.59071
##Away model
mod<-lm(STG$FTAG~STG$AST)
summary(mod)
##
## Call:
## lm(formula = STG$FTAG ~ STG$AST)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6065 -0.6353 -0.0243 0.6702 3.5318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02432 0.04802 0.506 0.613
## STG$AST 0.30548 0.01071 28.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9548 on 1518 degrees of freedom
## Multiple R-squared: 0.3489, Adjusted R-squared: 0.3485
## F-statistic: 813.6 on 1 and 1518 DF, p-value: < 2.2e-16
coef(mod)
## (Intercept) STG$AST
## 0.02431812 0.30548123
plot(STG$AST,STG$FTAG,abline(mod),main = "Scatterplot")
plot(mod)
Conclusion
So does high number of shots on goal result to more goals?
Based on the analysis above,its indicate there is a positive linear relationship between shots on goal and actual goals scored with away shots on goals having higher correlation to away goals scored. although it does not seem to be a strong linear relationship