What is the relationship between number of shots and actual goals scored by a team analysis of EPL last 4 seasons data including 18-19

Do teams with most shots on goal score more goals than opponents?

NB In the end its more about the quality ,not quantity of shots that court hence expected goals xG.

##Let Load Data

Data<-read.csv("epl.csv",header = TRUE)
head(Data,n=10)
##     X     Date    HomeTeam       AwayTeam FTHG FTAG FTR HTHG HTAG HTR HS
## 1   1 08/08/15 Bournemouth    Aston Villa    0    1   A    0    0   D 11
## 2   2 08/08/15     Chelsea        Swansea    2    2   D    2    1   H 11
## 3   3 08/08/15     Everton        Watford    2    2   D    0    1   A 10
## 4   4 08/08/15   Leicester     Sunderland    4    2   H    3    0   H 19
## 5   5 08/08/15  Man United      Tottenham    1    0   H    1    0   H  9
## 6   6 08/08/15     Norwich Crystal Palace    1    3   A    0    1   A 17
## 7   7 09/08/15     Arsenal       West Ham    0    2   A    0    1   A 22
## 8   8 09/08/15   Newcastle    Southampton    2    2   D    1    1   D  9
## 9   9 09/08/15       Stoke      Liverpool    0    1   A    0    0   D  7
## 10 10 10/08/15   West Brom       Man City    0    3   A    0    2   A  9
##    AS HST AST  PSH   PSA  PSD
## 1   7   2   3 1.95  4.27 3.65
## 2  18   3  10 1.39 10.39 4.92
## 3  11   5   5 1.70  5.62 3.95
## 4  10   8   5 1.99  4.34 3.48
## 5   9   1   4 1.65  5.90 4.09
## 6  11   6   7 2.52  3.08 3.35
## 7   8   6   4 1.31 12.00 5.75
## 8  15   4   5 2.88  2.69 3.33
## 9   8   1   3 3.48  2.25 3.46
## 10 19   2   7 5.75  1.68 3.98

We are interested in FTHG(full time home goal),FTAG(full time away goal),HST(home shots on goal) And AST(away shots on goal) Columns so we create a new dataframe STG

##load Package

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0     v readr   1.3.1
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v ggplot2 3.2.0     v forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
STG<-select(Data,FTHG,FTAG,HST,AST)
head(STG,n=5)
##   FTHG FTAG HST AST
## 1    0    1   2   3
## 2    2    2   3  10
## 3    2    2   5   5
## 4    4    2   8   5
## 5    1    0   1   4

Now let sum all variables

sum_HG<-sum(STG$FTHG)
print(sum_HG)
## [1] 2352
sum_AG<-sum(STG$FTAG)
print(sum_AG)
## [1] 1828
sum_HS<-sum(STG$HST)
print(sum_HS)
## [1] 7162
sum_AS<-sum(STG$AST)
print(sum_AS)
## [1] 5863

Average goals per game =sum home gaols and away goals

AG<-(4180/1520)
print(AG)
## [1] 2.75

Average shots on goal per game =sum home and away teams shots

AS<-(13025/1520)
print(AS)
## [1] 8.569079

Teams from English Premier Leag.ue score an average of 2.75 goals per game.to do that,they make an average of 8.569 shots on goal per match which means EPL teams have scored 11.36% of their total shots on goal

Shots To Goal Ratio =Total Shots/Total Goals

SGR<-(13025/4180)
print(SGR)
## [1] 3.116029

It takes a English Premier League Team an average of 3.116 shots on goal to score a goal

Model relationship between Full Time Home Goals And Home Shots

let find correlation between Home Shots and FTHG

cor(STG$HS,STG$FTHG)
## [1] 0.5920373

coefficient of 0.5920373 indicates a positive linear relationship between home shots on goal and home goals scored

Let fit a regression model to predict home goals with home shots on goal

?lm
## starting httpd help server ... done
model<-lm(STG$FTHG~STG$HST)

##Model summery

summary(model)
## 
## Call:
## lm(formula = STG$FTHG ~ STG$HST)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4943 -0.7706 -0.0570  0.6565  4.0837 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.19773    0.05432    3.64 0.000282 ***
## STG$HST      0.28643    0.01001   28.62  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.052 on 1518 degrees of freedom
## Multiple R-squared:  0.3505, Adjusted R-squared:  0.3501 
## F-statistic: 819.2 on 1 and 1518 DF,  p-value: < 2.2e-16
coef(model)
## (Intercept)     STG$HST 
##   0.1977348   0.2864344
plot(STG$HST,STG$FTHG,abline(model),main = "Scatterplot")

plot(model)

Residuals VS Fitted Plot

Most of the dots are above the dotted line and we can see that red line is close to the dotted line.we could then assume there is a close linear relationship between home shots and home goals scored

Normal Q-Q

The residual points follow the dotted line closely until when the quartile is greater than 2.75 which is the average goals per match in EPL in this case the data is not normally distributed

Residual Vs Leverage Cooks distance estimate influence of data point(3)

This plot indicates 3 point (774,834 and 901) that could potentailly have large influence in the model the values might influence the regression result when included or excluded from results

most of dots are in cooks distance are apart from the three points which have cooks distance greater than 0.4 which means they have large influence on the model

Model correlation between Full Time Away Goals And Away Shots

Let find correlation between Away Shots and FTAG

cor(STG$AST,STG$FTAG)
## [1] 0.59071

Away shots on goal have higher correlation with Away goals scored at 0.59071

##Away model

mod<-lm(STG$FTAG~STG$AST)
summary(mod)
## 
## Call:
## lm(formula = STG$FTAG ~ STG$AST)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6065 -0.6353 -0.0243  0.6702  3.5318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02432    0.04802   0.506    0.613    
## STG$AST      0.30548    0.01071  28.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9548 on 1518 degrees of freedom
## Multiple R-squared:  0.3489, Adjusted R-squared:  0.3485 
## F-statistic: 813.6 on 1 and 1518 DF,  p-value: < 2.2e-16
coef(mod)
## (Intercept)     STG$AST 
##  0.02431812  0.30548123
plot(STG$AST,STG$FTAG,abline(mod),main = "Scatterplot")

plot(mod)

Conclusion

So does high number of shots on goal result to more goals?

Based on the analysis above,its indicate there is a positive linear relationship between shots on goal and actual goals scored with away shots on goals having higher correlation to away goals scored. although it does not seem to be a strong linear relationship