The data set came from:
http://www.football-data.co.uk/spainm.php
It’s about La Liga season until the 36 week and the most important variables are the following ones:
Div = League Division
Date = Match Date (dd/mm/yy)
HomeTeam = Home Team
AwayTeam = Away Team
FTHG = Full Time Home Team Goals
FTAG = Full Time Away Team Goals
FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win)
HTHG = Half Time Home Team Goals
HTAG = Half Time Away Team Goals
HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)
HS = Home Team Shots
AS = Away Team Shots
HST = Home Team Shots on Target
AST = Away Team Shots on Target
HHW = Home Team Hit Woodwork
AHW = Away Team Hit Woodwork
HC = Home Team Corners
AC = Away Team Corners
HF = Home Team Fouls Committed
AF = Away Team Fouls Committed
HO = Home Team Offsides
AO = Away Team Offsides
HY = Home Team Yellow Cards
AY = Away Team Yellow Cards
HR = Home Team Red Cards
AR = Away Team Red Cards
In this part where are ging to compute the basic statistic as the mean, the median… for the away team goals (quantitative variable)
x<-taula$FTAG
meanx<-mean(x)
medianx<-median(x)
quartilsx<-quantile(x,probs = c(0.25,0.75))
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.122 2.000 8.000
The variable FTAG (Full Time Away Team Goals) gives us:
This means that the away team scores around 1 goal per match.
This value it’s extact because the median it’s a vaue from the table and all the values on the table are natural numbers. The median is very close to the mean too and it’s 1 goal for the away team in every match.
The 25% of the data is 0 goals and the 75% of the data is lower than 2 goals
library(ggplot2)
x<-as.data.frame(x)
ggplot(x,aes(x))+geom_histogram(fill = "grey", colour = "blue") + xlab("Goals") + ylab("Matches")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As it can be seen in the histogram, in most of the matches between 0 and 2 goals are scored for the away team. But there are some exceptitions for the big team that can score until 8 goals in a match being them the away teams.
x<-as.data.frame(x)
ggplot(x,aes(x=factor(1),y=x))+geom_boxplot(fill = "grey", colour = "blue")+coord_flip()+xlab("")+ylab("Goals")
The lower limit of the box it’s 0 (First quartil) and the upper limit is 2 (Third quartil). As it can be seen there’s the expected relation between the histogram and the boxplot.
In the following plot, ther is the representation of the 360 matches and their results: A for the away team victory, D for a draw and H for the home team victory.
result <- taula$FTR
table(result)
## result
## A D H
## 100 89 171
barplot(table(result), main="Barplot", xlab="Result", ylab= "Matches", col = "orange")
x<-taula$FTAG
y<-taula$FTHG
cor(x,y)
## [1] -0.1710941
plot(x,y, ylab ="FTHG", xlab = "FTAG",col="blue")
The calculated correlation is between the FTAG (Full Time Away Team Goals) and FTHG (Full Time Home Team Goals). The result of the correlation is negative because the relation between the two variables is decremental. Correlation: -0.1710941
FTR <- taula$FTR
FTHG <- taula$FTHG
p1 <- ggplot(taula, aes(x = FTHG, y = FTR))
p1 + geom_boxplot()+geom_boxplot(col="blue")
## Warning: position_dodge requires non-overlapping x intervals
## Warning: position_dodge requires non-overlapping x intervals
In the box plot we can see the relation between the scored goals of the home team and the result of the match. It can bee seen that if you score more than 1.25 (aprroximatley, 1.25 goals is not possible) you’re gonna win the match. If you score between 0 and 2.5 goals the match could end with a draw and if you score between 0 and 1.25 goals is it probable that you lose the match as well as having a draw.
awayfoults <- taula$AF
Fabs <- table(awayfoults)
Frel <- Fabs/margin.table(Fabs)
cumFabs <- cumsum(Fabs)
cumFrel <- cumsum(Frel)
The table and graphic of absolute frequency:
Fabs
## awayfoults
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27
## 1 1 3 1 3 4 17 28 25 36 31 38 34 37 23 13 15 12 10 15 1 5 3 2 2
plot(Fabs, col="blue")
The table and graphic of the cumulative absolute frequency:
cumFabs
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 1 2 5 6 9 13 30 58 83 119 150 188 222 259 282 295 310 322
## 20 21 22 23 25 26 27
## 332 347 348 353 356 358 360
plot(cumFabs, col="blue", xlab="AwayFaults")
The table of the relative frequency and the cumulative relative frequency (the graphics are the same as before):
Frel
## awayfoults
## 2 3 4 5 6 7
## 0.002777778 0.002777778 0.008333333 0.002777778 0.008333333 0.011111111
## 8 9 10 11 12 13
## 0.047222222 0.077777778 0.069444444 0.100000000 0.086111111 0.105555556
## 14 15 16 17 18 19
## 0.094444444 0.102777778 0.063888889 0.036111111 0.041666667 0.033333333
## 20 21 22 23 25 26
## 0.027777778 0.041666667 0.002777778 0.013888889 0.008333333 0.005555556
## 27
## 0.005555556
cumFrel
## 2 3 4 5 6 7
## 0.002777778 0.005555556 0.013888889 0.016666667 0.025000000 0.036111111
## 8 9 10 11 12 13
## 0.083333333 0.161111111 0.230555556 0.330555556 0.416666667 0.522222222
## 14 15 16 17 18 19
## 0.616666667 0.719444444 0.783333333 0.819444444 0.861111111 0.894444444
## 20 21 22 23 25 26
## 0.922222222 0.963888889 0.966666667 0.980555556 0.988888889 0.994444444
## 27
## 1.000000000
10 groups with the same lenght:
resCut<-cut(awayfoults,10)
str(resCut)
## Factor w/ 10 levels "(1.98,4.5]","(4.5,7]",..: 4 6 5 4 4 4 10 6 1 9 ...
The percentage of data of each group:
table(resCut)*100/360
## resCut
## (1.98,4.5] (4.5,7] (7,9.5] (9.5,12] (12,14.5] (14.5,17]
## 1.388889 2.222222 12.500000 25.555556 20.000000 20.277778
## (17,19.5] (19.5,22] (22,24.5] (24.5,27]
## 7.500000 7.222222 1.388889 1.944444
resCut1<-cut(taula$FTHG,10)
resCut2<-cut(taula$FTAG,10)
resTAbs <- table(resCut1,resCut2)
resTFreq <- resTAbs / 360
tFreqAbs <- addmargins(resTAbs)
tFreqRel <- addmargins(resTFreq)
SumCol <- colSums(tFreqRel)
SumRow <- rowSums(tFreqRel)
UnionProb <- SumCol[1]+SumRow[1]-tFreqRel[1,1]
ComplProb <- sum(SumCol[2:10])
tFreqAbs
## resCut2
## resCut1 (-0.008,0.8] (0.8,1.6] (1.6,2.4] (2.4,3.2] (3.2,4] (4,4.8]
## (-0.01,1] 62 67 43 18 4 0
## (1,2] 28 32 19 4 1 0
## (2,3] 18 15 6 3 0 0
## (3,4] 9 5 2 1 0 0
## (4,5] 3 5 2 0 0 0
## (5,6] 4 1 0 0 0 0
## (6,7] 0 1 0 0 0 0
## (7,8] 0 0 0 0 0 0
## (8,9] 0 0 0 0 0 0
## (9,10] 0 0 1 0 0 0
## Sum 124 126 73 26 5 0
## resCut2
## resCut1 (4.8,5.6] (5.6,6.4] (6.4,7.2] (7.2,8.01] Sum
## (-0.01,1] 4 1 0 1 200
## (1,2] 0 0 0 0 84
## (2,3] 0 0 0 0 42
## (3,4] 0 0 0 0 17
## (4,5] 0 0 0 0 10
## (5,6] 0 0 0 0 5
## (6,7] 0 0 0 0 1
## (7,8] 0 0 0 0 0
## (8,9] 0 0 0 0 0
## (9,10] 0 0 0 0 1
## Sum 4 1 0 1 360
tFreqRel
## resCut2
## resCut1 (-0.008,0.8] (0.8,1.6] (1.6,2.4] (2.4,3.2] (3.2,4]
## (-0.01,1] 0.172222222 0.186111111 0.119444444 0.050000000 0.011111111
## (1,2] 0.077777778 0.088888889 0.052777778 0.011111111 0.002777778
## (2,3] 0.050000000 0.041666667 0.016666667 0.008333333 0.000000000
## (3,4] 0.025000000 0.013888889 0.005555556 0.002777778 0.000000000
## (4,5] 0.008333333 0.013888889 0.005555556 0.000000000 0.000000000
## (5,6] 0.011111111 0.002777778 0.000000000 0.000000000 0.000000000
## (6,7] 0.000000000 0.002777778 0.000000000 0.000000000 0.000000000
## (7,8] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (8,9] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (9,10] 0.000000000 0.000000000 0.002777778 0.000000000 0.000000000
## Sum 0.344444444 0.350000000 0.202777778 0.072222222 0.013888889
## resCut2
## resCut1 (4,4.8] (4.8,5.6] (5.6,6.4] (6.4,7.2] (7.2,8.01]
## (-0.01,1] 0.000000000 0.011111111 0.002777778 0.000000000 0.002777778
## (1,2] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (2,3] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (3,4] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (4,5] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (5,6] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (6,7] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (7,8] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (8,9] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## (9,10] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## Sum 0.000000000 0.011111111 0.002777778 0.000000000 0.002777778
## resCut2
## resCut1 Sum
## (-0.01,1] 0.555555556
## (1,2] 0.233333333
## (2,3] 0.116666667
## (3,4] 0.047222222
## (4,5] 0.027777778
## (5,6] 0.013888889
## (6,7] 0.002777778
## (7,8] 0.000000000
## (8,9] 0.000000000
## (9,10] 0.002777778
## Sum 1.000000000
Some probabilities:
Conditional probability: P(V1|W1) = 0.1722222
Intersection probability: P(V2∩W2) = 0.0888889
Union probability: P(V1UW2) = 1.6277778
Compelment probability P(V1 compl.)= `r ComplProb
taulaFTR <- table(taula$FTR)
percFTR <- taulaFTR*100/360
taulaFTR
##
## A D H
## 100 89 171
percFTR
##
## A D H
## 27.77778 24.72222 47.50000
PA <- percFTR[1]
PD <- percFTR[2]
PH <- percFTR[3]
In a football season the 24.7222222 % of the matches have been a tie and the 27.7777778 % of the matches have been a defeat for the home team.
golsf<-taula$FTHG
golsc<-taula$FTAG
golstotals <-golsc+golsf
table(golsc+golsf)
##
## 0 1 2 3 4 5 6 7 8 12
## 24 62 89 82 58 19 19 4 2 1
tauleta <- table(taula$FTR,golstotals)
tauleta
## golstotals
## 0 1 2 3 4 5 6 7 8 12
## A 0 24 18 32 15 5 5 0 1 0
## D 24 0 43 0 19 0 3 0 0 0
## H 0 38 28 50 24 14 11 4 1 1
A3gols <- sum(tauleta[1,4:10])
Agols <- sum(tauleta[1,1:10])
PA3gols <- A3gols*100/Agols
D3gols <- sum(tauleta[2,4:10])
Dgols <- sum(tauleta[2,1:10])
PD3gols <- D3gols*100/Dgols
H3gols <- sum(tauleta[3,4:10])
Hgols <- sum(tauleta[3,1:10])
PH3gols <- H3gols*100/Hgols
In the 61.4035088 % of the home victories 3 or more goals have been scored. In the 24.7191011 % of the draws 3 or more goals have been scored and finally in the 58 % of the away team victories 3 or more goals have been scored. Suposse that in match 3 or more goals are scored, what is the probability that this is a home a victory ?
SOLUTION
victory3gols <- PH*PH3gols
draw3gols <- PD*PD3gols
defeat3goals <- PA*PA3gols
num <- victory3gols
denom <- victory3gols+draw3gols+defeat3goals
solution <- num/denom
If in a match 3 or more goals are scored, the probability to have a victory for the home team is: 0.5675676