Authors: 49881ZAI, 27489TZM

taula <- read.csv("SP1.csv")

ASSIGNMENT 1

Part 1

1. Where the data set came from?

The data set came from:

http://www.football-data.co.uk/spainm.php

It’s about La Liga season until the 36 week and the most important variables are the following ones:

Div = League Division

Date = Match Date (dd/mm/yy)

HomeTeam = Home Team

AwayTeam = Away Team

FTHG = Full Time Home Team Goals

FTAG = Full Time Away Team Goals

FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win)

HTHG = Half Time Home Team Goals

HTAG = Half Time Away Team Goals

HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)

HS = Home Team Shots

AS = Away Team Shots

HST = Home Team Shots on Target

AST = Away Team Shots on Target

HHW = Home Team Hit Woodwork

AHW = Away Team Hit Woodwork

HC = Home Team Corners

AC = Away Team Corners

HF = Home Team Fouls Committed

AF = Away Team Fouls Committed

HO = Home Team Offsides

AO = Away Team Offsides

HY = Home Team Yellow Cards

AY = Away Team Yellow Cards

HR = Home Team Red Cards

AR = Away Team Red Cards

2. Basic statistics of a quantitative variable

In this part where are ging to compute the basic statistic as the mean, the median… for the away team goals (quantitative variable)

x<-taula$FTAG
meanx<-mean(x)
medianx<-median(x)
quartilsx<-quantile(x,probs = c(0.25,0.75))
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.122   2.000   8.000

The variable FTAG (Full Time Away Team Goals) gives us:

  1. Mean = 1.1222222

This means that the away team scores around 1 goal per match.

  1. Median = 1

This value it’s extact because the median it’s a vaue from the table and all the values on the table are natural numbers. The median is very close to the mean too and it’s 1 goal for the away team in every match.

  1. Quartiles = 0, 2

The 25% of the data is 0 goals and the 75% of the data is lower than 2 goals

Histogram
library(ggplot2)
x<-as.data.frame(x)
ggplot(x,aes(x))+geom_histogram(fill = "grey", colour = "blue") + xlab("Goals") + ylab("Matches")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As it can be seen in the histogram, in most of the matches between 0 and 2 goals are scored for the away team. But there are some exceptitions for the big team that can score until 8 goals in a match being them the away teams.

Box plot
x<-as.data.frame(x)
ggplot(x,aes(x=factor(1),y=x))+geom_boxplot(fill = "grey", colour = "blue")+coord_flip()+xlab("")+ylab("Goals")

The lower limit of the box it’s 0 (First quartil) and the upper limit is 2 (Third quartil). As it can be seen there’s the expected relation between the histogram and the boxplot.

3. Basic statistics of one qualitative variable

In the following plot, ther is the representation of the 360 matches and their results: A for the away team victory, D for a draw and H for the home team victory.

result <- taula$FTR
table(result)
## result
##   A   D   H 
## 100  89 171
barplot(table(result), main="Barplot",  xlab="Result", ylab= "Matches", col = "orange")

4. Basic statistics of two quantitative variables

x<-taula$FTAG
y<-taula$FTHG
cor(x,y)
## [1] -0.1710941
plot(x,y, ylab ="FTHG", xlab = "FTAG",col="blue")

The calculated correlation is between the FTAG (Full Time Away Team Goals) and FTHG (Full Time Home Team Goals). The result of the correlation is negative because the relation between the two variables is decremental. Correlation: -0.1710941

5. One quantitative variable and one qualitative variable

FTR <- taula$FTR
FTHG <- taula$FTHG
p1 <- ggplot(taula, aes(x = FTHG, y = FTR))
p1 + geom_boxplot()+geom_boxplot(col="blue")
## Warning: position_dodge requires non-overlapping x intervals

## Warning: position_dodge requires non-overlapping x intervals

In the box plot we can see the relation between the scored goals of the home team and the result of the match. It can bee seen that if you score more than 1.25 (aprroximatley, 1.25 goals is not possible) you’re gonna win the match. If you score between 0 and 2.5 goals the match could end with a draw and if you score between 0 and 1.25 goals is it probable that you lose the match as well as having a draw.

Part 2

1. One variable coded as qualitative

awayfoults <- taula$AF
Fabs <- table(awayfoults)
Frel <- Fabs/margin.table(Fabs)
cumFabs <- cumsum(Fabs)
cumFrel <- cumsum(Frel)

The table and graphic of absolute frequency:

Fabs
## awayfoults
##  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 
##  1  1  3  1  3  4 17 28 25 36 31 38 34 37 23 13 15 12 10 15  1  5  3  2  2
plot(Fabs, col="blue")

The table and graphic of the cumulative absolute frequency:

cumFabs
##   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
##   1   2   5   6   9  13  30  58  83 119 150 188 222 259 282 295 310 322 
##  20  21  22  23  25  26  27 
## 332 347 348 353 356 358 360
plot(cumFabs, col="blue", xlab="AwayFaults")

The table of the relative frequency and the cumulative relative frequency (the graphics are the same as before):

Frel
## awayfoults
##           2           3           4           5           6           7 
## 0.002777778 0.002777778 0.008333333 0.002777778 0.008333333 0.011111111 
##           8           9          10          11          12          13 
## 0.047222222 0.077777778 0.069444444 0.100000000 0.086111111 0.105555556 
##          14          15          16          17          18          19 
## 0.094444444 0.102777778 0.063888889 0.036111111 0.041666667 0.033333333 
##          20          21          22          23          25          26 
## 0.027777778 0.041666667 0.002777778 0.013888889 0.008333333 0.005555556 
##          27 
## 0.005555556
cumFrel
##           2           3           4           5           6           7 
## 0.002777778 0.005555556 0.013888889 0.016666667 0.025000000 0.036111111 
##           8           9          10          11          12          13 
## 0.083333333 0.161111111 0.230555556 0.330555556 0.416666667 0.522222222 
##          14          15          16          17          18          19 
## 0.616666667 0.719444444 0.783333333 0.819444444 0.861111111 0.894444444 
##          20          21          22          23          25          26 
## 0.922222222 0.963888889 0.966666667 0.980555556 0.988888889 0.994444444 
##          27 
## 1.000000000

10 groups with the same lenght:

resCut<-cut(awayfoults,10)
str(resCut)
##  Factor w/ 10 levels "(1.98,4.5]","(4.5,7]",..: 4 6 5 4 4 4 10 6 1 9 ...

The percentage of data of each group:

table(resCut)*100/360
## resCut
## (1.98,4.5]    (4.5,7]    (7,9.5]   (9.5,12]  (12,14.5]  (14.5,17] 
##   1.388889   2.222222  12.500000  25.555556  20.000000  20.277778 
##  (17,19.5]  (19.5,22]  (22,24.5]  (24.5,27] 
##   7.500000   7.222222   1.388889   1.944444

2. Two quantitative variable coded as qualitative

resCut1<-cut(taula$FTHG,10)
resCut2<-cut(taula$FTAG,10)
resTAbs <- table(resCut1,resCut2)
resTFreq <- resTAbs / 360
tFreqAbs <- addmargins(resTAbs)
tFreqRel <- addmargins(resTFreq)
SumCol <- colSums(tFreqRel)
SumRow <- rowSums(tFreqRel)
UnionProb <- SumCol[1]+SumRow[1]-tFreqRel[1,1]
ComplProb <- sum(SumCol[2:10])
tFreqAbs
##            resCut2
## resCut1     (-0.008,0.8] (0.8,1.6] (1.6,2.4] (2.4,3.2] (3.2,4] (4,4.8]
##   (-0.01,1]           62        67        43        18       4       0
##   (1,2]               28        32        19         4       1       0
##   (2,3]               18        15         6         3       0       0
##   (3,4]                9         5         2         1       0       0
##   (4,5]                3         5         2         0       0       0
##   (5,6]                4         1         0         0       0       0
##   (6,7]                0         1         0         0       0       0
##   (7,8]                0         0         0         0       0       0
##   (8,9]                0         0         0         0       0       0
##   (9,10]               0         0         1         0       0       0
##   Sum                124       126        73        26       5       0
##            resCut2
## resCut1     (4.8,5.6] (5.6,6.4] (6.4,7.2] (7.2,8.01] Sum
##   (-0.01,1]         4         1         0          1 200
##   (1,2]             0         0         0          0  84
##   (2,3]             0         0         0          0  42
##   (3,4]             0         0         0          0  17
##   (4,5]             0         0         0          0  10
##   (5,6]             0         0         0          0   5
##   (6,7]             0         0         0          0   1
##   (7,8]             0         0         0          0   0
##   (8,9]             0         0         0          0   0
##   (9,10]            0         0         0          0   1
##   Sum               4         1         0          1 360
tFreqRel
##            resCut2
## resCut1     (-0.008,0.8]   (0.8,1.6]   (1.6,2.4]   (2.4,3.2]     (3.2,4]
##   (-0.01,1]  0.172222222 0.186111111 0.119444444 0.050000000 0.011111111
##   (1,2]      0.077777778 0.088888889 0.052777778 0.011111111 0.002777778
##   (2,3]      0.050000000 0.041666667 0.016666667 0.008333333 0.000000000
##   (3,4]      0.025000000 0.013888889 0.005555556 0.002777778 0.000000000
##   (4,5]      0.008333333 0.013888889 0.005555556 0.000000000 0.000000000
##   (5,6]      0.011111111 0.002777778 0.000000000 0.000000000 0.000000000
##   (6,7]      0.000000000 0.002777778 0.000000000 0.000000000 0.000000000
##   (7,8]      0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (8,9]      0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (9,10]     0.000000000 0.000000000 0.002777778 0.000000000 0.000000000
##   Sum        0.344444444 0.350000000 0.202777778 0.072222222 0.013888889
##            resCut2
## resCut1         (4,4.8]   (4.8,5.6]   (5.6,6.4]   (6.4,7.2]  (7.2,8.01]
##   (-0.01,1] 0.000000000 0.011111111 0.002777778 0.000000000 0.002777778
##   (1,2]     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (2,3]     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (3,4]     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (4,5]     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (5,6]     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (6,7]     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (7,8]     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (8,9]     0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   (9,10]    0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   Sum       0.000000000 0.011111111 0.002777778 0.000000000 0.002777778
##            resCut2
## resCut1             Sum
##   (-0.01,1] 0.555555556
##   (1,2]     0.233333333
##   (2,3]     0.116666667
##   (3,4]     0.047222222
##   (4,5]     0.027777778
##   (5,6]     0.013888889
##   (6,7]     0.002777778
##   (7,8]     0.000000000
##   (8,9]     0.000000000
##   (9,10]    0.002777778
##   Sum       1.000000000

Some probabilities:

Conditional probability: P(V1|W1) = 0.1722222

Intersection probability: P(V2∩W2) = 0.0888889

Union probability: P(V1UW2) = 1.6277778

Compelment probability P(V1 compl.)= `r ComplProb

Part 3

Bayes Theorem

taulaFTR <- table(taula$FTR)
percFTR <- taulaFTR*100/360
taulaFTR
## 
##   A   D   H 
## 100  89 171
percFTR
## 
##        A        D        H 
## 27.77778 24.72222 47.50000
PA <- percFTR[1]
PD <- percFTR[2]
PH <- percFTR[3]

In a football season the 24.7222222 % of the matches have been a tie and the 27.7777778 % of the matches have been a defeat for the home team.

golsf<-taula$FTHG
golsc<-taula$FTAG
golstotals <-golsc+golsf
table(golsc+golsf)
## 
##  0  1  2  3  4  5  6  7  8 12 
## 24 62 89 82 58 19 19  4  2  1
tauleta <- table(taula$FTR,golstotals)
tauleta
##    golstotals
##      0  1  2  3  4  5  6  7  8 12
##   A  0 24 18 32 15  5  5  0  1  0
##   D 24  0 43  0 19  0  3  0  0  0
##   H  0 38 28 50 24 14 11  4  1  1
A3gols <- sum(tauleta[1,4:10])
Agols <- sum(tauleta[1,1:10])
PA3gols <- A3gols*100/Agols
D3gols <- sum(tauleta[2,4:10])
Dgols <- sum(tauleta[2,1:10])
PD3gols <- D3gols*100/Dgols
H3gols <- sum(tauleta[3,4:10])
Hgols <- sum(tauleta[3,1:10])
PH3gols <- H3gols*100/Hgols

In the 61.4035088 % of the home victories 3 or more goals have been scored. In the 24.7191011 % of the draws 3 or more goals have been scored and finally in the 58 % of the away team victories 3 or more goals have been scored. Suposse that in match 3 or more goals are scored, what is the probability that this is a home a victory ?

SOLUTION

victory3gols <- PH*PH3gols
draw3gols <- PD*PD3gols
defeat3goals <- PA*PA3gols
num <- victory3gols
denom <- victory3gols+draw3gols+defeat3goals
solution <- num/denom

If in a match 3 or more goals are scored, the probability to have a victory for the home team is: 0.5675676