En el presente trabajo, se trata de observar los resultados de la Premier League, desde la temporada 1993-94 hasta la 2017-18. En primer lugar, observamos la estructura principal de los datos y los paquetes utilizados. A continuacion veremos los ajustes correspondientes. En segundo lugar, se enseñaran los distintos algoritmos empleados, los cuales nos podran ser utiles para sacar predicciones de dichos datos. El objetivo principal del trabajo, va a tratar de predecir, como hubiese sido la clasificacion de le EPL durante la temporada 2017-18. Centrandose, principalmente en los partidos del Liverpool. Finalmente, sacaremos conclusiones, de los resultados obtenidos.
El conjunto de datos, para el proyecto, se han sacadado de la pagina web de Kaggle. La estructura se compone de: 9665 observaciones por 11 columnas. Las variables incluyen: Division: denominado como E0 HomeTeam: Equipo Local AwayTeam: Equipo visitante FTHG : total de goles en casa FTAG: total de goles fuera de casa FTR: resultado final HTHG: goles en casa, a mitad de partido HTAG: goles fuera de casa, a mitad de partido HTR: resultados a mitad de partido Season:temporada
Utilizando los algoritmos, comprobando si funcionan en los datasets, veremos si es capaz de encontrar predicciones razonables para futuros partidos en la Premier League, los cuales podrian utilizarse para casas de apuestas deportivas.
# Comenzamos cargando los datos en R
url= "https://raw.githubusercontent.com/ferandoases/machinelearning1/master/EPL_Set.csv"
df=read.csv(url)
#Listado de paquetes empleados:
library(tidyverse)
## -- Attaching packages --------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.8
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.2.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rlang)
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## %@%, %||%, as_function, flatten, flatten_chr, flatten_dbl,
## flatten_int, flatten_lgl, invoke, list_along, modify, prepend,
## rep_along, splice
library(ggpubr)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:rlang':
##
## set_names
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(C50)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ranger)
library(e1071)
library(klaR)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Transformamos los datos de DATE que eran factores, a un formato de fecha.
# Ademas borramos los NA, que no interesan.
df$Date <- dmy(df$Date)
df <- df[complete.cases(df),]
str(df)
## 'data.frame': 8740 obs. of 11 variables:
## $ ï..Div : Factor w/ 1 level "E0": 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "1995-08-19" "1995-08-19" ...
## $ HomeTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 2 5 14 25 26 30 39 47 49 1 ...
## $ AwayTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 27 35 18 38 44 15 32 23 7 29 ...
## $ FTHG : int 3 1 0 1 1 3 3 1 3 1 ...
## $ FTAG : int 1 0 0 0 1 0 4 2 2 1 ...
## $ FTR : Factor w/ 3 levels "A","D","H": 3 3 2 3 2 3 1 1 3 2 ...
## $ HTHG : int 3 1 0 0 0 1 1 1 2 1 ...
## $ HTAG : int 0 0 0 0 1 0 3 0 2 1 ...
## $ HTR : Factor w/ 4 levels "","A","D","H": 4 4 3 3 2 4 2 4 3 3 ...
## $ Season : Factor w/ 25 levels "1993-94","1994-95",..: 3 3 3 3 3 3 3 3 3 3 ...
# Borramos la Var Div, ya que los cuales son iguales en todos los años.
# Son equipos de primera Division.
df <- df[-1]
str(df)
## 'data.frame': 8740 obs. of 10 variables:
## $ Date : Date, format: "1995-08-19" "1995-08-19" ...
## $ HomeTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 2 5 14 25 26 30 39 47 49 1 ...
## $ AwayTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 27 35 18 38 44 15 32 23 7 29 ...
## $ FTHG : int 3 1 0 1 1 3 3 1 3 1 ...
## $ FTAG : int 1 0 0 0 1 0 4 2 2 1 ...
## $ FTR : Factor w/ 3 levels "A","D","H": 3 3 2 3 2 3 1 1 3 2 ...
## $ HTHG : int 3 1 0 0 0 1 1 1 2 1 ...
## $ HTAG : int 0 0 0 0 1 0 3 0 2 1 ...
## $ HTR : Factor w/ 4 levels "","A","D","H": 4 4 3 3 2 4 2 4 3 3 ...
## $ Season : Factor w/ 25 levels "1993-94","1994-95",..: 3 3 3 3 3 3 3 3 3 3 ...
# Procedemos a ordenar las filas del dataframe por una columna.
# Vamos a centrarnos en las variables: FTR,HTR.
# en orden creciente.
dfn <- df[, 1:10]
dfn$HTR <- droplevels(df$HTR)
dfn$FTR <- droplevels(df$FTR)
str(dfn)
## 'data.frame': 8740 obs. of 10 variables:
## $ Date : Date, format: "1995-08-19" "1995-08-19" ...
## $ HomeTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 2 5 14 25 26 30 39 47 49 1 ...
## $ AwayTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 27 35 18 38 44 15 32 23 7 29 ...
## $ FTHG : int 3 1 0 1 1 3 3 1 3 1 ...
## $ FTAG : int 1 0 0 0 1 0 4 2 2 1 ...
## $ FTR : Factor w/ 3 levels "A","D","H": 3 3 2 3 2 3 1 1 3 2 ...
## $ HTHG : int 3 1 0 0 0 1 1 1 2 1 ...
## $ HTAG : int 0 0 0 0 1 0 3 0 2 1 ...
## $ HTR : Factor w/ 3 levels "A","D","H": 3 3 2 2 1 3 1 3 2 2 ...
## $ Season : Factor w/ 25 levels "1993-94","1994-95",..: 3 3 3 3 3 3 3 3 3 3 ...
#Como queremos estudiar la temporada 2017-18.
#y especialmente los partidos del Liverpool.
#realizamos lo siguiente:
temporada <- dfn %>% filter(Season == '2017-18')
sapply(temporada, function(x) sum(is.na(x)))
## Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG
## 0 0 0 0 0 0 0 0
## HTR Season
## 0 0
str(dfn)
## 'data.frame': 8740 obs. of 10 variables:
## $ Date : Date, format: "1995-08-19" "1995-08-19" ...
## $ HomeTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 2 5 14 25 26 30 39 47 49 1 ...
## $ AwayTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 27 35 18 38 44 15 32 23 7 29 ...
## $ FTHG : int 3 1 0 1 1 3 3 1 3 1 ...
## $ FTAG : int 1 0 0 0 1 0 4 2 2 1 ...
## $ FTR : Factor w/ 3 levels "A","D","H": 3 3 2 3 2 3 1 1 3 2 ...
## $ HTHG : int 3 1 0 0 0 1 1 1 2 1 ...
## $ HTAG : int 0 0 0 0 1 0 3 0 2 1 ...
## $ HTR : Factor w/ 3 levels "A","D","H": 3 3 2 2 1 3 1 3 2 2 ...
## $ Season : Factor w/ 25 levels "1993-94","1994-95",..: 3 3 3 3 3 3 3 3 3 3 ...
str(temporada)
## 'data.frame': 380 obs. of 10 variables:
## $ Date : Date, format: "2017-08-11" "2017-08-12" ...
## $ HomeTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 1 10 14 16 18 39 45 46 27 30 ...
## $ AwayTeam: Factor w/ 50 levels "Arsenal","Aston Villa",..: 24 26 11 20 40 42 25 8 47 44 ...
## $ FTHG : int 4 0 2 0 1 0 3 1 4 0 ...
## $ FTAG : int 3 2 3 3 0 0 3 0 0 2 ...
## $ FTR : Factor w/ 3 levels "A","D","H": 3 1 1 1 3 2 2 3 3 1 ...
## $ HTHG : int 2 0 0 0 1 0 2 1 1 0 ...
## $ HTAG : int 2 0 3 2 0 0 1 0 0 0 ...
## $ HTR : Factor w/ 3 levels "A","D","H": 2 2 1 1 3 2 3 3 3 2 ...
## $ Season : Factor w/ 25 levels "1993-94","1994-95",..: 25 25 25 25 25 25 25 25 25 25 ...
dfn1 <- temporada [, 4:9]
str(dfn1)
## 'data.frame': 380 obs. of 6 variables:
## $ FTHG: int 4 0 2 0 1 0 3 1 4 0 ...
## $ FTAG: int 3 2 3 3 0 0 3 0 0 2 ...
## $ FTR : Factor w/ 3 levels "A","D","H": 3 1 1 1 3 2 2 3 3 1 ...
## $ HTHG: int 2 0 0 0 1 0 2 1 1 0 ...
## $ HTAG: int 2 0 3 2 0 0 1 0 0 0 ...
## $ HTR : Factor w/ 3 levels "A","D","H": 2 2 1 1 3 2 3 3 3 2 ...
# Podemos hacer predicciones, utilizando árboles, después de haber cargado
# previamente los paquetes. Utilizamos 4 modelos.
# Despues diremos cual es el mejor:
#MODELO 1
m <- ctree(FTR ~ ., data=dfn1)
m
##
## Model formula:
## FTR ~ FTHG + FTAG + HTHG + HTAG + HTR
##
## Fitted party:
## [1] root
## | [2] FTAG <= 1
## | | [3] FTHG <= 0
## | | | [4] FTAG <= 0: D (n = 32, err = 0.0%)
## | | | [5] FTAG > 0: A (n = 23, err = 0.0%)
## | | [6] FTHG > 0
## | | | [7] FTAG <= 0: H (n = 104, err = 0.0%)
## | | | [8] FTAG > 0
## | | | | [9] FTHG <= 1: D (n = 45, err = 0.0%)
## | | | | [10] FTHG > 1: H (n = 59, err = 0.0%)
## | [11] FTAG > 1
## | | [12] FTHG <= 2
## | | | [13] FTHG <= 1: A (n = 72, err = 0.0%)
## | | | [14] FTHG > 1
## | | | | [15] FTAG <= 2: D (n = 19, err = 0.0%)
## | | | | [16] FTAG > 2: A (n = 13, err = 0.0%)
## | | [17] FTHG > 2: H (n = 13, err = 23.1%)
##
## Number of inner nodes: 8
## Number of terminal nodes: 9
plot(m)
#MODELO 2
m2 <- C5.0(FTR ~ ., data=dfn1)
m2
##
## Call:
## C5.0.formula(formula = FTR ~ ., data = dfn1)
##
## Classification Tree
## Number of samples: 380
## Number of predictors: 5
##
## Tree size: 11
##
## Non-standard options: attempt to group attributes
plot(m2)
#MODELO 3
#m3 <- cforest(FTR ~ FTHG + HTHG + HTAG + HTR + FTAG, data=dfn1)
#m3
#He ejecutado en oculto el codigo del modelo 3, ya que su output es
#demasiado grande.
#MODELO 4
m4 <- ranger(FTR ~ FTHG + HTHG + HTAG + HTR + FTAG, data=dfn1, importance="permutation")
m4
## Ranger result
##
## Call:
## ranger(FTR ~ FTHG + HTHG + HTAG + HTR + FTAG, data = dfn1, importance = "permutation")
##
## Type: Classification
## Number of trees: 500
## Sample size: 380
## Number of independent variables: 5
## Mtry: 2
## Target node size: 1
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error: 1.84 %
#Finalmente vemos que el mejor modelo es el 2.
#Cargamos ademas el plot
m2
##
## Call:
## C5.0.formula(formula = FTR ~ ., data = dfn1)
##
## Classification Tree
## Number of samples: 380
## Number of predictors: 5
##
## Tree size: 11
##
## Non-standard options: attempt to group attributes
plot(m2)
str(dfn1)
## 'data.frame': 380 obs. of 6 variables:
## $ FTHG: int 4 0 2 0 1 0 3 1 4 0 ...
## $ FTAG: int 3 2 3 3 0 0 3 0 0 2 ...
## $ FTR : Factor w/ 3 levels "A","D","H": 3 1 1 1 3 2 2 3 3 1 ...
## $ HTHG: int 2 0 0 0 1 0 2 1 1 0 ...
## $ HTAG: int 2 0 3 2 0 0 1 0 0 0 ...
## $ HTR : Factor w/ 3 levels "A","D","H": 2 2 1 1 3 2 3 3 3 2 ...
# El numero de la columna con el target:
names(dfn1)
## [1] "FTHG" "FTAG" "FTR" "HTHG" "HTAG" "HTR"
cl <- 3
# El nombre del target variable en formula:
f <- as.formula(paste(colnames(dfn1)[cl], "~", "."))
#Asignamos un Id, en mi caso el codigo de alumno:
id <- 128502
# y separamos en train y test poniendo un seed para que sea reproducible
set.seed(id)
spl <- createDataPartition(dfn1[,3], p=0.7, list=FALSE)
tr <- dfn1[ spl,]
te <- dfn1[-spl,]
# Entrenamos los tres algoritmos knn, nb y rf1:
m.knn <- train(f, data=tr, method="knn")
m.nb <- train(f, data=tr, method="nb")
m.rf1 <- train(f, data=tr, method="ranger")
# juntamos los resultados y lo llamamos res
res <- resamples(list(KNN=m.knn, NB=m.nb, RF=m.rf1))
# y vemos los resultados en numeros y graficamente
summary(res)
##
## Call:
## summary.resamples(object = res)
##
## Models: KNN, NB, RF
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## KNN 0.7659574 0.8163265 0.8617021 0.8475269 0.8686869 0.9215686 0
## NB 0.5940594 0.6451613 0.6880734 0.6886992 0.7142857 0.8235294 0
## RF 0.9369369 0.9887640 0.9898990 0.9844629 0.9906542 1.0000000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## KNN 0.6380233 0.7147477 0.7838316 0.7623911 0.7989063 0.8671875 0
## NB 0.3969719 0.4680187 0.5301128 0.5290736 0.5688623 0.7260519 0
## RF 0.9021657 0.9829730 0.9839169 0.9757651 0.9858353 1.0000000 0
dotplot(res)
# Observammos que el algoritmo,con mas precision es el random forest,
# despues vemos que tanto en el Kappa como en el Accuracy, el Knn esta
# por el estilo. Vemos que el algoritmo nb es descartado.
# Pasamos con las predicciones.
#PREDICCION 1
prediccion1 <- predict(m.rf1, te)
confusionMatrix(prediccion1,te[,cl])
## Confusion Matrix and Statistics
##
## Reference
## Prediction A D H
## A 32 0 0
## D 0 29 0
## H 0 0 51
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9676, 1)
## No Information Rate : 0.4554
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: D Class: H
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.2857 0.2589 0.4554
## Detection Rate 0.2857 0.2589 0.4554
## Detection Prevalence 0.2857 0.2589 0.4554
## Balanced Accuracy 1.0000 1.0000 1.0000
str(prediccion1)
## Factor w/ 3 levels "A","D","H": 1 1 2 3 1 3 1 1 3 3 ...
summary(prediccion1)
## A D H
## 32 29 51
#PREDICCION 2
prediccion2 <- predict(m.knn, te)
confusionMatrix(prediccion2,te[,cl])
## Confusion Matrix and Statistics
##
## Reference
## Prediction A D H
## A 29 2 0
## D 3 22 0
## H 0 5 51
##
## Overall Statistics
##
## Accuracy : 0.9107
## 95% CI : (0.8419, 0.9564)
## No Information Rate : 0.4554
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8595
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: D Class: H
## Sensitivity 0.9062 0.7586 1.0000
## Specificity 0.9750 0.9639 0.9180
## Pos Pred Value 0.9355 0.8800 0.9107
## Neg Pred Value 0.9630 0.9195 1.0000
## Prevalence 0.2857 0.2589 0.4554
## Detection Rate 0.2589 0.1964 0.4554
## Detection Prevalence 0.2768 0.2232 0.5000
## Balanced Accuracy 0.9406 0.8612 0.9590
str(prediccion2)
## Factor w/ 3 levels "A","D","H": 1 1 3 3 1 3 1 1 3 3 ...
summary(prediccion2)
## A D H
## 31 25 56
#PREDICCION 3
prediccion3<- predict(m.nb, te)
confusionMatrix(prediccion3,te[,cl])
## Confusion Matrix and Statistics
##
## Reference
## Prediction A D H
## A 23 7 0
## D 5 17 14
## H 4 5 37
##
## Overall Statistics
##
## Accuracy : 0.6875
## 95% CI : (0.593, 0.7717)
## No Information Rate : 0.4554
## P-Value [Acc > NIR] : 5.953e-07
##
## Kappa : 0.5216
## Mcnemar's Test P-Value : 0.03517
##
## Statistics by Class:
##
## Class: A Class: D Class: H
## Sensitivity 0.7188 0.5862 0.7255
## Specificity 0.9125 0.7711 0.8525
## Pos Pred Value 0.7667 0.4722 0.8043
## Neg Pred Value 0.8902 0.8421 0.7879
## Prevalence 0.2857 0.2589 0.4554
## Detection Rate 0.2054 0.1518 0.3304
## Detection Prevalence 0.2679 0.3214 0.4107
## Balanced Accuracy 0.8156 0.6786 0.7890
str(prediccion3)
## Factor w/ 3 levels "A","D","H": 1 1 3 3 2 3 1 1 2 3 ...
summary(prediccion3)
## A D H
## 30 36 46
#Procedemos a estudiar como se ha desenvuelto el Liverpool en la temporada:
liverpool_temporada <- temporada %>% filter(HomeTeam == 'Liverpool' | AwayTeam == 'Liverpool')
# Separamos los partidos en casa y visitante
liverpool_casa <- liverpool_temporada %>% filter(HomeTeam == 'Liverpool')
liverpool_visitante <- liverpool_temporada %>% filter(AwayTeam == 'Liverpool')
# Estudiaremos mediante la función count, para poder observar los partidos
# totales fuera de casa y en casa.
# Seguidamente, cargaremos un gráfico de barras para reflejar los resultados,
# de forma sencilla.
full_time_home <- liverpool_casa %>% count(FTR)
half_time_home <- liverpool_casa %>% count(HTR)
full_time_away <- liverpool_visitante %>% count(FTR)
half_time_away <- liverpool_visitante %>% count(HTR)
home_results_breakdown <- left_join(full_time_home, half_time_home, by = c('FTR' = 'HTR')) %>%
rename(result = FTR, FTR = n.x, HTR = n.y) %>%
gather(time, count, -result) %>%
mutate(location = "Home",
result = ordered(ifelse(result == 'A', 'Lose', ifelse(result == 'H', 'Win', 'Draw')), levels = c('Win', 'Draw', 'Lose')),
time = ifelse(time == 'HTR', 'Half Time Result', 'Full Time Result'),
time = ordered(time, levels = c('Half Time Result', 'Full Time Result')))
away_results_breakdown <- left_join(full_time_away, half_time_away, by = c('FTR' = 'HTR')) %>%
rename(result = FTR, FTR = n.x, HTR = n.y) %>%
gather(time, count, -result) %>%
mutate(location = "Away",
result = ordered(ifelse(result == 'A', 'Win', ifelse(result == 'H', 'Lose', 'Draw')), levels = c('Win', 'Draw', 'Lose')),
time = ifelse(time == 'HTR', 'Half Time Result', 'Full Time Result'),
time = ordered(time, levels = c('Half Time Result', 'Full Time Result')))
results_breakdown <- bind_rows(home_results_breakdown, away_results_breakdown) %>% mutate(location = ordered(location, c('Home', 'Away')))
home_points = c(1,3)
total_points_full_time <- liverpool_casa %>% count(FTR) %>% mutate(point_value = home_points, total_points = n * point_value)
total_points_full_time
## # A tibble: 2 x 4
## FTR n point_value total_points
## <fct> <int> <dbl> <dbl>
## 1 D 7 1 7
## 2 H 12 3 36
home_points = c(0,1,3)
total_points_full_time_half <- liverpool_casa %>% count(HTR) %>% mutate(point_value = home_points, total_points = n * point_value)
total_points_full_time_half
## # A tibble: 3 x 4
## HTR n point_value total_points
## <fct> <int> <dbl> <dbl>
## 1 A 1 0 0
## 2 D 8 1 8
## 3 H 10 3 30
#si observamos, solo los partidos jugados en casa el Liverpool, a tiempo completo,
#el equipo obtendria 43puntos y si los partidos solo durasen los primeros 45 min,
# hubiese obtenido 38 puntos.
## # A tibble: 1 x 2
## team final_points
## <chr> <dbl>
## 1 Liverpool 43
## # A tibble: 20 x 2
## team final_points
## <chr> <dbl>
## 1 Man City 50
## 2 Arsenal 47
## 3 Man United 47
## 4 Liverpool 43
## 5 Tottenham 43
## 6 Chelsea 37
## 7 Everton 34
## 8 Brighton 29
## 9 Newcastle 28
## 10 Watford 27
## 11 Leicester 27
## 12 West Ham 27
## 13 Crystal Palace 26
## 14 Bournemouth 26
## 15 Burnley 26
## 16 Huddersfield 23
## 17 Swansea 21
## 18 Stoke 20
## 19 Southampton 19
## 20 West Brom 18
## # A tibble: 1 x 2
## team final_points
## <chr> <dbl>
## 1 Liverpool 38
## # A tibble: 20 x 2
## team final_points
## <chr> <dbl>
## 1 Man City 43
## 2 Man United 38
## 3 Liverpool 38
## 4 Arsenal 37
## 5 Chelsea 34
## 6 Tottenham 31
## 7 Leicester 25
## 8 Huddersfield 25
## 9 Crystal Palace 24
## 10 Burnley 24
## 11 Southampton 23
## 12 Watford 23
## 13 Newcastle 23
## 14 Stoke 23
## 15 Bournemouth 22
## 16 West Ham 22
## 17 Brighton 21
## 18 Everton 21
## 19 West Brom 21
## 20 Swansea 16