Utilizando el paquete “Lahman” y “pitchRx”
Lahman es una base de datos creada por Sean Lahman la cual nos da las estadĂsticas por pitcher, bateador, y en general de los jugadores.
Por otro lado, pitchRx fue creada por Brookes Baseball (Dan Brooks y Harry Pavlidis) y muestra los datos por dĂa y por partido. Es decir, muestra cada lanzamiento, batazo, etc. de cada partido desde el 2008.
Para ello requerimos de las siguientes librerias que se deben de cargar.
library(Lahman)
## Warning: package 'Lahman' was built under R version 3.2.5
#library(pitchRx)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(sqldf)
## Warning: package 'sqldf' was built under R version 3.2.5
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 3.2.5
## Loading required package: proto
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 3.2.5
## Loading required package: DBI
library(plotly)
## Warning: package 'plotly' was built under R version 3.2.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Al cargar la librerĂa de Lahman podemos encontrar en ella la tabla llamada “Pitching” que la convertiremos en un objeto de R debido a que esta es la tabla que nos interesa para el análisis.
pitch <- Pitching
#Se filtrará unicamente los últimos 4 años hasta el 2015. (2012, 2013, 2014 y 2015)
d <- pitch %>% filter(yearID>=2012)
#Se eliminan los NA del dataset
d <- d %>% filter(is.na(IPouts)==FALSE)
d <- d %>% filter(is.na(BAOpp)==FALSE)
#Visualizar el contenido
str(d)
## 'data.frame': 2999 obs. of 30 variables:
## $ playerID: chr "aardsda01" "abadfe01" "accarje01" "accarje01" ...
## $ yearID : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ stint : int 1 1 1 2 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 93 58 45 96 16 94 131 66 117 3 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 2 5 2 2 2 5 2 2 5 5 ...
## $ W : int 0 0 0 0 2 1 5 0 1 0 ...
## $ L : int 0 6 0 0 10 3 3 3 2 0 ...
## $ G : int 1 37 26 1 69 45 61 12 67 3 ...
## $ GS : int 0 6 0 0 0 0 0 2 0 0 ...
## $ CG : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SHO : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SV : int 0 0 0 0 25 1 1 0 3 0 ...
## $ IPouts : int 3 138 106 6 252 142 157 104 190 9 ...
## $ H : int 1 57 38 4 80 48 56 37 57 5 ...
## $ ER : int 1 26 18 2 50 34 19 9 19 3 ...
## $ HR : int 1 6 3 0 11 7 4 4 1 1 ...
## $ BB : int 1 19 16 0 31 25 17 13 23 0 ...
## $ SO : int 1 38 28 1 75 46 45 22 57 2 ...
## $ BAOpp : num 0.25 0.311 0.288 0.444 0.254 0.259 0.269 0.282 0.241 0.385 ...
## $ ERA : num 9 5.09 4.58 9 5.36 6.46 3.27 2.34 2.7 9 ...
## $ IBB : int 0 1 1 0 2 1 1 2 1 0 ...
## $ WP : int 0 4 2 0 3 2 2 0 5 0 ...
## $ HBP : int 0 3 0 0 6 3 3 1 3 1 ...
## $ BK : int 0 0 0 0 1 0 0 0 0 1 ...
## $ BFP : int 5 208 152 10 361 216 228 148 267 15 ...
## $ GF : int 1 8 13 0 55 7 7 3 10 0 ...
## $ R : int 1 27 19 2 51 38 21 13 23 3 ...
## $ SH : int 0 2 1 0 2 1 0 1 4 0 ...
## $ SF : int 0 1 3 1 7 2 0 2 0 1 ...
## $ GIDP : int NA NA NA NA NA NA NA NA NA NA ...
sqldf("select playerID, yearID, teamID, W, L, IPouts, H, HR, SO, BAOpp from d limit 10")
## Loading required package: tcltk
## playerID yearID teamID W L IPouts H HR SO BAOpp
## 1 aardsda01 2012 NYA 0 0 3 1 1 1 0.250
## 2 abadfe01 2012 HOU 0 6 138 57 6 38 0.311
## 3 accarje01 2012 CLE 0 0 106 38 3 28 0.288
## 4 accarje01 2012 OAK 0 0 6 4 0 1 0.444
## 5 aceveal01 2012 BOS 2 10 252 80 11 75 0.254
## 6 acostma01 2012 NYN 1 3 142 48 7 46 0.259
## 7 adamsmi03 2012 TEX 5 3 157 56 4 45 0.269
## 8 adcocna01 2012 KCA 0 3 104 37 4 22 0.282
## 9 affelje01 2012 SFN 1 2 190 57 1 57 0.241
## 10 albaljo01 2012 ARI 0 0 9 5 1 2 0.385
Modelo y análisis
- Para realizar este análisis es importante resaltar que se utilizará IPouts como variable dependiente durante todo el modelo y sobre ésta se haran 4 modelos de regresiones lineales.
- Para cada regresión lineal, se utilizará su coeficiente y con éste crearemos una clave o fórmula para crear la columna “predicted_outs variable”. Luego, se creó otra columna llamada “variable out” que contiene la resta entre (IPouts - predicted outs variable).
- Finalmente se utilizará la columna de “variable out” para encontrar los puntos.
- Su máximo tendrá 25 puntos (25 cada modelo = 100 pts.). Si saca 0 o negativo tiene 0 puntos y de esto se sacan los puntos de cada pitcher.
Modelo 1: Outs ~ Promedio de bateo del oponente
out_baopp <- lm(data=d, IPouts~BAOpp)
summary(out_baopp)
##
## Call:
## lm(formula = IPouts ~ BAOpp, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -250.70 -119.93 -58.16 39.75 546.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 251.70 10.35 24.308 < 2e-16 ***
## BAOpp -294.14 37.36 -7.872 4.83e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 173.4 on 2997 degrees of freedom
## Multiple R-squared: 0.02026, Adjusted R-squared: 0.01993
## F-statistic: 61.97 on 1 and 2997 DF, p-value: 4.828e-15
d <- d %>% mutate(predicted_outs = ((1-BAOpp)*294.4))
d <- d %>% mutate(bat_out = IPouts-predicted_outs)
d <- d %>% mutate(pts_bat = 0)
j=1
while(j<=nrow(d)){
if(d$bat_out[j]>0){
d$pts_bat[j] <- (d$bat_out[j]/(max(d[,32])/25))
}
j=j+1
}
Modelo 2: Outs ~ Hits
out_hits <- lm(data=d, IPouts~H)
summary(out_hits)
##
## Call:
## lm(formula = IPouts ~ H, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -161.693 -16.584 -3.151 14.848 227.218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.03655 0.92648 4.357 1.36e-05 ***
## H 3.03830 0.01166 260.558 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.01 on 2997 degrees of freedom
## Multiple R-squared: 0.9577, Adjusted R-squared: 0.9577
## F-statistic: 6.789e+04 on 1 and 2997 DF, p-value: < 2.2e-16
d <- d %>% mutate(predicted_outs_H = H*3.03848)
d <- d %>% mutate(hit_out = IPouts - predicted_outs_H)
d <- d %>% mutate(pts_hit = 0)
k=1
while(k<=nrow(d)){
if(d$hit_out[k]>0){
d$pts_hit[k] <- (d$hit_out[k]/(max(d$hit_out)/25))
}
k=k+1
}
Modelo 3: Outs ~ Partidos ganados + Partidos perdidos
out_wl <- lm(data=d, IPouts~W+L)
summary(out_wl)
##
## Call:
## lm(formula = IPouts ~ W + L, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -237.166 -25.934 -6.818 23.984 267.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.1323 1.1424 19.37 <2e-16 ***
## W 26.1824 0.2941 89.04 <2e-16 ***
## L 20.7016 0.3342 61.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.96 on 2996 degrees of freedom
## Multiple R-squared: 0.9312, Adjusted R-squared: 0.9311
## F-statistic: 2.026e+04 on 2 and 2996 DF, p-value: < 2.2e-16
d <- d %>% mutate(predicted_outs_WL = ((W*26.1846)+(L*20.7071)))
d <- d %>% mutate(wl_out = IPouts - predicted_outs_WL)
d <- d %>% mutate(pts_wl = 0)
l=1
while(l<=nrow(d)){
if(d$wl_out[l]>0){
d$pts_wl[l] <- (d$wl_out[l]/(max(d$wl_out)/25))
}
l=l+1
}
Modelo 4: Outs ~ Home Runs + Strikes
out_wl <- lm(data=d, IPouts~W+L)
summary(out_wl)
##
## Call:
## lm(formula = IPouts ~ W + L, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -237.166 -25.934 -6.818 23.984 267.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.1323 1.1424 19.37 <2e-16 ***
## W 26.1824 0.2941 89.04 <2e-16 ***
## L 20.7016 0.3342 61.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.96 on 2996 degrees of freedom
## Multiple R-squared: 0.9312, Adjusted R-squared: 0.9311
## F-statistic: 2.026e+04 on 2 and 2996 DF, p-value: < 2.2e-16
d <- d %>% mutate(predicted_outs_HS = ((HR/8.18944)+(SO/2.39536)))
d <- d %>% mutate(hs_out = IPouts - predicted_outs_HS)
d <- d %>% mutate(pts_hs = 0)
m=1
while(m<=nrow(d)){
if(d$hs_out[m]>0){
d$pts_hs[m] <- (d$hs_out[m]/(max(d$hs_out)/25))
}
m=m+1
}
Finalmente su sumaron las 4 columnas llamadas “pts_variable” para encontrar la nota sobre 100 de cada pitcher y asi obtener los mejores 6. TambiĂ©n se utilzĂł una nueva columna para ver como les irĂa a los pitchers con una temporada perfecta.
d <- d %>% mutate(puntos_totales = pts_bat + pts_hit + pts_wl + pts_hs)
####Predecir los nuevos outs del jugador
d <- d %>% mutate(IPouts_simulado = bat_out + hit_out + wl_out + hs_out)
Graficas
set.seed(100)
#Distribucion en boxplot de BAOpp
plot_ly(d, y = ~BAOpp, type = "box")
#Puntos totales por equipo boxplot
plot_ly(d, y = ~puntos_totales, color = ~teamID, type = "box")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
#Relacion entre outs y puntos totales
plot_ly(d, x = ~IPouts, y = ~puntos_totales)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
#Distribucion de los outs conforme a BAOpp en scatterplot
plot_ly(d, x = ~BAOpp, y = ~IPouts, color = ~BAOpp,
size = ~BAOpp, text = ~paste("puntos_totales: ", puntos_totales))
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
#Distribucion de los outs conforme a BAOPP pero por equipo
p <- ggplot(data = d, aes(x = BAOpp, y = puntos_totales)) +
geom_point(aes(text = paste("IPouts:", IPouts))) +
geom_smooth(aes(colour = teamID, fill = teamID)) +
facet_wrap(~ teamID)
## Warning: Ignoring unknown aesthetics: text
ggplotly(p)
## `geom_smooth()` using method = 'loess'
#Relacion entre partidos jugados y puntos totales
plot_ly(d, x = ~G, y = ~puntos_totales)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
#Relacion entre partidos ganados y puntos totales
plot_ly(d, x = ~W, y = ~puntos_totales)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
#Relacion entre partidos perdidos y puntos totales
plot_ly(d, x = ~L, y = ~puntos_totales)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
#Puntos totales por equipo en porcentajes *pie
plot_ly(d, labels = ~teamID, values = ~puntos_totales, type = 'pie') %>%
layout(title = 'Puntos totales por equipo',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))