MLB Fantasy - Pitcher

Análisis de datos de la MLB

Utilizando el paquete “Lahman” y “pitchRx”

Lahman es una base de datos creada por Sean Lahman la cual nos da las estadísticas por pitcher, bateador, y en general de los jugadores.

Por otro lado, pitchRx fue creada por Brookes Baseball (Dan Brooks y Harry Pavlidis) y muestra los datos por día y por partido. Es decir, muestra cada lanzamiento, batazo, etc. de cada partido desde el 2008.

Para ello requerimos de las siguientes librerias que se deben de cargar.

library(Lahman)

## Warning: package 'Lahman' was built under R version 3.2.5

#library(pitchRx)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(sqldf)

## Warning: package 'sqldf' was built under R version 3.2.5

## Loading required package: gsubfn

## Warning: package 'gsubfn' was built under R version 3.2.5

## Loading required package: proto

## Loading required package: RSQLite

## Warning: package 'RSQLite' was built under R version 3.2.5

## Loading required package: DBI

library(plotly)

## Warning: package 'plotly' was built under R version 3.2.5

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.5

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Al cargar la librería de Lahman podemos encontrar en ella la tabla llamada “Pitching” que la convertiremos en un objeto de R debido a que esta es la tabla que nos interesa para el análisis.

pitch <- Pitching
#Se filtrará unicamente los últimos 4 años hasta el 2015. (2012, 2013, 2014 y 2015)
d <- pitch %>% filter(yearID>=2012)
#Se eliminan los NA del dataset
d <- d %>% filter(is.na(IPouts)==FALSE)
d <- d %>% filter(is.na(BAOpp)==FALSE)
#Visualizar el contenido
str(d)

## 'data.frame':    2999 obs. of  30 variables:
##  $ playerID: chr  "aardsda01" "abadfe01" "accarje01" "accarje01" ...
##  $ yearID  : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ stint   : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ teamID  : Factor w/ 149 levels "ALT","ANA","ARI",..: 93 58 45 96 16 94 131 66 117 3 ...
##  $ lgID    : Factor w/ 7 levels "AA","AL","FL",..: 2 5 2 2 2 5 2 2 5 5 ...
##  $ W       : int  0 0 0 0 2 1 5 0 1 0 ...
##  $ L       : int  0 6 0 0 10 3 3 3 2 0 ...
##  $ G       : int  1 37 26 1 69 45 61 12 67 3 ...
##  $ GS      : int  0 6 0 0 0 0 0 2 0 0 ...
##  $ CG      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SHO     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SV      : int  0 0 0 0 25 1 1 0 3 0 ...
##  $ IPouts  : int  3 138 106 6 252 142 157 104 190 9 ...
##  $ H       : int  1 57 38 4 80 48 56 37 57 5 ...
##  $ ER      : int  1 26 18 2 50 34 19 9 19 3 ...
##  $ HR      : int  1 6 3 0 11 7 4 4 1 1 ...
##  $ BB      : int  1 19 16 0 31 25 17 13 23 0 ...
##  $ SO      : int  1 38 28 1 75 46 45 22 57 2 ...
##  $ BAOpp   : num  0.25 0.311 0.288 0.444 0.254 0.259 0.269 0.282 0.241 0.385 ...
##  $ ERA     : num  9 5.09 4.58 9 5.36 6.46 3.27 2.34 2.7 9 ...
##  $ IBB     : int  0 1 1 0 2 1 1 2 1 0 ...
##  $ WP      : int  0 4 2 0 3 2 2 0 5 0 ...
##  $ HBP     : int  0 3 0 0 6 3 3 1 3 1 ...
##  $ BK      : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ BFP     : int  5 208 152 10 361 216 228 148 267 15 ...
##  $ GF      : int  1 8 13 0 55 7 7 3 10 0 ...
##  $ R       : int  1 27 19 2 51 38 21 13 23 3 ...
##  $ SH      : int  0 2 1 0 2 1 0 1 4 0 ...
##  $ SF      : int  0 1 3 1 7 2 0 2 0 1 ...
##  $ GIDP    : int  NA NA NA NA NA NA NA NA NA NA ...

sqldf("select playerID, yearID, teamID, W, L, IPouts, H, HR, SO, BAOpp from d limit 10")

## Loading required package: tcltk

##     playerID yearID teamID W  L IPouts  H HR SO BAOpp
## 1  aardsda01   2012    NYA 0  0      3  1  1  1 0.250
## 2   abadfe01   2012    HOU 0  6    138 57  6 38 0.311
## 3  accarje01   2012    CLE 0  0    106 38  3 28 0.288
## 4  accarje01   2012    OAK 0  0      6  4  0  1 0.444
## 5  aceveal01   2012    BOS 2 10    252 80 11 75 0.254
## 6  acostma01   2012    NYN 1  3    142 48  7 46 0.259
## 7  adamsmi03   2012    TEX 5  3    157 56  4 45 0.269
## 8  adcocna01   2012    KCA 0  3    104 37  4 22 0.282
## 9  affelje01   2012    SFN 1  2    190 57  1 57 0.241
## 10 albaljo01   2012    ARI 0  0      9  5  1  2 0.385

Variables a considerar

Variable	Descripcion
playerID	Nombre y apellido del jugador.
yearID	Año.
teamID	Equipo en el que jugó durante la temporada.
W	Partidos ganados.
L	Partidos perdidos.
IPouts	Outs en la temporada.
H	Número de hits.
HR	Número de Home Runs.
SO	Número de strikes.
BAOpp	Promedio de bateo del oponente.

Modelo y análisis

- Para realizar este análisis es importante resaltar que se utilizará IPouts como variable dependiente durante todo el modelo y sobre ésta se haran 4 modelos de regresiones lineales.

- Para cada regresión lineal, se utilizará su coeficiente y con éste crearemos una clave o fórmula para crear la columna “predicted_outs variable”. Luego, se creó otra columna llamada “variable out” que contiene la resta entre (IPouts - predicted outs variable).

- Finalmente se utilizará la columna de “variable out” para encontrar los puntos.

- Su máximo tendrá 25 puntos (25 cada modelo = 100 pts.). Si saca 0 o negativo tiene 0 puntos y de esto se sacan los puntos de cada pitcher.

Modelo 1: Outs ~ Promedio de bateo del oponente

out_baopp <- lm(data=d, IPouts~BAOpp)
summary(out_baopp)

## 
## Call:
## lm(formula = IPouts ~ BAOpp, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -250.70 -119.93  -58.16   39.75  546.25 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   251.70      10.35  24.308  < 2e-16 ***
## BAOpp        -294.14      37.36  -7.872 4.83e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 173.4 on 2997 degrees of freedom
## Multiple R-squared:  0.02026,    Adjusted R-squared:  0.01993 
## F-statistic: 61.97 on 1 and 2997 DF,  p-value: 4.828e-15

d <- d %>% mutate(predicted_outs = ((1-BAOpp)*294.4))
d <- d %>% mutate(bat_out = IPouts-predicted_outs)
d <- d %>% mutate(pts_bat = 0)

j=1
while(j<=nrow(d)){
  if(d$bat_out[j]>0){
    d$pts_bat[j] <- (d$bat_out[j]/(max(d[,32])/25)) 
  }
  j=j+1
}

Modelo 2: Outs ~ Hits

out_hits <- lm(data=d, IPouts~H)
summary(out_hits)

## 
## Call:
## lm(formula = IPouts ~ H, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -161.693  -16.584   -3.151   14.848  227.218 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.03655    0.92648   4.357 1.36e-05 ***
## H            3.03830    0.01166 260.558  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.01 on 2997 degrees of freedom
## Multiple R-squared:  0.9577, Adjusted R-squared:  0.9577 
## F-statistic: 6.789e+04 on 1 and 2997 DF,  p-value: < 2.2e-16

d <- d %>% mutate(predicted_outs_H = H*3.03848)
d <- d %>% mutate(hit_out = IPouts - predicted_outs_H)
d <- d %>% mutate(pts_hit = 0)

k=1
while(k<=nrow(d)){
  if(d$hit_out[k]>0){
    d$pts_hit[k] <- (d$hit_out[k]/(max(d$hit_out)/25)) 
  }
  k=k+1
}

Modelo 3: Outs ~ Partidos ganados + Partidos perdidos

out_wl <- lm(data=d, IPouts~W+L)
summary(out_wl)

## 
## Call:
## lm(formula = IPouts ~ W + L, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -237.166  -25.934   -6.818   23.984  267.458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.1323     1.1424   19.37   <2e-16 ***
## W            26.1824     0.2941   89.04   <2e-16 ***
## L            20.7016     0.3342   61.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.96 on 2996 degrees of freedom
## Multiple R-squared:  0.9312, Adjusted R-squared:  0.9311 
## F-statistic: 2.026e+04 on 2 and 2996 DF,  p-value: < 2.2e-16

d <- d %>% mutate(predicted_outs_WL = ((W*26.1846)+(L*20.7071)))
d <- d %>% mutate(wl_out = IPouts - predicted_outs_WL)
d <- d %>% mutate(pts_wl = 0)

l=1
while(l<=nrow(d)){
  if(d$wl_out[l]>0){
    d$pts_wl[l] <- (d$wl_out[l]/(max(d$wl_out)/25)) 
  }
  l=l+1
}

Modelo 4: Outs ~ Home Runs + Strikes

out_wl <- lm(data=d, IPouts~W+L)
summary(out_wl)

## 
## Call:
## lm(formula = IPouts ~ W + L, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -237.166  -25.934   -6.818   23.984  267.458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.1323     1.1424   19.37   <2e-16 ***
## W            26.1824     0.2941   89.04   <2e-16 ***
## L            20.7016     0.3342   61.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.96 on 2996 degrees of freedom
## Multiple R-squared:  0.9312, Adjusted R-squared:  0.9311 
## F-statistic: 2.026e+04 on 2 and 2996 DF,  p-value: < 2.2e-16

d <- d %>% mutate(predicted_outs_HS = ((HR/8.18944)+(SO/2.39536)))
d <- d %>% mutate(hs_out = IPouts - predicted_outs_HS)
d <- d %>% mutate(pts_hs = 0)

m=1
while(m<=nrow(d)){
  if(d$hs_out[m]>0){
    d$pts_hs[m] <- (d$hs_out[m]/(max(d$hs_out)/25)) 
  }
  m=m+1
}

Finalmente su sumaron las 4 columnas llamadas “pts_variable” para encontrar la nota sobre 100 de cada pitcher y asi obtener los mejores 6. También se utilzó una nueva columna para ver como les iría a los pitchers con una temporada perfecta.

d <- d %>% mutate(puntos_totales = pts_bat + pts_hit + pts_wl + pts_hs)

####Predecir los nuevos outs del jugador
d <- d %>% mutate(IPouts_simulado = bat_out + hit_out + wl_out + hs_out)

Resultados

La siguiente tabla muestra los resultados en puntos para los mejores 6 pitchers de la MLB entre 2012-2015.

resultados <- sqldf("select playerID, yearID, teamID, IPouts, IPouts_simulado, pts_bat, pts_hit, pts_wl, pts_hs,puntos_totales from d order by puntos_totales desc limit 6")
resultados

##    playerID yearID teamID IPouts IPouts_simulado  pts_bat  pts_hit    pts_wl   pts_hs puntos_totales
## 1 hernafe02   2014    SEA    708        1457.440 23.45460 20.70017 16.491330 23.84332       84.48943
## 2 kershcl01   2013    LAN    708        1393.183 23.38153 22.67127  8.866355 24.13182       79.05097
## 3 kershcl01   2015    LAN    698        1368.047 22.87050 21.91860 11.578886 22.57682       78.94481
## 4 greinza01   2015    LAN    668        1338.125 21.27895 23.60278  9.357568 23.06306       77.30236
## 5 kershcl01   2012    LAN    683        1332.379 22.35971 17.99721 11.229611 23.16789       74.75442
## 6 cuetojo01   2014    CIN    731        1359.440 24.50867 23.51541  1.808466 24.82364       74.65618

Graficas

set.seed(100)

#Distribucion en boxplot de BAOpp
plot_ly(d, y = ~BAOpp, type = "box")

#Puntos totales por equipo boxplot
plot_ly(d, y = ~puntos_totales, color = ~teamID, type = "box")

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

#Relacion entre outs y puntos totales
plot_ly(d, x = ~IPouts, y = ~puntos_totales)

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

#Distribucion de los outs conforme a BAOpp en scatterplot
plot_ly(d, x = ~BAOpp, y = ~IPouts, color = ~BAOpp,
        size = ~BAOpp, text = ~paste("puntos_totales: ", puntos_totales))

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

#Distribucion de los outs conforme a BAOPP pero por equipo
p <- ggplot(data = d, aes(x = BAOpp, y = puntos_totales)) +
  geom_point(aes(text = paste("IPouts:", IPouts))) +
  geom_smooth(aes(colour = teamID, fill = teamID)) + 
  facet_wrap(~ teamID)

## Warning: Ignoring unknown aesthetics: text

ggplotly(p)

## `geom_smooth()` using method = 'loess'

#Relacion entre partidos jugados y puntos totales
plot_ly(d, x = ~G, y = ~puntos_totales)

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

#Relacion entre partidos ganados y puntos totales
plot_ly(d, x = ~W, y = ~puntos_totales)

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

#Relacion entre partidos perdidos y puntos totales
plot_ly(d, x = ~L, y = ~puntos_totales)

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

#Puntos totales por equipo en porcentajes *pie
plot_ly(d, labels = ~teamID, values = ~puntos_totales, type = 'pie') %>%
  layout(title = 'Puntos totales por equipo',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

MLB Fantasy - Pitcher

Melissa Kiehnle, Alba Montepeque y José Rolando Josué

November 17, 2016

Objetivo del Proyecto

Encontrar los 6 mejores pitchers de la MLB durante 2012 a 2015.

Pitchers en la MLB

En la Mayor League Baseball, en Estados Unidos, un pitcher debe de tener las siguientes cualidades: buena resistencia muscular, fuerza, agilidad y destreza.También se busca su velocidad, variedad en sus lanzamientos o su control.

Análisis de datos de la MLB

Utilizando el paquete “Lahman” y “pitchRx”

Lahman es una base de datos creada por Sean Lahman la cual nos da las estadísticas por pitcher, bateador, y en general de los jugadores.

Por otro lado, pitchRx fue creada por Brookes Baseball (Dan Brooks y Harry Pavlidis) y muestra los datos por día y por partido. Es decir, muestra cada lanzamiento, batazo, etc. de cada partido desde el 2008.

Para ello requerimos de las siguientes librerias que se deben de cargar.

Al cargar la librería de Lahman podemos encontrar en ella la tabla llamada “Pitching” que la convertiremos en un objeto de R debido a que esta es la tabla que nos interesa para el análisis.

Variables a considerar

Modelo y análisis

- Para realizar este análisis es importante resaltar que se utilizará IPouts como variable dependiente durante todo el modelo y sobre ésta se haran 4 modelos de regresiones lineales.

- Para cada regresión lineal, se utilizará su coeficiente y con éste crearemos una clave o fórmula para crear la columna “predicted_outs variable”. Luego, se creó otra columna llamada “variable out” que contiene la resta entre (IPouts - predicted outs variable).

- Finalmente se utilizará la columna de “variable out” para encontrar los puntos.

- Su máximo tendrá 25 puntos (25 cada modelo = 100 pts.). Si saca 0 o negativo tiene 0 puntos y de esto se sacan los puntos de cada pitcher.

Modelo 1: Outs ~ Promedio de bateo del oponente

Modelo 2: Outs ~ Hits

Modelo 3: Outs ~ Partidos ganados + Partidos perdidos

Modelo 4: Outs ~ Home Runs + Strikes

Finalmente su sumaron las 4 columnas llamadas “pts_variable” para encontrar la nota sobre 100 de cada pitcher y asi obtener los mejores 6. También se utilzó una nueva columna para ver como les iría a los pitchers con una temporada perfecta.

Resultados

La siguiente tabla muestra los resultados en puntos para los mejores 6 pitchers de la MLB entre 2012-2015.

Graficas