Data607-Assignment1

Introducation

I choose the MLB baseball data set which shows the elo ratings for all the teams and their prediction for games. The specific article is whether the New York Yankees have gotten better since the deadline or just luckier. I was interested to see whether the trade deadline had an impact of the elo ratings and whether it was able to change the ratings based on the trades and therefore the predictions. Below is the link to the article. https://fivethirtyeight.com/features/have-the-streaking-yankees-gotten-better-or-just-luckier/

Importing Libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(naniar)
library(ggplot2)
library(ggplot2)

Importing Data from my github

df= read.csv("https://github.com/mianshariq/SPS/raw/75be027d76a4942451095a93bc5ebc81915f16bb/Data%20607/Assignments/Week%201%20Assignment/mlb_elo_latest.csv")

Exploratory Analysis

head(df)

##         date season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
## 1 2021-10-03   2021       0      NA   ATL   NYM 1552.848 1490.045 0.6223837
## 2 2021-10-03   2021       0      NA   STL   CHC 1497.264 1454.828 0.5944605
## 3 2021-10-03   2021       0      NA   SEA   ANA 1499.937 1478.156 0.5655044
## 4 2021-10-03   2021       0      NA   LAD   MIL 1595.241 1546.777 0.6027980
## 5 2021-10-03   2021       0      NA   KCR   MIN 1469.411 1480.613 0.5184097
## 6 2021-10-03   2021       0      NA   HOU   OAK 1565.150 1529.594 0.5848783
##   elo_prob2 elo1_post elo2_post rating1_pre rating2_pre pitcher1 pitcher2
## 1 0.3776163        NA        NA    1548.097    1512.210                  
## 2 0.4055395        NA        NA    1496.455    1462.279                  
## 3 0.4344956        NA        NA    1486.547    1489.910                  
## 4 0.3972020        NA        NA    1602.420    1543.978                  
## 5 0.4815903        NA        NA    1467.791    1492.561                  
## 6 0.4151217        NA        NA    1567.557    1521.187                  
##   pitcher1_rgs pitcher2_rgs pitcher1_adj pitcher2_adj rating_prob1 rating_prob2
## 1           NA           NA           NA           NA    0.5884631    0.4115369
## 2           NA           NA           NA           NA    0.5750243    0.4249757
## 3           NA           NA           NA           NA    0.5392963    0.4607037
## 4           NA           NA           NA           NA    0.6191841    0.3808159
## 5           NA           NA           NA           NA    0.4978022    0.5021978
## 6           NA           NA           NA           NA    0.6091022    0.3908978
##   rating1_post rating2_post score1 score2
## 1           NA           NA     NA     NA
## 2           NA           NA     NA     NA
## 3           NA           NA     NA     NA
## 4           NA           NA     NA     NA
## 5           NA           NA     NA     NA
## 6           NA           NA     NA     NA

summary(df)

##      date               season        neutral          playoff       
##  Length:2428        Min.   :2021   Min.   :0.0000000   Mode:logical  
##  Class :character   1st Qu.:2021   1st Qu.:0.0000000   NA's:2428     
##  Mode  :character   Median :2021   Median :0.0000000                 
##                     Mean   :2021   Mean   :0.0008237                 
##                     3rd Qu.:2021   3rd Qu.:0.0000000                 
##                     Max.   :2021   Max.   :1.0000000                 
##                                                                      
##     team1              team2              elo1_pre       elo2_pre   
##  Length:2428        Length:2428        Min.   :1417   Min.   :1424  
##  Class :character   Class :character   1st Qu.:1480   1st Qu.:1481  
##  Mode  :character   Mode  :character   Median :1504   Median :1504  
##                                        Mean   :1505   Mean   :1505  
##                                        3rd Qu.:1530   3rd Qu.:1529  
##                                        Max.   :1598   Max.   :1596  
##                                                                     
##    elo_prob1        elo_prob2        elo1_post      elo2_post     rating1_pre  
##  Min.   :0.3252   Min.   :0.2538   Min.   :1417   Min.   :1423   Min.   :1410  
##  1st Qu.:0.4797   1st Qu.:0.4150   1st Qu.:1483   1st Qu.:1483   1st Qu.:1471  
##  Median :0.5336   Median :0.4664   Median :1507   Median :1505   Median :1508  
##  Mean   :0.5331   Mean   :0.4669   Mean   :1505   Mean   :1505   Mean   :1505  
##  3rd Qu.:0.5850   3rd Qu.:0.5203   3rd Qu.:1529   3rd Qu.:1528   3rd Qu.:1533  
##  Max.   :0.7462   Max.   :0.6748   Max.   :1597   Max.   :1598   Max.   :1607  
##                                    NA's   :501    NA's   :501                  
##   rating2_pre     pitcher1           pitcher2          pitcher1_rgs  
##  Min.   :1415   Length:2428        Length:2428        Min.   :37.70  
##  1st Qu.:1471   Class :character   Class :character   1st Qu.:47.85  
##  Median :1507   Mode  :character   Mode  :character   Median :50.75  
##  Mean   :1505                                         Mean   :51.18  
##  3rd Qu.:1533                                         3rd Qu.:54.01  
##  Max.   :1610                                         Max.   :67.84  
##                                                       NA's   :369    
##   pitcher2_rgs    pitcher1_adj      pitcher2_adj      rating_prob1   
##  Min.   :39.77   Min.   :-52.456   Min.   :-61.675   Min.   :0.2862  
##  1st Qu.:47.85   1st Qu.:-14.231   1st Qu.:-13.789   1st Qu.:0.4713  
##  Median :50.73   Median : -1.831   Median : -2.262   Median :0.5346  
##  Mean   :51.15   Mean   : -0.892   Mean   : -1.130   Mean   :0.5339  
##  3rd Qu.:53.97   3rd Qu.: 10.886   3rd Qu.: 10.695   3rd Qu.:0.5977  
##  Max.   :67.60   Max.   : 59.047   Max.   : 58.738   Max.   :0.7738  
##  NA's   :367     NA's   :369       NA's   :367                       
##   rating_prob2     rating1_post   rating2_post      score1      
##  Min.   :0.2262   Min.   :1410   Min.   :1414   Min.   : 0.000  
##  1st Qu.:0.4023   1st Qu.:1474   1st Qu.:1473   1st Qu.: 2.000  
##  Median :0.4654   Median :1509   Median :1508   Median : 4.000  
##  Mean   :0.4661   Mean   :1505   Mean   :1505   Mean   : 4.613  
##  3rd Qu.:0.5287   3rd Qu.:1532   3rd Qu.:1531   3rd Qu.: 6.000  
##  Max.   :0.7138   Max.   :1606   Max.   :1610   Max.   :22.000  
##                   NA's   :501    NA's   :501    NA's   :501     
##      score2      
##  Min.   : 0.000  
##  1st Qu.: 2.000  
##  Median : 4.000  
##  Mean   : 4.403  
##  3rd Qu.: 6.000  
##  Max.   :24.000  
##  NA's   :501

gg_miss_fct(df, date)+ labs(title = "     Nan or Missing Data")

Removing column 3 and 4 as data as columns missing data

df=df[c(1,2,5:26)]

Filtering data set by games played so far this season

df1=filter(df, date < "2021-08-28")
summary(df1)

##      date               season        team1              team2          
##  Length:1927        Min.   :2021   Length:1927        Length:1927       
##  Class :character   1st Qu.:2021   Class :character   Class :character  
##  Mode  :character   Median :2021   Mode  :character   Mode  :character  
##                     Mean   :2021                                        
##                     3rd Qu.:2021                                        
##                     Max.   :2021                                        
##     elo1_pre       elo2_pre      elo_prob1        elo_prob2        elo1_post   
##  Min.   :1417   Min.   :1424   Min.   :0.3284   Min.   :0.2538   Min.   :1417  
##  1st Qu.:1483   1st Qu.:1483   1st Qu.:0.4858   1st Qu.:0.4184   1st Qu.:1483  
##  Median :1506   Median :1505   Median :0.5347   Median :0.4653   Median :1507  
##  Mean   :1505   Mean   :1505   Mean   :0.5338   Mean   :0.4662   Mean   :1505  
##  3rd Qu.:1529   3rd Qu.:1528   3rd Qu.:0.5816   3rd Qu.:0.5142   3rd Qu.:1529  
##  Max.   :1598   Max.   :1596   Max.   :0.7462   Max.   :0.6716   Max.   :1597  
##    elo2_post     rating1_pre    rating2_pre     pitcher1        
##  Min.   :1423   Min.   :1410   Min.   :1415   Length:1927       
##  1st Qu.:1483   1st Qu.:1474   1st Qu.:1473   Class :character  
##  Median :1505   Median :1509   Median :1508   Mode  :character  
##  Mean   :1505   Mean   :1505   Mean   :1505                     
##  3rd Qu.:1528   3rd Qu.:1532   3rd Qu.:1531                     
##  Max.   :1598   Max.   :1607   Max.   :1610                     
##    pitcher2          pitcher1_rgs    pitcher2_rgs    pitcher1_adj     
##  Length:1927        Min.   :37.70   Min.   :39.77   Min.   :-52.4557  
##  Class :character   1st Qu.:47.84   1st Qu.:47.79   1st Qu.:-14.4309  
##  Mode  :character   Median :50.73   Median :50.67   Median : -2.0547  
##                     Mean   :51.19   Mean   :51.12   Mean   : -0.9336  
##                     3rd Qu.:54.01   3rd Qu.:53.91   3rd Qu.: 10.9103  
##                     Max.   :67.84   Max.   :67.60   Max.   : 59.0470  
##   pitcher2_adj      rating_prob1     rating_prob2     rating1_post 
##  Min.   :-61.675   Min.   :0.2862   Min.   :0.2262   Min.   :1410  
##  1st Qu.:-14.020   1st Qu.:0.4726   1st Qu.:0.4021   1st Qu.:1474  
##  Median : -2.591   Median :0.5355   Median :0.4645   Median :1509  
##  Mean   : -1.235   Mean   :0.5350   Mean   :0.4650   Mean   :1505  
##  3rd Qu.: 10.700   3rd Qu.:0.5979   3rd Qu.:0.5274   3rd Qu.:1532  
##  Max.   : 58.738   Max.   :0.7738   Max.   :0.7138   Max.   :1606  
##   rating2_post      score1           score2      
##  Min.   :1414   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:1473   1st Qu.: 2.000   1st Qu.: 2.000  
##  Median :1508   Median : 4.000   Median : 4.000  
##  Mean   :1505   Mean   : 4.613   Mean   : 4.403  
##  3rd Qu.:1531   3rd Qu.: 6.000   3rd Qu.: 6.000  
##  Max.   :1610   Max.   :22.000   Max.   :24.000

Renaming columns

names(df1)[names(df1) == "team1"] <- "HomeTeam1"
names(df1)[names(df1) == "team2"] <- "AwayTeam2"

Filtering for New York Yankees games only

df1=filter(df1, HomeTeam1 =="NYY" | AwayTeam2=="NYY")

Separating Home and Away games in separate Tables and then concatenating the tables vertically to get all the games under 1 column

df2=filter(df1,HomeTeam1 =="NYY")

df2=df2[c('date', 'HomeTeam1', 'rating1_pre', 'rating_prob1')]
df2 <- df2 %>% rename( Team = HomeTeam1, rating_pre = rating1_pre, rating_prob = rating_prob1)


df3=filter(df1,AwayTeam2 =="NYY")
df3=df3[c('date', 'AwayTeam2', 'rating2_pre', 'rating_prob2')]
df3 <- df3 %>% rename(Team = AwayTeam2, rating_pre = rating2_pre, rating_prob = rating_prob2)

df4=rbind(df2,df3)

df4=arrange(df4, date)

Removing Double Header games for Graphing purposes

df4=distinct(df4, date, .keep_all = TRUE)
str(df4)

## 'data.frame':    125 obs. of  4 variables:
##  $ date       : chr  "2021-04-01" "2021-04-03" "2021-04-04" "2021-04-05" ...
##  $ Team       : chr  "NYY" "NYY" "NYY" "NYY" ...
##  $ rating_pre : num  1572 1571 1572 1570 1572 ...
##  $ rating_prob: num  0.625 0.609 0.571 0.722 0.749 ...

df4$date=as.Date(df4$date)

head(df4)

##         date Team rating_pre rating_prob
## 1 2021-04-01  NYY   1572.329   0.6249555
## 2 2021-04-03  NYY   1570.975   0.6089046
## 3 2021-04-04  NYY   1571.942   0.5711109
## 4 2021-04-05  NYY   1570.368   0.7224693
## 5 2021-04-06  NYY   1571.541   0.7485128
## 6 2021-04-07  NYY   1572.359   0.6633895

Question1: How does the New York Yankees stack up in this ELO model?

If we filter for probability of winning a game > .5, we get the results that according to the model, the NYY had 101 games where prob >.5 and 24 games with prob <.5 . The mean of the probabilities is .5627, meaning that 56% of the time the NYY will win the game. The record according to this model should be 101-24 for the season so far. Their actual record is 76-53. there is a 4 game variation due to double headers. This gives an accuracy of 56.4 % for the elo model. Which is not a great model, almost close to 50/50.

mean(df4[["rating_prob"]])

## [1] 0.5627149

df4 %>% count(rating_prob>.5)

##   rating_prob > 0.5   n
## 1             FALSE  24
## 2              TRUE 101

Here is histogram showing the spread of the probabilities. According to the model, the NYY should be winning majorities of their games.

ch1=ggplot(df4, aes(x=rating_prob, fill=rating_prob>.50)) +
geom_histogram(position="identity", bins = 10)
ch1

Here I want to see whether the trade deadline had an effect on the elo model and whether the mean probability and the record has increased or decreased. Below is the average of the probabilities before the deadline

df41=filter(df4, date < "2021-07-04")
print(mean(df41[["rating_prob"]]))

## [1] 0.5772281

df41 %>% count(rating_prob>.5)

##   rating_prob > 0.5  n
## 1             FALSE 10
## 2              TRUE 70

Below is the histogram of the probailities before the deadline

ch41=ggplot(df41, aes(x=rating_prob, fill=rating_prob>.50)) +
  geom_histogram(position="identity", bins = 10)
ch41

Below is the average of the probabilities after the deadline. According to this the NYY have a lower chance of winning games after the trade deadline. Keep in mind that its possible that other factors such as schedule is much harder for them after the deadline which might have decreased the probability.

df42=filter(df4, date > "2021-07-04")
mean(df42[["rating_prob"]])

## [1] 0.5373444

df42 %>% count(rating_prob>.5)

##   rating_prob > 0.5  n
## 1             FALSE 14
## 2              TRUE 30

Below is the histogram of the probailities before the deadline

ch42=ggplot(df42, aes(x=rating_prob, fill=rating_prob>.50)) +
  geom_histogram(position="identity", bins = 10)
ch42

Question 2: Does the model capture the current NYY 13 game winning streak? Is the model too linear?

Below is the chart which shows the team rating for NYY from the beginning of the season. As you can see, the NYY started off as of of the highest ranking with 1572 rating score, ranked top 5. the lowest rating being 1410 and highest being 1607.As the season went on filled with injuries, they were close to .500 record beofre the trade dealine with 2% chance of making the playoffs. Now they are riding a 13 game winning streak and their chances of making playoffs have increased to 98%.

chart=ggplot(data=df4, aes(x=date, y=rating_pre))+
  geom_line(color="#69b3a2")+
  annotate(geom="text", x=as.Date("2021-07-04"), y=1535.387, 
           label="Trade Deadline") +
  annotate(geom="point", x=as.Date("2021-07-04"), y=1540.387, size=10, shape=21, fill="transparent") +
  geom_hline(yintercept=1530, linetype='dotted', col = 'red')+
  annotate("text", x =as.Date("2021-05-01"), y = 1530, label = "Average League rating ", vjust = -0.5)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
chart

Below we are filtering Gerrit Cole

df5=df1[c(1:24)]
df6=filter(df5,pitcher1 == "Gerrit Cole")
df7=filter(df5,pitcher2 == "Gerrit Cole")
df6=df6[c('date', 'HomeTeam1', 'rating1_pre','pitcher1', 'pitcher1_adj', 'rating_prob1')]
df6 <- df6 %>% rename( Team = HomeTeam1, rating_pre = rating1_pre,pitcher = pitcher1, pitcher_adj = pitcher1_adj, rating_prob= rating_prob1 )
df7=df7[c('date', 'AwayTeam2', 'rating2_pre','pitcher2', 'pitcher2_adj', 'rating_prob2')]
df7 <- df7 %>% rename( Team = AwayTeam2, rating_pre = rating2_pre,pitcher = pitcher2, pitcher_adj = pitcher2_adj, rating_prob= rating_prob2 )
df8=rbind(df6,df7)
head(df8)

##         date Team rating_pre     pitcher pitcher_adj rating_prob
## 1 2021-08-21  NYY   1560.021 Gerrit Cole    31.82722   0.6413225
## 2 2021-08-16  NYY   1552.549 Gerrit Cole    31.71516   0.7110867
## 3 2021-07-17  NYY   1544.101 Gerrit Cole    40.27376   0.5904432
## 4 2021-07-04  NYY   1542.818 Gerrit Cole    41.66872   0.6141796
## 5 2021-06-22  NYY   1551.475 Gerrit Cole    47.14698   0.7050599
## 6 2021-06-03  NYY   1557.251 Gerrit Cole    48.41435   0.6282824

df8=mutate(df8, ttl_rating = rating_pre+pitcher_adj)
df8$date=as.Date(df8$date)
str(df8)

## 'data.frame':    24 obs. of  7 variables:
##  $ date       : Date, format: "2021-08-21" "2021-08-16" ...
##  $ Team       : chr  "NYY" "NYY" "NYY" "NYY" ...
##  $ rating_pre : num  1560 1553 1544 1543 1551 ...
##  $ pitcher    : chr  "Gerrit Cole" "Gerrit Cole" "Gerrit Cole" "Gerrit Cole" ...
##  $ pitcher_adj: num  31.8 31.7 40.3 41.7 47.1 ...
##  $ rating_prob: num  0.641 0.711 0.59 0.614 0.705 ...
##  $ ttl_rating : num  1592 1584 1584 1584 1599 ...

Question 3: How much does difference NYY best pitcher (Gerrit Cole) makes to the team ratings.

Its amazing the impact he has on the ratings of the team. He adds ~ 45 points to all his starts. That shows that, he can beat the best team, given they are starting an average pitcher.

ch=ggplot()+
  geom_step(data=df8, aes(x=date, y=rating_pre, color="pre_rating"))+
  geom_step(data=df8, aes(x=date, y=ttl_rating, color="Pre+Cole"))+
  ylab('rating')
ch

Conclusion:

After looking more into elo models of fivethirtyeight, baseball has the lowest model accuracy in all sports with accuracy of 59%. I think one of the main problem with the baseball elo model is that the model seems too linear and need more inputs such as Injury factor, which can compensate for the ratings as well as logarithmic approach towards win streak or hot hand, which should increase the accuracy of the models.