I choose the MLB baseball data set which shows the elo ratings for all the teams and their prediction for games. The specific article is whether the New York Yankees have gotten better since the deadline or just luckier. I was interested to see whether the trade deadline had an impact of the elo ratings and whether it was able to change the ratings based on the trades and therefore the predictions. Below is the link to the article. https://fivethirtyeight.com/features/have-the-streaking-yankees-gotten-better-or-just-luckier/
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(naniar)
library(ggplot2)
library(ggplot2)
df= read.csv("https://github.com/mianshariq/SPS/raw/75be027d76a4942451095a93bc5ebc81915f16bb/Data%20607/Assignments/Week%201%20Assignment/mlb_elo_latest.csv")
head(df)
## date season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1
## 1 2021-10-03 2021 0 NA ATL NYM 1552.848 1490.045 0.6223837
## 2 2021-10-03 2021 0 NA STL CHC 1497.264 1454.828 0.5944605
## 3 2021-10-03 2021 0 NA SEA ANA 1499.937 1478.156 0.5655044
## 4 2021-10-03 2021 0 NA LAD MIL 1595.241 1546.777 0.6027980
## 5 2021-10-03 2021 0 NA KCR MIN 1469.411 1480.613 0.5184097
## 6 2021-10-03 2021 0 NA HOU OAK 1565.150 1529.594 0.5848783
## elo_prob2 elo1_post elo2_post rating1_pre rating2_pre pitcher1 pitcher2
## 1 0.3776163 NA NA 1548.097 1512.210
## 2 0.4055395 NA NA 1496.455 1462.279
## 3 0.4344956 NA NA 1486.547 1489.910
## 4 0.3972020 NA NA 1602.420 1543.978
## 5 0.4815903 NA NA 1467.791 1492.561
## 6 0.4151217 NA NA 1567.557 1521.187
## pitcher1_rgs pitcher2_rgs pitcher1_adj pitcher2_adj rating_prob1 rating_prob2
## 1 NA NA NA NA 0.5884631 0.4115369
## 2 NA NA NA NA 0.5750243 0.4249757
## 3 NA NA NA NA 0.5392963 0.4607037
## 4 NA NA NA NA 0.6191841 0.3808159
## 5 NA NA NA NA 0.4978022 0.5021978
## 6 NA NA NA NA 0.6091022 0.3908978
## rating1_post rating2_post score1 score2
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
summary(df)
## date season neutral playoff
## Length:2428 Min. :2021 Min. :0.0000000 Mode:logical
## Class :character 1st Qu.:2021 1st Qu.:0.0000000 NA's:2428
## Mode :character Median :2021 Median :0.0000000
## Mean :2021 Mean :0.0008237
## 3rd Qu.:2021 3rd Qu.:0.0000000
## Max. :2021 Max. :1.0000000
##
## team1 team2 elo1_pre elo2_pre
## Length:2428 Length:2428 Min. :1417 Min. :1424
## Class :character Class :character 1st Qu.:1480 1st Qu.:1481
## Mode :character Mode :character Median :1504 Median :1504
## Mean :1505 Mean :1505
## 3rd Qu.:1530 3rd Qu.:1529
## Max. :1598 Max. :1596
##
## elo_prob1 elo_prob2 elo1_post elo2_post rating1_pre
## Min. :0.3252 Min. :0.2538 Min. :1417 Min. :1423 Min. :1410
## 1st Qu.:0.4797 1st Qu.:0.4150 1st Qu.:1483 1st Qu.:1483 1st Qu.:1471
## Median :0.5336 Median :0.4664 Median :1507 Median :1505 Median :1508
## Mean :0.5331 Mean :0.4669 Mean :1505 Mean :1505 Mean :1505
## 3rd Qu.:0.5850 3rd Qu.:0.5203 3rd Qu.:1529 3rd Qu.:1528 3rd Qu.:1533
## Max. :0.7462 Max. :0.6748 Max. :1597 Max. :1598 Max. :1607
## NA's :501 NA's :501
## rating2_pre pitcher1 pitcher2 pitcher1_rgs
## Min. :1415 Length:2428 Length:2428 Min. :37.70
## 1st Qu.:1471 Class :character Class :character 1st Qu.:47.85
## Median :1507 Mode :character Mode :character Median :50.75
## Mean :1505 Mean :51.18
## 3rd Qu.:1533 3rd Qu.:54.01
## Max. :1610 Max. :67.84
## NA's :369
## pitcher2_rgs pitcher1_adj pitcher2_adj rating_prob1
## Min. :39.77 Min. :-52.456 Min. :-61.675 Min. :0.2862
## 1st Qu.:47.85 1st Qu.:-14.231 1st Qu.:-13.789 1st Qu.:0.4713
## Median :50.73 Median : -1.831 Median : -2.262 Median :0.5346
## Mean :51.15 Mean : -0.892 Mean : -1.130 Mean :0.5339
## 3rd Qu.:53.97 3rd Qu.: 10.886 3rd Qu.: 10.695 3rd Qu.:0.5977
## Max. :67.60 Max. : 59.047 Max. : 58.738 Max. :0.7738
## NA's :367 NA's :369 NA's :367
## rating_prob2 rating1_post rating2_post score1
## Min. :0.2262 Min. :1410 Min. :1414 Min. : 0.000
## 1st Qu.:0.4023 1st Qu.:1474 1st Qu.:1473 1st Qu.: 2.000
## Median :0.4654 Median :1509 Median :1508 Median : 4.000
## Mean :0.4661 Mean :1505 Mean :1505 Mean : 4.613
## 3rd Qu.:0.5287 3rd Qu.:1532 3rd Qu.:1531 3rd Qu.: 6.000
## Max. :0.7138 Max. :1606 Max. :1610 Max. :22.000
## NA's :501 NA's :501 NA's :501
## score2
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 4.000
## Mean : 4.403
## 3rd Qu.: 6.000
## Max. :24.000
## NA's :501
gg_miss_fct(df, date)+ labs(title = " Nan or Missing Data")
Removing column 3 and 4 as data as columns missing data
df=df[c(1,2,5:26)]
Filtering data set by games played so far this season
df1=filter(df, date < "2021-08-28")
summary(df1)
## date season team1 team2
## Length:1927 Min. :2021 Length:1927 Length:1927
## Class :character 1st Qu.:2021 Class :character Class :character
## Mode :character Median :2021 Mode :character Mode :character
## Mean :2021
## 3rd Qu.:2021
## Max. :2021
## elo1_pre elo2_pre elo_prob1 elo_prob2 elo1_post
## Min. :1417 Min. :1424 Min. :0.3284 Min. :0.2538 Min. :1417
## 1st Qu.:1483 1st Qu.:1483 1st Qu.:0.4858 1st Qu.:0.4184 1st Qu.:1483
## Median :1506 Median :1505 Median :0.5347 Median :0.4653 Median :1507
## Mean :1505 Mean :1505 Mean :0.5338 Mean :0.4662 Mean :1505
## 3rd Qu.:1529 3rd Qu.:1528 3rd Qu.:0.5816 3rd Qu.:0.5142 3rd Qu.:1529
## Max. :1598 Max. :1596 Max. :0.7462 Max. :0.6716 Max. :1597
## elo2_post rating1_pre rating2_pre pitcher1
## Min. :1423 Min. :1410 Min. :1415 Length:1927
## 1st Qu.:1483 1st Qu.:1474 1st Qu.:1473 Class :character
## Median :1505 Median :1509 Median :1508 Mode :character
## Mean :1505 Mean :1505 Mean :1505
## 3rd Qu.:1528 3rd Qu.:1532 3rd Qu.:1531
## Max. :1598 Max. :1607 Max. :1610
## pitcher2 pitcher1_rgs pitcher2_rgs pitcher1_adj
## Length:1927 Min. :37.70 Min. :39.77 Min. :-52.4557
## Class :character 1st Qu.:47.84 1st Qu.:47.79 1st Qu.:-14.4309
## Mode :character Median :50.73 Median :50.67 Median : -2.0547
## Mean :51.19 Mean :51.12 Mean : -0.9336
## 3rd Qu.:54.01 3rd Qu.:53.91 3rd Qu.: 10.9103
## Max. :67.84 Max. :67.60 Max. : 59.0470
## pitcher2_adj rating_prob1 rating_prob2 rating1_post
## Min. :-61.675 Min. :0.2862 Min. :0.2262 Min. :1410
## 1st Qu.:-14.020 1st Qu.:0.4726 1st Qu.:0.4021 1st Qu.:1474
## Median : -2.591 Median :0.5355 Median :0.4645 Median :1509
## Mean : -1.235 Mean :0.5350 Mean :0.4650 Mean :1505
## 3rd Qu.: 10.700 3rd Qu.:0.5979 3rd Qu.:0.5274 3rd Qu.:1532
## Max. : 58.738 Max. :0.7738 Max. :0.7138 Max. :1606
## rating2_post score1 score2
## Min. :1414 Min. : 0.000 Min. : 0.000
## 1st Qu.:1473 1st Qu.: 2.000 1st Qu.: 2.000
## Median :1508 Median : 4.000 Median : 4.000
## Mean :1505 Mean : 4.613 Mean : 4.403
## 3rd Qu.:1531 3rd Qu.: 6.000 3rd Qu.: 6.000
## Max. :1610 Max. :22.000 Max. :24.000
Renaming columns
names(df1)[names(df1) == "team1"] <- "HomeTeam1"
names(df1)[names(df1) == "team2"] <- "AwayTeam2"
Filtering for New York Yankees games only
df1=filter(df1, HomeTeam1 =="NYY" | AwayTeam2=="NYY")
Separating Home and Away games in separate Tables and then concatenating the tables vertically to get all the games under 1 column
df2=filter(df1,HomeTeam1 =="NYY")
df2=df2[c('date', 'HomeTeam1', 'rating1_pre', 'rating_prob1')]
df2 <- df2 %>% rename( Team = HomeTeam1, rating_pre = rating1_pre, rating_prob = rating_prob1)
df3=filter(df1,AwayTeam2 =="NYY")
df3=df3[c('date', 'AwayTeam2', 'rating2_pre', 'rating_prob2')]
df3 <- df3 %>% rename(Team = AwayTeam2, rating_pre = rating2_pre, rating_prob = rating_prob2)
df4=rbind(df2,df3)
df4=arrange(df4, date)
Removing Double Header games for Graphing purposes
df4=distinct(df4, date, .keep_all = TRUE)
str(df4)
## 'data.frame': 125 obs. of 4 variables:
## $ date : chr "2021-04-01" "2021-04-03" "2021-04-04" "2021-04-05" ...
## $ Team : chr "NYY" "NYY" "NYY" "NYY" ...
## $ rating_pre : num 1572 1571 1572 1570 1572 ...
## $ rating_prob: num 0.625 0.609 0.571 0.722 0.749 ...
df4$date=as.Date(df4$date)
head(df4)
## date Team rating_pre rating_prob
## 1 2021-04-01 NYY 1572.329 0.6249555
## 2 2021-04-03 NYY 1570.975 0.6089046
## 3 2021-04-04 NYY 1571.942 0.5711109
## 4 2021-04-05 NYY 1570.368 0.7224693
## 5 2021-04-06 NYY 1571.541 0.7485128
## 6 2021-04-07 NYY 1572.359 0.6633895
If we filter for probability of winning a game > .5, we get the results that according to the model, the NYY had 101 games where prob >.5 and 24 games with prob <.5 . The mean of the probabilities is .5627, meaning that 56% of the time the NYY will win the game. The record according to this model should be 101-24 for the season so far. Their actual record is 76-53. there is a 4 game variation due to double headers. This gives an accuracy of 56.4 % for the elo model. Which is not a great model, almost close to 50/50.
mean(df4[["rating_prob"]])
## [1] 0.5627149
df4 %>% count(rating_prob>.5)
## rating_prob > 0.5 n
## 1 FALSE 24
## 2 TRUE 101
Here is histogram showing the spread of the probabilities. According to the model, the NYY should be winning majorities of their games.
ch1=ggplot(df4, aes(x=rating_prob, fill=rating_prob>.50)) +
geom_histogram(position="identity", bins = 10)
ch1
Here I want to see whether the trade deadline had an effect on the elo model and whether the mean probability and the record has increased or decreased. Below is the average of the probabilities before the deadline
df41=filter(df4, date < "2021-07-04")
print(mean(df41[["rating_prob"]]))
## [1] 0.5772281
df41 %>% count(rating_prob>.5)
## rating_prob > 0.5 n
## 1 FALSE 10
## 2 TRUE 70
Below is the histogram of the probailities before the deadline
ch41=ggplot(df41, aes(x=rating_prob, fill=rating_prob>.50)) +
geom_histogram(position="identity", bins = 10)
ch41
Below is the average of the probabilities after the deadline. According to this the NYY have a lower chance of winning games after the trade deadline. Keep in mind that its possible that other factors such as schedule is much harder for them after the deadline which might have decreased the probability.
df42=filter(df4, date > "2021-07-04")
mean(df42[["rating_prob"]])
## [1] 0.5373444
df42 %>% count(rating_prob>.5)
## rating_prob > 0.5 n
## 1 FALSE 14
## 2 TRUE 30
Below is the histogram of the probailities before the deadline
ch42=ggplot(df42, aes(x=rating_prob, fill=rating_prob>.50)) +
geom_histogram(position="identity", bins = 10)
ch42
Below is the chart which shows the team rating for NYY from the beginning of the season. As you can see, the NYY started off as of of the highest ranking with 1572 rating score, ranked top 5. the lowest rating being 1410 and highest being 1607.As the season went on filled with injuries, they were close to .500 record beofre the trade dealine with 2% chance of making the playoffs. Now they are riding a 13 game winning streak and their chances of making playoffs have increased to 98%.
chart=ggplot(data=df4, aes(x=date, y=rating_pre))+
geom_line(color="#69b3a2")+
annotate(geom="text", x=as.Date("2021-07-04"), y=1535.387,
label="Trade Deadline") +
annotate(geom="point", x=as.Date("2021-07-04"), y=1540.387, size=10, shape=21, fill="transparent") +
geom_hline(yintercept=1530, linetype='dotted', col = 'red')+
annotate("text", x =as.Date("2021-05-01"), y = 1530, label = "Average League rating ", vjust = -0.5)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
chart
Below we are filtering Gerrit Cole
df5=df1[c(1:24)]
df6=filter(df5,pitcher1 == "Gerrit Cole")
df7=filter(df5,pitcher2 == "Gerrit Cole")
df6=df6[c('date', 'HomeTeam1', 'rating1_pre','pitcher1', 'pitcher1_adj', 'rating_prob1')]
df6 <- df6 %>% rename( Team = HomeTeam1, rating_pre = rating1_pre,pitcher = pitcher1, pitcher_adj = pitcher1_adj, rating_prob= rating_prob1 )
df7=df7[c('date', 'AwayTeam2', 'rating2_pre','pitcher2', 'pitcher2_adj', 'rating_prob2')]
df7 <- df7 %>% rename( Team = AwayTeam2, rating_pre = rating2_pre,pitcher = pitcher2, pitcher_adj = pitcher2_adj, rating_prob= rating_prob2 )
df8=rbind(df6,df7)
head(df8)
## date Team rating_pre pitcher pitcher_adj rating_prob
## 1 2021-08-21 NYY 1560.021 Gerrit Cole 31.82722 0.6413225
## 2 2021-08-16 NYY 1552.549 Gerrit Cole 31.71516 0.7110867
## 3 2021-07-17 NYY 1544.101 Gerrit Cole 40.27376 0.5904432
## 4 2021-07-04 NYY 1542.818 Gerrit Cole 41.66872 0.6141796
## 5 2021-06-22 NYY 1551.475 Gerrit Cole 47.14698 0.7050599
## 6 2021-06-03 NYY 1557.251 Gerrit Cole 48.41435 0.6282824
df8=mutate(df8, ttl_rating = rating_pre+pitcher_adj)
df8$date=as.Date(df8$date)
str(df8)
## 'data.frame': 24 obs. of 7 variables:
## $ date : Date, format: "2021-08-21" "2021-08-16" ...
## $ Team : chr "NYY" "NYY" "NYY" "NYY" ...
## $ rating_pre : num 1560 1553 1544 1543 1551 ...
## $ pitcher : chr "Gerrit Cole" "Gerrit Cole" "Gerrit Cole" "Gerrit Cole" ...
## $ pitcher_adj: num 31.8 31.7 40.3 41.7 47.1 ...
## $ rating_prob: num 0.641 0.711 0.59 0.614 0.705 ...
## $ ttl_rating : num 1592 1584 1584 1584 1599 ...
Its amazing the impact he has on the ratings of the team. He adds ~ 45 points to all his starts. That shows that, he can beat the best team, given they are starting an average pitcher.
ch=ggplot()+
geom_step(data=df8, aes(x=date, y=rating_pre, color="pre_rating"))+
geom_step(data=df8, aes(x=date, y=ttl_rating, color="Pre+Cole"))+
ylab('rating')
ch
After looking more into elo models of fivethirtyeight, baseball has the lowest model accuracy in all sports with accuracy of 59%. I think one of the main problem with the baseball elo model is that the model seems too linear and need more inputs such as Injury factor, which can compensate for the ratings as well as logarithmic approach towards win streak or hot hand, which should increase the accuracy of the models.