The following is an R markdown in which I am cleaning and subsetting some MLB playoff projections from the 538 blog.

The page I am pulling this dataset from is: https://projects.fivethirtyeight.com/2020-mlb-predictions/

Loading and Preparing the data from 538

I loaded the data using the method outlined in “Automated Data Collection- Chapter 1” and the relevant table into a data frame.

I selected all of the columns except “1-week change” because I didn’t think time dependant relative rankings would be an important variable for quantifying the overall strength of the team.

library(stringr)
library(XML)
library(rvest)
library(tibble)

url<-"https://projects.fivethirtyeight.com/2020-mlb-predictions/"

predict_parsed<-read_html(url)

tables<-html_table(predict_parsed, fill=TRUE)

mlb<-tables[[1]]                   

mlb<-mlb[,  c(1,2,3,5,6,7,8,9)]

Renaming and Cleaning the Data

I followed the methods outlined in the text to rename the columns and extract the team name and current win total from a value that looked like this.

colnames(mlb)<-c("name", "division", "rating", "avgSim", "runDiff", "playoffs", "division", "WS")

mlb<-mlb[3:32 ,]

print(mlb$name[1])
## [1] "Dodgers24-10Dodgers"
name_clean<-unlist(mlb$name)
name_clean<-str_extract(mlb$name, "[^[:digit:]]+")

win_clean<-unlist(mlb$name)
win_clean<-str_extract(mlb$name,"[[:digit:]]{2}+")

loss_clean<-unlist(mlb$name)
loss_clean<-str_extract(mlb$name, "[-][[:digit:]]{2}+")
#because the Dodgers have lost less than 10 games, a null gets entered. I will manually input it now
loss_clean[1]<-"-09"
loss_clean<-str_sub(loss_clean, 2 , 3)


mlb<-add_column(mlb, win_clean, .after = "name")
## Warning: The `.data` argument of `add_column()` must have unique names as of tibble 3.0.0.
## Use `.name_repair = "minimal"`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
mlb<-add_column(mlb,loss_clean, .after ="win_clean")
mlb$name<-name_clean

This is the end result

library(htmlTable)


htmlTable(head(mlb))
name win_clean loss_clean division rating avgSim runDiff playoffs division.1 WS
3 Dodgers 24 09 NL West 1595 40-20 +111
99%
94% 25%
4 Yankees 17 13 AL East 1568 36-24 +55 98% 30% 12%
5 Astros 18 14 AL West 1566 34-26 +42 98% 29% 10%
6 Rays 22 11 AL East 1550 37-23 +47
99%
68% 8%
7 Twins 20 14 AL Central 1545 35-25 +45 97% 30% 5%
8 Indians 21 12 AL Central 1542 37-23 +63
99%
54% 6%

Conclusions

Having completed this assignment, there are several things I would like to research further. The most glaring is why the formatting of my htmlTable for this database seems like the fonts are different sizes in different columns. I have not used the library ‘htmlTable’ before, but I’m sure I will learn the nuances as I grow with it.

If I were to do some analysis on this, I think my first step would be to find how the current record and the rating or run differential related to the final projected record. Once this relationship was established, I would then try to figure out how they parlay that into the odds to make the playoffs.