The following is an R markdown in which I am cleaning and subsetting some MLB playoff projections from the 538 blog.
The page I am pulling this dataset from is: https://projects.fivethirtyeight.com/2020-mlb-predictions/
I loaded the data using the method outlined in “Automated Data Collection- Chapter 1” and the relevant table into a data frame.
I selected all of the columns except “1-week change” because I didn’t think time dependant relative rankings would be an important variable for quantifying the overall strength of the team.
library(stringr)
library(XML)
library(rvest)
library(tibble)
url<-"https://projects.fivethirtyeight.com/2020-mlb-predictions/"
predict_parsed<-read_html(url)
tables<-html_table(predict_parsed, fill=TRUE)
mlb<-tables[[1]]
mlb<-mlb[, c(1,2,3,5,6,7,8,9)]
I followed the methods outlined in the text to rename the columns and extract the team name and current win total from a value that looked like this.
colnames(mlb)<-c("name", "division", "rating", "avgSim", "runDiff", "playoffs", "division", "WS")
mlb<-mlb[3:32 ,]
print(mlb$name[1])
## [1] "Dodgers24-10Dodgers"
name_clean<-unlist(mlb$name)
name_clean<-str_extract(mlb$name, "[^[:digit:]]+")
win_clean<-unlist(mlb$name)
win_clean<-str_extract(mlb$name,"[[:digit:]]{2}+")
loss_clean<-unlist(mlb$name)
loss_clean<-str_extract(mlb$name, "[-][[:digit:]]{2}+")
#because the Dodgers have lost less than 10 games, a null gets entered. I will manually input it now
loss_clean[1]<-"-09"
loss_clean<-str_sub(loss_clean, 2 , 3)
mlb<-add_column(mlb, win_clean, .after = "name")
## Warning: The `.data` argument of `add_column()` must have unique names as of tibble 3.0.0.
## Use `.name_repair = "minimal"`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
mlb<-add_column(mlb,loss_clean, .after ="win_clean")
mlb$name<-name_clean
This is the end result
library(htmlTable)
htmlTable(head(mlb))
| name | win_clean | loss_clean | division | rating | avgSim | runDiff | playoffs | division.1 | WS | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Dodgers | 24 | 09 | NL West | 1595 | 40-20 | +111 |
99% |
94% | 25% |
| 4 | Yankees | 17 | 13 | AL East | 1568 | 36-24 | +55 | 98% | 30% | 12% |
| 5 | Astros | 18 | 14 | AL West | 1566 | 34-26 | +42 | 98% | 29% | 10% |
| 6 | Rays | 22 | 11 | AL East | 1550 | 37-23 | +47 |
99% |
68% | 8% |
| 7 | Twins | 20 | 14 | AL Central | 1545 | 35-25 | +45 | 97% | 30% | 5% |
| 8 | Indians | 21 | 12 | AL Central | 1542 | 37-23 | +63 |
99% |
54% | 6% |
Having completed this assignment, there are several things I would like to research further. The most glaring is why the formatting of my htmlTable for this database seems like the fonts are different sizes in different columns. I have not used the library ‘htmlTable’ before, but I’m sure I will learn the nuances as I grow with it.
If I were to do some analysis on this, I think my first step would be to find how the current record and the rating or run differential related to the final projected record. Once this relationship was established, I would then try to figure out how they parlay that into the odds to make the playoffs.