This graph explores the trend of home field Advantage that is often discussed in the MLB. This data comes from every game played by each team in MLB history. The team names are listed horizontally on the y-axis and their corresponding “added run,” value can be seen extending across the x-axis. In this case, each team’s added run value corresponds to the average number runs that a team will add to their score when playing at home versus when they play at an unidentified location. For example, let’s suppose that the Cardinals played in a game and won by 4 runs. If they re-played this game at their own home field, on average, they would have won +0.2 more runs. The largest home field advantage discovered was that of the Colorado Rockies, with an average of+0.5 added runs each time they play at home, almost double the average for the MLB (0.254). Considering this representation is based off of past data and is not a predictive model, we are only able to see post-game influence that home field advantage has on teams runs.

mean(Home_Away_Difference$`Home Team Margin of Victory`)

## [1] 0.2541399

confidence.interval.99 <- function(mean, sd, n){ c(mean - 2.58*sd/(n^.5),mean + 2.58*sd/(n^.5)) }

confidence.interval.99(mean(Home_Away_Difference$`Home Team Margin of Victory`), sd(Home_Away_Difference$`Home Team Margin of Victory`), 210876)

## [1] 0.2294199 0.2788599

The MLB average home field advantage is 0.254, and this falls within our 99% confidence intervals of 0.229 to 0.279.

print(boruta.train)

## Boruta performed 99 iterations in 14.45779 secs.
##  12 attributes confirmed important: `Home Team At-Bats`, `Home
## Team Errors`, `Home Team Hits`, `Home Team Homeruns`, `Home Team
## Intentional Walks` and 7 more.
##  9 attributes confirmed unimportant: `Home Team Assists`, `Home
## Team Double Plays`, `Home Team Doubles`, `Home Team Left on Base`,
## `Visiting Team Assists` and 4 more.
##  3 tentative attributes left: `Home Team Stolen Bases`, `Home Team
## Strikeouts`, `Visiting Team Stolen Bases`.

This visualization was created using the boruta package, and it is a graphical depiction of the important features needed to predict a home-team win. We choose this schematic to use as it gives us a very clear idea of what features we should incorporate in our NaiveBayes prediction models. The blue are randomly generated points used as a basis of comparison, while the red whisker plots signify not useful features and useful features are represented in green. Yellow whisker plots represent the data points that boruta deemed to be undecided as to whether or not they are useful in building a model to predict a home-team win. Blue bars are the random predictors used to judge the importance of the targeted features.

How does boruta work? Loosely, it’s based on the idea that by adding randomness to a system and then collecting results from that system, one can reduce the misleading impact of randomness in the original sample by already taking it into account with your feature selection. Using a random forest classification algorithm, this provides an intrinsic measure of the importance of each feature that we can then compare to random permutations of (a selection of) the variables to test if it is higher than the scores from random variables (seen here in blue).

https://www.r-bloggers.com/feature-selection-all-relevant-selection-with-the-boruta-package/

nb_class_best_no_cheating <- naiveBayes(`Home Team Win?` ~ `Home Team Hits` + `Visiting Team Hits` + `Home Team Homeruns` + `Visiting Team Homeruns` + `Home Team Intentional Walks` + `Visiting Team Walks` + `Home Team Strikeouts` + `Home Team At-Bats` + `Visiting Team Errors` + `Visiting Team Double Plays` + `Visiting Team Assists` + `Visiting Team Strikeouts`, data = data.no.ties.25.years.train)

## [1] 0.8331477

## [1] 0.8322907

## [1] 0.8428449

## [1] 0.8286204

## [1] 0.8419745

Using NaiveBayes, 13 features were used to predict whether the home team won. Creating 5 folds within the data, all testing resulted in accuracy between 82-84% for the prediction model.

## Boruta performed 99 iterations in 30.23867 secs.
##  6 attributes confirmed important: `Home Team At-Bats`, `Home Team
## Doubles`, `Home Team Hits`, `Home Team Homeruns`, `Home Team
## Walks` and 1 more.
##  3 attributes confirmed unimportant: `Home Team Intentional
## Walks`, `Home Team Stolen Bases`, `Visiting Team Double Plays`.
##  3 tentative attributes left: `Home Team Strikeouts`, `Home Team
## Triples`, `Visiting Team Errors`.

This visualization was created using the boruta package, and it is a graphical depiction of the important features needed to predict a home team score. We choose this schematic to use as it gives us a very clear idea of what features we should incorporate in our NaiveBayes prediction model for score prediction. The blue are randomly generated points used as a basis of comparison, while the red whisker plots signify not useful features and useful features are represented in green. Blue bars are the random predictors used to judge the importance of the targeted features.

nb_class_runs_segmented_best <- naiveBayes(`Home Team Score` ~ `Home Team Hits` + `Home Team Homeruns` + `Home Team Walks` + `Home Team Doubles` + `Visiting Team Errors`, data = data.runs.segmented.25.train)

## [1] 0.6645244

## [1] 0.6750364

## [1] 0.6657241

## [1] 0.6687232

## [1] 0.6774083

Using NaiveBayes, 5 features were used to predict whether the home team the range of the hone team score (0-2, 3-6, 7-9, 10+). Creating 5 folds within the data, all testing resulted in accuracy between 66-68% for the prediction model.

##      Batting Average Slugging Percentage       OBP       OPS
## [1,]       0.3433703           0.7550007 0.4918746 0.7752924

This visualization is a parallel coordinate plot which shows the offensive data from the year 2015. Our intentions with the visualization was to find any potential trends that existed between some common offensive baseball data points and a team’s ranking as high, low or average scoring. Moving from left to right, we see along the x-axis a series of data point labels that we thought were important to include in our analysis (batting average, slugging percentage, on base percentage and OPS). Each bar is representative of a team in 2015, and the colors are broken into three bins that separate teams into three scoring levels: low scoring (red), average scoring (black) and high scoring (green). Examining the graph from left to right with a focus on the green band, we see that in order to be considered a high scoring team it is important to have a high batting average, slugging percentage, on base percentage and OPS (on base percentage and slugging combined) rate to be categorized as high scoring. Conversely, teams that are frequently in the lower percentile of these features (red) display a lower scoring average. This visualization is useful in being able to determine what aspects of improvement should be made (offensively) in a team to raise their scoring frequency.

This visualization shows the average change in attendance that each team produces when they play on the road. Across the y-axis, we have the percent change, which was gathered from data on fan attendance at specified team’s past games. This was found for each team and compared across teams to show that there are teams that you may play that will positively and negatively affect your attendances stats. For example, playing the Boston Red Sox on average increases a teams’ home attendance by 8%, while playing the Tampa Bay Rays decreases the other teams home attendance statistic by 5%.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at “www.retrosheet.org”.

Final Project

Jason Katz

11/21/2016

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at “www.retrosheet.org”.