Introduction

The 2021 Major League Baseball regular season got back to normal compared to the ugly 2020 season, better known as the “COVID season.” The COVID season saw all kinds of rule changes due to COVID 19. Teams played 60 games, and over half the league qualified for the playoffs. Fast forward a year and the 2021 season has just ended, with the Atlanta Braves winning the World Series! I was interested to see how the Atlanta Braves were able to pull off winning the World Series despite having +1000 odds to win the World Series at the beginning of the season (100 dollar bet at the beginning of the season would have won you 1,000 dollars). Also, I wanted to see how can a team like the San Francisco Giants lead the league in wins despite having the 9th worst odds to win the World Series! How can the wise guys in Vegas be so off with their odds making at the beginning of the season? What were teams like the Braves and Giants doing during the regular season to win so many games? Is there a model that can explain what the good teams did right? This analysis will also scrape reviews from the site “Stadium Dude” to put together a sentimental analysis on his reviews across all 30 MLB standings.

At the end I run regression from the sentimental analysis to see if the sentimental analysis could predict wins.

To gain an understanding of what made teams like the Braves, Giants, Astros, and others who qualified for the playoffs so successful, I gathered data on a team level from the popular site Baseball Reference. The data includes basic and advanced statistic measurements that are used in modern day baseball. My goal was to analyze the data and target specific variables that contributed to a team’ success. At the end of this analysis is a predictive model that I put together using specific variables to predict several outcomes. The outcomes that I wanted to predict include runs scored, runs allowed, and most importantly wins.

Below is a statistic glossary that will be very helpful to lean on throughout this analysis.

Pitching Glossary

ERA 
Earned Run Average. Amount of Runs that are earned on average through 9 innings  
WHIP  Walks + Hits / Innings Pitched
R  Total Runs Allowed.
LOB  Amount of runners that were stranded on base when the inning ended.
IP  Innings pitched.
BB  Walks given up

Hitting Glossary

H  Total Hits - Singles, Doubles, Triples, and Home Runs
HR  Home Runs.
OBP  On base percentage. Percantage of time batter reaches base.
SLG  Slugging percentage. The formula for slugging percentage is: (1B + 2Bx2 + 3Bx3 + HRx4)/AB.
OPS  On base percentage plus slugging percentage
GDP  Grounded into DP - amount of times two outs were made in one play against hitting team
AB  At bats - amount of times the team sent up a batter to hit
BA  Batting Average - the percentage of time a batter gets a hit. (Hits / ABs)

Finally, below I dive into specific variables and building a model to help predict runs, runs allowed, and wins, below is a quick summary table highligting the data sets for hitting and pitching on a team level:

Hitting Summary

Pitching Summary

Descripitve Analysis

Hitting Descriptives

Let’s take a look at some of the most popular baseball statistics. This is on a team level. There are 30 MLB teams that compete in the Major Leagues. Let’s look at 2 very popular hitting statistics first: batting average and runs.

This graph shows a positive relationship between batting average and runs, with the correlation being 0.69 between the two variables. We see from the chart that teams who made the playoffs scored a lot of runs, with the Blue Jays (TOR) being an exception. We also see that the Nationals (WSN) had a high team batting average which didn’t result in a whole lot of runs which is odd given the overall correlation between the two.

Let’s look at two other hitting variables: OBP and Wins

The correlation between OBP and Wins is decently strong at 0.59. We see that the teams who made the playoffs are seen more to the right of the chart (higher OBP).

Now let’s look at the relationship between OBP and Batting Average.

I found the above graph to be very interesting. There is a positive relationship between batting average and OBP (correlation of 0.77). This makes sense because when a player gets a hit both batting average and OBP increase However, these statistics do not dictate a teams playoffs chances as strongly as I originally though it would. Teams like the Washington Nationals (WSN) and the Toronto Blue Jays (TOR) really stick out. Both of these teams had a very high batting average and OBP, but failed to make the playoffs. In fact the Nationals had the 5th worst record in the MLB! Yet only one other team (Houston) had gotten on base more than the Nationals.

Now, let’s take a break from specific stats and look at how many Wins it took for a team to make the Playoffs.

The above table shows the amount of wins from greatest to least for the National League. The top 5 teams make the playoffs. We see that the Braves got the last playoff spot winning 88 games. 84 games would have been enough in the National League. Oddly enough the Braves, who had the least amount of wins for a team to make the playoffs, won the World Series.

Below is the table for the American League. We see that it took a lot more wins to make the playoffs in this division. The Blue Jays and Mariners won 90 and 91 games, but failed to make it to the playoffs.

Let’s now look at if there was any difference in the hitting categories that we mentioned above between the National and American League.

We see from the box plot above that the American League had a bigger range from top to bottom. The National League teams all were closer when it came to the team’s on base percentage.

Pitching Descriptives

Now let’s look at Pitching Statistics on a team level. Let’s start with the two most thought of measurements, WHIP and ERA.

This one is easy! The lower the WHIP the lower the ERA. This obviously makes sense as the less batters the pitcher puts on base the lower the likelihood of a team scoring. It is clear that the teams that made the playoffs all had very low ERA’s and WHIP’s with their pitching.

What about strikeouts? Do teams that strike out a ton of batters translate into wins? Is there any correlation between these two variables?

This is another interesting finding. Yes, the top 4 teams that struck out the most opposing players made the playoffs and won a lot of games, however there are several teams that won a lot of games without striking out a whole lot of opposing batters. Look at STL, every team in the MLB struck out more batters than them, yet STL won 90 games and made it the playoffs! The SFG won the most games in the MLB, yet they come in right in the middle of the pack when it comes to their pitcher’s striking out opposing hitters. Putting visuals up and viewing the relationships between the different variables helped me establish which variables played a role or maybe even more importantly didn’t play a role into predicting a my target variables, runs and wins.

Finally let’s look at one last measurement. Let’s compare a team’s run differential to wins.

For me this is the most fascinating visual that has been shown. There is a clear relationship between a teams wins and a teams run differential. At the end I put together a regression model predicting wins for a MLB team. The relationship between these two variables helped me put together that model.

Ballpark Ratings

Let’s throw in a variable that is entirely subjective to see if there is any correlation to a team’s ballpark review and the result on the field. I am a big fan of reading “Stadium Dude’s” reviews on stadiums. This guy has seen every MLB stadium multiple times. Here is a table showing the rating system that he used to give an overall ballpark rating score. You will see there are ratings for the Stadium, Food, Beer, Neighborhood, Price, Accessibility, Weather, Vibe, and finally the total score.

Along with the ratings he also wrote a review for each team’s stadium. I found that it would be really cool to perform a sentiment analysis to see if there is any relationship for a team’s success and the sentiment score for each team’s ballpark.

First, I ran a sentimental analysis on each ballpark review. The sentiment score shows the difference between the positive words and the negative words that each review contained.

Now let’s add the sentiment scores to the statistics for each team and see if there is any correaltion between the sentiment and ratings for each category with our target variable Wins.

The above chart shows us correlations between variables. I want to really look at the target variable, wins - which is at the bottom right of the chart. We see that the variables, Price and Sentiment (Positive - Negative), have the highest correlation with the target variable, wins.

Hitting Regression

Predicting Runs Scored from Team Hitting Stats

One of the best tools for predicting the amount of runs a team scored is to use multiple regression.

The multiple regression equation that I established through R studio is: \[RunsScored = -0.001307 + OPS(1889.4518) + GDP(-0.7575) + AB(0.1390)\] Below is the regression model plotted along with the residuals (30 MLB teams). Teams that fell above the regression line outperformed the model (meaning that they scored more runs than what the model predicted them to score). The teams that fell below the regression line perhaps got unlucky as the model predicted them to score more runs than what they did for the season. The regression line accounts for nearly 92% of the variance for runs scored by MLB teams during the 2021 season.

Here is a visual to show wich teams outperformed and under performed based off the regression model of:

\[RunsAllowed = -1442.5067 + WHIP(1239.98) + LOB(-0.7169) + IP(0.9369)\] This model had a R2 of 0.96 and included just 3 very important variables.

Predicting Wins

After predicting runs and runs allowed I went ahead and created a multiple regression model to predict wins.

This model accounts for 88% of the variance for the target variable, Wins.

\[Wins = 98.64 + RunsScored(0.08893) + RunsAllowed(-0.10490)\] The model only shows two variables runs scored and runs allowed. We see that runs scored is a positive coefficient while runs allowed is a negative coefficient. Now let’s visualize!

It is interesting to see which teams outperformed the model and teams that under performed. Seattle Mariners (SEA) jumps off the chart as they were predicted to have just 76 wins, but ended up with 90! This was the biggest difference. Below is the data table that shows the differnce between predicted wins and actual wins.

Final Regression

Ball Park Review - Predicting Wins from Review Emotion

Plot above shows R2 of just .30 after 2 variables

## Subset selection object
## Call: regsubsets.formula(W ~ ., data = NRC_Numeric, nvmax = 10)
## 11 Variables  (and intercept)
##              Forced in Forced out
## anger            FALSE      FALSE
## anticipation     FALSE      FALSE
## disgust          FALSE      FALSE
## fear             FALSE      FALSE
## joy              FALSE      FALSE
## negative         FALSE      FALSE
## positive         FALSE      FALSE
## sadness          FALSE      FALSE
## surprise         FALSE      FALSE
## trust            FALSE      FALSE
## positivity       FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
##           anger anticipation disgust fear joy negative positive sadness
## 1  ( 1 )  " "   " "          "*"     " "  " " " "      " "      " "    
## 2  ( 1 )  " "   " "          "*"     " "  " " " "      " "      " "    
## 3  ( 1 )  " "   " "          "*"     " "  " " " "      " "      " "    
## 4  ( 1 )  " "   " "          "*"     " "  " " " "      "*"      "*"    
## 5  ( 1 )  " "   " "          "*"     "*"  " " " "      " "      "*"    
## 6  ( 1 )  " "   " "          "*"     "*"  " " " "      "*"      "*"    
## 7  ( 1 )  " "   " "          "*"     "*"  "*" " "      "*"      "*"    
## 8  ( 1 )  "*"   " "          "*"     "*"  "*" " "      "*"      "*"    
## 9  ( 1 )  "*"   "*"          "*"     "*"  "*" " "      "*"      "*"    
## 10  ( 1 ) "*"   "*"          "*"     "*"  "*" "*"      "*"      "*"    
##           surprise trust positivity
## 1  ( 1 )  " "      " "   " "       
## 2  ( 1 )  "*"      " "   " "       
## 3  ( 1 )  "*"      " "   "*"       
## 4  ( 1 )  "*"      " "   " "       
## 5  ( 1 )  "*"      " "   "*"       
## 6  ( 1 )  "*"      "*"   " "       
## 7  ( 1 )  "*"      "*"   " "       
## 8  ( 1 )  "*"      "*"   " "       
## 9  ( 1 )  "*"      "*"   " "       
## 10  ( 1 ) "*"      "*"   " "

Summary for the two variables we will use to build weak model. (Disgust and surprise)

## 
## Regression Results
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                  W             
## -----------------------------------------------
## disgust                  2.091*** (0.682)      
## surprise                 -1.392** (0.515)      
## Constant                 81.224*** (8.982)     
## -----------------------------------------------
## Observations                    30             
## R2                             0.348           
## Adjusted R2                    0.300           
## Residual Std. Error      12.105 (df = 27)      
## F Statistic            7.217*** (df = 2; 27)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Both disgust and surpirse we see are below the 0.05 significant value.

As we can see the model for wins based off the sentimental review is not a good measurement for predicting wins. We will stick to our model for predicting wins with the variables of runs allowed and runs scored instead of stadium dude’s sentimental scores.