Introduction to the data

This document provides an analysis of players statistics from the 2021 MLB season. This data includes only data from the regular season and includes all position players. In order for a player to qualify for this data the player must have at least 3.1 plate appearances per team game played. More information and data can be found here.

Details of the Data

The data for this analysis comes from the Major League Baseball for the 2021 season. The data contains 20 variables (described below) and consists of 132 rows. Each row is a record of an individual player from the 2021 season if they had at least 3.1 plate appearance per team game played.


Data Dictionary

Variables in Dataset Variable Type Explanation
Player Character Full name of the player
Position Character Main position for the player
Team Character Last team the player played for in 2021
League Character League the player last played in 2021
G Numeric Number of games the player played in 2021
AB Numeric Number of at bats the player had in 2021
R Numeric Number of runs the player scored in 2021
H Numeric Number of hits the player had in 2021
2B Numeric Number of doubles the player had in 2021
3B Numeric Number of triples the player had in 2021
HR Numeric Number of Home runs the player had in 2021
RBI Numeric Number of runs batted in from the 2021 season
BB Numeric Number of walks the player had in 2021
SO Numeric Number of strike outs the player had in 2021
SB Numeric Number of stolen bases the player had in 2021
CS Numeric Number of times the player was caught stealing in 2021
AVG Numeric The players batting average for the 2021 season
OBP Numeric The players on base percentage for the 2021 season
SLG Numeric The players slugging percentage for the 2021 season
OPS Numeric The players on base percentage plus their slugging percentage for the 2021 season


Top 10 batting averages

This table shows the top 10 batting averages from the 2021 season

Player Team Position AVG
Trea Turner LAD 2B 0.328
Yuli Gurriel HOU 1B 0.319
Juan Soto WSH RF 0.313
Vladimir Guerrero Jr. TOR 1B 0.311
Michael Brantley HOU LF 0.311
Starling Marte OAK CF 0.310
Bryce Harper PHI RF 0.309
Nick Castellanos CIN RF 0.309
Tim Anderson CWS SS 0.309
Adam Frazier SD 2B 0.305

Top 10 slugging percentage

This table shows the top 10 players based off of home runs in the 2021 season

Player Team Position SLG
Bryce Harper PHI RF 0.615
Fernando Tatis SD SS 0.611
Vladimir Guerrero Jr. TOR 1B 0.601
Shohei Ohtani LAA DH 0.592
Nick Castellanos CIN RF 0.576
Joey Votto CIN 1B 0.563
Tyler O’Neil STL LF 0.560
Kyle Tucker HOU RF 0.557
Aaron Judge NYY RF 0.544
Salvador Perez KC C 0.544

Top 10 home run hitting players

This table shows the top 10 players based off of home runs in the 2021 season

Player Team Position HR
Vladimir Guerrero Jr. TOR 1B 48
Salvador Perez KC C 48
Shohei Ohtani LAA DH 46
Marcus Semien TOR 2B 45
Fernando Tatis SD SS 42
Aaron Judge NYY RF 39
Matt Olson OAK 1B 39
Brandon Lowe TB 2B 39
Mitch Haniger SEA RF 39
Rafael Devers BOS 3B 38

Top 10 doubles hitting players

This table shows the top 10 players based off of doubles in the 2021 season

Player Team Position 2B
Bryce Harper PHI RF 42
J.D. Martinez BOS DH 42
Jemier Candelario DET 3B 42
Whit Merrifield KC 2B 42
Tommy Edman STL 2B 41
Ozzie Albies ATL 2B 40
Marcus Semien TOR 2B 39
Nick Castellanos CIN RF 38
Kyle Tucker HOU RF 37
Rafael Devers BOS 3B 37

Top 10 triple hitting players

This table shows the top 10 players based off of triples in the 2021 season

Player Team Position 3B
Shohei Ohtani LAA DH 8
Bryan Reynolds PIT CF 8
David Peralta ARI LF 8
Jake Cronenworth SD 2B 7
Ozzie Albies ATL 2B 7
Nicky Lopez KC SS 6
Amed Rosario CLE SS 6
Hunter Dozier KC RF 6
Jose Ramirez CLE 3B 5
Cedric Mullins BAL CF 5


Descriptive Analysis

For this portion of the analysis, I will be performing five separate visualizations or tables pursuing how the league in which the player plays in (American or National) effects how well the player does for the season. It is important to remember that the data only includes players who were qualified to be in the data meaning they had at least 3.1 at-bats per team game.

How does the average slugging percentage change by league?

Based on the visual here, you can see that the National league has a higher average slugging percentage of their players. Although it is a very slim margin, it is something to pay attention to as we look into more statistics throughout this portion of the analysis.

How does the batting average change by league?

Based on the table below, you can see that the batting average for players in the American league is slightly higher than it is for the National League. Based off of the leagues in this table, it is hard to come to a conclusion on which league is more difficult for the hitters.

League batting_average
American 0.2649589
National 0.2643559

How does the amount of homeruns hit change by league?

Based on this chart, you can tell that the American League hit significantly more homeruns than the National League did.

How does the number of doubles hit change by league?

Once again based on the chart, the American League has hit significantly more doubles than the National League has. I was expecting this to be flipped where the National League hit more doubles in anticipation of field sizes possibly playing a role in why the American League had more home runs but it was not the case.

How does the number of triples hit change by league?

As you can see from the table below, the National League hit more triples than the American League did in the 2021 season. This was what I was expecting to happen for the doubles but I was incorrect and was not expecting it to happen for the triples as a result.

How many players qualified for the data for each league?

After doing these analyses, it made me think that the data could be off balanced based on the amount of players qualified to be in the data. In that manner, I was correct as you can tell from the table below, there is 14 more players who qualified from the American League than the National League.

League n
American 73
National 59


Secondary Data Source

For this portion of my analysis, I scraped the 1,000 most recent tweets from the MLB twitter account. I wanted to look into what the most popular words that were being used by a professional sports league account and what they are generally tweeting about.

Does MLB’s twitter use more positive or negitave words?

As I expected, MLB’s twitter uses mostly positive words as a result of it being a professional sports league account and needing to maintain a solid representation of their league. The one word that came us as negative in their latest 1,000 tweets was “Dusty” this is one of the coaches first names who just reached a large milestone winning 2,000 games.

Word cloud of the most used words by the MLB twitter account

For this visual, I wanted to see what the most commonly used words were by the MLB twitter account, the most used word was “MLB” which makes sense as it is their account. Following that up is “Game”, “Tonight”, “Hit”, and “Jackie” I found these words to make sense as they are promoting games that are happening tonight which is something I would expect out of the MLB account. The one that initially surprised me was “Jackie” but then I saw later down the word count list was “Robinson” and put it together that they were making tweets about Jackie Robinson day.


Predictive Analysis

Through this predictive analysis, I am attempting to predict the players slugging percentage based on the amount of doubles, triples, and home runs the player hit.

The generalized regression equation we begin with is: \[Slugging Percetange = \alpha_i + doubles_i + triples_i + homeruns_i\]

## 
## Call:
## lm(formula = SLG ~ `2B` + `3B` + HR, data = mlb)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.079420 -0.015927 -0.000336  0.013638  0.068512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2611246  0.0119632  21.827  < 2e-16 ***
## `2B`        0.0026164  0.0003852   6.792 3.73e-10 ***
## `3B`        0.0021719  0.0013846   1.569    0.119    
## HR          0.0050928  0.0002357  21.610  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02743 on 128 degrees of freedom
## Multiple R-squared:  0.8143, Adjusted R-squared:  0.8099 
## F-statistic: 187.1 on 3 and 128 DF,  p-value: < 2.2e-16

After looking at the results of the linear regression model created, I was surprised to see that the only variable that is not significant to the model is triples. There are two significant variable in the model, the first is doubles and the other is home runs. I was expecting all three variables to be significant in this model, based off of my previous knowledge and knowing how slugging percentage is calculated. The last thing I wanted to address in the model was the adjusted R-squared at 0.8099, this means that the variables in our model account for about 81% of the variation. Overall, this model does a good job at predicting the slugging percentage an individual may end the season with.