This document provides an analysis of players statistics from the 2021 MLB season. This data includes only data from the regular season and includes all position players. In order for a player to qualify for this data the player must have at least 3.1 plate appearances per team game played. More information and data can be found here.
The data for this analysis comes from the Major League Baseball for the 2021 season. The data contains 20 variables (described below) and consists of 132 rows. Each row is a record of an individual player from the 2021 season if they had at least 3.1 plate appearance per team game played.
| Variables in Dataset | Variable Type | Explanation |
|---|---|---|
| Player | Character | Full name of the player |
| Position | Character | Main position for the player |
| Team | Character | Last team the player played for in 2021 |
| League | Character | League the player last played in 2021 |
| G | Numeric | Number of games the player played in 2021 |
| AB | Numeric | Number of at bats the player had in 2021 |
| R | Numeric | Number of runs the player scored in 2021 |
| H | Numeric | Number of hits the player had in 2021 |
| 2B | Numeric | Number of doubles the player had in 2021 |
| 3B | Numeric | Number of triples the player had in 2021 |
| HR | Numeric | Number of Home runs the player had in 2021 |
| RBI | Numeric | Number of runs batted in from the 2021 season |
| BB | Numeric | Number of walks the player had in 2021 |
| SO | Numeric | Number of strike outs the player had in 2021 |
| SB | Numeric | Number of stolen bases the player had in 2021 |
| CS | Numeric | Number of times the player was caught stealing in 2021 |
| AVG | Numeric | The players batting average for the 2021 season |
| OBP | Numeric | The players on base percentage for the 2021 season |
| SLG | Numeric | The players slugging percentage for the 2021 season |
| OPS | Numeric | The players on base percentage plus their slugging percentage for the 2021 season |
This table shows the top 10 batting averages from the 2021 season
| Player | Team | Position | AVG |
|---|---|---|---|
| Trea Turner | LAD | 2B | 0.328 |
| Yuli Gurriel | HOU | 1B | 0.319 |
| Juan Soto | WSH | RF | 0.313 |
| Vladimir Guerrero Jr. | TOR | 1B | 0.311 |
| Michael Brantley | HOU | LF | 0.311 |
| Starling Marte | OAK | CF | 0.310 |
| Bryce Harper | PHI | RF | 0.309 |
| Nick Castellanos | CIN | RF | 0.309 |
| Tim Anderson | CWS | SS | 0.309 |
| Adam Frazier | SD | 2B | 0.305 |
This table shows the top 10 players based off of home runs in the 2021 season
| Player | Team | Position | SLG |
|---|---|---|---|
| Bryce Harper | PHI | RF | 0.615 |
| Fernando Tatis | SD | SS | 0.611 |
| Vladimir Guerrero Jr. | TOR | 1B | 0.601 |
| Shohei Ohtani | LAA | DH | 0.592 |
| Nick Castellanos | CIN | RF | 0.576 |
| Joey Votto | CIN | 1B | 0.563 |
| Tyler O’Neil | STL | LF | 0.560 |
| Kyle Tucker | HOU | RF | 0.557 |
| Aaron Judge | NYY | RF | 0.544 |
| Salvador Perez | KC | C | 0.544 |
This table shows the top 10 players based off of home runs in the 2021 season
| Player | Team | Position | HR |
|---|---|---|---|
| Vladimir Guerrero Jr. | TOR | 1B | 48 |
| Salvador Perez | KC | C | 48 |
| Shohei Ohtani | LAA | DH | 46 |
| Marcus Semien | TOR | 2B | 45 |
| Fernando Tatis | SD | SS | 42 |
| Aaron Judge | NYY | RF | 39 |
| Matt Olson | OAK | 1B | 39 |
| Brandon Lowe | TB | 2B | 39 |
| Mitch Haniger | SEA | RF | 39 |
| Rafael Devers | BOS | 3B | 38 |
This table shows the top 10 players based off of doubles in the 2021 season
| Player | Team | Position | 2B |
|---|---|---|---|
| Bryce Harper | PHI | RF | 42 |
| J.D. Martinez | BOS | DH | 42 |
| Jemier Candelario | DET | 3B | 42 |
| Whit Merrifield | KC | 2B | 42 |
| Tommy Edman | STL | 2B | 41 |
| Ozzie Albies | ATL | 2B | 40 |
| Marcus Semien | TOR | 2B | 39 |
| Nick Castellanos | CIN | RF | 38 |
| Kyle Tucker | HOU | RF | 37 |
| Rafael Devers | BOS | 3B | 37 |
This table shows the top 10 players based off of triples in the 2021 season
| Player | Team | Position | 3B |
|---|---|---|---|
| Shohei Ohtani | LAA | DH | 8 |
| Bryan Reynolds | PIT | CF | 8 |
| David Peralta | ARI | LF | 8 |
| Jake Cronenworth | SD | 2B | 7 |
| Ozzie Albies | ATL | 2B | 7 |
| Nicky Lopez | KC | SS | 6 |
| Amed Rosario | CLE | SS | 6 |
| Hunter Dozier | KC | RF | 6 |
| Jose Ramirez | CLE | 3B | 5 |
| Cedric Mullins | BAL | CF | 5 |
For this portion of the analysis, I will be performing five separate visualizations or tables pursuing how the league in which the player plays in (American or National) effects how well the player does for the season. It is important to remember that the data only includes players who were qualified to be in the data meaning they had at least 3.1 at-bats per team game.
Based on the visual here, you can see that the National league has a higher average slugging percentage of their players. Although it is a very slim margin, it is something to pay attention to as we look into more statistics throughout this portion of the analysis.
Based on the table below, you can see that the batting average for players in the American league is slightly higher than it is for the National League. Based off of the leagues in this table, it is hard to come to a conclusion on which league is more difficult for the hitters.
| League | batting_average |
|---|---|
| American | 0.2649589 |
| National | 0.2643559 |
Based on this chart, you can tell that the American League hit significantly more homeruns than the National League did.
Once again based on the chart, the American League has hit significantly more doubles than the National League has. I was expecting this to be flipped where the National League hit more doubles in anticipation of field sizes possibly playing a role in why the American League had more home runs but it was not the case.
As you can see from the table below, the National League hit more triples than the American League did in the 2021 season. This was what I was expecting to happen for the doubles but I was incorrect and was not expecting it to happen for the triples as a result.
After doing these analyses, it made me think that the data could be off balanced based on the amount of players qualified to be in the data. In that manner, I was correct as you can tell from the table below, there is 14 more players who qualified from the American League than the National League.
| League | n |
|---|---|
| American | 73 |
| National | 59 |
For this portion of my analysis, I scraped the 1,000 most recent tweets from the MLB twitter account. I wanted to look into what the most popular words that were being used by a professional sports league account and what they are generally tweeting about.
As I expected, MLB’s twitter uses mostly positive words as a result of it being a professional sports league account and needing to maintain a solid representation of their league. The one word that came us as negative in their latest 1,000 tweets was “Dusty” this is one of the coaches first names who just reached a large milestone winning 2,000 games.
For this visual, I wanted to see what the most commonly used words were by the MLB twitter account, the most used word was “MLB” which makes sense as it is their account. Following that up is “Game”, “Tonight”, “Hit”, and “Jackie” I found these words to make sense as they are promoting games that are happening tonight which is something I would expect out of the MLB account. The one that initially surprised me was “Jackie” but then I saw later down the word count list was “Robinson” and put it together that they were making tweets about Jackie Robinson day.
Through this predictive analysis, I am attempting to predict the players slugging percentage based on the amount of doubles, triples, and home runs the player hit.
The generalized regression equation we begin with is: \[Slugging Percetange = \alpha_i + doubles_i + triples_i + homeruns_i\]
##
## Call:
## lm(formula = SLG ~ `2B` + `3B` + HR, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.079420 -0.015927 -0.000336 0.013638 0.068512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2611246 0.0119632 21.827 < 2e-16 ***
## `2B` 0.0026164 0.0003852 6.792 3.73e-10 ***
## `3B` 0.0021719 0.0013846 1.569 0.119
## HR 0.0050928 0.0002357 21.610 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02743 on 128 degrees of freedom
## Multiple R-squared: 0.8143, Adjusted R-squared: 0.8099
## F-statistic: 187.1 on 3 and 128 DF, p-value: < 2.2e-16
After looking at the results of the linear regression model created, I was surprised to see that the only variable that is not significant to the model is triples. There are two significant variable in the model, the first is doubles and the other is home runs. I was expecting all three variables to be significant in this model, based off of my previous knowledge and knowing how slugging percentage is calculated. The last thing I wanted to address in the model was the adjusted R-squared at 0.8099, this means that the variables in our model account for about 81% of the variation. Overall, this model does a good job at predicting the slugging percentage an individual may end the season with.