This data came from MLB.com, which has MLB player stats dating all the way back to 1876. I chose to focus on Active players. This data set is the set of players that are starting almost every game, averaging at least 3.1 plate appearances per game. So basically, these are the starting/regular players of each of the 30 MLB teams. In this data you can see player positions, batting averages, RBIs, and home runs, as well as many other stats for each player. The top 1000 rankings are included in our dataset. Ranking is determined solely by AVG, or batting average. Each row, or rank, is the stats for one player for one year, so there are players with multiple rows, as they have played multiple years. It’s interesting to note that there are 1001 rows, this is because two players have the same ranking at some point, giving us players in 1000 ranks.
RK: rank. The rank of the player (in a given year) based on their batting average.
Player: Name of the player.
Year: Year in which the player’s stats are from.
Team: The abbreviation of the 30 MLB team names. In the Stadiums file, “Team Name” shows the full name of the teams.
Pos: Player’s position.
- 1B: first base
- SS: shortstop
- 2B: second base
- 3B: third base
- CF: center field
- RF: right field
- LF: left field
- C: catcher
- DH: designated hitter (only in the AL)
- OF: outfielder
G: number of games in which the player appeared (in a given year).
AB: number of official at bats by a batter. (This is plate appearances minus sacrifices, walks, and “hit by pitches”.)
R: runs. The number of times a baserunner safely reaches home plate.
H: hits. The number of times a batter hits the ball and reaches a base safely (without the aid of an error.)
2B: number of times a batter hits the ball and reaches second base.
3B: number of times a batter hits the ball and reaches third base.
HR: numer of times a batter hits the ball and gets a home run.
RBI: runs batted in. The number of runs that come from a batter hitting the ball. (If bases are loaded and batter hits a HR, RBI is 4)
BB: walks. Four balls in an bat.
SO: strikeouts. Three strikes during an at bat.
SB: stolen base. Number of times a player has stolen a base.
CS: caught stealing. Number of times a player has gotten out while trying to steal a base.
AVG: batting average. The chance a player has of getting a hit during an at bat.
OBP: on base percentage. The chance a player will get on base during an at bat. How frequently they get on base per plate appearance.
SLG: slugging percent. The same as batting average but it takes into account singles, doubles, triples, and HRs. A higher SLG means a player is more “productive” when hitting.
OPS: on base plus slugging percentage. This is the ability of a player to get on base AND hit for power.
SF: number of times a runner tags up and scores after a batter’s fly out.
AO: fly outs. Total number of times a batter hit the ball and it was caught in the air, resulting in an out.
GO: ground outs. Number of times a batter has gotten out on a ground ball.
PA: plate appearances.
NP: number of pitches thrown during all of the batter’s plate appearances.
RBIAB: runs batted in per at bat.
HRAB: home runs per at bat.
BABIP: batting average on balls in play. When a player makes contact with the ball, what’s the chance they’ll get a hit? This does not account for strikeouts (because the ball is not put into play).
NPPA: number of pitches per plate appearance.
NPAB: number of pitches per at bat.
SOAB: number of strikeouts per at bat.
What do the batting average and OPS look like for all active players?
Which position has the best batting average?
What stats have strong correlations to one another?
Does getting more pitches in an at Bat increase the odds of hitting a homerun?
Which position has the best batting average?
What teams have the best batting averages?
How did switching teams affect Albert Pujol’s stats?
Why is Mike Trout considered such a well-rounded player (possibly the greatest of all time)?
Do Homerun hitters have higher Strike Out percentages?
What makes the Yankees and Red Sox such a good rivalry?
Does the average number of pitches in an at-bat correlate with batting average? Does it correlate with HRs?
Where does Buster Posey fall in terms of average BABIP?
Which teams hit the most homeruns?
Where are the stadiums of the 30 MLB teams?
I scraped the data using a tool called selector gadget. This was done automatively so it was quiet easy.To clean the data I removed the * before player names, as well as deleted the columns Player2 and Player3, which were just repeating the same info as Player, and a few variables were changed from characters to numeric. In order to answer some of these questions, I created a few variables by mutating existing columns. The variables created were: RBIAB, HRAB, BABIP, NP, NPPA, NPAB, and SOAB.
I began by getting some simple visualizations of the basic stats in the data:
What do the batting average and OPS look like for all active players?This shows that the batting averages for all active players (with each data point being a certain player in a certain year). If an AVG is .300 (or 300), that means the player has a 30% chance of getting a hit. For reference, anything over .300 is considered a really good batting average. The outliers are players who played just a few games in a season, and happened to play really well during those games (thus making them very high outliers). This alone does not tell us how good a player is as different players have different approaches, some hit to get on base, some hit with for HRs. The distribution would be bell shaped if we had the lower ranked players were also included (players who have AVGs less than .25).
Both batting average and OPS are bell-shaped for active players.
Which position has the best batting average?
I then wanted to look at which position had the best batting average. My assumption was the Designated Hitters would have the highest AVG, considering their only job is to hit. I got a completely different result. This is probably due to the fact that I do not have much data for the DH, so we have to take these results with a grain of sand.
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
| Pos | MEANAVG |
|---|---|
| 1B | 0.293 |
| SS | 0.284 |
| 2B | 0.289 |
| 3B | 0.285 |
| CF | 0.286 |
| RF | 0.284 |
| LF | 0.290 |
| C | 0.288 |
| DH | 0.275 |
| OF | 0.288 |
Using this data, I found that Designated Hitters actually have the WORST batting averages. This could possibly be because we have less data for DH since they’re only in the American League. If I had data for ALL players in the MLB, not just the ones with 3.1+ plate appearances, we would see a different story, averages for all the positions would be much closer to one another and DH would be among the positions with the higher batting averages.
What stats have strong correlations to one another?
To find what stats have strong correlations to one another I created a correlation matrix using several of the variables in the data and produced a heatmap to better visualize the relationships.
I started by graphing a bunch of scatterplots, going through variables two at a time to see their relationships, until I decided to just use a correlation matrix and see all the stats at the same time. As you can see, there is a lot of red, and not a lot of green/grey. This is because most hitting stats in baseball are positive towards the hitter and therefore will have positive correlations. The only 2 stats that are inverses are AO and SO (Fly-Outs and Strike-Outs), and this is shown by these being the only grey points on the matrix. The strongest inverse is seen with Fly-Outs and BABIP. This makes sense as BABIP only takes into account balls that are in play, so if a ball is caught in the air and there is an out, this would negativley effect BABIP. You can see a lot of Very Strong correlations, because a lot of the stats measure very similar things, for example: OBP and AVG.
Does getting more pitches in an at Bat increase the odds of hitting a homerun?
After using a scatterplot to visualize the realtionship between HRPA and NPPA and finding the R^2, I was able to conclude that there is no real evidence of a correlation between the two, using the data set that I have.
What teams have the best batting averages?
| Team | MEANAVG |
|---|---|
| FLA | 0.303 |
| STL | 0.297 |
| WSH | 0.296 |
| COL | 0.295 |
| DET | 0.294 |
I found that the teams with the top 5 best batting averages in our dataset were (in order): The Miami Marlins, The St. Louis Cardnials, The Washington Nationals, The Colorado Rockies, and The Detroit Tigers. This is obviously skewed with the data that I have. Upon looking at the players for these teams, I saw that Albert Pujols holds 2 of the top 10 rankings in our data (so his batting average was super high in these years) and was playing for the Cardinals at the time, which is a large contributing factor as to why they have one of the best team batting averages. This leads into our next question…
How did switching teams affect Albert Pujols’s stats?
Why is Mike Trout considered such a well-rounded player (possibly the greatest of all time)?
Mike Trout is currently the best baseball player in the MLB having the highest WAR in all of Baseball, and some people think that he will be considered the best player that has ever played the game. Just like with Albert Pujols, pitchers are scared to pitch to him, and you can see he accumulates a lot of walks each season. I chose to look at Stolen Bases as well with Trout, as he is known to be a good base stealer, unlike Pujols who is quite slow. The vertical black line in this case represents 2017. In 2017 he had a thumb injury which caused him to miss a third of the season, Trout played only 114 games out of 162, missing 48. In both 2016 and 2018, trout played the majority of the season not missing more than 20 games each year. You can see with the bottom three lines, (HR, SB, and XBH) that he came back after being injured in 2017 and had an incredible season, almost matching his prior stats and even hitting more home runs than the year before. This as a testament as to what kind of a player Mike Trout is, the stats are proof of his greatness.
What makes the Yankees and Red Sox such a good rivalry?
When trying to answer our question about why the Red Sox and Yankees rivalry is so intense, I wanted to see if there was a difference in batting styles. The yankees are always thought of as big home run hitters, and the Red Sox are thought to have very consistent solid average hitters. So, I wanted to see if this was the case, and when I graphed it, I could see right away that my assumptions were true. The Red Sox have a much higher batting average, and the Yankees hit almost 5 more home runs per player each season. In answering the question, the reason these games are so fun to watch is the different hitting styles each team uses, one with an emphasis on batting average, the other with emphasis on home runs. Because the teams have different strengths, it makes a very close match-up and an exciting rivalry.
Where does Buster Posey fall in terms of average BABIP?
To answer this question, I created the BABIP (batting avg on balls in play), using the formula: BABIP = (H – HR)/(AB – SO – HR + SF). From this, I wanted to see where one of my favorite players stood versus the rest of the players in the data. Highlighted in orange is Buster Posey’s average BABIP using this data, and you can see that he is above the average of all the players, as he is considered to be one of the best hitting catchers to ever play. This is still an accomplishment for Posey, as he is getting old, but continuing to hit well enough to remain above the average.
Which teams hit the most homeruns?| Team | MeanHR |
|---|---|
| TOR | 0.049 |
| BAL | 0.045 |
| MIL | 0.045 |
| LAD | 0.044 |
| STL | 0.044 |
| FLA | 0.042 |
| CHC | 0.041 |
| COL | 0.040 |
| SEA | 0.040 |
| WSH | 0.039 |
| DET | 0.039 |
| ARI | 0.039 |
| NYY | 0.038 |
| CIN | 0.038 |
| TEX | 0.037 |
| HOU | 0.037 |
| TB | 0.036 |
| CLE | 0.035 |
| LAA | 0.035 |
| MIN | 0.034 |
| BOS | 0.034 |
| OAK | 0.034 |
| CWS | 0.033 |
| PHI | 0.033 |
| NYM | 0.032 |
| ATL | 0.032 |
| SD | 0.031 |
| MIA | 0.030 |
| SF | 0.029 |
| PIT | 0.029 |
| KC | 0.029 |
| ANA | 0.028 |
| MON | 0.010 |
As you can see in the table, the top 5 highest HR hitting teams are: The Toronto Bluejays, The Baltimore Orioles, The Milwaukee Brewers, The Los Angeles Dodgers, and The St. Louis Cardinals. If you remember our question about team batting averages, the Cardinals were in the top 5 there as well, so they have both a high batting average as well as a high number of homeruns hit. This is probably due to ALbert Pujols again, as he was an incredible all around player.
Where are the stadiums of the 30 MLB teams?
I recently learned how to use cloroplath maps, so I decided to make a map using the colors for the points that correlate to the teams colors, where the teams were located.