Power Hitting in Major League Baseball
Baseball has always been a sport deeply rooted in data, but in recent years the game has shifted toward an increased focus on power hitting. Players who can consistently hit home runs are now among the most valued in the league, as teams prioritize offensive production that can quickly change the course of a game. This trend has grown alongside the rise of advanced analytics, which have changed the way teams evaluate performance and make decisions. I’ve always been interested in the relationship between data and performance in sports, and this shift in baseball toward power has made me curious about how much home runs influence other key metrics like runs batted in (RBI’s) and games played. My hypothesis is that players who hit more home runs also tend to drive in more runs and appear in more games, making them key contributors to team success.
Introduction to the Lahman Data Set
To explore this topic, I selected a data set that includes player-level batting statistics from Major League Baseball. The data comes from the well-known Lahman Baseball Database, which tracks historical performance across many seasons. I filtered the data to include only seasons from 1950 onward and players who had at least 50 at-bats in a given year, in order to focus on players who had a meaningful presence in the lineup. This data set allows for a clear analysis of home run totals alongside other metrics like RBIs, games played, and at-bats. The data set I used can be accessed at the following link:
https://myxaviermy.sharepoint.com/:x:/g/personal/rushs2_xavier_edu/EUQ_GBXB6CNBsxxfizKKLp0BWriMSAL3CUHN5buXNPerjw?download=1
Lahman Data Set (1950–Present) –Data Dictionary
| playerID |
Unique identifier for the player across all seasons |
| yearID |
The year of the season |
| stint |
Number indicating the player’s stint with a team in a given year (e.g., 1 for first stint, 2 for second if traded mid-season) |
| teamID |
Team abbreviation for the team the player was on that season |
| lgID |
League ID (e.g., AL for American League, NL for National League) |
| G |
Total games played by the player in the season |
| G_batting |
Games played as a batter (can differ from total games if the player also pitched or fielded) |
| AB |
At-bats (official plate appearances excluding walks, hit-by-pitches, etc.) |
| R |
Runs scored by the player |
| H |
Hits recorded by the player |
| 2B |
Doubles hit by the player |
| 3B |
Triples hit by the player |
| HR |
Home runs hit by the player |
| RBI |
Runs batted in (how many runners the player drove in) |
| SB |
Stolen bases |
| CS |
Times caught stealing |
| BB |
Walks (base on balls) drawn |
| SO |
Strikeouts |
| IBB |
Intentional walks received |
| HBP |
Times hit by a pitch |
| SH |
Sacrifice hits (bunts) |
| SF |
Sacrifice flies |
| GIDP |
Grounded into double plays |
Introduction to the Official MLB Data Set
In addition to the historical data set from the Lahman Baseball Database, I also created a second data set using web scraping techniques to collect player statistics from the official MLB website. This data focuses specifically on the 2016 Major League Baseball season and includes detailed batting statistics for individual players. I used the R programming language along with the rvest, polite, and tidyverse packages to extract and clean the data from multiple paginated web pages. This process allowed me to automate the collection of structured player-level statistics directly from the MLB site, ensuring that the data is both accurate and specific to one complete season.
The data set includes a wide range of variables that help evaluate player performance. These columns include the team a player was on (TEAM), games played (G), at-bats (AB), runs scored (R), hits (H), doubles (2B), triples (3B), home runs (HR), and runs batted in (RBI). It also tracks walks (BB), strikeouts (SO), stolen bases (SB), and times caught stealing (CS). In addition to these counting stats, the data set includes batting average (AVG), on-base percentage (OBP), slugging percentage (SLG), and on-base plus slugging (OPS), which are common efficiency metrics used in modern baseball analysis. Each row is also labeled with the player’s name (PLAYER) and their position (POSITION).
Web-Scraped 2016 MLB Data Set – Data Dictionary
| TEAM |
Team abbreviation for the player’s 2016 team |
| G |
Games played by the player in the 2016 season |
| AB |
At-bats during the season |
| R |
Runs scored |
| H |
Hits |
| 2B |
Doubles hit |
| 3B |
Triples hit |
| HR |
Home runs hit |
| RBI |
Runs batted in |
| BB |
Walks (bases on balls) drawn |
| SO |
Strikeouts |
| SB |
Stolen bases |
| CS |
Caught stealing |
| AVG |
Batting average (Hits ÷ At-Bats) |
| OBP |
On-base percentage |
| SLG |
Slugging percentage |
| OPS |
On-base plus slugging (OBP + SLG) |
| PLAYER |
Full name of the player |
| POSITION |
Defensive position played (e.g., OF, 1B, SS) |
Setting Up
I loaded all the necessary libraries for data analysis, visualization, cleaning, and web scraping. These included packages like tidyverse and ggplot2 for data manipulation and plotting, readr and janitor for reading and cleaning data, and several others like rvest, polite, and xml2 to support web scraping and HTML parsing.
[1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
[7] "G_batting" "AB" "R" "H" "2B" "3B"
[13] "HR" "RBI" "SB" "CS" "BB" "SO"
[19] "IBB" "HBP" "SH" "SF" "GIDP" "G_old"
[1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
[7] "G_batting" "AB" "R" "H" "2B" "3B"
[13] "HR" "RBI" "SB" "CS" "BB" "SO"
[19] "IBB" "HBP" "SH" "SF" "GIDP" "G_old"
I loaded the Lahman batting data set from a SharePoint link and filtered it to include only players from the year 1950 onward who had at least 50 at-bats. This ensured the analysis focused on modern-era players with a meaningful amount of playing time.
For the MLB database, I used R to web scrape player batting statistics from the official MLB website for the 2016 season. I first defined the base URL and created a function to generate a list of all 33 paginated stat pages. Then, I built a custom scraping function that reads each page, extracts the main stats table, and parses the player names and positions using their respective HTML classes. The function combines first and last names to create full player names and captures the position information for each player. I then created a loop to apply this scraper across all 33 pages, storing the results in one consolidated data frame. To ensure clean and usable data, I renamed repetitive or improperly formatted columns (like “HRHR” and “TEAMTEAM”) into standard labels such as HR, RBI, and TEAM. Finally, I converted key performance columns to numeric format so they could be used for analysis and visualization. This process allowed me to automate the collection of detailed, structured MLB player data directly from the league’s website.
Home Run Trends Over the Years (Since 1950)
This line chart illustrates the total number of home runs hit across all players for each year from 1950 onward. The visual shows fluctuations over time, but an overall upward trend in recent decades suggests that baseball has increasingly favored power hitters. This shift can be attributed to evolving game strategies, such as the emphasis on launch angle and exit velocity, as well as changes in training, player development, and even ball design. The combination of a bold line and distinct data points highlights how modern baseball has trended toward rewarding players who can hit for power.
Home Runs vs Runs Batted In (RBI’s)
This scatter plot explores the relationship between home runs (HR) and runs batted in (RBI) for players with at least 50 at-bats. Each dot represents a player in a specific season. A dashed regression line shows a strong positive correlation—suggesting that players who hit more home runs also tend to drive in more runs, reinforcing the link between power hitting and run production.
Games Played vs Home Runs
This hexbin plot visualizes how frequently different combinations of games played (G) and home runs (HR) occur. Each hexagon represents a group of players with similar values, and darker colors indicate a higher concentration of players. This format is ideal for dense data, clearly showing that while most players hit few home runs even with high game counts, power hitters are more rare.
Games Played vs Runs Batted In (Colored by Home Runs)
This scatter plot examines the relationship between games played and RBIs, with each point color-coded by the number of home runs. A clear positive trend shows that players who appear in more games tend to accumulate more RBIs. The color gradient adds insight by showing how players with higher home run totals often contribute more significantly to RBI totals, reinforcing the value of power hitters in extended play.
Home Runs vs RBIs (Bubble Plot: Games Played as Size)
This multi-dimensional bubble plot compares home runs (x-axis) and RBI’s (y-axis), with the size of each bubble representing games played (G) and color representing home run totals. Larger bubbles generally fall higher and to the right, confirming that consistent playing time contributes to both HR and RBI accumulation. This visualization elegantly combines four variables to highlight how high-volume, power-hitting players impact offensive production.
Hits vs Runs (Colored by Avg)
This visualization uses the MLB 2016 season dataset to explore the relationship between total hits (H) and total runs scored (R) among players with at least 50 at-bats. Each point on the scatterplot represents an individual player, and the color of each point reflects their batting average (AVG), with a gradient scale applied using the viridis color palette. A black dashed trend line shows the linear relationship between hits and runs, indicating that players who accumulate more hits also tend to score more runs. By including batting average as a color dimension, the plot highlights how more efficient hitters contribute more effectively to run production. This chart helps visualize how contact ability and overall hitting efficiency impact a player’s scoring output.
Total Home Runs by Position
This visualization shows the total number of home runs hit by players at each position during the 2016 MLB season, including only those with at least 50 at-bats. The bar chart makes it clear that third basemen and first basemen were the top two positions in terms of total home runs. This aligns with the traditional offensive expectations of these roles, as both positions are often filled by power hitters known for their ability to drive the ball and produce runs. On the other end of the spectrum, pitchers ranked at the very bottom, contributing the fewest home runs. This is expected, as pitchers generally focus on their defensive duties and rarely get significant at-bats, especially in leagues using the designated hitter rule. Overall, the chart reflects how a player’s defensive position often correlates with their offensive responsibilities and output.
Conclusion
This project explored the relationship between power hitting and offensive contribution in Major League Baseball using both a long-term data set from the Lahman Baseball Database and a detailed 2016 season data set collected through web scraping. By examining key metrics like home runs, RBIs, hits, and games played, the analysis confirmed that players who hit more home runs tend to drive in more runs and play more games, reinforcing their value as core offensive contributors. Visualizations showed strong correlations between home runs and RBIs, as well as the impact of consistent hitters on run production. The breakdown of home runs by position highlighted how offensive expectations vary by role, with first and third basemen leading in home run totals while pitchers contributed the least. Using both historical and single-season data added depth to the findings and allowed for a broader understanding of trends over time and player performance in a specific year. Overall, the project demonstrated how statistical analysis and data visualization can be used to uncover patterns in sports performance and support data-driven insights in baseball.
The echo: false option disables the printing of code (only output is displayed).