Purpose of Analysis

As a BAIS major and baseball player at Xavier, I wanted to combine these interests to analyze Major League Baseball data over the past decade. Specifically, I’m interested in trends in home runs and strikeouts, hypothesizing that both have increased due to the focus on power hitting.

Data Scraping and Cleaning

The data was scraped from Baseball Reference, which provides comprehensive statistics on Major League Baseball. To collect the data, I wrote a script that automatically fetched team statistics for each year from 2013 to 2023. The script accessed the webpage for each year, extracted the table with team statistics, and saved this information. This process was repeated for each year to build a complete dataset spanning over a decade.

My dataset initially included a range of team statistics, but I focused specifically on cleaning and structuring data related to home runs (HR) and strikeouts (SO).

First, I encountered an issue with the HR and SO columns being read as character strings due to the presence of non-numeric characters. To resolve this, I applied a function to strip out any non-digits and then converted these columns into numeric data types. This step was helped in enabling accurate statistical calculations later in the analysis.

Next, I filtered out rows where the team information was missing or marked incorrectly with the placeholder ‘Tm’. This ensured that all data used in the analysis was associated with actual MLB teams. Additionally, I decided to exclude the data from the year 2020 due to the season’s shortened length, which could potentially skew the analysis of trends over the years.

Here is a link to my scraped data: https://myxavier-my.sharepoint.com/:x:/g/personal/depreym_xavier_edu/EbycesGrxDNDiUvBm8vQspwBPTqNf_nQ3GDtPgD1d9m0fQ

Analysis and Visualization

I focused on visually representing the trends observed in the home runs (HR) and strikeouts (SO) over the last decade using data from Major League Baseball. Creating these visualizations allowed me to better understand and communicate the patterns and relationships within the data.

I started by plotting the annual totals of home runs and strikeouts. This was achieved through a line graph, where each line represented one of the variables over the years. The choice of a line graph was intentional, as it clearly illustrates the trend and progression of each statistic over time, making it easy to identify any significant changes or anomalies.

Next, I included a scatter plot with a regression line to explore the relationship between home runs and strikeouts. This visualization was particularly insightful as it highlighted whether increases in home runs could be associated with increases in strikeouts, providing a visual representation of the correlation between these two variables.

## [1] "Correlation between home runs and strikeouts: 0.847539563637121"
## `geom_smooth()` using formula = 'y ~ x'

Interpretation and Conclusion

The increase in both home runs and strikeouts suggests a shift towards power hitting in baseball strategies. The correlation analysis and scatter plot reinforce this trend, indicating that as teams hit more home runs, strikeouts also tend to increase. This might suggest that players are taking more risks at bat, aiming for home runs at the expense of potentially higher strikeout rates.

This comprehensive analysis helps to understand trends in baseball over the last decade, providing insights into how the game’s strategic elements have evolved.