2025-10-30

Data Sets

I will be using two data sets during this project, which I got from Kaggle. One will be MLB hitting statistics for players (hitting_data) and the other will be MLB pitching statistics for players (pitching_data). I will be using these data sets to try and find trends based on certain statistics and players positions.

Ggplot Boxplot code:

I wanted to see what the correlation was for batting average based on positions. A boxplot was used to compare each position. Here is the code I used to accomplish this:

hitting_data$Position <- factor(hitting_data$Position,
                       levels = c("P", "C", "1B", "2B", "3B", "SS", 
                                  "LF", "CF", "RF", "DH"))
hitting_data <- hitting_data %>% 
  filter(!is.na(Position))

ggplot(data = hitting_data, aes(x = Position, y = AVG)) +
  geom_boxplot(width = 0.25, outlier.alpha = 0.5) +
  labs(title = "Batting Average Based On Position",
       x = "Position", y = "AVG") + 
  theme(text = element_text(size = 10),
        plot.title = element_text(size = 12, face = "bold"))

Ggplot Boxplot

Besides pitchers, we can see that catchers tend to have to lowest batting averages. This might be the case because of the defensive and physical demands of their position. We see that Designated Hitters have higher averages because they don’t have to play defense, all they have to do is worry about hitting.

Ggplot Scatter Plot

This scatter plot shows us an obvious positive relationship between pitchers ERA and WHIP. This makes sense because WHIP shows how many base runners pitchers allow per inning pitched, so the more runners they allow, the more runs they give up.

3D Scatter Plot

The following is a 3D scatter plot of pitchers strikouts, walks, and home runs allowed.

3D Scatter Plot Analysis

This 3D scatter plot visualizes the relationship between walks(BB), strikeouts(SO), and home runs allowed(HR) for pitchers, with color representing WHIP(walks + hits per inning pitched). A few trends I notice are:

-Pitchers with higher strikeout totals generally allow fewer walks and home runs, indicating stronger command and pitch effectiveness.

-Higher WHIP values are clustered where pitchers allow more walks and home runs, reinforcing that higher WHIP usually correlates with poor run prevention.

-The majority of pitchers fall near the lower ends if HR and BB, suggesting that while strikeouts vary widely, extreme outliers in home runs or walks are rare.

Plotly Bar Plot

Plotly Pie Charts

I wanted to show the distribution of player positions within major home run clubs (specifically the 300+, 500+, and 600+ home run clubs). To do that I used pie charts that are divided into the different home run clubs I previously mentioned.

Statistical Analysis

## 
##  Welch Two Sample t-test
## 
## data:  corner and middle
## t = 3.1018, df = 1078, p-value = 0.001973
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.001540304 0.006844134
## sample estimates:
## mean of x mean of y 
## 0.2654578 0.2612656

This two sample t-test compared the average batting averages of corner infielders(1B and 3B) and middle infielders(2B and SS). The results show a statistically significant difference between the two groups(t = 3.10, p = 0.001973).

–The mean batting average for corner infielders was 0.265, while middle infielders averaged 0.261.

–The 95% confidence interval for the difference in means (0.0015 to 0.0068) does not include zero, supporting that this difference is unlikely due to random chance.

Conclusion

This analysis explored trends in Major League Baseball hitting and pitching performance using multiple visualizations and statistical methods. The batting average by position chart showed clear variation across defensive roles — with pitchers and catchers posting the lowest averages, and corner infielders and designated hitters demonstrating the strongest offensive output.For pitchers, the ERA vs. WHIP regression revealed a strong positive correlation, indicating that as pitchers allow more baserunners (higher WHIP), their earned run averages also rise — a logical connection between control and performance. The 3D plot of BB, SO, and HR further highlighted this relationship, showing that pitchers with high strikeout totals generally limit walks and home runs, aligning with lower WHIP values.Finally, the home run analyses — including total home runs by position and positional breakdowns of the 300+, 500+, and 600+ HR clubs — showed that power hitters are overwhelmingly concentrated in corner positions and the outfield. Defensive positions like shortstop and second base contribute minimally to these milestones, reflecting the specialization of offensive power in baseball. Overall, these results illustrate how player position strongly influences both offensive and pitching performance, underscoring the balance between defensive value and offensive production in constructing a successful baseball team.