Introduction

The purpose of this document is to analyze the season history of the Cincinnati Reds baseball franchise from 1882 to 2019. As a die-hard Reds fan and a follower of baseball statistics, I enjoyed researching the season history of my favorite team to find interesting trends and revelations. In addition, I took a Sabermetrics class at Xavier that inspired me to continue playing around with baseball data in my free time. I scraped a datatable from baseball-reference that displays every single season the Cincinnati Reds have played to date. Here is the link:

https://www.baseball-reference.com/teams/CIN/index.shtml

Information found in the table include wins, losses, runs scored, runs allowed, winning percentage, average batter age, average pitcher age, etc.

Packages Required

The packages in R I loaded up and used are listed below along with their purpose in my analysis:

XML: Working with XML table objects on HTML webpages

rvest: Useful tooks for working with HTML and XML

RCurl: Convenient HTTP forms

magritter: Adding piping functionality

httr:Useful for web authentication

tidyverse: The tidyverse is a collection of open source R packages that help model, transform, and visualize data. In the tidyverse, I used the ggplot2 package throughout my analysis to create my visualizations, and I used the dplyr package to filter and manipulate the data.

scales: I used the scales library to prevent R from labeling my axes in a scientific format.

To install these packages in R, use the following command: install.packages(c(“XML”, “rvest”, “RCurl”, “magritter”, httr“,”tidyverse“,”scales"))

Analysis

After scraping the data table from baseball-reference, I asked a series of 5 questions I wanted to investigate.

Question #1 Is the Season Winning Percentage Correlated with the Season Attendance?

To determine how strong the bandwagon effect is for the Reds, I wanted to find out if the Reds’ season attendance would increase if their winning percentage for the season was high. I ended up filtering for seasons between 1973 and 2019 because 1973 was when the Reds first broke the 2 million mark for attendance. If I included the seasons prior to 1973, it would not have been a good representation of the modern population of Cincinnati.

## Warning: Missing column names filled in: 'X1' [1]

As indicated by the above graph, there appears to be a slight correlation with winning pct. and attendance which shows that Cincinnati fans will only fill up the stadium if the team is winning. This is unlike other major sports teams, such as the Yankees or Dodgers, whose fans will show up in droves regardless of the team’s performance. To further validate my findings, I would use a correlation analysis.

Question #2 Is the Season Run Margin Correlated with the Season Wins?

In Sabermetrics, I learned that, among other items, a team’s run margin (runs scored - runs allowed) is one of the best indicators of how many games a team will win in a season. I filtered for seasons from 1962 to 2019 because the MLB implemented a 162 game season in 1962.

There appears to be a very strong correlation between run margin and season wins with only two outliers in the data. For one season, the Reds had a positive run margin but won just 66 games. And for another season, the Reds had a whopping run margin of over 100 while only posting 66 wins as well. It is very uncommon for a team to have that high of a run margin and win so few games, so I was surprised to see this. To further validate my findings, I would use a correlation analysis.

Question #3 Is the Steroid Era Prevalent in the Data?

The Steroid Era is a dark chapter in baseball history where many players used PEDs to gain an advantage. As a result, we witnessed spectacular home run numbers from many players throughout the ‘90s and early ’00s, such as Barry Bonds, Mark McGwire, and Sammy Sosa. I’m curious to see if the Steroid Era is reflected in the Reds’ history too. I will compare the average amount of runs scored and allowed in each Reds era by grouping each season by their respective era. I identified 7 eras in baseball: Deadball (1901-1919), Lively Ball (1920-1945), Post-War (1946-1960), Expansion (1961-1976), Free Agency (1977-1993), Steroid (1994-2005), and Modern (2006-2019).

Based on the two bar graphs, the Steroid Era in Reds History had the most runs scored and allowed on average in a season. This indicates that Reds players were probably using PEDs to increase their chances of hitting long balls, and Reds pitchers faced players from other teams that frequently used PEDs. All in all, while the Steroid Era condemns certain individuals like Barry Bonds, it’s important to note that everyone probably used them in this time, too, because the MLB didn’t test their players for PEDs until 2003.

Question #4 Is the Average Batter Age correlated with runs scored in season?

Typically, as a batter ages, their performance will decline. If a team has older batters, you’d expect them to score less runs throughout the season. I’m curious to see if this applies to the Reds throughout their history. I filtered for the seasons between 1962 and 2019 because 1962 is when the MLB implemented a 162 game season. I then plotted the data points on a scatter plot to see if there’s a correlation.

To my genuine surprise, there doesn’t appear to any correlation to the average batter age and the amount of runs scored in a season for the Reds. It appears that, in the Reds’ case, father time doesn’t play a major role in determining how many runs the batters will score. To further validate my findings, I would use a correlation analysis.

Question #5 Is Pitcher Average Age correlated with runs allowed in a season?

After finding no correlation between average batter age and runs scored, I wanted to see if there was no correlation between the average pitcher age and runs allowed. As I’ve reitereated before, I filtered for the seasons between 1962 and 2019 and plotted the data points on a scatter plot.

To my surprise once again, there appears to be a somewhat strong correlation between pitcher age and runs allowed. to summarize, as the pitching staff for the Reds grow older, they are correlated to give up more runs in a season. To further validate my findings, I would use a correlation analysis.