Sample MLB Batting Statistics
Introduction
This assignment will use data scraping to examine a sample of MLB baseball statistics. The sample includes batting statistics from the Reds, Cardinals, Braves, Mets and Cubs. This analysis will examine the age of the player and how it is correlated with stolen bases and homeruns.
Required Packages
The packages required for this markdown are:
| Package | Summary |
|---|---|
| readr | Read csv files |
| dplyr | Data processing and analysis |
| sqldf | Using SQL in R |
| knitr | RMarkdown documents |
| rmdformats | RMarkdown themes |
| httr | Useful for web authentication |
| rvest | Useful tools for working with HTML and XML |
Sample of Batting MLB Data
The original dataset was combined from the following MLB batting statistics.
Reds: https://www.baseball-reference.com/teams/CIN/2020.shtml
Mets: https://www.baseball-reference.com/teams/NYM/2020.shtml
Braves: https://www.baseball-reference.com/teams/ATL/2020.shtml
Cardinals: https://www.baseball-reference.com/teams/STL/2020.shtml
Chicago: https://www.baseball-reference.com/teams/CHC/2020.shtml
Batting Analysis
The batting dataset has 121 rows and 29 columns. The first table displays age and the percent of homeruns in each age group. The percentage was determined by dividing the sum of the age group homeruns by the total homeruns in the dataset (2301 homeruns). This will allow us to see each age group as a percentage of the whole dataset. The first analysis will show homeruns by age of players.
| Age | total_players | total_homeruns | total_HR_percent |
|---|---|---|---|
| 21 | 2 | 32 | 1 |
| 22 | 2 | 17 | 1 |
| 23 | 6 | 131 | 6 |
| 24 | 2 | 38 | 2 |
| 25 | 7 | 240 | 10 |
| 26 | 9 | 208 | 9 |
| 27 | 12 | 379 | 16 |
| 28 | 18 | 300 | 13 |
| 29 | 15 | 176 | 8 |
| 30 | 11 | 272 | 12 |
| 31 | 13 | 183 | 8 |
| 32 | 8 | 74 | 3 |
| 33 | 7 | 83 | 4 |
| 34 | 3 | 31 | 1 |
| 35 | 1 | 0 | 0 |
| 36 | 4 | 83 | 4 |
| 37 | 1 | 54 | 2 |
The younger the age doesn’t correlate to more homeruns. The data suggests that the players that are age 27 have 16% of the homeruns in this dataset. The next highest percentage is age 28 and 30 respectively.
The next graph will look at games played and homeruns.
The graph suggests that the more games a person plays, the total homeruns increase.
Stolen Base Analysis
The next graph will show games played and stolen bases. The second graph will show stolen bases and age groups.
| Age | total_players | total_stolen_bases | total_SB_percent |
|---|---|---|---|
| 21 | 2 | 8 | 7 |
| 22 | 2 | 1 | 1 |
| 23 | 6 | 9 | 8 |
| 24 | 2 | 0 | 0 |
| 25 | 7 | 8 | 7 |
| 26 | 9 | 7 | 6 |
| 27 | 12 | 17 | 15 |
| 28 | 18 | 6 | 5 |
| 29 | 15 | 20 | 18 |
| 30 | 11 | 13 | 12 |
| 31 | 13 | 6 | 5 |
| 32 | 8 | 7 | 6 |
| 33 | 7 | 9 | 8 |
| 34 | 3 | 0 | 0 |
| 35 | 1 | 0 | 0 |
| 36 | 4 | 0 | 0 |
| 37 | 1 | 0 | 0 |
The graph suggests that there is not an obvious relationship between games played and stolen bases. Interestingly, the players that are 29 have the most combined stolen bases out of any age group. There are no stolen bases after age 33 from this dataset.
In order to further this analysis, I would gather information from the entire major league teams in order to see if there are similar trends. More major league data would allow for more analysis such as slicing the data by field position. The sample data provides a snapshot into potential trends of homeruns and stolen bases and correlation by age.