Introduction

This assignment will use data scraping to examine a sample of MLB baseball statistics. The sample includes batting statistics from the Reds, Cardinals, Braves, Mets and Cubs. This analysis will examine the age of the player and how it is correlated with stolen bases and homeruns.

Required Packages

The packages required for this markdown are:

Package Summary
readr Read csv files
dplyr Data processing and analysis
sqldf Using SQL in R
knitr RMarkdown documents
rmdformats RMarkdown themes
httr Useful for web authentication
rvest Useful tools for working with HTML and XML

Sample of Batting MLB Data

The original dataset was combined from the following MLB batting statistics.

Reds: https://www.baseball-reference.com/teams/CIN/2020.shtml

Mets: https://www.baseball-reference.com/teams/NYM/2020.shtml

Braves: https://www.baseball-reference.com/teams/ATL/2020.shtml

Cardinals: https://www.baseball-reference.com/teams/STL/2020.shtml

Chicago: https://www.baseball-reference.com/teams/CHC/2020.shtml

Batting Analysis

The batting dataset has 121 rows and 29 columns. The first table displays age and the percent of homeruns in each age group. The percentage was determined by dividing the sum of the age group homeruns by the total homeruns in the dataset (2301 homeruns). This will allow us to see each age group as a percentage of the whole dataset. The first analysis will show homeruns by age of players.

Batting Statistics
Age total_players total_homeruns total_HR_percent
21 2 32 1
22 2 17 1
23 6 131 6
24 2 38 2
25 7 240 10
26 9 208 9
27 12 379 16
28 18 300 13
29 15 176 8
30 11 272 12
31 13 183 8
32 8 74 3
33 7 83 4
34 3 31 1
35 1 0 0
36 4 83 4
37 1 54 2

The younger the age doesn’t correlate to more homeruns. The data suggests that the players that are age 27 have 16% of the homeruns in this dataset. The next highest percentage is age 28 and 30 respectively.

The next graph will look at games played and homeruns.

The graph suggests that the more games a person plays, the total homeruns increase.

Stolen Base Analysis

The next graph will show games played and stolen bases. The second graph will show stolen bases and age groups.

Stolen Base Stats
Age total_players total_stolen_bases total_SB_percent
21 2 8 7
22 2 1 1
23 6 9 8
24 2 0 0
25 7 8 7
26 9 7 6
27 12 17 15
28 18 6 5
29 15 20 18
30 11 13 12
31 13 6 5
32 8 7 6
33 7 9 8
34 3 0 0
35 1 0 0
36 4 0 0
37 1 0 0

The graph suggests that there is not an obvious relationship between games played and stolen bases. Interestingly, the players that are 29 have the most combined stolen bases out of any age group. There are no stolen bases after age 33 from this dataset.

In order to further this analysis, I would gather information from the entire major league teams in order to see if there are similar trends. More major league data would allow for more analysis such as slicing the data by field position. The sample data provides a snapshot into potential trends of homeruns and stolen bases and correlation by age.