Comparison

We will be implementing the library ggplot2 which allows us to compare statistics and create a scatter plot of the comparison. The Lahman statistics set contains the necessary information to plot with. The data is taken from the year 2014 and am comparing at bats to homeruns hit.

First, we create the dataframe to work with. The initial plot is simple because it creates a basic scatter plot with at bats on the x-axis and homeruns on the y-axis. EAch point in the first plot represents a professional baseball player who hit a homerun in 2014.

Now, lets add an extra layer. I want to create a line of best fit or regression line to show the type of relationship that these two statistics have. A linear regression was chosen because it seemed to fit best with the scatter plotted.A confidence interval is also added around the regression(a gray outline around the best fit line).

## `geom_smooth()` using formula 'y ~ x'

Additional Components

Lets add more components to the ggplot function. Color is going to be added to represent to two different leagues in major league baseball. The American League will be represented in red and national league in blue. I also changed the color of the regression line to make it easier to notice.

Instead of representing each of the scatter points in either of the two leagues, lets make one for every MLB team. This information is retrieved from the Lahman data set.

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Although it is interesting to see all 30 teams be represented, there is an information overload and it is difficult to differentiate between all the teams. Lets make more scatter plot that is easier to read but still outputs the same information

## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using formula 'y ~ x'

The new scatter plot is homeruns vs at-bats for the entire team, instead of individual players. This is a way more interesting statistic since there are less teams to compare. Now the scatter plot is way less messy and it is easier to determine the teams. The only issue is some of the colors look very similar, but we can easily see outliers as well as the new line of regression (with a confidence interval).