For this project, I want to visualize the the trend of home runs a baseball player has in comparison to their salary. While, at first, a simple scatter plot comes to mind, I am intrested in making sure that the points are clearly visible, and highlight the points by highligting the players respective league. The data set I will be using is known as the Hitters data frame within the ISLR2 library. My goal for this project is that as I work in developling the best plot, I will consider new features to add, and explain why these features allign with the intentions of insight as well as data visualization principles.

Visual 1:

df = Hitters

df %>%
  ggplot(aes(x = HmRun, y = Salary, color = League)) + 
  geom_point() +
  labs(x = "# Of Home Runs", y = "Salary In Thousands of Dollars",
       title = "Relationship Between Home Runs & Salaries of Baseball Players") + 
    theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 59 rows containing missing values (`geom_point()`).

In this first visual, I showcase a simple scatter plot where we can identify the number of home runs by the salaries for each respective player. In adittion, I added a color to the scatterplot, which differntiates each of the players into the respective league. However, with this simple visual, there is still so much more room for improvement, and we can obtain more information for our research intrest.

Visual 2:

df %>%
  ggplot(aes(x = HmRun, y = Salary, color = League)) + 
  geom_point(alpha = .7, size = 3) + scale_color_viridis(discrete = T, option = "C")+
  labs(x = "# Of Home Runs", y = "Salary In Thousands of Dollars",
       title = "Relationship Between Home Runs & Salaries of Baseball Players") + 
    theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 59 rows containing missing values (`geom_point()`).

In continuiation of the first visual, I noticed some problems I wanted to adress in the visual above. Firstly, the size and the points overall were hard to follow as a viewer, which made the distinction of the players by both leagues hard to distinguish. Because of this, I added transperancy into the points, as well as increased the size. Finally, I modified the color of the points to include two distinguishing colors for both leagues.

Visual 3:

# Calculate the correlation coeffiecients:

df = df %>%
  drop_na()

cor_coef = cor(df$HmRun, df$Salary)

df %>%
  ggplot(aes(x = HmRun, y = Salary, color = League)) + 
  geom_point(alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_viridis(discrete = TRUE, option = "C") + theme_minimal() +
  labs(x = "# Of Home Runs", y = "Salary In Thousands of Dollars",
       title = "Relationship Between Home Runs & Salaries of Baseball Players") + 
  theme(plot.title = element_text(hjust = 0.5)) +   geom_text(
    aes(label = paste("Correlation: r =", round(cor_coef, 2))),
    x = min(df$HmRun), y = max(df$Salary), hjust = 0, vjust = 1,
    color = "black", fontface = "bold"
  )
## `geom_smooth()` using formula = 'y ~ x'

In the graphic above, I inserted a regression line, which looks to see what the association between the number of home runs and the salaries of baseball players. From interpreting the linear trend, we can conlcude that there is a positive linear association between both players, with a correlation factor of .34, showcased in the top left corner. In adittion, to make the data visual much more pleasing to the eye, I added a minimal theme as the last modification.

league_a = df %>%
  filter(League == "A")

league_a_corr = cor(league_a$HmRun, league_a$Salary)

league_N = df %>%
  filter(League == "N")

league_n_corr = cor(league_N$HmRun, league_N$Salary)


league_a_visual = league_a %>%
  ggplot(aes(x = HmRun, y = Salary, color = League)) + 
  geom_point(alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_manual(values = c("A" = "#39568cff"), guide = FALSE) + theme_minimal() +
  labs(x = "# Of Home Runs", y = "Salary In Thousands of Dollars") + 
  theme(plot.title = element_text(hjust = 0.5)) +   geom_text(
    aes(label = paste("Correlation: r =", round(league_a_corr, 2))),
    x = min(league_a$HmRun), y = max(league_a$Salary), hjust = 0, vjust = 1,
    color = "black", fontface = "bold"
  )

league_n_visual = league_N  %>%
  ggplot(aes(x = HmRun, y = Salary, color = League)) + 
  geom_point(alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
    scale_color_manual(values = c("N" = "#dce319ff"), guide = FALSE) + theme_minimal() +
  labs(x = "# Of Home Runs", y = "Salary In Thousands of Dollars") + 
  theme(plot.title = element_text(hjust = 0.5)) +   geom_text(
    aes(label = paste("Correlation: r =", round(league_n_corr, 2))),
    x = min(league_N$HmRun), y = max(league_N$Salary), hjust = 0, vjust = 1,
    color = "black", fontface = "bold"
  )

side_by_side_plots = league_a_visual + league_n_visual +
  plot_layout(guides = "collect") + 
  plot_annotation(
    title = "Home Runs By Salaries in Both American (Left) and National (Right) Leagues",
    theme = theme(plot.title = element_text(hjust = .5))
    
  )

side_by_side_plots
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

While last visual contained a numeric value that represented the correlation between the number of home runs and the salary of the baseball players, we were unable to separate that correlation coefficient by league. In this final visual, I have successfully separated the baseball players in both leagues, and maped out the regression line along with the coefficient. In addition, I also removed the legend, as It may had cause some confusion to the viewer and grouped both visuals together with one title to help the reader interpret quickly. From analyzing the visuals, we understand that both visuals showcase a positive linear relationship between the number of home runs and the salaries of players. However, one interesting observation, which we were unable to recognize before is that for the American league, a player who score more home runs are more likely to be paid a higher salary than the National League. Because of this further insight, I am fully confident that this is the best plot out of the batch.