Hitters and Salary

Aadhya Sridharan

Baseball Salary Analysis

This project looks at the Hitters (in baseball) dataset to see how their salary is affected, and by what.

Data Description

The dataset has statistics and salary information for the baseball players.

Some of the variables that are used are:

• Salary • Hits • Runs • League • Division

The data comes from the Hitters.csv file used for this project.

Load Packages and Data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
Hitters <- read.csv("Hitters.csv")

I am loading the packages needed for the project and the needed dataset here.

Clean the Data

Hitters <- na.omit(Hitters)

This removes rows with missing salary values so the analysis can run correctly.

Summary Statistics

summary(Hitters$Salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    67.5   190.0   425.0   535.9   750.0  2460.0

I decided to show the distribution of the salary here, for some general insight before viewing the graphs.

Average Salary by League

Hitters %>%
  group_by(League) %>%
  summarise(mean_salary = mean(Salary))
## # A tibble: 2 × 2
##   League mean_salary
##   <chr>        <dbl>
## 1 A             542.
## 2 N             529.

This shows how the avg salary differs between leagues. They are fairly similar, but one league has a higher salary than the other. This gap is not large though, proving that this is probably not what is impacting player salaries strongly.

Scatterplot of Hits and Salary

ggplot(Hitters, aes(x = Hits, y = Salary)) +
  geom_point() +
  labs(title = "Hits and Salary", x = "Hits", y = "Salary")

This plot shows whether players with more hits tend to earn more, which in this case, they so tend to, as we can see with the general upward trend. Lower hitting players tend ot have lower salaries. But due to the spread, players with similar hits can still have different salaries.

Boxplot of Salary by League

ggplot(Hitters, aes(x = League, y = Salary)) +
  geom_boxplot() +
  labs(title = "Salary by League", x = "League", y = "Salary")

This shows how salary is spread out across leagues. Most players seem to be clustered in the lower to mid salary range, and a very small number of them have very high salaries. This makes the data right skewed, as the average players and top earners have a large and noticeable gap between them.

Interactive Plotly Scatterplot

plot_ly(Hitters, x = ~Hits, y = ~Salary, color = ~League, type = 'scatter', mode = 'markers')

This has the same idea before, but it is more interactive, better showing the details of the graph. There is a clear upward trend , where more hits are linked to higher salary. The leagues are mixed though, so that alone does not explain salary. Some players have similar hits but different salaries. This proves that other factors also affect salary.

Three Variable Plotly Plot

plot_ly(Hitters, x = ~Hits, y = ~Runs, z = ~Salary, type = 'scatter3d', mode = 'markers')

This 3D plot shows the relationship between hits, runs, and salary. We can see that players with more runs and hits tend to have a higher salary, with a clear upward pattern when both performance variables increase. Some players though have high performance and low salary, showing the variation among each other. Once again, we see that salary is not just impacted by one variable, but both hits and runs.

Linear Regression

model <- lm(Salary ~ Hits, data = Hitters)
summary(model)
## 
## Call:
## lm(formula = Salary ~ Hits, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -893.99 -245.63  -59.08  181.12 2059.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.0488    64.9822   0.970    0.333    
## Hits          4.3854     0.5561   7.886 8.53e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 406.2 on 261 degrees of freedom
## Multiple R-squared:  0.1924, Adjusted R-squared:  0.1893 
## F-statistic: 62.19 on 1 and 261 DF,  p-value: 8.531e-14

This regression tests whether hits actually help explain salary, or not.

Overall Interpretation

Conclusion

The data set that I chose works well due to the fact that it has a mic of numerical and categorical variables.

I wanted to mainly focus on salary, hits, runs and how the leagues compare to each other.

This presentation is an attempt to organize code, output, plots, and statistical analysis in a clear flow and order.