Aadhya Sridharan
This project looks at the Hitters (in baseball) dataset to see how their salary is affected, and by what.
The dataset has statistics and salary information for the baseball players.
Some of the variables that are used are:
• Salary • Hits • Runs • League • Division
The data comes from the Hitters.csv file used for this project.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
I am loading the packages needed for the project and the needed dataset here.
This removes rows with missing salary values so the analysis can run correctly.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.5 190.0 425.0 535.9 750.0 2460.0
I decided to show the distribution of the salary here, for some general insight before viewing the graphs.
## # A tibble: 2 × 2
## League mean_salary
## <chr> <dbl>
## 1 A 542.
## 2 N 529.
This shows how the avg salary differs between leagues. They are fairly similar, but one league has a higher salary than the other. This gap is not large though, proving that this is probably not what is impacting player salaries strongly.
ggplot(Hitters, aes(x = Hits, y = Salary)) +
geom_point() +
labs(title = "Hits and Salary", x = "Hits", y = "Salary")This plot shows whether players with more hits tend to earn more, which in this case, they so tend to, as we can see with the general upward trend. Lower hitting players tend ot have lower salaries. But due to the spread, players with similar hits can still have different salaries.
ggplot(Hitters, aes(x = League, y = Salary)) +
geom_boxplot() +
labs(title = "Salary by League", x = "League", y = "Salary")This shows how salary is spread out across leagues. Most players seem to be clustered in the lower to mid salary range, and a very small number of them have very high salaries. This makes the data right skewed, as the average players and top earners have a large and noticeable gap between them.
This has the same idea before, but it is more interactive, better showing the details of the graph. There is a clear upward trend , where more hits are linked to higher salary. The leagues are mixed though, so that alone does not explain salary. Some players have similar hits but different salaries. This proves that other factors also affect salary.
This 3D plot shows the relationship between hits, runs, and salary. We can see that players with more runs and hits tend to have a higher salary, with a clear upward pattern when both performance variables increase. Some players though have high performance and low salary, showing the variation among each other. Once again, we see that salary is not just impacted by one variable, but both hits and runs.
##
## Call:
## lm(formula = Salary ~ Hits, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -893.99 -245.63 -59.08 181.12 2059.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.0488 64.9822 0.970 0.333
## Hits 4.3854 0.5561 7.886 8.53e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 406.2 on 261 degrees of freedom
## Multiple R-squared: 0.1924, Adjusted R-squared: 0.1893
## F-statistic: 62.19 on 1 and 261 DF, p-value: 8.531e-14
This regression tests whether hits actually help explain salary, or not.
Overall Interpretation
Players with more hits generally tend to have higher salaries.
There is variation in salary between the leagues, but not a large one.
Runs and hits together show that even performance is linked to salary.
The regression shows that there is in fact a positive relationship between salary and hits.
The data set that I chose works well due to the fact that it has a mic of numerical and categorical variables.
I wanted to mainly focus on salary, hits, runs and how the leagues compare to each other.
This presentation is an attempt to organize code, output, plots, and statistical analysis in a clear flow and order.