MLB Player Data

This data came from the Lahman data set, which has MLB player stats dating all the way back to 1876. I chose to focus on only Active players so we have players from 2005-2019. This data set is the set of players that are starting almost every game, averaging at least 3.1 plate appearances per game. So basically, these are the starting/regular players of each of the 30 MLB teams. We have 130 observations with 35 different variables.

Variable Definitions

RK: rank. The rank of the player (in a given year) based on their batting average.
Player: Name of the player.
Pos: Player's position.
- 1B: first base
- SS: shortstop
- 2B: second base
- 3B: third base
- CF: center field
- RF: right field
- LF: left field
- C: catcher
- DH: designated hitter (only in the AL)
- OF: outfielder
AB: number of official at bats by a batter. (This is plate appearances minus sacrifices, walks, and "hit by pitches".)
HR: numer of times a batter hits the ball and gets a home run.
RBI: runs batted in. The number of runs that come from a batter hitting the ball. (If bases are loaded and batter hits a HR, RBI is 4)
AVG: batting average. The chance a player has of getting a hit during an at bat.
OBP: on base percentage. The chance a player will get on base during an at bat. How frequently they get on base per plate appearance.
SLG: slugging percent. The same as batting average but it takes into account singles, doubles, triples, and HRs. A higher SLG means a player is more "productive" when hitting.
OPS: on base plus slugging percentage. This is the ability of a player to get on base AND hit for power.
PA: plate appearances.
NP: number of pitches thrown during all of the batter's plate appearances. Pitches_per_PA: number of pitches a player sees per plate appearance.

1. Are the player AVG, HR, OPS, and SLG normal?

Test for normality of individual variables:

## $AVG
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.98555, p-value = 0.1855
## 
## 
## $HR
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.90484, p-value = 1.409e-07
## 
## 
## $OPS
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.96792, p-value = 0.003582
## 
## 
## $SLG
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.97975, p-value = 0.04916
## 
##  Generalized Shapiro-Wilk test for Multivariate Normality by
##  Villasenor-Alva and Gonzalez-Estrada
## 
## data:  as.matrix(player_data_1[, 3:6])
## MVW = 0.9579, p-value = 7.298e-10

Looking at the histograms (mean is shown by the BLUE dashed line), AVG looks generally normal, however it also looks multi-model. Homeruns(HR) and Slugging (SLG) are skewed right. OPS is slightly skewed right but not as much as HR and SLG.

The individual tests for normality show that Homeruns(HR) and OPS are both not normally distributed due to the extremely low p-values. However, Average(AVG) is normally distributed as we have a large p-value of .1855. Slugging (SLG) is close to the .05 threshold, however is is below so therefor we reject the null hypothesis and say that SLG is not normally distributed.

Looking at the multivariate test for normality, we have an extremely low p-value of 7.298e-10, meaning that we reject our null and we can say that our data is not normally distributed when using AVG, HR, OPS and SLG.

2. Is there a difference between outfield hitters and infield hitters in terms of Batting average, Homeruns, OPS and Slugging?

MANOVA Test between Infielders and Outfielders

## 
##  Generalized Shapiro-Wilk test for Multivariate Normality by
##  Villasenor-Alva and Gonzalez-Estrada
## 
## data:  as.matrix(player_data_2[, 2:6])
## MVW = 0.97601, p-value = 0.5259
## 
##  Generalized Shapiro-Wilk test for Multivariate Normality by
##  Villasenor-Alva and Gonzalez-Estrada
## 
## data:  as.matrix(player_data_3[, 2:6])
## MVW = 0.9869, p-value = 0.1486

Checking for Normality of both groups with our dependent variables, we have a p-value of 0.5259 for the outfielders and a p-value of 0.1486 for the infielders, meaning that for our Shapiro-Wilk test for normality we have no evidence against our null hypothesis, and we can say that the groups are multivaraite normal. Our data is also independent.

## 
## Type II MANOVA Tests:
## 
## Sum of squares and products for error:
##                         AVG        SLG Pitches_Per_PA    HR_Per_AB  RBI_Per_AB
## AVG             0.050785048 0.04769144     -0.1742461 -0.004056867 0.008226136
## SLG             0.047691440 0.28892817      0.4028525  0.075994742 0.144691019
## Pitches_Per_PA -0.174246075 0.40285246      5.5767763  0.187092969 0.251097138
## HR_Per_AB      -0.004056867 0.07599474      0.1870930  0.027197865 0.045953598
## RBI_Per_AB      0.008226136 0.14469102      0.2510971  0.045953598 0.094510425
## 
## ------------------------------------------
##  
## Term: Pos_Num 
## 
## Sum of squares and products for the hypothesis:
##                          AVG           SLG Pitches_Per_PA     HR_Per_AB
## AVG             4.597290e-04  1.112183e-03   0.0061540943  3.706795e-05
## SLG             1.112183e-03  2.690611e-03   0.0148880798  8.967536e-05
## Pitches_Per_PA  6.154094e-03  1.488808e-02   0.0823808772  4.962047e-04
## HR_Per_AB       3.706795e-05  8.967536e-05   0.0004962047  2.988789e-06
## RBI_Per_AB     -5.200859e-04 -1.258200e-03  -0.0069620539 -4.193453e-05
##                   RBI_Per_AB
## AVG            -5.200859e-04
## SLG            -1.258200e-03
## Pitches_Per_PA -6.962054e-03
## HR_Per_AB      -4.193453e-05
## RBI_Per_AB      5.883671e-04
## 
## Multivariate Tests: Pos_Num
##                  Df test stat approx F num Df den Df    Pr(>F)   
## Pillai            1 0.1324599 3.786575      5    124 0.0031682 **
## Wilks             1 0.8675401 3.786575      5    124 0.0031682 **
## Hotelling-Lawley  1 0.1526845 3.786575      5    124 0.0031682 **
## Roy               1 0.1526845 3.786575      5    124 0.0031682 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null Hypothesis: The vectors of means of Batting Average(AVG), Homeruns(HR), On-Base Plus Slugging(OPS), and Slugging (SLG) are equal between Infielders and Outfielders.

The value for the Wilks criterion is 0.8675401 and the corresponding F-statistic is 3.786575 with (5, 124) degrees of freedom. The p-value is 0.0031682, which is small, meaning we have strong evidence against the null hypothesis that the vectors of means of Batting Average(AVG), Homeruns(HR), On-Base Plus Slugging(OPS), and Slugging (SLG) are equal between Infielders and Outfielders.

3. Is there a significant difference between the top 10 Hitters and the rest of the day to day players?

Inference for Means:

## 
##  Generalized Shapiro-Wilk test for Multivariate Normality by
##  Villasenor-Alva and Gonzalez-Estrada
## 
## data:  as.matrix(player_data_Top_10[, 1:3])
## MVW = 0.98474, p-value = 0.07137
## 
##  Hotelling's one sample T2-test
## 
## data:  player_data_Top_10[1:10, 1:3]
## T.2 = 109.89, df1 = 3, df2 = 7, p-value = 2.993e-06
## alternative hypothesis: true location is not equal to c(0.268851955870199,0.437690784474327,3.84767287480817)

Checking our Shapiro wilks test, we can see that when looking at Batting Average(AVG), Slugging Percentage(SLG), and Pitches Per Plate Appearance, our data is mulitvariate normal because our p-value is greater than .05. Our data is also independent so we have both conditions to use the Hotellings t2 Test.

Null hypothesis: Our null hypothesis is that our top 10 players by Rank have a mean vector Average, Slugging and Pitches Per Plate appearance that is equal to the rest of the leagues average.

Alternative Hyptothesis: Our alternate hypothesis is that the top 10 players have a different mean vector average of the those three statistics than the rest of the league. Answering our question, the top 10 players have statistically different mean vector statistics for AVG, SLG and Pitches per plate appearance.

T.2 = 109.89,

p-value = 2.993e-06

Due to our extremely low p-value of 2.993e-06 we have overwhelming evidence against our null hypothesis that our top 10 players by Rank have a mean vector Average, Slugging and Pitches Per Plate appearance that is equal to the rest of the leagues average.

4. Are Most-Valuable Player winners (MVP) career stats that different than the rest of the league?

Bonferroni Condfidence interval for difference of means (NON-MVP vs MVP)

##                         AVG         HR         OPS         SLG
## lower_limit_1v2 -0.04156749 -190.02826 -0.17847485 -0.11831569
## upper_limit_1v2 -0.01242121  -50.83897 -0.09386978 -0.05402047

We have our Non-MVP group as group 1 and our MVP winners as group 2. Looking at our 95 percent confidence intervals of the difference between the mean vectors of Non-MVP and MVP winners, we can see that none of the confidence intervals contain 0, meaning that MVP players generally have higher stats in all 4 categories.

When looking at AVG, the confidence intervals show a difference of only .01 - .04 in batting average which is not a huge difference. When looking at SLG, the confidence intervals show a difference of only .05 - .11 in slugging percentage which again is not a huge difference. Same with OPS as there is only a small difference of .09 - .17.

When looking at number of Homeruns, there is a large difference with an interval of 50-190 more homeruns hit by MVP winners than non-MVP players. This might be due to the fact the MVP winners could have longer careers as they are very talented.

5. Creating 2 classifiers that can identify if a player is in the top 30 by AVG.

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## [1] "LDA:"
##             
##              Not_top_30 top_30
##   Not_top_30         97      3
##   top_30              5     25
## [1] "LDA with CSV:"
##             
##              Not_top_30 top_30
##   Not_top_30         95      5
##   top_30              5     25
## [1] "QDA:"
##             
##              Not_top_30 top_30
##   Not_top_30         99      1
##   top_30              1     29
## [1] "QDA with CSV:"
##             
##              Not_top_30 top_30
##   Not_top_30         98      2
##   top_30              5     25
## [1] 0.010142602 0.015472371 0.026702317 0.008377897 0.011390374 0.014937611

We can see that the LDA and QDA models both have similar error when our prior is equal to .9 and .1. However, the QDA model does better in all 3 of the different priors especially the .85/.15 and .8/.2. Our best QDA model has a mean error of .01. This means we have a really strong model and that OPS, SLG, Pitches_Per_PA, XBH_Per_AB, RBI_Per_AB, HR_Per_AB are good independent variables when trying to predict whether a player is in the top 30 Rank based on Batting Average.

6. Clustering using Batting Average,On-Base Percentage, Slugging Percentage, OPS, Pitches Per Plate Appearance, HR Per At Bat,RBI Per At Bat

## Warning: 'MASS' namespace cannot be unloaded:
##   namespace 'MASS' is imported by 'prabclus', 'DescTools', 'fpc', 'vegan' so cannot be unloaded

From the Dendrograms I decided to look at 4 clusters for Euclidean distances and 3 clusters for Canberra distances. The graphs of the clusters are shown, and they both look like good clusters as the points in each cluster do not overlap each other very much and the clusters have good uniform shape. The different colors are clearly identifiable in the plots.

Looking at the last 2 graphs we can see that the highest ch Index is when the Euclidean and the Canberra distance is at 3. This agrees with what I looked at earlier with the dendrogram for Canberra distances. We have a good cluster algorithm when we have 3 clusters for Euclidean or Canberra distances when using Batting Average, On-Base Percentage, Slugging Percentage, OPS, Pitches Per Plate Appearance, HRs Per At Bat, and RBIs Per At Bat.