STAT 5230: Homework 1 Solutions

MLB Batting Data

The dataset mlb batting.csv has information on all the batting results for 314 MLB players for the 2023 season that played in at least 50 games. There are 10 variables in the data. First is player number (row_num), which you can ignore for now.

Question 1: Summary Stats

Mean vector \(\bar{\textbf{y}}\)

mlb |> 
  #removing row_num |> 
  select(-row_num) |> 
  colMeans() |> 
  signif(digits = 3)

##      atbats     singles     doubles     triples    homeruns       walks 
##    268.0000      0.1600      0.0509      0.0042      0.0349      0.0990 
##    hit_outs   struckout batting_avg 
##      0.4990      0.2520      0.2500

Covariance Matrix \(\textbf{S}\)

mlb |> 
  #removing row_num |> 
  select(-row_num) |> 
  cov() |> 
  round(digits = 4)

##                atbats singles doubles triples homeruns   walks hit_outs
## atbats      5552.9210  0.3592  0.2103   0.045   0.3011 -0.0184   0.3160
## singles        0.3592  0.0011  0.0000   0.000  -0.0003 -0.0003   0.0005
## doubles        0.2103  0.0000  0.0002   0.000   0.0000  0.0000  -0.0002
## triples        0.0450  0.0000  0.0000   0.000   0.0000  0.0000   0.0000
## homeruns       0.3011 -0.0003  0.0000   0.000   0.0003  0.0002  -0.0005
## walks         -0.0184 -0.0003  0.0000   0.000   0.0002  0.0015  -0.0005
## hit_outs       0.3160  0.0005 -0.0002   0.000  -0.0005 -0.0005   0.0041
## struckout     -1.2315 -0.0013 -0.0001   0.000   0.0005  0.0007  -0.0039
## batting_avg    0.9169  0.0008  0.0003   0.000   0.0001 -0.0001  -0.0002
##             struckout batting_avg
## atbats        -1.2315      0.9169
## singles       -0.0013      0.0008
## doubles       -0.0001      0.0003
## triples        0.0000      0.0000
## homeruns       0.0005      0.0001
## walks          0.0007     -0.0001
## hit_outs      -0.0039     -0.0002
## struckout      0.0048     -0.0009
## batting_avg   -0.0009      0.0011

Correlation Matrix \(\textbf{R}\)

mlb |> 
  #removing row_num |> 
  select(-row_num) |> 
  cor() |> 
  round(digits = 3)

##             atbats singles doubles triples homeruns  walks hit_outs struckout
## atbats       1.000   0.148   0.180   0.121    0.216 -0.006    0.066    -0.239
## singles      0.148   1.000   0.001   0.069   -0.479 -0.239    0.235    -0.565
## doubles      0.180   0.001   1.000  -0.048    0.034  0.013   -0.193    -0.054
## triples      0.121   0.069  -0.048   1.000   -0.111 -0.033    0.022    -0.085
## homeruns     0.216  -0.479   0.034  -0.111    1.000  0.231   -0.436     0.360
## walks       -0.006  -0.239   0.013  -0.033    0.231  1.000   -0.216     0.249
## hit_outs     0.066   0.235  -0.193   0.022   -0.436 -0.216    1.000    -0.877
## struckout   -0.239  -0.565  -0.054  -0.085    0.360  0.249   -0.877     1.000
## batting_avg  0.368   0.719   0.483   0.133    0.091 -0.102   -0.101    -0.389
##             batting_avg
## atbats            0.368
## singles           0.719
## doubles           0.483
## triples           0.133
## homeruns          0.091
## walks            -0.102
## hit_outs         -0.101
## struckout        -0.389
## batting_avg       1.000

Quesiton 2) Density plots, scatter plots, and correlation plot

Create a set of density plots, scatterplot matrix and the correlation plot for the 9 variables (not row_num)

Part 2i) Density plots

Which variables appear to be non-normal

mlb |> 
  pivot_longer(
    cols = -row_num
  ) |> 
  mutate(
    name = as_factor(name)
  ) |> 
  ggplot(
    mapping = aes(
      x = value,
      fill = name
    )
  ) + 
  
  geom_density(
    show.legend = F
  ) + 
  
  facet_wrap(
    facets = vars(name),
    scales = "free",
    ncol = 3
  ) + 
  
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  ) + 
  
  theme_bw() + 
  
  labs(
    x = NULL
  )

The most non-normal variables are at bats, triples, homeruns, and walks. The others aren’t perfectly bellshaped, but they are roughly bell-shaped

Scatter plots

Do any of the variables appear to have an obvious non-linear relationship?

mlb |> 
  select(-row_num) |> 
  ggpairs() + 
  theme_bw()

From the scatterplots, there isn’t an obvious non-linear trend.

Correlation plot

Which variable appears to have the weakest correlations with the other eight variables

mlb |> 
  #removing row_num |> 
  select(-row_num) |> 
  cor() |> 
  ggcorrplot::ggcorrplot(
    lab = T,
    colors = c("red", "white", "blue"),
    type = "lower",
    outline.color = "white",
    ggtheme = theme_void,
    hc.order = T
  )

triples has the weakest overall correlations with the strongest being 0.13 with batting_avg

Which pair of variables have the strongest correlation?

The strongest correlation is struckout with hit_outs with a correlation of -0.88, indicating the more likely a player is to strike out, the less likely they are to get a hit that results in an out

Question 3) Generalized and Total Variance of \(\textbf{S}\)

Does there appear to be at least one linear dependency? Briefly explain your answer

c(
  "Generalized Variance" = mlb |> select(-row_num) |> cov() |> det(),
  "Total Variance" = mlb |> select(-row_num) |> cov() |> diag() |> sum()
)

## Generalized Variance       Total Variance 
##         4.045203e-45         5.552934e+03

Since the generalized variance is 0, there is at least 1 linear dependency in the data.

Question 4) Generalized and Total Variance of \(\textbf{R}\)

c(
  "Generalized Variance" = mlb |> select(-row_num) |> cor() |> det(),
  "Total Variance" = mlb |> select(-row_num) |> cor() |> diag() |> sum()
)

## Generalized Variance       Total Variance 
##         3.027458e-21         9.000000e+00

The total variance is 9 because the “variance” of a variable using the correlation matrix is 1 for each variable. The total variance is the trace of the matrix (sum of the diagonal), so we are adding 9 1s together, for a total of 9.

Question 5) Eigenvalues of \(\textbf{R}\)

eigenR <- 
  mlb |> 
  select(-row_num) |> 
  cor() |> 
  eigen()

# Eigenvalues
round(eigenR$values, 5)

## [1] 2.84035 1.89004 1.08922 1.02000 0.85284 0.78715 0.52037 0.00003 0.00000

Since the last eigenvalue is 0, there is 1 linear dependency. The second to last eigenvalue is very small (3^{-5}) and indicates that there might be a not-perfect, close linear dependency in the data that we would want to be on the lookout for

Question 6) Find the Eigenvectors of \(\textbf{S}\)

Using the last eigenvectors corresponding to your answers in question 5), what set(s) of variables are linearly dependent?

eigenS <- 
  mlb |> 
  select(-row_num) |> 
  cov() |> 
  eigen()



# Eigenvectors
Evecs <- eigenS$vectors
row.names(Evecs) <- colnames(mlb)[-1]
colnames(Evecs) <- paste0("e", 1:(ncol(mlb)-1))
round(Evecs[,8:9], 3)

##                 e8     e9
## atbats       0.000  0.000
## singles      0.217 -0.408
## doubles      0.219 -0.408
## triples      0.219 -0.408
## homeruns     0.218 -0.408
## walks        0.000  0.000
## hit_outs    -0.437 -0.408
## struckout   -0.437 -0.408
## batting_avg -0.654  0.000

The exact linear dependency is between singles, doubles, triples, homeruns, hit_outs, and struckout percentage because they have to add up to 1 since that’s every result of a player batting.

A close to exact linear association is between singles, doubles, triples, homeruns, hit_outs, and struckout

Question 7) Outlier detection

mlb |> 
  mahalanobis_distance(-row_num, -atbats, -triples) |> 
  
  ggplot(
    mapping = aes(
      x = row_num,
      y = mahal.dist
    )
  ) + 
  geom_segment(
    aes(xend = row_num, 
        yend = 0)
  ) +  
  geom_point(
    mapping = aes(color = is.outlier),
    show.legend = F
  ) + 
  
  geom_hline(
    yintercept = qchisq(0.999, df = 7),
    color = "red", 
    linetype = 2
  ) + 
  
  scale_color_manual(
    values = c("black", "red")
  ) + 
  
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  ) + 
  
  scale_x_continuous(
    expand = c(0.025, 0)
  ) + 
  
  theme_bw() + 
  
  labs(
    x = NULL,
    y = "Mahalanobis Distance"
  )

Question 8) Multivariate Normality

Part 8a) Univariate Normality

Univariate Tests

mlb |> 
  shapiro_test(singles, doubles, homeruns, walks, hit_outs, struckout, batting_avg) |> 
  mutate(p = round(p, digits = 5)) |> 
  arrange(p)

## # A tibble: 7 × 3
##   variable    statistic       p
##   <chr>           <dbl>   <dbl>
## 1 walks           0.967 0      
## 2 homeruns        0.976 0.00005
## 3 doubles         0.989 0.0155 
## 4 singles         0.991 0.0507 
## 5 batting_avg     0.992 0.113  
## 6 struckout       0.995 0.422  
## 7 hit_outs        0.996 0.663

Univariate Normality Plots

mlb |> 
  select(-row_num, -atbats, -triples) |> 
  pivot_longer(
    cols = everything()
  ) |> 
  mutate(name = as_factor(name)) |> 
  ggplot(
    mapping = aes(sample = value)
  ) + 
  stat_qq() + 
  stat_qq_line() + 
  facet_wrap(
    facets = vars(name),
    scales = "free_y"
    ) + 
  
  theme_bw() + 
  
  labs(
    x = NULL,
    y = NULL
  )

Part 8b) Multivariate Normality

mvn(
  data = mlb |> select(-row_num, -atbats, -triples),
  mvnTest = "mardia",
  multivariatePlot = "qq",
  univariateTest = "SW",
  desc = F
)

## $multivariateNormality
##              Test        Statistic              p value Result
## 1 Mardia Skewness 254.650012181646 4.43563927061862e-19     NO
## 2 Mardia Kurtosis 2.72141084146074  0.00650039174991912     NO
## 3             MVN             <NA>                 <NA>     NO
## 
## $univariateNormality
##           Test    Variable Statistic   p value Normality
## 1 Shapiro-Wilk   singles      0.9910  0.0507      YES   
## 2 Shapiro-Wilk   doubles      0.9887  0.0155      NO    
## 3 Shapiro-Wilk  homeruns      0.9764  <0.001      NO    
## 4 Shapiro-Wilk    walks       0.9672  <0.001      NO    
## 5 Shapiro-Wilk  hit_outs      0.9962  0.6631      YES   
## 6 Shapiro-Wilk  struckout     0.9951  0.4221      YES   
## 7 Shapiro-Wilk batting_avg    0.9925   0.113      YES

Question 9) Multivariate Normal after transformation

mvn(
  data = mlb |> select(-row_num, -atbats, -triples, -homeruns),
  mvnTest = "mardia",
  multivariatePlot = "qq",
  univariateTest = "SW",
  desc = F,
  bc = T
)

## $multivariateNormality
##              Test          Statistic            p value Result
## 1 Mardia Skewness   75.4022635290638 0.0429324497842212     NO
## 2 Mardia Kurtosis 0.0276096897637102  0.977973453298814    YES
## 3             MVN               <NA>               <NA>     NO
## 
## $univariateNormality
##           Test    Variable Statistic   p value Normality
## 1 Shapiro-Wilk   singles      0.9910    0.0507    YES   
## 2 Shapiro-Wilk   doubles      0.9914    0.0629    YES   
## 3 Shapiro-Wilk    walks       0.9949    0.3897    YES   
## 4 Shapiro-Wilk  hit_outs      0.9963    0.6735    YES   
## 5 Shapiro-Wilk  struckout     0.9951    0.4221    YES   
## 6 Shapiro-Wilk batting_avg    0.9925    0.1130    YES   
## 
## $BoxCoxPowerTransformation
##     singles     doubles       walks    hit_outs   struckout batting_avg 
##        1.00        0.50        0.50        1.01        1.00        1.00