STAT 5230: Homework 1 Solutions

MLB Batting Data

The dataset mlb batting.csv has information on all the batting results for 314 MLB players for the 2023 season that played in at least 50 games. There are 10 variables in the data. First is player number (row_num), which you can ignore for now.

Question 1: Summary Stats

Mean vector \(\bar{\textbf{y}}\)

mlb |> 
  #removing row_num |> 
  select(-row_num) |> 
  colMeans() |> 
  signif(digits = 3)

##      atbats     singles     doubles     triples    homeruns       walks 
##    268.0000      0.1600      0.0509      0.0042      0.0349      0.0990 
##    hit_outs   struckout batting_avg 
##      0.4990      0.2520      0.2500

Covariance Matrix \(\textbf{S}\)

mlb |> 
  #removing row_num |> 
  select(-row_num) |> 
  cov() |> 
  round(digits = 4)

##                atbats singles doubles triples homeruns   walks hit_outs
## atbats      5552.9210  0.3592  0.2103   0.045   0.3011 -0.0184   0.3160
## singles        0.3592  0.0011  0.0000   0.000  -0.0003 -0.0003   0.0005
## doubles        0.2103  0.0000  0.0002   0.000   0.0000  0.0000  -0.0002
## triples        0.0450  0.0000  0.0000   0.000   0.0000  0.0000   0.0000
## homeruns       0.3011 -0.0003  0.0000   0.000   0.0003  0.0002  -0.0005
## walks         -0.0184 -0.0003  0.0000   0.000   0.0002  0.0015  -0.0005
## hit_outs       0.3160  0.0005 -0.0002   0.000  -0.0005 -0.0005   0.0041
## struckout     -1.2315 -0.0013 -0.0001   0.000   0.0005  0.0007  -0.0039
## batting_avg    0.9169  0.0008  0.0003   0.000   0.0001 -0.0001  -0.0002
##             struckout batting_avg
## atbats        -1.2315      0.9169
## singles       -0.0013      0.0008
## doubles       -0.0001      0.0003
## triples        0.0000      0.0000
## homeruns       0.0005      0.0001
## walks          0.0007     -0.0001
## hit_outs      -0.0039     -0.0002
## struckout      0.0048     -0.0009
## batting_avg   -0.0009      0.0011

Correlation Matrix \(\textbf{R}\)

mlb |> 
  #removing row_num |> 
  select(-row_num) |> 
  cor() |> 
  round(digits = 3)

##             atbats singles doubles triples homeruns  walks hit_outs struckout
## atbats       1.000   0.148   0.180   0.121    0.216 -0.006    0.066    -0.239
## singles      0.148   1.000   0.001   0.069   -0.479 -0.239    0.235    -0.565
## doubles      0.180   0.001   1.000  -0.048    0.034  0.013   -0.193    -0.054
## triples      0.121   0.069  -0.048   1.000   -0.111 -0.033    0.022    -0.085
## homeruns     0.216  -0.479   0.034  -0.111    1.000  0.231   -0.436     0.360
## walks       -0.006  -0.239   0.013  -0.033    0.231  1.000   -0.216     0.249
## hit_outs     0.066   0.235  -0.193   0.022   -0.436 -0.216    1.000    -0.877
## struckout   -0.239  -0.565  -0.054  -0.085    0.360  0.249   -0.877     1.000
## batting_avg  0.368   0.719   0.483   0.133    0.091 -0.102   -0.101    -0.389
##             batting_avg
## atbats            0.368
## singles           0.719
## doubles           0.483
## triples           0.133
## homeruns          0.091
## walks            -0.102
## hit_outs         -0.101
## struckout        -0.389
## batting_avg       1.000

Quesiton 2) Density plots, scatter plots, and correlation plot

Create a set of density plots, scatterplot matrix and the correlation plot for the 9 variables (not row_num)

Part 2i) Density plots

Which variables appear to be non-normal

mlb |> 
  pivot_longer(
    cols = -row_num
  ) |> 
  mutate(
    name = as_factor(name)
  ) |> 
  ggplot(
    mapping = aes(
      x = value,
      fill = name
    )
  ) + 
  
  geom_density(
    show.legend = F
  ) + 
  
  facet_wrap(
    facets = vars(name),
    scales = "free",
    ncol = 3
  ) + 
  
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  ) + 
  
  theme_bw() + 
  
  labs(
    x = NULL
  )

The most non-normal variables are at bats, triples, homeruns, and walks. The others aren’t perfectly bellshaped, but they are roughly bell-shaped

Scatter plots

Do any of the variables appear to have an obvious non-linear relationship?

mlb |> 
  select(-row_num) |> 
  ggpairs() + 
  theme_bw()

From the scatterplots, there isn’t an obvious non-linear trend.

Correlation plot

Which variable appears to have the weakest correlations with the other eight variables

mlb |> 
  #removing row_num |> 
  select(-row_num) |> 
  cor() |> 
  ggcorrplot::ggcorrplot(
    lab = T,
    colors = c("red", "white", "blue"),
    type = "lower",
    outline.color = "white",
    ggtheme = theme_void,
    hc.order = T
  )

triples has the weakest overall correlations with the strongest being 0.13 with batting_avg

Which pair of variables have the strongest correlation?

The strongest correlation is struckout with hit_outs with a correlation of -0.88, indicating the more likely a player is to strike out, the less likely they are to get a hit that results in an out

Question 3) Generalized and Total Variance of \(\textbf{S}\)

Does there appear to be at least one linear dependency? Briefly explain your answer

c(
  "Generalized Variance" = mlb |> select(-row_num) |> cov() |> det(),
  "Total Variance" = mlb |> select(-row_num) |> cov() |> diag() |> sum()
)

## Generalized Variance       Total Variance 
##         4.045203e-45         5.552934e+03

Since the generalized variance is 0, there is at least 1 linear dependency in the data.

Question 4) Generalized and Total Variance of \(\textbf{R}\)

c(
  "Generalized Variance" = mlb |> select(-row_num) |> cor() |> det(),
  "Total Variance" = mlb |> select(-row_num) |> cor() |> diag() |> sum()
)

## Generalized Variance       Total Variance 
##         3.027458e-21         9.000000e+00

The total variance is 9 because the “variance” of a variable using the correlation matrix is 1 for each variable. The total variance is the trace of the matrix (sum of the diagonal), so we are adding 9 1s together, for a total of 9.

Question 5) Eigenvalues of \(\textbf{R}\)

eigenR <- 
  mlb |> 
  select(-row_num) |> 
  cor() |> 
  eigen()

# Eigenvalues
round(eigenR$values, 5)

## [1] 2.84035 1.89004 1.08922 1.02000 0.85284 0.78715 0.52037 0.00003 0.00000

Since the last eigenvalue is 0, there is 1 linear dependency. The second to last eigenvalue is very small (3^{-5}) and indicates that there might be a not-perfect, close linear dependency in the data that we would want to be on the lookout for

Question 6) Find the Eigenvectors of \(\textbf{S}\)

Using the last eigenvectors corresponding to your answers in question 5), what set(s) of variables are linearly dependent?

eigenS <- 
  mlb |> 
  select(-row_num) |> 
  cov() |> 
  eigen()



# Eigenvectors
Evecs <- eigenS$vectors
row.names(Evecs) <- colnames(mlb)[-1]
colnames(Evecs) <- paste0("e", 1:(ncol(mlb)-1))
round(Evecs[,8:9], 3)

##                 e8     e9
## atbats       0.000  0.000
## singles      0.217 -0.408
## doubles      0.219 -0.408
## triples      0.219 -0.408
## homeruns     0.218 -0.408
## walks        0.000  0.000
## hit_outs    -0.437 -0.408
## struckout   -0.437 -0.408
## batting_avg -0.654  0.000

The exact linear dependency is between singles, doubles, triples, homeruns, hit_outs, and struckout percentage because they have to add up to 1 since that’s every result of a player batting.

A close to exact linear association is between singles, doubles, triples, homeruns, hit_outs, and struckout

Question 7) Outlier detection

mlb |> 
  mahalanobis_distance(-row_num, -atbats, -triples) |> 
  
  ggplot(
    mapping = aes(
      x = row_num,
      y = mahal.dist
    )
  ) + 
  geom_segment(
    aes(xend = row_num, 
        yend = 0)
  ) +  
  geom_point(
    mapping = aes(color = is.outlier),
    show.legend = F
  ) + 
  
  geom_hline(
    yintercept = qchisq(0.999, df = 7),
    color = "red", 
    linetype = 2
  ) + 
  
  scale_color_manual(
    values = c("black", "red")
  ) + 
  
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  ) + 
  
  scale_x_continuous(
    expand = c(0.025, 0)
  ) + 
  
  theme_bw() + 
  
  labs(
    x = NULL,
    y = "Mahalanobis Distance"
  )

There appears to be three outliers in the data, using a 99.9% threshold for outlier detection.

Question 8) Multivariate Normality

Part 8a) Univariate Normality

Univariate Tests

mlb |> 
  shapiro_test(singles, doubles, homeruns, walks, hit_outs, struckout, batting_avg) |> 
  mutate(p = round(p, digits = 5)) |> 
  arrange(p)

## # A tibble: 7 × 3
##   variable    statistic       p
##   <chr>           <dbl>   <dbl>
## 1 walks           0.967 0      
## 2 homeruns        0.976 0.00005
## 3 doubles         0.989 0.0155 
## 4 singles         0.991 0.0507 
## 5 batting_avg     0.992 0.113  
## 6 struckout       0.995 0.422  
## 7 hit_outs        0.996 0.663

Univariate Normality Plots

mlb |> 
  select(-row_num, -atbats, -triples) |> 
  pivot_longer(
    cols = everything()
  ) |> 
  mutate(name = as_factor(name)) |> 
  ggplot(
    mapping = aes(sample = value)
  ) + 
  stat_qq() + 
  stat_qq_line() + 
  facet_wrap(
    facets = vars(name),
    scales = "free_y"
    ) + 
  
  theme_bw() + 
  
  labs(
    x = NULL,
    y = NULL
  )

From the QQ plot, walks and homeruns strongly appears to be not normal. The other variables are at least roughly normal, with singles, doubles, and batting_avg QQ plots are about as good as you can hope to see

Part 8b) Multivariate Normality

baseball_mvn <- 
  mvn(
    data = mlb |> select(-row_num, -atbats, -triples),
    mvn_test = "mardia",
    univariate_test = 'SW',
    desc = F
  )

# Checking MVN
baseball_mvn$multivariate_normality

##              Test Statistic p.value     Method          MVN
## 1 Mardia Skewness   254.650  <0.001 asymptotic ✗ Not normal
## 2 Mardia Kurtosis     2.721   0.007 asymptotic ✗ Not normal

# Creating the chi-squared QQ plot
plot(baseball_mvn)

The Mardia test and the QQ-plot both agree that the data are not MVN since the points do not follow the line closely.

Question 9) Multivariate Normal after transformation

baseball_mvn9 <- 
  mvn(
    data = mlb |> select(-row_num, -atbats, -triples, -homeruns),
    mvn_test = "mardia",
    univariate_test = "SW",
    desc = F,
    power_family = 'bcPower',
    power_transform_type = 'rounded'
  )

# Multivariate test
baseball_mvn9$multivariate_normality

##              Test Statistic p.value     Method      MVN
## 1 Mardia Skewness    70.816   0.088 asymptotic ✓ Normal
## 2 Mardia Kurtosis    -0.089   0.929 asymptotic ✓ Normal

# Univariate test
baseball_mvn9$univariate_normality

##           Test    Variable Statistic p.value Normality
## 1 Shapiro-Wilk     singles     0.991   0.051  ✓ Normal
## 2 Shapiro-Wilk     doubles     0.991   0.063  ✓ Normal
## 3 Shapiro-Wilk       walks     0.995   0.390  ✓ Normal
## 4 Shapiro-Wilk    hit_outs     0.996   0.671  ✓ Normal
## 5 Shapiro-Wilk   struckout     0.995   0.422  ✓ Normal
## 6 Shapiro-Wilk batting_avg     0.992   0.113  ✓ Normal

# Plot
plot(baseball_mvn9)

# transformations
baseball_mvn9$power_transform_lambda

##     singles     doubles       walks    hit_outs   struckout batting_avg 
##    1.000000    0.500000    0.500000    1.007836    1.000000    1.000000

After the data are transformed, the data appear to be MVN. The points in the QQ plot follow the line very closely, the tests (both univariate and multivariate) do not reject the null hypothesis. The only two variables that were transformed (after dropping homeruns) are doubles and walks with a square root transformation. The remaining are untransformed (power = 1).