The dataset mlb batting.csv has information on all the batting results for 314 MLB players for the 2023 season that played in at least 50 games. There are 10 variables in the data. First is player number (row_num), which you can ignore for now.
mlb |>
#removing row_num |>
select(-row_num) |>
colMeans() |>
signif(digits = 3)
## atbats singles doubles triples homeruns walks
## 268.0000 0.1600 0.0509 0.0042 0.0349 0.0990
## hit_outs struckout batting_avg
## 0.4990 0.2520 0.2500
mlb |>
#removing row_num |>
select(-row_num) |>
cov() |>
round(digits = 4)
## atbats singles doubles triples homeruns walks hit_outs
## atbats 5552.9210 0.3592 0.2103 0.045 0.3011 -0.0184 0.3160
## singles 0.3592 0.0011 0.0000 0.000 -0.0003 -0.0003 0.0005
## doubles 0.2103 0.0000 0.0002 0.000 0.0000 0.0000 -0.0002
## triples 0.0450 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## homeruns 0.3011 -0.0003 0.0000 0.000 0.0003 0.0002 -0.0005
## walks -0.0184 -0.0003 0.0000 0.000 0.0002 0.0015 -0.0005
## hit_outs 0.3160 0.0005 -0.0002 0.000 -0.0005 -0.0005 0.0041
## struckout -1.2315 -0.0013 -0.0001 0.000 0.0005 0.0007 -0.0039
## batting_avg 0.9169 0.0008 0.0003 0.000 0.0001 -0.0001 -0.0002
## struckout batting_avg
## atbats -1.2315 0.9169
## singles -0.0013 0.0008
## doubles -0.0001 0.0003
## triples 0.0000 0.0000
## homeruns 0.0005 0.0001
## walks 0.0007 -0.0001
## hit_outs -0.0039 -0.0002
## struckout 0.0048 -0.0009
## batting_avg -0.0009 0.0011
mlb |>
#removing row_num |>
select(-row_num) |>
cor() |>
round(digits = 3)
## atbats singles doubles triples homeruns walks hit_outs struckout
## atbats 1.000 0.148 0.180 0.121 0.216 -0.006 0.066 -0.239
## singles 0.148 1.000 0.001 0.069 -0.479 -0.239 0.235 -0.565
## doubles 0.180 0.001 1.000 -0.048 0.034 0.013 -0.193 -0.054
## triples 0.121 0.069 -0.048 1.000 -0.111 -0.033 0.022 -0.085
## homeruns 0.216 -0.479 0.034 -0.111 1.000 0.231 -0.436 0.360
## walks -0.006 -0.239 0.013 -0.033 0.231 1.000 -0.216 0.249
## hit_outs 0.066 0.235 -0.193 0.022 -0.436 -0.216 1.000 -0.877
## struckout -0.239 -0.565 -0.054 -0.085 0.360 0.249 -0.877 1.000
## batting_avg 0.368 0.719 0.483 0.133 0.091 -0.102 -0.101 -0.389
## batting_avg
## atbats 0.368
## singles 0.719
## doubles 0.483
## triples 0.133
## homeruns 0.091
## walks -0.102
## hit_outs -0.101
## struckout -0.389
## batting_avg 1.000
Create a set of density plots, scatterplot matrix and the correlation plot for the 9 variables (not row_num)
Which variables appear to be non-normal
mlb |>
pivot_longer(
cols = -row_num
) |>
mutate(
name = as_factor(name)
) |>
ggplot(
mapping = aes(
x = value,
fill = name
)
) +
geom_density(
show.legend = F
) +
facet_wrap(
facets = vars(name),
scales = "free",
ncol = 3
) +
scale_y_continuous(
expand = c(0, 0, 0.05, 0)
) +
theme_bw() +
labs(
x = NULL
)
The most non-normal variables are at bats, triples, homeruns, and walks. The others aren’t perfectly bellshaped, but they are roughly bell-shaped
Do any of the variables appear to have an obvious non-linear relationship?
mlb |>
select(-row_num) |>
ggpairs() +
theme_bw()
From the scatterplots, there isn’t an obvious non-linear trend.
Which variable appears to have the weakest correlations with the other eight variables
mlb |>
#removing row_num |>
select(-row_num) |>
cor() |>
ggcorrplot::ggcorrplot(
lab = T,
colors = c("red", "white", "blue"),
type = "lower",
outline.color = "white",
ggtheme = theme_void,
hc.order = T
)
triples has the weakest overall correlations with the strongest being 0.13 with batting_avg
Which pair of variables have the strongest correlation?
The strongest correlation is struckout with hit_outs with a correlation of -0.88, indicating the more likely a player is to strike out, the less likely they are to get a hit that results in an out
Does there appear to be at least one linear dependency? Briefly explain your answer
c(
"Generalized Variance" = mlb |> select(-row_num) |> cov() |> det(),
"Total Variance" = mlb |> select(-row_num) |> cov() |> diag() |> sum()
)
## Generalized Variance Total Variance
## 4.045203e-45 5.552934e+03
Since the generalized variance is 0, there is at least 1 linear dependency in the data.
c(
"Generalized Variance" = mlb |> select(-row_num) |> cor() |> det(),
"Total Variance" = mlb |> select(-row_num) |> cor() |> diag() |> sum()
)
## Generalized Variance Total Variance
## 3.027458e-21 9.000000e+00
The total variance is 9 because the “variance” of a variable using the correlation matrix is 1 for each variable. The total variance is the trace of the matrix (sum of the diagonal), so we are adding 9 1s together, for a total of 9.
eigenR <-
mlb |>
select(-row_num) |>
cor() |>
eigen()
# Eigenvalues
round(eigenR$values, 5)
## [1] 2.84035 1.89004 1.08922 1.02000 0.85284 0.78715 0.52037 0.00003 0.00000
Since the last eigenvalue is 0, there is 1 linear dependency. The second to last eigenvalue is very small (3^{-5}) and indicates that there might be a not-perfect, close linear dependency in the data that we would want to be on the lookout for
Using the last eigenvectors corresponding to your answers in question 5), what set(s) of variables are linearly dependent?
eigenS <-
mlb |>
select(-row_num) |>
cov() |>
eigen()
# Eigenvectors
Evecs <- eigenS$vectors
row.names(Evecs) <- colnames(mlb)[-1]
colnames(Evecs) <- paste0("e", 1:(ncol(mlb)-1))
round(Evecs[,8:9], 3)
## e8 e9
## atbats 0.000 0.000
## singles 0.217 -0.408
## doubles 0.219 -0.408
## triples 0.219 -0.408
## homeruns 0.218 -0.408
## walks 0.000 0.000
## hit_outs -0.437 -0.408
## struckout -0.437 -0.408
## batting_avg -0.654 0.000
The exact linear dependency is between singles, doubles, triples, homeruns, hit_outs, and struckout percentage because they have to add up to 1 since that’s every result of a player batting.
A close to exact linear association is between singles, doubles, triples, homeruns, hit_outs, and struckout
mlb |>
mahalanobis_distance(-row_num, -atbats, -triples) |>
ggplot(
mapping = aes(
x = row_num,
y = mahal.dist
)
) +
geom_segment(
aes(xend = row_num,
yend = 0)
) +
geom_point(
mapping = aes(color = is.outlier),
show.legend = F
) +
geom_hline(
yintercept = qchisq(0.999, df = 7),
color = "red",
linetype = 2
) +
scale_color_manual(
values = c("black", "red")
) +
scale_y_continuous(
expand = c(0, 0, 0.05, 0)
) +
scale_x_continuous(
expand = c(0.025, 0)
) +
theme_bw() +
labs(
x = NULL,
y = "Mahalanobis Distance"
)
mlb |>
shapiro_test(singles, doubles, homeruns, walks, hit_outs, struckout, batting_avg) |>
mutate(p = round(p, digits = 5)) |>
arrange(p)
## # A tibble: 7 × 3
## variable statistic p
## <chr> <dbl> <dbl>
## 1 walks 0.967 0
## 2 homeruns 0.976 0.00005
## 3 doubles 0.989 0.0155
## 4 singles 0.991 0.0507
## 5 batting_avg 0.992 0.113
## 6 struckout 0.995 0.422
## 7 hit_outs 0.996 0.663
mlb |>
select(-row_num, -atbats, -triples) |>
pivot_longer(
cols = everything()
) |>
mutate(name = as_factor(name)) |>
ggplot(
mapping = aes(sample = value)
) +
stat_qq() +
stat_qq_line() +
facet_wrap(
facets = vars(name),
scales = "free_y"
) +
theme_bw() +
labs(
x = NULL,
y = NULL
)
mvn(
data = mlb |> select(-row_num, -atbats, -triples),
mvnTest = "mardia",
multivariatePlot = "qq",
univariateTest = "SW",
desc = F
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 254.650012181646 4.43563927061862e-19 NO
## 2 Mardia Kurtosis 2.72141084146074 0.00650039174991912 NO
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk singles 0.9910 0.0507 YES
## 2 Shapiro-Wilk doubles 0.9887 0.0155 NO
## 3 Shapiro-Wilk homeruns 0.9764 <0.001 NO
## 4 Shapiro-Wilk walks 0.9672 <0.001 NO
## 5 Shapiro-Wilk hit_outs 0.9962 0.6631 YES
## 6 Shapiro-Wilk struckout 0.9951 0.4221 YES
## 7 Shapiro-Wilk batting_avg 0.9925 0.113 YES
mvn(
data = mlb |> select(-row_num, -atbats, -triples, -homeruns),
mvnTest = "mardia",
multivariatePlot = "qq",
univariateTest = "SW",
desc = F,
bc = T
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 75.4022635290638 0.0429324497842212 NO
## 2 Mardia Kurtosis 0.0276096897637102 0.977973453298814 YES
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk singles 0.9910 0.0507 YES
## 2 Shapiro-Wilk doubles 0.9914 0.0629 YES
## 3 Shapiro-Wilk walks 0.9949 0.3897 YES
## 4 Shapiro-Wilk hit_outs 0.9963 0.6735 YES
## 5 Shapiro-Wilk struckout 0.9951 0.4221 YES
## 6 Shapiro-Wilk batting_avg 0.9925 0.1130 YES
##
## $BoxCoxPowerTransformation
## singles doubles walks hit_outs struckout batting_avg
## 1.00 0.50 0.50 1.01 1.00 1.00