The dataset mlb batting.csv has information on all the batting results for 314 MLB players for the 2023 season that played in at least 50 games. There are 10 variables in the data. First is player number (row_num), which you can ignore for now.
mlb |>
#removing row_num |>
select(-row_num) |>
colMeans() |>
signif(digits = 3)
## atbats singles doubles triples homeruns walks
## 268.0000 0.1600 0.0509 0.0042 0.0349 0.0990
## hit_outs struckout batting_avg
## 0.4990 0.2520 0.2500
mlb |>
#removing row_num |>
select(-row_num) |>
cov() |>
round(digits = 4)
## atbats singles doubles triples homeruns walks hit_outs
## atbats 5552.9210 0.3592 0.2103 0.045 0.3011 -0.0184 0.3160
## singles 0.3592 0.0011 0.0000 0.000 -0.0003 -0.0003 0.0005
## doubles 0.2103 0.0000 0.0002 0.000 0.0000 0.0000 -0.0002
## triples 0.0450 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## homeruns 0.3011 -0.0003 0.0000 0.000 0.0003 0.0002 -0.0005
## walks -0.0184 -0.0003 0.0000 0.000 0.0002 0.0015 -0.0005
## hit_outs 0.3160 0.0005 -0.0002 0.000 -0.0005 -0.0005 0.0041
## struckout -1.2315 -0.0013 -0.0001 0.000 0.0005 0.0007 -0.0039
## batting_avg 0.9169 0.0008 0.0003 0.000 0.0001 -0.0001 -0.0002
## struckout batting_avg
## atbats -1.2315 0.9169
## singles -0.0013 0.0008
## doubles -0.0001 0.0003
## triples 0.0000 0.0000
## homeruns 0.0005 0.0001
## walks 0.0007 -0.0001
## hit_outs -0.0039 -0.0002
## struckout 0.0048 -0.0009
## batting_avg -0.0009 0.0011
mlb |>
#removing row_num |>
select(-row_num) |>
cor() |>
round(digits = 3)
## atbats singles doubles triples homeruns walks hit_outs struckout
## atbats 1.000 0.148 0.180 0.121 0.216 -0.006 0.066 -0.239
## singles 0.148 1.000 0.001 0.069 -0.479 -0.239 0.235 -0.565
## doubles 0.180 0.001 1.000 -0.048 0.034 0.013 -0.193 -0.054
## triples 0.121 0.069 -0.048 1.000 -0.111 -0.033 0.022 -0.085
## homeruns 0.216 -0.479 0.034 -0.111 1.000 0.231 -0.436 0.360
## walks -0.006 -0.239 0.013 -0.033 0.231 1.000 -0.216 0.249
## hit_outs 0.066 0.235 -0.193 0.022 -0.436 -0.216 1.000 -0.877
## struckout -0.239 -0.565 -0.054 -0.085 0.360 0.249 -0.877 1.000
## batting_avg 0.368 0.719 0.483 0.133 0.091 -0.102 -0.101 -0.389
## batting_avg
## atbats 0.368
## singles 0.719
## doubles 0.483
## triples 0.133
## homeruns 0.091
## walks -0.102
## hit_outs -0.101
## struckout -0.389
## batting_avg 1.000
Create a set of density plots, scatterplot matrix and the correlation plot for the 9 variables (not row_num)
Which variables appear to be non-normal
mlb |>
pivot_longer(
cols = -row_num
) |>
mutate(
name = as_factor(name)
) |>
ggplot(
mapping = aes(
x = value,
fill = name
)
) +
geom_density(
show.legend = F
) +
facet_wrap(
facets = vars(name),
scales = "free",
ncol = 3
) +
scale_y_continuous(
expand = c(0, 0, 0.05, 0)
) +
theme_bw() +
labs(
x = NULL
)
The most non-normal variables are at bats, triples, homeruns, and walks. The others aren’t perfectly bellshaped, but they are roughly bell-shaped
Do any of the variables appear to have an obvious non-linear relationship?
mlb |>
select(-row_num) |>
ggpairs() +
theme_bw()
From the scatterplots, there isn’t an obvious non-linear trend.
Which variable appears to have the weakest correlations with the other eight variables
mlb |>
#removing row_num |>
select(-row_num) |>
cor() |>
ggcorrplot::ggcorrplot(
lab = T,
colors = c("red", "white", "blue"),
type = "lower",
outline.color = "white",
ggtheme = theme_void,
hc.order = T
)
triples has the weakest overall correlations with the strongest being 0.13 with batting_avg
Which pair of variables have the strongest correlation?
The strongest correlation is struckout with hit_outs with a correlation of -0.88, indicating the more likely a player is to strike out, the less likely they are to get a hit that results in an out
Does there appear to be at least one linear dependency? Briefly explain your answer
c(
"Generalized Variance" = mlb |> select(-row_num) |> cov() |> det(),
"Total Variance" = mlb |> select(-row_num) |> cov() |> diag() |> sum()
)
## Generalized Variance Total Variance
## 4.045203e-45 5.552934e+03
Since the generalized variance is 0, there is at least 1 linear dependency in the data.
c(
"Generalized Variance" = mlb |> select(-row_num) |> cor() |> det(),
"Total Variance" = mlb |> select(-row_num) |> cor() |> diag() |> sum()
)
## Generalized Variance Total Variance
## 3.027458e-21 9.000000e+00
The total variance is 9 because the “variance” of a variable using the correlation matrix is 1 for each variable. The total variance is the trace of the matrix (sum of the diagonal), so we are adding 9 1s together, for a total of 9.
eigenR <-
mlb |>
select(-row_num) |>
cor() |>
eigen()
# Eigenvalues
round(eigenR$values, 5)
## [1] 2.84035 1.89004 1.08922 1.02000 0.85284 0.78715 0.52037 0.00003 0.00000
Since the last eigenvalue is 0, there is 1 linear dependency. The second to last eigenvalue is very small (3^{-5}) and indicates that there might be a not-perfect, close linear dependency in the data that we would want to be on the lookout for
Using the last eigenvectors corresponding to your answers in question 5), what set(s) of variables are linearly dependent?
eigenS <-
mlb |>
select(-row_num) |>
cov() |>
eigen()
# Eigenvectors
Evecs <- eigenS$vectors
row.names(Evecs) <- colnames(mlb)[-1]
colnames(Evecs) <- paste0("e", 1:(ncol(mlb)-1))
round(Evecs[,8:9], 3)
## e8 e9
## atbats 0.000 0.000
## singles 0.217 -0.408
## doubles 0.219 -0.408
## triples 0.219 -0.408
## homeruns 0.218 -0.408
## walks 0.000 0.000
## hit_outs -0.437 -0.408
## struckout -0.437 -0.408
## batting_avg -0.654 0.000
The exact linear dependency is between singles, doubles, triples, homeruns, hit_outs, and struckout percentage because they have to add up to 1 since that’s every result of a player batting.
A close to exact linear association is between singles, doubles, triples, homeruns, hit_outs, and struckout
mlb |>
mahalanobis_distance(-row_num, -atbats, -triples) |>
ggplot(
mapping = aes(
x = row_num,
y = mahal.dist
)
) +
geom_segment(
aes(xend = row_num,
yend = 0)
) +
geom_point(
mapping = aes(color = is.outlier),
show.legend = F
) +
geom_hline(
yintercept = qchisq(0.999, df = 7),
color = "red",
linetype = 2
) +
scale_color_manual(
values = c("black", "red")
) +
scale_y_continuous(
expand = c(0, 0, 0.05, 0)
) +
scale_x_continuous(
expand = c(0.025, 0)
) +
theme_bw() +
labs(
x = NULL,
y = "Mahalanobis Distance"
)
There appears to be three outliers in the data, using a 99.9% threshold for outlier detection.
mlb |>
shapiro_test(singles, doubles, homeruns, walks, hit_outs, struckout, batting_avg) |>
mutate(p = round(p, digits = 5)) |>
arrange(p)
## # A tibble: 7 × 3
## variable statistic p
## <chr> <dbl> <dbl>
## 1 walks 0.967 0
## 2 homeruns 0.976 0.00005
## 3 doubles 0.989 0.0155
## 4 singles 0.991 0.0507
## 5 batting_avg 0.992 0.113
## 6 struckout 0.995 0.422
## 7 hit_outs 0.996 0.663
mlb |>
select(-row_num, -atbats, -triples) |>
pivot_longer(
cols = everything()
) |>
mutate(name = as_factor(name)) |>
ggplot(
mapping = aes(sample = value)
) +
stat_qq() +
stat_qq_line() +
facet_wrap(
facets = vars(name),
scales = "free_y"
) +
theme_bw() +
labs(
x = NULL,
y = NULL
)
From the QQ plot, walks and homeruns
strongly appears to be not normal. The other variables are at least
roughly normal, with singles, doubles, and
batting_avg QQ plots are about as good as you can hope to
see
baseball_mvn <-
mvn(
data = mlb |> select(-row_num, -atbats, -triples),
mvn_test = "mardia",
univariate_test = 'SW',
desc = F
)
# Checking MVN
baseball_mvn$multivariate_normality
## Test Statistic p.value Method MVN
## 1 Mardia Skewness 254.650 <0.001 asymptotic ✗ Not normal
## 2 Mardia Kurtosis 2.721 0.007 asymptotic ✗ Not normal
# Creating the chi-squared QQ plot
plot(baseball_mvn)
The Mardia test and the QQ-plot both agree that the data are not MVN since the points do not follow the line closely.
baseball_mvn9 <-
mvn(
data = mlb |> select(-row_num, -atbats, -triples, -homeruns),
mvn_test = "mardia",
univariate_test = "SW",
desc = F,
power_family = 'bcPower',
power_transform_type = 'rounded'
)
# Multivariate test
baseball_mvn9$multivariate_normality
## Test Statistic p.value Method MVN
## 1 Mardia Skewness 70.816 0.088 asymptotic ✓ Normal
## 2 Mardia Kurtosis -0.089 0.929 asymptotic ✓ Normal
# Univariate test
baseball_mvn9$univariate_normality
## Test Variable Statistic p.value Normality
## 1 Shapiro-Wilk singles 0.991 0.051 ✓ Normal
## 2 Shapiro-Wilk doubles 0.991 0.063 ✓ Normal
## 3 Shapiro-Wilk walks 0.995 0.390 ✓ Normal
## 4 Shapiro-Wilk hit_outs 0.996 0.671 ✓ Normal
## 5 Shapiro-Wilk struckout 0.995 0.422 ✓ Normal
## 6 Shapiro-Wilk batting_avg 0.992 0.113 ✓ Normal
# Plot
plot(baseball_mvn9)
# transformations
baseball_mvn9$power_transform_lambda
## singles doubles walks hit_outs struckout batting_avg
## 1.000000 0.500000 0.500000 1.007836 1.000000 1.000000
After the data are transformed, the data appear to be MVN. The points
in the QQ plot follow the line very closely, the tests (both univariate
and multivariate) do not reject the null hypothesis. The only two
variables that were transformed (after dropping homeruns) are
doubles and walks with a square root
transformation. The remaining are untransformed (power = 1).