This project aims to predict the salary level of Major League Baseball players using performance statistics from the Hitters.csv dataset. We will build a multinomial logistic regression model to classify players into salary categories (LowSalary, MediumSalary, HighSalary) and evaluate the model using cross-validation. Multinomial logistic regression is used when the response variable \(Y\) has more than two categories. In this section, we present the generic formulation of this model and explain how it differs from the binary logistic regression model.
Let \(Y\) be a categorical response variable with \(J\) distinct categories (e.g., LowSalary, MediumSalary, HighSalary) and let = (\(x_1, x_2, \dots, x_p\)) denote the predictors. We choose one of the \(J\) categories (commonly the last one, \(J\)) as the reference category.
For each non-reference category \(j\) (with \(j = 1, 2, \dots, J-1\)), the model is defined by:
\[ \log \left(\frac{P(Y = j \mid \mathbf{x})}{P(Y = J \mid \mathbf{x})}\right) = \beta_{j0} + \beta_{j1}x_1 + \beta_{j2}x_2 + \cdots + \beta_{jp}x_p. \]
From this, the probability of observing category \(j\) is:
\[ P(Y = j \mid \mathbf{x}) = \frac{\exp\left(\beta_{j0} + \beta_{j1}x_1 + \cdots + \beta_{jp}x_p\right)}{1 + \sum_{l=1}^{J-1} \exp\left(\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p\right)}, \]
for j = 1, 2, , J-1, and the probability for the reference category \(J\) is:
\[ P(Y = J \mid \mathbf{x}) = \frac{1}{1 + \sum_{l=1}^{J-1} \exp\left(\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p\right)}. \]
Binary logistic regression is used when the response variable has only two categories. Its equation is: \[ \log \left(\frac{P(Y = 1 \mid \mathbf{x})}{1 - P(Y = 1 \mid \mathbf{x})}\right) = \beta_0 + \beta_1x_1 + \cdots + \beta_px_p. \] Here, the coefficient \(k\) represents the change in the log-odds of \(Y = 1\) for a one-unit increase in \(x_k\).
Multinomial logistic regression extends binary logistic regression to handle a response variable with more than two categories. It constructs \(J-1\) logit equations comparing each non-reference category to the reference category. The coefficients \(\beta_j,_k\) represent the change in the log-odds of being in category \(j\) (compared to the reference category) for a one-unit change in \(x_k\).
In this section, we import the Hitters dataset, inspect its structure, and perform some basic data preprocessing. We will:
# Importing the Hitters dataset
Hitters <- read.csv("Hitters.csv", header = TRUE)
# Display the dimensions of the dataset: [number of rows, number of columns]
dim(Hitters)
## [1] 317 20
# Display the structure of the dataset to understand variable types and data organization
str(Hitters)
## 'data.frame': 317 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 574 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 159 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 21 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 107 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 75 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 59 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 10 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 4631 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1300 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 90 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 702 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 504 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 488 ...
## $ League : chr "A" "N" "A" "N" ...
## $ Division : chr "E" "W" "W" "E" ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 238 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 445 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 22 ...
## $ Salary : num NA 475 480 500 91.5 ...
## $ NewLeague: chr "A" "N" "A" "N" ...
# Provide summary statistics for each variable
summary(Hitters)
## AtBat Hits HmRun Runs
## Min. : 16.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:256.0 1st Qu.: 64.0 1st Qu.: 4.00 1st Qu.: 30.00
## Median :379.0 Median : 96.0 Median : 8.00 Median : 48.00
## Mean :381.2 Mean :101.1 Mean :10.75 Mean : 50.89
## 3rd Qu.:512.0 3rd Qu.:137.0 3rd Qu.:16.00 3rd Qu.: 69.00
## Max. :687.0 Max. :238.0 Max. :40.00 Max. :130.00
##
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19
## 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 822
## Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928
## Mean : 47.95 Mean : 38.51 Mean : 7.429 Mean : 2646
## 3rd Qu.: 63.00 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3919
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053
##
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.0
## 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 101.0 1st Qu.: 91.0
## Median : 506.0 Median : 37.00 Median : 247.0 Median : 219.0
## Mean : 717.3 Mean : 68.94 Mean : 358.1 Mean : 328.8
## 3rd Qu.:1051.0 3rd Qu.: 90.00 3rd Qu.: 518.0 3rd Qu.: 421.0
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.0
##
## CWalks League Division PutOuts
## Min. : 0 Length:317 Length:317 Min. : 0.0
## 1st Qu.: 68 Class :character Class :character 1st Qu.: 109.0
## Median : 170 Mode :character Mode :character Median : 212.0
## Mean : 258 Mean : 290.5
## 3rd Qu.: 337 3rd Qu.: 325.0
## Max. :1566 Max. :1378.0
##
## Assists Errors Salary NewLeague
## Min. : 0.0 Min. : 0.000 Min. : 67.5 Length:317
## 1st Qu.: 7.0 1st Qu.: 3.000 1st Qu.: 190.0 Class :character
## Median : 40.0 Median : 6.000 Median : 420.0 Mode :character
## Mean :107.9 Mean : 8.091 Mean : 532.8
## 3rd Qu.:166.0 3rd Qu.:11.000 3rd Qu.: 750.0
## Max. :492.0 Max. :32.000 Max. :2460.0
## NA's :58
# Counting total missing values in the dataset
total_missing <- sum(is.na(Hitters))
total_missing
## [1] 58
Before handling the missing values, we need to assess whether they occur at random or if there is a systematic pattern. In this case, we have observed that the missing values appear only in the target variable.
# Load the raw Hitters dataset with missing values
Hitters_raw <- read.csv("Hitters.csv", header = TRUE)
# Check dimensions: total rows and columns
dim(Hitters_raw)
## [1] 317 20
naniarvis_miss(Hitters_raw) +
theme(
axis.text.x = element_text(angle = 75, hjust = 0)
) +
labs(
title = "Missingness in Hitters Dataset",
)
Since:
• The missing values occur only in the target variable, and
• Our investigation shows no systematic bias in how the missing values are distributed (i.e., they appear to be Missing Completely At Random, MCAR),
it is acceptable to remove these 58 observations (out of 317) from the analysis. Imputing the target variable is generally not recommended because it can introduce further bias.
# Remove rows with any missing values
Hitters_clean <- na.omit(Hitters)
# Confirm new dimensions after removing missing values
dim(Hitters_clean)
## [1] 259 20
# Identify numeric columns in the cleaned dataset
numeric_vars <- sapply(Hitters_clean, is.numeric)
# Standardize the numeric predictors using scale()
Hitters_clean[numeric_vars] <- scale(Hitters_clean[numeric_vars])
# Check the updated summary to confirm that numeric variables now have a mean 0 and a s.d 1
describe(Hitters_clean)
## vars n mean sd median trimmed mad min max range skew
## AtBat 1 259 0.00 1.0 0.06 0.01 1.26 -2.62 1.93 4.54 -0.13
## Hits 2 259 0.00 1.0 -0.11 -0.03 1.15 -2.37 2.88 5.25 0.25
## HmRun 3 259 0.00 1.0 -0.29 -0.10 1.01 -1.32 3.23 4.55 0.80
## Runs 4 259 0.00 1.0 -0.11 -0.04 1.16 -2.15 2.95 5.10 0.36
## RBI 5 259 0.00 1.0 -0.17 -0.07 1.03 -1.99 2.70 4.68 0.57
## Walks 6 259 0.00 1.0 -0.18 -0.06 1.11 -1.91 3.00 4.91 0.54
## Years 7 259 0.00 1.0 -0.26 -0.10 0.93 -1.31 3.49 4.79 0.82
## CAtBat 8 259 0.00 1.0 -0.31 -0.14 0.82 -1.15 4.98 6.13 1.32
## CHits 9 259 0.00 1.0 -0.32 -0.15 0.81 -1.10 5.44 6.54 1.46
## CHmRun 10 259 0.00 1.0 -0.36 -0.20 0.54 -0.84 5.87 6.70 2.23
## CRuns 11 259 0.00 1.0 -0.33 -0.15 0.76 -1.08 5.45 6.53 1.53
## CRBI 12 259 0.00 1.0 -0.31 -0.18 0.70 -1.01 4.13 5.13 1.55
## CWalks 13 259 0.00 1.0 -0.32 -0.18 0.66 -0.98 5.02 6.00 1.90
## League* 14 259 1.47 0.5 1.00 1.46 0.00 1.00 2.00 1.00 0.12
## Division* 15 259 1.51 0.5 2.00 1.51 0.00 1.00 2.00 1.00 -0.02
## PutOuts 16 259 0.00 1.0 -0.24 -0.20 0.56 -1.04 3.85 4.89 2.04
## Assists 17 259 0.00 1.0 -0.51 -0.17 0.44 -0.83 2.55 3.38 1.14
## Errors 18 259 0.00 1.0 -0.25 -0.11 0.90 -1.31 3.53 4.84 0.93
## Salary 19 259 0.00 1.0 -0.25 -0.16 0.89 -1.03 4.27 5.30 1.60
## NewLeague* 20 259 1.46 0.5 1.00 1.45 0.00 1.00 2.00 1.00 0.15
## kurtosis se
## AtBat -0.99 0.06
## Hits -0.57 0.06
## HmRun -0.22 0.06
## Runs -0.60 0.06
## RBI -0.41 0.06
## Walks -0.39 0.06
## Years -0.06 0.06
## CAtBat 1.99 0.06
## CHits 2.93 0.06
## CHmRun 6.15 0.06
## CRuns 3.12 0.06
## CRBI 1.98 0.06
## CWalks 4.33 0.06
## League* -1.99 0.03
## Division* -2.01 0.03
## PutOuts 4.00 0.06
## Assists 0.00 0.06
## Errors 0.27 0.06
## Salary 3.03 0.06
## NewLeague* -1.99 0.03
Converting a variable to a factor in R means treating it as a categorical variable. Factors store categorical data along with the distinct levels (i.e., unique values) that the variable can take. This is important because many modeling functions in R (like multinomial logistic regression) need to know that a variable is categorical so they can automatically create dummy variables for each level.
# Convert character columns to factors
Hitters_clean[sapply(Hitters_clean, is.character)] <-
lapply(Hitters_clean[sapply(Hitters_clean, is.character)], as.factor)
In this section, we transform the continuous “Salary” variable into a categorical response variable. The project suggests using salary tertiles (three categories), but for the bonus, we will define four categories based on quartiles. This can be a finer approach in capturing differences in salary.
The new variable, SalaryLevel, is created using the
cut() function along with quantile() to set
the breakpoints at the 0%, 25%, 50%, 75%, and 100% percentiles. We then
label these four groups as follows:
Low: Salaries in the bottom 25%.
Medium-Low: Salaries between the 25th and 50th percentiles.
Medium-High: Salaries between the 50th and 75th percentiles.
High: Salaries in the top 25%.
# Create SalaryLevel variable using quartile-based cutoffs (bonus approach)
# Create a categorical SalaryLevel variable using quartiles on the cleaned dataset
Hitters_clean$SalaryLevel <- cut(Hitters_clean$Salary,
breaks = quantile(Hitters_clean$Salary, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE),
labels = c("Low", "Medium-Low", "Medium-High", "High"),
include.lowest = TRUE)
# Display the distribution of the new SalaryLevel variable
table(Hitters_clean$SalaryLevel)
##
## Low Medium-Low Medium-High High
## 66 64 70 59
This section outlines a thorough exploration of the Hitters dataset. Our goal is to understand the data’s structure, distribution, and relationships between variables. We follow these steps:
• Report summary statistics (means, standard deviations, quartiles, etc.)
• Visualize distributions using histograms, boxplots and bar charts.
• Analyze relationships with correlation matrices.
• Perform basic statistical tests (Chi-squared tests and t-tests) to compare groups.
We begin by loading our cleaned dataset (Hitters_clean), which has been preprocessed (missing values removed, numeric variables standardized, and non-numeric variables converted to factors when appropriate).
str(Hitters_clean)
## 'data.frame': 259 obs. of 21 variables:
## $ AtBat : num -0.604 0.511 0.627 -0.563 1.293 ...
## $ Hits : num -0.597 0.489 0.733 -0.464 1.354 ...
## $ HmRun : num -0.52 0.73 0.957 -0.179 -0.861 ...
## $ Runs : num -1.204 0.442 0.403 -0.616 0.756 ...
## $ RBI : num -0.515 0.8011 1.0333 -0.3602 -0.0118 ...
## $ Walks : num -0.0829 1.6471 -0.1764 -0.5037 -0.2699 ...
## $ Years : num 1.403 -0.889 0.778 -1.098 0.778 ...
## $ CAtBat : num 0.351 -0.446 1.302 -0.982 0.77 ...
## $ CHits : num 0.178 -0.404 1.316 -0.952 0.636 ...
## $ CHmRun : num 0.00855 -0.06484 1.91653 -0.6886 -0.60299 ...
## $ CRuns : num -0.115 -0.408 1.415 -0.939 0.428 ...
## $ CRBI : num 0.268 -0.191 1.582 -0.873 0.026 ...
## $ CWalks : num 0.4536 0.0246 0.3732 -0.8565 -0.2397 ...
## $ League : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 2 ...
## $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 2 ...
## $ PutOuts : num 1.2072 2.0884 -0.3277 1.8219 -0.0363 ...
## $ Assists : num -0.532 -0.264 -0.752 -0.553 2.065 ...
## $ Errors : num 0.201 0.806 -0.858 -0.706 2.469 ...
## $ Salary : num -0.128 -0.1169 -0.0727 -0.9771 0.4809 ...
## $ NewLeague : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 2 ...
## $ SalaryLevel: Factor w/ 4 levels "Low","Medium-Low",..: 3 3 3 1 3 1 1 1 3 3 ...
## - attr(*, "na.action")= 'omit' Named int [1:58] 1 15 18 21 28 30 34 36 38 39 ...
## ..- attr(*, "names")= chr [1:58] "1" "15" "18" "21" ...
The str() output shows the structure of our dataset (number of
observations, types of variables). This initial inspection confirms that
our data is ready for further analysis.
In the snippet below, we replicate the approach shown in the
guidelines:
1. Histogram of Salary with overlaid lines for the
median and mean.
2. Summary Table displaying key statistics (mean,
standard deviation, min, and max).
# 1. Histogram of Salary
hist(Hitters_clean$Salary,
main = "Salary",
xlab = "Salary",
col = "lavender",
breaks = 10)
# Add vertical lines for the median (red) and mean (black)
abline(v = median(Hitters_clean$Salary), col = "red", lwd = 3)
abline(v = mean(Hitters_clean$Salary), col = "black", lwd = 3)
# Add a legend
legend("topright",
legend = c("median", "mean"),
lty = 1,
col = c("red", "black"))
# 2. Summary Statistics Table
summary_stats <- data.frame(
Mean = c(mean(Hitters_clean$Salary), mean(Hitters_clean$PutOuts)),
sd = c(sd(Hitters_clean$Salary), sd(Hitters_clean$PutOuts)),
Minimum = c(min(Hitters_clean$Salary), min(Hitters_clean$PutOuts)),
Maximum = c(max(Hitters_clean$Salary), max(Hitters_clean$PutOuts))
)
rownames(summary_stats) <- c("Salary", "PutOuts")
summary_stats
## Mean sd Minimum Maximum
## Salary 1.886093e-16 1 -1.030266 4.267057
## PutOuts 6.708490e-17 1 -1.038259 3.854228
Since the target variable is now categorical (SalaryLevel), we use ANOVA to test if the means of continuous predictors differ significantly across salary groups.
# Identify numeric variables, excluding Salary
numeric_vars <- setdiff(names(Hitters_clean)[sapply(Hitters_clean, is.numeric)], "Salary")
# Perform ANOVA for each numeric variable against SalaryLevel
anova_results <- lapply(numeric_vars, function(var) {
formula <- as.formula(paste(var, "~ SalaryLevel"))
aov_model <- aov(formula, data = Hitters_clean)
p_val <- summary(aov_model)[[1]][["Pr(>F)"]][1]
return(p_val)
})
# Combine results into a data frame
anova_df <- data.frame(Variable = numeric_vars,
p_value = unlist(anova_results))
anova_df <- anova_df[order(anova_df$p_value), ]
anova_df
## Variable p_value
## 9 CHits 1.267719e-26
## 11 CRuns 2.473225e-26
## 8 CAtBat 5.285011e-26
## 12 CRBI 2.180803e-24
## 7 Years 4.091916e-22
## 13 CWalks 3.277435e-19
## 10 CHmRun 2.658441e-17
## 2 Hits 9.988026e-12
## 5 RBI 4.749327e-11
## 6 Walks 1.986297e-10
## 1 AtBat 3.246702e-10
## 4 Runs 7.783730e-10
## 3 HmRun 7.770850e-07
## 14 PutOuts 2.889509e-05
## 16 Errors 1.186867e-01
## 15 Assists 4.834535e-01
Observing the ANOVA tests, we found that nearly all
numeric predictors except Errors and Assists are statistically
significant (p < 0.05).
Highly correlated predictors may be redundant. We compute and visualize the correlation matrix for the significant predictors.
# Define significant predictors (based on ANOVA results)
significant_vars <- c("CHits", "CRuns", "CAtBat", "CRBI", "Years",
"CWalks", "CHmRun", "Hits", "RBI", "Walks",
"AtBat", "Runs", "HmRun", "PutOuts")
# Compute the correlation matrix for these variables
corr_matrix <- cor(Hitters_clean[, significant_vars], use = "pairwise.complete.obs")
round(corr_matrix, 2)
## CHits CRuns CAtBat CRBI Years CWalks CHmRun Hits RBI Walks AtBat Runs
## CHits 1.00 0.98 1.00 0.95 0.90 0.89 0.79 0.23 0.28 0.26 0.22 0.18
## CRuns 0.98 1.00 0.98 0.95 0.88 0.93 0.82 0.24 0.30 0.32 0.23 0.23
## CAtBat 1.00 0.98 1.00 0.95 0.92 0.91 0.80 0.20 0.27 0.26 0.20 0.16
## CRBI 0.95 0.95 0.95 1.00 0.86 0.89 0.93 0.22 0.38 0.30 0.22 0.20
## Years 0.90 0.88 0.92 0.86 1.00 0.84 0.72 0.02 0.12 0.12 0.01 -0.02
## CWalks 0.89 0.93 0.91 0.89 0.84 1.00 0.81 0.12 0.22 0.41 0.13 0.15
## CHmRun 0.79 0.82 0.80 0.93 0.72 0.81 1.00 0.19 0.44 0.34 0.21 0.23
## Hits 0.23 0.24 0.20 0.22 0.02 0.12 0.19 1.00 0.79 0.59 0.96 0.91
## RBI 0.28 0.30 0.27 0.38 0.12 0.22 0.44 0.79 1.00 0.56 0.80 0.78
## Walks 0.26 0.32 0.26 0.30 0.12 0.41 0.34 0.59 0.56 1.00 0.62 0.70
## AtBat 0.22 0.23 0.20 0.22 0.01 0.13 0.21 0.96 0.80 0.62 1.00 0.90
## Runs 0.18 0.23 0.16 0.20 -0.02 0.15 0.23 0.91 0.78 0.70 0.90 1.00
## HmRun 0.21 0.25 0.21 0.34 0.11 0.22 0.49 0.53 0.85 0.44 0.56 0.63
## PutOuts 0.07 0.06 0.06 0.10 -0.01 0.07 0.10 0.30 0.32 0.29 0.31 0.27
## HmRun PutOuts
## CHits 0.21 0.07
## CRuns 0.25 0.06
## CAtBat 0.21 0.06
## CRBI 0.34 0.10
## Years 0.11 -0.01
## CWalks 0.22 0.07
## CHmRun 0.49 0.10
## Hits 0.53 0.30
## RBI 0.85 0.32
## Walks 0.44 0.29
## AtBat 0.56 0.31
## Runs 0.63 0.27
## HmRun 1.00 0.25
## PutOuts 0.25 1.00
corrplot(corr_matrix, method = "color", tl.cex = 0.8, addCoef.col = "black")
• Career Statistics (“C” Variables):
Variables like CHits, CRuns, CAtBat, CRBI, CHmRun, CWalks, and Years are all highly correlated with each other (correlations often above 0.85). This makes intuitive sense: players with more career hits (CHits) typically have more career at-bats (CAtBat), more career runs (CRuns), and so on.
• Single-Season Statistics (1986 Season):
Variables such as Hits, AtBat, RBI, Runs, HmRun, and Walks also show moderate to strong correlations among themselves (e.g., Hits and AtBat around 0.80). This reflects the fact that better performance in one statistic during a single season often accompanies better performance in others.
Within each cluster, many variables have correlations well above 0.80 or 0.90. Such high correlations imply multicollinearity, meaning these variables carry very similar information. For modeling, we generally do not want to include all of these highly correlated predictors at once, as it can destabilize coefficient estimates and inflate variance.
PutOuts is not strongly correlated with the other hitting or batting
statistics, suggesting it might provide distinct information. If we want
to capture fielding ability in our model, this variable could be worth
exploring, as it does not overlap heavily with the others.
v
### Visualizing Relationships with Boxplots
ggplot(Hitters_clean, aes(x = SalaryLevel, y = Years)) +
geom_boxplot(fill = "lightgreen", color = "black") +
labs(title = "Distribution of Years by Salary Level",
x = "Salary Level",
y = "Years in Major Leagues") +
theme_minimal()
• Overall Trend: Players in higher salary
categories tend to have more experience (higher “Years”).
• Distribution: The “Low” salary group shows mostly negative values (below-average years), whereas the “High” salary group centers around positive values (above-average years).
• Interpretation: This suggests that experience
(time spent in the major leagues) strongly correlates with higher
salaries.
ggplot(Hitters_clean, aes(x = SalaryLevel, y = CRBI)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Distribution of Career RBI by Salary Level",
x = "Salary Level",
y = "Career RBI") +
theme_minimal()
• Overall Trend: The median CRBI increases
notably from the “Low” to the “High” salary groups.
• Distribution: The “High” salary group shows both a higher median and a wider spread, indicating that players with above-average career RBI totals are more likely to be in the higher salary brackets.
• Interpretation: Offensive productivity, as
measured by career runsbatted in, appears to be an important factor in
salary determination.
ggplot(Hitters_clean, aes(x = SalaryLevel, y = PutOuts)) +
geom_boxplot(fill = "lightcoral", color = "black") +
labs(title = "Distribution of PutOuts by Salary Level",
x = "Salary Level",
y = "PutOuts") +
theme_minimal()
• Overall Trend: While there is an upward
shift in median PutOuts from the “Low” to the “High” group, the
separation is less dramatic than for Years or CRBI.
• Distribution: The “High” group exhibits the largest variability, with some players having substantially higher PutOuts. This may reflect differences in defensive roles or positions (e.g., first basemen vs. outfielders).
• Interpretation: PutOuts can still distinguish some
of the higher-salary players—possibly those in positions that accrue
more putouts—but the effect is not as strong as for offensive or
experience metrics.
For categorical variables, we can perform a Chi-squared test to
assess associations. For example, with Division
variable:
chisq_test <- chisq.test(table(Hitters_clean$SalaryLevel, Hitters_clean$Division))
chisq_test
##
## Pearson's Chi-squared test
##
## data: table(Hitters_clean$SalaryLevel, Hitters_clean$Division)
## X-squared = 7.3565, df = 3, p-value = 0.06136
Since the p-value (≈ 0.061) is slightly above the conventional significance threshold of 0.05, we fail to reject the null hypothesis at the 5% level. This suggests that we do not have sufficient evidence to conclude there is a statistically significant association between SalaryLevel and Division. In other words, salary category and division membership appear to be independent at this level of significance. However, the p-value is close to 0.05, which indicates a borderline result.
chisq_test <- chisq.test(table(Hitters_clean$SalaryLevel, Hitters_clean$NewLeague))
chisq_test
##
## Pearson's Chi-squared test
##
## data: table(Hitters_clean$SalaryLevel, Hitters_clean$NewLeague)
## X-squared = 2.8035, df = 3, p-value = 0.4229
With a p-value of approximately 0.4229—well above the typical 0.05
threshold—we fail to reject the null hypothesis. This indicates no
statistically significant evidence of an association between a player’s
salary category and their new league assignment. Put another way, being
in League A or League N at the beginning of 1987 does not appear to be
related to whether a player falls into the Low, Medium-Low, Medium-High,
or High salary group.
We can compare the means of a continuous variable between two groups
using a t-test. For instance, comparing Hits between two
salary groups (Low vs. High):
# Subset the data into Low and High Salary groups based on SalaryLevel
low_salary <- subset(Hitters_clean, SalaryLevel == "Low")
high_salary <- subset(Hitters_clean, SalaryLevel == "High")
# Perform a Welch Two Sample t-test on CHmRun between the two groups
t_test_hits <- t.test(low_salary$Hits, high_salary$Hits)
t_test_hits
##
## Welch Two Sample t-test
##
## data: low_salary$Hits and high_salary$Hits
## t = -6.8089, df = 115.56, p-value = 4.655e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.4303486 -0.7856984
## sample estimates:
## mean of x mean of y
## -0.4398894 0.6681341
Because the p-value is far below the conventional 0.05 threshold, we
reject the null hypothesis of equal means. The negative confidence
interval indicates that the mean Hits value in the Low
Salary group is significantly lower than that of the High Salary group.
In practical terms, this strongly suggests that players in the High
Salary category have, on average, more hits than players in the Low
Salary category. In conclusion, the t-test confirms a statistically
significant difference in performance (as measured by Hits) between the
lowest- and highest-paid players, aligning with the expectation that
higher-performing players command higher salaries.
We can further do another t-test with CHmRun variable
between two salary groups (Low vs. High):
# Subset the data into Low and High Salary groups based on SalaryLevel
low_salary <- subset(Hitters_clean, SalaryLevel == "Low")
high_salary <- subset(Hitters_clean, SalaryLevel == "High")
# Perform a Welch Two Sample t-test on CHmRun between the two groups
t_test_chmrun <- t.test(low_salary$CHmRun, high_salary$CHmRun)
t_test_chmrun
##
## Welch Two Sample t-test
##
## data: low_salary$CHmRun and high_salary$CHmRun
## t = -9.4076, df = 61.632, p-value = 1.583e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.727073 -1.121686
## sample estimates:
## mean of x mean of y
## -0.6604333 0.7639466
Since the p-value is far below the conventional threshold of 0.05, we
reject the null hypothesis that there is no difference between the two
groups. The negative confidence interval indicates that the average
CHmRun for players in the Low Salary group is significantly
lower than that for players in the High Salary group. This strongly
suggests that players with higher career home run totals tend to belong
to the High Salary category, underscoring the importance of offensive
performance in salary determination.
And lastly we double check the result we get from ANOVA Test with a
t-test with variable Assists
# Subset the data into Low and High Salary groups based on SalaryLevel
low_salary <- subset(Hitters_clean, SalaryLevel == "Low")
high_salary <- subset(Hitters_clean, SalaryLevel == "High")
# Perform a Welch Two Sample t-test on CHmRun between the two groups
t_test_assists <- t.test(low_salary$Assists, high_salary$Assists)
t_test_assists
##
## Welch Two Sample t-test
##
## data: low_salary$Assists and high_salary$Assists
## t = -0.0346, df = 122.94, p-value = 0.9725
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3515640 0.3394848
## sample estimates:
## mean of x mean of y
## -0.02093140 -0.01489181
Because the p-value (≈ 0.97) is much higher than the conventional
0.05 threshold, we fail to reject the null hypothesis. The confidence
interval includes zero, indicating no statistically significant
difference in the mean assists between the Low Salary and High Salary
groups. Practically, this suggests that defensive performance as
measured by assists does not differ meaningfully between these two
salary categories, and hence may not be a strong determinant of salary
level.
In this density plot, we compare the distribution of
Runs (the number of runs scored in the season) between two
salary categories—Low Salary (red) and High Salary (blue). The dashed
vertical lines represent each group’s mean Runs value.
# Create subsets for Low and High Salary groups based on SalaryLevel
low_salary <- Hitters_clean %>% filter(SalaryLevel == "Low")
high_salary <- Hitters_clean %>% filter(SalaryLevel == "High")
# Plot the density for the Low Salary group for 'Runs'
plot(density(low_salary$Runs),
main = "Density of Runs by Salary Level",
xlab = "Runs (Season Runs)",
col = "red",
lwd = 2,
ylim = c(0, max(density(low_salary$Runs)$y, density(high_salary$Runs)$y))
)
# Overlay the density for the High Salary group
lines(density(high_salary$Runs), col = "blue", lwd = 2)
# Add vertical dashed lines for the means of each group
abline(v = mean(low_salary$Runs), col = "red", lwd = 2, lty = 2)
abline(v = mean(high_salary$Runs), col = "blue", lwd = 2, lty = 2)
# Add a legend to distinguish the groups
legend("topright",
legend = c("Low Salary", "High Salary"),
col = c("red", "blue"),
lwd = 2)
The Low Salary group’s distribution is centered around a lower
(negative) standardized Runs value, whereas the High Salary
group’s distribution is shifted to the right (positive). This indicates
that players in the High Salary group tend to score more runs on
average.
The dashed lines show a noticeable gap between the average Runs for Low Salary (red) and High Salary (blue). This gap suggests that higher run production is associated with higher pay.
Despite some overlap in the two curves, the clear separation in their
peaks and means reinforces that Runs is a meaningful
performance metric related to salary level.
Overall, the plot supports the idea that players who score more runs are more likely to fall into the High Salary category. This aligns with the broader notion that offensive performance is rewarded in player salaries.
We have explored both individual variable distributions (using summary statistics, histograms, and boxplots) and relationships between variables (using correlation matrices and statistical tests). Focus has been placed on metrics that are most relevant to understanding player performance and salary differences. This comprehensive EDA provides a solid foundation for subsequent modeling. It informs our decisions on which variables to include in a multinomial logistic regression and helps us understand potential issues like multicollinearity.
In this section, we fit a multinomial logistic regression model to predict SalaryLevel using the full set of predictors (excluding the original continuous Salary). We then apply model selection techniques—both forward and backward selection—using stepAIC() to identify a more parsimonious model. Model comparisons are made based on the Akaike Information Criterion (AIC).
# Fit the full multinomial logistic regression model (exclude Salary since it was used to create SalaryLevel)
full_model <- multinom(SalaryLevel ~ ., data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"])
summary(full_model)
## Call:
## multinom(formula = SalaryLevel ~ ., data = Hitters_clean[, !names(Hitters_clean) %in%
## "Salary"])
##
## Coefficients:
## (Intercept) AtBat Hits HmRun Runs RBI
## Medium-Low 4.159894 -5.880311 1.618117 -1.3010311 4.066619 2.1064481
## Medium-High 4.439138 -6.777712 4.135371 -0.8721292 1.868520 1.5949929
## High 4.522568 -5.548210 3.440052 -0.2720965 1.638465 0.8267061
## Walks Years CAtBat CHits CHmRun CRuns
## Medium-Low -0.06285499 1.1118485 34.69468 -18.08576 3.780273 -10.227697
## Medium-High 1.03813323 0.5753733 29.01246 -14.52271 3.556581 -5.391923
## High 1.32421271 -0.9589881 28.28527 -13.98477 2.577480 -4.489479
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## Medium-Low -7.276993 2.2125353 3.426984 -0.5208839 -0.06593199 1.2212657
## Medium-High -6.522122 0.8372687 3.462216 -0.3859751 0.14488185 1.2248027
## High -3.856274 0.1736556 2.299857 -1.4894562 0.46537095 0.8731398
## Errors NewLeagueN
## Medium-Low -1.1448409 -3.036058
## Medium-High -0.9224977 -2.625522
## High -0.5707376 -2.010882
##
## Std. Errors:
## (Intercept) AtBat Hits HmRun Runs RBI Walks
## Medium-Low 1.265759 2.089064 2.233412 1.096794 1.490986 1.298393 0.8703046
## Medium-High 1.256830 2.140168 2.280288 1.118552 1.521350 1.342576 0.8910443
## High 1.271984 2.189055 2.354850 1.141206 1.572037 1.386204 0.9270314
## Years CAtBat CHits CHmRun CRuns CRBI CWalks
## Medium-Low 1.210967 16.05048 17.31092 4.517420 7.451315 7.916279 4.308370
## Medium-High 1.245727 16.00335 17.26354 4.487129 7.464656 7.867447 4.303366
## High 1.380005 16.14438 17.47751 4.521068 7.602715 7.917532 4.330097
## LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## Medium-Low 1.653709 0.6098704 0.3644251 0.6147888 0.5342457 1.589655
## Medium-High 1.729227 0.6419983 0.3387811 0.6228910 0.5402256 1.665646
## High 1.876590 0.7172204 0.3378969 0.6476407 0.5669417 1.824070
##
## Residual Deviance: 390.6562
## AIC: 510.6562
# Fit the null model (intercept only)
null_model <- multinom(SalaryLevel ~ 1, data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"])
summary(null_model)
## Call:
## multinom(formula = SalaryLevel ~ 1, data = Hitters_clean[, !names(Hitters_clean) %in%
## "Salary"])
##
## Coefficients:
## (Intercept)
## Medium-Low -0.03077005
## Medium-High 0.05884434
## High -0.11211528
##
## Std. Errors:
## (Intercept)
## Medium-Low 0.1754325
## Medium-High 0.1715728
## High 0.1791667
##
## Residual Deviance: 717.1267
## AIC: 723.1267
• Model Specification: This model includes no predictors, only intercepts for each salary category (relative to a reference category, typically “Low”).
• Interpretation of Intercepts: Each intercept represents the baseline log-odds of belonging to a particular salary category versus the reference category, without accounting for any performance or demographic variables.
• Residual Deviance: 717.1267
• AIC: 723.1267
These values are relatively high because the model offers no
explanatory power beyond the overall distribution of salary
levels.
• Model Specification: This model includes all available predictors (e.g., AtBat, Hits, HmRun, Runs, RBI, etc.), excluding the continuous Salary variable (since it was used to derive SalaryLevel).
• Coefficient Estimates: The output shows estimates for each salary category (e.g., Medium-Low, Medium-High, High) relative to the reference category (e.g., Low). A positive coefficient indicates that higher values of that predictor increase the log-odds of belonging to that salary category, whereas a negative coefficient indicates the opposite effect.
• Residual Deviance: 390.6562
• AIC: 510.6562
These values are significantly lower than those of the null model,
indicating that including all predictors substantially improves the
model’s ability to predict salary levels.
• Improvement Over Null Model:
The drop in AIC from ~723 (null model) to ~510 (full model) shows a marked improvement in fit. This confirms that many of these variables (e.g., performance stats, years of experience) help distinguish salary categories.
Overall, these results underscore the strong explanatory value of
performance metrics and experience in determining salary categories,
providing a solid foundation for model refinement and
interpretation.
In this section, we define our classification error measure and then randomly select one of the four cross-validation methods based on the birthday seed. According to the instructions, we use the birthday 20/03/2004 (friend’s name: Efe) as 20032004. We then implement the chosen method to estimate the test error (misclassification rate) of our multinomial model.
Because we have a classification problem (salary levels as
categories), a natural choice is the misclassification rate (or
equivalently,
accuracy = 1 - misclassification rate).
Misclassification Rate:
\[ \text{Misclassification Rate} = \frac{\text{Number of Incorrect Predictions}}{\text{Total Number of Observations}} \]
Accuracy:
\[
\text{Accuracy} = 1 - \text{Misclassification Rate} = \frac{\text{Number
of Correct Predictions}}{\text{Total Number of Observations}}
\]
These metrics will help us evaluate how well our
model predicts the correct salary categories.
According to the assignment, we have four possible CV methods:
1. Vanilla validation set
2. LOO-CV (Leave-One-Out Cross-Validation)
3. K-fold CV (K = 5)
4. K-fold CV (K = 10)
We choose one randomly using our birthday seed.
# Friend's name: Efe Aydın
# Birthday: 20/03/2004 -> 20032004
set.seed(20032004)
cv_methods <- c("1. Vanilla validation set",
"2. LOO-CV",
"3. K-fold CV (with K = 5)",
"4. K-fold CV (with K = 10)")
chosen_method <- sample(cv_methods, 1)
chosen_method
## [1] "3. K-fold CV (with K = 5)"
Since the chosen method is “3. K-fold CV (with K = 5)”, we use the caret package to perform 5-fold cross-validation on our multinomial logistic regression model. We exclude the original continuous Salary variable since SalaryLevel is our response.
# Define a trainControl object for 5-fold cross-validation
train_control <- trainControl(method = "cv", number = 5)
# Set a seed for reproducibility of the training process
set.seed(20032004)
# Fit the multinomial logistic regression model using caret's train() function
cv_fit <- train(
SalaryLevel ~ .,
data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"],
method = "multinom",
trControl = train_control,
trace = FALSE
)
# Display the cross-validation results
print(cv_fit)
## Penalized Multinomial Regression
##
## 259 samples
## 19 predictor
## 4 classes: 'Low', 'Medium-Low', 'Medium-High', 'High'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 208, 206, 207, 208, 207
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.5875637 0.4497898
## 1e-04 0.5913373 0.4547798
## 1e-01 0.5954011 0.4600379
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
# Calculate accuracy and misclassification rate
accuracy_5fold <- cv_fit$results$Accuracy[1]
misclass_rate_5fold <- 1 - accuracy_5fold
cat("\n5-Fold CV Accuracy:", accuracy_5fold, "\n")
##
## 5-Fold CV Accuracy: 0.5875637
cat("5-Fold CV Misclassification Rate:", misclass_rate_5fold, "\n")
## 5-Fold CV Misclassification Rate: 0.4124363
The second CV we selected to imply is the “vanilla” validation approach (a single train-test split) to evaluate the multinomial logistic regression model. This approach splits the data (e.g., 80/20), fits the model on the training set, and then computes the misclassification rate on the test set.
# Set seed for reproducibility
set.seed(20032004)
# Create an 80/20 split using caret's createDataPartition
trainIndex <- createDataPartition(Hitters_clean$SalaryLevel, p = 0.8, list = FALSE)
trainData <- Hitters_clean[trainIndex, ]
testData <- Hitters_clean[-trainIndex, ]
# Fit the multinomial logistic regression model on the training set
# Exclude the continuous 'Salary' variable
model_vanilla <- multinom(SalaryLevel ~ ., data = trainData[, !names(trainData) %in% "Salary"], trace = FALSE)
# Predict the SalaryLevel on the test set
predictions <- predict(model_vanilla, newdata = testData)
# Calculate accuracy and misclassification rate
accuracy_val <- mean(predictions == testData$SalaryLevel)
misclass_val <- 1 - accuracy_val
cat("Vanilla CV Accuracy:", accuracy_val, "\n")
## Vanilla CV Accuracy: 0.7
cat("Vanilla CV Misclassification Rate:", misclass_val, "\n")
## Vanilla CV Misclassification Rate: 0.3
5-Fold CV:
• Accuracy: ~0.56
• Misclassification Rate: ~0.44
Vanilla Validation (80/20 split):
• Accuracy: ~0.7
• Misclassification Rate: ~0.3
Sample Variability:
• The vanilla validation approach relies on a single random train-test split. Depending on which observations end up in the training vs. test sets, the resulting accuracy can vary. If the test set happens to be “easier” to classify, the model may appear to perform better.
Multiple Folds in CV:
• The 5-fold cross-validation approach partitions the data into five
subsets, using each subset as a test set once. This typically provides a
more stable estimate of the model’s generalization performance, as it
reduces variance in the performance metric compared to a single
split.
Vanilla CV (80/20) Pros and Cons:
• Pros: Simple and fast to implement.
• Cons: Highlydependent on the particular split, so the result may not generalize well.
5-Fold CV Pros and Cons:
• Pros: More robust estimate of predictive performance since it averages results across multiple splits.
• Cons: Computationally more expensive than a single
train-test split.
Stability vs. Luck:
• The 5-fold CV metric is often preferred as it mitigates the “luck of the draw” in train-test splitting.
• A single train-test split (vanilla) may overestimate or
underestimate performance if the data split is not
representative.
• The higher accuracy (70%) observed in the vanilla validation may be partly due to a favorable split.
• The 5-fold CV accuracy (56%) is likely a more conservative and robust estimate of how well the model would perform on truly unseen data.
• Ultimately, both methods show that the multinomial logistic
regression model has meaningful predictive capability, but
cross-validation gives a more stable, generalizable assessment of its
performance.
The full multinomial logistic regression model (using all predictors) may include irrelevant or redundant variables. In this section, we identify a more parsimonious model with a lower test error using forward and backward selection via the stepAIC() function from the MASS package. We then evaluate the selected model using 5-fold cross-validation.
We start with the null model and add predictors step-by-step, choosing the predictor that most improves the model (i.e. reduces AIC).
forward_model <- stepAIC(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward")
summary(forward_model)
## Call:
## multinom(formula = SalaryLevel ~ CRuns + AtBat + PutOuts + Years,
## data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"])
##
## Coefficients:
## (Intercept) CRuns AtBat PutOuts Years
## Medium-Low 5.211497 6.548930 -0.70693400 -0.04783254 1.6744941
## Medium-High 5.457328 8.054138 -0.32438786 0.16626315 1.2615554
## High 4.755518 9.458694 -0.03233549 0.64143901 0.2549265
##
## Std. Errors:
## (Intercept) CRuns AtBat PutOuts Years
## Medium-Low 1.084234 1.858143 0.3340662 0.3137865 0.8688585
## Medium-High 1.081369 1.879150 0.3586437 0.2997162 0.9026553
## High 1.093172 1.916415 0.4037025 0.2942920 0.9828780
##
## Residual Deviance: 454.8555
## AIC: 484.8555
We also perform backward selection, starting with the full model and removing the least significant predictor at each step based on AIC.
backward_model <- stepAIC(full_model, direction = "backward")
summary(backward_model)
## Call:
## multinom(formula = SalaryLevel ~ AtBat + Hits + Runs + Walks +
## Years + CAtBat + CHits + Division + Assists + Errors, data = Hitters_clean[,
## !names(Hitters_clean) %in% "Salary"])
##
## Coefficients:
## (Intercept) AtBat Hits Runs Walks Years
## Medium-Low 4.295296 -5.739224 3.699605 2.1668138 0.342005 1.2393909
## Medium-High 4.644513 -6.058737 4.915113 0.8024354 1.224064 0.5270782
## High 4.421339 -4.581230 3.692677 1.0352795 1.288674 -0.7539692
## CAtBat CHits DivisionW Assists Errors
## Medium-Low 34.46724 -30.14107 -0.4375513 1.0478602 -0.9016800
## Medium-High 27.43086 -21.02300 -0.1934305 0.8943664 -0.5598529
## High 26.21484 -18.17980 -1.1646381 0.3739404 -0.2622754
##
## Std. Errors:
## (Intercept) AtBat Hits Runs Walks Years CAtBat
## Medium-Low 1.089078 1.717575 1.819624 0.9610325 0.4686704 1.115734 12.76839
## Medium-High 1.086585 1.754597 1.844028 0.9848865 0.4801116 1.163152 12.73991
## High 1.094026 1.787445 1.867295 1.0145532 0.5009128 1.283015 12.82023
## CHits DivisionW Assists Errors
## Medium-Low 12.79997 0.5473749 0.4970802 0.4485288
## Medium-High 12.72644 0.5787065 0.5030995 0.4548782
## High 12.76948 0.6314128 0.5290861 0.4834889
##
## Residual Deviance: 416.8386
## AIC: 482.8386
1. Process:
• Started from the null model (SalaryLevel ~ 1) and added predictors step by step.
• At each step, the predictor that most reduced the AIC was added, until no further improvement was found.
2. Final Model & AIC:
• The final forward model includes a subset of predictors (e.g., CRuns, AtBat, PutOuts, Years, etc.).
• The AIC of this final model is ~484.86.
3. Interpretation:
• Each selected predictor meaningfully contributes to distinguishing salary levels.
• Positive coefficients indicate that as the predictor increases, the log-odds of being in a higher salary category (vs. the reference) also increase; negative coefficients do the opposite.
• Because the forward model includes fewer variables than the full
model, it is typically more parsimonious while still retaining good
predictive power.
1. Process:
• Started from the full model (with all predictors) and iteratively removed the least significant predictor (or the one whose removal best improved the AIC).
• This continued until no further improvement in AIC was possible.
2. Final Model & AIC:
• The final backward model retains a different subset of predictors (e.g., AtBat, Hits, Runs, Walks, Years, etc.).
• Its AIC is ~482.84, which is marginally lower than the forward model’s AIC of ~484.86.
3. Interpretation:
• This final subset represents predictors that the backward procedure deemed most significant from the full set.
• While it may include different (or more) variables than the forward
model, the lower AIC indicates it may be more optimal in balancing model
fit and complexity compared to the forward model.
Overall, the backward selection model appears to be more parsimonious
and has a slighly better AIC than the backward selection model.
Therefore, it is likely the preferred choice for predicting
SalaryLevel.
We have obtained a backward-selected multinomial logistic regression
model stored as backward_model (fitted with multinom() from
the nnet package), we now evaluate its predictive performance using:
• 5-Fold Cross-Validation
• Vanilla Validation (80/20 train-test split)
# Extract the formula from the selected model
selected_formula <- formula(backward_model)
# Define trainControl for 5-fold cross-validation
train_control <- trainControl(method = "cv", number = 5)
# Set a seed for reproducibility
set.seed(20032004)
# Train the selected model using caret
cv_fit_selected <- train(selected_formula,
data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"],
method = "multinom",
trControl = train_control,
trace = FALSE)
# Display the cross-validation results
print(cv_fit_selected)
## Penalized Multinomial Regression
##
## 259 samples
## 10 predictor
## 4 classes: 'Low', 'Medium-Low', 'Medium-High', 'High'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 208, 206, 207, 208, 207
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.6178946 0.4900186
## 1e-04 0.6256624 0.5001525
## 1e-01 0.6023564 0.4689325
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 1e-04.
# Calculate accuracy and misclassification rate
accuracy_selected <- cv_fit_selected$results$Accuracy[1]
misclass_rate_selected <- 1 - accuracy_selected
cat("\n5-Fold CV Accuracy of Backward Model:", accuracy_selected, "\n")
##
## 5-Fold CV Accuracy of Backward Model: 0.6178946
cat("5-Fold CV Misclassification Rate of Backward Model:", misclass_rate_selected, "\n")
## 5-Fold CV Misclassification Rate of Backward Model: 0.3821054
# Set seed for reproducibility
set.seed(20032004)
# Create an 80/20 split
trainIndex <- createDataPartition(Hitters_clean$SalaryLevel, p = 0.8, list = FALSE)
trainData <- Hitters_clean[trainIndex, ]
testData <- Hitters_clean[-trainIndex, ]
# Fit the selected model on the training set using the final formula from the selected model
selected_formula <- formula(backward_model)
model_vanilla_selected <- multinom(selected_formula, data = trainData[, !names(trainData) %in% "Salary"], trace = FALSE)
# Predict on the test set
predictions_selected <- predict(model_vanilla_selected, newdata = testData)
# Calculate accuracy and misclassification rate
accuracy_val_selected <- mean(predictions_selected == testData$SalaryLevel)
misclass_val_selected <- 1 - accuracy_val_selected
cat("Vanilla CV (80/20) Accuracy for Backward Model:", accuracy_val_selected, "\n")
## Vanilla CV (80/20) Accuracy for Backward Model: 0.74
cat("Vanilla CV Misclassification Rate for Backward Model:", misclass_val_selected, "\n")
## Vanilla CV Misclassification Rate for Backward Model: 0.26
5-Fold CV:
• Accuracy: 58.76%
• Misclassification Rate: 41.24%
The 5-fold cross-validation for the full model indicates that, on average, about 58.76% of cases are correctly classified, while roughly 41.24% are misclassified. This reflects the performance when all available predictors are used, even if some may be redundant or less informative.
Vanilla CV (80/20 Split):
• Accuracy: 70%
• Misclassification Rate: 30%
With a single 80/20 split, the full model shows a higher accuracy of
70%, likely due to the specifics of the split. However, this method is
more sensitive to the particular partition used.
5-Fold CV:
• Accuracy: 61.79%
• Misclassification Rate: 38.21%
After applying backward selection, the 5-fold CV accuracy improves to about 61.79%, reducing the misclassification rate to 38.21%. This suggests that eliminating less relevant predictors enhances the model’s ability to generalize, resulting in more consistent performance across folds.
Vanilla CV (80/20 Split):
• Accuracy: 74%
• Misclassification Rate: 26%
Similarly, the backward-selected model achieves 74% accuracy on the
80/20 split, with a misclassification rate of 26%, indicating better
performance compared to the full model on this validation method as
well.
Performance Improvement:
The backward selection model consistently outperforms the full model on both validation methods. The accuracy improves by roughly 3 percentage points in the 5-fold CV (from 58.76% to 61.79%) and by 4 percentage points in the vanilla CV (from 70% to 74%). This reduction in misclassification rates (from 41.24% to 38.21% in 5-fold CV and from 30% to 26% in vanilla CV) suggests that the removal of redundant or irrelevant predictors results in a model that is better at distinguishing between the salary categories.
Cross-Validation Insights:
While the vanilla CV results are generally higher—possibly due to a
favorable train-test split—the 5-fold CV offers a more robust and less
variable performance estimate by averaging over multiple splits. The
consistent improvement seen in both validation schemes confirms the
benefit of model selection.
Overall, these findings underscore the advantage of applying backward selection. By reducing model complexity and avoiding overfitting, the backward model demonstrates improved predictive accuracy and lower error rates. This makes it a more desirable choice for predicting individual salary levels.