Part A) Formalizing the Multinomial Logistic Regression Model

This project aims to predict the salary level of Major League Baseball players using performance statistics from the Hitters.csv dataset. We will build a multinomial logistic regression model to classify players into salary categories (LowSalary, MediumSalary, HighSalary) and evaluate the model using cross-validation. Multinomial logistic regression is used when the response variable \(Y\) has more than two categories. In this section, we present the generic formulation of this model and explain how it differs from the binary logistic regression model.

Model Formulation

Let \(Y\) be a categorical response variable with \(J\) distinct categories (e.g., LowSalary, MediumSalary, HighSalary) and let = (\(x_1, x_2, \dots, x_p\)) denote the predictors. We choose one of the \(J\) categories (commonly the last one, \(J\)) as the reference category.

For each non-reference category \(j\) (with \(j = 1, 2, \dots, J-1\)), the model is defined by:

\[ \log \left(\frac{P(Y = j \mid \mathbf{x})}{P(Y = J \mid \mathbf{x})}\right) = \beta_{j0} + \beta_{j1}x_1 + \beta_{j2}x_2 + \cdots + \beta_{jp}x_p. \]

From this, the probability of observing category \(j\) is:

\[ P(Y = j \mid \mathbf{x}) = \frac{\exp\left(\beta_{j0} + \beta_{j1}x_1 + \cdots + \beta_{jp}x_p\right)}{1 + \sum_{l=1}^{J-1} \exp\left(\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p\right)}, \]

for j = 1, 2, , J-1, and the probability for the reference category \(J\) is:

\[ P(Y = J \mid \mathbf{x}) = \frac{1}{1 + \sum_{l=1}^{J-1} \exp\left(\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p\right)}. \]

Comparison with Binary Logistic Regression

Binary logistic regression is used when the response variable has only two categories. Its equation is: \[ \log \left(\frac{P(Y = 1 \mid \mathbf{x})}{1 - P(Y = 1 \mid \mathbf{x})}\right) = \beta_0 + \beta_1x_1 + \cdots + \beta_px_p. \] Here, the coefficient \(k\) represents the change in the log-odds of \(Y = 1\) for a one-unit increase in \(x_k\).

Multinomial logistic regression extends binary logistic regression to handle a response variable with more than two categories. It constructs \(J-1\) logit equations comparing each non-reference category to the reference category. The coefficients \(\beta_j,_k\) represent the change in the log-odds of being in category \(j\) (compared to the reference category) for a one-unit change in \(x_k\).

Part B) Importing and Preprocessing the Hitters Data

In this section, we import the Hitters dataset, inspect its structure, and perform some basic data preprocessing. We will:

Load the dataset.
Report the number of observations (units) and variables (measurements).
Check for and handle missing values.
Standardize numeric predictors to ensure comparability.

Importing the Data

# Importing the Hitters dataset
Hitters <- read.csv("Hitters.csv", header = TRUE)

# Display the dimensions of the dataset: [number of rows, number of columns]
dim(Hitters)

## [1] 317  20

Inspecting the Data

# Display the structure of the dataset to understand variable types and data organization
str(Hitters)

## 'data.frame':    317 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 574 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 159 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 21 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 107 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 75 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 59 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 10 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 4631 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1300 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 90 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 702 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 504 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 488 ...
##  $ League   : chr  "A" "N" "A" "N" ...
##  $ Division : chr  "E" "W" "W" "E" ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 238 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 445 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 22 ...
##  $ Salary   : num  NA 475 480 500 91.5 ...
##  $ NewLeague: chr  "A" "N" "A" "N" ...

# Provide summary statistics for each variable
summary(Hitters)

##      AtBat            Hits           HmRun            Runs       
##  Min.   : 16.0   Min.   :  1.0   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:256.0   1st Qu.: 64.0   1st Qu.: 4.00   1st Qu.: 30.00  
##  Median :379.0   Median : 96.0   Median : 8.00   Median : 48.00  
##  Mean   :381.2   Mean   :101.1   Mean   :10.75   Mean   : 50.89  
##  3rd Qu.:512.0   3rd Qu.:137.0   3rd Qu.:16.00   3rd Qu.: 69.00  
##  Max.   :687.0   Max.   :238.0   Max.   :40.00   Max.   :130.00  
##                                                                  
##       RBI             Walks            Years            CAtBat     
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19  
##  1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  822  
##  Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928  
##  Mean   : 47.95   Mean   : 38.51   Mean   : 7.429   Mean   : 2646  
##  3rd Qu.: 63.00   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3919  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053  
##                                                                    
##      CHits            CHmRun           CRuns             CRBI       
##  Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.0  
##  1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 101.0   1st Qu.:  91.0  
##  Median : 506.0   Median : 37.00   Median : 247.0   Median : 219.0  
##  Mean   : 717.3   Mean   : 68.94   Mean   : 358.1   Mean   : 328.8  
##  3rd Qu.:1051.0   3rd Qu.: 90.00   3rd Qu.: 518.0   3rd Qu.: 421.0  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.0  
##                                                                     
##      CWalks        League            Division            PutOuts      
##  Min.   :   0   Length:317         Length:317         Min.   :   0.0  
##  1st Qu.:  68   Class :character   Class :character   1st Qu.: 109.0  
##  Median : 170   Mode  :character   Mode  :character   Median : 212.0  
##  Mean   : 258                                         Mean   : 290.5  
##  3rd Qu.: 337                                         3rd Qu.: 325.0  
##  Max.   :1566                                         Max.   :1378.0  
##                                                                       
##     Assists          Errors           Salary        NewLeague        
##  Min.   :  0.0   Min.   : 0.000   Min.   :  67.5   Length:317        
##  1st Qu.:  7.0   1st Qu.: 3.000   1st Qu.: 190.0   Class :character  
##  Median : 40.0   Median : 6.000   Median : 420.0   Mode  :character  
##  Mean   :107.9   Mean   : 8.091   Mean   : 532.8                     
##  3rd Qu.:166.0   3rd Qu.:11.000   3rd Qu.: 750.0                     
##  Max.   :492.0   Max.   :32.000   Max.   :2460.0                     
##                                   NA's   :58

Handling the Missing Values

# Counting total missing values in the dataset
total_missing <- sum(is.na(Hitters))
total_missing

## [1] 58

Investigating the Missingness Pattern in the Target Variable

Before handling the missing values, we need to assess whether they occur at random or if there is a systematic pattern. In this case, we have observed that the missing values appear only in the target variable.

Load the Raw Data

# Load the raw Hitters dataset with missing values
Hitters_raw <- read.csv("Hitters.csv", header = TRUE)

# Check dimensions: total rows and columns
dim(Hitters_raw)

## [1] 317  20

Visualize Missing Data with `naniar`

vis_miss(Hitters_raw) +
  theme(
    axis.text.x = element_text(angle = 75, hjust = 0)
  ) +
  labs(
    title = "Missingness in Hitters Dataset",
  )

Conclusion on Handling Missing Values

Since:

• The missing values occur only in the target variable, and

• Our investigation shows no systematic bias in how the missing values are distributed (i.e., they appear to be Missing Completely At Random, MCAR),

it is acceptable to remove these 58 observations (out of 317) from the analysis. Imputing the target variable is generally not recommended because it can introduce further bias.

Handling Missing Values

# Remove rows with any missing values
Hitters_clean <- na.omit(Hitters)

# Confirm new dimensions after removing missing values
dim(Hitters_clean)

## [1] 259  20

Standardizing Numeric Predictors (z-score standardization)

# Identify numeric columns in the cleaned dataset
numeric_vars <- sapply(Hitters_clean, is.numeric)

# Standardize the numeric predictors using scale()
Hitters_clean[numeric_vars] <- scale(Hitters_clean[numeric_vars])

# Check the updated summary to confirm that numeric variables now have a mean 0 and a s.d 1


describe(Hitters_clean)

##            vars   n mean  sd median trimmed  mad   min  max range  skew
## AtBat         1 259 0.00 1.0   0.06    0.01 1.26 -2.62 1.93  4.54 -0.13
## Hits          2 259 0.00 1.0  -0.11   -0.03 1.15 -2.37 2.88  5.25  0.25
## HmRun         3 259 0.00 1.0  -0.29   -0.10 1.01 -1.32 3.23  4.55  0.80
## Runs          4 259 0.00 1.0  -0.11   -0.04 1.16 -2.15 2.95  5.10  0.36
## RBI           5 259 0.00 1.0  -0.17   -0.07 1.03 -1.99 2.70  4.68  0.57
## Walks         6 259 0.00 1.0  -0.18   -0.06 1.11 -1.91 3.00  4.91  0.54
## Years         7 259 0.00 1.0  -0.26   -0.10 0.93 -1.31 3.49  4.79  0.82
## CAtBat        8 259 0.00 1.0  -0.31   -0.14 0.82 -1.15 4.98  6.13  1.32
## CHits         9 259 0.00 1.0  -0.32   -0.15 0.81 -1.10 5.44  6.54  1.46
## CHmRun       10 259 0.00 1.0  -0.36   -0.20 0.54 -0.84 5.87  6.70  2.23
## CRuns        11 259 0.00 1.0  -0.33   -0.15 0.76 -1.08 5.45  6.53  1.53
## CRBI         12 259 0.00 1.0  -0.31   -0.18 0.70 -1.01 4.13  5.13  1.55
## CWalks       13 259 0.00 1.0  -0.32   -0.18 0.66 -0.98 5.02  6.00  1.90
## League*      14 259 1.47 0.5   1.00    1.46 0.00  1.00 2.00  1.00  0.12
## Division*    15 259 1.51 0.5   2.00    1.51 0.00  1.00 2.00  1.00 -0.02
## PutOuts      16 259 0.00 1.0  -0.24   -0.20 0.56 -1.04 3.85  4.89  2.04
## Assists      17 259 0.00 1.0  -0.51   -0.17 0.44 -0.83 2.55  3.38  1.14
## Errors       18 259 0.00 1.0  -0.25   -0.11 0.90 -1.31 3.53  4.84  0.93
## Salary       19 259 0.00 1.0  -0.25   -0.16 0.89 -1.03 4.27  5.30  1.60
## NewLeague*   20 259 1.46 0.5   1.00    1.45 0.00  1.00 2.00  1.00  0.15
##            kurtosis   se
## AtBat         -0.99 0.06
## Hits          -0.57 0.06
## HmRun         -0.22 0.06
## Runs          -0.60 0.06
## RBI           -0.41 0.06
## Walks         -0.39 0.06
## Years         -0.06 0.06
## CAtBat         1.99 0.06
## CHits          2.93 0.06
## CHmRun         6.15 0.06
## CRuns          3.12 0.06
## CRBI           1.98 0.06
## CWalks         4.33 0.06
## League*       -1.99 0.03
## Division*     -2.01 0.03
## PutOuts        4.00 0.06
## Assists        0.00 0.06
## Errors         0.27 0.06
## Salary         3.03 0.06
## NewLeague*    -1.99 0.03

Converting Character Columns to Factors

Converting a variable to a factor in R means treating it as a categorical variable. Factors store categorical data along with the distinct levels (i.e., unique values) that the variable can take. This is important because many modeling functions in R (like multinomial logistic regression) need to know that a variable is categorical so they can automatically create dummy variables for each level.

# Convert character columns to factors
Hitters_clean[sapply(Hitters_clean, is.character)] <- 
    lapply(Hitters_clean[sapply(Hitters_clean, is.character)], as.factor)

Part C) Creating the Categorical Response Variable

In this section, we transform the continuous “Salary” variable into a categorical response variable. The project suggests using salary tertiles (three categories), but for the bonus, we will define four categories based on quartiles. This can be a finer approach in capturing differences in salary.

The new variable, SalaryLevel, is created using the cut() function along with quantile() to set the breakpoints at the 0%, 25%, 50%, 75%, and 100% percentiles. We then label these four groups as follows:

Low: Salaries in the bottom 25%.
Medium-Low: Salaries between the 25th and 50th percentiles.
Medium-High: Salaries between the 50th and 75th percentiles.
High: Salaries in the top 25%.

# Create SalaryLevel variable using quartile-based cutoffs (bonus approach)
# Create a categorical SalaryLevel variable using quartiles on the cleaned dataset
Hitters_clean$SalaryLevel <- cut(Hitters_clean$Salary, 
                                 breaks = quantile(Hitters_clean$Salary, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE),
                                 labels = c("Low", "Medium-Low", "Medium-High", "High"),
                                 include.lowest = TRUE)

# Display the distribution of the new SalaryLevel variable
table(Hitters_clean$SalaryLevel)

## 
##         Low  Medium-Low Medium-High        High 
##          66          64          70          59

Part D) Visual Inspection & Identifying Relevant Variables

This section outlines a thorough exploration of the Hitters dataset. Our goal is to understand the data’s structure, distribution, and relationships between variables. We follow these steps:

•   Report summary statistics (means, standard deviations, quartiles, etc.)

•   Visualize distributions using histograms, boxplots and bar charts.

•   Analyze relationships with correlation matrices.

•   Perform basic statistical tests (Chi-squared tests and t-tests) to compare groups.

Data Import and Structure

We begin by loading our cleaned dataset (Hitters_clean), which has been preprocessed (missing values removed, numeric variables standardized, and non-numeric variables converted to factors when appropriate).

str(Hitters_clean)

## 'data.frame':    259 obs. of  21 variables:
##  $ AtBat      : num  -0.604 0.511 0.627 -0.563 1.293 ...
##  $ Hits       : num  -0.597 0.489 0.733 -0.464 1.354 ...
##  $ HmRun      : num  -0.52 0.73 0.957 -0.179 -0.861 ...
##  $ Runs       : num  -1.204 0.442 0.403 -0.616 0.756 ...
##  $ RBI        : num  -0.515 0.8011 1.0333 -0.3602 -0.0118 ...
##  $ Walks      : num  -0.0829 1.6471 -0.1764 -0.5037 -0.2699 ...
##  $ Years      : num  1.403 -0.889 0.778 -1.098 0.778 ...
##  $ CAtBat     : num  0.351 -0.446 1.302 -0.982 0.77 ...
##  $ CHits      : num  0.178 -0.404 1.316 -0.952 0.636 ...
##  $ CHmRun     : num  0.00855 -0.06484 1.91653 -0.6886 -0.60299 ...
##  $ CRuns      : num  -0.115 -0.408 1.415 -0.939 0.428 ...
##  $ CRBI       : num  0.268 -0.191 1.582 -0.873 0.026 ...
##  $ CWalks     : num  0.4536 0.0246 0.3732 -0.8565 -0.2397 ...
##  $ League     : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Division   : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 2 ...
##  $ PutOuts    : num  1.2072 2.0884 -0.3277 1.8219 -0.0363 ...
##  $ Assists    : num  -0.532 -0.264 -0.752 -0.553 2.065 ...
##  $ Errors     : num  0.201 0.806 -0.858 -0.706 2.469 ...
##  $ Salary     : num  -0.128 -0.1169 -0.0727 -0.9771 0.4809 ...
##  $ NewLeague  : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 2 ...
##  $ SalaryLevel: Factor w/ 4 levels "Low","Medium-Low",..: 3 3 3 1 3 1 1 1 3 3 ...
##  - attr(*, "na.action")= 'omit' Named int [1:58] 1 15 18 21 28 30 34 36 38 39 ...
##   ..- attr(*, "names")= chr [1:58] "1" "15" "18" "21" ...

The str() output shows the structure of our dataset (number of observations, types of variables). This initial inspection confirms that our data is ready for further analysis.

Salary Distribution and Key Statistics

In the snippet below, we replicate the approach shown in the guidelines:
1. Histogram of Salary with overlaid lines for the median and mean.
2. Summary Table displaying key statistics (mean, standard deviation, min, and max).

# 1. Histogram of Salary
hist(Hitters_clean$Salary,
     main = "Salary",
     xlab = "Salary",
     col = "lavender",
     breaks = 10)

# Add vertical lines for the median (red) and mean (black)
abline(v = median(Hitters_clean$Salary), col = "red", lwd = 3)
abline(v = mean(Hitters_clean$Salary), col = "black", lwd = 3)

# Add a legend
legend("topright",
       legend = c("median", "mean"),
       lty = 1,
       col = c("red", "black"))

# 2. Summary Statistics Table
summary_stats <- data.frame(
  Mean     = c(mean(Hitters_clean$Salary), mean(Hitters_clean$PutOuts)),
  sd       = c(sd(Hitters_clean$Salary), sd(Hitters_clean$PutOuts)),
  Minimum  = c(min(Hitters_clean$Salary), min(Hitters_clean$PutOuts)),
  Maximum  = c(max(Hitters_clean$Salary), max(Hitters_clean$PutOuts))
)

rownames(summary_stats) <- c("Salary", "PutOuts")
summary_stats

##                 Mean sd   Minimum  Maximum
## Salary  1.886093e-16  1 -1.030266 4.267057
## PutOuts 6.708490e-17  1 -1.038259 3.854228

Identifying Relevant Predictors via ANOVA Tests

Since the target variable is now categorical (SalaryLevel), we use ANOVA to test if the means of continuous predictors differ significantly across salary groups.

# Identify numeric variables, excluding Salary
numeric_vars <- setdiff(names(Hitters_clean)[sapply(Hitters_clean, is.numeric)], "Salary")

# Perform ANOVA for each numeric variable against SalaryLevel
anova_results <- lapply(numeric_vars, function(var) {
  formula <- as.formula(paste(var, "~ SalaryLevel"))
  aov_model <- aov(formula, data = Hitters_clean)
  p_val <- summary(aov_model)[[1]][["Pr(>F)"]][1]
  return(p_val)
})

# Combine results into a data frame
anova_df <- data.frame(Variable = numeric_vars,
                       p_value = unlist(anova_results))
anova_df <- anova_df[order(anova_df$p_value), ]
anova_df

##    Variable      p_value
## 9     CHits 1.267719e-26
## 11    CRuns 2.473225e-26
## 8    CAtBat 5.285011e-26
## 12     CRBI 2.180803e-24
## 7     Years 4.091916e-22
## 13   CWalks 3.277435e-19
## 10   CHmRun 2.658441e-17
## 2      Hits 9.988026e-12
## 5       RBI 4.749327e-11
## 6     Walks 1.986297e-10
## 1     AtBat 3.246702e-10
## 4      Runs 7.783730e-10
## 3     HmRun 7.770850e-07
## 14  PutOuts 2.889509e-05
## 16   Errors 1.186867e-01
## 15  Assists 4.834535e-01

Observing the ANOVA tests, we found that nearly all numeric predictors except Errors and Assists are statistically significant (p < 0.05).

Assessing Multicollinearity: Correlation Matrix for Significant Predictors

Highly correlated predictors may be redundant. We compute and visualize the correlation matrix for the significant predictors.

# Define significant predictors (based on ANOVA results)
significant_vars <- c("CHits", "CRuns", "CAtBat", "CRBI", "Years", 
                      "CWalks", "CHmRun", "Hits", "RBI", "Walks", 
                      "AtBat", "Runs", "HmRun", "PutOuts")

# Compute the correlation matrix for these variables
corr_matrix <- cor(Hitters_clean[, significant_vars], use = "pairwise.complete.obs")
round(corr_matrix, 2)

##         CHits CRuns CAtBat CRBI Years CWalks CHmRun Hits  RBI Walks AtBat  Runs
## CHits    1.00  0.98   1.00 0.95  0.90   0.89   0.79 0.23 0.28  0.26  0.22  0.18
## CRuns    0.98  1.00   0.98 0.95  0.88   0.93   0.82 0.24 0.30  0.32  0.23  0.23
## CAtBat   1.00  0.98   1.00 0.95  0.92   0.91   0.80 0.20 0.27  0.26  0.20  0.16
## CRBI     0.95  0.95   0.95 1.00  0.86   0.89   0.93 0.22 0.38  0.30  0.22  0.20
## Years    0.90  0.88   0.92 0.86  1.00   0.84   0.72 0.02 0.12  0.12  0.01 -0.02
## CWalks   0.89  0.93   0.91 0.89  0.84   1.00   0.81 0.12 0.22  0.41  0.13  0.15
## CHmRun   0.79  0.82   0.80 0.93  0.72   0.81   1.00 0.19 0.44  0.34  0.21  0.23
## Hits     0.23  0.24   0.20 0.22  0.02   0.12   0.19 1.00 0.79  0.59  0.96  0.91
## RBI      0.28  0.30   0.27 0.38  0.12   0.22   0.44 0.79 1.00  0.56  0.80  0.78
## Walks    0.26  0.32   0.26 0.30  0.12   0.41   0.34 0.59 0.56  1.00  0.62  0.70
## AtBat    0.22  0.23   0.20 0.22  0.01   0.13   0.21 0.96 0.80  0.62  1.00  0.90
## Runs     0.18  0.23   0.16 0.20 -0.02   0.15   0.23 0.91 0.78  0.70  0.90  1.00
## HmRun    0.21  0.25   0.21 0.34  0.11   0.22   0.49 0.53 0.85  0.44  0.56  0.63
## PutOuts  0.07  0.06   0.06 0.10 -0.01   0.07   0.10 0.30 0.32  0.29  0.31  0.27
##         HmRun PutOuts
## CHits    0.21    0.07
## CRuns    0.25    0.06
## CAtBat   0.21    0.06
## CRBI     0.34    0.10
## Years    0.11   -0.01
## CWalks   0.22    0.07
## CHmRun   0.49    0.10
## Hits     0.53    0.30
## RBI      0.85    0.32
## Walks    0.44    0.29
## AtBat    0.56    0.31
## Runs     0.63    0.27
## HmRun    1.00    0.25
## PutOuts  0.25    1.00

corrplot(corr_matrix, method = "color", tl.cex = 0.8, addCoef.col = "black")

Correlation Plot Analysis

Two Main Cluster of Variables

•   Career Statistics (“C” Variables):

Variables like CHits, CRuns, CAtBat, CRBI, CHmRun, CWalks, and Years are all highly correlated with each other (correlations often above 0.85). This makes intuitive sense: players with more career hits (CHits) typically have more career at-bats (CAtBat), more career runs (CRuns), and so on.

•   Single-Season Statistics (1986 Season):

Variables such as Hits, AtBat, RBI, Runs, HmRun, and Walks also show moderate to strong correlations among themselves (e.g., Hits and AtBat around 0.80). This reflects the fact that better performance in one statistic during a single season often accompanies better performance in others.

High Multicollinearity

Within each cluster, many variables have correlations well above 0.80 or 0.90. Such high correlations imply multicollinearity, meaning these variables carry very similar information. For modeling, we generally do not want to include all of these highly correlated predictors at once, as it can destabilize coefficient estimates and inflate variance.

PutOuts Stands Out

PutOuts is not strongly correlated with the other hitting or batting statistics, suggesting it might provide distinct information. If we want to capture fielding ability in our model, this variable could be worth exploring, as it does not overlap heavily with the others.

v ### Visualizing Relationships with Boxplots

Boxplot for Years by Salary Level

ggplot(Hitters_clean, aes(x = SalaryLevel, y = Years)) +
  geom_boxplot(fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Years by Salary Level",
       x = "Salary Level",
       y = "Years in Major Leagues") +
  theme_minimal()

• Overall Trend: Players in higher salary categories tend to have more experience (higher “Years”).

• Distribution: The “Low” salary group shows mostly negative values (below-average years), whereas the “High” salary group centers around positive values (above-average years).

• Interpretation: This suggests that experience (time spent in the major leagues) strongly correlates with higher salaries.

Boxplot for CRBI by Salary Level

ggplot(Hitters_clean, aes(x = SalaryLevel, y = CRBI)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Distribution of Career RBI by Salary Level",
       x = "Salary Level",
       y = "Career RBI") +
  theme_minimal()

• Overall Trend: The median CRBI increases notably from the “Low” to the “High” salary groups.

• Distribution: The “High” salary group shows both a higher median and a wider spread, indicating that players with above-average career RBI totals are more likely to be in the higher salary brackets.

• Interpretation: Offensive productivity, as measured by career runsbatted in, appears to be an important factor in salary determination.

Boxplot for PutOuts by Salary Level

ggplot(Hitters_clean, aes(x = SalaryLevel, y = PutOuts)) +
  geom_boxplot(fill = "lightcoral", color = "black") +
  labs(title = "Distribution of PutOuts by Salary Level",
       x = "Salary Level",
       y = "PutOuts") +
  theme_minimal()

• Overall Trend: While there is an upward shift in median PutOuts from the “Low” to the “High” group, the separation is less dramatic than for Years or CRBI.

• Distribution: The “High” group exhibits the largest variability, with some players having substantially higher PutOuts. This may reflect differences in defensive roles or positions (e.g., first basemen vs. outfielders).

• Interpretation: PutOuts can still distinguish some of the higher-salary players—possibly those in positions that accrue more putouts—but the effect is not as strong as for offensive or experience metrics.

Statictical Tests

Chi-Squared Test

For categorical variables, we can perform a Chi-squared test to assess associations. For example, with Division variable:

chisq_test <- chisq.test(table(Hitters_clean$SalaryLevel, Hitters_clean$Division))
chisq_test

## 
##  Pearson's Chi-squared test
## 
## data:  table(Hitters_clean$SalaryLevel, Hitters_clean$Division)
## X-squared = 7.3565, df = 3, p-value = 0.06136

Since the p-value (≈ 0.061) is slightly above the conventional significance threshold of 0.05, we fail to reject the null hypothesis at the 5% level. This suggests that we do not have sufficient evidence to conclude there is a statistically significant association between SalaryLevel and Division. In other words, salary category and division membership appear to be independent at this level of significance. However, the p-value is close to 0.05, which indicates a borderline result.

chisq_test <- chisq.test(table(Hitters_clean$SalaryLevel, Hitters_clean$NewLeague))
chisq_test

## 
##  Pearson's Chi-squared test
## 
## data:  table(Hitters_clean$SalaryLevel, Hitters_clean$NewLeague)
## X-squared = 2.8035, df = 3, p-value = 0.4229

With a p-value of approximately 0.4229—well above the typical 0.05 threshold—we fail to reject the null hypothesis. This indicates no statistically significant evidence of an association between a player’s salary category and their new league assignment. Put another way, being in League A or League N at the beginning of 1987 does not appear to be related to whether a player falls into the Low, Medium-Low, Medium-High, or High salary group.

t-Test

We can compare the means of a continuous variable between two groups using a t-test. For instance, comparing Hits between two salary groups (Low vs. High):

# Subset the data into Low and High Salary groups based on SalaryLevel
low_salary <- subset(Hitters_clean, SalaryLevel == "Low")
high_salary <- subset(Hitters_clean, SalaryLevel == "High")

# Perform a Welch Two Sample t-test on CHmRun between the two groups
t_test_hits <- t.test(low_salary$Hits, high_salary$Hits)
t_test_hits

## 
##  Welch Two Sample t-test
## 
## data:  low_salary$Hits and high_salary$Hits
## t = -6.8089, df = 115.56, p-value = 4.655e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4303486 -0.7856984
## sample estimates:
##  mean of x  mean of y 
## -0.4398894  0.6681341

t-statistic: -6.8889
Degrees of Freedom: 115.56
p-value: 4.655e-10 (extremely small)
95% Confidence Interval: (-1.3034, -0.7857)

Because the p-value is far below the conventional 0.05 threshold, we reject the null hypothesis of equal means. The negative confidence interval indicates that the mean Hits value in the Low Salary group is significantly lower than that of the High Salary group. In practical terms, this strongly suggests that players in the High Salary category have, on average, more hits than players in the Low Salary category. In conclusion, the t-test confirms a statistically significant difference in performance (as measured by Hits) between the lowest- and highest-paid players, aligning with the expectation that higher-performing players command higher salaries.

We can further do another t-test with CHmRun variable between two salary groups (Low vs. High):

# Subset the data into Low and High Salary groups based on SalaryLevel
low_salary <- subset(Hitters_clean, SalaryLevel == "Low")
high_salary <- subset(Hitters_clean, SalaryLevel == "High")

# Perform a Welch Two Sample t-test on CHmRun between the two groups
t_test_chmrun <- t.test(low_salary$CHmRun, high_salary$CHmRun)
t_test_chmrun

## 
##  Welch Two Sample t-test
## 
## data:  low_salary$CHmRun and high_salary$CHmRun
## t = -9.4076, df = 61.632, p-value = 1.583e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.727073 -1.121686
## sample estimates:
##  mean of x  mean of y 
## -0.6604333  0.7639466

t-statistic: -9.4076
Degrees of Freedom: 61.632
p-value: 1.583e-13 (extremely small)
95% Confidence Interval: (-1.727073, -1.121686)

Since the p-value is far below the conventional threshold of 0.05, we reject the null hypothesis that there is no difference between the two groups. The negative confidence interval indicates that the average CHmRun for players in the Low Salary group is significantly lower than that for players in the High Salary group. This strongly suggests that players with higher career home run totals tend to belong to the High Salary category, underscoring the importance of offensive performance in salary determination.

And lastly we double check the result we get from ANOVA Test with a t-test with variable Assists

# Subset the data into Low and High Salary groups based on SalaryLevel
low_salary <- subset(Hitters_clean, SalaryLevel == "Low")
high_salary <- subset(Hitters_clean, SalaryLevel == "High")

# Perform a Welch Two Sample t-test on CHmRun between the two groups
t_test_assists <- t.test(low_salary$Assists, high_salary$Assists)
t_test_assists

## 
##  Welch Two Sample t-test
## 
## data:  low_salary$Assists and high_salary$Assists
## t = -0.0346, df = 122.94, p-value = 0.9725
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3515640  0.3394848
## sample estimates:
##   mean of x   mean of y 
## -0.02093140 -0.01489181

t-statistic: -0.8346
Degrees of Freedom: 122.94
p-value: 0.9725
95% Confidence Interval: (-0.3515640, 0.3394048)

Because the p-value (≈ 0.97) is much higher than the conventional 0.05 threshold, we fail to reject the null hypothesis. The confidence interval includes zero, indicating no statistically significant difference in the mean assists between the Low Salary and High Salary groups. Practically, this suggests that defensive performance as measured by assists does not differ meaningfully between these two salary categories, and hence may not be a strong determinant of salary level.

Density Plot

In this density plot, we compare the distribution of Runs (the number of runs scored in the season) between two salary categories—Low Salary (red) and High Salary (blue). The dashed vertical lines represent each group’s mean Runs value.

# Create subsets for Low and High Salary groups based on SalaryLevel
low_salary <- Hitters_clean %>% filter(SalaryLevel == "Low")
high_salary <- Hitters_clean %>% filter(SalaryLevel == "High")

# Plot the density for the Low Salary group for 'Runs'
plot(density(low_salary$Runs),
     main = "Density of Runs by Salary Level",
     xlab = "Runs (Season Runs)",
     col = "red",
     lwd = 2,
     ylim = c(0, max(density(low_salary$Runs)$y, density(high_salary$Runs)$y))
)

# Overlay the density for the High Salary group
lines(density(high_salary$Runs), col = "blue", lwd = 2)

# Add vertical dashed lines for the means of each group
abline(v = mean(low_salary$Runs), col = "red", lwd = 2, lty = 2)
abline(v = mean(high_salary$Runs), col = "blue", lwd = 2, lty = 2)

# Add a legend to distinguish the groups
legend("topright",
       legend = c("Low Salary", "High Salary"),
       col = c("red", "blue"),
       lwd = 2)

Shift in Distributions:

The Low Salary group’s distribution is centered around a lower (negative) standardized Runs value, whereas the High Salary group’s distribution is shifted to the right (positive). This indicates that players in the High Salary group tend to score more runs on average.

Differences in Means:

The dashed lines show a noticeable gap between the average Runs for Low Salary (red) and High Salary (blue). This gap suggests that higher run production is associated with higher pay.

Overlap:

Despite some overlap in the two curves, the clear separation in their peaks and means reinforces that Runs is a meaningful performance metric related to salary level.

Overall, the plot supports the idea that players who score more runs are more likely to fall into the High Salary category. This aligns with the broader notion that offensive performance is rewarded in player salaries.

Final Remarks on EDA

We have explored both individual variable distributions (using summary statistics, histograms, and boxplots) and relationships between variables (using correlation matrices and statistical tests). Focus has been placed on metrics that are most relevant to understanding player performance and salary differences. This comprehensive EDA provides a solid foundation for subsequent modeling. It informs our decisions on which variables to include in a multinomial logistic regression and helps us understand potential issues like multicollinearity.

Part E) Multinomial Logistic Regression & Model Selection

In this section, we fit a multinomial logistic regression model to predict SalaryLevel using the full set of predictors (excluding the original continuous Salary). We then apply model selection techniques—both forward and backward selection—using stepAIC() to identify a more parsimonious model. Model comparisons are made based on the Akaike Information Criterion (AIC).

Fitting the Full and Null Models

# Fit the full multinomial logistic regression model (exclude Salary since it was used to create SalaryLevel)
full_model <- multinom(SalaryLevel ~ ., data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"])

Full Model Summary

summary(full_model)

## Call:
## multinom(formula = SalaryLevel ~ ., data = Hitters_clean[, !names(Hitters_clean) %in% 
##     "Salary"])
## 
## Coefficients:
##             (Intercept)     AtBat     Hits      HmRun     Runs       RBI
## Medium-Low     4.159894 -5.880311 1.618117 -1.3010311 4.066619 2.1064481
## Medium-High    4.439138 -6.777712 4.135371 -0.8721292 1.868520 1.5949929
## High           4.522568 -5.548210 3.440052 -0.2720965 1.638465 0.8267061
##                   Walks      Years   CAtBat     CHits   CHmRun      CRuns
## Medium-Low  -0.06285499  1.1118485 34.69468 -18.08576 3.780273 -10.227697
## Medium-High  1.03813323  0.5753733 29.01246 -14.52271 3.556581  -5.391923
## High         1.32421271 -0.9589881 28.28527 -13.98477 2.577480  -4.489479
##                  CRBI    CWalks  LeagueN  DivisionW     PutOuts   Assists
## Medium-Low  -7.276993 2.2125353 3.426984 -0.5208839 -0.06593199 1.2212657
## Medium-High -6.522122 0.8372687 3.462216 -0.3859751  0.14488185 1.2248027
## High        -3.856274 0.1736556 2.299857 -1.4894562  0.46537095 0.8731398
##                 Errors NewLeagueN
## Medium-Low  -1.1448409  -3.036058
## Medium-High -0.9224977  -2.625522
## High        -0.5707376  -2.010882
## 
## Std. Errors:
##             (Intercept)    AtBat     Hits    HmRun     Runs      RBI     Walks
## Medium-Low     1.265759 2.089064 2.233412 1.096794 1.490986 1.298393 0.8703046
## Medium-High    1.256830 2.140168 2.280288 1.118552 1.521350 1.342576 0.8910443
## High           1.271984 2.189055 2.354850 1.141206 1.572037 1.386204 0.9270314
##                Years   CAtBat    CHits   CHmRun    CRuns     CRBI   CWalks
## Medium-Low  1.210967 16.05048 17.31092 4.517420 7.451315 7.916279 4.308370
## Medium-High 1.245727 16.00335 17.26354 4.487129 7.464656 7.867447 4.303366
## High        1.380005 16.14438 17.47751 4.521068 7.602715 7.917532 4.330097
##              LeagueN DivisionW   PutOuts   Assists    Errors NewLeagueN
## Medium-Low  1.653709 0.6098704 0.3644251 0.6147888 0.5342457   1.589655
## Medium-High 1.729227 0.6419983 0.3387811 0.6228910 0.5402256   1.665646
## High        1.876590 0.7172204 0.3378969 0.6476407 0.5669417   1.824070
## 
## Residual Deviance: 390.6562 
## AIC: 510.6562

# Fit the null model (intercept only)
null_model <- multinom(SalaryLevel ~ 1, data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"])

Null Model Summary

summary(null_model)

## Call:
## multinom(formula = SalaryLevel ~ 1, data = Hitters_clean[, !names(Hitters_clean) %in% 
##     "Salary"])
## 
## Coefficients:
##             (Intercept)
## Medium-Low  -0.03077005
## Medium-High  0.05884434
## High        -0.11211528
## 
## Std. Errors:
##             (Intercept)
## Medium-Low    0.1754325
## Medium-High   0.1715728
## High          0.1791667
## 
## Residual Deviance: 717.1267 
## AIC: 723.1267

Commentary on Full Model vs. Null Model

Null Model (SalaryLevel ~ 1)

• Model Specification: This model includes no predictors, only intercepts for each salary category (relative to a reference category, typically “Low”).

• Interpretation of Intercepts: Each intercept represents the baseline log-odds of belonging to a particular salary category versus the reference category, without accounting for any performance or demographic variables.

Fit Statistics:

• Residual Deviance: 717.1267

• AIC: 723.1267

These values are relatively high because the model offers no explanatory power beyond the overall distribution of salary levels.

Full Model (SalaryLevel ~ .)

• Model Specification: This model includes all available predictors (e.g., AtBat, Hits, HmRun, Runs, RBI, etc.), excluding the continuous Salary variable (since it was used to derive SalaryLevel).

• Coefficient Estimates: The output shows estimates for each salary category (e.g., Medium-Low, Medium-High, High) relative to the reference category (e.g., Low). A positive coefficient indicates that higher values of that predictor increase the log-odds of belonging to that salary category, whereas a negative coefficient indicates the opposite effect.

Fit Statistics:

• Residual Deviance: 390.6562

• AIC: 510.6562

These values are significantly lower than those of the null model, indicating that including all predictors substantially improves the model’s ability to predict salary levels.

Comparison and Next Steps

• Improvement Over Null Model:

The drop in AIC from ~723 (null model) to ~510 (full model) shows a marked improvement in fit. This confirms that many of these variables (e.g., performance stats, years of experience) help distinguish salary categories.

Overall, these results underscore the strong explanatory value of performance metrics and experience in determining salary categories, providing a solid foundation for model refinement and interpretation.

Part f) Accuracy and Cross-Validation

In this section, we define our classification error measure and then randomly select one of the four cross-validation methods based on the birthday seed. According to the instructions, we use the birthday 20/03/2004 (friend’s name: Efe) as 20032004. We then implement the chosen method to estimate the test error (misclassification rate) of our multinomial model.

Defining the Error Measure

Because we have a classification problem (salary levels as categories), a natural choice is the misclassification rate (or equivalently,
accuracy = 1 - misclassification rate).

Misclassification Rate:

\[ \text{Misclassification Rate} = \frac{\text{Number of Incorrect Predictions}}{\text{Total Number of Observations}} \]

Accuracy:

\[ \text{Accuracy} = 1 - \text{Misclassification Rate} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Observations}} \]

These metrics will help us evaluate how well our model predicts the correct salary categories.

Randomly Choosing the CV Method

According to the assignment, we have four possible CV methods:

1.  Vanilla validation set
2.  LOO-CV (Leave-One-Out Cross-Validation)
3.  K-fold CV (K = 5)
4.  K-fold CV (K = 10)

We choose one randomly using our birthday seed.

# Friend's name: Efe Aydın
# Birthday: 20/03/2004 -> 20032004
set.seed(20032004)

cv_methods <- c("1. Vanilla validation set",
                "2. LOO-CV",
                "3. K-fold CV (with K = 5)",
                "4. K-fold CV (with K = 10)")

chosen_method <- sample(cv_methods, 1)
chosen_method

## [1] "3. K-fold CV (with K = 5)"

Implementing 5-Fold Cross-Validation

Since the chosen method is “3. K-fold CV (with K = 5)”, we use the caret package to perform 5-fold cross-validation on our multinomial logistic regression model. We exclude the original continuous Salary variable since SalaryLevel is our response.

# Define a trainControl object for 5-fold cross-validation
train_control <- trainControl(method = "cv", number = 5)

# Set a seed for reproducibility of the training process
set.seed(20032004)

# Fit the multinomial logistic regression model using caret's train() function
cv_fit <- train(
  SalaryLevel ~ .,
  data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"],
  method = "multinom",
  trControl = train_control,
  trace = FALSE
)

# Display the cross-validation results
print(cv_fit)

## Penalized Multinomial Regression 
## 
## 259 samples
##  19 predictor
##   4 classes: 'Low', 'Medium-Low', 'Medium-High', 'High' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 208, 206, 207, 208, 207 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.5875637  0.4497898
##   1e-04  0.5913373  0.4547798
##   1e-01  0.5954011  0.4600379
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.

# Calculate accuracy and misclassification rate
accuracy_5fold <- cv_fit$results$Accuracy[1]
misclass_rate_5fold <- 1 - accuracy_5fold

cat("\n5-Fold CV Accuracy:", accuracy_5fold, "\n")

## 
## 5-Fold CV Accuracy: 0.5875637

cat("5-Fold CV Misclassification Rate:", misclass_rate_5fold, "\n")

## 5-Fold CV Misclassification Rate: 0.4124363

The second CV we selected to imply is the “vanilla” validation approach (a single train-test split) to evaluate the multinomial logistic regression model. This approach splits the data (e.g., 80/20), fits the model on the training set, and then computes the misclassification rate on the test set.

# Set seed for reproducibility
set.seed(20032004)

# Create an 80/20 split using caret's createDataPartition
trainIndex <- createDataPartition(Hitters_clean$SalaryLevel, p = 0.8, list = FALSE)
trainData <- Hitters_clean[trainIndex, ]
testData  <- Hitters_clean[-trainIndex, ]

# Fit the multinomial logistic regression model on the training set
# Exclude the continuous 'Salary' variable
model_vanilla <- multinom(SalaryLevel ~ ., data = trainData[, !names(trainData) %in% "Salary"], trace = FALSE)

# Predict the SalaryLevel on the test set
predictions <- predict(model_vanilla, newdata = testData)

# Calculate accuracy and misclassification rate
accuracy_val <- mean(predictions == testData$SalaryLevel)
misclass_val <- 1 - accuracy_val

cat("Vanilla CV Accuracy:", accuracy_val, "\n")

## Vanilla CV Accuracy: 0.7

cat("Vanilla CV Misclassification Rate:", misclass_val, "\n")

## Vanilla CV Misclassification Rate: 0.3

Commentary on 5-Fold CV vs. Vanilla Validation

1. Overall Results

5-Fold CV:

• Accuracy: ~0.56

• Misclassification Rate: ~0.44

Vanilla Validation (80/20 split):

• Accuracy: ~0.7

• Misclassification Rate: ~0.3

2. Possible Reasons for the Difference

Sample Variability:

• The vanilla validation approach relies on a single random train-test split. Depending on which observations end up in the training vs. test sets, the resulting accuracy can vary. If the test set happens to be “easier” to classify, the model may appear to perform better.

Multiple Folds in CV:

• The 5-fold cross-validation approach partitions the data into five subsets, using each subset as a test set once. This typically provides a more stable estimate of the model’s generalization performance, as it reduces variance in the performance metric compared to a single split.

3. Interpretation & Practical Implications

Vanilla CV (80/20) Pros and Cons:

• Pros: Simple and fast to implement.

• Cons: Highlydependent on the particular split, so the result may not generalize well.

5-Fold CV Pros and Cons:

• Pros: More robust estimate of predictive performance since it averages results across multiple splits.

• Cons: Computationally more expensive than a single train-test split.

4. Which Result to Trust?

Stability vs. Luck:

• The 5-fold CV metric is often preferred as it mitigates the “luck of the draw” in train-test splitting.

• A single train-test split (vanilla) may overestimate or underestimate performance if the data split is not representative.

5. Conclusion

• The higher accuracy (70%) observed in the vanilla validation may be partly due to a favorable split.

• The 5-fold CV accuracy (56%) is likely a more conservative and robust estimate of how well the model would perform on truly unseen data.

• Ultimately, both methods show that the multinomial logistic regression model has meaningful predictive capability, but cross-validation gives a more stable, generalizable assessment of its performance.

Part g) Identifying a Better Model with Lower Test Error

The full multinomial logistic regression model (using all predictors) may include irrelevant or redundant variables. In this section, we identify a more parsimonious model with a lower test error using forward and backward selection via the stepAIC() function from the MASS package. We then evaluate the selected model using 5-fold cross-validation.

Forward Selection

We start with the null model and add predictors step-by-step, choosing the predictor that most improves the model (i.e. reduces AIC).

forward_model <- stepAIC(null_model, 
                         scope = list(lower = null_model, upper = full_model), 
                         direction = "forward")

Final Forward Selection Model:

summary(forward_model)

## Call:
## multinom(formula = SalaryLevel ~ CRuns + AtBat + PutOuts + Years, 
##     data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"])
## 
## Coefficients:
##             (Intercept)    CRuns       AtBat     PutOuts     Years
## Medium-Low     5.211497 6.548930 -0.70693400 -0.04783254 1.6744941
## Medium-High    5.457328 8.054138 -0.32438786  0.16626315 1.2615554
## High           4.755518 9.458694 -0.03233549  0.64143901 0.2549265
## 
## Std. Errors:
##             (Intercept)    CRuns     AtBat   PutOuts     Years
## Medium-Low     1.084234 1.858143 0.3340662 0.3137865 0.8688585
## Medium-High    1.081369 1.879150 0.3586437 0.2997162 0.9026553
## High           1.093172 1.916415 0.4037025 0.2942920 0.9828780
## 
## Residual Deviance: 454.8555 
## AIC: 484.8555

Backward Selection

We also perform backward selection, starting with the full model and removing the least significant predictor at each step based on AIC.

backward_model <- stepAIC(full_model, direction = "backward")

Final Backward Selection Model:

summary(backward_model)

## Call:
## multinom(formula = SalaryLevel ~ AtBat + Hits + Runs + Walks + 
##     Years + CAtBat + CHits + Division + Assists + Errors, data = Hitters_clean[, 
##     !names(Hitters_clean) %in% "Salary"])
## 
## Coefficients:
##             (Intercept)     AtBat     Hits      Runs    Walks      Years
## Medium-Low     4.295296 -5.739224 3.699605 2.1668138 0.342005  1.2393909
## Medium-High    4.644513 -6.058737 4.915113 0.8024354 1.224064  0.5270782
## High           4.421339 -4.581230 3.692677 1.0352795 1.288674 -0.7539692
##               CAtBat     CHits  DivisionW   Assists     Errors
## Medium-Low  34.46724 -30.14107 -0.4375513 1.0478602 -0.9016800
## Medium-High 27.43086 -21.02300 -0.1934305 0.8943664 -0.5598529
## High        26.21484 -18.17980 -1.1646381 0.3739404 -0.2622754
## 
## Std. Errors:
##             (Intercept)    AtBat     Hits      Runs     Walks    Years   CAtBat
## Medium-Low     1.089078 1.717575 1.819624 0.9610325 0.4686704 1.115734 12.76839
## Medium-High    1.086585 1.754597 1.844028 0.9848865 0.4801116 1.163152 12.73991
## High           1.094026 1.787445 1.867295 1.0145532 0.5009128 1.283015 12.82023
##                CHits DivisionW   Assists    Errors
## Medium-Low  12.79997 0.5473749 0.4970802 0.4485288
## Medium-High 12.72644 0.5787065 0.5030995 0.4548782
## High        12.76948 0.6314128 0.5290861 0.4834889
## 
## Residual Deviance: 416.8386 
## AIC: 482.8386

Summary of Model Results

Forward Selection

1. Process:

• Started from the null model (SalaryLevel ~ 1) and added predictors step by step.

• At each step, the predictor that most reduced the AIC was added, until no further improvement was found.

2. Final Model & AIC:

• The final forward model includes a subset of predictors (e.g., CRuns, AtBat, PutOuts, Years, etc.).

• The AIC of this final model is ~484.86.

3. Interpretation:

• Each selected predictor meaningfully contributes to distinguishing salary levels.

• Positive coefficients indicate that as the predictor increases, the log-odds of being in a higher salary category (vs. the reference) also increase; negative coefficients do the opposite.

• Because the forward model includes fewer variables than the full model, it is typically more parsimonious while still retaining good predictive power.

Backward Selection

1. Process:

• Started from the full model (with all predictors) and iteratively removed the least significant predictor (or the one whose removal best improved the AIC).

• This continued until no further improvement in AIC was possible.

2. Final Model & AIC:

• The final backward model retains a different subset of predictors (e.g., AtBat, Hits, Runs, Walks, Years, etc.).

• Its AIC is ~482.84, which is marginally lower than the forward model’s AIC of ~484.86.

3. Interpretation:

• This final subset represents predictors that the backward procedure deemed most significant from the full set.

• While it may include different (or more) variables than the forward model, the lower AIC indicates it may be more optimal in balancing model fit and complexity compared to the forward model.

Overall, the backward selection model appears to be more parsimonious and has a slighly better AIC than the backward selection model. Therefore, it is likely the preferred choice for predicting SalaryLevel.

Cross Validation

We have obtained a backward-selected multinomial logistic regression model stored as backward_model (fitted with multinom() from the nnet package), we now evaluate its predictive performance using:

• 5-Fold Cross-Validation

• Vanilla Validation (80/20 train-test split)

5-Fold Cross-Validation

# Extract the formula from the selected model
selected_formula <- formula(backward_model)

# Define trainControl for 5-fold cross-validation
train_control <- trainControl(method = "cv", number = 5)

# Set a seed for reproducibility
set.seed(20032004)

# Train the selected model using caret
cv_fit_selected <- train(selected_formula,
                         data = Hitters_clean[, !names(Hitters_clean) %in% "Salary"],
                         method = "multinom",
                         trControl = train_control,
                         trace = FALSE)

# Display the cross-validation results
print(cv_fit_selected)

## Penalized Multinomial Regression 
## 
## 259 samples
##  10 predictor
##   4 classes: 'Low', 'Medium-Low', 'Medium-High', 'High' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 208, 206, 207, 208, 207 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.6178946  0.4900186
##   1e-04  0.6256624  0.5001525
##   1e-01  0.6023564  0.4689325
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 1e-04.

# Calculate accuracy and misclassification rate
accuracy_selected <- cv_fit_selected$results$Accuracy[1]
misclass_rate_selected <- 1 - accuracy_selected

cat("\n5-Fold CV Accuracy of Backward Model:", accuracy_selected, "\n")

## 
## 5-Fold CV Accuracy of Backward Model: 0.6178946

cat("5-Fold CV Misclassification Rate of Backward Model:", misclass_rate_selected, "\n")

## 5-Fold CV Misclassification Rate of Backward Model: 0.3821054

# Set seed for reproducibility
set.seed(20032004)

# Create an 80/20 split
trainIndex <- createDataPartition(Hitters_clean$SalaryLevel, p = 0.8, list = FALSE)
trainData <- Hitters_clean[trainIndex, ]
testData  <- Hitters_clean[-trainIndex, ]

# Fit the selected model on the training set using the final formula from the selected model
selected_formula <- formula(backward_model)
model_vanilla_selected <- multinom(selected_formula, data = trainData[, !names(trainData) %in% "Salary"], trace = FALSE)

# Predict on the test set
predictions_selected <- predict(model_vanilla_selected, newdata = testData)

# Calculate accuracy and misclassification rate
accuracy_val_selected <- mean(predictions_selected == testData$SalaryLevel)
misclass_val_selected <- 1 - accuracy_val_selected

cat("Vanilla CV (80/20) Accuracy for Backward Model:", accuracy_val_selected, "\n")

## Vanilla CV (80/20) Accuracy for Backward Model: 0.74

cat("Vanilla CV Misclassification Rate for Backward Model:", misclass_val_selected, "\n")

## Vanilla CV Misclassification Rate for Backward Model: 0.26

Commentary on Model Performance

1. Full Model Performance:

5-Fold CV:

• Accuracy: 58.76%

• Misclassification Rate: 41.24%

The 5-fold cross-validation for the full model indicates that, on average, about 58.76% of cases are correctly classified, while roughly 41.24% are misclassified. This reflects the performance when all available predictors are used, even if some may be redundant or less informative.

Vanilla CV (80/20 Split):

• Accuracy: 70%

• Misclassification Rate: 30%

With a single 80/20 split, the full model shows a higher accuracy of 70%, likely due to the specifics of the split. However, this method is more sensitive to the particular partition used.

2. Backward Model Performance:

5-Fold CV:

• Accuracy: 61.79%

• Misclassification Rate: 38.21%

After applying backward selection, the 5-fold CV accuracy improves to about 61.79%, reducing the misclassification rate to 38.21%. This suggests that eliminating less relevant predictors enhances the model’s ability to generalize, resulting in more consistent performance across folds.

Vanilla CV (80/20 Split):

• Accuracy: 74%

• Misclassification Rate: 26%

Similarly, the backward-selected model achieves 74% accuracy on the 80/20 split, with a misclassification rate of 26%, indicating better performance compared to the full model on this validation method as well.

3. Overall Comparison and Interpretation:

Performance Improvement:

The backward selection model consistently outperforms the full model on both validation methods. The accuracy improves by roughly 3 percentage points in the 5-fold CV (from 58.76% to 61.79%) and by 4 percentage points in the vanilla CV (from 70% to 74%). This reduction in misclassification rates (from 41.24% to 38.21% in 5-fold CV and from 30% to 26% in vanilla CV) suggests that the removal of redundant or irrelevant predictors results in a model that is better at distinguishing between the salary categories.

Cross-Validation Insights:

While the vanilla CV results are generally higher—possibly due to a favorable train-test split—the 5-fold CV offers a more robust and less variable performance estimate by averaging over multiple splits. The consistent improvement seen in both validation schemes confirms the benefit of model selection.

Conclusion:

Overall, these findings underscore the advantage of applying backward selection. By reducing model complexity and avoiding overfitting, the backward model demonstrates improved predictive accuracy and lower error rates. This makes it a more desirable choice for predicting individual salary levels.

ProjectWork_FenersiKennler

Mazhar Can Muradoglu, Berke Akseki, Efe Aydin

2025-04-04

Part A) Formalizing the Multinomial Logistic Regression Model

Model Formulation

Comparison with Binary Logistic Regression

Part B) Importing and Preprocessing the Hitters Data

Importing the Data

Inspecting the Data

Handling the Missing Values

Investigating the Missingness Pattern in the Target Variable

Load the Raw Data

Visualize Missing Data with naniar

Conclusion on Handling Missing Values

Handling Missing Values

Standardizing Numeric Predictors (z-score standardization)

Converting Character Columns to Factors

Part C) Creating the Categorical Response Variable

Part D) Visual Inspection & Identifying Relevant Variables

Data Import and Structure

Salary Distribution and Key Statistics

Identifying Relevant Predictors via ANOVA Tests

Assessing Multicollinearity: Correlation Matrix for Significant Predictors

Correlation Plot Analysis

Boxplot for Years by Salary Level

Boxplot for CRBI by Salary Level

Boxplot for PutOuts by Salary Level

Statictical Tests

Chi-Squared Test

t-Test

Density Plot

Final Remarks on EDA

Part E) Multinomial Logistic Regression & Model Selection

Fitting the Full and Null Models

Full Model Summary

Null Model Summary

Commentary on Full Model vs. Null Model

Null Model (SalaryLevel ~ 1)

Fit Statistics:

Full Model (SalaryLevel ~ .)

Fit Statistics:

Comparison and Next Steps

Part f) Accuracy and Cross-Validation

Defining the Error Measure

Randomly Choosing the CV Method

Implementing 5-Fold Cross-Validation

Commentary on 5-Fold CV vs. Vanilla Validation

1. Overall Results

2. Possible Reasons for the Difference

3. Interpretation & Practical Implications

4. Which Result to Trust?

5. Conclusion

Part g) Identifying a Better Model with Lower Test Error

Forward Selection

Final Forward Selection Model:

Backward Selection

Final Backward Selection Model:

Summary of Model Results

Forward Selection

Backward Selection

Cross Validation

5-Fold Cross-Validation

Commentary on Model Performance

1. Full Model Performance:

2. Backward Model Performance:

3. Overall Comparison and Interpretation:

Conclusion:

Visualize Missing Data with `naniar`