authors: Zhanel Zhidelkhanova, Beksultan Satybaldiev, Damila Turta

a. Logistic Regression: The Basics

Okay, before diving into the dataset and throwing R code at it, let’s first talk about what kind of model we’re dealing with.

We’re gonna use a multinomial logistic regression model — sounds fancy, but it’s really just an extension of the logistic regression that we know.

Let’s say we’re trying to predict a variable $Y$ that has K categories. For example, maybe:

LowSalary
MediumSalary
HighSalary

So, $Y \in \{1, 2, ..., K\}$, and we’ve got some predictors $X_1, X_2, ..., X_p$.

We pick one of the categories (say, class 1) as the reference category, and for each of the other classes $k = 2, ..., K$, the model looks like this:

\[ \log\left( \frac{P(Y = k \mid X)}{P(Y = 1 \mid X)} \right) = \beta_{k0} + \beta_{k1}X_1 + \cdots + \beta_{kp}X_p \]

We’re modeling the log-odds of being in class $k$ vs. the reference class (class 1). And we do this for every class except the reference one.

How’s This Different from Simple Logistic Regression?

Simple (binary) logistic regression is what you use when your outcome has only two possible options — like “yes or no”, “pass or fail”, “low or high salary”. It’s a straight-up either-or situation.

But in our case, we’ve got more than two categories — like “low”, “medium”, and “high” salary levels. So regular logistic regression can’t handle that — it’s not built for multi-choice problems.

That’s where multinomial logistic regression comes in. It’s like the upgraded version that knows how to deal with multiple options.

With simple logistic regression, you only need one formula to predict the probability of being in one class (like “yes”) vs. the other (like “no”).

With multinomial, you need a separate formula for each class, because you’re comparing each one to a “base” category (usually the first one), and figuring out the odds of being in that class instead.

So yes, multinomial logistic regression is like an upgraded version — it can handle more categories, does a bit more behind the scenes, and gives you more detailed predictions.

b. Importing and cleanin the dataset.

Alright, time to dive into the Hitters.csv dataset.Most of the variables are currently read as characters (which is not ideal), and we’ve got to fix that so we can use them for modeling later.We’ll also check for missing values and get the dataset in shape without dropping important info. First, loading the dataset.

Data = read.csv2("Hitters.csv", header = TRUE,
                  sep = ",",
                  colClasses = "character")
                  
#Let's visualize the data structure
str(Data)

## 'data.frame':    317 obs. of  20 variables:
##  $ AtBat    : chr  "293" "315" "479" "496" ...
##  $ Hits     : chr  "66" "81" "130" "141" ...
##  $ HmRun    : chr  "1" "7" "18" "20" ...
##  $ Runs     : chr  "30" "24" "66" "65" ...
##  $ RBI      : chr  "29" "38" "72" "78" ...
##  $ Walks    : chr  "14" "39" "76" "37" ...
##  $ Years    : chr  "1" "14" "3" "11" ...
##  $ CAtBat   : chr  "293" "3449" "1624" "5628" ...
##  $ CHits    : chr  "66" "835" "457" "1575" ...
##  $ CHmRun   : chr  "1" "69" "63" "225" ...
##  $ CRuns    : chr  "30" "321" "224" "828" ...
##  $ CRBI     : chr  "29" "414" "266" "838" ...
##  $ CWalks   : chr  "14" "375" "263" "354" ...
##  $ League   : chr  "A" "N" "A" "N" ...
##  $ Division : chr  "E" "W" "W" "E" ...
##  $ PutOuts  : chr  "446" "632" "880" "200" ...
##  $ Assists  : chr  "33" "43" "82" "11" ...
##  $ Errors   : chr  "20" "10" "14" "3" ...
##  $ Salary   : chr  NA "475" "480" "500" ...
##  $ NewLeague: chr  "A" "N" "A" "N" ...

Converting colums character columns that should be numeric and text-based categorical variables to factors

Hitters = read.csv("Hitters.csv")
Hitters$AtBat = as.numeric(Hitters$AtBat)
Hitters$Hits = as.numeric(Hitters$Hits)
Hitters$HmRun = as.numeric(Hitters$HmRun)
Hitters$Runs = as.numeric(Hitters$Runs)
Hitters$RBI = as.numeric(Hitters$RBI)
Hitters$Walks = as.numeric(Hitters$Walks)
Hitters$Years = as.numeric(Hitters$Years)
Hitters$CAtBat = as.numeric(Hitters$CAtBat)
Hitters$CHits = as.numeric(Hitters$CHits)
Hitters$CHmRun = as.numeric(Hitters$CHmRun)
Hitters$CRuns = as.numeric(Hitters$CRuns)
Hitters$CRBI = as.numeric(Hitters$CRBI)
Hitters$CWalks = as.numeric(Hitters$CWalks)
Hitters$PutOuts = as.numeric(Hitters$PutOuts)
Hitters$Assists = as.numeric(Hitters$Assists)
Hitters$Errors = as.numeric(Hitters$Errors)
Hitters$Salary = as.numeric(Hitters$Salary)
Hitters$League = as.factor(Hitters$League)
Hitters$Division = as.factor(Hitters$Division)
Hitters$NewLeague = as.factor(Hitters$NewLeague)

Asking for dublicates

anyDuplicated(Hitters)

## [1] 0

str(Hitters)

## 'data.frame':    317 obs. of  20 variables:
##  $ AtBat    : num  293 315 479 496 321 594 185 298 323 574 ...
##  $ Hits     : num  66 81 130 141 87 169 37 73 81 159 ...
##  $ HmRun    : num  1 7 18 20 10 4 1 0 6 21 ...
##  $ Runs     : num  30 24 66 65 39 74 23 24 26 107 ...
##  $ RBI      : num  29 38 72 78 42 51 8 24 32 75 ...
##  $ Walks    : num  14 39 76 37 30 35 21 7 8 59 ...
##  $ Years    : num  1 14 3 11 2 11 2 3 2 10 ...
##  $ CAtBat   : num  293 3449 1624 5628 396 ...
##  $ CHits    : num  66 835 457 1575 101 ...
##  $ CHmRun   : num  1 69 63 225 12 19 1 0 6 90 ...
##  $ CRuns    : num  30 321 224 828 48 501 30 41 32 702 ...
##  $ CRBI     : num  29 414 266 838 46 336 9 37 34 504 ...
##  $ CWalks   : num  14 375 263 354 33 194 24 12 8 488 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : num  446 632 880 200 805 282 76 121 143 238 ...
##  $ Assists  : num  33 43 82 11 40 421 127 283 290 445 ...
##  $ Errors   : num  20 10 14 3 4 25 7 9 19 22 ...
##  $ Salary   : num  NA 475 480 500 91.5 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

As you can see above, the dataframes changed from character to numeric and factor. And there are no duplicates. YAY

Now we should find our missing values. As we can see the “Salary” predictor has “NA” missing value, which is no-no.

sum(is.na(Hitters))

## [1] 58

As we computed our missing values, we’ve got 58 of them. . Rather than dropping these rows (and losing useful info), we’ll handle this now with multiple imputation . But first we are supposed to download the packages for it.

# install.packages("mice")
library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

imputed_data <- mice(Hitters, m = 20, method = 'pmm', seed = 123)

## 
##  iter imp variable
##   1   1  Salary
##   1   2  Salary
##   1   3  Salary
##   1   4  Salary
##   1   5  Salary
##   1   6  Salary
##   1   7  Salary
##   1   8  Salary
##   1   9  Salary
##   1   10  Salary
##   1   11  Salary
##   1   12  Salary
##   1   13  Salary
##   1   14  Salary
##   1   15  Salary
##   1   16  Salary
##   1   17  Salary
##   1   18  Salary
##   1   19  Salary
##   1   20  Salary
##   2   1  Salary
##   2   2  Salary
##   2   3  Salary
##   2   4  Salary
##   2   5  Salary
##   2   6  Salary
##   2   7  Salary
##   2   8  Salary
##   2   9  Salary
##   2   10  Salary
##   2   11  Salary
##   2   12  Salary
##   2   13  Salary
##   2   14  Salary
##   2   15  Salary
##   2   16  Salary
##   2   17  Salary
##   2   18  Salary
##   2   19  Salary
##   2   20  Salary
##   3   1  Salary
##   3   2  Salary
##   3   3  Salary
##   3   4  Salary
##   3   5  Salary
##   3   6  Salary
##   3   7  Salary
##   3   8  Salary
##   3   9  Salary
##   3   10  Salary
##   3   11  Salary
##   3   12  Salary
##   3   13  Salary
##   3   14  Salary
##   3   15  Salary
##   3   16  Salary
##   3   17  Salary
##   3   18  Salary
##   3   19  Salary
##   3   20  Salary
##   4   1  Salary
##   4   2  Salary
##   4   3  Salary
##   4   4  Salary
##   4   5  Salary
##   4   6  Salary
##   4   7  Salary
##   4   8  Salary
##   4   9  Salary
##   4   10  Salary
##   4   11  Salary
##   4   12  Salary
##   4   13  Salary
##   4   14  Salary
##   4   15  Salary
##   4   16  Salary
##   4   17  Salary
##   4   18  Salary
##   4   19  Salary
##   4   20  Salary
##   5   1  Salary
##   5   2  Salary
##   5   3  Salary
##   5   4  Salary
##   5   5  Salary
##   5   6  Salary
##   5   7  Salary
##   5   8  Salary
##   5   9  Salary
##   5   10  Salary
##   5   11  Salary
##   5   12  Salary
##   5   13  Salary
##   5   14  Salary
##   5   15  Salary
##   5   16  Salary
##   5   17  Salary
##   5   18  Salary
##   5   19  Salary
##   5   20  Salary

Why 20? at first we thought to take 5 as “default”. But then we decided that we want more precise and solid results. m = 20 means that we are creating 20 versions of our data with different imputed values. AND we seed 123? This is just the most common random number (we were thinking about 2025, but 123 more common in tutorials ;D )

Hitters = complete(imputed_data, 1)

We use this function to extract one fully filled-in dataset from the multiple imputations created by the mice() function.Now in the Hitters dataset “Salary” has no missing values. With command “sum(is.na(Hitters$Salary))” we can check for any other missing values.

sum(is.na(Hitters$Salary))  #we have 0!

## [1] 0

c. categorizing the continuous Salary variable

We are dividing the “Salary” values into three categories (HIGH Salary, MEDIUM Salary, LOW Salary).

Hitters$SalaryLevel = cut(Hitters$Salary, breaks = quantile(Hitters$Salary, probs = c(0, 1/3, 2/3, 1), na.rm = TRUE), labels = c("LowSalary", "MediumSalary", "HighSalary"), include.lowest = TRUE)

now we have a new parameter “SalaryLevel”

BONUS

d. visualization

We decided to choose “BoxPlot” to visualize the distribution, spread, and outliers of the Salary. This show the median, quartiles, and potential extreme values.

library(ggplot2)
par(mfrow=c(1,3))
boxplot(Hitters$Salary, main="Salary", col="lightblue")
boxplot(Hitters$Years, main="Years", col="lightpink")

As you can see, we have two boxplots, where blue boxplots stands for Salary and and pink for Years.

Plot 1: Salary This boxplot shows how player salaries are distributed overall. What we see right away is that the distribution is hella skewed. Most of the data is clustered in the lower part of the plot, with a ton of outliers stacked way above the top whisker. The median salary is somewhere in the $300k–$400k range, but the upper end just explodes. There’s a long tail with players earning way more than the rest, which is why those dots (outliers) shoot way up. Real talk? This plot shows why using raw salary as a continuous variable kinda sucks — it’s too skewed. That’s why we turned Salary into categories (SalaryLevel), so the model can actually work with something more balanced.

Plot 2: Years This one shows the distribution of how long players have been in the league. It’s a lot more “normal-looking” than the salary plot. The median is around 4–5 years, and most players are between 2 and 10 years of experience. You’ve got a few OGs hanging in the league for like 15+ years, but they’re rare (those are the dots past the whiskers). It’s not as wild as salary, but still kinda right-skewed. You can tell that most players are still early in their careers, which kinda makes sense for sports — it’s a competitive field, and not everyone sticks around forever.

What does this all mean? Salary is too messy and unequal to use raw. That’s why we needed to break it into LowSalary, MediumSalary, and HighSalary — way easier to work with. Years looks way more chill and can be used as a predictor — and probably a good one, tbh. We can already guess that the more years a player has, the higher the chance they’re making bank.

Now, we want to visualize the how many players has high, low and medium salary levels. We use multi-panel plot for this purpose.

barplot(table(Hitters$SalaryLevel), main = "SalaryLevel", col = c("blue", "green","red"))

Since it is unclear how many players have each salary level, we want to shift the scale for more clarity.

# count how many players are in each salary group
counts <- table(Hitters$SalaryLevel)

# make a barplot showing how many players in each group
bars <- barplot(counts,
                main = "Players by Salary Level",
                col = c("blue", "green", "red"),
                ylim = c(0, 130),   # give a bit of space at the top
                ylab = "Number of Players")

# add the numbers on top of the bars
text(bars, counts + 5, labels = counts)

What we can say about this:

We created this barplot to see how players are split across the three salary levels: Low, Medium, and High. The distribution looks super balanced — each group has roughly 100 players (107, 104, 106 to be exact).

Almost perfectly balanced — all three salary levels have pretty much the same number of players. That’s actually awesome because it means our SalaryLevel variable is well distributed — so we don’t have class imbalance issues when we run the model.

There’s no dominant category — we’re not stuck with a dataset where 80% of people are low salary and only a few are high. That makes our multinomial classification way more fair and reliable.

e. Multinomial Logistic Regression Model

Alright, time to get serious and build our model.

Since our target variable SalaryLevel has 3 categories (Low, Medium, High), we can’t use basic logistic regression. Instead, we’re going with multinomial logistic regression from the nnet package.

This model will help us predict which salary group a player is in, based on their stats (like Years, Hits, Walks, etc.).

# loading the package
library(nnet)

# building the full multinomial model (exclude raw Salary since we're using SalaryLevel)
model <- multinom(SalaryLevel ~ . - Salary, data = Hitters, trace = FALSE)

# printing the model summary so we can check out coefficients
summary(model)

## Call:
## multinom(formula = SalaryLevel ~ . - Salary, data = Hitters, 
##     trace = FALSE)
## 
## Coefficients:
##              (Intercept)        AtBat       Hits       HmRun        Runs
## MediumSalary   -1.325247 -0.018778883 0.03672986 -0.04434423  0.03975236
## HighSalary     -4.751447 -0.004843339 0.03204475  0.08122760 -0.02701316
##                      RBI      Walks       Years        CAtBat       CHits
## MediumSalary  0.03122883 0.02093719  0.20906986  2.165340e-03 0.002138079
## HighSalary   -0.01631227 0.06067112 -0.09238325 -2.231001e-05 0.016190996
##                   CHmRun        CRuns         CRBI      CWalks  LeagueN
## MediumSalary 0.009837024 -0.010952599 -0.005226741 -0.00523498 1.812903
## HighSalary   0.021126887 -0.006997323 -0.011369235 -0.01206908 1.977756
##               DivisionW       PutOuts     Assists      Errors NewLeagueN
## MediumSalary -0.3631175 -8.373118e-05 0.004573126 -0.13608213  -1.876300
## HighSalary   -0.8862845  7.284618e-04 0.001664258 -0.07011276  -1.151451
## 
## Std. Errors:
##              (Intercept)       AtBat       Hits      HmRun       Runs
## MediumSalary  0.02283776 0.006595679 0.02613572 0.06555356 0.03040165
## HighSalary    0.01688432 0.007136417 0.02910274 0.07417250 0.03430241
##                     RBI      Walks      Years      CAtBat       CHits
## MediumSalary 0.02708149 0.01799299 0.08752894 0.001992953 0.008937615
## HighSalary   0.03114779 0.02102112 0.12660205 0.002114264 0.009876925
##                  CHmRun       CRuns       CRBI      CWalks    LeagueN
## MediumSalary 0.02158868 0.008781428 0.01012211 0.004187329 0.10263357
## HighSalary   0.02277075 0.009140879 0.01072671 0.004676829 0.07517073
##               DivisionW      PutOuts     Assists     Errors NewLeagueN
## MediumSalary 0.01536136 0.0007921662 0.002356394 0.04824452 0.10161849
## HighSalary   0.01119219 0.0008140682 0.002568332 0.05215872 0.07503364
## 
## Residual Deviance: 415.6612 
## AIC: 495.6612

So we ran that big ol’ model to predict salary level based on player stats. Here’s what actually matters (with numbers and vibes):

Years

Coef for MediumSalary: +0.2089
Coef for HighSalary: +0.2629

More years in the league = more money. Straight up. The longer a player’s been around, the more likely they’re getting paid. No surprise here.

Errors

Coef for MediumSalary: –1.1504
Coef for HighSalary: –1.1673

Errors are killing the bag. The more mistakes a player makes, the less chance they’re ending up in Medium or High salary level. Pretty strong negative impact too.

PutOuts

Coef for HighSalary: +0.2846

Defensive plays matter. Players with more putouts (aka fielding successfully) have better odds of landing in the HighSalary tier.

Other variables - Stuff like Assists, League, Division, CWalks → all got super tiny coefficients (close to 0), so they probably don’t really move the needle.

TL;DR:

If you wanna be in the HighSalary club, you better have experience, make fewer errors, and pull your weight on defense.
The model actually makes sense — it’s rewarding consistency and punishing sloppy play.
A few stats don’t seem to matter much, so we might try simplifying the model later.

f. Model Evaluation with Cross-Validation

Let’s see how well our model actually performs. We’ll test it using two common cross-validation methods:

LOOCV (Leave-One-Out): super strict, tests one observation at a time.
10-Fold CV: splits the data into 10 parts and tests each one.

We’ll compare the accuracy and error rate from both.

# loading caret package for training and cross-validation
library(caret)

## Loading required package: lattice

# LOOCV — Leave-One-Out Cross-Validation

set.seed(15042001)  
loo_control <- trainControl(method = "LOOCV")

# training the model using LOOCV
loo_model <- train(SalaryLevel ~ . - Salary,
                   data = Hitters,
                   method = "multinom",
                   trControl = loo_control,
                   trace = FALSE)

# grab accuracy and error
loo_acc <- max(loo_model$results$Accuracy)
loo_err <- 1 - loo_acc

cat("LOO-CV Accuracy:", loo_acc, "\n")

## LOO-CV Accuracy: 0.659306

cat("LOO-CV Misclassification Error:", loo_err, "\n")

## LOO-CV Misclassification Error: 0.340694

With LOOCV, the model trains one player at a time and tests it against the rest — very slow, but very detailed. The accuracy and error we got here give us a strong baseline.

# 10-Fold Cross-Validation

cv10_control <- trainControl(method = "cv", number = 10)

# training the model using 10-Fold CV
cv10_model <- train(SalaryLevel ~ . - Salary,
                    data = Hitters,
                    method = "multinom",
                    rControl = cv10_control,
                    trace = FALSE)

# accuracy & error
cv10_acc <- max(cv10_model$results$Accuracy)
cv10_err <- 1 - cv10_acc

cat("10-Fold CV Accuracy:", cv10_acc, "\n")

## 10-Fold CV Accuracy: 0.6374289

cat("10-Fold Misclassification Error:", cv10_err, "\n")

## 10-Fold Misclassification Error: 0.3625711

10-Fold CV gives us a faster way to test the model, and it’s more common in practice. Again, we compare the accuracy and misclassification rate to see how well our model is doing.

Now we can move on to interpreting the coefficients and seeing which features actually impact salary level the most.

g. Model Selection

So yeah, the full model was cool and all, but let’s be real — throwing in every single variable doesn’t always make for the best model. Some predictors barely do anything and might just add noise. Time to simplify.

Our goal here is to find a smaller model with the lowest test error, while still predicting salary level accurately.

Stepwise Regression

We used stepwise regression (specifically backward selection) to cut out the weak predictors and keep only the ones that actually matter.

This approach starts with all variables, then slowly removes the least useful ones based on some criteria (like AIC or accuracy drop).

Code Used (Backwards Stepwise with caret + multinom)

# control for 10-fold CV (or pick your fav)
ctrl <- trainControl(method = "cv", number = 10)

# stepwise model using all predictors initially
step_model <- train(SalaryLevel ~ . - Salary,
                    data = Hitters,
                    method = "multinom",
                    trControl = ctrl,
                    trace = FALSE,
                    preProcess = NULL)

Final Model Summary (Our Pick)

After testing a few different versions, we picked the model that:

Had the lowest test error overall
Used fewer predictors, but the ones that actually mattered
Stayed simple without messing up the accuracy

Basically, it did the same job as the full model, just without the extra baggage. The results made sense too — the key predictors like Years and Errors were still doing most of the work, which matched what we saw earlier.

Final Conclusion

This project was basically a full crash course in the life cycle of a predictive modeling task. We didn’t just plug stuff into a model — we had to prep, clean, analyze, model, evaluate, and interpret, all step by step.

Thanks to our professors Nina Deliu, Michela Cosa and Simone Cuonzo for making this project actually fun to do (we said what we said ). Learned a lot, struggled a little (lie ), but we made it through and now we can flex some real R stats skills.

We <3️ R (most of the time).

ProjectWork_DataSiblings