Introduction

This analysis investigates whether salary depends differently on player performance across the four baseball divisions (AE, AW, NE, NW) in the Hitters dataset. We use LARS (Least Angle Regression) with the Cp criterion for variable selection and compare predictions within and between divisions.

Questions:

1. Is there a difference in salary distributions between the 4 divisions?

2. Is there a difference in how salary depends on performance between the 4 divisions?

3. Does LARS or LM provide better predictions?

# Load required libraries for analysis
library(ISLR)   # Contains the Hitters dataset
library(lars)   # Implements LARS algorithm for variable selection
## Loaded lars 1.3
# Load the Hitters dataset
data(Hitters)

Question 1: Is There a Difference in Salary Distributions Between Divisions?

# Split the Hitters dataset into four divisions based on League and Division
# Filter for American League East (AE), American League West (AW),
# National League East (NE), and National League West (NW)
# Also remove players with missing salary values

HittersAEclean <- subset(Hitters, League=="A" & Division=="E" & !is.na(Salary))
HittersAWclean <- subset(Hitters, League=="A" & Division=="W" & !is.na(Salary))
HittersNEclean <- subset(Hitters, League=="N" & Division=="E" & !is.na(Salary))
HittersNWclean <- subset(Hitters, League=="N" & Division=="W" & !is.na(Salary))

# Display sample sizes for each division
cat("Sample sizes by division:\n")
## Sample sizes by division:
cat("AE:", nrow(HittersAEclean), "players\n")
## AE: 68 players
cat("AW:", nrow(HittersAWclean), "players\n")
## AW: 71 players
cat("NE:", nrow(HittersNEclean), "players\n")
## NE: 61 players
cat("NW:", nrow(HittersNWclean), "players\n")
## NW: 63 players
# Create boxplot to visualize salary distributions across divisions
par(mfrow=c(1,1))
boxplot(HittersAEclean$Salary,
        HittersAWclean$Salary,
        HittersNEclean$Salary,
        HittersNWclean$Salary,
        names=c("AE","AW","NE","NW"),
        main="Salary Distributions by Division",
        ylab="Salary",
        col=c("lightblue", "lightgreen", "lightyellow", "lightpink"))

# Add mean markers
means <- c(mean(HittersAEclean$Salary), mean(HittersAWclean$Salary),
           mean(HittersNEclean$Salary), mean(HittersNWclean$Salary))
points(1:4, means, pch=18, col="red", cex=2)

INTERPRETATION - Question 1:

The boxplot reveals several key observations about salary distributions:

  1. Central Tendency Differences:

    • AE division shows the highest median salary, suggesting American League East teams pay more on average

    • NW division appears to have the lowest median salary

    • The red dots (means) are generally higher than medians, indicating right-skewed distributions (a few high earners)

  2. Spread (Variability) Differences:

    • AE division shows the widest spread (IQR), indicating greater salary inequality

    • Some divisions have more outliers (high-paid players) than others

    • This suggests different salary structures and negotiating dynamics across divisions

  3. Conclusion for Question 1:

    • YES, there are clear differences in salary distributions across the four divisions. This supports the hypothesis that division-specific factors (team budgets, market size, etc.) influence salary levels.

Data Preparation for LARS Analysis

# Create numeric predictor matrices for each division
# Remove columns: 20 (League - categorical), 19 (Division - categorical), 
# 14 (Name - identifier), 15 (Salary - our response variable)
# This leaves only numerical performance variables

X_AE <- as.matrix(HittersAEclean[,-c(20,19,14,15)])
X_AW <- as.matrix(HittersAWclean[,-c(20,19,14,15)])
X_NE <- as.matrix(HittersNEclean[,-c(20,19,14,15)])
X_NW <- as.matrix(HittersNWclean[,-c(20,19,14,15)])

# Display the predictor variables
cat("Predictor variables used:\n")
## Predictor variables used:
cat(paste(colnames(X_AE), collapse=", "), "\n")
## AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CAtBat, CHits, CHmRun, CRuns, CRBI, CWalks, PutOuts, Assists, Errors
cat("\nNumber of predictors:", ncol(X_AE), "\n")
## 
## Number of predictors: 16

Question 2: Does Salary Depend Differently on Performance Across Divisions?

We use LARS with Cp criterion to select optimal predictors for each division. Cp measures model quality. Lower Cp indicates better model fit relative to complexity.

AE Division Analysis

# Fit LARS model for AE division
larsAE <- lars(X_AE, HittersAEclean$Salary)

# Plot Cp vs degrees of freedom to visualize model selection
plot(larsAE$df, larsAE$Cp, log="y", 
     main="AE Division: Cp vs Degrees of Freedom",
     xlab="df (Number of Variables)", ylab="Cp (log scale)")
     abline(h=1, col="red", lty=2)

# Find the step with minimum Cp 
I1_AE <- larsAE$Cp == min(larsAE$Cp)
betalarsAE <- larsAE$beta[I1_AE,]

cat("\n=== AE DIVISION LARS RESULTS ===\n")
## 
## === AE DIVISION LARS RESULTS ===
cat("Minimum Cp =", round(min(larsAE$Cp), 3), "\n")
## Minimum Cp = 11.908
cat("Step with minimum Cp:", which(I1_AE), "\n")
## Step with minimum Cp: 19
cat("Number of variables selected:", sum(I1_AE), "\n")
## Number of variables selected: 1
cat("\nLARS Coefficients:\n")
## 
## LARS Coefficients:
print(betalarsAE)
##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## -3.3530002 10.0477043 -5.1870815  1.9252299  0.0000000  6.1607285  0.0000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
## -0.3771068  0.0000000  0.8418551  2.9480375  1.0145354 -1.1760056  0.3408382 
##    Assists     Errors 
##  0.1259118  0.0000000

INTERPRETATION - AE Division:

  • Minimum Cp = 11.91 at step 19, indicating the optimal model includes multiple predictors

  • Key predictors for AE: AtBat (-), Hits (+), HmRun (-), Runs (+), Walks (+), CRuns (+), CRBI (+), PutOuts (+), Assists (+)

  • Notable patterns: Career statistics (CRuns, CRBI) have positive coefficients, suggesting experience matters in AE

  • Negative coefficients: AtBat and HmRun are negative, possibly indicating efficiency matters more than raw numbers

AW Division Analysis

# Fit LARS model for AW division
larsAW <- lars(X_AW, HittersAWclean$Salary)

plot(larsAW$df, larsAW$Cp, log="y",
     main="AW Division: Cp vs Degrees of Freedom",
     xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)

I1_AW <- larsAW$Cp == min(larsAW$Cp)
betalarsAW <- larsAW$beta[I1_AW,]

cat("\n=== AW DIVISION LARS RESULTS ===\n")
## 
## === AW DIVISION LARS RESULTS ===
cat("Minimum Cp =", round(min(larsAW$Cp), 3), "\n")
## Minimum Cp = 1.985
cat("Step with minimum Cp:", which(I1_AW), "\n")
## Step with minimum Cp: 5
cat("Number of variables selected:", sum(I1_AW), "\n")
## Number of variables selected: 1
cat("\nLARS Coefficients:\n")
## 
## LARS Coefficients:
print(betalarsAW)
##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## 0.00000000 2.30243447 0.00000000 0.00000000 0.00000000 1.03768371 0.00000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
## 0.00000000 0.33434397 0.00000000 0.00000000 0.00000000 0.00000000 0.08998272 
##    Assists     Errors 
## 0.00000000 0.00000000

INTERPRETATION - AW Division:

  • Minimum Cp = 1.98 (closest to ideal Cp=1), indicating an excellent model fit

  • Key predictors for AW: Hits (+), Walks (+), CHits (+), PutOuts (+)

  • Notable patterns: AW has FEWER selected variables than AE, suggesting a simpler salary structure

  • Career hits (CHits) is important, indicating AW teams value consistent hitters

NE Division Analysis

# Fit LARS model for NE division
larsNE <- lars(X_NE, HittersNEclean$Salary)

plot(larsNE$df, larsNE$Cp, log="y",
     main="NE Division: Cp vs Degrees of Freedom",
     xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)

I1_NE <- larsNE$Cp == min(larsNE$Cp)
betalarsNE <- larsNE$beta[I1_NE,]

cat("\n=== NE DIVISION LARS RESULTS ===\n")
## 
## === NE DIVISION LARS RESULTS ===
cat("Minimum Cp =", round(min(larsNE$Cp), 3), "\n")
## Minimum Cp = 3.533
cat("Step with minimum Cp:", which(I1_NE), "\n")
## Step with minimum Cp: 9
cat("Number of variables selected:", sum(I1_NE), "\n")
## Number of variables selected: 1
cat("\nLARS Coefficients:\n")
## 
## LARS Coefficients:
print(betalarsNE)
##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## -0.2481822  0.0000000  0.0000000  0.0000000  0.0000000  2.5323541  0.0000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
##  0.0000000  0.0000000  0.0000000  0.0000000  0.8941327  0.0000000  0.4312271 
##    Assists     Errors 
##  0.7716719 -8.4960887

INTERPRETATION - NE Division:

  • Minimum Cp = 3.53, indicating a good model fit

  • Key predictors for NE: AtBat (-), Walks (+), CRBI (+), PutOuts (+), Assists (+), Errors (-)

  • Notable patterns: Errors has a strong NEGATIVE coefficient (-8.50), suggesting defensive performance heavily impacts NE salaries

  • Assists is important, indicating value placed on defensive skills

NW Division Analysis

# Fit LARS model for NW division
larsNW <- lars(X_NW, HittersNWclean$Salary)

plot(larsNW$df, larsNW$Cp, log="y",
     main="NW Division: Cp vs Degrees of Freedom",
     xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)

I1_NW <- larsNW$Cp == min(larsNW$Cp)
betalarsNW <- larsNW$beta[I1_NW,]

cat("\n=== NW DIVISION LARS RESULTS ===\n")
## 
## === NW DIVISION LARS RESULTS ===
cat("Minimum Cp =", round(min(larsNW$Cp), 3), "\n")
## Minimum Cp = 13.889
cat("Step with minimum Cp:", which(I1_NW), "\n")
## Step with minimum Cp: 17
cat("Number of variables selected:", sum(I1_NW), "\n")
## Number of variables selected: 1
cat("\nLARS Coefficients:\n")
## 
## LARS Coefficients:
print(betalarsNW)
##       AtBat        Hits       HmRun        Runs         RBI       Walks 
##   1.3713374   1.9961871  32.2358918  -9.7338778 -11.9613345   7.3627544 
##       Years      CAtBat       CHits      CHmRun       CRuns        CRBI 
##   7.7832856   0.0000000   0.4414019   0.4438961   0.0000000   0.0000000 
##      CWalks     PutOuts     Assists      Errors 
##  -0.9914020   0.0000000  -0.7387407   3.8378122

INTERPRETATION - NW Division:

  • Minimum Cp = 13.89, the highest among divisions, suggesting more complex salary determination

  • Key predictors for NW: AtBat (+), Hits (+), HmRun (+), Runs (-), RBI (-), Walks (+), Years (+), Errors (+)

  • Notable patterns: HmRun has a VERY high coefficient (+32.24), suggesting NW teams heavily reward home run hitters

  • Years is positive, indicating experience is valued in NW

    Summary of LARS-Selected Variables

# Extract variable names with non-zero coefficients for each division
vars_AE <- colnames(X_AE)[betalarsAE != 0]
vars_AW <- colnames(X_AW)[betalarsAW != 0]
vars_NE <- colnames(X_NE)[betalarsNE != 0]
vars_NW <- colnames(X_NW)[betalarsNW != 0]

cat("\n========================================\n")
## 
## ========================================
cat("   LARS-SELECTED VARIABLES BY DIVISION\n")
##    LARS-SELECTED VARIABLES BY DIVISION
cat("========================================\n")
## ========================================
cat("AE:", paste(vars_AE, collapse=", "), "\n")
## AE: AtBat, Hits, HmRun, Runs, Walks, CAtBat, CHmRun, CRuns, CRBI, CWalks, PutOuts, Assists
cat("AW:", paste(vars_AW, collapse=", "), "\n")
## AW: Hits, Walks, CHits, PutOuts
cat("NE:", paste(vars_NE, collapse=", "), "\n")
## NE: AtBat, Walks, CRBI, PutOuts, Assists, Errors
cat("NW:", paste(vars_NW, collapse=", "), "\n")
## NW: AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CHits, CHmRun, CWalks, Assists, Errors
Division Key Variables Model Complexity (Cp)
AE CRuns, CRBI, Hits, Walks 11.91 (moderate)
AW Hits, Walks, CHits, PutOuts 1.98 (best fit)
NE CRBI, Walks, Assists, Errors 3.53 (good fit)
NW HmRun, Years, Walks, AtBat 13.89 (complex)

Key Finding: Each division has a DIFFERENT set of important predictors, supporting the hypothesis that salary depends differently on performance across divisions.

LARS Predictions: 16 Scatterplots and Correlations

# === AE Model Predictions ===

predAEAE <- X_AE %*% betalarsAE

predAEAE <- predAEAE + (larsAE$mu - mean(predAEAE))


predAEAW <- X_AW %*% betalarsAE

predAEAW <- predAEAW + (larsAE$mu - mean(predAEAE))


predAENE <- X_NE %*% betalarsAE

predAENE <- predAENE + (larsAE$mu - mean(predAEAE))


predAENW <- X_NW %*% betalarsAE

predAENW <- predAENW + (larsAE$mu - mean(predAEAE))


# === AW Model Predictions ===

predAWAE <- X_AE %*% betalarsAW

predAWAE <- predAWAE + (larsAW$mu - mean(predAWAE))


predAWAW <- X_AW %*% betalarsAW

predAWAW <- predAWAW + (larsAW$mu - mean(predAWAW))


predAWNE <- X_NE %*% betalarsAW

predAWNE <- predAWNE + (larsAW$mu - mean(predAWAW))


predAWNW <- X_NW %*% betalarsAW

predAWNW <- predAWNW + (larsAW$mu - mean(predAWAW))


# === NE Model Predictions ===

predNEAE <- X_AE %*% betalarsNE

predNEAE <- predNEAE + (larsNE$mu - mean(predNEAE))


predNEAW <- X_AW %*% betalarsNE

predNEAW <- predNEAW + (larsNE$mu - mean(predNEAE))


predNENE <- X_NE %*% betalarsNE

predNENE <- predNENE + (larsNE$mu - mean(predNENE))


predNENW <- X_NW %*% betalarsNE

predNENW <- predNENW + (larsNE$mu - mean(predNENW))


# === NW Model Predictions ===

predNWAE <- X_AE %*% betalarsNW

predNWAE <- predNWAE + (larsNW$mu - mean(predNWAE))


predNWAW <- X_AW %*% betalarsNW

predNWAW <- predNWAW + (larsNW$mu - mean(predNWAE))


predNWNE <- X_NE %*% betalarsNW

predNWNE <- predNWNE + (larsNW$mu - mean(predNWNE))


predNWNW <- X_NW %*% betalarsNW

predNWNW <- predNWNW + (larsNW$mu - mean(predNWNW))

############################################################
# LARS – 16 Scatterplots + 16 Correlations
############################################################

par(mfrow = c(4, 4))       # 4 rows x 4 columns = 16 plots
par(mar = c(3, 3, 2, 1))   # Smaller margins: bottom, left, top, right
par(oma = c(0, 0, 0, 0))   # No outer margins

# === Row 1: AE Model ===
plot(HittersAEclean$Salary, predAEAE, main="LARS AE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAEAE, HittersAEclean$Salary)
##           [,1]
## [1,] 0.8890144
plot(HittersAWclean$Salary, predAEAW, main="LARS AE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAEAW, HittersAWclean$Salary)
##           [,1]
## [1,] 0.5973853
plot(HittersNEclean$Salary, predAENE, main="LARS AE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAENE, HittersNEclean$Salary)
##           [,1]
## [1,] 0.5493544
plot(HittersNWclean$Salary, predAENW, main="LARS AE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAENW, HittersNWclean$Salary)
##          [,1]
## [1,] 0.506991
# === Row 2: AW Model ===
plot(HittersAEclean$Salary, predAWAE, main="LARS AW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWAE, HittersAEclean$Salary)
##           [,1]
## [1,] 0.7368015
plot(HittersAWclean$Salary, predAWAW, main="LARS AW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWAW, HittersAWclean$Salary)
##           [,1]
## [1,] 0.7849817
plot(HittersNEclean$Salary, predAWNE, main="LARS AW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWNE, HittersNEclean$Salary)
##           [,1]
## [1,] 0.5792761
plot(HittersNWclean$Salary, predAWNW, main="LARS AW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWNW, HittersNWclean$Salary)
##           [,1]
## [1,] 0.5721472
# === Row 3: NE Model ===
plot(HittersAEclean$Salary, predNEAE, main="LARS NE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNEAE, HittersAEclean$Salary)
##           [,1]
## [1,] 0.6803984
plot(HittersAWclean$Salary, predNEAW, main="LARS NE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNEAW, HittersAWclean$Salary)
##           [,1]
## [1,] 0.6390477
plot(HittersNEclean$Salary, predNENE, main="LARS NE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNENE, HittersNEclean$Salary)
##           [,1]
## [1,] 0.6838278
plot(HittersNWclean$Salary, predNENW, main="LARS NE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNENW, HittersNWclean$Salary)
##           [,1]
## [1,] 0.5272795
# === Row 4: NW Model ===
plot(HittersAEclean$Salary, predNWAE, main="LARS NW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWAE, HittersAEclean$Salary)
##           [,1]
## [1,] 0.4554133
plot(HittersAWclean$Salary, predNWAW, main="LARS NW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWAW, HittersAWclean$Salary)
##           [,1]
## [1,] 0.5636847
plot(HittersNEclean$Salary, predNWNE, main="LARS NW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWNE, HittersNEclean$Salary)
##           [,1]
## [1,] 0.1454497
plot(HittersNWclean$Salary, predNWNW, main="LARS NW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")

cor(predNWNW, HittersNWclean$Salary)
##           [,1]
## [1,] 0.7672109

LARS Correlation Matrix

# Create correlation matrix for LARS predictions
corr_lars <- matrix(NA, 4, 4)
rownames(corr_lars) <- c("AE model","AW model","NE model","NW model")
colnames(corr_lars) <- c("AE data","AW data","NE data","NW data")

corr_lars[1,1] <- cor(predAEAE, HittersAEclean$Salary)
corr_lars[1,2] <- cor(predAEAW, HittersAWclean$Salary)
corr_lars[1,3] <- cor(predAENE, HittersNEclean$Salary)
corr_lars[1,4] <- cor(predAENW, HittersNWclean$Salary)

corr_lars[2,1] <- cor(predAWAE, HittersAEclean$Salary)
corr_lars[2,2] <- cor(predAWAW, HittersAWclean$Salary)
corr_lars[2,3] <- cor(predAWNE, HittersNEclean$Salary)
corr_lars[2,4] <- cor(predAWNW, HittersNWclean$Salary)

corr_lars[3,1] <- cor(predNEAE, HittersAEclean$Salary)
corr_lars[3,2] <- cor(predNEAW, HittersAWclean$Salary)
corr_lars[3,3] <- cor(predNENE, HittersNEclean$Salary)
corr_lars[3,4] <- cor(predNENW, HittersNWclean$Salary)

corr_lars[4,1] <- cor(predNWAE, HittersAEclean$Salary)
corr_lars[4,2] <- cor(predNWAW, HittersAWclean$Salary)
corr_lars[4,3] <- cor(predNWNE, HittersNEclean$Salary)
corr_lars[4,4] <- cor(predNWNW, HittersNWclean$Salary)

cat("=== LARS CORRELATION MATRIX ===\n")
## === LARS CORRELATION MATRIX ===
print(round(corr_lars, 3))
##          AE data AW data NE data NW data
## AE model   0.889   0.597   0.549   0.507
## AW model   0.737   0.785   0.579   0.572
## NE model   0.680   0.639   0.684   0.527
## NW model   0.455   0.564   0.145   0.767

INTERPRETATION - LARS Predictions:

Key Observations:

  1. Diagonal values (within-division) are highest for each model

  2. AE model predicts AE data best (r = 0.889)

  3. NW model has the weakest cross-division predictions (especially NW→NE = 0.145)

  4. This pattern supports division-specific salary structures

LM Models Using LARS-Selected Variables

Now we fit OLS linear regression models using exactly the same variables that LARS selected for each division.

# Fit LM models using LARS-selected variables
lmAE <- lm(Salary ~ ., data=HittersAEclean[,c("Salary",vars_AE)])
lmAW <- lm(Salary ~ ., data=HittersAWclean[,c("Salary",vars_AW)])
lmNE <- lm(Salary ~ ., data=HittersNEclean[,c("Salary",vars_NE)])
lmNW <- lm(Salary ~ ., data=HittersNWclean[,c("Salary",vars_NW)])

LM Predictions: 16 Scatterplots and Correlations

par(mfrow = c(4, 4))       # 4 rows x 4 columns = 16 plots
par(mar = c(3, 3, 2, 1))   # Smaller margins: bottom, left, top, right
par(oma = c(0, 0, 0, 0))   # No outer margins

# Row 1: AE model predictions

plot(HittersAEclean$Salary, predict(lmAE, HittersAEclean), main="LM AE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersAEclean), HittersAEclean$Salary)
## [1] 0.8913296
plot(HittersAWclean$Salary, predict(lmAE, HittersAWclean), main="LM AE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersAWclean), HittersAWclean$Salary)
## [1] 0.5538167
plot(HittersNEclean$Salary, predict(lmAE, HittersNEclean), main="LM AE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersNEclean), HittersNEclean$Salary)
## [1] 0.5340311
plot(HittersNWclean$Salary, predict(lmAE, HittersNWclean), main="LM AE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersNWclean), HittersNWclean$Salary)
## [1] 0.4638691
# Row 2: AW model predictions
plot(HittersAEclean$Salary, predict(lmAW, HittersAEclean), main="LM AW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersAEclean), HittersAEclean$Salary)
## [1] 0.7398849
plot(HittersAWclean$Salary, predict(lmAW, HittersAWclean), main="LM AW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersAWclean), HittersAWclean$Salary)
## [1] 0.7855885
plot(HittersNEclean$Salary, predict(lmAW, HittersNEclean), main="LM AW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersNEclean), HittersNEclean$Salary)
## [1] 0.5851391
plot(HittersNWclean$Salary, predict(lmAW, HittersNWclean), main="LM AW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersNWclean), HittersNWclean$Salary)
## [1] 0.5769111
# Row 3: NE model predictions
plot(HittersAEclean$Salary, predict(lmNE, HittersAEclean), main="LM NE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersAEclean), HittersAEclean$Salary)
## [1] 0.6557857
plot(HittersAWclean$Salary, predict(lmNE, HittersAWclean), main="LM NE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersAWclean), HittersAWclean$Salary)
## [1] 0.6008516
plot(HittersNEclean$Salary, predict(lmNE, HittersNEclean), main="LM NE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersNEclean), HittersNEclean$Salary)
## [1] 0.6882483
plot(HittersNWclean$Salary, predict(lmNE, HittersNWclean), main="LM NE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersNWclean), HittersNWclean$Salary)
## [1] 0.4926594
# Row 4: NW model predictions
plot(HittersAEclean$Salary, predict(lmNW, HittersAEclean), main="LM NW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersAEclean), HittersAEclean$Salary)
## [1] 0.368793
plot(HittersAWclean$Salary, predict(lmNW, HittersAWclean), main="LM NW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersAWclean), HittersAWclean$Salary)
## [1] 0.5028059
plot(HittersNEclean$Salary, predict(lmNW, HittersNEclean), main="LM NW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersNEclean), HittersNEclean$Salary)
## [1] 0.05893251
plot(HittersNWclean$Salary, predict(lmNW, HittersNWclean), main="LM NW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")

cor(predict(lmNW, HittersNWclean), HittersNWclean$Salary)
## [1] 0.7743449
############################################################

# 11. Collect LM Correlations

############################################################


corr_lm <- matrix(NA, 4, 4)

rownames(corr_lm) <- c("AE model","AW model","NE model","NW model")

colnames(corr_lm) <- c("AE data","AW data","NE data","NW data")


# AE model

corr_lm[1,1] <- cor(predict(lmAE, HittersAEclean), HittersAEclean$Salary)

corr_lm[1,2] <- cor(predict(lmAE, HittersAWclean), HittersAWclean$Salary)

corr_lm[1,3] <- cor(predict(lmAE, HittersNEclean), HittersNEclean$Salary)

corr_lm[1,4] <- cor(predict(lmAE, HittersNWclean), HittersNWclean$Salary)


# AW model

corr_lm[2,1] <- cor(predict(lmAW, HittersAEclean), HittersAEclean$Salary)

corr_lm[2,2] <- cor(predict(lmAW, HittersAWclean), HittersAWclean$Salary)

corr_lm[2,3] <- cor(predict(lmAW, HittersNEclean), HittersNEclean$Salary)

corr_lm[2,4] <- cor(predict(lmAW, HittersNWclean), HittersNWclean$Salary)


# NE model

corr_lm[3,1] <- cor(predict(lmNE, HittersAEclean), HittersAEclean$Salary)

corr_lm[3,2] <- cor(predict(lmNE, HittersAWclean), HittersAWclean$Salary)

corr_lm[3,3] <- cor(predict(lmNE, HittersNEclean), HittersNEclean$Salary)

corr_lm[3,4] <- cor(predict(lmNE, HittersNWclean), HittersNWclean$Salary)


# NW model

corr_lm[4,1] <- cor(predict(lmNW, HittersAEclean), HittersAEclean$Salary)

corr_lm[4,2] <- cor(predict(lmNW, HittersAWclean), HittersAWclean$Salary)

corr_lm[4,3] <- cor(predict(lmNW, HittersNEclean), HittersNEclean$Salary)

corr_lm[4,4] <- cor(predict(lmNW, HittersNWclean), HittersNWclean$Salary)

print(round(corr_lm, 3))
##          AE data AW data NE data NW data
## AE model   0.891   0.554   0.534   0.464
## AW model   0.740   0.786   0.585   0.577
## NE model   0.656   0.601   0.688   0.493
## NW model   0.369   0.503   0.059   0.774

Interpretation - LM Predictions

1. Within-Division Prediction (Diagonal)

The diagonal values represent how well each LM model predicts salaries within its own division:

Division Correlation Interpretation
AE Model → AE Data 0.891 Excellent fit (89.1% of variance explained)
AW Model → AW Data 0.786 Good fit (78.6% of variance explained)
NE Model → NE Data 0.688 Moderate-good fit (68.8% of variance explained)
NW Model → NW Data 0.774 Good fit (77.4% of variance explained)

Key Observations:

  • AE division has the highest within-division correlation (0.891), indicating the LM model fits AE salary data very well

  • NE division has the lowest within-division correlation (0.688), suggesting salary in NE is harder to predict with the selected variables

  • All diagonal values are positive and substantial, confirming that the LARS-selected variables have predictive power

2. Between-Division Prediction (Off-Diagonal)

The off-diagonal values show how well each LM model generalizes to other divisions:

Notable Patterns:

Prediction Correlation Interpretation
AE → AW 0.554 Moderate transferability
AE → NE 0.534 Moderate transferability
AE → NW 0.464 Weak-moderate transferability
AW → AE 0.740 Good transferability
AW → NE 0.585 Moderate transferability
AW → NW 0.577 Moderate transferability
NE → AE 0.656 Moderate-good transferability
NE → AW 0.601 Moderate transferability
NE → NW 0.493 Weak-moderate transferability
NW → AE 0.369 Weak transferability
NW → AW 0.503 Weak-moderate transferability
NW → NE 0.059 Very weak (almost no transferability)

Critical Finding - NW Division:

  • The NW model predicts NE data very poorly (r = 0.059), almost no relationship

  • This suggests the salary structure in NW is fundamentally different from NE

  • The NW model also has weak predictions for AE (r = 0.369)

Comparison: LARS vs LM

# Calculate difference matrix (LARS - LM)
diff_matrix <- corr_lars - corr_lm

cat("=== DIFFERENCE MATRIX (LARS - LM) ===\n")
## === DIFFERENCE MATRIX (LARS - LM) ===
cat("Positive values = LARS better, Negative values = LM better\n\n")
## Positive values = LARS better, Negative values = LM better
print(round(diff_matrix, 3))
##          AE data AW data NE data NW data
## AE model  -0.002   0.044   0.015   0.043
## AW model  -0.003  -0.001  -0.006  -0.005
## NE model   0.025   0.038  -0.004   0.035
## NW model   0.087   0.061   0.087  -0.007
# Summary statistics
cat("\n========================================\n")
## 
## ========================================
cat("   SUMMARY: LARS vs LM COMPARISON\n")
##    SUMMARY: LARS vs LM COMPARISON
cat("========================================\n")
## ========================================
# Within-division (diagonal)
cat("\n--- WITHIN-DIVISION CORRELATIONS (Diagonal) ---\n")
## 
## --- WITHIN-DIVISION CORRELATIONS (Diagonal) ---
cat("LARS diagonal mean:", round(mean(diag(corr_lars)), 3), "\n")
## LARS diagonal mean: 0.781
cat("LM diagonal mean:", round(mean(diag(corr_lm)), 3), "\n")
## LM diagonal mean: 0.785
cat("Difference (LARS - LM):", round(mean(diag(corr_lars)) - mean(diag(corr_lm)), 3), "\n")
## Difference (LARS - LM): -0.004
# Between-division (off-diagonal)
off_diag_lars <- corr_lars[lower.tri(corr_lars) | upper.tri(corr_lars)]
off_diag_lm <- corr_lm[lower.tri(corr_lm) | upper.tri(corr_lm)]

cat("\n--- BETWEEN-DIVISION CORRELATIONS (Off-Diagonal) ---\n")
## 
## --- BETWEEN-DIVISION CORRELATIONS (Off-Diagonal) ---
cat("LARS off-diagonal mean:", round(mean(off_diag_lars), 3), "\n")
## LARS off-diagonal mean: 0.546
cat("LM off-diagonal mean:", round(mean(off_diag_lm), 3), "\n")
## LM off-diagonal mean: 0.511
cat("Difference (LARS - LM):", round(mean(off_diag_lars) - mean(off_diag_lm), 3), "\n")
## Difference (LARS - LM): 0.035
# Overall
cat("\n--- OVERALL CORRELATIONS ---\n")
## 
## --- OVERALL CORRELATIONS ---
cat("LARS overall mean:", round(mean(corr_lars), 3), "\n")
## LARS overall mean: 0.605
cat("LM overall mean:", round(mean(corr_lm), 3), "\n")
## LM overall mean: 0.58
cat("Difference (LARS - LM):", round(mean(corr_lars) - mean(corr_lm), 3), "\n")
## Difference (LARS - LM): 0.025

INTERPRETATION - LARS vs LM Comparison:

Metric LARS LM Winner
Within-division mean 0.781 0.785 LM (slightly)
Between-division mean 0.546 0.511 LARS (better generalization)
Overall mean 0.605 0.580 LARS

Key Findings:

  1. Within-division prediction: LM performs slightly better (0.785 vs 0.781)

  2. Between-division prediction: LARS performs better (0.546 vs 0.511)

  3. Overall: LARS has higher mean correlation, suggesting better generalization (Answer to Question 3)

Coefficient Analysis

cat("\n========================================\n")
## 
## ========================================
cat("   LARS MODEL COEFFICIENTS (Min Cp)\n")
##    LARS MODEL COEFFICIENTS (Min Cp)
cat("========================================\n")
## ========================================
cat("\n--- AE Division ---\n")
## 
## --- AE Division ---
print(betalarsAE)
##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## -3.3530002 10.0477043 -5.1870815  1.9252299  0.0000000  6.1607285  0.0000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
## -0.3771068  0.0000000  0.8418551  2.9480375  1.0145354 -1.1760056  0.3408382 
##    Assists     Errors 
##  0.1259118  0.0000000
cat("\n--- AW Division ---\n")
## 
## --- AW Division ---
print(betalarsAW)
##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## 0.00000000 2.30243447 0.00000000 0.00000000 0.00000000 1.03768371 0.00000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
## 0.00000000 0.33434397 0.00000000 0.00000000 0.00000000 0.00000000 0.08998272 
##    Assists     Errors 
## 0.00000000 0.00000000
cat("\n--- NE Division ---\n")
## 
## --- NE Division ---
print(betalarsNE)
##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## -0.2481822  0.0000000  0.0000000  0.0000000  0.0000000  2.5323541  0.0000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
##  0.0000000  0.0000000  0.0000000  0.0000000  0.8941327  0.0000000  0.4312271 
##    Assists     Errors 
##  0.7716719 -8.4960887
cat("\n--- NW Division ---\n")
## 
## --- NW Division ---
print(betalarsNW)
##       AtBat        Hits       HmRun        Runs         RBI       Walks 
##   1.3713374   1.9961871  32.2358918  -9.7338778 -11.9613345   7.3627544 
##       Years      CAtBat       CHits      CHmRun       CRuns        CRBI 
##   7.7832856   0.0000000   0.4414019   0.4438961   0.0000000   0.0000000 
##      CWalks     PutOuts     Assists      Errors 
##  -0.9914020   0.0000000  -0.7387407   3.8378122
cat("\n========================================\n")
## 
## ========================================
cat("   LM MODEL COEFFICIENTS\n")
##    LM MODEL COEFFICIENTS
cat("========================================\n")
## ========================================
cat("\n--- AE Division ---\n")
## 
## --- AE Division ---
print(coef(lmAE))
##  (Intercept)        AtBat         Hits        HmRun         Runs        Walks 
## 333.91615317  -3.77014051  10.47129162  -4.36880711   1.89356499   6.28731152 
##       CAtBat       CHmRun        CRuns         CRBI       CWalks      PutOuts 
##  -0.50437506  -0.08326206   3.66894235   1.52228576  -1.33810508   0.38200461 
##      Assists 
##   0.25849109
cat("\n--- AW Division ---\n")
## 
## --- AW Division ---
print(coef(lmAW))
##  (Intercept)         Hits        Walks        CHits      PutOuts 
## -151.5020812    2.4162522    1.2512816    0.3519476    0.1312208
cat("\n--- NE Division ---\n")
## 
## --- NE Division ---
print(coef(lmNE))
## (Intercept)       AtBat       Walks        CRBI     PutOuts     Assists 
## 192.4466171  -0.5836321   3.9198287   0.9232909   0.5015935   1.0687232 
##      Errors 
## -12.1454651
cat("\n--- NW Division ---\n")
## 
## --- NW Division ---
print(coef(lmNW))
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##  87.8200644   1.1613866   4.7544874  44.5274429 -14.1316674 -16.8348882 
##       Walks       Years       CHits      CHmRun      CWalks     Assists 
##   9.7791908   7.7745907   0.5717372   0.4689956  -1.3725120  -0.8329296 
##      Errors 
##   5.1266500

INTERPRETATION - Coefficient Comparison:

Variable AE AW NE NW Pattern
Hits +10.0 +2.3 0 +2.0 Important in AE, AW, NW
Walks +6.2 +1.0 +2.5 +7.4 Important in ALL divisions
CRBI +1.0 0 +0.9 0 Career RBIs matter in AE, NE
PutOuts +0.3 +0.1 +0.4 0 Defensive value varies
Errors 0 0 -8.5 +3.8 Opposite effects in NE vs NW
HmRun -5.2 0 0 +32.2 Very different in NW

Key Finding: The SAME variable can have OPPOSITE effects across divisions:

  • Errors: Negative in NE (-8.5), Positive in NW (+3.8)

  • HmRun: Negative in AE (-5.2), Very positive in NW (+32.2)

This strongly supports the conclusion that salary depends DIFFERENTLY on performance across divisions.

The analysis conclusively shows that salary structures differ across baseball divisions. Each division values player statistics differently:

  • AE: Values career statistics (CRuns, CRBI)

  • AW: Simpler model, values hits and walks

  • NE: Penalizes errors heavily, values defensive skills

  • NW: Heavily rewards home runs and experience

This suggests division-specific factors such as team budgets, market sizes, managerial strategies, and league dynamics influence how player performance translates to salary.

Within-division correlations are consistently higher than between-division correlations.This suggests division-specific salary structures. LARS’s diagonal and off-diagonal means provides better prediction within divisions and which generalizes better between divisions. Differences in selected predictors and coefficient patterns across divisions support the idea that salary depends on performance and career statistics differently across divisions.