Assignment 2 - Hitters Data Analysis

```{r setup, include=FALSE, echo=FALSE} knitr::opts_chunk$set(warning = FALSE, message = FALSE)

Introduction

This analysis investigates whether salary depends differently on player performance across the four baseball divisions (AE, AW, NE, NW) in the Hitters dataset. We use LARS (Least Angle Regression) with the Cp criterion for variable selection and compare predictions within and between divisions.

Questions:

1. Is there a difference in salary distributions between the 4 divisions?

2. Is there a difference in how salary depends on performance between the 4 divisions?

3. Does LARS or LM provide better predictions?

# Load required libraries for analysis
library(ISLR)   # Contains the Hitters dataset
library(lars)   # Implements LARS algorithm for variable selection

## Loaded lars 1.3

# Load the Hitters dataset
data(Hitters)

Question 1: Is There a Difference in Salary Distributions Between Divisions?

# Split the Hitters dataset into four divisions based on League and Division
# Filter for American League East (AE), American League West (AW),
# National League East (NE), and National League West (NW)
# Also remove players with missing salary values

HittersAEclean <- subset(Hitters, League=="A" & Division=="E" & !is.na(Salary))
HittersAWclean <- subset(Hitters, League=="A" & Division=="W" & !is.na(Salary))
HittersNEclean <- subset(Hitters, League=="N" & Division=="E" & !is.na(Salary))
HittersNWclean <- subset(Hitters, League=="N" & Division=="W" & !is.na(Salary))

# Display sample sizes for each division
cat("Sample sizes by division:\n")

## Sample sizes by division:

cat("AE:", nrow(HittersAEclean), "players\n")

## AE: 68 players

cat("AW:", nrow(HittersAWclean), "players\n")

## AW: 71 players

cat("NE:", nrow(HittersNEclean), "players\n")

## NE: 61 players

cat("NW:", nrow(HittersNWclean), "players\n")

## NW: 63 players

# Create boxplot to visualize salary distributions across divisions
par(mfrow=c(1,1))
boxplot(HittersAEclean$Salary,
        HittersAWclean$Salary,
        HittersNEclean$Salary,
        HittersNWclean$Salary,
        names=c("AE","AW","NE","NW"),
        main="Salary Distributions by Division",
        ylab="Salary",
        col=c("lightblue", "lightgreen", "lightyellow", "lightpink"))

# Add mean markers
means <- c(mean(HittersAEclean$Salary), mean(HittersAWclean$Salary),
           mean(HittersNEclean$Salary), mean(HittersNWclean$Salary))
points(1:4, means, pch=18, col="red", cex=2)

INTERPRETATION - Question 1:

The boxplot reveals several key observations about salary distributions:

Central Tendency Differences:
- AE division shows the highest median salary, suggesting American League East teams pay more on average
- NW division appears to have the lowest median salary
- The red dots (means) are generally higher than medians, indicating right-skewed distributions (a few high earners)
Spread (Variability) Differences:
- AE division shows the widest spread (IQR), indicating greater salary inequality
- Some divisions have more outliers (high-paid players) than others
- This suggests different salary structures and negotiating dynamics across divisions
Conclusion for Question 1:
- YES, there are clear differences in salary distributions across the four divisions. This supports the hypothesis that division-specific factors (team budgets, market size, etc.) influence salary levels.

Data Preparation for LARS Analysis

# Create numeric predictor matrices for each division
# Remove columns: 20 (League - categorical), 19 (Division - categorical), 
# 14 (Name - identifier), 15 (Salary - our response variable)
# This leaves only numerical performance variables

X_AE <- as.matrix(HittersAEclean[,-c(20,19,14,15)])
X_AW <- as.matrix(HittersAWclean[,-c(20,19,14,15)])
X_NE <- as.matrix(HittersNEclean[,-c(20,19,14,15)])
X_NW <- as.matrix(HittersNWclean[,-c(20,19,14,15)])

# Display the predictor variables
cat("Predictor variables used:\n")

## Predictor variables used:

cat(paste(colnames(X_AE), collapse=", "), "\n")

## AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CAtBat, CHits, CHmRun, CRuns, CRBI, CWalks, PutOuts, Assists, Errors

cat("\nNumber of predictors:", ncol(X_AE), "\n")

## 
## Number of predictors: 16

Question 2: Does Salary Depend Differently on Performance Across Divisions?

We use LARS with Cp criterion to select optimal predictors for each division. Cp measures model quality. Lower Cp indicates better model fit relative to complexity.

AE Division Analysis

# Fit LARS model for AE division
larsAE <- lars(X_AE, HittersAEclean$Salary)

# Plot Cp vs degrees of freedom to visualize model selection
plot(larsAE$df, larsAE$Cp, log="y", 
     main="AE Division: Cp vs Degrees of Freedom",
     xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2, label="Cp=1 reference line")

## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): "label" is
## not a graphical parameter

# Find the step with minimum Cp 
I1_AE <- larsAE$Cp == min(larsAE$Cp)
betalarsAE <- larsAE$beta[I1_AE,]

cat("\n=== AE DIVISION LARS RESULTS ===\n")

## 
## === AE DIVISION LARS RESULTS ===

cat("Minimum Cp =", round(min(larsAE$Cp), 3), "\n")

## Minimum Cp = 11.908

cat("Step with minimum Cp:", which(I1_AE), "\n")

## Step with minimum Cp: 19

cat("Number of variables selected:", sum(I1_AE), "\n")

## Number of variables selected: 1

cat("\nLARS Coefficients:\n")

## 
## LARS Coefficients:

print(betalarsAE)

##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## -3.3530002 10.0477043 -5.1870815  1.9252299  0.0000000  6.1607285  0.0000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
## -0.3771068  0.0000000  0.8418551  2.9480375  1.0145354 -1.1760056  0.3408382 
##    Assists     Errors 
##  0.1259118  0.0000000

INTERPRETATION - AE Division:

Minimum Cp = 11.91 at step 19, indicating the optimal model includes multiple predictors
Key predictors for AE: AtBat (-), Hits (+), HmRun (-), Runs (+), Walks (+), CRuns (+), CRBI (+), PutOuts (+), Assists (+)
Notable patterns: Career statistics (CRuns, CRBI) have positive coefficients, suggesting experience matters in AE
Negative coefficients: AtBat and HmRun are negative, possibly indicating efficiency matters more than raw numbers

AW Division Analysis

# Fit LARS model for AW division
larsAW <- lars(X_AW, HittersAWclean$Salary)

plot(larsAW$df, larsAW$Cp, log="y",
     main="AW Division: Cp vs Degrees of Freedom",
     xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)

I1_AW <- larsAW$Cp == min(larsAW$Cp)
betalarsAW <- larsAW$beta[I1_AW,]

cat("\n=== AW DIVISION LARS RESULTS ===\n")

## 
## === AW DIVISION LARS RESULTS ===

cat("Minimum Cp =", round(min(larsAW$Cp), 3), "\n")

## Minimum Cp = 1.985

cat("Step with minimum Cp:", which(I1_AW), "\n")

## Step with minimum Cp: 5

cat("Number of variables selected:", sum(I1_AW), "\n")

## Number of variables selected: 1

cat("\nLARS Coefficients:\n")

## 
## LARS Coefficients:

print(betalarsAW)

##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## 0.00000000 2.30243447 0.00000000 0.00000000 0.00000000 1.03768371 0.00000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
## 0.00000000 0.33434397 0.00000000 0.00000000 0.00000000 0.00000000 0.08998272 
##    Assists     Errors 
## 0.00000000 0.00000000

INTERPRETATION - AW Division:

Minimum Cp = 1.98 (closest to ideal Cp=1), indicating an excellent model fit
Key predictors for AW: Hits (+), Walks (+), CHits (+), PutOuts (+)
Notable patterns: AW has FEWER selected variables than AE, suggesting a simpler salary structure
Career hits (CHits) is important, indicating AW teams value consistent hitters

NE Division Analysis

# Fit LARS model for NE division
larsNE <- lars(X_NE, HittersNEclean$Salary)

plot(larsNE$df, larsNE$Cp, log="y",
     main="NE Division: Cp vs Degrees of Freedom",
     xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)

I1_NE <- larsNE$Cp == min(larsNE$Cp)
betalarsNE <- larsNE$beta[I1_NE,]

cat("\n=== NE DIVISION LARS RESULTS ===\n")

## 
## === NE DIVISION LARS RESULTS ===

cat("Minimum Cp =", round(min(larsNE$Cp), 3), "\n")

## Minimum Cp = 3.533

cat("Step with minimum Cp:", which(I1_NE), "\n")

## Step with minimum Cp: 9

cat("Number of variables selected:", sum(I1_NE), "\n")

## Number of variables selected: 1

cat("\nLARS Coefficients:\n")

## 
## LARS Coefficients:

print(betalarsNE)

##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## -0.2481822  0.0000000  0.0000000  0.0000000  0.0000000  2.5323541  0.0000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
##  0.0000000  0.0000000  0.0000000  0.0000000  0.8941327  0.0000000  0.4312271 
##    Assists     Errors 
##  0.7716719 -8.4960887

INTERPRETATION - NE Division:

Minimum Cp = 3.53, indicating a good model fit
Key predictors for NE: AtBat (-), Walks (+), CRBI (+), PutOuts (+), Assists (+), Errors (-)
Notable patterns: Errors has a strong NEGATIVE coefficient (-8.50), suggesting defensive performance heavily impacts NE salaries
Assists is important, indicating value placed on defensive skills

NW Division Analysis

# Fit LARS model for NW division
larsNW <- lars(X_NW, HittersNWclean$Salary)

plot(larsNW$df, larsNW$Cp, log="y",
     main="NW Division: Cp vs Degrees of Freedom",
     xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)

I1_NW <- larsNW$Cp == min(larsNW$Cp)
betalarsNW <- larsNW$beta[I1_NW,]

cat("\n=== NW DIVISION LARS RESULTS ===\n")

## 
## === NW DIVISION LARS RESULTS ===

cat("Minimum Cp =", round(min(larsNW$Cp), 3), "\n")

## Minimum Cp = 13.889

cat("Step with minimum Cp:", which(I1_NW), "\n")

## Step with minimum Cp: 17

cat("Number of variables selected:", sum(I1_NW), "\n")

## Number of variables selected: 1

cat("\nLARS Coefficients:\n")

## 
## LARS Coefficients:

print(betalarsNW)

##       AtBat        Hits       HmRun        Runs         RBI       Walks 
##   1.3713374   1.9961871  32.2358918  -9.7338778 -11.9613345   7.3627544 
##       Years      CAtBat       CHits      CHmRun       CRuns        CRBI 
##   7.7832856   0.0000000   0.4414019   0.4438961   0.0000000   0.0000000 
##      CWalks     PutOuts     Assists      Errors 
##  -0.9914020   0.0000000  -0.7387407   3.8378122

INTERPRETATION - NW Division:

Minimum Cp = 13.89, the highest among divisions, suggesting more complex salary determination
Key predictors for NW: AtBat (+), Hits (+), HmRun (+), Runs (-), RBI (-), Walks (+), Years (+), Errors (+)
Notable patterns: HmRun has a VERY high coefficient (+32.24), suggesting NW teams heavily reward home run hitters
Years is positive, indicating experience is valued in NW

Summary of LARS-Selected Variables

# Extract variable names with non-zero coefficients for each division
vars_AE <- colnames(X_AE)[betalarsAE != 0]
vars_AW <- colnames(X_AW)[betalarsAW != 0]
vars_NE <- colnames(X_NE)[betalarsNE != 0]
vars_NW <- colnames(X_NW)[betalarsNW != 0]

cat("\n========================================\n")

## 
## ========================================

cat("   LARS-SELECTED VARIABLES BY DIVISION\n")

##    LARS-SELECTED VARIABLES BY DIVISION

cat("========================================\n")

## ========================================

cat("AE:", paste(vars_AE, collapse=", "), "\n")

## AE: AtBat, Hits, HmRun, Runs, Walks, CAtBat, CHmRun, CRuns, CRBI, CWalks, PutOuts, Assists

cat("AW:", paste(vars_AW, collapse=", "), "\n")

## AW: Hits, Walks, CHits, PutOuts

cat("NE:", paste(vars_NE, collapse=", "), "\n")

## NE: AtBat, Walks, CRBI, PutOuts, Assists, Errors

cat("NW:", paste(vars_NW, collapse=", "), "\n")

## NW: AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CHits, CHmRun, CWalks, Assists, Errors

Division	Key Variables	Model Complexity (Cp)
AE	CRuns, CRBI, Hits, Walks	11.91 (moderate)
AW	Hits, Walks, CHits, PutOuts	1.98 (best fit)
NE	CRBI, Walks, Assists, Errors	3.53 (good fit)
NW	HmRun, Years, Walks, AtBat	13.89 (complex)

Key Finding: Each division has a DIFFERENT set of important predictors, supporting the hypothesis that salary depends differently on performance across divisions.

LARS Predictions: 16 Scatterplots and Correlations

# === AE Model Predictions ===

predAEAE <- X_AE %*% betalarsAE

predAEAE <- predAEAE + (larsAE$mu - mean(predAEAE))


predAEAW <- X_AW %*% betalarsAE

predAEAW <- predAEAW + (larsAE$mu - mean(predAEAE))


predAENE <- X_NE %*% betalarsAE

predAENE <- predAENE + (larsAE$mu - mean(predAEAE))


predAENW <- X_NW %*% betalarsAE

predAENW <- predAENW + (larsAE$mu - mean(predAEAE))


# === AW Model Predictions ===

predAWAE <- X_AE %*% betalarsAW

predAWAE <- predAWAE + (larsAW$mu - mean(predAWAE))


predAWAW <- X_AW %*% betalarsAW

predAWAW <- predAWAW + (larsAW$mu - mean(predAWAW))


predAWNE <- X_NE %*% betalarsAW

predAWNE <- predAWNE + (larsAW$mu - mean(predAWAW))


predAWNW <- X_NW %*% betalarsAW

predAWNW <- predAWNW + (larsAW$mu - mean(predAWAW))


# === NE Model Predictions ===

predNEAE <- X_AE %*% betalarsNE

predNEAE <- predNEAE + (larsNE$mu - mean(predNEAE))


predNEAW <- X_AW %*% betalarsNE

predNEAW <- predNEAW + (larsNE$mu - mean(predNEAE))


predNENE <- X_NE %*% betalarsNE

predNENE <- predNENE + (larsNE$mu - mean(predNENE))


predNENW <- X_NW %*% betalarsNE

predNENW <- predNENW + (larsNE$mu - mean(predNENW))


# === NW Model Predictions ===

predNWAE <- X_AE %*% betalarsNW

predNWAE <- predNWAE + (larsNW$mu - mean(predNWAE))


predNWAW <- X_AW %*% betalarsNW

predNWAW <- predNWAW + (larsNW$mu - mean(predNWAE))


predNWNE <- X_NE %*% betalarsNW

predNWNE <- predNWNE + (larsNW$mu - mean(predNWNE))


predNWNW <- X_NW %*% betalarsNW

predNWNW <- predNWNW + (larsNW$mu - mean(predNWNW))

############################################################
# LARS – 16 Scatterplots + 16 Correlations
############################################################

par(mfrow = c(4, 4))       # 4 rows x 4 columns = 16 plots
par(mar = c(3, 3, 2, 1))   # Smaller margins: bottom, left, top, right
par(oma = c(0, 0, 0, 0))   # No outer margins

# === Row 1: AE Model ===
plot(HittersAEclean$Salary, predAEAE, main="LARS AE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAEAE, HittersAEclean$Salary)

##           [,1]
## [1,] 0.8890144

plot(HittersAWclean$Salary, predAEAW, main="LARS AE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAEAW, HittersAWclean$Salary)

##           [,1]
## [1,] 0.5973853

plot(HittersNEclean$Salary, predAENE, main="LARS AE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAENE, HittersNEclean$Salary)

##           [,1]
## [1,] 0.5493544

plot(HittersNWclean$Salary, predAENW, main="LARS AE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAENW, HittersNWclean$Salary)

##          [,1]
## [1,] 0.506991

# === Row 2: AW Model ===
plot(HittersAEclean$Salary, predAWAE, main="LARS AW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWAE, HittersAEclean$Salary)

##           [,1]
## [1,] 0.7368015

plot(HittersAWclean$Salary, predAWAW, main="LARS AW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWAW, HittersAWclean$Salary)

##           [,1]
## [1,] 0.7849817

plot(HittersNEclean$Salary, predAWNE, main="LARS AW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWNE, HittersNEclean$Salary)

##           [,1]
## [1,] 0.5792761

plot(HittersNWclean$Salary, predAWNW, main="LARS AW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWNW, HittersNWclean$Salary)

##           [,1]
## [1,] 0.5721472

# === Row 3: NE Model ===
plot(HittersAEclean$Salary, predNEAE, main="LARS NE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNEAE, HittersAEclean$Salary)

##           [,1]
## [1,] 0.6803984

plot(HittersAWclean$Salary, predNEAW, main="LARS NE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNEAW, HittersAWclean$Salary)

##           [,1]
## [1,] 0.6390477

plot(HittersNEclean$Salary, predNENE, main="LARS NE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNENE, HittersNEclean$Salary)

##           [,1]
## [1,] 0.6838278

plot(HittersNWclean$Salary, predNENW, main="LARS NE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNENW, HittersNWclean$Salary)

##           [,1]
## [1,] 0.5272795

# === Row 4: NW Model ===
plot(HittersAEclean$Salary, predNWAE, main="LARS NW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWAE, HittersAEclean$Salary)

##           [,1]
## [1,] 0.4554133

plot(HittersAWclean$Salary, predNWAW, main="LARS NW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWAW, HittersAWclean$Salary)

##           [,1]
## [1,] 0.5636847

plot(HittersNEclean$Salary, predNWNE, main="LARS NW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWNE, HittersNEclean$Salary)

##           [,1]
## [1,] 0.1454497

plot(HittersNWclean$Salary, predNWNW, main="LARS NW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")

cor(predNWNW, HittersNWclean$Salary)

##           [,1]
## [1,] 0.7672109

LARS Correlation Matrix

# Create correlation matrix for LARS predictions
corr_lars <- matrix(NA, 4, 4)
rownames(corr_lars) <- c("AE model","AW model","NE model","NW model")
colnames(corr_lars) <- c("AE data","AW data","NE data","NW data")

corr_lars[1,1] <- cor(predAEAE, HittersAEclean$Salary)
corr_lars[1,2] <- cor(predAEAW, HittersAWclean$Salary)
corr_lars[1,3] <- cor(predAENE, HittersNEclean$Salary)
corr_lars[1,4] <- cor(predAENW, HittersNWclean$Salary)

corr_lars[2,1] <- cor(predAWAE, HittersAEclean$Salary)
corr_lars[2,2] <- cor(predAWAW, HittersAWclean$Salary)
corr_lars[2,3] <- cor(predAWNE, HittersNEclean$Salary)
corr_lars[2,4] <- cor(predAWNW, HittersNWclean$Salary)

corr_lars[3,1] <- cor(predNEAE, HittersAEclean$Salary)
corr_lars[3,2] <- cor(predNEAW, HittersAWclean$Salary)
corr_lars[3,3] <- cor(predNENE, HittersNEclean$Salary)
corr_lars[3,4] <- cor(predNENW, HittersNWclean$Salary)

corr_lars[4,1] <- cor(predNWAE, HittersAEclean$Salary)
corr_lars[4,2] <- cor(predNWAW, HittersAWclean$Salary)
corr_lars[4,3] <- cor(predNWNE, HittersNEclean$Salary)
corr_lars[4,4] <- cor(predNWNW, HittersNWclean$Salary)

cat("=== LARS CORRELATION MATRIX ===\n")

## === LARS CORRELATION MATRIX ===

print(round(corr_lars, 3))

##          AE data AW data NE data NW data
## AE model   0.889   0.597   0.549   0.507
## AW model   0.737   0.785   0.579   0.572
## NE model   0.680   0.639   0.684   0.527
## NW model   0.455   0.564   0.145   0.767

INTERPRETATION - LARS Predictions:

Key Observations:

Diagonal values (within-division) are highest for each model
AE model predicts AE data best (r = 0.889)
NW model has the weakest cross-division predictions (especially NW→NE = 0.145)
This pattern supports division-specific salary structures

LM Models Using LARS-Selected Variables

Now we fit OLS linear regression models using exactly the same variables that LARS selected for each division.

# Fit LM models using LARS-selected variables
lmAE <- lm(Salary ~ ., data=HittersAEclean[,c("Salary",vars_AE)])
lmAW <- lm(Salary ~ ., data=HittersAWclean[,c("Salary",vars_AW)])
lmNE <- lm(Salary ~ ., data=HittersNEclean[,c("Salary",vars_NE)])
lmNW <- lm(Salary ~ ., data=HittersNWclean[,c("Salary",vars_NW)])

LM Predictions: 16 Scatterplots and Correlations

par(mfrow = c(4, 4))       # 4 rows x 4 columns = 16 plots
par(mar = c(3, 3, 2, 1))   # Smaller margins: bottom, left, top, right
par(oma = c(0, 0, 0, 0))   # No outer margins

# Row 1: AE model predictions

plot(HittersAEclean$Salary, predict(lmAE, HittersAEclean), main="LM AE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersAEclean), HittersAEclean$Salary)

## [1] 0.8913296

plot(HittersAWclean$Salary, predict(lmAE, HittersAWclean), main="LM AE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersAWclean), HittersAWclean$Salary)

## [1] 0.5538167

plot(HittersNEclean$Salary, predict(lmAE, HittersNEclean), main="LM AE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersNEclean), HittersNEclean$Salary)

## [1] 0.5340311

plot(HittersNWclean$Salary, predict(lmAE, HittersNWclean), main="LM AE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersNWclean), HittersNWclean$Salary)

## [1] 0.4638691

# Row 2: AW model predictions
plot(HittersAEclean$Salary, predict(lmAW, HittersAEclean), main="LM AW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersAEclean), HittersAEclean$Salary)

## [1] 0.7398849

plot(HittersAWclean$Salary, predict(lmAW, HittersAWclean), main="LM AW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersAWclean), HittersAWclean$Salary)

## [1] 0.7855885

plot(HittersNEclean$Salary, predict(lmAW, HittersNEclean), main="LM AW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersNEclean), HittersNEclean$Salary)

## [1] 0.5851391

plot(HittersNWclean$Salary, predict(lmAW, HittersNWclean), main="LM AW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersNWclean), HittersNWclean$Salary)

## [1] 0.5769111

# Row 3: NE model predictions
plot(HittersAEclean$Salary, predict(lmNE, HittersAEclean), main="LM NE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersAEclean), HittersAEclean$Salary)

## [1] 0.6557857

plot(HittersAWclean$Salary, predict(lmNE, HittersAWclean), main="LM NE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersAWclean), HittersAWclean$Salary)

## [1] 0.6008516

plot(HittersNEclean$Salary, predict(lmNE, HittersNEclean), main="LM NE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersNEclean), HittersNEclean$Salary)

## [1] 0.6882483

plot(HittersNWclean$Salary, predict(lmNE, HittersNWclean), main="LM NE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersNWclean), HittersNWclean$Salary)

## [1] 0.4926594

# Row 4: NW model predictions
plot(HittersAEclean$Salary, predict(lmNW, HittersAEclean), main="LM NW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersAEclean), HittersAEclean$Salary)

## [1] 0.368793

plot(HittersAWclean$Salary, predict(lmNW, HittersAWclean), main="LM NW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersAWclean), HittersAWclean$Salary)

## [1] 0.5028059

plot(HittersNEclean$Salary, predict(lmNW, HittersNEclean), main="LM NW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersNEclean), HittersNEclean$Salary)

## [1] 0.05893251

plot(HittersNWclean$Salary, predict(lmNW, HittersNWclean), main="LM NW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")

cor(predict(lmNW, HittersNWclean), HittersNWclean$Salary)

## [1] 0.7743449

############################################################

# 11. Collect LM Correlations

############################################################


corr_lm <- matrix(NA, 4, 4)

rownames(corr_lm) <- c("AE model","AW model","NE model","NW model")

colnames(corr_lm) <- c("AE data","AW data","NE data","NW data")


# AE model

corr_lm[1,1] <- cor(predict(lmAE, HittersAEclean), HittersAEclean$Salary)

corr_lm[1,2] <- cor(predict(lmAE, HittersAWclean), HittersAWclean$Salary)

corr_lm[1,3] <- cor(predict(lmAE, HittersNEclean), HittersNEclean$Salary)

corr_lm[1,4] <- cor(predict(lmAE, HittersNWclean), HittersNWclean$Salary)


# AW model

corr_lm[2,1] <- cor(predict(lmAW, HittersAEclean), HittersAEclean$Salary)

corr_lm[2,2] <- cor(predict(lmAW, HittersAWclean), HittersAWclean$Salary)

corr_lm[2,3] <- cor(predict(lmAW, HittersNEclean), HittersNEclean$Salary)

corr_lm[2,4] <- cor(predict(lmAW, HittersNWclean), HittersNWclean$Salary)


# NE model

corr_lm[3,1] <- cor(predict(lmNE, HittersAEclean), HittersAEclean$Salary)

corr_lm[3,2] <- cor(predict(lmNE, HittersAWclean), HittersAWclean$Salary)

corr_lm[3,3] <- cor(predict(lmNE, HittersNEclean), HittersNEclean$Salary)

corr_lm[3,4] <- cor(predict(lmNE, HittersNWclean), HittersNWclean$Salary)


# NW model

corr_lm[4,1] <- cor(predict(lmNW, HittersAEclean), HittersAEclean$Salary)

corr_lm[4,2] <- cor(predict(lmNW, HittersAWclean), HittersAWclean$Salary)

corr_lm[4,3] <- cor(predict(lmNW, HittersNEclean), HittersNEclean$Salary)

corr_lm[4,4] <- cor(predict(lmNW, HittersNWclean), HittersNWclean$Salary)

print(round(corr_lm, 3))

##          AE data AW data NE data NW data
## AE model   0.891   0.554   0.534   0.464
## AW model   0.740   0.786   0.585   0.577
## NE model   0.656   0.601   0.688   0.493
## NW model   0.369   0.503   0.059   0.774

Interpretation - LM Predictions

1. Within-Division Prediction (Diagonal)

The diagonal values represent how well each LM model predicts salaries within its own division:

Division	Correlation	Interpretation
AE Model → AE Data	0.891	Excellent fit (89.1% of variance explained)
AW Model → AW Data	0.786	Good fit (78.6% of variance explained)
NE Model → NE Data	0.688	Moderate-good fit (68.8% of variance explained)
NW Model → NW Data	0.774	Good fit (77.4% of variance explained)

Key Observations:

AE division has the highest within-division correlation (0.891), indicating the LM model fits AE salary data very well
NE division has the lowest within-division correlation (0.688), suggesting salary in NE is harder to predict with the selected variables
All diagonal values are positive and substantial, confirming that the LARS-selected variables have predictive power

2. Between-Division Prediction (Off-Diagonal)

The off-diagonal values show how well each LM model generalizes to other divisions:

Notable Patterns:

Prediction	Correlation	Interpretation
AE → AW	0.554	Moderate transferability
AE → NE	0.534	Moderate transferability
AE → NW	0.464	Weak-moderate transferability
AW → AE	0.740	Good transferability
AW → NE	0.585	Moderate transferability
AW → NW	0.577	Moderate transferability
NE → AE	0.656	Moderate-good transferability
NE → AW	0.601	Moderate transferability
NE → NW	0.493	Weak-moderate transferability
NW → AE	0.369	Weak transferability
NW → AW	0.503	Weak-moderate transferability
NW → NE	0.059	Very weak (almost no transferability)

Critical Finding - NW Division:

The NW model predicts NE data very poorly (r = 0.059), almost no relationship
This suggests the salary structure in NW is fundamentally different from NE
The NW model also has weak predictions for AE (r = 0.369)

Comparison: LARS vs LM

# Calculate difference matrix (LARS - LM)
diff_matrix <- corr_lars - corr_lm

cat("=== DIFFERENCE MATRIX (LARS - LM) ===\n")

## === DIFFERENCE MATRIX (LARS - LM) ===

cat("Positive values = LARS better, Negative values = LM better\n\n")

## Positive values = LARS better, Negative values = LM better

print(round(diff_matrix, 3))

##          AE data AW data NE data NW data
## AE model  -0.002   0.044   0.015   0.043
## AW model  -0.003  -0.001  -0.006  -0.005
## NE model   0.025   0.038  -0.004   0.035
## NW model   0.087   0.061   0.087  -0.007

# Summary statistics
cat("\n========================================\n")

## 
## ========================================

cat("   SUMMARY: LARS vs LM COMPARISON\n")

##    SUMMARY: LARS vs LM COMPARISON

cat("========================================\n")

## ========================================

# Within-division (diagonal)
cat("\n--- WITHIN-DIVISION CORRELATIONS (Diagonal) ---\n")

## 
## --- WITHIN-DIVISION CORRELATIONS (Diagonal) ---

cat("LARS diagonal mean:", round(mean(diag(corr_lars)), 3), "\n")

## LARS diagonal mean: 0.781

cat("LM diagonal mean:", round(mean(diag(corr_lm)), 3), "\n")

## LM diagonal mean: 0.785

cat("Difference (LARS - LM):", round(mean(diag(corr_lars)) - mean(diag(corr_lm)), 3), "\n")

## Difference (LARS - LM): -0.004

# Between-division (off-diagonal)
off_diag_lars <- corr_lars[lower.tri(corr_lars) | upper.tri(corr_lars)]
off_diag_lm <- corr_lm[lower.tri(corr_lm) | upper.tri(corr_lm)]

cat("\n--- BETWEEN-DIVISION CORRELATIONS (Off-Diagonal) ---\n")

## 
## --- BETWEEN-DIVISION CORRELATIONS (Off-Diagonal) ---

cat("LARS off-diagonal mean:", round(mean(off_diag_lars), 3), "\n")

## LARS off-diagonal mean: 0.546

cat("LM off-diagonal mean:", round(mean(off_diag_lm), 3), "\n")

## LM off-diagonal mean: 0.511

cat("Difference (LARS - LM):", round(mean(off_diag_lars) - mean(off_diag_lm), 3), "\n")

## Difference (LARS - LM): 0.035

# Overall
cat("\n--- OVERALL CORRELATIONS ---\n")

## 
## --- OVERALL CORRELATIONS ---

cat("LARS overall mean:", round(mean(corr_lars), 3), "\n")

## LARS overall mean: 0.605

cat("LM overall mean:", round(mean(corr_lm), 3), "\n")

## LM overall mean: 0.58

cat("Difference (LARS - LM):", round(mean(corr_lars) - mean(corr_lm), 3), "\n")

## Difference (LARS - LM): 0.025

INTERPRETATION - LARS vs LM Comparison:

Metric	LARS	LM	Winner
Within-division mean	0.781	0.785	LM (slightly)
Between-division mean	0.546	0.511	LARS (better generalization)
Overall mean	0.605	0.580	LARS

Key Findings:

Within-division prediction: LM performs slightly better (0.785 vs 0.781)
Between-division prediction: LARS performs better (0.546 vs 0.511)
Overall: LARS has higher mean correlation, suggesting better generalization

Coefficient Analysis

cat("\n========================================\n")

## 
## ========================================

cat("   LARS MODEL COEFFICIENTS (Min Cp)\n")

##    LARS MODEL COEFFICIENTS (Min Cp)

cat("========================================\n")

## ========================================

cat("\n--- AE Division ---\n")

## 
## --- AE Division ---

print(betalarsAE)

##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## -3.3530002 10.0477043 -5.1870815  1.9252299  0.0000000  6.1607285  0.0000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
## -0.3771068  0.0000000  0.8418551  2.9480375  1.0145354 -1.1760056  0.3408382 
##    Assists     Errors 
##  0.1259118  0.0000000

cat("\n--- AW Division ---\n")

## 
## --- AW Division ---

print(betalarsAW)

##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## 0.00000000 2.30243447 0.00000000 0.00000000 0.00000000 1.03768371 0.00000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
## 0.00000000 0.33434397 0.00000000 0.00000000 0.00000000 0.00000000 0.08998272 
##    Assists     Errors 
## 0.00000000 0.00000000

cat("\n--- NE Division ---\n")

## 
## --- NE Division ---

print(betalarsNE)

##      AtBat       Hits      HmRun       Runs        RBI      Walks      Years 
## -0.2481822  0.0000000  0.0000000  0.0000000  0.0000000  2.5323541  0.0000000 
##     CAtBat      CHits     CHmRun      CRuns       CRBI     CWalks    PutOuts 
##  0.0000000  0.0000000  0.0000000  0.0000000  0.8941327  0.0000000  0.4312271 
##    Assists     Errors 
##  0.7716719 -8.4960887

cat("\n--- NW Division ---\n")

## 
## --- NW Division ---

print(betalarsNW)

##       AtBat        Hits       HmRun        Runs         RBI       Walks 
##   1.3713374   1.9961871  32.2358918  -9.7338778 -11.9613345   7.3627544 
##       Years      CAtBat       CHits      CHmRun       CRuns        CRBI 
##   7.7832856   0.0000000   0.4414019   0.4438961   0.0000000   0.0000000 
##      CWalks     PutOuts     Assists      Errors 
##  -0.9914020   0.0000000  -0.7387407   3.8378122

cat("\n========================================\n")

## 
## ========================================

cat("   LM MODEL COEFFICIENTS\n")

##    LM MODEL COEFFICIENTS

cat("========================================\n")

## ========================================

cat("\n--- AE Division ---\n")

## 
## --- AE Division ---

print(coef(lmAE))

##  (Intercept)        AtBat         Hits        HmRun         Runs        Walks 
## 333.91615317  -3.77014051  10.47129162  -4.36880711   1.89356499   6.28731152 
##       CAtBat       CHmRun        CRuns         CRBI       CWalks      PutOuts 
##  -0.50437506  -0.08326206   3.66894235   1.52228576  -1.33810508   0.38200461 
##      Assists 
##   0.25849109

cat("\n--- AW Division ---\n")

## 
## --- AW Division ---

print(coef(lmAW))

##  (Intercept)         Hits        Walks        CHits      PutOuts 
## -151.5020812    2.4162522    1.2512816    0.3519476    0.1312208

cat("\n--- NE Division ---\n")

## 
## --- NE Division ---

print(coef(lmNE))

## (Intercept)       AtBat       Walks        CRBI     PutOuts     Assists 
## 192.4466171  -0.5836321   3.9198287   0.9232909   0.5015935   1.0687232 
##      Errors 
## -12.1454651

cat("\n--- NW Division ---\n")

## 
## --- NW Division ---

print(coef(lmNW))

## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##  87.8200644   1.1613866   4.7544874  44.5274429 -14.1316674 -16.8348882 
##       Walks       Years       CHits      CHmRun      CWalks     Assists 
##   9.7791908   7.7745907   0.5717372   0.4689956  -1.3725120  -0.8329296 
##      Errors 
##   5.1266500

INTERPRETATION - Coefficient Comparison:

Variable	AE	AW	NE	NW	Pattern
Hits	+10.0	+2.3	0	+2.0	Important in AE, AW, NW
Walks	+6.2	+1.0	+2.5	+7.4	Important in ALL divisions
CRBI	+1.0	0	+0.9	0	Career RBIs matter in AE, NE
PutOuts	+0.3	+0.1	+0.4	0	Defensive value varies
Errors	0	0	-8.5	+3.8	Opposite effects in NE vs NW
HmRun	-5.2	0	0	+32.2	Very different in NW

Key Finding: The SAME variable can have OPPOSITE effects across divisions:

Errors: Negative in NE (-8.5), Positive in NW (+3.8)
HmRun: Negative in AE (-5.2), Very positive in NW (+32.2)

This strongly supports the conclusion that salary depends DIFFERENTLY on performance across divisions.

The analysis conclusively shows that salary structures differ across baseball divisions. Each division values player statistics differently:

AE: Values career statistics (CRuns, CRBI)
AW: Simpler model, values hits and walks
NE: Penalizes errors heavily, values defensive skills
NW: Heavily rewards home runs and experience

This suggests division-specific factors such as team budgets, market sizes, managerial strategies, and league dynamics influence how player performance translates to salary.

Within-division correlations are consistently higher than between-division correlations, it suggests division-specific salary structures. Comparing LARS vs LM diagonal and off-diagonal means indicates which method provides better prediction within divisions and which generalizes better between divisions. Differences in selected predictors and coefficient patterns across divisions support the idea that salary depends on performance and career statistics differently across divisions.