```{r setup, include=FALSE, echo=FALSE} knitr::opts_chunk$set(warning = FALSE, message = FALSE)
This analysis investigates whether salary depends differently on player performance across the four baseball divisions (AE, AW, NE, NW) in the Hitters dataset. We use LARS (Least Angle Regression) with the Cp criterion for variable selection and compare predictions within and between divisions.
Questions:
1. Is there a difference in salary distributions between the 4 divisions?
2. Is there a difference in how salary depends on performance between the 4 divisions?
3. Does LARS or LM provide better predictions?
# Load required libraries for analysis
library(ISLR) # Contains the Hitters dataset
library(lars) # Implements LARS algorithm for variable selection
## Loaded lars 1.3
# Load the Hitters dataset
data(Hitters)
# Split the Hitters dataset into four divisions based on League and Division
# Filter for American League East (AE), American League West (AW),
# National League East (NE), and National League West (NW)
# Also remove players with missing salary values
HittersAEclean <- subset(Hitters, League=="A" & Division=="E" & !is.na(Salary))
HittersAWclean <- subset(Hitters, League=="A" & Division=="W" & !is.na(Salary))
HittersNEclean <- subset(Hitters, League=="N" & Division=="E" & !is.na(Salary))
HittersNWclean <- subset(Hitters, League=="N" & Division=="W" & !is.na(Salary))
# Display sample sizes for each division
cat("Sample sizes by division:\n")
## Sample sizes by division:
cat("AE:", nrow(HittersAEclean), "players\n")
## AE: 68 players
cat("AW:", nrow(HittersAWclean), "players\n")
## AW: 71 players
cat("NE:", nrow(HittersNEclean), "players\n")
## NE: 61 players
cat("NW:", nrow(HittersNWclean), "players\n")
## NW: 63 players
# Create boxplot to visualize salary distributions across divisions
par(mfrow=c(1,1))
boxplot(HittersAEclean$Salary,
HittersAWclean$Salary,
HittersNEclean$Salary,
HittersNWclean$Salary,
names=c("AE","AW","NE","NW"),
main="Salary Distributions by Division",
ylab="Salary",
col=c("lightblue", "lightgreen", "lightyellow", "lightpink"))
# Add mean markers
means <- c(mean(HittersAEclean$Salary), mean(HittersAWclean$Salary),
mean(HittersNEclean$Salary), mean(HittersNWclean$Salary))
points(1:4, means, pch=18, col="red", cex=2)
INTERPRETATION - Question 1:
The boxplot reveals several key observations about salary distributions:
Central Tendency Differences:
AE division shows the highest median salary, suggesting American League East teams pay more on average
NW division appears to have the lowest median salary
The red dots (means) are generally higher than medians, indicating right-skewed distributions (a few high earners)
Spread (Variability) Differences:
AE division shows the widest spread (IQR), indicating greater salary inequality
Some divisions have more outliers (high-paid players) than others
This suggests different salary structures and negotiating dynamics across divisions
Conclusion for Question 1:
# Create numeric predictor matrices for each division
# Remove columns: 20 (League - categorical), 19 (Division - categorical),
# 14 (Name - identifier), 15 (Salary - our response variable)
# This leaves only numerical performance variables
X_AE <- as.matrix(HittersAEclean[,-c(20,19,14,15)])
X_AW <- as.matrix(HittersAWclean[,-c(20,19,14,15)])
X_NE <- as.matrix(HittersNEclean[,-c(20,19,14,15)])
X_NW <- as.matrix(HittersNWclean[,-c(20,19,14,15)])
# Display the predictor variables
cat("Predictor variables used:\n")
## Predictor variables used:
cat(paste(colnames(X_AE), collapse=", "), "\n")
## AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CAtBat, CHits, CHmRun, CRuns, CRBI, CWalks, PutOuts, Assists, Errors
cat("\nNumber of predictors:", ncol(X_AE), "\n")
##
## Number of predictors: 16
We use LARS with Cp criterion to select optimal predictors for each division. Cp measures model quality. Lower Cp indicates better model fit relative to complexity.
# Fit LARS model for AE division
larsAE <- lars(X_AE, HittersAEclean$Salary)
# Plot Cp vs degrees of freedom to visualize model selection
plot(larsAE$df, larsAE$Cp, log="y",
main="AE Division: Cp vs Degrees of Freedom",
xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2, label="Cp=1 reference line")
## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): "label" is
## not a graphical parameter
# Find the step with minimum Cp
I1_AE <- larsAE$Cp == min(larsAE$Cp)
betalarsAE <- larsAE$beta[I1_AE,]
cat("\n=== AE DIVISION LARS RESULTS ===\n")
##
## === AE DIVISION LARS RESULTS ===
cat("Minimum Cp =", round(min(larsAE$Cp), 3), "\n")
## Minimum Cp = 11.908
cat("Step with minimum Cp:", which(I1_AE), "\n")
## Step with minimum Cp: 19
cat("Number of variables selected:", sum(I1_AE), "\n")
## Number of variables selected: 1
cat("\nLARS Coefficients:\n")
##
## LARS Coefficients:
print(betalarsAE)
## AtBat Hits HmRun Runs RBI Walks Years
## -3.3530002 10.0477043 -5.1870815 1.9252299 0.0000000 6.1607285 0.0000000
## CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts
## -0.3771068 0.0000000 0.8418551 2.9480375 1.0145354 -1.1760056 0.3408382
## Assists Errors
## 0.1259118 0.0000000
INTERPRETATION - AE Division:
Minimum Cp = 11.91 at step 19, indicating the optimal model includes multiple predictors
Key predictors for AE: AtBat (-), Hits (+), HmRun (-), Runs (+), Walks (+), CRuns (+), CRBI (+), PutOuts (+), Assists (+)
Notable patterns: Career statistics (CRuns, CRBI) have positive coefficients, suggesting experience matters in AE
Negative coefficients: AtBat and HmRun are negative, possibly indicating efficiency matters more than raw numbers
# Fit LARS model for AW division
larsAW <- lars(X_AW, HittersAWclean$Salary)
plot(larsAW$df, larsAW$Cp, log="y",
main="AW Division: Cp vs Degrees of Freedom",
xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)
I1_AW <- larsAW$Cp == min(larsAW$Cp)
betalarsAW <- larsAW$beta[I1_AW,]
cat("\n=== AW DIVISION LARS RESULTS ===\n")
##
## === AW DIVISION LARS RESULTS ===
cat("Minimum Cp =", round(min(larsAW$Cp), 3), "\n")
## Minimum Cp = 1.985
cat("Step with minimum Cp:", which(I1_AW), "\n")
## Step with minimum Cp: 5
cat("Number of variables selected:", sum(I1_AW), "\n")
## Number of variables selected: 1
cat("\nLARS Coefficients:\n")
##
## LARS Coefficients:
print(betalarsAW)
## AtBat Hits HmRun Runs RBI Walks Years
## 0.00000000 2.30243447 0.00000000 0.00000000 0.00000000 1.03768371 0.00000000
## CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts
## 0.00000000 0.33434397 0.00000000 0.00000000 0.00000000 0.00000000 0.08998272
## Assists Errors
## 0.00000000 0.00000000
INTERPRETATION - AW Division:
Minimum Cp = 1.98 (closest to ideal Cp=1), indicating an excellent model fit
Key predictors for AW: Hits (+), Walks (+), CHits (+), PutOuts (+)
Notable patterns: AW has FEWER selected variables than AE, suggesting a simpler salary structure
Career hits (CHits) is important, indicating AW teams value consistent hitters
# Fit LARS model for NE division
larsNE <- lars(X_NE, HittersNEclean$Salary)
plot(larsNE$df, larsNE$Cp, log="y",
main="NE Division: Cp vs Degrees of Freedom",
xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)
I1_NE <- larsNE$Cp == min(larsNE$Cp)
betalarsNE <- larsNE$beta[I1_NE,]
cat("\n=== NE DIVISION LARS RESULTS ===\n")
##
## === NE DIVISION LARS RESULTS ===
cat("Minimum Cp =", round(min(larsNE$Cp), 3), "\n")
## Minimum Cp = 3.533
cat("Step with minimum Cp:", which(I1_NE), "\n")
## Step with minimum Cp: 9
cat("Number of variables selected:", sum(I1_NE), "\n")
## Number of variables selected: 1
cat("\nLARS Coefficients:\n")
##
## LARS Coefficients:
print(betalarsNE)
## AtBat Hits HmRun Runs RBI Walks Years
## -0.2481822 0.0000000 0.0000000 0.0000000 0.0000000 2.5323541 0.0000000
## CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts
## 0.0000000 0.0000000 0.0000000 0.0000000 0.8941327 0.0000000 0.4312271
## Assists Errors
## 0.7716719 -8.4960887
INTERPRETATION - NE Division:
Minimum Cp = 3.53, indicating a good model fit
Key predictors for NE: AtBat (-), Walks (+), CRBI (+), PutOuts (+), Assists (+), Errors (-)
Notable patterns: Errors has a strong NEGATIVE coefficient (-8.50), suggesting defensive performance heavily impacts NE salaries
Assists is important, indicating value placed on defensive skills
# Fit LARS model for NW division
larsNW <- lars(X_NW, HittersNWclean$Salary)
plot(larsNW$df, larsNW$Cp, log="y",
main="NW Division: Cp vs Degrees of Freedom",
xlab="df (Number of Variables)", ylab="Cp (log scale)")
abline(h=1, col="red", lty=2)
I1_NW <- larsNW$Cp == min(larsNW$Cp)
betalarsNW <- larsNW$beta[I1_NW,]
cat("\n=== NW DIVISION LARS RESULTS ===\n")
##
## === NW DIVISION LARS RESULTS ===
cat("Minimum Cp =", round(min(larsNW$Cp), 3), "\n")
## Minimum Cp = 13.889
cat("Step with minimum Cp:", which(I1_NW), "\n")
## Step with minimum Cp: 17
cat("Number of variables selected:", sum(I1_NW), "\n")
## Number of variables selected: 1
cat("\nLARS Coefficients:\n")
##
## LARS Coefficients:
print(betalarsNW)
## AtBat Hits HmRun Runs RBI Walks
## 1.3713374 1.9961871 32.2358918 -9.7338778 -11.9613345 7.3627544
## Years CAtBat CHits CHmRun CRuns CRBI
## 7.7832856 0.0000000 0.4414019 0.4438961 0.0000000 0.0000000
## CWalks PutOuts Assists Errors
## -0.9914020 0.0000000 -0.7387407 3.8378122
INTERPRETATION - NW Division:
Minimum Cp = 13.89, the highest among divisions, suggesting more complex salary determination
Key predictors for NW: AtBat (+), Hits (+), HmRun (+), Runs (-), RBI (-), Walks (+), Years (+), Errors (+)
Notable patterns: HmRun has a VERY high coefficient (+32.24), suggesting NW teams heavily reward home run hitters
Years is positive, indicating experience is valued in NW
# Extract variable names with non-zero coefficients for each division
vars_AE <- colnames(X_AE)[betalarsAE != 0]
vars_AW <- colnames(X_AW)[betalarsAW != 0]
vars_NE <- colnames(X_NE)[betalarsNE != 0]
vars_NW <- colnames(X_NW)[betalarsNW != 0]
cat("\n========================================\n")
##
## ========================================
cat(" LARS-SELECTED VARIABLES BY DIVISION\n")
## LARS-SELECTED VARIABLES BY DIVISION
cat("========================================\n")
## ========================================
cat("AE:", paste(vars_AE, collapse=", "), "\n")
## AE: AtBat, Hits, HmRun, Runs, Walks, CAtBat, CHmRun, CRuns, CRBI, CWalks, PutOuts, Assists
cat("AW:", paste(vars_AW, collapse=", "), "\n")
## AW: Hits, Walks, CHits, PutOuts
cat("NE:", paste(vars_NE, collapse=", "), "\n")
## NE: AtBat, Walks, CRBI, PutOuts, Assists, Errors
cat("NW:", paste(vars_NW, collapse=", "), "\n")
## NW: AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CHits, CHmRun, CWalks, Assists, Errors
| Division | Key Variables | Model Complexity (Cp) |
|---|---|---|
| AE | CRuns, CRBI, Hits, Walks | 11.91 (moderate) |
| AW | Hits, Walks, CHits, PutOuts | 1.98 (best fit) |
| NE | CRBI, Walks, Assists, Errors | 3.53 (good fit) |
| NW | HmRun, Years, Walks, AtBat | 13.89 (complex) |
Key Finding: Each division has a DIFFERENT set of important predictors, supporting the hypothesis that salary depends differently on performance across divisions.
# === AE Model Predictions ===
predAEAE <- X_AE %*% betalarsAE
predAEAE <- predAEAE + (larsAE$mu - mean(predAEAE))
predAEAW <- X_AW %*% betalarsAE
predAEAW <- predAEAW + (larsAE$mu - mean(predAEAE))
predAENE <- X_NE %*% betalarsAE
predAENE <- predAENE + (larsAE$mu - mean(predAEAE))
predAENW <- X_NW %*% betalarsAE
predAENW <- predAENW + (larsAE$mu - mean(predAEAE))
# === AW Model Predictions ===
predAWAE <- X_AE %*% betalarsAW
predAWAE <- predAWAE + (larsAW$mu - mean(predAWAE))
predAWAW <- X_AW %*% betalarsAW
predAWAW <- predAWAW + (larsAW$mu - mean(predAWAW))
predAWNE <- X_NE %*% betalarsAW
predAWNE <- predAWNE + (larsAW$mu - mean(predAWAW))
predAWNW <- X_NW %*% betalarsAW
predAWNW <- predAWNW + (larsAW$mu - mean(predAWAW))
# === NE Model Predictions ===
predNEAE <- X_AE %*% betalarsNE
predNEAE <- predNEAE + (larsNE$mu - mean(predNEAE))
predNEAW <- X_AW %*% betalarsNE
predNEAW <- predNEAW + (larsNE$mu - mean(predNEAE))
predNENE <- X_NE %*% betalarsNE
predNENE <- predNENE + (larsNE$mu - mean(predNENE))
predNENW <- X_NW %*% betalarsNE
predNENW <- predNENW + (larsNE$mu - mean(predNENW))
# === NW Model Predictions ===
predNWAE <- X_AE %*% betalarsNW
predNWAE <- predNWAE + (larsNW$mu - mean(predNWAE))
predNWAW <- X_AW %*% betalarsNW
predNWAW <- predNWAW + (larsNW$mu - mean(predNWAE))
predNWNE <- X_NE %*% betalarsNW
predNWNE <- predNWNE + (larsNW$mu - mean(predNWNE))
predNWNW <- X_NW %*% betalarsNW
predNWNW <- predNWNW + (larsNW$mu - mean(predNWNW))
############################################################
# LARS – 16 Scatterplots + 16 Correlations
############################################################
par(mfrow = c(4, 4)) # 4 rows x 4 columns = 16 plots
par(mar = c(3, 3, 2, 1)) # Smaller margins: bottom, left, top, right
par(oma = c(0, 0, 0, 0)) # No outer margins
# === Row 1: AE Model ===
plot(HittersAEclean$Salary, predAEAE, main="LARS AE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAEAE, HittersAEclean$Salary)
## [,1]
## [1,] 0.8890144
plot(HittersAWclean$Salary, predAEAW, main="LARS AE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAEAW, HittersAWclean$Salary)
## [,1]
## [1,] 0.5973853
plot(HittersNEclean$Salary, predAENE, main="LARS AE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAENE, HittersNEclean$Salary)
## [,1]
## [1,] 0.5493544
plot(HittersNWclean$Salary, predAENW, main="LARS AE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAENW, HittersNWclean$Salary)
## [,1]
## [1,] 0.506991
# === Row 2: AW Model ===
plot(HittersAEclean$Salary, predAWAE, main="LARS AW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWAE, HittersAEclean$Salary)
## [,1]
## [1,] 0.7368015
plot(HittersAWclean$Salary, predAWAW, main="LARS AW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWAW, HittersAWclean$Salary)
## [,1]
## [1,] 0.7849817
plot(HittersNEclean$Salary, predAWNE, main="LARS AW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWNE, HittersNEclean$Salary)
## [,1]
## [1,] 0.5792761
plot(HittersNWclean$Salary, predAWNW, main="LARS AW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predAWNW, HittersNWclean$Salary)
## [,1]
## [1,] 0.5721472
# === Row 3: NE Model ===
plot(HittersAEclean$Salary, predNEAE, main="LARS NE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNEAE, HittersAEclean$Salary)
## [,1]
## [1,] 0.6803984
plot(HittersAWclean$Salary, predNEAW, main="LARS NE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNEAW, HittersAWclean$Salary)
## [,1]
## [1,] 0.6390477
plot(HittersNEclean$Salary, predNENE, main="LARS NE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNENE, HittersNEclean$Salary)
## [,1]
## [1,] 0.6838278
plot(HittersNWclean$Salary, predNENW, main="LARS NE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNENW, HittersNWclean$Salary)
## [,1]
## [1,] 0.5272795
# === Row 4: NW Model ===
plot(HittersAEclean$Salary, predNWAE, main="LARS NW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWAE, HittersAEclean$Salary)
## [,1]
## [1,] 0.4554133
plot(HittersAWclean$Salary, predNWAW, main="LARS NW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWAW, HittersAWclean$Salary)
## [,1]
## [1,] 0.5636847
plot(HittersNEclean$Salary, predNWNE, main="LARS NW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWNE, HittersNEclean$Salary)
## [,1]
## [1,] 0.1454497
plot(HittersNWclean$Salary, predNWNW, main="LARS NW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predNWNW, HittersNWclean$Salary)
## [,1]
## [1,] 0.7672109
# Create correlation matrix for LARS predictions
corr_lars <- matrix(NA, 4, 4)
rownames(corr_lars) <- c("AE model","AW model","NE model","NW model")
colnames(corr_lars) <- c("AE data","AW data","NE data","NW data")
corr_lars[1,1] <- cor(predAEAE, HittersAEclean$Salary)
corr_lars[1,2] <- cor(predAEAW, HittersAWclean$Salary)
corr_lars[1,3] <- cor(predAENE, HittersNEclean$Salary)
corr_lars[1,4] <- cor(predAENW, HittersNWclean$Salary)
corr_lars[2,1] <- cor(predAWAE, HittersAEclean$Salary)
corr_lars[2,2] <- cor(predAWAW, HittersAWclean$Salary)
corr_lars[2,3] <- cor(predAWNE, HittersNEclean$Salary)
corr_lars[2,4] <- cor(predAWNW, HittersNWclean$Salary)
corr_lars[3,1] <- cor(predNEAE, HittersAEclean$Salary)
corr_lars[3,2] <- cor(predNEAW, HittersAWclean$Salary)
corr_lars[3,3] <- cor(predNENE, HittersNEclean$Salary)
corr_lars[3,4] <- cor(predNENW, HittersNWclean$Salary)
corr_lars[4,1] <- cor(predNWAE, HittersAEclean$Salary)
corr_lars[4,2] <- cor(predNWAW, HittersAWclean$Salary)
corr_lars[4,3] <- cor(predNWNE, HittersNEclean$Salary)
corr_lars[4,4] <- cor(predNWNW, HittersNWclean$Salary)
cat("=== LARS CORRELATION MATRIX ===\n")
## === LARS CORRELATION MATRIX ===
print(round(corr_lars, 3))
## AE data AW data NE data NW data
## AE model 0.889 0.597 0.549 0.507
## AW model 0.737 0.785 0.579 0.572
## NE model 0.680 0.639 0.684 0.527
## NW model 0.455 0.564 0.145 0.767
INTERPRETATION - LARS Predictions:
Key Observations:
Diagonal values (within-division) are highest for each model
AE model predicts AE data best (r = 0.889)
NW model has the weakest cross-division predictions (especially NW→NE = 0.145)
This pattern supports division-specific salary structures
Now we fit OLS linear regression models using exactly the same variables that LARS selected for each division.
# Fit LM models using LARS-selected variables
lmAE <- lm(Salary ~ ., data=HittersAEclean[,c("Salary",vars_AE)])
lmAW <- lm(Salary ~ ., data=HittersAWclean[,c("Salary",vars_AW)])
lmNE <- lm(Salary ~ ., data=HittersNEclean[,c("Salary",vars_NE)])
lmNW <- lm(Salary ~ ., data=HittersNWclean[,c("Salary",vars_NW)])
par(mfrow = c(4, 4)) # 4 rows x 4 columns = 16 plots
par(mar = c(3, 3, 2, 1)) # Smaller margins: bottom, left, top, right
par(oma = c(0, 0, 0, 0)) # No outer margins
# Row 1: AE model predictions
plot(HittersAEclean$Salary, predict(lmAE, HittersAEclean), main="LM AE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersAEclean), HittersAEclean$Salary)
## [1] 0.8913296
plot(HittersAWclean$Salary, predict(lmAE, HittersAWclean), main="LM AE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersAWclean), HittersAWclean$Salary)
## [1] 0.5538167
plot(HittersNEclean$Salary, predict(lmAE, HittersNEclean), main="LM AE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersNEclean), HittersNEclean$Salary)
## [1] 0.5340311
plot(HittersNWclean$Salary, predict(lmAE, HittersNWclean), main="LM AE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAE, HittersNWclean), HittersNWclean$Salary)
## [1] 0.4638691
# Row 2: AW model predictions
plot(HittersAEclean$Salary, predict(lmAW, HittersAEclean), main="LM AW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersAEclean), HittersAEclean$Salary)
## [1] 0.7398849
plot(HittersAWclean$Salary, predict(lmAW, HittersAWclean), main="LM AW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersAWclean), HittersAWclean$Salary)
## [1] 0.7855885
plot(HittersNEclean$Salary, predict(lmAW, HittersNEclean), main="LM AW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersNEclean), HittersNEclean$Salary)
## [1] 0.5851391
plot(HittersNWclean$Salary, predict(lmAW, HittersNWclean), main="LM AW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmAW, HittersNWclean), HittersNWclean$Salary)
## [1] 0.5769111
# Row 3: NE model predictions
plot(HittersAEclean$Salary, predict(lmNE, HittersAEclean), main="LM NE→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersAEclean), HittersAEclean$Salary)
## [1] 0.6557857
plot(HittersAWclean$Salary, predict(lmNE, HittersAWclean), main="LM NE→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersAWclean), HittersAWclean$Salary)
## [1] 0.6008516
plot(HittersNEclean$Salary, predict(lmNE, HittersNEclean), main="LM NE→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersNEclean), HittersNEclean$Salary)
## [1] 0.6882483
plot(HittersNWclean$Salary, predict(lmNE, HittersNWclean), main="LM NE→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNE, HittersNWclean), HittersNWclean$Salary)
## [1] 0.4926594
# Row 4: NW model predictions
plot(HittersAEclean$Salary, predict(lmNW, HittersAEclean), main="LM NW→AE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersAEclean), HittersAEclean$Salary)
## [1] 0.368793
plot(HittersAWclean$Salary, predict(lmNW, HittersAWclean), main="LM NW→AW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersAWclean), HittersAWclean$Salary)
## [1] 0.5028059
plot(HittersNEclean$Salary, predict(lmNW, HittersNEclean), main="LM NW→NE", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersNEclean), HittersNEclean$Salary)
## [1] 0.05893251
plot(HittersNWclean$Salary, predict(lmNW, HittersNWclean), main="LM NW→NW", xlab="Actual", ylab="Predicted"); abline(0,1,col="red")
cor(predict(lmNW, HittersNWclean), HittersNWclean$Salary)
## [1] 0.7743449
############################################################
# 11. Collect LM Correlations
############################################################
corr_lm <- matrix(NA, 4, 4)
rownames(corr_lm) <- c("AE model","AW model","NE model","NW model")
colnames(corr_lm) <- c("AE data","AW data","NE data","NW data")
# AE model
corr_lm[1,1] <- cor(predict(lmAE, HittersAEclean), HittersAEclean$Salary)
corr_lm[1,2] <- cor(predict(lmAE, HittersAWclean), HittersAWclean$Salary)
corr_lm[1,3] <- cor(predict(lmAE, HittersNEclean), HittersNEclean$Salary)
corr_lm[1,4] <- cor(predict(lmAE, HittersNWclean), HittersNWclean$Salary)
# AW model
corr_lm[2,1] <- cor(predict(lmAW, HittersAEclean), HittersAEclean$Salary)
corr_lm[2,2] <- cor(predict(lmAW, HittersAWclean), HittersAWclean$Salary)
corr_lm[2,3] <- cor(predict(lmAW, HittersNEclean), HittersNEclean$Salary)
corr_lm[2,4] <- cor(predict(lmAW, HittersNWclean), HittersNWclean$Salary)
# NE model
corr_lm[3,1] <- cor(predict(lmNE, HittersAEclean), HittersAEclean$Salary)
corr_lm[3,2] <- cor(predict(lmNE, HittersAWclean), HittersAWclean$Salary)
corr_lm[3,3] <- cor(predict(lmNE, HittersNEclean), HittersNEclean$Salary)
corr_lm[3,4] <- cor(predict(lmNE, HittersNWclean), HittersNWclean$Salary)
# NW model
corr_lm[4,1] <- cor(predict(lmNW, HittersAEclean), HittersAEclean$Salary)
corr_lm[4,2] <- cor(predict(lmNW, HittersAWclean), HittersAWclean$Salary)
corr_lm[4,3] <- cor(predict(lmNW, HittersNEclean), HittersNEclean$Salary)
corr_lm[4,4] <- cor(predict(lmNW, HittersNWclean), HittersNWclean$Salary)
print(round(corr_lm, 3))
## AE data AW data NE data NW data
## AE model 0.891 0.554 0.534 0.464
## AW model 0.740 0.786 0.585 0.577
## NE model 0.656 0.601 0.688 0.493
## NW model 0.369 0.503 0.059 0.774
The diagonal values represent how well each LM model predicts salaries within its own division:
| Division | Correlation | Interpretation |
|---|---|---|
| AE Model → AE Data | 0.891 | Excellent fit (89.1% of variance explained) |
| AW Model → AW Data | 0.786 | Good fit (78.6% of variance explained) |
| NE Model → NE Data | 0.688 | Moderate-good fit (68.8% of variance explained) |
| NW Model → NW Data | 0.774 | Good fit (77.4% of variance explained) |
Key Observations:
AE division has the highest within-division correlation (0.891), indicating the LM model fits AE salary data very well
NE division has the lowest within-division correlation (0.688), suggesting salary in NE is harder to predict with the selected variables
All diagonal values are positive and substantial, confirming that the LARS-selected variables have predictive power
The off-diagonal values show how well each LM model generalizes to other divisions:
Notable Patterns:
| Prediction | Correlation | Interpretation |
|---|---|---|
| AE → AW | 0.554 | Moderate transferability |
| AE → NE | 0.534 | Moderate transferability |
| AE → NW | 0.464 | Weak-moderate transferability |
| AW → AE | 0.740 | Good transferability |
| AW → NE | 0.585 | Moderate transferability |
| AW → NW | 0.577 | Moderate transferability |
| NE → AE | 0.656 | Moderate-good transferability |
| NE → AW | 0.601 | Moderate transferability |
| NE → NW | 0.493 | Weak-moderate transferability |
| NW → AE | 0.369 | Weak transferability |
| NW → AW | 0.503 | Weak-moderate transferability |
| NW → NE | 0.059 | Very weak (almost no transferability) |
Critical Finding - NW Division:
The NW model predicts NE data very poorly (r = 0.059), almost no relationship
This suggests the salary structure in NW is fundamentally different from NE
The NW model also has weak predictions for AE (r = 0.369)
# Calculate difference matrix (LARS - LM)
diff_matrix <- corr_lars - corr_lm
cat("=== DIFFERENCE MATRIX (LARS - LM) ===\n")
## === DIFFERENCE MATRIX (LARS - LM) ===
cat("Positive values = LARS better, Negative values = LM better\n\n")
## Positive values = LARS better, Negative values = LM better
print(round(diff_matrix, 3))
## AE data AW data NE data NW data
## AE model -0.002 0.044 0.015 0.043
## AW model -0.003 -0.001 -0.006 -0.005
## NE model 0.025 0.038 -0.004 0.035
## NW model 0.087 0.061 0.087 -0.007
# Summary statistics
cat("\n========================================\n")
##
## ========================================
cat(" SUMMARY: LARS vs LM COMPARISON\n")
## SUMMARY: LARS vs LM COMPARISON
cat("========================================\n")
## ========================================
# Within-division (diagonal)
cat("\n--- WITHIN-DIVISION CORRELATIONS (Diagonal) ---\n")
##
## --- WITHIN-DIVISION CORRELATIONS (Diagonal) ---
cat("LARS diagonal mean:", round(mean(diag(corr_lars)), 3), "\n")
## LARS diagonal mean: 0.781
cat("LM diagonal mean:", round(mean(diag(corr_lm)), 3), "\n")
## LM diagonal mean: 0.785
cat("Difference (LARS - LM):", round(mean(diag(corr_lars)) - mean(diag(corr_lm)), 3), "\n")
## Difference (LARS - LM): -0.004
# Between-division (off-diagonal)
off_diag_lars <- corr_lars[lower.tri(corr_lars) | upper.tri(corr_lars)]
off_diag_lm <- corr_lm[lower.tri(corr_lm) | upper.tri(corr_lm)]
cat("\n--- BETWEEN-DIVISION CORRELATIONS (Off-Diagonal) ---\n")
##
## --- BETWEEN-DIVISION CORRELATIONS (Off-Diagonal) ---
cat("LARS off-diagonal mean:", round(mean(off_diag_lars), 3), "\n")
## LARS off-diagonal mean: 0.546
cat("LM off-diagonal mean:", round(mean(off_diag_lm), 3), "\n")
## LM off-diagonal mean: 0.511
cat("Difference (LARS - LM):", round(mean(off_diag_lars) - mean(off_diag_lm), 3), "\n")
## Difference (LARS - LM): 0.035
# Overall
cat("\n--- OVERALL CORRELATIONS ---\n")
##
## --- OVERALL CORRELATIONS ---
cat("LARS overall mean:", round(mean(corr_lars), 3), "\n")
## LARS overall mean: 0.605
cat("LM overall mean:", round(mean(corr_lm), 3), "\n")
## LM overall mean: 0.58
cat("Difference (LARS - LM):", round(mean(corr_lars) - mean(corr_lm), 3), "\n")
## Difference (LARS - LM): 0.025
INTERPRETATION - LARS vs LM Comparison:
| Metric | LARS | LM | Winner |
|---|---|---|---|
| Within-division mean | 0.781 | 0.785 | LM (slightly) |
| Between-division mean | 0.546 | 0.511 | LARS (better generalization) |
| Overall mean | 0.605 | 0.580 | LARS |
Key Findings:
Within-division prediction: LM performs slightly better (0.785 vs 0.781)
Between-division prediction: LARS performs better (0.546 vs 0.511)
Overall: LARS has higher mean correlation, suggesting better generalization
cat("\n========================================\n")
##
## ========================================
cat(" LARS MODEL COEFFICIENTS (Min Cp)\n")
## LARS MODEL COEFFICIENTS (Min Cp)
cat("========================================\n")
## ========================================
cat("\n--- AE Division ---\n")
##
## --- AE Division ---
print(betalarsAE)
## AtBat Hits HmRun Runs RBI Walks Years
## -3.3530002 10.0477043 -5.1870815 1.9252299 0.0000000 6.1607285 0.0000000
## CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts
## -0.3771068 0.0000000 0.8418551 2.9480375 1.0145354 -1.1760056 0.3408382
## Assists Errors
## 0.1259118 0.0000000
cat("\n--- AW Division ---\n")
##
## --- AW Division ---
print(betalarsAW)
## AtBat Hits HmRun Runs RBI Walks Years
## 0.00000000 2.30243447 0.00000000 0.00000000 0.00000000 1.03768371 0.00000000
## CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts
## 0.00000000 0.33434397 0.00000000 0.00000000 0.00000000 0.00000000 0.08998272
## Assists Errors
## 0.00000000 0.00000000
cat("\n--- NE Division ---\n")
##
## --- NE Division ---
print(betalarsNE)
## AtBat Hits HmRun Runs RBI Walks Years
## -0.2481822 0.0000000 0.0000000 0.0000000 0.0000000 2.5323541 0.0000000
## CAtBat CHits CHmRun CRuns CRBI CWalks PutOuts
## 0.0000000 0.0000000 0.0000000 0.0000000 0.8941327 0.0000000 0.4312271
## Assists Errors
## 0.7716719 -8.4960887
cat("\n--- NW Division ---\n")
##
## --- NW Division ---
print(betalarsNW)
## AtBat Hits HmRun Runs RBI Walks
## 1.3713374 1.9961871 32.2358918 -9.7338778 -11.9613345 7.3627544
## Years CAtBat CHits CHmRun CRuns CRBI
## 7.7832856 0.0000000 0.4414019 0.4438961 0.0000000 0.0000000
## CWalks PutOuts Assists Errors
## -0.9914020 0.0000000 -0.7387407 3.8378122
cat("\n========================================\n")
##
## ========================================
cat(" LM MODEL COEFFICIENTS\n")
## LM MODEL COEFFICIENTS
cat("========================================\n")
## ========================================
cat("\n--- AE Division ---\n")
##
## --- AE Division ---
print(coef(lmAE))
## (Intercept) AtBat Hits HmRun Runs Walks
## 333.91615317 -3.77014051 10.47129162 -4.36880711 1.89356499 6.28731152
## CAtBat CHmRun CRuns CRBI CWalks PutOuts
## -0.50437506 -0.08326206 3.66894235 1.52228576 -1.33810508 0.38200461
## Assists
## 0.25849109
cat("\n--- AW Division ---\n")
##
## --- AW Division ---
print(coef(lmAW))
## (Intercept) Hits Walks CHits PutOuts
## -151.5020812 2.4162522 1.2512816 0.3519476 0.1312208
cat("\n--- NE Division ---\n")
##
## --- NE Division ---
print(coef(lmNE))
## (Intercept) AtBat Walks CRBI PutOuts Assists
## 192.4466171 -0.5836321 3.9198287 0.9232909 0.5015935 1.0687232
## Errors
## -12.1454651
cat("\n--- NW Division ---\n")
##
## --- NW Division ---
print(coef(lmNW))
## (Intercept) AtBat Hits HmRun Runs RBI
## 87.8200644 1.1613866 4.7544874 44.5274429 -14.1316674 -16.8348882
## Walks Years CHits CHmRun CWalks Assists
## 9.7791908 7.7745907 0.5717372 0.4689956 -1.3725120 -0.8329296
## Errors
## 5.1266500
INTERPRETATION - Coefficient Comparison:
| Variable | AE | AW | NE | NW | Pattern |
|---|---|---|---|---|---|
| Hits | +10.0 | +2.3 | 0 | +2.0 | Important in AE, AW, NW |
| Walks | +6.2 | +1.0 | +2.5 | +7.4 | Important in ALL divisions |
| CRBI | +1.0 | 0 | +0.9 | 0 | Career RBIs matter in AE, NE |
| PutOuts | +0.3 | +0.1 | +0.4 | 0 | Defensive value varies |
| Errors | 0 | 0 | -8.5 | +3.8 | Opposite effects in NE vs NW |
| HmRun | -5.2 | 0 | 0 | +32.2 | Very different in NW |
Key Finding: The SAME variable can have OPPOSITE effects across divisions:
Errors: Negative in NE (-8.5), Positive in NW (+3.8)
HmRun: Negative in AE (-5.2), Very positive in NW (+32.2)
This strongly supports the conclusion that salary depends DIFFERENTLY on performance across divisions.
The analysis conclusively shows that salary structures differ across baseball divisions. Each division values player statistics differently:
AE: Values career statistics (CRuns, CRBI)
AW: Simpler model, values hits and walks
NE: Penalizes errors heavily, values defensive skills
NW: Heavily rewards home runs and experience
This suggests division-specific factors such as team budgets, market sizes, managerial strategies, and league dynamics influence how player performance translates to salary.
Within-division correlations are consistently higher than between-division correlations, it suggests division-specific salary structures. Comparing LARS vs LM diagonal and off-diagonal means indicates which method provides better prediction within divisions and which generalizes better between divisions. Differences in selected predictors and coefficient patterns across divisions support the idea that salary depends on performance and career statistics differently across divisions.