EconometricAssn1 - Moneyball Analysis

Author

Nuobing Fan

# Load libraries
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'tidyr' was built under R version 4.3.3

Warning: package 'readr' was built under R version 4.3.3

Warning: package 'purrr' was built under R version 4.3.3

Warning: package 'dplyr' was built under R version 4.3.3

Warning: package 'forcats' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(visdat)

Warning: package 'visdat' was built under R version 4.3.3

library(mice)

Warning: package 'mice' was built under R version 4.3.3

Warning in check_dep_version(): ABI version mismatch: 
lme4 was built with Matrix ABI version 1
Current Matrix ABI version is 0
Please re-install lme4 from source or restore original 'Matrix' package


Attaching package: 'mice'

The following object is masked from 'package:stats':

    filter

The following objects are masked from 'package:base':

    cbind, rbind

library(corrplot)

Warning: package 'corrplot' was built under R version 4.3.3

corrplot 0.94 loaded

library(stargazer)


Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(lmtest)

Warning: package 'lmtest' was built under R version 4.3.3

Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

library(car)

Warning: package 'car' was built under R version 4.3.3

Loading required package: carData

Warning: package 'carData' was built under R version 4.3.3


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

# Load the training data
train_data <- read.csv("./moneyball-training-data-1.csv")

# View basic structure and summary of the dataset
str(train_data)

'data.frame':   2276 obs. of  17 variables:
 $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
 $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
 $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
 $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
 $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
 $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
 $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
 $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
 $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
 $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
 $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
 $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
 $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
 $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
 $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
 $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
 $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

summary(train_data)

     INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
 Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
 1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
 Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
 Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
 3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
 Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
                                                                 
 TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
 1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
 Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
 Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
 3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
 Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
                                                   NA's   :102     
 TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
 Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
 1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
 Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
 Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
 3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
 Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
 NA's   :131     NA's   :772     NA's   :2085                    
 TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
 Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
 1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
 Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
 Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
 3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
 Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
                                   NA's   :102                       
 TEAM_FIELDING_DP
 Min.   : 52.0   
 1st Qu.:131.0   
 Median :149.0   
 Mean   :146.4   
 3rd Qu.:164.0   
 Max.   :228.0   
 NA's   :286

# Check for missing values
vis_dat(train_data)

1. DATA EXPLORATION (50 Points)

Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

Mean / Standard Deviation / Median
Bar Chart or Box Plot of the data
Is the data correlated to the target variable (or to other variables?)
Are any of the variables missing and need to be imputed “fixed”? There are many packages to quickly identify types of variables and missing values. EG. vis_data – read up on how create a Gallery of Missing Data and Visualizations. I would like to see some charts on this. Can try mice and Amelia packages too.

Ans: For this question, I tested and fixed the missing values, also calculated Mean, Median, and SD. Since our data is big enough that we can consider it as a normal distribution.Also it will come with some virsualizations.

In the preparation of the Moneyball dataset for regression analysis, managing outliers is crucial as they can significantly affect model accuracy and inferential statistics.

# Checking missing values before imputation
# Check missing values and summarize
missing_values <- colSums(is.na(train_data))
print(missing_values)  # Prints the number of missing values for each column

           INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
               0                0                0                0 
 TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
               0                0                0              102 
 TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
             131              772             2085                0 
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
               0                0              102                0 
TEAM_FIELDING_DP 
             286

cat("Total observations:", nrow(train_data), "\n")

Total observations: 2276

cat("Missing values summary:\n")

Missing values summary:

print(missing_values[missing_values > 0])  # Only show variables with missing values

 TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP 
             102              131              772             2085 
TEAM_PITCHING_SO TEAM_FIELDING_DP 
             102              286

# Impute missing values for each column using mean/median where appropriate (for all numeric data)
train_data$TEAM_BATTING_SO <- ifelse(is.na(train_data$TEAM_BATTING_SO), 
                                     mean(train_data$TEAM_BATTING_SO, na.rm = TRUE), 
                                     train_data$TEAM_BATTING_SO)

train_data$TEAM_BASERUN_CS <- ifelse(is.na(train_data$TEAM_BASERUN_CS), 
                                     mean(train_data$TEAM_BASERUN_CS, na.rm = TRUE), 
                                     train_data$TEAM_BASERUN_CS)

train_data$TEAM_BATTING_HBP <- ifelse(is.na(train_data$TEAM_BATTING_HBP), 
                                      mean(train_data$TEAM_BATTING_HBP, na.rm = TRUE), 
                                      train_data$TEAM_BATTING_HBP)

train_data$TEAM_PITCHING_SO <- ifelse(is.na(train_data$TEAM_PITCHING_SO), 
                                      mean(train_data$TEAM_PITCHING_SO, na.rm = TRUE), 
                                      train_data$TEAM_PITCHING_SO)

# Checking missing values after imputation
cat("Missing values after imputation:\n")

Missing values after imputation:

colSums(is.na(train_data))

           INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
               0                0                0                0 
 TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
               0                0                0                0 
 TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
             131                0                0                0 
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
               0                0                0                0 
TEAM_FIELDING_DP 
             286

# Visualize again after imputation to confirm changes
vis_dat(train_data)

Outliers were identified using box plots and summary statistics for each variable. These methods helped visualize distribution extremes and quantify potential outliers, particularly in variables like TEAM_BATTING_HR and TEAM_PITCHING_SO.

# Fix missing values: Impute using the mean or median for numeric columns
train_data <- train_data %>%
  mutate(TEAM_BATTING_H = ifelse(is.na(TEAM_BATTING_H), mean(TEAM_BATTING_H, na.rm = TRUE), TEAM_BATTING_H),
         TEAM_BATTING_2B = ifelse(is.na(TEAM_BATTING_2B), median(TEAM_BATTING_2B, na.rm = TRUE), TEAM_BATTING_2B),
         TEAM_BATTING_3B = ifelse(is.na(TEAM_BATTING_3B), mean(TEAM_BATTING_3B, na.rm = TRUE), TEAM_BATTING_3B),
         TEAM_BATTING_HR = ifelse(is.na(TEAM_BATTING_HR), median(TEAM_BATTING_HR, na.rm = TRUE), TEAM_BATTING_HR),
         TEAM_BASERUN_SB = ifelse(is.na(TEAM_BASERUN_SB), mean(TEAM_BASERUN_SB, na.rm = TRUE), TEAM_BASERUN_SB),
         TEAM_FIELDING_E = ifelse(is.na(TEAM_FIELDING_E), median(TEAM_FIELDING_E, na.rm = TRUE), TEAM_FIELDING_E),
         TEAM_FIELDING_DP = ifelse(is.na(TEAM_FIELDING_DP), mean(TEAM_FIELDING_DP, na.rm = TRUE), TEAM_FIELDING_DP))

# Double-check the imputation
vis_dat(train_data)

# Check the distribution before deciding on mean/median.
hist(train_data$TEAM_BATTING_HR)

# Summary statistics for numeric variables
summary_stats <- train_data %>%
  summarise(across(where(is.numeric), list(mean = mean, sd = sd, median = median), na.rm = TRUE))

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(...)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

print(summary_stats)

  INDEX_mean INDEX_sd INDEX_median TARGET_WINS_mean TARGET_WINS_sd
1   1268.464  736.349       1270.5         80.79086       15.75215
  TARGET_WINS_median TEAM_BATTING_H_mean TEAM_BATTING_H_sd
1                 82             1469.27          144.5912
  TEAM_BATTING_H_median TEAM_BATTING_2B_mean TEAM_BATTING_2B_sd
1                  1454             241.2469           46.80141
  TEAM_BATTING_2B_median TEAM_BATTING_3B_mean TEAM_BATTING_3B_sd
1                    238                55.25           27.93856
  TEAM_BATTING_3B_median TEAM_BATTING_HR_mean TEAM_BATTING_HR_sd
1                     47             99.61204           60.54687
  TEAM_BATTING_HR_median TEAM_BATTING_BB_mean TEAM_BATTING_BB_sd
1                    102             501.5589           122.6709
  TEAM_BATTING_BB_median TEAM_BATTING_SO_mean TEAM_BATTING_SO_sd
1                    512             735.6053           242.8912
  TEAM_BATTING_SO_median TEAM_BASERUN_SB_mean TEAM_BASERUN_SB_sd
1               735.6053             124.7618           85.22608
  TEAM_BASERUN_SB_median TEAM_BASERUN_CS_mean TEAM_BASERUN_CS_sd
1                    106             52.80386           18.65913
  TEAM_BASERUN_CS_median TEAM_BATTING_HBP_mean TEAM_BATTING_HBP_sd
1               52.80386              59.35602            3.747397
  TEAM_BATTING_HBP_median TEAM_PITCHING_H_mean TEAM_PITCHING_H_sd
1                59.35602              1779.21           1406.843
  TEAM_PITCHING_H_median TEAM_PITCHING_HR_mean TEAM_PITCHING_HR_sd
1                   1518              105.6986            61.29875
  TEAM_PITCHING_HR_median TEAM_PITCHING_BB_mean TEAM_PITCHING_BB_sd
1                     107              553.0079            166.3574
  TEAM_PITCHING_BB_median TEAM_PITCHING_SO_mean TEAM_PITCHING_SO_sd
1                   536.5              817.7305             540.544
  TEAM_PITCHING_SO_median TEAM_FIELDING_E_mean TEAM_FIELDING_E_sd
1                817.7305             246.4807            227.771
  TEAM_FIELDING_E_median TEAM_FIELDING_DP_mean TEAM_FIELDING_DP_sd
1                    159              146.3879            24.52252
  TEAM_FIELDING_DP_median
1                146.3879

# Correlation matrix including the target variable
cor_matrix <- cor(train_data %>% select(-INDEX), use = "complete.obs")

# Visualize correlation matrix with corrplot
corrplot(cor_matrix, method = "color", type = "lower", 
         tl.cex = 0.7, title = "Correlation Matrix", mar = c(0,0,1,0))

# Bar chart example for TARGET_WINS (discrete variable, number of wins)
ggplot(train_data, aes(x = TARGET_WINS)) +
  geom_bar(fill = "blue", color = "white") +
  labs(title = "Distribution of Wins")

# Display specific correlation with TARGET_WINS
cor_with_target <- cor_matrix[,"TARGET_WINS"]
print(cor_with_target)

     TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
      1.00000000       0.38876752       0.28910365       0.14260841 
 TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
      0.17615320       0.23255986      -0.03067847       0.12297192 
 TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
      0.01556444       0.01633738      -0.10993705       0.18901373 
TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
      0.12417454      -0.07578725      -0.17648476      -0.02884126

The bar-char– X-axis (TARGET_WINS): This axis displays the number of wins for the teams, which is a discrete variable. It likely ranges from 0 to 150, as inferred from the label.

Y-axis (Count): This axis indicates the number of teams that achieved each win total. The height of each bar reflects how many teams had that specific number of wins.

Bars: Each bar represents the count of teams that won a specific number of games. The blue bars help visualize the frequency of teams at different win totals.

Distribution Shape: The distribution appears to be somewhat bell-shaped or normal, indicating that most teams tend to win between a certain range (around 80 to 100 wins).

Peaks: There are peaks at specific win totals, which suggest common outcomes for teams in the dataset. It may indicate that this is a typical performance level for teams during the years covered by the dataset.

Low Counts for Extremes: The counts for very low and very high win totals are lower, indicating that extreme outcomes (very few wins or very many wins) are less common.

This graph provides insights into the competitive balance of teams within the dataset.

# Handling missing values - Impute with mean or median
train_data <- train_data %>%
  mutate(across(everything(), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Feature engineering - create new variables
# Example: Run-to-hit ratio
train_data <- train_data %>%
  mutate(Run_to_Hit_Ratio = TEAM_BATTING_H / (TEAM_BATTING_HR + 1))

# Check if imputation and new variables were added successfully
summary(train_data)

     INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
 Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
 1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
 Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
 Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
 3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
 Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
 TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
 1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.8  
 Median : 47.00   Median :102.00   Median :512.0   Median : 735.6  
 Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
 3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0  
 Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
 TEAM_BASERUN_SB TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H
 Min.   :  0.0   Min.   :  0.00   Min.   :29.00    Min.   : 1137  
 1st Qu.: 67.0   1st Qu.: 44.00   1st Qu.:59.36    1st Qu.: 1419  
 Median :106.0   Median : 52.80   Median :59.36    Median : 1518  
 Mean   :124.8   Mean   : 52.80   Mean   :59.36    Mean   : 1779  
 3rd Qu.:151.0   3rd Qu.: 54.25   3rd Qu.:59.36    3rd Qu.: 1682  
 Max.   :697.0   Max.   :201.00   Max.   :95.00    Max.   :30132  
 TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
 Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
 1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  626.0   1st Qu.: 127.0  
 Median :107.0    Median : 536.5   Median :  817.7   Median : 159.0  
 Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
 3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  957.0   3rd Qu.: 249.2  
 Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
 TEAM_FIELDING_DP Run_to_Hit_Ratio  
 Min.   : 52.0    Min.   :   5.854  
 1st Qu.:134.0    1st Qu.:   9.781  
 Median :146.4    Median :  13.763  
 Mean   :146.4    Mean   :  37.657  
 3rd Qu.:161.2    3rd Qu.:  36.405  
 Max.   :228.0    Max.   :2003.000

2. DATA PREPARATION (50 Points)

Describe how you have transformed the data by changing the original variables or creating new variables. If you didtransform the data or create new variables, discuss why you did this. Here are some possible transformations.

Fix missing values (maybe with a Mean or Median value).
Create flags to suggest if a variable was missing
Transform data by putting it into buckets
Mathematical transformations such as log or square root (or use Box-Cox). Can try inverse transformation (1/x) or truncation (cap the maximum value possible)
Combine variables (such as ratios or adding or multiplying) to create new variables

I have fix missing values with either mean or median, since the missing values may do negative impact on my analysis.fix value very much help me to make model in next few questions. or we can maybe create flags for missing values that help us to find situations where sometimes can be useful.

# Create missing value flags
train_data <- train_data %>%
  mutate(FLAG_MISSING_BATTING_H = ifelse(is.na(TEAM_BATTING_H), 1, 0),
         FLAG_MISSING_BATTING_2B = ifelse(is.na(TEAM_BATTING_2B), 1, 0),
         FLAG_MISSING_BATTING_HR = ifelse(is.na(TEAM_BATTING_HR), 1, 0),
         FLAG_MISSING_BASERUN_SB = ifelse(is.na(TEAM_BASERUN_SB), 1, 0))

also I did log transformation that it help stabilize variance and reduce impact of extreme values.

train_datalog <- train_data %>%
  mutate(LOG_TEAM_BATTING_HR = log(TEAM_BATTING_HR + 1),  # Adding 1 to avoid log(0)
         LOG_TEAM_BASERUN_SB = log(TEAM_BASERUN_SB + 1))

3. BUILD MODELS (50 Points)

Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection,etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

stepAIC may be helpful for backward / forward selection.

# Model 1: Simple linear regression with selected variables (no log transformation)
model1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR, data = train_data)
summary(model1)


Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR, 
    data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-67.537  -9.269   0.701   9.602  49.679 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     13.743878   3.091731   4.445  9.2e-06 ***
TEAM_BATTING_H   0.042481   0.002065  20.568  < 2e-16 ***
TEAM_BATTING_HR  0.046493   0.004932   9.426  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.24 on 2273 degrees of freedom
Multiple R-squared:  0.1831,    Adjusted R-squared:  0.1824 
F-statistic: 254.7 on 2 and 2273 DF,  p-value: < 2.2e-16

#This model uses two key variables from the original train_data without any log transformations. It includes TEAM_BATTING_H and TEAM_BATTING_HR (hits and home runs), which are likely to impact the number of wins.

# Model 2: Multiple linear regression with more variables (no log transformation)
model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_data)
summary(model2)


Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
    TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.314  -9.141  -0.030   8.742  53.951 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      6.821827   2.940689   2.320 0.020440 *  
TEAM_BATTING_H   0.049153   0.002060  23.859  < 2e-16 ***
TEAM_BATTING_HR  0.021368   0.006108   3.499 0.000477 ***
TEAM_BASERUN_SB  0.042359   0.003709  11.421  < 2e-16 ***
TEAM_FIELDING_E -0.022978   0.001624 -14.145  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.41 on 2271 degrees of freedom
Multiple R-squared:  0.2767,    Adjusted R-squared:  0.2754 
F-statistic: 217.2 on 4 and 2271 DF,  p-value: < 2.2e-16

#This model includes more variables without applying any log transformations.And it expands on Model 1 by adding TEAM_BASERUN_SB (stolen bases) and TEAM_FIELDING_E (errors), which should provide a broader perspective on team performance.

# Model 3: Multiple linear regression with log-transformed variables
model3 <- lm(TARGET_WINS ~ log(TEAM_BATTING_H + 1) + LOG_TEAM_BATTING_HR + LOG_TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_datalog)
summary(model3)


Call:
lm(formula = TARGET_WINS ~ log(TEAM_BATTING_H + 1) + LOG_TEAM_BATTING_HR + 
    LOG_TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_datalog)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.272  -9.183   0.013   8.886  55.058 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -4.912e+02  2.297e+01 -21.388   <2e-16 ***
log(TEAM_BATTING_H + 1)  7.554e+01  3.257e+00  23.193   <2e-16 ***
LOG_TEAM_BATTING_HR      7.315e-01  4.866e-01   1.503    0.133    
LOG_TEAM_BASERUN_SB      5.066e+00  4.848e-01  10.448   <2e-16 ***
TEAM_FIELDING_E         -2.123e-02  1.898e-03 -11.186   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.48 on 2271 degrees of freedom
Multiple R-squared:  0.2692,    Adjusted R-squared:  0.2679 
F-statistic: 209.1 on 4 and 2271 DF,  p-value: < 2.2e-16

#This model uses log-transformed variables from the train_datalog dataset, also applies log transformations to TEAM_BATTING_H, TEAM_BATTING_HR, and TEAM_BASERUN_SB to reduce the impact of extreme values and skewness.


# Compare models using stargazer
stargazer(model1, model2, model3, type = "text", title = "Model Comparison", out = "models.txt")


Model Comparison
=====================================================================================================
                                                     Dependent variable:                             
                        -----------------------------------------------------------------------------
                                                         TARGET_WINS                                 
                                   (1)                       (2)                       (3)           
-----------------------------------------------------------------------------------------------------
TEAM_BATTING_H                  0.042***                  0.049***                                   
                                 (0.002)                   (0.002)                                   
                                                                                                     
TEAM_BATTING_HR                 0.046***                  0.021***                                   
                                 (0.005)                   (0.006)                                   
                                                                                                     
TEAM_BASERUN_SB                                           0.042***                                   
                                                           (0.004)                                   
                                                                                                     
log(TEAM_BATTING_H + 1)                                                             75.536***        
                                                                                     (3.257)         
                                                                                                     
LOG_TEAM_BATTING_HR                                                                   0.732          
                                                                                     (0.487)         
                                                                                                     
LOG_TEAM_BASERUN_SB                                                                 5.066***         
                                                                                     (0.485)         
                                                                                                     
TEAM_FIELDING_E                                           -0.023***                 -0.021***        
                                                           (0.002)                   (0.002)         
                                                                                                     
Constant                        13.744***                  6.822**                 -491.209***       
                                 (3.092)                   (2.941)                  (22.967)         
                                                                                                     
-----------------------------------------------------------------------------------------------------
Observations                      2,276                     2,276                     2,276          
R2                                0.183                     0.277                     0.269          
Adjusted R2                       0.182                     0.275                     0.268          
Residual Std. Error        14.244 (df = 2273)        13.409 (df = 2271)        13.478 (df = 2271)    
F Statistic             254.692*** (df = 2; 2273) 217.186*** (df = 4; 2271) 209.140*** (df = 4; 2271)
=====================================================================================================
Note:                                                                     *p<0.1; **p<0.05; ***p<0.01

stargazer(model1, model2, model3, type = "text", title = "Model Comparison", out = "models_comparison.txt", 
          dep.var.labels = "Number of Wins", 
          covariate.labels = c("Batting Hits", "Home Runs", "Stolen Bases", "Fielding Errors"),
          model.names = FALSE)


Model Comparison
=================================================================================================
                                                 Dependent variable:                             
                    -----------------------------------------------------------------------------
                                                   Number of Wins                                
                               (1)                       (2)                       (3)           
-------------------------------------------------------------------------------------------------
Batting Hits                0.042***                  0.049***                                   
                             (0.002)                   (0.002)                                   
                                                                                                 
Home Runs                   0.046***                  0.021***                                   
                             (0.005)                   (0.006)                                   
                                                                                                 
Stolen Bases                                          0.042***                                   
                                                       (0.004)                                   
                                                                                                 
Fielding Errors                                                                 75.536***        
                                                                                 (3.257)         
                                                                                                 
LOG_TEAM_BATTING_HR                                                               0.732          
                                                                                 (0.487)         
                                                                                                 
LOG_TEAM_BASERUN_SB                                                             5.066***         
                                                                                 (0.485)         
                                                                                                 
TEAM_FIELDING_E                                       -0.023***                 -0.021***        
                                                       (0.002)                   (0.002)         
                                                                                                 
Constant                    13.744***                  6.822**                 -491.209***       
                             (3.092)                   (2.941)                  (22.967)         
                                                                                                 
-------------------------------------------------------------------------------------------------
Observations                  2,276                     2,276                     2,276          
R2                            0.183                     0.277                     0.269          
Adjusted R2                   0.182                     0.275                     0.268          
Residual Std. Error    14.244 (df = 2273)        13.409 (df = 2271)        13.478 (df = 2271)    
F Statistic         254.692*** (df = 2; 2273) 217.186*** (df = 4; 2271) 209.140*** (df = 4; 2271)
=================================================================================================
Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

For model1: Both TEAM_BATTING_H and TEAM_BATTING_HR have positive and significant coefficients, as expected, since more hits and home runs should lead to more wins. The adjusted R-squared value of 0.182 indicates that the model explains about 18% of the variance in team wins. Residual Standard Error: 14.24 Interpretation: This simple model gives a basic understanding of how team batting metrics relate to team performance in terms of wins, but the explanatory power is limited due to the simplicity of the model.

for model2: TEAM_BATTING_H, TEAM_BATTING_HR, and TEAM_BASERUN_SB all have positive and significant coefficients, indicating that better offensive metrics lead to more wins. TEAM_FIELDING_E has a negative coefficient, which aligns with the expectation that more errors reduce the likelihood of winning. The adjusted R-squared value of 0.275 suggests that this model explains around 27.5% of the variance in wins, significantly improving upon Model 1. Residual Standard Error: 13.41 Interpretation: This multivariable model, without log transformation, captures the influence of offensive and defensive metrics on team wins more effectively than the simpler Model 1.

for model3: log(TEAM_BATTING_H + 1) and LOG_TEAM_BASERUN_SB are positive and significant, indicating that after accounting for skewed data, these variables still strongly influence the number of wins. LOG_TEAM_BATTING_HR is not significant in this model, suggesting that the log transformation may have reduced its impact. TEAM_FIELDING_E remains negative and significant, as expected. The adjusted R-squared value of 0.268 is slightly lower than Model 2, but still a substantial improvement over Model 1. Residual Standard Error: 13.48 Interpretation: Log-transforming certain variables reduces the influence of outliers, but it doesn’t significantly improve the model compared to Model 2. However, it helps stabilize the variance in the dataset.

4. SELECT MODELS (50 Points)

Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.

Model 1: Simple, but low explanatory power (Adjusted R-squared: 0.182).

Model 2: Best performance overall, with a higher Adjusted R-squared (0.275), making it the most comprehensive model without log transformations.

Model 3: Similar performance to Model 2 (Adjusted R-squared: 0.268), but it handles skewed variables better through log transformations.

Seems Model 2 performs the best, balancing simplicity and predictive power without requiring log transformations. However, Model 3 may be useful if there are concerns about outliers or highly skewed data.

For selecting the best multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on

mean squared error,
R2,
F-statistic, and
residual plots (see this video for details).
Residuals vs Fitted Values

Normal Q-Q (Quantile-Quantile) – tells if residuals are normally distributed by comparing them with actual normal distribution
Plot - Scale-Location / Spread-Location Plot – shows if residuals are spread equally among our predictions in order to check homoscedasticity
Residuals vs Leverage Plot – shows influential data points that have a big effect on the linear model

Make predictions using the evaluation data set

# a.
# Mean Squared Error for model2
mse_model2 <- mean(residuals(model2)^2)
cat("MSE model2: ", mse_model2)

MSE model2:  179.3956

# Mean Squared Error for model3
mse_model3 <- mean(residuals(model3)^2)
cat("MSE model3: ", mse_model3)

MSE model3:  181.2537

# b.
# R² and Adjusted R² for all models
model_comparison <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3"),
  R_squared = c(summary(model1)$r.squared, summary(model2)$r.squared, summary(model3)$r.squared),
  Adj_R_squared = c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared, summary(model3)$adj.r.squared)
)
print(model_comparison)

    Model R_squared Adj_R_squared
1 Model 1 0.1830744     0.1823556
2 Model 2 0.2766929     0.2754189
3 Model 3 0.2692012     0.2679140

# c.
# F-statistic for Model 2
f_stat2 <- summary(model2)$fstatistic
print("f stat model2:")

[1] "f stat model2:"

print(f_stat2)

    value     numdf     dendf 
 217.1863    4.0000 2271.0000

# F-statistic for Model 3
f_stat3 <- summary(model3)$fstatistic
print("f stat model3:")

[1] "f stat model3:"

print(f_stat3)

    value     numdf     dendf 
 209.1396    4.0000 2271.0000

# d.

# Residual diagnostics
# d. Residual diagnostics
par(mfrow = c(2, 2))  # Set up the plotting area for 4 plots

# Plot residual diagnostics for Model 2
plot(model2, which = 1)  # Residuals vs Fitted
title(main = "Model 2: Residuals vs Fitted")  # Add title

plot(model2, which = 2)  # Normal Q-Q
title(main = "Model 2: Normal Q-Q Plot")  # Add title

plot(model2, which = 3)  # Scale-Location
title(main = "Model 2: Scale-Location Plot")  # Add title

plot(model2, which = 4)  # Residuals vs Leverage
title(main = "Model 2: Residuals vs Leverage")  # Add title

# Plot residual diagnostics for Model 3
par(mfrow = c(2, 2))  # Reset plotting area for 4 plots

plot(model3, which = 1)  # Residuals vs Fitted
title(main = "Model 3: Residuals vs Fitted")  # Add title

plot(model3, which = 2)  # Normal Q-Q
title(main = "Model 3: Normal Q-Q Plot")  # Add title

plot(model3, which = 3)  # Scale-Location
title(main = "Model 3: Scale-Location Plot")  # Add title

plot(model3, which = 4)  # Residuals vs Leverage
title(main = "Model 3: Residuals vs Leverage")  # Add title

library(visdat)
# Check multicollinearity using VIF for Model 3. 2
vif_values3 <- vif(model3)
vif_values2 <- vif(model2)
print("VIF for model 2")

[1] "VIF for model 2"

print(vif_values2)

 TEAM_BATTING_H TEAM_BATTING_HR TEAM_BASERUN_SB TEAM_FIELDING_E 
       1.122748        1.730410        1.264306        1.732362

print("VIF for model 3")

[1] "VIF for model 3"

print(vif_values3)

log(TEAM_BATTING_H + 1)     LOG_TEAM_BATTING_HR     LOG_TEAM_BASERUN_SB 
               1.168252                2.394291                1.174813 
        TEAM_FIELDING_E 
               2.340531

MSE for Model 2: 179.3956 MSE for Model 3: 181.2537 The MSE indicates the average squared difference between predicted and actual team wins. A lower MSE for Model 2 suggests it has a better fit compared to Model 3.
Model 2 has the highest Adjusted R² (0.2754), indicating it explains more variance in team wins while accounting for the number of predictors. Model 1 has the lowest explanatory power, while Model 3 performs slightly worse than Model 2.
F-statistic for Model 2: 217.1863 (with degrees of freedom 4 and 2271) F-statistic for Model 3: 209.1396 (with degrees of freedom 4 and 2271) The F-statistic indicates that both models are statistically significant, with a p-value < 2.2e-16, showing that the predictors collectively relate to the number of wins.
Residual Diagnostics

Residuals vs Fitted Values: Model 2: Shows a random scatter, indicating good homoscedasticity, but slight curvature suggests potential non-linearity. Model 3: Similar scatter, but slight curvature may be more pronounced due to log transformations.

Normal Q-Q Plot: Model 2: Points largely follow the reference line, suggesting normality of residuals. Model 3: Also follows the reference line, but some deviations at the extremes may indicate the presence of outliers.

Scale-Location Plot: Model 2: Residuals are evenly spread, indicating homoscedasticity. Model 3: Shows similar characteristics but may have slightly more spread at lower fitted values.

Residuals vs Leverage Plot: Model 2: No major influential points. Model 3: Identifies a few potentially influential points, but they do not seem to have a significant effect on the overall model fit.

VIF: All VIF values are below the threshold of 5, indicating no significant multicollinearity issues in either model. However, the VIF for TEAM_BATTING_HR is higher in Model 3, which may indicate a potential concern in this model.

Overall Analysis:

Model 2 demonstrates superior performance across several metrics: lower MSE, higher Adjusted R², and a better F-statistic. The residual diagnostics indicate that it meets the assumptions of linear regression more robustly. Model 3, while effective in handling skewed data through log transformations, does not significantly outperform Model 2. It shows similar goodness-of-fit but indicates some potential issues with influential points and normality of residuals.

Thus:

Based on this comparison, Model 2 is preferred for predictions and further analysis due to its simplicity, better fit, and consistent residual diagnostics. Model 3 may still be valuable if addressing skewness and outliers becomes critical in the context of the analysis.

In this case, we will use Model 2 for evaluation ### Evaluation:

# Load the evaluation data
eval_data <- read.csv("./moneyball-evaluation-data-1.csv")

# Impute missing values, create new features, and make predictions
eval_data$Predicted_Wins <- predict(model2, newdata = eval_data)

# Save the predicted results
write.csv(eval_data, "./predicted_wins.csv", row.names = FALSE)

GPT comment:

In your work, you’ve effectively structured your code with clear segmentation and good use of comments, making it easy to follow your analytical process in R. You’ve correctly utilized packages like tidyverse for data manipulation, visdat for visualizing missing data, and stargazer for neatly presenting regression outputs. However, for improvement, consider refining your approach to handling missing data. While imputing missing values with mean or median is standard, exploring more sophisticated methods or providing a rationale for choosing one over the other based on the distribution of data could enhance the robustness of your analysis. Also, ensure your code adheres to recent updates in R syntax to avoid deprecated functions, enhancing script reliability and future compatibility.