Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(visdat)
Warning: package 'visdat' was built under R version 4.3.3
library(mice)
Warning: package 'mice' was built under R version 4.3.3
Warning in check_dep_version(): ABI version mismatch:
lme4 was built with Matrix ABI version 1
Current Matrix ABI version is 0
Please re-install lme4 from source or restore original 'Matrix' package
Attaching package: 'mice'
The following object is masked from 'package:stats':
filter
The following objects are masked from 'package:base':
cbind, rbind
library(corrplot)
Warning: package 'corrplot' was built under R version 4.3.3
corrplot 0.94 loaded
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(lmtest)
Warning: package 'lmtest' was built under R version 4.3.3
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
library(car)
Warning: package 'car' was built under R version 4.3.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.3.3
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
# Load the training datatrain_data <-read.csv("./moneyball-training-data-1.csv")# View basic structure and summary of the datasetstr(train_data)
'data.frame': 2276 obs. of 17 variables:
$ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
$ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
$ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
$ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
$ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
$ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
$ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
$ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
$ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
$ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
$ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
$ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
$ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
$ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
$ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
$ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
$ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
summary(train_data)
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
Median :1270.5 Median : 82.00 Median :1454 Median :238.0
Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
NA's :102
TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
Median :101.0 Median : 49.0 Median :58.00 Median : 1518
Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
NA's :131 NA's :772 NA's :2085
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
NA's :102
TEAM_FIELDING_DP
Min. : 52.0
1st Qu.:131.0
Median :149.0
Mean :146.4
3rd Qu.:164.0
Max. :228.0
NA's :286
# Check for missing valuesvis_dat(train_data)
1. DATA EXPLORATION (50 Points)
Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
Mean / Standard Deviation / Median
Bar Chart or Box Plot of the data
Is the data correlated to the target variable (or to other variables?)
Are any of the variables missing and need to be imputed “fixed”? There are many packages to quickly identify types of variables and missing values. EG. vis_data – read up on how create a Gallery of Missing Data and Visualizations. I would like to see some charts on this. Can try mice and Amelia packages too.
Ans: For this question, I tested and fixed the missing values, also calculated Mean, Median, and SD. Since our data is big enough that we can consider it as a normal distribution.Also it will come with some virsualizations.
In the preparation of the Moneyball dataset for regression analysis, managing outliers is crucial as they can significantly affect model accuracy and inferential statistics.
# Checking missing values before imputation# Check missing values and summarizemissing_values <-colSums(is.na(train_data))print(missing_values) # Prints the number of missing values for each column
# Visualize again after imputation to confirm changesvis_dat(train_data)
Outliers were identified using box plots and summary statistics for each variable. These methods helped visualize distribution extremes and quantify potential outliers, particularly in variables like TEAM_BATTING_HR and TEAM_PITCHING_SO.
# Fix missing values: Impute using the mean or median for numeric columnstrain_data <- train_data %>%mutate(TEAM_BATTING_H =ifelse(is.na(TEAM_BATTING_H), mean(TEAM_BATTING_H, na.rm =TRUE), TEAM_BATTING_H),TEAM_BATTING_2B =ifelse(is.na(TEAM_BATTING_2B), median(TEAM_BATTING_2B, na.rm =TRUE), TEAM_BATTING_2B),TEAM_BATTING_3B =ifelse(is.na(TEAM_BATTING_3B), mean(TEAM_BATTING_3B, na.rm =TRUE), TEAM_BATTING_3B),TEAM_BATTING_HR =ifelse(is.na(TEAM_BATTING_HR), median(TEAM_BATTING_HR, na.rm =TRUE), TEAM_BATTING_HR),TEAM_BASERUN_SB =ifelse(is.na(TEAM_BASERUN_SB), mean(TEAM_BASERUN_SB, na.rm =TRUE), TEAM_BASERUN_SB),TEAM_FIELDING_E =ifelse(is.na(TEAM_FIELDING_E), median(TEAM_FIELDING_E, na.rm =TRUE), TEAM_FIELDING_E),TEAM_FIELDING_DP =ifelse(is.na(TEAM_FIELDING_DP), mean(TEAM_FIELDING_DP, na.rm =TRUE), TEAM_FIELDING_DP))# Double-check the imputationvis_dat(train_data)
# Check the distribution before deciding on mean/median.hist(train_data$TEAM_BATTING_HR)
# Summary statistics for numeric variablessummary_stats <- train_data %>%summarise(across(where(is.numeric), list(mean = mean, sd = sd, median = median), na.rm =TRUE))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(...)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
# Correlation matrix including the target variablecor_matrix <-cor(train_data %>%select(-INDEX), use ="complete.obs")# Visualize correlation matrix with corrplotcorrplot(cor_matrix, method ="color", type ="lower", tl.cex =0.7, title ="Correlation Matrix", mar =c(0,0,1,0))
# Bar chart example for TARGET_WINS (discrete variable, number of wins)ggplot(train_data, aes(x = TARGET_WINS)) +geom_bar(fill ="blue", color ="white") +labs(title ="Distribution of Wins")
# Display specific correlation with TARGET_WINScor_with_target <- cor_matrix[,"TARGET_WINS"]print(cor_with_target)
The bar-char– X-axis (TARGET_WINS): This axis displays the number of wins for the teams, which is a discrete variable. It likely ranges from 0 to 150, as inferred from the label.
Y-axis (Count): This axis indicates the number of teams that achieved each win total. The height of each bar reflects how many teams had that specific number of wins.
Bars: Each bar represents the count of teams that won a specific number of games. The blue bars help visualize the frequency of teams at different win totals.
Distribution Shape: The distribution appears to be somewhat bell-shaped or normal, indicating that most teams tend to win between a certain range (around 80 to 100 wins).
Peaks: There are peaks at specific win totals, which suggest common outcomes for teams in the dataset. It may indicate that this is a typical performance level for teams during the years covered by the dataset.
Low Counts for Extremes: The counts for very low and very high win totals are lower, indicating that extreme outcomes (very few wins or very many wins) are less common.
This graph provides insights into the competitive balance of teams within the dataset.
# Handling missing values - Impute with mean or mediantrain_data <- train_data %>%mutate(across(everything(), ~ifelse(is.na(.), mean(., na.rm =TRUE), .)))# Feature engineering - create new variables# Example: Run-to-hit ratiotrain_data <- train_data %>%mutate(Run_to_Hit_Ratio = TEAM_BATTING_H / (TEAM_BATTING_HR +1))# Check if imputation and new variables were added successfullysummary(train_data)
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
Median :1270.5 Median : 82.00 Median :1454 Median :238.0
Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 556.8
Median : 47.00 Median :102.00 Median :512.0 Median : 735.6
Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0
Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
Min. : 0.0 Min. : 0.00 Min. :29.00 Min. : 1137
1st Qu.: 67.0 1st Qu.: 44.00 1st Qu.:59.36 1st Qu.: 1419
Median :106.0 Median : 52.80 Median :59.36 Median : 1518
Mean :124.8 Mean : 52.80 Mean :59.36 Mean : 1779
3rd Qu.:151.0 3rd Qu.: 54.25 3rd Qu.:59.36 3rd Qu.: 1682
Max. :697.0 Max. :201.00 Max. :95.00 Max. :30132
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 626.0 1st Qu.: 127.0
Median :107.0 Median : 536.5 Median : 817.7 Median : 159.0
Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.0 3rd Qu.: 249.2
Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
TEAM_FIELDING_DP Run_to_Hit_Ratio
Min. : 52.0 Min. : 5.854
1st Qu.:134.0 1st Qu.: 9.781
Median :146.4 Median : 13.763
Mean :146.4 Mean : 37.657
3rd Qu.:161.2 3rd Qu.: 36.405
Max. :228.0 Max. :2003.000
2. DATA PREPARATION (50 Points)
Describe how you have transformed the data by changing the original variables or creating new variables. If you didtransform the data or create new variables, discuss why you did this. Here are some possible transformations.
Fix missing values (maybe with a Mean or Median value).
Create flags to suggest if a variable was missing
Transform data by putting it into buckets
Mathematical transformations such as log or square root (or use Box-Cox). Can try inverse transformation (1/x) or truncation (cap the maximum value possible)
Combine variables (such as ratios or adding or multiplying) to create new variables
I have fix missing values with either mean or median, since the missing values may do negative impact on my analysis.fix value very much help me to make model in next few questions. or we can maybe create flags for missing values that help us to find situations where sometimes can be useful.
Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection,etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
stepAIC may be helpful for backward / forward selection.
# Model 1: Simple linear regression with selected variables (no log transformation)model1 <-lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR, data = train_data)summary(model1)
Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR,
data = train_data)
Residuals:
Min 1Q Median 3Q Max
-67.537 -9.269 0.701 9.602 49.679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.743878 3.091731 4.445 9.2e-06 ***
TEAM_BATTING_H 0.042481 0.002065 20.568 < 2e-16 ***
TEAM_BATTING_HR 0.046493 0.004932 9.426 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.24 on 2273 degrees of freedom
Multiple R-squared: 0.1831, Adjusted R-squared: 0.1824
F-statistic: 254.7 on 2 and 2273 DF, p-value: < 2.2e-16
#This model uses two key variables from the original train_data without any log transformations. It includes TEAM_BATTING_H and TEAM_BATTING_HR (hits and home runs), which are likely to impact the number of wins.# Model 2: Multiple linear regression with more variables (no log transformation)model2 <-lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_data)summary(model2)
Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-51.314 -9.141 -0.030 8.742 53.951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.821827 2.940689 2.320 0.020440 *
TEAM_BATTING_H 0.049153 0.002060 23.859 < 2e-16 ***
TEAM_BATTING_HR 0.021368 0.006108 3.499 0.000477 ***
TEAM_BASERUN_SB 0.042359 0.003709 11.421 < 2e-16 ***
TEAM_FIELDING_E -0.022978 0.001624 -14.145 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.41 on 2271 degrees of freedom
Multiple R-squared: 0.2767, Adjusted R-squared: 0.2754
F-statistic: 217.2 on 4 and 2271 DF, p-value: < 2.2e-16
#This model includes more variables without applying any log transformations.And it expands on Model 1 by adding TEAM_BASERUN_SB (stolen bases) and TEAM_FIELDING_E (errors), which should provide a broader perspective on team performance.# Model 3: Multiple linear regression with log-transformed variablesmodel3 <-lm(TARGET_WINS ~log(TEAM_BATTING_H +1) + LOG_TEAM_BATTING_HR + LOG_TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_datalog)summary(model3)
Call:
lm(formula = TARGET_WINS ~ log(TEAM_BATTING_H + 1) + LOG_TEAM_BATTING_HR +
LOG_TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_datalog)
Residuals:
Min 1Q Median 3Q Max
-53.272 -9.183 0.013 8.886 55.058
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.912e+02 2.297e+01 -21.388 <2e-16 ***
log(TEAM_BATTING_H + 1) 7.554e+01 3.257e+00 23.193 <2e-16 ***
LOG_TEAM_BATTING_HR 7.315e-01 4.866e-01 1.503 0.133
LOG_TEAM_BASERUN_SB 5.066e+00 4.848e-01 10.448 <2e-16 ***
TEAM_FIELDING_E -2.123e-02 1.898e-03 -11.186 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.48 on 2271 degrees of freedom
Multiple R-squared: 0.2692, Adjusted R-squared: 0.2679
F-statistic: 209.1 on 4 and 2271 DF, p-value: < 2.2e-16
#This model uses log-transformed variables from the train_datalog dataset, also applies log transformations to TEAM_BATTING_H, TEAM_BATTING_HR, and TEAM_BASERUN_SB to reduce the impact of extreme values and skewness.# Compare models using stargazerstargazer(model1, model2, model3, type ="text", title ="Model Comparison", out ="models.txt")
For model1: Both TEAM_BATTING_H and TEAM_BATTING_HR have positive and significant coefficients, as expected, since more hits and home runs should lead to more wins. The adjusted R-squared value of 0.182 indicates that the model explains about 18% of the variance in team wins. Residual Standard Error: 14.24 Interpretation: This simple model gives a basic understanding of how team batting metrics relate to team performance in terms of wins, but the explanatory power is limited due to the simplicity of the model.
for model2: TEAM_BATTING_H, TEAM_BATTING_HR, and TEAM_BASERUN_SB all have positive and significant coefficients, indicating that better offensive metrics lead to more wins. TEAM_FIELDING_E has a negative coefficient, which aligns with the expectation that more errors reduce the likelihood of winning. The adjusted R-squared value of 0.275 suggests that this model explains around 27.5% of the variance in wins, significantly improving upon Model 1. Residual Standard Error: 13.41 Interpretation: This multivariable model, without log transformation, captures the influence of offensive and defensive metrics on team wins more effectively than the simpler Model 1.
for model3: log(TEAM_BATTING_H + 1) and LOG_TEAM_BASERUN_SB are positive and significant, indicating that after accounting for skewed data, these variables still strongly influence the number of wins. LOG_TEAM_BATTING_HR is not significant in this model, suggesting that the log transformation may have reduced its impact. TEAM_FIELDING_E remains negative and significant, as expected. The adjusted R-squared value of 0.268 is slightly lower than Model 2, but still a substantial improvement over Model 1. Residual Standard Error: 13.48 Interpretation: Log-transforming certain variables reduces the influence of outliers, but it doesn’t significantly improve the model compared to Model 2. However, it helps stabilize the variance in the dataset.
4. SELECT MODELS (50 Points)
Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.
Model 1: Simple, but low explanatory power (Adjusted R-squared: 0.182).
Model 2: Best performance overall, with a higher Adjusted R-squared (0.275), making it the most comprehensive model without log transformations.
Model 3: Similar performance to Model 2 (Adjusted R-squared: 0.268), but it handles skewed variables better through log transformations.
Seems Model 2 performs the best, balancing simplicity and predictive power without requiring log transformations. However, Model 3 may be useful if there are concerns about outliers or highly skewed data.
For selecting the best multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on
mean squared error,
R2,
F-statistic, and
residual plots (see this video for details).
Residuals vs Fitted Values
Normal Q-Q (Quantile-Quantile) – tells if residuals are normally distributed by comparing them with actual normal distribution
Plot - Scale-Location / Spread-Location Plot – shows if residuals are spread equally among our predictions in order to check homoscedasticity
Residuals vs Leverage Plot – shows influential data points that have a big effect on the linear model
Make predictions using the evaluation data set
# a.# Mean Squared Error for model2mse_model2 <-mean(residuals(model2)^2)cat("MSE model2: ", mse_model2)
MSE model2: 179.3956
# Mean Squared Error for model3mse_model3 <-mean(residuals(model3)^2)cat("MSE model3: ", mse_model3)
MSE model3: 181.2537
# b.# R² and Adjusted R² for all modelsmodel_comparison <-data.frame(Model =c("Model 1", "Model 2", "Model 3"),R_squared =c(summary(model1)$r.squared, summary(model2)$r.squared, summary(model3)$r.squared),Adj_R_squared =c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared, summary(model3)$adj.r.squared))print(model_comparison)
Model R_squared Adj_R_squared
1 Model 1 0.1830744 0.1823556
2 Model 2 0.2766929 0.2754189
3 Model 3 0.2692012 0.2679140
# c.# F-statistic for Model 2f_stat2 <-summary(model2)$fstatisticprint("f stat model2:")
[1] "f stat model2:"
print(f_stat2)
value numdf dendf
217.1863 4.0000 2271.0000
# F-statistic for Model 3f_stat3 <-summary(model3)$fstatisticprint("f stat model3:")
[1] "f stat model3:"
print(f_stat3)
value numdf dendf
209.1396 4.0000 2271.0000
# d.# Residual diagnostics# d. Residual diagnosticspar(mfrow =c(2, 2)) # Set up the plotting area for 4 plots# Plot residual diagnostics for Model 2plot(model2, which =1) # Residuals vs Fittedtitle(main ="Model 2: Residuals vs Fitted") # Add titleplot(model2, which =2) # Normal Q-Qtitle(main ="Model 2: Normal Q-Q Plot") # Add titleplot(model2, which =3) # Scale-Locationtitle(main ="Model 2: Scale-Location Plot") # Add titleplot(model2, which =4) # Residuals vs Leveragetitle(main ="Model 2: Residuals vs Leverage") # Add title
# Plot residual diagnostics for Model 3par(mfrow =c(2, 2)) # Reset plotting area for 4 plotsplot(model3, which =1) # Residuals vs Fittedtitle(main ="Model 3: Residuals vs Fitted") # Add titleplot(model3, which =2) # Normal Q-Qtitle(main ="Model 3: Normal Q-Q Plot") # Add titleplot(model3, which =3) # Scale-Locationtitle(main ="Model 3: Scale-Location Plot") # Add titleplot(model3, which =4) # Residuals vs Leveragetitle(main ="Model 3: Residuals vs Leverage") # Add title
library(visdat)# Check multicollinearity using VIF for Model 3. 2vif_values3 <-vif(model3)vif_values2 <-vif(model2)print("VIF for model 2")
MSE for Model 2: 179.3956 MSE for Model 3: 181.2537 The MSE indicates the average squared difference between predicted and actual team wins. A lower MSE for Model 2 suggests it has a better fit compared to Model 3.
Model 2 has the highest Adjusted R² (0.2754), indicating it explains more variance in team wins while accounting for the number of predictors. Model 1 has the lowest explanatory power, while Model 3 performs slightly worse than Model 2.
F-statistic for Model 2: 217.1863 (with degrees of freedom 4 and 2271) F-statistic for Model 3: 209.1396 (with degrees of freedom 4 and 2271) The F-statistic indicates that both models are statistically significant, with a p-value < 2.2e-16, showing that the predictors collectively relate to the number of wins.
Residual Diagnostics
Residuals vs Fitted Values: Model 2: Shows a random scatter, indicating good homoscedasticity, but slight curvature suggests potential non-linearity. Model 3: Similar scatter, but slight curvature may be more pronounced due to log transformations.
Normal Q-Q Plot: Model 2: Points largely follow the reference line, suggesting normality of residuals. Model 3: Also follows the reference line, but some deviations at the extremes may indicate the presence of outliers.
Scale-Location Plot: Model 2: Residuals are evenly spread, indicating homoscedasticity. Model 3: Shows similar characteristics but may have slightly more spread at lower fitted values.
Residuals vs Leverage Plot: Model 2: No major influential points. Model 3: Identifies a few potentially influential points, but they do not seem to have a significant effect on the overall model fit.
VIF: All VIF values are below the threshold of 5, indicating no significant multicollinearity issues in either model. However, the VIF for TEAM_BATTING_HR is higher in Model 3, which may indicate a potential concern in this model.
Overall Analysis:
Model 2 demonstrates superior performance across several metrics: lower MSE, higher Adjusted R², and a better F-statistic. The residual diagnostics indicate that it meets the assumptions of linear regression more robustly. Model 3, while effective in handling skewed data through log transformations, does not significantly outperform Model 2. It shows similar goodness-of-fit but indicates some potential issues with influential points and normality of residuals.
Thus:
Based on this comparison, Model 2 is preferred for predictions and further analysis due to its simplicity, better fit, and consistent residual diagnostics. Model 3 may still be valuable if addressing skewness and outliers becomes critical in the context of the analysis.
In this case, we will use Model 2 for evaluation ### Evaluation:
# Load the evaluation dataeval_data <-read.csv("./moneyball-evaluation-data-1.csv")# Impute missing values, create new features, and make predictionseval_data$Predicted_Wins <-predict(model2, newdata = eval_data)# Save the predicted resultswrite.csv(eval_data, "./predicted_wins.csv", row.names =FALSE)
GPT comment:
In your work, you’ve effectively structured your code with clear segmentation and good use of comments, making it easy to follow your analytical process in R. You’ve correctly utilized packages like tidyverse for data manipulation, visdat for visualizing missing data, and stargazer for neatly presenting regression outputs. However, for improvement, consider refining your approach to handling missing data. While imputing missing values with mean or median is standard, exploring more sophisticated methods or providing a rationale for choosing one over the other based on the distribution of data could enhance the robustness of your analysis. Also, ensure your code adheres to recent updates in R syntax to avoid deprecated functions, enhancing script reliability and future compatibility.