DATA 621 HW 1

1. DATA EXPLORATION

Overview

The moneyball-training-data contains 17 columns(INDEX, TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_H, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E, TEAM_FIELDING_DP) and 2276 rows. This is an observational study. The data set is a quantitative data set and all variables are independent variables.

Load GitHub moneyball-training-data and moneyball-evaluation-data CSV file to RStudio.

Summary

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

The mean and median of the TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_HR, TEAM_PITCHING_HR, TEAM_FIELDING_DP variables are also close to each other, indicating that these variables also have a roughly normal distribution.

The median value of the TEAM_BATTING_HBP, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_FIELDING_E are much lower than the mean value. This indicates that the distribution of this variable is positively skewed and there could be some outliers.

Box Plot of the data

Correlation

##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS       1.00000000    0.388767521      0.28910365     0.142608411
## TEAM_BATTING_H    0.38876752    1.000000000      0.56284968     0.427696575
## TEAM_BATTING_2B   0.28910365    0.562849678      1.00000000    -0.107305824
## TEAM_BATTING_3B   0.14260841    0.427696575     -0.10730582     1.000000000
## TEAM_BATTING_HR   0.17615320   -0.006544685      0.43539729    -0.635566946
## TEAM_BATTING_BB   0.23255986   -0.072464013      0.25572610    -0.287235841
## TEAM_BATTING_SO  -0.03175071   -0.463853571      0.16268519    -0.669781188
## TEAM_BASERUN_SB   0.13513892    0.123567797     -0.19975724     0.533506448
## TEAM_BASERUN_CS   0.02240407    0.016705668     -0.09981406     0.348764919
## TEAM_BATTING_HBP  0.07350424   -0.029112176      0.04608475    -0.174247154
## TEAM_PITCHING_H  -0.10993705    0.302693709      0.02369219     0.194879411
## TEAM_PITCHING_HR  0.18901373    0.072853119      0.45455082    -0.567836679
## TEAM_PITCHING_BB  0.12417454    0.094193027      0.17805420    -0.002224148
## TEAM_PITCHING_SO -0.07843609   -0.252656790      0.06479231    -0.258818931
## TEAM_FIELDING_E  -0.17648476    0.264902478     -0.23515099     0.509778447
## TEAM_FIELDING_DP -0.03485058    0.155383321      0.29087998    -0.323074847
##                  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS          0.176153200      0.23255986     -0.03175071
## TEAM_BATTING_H      -0.006544685     -0.07246401     -0.46385357
## TEAM_BATTING_2B      0.435397293      0.25572610      0.16268519
## TEAM_BATTING_3B     -0.635566946     -0.28723584     -0.66978119
## TEAM_BATTING_HR      1.000000000      0.51373481      0.72706935
## TEAM_BATTING_BB      0.513734810      1.00000000      0.37975087
## TEAM_BATTING_SO      0.727069348      0.37975087      1.00000000
## TEAM_BASERUN_SB     -0.453578426     -0.10511564     -0.25448923
## TEAM_BASERUN_CS     -0.433793868     -0.13698837     -0.21788137
## TEAM_BATTING_HBP     0.106181160      0.04746007      0.22094219
## TEAM_PITCHING_H     -0.250145481     -0.44977762     -0.37568637
## TEAM_PITCHING_HR     0.969371396      0.45955207      0.66717889
## TEAM_PITCHING_BB     0.136927564      0.48936126      0.03700514
## TEAM_PITCHING_SO     0.184707564     -0.02075682      0.41623330
## TEAM_FIELDING_E     -0.587339098     -0.65597081     -0.58466444
## TEAM_FIELDING_DP     0.448985348      0.43087675      0.15488939
##                  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS           0.13513892      0.02240407       0.07350424
## TEAM_BATTING_H        0.12356780      0.01670567      -0.02911218
## TEAM_BATTING_2B      -0.19975724     -0.09981406       0.04608475
## TEAM_BATTING_3B       0.53350645      0.34876492      -0.17424715
## TEAM_BATTING_HR      -0.45357843     -0.43379387       0.10618116
## TEAM_BATTING_BB      -0.10511564     -0.13698837       0.04746007
## TEAM_BATTING_SO      -0.25448923     -0.21788137       0.22094219
## TEAM_BASERUN_SB       1.00000000      0.65524480      -0.06400498
## TEAM_BASERUN_CS       0.65524480      1.00000000      -0.07051390
## TEAM_BATTING_HBP     -0.06400498     -0.07051390       1.00000000
## TEAM_PITCHING_H       0.07328505     -0.05200781      -0.02769699
## TEAM_PITCHING_HR     -0.41651072     -0.42256605       0.10675878
## TEAM_PITCHING_BB      0.14641513     -0.10696124       0.04785137
## TEAM_PITCHING_SO     -0.13712861     -0.21022274       0.22157375
## TEAM_FIELDING_E       0.50963090      0.04832189       0.04178971
## TEAM_FIELDING_DP     -0.49707763     -0.21424801      -0.07120824
##                  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS          -0.10993705       0.18901373      0.124174536
## TEAM_BATTING_H        0.30269371       0.07285312      0.094193027
## TEAM_BATTING_2B       0.02369219       0.45455082      0.178054204
## TEAM_BATTING_3B       0.19487941      -0.56783668     -0.002224148
## TEAM_BATTING_HR      -0.25014548       0.96937140      0.136927564
## TEAM_BATTING_BB      -0.44977762       0.45955207      0.489361263
## TEAM_BATTING_SO      -0.37568637       0.66717889      0.037005141
## TEAM_BASERUN_SB       0.07328505      -0.41651072      0.146415134
## TEAM_BASERUN_CS      -0.05200781      -0.42256605     -0.106961236
## TEAM_BATTING_HBP     -0.02769699       0.10675878      0.047851371
## TEAM_PITCHING_H       1.00000000      -0.14161276      0.320676162
## TEAM_PITCHING_HR     -0.14161276       1.00000000      0.221937505
## TEAM_PITCHING_BB      0.32067616       0.22193750      1.000000000
## TEAM_PITCHING_SO      0.26724807       0.20588053      0.488498653
## TEAM_FIELDING_E       0.66775901      -0.49314447     -0.022837561
## TEAM_FIELDING_DP     -0.22865059       0.43917040      0.324457226
##                  TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS           -0.07843609     -0.17648476      -0.03485058
## TEAM_BATTING_H        -0.25265679      0.26490248       0.15538332
## TEAM_BATTING_2B        0.06479231     -0.23515099       0.29087998
## TEAM_BATTING_3B       -0.25881893      0.50977845      -0.32307485
## TEAM_BATTING_HR        0.18470756     -0.58733910       0.44898535
## TEAM_BATTING_BB       -0.02075682     -0.65597081       0.43087675
## TEAM_BATTING_SO        0.41623330     -0.58466444       0.15488939
## TEAM_BASERUN_SB       -0.13712861      0.50963090      -0.49707763
## TEAM_BASERUN_CS       -0.21022274      0.04832189      -0.21424801
## TEAM_BATTING_HBP       0.22157375      0.04178971      -0.07120824
## TEAM_PITCHING_H        0.26724807      0.66775901      -0.22865059
## TEAM_PITCHING_HR       0.20588053     -0.49314447       0.43917040
## TEAM_PITCHING_BB       0.48849865     -0.02283756       0.32445723
## TEAM_PITCHING_SO       1.00000000     -0.02329178       0.02615804
## TEAM_FIELDING_E       -0.02329178      1.00000000      -0.49768495
## TEAM_FIELDING_DP       0.02615804     -0.49768495       1.00000000

The table shows the Pearson’s correlation coefficients between different baseball statistics. The values in the table indicate the strength and direction of the linear relationship between the two variables.

For example, the value of -0.10993705 between TARGET_WINS and TEAM_PITCHING_H indicates that there is a weak negative linear relationship between the two variables. This means that as the number of TEAM_PITCHING_H increases, the number of TARGET_WINS is likely to decrease, but the relationship is weak.

There is a moderate negative correlation (-0.635566946) between “TEAM_BATTING_3B” and “TEAM_BATTING_HR”. This means that as the number of “TEAM_BATTING_3B” increases, the number of “TEAM_BATTING_HR” is likely to decrease.

On the other hand, the value of 0.96937140 between TEAM_BATTING_HR and TEAM_PITCHING_HR indicates that there is a strong positive linear relationship between the two variables. This means that as the number of TEAM_BATTING_HR increases, the number of TEAM_PITCHING_HR is also likely to increase, and the relationship is strong.

Missing Data

It is always important to consider the context and characteristics of the data before deciding on the best approach to handle missing values. Column named BATTING_SO, BASERUN_SB, BASERUN_CS, BATTING_HBP, PITCHING_SO, FIELDING_DP has missing data. In general, if the data is not normally distributed and there are outliers, the median is a good choice. If the data is normally distributed, the mean is a good choice.

2. DATA PREPARATION

Missing Data

I replaced BATTING_HBP and FIELDING_DP with mean because it is normally distributed and BATTING_SO, BASERUN_SB, and BASERUN_CS with median because it is not normally distributed.

The presence of missing values (NA’s) can also affect the distribution of the data and correlation calculation.

Drop and Replace

Column index and TEAM_ from column name is unnecessary.

New Variables

After all, in baseball, if you score less points and score more points, you have a higher chance of winning. That’s why I paid attention to the fact that the more on-base and the more long hits, the more points you score. From a pitcher’s point of view, you’d think the opposite. The less on-base and the fewer extra hits, the higher the odds of scoring fewer runs.

For a pitcher, striking out is the surest way to get an out. If you strike out 3 times in a row without sacrificing bases, you can score no runs. The important factor is clear. The more you strike out, the less likely you are to score. However, a lot of strikeouts doesn’t necessarily mean less points, and even if you have a lot of strikeouts, if you get on base a lot and catch a lot of long hits, the probability of scoring goes up.

WHIP is an indicator of a pitcher’s stability. The reason the pitcher’s stability is important is that even if the opposing team loses points, if our team’s pitchers give up a lot of points, we can’t win. However, going on base due to a fielding error does not affect WHIP.

Even if risk management ability, bases loaded operation, rapid control ability, command, etc. are all subordinated, WHIP does not represent 100% of the pitcher’s ability in the eyes of sabermetry. This is because whether you give up a hit or hit a home run counts as one on base.

For this reason, Strikeout rate and Walks + Hits per Inning Pitched (WHIP) were additionally calculated.

Based on 2022 data provided by https://www.teamrankings.com/mlb/stat/plate-appearances. An average plate appearances for one team is 6167.
- On_base_percentage
- OPS
- Batting_average
The strikeout rate was calculated based on the 9th inning.
- ERA
- Strikeout_rate
- WHIP

##   TARGET_WINS       BATTING_H      BATTING_2B      BATTING_3B    
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##    BATTING_HR       BATTING_BB      BATTING_SO       BASERUN_SB   
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.8   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 736.3   Mean   :123.4  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:151.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##    BASERUN_CS      BATTING_HBP      PITCHING_H     PITCHING_HR   
##  Min.   :  0.00   Min.   :29.00   Min.   : 1137   Min.   :  0.0  
##  1st Qu.: 44.00   1st Qu.:59.00   1st Qu.: 1419   1st Qu.: 50.0  
##  Median : 49.00   Median :59.00   Median : 1518   Median :107.0  
##  Mean   : 51.51   Mean   :59.03   Mean   : 1779   Mean   :105.7  
##  3rd Qu.: 54.25   3rd Qu.:59.00   3rd Qu.: 1682   3rd Qu.:150.0  
##  Max.   :201.00   Max.   :95.00   Max.   :30132   Max.   :343.0  
##   PITCHING_BB      PITCHING_SO        FIELDING_E      FIELDING_DP   
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0  
##  1st Qu.: 476.0   1st Qu.:  626.0   1st Qu.: 127.0   1st Qu.:134.0  
##  Median : 536.5   Median :  813.0   Median : 159.0   Median :146.0  
##  Mean   : 553.0   Mean   :  817.5   Mean   : 246.5   Mean   :146.3  
##  3rd Qu.: 611.0   3rd Qu.:  957.0   3rd Qu.: 249.2   3rd Qu.:161.2  
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0  
##       OBP              SLG             OPS               BA        
##  Min.   :0.1540   Min.   :0.188   Min.   :0.3420   Min.   :0.1440  
##  1st Qu.:0.3120   1st Qu.:0.371   1st Qu.:0.6870   1st Qu.:0.2240  
##  Median :0.3290   Median :0.409   Median :0.7390   Median :0.2360  
##  Mean   :0.3291   Mean   :0.408   Mean   :0.7371   Mean   :0.2383  
##  3rd Qu.:0.3470   3rd Qu.:0.443   3rd Qu.:0.7860   3rd Qu.:0.2490  
##  Max.   :0.4670   Max.   :0.621   Max.   :1.0720   Max.   :0.4140  
##       ERA               K              WHIP      
##  Min.   :0.0000   Min.   :0.000   Min.   :1.190  
##  1st Qu.:0.7252   1st Qu.:3.357   1st Qu.:2.100  
##  Median :1.9450   Median :4.793   Median :2.535  
##  Mean   :1.8421   Mean   :4.606   Mean   :  Inf  
##  3rd Qu.:2.7412   3rd Qu.:5.857   3rd Qu.:3.623  
##  Max.   :4.5940   Max.   :9.137   Max.   :  Inf

#dftrain$BATTING_HR_bucket <- cut(dftrain$BATTING_HR, breaks = c(0, 10, 20, 30, 40, 50, max(dftrain$BATTING_HR)), labels = c("0-10", "10-20", "20-30", "30-40", "40-50", "50+"))

Data Transformation

One possible way to transform ERA and WHIP data to have a positive correlation is to take the reciprocal of each variable, as lower values of ERA and WHIP are associated with better performance.

I created three new variables in dftrain dataset, ERA_inverse and WHIP_inverse.

1 / dftrain$ERA creates a new column in the dftrain data frame called ERA_inverse, which is the result of taking the reciprocal of the ERA column.

3. BUILD MODELS

Model 1

Chose single variable to predict the wins. The team’s batting home run average is 100. If the team hits 30 more home runs. Teams with 130 home runs will have 82 predicted wins.

Team batting Homerun average is 100. If the team hits 30 more homeruns. Team will

cor(dftrain_new$TARGET_WINS, dftrain_new$BATTING_HR)

## [1] 0.1761532

## 
## Call:
## lm(formula = TARGET_WINS ~ BATTING_HR, data = dftrain_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.226  -9.909   0.520  10.218  68.445 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 76.22576    0.62599 121.768   <2e-16 ***
## BATTING_HR   0.04583    0.00537   8.534   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.51 on 2274 degrees of freedom
## Multiple R-squared:  0.03103,    Adjusted R-squared:  0.0306 
## F-statistic: 72.82 on 1 and 2274 DF,  p-value: < 2.2e-16

predict(model1, newdata = p1)

##        1 
## 82.18351

Model 2

Chose OPS, BA, ERA, K, WHIP without any inversed data.

0.800 OPS, 0.28 BA, 1.5 ERA, 9.2 K and 1.334 WHIP will have 88 predicted wins.

## 
## Call:
## lm(formula = TARGET_WINS ~ OPS + BA + ERA + K + WHIP, data = na.omit(dftrain_new))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.789  -9.031   0.455   9.140  48.096 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.5050     5.4086   2.127  0.03351 *  
## OPS         104.9102    13.5728   7.729 1.62e-14 ***
## BA            1.7758    45.5282   0.039  0.96889    
## ERA          -2.2318     0.6835  -3.265  0.00111 ** 
## K            -0.4019     0.3539  -1.136  0.25623    
## WHIP         -0.7734     0.1870  -4.136 3.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.71 on 2250 degrees of freedom
## Multiple R-squared:  0.2036, Adjusted R-squared:  0.2019 
## F-statistic: 115.1 on 5 and 2250 DF,  p-value: < 2.2e-16

predict(model2, newdata = p2)

##        1 
## 87.85338

In this case, a negative coefficient for ERA, K, and WHIP means that, holding all other variables constant, an increase in ERA or K or WHIP is associated with a decrease in the number of wins. This may seem counterintuitive because in general, having a lower ERA and WHIP and higher K is better for a pitcher and would seem to lead to more wins. However, it’s possible that other factors that are not included in the model (such as the quality of the team’s defense or run support) are influencing the relationship between ERA/K/WHIP and wins in a way that is not captured by the model. Additionally, it’s important to remember that correlation does not necessarily imply causation, and while these variables may be correlated with winning, there may be other factors that are more strongly predictive of wins.

Model 3

chose OBP, SLG, BA, ERA_inverse, K, WHIP to predict the wins.

0.400 OBP, 0.417 SLG, 0.280 BA, 1.5 ERA, 9.2 K and 1.334 WHIP will have 94 predicted wins.

p3 <- data.frame(c(0.400),
                 c(0.416),
                 c(0.280),
                 c(0.667), #1/1.5
                 c(9.200),
                 c(1.334)) 
                
colnames(p3) <- c("OBP","SLG","BA","ERA_inverse","K","WHIP")

#cor(dftrain_new$TARGET_WINS, OBP, SLG, OPS, BA, ERA_inverse, K, WHIP)

# replace 'Infinate' values with NA
dftrain_new$OPS[is.infinite(dftrain_new$OPS)] <- NA
dftrain_new$ERA_inverse[is.infinite(dftrain_new$ERA_inverse)] <- NA
dftrain_new$K[is.infinite(dftrain_new$K)] <- NA

model3 <- lm(TARGET_WINS ~ OBP + SLG + BA + ERA_inverse + K + WHIP,  data = na.omit(dftrain_new))
summary(model3)

## 
## Call:
## lm(formula = TARGET_WINS ~ OBP + SLG + BA + ERA_inverse + K + 
##     WHIP, data = na.omit(dftrain_new))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.386  -8.873   0.468   9.201  47.153 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.1026     6.3947  -0.172 0.863126    
## OBP         163.2316    18.6870   8.735  < 2e-16 ***
## SLG          46.3902    13.5567   3.422 0.000633 ***
## BA           51.6041    39.6472   1.302 0.193193    
## ERA_inverse   1.1340     0.2689   4.217 2.57e-05 ***
## K            -0.3952     0.3318  -1.191 0.233726    
## WHIP         -0.7499     0.1888  -3.972 7.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 2241 degrees of freedom
## Multiple R-squared:  0.2089, Adjusted R-squared:  0.2068 
## F-statistic: 98.65 on 6 and 2241 DF,  p-value: < 2.2e-16

predict(model3, newdata = p3)

##        1 
## 94.05756

4. SELECT MODELS

Since model 1 has only one value, I will exclude model 1 from my selection. I will choose to use Model 3.

To determine which model is better, we can compare their goodness-of-fit measures. One commonly used measure is the adjusted R-squared, which takes into account the number of predictors in the model. Another measure is the residual standard error (RSE), which estimates the standard deviation of the errors in the model.

Comparing the two models, I can see that Model 3 has a slightly higher adjusted R-squared value (0.2068) than Model 2 (0.2019). This suggests that Model 3 explains a slightly greater proportion of the variance in the response variable (TARGET_WINS).

Furthermore, the residual standard error of Model 3 (13.49) is slightly smaller than that of Model 2 (13.71). This suggests that the errors in Model 3 are slightly smaller, on average, than the errors in Model 2.

Therefore, based on these measures, I can say that Model 3 is slightly better than Model 2.

DATA 621 HW 1

Ivan Tikhonov, Seung Min Song

2023-2-26

1. DATA EXPLORATION

Overview

Summary

Box Plot of the data

Correlation

Missing Data

2. DATA PREPARATION

Missing Data

Drop and Replace

New Variables

Data Transformation

3. BUILD MODELS

Model 1

Model 2

Model 3

4. SELECT MODELS