Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

df = read.csv("/Users/mathew.katz/Desktop/CUNYSPS/mlb2022.txt")
df = df[-1,]
df
##    Rk Season Team Lg   W  GP W.1   L   WL.  ERA   G CG SHO SV     IP    H   R
## 2   1   2022  LAD NL 111 162 111  51 0.685 2.80 162  1   1 43 1451.1 1114 513
## 3   2   2022  HOU AL 106 162 106  56 0.654 2.90 162  3   1 53 1445.0 1121 518
## 4   3   2022  ATL NL 101 162 101  61 0.623 3.46 162  1   1 55 1448.0 1224 609
## 5   4   2022  NYM NL 101 162 101  61 0.623 3.58 162  0   0 41 1438.2 1274 606
## 6   5   2022  NYY AL  99 162  99  63 0.611 3.30 162  1   1 47 1451.2 1177 567
## 7   6   2022  STL NL  93 162  93  69 0.574 3.79 162  3   1 37 1436.0 1335 637
## 8   7   2022  TOR AL  92 162  92  70 0.568 3.89 162  0   0 46 1441.2 1356 679
## 9   8   2022  CLE AL  92 162  92  70 0.568 3.47 162  1   0 51 1456.0 1252 634
## 10  9   2022  SEA AL  90 162  90  72 0.556 3.59 162  0   0 40 1447.0 1277 623
## 11 10   2022  SDP NL  89 162  89  73 0.549 3.81 162  0   0 48 1443.1 1263 660
## 12 11   2022  PHI NL  87 162  87  75 0.537 3.98 162  3   1 42 1428.1 1330 685
## 13 12   2022  TBR AL  86 162  86  76 0.531 3.41 162  0   0 44 1435.2 1260 614
## 14 13   2022  MIL NL  86 162  86  76 0.531 3.83 162  0   0 52 1446.0 1238 688
## 15 14   2022  BAL AL  83 162  83  79 0.512 3.97 162  2   1 46 1433.2 1406 688
## 16 15   2022  CHW AL  81 162  81  81 0.500 3.94 162  2   1 48 1447.2 1330 717
## 17 16   2022  SFG NL  81 162  81  81 0.500 3.86 162  1   0 39 1433.0 1397 697
## 18 17   2022  BOS AL  78 162  78  84 0.481 4.54 162  5   2 39 1430.2 1411 787
## 19 18   2022  MIN AL  78 162  78  84 0.481 3.98 162  0   0 28 1437.1 1320 684
## 20 19   2022  ARI NL  74 162  74  88 0.457 4.26 162  0   0 33 1429.2 1345 740
## 21 20   2022  CHC NL  74 162  74  88 0.457 4.03 162  0   0 44 1444.0 1342 731
## 22 21   2022  LAA AL  73 162  73  89 0.451 3.79 162  2   2 38 1435.2 1241 668
## 23 22   2022  MIA NL  69 162  69  93 0.426 3.87 162  6   1 41 1437.2 1311 676
## 24 23   2022  COL NL  68 162  68  94 0.420 5.08 162  1   1 43 1425.1 1516 873
## 25 24   2022  TEX AL  68 162  68  94 0.420 4.22 162  1   1 37 1435.0 1345 743
## 26 25   2022  DET AL  66 162  66  96 0.407 4.04 162  0   0 38 1419.2 1336 713
## 27 26   2022  KCR AL  65 162  65  97 0.401 4.72 162  0   0 33 1416.0 1493 810
## 28 27   2022  PIT NL  62 162  62 100 0.383 4.66 162  0   0 33 1421.1 1432 817
## 29 28   2022  CIN NL  62 162  62 100 0.383 4.86 162  1   1 31 1423.2 1366 815
## 30 29   2022  OAK AL  60 162  60 102 0.370 4.54 162  0   0 34 1426.1 1394 770
## 31 30   2022  WSN NL  55 162  55 107 0.340 5.00 162  2   0 28 1411.2 1469 855
##     ER  HR  BB IBB   SO HBP BK WP   BF ERA.  FIP  WHIP  H9 HR9 BB9 SO9 SO.BB
## 2  451 152 407  13 1465  75  3 38 5865  149 3.45 1.048 6.9 0.9 2.5 9.1  3.60
## 3  465 134 458   6 1524  60  6 56 5856  134 3.28 1.093 7.0 0.8 2.9 9.5  3.33
## 4  556 148 500  21 1554  62  4 55 6031  121 3.46 1.191 7.6 0.9 3.1 9.7  3.11
## 5  573 169 428  13 1565  71  2 35 5950  108 3.50 1.183 8.0 1.1 2.7 9.8  3.66
## 6  533 157 444  10 1459  65  5 40 5938  119 3.56 1.117 7.3 1.0 2.8 9.0  3.29
## 7  605 146 489  11 1177  60  3 43 6014  101 3.94 1.270 8.4 0.9 3.1 7.4  2.41
## 8  623 180 424  15 1390  76  5 29 6053  100 3.85 1.235 8.5 1.1 2.6 8.7  3.28
## 9  562 172 435  14 1390  57  2 49 5989  110 3.75 1.159 7.7 1.1 2.7 8.6  3.20
## 10 577 186 447  24 1391  56  0 45 5986  105 3.90 1.191 7.9 1.2 2.8 8.7  3.11
## 11 611 173 468   6 1451  88  5 54 6047  100 3.82 1.199 7.9 1.1 2.9 9.0  3.10
## 12 631 150 463  16 1423  68  3 47 6006  102 3.60 1.255 8.4 0.9 2.9 9.0  3.07
## 13 544 172 384  15 1384  66  4 54 5930  108 3.68 1.145 7.9 1.1 2.4 8.7  3.60
## 14 615 190 521  12 1530  67  4 47 6057  104 3.92 1.216 7.7 1.2 3.2 9.5  2.94
## 15 633 171 443   8 1214  64  4 47 6058  102 4.03 1.290 8.8 1.1 2.8 7.6  2.74
## 16 633 166 533  15 1450  51  6 64 6145  102 3.81 1.287 8.3 1.0 3.3 9.0  2.72
## 17 615 132 441  16 1370  52  2 53 6070  106 3.43 1.283 8.8 0.8 2.8 8.6  3.11
## 18 721 185 526  17 1346  72  8 60 6167   93 4.17 1.354 8.9 1.2 3.3 8.5  2.56
## 19 636 184 468  19 1336  66  4 50 6042   98 4.03 1.244 8.3 1.2 2.9 8.4  2.85
## 20 676 191 504  18 1216  59  3 51 6065   96 4.33 1.293 8.5 1.2 3.2 7.7  2.41
## 21 646 207 540  19 1383  73  8 53 6162  102 4.33 1.303 8.4 1.3 3.4 8.6  2.56
## 22 604 168 540  23 1383  60  3 64 6038  108 3.96 1.241 7.8 1.1 3.4 8.7  2.56
## 23 618 173 511  19 1437  76  3 54 6056  105 3.90 1.267 8.2 1.1 3.2 9.0  2.81
## 24 804 184 539  12 1187  59  3 65 6240   92 4.38 1.442 9.6 1.2 3.4 7.5  2.20
## 25 673 169 581  16 1314  71  7 66 6167   94 4.17 1.342 8.4 1.1 3.6 8.2  2.26
## 26 637 167 511   9 1195  57  2 59 6047   94 4.16 1.301 8.5 1.1 3.2 7.6  2.34
## 27 742 173 589  15 1191  71  6 88 6249   86 4.42 1.470 9.5 1.1 3.7 7.6  2.02
## 28 736 164 586  23 1250  87  5 62 6263   88 4.27 1.420 9.1 1.0 3.7 7.9  2.13
## 29 769 213 612  21 1414 110  5 58 6220   93 4.59 1.389 8.6 1.3 3.9 8.9  2.31
## 30 719 195 503  37 1203  72  5 62 6121   83 4.41 1.330 8.8 1.2 3.2 7.6  2.39
## 31 785 244 558  12 1220  75  2 59 6220   78 4.98 1.436 9.4 1.6 3.6 7.8  2.19

The data set here is pitching statistics for every team in MLB from last year, 2022. The columns are “Rk”, “Season”, “Team”, “Lg”, “W”, “GP”, “W.1”, “L”, “WL.”, “ERA”, “G”, “CG”, “SHO”, “SV”, “IP”, “H”, “R”, “ER”, “HR”, “BB”, “IBB”, “SO”, “HBP”, “BK”, “WP”, “BF”, “ERA.”, “FIP”, “WHIP”, “H9”, “HR9”, “BB9”, “SO9”, “SO.BB”, and “ERA_betterthan_average.”

Dichotomous term:

Is the team’s ERA better than league average?

df$ERA_betterthan_average[df$ERA > mean(df$ERA)] <- "No"
df$ERA_betterthan_average[df$ERA < mean(df$ERA)] <- "Yes"

Quadratic term:

quad = I(df$FIP^2)

Dichotomous vs. quantitative interaction term:

dvq = df$WHIP*df$HR
model<-lm(df$W ~ df$ERA_betterthan_average + quad + dvq)
summary(model)
## 
## Call:
## lm(formula = df$W ~ df$ERA_betterthan_average + quad + dvq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.071  -5.232   1.028   4.313  14.972 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  130.05980   12.24704  10.620 5.97e-11 ***
## df$ERA_betterthan_averageYes   5.35127    4.18021   1.280   0.2118    
## quad                          -4.15387    1.64155  -2.530   0.0178 *  
## dvq                            0.06314    0.11198   0.564   0.5777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.522 on 26 degrees of freedom
## Multiple R-squared:  0.7647, Adjusted R-squared:  0.7376 
## F-statistic: 28.17 on 3 and 26 DF,  p-value: 2.506e-08

The only pvalue smaller than 0.05 (the test to see if the predictors are not meaningful) is the Quadratic term which was FIP squared. FIP takes into account only the things that a pitcher can control, such as strikeouts, walks, hit-by-pitches, and home runs allowed, and ignores the impact of defense behind them. By doing so, FIP aims to provide a more accurate picture of a pitcher’s performance and their ability to prevent runs from scoring. Its the new school Earned Run Average (ERA).

plot(model$fitted.values, model$residuals)
abline(0,0)

Residuals don’t look great but there is a definite linear relationship between the independent variables and the dependent variable. There would defnitely need to be more feature selection and more analysis but this is a good start!