Introducing the dataset.
options(warn = -1)
suppressMessages(require(plotly))
library(knitr)
suppressMessages(library(RCurl))
suppressMessages(library(plyr))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(require(scatterplot3d));
training <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-IS621/master/moneyball-training-data.csv", header = TRUE, sep = ",")
evaluation <- read.csv("https://raw.githubusercontent.com/mascotinme/MSDA-IS621/master/moneyball-training-data.csv", header = TRUE, sep = ",")
str(training)## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
dim(training)## [1] 2276 17
kable(summary(training))| INDEX | TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
| 1st Qu.: 630.8 | 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | |
| Median :1270.5 | Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | |
| Mean :1268.5 | Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | |
| 3rd Qu.:1915.5 | 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | |
| Max. :2535.0 | Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 |
The Multiple Linear Regression Equation for the data analysis is:
\({ Y }\quad =\quad { B }_{ 0 }\quad +\quad { B }_{ 1 }{ x }_{ 1 }\quad +\quad { B }_{ 2 }{ x }_{ 2 }\quad +\quad\) ………+\(\quad { B }_{ n }{ x }_{ n }\quad\) +\(\quad { e}\\\)
Where,
\(\quad { Y }\quad\) = Reponse or Dependent Variable,
\(\quad{ x }_{ 1 }\) …..\({ x }_{ n }\quad\) = Explantory or Independent Variables
\(\quad { B }_{ 0 }\quad\) = Intercept,
\(\quad { B }_{ 1 }\quad , ...., \quad { B }_{ n }\quad\) = Slope of Independent variables or Model Parameter.
\(\quad { e}\\\) = Residual or Error term ( the difference between an actual and a predicted value of y)
Could be re-written in terms of the training dataset as:
\({ Y }\quad =\quad { B }_{ 0 }\quad +\quad { B }_{ target-wins }{ X }_{ target-wins}\quad +\quad { B }_{ team-batting-H }{ X }_{ team-batting-H }\quad +\quad\) ………+\(\quad { B }_{ team-fielding-DP }{ X }_{ team-fielding-DP }\quad\) +\(\quad { e}\\\)
A glimpse at the multiple linear regression Analysis:
fit1 <- lm(TARGET_WINS ~. -INDEX, data = training) # The Variable INDEX is intentional omitted as it has nothing to do with the analysis
summary(fit1)##
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8708 -5.6564 -0.0599 5.2545 22.9274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.28826 19.67842 3.064 0.00253 **
## TEAM_BATTING_H 1.91348 2.76139 0.693 0.48927
## TEAM_BATTING_2B 0.02639 0.03029 0.871 0.38484
## TEAM_BATTING_3B -0.10118 0.07751 -1.305 0.19348
## TEAM_BATTING_HR -4.84371 10.50851 -0.461 0.64542
## TEAM_BATTING_BB -4.45969 3.63624 -1.226 0.22167
## TEAM_BATTING_SO 0.34196 2.59876 0.132 0.89546
## TEAM_BASERUN_SB 0.03304 0.02867 1.152 0.25071
## TEAM_BASERUN_CS -0.01104 0.07143 -0.155 0.87730
## TEAM_BATTING_HBP 0.08247 0.04960 1.663 0.09815 .
## TEAM_PITCHING_H -1.89096 2.76095 -0.685 0.49432
## TEAM_PITCHING_HR 4.93043 10.50664 0.469 0.63946
## TEAM_PITCHING_BB 4.51089 3.63372 1.241 0.21612
## TEAM_PITCHING_SO -0.37364 2.59705 -0.144 0.88577
## TEAM_FIELDING_E -0.17204 0.04140 -4.155 5.08e-05 ***
## TEAM_FIELDING_DP -0.10819 0.03654 -2.961 0.00349 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.467 on 175 degrees of freedom
## (2085 observations deleted due to missingness)
## Multiple R-squared: 0.5501, Adjusted R-squared: 0.5116
## F-statistic: 14.27 on 15 and 175 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit1)plot_ly(data = training, x = TEAM_BATTING_2B , y = TARGET_WINS, mode = "markers",
color = "blue", line = list(shape = "linear"))plot(TARGET_WINS~TEAM_BATTING_2B, training)
fitline <- lm(training$TARGET_WINS~training$TEAM_BATTING_2B)
abline(fitline)cor(training$TARGET_WINS, training$TEAM_BATTING_2B)## [1] 0.2891036
A 3D Scatterplot display for TARGET_WINS, TEAM_BATTING_2B and TEAM_BATTING_BB
attach(training);
#Run the this query to display it in 3D
scatterplot3d(TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_BB ,pch = 20, highlight.3d = TRUE, type = "h", main = "3D ScatterPlots"); hist(training$TARGET_WINS, col="green")hist(training$TEAM_FIELDING_DP, col="blue")We can deduce from the above model that some of the variables are not comtributing meaningfully to the analysis, we therefore proceeded by using a statistical tool for selecting the best model for the analysis.
We shall select the best model by using both forward and backward selection process
stepwise <- step(fit1, direction = "both") # Model Selection using both FORWARD AND BACKWARD selection.## Start: AIC=831.31
## TARGET_WINS ~ (INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP) -
## INDEX
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_SO 1 1.24 12547 829.33
## - TEAM_PITCHING_SO 1 1.48 12547 829.33
## - TEAM_BASERUN_CS 1 1.71 12548 829.34
## - TEAM_BATTING_HR 1 15.23 12561 829.54
## - TEAM_PITCHING_HR 1 15.79 12562 829.55
## - TEAM_PITCHING_H 1 33.63 12580 829.82
## - TEAM_BATTING_H 1 34.42 12580 829.83
## - TEAM_BATTING_2B 1 54.41 12600 830.14
## - TEAM_BASERUN_SB 1 95.22 12641 830.76
## - TEAM_BATTING_BB 1 107.84 12654 830.95
## - TEAM_PITCHING_BB 1 110.48 12656 830.99
## - TEAM_BATTING_3B 1 122.16 12668 831.16
## <none> 12546 831.31
## - TEAM_BATTING_HBP 1 198.21 12744 832.31
## - TEAM_FIELDING_DP 1 628.49 13174 838.65
## - TEAM_FIELDING_E 1 1237.79 13784 847.28
##
## Step: AIC=829.33
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS +
## TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BASERUN_CS 1 1.59 12549 827.35
## - TEAM_BATTING_HR 1 15.82 12563 827.57
## - TEAM_PITCHING_HR 1 16.39 12564 827.58
## - TEAM_BATTING_2B 1 53.47 12601 828.14
## - TEAM_PITCHING_H 1 88.45 12636 828.67
## - TEAM_BATTING_H 1 90.30 12637 828.70
## - TEAM_BASERUN_SB 1 94.19 12641 828.76
## - TEAM_BATTING_BB 1 107.95 12655 828.97
## - TEAM_PITCHING_BB 1 110.60 12658 829.01
## - TEAM_BATTING_3B 1 122.20 12669 829.18
## <none> 12547 829.33
## - TEAM_BATTING_HBP 1 197.11 12744 830.31
## + TEAM_BATTING_SO 1 1.24 12546 831.31
## - TEAM_FIELDING_DP 1 630.68 13178 836.70
## - TEAM_FIELDING_E 1 1240.80 13788 845.34
## - TEAM_PITCHING_SO 1 1312.89 13860 846.34
##
## Step: AIC=827.35
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_HBP +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_HR 1 16.06 12565 825.60
## - TEAM_PITCHING_HR 1 16.64 12565 825.61
## - TEAM_BATTING_2B 1 53.05 12602 826.16
## - TEAM_PITCHING_H 1 90.24 12639 826.72
## - TEAM_BATTING_H 1 92.13 12641 826.75
## - TEAM_BATTING_BB 1 110.31 12659 827.03
## - TEAM_PITCHING_BB 1 113.00 12662 827.07
## - TEAM_BASERUN_SB 1 123.42 12672 827.22
## - TEAM_BATTING_3B 1 129.33 12678 827.31
## <none> 12549 827.35
## - TEAM_BATTING_HBP 1 197.23 12746 828.33
## + TEAM_BASERUN_CS 1 1.59 12547 829.33
## + TEAM_BATTING_SO 1 1.12 12548 829.34
## - TEAM_FIELDING_DP 1 635.62 13184 834.79
## - TEAM_PITCHING_SO 1 1311.88 13861 844.35
## - TEAM_FIELDING_E 1 1322.05 13871 844.49
##
## Step: AIC=825.6
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_H +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_2B 1 55.48 12620 824.44
## - TEAM_PITCHING_H 1 89.26 12654 824.95
## - TEAM_BATTING_H 1 91.97 12657 824.99
## - TEAM_BATTING_BB 1 104.58 12669 825.18
## - TEAM_PITCHING_BB 1 107.19 12672 825.22
## <none> 12565 825.60
## - TEAM_BATTING_3B 1 137.48 12702 825.68
## - TEAM_BASERUN_SB 1 146.90 12712 825.82
## - TEAM_BATTING_HBP 1 200.36 12765 826.62
## + TEAM_BATTING_HR 1 16.06 12549 827.35
## + TEAM_BASERUN_CS 1 1.83 12563 827.57
## + TEAM_BATTING_SO 1 1.67 12563 827.57
## - TEAM_FIELDING_DP 1 628.95 13194 832.93
## - TEAM_PITCHING_HR 1 853.54 13418 836.15
## - TEAM_PITCHING_SO 1 1316.68 13882 842.63
## - TEAM_FIELDING_E 1 1333.15 13898 842.86
##
## Step: AIC=824.44
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_PITCHING_H 1 84.47 12705 823.71
## - TEAM_BATTING_H 1 87.79 12708 823.76
## - TEAM_BATTING_BB 1 98.92 12719 823.93
## - TEAM_PITCHING_BB 1 101.48 12722 823.97
## - TEAM_BASERUN_SB 1 109.27 12730 824.09
## <none> 12620 824.44
## - TEAM_BATTING_3B 1 147.01 12767 824.65
## - TEAM_BATTING_HBP 1 204.39 12825 825.51
## + TEAM_BATTING_2B 1 55.48 12565 825.60
## + TEAM_BATTING_HR 1 18.48 12602 826.16
## + TEAM_BASERUN_CS 1 1.38 12619 826.42
## + TEAM_BATTING_SO 1 0.55 12620 826.43
## - TEAM_FIELDING_DP 1 649.12 13269 832.02
## - TEAM_PITCHING_HR 1 812.92 13433 834.36
## - TEAM_PITCHING_SO 1 1262.90 13883 840.66
## - TEAM_FIELDING_E 1 1379.34 14000 842.25
##
## Step: AIC=823.71
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_BB 1 32.85 12738 822.21
## - TEAM_PITCHING_BB 1 43.42 12748 822.37
## - TEAM_BASERUN_SB 1 105.16 12810 823.29
## <none> 12705 823.71
## - TEAM_BATTING_3B 1 153.13 12858 824.00
## + TEAM_PITCHING_H 1 84.47 12620 824.44
## - TEAM_BATTING_HBP 1 183.82 12888 824.46
## + TEAM_BATTING_SO 1 62.04 12643 824.78
## + TEAM_BATTING_2B 1 50.69 12654 824.95
## + TEAM_BATTING_HR 1 12.25 12692 825.53
## + TEAM_BASERUN_CS 1 3.11 12702 825.67
## - TEAM_BATTING_H 1 504.11 13209 829.15
## - TEAM_FIELDING_DP 1 602.80 13308 830.57
## - TEAM_PITCHING_HR 1 850.25 13555 834.09
## - TEAM_PITCHING_SO 1 1259.72 13964 839.77
## - TEAM_FIELDING_E 1 1419.39 14124 841.94
##
## Step: AIC=822.21
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BASERUN_SB +
## TEAM_BATTING_HBP + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BASERUN_SB 1 109.99 12848 821.85
## <none> 12738 822.21
## - TEAM_BATTING_3B 1 156.45 12894 822.54
## - TEAM_BATTING_HBP 1 186.58 12924 822.98
## + TEAM_BATTING_2B 1 48.63 12689 823.48
## + TEAM_BATTING_BB 1 32.85 12705 823.71
## + TEAM_BATTING_HR 1 22.99 12715 823.86
## + TEAM_PITCHING_H 1 18.40 12719 823.93
## + TEAM_BATTING_SO 1 17.51 12720 823.94
## + TEAM_BASERUN_CS 1 3.86 12734 824.15
## - TEAM_BATTING_H 1 485.67 13223 827.35
## - TEAM_FIELDING_DP 1 623.19 13361 829.33
## - TEAM_PITCHING_HR 1 843.83 13581 832.46
## - TEAM_PITCHING_SO 1 1267.25 14005 838.32
## - TEAM_FIELDING_E 1 1395.02 14133 840.06
## - TEAM_PITCHING_BB 1 2364.81 15102 852.73
##
## Step: AIC=821.85
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_HBP +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_3B 1 133.47 12981 821.82
## <none> 12848 821.85
## + TEAM_BASERUN_SB 1 109.99 12738 822.21
## - TEAM_BATTING_HBP 1 177.11 13025 822.46
## + TEAM_BATTING_BB 1 37.69 12810 823.29
## + TEAM_BATTING_HR 1 30.72 12817 823.39
## + TEAM_BASERUN_CS 1 23.16 12824 823.51
## + TEAM_PITCHING_H 1 22.34 12825 823.52
## + TEAM_BATTING_SO 1 21.53 12826 823.53
## + TEAM_BATTING_2B 1 14.11 12834 823.64
## - TEAM_BATTING_H 1 566.11 13414 828.09
## - TEAM_FIELDING_DP 1 737.46 13585 830.51
## - TEAM_PITCHING_HR 1 756.49 13604 830.78
## - TEAM_PITCHING_SO 1 1257.91 14106 837.69
## - TEAM_FIELDING_E 1 1330.40 14178 838.67
## - TEAM_PITCHING_BB 1 2371.12 15219 852.20
##
## Step: AIC=821.82
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## <none> 12981 821.82
## + TEAM_BATTING_3B 1 133.47 12848 821.85
## + TEAM_BASERUN_SB 1 87.02 12894 822.54
## - TEAM_BATTING_HBP 1 228.70 13210 823.16
## + TEAM_BATTING_BB 1 40.42 12941 823.23
## + TEAM_BATTING_HR 1 33.83 12947 823.33
## + TEAM_PITCHING_H 1 23.95 12957 823.47
## + TEAM_BATTING_SO 1 23.13 12958 823.48
## + TEAM_BATTING_2B 1 21.28 12960 823.51
## + TEAM_BASERUN_CS 1 7.07 12974 823.72
## - TEAM_BATTING_H 1 449.87 13431 826.33
## - TEAM_FIELDING_DP 1 813.17 13794 831.43
## - TEAM_PITCHING_HR 1 990.20 13971 833.86
## - TEAM_PITCHING_SO 1 1316.56 14298 838.27
## - TEAM_FIELDING_E 1 1334.60 14316 838.52
## - TEAM_PITCHING_BB 1 2583.00 15564 854.49
summary(stepwise)##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2248 -5.6294 -0.0212 5.0439 21.3065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.95454 19.10292 3.191 0.001670 **
## TEAM_BATTING_H 0.02541 0.01009 2.518 0.012648 *
## TEAM_BATTING_HBP 0.08712 0.04852 1.796 0.074211 .
## TEAM_PITCHING_HR 0.08945 0.02394 3.736 0.000249 ***
## TEAM_PITCHING_BB 0.05672 0.00940 6.034 8.66e-09 ***
## TEAM_PITCHING_SO -0.03136 0.00728 -4.308 2.68e-05 ***
## TEAM_FIELDING_E -0.17218 0.03970 -4.338 2.38e-05 ***
## TEAM_FIELDING_DP -0.11904 0.03516 -3.386 0.000869 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.422 on 183 degrees of freedom
## (2085 observations deleted due to missingness)
## Multiple R-squared: 0.5345, Adjusted R-squared: 0.5167
## F-statistic: 30.02 on 7 and 183 DF, p-value: < 2.2e-16
The above selection process depicts the best model for the analysis.
fit2 <- training[, c("TEAM_BATTING_H", "TEAM_PITCHING_HR" , "TEAM_PITCHING_BB", "TEAM_PITCHING_SO", "TEAM_FIELDING_E", "TEAM_FIELDING_DP", "TARGET_WINS", "TEAM_BATTING_HBP")]
par(mfrow=c(2,2))
plot(fit2)fit3 <- lm(TARGET_WINS ~. -TEAM_BATTING_HBP, data = fit2)
summary(fit3)##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_BATTING_HBP, data = fit2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.8415 -6.0133 -0.0886 5.3245 22.1650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.466852 19.166412 3.311 0.001117 **
## TEAM_BATTING_H 0.025806 0.010150 2.543 0.011829 *
## TEAM_PITCHING_HR 0.091740 0.024051 3.814 0.000186 ***
## TEAM_PITCHING_BB 0.056080 0.009450 5.935 1.44e-08 ***
## TEAM_PITCHING_SO -0.028885 0.007191 -4.017 8.59e-05 ***
## TEAM_FIELDING_E -0.173892 0.039923 -4.356 2.20e-05 ***
## TEAM_FIELDING_DP -0.121696 0.035340 -3.444 0.000711 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.473 on 184 degrees of freedom
## (2085 observations deleted due to missingness)
## Multiple R-squared: 0.5263, Adjusted R-squared: 0.5109
## F-statistic: 34.07 on 6 and 184 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit3)\(\hat { \quad y } =\quad \hat { { \beta }_{ 0 } } \quad +\quad \hat { { \beta }_{ 1 }{ x }_{ 1 } } \quad +\quad \hat { { \beta }_{ 2 }{ x }_{ 2 } } +....+\quad \hat { { \beta }_{ n }{ x }_{ n } } + \quad \hat {\quad e}\)
where \(\hat { \quad y }\) is the predicted value of y, and \({ \beta }_{ 0 },\quad { \beta }_{ 1 },\quad { \beta }_{ 2 }\)
are the estimated co-effients.
INTERPRETATIONS:
The R Squared:
The Initial Adjusted Rsquare before model selection was 0.5126, while the Adjusted Rsquared after the was .5167. A variable called TEAM_BATTING_HBP was not contributing significantly and was removed, the final Adjusted Rsquare is 0.5109 which shows the model is significance and that the removal of TEAM_BATTING_BP doesnt have any meaniful effect on other variables.
The P-Value
The least square prediction is:
\(\hat { \quad y } =\quad 63.4669\quad +\quad 0.0258TEAM_{ B }ATTING_{ H }\quad +\quad 0.0917TEAM_{ P }ITCHING_{ H }R\quad +\quad 0.0561TEAM-{ P }ITCHING-{ B }B\quad -\quad 0.0289TEAM-{ P }ITCHING-{ S }O\quad -\quad 0.1739TEAM-{ F }IELDING-{ E }\quad -\quad 0.1217TEAM-FIELDING-DP\)
The Co-efficient interpretations: First-Order Quantative Variables
If we increase the TEAM_BATTING_H by one unit, keeping the other variables constant, the mean value of Y increases by 0.0258. Same is applicable for other variables.
Analysis of Variance (ANOVA) is adopted here to show the effect and interaction between the variables.
anova(fit3, test= "F")## Analysis of Variance Table
##
## Response: TARGET_WINS
## Df Sum Sq Mean Sq F value Pr(>F)
## TEAM_BATTING_H 1 6158.8 6158.8 85.787 < 2.2e-16 ***
## TEAM_PITCHING_HR 1 1853.7 1853.7 25.820 9.161e-07 ***
## TEAM_PITCHING_BB 1 2573.7 2573.7 35.849 1.095e-08 ***
## TEAM_PITCHING_SO 1 1698.6 1698.6 23.660 2.459e-06 ***
## TEAM_FIELDING_E 1 1541.1 1541.1 21.466 6.800e-06 ***
## TEAM_FIELDING_DP 1 851.3 851.3 11.858 0.0007108 ***
## Residuals 184 13209.8 71.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The 95% and 5% confident interval of the variables to check if any of the variable is equal to zero.
confint(fit3)## 2.5 % 97.5 %
## (Intercept) 25.652660685 101.28104402
## TEAM_BATTING_H 0.005781302 0.04583131
## TEAM_PITCHING_HR 0.044287901 0.13919156
## TEAM_PITCHING_BB 0.037436027 0.07472318
## TEAM_PITCHING_SO -0.043071941 -0.01469730
## TEAM_FIELDING_E -0.252657614 -0.09512704
## TEAM_FIELDING_DP -0.191420320 -0.05197243