DATA 621: HW 1
David Quarshie - Group 3
Overview
In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team.
Data Exploration
When it comes to sports, the most important thing a team can do is win. At the end of the day the fans, the coaches, the players, the management team all want their team to get more wins. But what goes into making a team a winning team? For years coaches and scouts from various sports have looked at certain stats that they’ve determined are most important to getting wins. But now with data science we can actually look at historical measures and confidently know which variables are indicative to winning.
We have been giving two datasets that show us the numbers of baseball team wins from 1871 to 2006. The wins also come with stats from each team such as, base hits, homeruns, errors, and walks. Our goal is to use one dataset to develop and train a model that will help us find which stats will get us wins and use the other dataset to see how accurate our model is. Let’s take a look at the data we’ll be developing our model with.
Data Snapshot
Columns
Below we can see the column names we have in our train dataset. We can see that every column but INDEX is needed, so we can remove that and go ahead to working with the data.
## [1] "INDEX" "TARGET_WINS" "TEAM_BATTING_H"
## [4] "TEAM_BATTING_2B" "TEAM_BATTING_3B" "TEAM_BATTING_HR"
## [7] "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB"
## [10] "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H"
## [13] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO"
## [16] "TEAM_FIELDING_E" "TEAM_FIELDING_DP"
Data Sample
Here we can see a sample of the data after removing the INDEX column.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 39 1445 194 39
## 2 70 1339 219 22
## 3 86 1377 232 35
## 4 70 1387 209 38
## 5 82 1297 186 27
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
Dimensions
The dimensions of the dataset are below. We’re working with 2,276 rows of data with 16 columns.
## [1] 2276 16
Data Summary
Let’s also take a look at data’s summary. This will allow us to see some basic information like mean, minimum, and maximum. Fot example we see that the average number of wins for a team is around 81.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
Variable Scatterplots
We can quickly make scatterplots for each variable, allowing us to see how they are distributed and see if there is any skewness and/or outliers.
Wins Histogram
Focusing on our response variable, WINS, we can make a histogram for the variable and investigate its distribution. Our histogram shows us that the distribution is normal with the mean centered around 81. We also see that there are some outliers, with the max wins being at 146 and the min being at 0. Seeing as how no team in the history of the MLB has ever gone winless, we can take note that there may be some errors in the data.
## NULL
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 2276 80.79 15.75 82 81.31 14.83 0 146 146 -0.4 1.03
## se
## X1 0.33
Data Preparation
No dataset is perfect. So with that in mind we’ll take a look at what we have and try to edit it as much as we can before we begin to build a model.
Missing Values
When we first took a look at our data summary we could see that there were several fields with some missing values. Having missing values is a sign that the variable may not be important in determining how many wins a team get’s Let’s see which variables have a large amount of missing values.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
## TEAM_BATTING_HBP TEAM_BASERUN_CS TEAM_FIELDING_DP TEAM_BASERUN_SB
## 91.608084 33.919156 12.565905 5.755712
## TEAM_PITCHING_SO TEAM_BATTING_SO TEAM_FIELDING_E TEAM_PITCHING_BB
## 4.481547 4.481547 0.000000 0.000000
## TEAM_PITCHING_HR TEAM_PITCHING_H TEAM_BATTING_BB TEAM_BATTING_HR
## 0.000000 0.000000 0.000000 0.000000
## TEAM_BATTING_3B TEAM_BATTING_2B TEAM_BATTING_H TARGET_WINS
## 0.000000 0.000000 0.000000 0.000000
Our results show us that HBP, CS, and DP have the highest amount of missing values so it’s in our best interest to remove those fields from the data.
Outliers
After getting rid of the fields with a high amount of missing values, we still need to deal with fields that have extreme outliers. These outliers will cause our model to take in values that are outside of the norm, making our predictions invalid.
Looking at the summary and the plots below we see that PITCHING_H, PITCHING_BB, PITCHING_SO, and FIELDING_E are all skewed by their outliers. We also have some fields with a few missing values. Our plan to fix that is to pick any value that is 3 standard deviations above the mean and impute them as the median. And also replace any leftover missing values with that fields’ median as well.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0
## Median : 1518 Median :107.0 Median : 536.5 Median : 813.5
## Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 817.7
## 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0
## Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0
## NA's :102
## TEAM_FIELDING_E
## Min. : 65.0
## 1st Qu.: 127.0
## Median : 159.0
## Mean : 246.5
## 3rd Qu.: 249.2
## Max. :1898.0
##
Impute
##
## iter imp variable
## 1 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 1 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 1 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 1 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 1 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 541.8 1st Qu.: 67.0
## Median :102.00 Median :512.0 Median : 732.0 Median :105.5
## Mean : 99.61 Mean :501.6 Mean : 727.3 Mean :136.3
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:170.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 611.0
## Median : 1518 Median :107.0 Median : 536.5 Median : 801.5
## Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 809.8
## 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.2
## Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0
## TEAM_FIELDING_E
## Min. : 65.0
## 1st Qu.: 127.0
## Median : 159.0
## Mean : 246.5
## 3rd Qu.: 249.2
## Max. :1898.0
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 541.8 1st Qu.: 67.0
## Median :102.00 Median :512.0 Median : 732.0 Median :105.5
## Mean : 99.61 Mean :501.6 Mean : 727.3 Mean :136.3
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:170.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. :1137 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:1419 1st Qu.: 50.0 1st Qu.:476.0 1st Qu.: 611.0
## Median :1518 Median :107.0 Median :536.5 Median : 801.2
## Mean :1605 Mean :105.7 Mean :500.4 Mean : 787.8
## 3rd Qu.:1660 3rd Qu.:150.0 3rd Qu.:536.5 3rd Qu.: 954.0
## Max. :4134 Max. :343.0 Max. :536.5 Max. :1600.0
## TEAM_FIELDING_E
## Min. : 65.0
## 1st Qu.:127.0
## Median :159.0
## Mean :198.9
## 3rd Qu.:215.0
## Max. :681.0
Build Models
Now that our data is in a state we can work with we can go ahead and build some models. Linear model builing is centered around finding the variables that contribute most to the value we want to predict. In our case we’re looking for the variables that go into helping a baseball team win games. For starters, let’s take all the variables we have in our data and use R’s lm function to see what formula we come up.
When creating a linear model for wins, our results (shown below) tell us some interesting things. Each variable comes with an estimate that will be it’s coefficient in the formula and also comes with a p-value which can be used to determine how useful the variable is in our formula. The lower the p-value, the better. Looking at the coefficients we can say that when using all of the variables our formula for win is:
\[ wins = 7.28 + .03*BATTING_H - .009*BATTING_2B + .098*BATTING_3B + .095*BATTING_HR +.034*BATTING_BB+.003*BATTING_SO + .035*BASERUN_SB + .003*PITCHING_H - .032*PITCHING_HR -.016*PITCHING_BB - .006*PITCHING_SO - .022*TEAM_FIELDING_E\]
We’ve also included the plot of the residuals to make sure that they pass all the assumptions, which they do. We also see that this model has an r-squared of .2702 which is not bad. But there are several variables with high p-values that can be removed and we’ll do that for next model.
Model 1
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.963 -8.228 0.507 8.360 52.403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.697984 5.386462 1.615 0.10650
## TEAM_BATTING_H 0.034392 0.003701 9.293 < 2e-16 ***
## TEAM_BATTING_2B -0.012467 0.009355 -1.333 0.18279
## TEAM_BATTING_3B 0.086593 0.017548 4.935 8.62e-07 ***
## TEAM_BATTING_HR 0.097610 0.027018 3.613 0.00031 ***
## TEAM_BATTING_BB 0.040839 0.004367 9.352 < 2e-16 ***
## TEAM_BATTING_SO 0.002274 0.004254 0.535 0.59303
## TEAM_BASERUN_SB 0.042814 0.004017 10.658 < 2e-16 ***
## TEAM_PITCHING_H 0.002615 0.001177 2.221 0.02643 *
## TEAM_PITCHING_HR -0.037909 0.023861 -1.589 0.11226
## TEAM_PITCHING_BB -0.017132 0.007321 -2.340 0.01936 *
## TEAM_PITCHING_SO -0.006082 0.003459 -1.759 0.07878 .
## TEAM_FIELDING_E -0.023825 0.003437 -6.931 5.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.35 on 2263 degrees of freedom
## Multiple R-squared: 0.2855, Adjusted R-squared: 0.2817
## F-statistic: 75.37 on 12 and 2263 DF, p-value: < 2.2e-16
For this model we’ve removed all the variables from the first model that had high p-values in an effort to make a better fit. When limitig the variables we’re using with come up with this formula for wins:
\[ wins = 6.96 + .03*BATTING_H + .096*BATTING_3B + .055*BATTING_HR +.031*BATTING_BB+.006*BATTING_SO + .036*BASERUN_SB - .01*PITCHING_SO + .003*PITCHING_H - .021*TEAM_FIELDING_E\]
Our residual plots once again look fine but this time our r-squared has dropped a bit to .2683. Let’s build a third model getting rid of some variables with high p-values.
Model 2
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_SO + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E,
## data = train_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.983 -8.359 0.529 8.611 52.254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.434924 4.702746 1.794 0.073008 .
## TEAM_BATTING_H 0.029891 0.002872 10.409 < 2e-16 ***
## TEAM_BATTING_3B 0.085491 0.017450 4.899 1.03e-06 ***
## TEAM_BATTING_HR 0.060208 0.009877 6.096 1.28e-09 ***
## TEAM_BATTING_BB 0.032593 0.002909 11.203 < 2e-16 ***
## TEAM_BATTING_SO 0.005152 0.003746 1.375 0.169222
## TEAM_BASERUN_SB 0.043455 0.004008 10.841 < 2e-16 ***
## TEAM_PITCHING_SO -0.010315 0.002933 -3.517 0.000445 ***
## TEAM_PITCHING_H 0.002774 0.001156 2.400 0.016495 *
## TEAM_FIELDING_E -0.023267 0.003407 -6.828 1.10e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 2266 degrees of freedom
## Multiple R-squared: 0.2823, Adjusted R-squared: 0.2794
## F-statistic: 99.03 on 9 and 2266 DF, p-value: < 2.2e-16
After removing some more variables we come up with our final formula for wins:
\[ wins = 8.9 + .032*BATTING_H + .094*BATTING_3B + .059*BATTING_HR +.032*BATTING_BB - .004*BATTING_SO + .036*BASERUN_SB - .02*TEAM_FIELDING_E\]
Model 3
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_FIELDING_E, data = train_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.269 -8.325 0.585 8.565 53.497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.905546 4.623568 2.142 0.03227 *
## TEAM_BATTING_H 0.030996 0.002793 11.100 < 2e-16 ***
## TEAM_BATTING_3B 0.083288 0.017482 4.764 2.02e-06 ***
## TEAM_BATTING_HR 0.062976 0.009849 6.394 1.95e-10 ***
## TEAM_BATTING_BB 0.034259 0.002768 12.378 < 2e-16 ***
## TEAM_BATTING_SO -0.005721 0.002209 -2.589 0.00968 **
## TEAM_BASERUN_SB 0.043376 0.003833 11.317 < 2e-16 ***
## TEAM_FIELDING_E -0.022457 0.003402 -6.600 5.09e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 2268 degrees of freedom
## Multiple R-squared: 0.2781, Adjusted R-squared: 0.2758
## F-statistic: 124.8 on 7 and 2268 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_SO + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E,
## data = train_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.983 -8.359 0.529 8.611 52.254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.434924 4.702746 1.794 0.073008 .
## TEAM_BATTING_H 0.029891 0.002872 10.409 < 2e-16 ***
## TEAM_BATTING_3B 0.085491 0.017450 4.899 1.03e-06 ***
## TEAM_BATTING_HR 0.060208 0.009877 6.096 1.28e-09 ***
## TEAM_BATTING_BB 0.032593 0.002909 11.203 < 2e-16 ***
## TEAM_BATTING_SO 0.005152 0.003746 1.375 0.169222
## TEAM_BASERUN_SB 0.043455 0.004008 10.841 < 2e-16 ***
## TEAM_PITCHING_SO -0.010315 0.002933 -3.517 0.000445 ***
## TEAM_PITCHING_H 0.002774 0.001156 2.400 0.016495 *
## TEAM_FIELDING_E -0.023267 0.003407 -6.828 1.10e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 2266 degrees of freedom
## Multiple R-squared: 0.2823, Adjusted R-squared: 0.2794
## F-statistic: 99.03 on 9 and 2266 DF, p-value: < 2.2e-16
Select Model
After looking at the results from our 3 models we decided to with our final model, model 3. With a high r-squared value but with less variables than model 1 we’re confident with this choice. Below are the summary results for model 3 along with the residual plot. Viewing this results, we can say that this model is statiscally signifiant.
## [1] "Mean Squared Error: 179.054046475421"
## [1] "Root MSE: 13.3811078194379"
## [1] "Adjusted R-squared: 0.275841665160897"
## [1] "F-statistic: 124.796878202348"
Cleaning Test Data
As previously stated, we were given two datasets to work with, one to train our model and one to test it on. We did several transformations to remove missing values and impute medians to our train data, so let’s do the same for the test data.
##
## iter imp variable
## 1 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 1 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 1 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 1 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 1 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 2 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 3 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 4 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
## 5 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_SO
Predicting Wins
With our test data in the same format as the train data we can now use R’s predict function and our model 3 to see how many wins it provides. Below is a sample of our results for the number of wins for each row in the test data, along with the upper and lower bands for a prediction interval.
## fit lwr upr
## 1 64.03585 37.69984 90.37186
## 2 67.38259 41.06668 93.69850
## 3 73.24273 46.93634 99.54912
## 4 85.09308 58.78365 111.40250
## 5 61.55138 35.09545 88.00732
## 6 64.20408 37.77926 90.62890
## 7 82.41352 56.03561 108.79143
## 8 67.78228 41.45574 94.10881
## 9 70.32870 44.00176 96.65564
## 10 70.62537 44.31475 96.93599
Conclusion
In the end, we can confiedently say that we went with the right model. When we first took a look at our train data we saw the summary below.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 541.8 1st Qu.: 67.0
## Median :102.00 Median :512.0 Median : 732.0 Median :105.5
## Mean : 99.61 Mean :501.6 Mean : 727.3 Mean :136.3
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:170.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. :1137 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:1419 1st Qu.: 50.0 1st Qu.:476.0 1st Qu.: 611.0
## Median :1518 Median :107.0 Median :536.5 Median : 801.2
## Mean :1605 Mean :105.7 Mean :500.4 Mean : 787.8
## 3rd Qu.:1660 3rd Qu.:150.0 3rd Qu.:536.5 3rd Qu.: 954.0
## Max. :4134 Max. :343.0 Max. :536.5 Max. :1600.0
## TEAM_FIELDING_E
## Min. : 65.0
## 1st Qu.:127.0
## Median :159.0
## Mean :198.9
## 3rd Qu.:215.0
## Max. :681.0
Focusing on the wins, we see that the mean was around 81. Let’s take a look at the summary of the predicted values.
Shown below we see that our predicted number of wins is just around 80. Not too far off from the mean of the train data.
## fit lwr upr
## Min. : 43.20 Min. :16.66 Min. : 69.75
## 1st Qu.: 74.28 1st Qu.:47.98 1st Qu.:100.59
## Median : 80.74 Median :54.41 Median :107.07
## Mean : 80.25 Mean :53.92 Mean :106.59
## 3rd Qu.: 85.77 3rd Qu.:59.45 3rd Qu.:112.10
## Max. :109.70 Max. :83.23 Max. :136.17
Appendix
Code for this project can be found here:
https://github.com/dquarshie89/Data-621/blob/master/moneyball.R