DATA 621: HW 1

David Quarshie - Group 3

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team.

Data Exploration

When it comes to sports, the most important thing a team can do is win. At the end of the day the fans, the coaches, the players, the management team all want their team to get more wins. But what goes into making a team a winning team? For years coaches and scouts from various sports have looked at certain stats that they’ve determined are most important to getting wins. But now with data science we can actually look at historical measures and confidently know which variables are indicative to winning.

We have been giving two datasets that show us the numbers of baseball team wins from 1871 to 2006. The wins also come with stats from each team such as, base hits, homeruns, errors, and walks. Our goal is to use one dataset to develop and train a model that will help us find which stats will get us wins and use the other dataset to see how accurate our model is. Let’s take a look at the data we’ll be developing our model with.

Data Snapshot

Columns

Below we can see the column names we have in our train dataset. We can see that every column but INDEX is needed, so we can remove that and go ahead to working with the data.

##  [1] "INDEX"            "TARGET_WINS"      "TEAM_BATTING_H"  
##  [4] "TEAM_BATTING_2B"  "TEAM_BATTING_3B"  "TEAM_BATTING_HR" 
##  [7] "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
## [10] "TEAM_BASERUN_CS"  "TEAM_BATTING_HBP" "TEAM_PITCHING_H" 
## [13] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO"
## [16] "TEAM_FIELDING_E"  "TEAM_FIELDING_DP"

Data Sample

Here we can see a sample of the data after removing the INDEX column.

##   TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1          39           1445             194              39
## 2          70           1339             219              22
## 3          86           1377             232              35
## 4          70           1387             209              38
## 5          82           1297             186              27
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              13             143             842              NA
## 2             190             685            1075              37
## 3             137             602             917              46
## 4              96             451             922              43
## 5             102             472             920              49
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1              NA               NA            9364               84
## 2              28               NA            1347              191
## 3              27               NA            1377              137
## 4              30               NA            1396               97
## 5              39               NA            1297              102
##   TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1              927             5456            1011               NA
## 2              689             1082             193              155
## 3              602              917             175              153
## 4              454              928             164              156
## 5              472              920             138              168

Dimensions

The dimensions of the dataset are below. We’re working with 2,276 rows of data with 16 columns.

## [1] 2276   16

Data Summary

Let’s also take a look at data’s summary. This will allow us to see some basic information like mean, minimum, and maximum. Fot example we see that the average number of wins for a team is around 81.

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286

Variable Scatterplots

We can quickly make scatterplots for each variable, allowing us to see how they are distributed and see if there is any skewness and/or outliers.

Wins Histogram

Focusing on our response variable, WINS, we can make a histogram for the variable and investigate its distribution. Our histogram shows us that the distribution is normal with the mean centered around 81. We also see that there are some outliers, with the max wins being at 146 and the min being at 0. Seeing as how no team in the history of the MLB has ever gone winless, we can take note that there may be some errors in the data.

## NULL
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 2276 80.79 15.75     82   81.31 14.83   0 146   146 -0.4     1.03
##      se
## X1 0.33

Data Preparation

No dataset is perfect. So with that in mind we’ll take a look at what we have and try to edit it as much as we can before we begin to build a model.

Missing Values

When we first took a look at our data summary we could see that there were several fields with some missing values. Having missing values is a sign that the variable may not be important in determining how many wins a team get’s Let’s see which variables have a large amount of missing values.

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286
## TEAM_BATTING_HBP  TEAM_BASERUN_CS TEAM_FIELDING_DP  TEAM_BASERUN_SB 
##        91.608084        33.919156        12.565905         5.755712 
## TEAM_PITCHING_SO  TEAM_BATTING_SO  TEAM_FIELDING_E TEAM_PITCHING_BB 
##         4.481547         4.481547         0.000000         0.000000 
## TEAM_PITCHING_HR  TEAM_PITCHING_H  TEAM_BATTING_BB  TEAM_BATTING_HR 
##         0.000000         0.000000         0.000000         0.000000 
##  TEAM_BATTING_3B  TEAM_BATTING_2B   TEAM_BATTING_H      TARGET_WINS 
##         0.000000         0.000000         0.000000         0.000000

Our results show us that HBP, CS, and DP have the highest amount of missing values so it’s in our best interest to remove those fields from the data.

Outliers

After getting rid of the fields with a high amount of missing values, we still need to deal with fields that have extreme outliers. These outliers will cause our model to take in values that are outside of the norm, making our predictions invalid.

Looking at the summary and the plots below we see that PITCHING_H, PITCHING_BB, PITCHING_SO, and FIELDING_E are all skewed by their outliers. We also have some fields with a few missing values. Our plan to fix that is to pick any value that is 3 standard deviations above the mean and impute them as the median. And also replace any leftover missing values with that fields’ median as well.

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##  Min.   : 1137   Min.   :  0.0    Min.   :   0.0   Min.   :    0.0  
##  1st Qu.: 1419   1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0  
##  Median : 1518   Median :107.0    Median : 536.5   Median :  813.5  
##  Mean   : 1779   Mean   :105.7    Mean   : 553.0   Mean   :  817.7  
##  3rd Qu.: 1682   3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0  
##  Max.   :30132   Max.   :343.0    Max.   :3645.0   Max.   :19278.0  
##                                                    NA's   :102      
##  TEAM_FIELDING_E 
##  Min.   :  65.0  
##  1st Qu.: 127.0  
##  Median : 159.0  
##  Mean   : 246.5  
##  3rd Qu.: 249.2  
##  Max.   :1898.0  
## 

Impute

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 541.8   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 732.0   Median :105.5  
##  Mean   : 99.61   Mean   :501.6   Mean   : 727.3   Mean   :136.3  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:170.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##  Min.   : 1137   Min.   :  0.0    Min.   :   0.0   Min.   :    0.0  
##  1st Qu.: 1419   1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  611.0  
##  Median : 1518   Median :107.0    Median : 536.5   Median :  801.5  
##  Mean   : 1779   Mean   :105.7    Mean   : 553.0   Mean   :  809.8  
##  3rd Qu.: 1682   3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  957.2  
##  Max.   :30132   Max.   :343.0    Max.   :3645.0   Max.   :19278.0  
##  TEAM_FIELDING_E 
##  Min.   :  65.0  
##  1st Qu.: 127.0  
##  Median : 159.0  
##  Mean   : 246.5  
##  3rd Qu.: 249.2  
##  Max.   :1898.0

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 541.8   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 732.0   Median :105.5  
##  Mean   : 99.61   Mean   :501.6   Mean   : 727.3   Mean   :136.3  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:170.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
##  Min.   :1137    Min.   :  0.0    Min.   :  0.0    Min.   :   0.0  
##  1st Qu.:1419    1st Qu.: 50.0    1st Qu.:476.0    1st Qu.: 611.0  
##  Median :1518    Median :107.0    Median :536.5    Median : 801.2  
##  Mean   :1605    Mean   :105.7    Mean   :500.4    Mean   : 787.8  
##  3rd Qu.:1660    3rd Qu.:150.0    3rd Qu.:536.5    3rd Qu.: 954.0  
##  Max.   :4134    Max.   :343.0    Max.   :536.5    Max.   :1600.0  
##  TEAM_FIELDING_E
##  Min.   : 65.0  
##  1st Qu.:127.0  
##  Median :159.0  
##  Mean   :198.9  
##  3rd Qu.:215.0  
##  Max.   :681.0

Build Models

Now that our data is in a state we can work with we can go ahead and build some models. Linear model builing is centered around finding the variables that contribute most to the value we want to predict. In our case we’re looking for the variables that go into helping a baseball team win games. For starters, let’s take all the variables we have in our data and use R’s lm function to see what formula we come up.

When creating a linear model for wins, our results (shown below) tell us some interesting things. Each variable comes with an estimate that will be it’s coefficient in the formula and also comes with a p-value which can be used to determine how useful the variable is in our formula. The lower the p-value, the better. Looking at the coefficients we can say that when using all of the variables our formula for win is:
\[ wins = 7.28 + .03*BATTING_H - .009*BATTING_2B + .098*BATTING_3B + .095*BATTING_HR +.034*BATTING_BB+.003*BATTING_SO + .035*BASERUN_SB + .003*PITCHING_H - .032*PITCHING_HR -.016*PITCHING_BB - .006*PITCHING_SO - .022*TEAM_FIELDING_E\]
We’ve also included the plot of the residuals to make sure that they pass all the assumptions, which they do. We also see that this model has an r-squared of .2702 which is not bad. But there are several variables with high p-values that can be removed and we’ll do that for next model.

Model 1

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.963  -8.228   0.507   8.360  52.403 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.697984   5.386462   1.615  0.10650    
## TEAM_BATTING_H    0.034392   0.003701   9.293  < 2e-16 ***
## TEAM_BATTING_2B  -0.012467   0.009355  -1.333  0.18279    
## TEAM_BATTING_3B   0.086593   0.017548   4.935 8.62e-07 ***
## TEAM_BATTING_HR   0.097610   0.027018   3.613  0.00031 ***
## TEAM_BATTING_BB   0.040839   0.004367   9.352  < 2e-16 ***
## TEAM_BATTING_SO   0.002274   0.004254   0.535  0.59303    
## TEAM_BASERUN_SB   0.042814   0.004017  10.658  < 2e-16 ***
## TEAM_PITCHING_H   0.002615   0.001177   2.221  0.02643 *  
## TEAM_PITCHING_HR -0.037909   0.023861  -1.589  0.11226    
## TEAM_PITCHING_BB -0.017132   0.007321  -2.340  0.01936 *  
## TEAM_PITCHING_SO -0.006082   0.003459  -1.759  0.07878 .  
## TEAM_FIELDING_E  -0.023825   0.003437  -6.931 5.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.35 on 2263 degrees of freedom
## Multiple R-squared:  0.2855, Adjusted R-squared:  0.2817 
## F-statistic: 75.37 on 12 and 2263 DF,  p-value: < 2.2e-16

For this model we’ve removed all the variables from the first model that had high p-values in an effort to make a better fit. When limitig the variables we’re using with come up with this formula for wins:
\[ wins = 6.96 + .03*BATTING_H + .096*BATTING_3B + .055*BATTING_HR +.031*BATTING_BB+.006*BATTING_SO + .036*BASERUN_SB - .01*PITCHING_SO + .003*PITCHING_H - .021*TEAM_FIELDING_E\]
Our residual plots once again look fine but this time our r-squared has dropped a bit to .2683. Let’s build a third model getting rid of some variables with high p-values.

Model 2

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_SO + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E, 
##     data = train_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.983  -8.359   0.529   8.611  52.254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.434924   4.702746   1.794 0.073008 .  
## TEAM_BATTING_H    0.029891   0.002872  10.409  < 2e-16 ***
## TEAM_BATTING_3B   0.085491   0.017450   4.899 1.03e-06 ***
## TEAM_BATTING_HR   0.060208   0.009877   6.096 1.28e-09 ***
## TEAM_BATTING_BB   0.032593   0.002909  11.203  < 2e-16 ***
## TEAM_BATTING_SO   0.005152   0.003746   1.375 0.169222    
## TEAM_BASERUN_SB   0.043455   0.004008  10.841  < 2e-16 ***
## TEAM_PITCHING_SO -0.010315   0.002933  -3.517 0.000445 ***
## TEAM_PITCHING_H   0.002774   0.001156   2.400 0.016495 *  
## TEAM_FIELDING_E  -0.023267   0.003407  -6.828 1.10e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 2266 degrees of freedom
## Multiple R-squared:  0.2823, Adjusted R-squared:  0.2794 
## F-statistic: 99.03 on 9 and 2266 DF,  p-value: < 2.2e-16

After removing some more variables we come up with our final formula for wins:
\[ wins = 8.9 + .032*BATTING_H + .094*BATTING_3B + .059*BATTING_HR +.032*BATTING_BB - .004*BATTING_SO + .036*BASERUN_SB - .02*TEAM_FIELDING_E\]

Model 3

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E, data = train_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.269  -8.325   0.585   8.565  53.497 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.905546   4.623568   2.142  0.03227 *  
## TEAM_BATTING_H   0.030996   0.002793  11.100  < 2e-16 ***
## TEAM_BATTING_3B  0.083288   0.017482   4.764 2.02e-06 ***
## TEAM_BATTING_HR  0.062976   0.009849   6.394 1.95e-10 ***
## TEAM_BATTING_BB  0.034259   0.002768  12.378  < 2e-16 ***
## TEAM_BATTING_SO -0.005721   0.002209  -2.589  0.00968 ** 
## TEAM_BASERUN_SB  0.043376   0.003833  11.317  < 2e-16 ***
## TEAM_FIELDING_E -0.022457   0.003402  -6.600 5.09e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.4 on 2268 degrees of freedom
## Multiple R-squared:  0.2781, Adjusted R-squared:  0.2758 
## F-statistic: 124.8 on 7 and 2268 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_SO + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E, 
##     data = train_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.983  -8.359   0.529   8.611  52.254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.434924   4.702746   1.794 0.073008 .  
## TEAM_BATTING_H    0.029891   0.002872  10.409  < 2e-16 ***
## TEAM_BATTING_3B   0.085491   0.017450   4.899 1.03e-06 ***
## TEAM_BATTING_HR   0.060208   0.009877   6.096 1.28e-09 ***
## TEAM_BATTING_BB   0.032593   0.002909  11.203  < 2e-16 ***
## TEAM_BATTING_SO   0.005152   0.003746   1.375 0.169222    
## TEAM_BASERUN_SB   0.043455   0.004008  10.841  < 2e-16 ***
## TEAM_PITCHING_SO -0.010315   0.002933  -3.517 0.000445 ***
## TEAM_PITCHING_H   0.002774   0.001156   2.400 0.016495 *  
## TEAM_FIELDING_E  -0.023267   0.003407  -6.828 1.10e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 2266 degrees of freedom
## Multiple R-squared:  0.2823, Adjusted R-squared:  0.2794 
## F-statistic: 99.03 on 9 and 2266 DF,  p-value: < 2.2e-16

Select Model

After looking at the results from our 3 models we decided to with our final model, model 3. With a high r-squared value but with less variables than model 1 we’re confident with this choice. Below are the summary results for model 3 along with the residual plot. Viewing this results, we can say that this model is statiscally signifiant.

## [1] "Mean Squared Error: 179.054046475421"
## [1] "Root MSE: 13.3811078194379"
## [1] "Adjusted R-squared: 0.275841665160897"
## [1] "F-statistic: 124.796878202348"

Cleaning Test Data

As previously stated, we were given two datasets to work with, one to train our model and one to test it on. We did several transformations to remove missing values and impute medians to our train data, so let’s do the same for the test data.

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO

Predicting Wins

With our test data in the same format as the train data we can now use R’s predict function and our model 3 to see how many wins it provides. Below is a sample of our results for the number of wins for each row in the test data, along with the upper and lower bands for a prediction interval.

##         fit      lwr       upr
## 1  64.03585 37.69984  90.37186
## 2  67.38259 41.06668  93.69850
## 3  73.24273 46.93634  99.54912
## 4  85.09308 58.78365 111.40250
## 5  61.55138 35.09545  88.00732
## 6  64.20408 37.77926  90.62890
## 7  82.41352 56.03561 108.79143
## 8  67.78228 41.45574  94.10881
## 9  70.32870 44.00176  96.65564
## 10 70.62537 44.31475  96.93599

Conclusion

In the end, we can confiedently say that we went with the right model. When we first took a look at our train data we saw the summary below.

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 541.8   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 732.0   Median :105.5  
##  Mean   : 99.61   Mean   :501.6   Mean   : 727.3   Mean   :136.3  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:170.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
##  Min.   :1137    Min.   :  0.0    Min.   :  0.0    Min.   :   0.0  
##  1st Qu.:1419    1st Qu.: 50.0    1st Qu.:476.0    1st Qu.: 611.0  
##  Median :1518    Median :107.0    Median :536.5    Median : 801.2  
##  Mean   :1605    Mean   :105.7    Mean   :500.4    Mean   : 787.8  
##  3rd Qu.:1660    3rd Qu.:150.0    3rd Qu.:536.5    3rd Qu.: 954.0  
##  Max.   :4134    Max.   :343.0    Max.   :536.5    Max.   :1600.0  
##  TEAM_FIELDING_E
##  Min.   : 65.0  
##  1st Qu.:127.0  
##  Median :159.0  
##  Mean   :198.9  
##  3rd Qu.:215.0  
##  Max.   :681.0

Focusing on the wins, we see that the mean was around 81. Let’s take a look at the summary of the predicted values.

Shown below we see that our predicted number of wins is just around 80. Not too far off from the mean of the train data.

##       fit              lwr             upr        
##  Min.   : 43.20   Min.   :16.66   Min.   : 69.75  
##  1st Qu.: 74.28   1st Qu.:47.98   1st Qu.:100.59  
##  Median : 80.74   Median :54.41   Median :107.07  
##  Mean   : 80.25   Mean   :53.92   Mean   :106.59  
##  3rd Qu.: 85.77   3rd Qu.:59.45   3rd Qu.:112.10  
##  Max.   :109.70   Max.   :83.23   Max.   :136.17

Appendix

Code for this project can be found here:
https://github.com/dquarshie89/Data-621/blob/master/moneyball.R