CUNY 621 Homework 1

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set.

1. Data Exploration

Recruiting for talented athletes to the Major Leagues has always been a challenge for many baseball teams. The pressures and goals of winning championships had often led scouts to search throughout the country for individuals who could make a tremendous impact to their team. For years, the scouts had focused on many subjective reports regarding players’ abilities to perform. However, this was increasingly difficult for teams with a smaller payroll (like the 2002 Oakland A’s) to complete with teams with bigger payrolls (New York Yankees).

This was the focus of Michael Lewis’s book ‘Moneyball’ (published in 2003). Given the 2002 Oakland A’s revenue disadvantage, General Manager Billy Beane had to search for a competitive edge. As a result, he had turned to sabermetrics. According to Wikipedia, “the central premise of Moneyball is that the collective wisdom of baseball insiders (including players, managers, coaches, scouts, and the front office) over the past century is subjective and often flaws. Statistics such as stolen bases, runs batted in, and batting average, typically used to gauge players, are relics of a 19th-century view of the game, and the statistics available at that time. Before sabermetrics was introduced to baseball, teams were dependent on the skills of their scouts to find and evaluate players. Scouts are those who are experienced in the sport, usually been involved as players or coaches.” (https://en.wikipedia.org/wiki/Moneyball)

As data scientists, we have been given a dataset to undergo a similar exercise to help identify what factors (baseball statistics) have an impact on the overall wins for the team. (And the more wins, the better the chances for winning the World Series).

The name of this dataset is called moneyball. It contains multiple rows and columns with data containing information about the team and its corresponding wins. Let us take at the data and its visualizations.

Below is part (head) of the dataset.

##   WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB BATTING_SO
## 1   39      1445        194         39         13        143        842
## 2   70      1339        219         22        190        685       1075
## 3   86      1377        232         35        137        602        917
## 4   70      1387        209         38         96        451        922
## 5   82      1297        186         27        102        472        920
## 6   75      1279        200         36         92        443        973
##   BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR PITCHING_BB
## 1         NA         NA          NA       9364          84         927
## 2         37         28          NA       1347         191         689
## 3         46         27          NA       1377         137         602
## 4         43         30          NA       1396          97         454
## 5         49         39          NA       1297         102         472
## 6        107         59          NA       1279          92         443
##   PITCHING_SO FIELDING_E FIELDING_DP
## 1        5456       1011          NA
## 2        1082        193         155
## 3         917        175         153
## 4         928        164         156
## 5         920        138         168
## 6         973        123         149

What are the baseball variables that we are looking at in this data set?

##  [1] "WINS"        "BATTING_H"   "BATTING_2B"  "BATTING_3B"  "BATTING_HR" 
##  [6] "BATTING_BB"  "BATTING_SO"  "BASERUN_SB"  "BASERUN_CS"  "BATTING_HBP"
## [11] "PITCHING_H"  "PITCHING_HR" "PITCHING_BB" "PITCHING_SO" "FIELDING_E" 
## [16] "FIELDING_DP"

The description of these variables are described above in the first section.

What are the dimensions of this dataset?

## [1] 2276   16

This dataset contains 2276 rows and contains 16 features including the response variable WINS.

If we look at all of the data within each categories, we notice that all of these categories contain continuous values. So it may be reasonable to start with looking at the quartiles, mean, median, range, skew, minimu, maximum, etc. values.

##       WINS          BATTING_H      BATTING_2B      BATTING_3B    
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##    BATTING_HR       BATTING_BB      BATTING_SO       BASERUN_SB   
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##    BASERUN_CS     BATTING_HBP      PITCHING_H     PITCHING_HR   
##  Min.   :  0.0   Min.   :29.00   Min.   : 1137   Min.   :  0.0  
##  1st Qu.: 38.0   1st Qu.:50.50   1st Qu.: 1419   1st Qu.: 50.0  
##  Median : 49.0   Median :58.00   Median : 1518   Median :107.0  
##  Mean   : 52.8   Mean   :59.36   Mean   : 1779   Mean   :105.7  
##  3rd Qu.: 62.0   3rd Qu.:67.00   3rd Qu.: 1682   3rd Qu.:150.0  
##  Max.   :201.0   Max.   :95.00   Max.   :30132   Max.   :343.0  
##  NA's   :772     NA's   :2085                                   
##   PITCHING_BB      PITCHING_SO        FIELDING_E      FIELDING_DP   
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0  
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0  
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0  
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4  
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0  
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0  
##                   NA's   :102                        NA's   :286
##             vars    n    mean      sd median trimmed    mad  min   max
## WINS           1 2276   80.79   15.75   82.0   81.31  14.83    0   146
## BATTING_H      2 2276 1469.27  144.59 1454.0 1459.04 114.16  891  2554
## BATTING_2B     3 2276  241.25   46.80  238.0  240.40  47.44   69   458
## BATTING_3B     4 2276   55.25   27.94   47.0   52.18  23.72    0   223
## BATTING_HR     5 2276   99.61   60.55  102.0   97.39  78.58    0   264
## BATTING_BB     6 2276  501.56  122.67  512.0  512.18  94.89    0   878
## BATTING_SO     7 2174  735.61  248.53  750.0  742.31 284.66    0  1399
## BASERUN_SB     8 2145  124.76   87.79  101.0  110.81  60.79    0   697
## BASERUN_CS     9 1504   52.80   22.96   49.0   50.36  17.79    0   201
## BATTING_HBP   10  191   59.36   12.97   58.0   58.86  11.86   29    95
## PITCHING_H    11 2276 1779.21 1406.84 1518.0 1555.90 174.95 1137 30132
## PITCHING_HR   12 2276  105.70   61.30  107.0  103.16  74.13    0   343
## PITCHING_BB   13 2276  553.01  166.36  536.5  542.62  98.59    0  3645
## PITCHING_SO   14 2174  817.73  553.09  813.5  796.93 257.23    0 19278
## FIELDING_E    15 2276  246.48  227.77  159.0  193.44  62.27   65  1898
## FIELDING_DP   16 1990  146.39   26.23  149.0  147.58  23.72   52   228
##             range  skew kurtosis    se
## WINS          146 -0.40     1.03  0.33
## BATTING_H    1663  1.57     7.28  3.03
## BATTING_2B    389  0.22     0.01  0.98
## BATTING_3B    223  1.11     1.50  0.59
## BATTING_HR    264  0.19    -0.96  1.27
## BATTING_BB    878 -1.03     2.18  2.57
## BATTING_SO   1399 -0.30    -0.32  5.33
## BASERUN_SB    697  1.97     5.49  1.90
## BASERUN_CS    201  1.98     7.62  0.59
## BATTING_HBP    66  0.32    -0.11  0.94
## PITCHING_H  28995 10.33   141.84 29.49
## PITCHING_HR   343  0.29    -0.60  1.28
## PITCHING_BB  3645  6.74    96.97  3.49
## PITCHING_SO 19278 22.17   671.19 11.86
## FIELDING_E   1833  2.99    10.97  4.77
## FIELDING_DP   176 -0.39     0.18  0.59

Visualization of the dataset:

The above contains a lot of information, and can be pretty overwhelming. If we take time to dissect the information, there are some interesting points.

The above is the boxplots of all of the variables listed in the data set, which the visualization may assist us in how the data is spread (given that it’s in median and quartiles). However, as of note, some of the variables contain missing values (which we will discuss later). While it may be interesting, the absolute numbers here does not provide any immediate insight yet.

The scatterplot on the other hand provides a visualization of the independent variables against the response variables WINS. Using ggplot2, a ‘lm’ model was added to all of the scatterplots. While not completely rigorous, it does give us a general idea of how each variable contributes to the WINS column. For example, in PITCHING_H, the more hits the opposing team makes, WINS begin to decrease, which is pretty obvious. If the opposing team is hitting more balls, the opposing team will have a higher likelihood of winning.

Below are the correlations and heatmaps of the correlations between the variables.

##              WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## WINS         1.00      0.47       0.31      -0.12       0.42       0.47
## BATTING_H    0.47      1.00       0.56       0.21       0.40       0.20
## BATTING_2B   0.31      0.56       1.00       0.04       0.25       0.20
## BATTING_3B  -0.12      0.21       0.04       1.00      -0.22      -0.21
## BATTING_HR   0.42      0.40       0.25      -0.22       1.00       0.46
## BATTING_BB   0.47      0.20       0.20      -0.21       0.46       1.00
## BATTING_SO  -0.23     -0.34      -0.06      -0.19       0.21       0.22
## BASERUN_SB   0.01      0.07      -0.19       0.17      -0.19      -0.09
## BASERUN_CS  -0.18     -0.09      -0.20       0.23      -0.28      -0.21
## BATTING_HBP  0.07     -0.03       0.05      -0.17       0.11       0.05
## PITCHING_H   0.47      1.00       0.56       0.21       0.40       0.20
## PITCHING_HR  0.42      0.39       0.25      -0.22       1.00       0.46
## PITCHING_BB  0.47      0.20       0.20      -0.21       0.46       1.00
## PITCHING_SO -0.23     -0.34      -0.07      -0.19       0.21       0.22
## FIELDING_E  -0.39     -0.25      -0.19      -0.07       0.02      -0.08
## FIELDING_DP -0.20      0.02      -0.02       0.13      -0.06      -0.08
##             BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H
## WINS             -0.23       0.01      -0.18        0.07       0.47
## BATTING_H        -0.34       0.07      -0.09       -0.03       1.00
## BATTING_2B       -0.06      -0.19      -0.20        0.05       0.56
## BATTING_3B       -0.19       0.17       0.23       -0.17       0.21
## BATTING_HR        0.21      -0.19      -0.28        0.11       0.40
## BATTING_BB        0.22      -0.09      -0.21        0.05       0.20
## BATTING_SO        1.00      -0.07      -0.06        0.22      -0.34
## BASERUN_SB       -0.07       1.00       0.62       -0.06       0.07
## BASERUN_CS       -0.06       0.62       1.00       -0.07      -0.09
## BATTING_HBP       0.22      -0.06      -0.07        1.00      -0.03
## PITCHING_H       -0.34       0.07      -0.09       -0.03       1.00
## PITCHING_HR       0.21      -0.19      -0.28        0.11       0.39
## PITCHING_BB       0.22      -0.09      -0.21        0.05       0.20
## PITCHING_SO       1.00      -0.07      -0.06        0.22      -0.34
## FIELDING_E        0.31       0.04       0.21        0.04      -0.25
## FIELDING_DP      -0.12      -0.13      -0.01       -0.07       0.01
##             PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## WINS               0.42        0.47       -0.23      -0.39       -0.20
## BATTING_H          0.39        0.20       -0.34      -0.25        0.02
## BATTING_2B         0.25        0.20       -0.07      -0.19       -0.02
## BATTING_3B        -0.22       -0.21       -0.19      -0.07        0.13
## BATTING_HR         1.00        0.46        0.21       0.02       -0.06
## BATTING_BB         0.46        1.00        0.22      -0.08       -0.08
## BATTING_SO         0.21        0.22        1.00       0.31       -0.12
## BASERUN_SB        -0.19       -0.09       -0.07       0.04       -0.13
## BASERUN_CS        -0.28       -0.21       -0.06       0.21       -0.01
## BATTING_HBP        0.11        0.05        0.22       0.04       -0.07
## PITCHING_H         0.39        0.20       -0.34      -0.25        0.01
## PITCHING_HR        1.00        0.46        0.21       0.02       -0.06
## PITCHING_BB        0.46        1.00        0.22      -0.08       -0.08
## PITCHING_SO        0.21        0.22        1.00       0.31       -0.12
## FIELDING_E         0.02       -0.08        0.31       1.00        0.04
## FIELDING_DP       -0.06       -0.08       -0.12       0.04        1.00

According to the correlation table and heatmap, the values that correspond most positively are BATTING_H, BATTING_2B, BATTING_HR, BATTING_BB, PITCHING_H, PITCHING_HR, and PITCHING_BB.

It should be noted that PITCHING_H, PITCHING_HR, and PITCHING_BB are positively correlated with wins, which does not seem to make much sense at all. This is in stark contrast to what was stated in the first section regarding what is considered to be positive and negative correlation (according to domain knowledge.) So instance, if your team is giving up hits, home runs, and walks, that puts the opposing team at an advantage.

Let’s try looking at each ‘supposedly’ positive and negative impact variables separately using a different package in R.

Supposed Positive Impact:

Supposed Negative Impact:

Though there appear to be significant p-values for some of these independent variables compared to WINS, it should be noted that the scatterplots are not all necessarily linear. We can see if we can work this out in our modelling section.

Now let’s take a look at the response variable, WINS.

##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 2276 80.79 15.75     82   81.31 14.83   0 146   146 -0.4     1.03
##      se
## X1 0.33

In the diagram above, the red dashed line represents the mean and the two blue dashed lines represent the two standard deviations above and below the mean. The skew is less than -0.5, and there appears to be a normal distribution. However, again, there are some interesting findings with the WINS data.

Though the data has been curated so that every team has been standardized to 162 wins, there was a target win season MAX of 146 and a MIN of 0. These numbers are most definitely outliers as this has never occurred in the history of Major League Basesball. According to Wikipedia, ‘List of best Major League Baseball season win-loss records’, since 1961 (when the games were expanded to 162 games in a season), only two teams had managed to obtain a winning percentage greater than 70%, or 114 wins. They were the 1998 New York Yankees and the 2001 Seattle Mariners.

What if we looked at the teams before 1961? Perhaps we may find more revealing data. Unfortunately, this is not the case either. According to Newday, the MLB’s winningest season of all time was the 1906 Chicago Cubs with a record of 116-36, which gives a total winning percentage of 76.3%, or if converted to a full 162 game season, an equivalent of 123.6 wins. (https://www.newsday.com/sports/baseball/mlb-s-winningest-seasons-1.3077797).

Likewise, the most losing MLB team in the history of baseball was the 1899 Cleveland Spiders with 20 wins and 134 losses, with a staggering win percentage of 0.130. Would it improve the model to eliminate the outliers? Any data including these numbers may only likely mean extrapolation and may not be accurate in its representations of wins. We must be careful about including extrapolated data, and when we perform our modelling, we really shoulder consider taking out the outliers as they are likely not representative what truly happened.

Before we head into the next section on data preparation, let’s take a look at the missing data points.

How many data points are missing from each column and what percentage of the data is composed of missing values?

## BATTING_HBP  BASERUN_CS FIELDING_DP  BASERUN_SB PITCHING_SO  BATTING_SO 
##        2085         772         286         131         102         102 
##  FIELDING_E PITCHING_BB PITCHING_HR  PITCHING_H  BATTING_BB  BATTING_HR 
##           0           0           0           0           0           0 
##  BATTING_3B  BATTING_2B   BATTING_H        WINS 
##           0           0           0           0
## BATTING_HBP  BASERUN_CS FIELDING_DP  BASERUN_SB PITCHING_SO  BATTING_SO 
##   91.608084   33.919156   12.565905    5.755712    4.481547    4.481547 
##  FIELDING_E PITCHING_BB PITCHING_HR  PITCHING_H  BATTING_BB  BATTING_HR 
##    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000 
##  BATTING_3B  BATTING_2B   BATTING_H        WINS 
##    0.000000    0.000000    0.000000    0.000000

Most of the data have no missing data. However, there are some fields like BATTING_HBP, and BASERUN_CS that contain a significant amount of missing data, which should be considered when we model our regression.

2. Data Preparation

Now that we have taken the opportunity to perform some exploratory data analysis, let’s start preparing and cleaning the data for model preparation. The first step would be to handle all of the missing values.

As we had noted in the previous section, missing values can lead to errors and bias into our model. So fixing or imputing these missing values may help deal with the situation (or possibly make it worse). Multiple scientific papers have been studied in regards to missing data, and the overall consensus is that there is NO consensus on what the cutoff of missing data should be. Some authors have argued that missing data greater than 10% can lead to bias (Bennett 2001); other authors discuss that missing data mechanisms and the missing data patterns have greater impact on research results than does the proportion of missing data (Tabachnick and Fidell 2012). (Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3701793/#CR6).

So we should consider: what is an acceptable level of missing data in baseball statistics? This is unclear, even from my own search in the literature. Using my best judgment, I will eliminate any variables that have more than 10% missing data, which in this case will be BATTING_HBP, BASERUN_CS and FIELDING_DP. BATTING_HBP and BASERUN_CS had a staggering 91.6% and 33.9% missing data, respectively. Though we could make an argument that a 12.5% missing rate in FIELDING_DP could be overlooked, the fact that the missing data rate is greater than 10% and that the Pearson Correlation between FIELDING_DP and WINS is not large at -.20, I will also remove this column as well.

Now that we have decided on which variables to remove, we are still left with BASERUN_SB, PITCHING_SO, and BATTING_SO with variables with missing data points. Typically we can use either the mean or median for imputation. Choosing one over the other will depend on our data. So for example, if there is a dataset that have great outliers, then we should probably consider the median (ex. 99% of the household income is below 100, and 1% is above 500.)

We can check this by checking the distribution of BASERUN_SB, PITCHING_SO, and BATTING_SO.

Given that the means are susceptible to outliers, we must consider its effect, particularly in BASERUN_SB and PITCHING_SO. The maximum value of PITCHING_SO is:

## [1] 19278

This is an absurd value. It would be in our best interests if we imputed the median, and not the mean.

Now that we have removed the three variables with greater than 10% missing value, and imputed all of the remaining missing values, let’s take a look again at the head and the summary of the dataset.

##   WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB BATTING_SO
## 1   39      1445        194         39         13        143        842
## 2   70      1339        219         22        190        685       1075
## 3   86      1377        232         35        137        602        917
## 4   70      1387        209         38         96        451        922
## 5   82      1297        186         27        102        472        920
## 6   75      1279        200         36         92        443        973
##   BASERUN_SB PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## 1        101       9364          84         927        5456       1011
## 2         37       1347         191         689        1082        193
## 3         46       1377         137         602         917        175
## 4         43       1396          97         454         928        164
## 5         49       1297         102         472         920        138
## 6        107       1279          92         443         973        123
##       WINS          BATTING_H      BATTING_2B      BATTING_3B    
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##    BATTING_HR       BATTING_BB      BATTING_SO       BASERUN_SB   
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.8   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 736.3   Mean   :123.4  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:151.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##    PITCHING_H     PITCHING_HR     PITCHING_BB      PITCHING_SO     
##  Min.   : 1137   Min.   :  0.0   Min.   :   0.0   Min.   :    0.0  
##  1st Qu.: 1419   1st Qu.: 50.0   1st Qu.: 476.0   1st Qu.:  626.0  
##  Median : 1518   Median :107.0   Median : 536.5   Median :  813.5  
##  Mean   : 1779   Mean   :105.7   Mean   : 553.0   Mean   :  817.5  
##  3rd Qu.: 1682   3rd Qu.:150.0   3rd Qu.: 611.0   3rd Qu.:  957.0  
##  Max.   :30132   Max.   :343.0   Max.   :3645.0   Max.   :19278.0  
##    FIELDING_E    
##  Min.   :  65.0  
##  1st Qu.: 127.0  
##  Median : 159.0  
##  Mean   : 246.5  
##  3rd Qu.: 249.2  
##  Max.   :1898.0

When we look at the observations in the revised dataset, we now have successfully eliminated all of the NA values. However, when we inspect the data, we see a significant amount of outliers. And outliers may potentially have high leverage on our models. We need to investigate this further and deal with these outliers.

In general, what do we do about outliers? (Let’s refer to ‘Linear Models with R, Second Edition’, page 88.) Check for a data-entry error first.

Some may have been entered in error. Certainly like values of 19278 strike outs is physically impossible, so we need to remove these values and impute them with the median. Let’s plot the histogram with all of the values.

We notice that some of these histograms extend out quite far to the right from their outliers. Again, looking at the max values for these variables, such as PITCHING_H, PITCHING_BB, PITCHING_SO, and FIELDING_E, these are likely entered in error. The next step will be to impute these values with the median. We will arbitrarily choose any values above 3 standard deviations above the mean and impute them as the median.

The rest of the variables on visual inspection appear to be reasonable fits (even if they do contain outliers; we are trying to eliminate the errors that are the most obvious and likely secondary to entry error.)

##       WINS          BATTING_H      BATTING_2B      BATTING_3B    
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##    BATTING_HR       BATTING_BB      BATTING_SO       BASERUN_SB   
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 556.8   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 736.3   Mean   :123.4  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:151.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##    PITCHING_H    PITCHING_HR     PITCHING_BB     PITCHING_SO    
##  Min.   :1137   Min.   :  0.0   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.:1419   1st Qu.: 50.0   1st Qu.:476.0   1st Qu.: 626.0  
##  Median :1518   Median :107.0   Median :536.5   Median : 813.5  
##  Mean   :1605   Mean   :105.7   Mean   :500.4   Mean   : 795.6  
##  3rd Qu.:1660   3rd Qu.:150.0   3rd Qu.:536.5   3rd Qu.: 954.0  
##  Max.   :4134   Max.   :343.0   Max.   :536.5   Max.   :1600.0  
##    FIELDING_E   
##  Min.   : 65.0  
##  1st Qu.:127.0  
##  Median :159.0  
##  Mean   :198.9  
##  3rd Qu.:215.0  
##  Max.   :681.0

Now that we re-examine the data, we have successfully removed the extreme MAX values, and the data appears to be overall more consistent with the rest.

Now let’s take a look at the response variable, WINS. We had mentioned earlier that the WINS column may contain features that are outliers. And from before, we noted that the most winning team in baseball history had a winning percentage of 76.3%, and that the most losing team in baseball history had a winning percentage of 13.0%. Both are certainly at the extremes. However, the data does contain points with more wins than the most winning team in baseball history and the most losing team in baseball history.

How many and what percentage of WINS is either below 13.0% or greater than 76.3% (which corresponds to 21 games and 123 games respectively?

## [1] "Number of Wins that are less than 21 or greater than 123: 11"
## [1] "Percentage: 0.48%"

So only 0.48% of the data are in the extreme ranges. We will leave this for now given it accounts for such a small amount of data.

Binning Data: Now let’s talk about weak, average, and strong teams. In baseball, anything greater than 100 game win is a great season. We should consider cutting games above 100 games. And likewise, any season that is less than 0.500 win percentage, or wins less than 81 games in the season is considered to be generally non-contenders for the world series. For this dataset, we will consider any team less than 81 wins to be weak, any team greater than 100 wins to be a strong team; and the rest will be considered average teams.

How many of each teams exist in this dataset?

## Wins
##    Weak Teams Average Teams  Strong Teams 
##          1118           958           200

It appears that there are generally more weak teams (less than .500 winning percentage) in this dataset than average teams or strong teams. When we create our models, we will create a regression model with all of the teams, and then subset the data to see if we can create a regression model for the weak team, average teams, and strong teams (this will be done in the following modelling section).

Let’s revisit the scatterplots, and see what the new scatterplots appear like after dealing with the missing values and outliers. Depending on the plot, consideration needs to be made to potentially transform the data. Given that this is a multiple regression, in order to see if we need to transform the response variable, we will need to visualize the residuals and ensure that it satisfies the basic assumptions of regression.

Because visualizing in multiple dimensions is impossible for the human mind, we will simplify the process and only look at one predictor variable vs. the response variable, or in this case: BATTING_H vs. WINS.

## [1] "Pearson Correlation for Hits vs. Wins: 0.388767521072289"

Let’s take a look at the residuals and see if this response would benefit from a transformation.

The overall residual distribution does appear to be fairly normally distributed with no obvious heteroscedasticity, so a transformation may not be necessary. However, given that this is only a single predictor variable to a single response, this may certainly change the residual plot. Therefore, we will create the most common transformations, the logarithmic transformation, square-root transformation, and the BoxCox transformation and include them with our model building to see if these response transformations would improve the R-squared value.

It is also a good idea to look at the data separated into their bins (WINS). Another visualization with color separations may provide more insight into how BATTING_H may relate to WINS.

We could also look and see how BATTING_H compare to BATTING_HR binning an Average, Strong, and Weak Team.

Or BATTING_H compare to BATTING_SO binning an Average, Strong, and Weak Team.

Again, it is not easily discernible what this information means. But is interesting to see how the data separates itself out.

Before we continue onto the third section, we should go back to Moneyball. Moneyball had emphasized these two stats quite heavily, On-base Percentage and Slugging. On-Base Percentage (OBP) is “a statistic generally measugin how frequently a batter reaches base. Specifically, it records the ratio of the batter’s times-on-base (TOB) (the sum of hits, walks, and times hit by pitch) to their number of plate appearances. By factoring in only hits, walks and times hit by pitch, OBP does not credit the batter for reaching base due to fielding errors or decisions.” (https://en.wikipedia.org/wiki/On-base_percentage)

The formula is as follows:

\[OBP = \frac{H + BB + HBP}{AB + BB + HBP + SF}\]

Slugging Percentage is “a measure of the batting producitivity of a hitter. It is calculated as total bases divided by at bats, through the following formula: (https://en.wikipedia.org/wiki/Slugging_percentage)

\[SLG = \frac{(1B)+(2 * 2B) + (3 * 3B) + (4 * HR)}{AB}\]

We would like to include both of these stats into our model. However, given that we do not have ‘At-Bats (AB)’, ‘Sacrifice Fly (SF)’, ‘Hit by Pitch (HBP - as this was taken out)’, we will only be able to include a modified version of the slugging percentage into our dataframe. As of note, BATTING_H counts all types of hits, whether it was a single, double, triple, or homerun. As a result, we will use the following stat and include it in our model:

\[MODIFIED-SLG = \frac{BATTING\text{_}H + (2 * BATTING\text{_}2B) + (3 * BATTING\text{_}3B) + (4 * BATTING\text{_}HR)}{\text{Total Number of Innings in 162 games = 1458}}\]

Other statistics that we could potentially add to our model can be viewed in this following article, “Which Baseball Statistic is the Most Important When Determining Team Success?” (https://www.iwu.edu/economics/PPE13/houser.pdf). However, given the limitations of this data, we will only add the modified slugging to our dataset.

3. Build Models

Using the training data set, we will test out several different multiple linear regression models, and try to develop a model that would best accurate in terms of predicting WINS for a team. The initial approach would be to build the full model, assuming that more data and information would create the best model. Of course, this is subjected to testing, and we may ultimately find models that are better than the full model.

Full model (using WINS as the response variable; and not the transformations; and leaving out STRENGTH as that was derived from the target variable):

## 
## Call:
## lm(formula = WINS ~ . - WINS_log - WINS_sqrt - STRENGTH, data = moneyball_rev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.862  -8.582   0.353   8.578  56.341 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.414986   5.607021  -0.252  0.80079    
## BATTING_H    0.038461   0.003769  10.204  < 2e-16 ***
## BATTING_2B  -0.019783   0.009431  -2.098  0.03604 *  
## BATTING_3B   0.102623   0.017633   5.820 6.72e-09 ***
## BATTING_HR   0.075863   0.027554   2.753  0.00595 ** 
## BATTING_BB   0.035305   0.004431   7.968 2.52e-15 ***
## BATTING_SO   0.011331   0.004418   2.565  0.01039 *  
## BASERUN_SB   0.030714   0.004590   6.692 2.76e-11 ***
## PITCHING_H   0.006600   0.001173   5.625 2.08e-08 ***
## PITCHING_HR -0.036080   0.024315  -1.484  0.13798    
## PITCHING_BB -0.018563   0.007410  -2.505  0.01230 *  
## PITCHING_SO -0.007138   0.003582  -1.993  0.04643 *  
## FIELDING_E  -0.022174   0.003612  -6.138 9.82e-10 ***
## MOD_SLG            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 2263 degrees of freedom
## Multiple R-squared:   0.27,  Adjusted R-squared:  0.2661 
## F-statistic: 69.74 on 12 and 2263 DF,  p-value: < 2.2e-16

This model has an adjusted R-square value of 0.2661. While we had attempted to use the MOD_SLG as a predictor variable, R is producing an undefined value because of its likely very high correlation with the other variables within the dataset. The model otherwise does appear to fulfill the assumptions of the regression with a normally distributed residuals as evidenced by the histogram plot of the residuals and the Q-Q plot. However, an adjusted R-square value of 0.2661 is quite poor. We can try to improve this.

The model did show PITCHING_HR as a non-significant value, let’s see if we can remove this variable, and re-develop another model with this variable and MOD_SLG taken out.

## 
## Call:
## lm(formula = WINS ~ . - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG - 
##     PITCHING_HR, data = moneyball_rev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.196  -8.470   0.304   8.603  55.858 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.482090   5.573141  -0.087 0.931075    
## BATTING_H    0.037819   0.003745  10.098  < 2e-16 ***
## BATTING_2B  -0.019438   0.009430  -2.061 0.039398 *  
## BATTING_3B   0.100402   0.017574   5.713 1.26e-08 ***
## BATTING_HR   0.037611   0.009732   3.865 0.000114 ***
## BATTING_BB   0.035544   0.004429   8.026 1.61e-15 ***
## BATTING_SO   0.014178   0.003980   3.562 0.000375 ***
## BASERUN_SB   0.030305   0.004582   6.613 4.68e-11 ***
## PITCHING_H   0.006706   0.001171   5.725 1.17e-08 ***
## PITCHING_BB -0.019088   0.007403  -2.578 0.009988 ** 
## PITCHING_SO -0.009777   0.003110  -3.143 0.001693 ** 
## FIELDING_E  -0.021807   0.003605  -6.049 1.70e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.5 on 2264 degrees of freedom
## Multiple R-squared:  0.2693, Adjusted R-squared:  0.2657 
## F-statistic: 75.84 on 11 and 2264 DF,  p-value: < 2.2e-16

So while all of the coefficients are now statistically significant, the adjusted R-square value really has not changed much. In fact, we can use an ANOVA to determine if there really is any difference between these two models.

## Analysis of Variance Table
## 
## Model 1: WINS ~ (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB + 
##     BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB + 
##     PITCHING_SO + FIELDING_E + STRENGTH + WINS_log + WINS_sqrt + 
##     MOD_SLG) - WINS_log - WINS_sqrt - STRENGTH
## Model 2: WINS ~ (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB + 
##     BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB + 
##     PITCHING_SO + FIELDING_E + STRENGTH + WINS_log + WINS_sqrt + 
##     MOD_SLG) - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG - PITCHING_HR
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   2263 412102                           
## 2   2264 412503 -1   -400.97 2.2019  0.138

And ANOVA suggests that with a F-statistic of 2.2019, this has a corresponding p-value of .138, indicating no difference between the two models.

We may be able to modify the data and start looking at the terms from a different perspective. From our data exploration, we have noted that there are some collinearity in the datasets. This may be a good opportunity to include the full model with interactive terms. Particularly, in regards to batting, BATTING_H, BATTING_2B, BATTING_3B, BATTING_HR are all likely to be interactive. Let’s try it out in the model.

## 
## Call:
## lm(formula = WINS ~ . + BATTING_H * BATTING_2B * BATTING_3B * 
##     BATTING_HR, data = new_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.388  -8.373   0.412   8.340  56.319 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                -8.746e+01  3.922e+01  -2.230
## BATTING_H                                   7.479e-02  2.798e-02   2.673
## BATTING_2B                                  2.797e-01  1.571e-01   1.781
## BATTING_3B                                  1.596e+00  5.145e-01   3.103
## BATTING_HR                                  8.029e-01  5.597e-01   1.435
## BATTING_BB                                  3.497e-02  4.442e-03   7.874
## BATTING_SO                                  1.064e-02  4.463e-03   2.384
## BASERUN_SB                                  3.698e-02  4.701e-03   7.868
## PITCHING_H                                  6.923e-03  1.224e-03   5.655
## PITCHING_HR                                -3.419e-02  2.588e-02  -1.321
## PITCHING_BB                                -2.552e-02  7.463e-03  -3.420
## PITCHING_SO                                -6.838e-03  3.630e-03  -1.884
## FIELDING_E                                 -2.193e-02  3.652e-03  -6.005
## BATTING_H:BATTING_2B                       -1.079e-04  1.053e-04  -1.024
## BATTING_H:BATTING_3B                       -8.410e-04  3.446e-04  -2.441
## BATTING_2B:BATTING_3B                      -4.750e-03  1.900e-03  -2.500
## BATTING_H:BATTING_HR                       -2.659e-04  3.840e-04  -0.692
## BATTING_2B:BATTING_HR                      -3.511e-03  2.014e-03  -1.743
## BATTING_3B:BATTING_HR                      -6.663e-03  1.115e-02  -0.597
## BATTING_H:BATTING_2B:BATTING_3B             2.576e-06  1.189e-06   2.166
## BATTING_H:BATTING_2B:BATTING_HR             1.520e-06  1.345e-06   1.131
## BATTING_H:BATTING_3B:BATTING_HR             3.084e-06  7.464e-06   0.413
## BATTING_2B:BATTING_3B:BATTING_HR            3.411e-05  3.906e-05   0.873
## BATTING_H:BATTING_2B:BATTING_3B:BATTING_HR -1.843e-08  2.534e-08  -0.727
##                                            Pr(>|t|)    
## (Intercept)                                0.025830 *  
## BATTING_H                                  0.007578 ** 
## BATTING_2B                                 0.075094 .  
## BATTING_3B                                 0.001941 ** 
## BATTING_HR                                 0.151565    
## BATTING_BB                                 5.31e-15 ***
## BATTING_SO                                 0.017214 *  
## BASERUN_SB                                 5.55e-15 ***
## PITCHING_H                                 1.76e-08 ***
## PITCHING_HR                                0.186634    
## PITCHING_BB                                0.000637 ***
## PITCHING_SO                                0.059730 .  
## FIELDING_E                                 2.22e-09 ***
## BATTING_H:BATTING_2B                       0.305855    
## BATTING_H:BATTING_3B                       0.014739 *  
## BATTING_2B:BATTING_3B                      0.012495 *  
## BATTING_H:BATTING_HR                       0.488719    
## BATTING_2B:BATTING_HR                      0.081442 .  
## BATTING_3B:BATTING_HR                      0.550304    
## BATTING_H:BATTING_2B:BATTING_3B            0.030394 *  
## BATTING_H:BATTING_2B:BATTING_HR            0.258376    
## BATTING_H:BATTING_3B:BATTING_HR            0.679527    
## BATTING_2B:BATTING_3B:BATTING_HR           0.382726    
## BATTING_H:BATTING_2B:BATTING_3B:BATTING_HR 0.467066    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.31 on 2252 degrees of freedom
## Multiple R-squared:  0.2937, Adjusted R-squared:  0.2864 
## F-statistic: 40.71 on 23 and 2252 DF,  p-value: < 2.2e-16

The interactive terms approach does seem to improve the adjusted R-squared value to 0.2864, but there are quite a bit of non-statistically significant coefficients here.

## Analysis of Variance Table
## 
## Model 1: WINS ~ (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB + 
##     BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB + 
##     PITCHING_SO + FIELDING_E + STRENGTH + WINS_log + WINS_sqrt + 
##     MOD_SLG) - WINS_log - WINS_sqrt - STRENGTH
## Model 2: WINS ~ BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB + 
##     BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB + 
##     PITCHING_SO + FIELDING_E + BATTING_H * BATTING_2B * BATTING_3B * 
##     BATTING_HR
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   2263 412102                                  
## 2   2252 398726 11     13376 6.8679 1.724e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA analysis suggests that the first model I had developed compared to this last model with the interactive terms are statistically different. However, it is unclear if a more complicated model is truly the better way to go.

Let’s determine to see if looking at any of the transformations of the predictor responses. Initially, I had attempted the logarithmic transformation and square root transformation, but this led to singularity issues, thus not allowing R to properly create a model. Therefore, those two transformations were discarded. Let’s take a look at the BoxCox transformations. In order to perform a box-cox transformation, all the response variables will need to be positive. There was one outlier with 0 wins, which we will eliminate in order to perform the boxcox transformation.

For the sake of simplicity, we will choose lambda as 1.4. Let’s transform the target variable and see if we had improved the linear regression model.

## 
## Call:
## lm(formula = WINS_BC ~ . - WINS - WINS_log - WINS_sqrt - STRENGTH - 
##     MOD_SLG, data = mod_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -421.44  -71.06   -0.23   67.44  476.39 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.380e+02  4.586e+01  -3.009  0.00265 ** 
## BATTING_H    2.968e-01  3.034e-02   9.782  < 2e-16 ***
## BATTING_2B  -1.446e-01  7.540e-02  -1.918  0.05527 .  
## BATTING_3B   7.895e-01  1.407e-01   5.613 2.23e-08 ***
## BATTING_HR   6.269e-01  2.196e-01   2.854  0.00435 ** 
## BATTING_BB   2.716e-01  3.533e-02   7.687 2.24e-14 ***
## BATTING_SO   5.981e-02  3.530e-02   1.694  0.09038 .  
## BASERUN_SB   2.564e-01  3.662e-02   7.001 3.33e-12 ***
## PITCHING_H   5.002e-02  9.352e-03   5.349 9.73e-08 ***
## PITCHING_HR -2.782e-01  1.939e-01  -1.435  0.15147    
## PITCHING_BB -1.756e-01  5.966e-02  -2.944  0.00327 ** 
## PITCHING_SO -4.464e-02  2.855e-02  -1.563  0.11808    
## FIELDING_E  -1.773e-01  2.879e-02  -6.157 8.73e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 107.6 on 2262 degrees of freedom
## Multiple R-squared:  0.261,  Adjusted R-squared:  0.2571 
## F-statistic: 66.58 on 12 and 2262 DF,  p-value: < 2.2e-16

The adjusted R-square value is 0.2571, which suggets that this is not an improved model. A transformation here may not be helpful for this particular dataset.

Before we continue onto the fourth section of this assignment, let’s go back and run the full model without any modifications I had made (including imputing out missing variables, adjusting the variables, etc.).

## 
## Call:
## lm(formula = WINS ~ ., data = moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8708  -5.6564  -0.0599   5.2545  22.9274 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.28826   19.67842   3.064  0.00253 ** 
## BATTING_H    1.91348    2.76139   0.693  0.48927    
## BATTING_2B   0.02639    0.03029   0.871  0.38484    
## BATTING_3B  -0.10118    0.07751  -1.305  0.19348    
## BATTING_HR  -4.84371   10.50851  -0.461  0.64542    
## BATTING_BB  -4.45969    3.63624  -1.226  0.22167    
## BATTING_SO   0.34196    2.59876   0.132  0.89546    
## BASERUN_SB   0.03304    0.02867   1.152  0.25071    
## BASERUN_CS  -0.01104    0.07143  -0.155  0.87730    
## BATTING_HBP  0.08247    0.04960   1.663  0.09815 .  
## PITCHING_H  -1.89096    2.76095  -0.685  0.49432    
## PITCHING_HR  4.93043   10.50664   0.469  0.63946    
## PITCHING_BB  4.51089    3.63372   1.241  0.21612    
## PITCHING_SO -0.37364    2.59705  -0.144  0.88577    
## FIELDING_E  -0.17204    0.04140  -4.155 5.08e-05 ***
## FIELDING_DP -0.10819    0.03654  -2.961  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.467 on 175 degrees of freedom
##   (2085 observations deleted due to missingness)
## Multiple R-squared:  0.5501, Adjusted R-squared:  0.5116 
## F-statistic: 14.27 on 15 and 175 DF,  p-value: < 2.2e-16

Incredibly, the Adjusted R-squared value is 0.5116. However, when we had performed our data munging, we noted some issues with the data. Could this model be overfitting?

As of now, we will leave this last model into consideration, but other than this model, our best bet model may likely be the model with the interactive terms. Again, taking a look at that interactive model.

## 
## Call:
## lm(formula = WINS ~ . + BATTING_H * BATTING_2B * BATTING_3B * 
##     BATTING_HR, data = new_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.388  -8.373   0.412   8.340  56.319 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                -8.746e+01  3.922e+01  -2.230
## BATTING_H                                   7.479e-02  2.798e-02   2.673
## BATTING_2B                                  2.797e-01  1.571e-01   1.781
## BATTING_3B                                  1.596e+00  5.145e-01   3.103
## BATTING_HR                                  8.029e-01  5.597e-01   1.435
## BATTING_BB                                  3.497e-02  4.442e-03   7.874
## BATTING_SO                                  1.064e-02  4.463e-03   2.384
## BASERUN_SB                                  3.698e-02  4.701e-03   7.868
## PITCHING_H                                  6.923e-03  1.224e-03   5.655
## PITCHING_HR                                -3.419e-02  2.588e-02  -1.321
## PITCHING_BB                                -2.552e-02  7.463e-03  -3.420
## PITCHING_SO                                -6.838e-03  3.630e-03  -1.884
## FIELDING_E                                 -2.193e-02  3.652e-03  -6.005
## BATTING_H:BATTING_2B                       -1.079e-04  1.053e-04  -1.024
## BATTING_H:BATTING_3B                       -8.410e-04  3.446e-04  -2.441
## BATTING_2B:BATTING_3B                      -4.750e-03  1.900e-03  -2.500
## BATTING_H:BATTING_HR                       -2.659e-04  3.840e-04  -0.692
## BATTING_2B:BATTING_HR                      -3.511e-03  2.014e-03  -1.743
## BATTING_3B:BATTING_HR                      -6.663e-03  1.115e-02  -0.597
## BATTING_H:BATTING_2B:BATTING_3B             2.576e-06  1.189e-06   2.166
## BATTING_H:BATTING_2B:BATTING_HR             1.520e-06  1.345e-06   1.131
## BATTING_H:BATTING_3B:BATTING_HR             3.084e-06  7.464e-06   0.413
## BATTING_2B:BATTING_3B:BATTING_HR            3.411e-05  3.906e-05   0.873
## BATTING_H:BATTING_2B:BATTING_3B:BATTING_HR -1.843e-08  2.534e-08  -0.727
##                                            Pr(>|t|)    
## (Intercept)                                0.025830 *  
## BATTING_H                                  0.007578 ** 
## BATTING_2B                                 0.075094 .  
## BATTING_3B                                 0.001941 ** 
## BATTING_HR                                 0.151565    
## BATTING_BB                                 5.31e-15 ***
## BATTING_SO                                 0.017214 *  
## BASERUN_SB                                 5.55e-15 ***
## PITCHING_H                                 1.76e-08 ***
## PITCHING_HR                                0.186634    
## PITCHING_BB                                0.000637 ***
## PITCHING_SO                                0.059730 .  
## FIELDING_E                                 2.22e-09 ***
## BATTING_H:BATTING_2B                       0.305855    
## BATTING_H:BATTING_3B                       0.014739 *  
## BATTING_2B:BATTING_3B                      0.012495 *  
## BATTING_H:BATTING_HR                       0.488719    
## BATTING_2B:BATTING_HR                      0.081442 .  
## BATTING_3B:BATTING_HR                      0.550304    
## BATTING_H:BATTING_2B:BATTING_3B            0.030394 *  
## BATTING_H:BATTING_2B:BATTING_HR            0.258376    
## BATTING_H:BATTING_3B:BATTING_HR            0.679527    
## BATTING_2B:BATTING_3B:BATTING_HR           0.382726    
## BATTING_H:BATTING_2B:BATTING_3B:BATTING_HR 0.467066    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.31 on 2252 degrees of freedom
## Multiple R-squared:  0.2937, Adjusted R-squared:  0.2864 
## F-statistic: 40.71 on 23 and 2252 DF,  p-value: < 2.2e-16

The reason why this model may be the most reasonable to use is that its coefficient values are probably the most intuitive out of all the models proposed here. For example, the positive coefficients for BATTING_H, BATTING_2B, BATTING_3B, BATTING_HR, BATTING_BB, etc. suggests that the higher the number associated with these coefficients, the more likely it will contribute to the WINS column. Likewise, with PITCHING_HR, PITCHING_BB, given it’s negative coefficient values, the higher these values, the more it will hurt the WINS column.

That being said, not all of the coefficient values make sense. For instance, PITCHING_SO has a negative value. In the baseball world, throwing more strikeouts should in theory increases your chances of winning.

However, that being said, the mathematics behind regression can be quite complex, and different variables often times interact with each other. This can cause the numbers to be quite counter-intuitive. That doesn’t necessarily mean that the model is incorrect.

4. Select Models

As discussed above, we have several different models to choose from. And there are several different ways to evaluate a model, including \(Adjusted^2\), RMSE, etc. The descriptions of these different methods can be read here: https://www.r-bloggers.com/assessing-the-accuracy-of-our-models-r-squared-adjusted-r-squared-rmse-mae-aic/.

A significant amount of time was looking into these models listed above. In addition to trying to make sense of the model, I have already started evaluating the models based off their Adjusted \(R^2\) values. Ultimately, I had chosen the interactive terms model as multi-collinearity is likely an issue. So for instance, a player’s ability to bat is likely going to affect his ability to hit singles, doubles, triples, or homeruns. Players who tend to be better batters, tend to hit well much better all around in all four of these categories. Therefore, they are not independent from each other. This is the big reason why I ultimately ended up choosing this particular model over the others.

Now there was the last model I had provided that had a significantly improved R-squared value over the model I had chosen. Why did I not choose this model? As we were performing our exploratory data analysis, this data contained a significant amount of missing data and outliers, which can skew the model and provide a false sense of security and assurance. Knowing that the unmodified full model contains all of these issues, it may be best not to use this model, as this may overfit the training model, but may ultimately fail to generalize to other datasets.

As requested in the question set, let’s evaluate the multiple linear regression model based on (a) mean squared error, (b) \(R^2\), (c) F-statistic, and (d) residual plots.

## [1] "Mean Squared Error: 175.187129382774"
## [1] "Root MSE: 13.2358274914255"
## [1] "Adjusted R-squared: 0.286446971302889"
## [1] "F-statistic: 40.7074086819212"

The model appears to statistically significant.

Now the to make the predictions with the evaluation dataset. We will initially have to eliminate the variables BATTING_HBP,BASERUN_CS,FIELDING_DP. If the evaluation dataset has any missing data points, we will have to impute in the median data points to remain consistent with our training model.

Let’s take a look at the head of the test set after we made these modifications.

##   BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB BATTING_SO
## 1      1209        170         33         83        447       1080
## 2      1221        151         29         88        516        929
## 3      1395        183         29         93        509        816
## 4      1539        309         29        159        486        914
## 5      1445        203         68          5         95        416
## 6      1431        236         53         10        215        377
##   BASERUN_SB PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## 1         62       1209          83         447        1080        140
## 2         54       1221          88         516         929        135
## 3         59       1395          93         509         816        156
## 4        148       1539         159         486         914        124
## 5         92       3902          14         257        1123        616
## 6         92       2793          20         420         736        572

Let’s produce a prediction for the test set including its prediction intervals. (If you like to save this data as a .csv file and output it, please see the appendix.)

##           fit        lwr       upr
## 1    65.19033  38.963608  91.41705
## 2    65.75030  39.473535  92.02707
## 3    74.08726  47.938162 100.23636
## 4    84.90616  58.763857 111.04847
## 5    67.94395  41.316532  94.57138
## 6    62.79714  36.464737  89.12954
## 7    77.45402  51.228355 103.67968
## 8    65.51250  39.315788  91.70922
## 9    70.98443  44.815416  97.15345
## 10   69.90627  43.781607  96.03094
## 11   72.15654  46.020825  98.29225
## 12   82.60546  56.450025 108.76089
## 13   82.06223  55.834390 108.29008
## 14   80.48800  54.280972 106.69502
## 15   79.35101  53.156912 105.54510
## 16   81.84139  55.654766 108.02802
## 17   72.44740  46.251563  98.64324
## 18   78.99631  52.852399 105.14022
## 19   61.00442  34.822399  87.18644
## 20   88.54734  62.322522 114.77216
## 21   81.91433  55.705036 108.12363
## 22   84.20397  57.991239 110.41669
## 23   78.65437  52.424724 104.88402
## 24   71.56051  45.414208  97.70681
## 25   82.93302  56.788821 109.07721
## 26   86.54794  60.379938 112.71595
## 27   59.14667  30.910344  87.38299
## 28   74.11489  47.961425 100.26835
## 29   83.62548  57.378076 109.87289
## 30   76.59144  50.331029 102.85184
## 31   84.98409  58.680956 111.28723
## 32   83.41988  57.233143 109.60661
## 33   82.11643  55.851783 108.38108
## 34   83.47974  57.157876 109.80160
## 35   80.24415  54.095454 106.39286
## 36   81.81934  55.640607 107.99806
## 37   74.37683  48.249544 100.50411
## 38   89.03987  62.828234 115.25150
## 39   79.32565  52.644050 106.00725
## 40   88.27825  62.094141 114.46236
## 41   77.11910  50.885123 103.35308
## 42   85.49939  59.334723 111.66406
## 43   25.48259  -3.547424  54.51261
## 44   97.98304  71.180574 124.78551
## 45   87.34547  60.978442 113.71249
## 46   87.89786  61.517969 114.27774
## 47   90.59936  64.077126 117.12159
## 48   73.27413  47.123699  99.42456
## 49   68.33860  42.193537  94.48366
## 50   74.98734  48.853630 101.12105
## 51   77.19618  51.061838 103.33053
## 52   85.13214  58.960191 111.30409
## 53   79.18255  53.036294 105.32881
## 54   74.63227  48.467231 100.79730
## 55   76.29415  50.175469 102.41282
## 56   77.84642  51.721354 103.97149
## 57   87.19360  60.950002 113.43721
## 58   71.76554  45.590474  97.94060
## 59   56.51941  30.026999  83.01183
## 60   76.73483  50.592401 102.87725
## 61   84.09501  57.937354 110.25267
## 62   77.58656  51.266835 103.90629
## 63   85.24703  59.109578 111.38447
## 64   81.97135  55.714428 108.22828
## 65   84.91806  58.730521 111.10560
## 66   95.05652  68.729248 121.38378
## 67   74.31244  48.172371 100.45252
## 68   80.09233  53.927763 106.25689
## 69   76.73882  50.581141 102.89651
## 70   88.13985  61.896703 114.38300
## 71   85.90838  59.663667 112.15309
## 72   73.70898  47.574961  99.84300
## 73   80.76387  54.599711 106.92804
## 74   90.59474  64.284360 116.90513
## 75   83.30479  57.115692 109.49389
## 76   89.78166  63.531510 116.03181
## 77   82.46595  56.284587 108.64731
## 78   81.27257  55.145115 107.40003
## 79   67.71837  41.567232  93.86952
## 80   73.97585  47.832538 100.11916
## 81   84.36046  58.087835 110.63308
## 82   87.05392  60.803596 113.30424
## 83   91.67783  64.542796 118.81287
## 84   80.38443  54.248262 106.52060
## 85   83.86098  57.642162 110.07980
## 86   79.95696  53.782240 106.13167
## 87   80.98155  54.762846 107.20026
## 88   81.85522  55.734353 107.97608
## 89   83.22258  57.049404 109.39576
## 90   83.34236  57.117784 109.56694
## 91   74.31418  48.084138 100.54422
## 92   87.84938  45.968486 129.73027
## 93   67.71741  41.544106  93.89072
## 94   78.69073  52.511501 104.86996
## 95   79.83220  53.676285 105.98811
## 96   78.33182  52.202588 104.46104
## 97   85.74682  59.452304 112.04134
## 98   94.17179  67.888270 120.45531
## 99   87.90937  61.654843 114.16390
## 100  89.64484  63.385556 115.90412
## 101  83.06290  56.909825 109.21597
## 102  70.06731  43.916807  96.21782
## 103  81.97868  55.849528 108.10784
## 104  81.80705  55.612554 108.00154
## 105  81.23208  55.038172 107.42600
## 106  69.69927  43.316922  96.08162
## 107  51.45812  24.928120  77.98813
## 108  77.23375  51.053263 103.41425
## 109  80.72630  54.561585 106.89101
## 110  59.31439  32.997730  85.63105
## 111  82.35130  56.190169 108.51242
## 112  84.24418  58.050125 110.43824
## 113  92.58932  66.435355 118.74328
## 114  88.50692  62.353622 114.66021
## 115  81.93386  55.791334 108.07639
## 116  80.68634  54.533626 106.83905
## 117  89.33272  63.180354 115.48510
## 118  79.60421  53.460792 105.74764
## 119  76.69707  50.549293 102.84484
## 120  75.12056  48.895427 101.34569
## 121  87.40086  61.195277 113.60645
## 122  61.46132  35.266193  87.65645
## 123  64.45339  38.269788  90.63698
## 124  58.01540  31.590142  84.44067
## 125  70.33036  44.159179  96.50154
## 126  83.89022  57.728962 110.05148
## 127  86.35077  60.185744 112.51579
## 128  74.25452  48.117202 100.39184
## 129  85.57835  59.421421 111.73529
## 130  89.79343  63.526818 116.06005
## 131  85.25814  59.082260 111.43401
## 132  77.80046  51.602412 103.99851
## 133  76.32213  50.151380 102.49287
## 134  86.09441  59.882982 112.30584
## 135  89.33801  63.152281 115.52375
## 136  59.78975  33.261077  86.31843
## 137  77.31551  51.177677 103.45334
## 138  76.57652  50.455069 102.69796
## 139  86.59601  60.411472 112.78055
## 140  80.93792  54.813035 107.06281
## 141  62.77592  36.563040  88.98880
## 142  68.55715  42.282573  94.83172
## 143  91.01579  64.805980 117.22561
## 144  76.89694  50.754520 103.03937
## 145  70.91936  44.737192  97.10153
## 146  71.99100  45.849100  98.13289
## 147  78.41426  52.282760 104.54575
## 148  80.75981  54.626206 106.89341
## 149  83.77344  57.655507 109.89137
## 150  78.69675  52.569928 104.82358
## 151  82.87512  56.736292 109.01395
## 152  81.37889  55.230979 107.52681
## 153 156.99017 102.266046 211.71429
## 154  71.96646  45.787191  98.14572
## 155  75.73622  49.597692 101.87474
## 156  75.40503  49.191800 101.61826
## 157  80.29909  54.032283 106.56590
## 158  62.20562  35.886090  88.52514
## 159  86.78267  60.517509 113.04782
## 160  67.57468  41.410067  93.73930
## 161  90.08680  62.841868 117.33174
## 162  96.44026  69.739091 123.14143
## 163  85.66220  59.409389 111.91502
## 164  91.11011  63.737069 118.48316
## 165  87.64169  61.154422 114.12896
## 166  84.70623  58.409282 111.00318
## 167  85.51847  59.306794 111.73015
## 168  82.09385  55.951787 108.23591
## 169  76.40460  50.163202 102.64600
## 170  79.33048  53.175531 105.48543
## 171  89.58480  63.387303 115.78229
## 172  85.44618  59.290005 111.60236
## 173  77.92991  51.758648 104.10117
## 174  89.30595  63.126415 115.48549
## 175  79.62758  53.469350 105.78581
## 176  73.40745  47.210875  99.60403
## 177  73.88085  47.612427 100.14928
## 178  75.35809  49.239079 101.47709
## 179  73.23287  47.105102  99.36064
## 180  82.31379  56.155727 108.47186
## 181  84.06089  57.822858 110.29893
## 182  83.83366  57.678479 109.98885
## 183  85.07842  58.929591 111.22725
## 184  80.14081  53.960171 106.32146
## 185 109.54300  71.924584 147.16142
## 186  95.62474  69.225611 122.02388
## 187  86.02903  59.620017 112.43804
## 188  56.84309  29.869694  83.81648
## 189  54.60977  28.250477  80.96906
## 190 108.85223  82.026384 135.67808
## 191  65.75286  39.590759  91.91497
## 192  78.75669  52.614144 104.89923
## 193  75.31090  49.128335 101.49346
## 194  77.47357  51.338185 103.60896
## 195  82.15986  55.936617 108.38311
## 196  63.94168  37.782928  90.10044
## 197  74.59659  48.478351 100.71482
## 198  78.72536  52.455108 104.99562
## 199  77.58546  51.395559 103.77537
## 200  83.39330  57.214291 109.57231
## 201  80.98275  54.784910 107.18059
## 202  81.46477  55.326329 107.60321
## 203  76.16639  49.980054 102.35273
## 204  79.22534  52.636960 105.81372
## 205  78.52411  52.402750 104.64547
## 206  81.08151  54.916165 107.24686
## 207  78.04266  51.857006 104.22832
## 208  76.37720  50.230116 102.52428
## 209  79.46808  53.253476 105.68269
## 210  72.15377  45.929879  98.37765
## 211  95.20298  68.683771 121.72219
## 212  87.87247  61.549843 114.19510
## 213  84.08937  57.884732 110.29401
## 214  65.99910  39.833482  92.16472
## 215  72.18685  46.007990  98.36571
## 216  84.85461  58.699564 111.00967
## 217  85.26477  59.074753 111.45479
## 218  91.53384  65.354403 117.71327
## 219  76.55081  50.429003 102.67262
## 220  77.36313  51.231257 103.49499
## 221  77.51708  51.368310 103.66586
## 222  77.86387  51.688701 104.03905
## 223  80.50133  54.351219 106.65145
## 224  78.85486  52.712468 104.99724
## 225 101.78313  72.330538 131.23571
## 226  76.04254  49.883527 102.20156
## 227  80.20297  54.078127 106.32781
## 228  80.73083  54.569197 106.89247
## 229  80.05644  53.922349 106.19053
## 230  67.69091  41.294343  94.08747
## 231  68.15125  41.961485  94.34102
## 232  89.71557  63.507540 115.92360
## 233  80.88768  54.658426 107.11692
## 234  89.32436  63.076754 115.57196
## 235  78.79225  52.673496 104.91101
## 236  72.67776  46.552751  98.80278
## 237  83.37966  57.211714 109.54760
## 238  79.78433  53.623895 105.94476
## 239  85.25932  58.767660 111.75099
## 240  71.86001  45.712783  98.00724
## 241  85.99510  59.843070 112.14713
## 242  84.15897  58.000262 110.31767
## 243  82.35973  56.211391 108.50807
## 244  84.49100  58.354138 110.62787
## 245  56.64383  30.285525  83.00214
## 246  87.86669  61.683253 114.05013
## 247  79.47383  53.338449 105.60921
## 248  81.98708  55.848114 108.12605
## 249  72.04451  45.920334  98.16868
## 250  85.32581  59.174962 111.47665
## 251  78.73890  52.556754 104.92104
## 252  56.25814  29.383086  83.13320
## 253  87.56414  61.168158 113.96012
## 254 -23.30153 -84.499000  37.89594
## 255  66.46014  40.196344  92.72393
## 256  76.31004  50.061512 102.55856
## 257  79.46685  53.318807 105.61488
## 258  81.91008  55.763864 108.05630
## 259  71.54079  45.258491  97.82308

Conclusion

Appendix

Section 1

library(ggplot2)
library(reshape2)
library(ggpubr)
library(corrplot)
library(Hmisc)
library(psych)
library(PerformanceAnalytics)
library(MASS)

url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%201/moneyball-training-data.csv'
moneyball <- read.csv(url, header=TRUE)

# Removing the index column given that this is non-contributory
# Renaming the columns to make it easier to read
moneyball <- moneyball[,-1]
colnames(moneyball) <- c("WINS", "BATTING_H", "BATTING_2B", "BATTING_3B", "BATTING_HR", "BATTING_BB", "BATTING_SO", "BASERUN_SB", "BASERUN_CS", "BATTING_HBP", "PITCHING_H", "PITCHING_HR", "PITCHING_BB", "PITCHING_SO", "FIELDING_E", "FIELDING_DP")
head(moneyball)

# Data shape
colnames(moneyball)
dim(moneyball)

moneyball_sum <- summary(moneyball)
moneyball_sum
describe(moneyball)

# ggplot visualization of data
meltMoneyball <- melt(moneyball)
ggplot(meltMoneyball, aes(factor(variable), value)) + geom_boxplot() + facet_wrap(~variable, scale="free") + labs(x="", y="")

meltMoneyball_toWins <- melt(moneyball, id.vars=c('WINS'))
ggplot(meltMoneyball_toWins, aes(x=value, y=WINS)) + geom_point() + facet_wrap(~variable, scale = "free") + geom_smooth(model="lm") + labs(x="", y="Number of Wins")

moneyball_cor <- cor(moneyball, method = "pearson", use = "complete.obs")
round(moneyball_cor, 2)
corrplot(moneyball_cor, method="shade")

# Checking out positive impact according to the table supplied
pos_impact <- moneyball[,c(1,2,3,4,5,6,8,10,14,16)]
chart.Correlation(pos_impact, histogram=TRUE, pch=19)

# Negatively impact the wins

neg_impact <- moneyball[,c(1,7,9,11,12,13,15)]
chart.Correlation(neg_impact, histogram=TRUE, pch=19)

# Wins
mean_wins <- mean(moneyball$WINS, na.rm=TRUE)
sd_wins <- sd(moneyball$WINS)
ggplot(moneyball, aes(x=WINS)) + geom_histogram(binwidth=1, col = 'black', fill = "green", aes(y=..density..)) + labs(x="Number of Wins", y = "Distribution") + geom_density(alpha=.2) + geom_vline(aes(xintercept=mean(WINS, na.rm=TRUE)), color="red", linetype="dashed", size=1) + geom_vline(xintercept = (mean_wins + 2 * sd_wins), color='blue', linetype='dashed', size=1) + geom_vline(xintercept = (mean_wins - 2 * sd_wins), color='blue', linetype='dashed', size=1)

describe(moneyball$WINS)

# Checking for NA values
rev(sort(colSums(sapply(moneyball, is.na))))

# Percentage of the data that is NA for its respective variable
rev(sort(colSums(sapply(moneyball, is.na))/nrow(moneyball) * 100))

Section 2

# Creating a revised version of the data frame taking out the columns with significant amount of missing data
moneyball_rev <- moneyball[, !names(moneyball) %in% c('BATTING_HBP','BASERUN_CS','FIELDING_DP')]

# Dealing with the outliers. Visualizing the outliers.
outlier_df <- moneyball[, names(moneyball) %in% c('BASERUN_SB', 'PITCHING_SO', 'BATTING_SO')]
outlier_melt <- melt(outlier_df)
ggplot(outlier_melt, aes(x=value)) + geom_histogram(binwidth=20) + facet_wrap(~variable, scale='free') + labs(x='', y='Frequency')

# Example of an outlier
max(outlier_df$PITCHING_SO, na.rm=TRUE)

# Creating a function to replace NA values with median values
replace_median <- function(x){
  x <- as.numeric(as.character(x))
  x[is.na(x)] = median(x, na.rm=TRUE)
  return(x)
}

# Creating the revised data frame
moneyball_rev <- apply(moneyball_rev, 2, replace_median)
moneyball_rev <- as.data.frame(moneyball_rev)

head(moneyball_rev)
summary(moneyball_rev)

# Histogram plots of the independent variables
moneyball_rev_melt <- melt(moneyball_rev)
ggplot(moneyball_rev_melt, aes(x=value)) + geom_histogram() + facet_wrap(~variable, scale='free') + labs(x='', y='Frequency')

# To modify the outliers
moneyball_rev$PITCHING_H[moneyball_rev$PITCHING_H > 3*sd(moneyball_rev$PITCHING_H)] <- median(moneyball_rev$PITCHING_H)
moneyball_rev$PITCHING_BB[moneyball_rev$PITCHING_BB > 3*sd(moneyball_rev$PITCHING_BB)] <- median(moneyball_rev$PITCHING_BB)
moneyball_rev$PITCHING_SO[moneyball_rev$PITCHING_SO > 3*sd(moneyball_rev$PITCHING_SO)] <- median(moneyball_rev$PITCHING_SO)
moneyball_rev$FIELDING_E[moneyball_rev$FIELDING_E > 3*sd(moneyball_rev$FIELDING_E)] <- median(moneyball_rev$FIELDING_E)

summary(moneyball_rev)

# Finding out how many of the wins are outliers
outlier_wins <- sum(moneyball_rev$WINS > 123 | moneyball_rev$WINS < 21)
print(paste0("Number of Wins that are less than 21 or greater than 123: ", outlier_wins))
print(paste0("Percentage: ", round(outlier_wins/nrow(moneyball_rev) * 100,2), "%"))

# Creating buckets for wins
Wins <- cut(moneyball_rev[,'WINS'], breaks = c(0,81,100,162), include.lowest=TRUE, labels=c("Weak Teams", "Average Teams", "Strong Teams"))
table(Wins)

# Subsetting the team strengths for visualization
weak_team <- subset(moneyball_rev, moneyball_rev$WINS < 81)
avg_team <- subset(moneyball_rev, moneyball_rev$WINS >= 81 & moneyball_rev$WINS < 100)
strong_team <- subset(moneyball_rev, moneyball_rev$WINS >= 100)

moneyball_rev$STRENGTH <- ifelse(moneyball_rev$WINS < 81, 'WEAK', ifelse(moneyball_rev$WINS >= 100, 'STRONG', 'AVERAGE'))

# Scatterplot visualization of variables against WINS
moneyball_rev_melt_WINS <- melt(moneyball_rev, id.vars='WINS')
ggplot(moneyball_rev_melt_WINS, aes(x=value, y=WINS)) + geom_jitter() + facet_wrap(~variable, scale='free') + labs(x='', y='Number of WINS')

# Scatter plot of Hitting and Wins
ggplot(moneyball_rev, aes(x=BATTING_H, y=WINS)) + geom_jitter() + labs(x="Hits (1B, 2B, 3B, HR)", y="Number of Wins") + geom_smooth(method="lm")
print(paste0("Pearson Correlation for Hits vs. Wins: ", cor(moneyball_rev$BATTING_H, moneyball_rev$WINS)))

# Creating a sample linear model
lm_samp <- lm(WINS ~ BATTING_H, data=moneyball_rev)
par(mfrow=c(2,2))
plot(lm_samp)

# More visualizations
ggplot(moneyball_rev, aes(x=BATTING_H, y=WINS, color=STRENGTH)) + geom_jitter() + labs(x="Hits (1B, 2B, 3B, HR)", y="Number of Wins") 
ggplot(moneyball_rev, aes(x=BATTING_H, y=BATTING_HR, color=STRENGTH)) + geom_jitter() + labs(x="Hits (1B, 2B, 3B, HR)", y="Number of Home Runs") 
ggplot(moneyball_rev, aes(x=BATTING_H, y=BATTING_SO, color=STRENGTH)) + geom_jitter() + labs(x="Hits (1B, 2B, 3B, HR)", y="Number of Strike Outs") 

# Attempting transformations
moneyball_rev$WINS_log <- log(moneyball_rev$WINS)
moneyball_rev$WINS_sqrt <- sqrt(moneyball_rev$WINS)

# Attempting to create a new variable for slugging
moneyball_rev$MOD_SLG <- (moneyball_rev$BATTING_H + (2 * moneyball_rev$BATTING_2B) + (3 * moneyball_rev$BATTING_3B) + (4 * moneyball_rev$BATTING_HR))/1458

Section 3

# Creating 1st model
lmod <- lm(WINS ~ . - WINS_log - WINS_sqrt - STRENGTH, moneyball_rev)
summary(lmod)
par(mfrow=c(2,2))
plot(lmod)
hist(resid(lmod), main="Histogram of Residuals")

# Creating 2nd model
lmod2 <- lm(WINS ~ . - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG - PITCHING_HR, moneyball_rev)
summary(lmod2)
par(mfrow=c(2,2))
plot(lmod2)
hist(resid(lmod2), main="Histogram of Residuals")

# ANOVA analysis between 1st and 2nd model
anova(lmod, lmod2)

# Creating 3rd model
new_df <- moneyball_rev[,!names(moneyball_rev) %in% c("WINS_log", "WINS_sqrt", "STRENGTH", "MOD_SLG")]
lmod3 <- lm(WINS ~ . + BATTING_H*BATTING_2B*BATTING_3B*BATTING_HR, new_df)
summary(lmod3)
par(mfrow=c(2,2))
plot(lmod3)
hist(resid(lmod3))

# Another ANOVA analysis between 1st and 3rd model
anova(lmod, lmod3)

# Creating 4th model
mod_df <- subset(moneyball_rev, moneyball_rev$WINS != 0)
lmod4 <- lm(WINS ~ . - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG, mod_df)

# BoxCox Transformation
boxcox(lmod4, plotit=TRUE)
boxcox(lmod4, plotit=TRUE, lambda=seq(1.2,1.6, by=0.1))

mod_df$WINS_BC <- mod_df$WINS**1.4

# Creating the 5th model
lmod5 <- lm(WINS_BC ~ . -WINS - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG, mod_df)
summary(lmod5)
par(mfrow=c(2,2))
plot(lmod5)
hist(resid(lmod5))

# Creating the 6th model
lmod6 <- lm(WINS ~ ., moneyball)
summary(lmod6)
par(mfrow=c(2,2))
plot(lmod6)
hist(resid(lmod6))

# Showing the summary of the 3rd model again.
summary(lmod3)

Section 4

# Showing the numbers associated with model 3
sum_lmod3 <- summary(lmod3)
RSS <- c(crossprod(lmod3$residuals))
MSE <- RSS/length(lmod3$residuals)
print(paste0("Mean Squared Error: ", MSE))
print(paste0("Root MSE: ", sqrt(MSE)))
print(paste0("Adjusted R-squared: ", sum_lmod3$adj.r.squared))
print(paste0("F-statistic: ",sum_lmod3$fstatistic[1]))
plot(resid(lmod3))
abline(h=0, col=2)

# Importing the test data set
url2 <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%201/moneyball-evaluation-data.csv'
test_set <- read.csv(url2, header = TRUE)
colnames(test_set) <- c('INDEX', 'BATTING_H', 'BATTING_2B' , 'BATTING_3B' , 'BATTING_HR' , 'BATTING_BB', 'BATTING_SO' , 'BASERUN_SB', 'BASERUN_CS', 'BATTING_HBP', 'PITCHING_H', 'PITCHING_HR', 'PITCHING_BB' , 'PITCHING_SO', 'FIELDING_E', 'FIELDING_DP')
test_set <- test_set[,!names(test_set) %in% c('INDEX','BATTING_HBP','BASERUN_CS','FIELDING_DP')]

test_set <- apply(test_set, 2, replace_median)
test_set <- as.data.frame(test_set)
head(test_set)

# Creating the prediction
predict(lmod3, newdata = test_set, interval="prediction")
# You may un-comment this section if you would like to download a csv file of the predictions
# a <- predict(lmod3, newdata = test_set, interval="prediction")
# write.csv(a, file="Prediction.csv")