In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set.
Recruiting for talented athletes to the Major Leagues has always been a challenge for many baseball teams. The pressures and goals of winning championships had often led scouts to search throughout the country for individuals who could make a tremendous impact to their team. For years, the scouts had focused on many subjective reports regarding players’ abilities to perform. However, this was increasingly difficult for teams with a smaller payroll (like the 2002 Oakland A’s) to complete with teams with bigger payrolls (New York Yankees).
This was the focus of Michael Lewis’s book ‘Moneyball’ (published in 2003). Given the 2002 Oakland A’s revenue disadvantage, General Manager Billy Beane had to search for a competitive edge. As a result, he had turned to sabermetrics. According to Wikipedia, “the central premise of Moneyball is that the collective wisdom of baseball insiders (including players, managers, coaches, scouts, and the front office) over the past century is subjective and often flaws. Statistics such as stolen bases, runs batted in, and batting average, typically used to gauge players, are relics of a 19th-century view of the game, and the statistics available at that time. Before sabermetrics was introduced to baseball, teams were dependent on the skills of their scouts to find and evaluate players. Scouts are those who are experienced in the sport, usually been involved as players or coaches.” (https://en.wikipedia.org/wiki/Moneyball)
As data scientists, we have been given a dataset to undergo a similar exercise to help identify what factors (baseball statistics) have an impact on the overall wins for the team. (And the more wins, the better the chances for winning the World Series).
The name of this dataset is called moneyball
. It contains multiple rows and columns with data containing information about the team and its corresponding wins. Let us take at the data and its visualizations.
Below is part (head) of the dataset.
## WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB BATTING_SO
## 1 39 1445 194 39 13 143 842
## 2 70 1339 219 22 190 685 1075
## 3 86 1377 232 35 137 602 917
## 4 70 1387 209 38 96 451 922
## 5 82 1297 186 27 102 472 920
## 6 75 1279 200 36 92 443 973
## BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR PITCHING_BB
## 1 NA NA NA 9364 84 927
## 2 37 28 NA 1347 191 689
## 3 46 27 NA 1377 137 602
## 4 43 30 NA 1396 97 454
## 5 49 39 NA 1297 102 472
## 6 107 59 NA 1279 92 443
## PITCHING_SO FIELDING_E FIELDING_DP
## 1 5456 1011 NA
## 2 1082 193 155
## 3 917 175 153
## 4 928 164 156
## 5 920 138 168
## 6 973 123 149
What are the baseball variables that we are looking at in this data set?
## [1] "WINS" "BATTING_H" "BATTING_2B" "BATTING_3B" "BATTING_HR"
## [6] "BATTING_BB" "BATTING_SO" "BASERUN_SB" "BASERUN_CS" "BATTING_HBP"
## [11] "PITCHING_H" "PITCHING_HR" "PITCHING_BB" "PITCHING_SO" "FIELDING_E"
## [16] "FIELDING_DP"
The description of these variables are described above in the first section.
What are the dimensions of this dataset?
## [1] 2276 16
This dataset contains 2276 rows and contains 16 features including the response variable WINS
.
If we look at all of the data within each categories, we notice that all of these categories contain continuous values. So it may be reasonable to start with looking at the quartiles, mean, median, range, skew, minimu, maximum, etc. values.
## WINS BATTING_H BATTING_2B BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## BATTING_HR BATTING_BB BATTING_SO BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
## vars n mean sd median trimmed mad min max
## WINS 1 2276 80.79 15.75 82.0 81.31 14.83 0 146
## BATTING_H 2 2276 1469.27 144.59 1454.0 1459.04 114.16 891 2554
## BATTING_2B 3 2276 241.25 46.80 238.0 240.40 47.44 69 458
## BATTING_3B 4 2276 55.25 27.94 47.0 52.18 23.72 0 223
## BATTING_HR 5 2276 99.61 60.55 102.0 97.39 78.58 0 264
## BATTING_BB 6 2276 501.56 122.67 512.0 512.18 94.89 0 878
## BATTING_SO 7 2174 735.61 248.53 750.0 742.31 284.66 0 1399
## BASERUN_SB 8 2145 124.76 87.79 101.0 110.81 60.79 0 697
## BASERUN_CS 9 1504 52.80 22.96 49.0 50.36 17.79 0 201
## BATTING_HBP 10 191 59.36 12.97 58.0 58.86 11.86 29 95
## PITCHING_H 11 2276 1779.21 1406.84 1518.0 1555.90 174.95 1137 30132
## PITCHING_HR 12 2276 105.70 61.30 107.0 103.16 74.13 0 343
## PITCHING_BB 13 2276 553.01 166.36 536.5 542.62 98.59 0 3645
## PITCHING_SO 14 2174 817.73 553.09 813.5 796.93 257.23 0 19278
## FIELDING_E 15 2276 246.48 227.77 159.0 193.44 62.27 65 1898
## FIELDING_DP 16 1990 146.39 26.23 149.0 147.58 23.72 52 228
## range skew kurtosis se
## WINS 146 -0.40 1.03 0.33
## BATTING_H 1663 1.57 7.28 3.03
## BATTING_2B 389 0.22 0.01 0.98
## BATTING_3B 223 1.11 1.50 0.59
## BATTING_HR 264 0.19 -0.96 1.27
## BATTING_BB 878 -1.03 2.18 2.57
## BATTING_SO 1399 -0.30 -0.32 5.33
## BASERUN_SB 697 1.97 5.49 1.90
## BASERUN_CS 201 1.98 7.62 0.59
## BATTING_HBP 66 0.32 -0.11 0.94
## PITCHING_H 28995 10.33 141.84 29.49
## PITCHING_HR 343 0.29 -0.60 1.28
## PITCHING_BB 3645 6.74 96.97 3.49
## PITCHING_SO 19278 22.17 671.19 11.86
## FIELDING_E 1833 2.99 10.97 4.77
## FIELDING_DP 176 -0.39 0.18 0.59
Visualization of the dataset:
The above contains a lot of information, and can be pretty overwhelming. If we take time to dissect the information, there are some interesting points.
The above is the boxplots of all of the variables listed in the data set, which the visualization may assist us in how the data is spread (given that it’s in median and quartiles). However, as of note, some of the variables contain missing values (which we will discuss later). While it may be interesting, the absolute numbers here does not provide any immediate insight yet.
The scatterplot on the other hand provides a visualization of the independent variables against the response variables WINS
. Using ggplot2
, a ‘lm’ model was added to all of the scatterplots. While not completely rigorous, it does give us a general idea of how each variable contributes to the WINS
column. For example, in PITCHING_H
, the more hits the opposing team makes, WINS
begin to decrease, which is pretty obvious. If the opposing team is hitting more balls, the opposing team will have a higher likelihood of winning.
Below are the correlations and heatmaps of the correlations between the variables.
## WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## WINS 1.00 0.47 0.31 -0.12 0.42 0.47
## BATTING_H 0.47 1.00 0.56 0.21 0.40 0.20
## BATTING_2B 0.31 0.56 1.00 0.04 0.25 0.20
## BATTING_3B -0.12 0.21 0.04 1.00 -0.22 -0.21
## BATTING_HR 0.42 0.40 0.25 -0.22 1.00 0.46
## BATTING_BB 0.47 0.20 0.20 -0.21 0.46 1.00
## BATTING_SO -0.23 -0.34 -0.06 -0.19 0.21 0.22
## BASERUN_SB 0.01 0.07 -0.19 0.17 -0.19 -0.09
## BASERUN_CS -0.18 -0.09 -0.20 0.23 -0.28 -0.21
## BATTING_HBP 0.07 -0.03 0.05 -0.17 0.11 0.05
## PITCHING_H 0.47 1.00 0.56 0.21 0.40 0.20
## PITCHING_HR 0.42 0.39 0.25 -0.22 1.00 0.46
## PITCHING_BB 0.47 0.20 0.20 -0.21 0.46 1.00
## PITCHING_SO -0.23 -0.34 -0.07 -0.19 0.21 0.22
## FIELDING_E -0.39 -0.25 -0.19 -0.07 0.02 -0.08
## FIELDING_DP -0.20 0.02 -0.02 0.13 -0.06 -0.08
## BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H
## WINS -0.23 0.01 -0.18 0.07 0.47
## BATTING_H -0.34 0.07 -0.09 -0.03 1.00
## BATTING_2B -0.06 -0.19 -0.20 0.05 0.56
## BATTING_3B -0.19 0.17 0.23 -0.17 0.21
## BATTING_HR 0.21 -0.19 -0.28 0.11 0.40
## BATTING_BB 0.22 -0.09 -0.21 0.05 0.20
## BATTING_SO 1.00 -0.07 -0.06 0.22 -0.34
## BASERUN_SB -0.07 1.00 0.62 -0.06 0.07
## BASERUN_CS -0.06 0.62 1.00 -0.07 -0.09
## BATTING_HBP 0.22 -0.06 -0.07 1.00 -0.03
## PITCHING_H -0.34 0.07 -0.09 -0.03 1.00
## PITCHING_HR 0.21 -0.19 -0.28 0.11 0.39
## PITCHING_BB 0.22 -0.09 -0.21 0.05 0.20
## PITCHING_SO 1.00 -0.07 -0.06 0.22 -0.34
## FIELDING_E 0.31 0.04 0.21 0.04 -0.25
## FIELDING_DP -0.12 -0.13 -0.01 -0.07 0.01
## PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## WINS 0.42 0.47 -0.23 -0.39 -0.20
## BATTING_H 0.39 0.20 -0.34 -0.25 0.02
## BATTING_2B 0.25 0.20 -0.07 -0.19 -0.02
## BATTING_3B -0.22 -0.21 -0.19 -0.07 0.13
## BATTING_HR 1.00 0.46 0.21 0.02 -0.06
## BATTING_BB 0.46 1.00 0.22 -0.08 -0.08
## BATTING_SO 0.21 0.22 1.00 0.31 -0.12
## BASERUN_SB -0.19 -0.09 -0.07 0.04 -0.13
## BASERUN_CS -0.28 -0.21 -0.06 0.21 -0.01
## BATTING_HBP 0.11 0.05 0.22 0.04 -0.07
## PITCHING_H 0.39 0.20 -0.34 -0.25 0.01
## PITCHING_HR 1.00 0.46 0.21 0.02 -0.06
## PITCHING_BB 0.46 1.00 0.22 -0.08 -0.08
## PITCHING_SO 0.21 0.22 1.00 0.31 -0.12
## FIELDING_E 0.02 -0.08 0.31 1.00 0.04
## FIELDING_DP -0.06 -0.08 -0.12 0.04 1.00
According to the correlation table and heatmap, the values that correspond most positively are BATTING_H
, BATTING_2B
, BATTING_HR
, BATTING_BB
, PITCHING_H
, PITCHING_HR
, and PITCHING_BB
.
It should be noted that PITCHING_H
, PITCHING_HR
, and PITCHING_BB
are positively correlated with wins, which does not seem to make much sense at all. This is in stark contrast to what was stated in the first section regarding what is considered to be positive and negative correlation (according to domain knowledge.) So instance, if your team is giving up hits, home runs, and walks, that puts the opposing team at an advantage.
Let’s try looking at each ‘supposedly’ positive and negative impact variables separately using a different package in R.
Supposed Positive Impact:
Supposed Negative Impact:
Though there appear to be significant p-values for some of these independent variables compared to WINS
, it should be noted that the scatterplots are not all necessarily linear. We can see if we can work this out in our modelling section.
Now let’s take a look at the response variable, WINS
.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 2276 80.79 15.75 82 81.31 14.83 0 146 146 -0.4 1.03
## se
## X1 0.33
In the diagram above, the red dashed line represents the mean and the two blue dashed lines represent the two standard deviations above and below the mean. The skew is less than -0.5, and there appears to be a normal distribution. However, again, there are some interesting findings with the WINS
data.
Though the data has been curated so that every team has been standardized to 162 wins, there was a target win season MAX of 146 and a MIN of 0. These numbers are most definitely outliers as this has never occurred in the history of Major League Basesball. According to Wikipedia, ‘List of best Major League Baseball season win-loss records’, since 1961 (when the games were expanded to 162 games in a season), only two teams had managed to obtain a winning percentage greater than 70%, or 114 wins. They were the 1998 New York Yankees and the 2001 Seattle Mariners.
What if we looked at the teams before 1961? Perhaps we may find more revealing data. Unfortunately, this is not the case either. According to Newday, the MLB’s winningest season of all time was the 1906 Chicago Cubs with a record of 116-36, which gives a total winning percentage of 76.3%, or if converted to a full 162 game season, an equivalent of 123.6 wins. (https://www.newsday.com/sports/baseball/mlb-s-winningest-seasons-1.3077797).
Likewise, the most losing MLB team in the history of baseball was the 1899 Cleveland Spiders with 20 wins and 134 losses, with a staggering win percentage of 0.130. Would it improve the model to eliminate the outliers? Any data including these numbers may only likely mean extrapolation and may not be accurate in its representations of wins. We must be careful about including extrapolated data, and when we perform our modelling, we really shoulder consider taking out the outliers as they are likely not representative what truly happened.
Before we head into the next section on data preparation, let’s take a look at the missing data points.
How many data points are missing from each column and what percentage of the data is composed of missing values?
## BATTING_HBP BASERUN_CS FIELDING_DP BASERUN_SB PITCHING_SO BATTING_SO
## 2085 772 286 131 102 102
## FIELDING_E PITCHING_BB PITCHING_HR PITCHING_H BATTING_BB BATTING_HR
## 0 0 0 0 0 0
## BATTING_3B BATTING_2B BATTING_H WINS
## 0 0 0 0
## BATTING_HBP BASERUN_CS FIELDING_DP BASERUN_SB PITCHING_SO BATTING_SO
## 91.608084 33.919156 12.565905 5.755712 4.481547 4.481547
## FIELDING_E PITCHING_BB PITCHING_HR PITCHING_H BATTING_BB BATTING_HR
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## BATTING_3B BATTING_2B BATTING_H WINS
## 0.000000 0.000000 0.000000 0.000000
Most of the data have no missing data. However, there are some fields like BATTING_HBP
, and BASERUN_CS
that contain a significant amount of missing data, which should be considered when we model our regression.
Now that we have taken the opportunity to perform some exploratory data analysis, let’s start preparing and cleaning the data for model preparation. The first step would be to handle all of the missing values.
As we had noted in the previous section, missing values can lead to errors and bias into our model. So fixing or imputing these missing values may help deal with the situation (or possibly make it worse). Multiple scientific papers have been studied in regards to missing data, and the overall consensus is that there is NO consensus on what the cutoff of missing data should be. Some authors have argued that missing data greater than 10% can lead to bias (Bennett 2001); other authors discuss that missing data mechanisms and the missing data patterns have greater impact on research results than does the proportion of missing data (Tabachnick and Fidell 2012). (Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3701793/#CR6).
So we should consider: what is an acceptable level of missing data in baseball statistics? This is unclear, even from my own search in the literature. Using my best judgment, I will eliminate any variables that have more than 10% missing data, which in this case will be BATTING_HBP
, BASERUN_CS
and FIELDING_DP
. BATTING_HBP
and BASERUN_CS
had a staggering 91.6% and 33.9% missing data, respectively. Though we could make an argument that a 12.5% missing rate in FIELDING_DP
could be overlooked, the fact that the missing data rate is greater than 10% and that the Pearson Correlation between FIELDING_DP
and WINS
is not large at -.20, I will also remove this column as well.
Now that we have decided on which variables to remove, we are still left with BASERUN_SB
, PITCHING_SO
, and BATTING_SO
with variables with missing data points. Typically we can use either the mean or median for imputation. Choosing one over the other will depend on our data. So for example, if there is a dataset that have great outliers, then we should probably consider the median (ex. 99% of the household income is below 100, and 1% is above 500.)
We can check this by checking the distribution of BASERUN_SB
, PITCHING_SO
, and BATTING_SO
.
Given that the means are susceptible to outliers, we must consider its effect, particularly in BASERUN_SB
and PITCHING_SO
. The maximum value of PITCHING_SO
is:
## [1] 19278
This is an absurd value. It would be in our best interests if we imputed the median, and not the mean.
Now that we have removed the three variables with greater than 10% missing value, and imputed all of the remaining missing values, let’s take a look again at the head and the summary of the dataset.
## WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB BATTING_SO
## 1 39 1445 194 39 13 143 842
## 2 70 1339 219 22 190 685 1075
## 3 86 1377 232 35 137 602 917
## 4 70 1387 209 38 96 451 922
## 5 82 1297 186 27 102 472 920
## 6 75 1279 200 36 92 443 973
## BASERUN_SB PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## 1 101 9364 84 927 5456 1011
## 2 37 1347 191 689 1082 193
## 3 46 1377 137 602 917 175
## 4 43 1396 97 454 928 164
## 5 49 1297 102 472 920 138
## 6 107 1279 92 443 973 123
## WINS BATTING_H BATTING_2B BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
## BATTING_HR BATTING_BB BATTING_SO BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 556.8 1st Qu.: 67.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 736.3 Mean :123.4
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:151.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO
## Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 626.0
## Median : 1518 Median :107.0 Median : 536.5 Median : 813.5
## Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 817.5
## 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.0
## Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0
## FIELDING_E
## Min. : 65.0
## 1st Qu.: 127.0
## Median : 159.0
## Mean : 246.5
## 3rd Qu.: 249.2
## Max. :1898.0
When we look at the observations in the revised dataset, we now have successfully eliminated all of the NA values. However, when we inspect the data, we see a significant amount of outliers. And outliers may potentially have high leverage on our models. We need to investigate this further and deal with these outliers.
In general, what do we do about outliers? (Let’s refer to ‘Linear Models with R, Second Edition’, page 88.) Check for a data-entry error first.
Some may have been entered in error. Certainly like values of 19278 strike outs is physically impossible, so we need to remove these values and impute them with the median. Let’s plot the histogram with all of the values.
We notice that some of these histograms extend out quite far to the right from their outliers. Again, looking at the max values for these variables, such as PITCHING_H
, PITCHING_BB
, PITCHING_SO
, and FIELDING_E
, these are likely entered in error. The next step will be to impute these values with the median. We will arbitrarily choose any values above 3 standard deviations above the mean and impute them as the median.
The rest of the variables on visual inspection appear to be reasonable fits (even if they do contain outliers; we are trying to eliminate the errors that are the most obvious and likely secondary to entry error.)
## WINS BATTING_H BATTING_2B BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
## BATTING_HR BATTING_BB BATTING_SO BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 556.8 1st Qu.: 67.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 736.3 Mean :123.4
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:151.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO
## Min. :1137 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:1419 1st Qu.: 50.0 1st Qu.:476.0 1st Qu.: 626.0
## Median :1518 Median :107.0 Median :536.5 Median : 813.5
## Mean :1605 Mean :105.7 Mean :500.4 Mean : 795.6
## 3rd Qu.:1660 3rd Qu.:150.0 3rd Qu.:536.5 3rd Qu.: 954.0
## Max. :4134 Max. :343.0 Max. :536.5 Max. :1600.0
## FIELDING_E
## Min. : 65.0
## 1st Qu.:127.0
## Median :159.0
## Mean :198.9
## 3rd Qu.:215.0
## Max. :681.0
Now that we re-examine the data, we have successfully removed the extreme MAX values, and the data appears to be overall more consistent with the rest.
Now let’s take a look at the response variable, WINS
. We had mentioned earlier that the WINS
column may contain features that are outliers. And from before, we noted that the most winning team in baseball history had a winning percentage of 76.3%, and that the most losing team in baseball history had a winning percentage of 13.0%. Both are certainly at the extremes. However, the data does contain points with more wins than the most winning team in baseball history and the most losing team in baseball history.
How many and what percentage of WINS
is either below 13.0% or greater than 76.3% (which corresponds to 21 games and 123 games respectively?
## [1] "Number of Wins that are less than 21 or greater than 123: 11"
## [1] "Percentage: 0.48%"
So only 0.48% of the data are in the extreme ranges. We will leave this for now given it accounts for such a small amount of data.
Binning Data: Now let’s talk about weak, average, and strong teams. In baseball, anything greater than 100 game win is a great season. We should consider cutting games above 100 games. And likewise, any season that is less than 0.500 win percentage, or wins less than 81 games in the season is considered to be generally non-contenders for the world series. For this dataset, we will consider any team less than 81 wins to be weak, any team greater than 100 wins to be a strong team; and the rest will be considered average teams.
How many of each teams exist in this dataset?
## Wins
## Weak Teams Average Teams Strong Teams
## 1118 958 200
It appears that there are generally more weak teams (less than .500 winning percentage) in this dataset than average teams or strong teams. When we create our models, we will create a regression model with all of the teams, and then subset the data to see if we can create a regression model for the weak team, average teams, and strong teams (this will be done in the following modelling section).
Let’s revisit the scatterplots, and see what the new scatterplots appear like after dealing with the missing values and outliers. Depending on the plot, consideration needs to be made to potentially transform the data. Given that this is a multiple regression, in order to see if we need to transform the response variable, we will need to visualize the residuals and ensure that it satisfies the basic assumptions of regression.
Because visualizing in multiple dimensions is impossible for the human mind, we will simplify the process and only look at one predictor variable vs. the response variable, or in this case: BATTING_H
vs. WINS
.
## [1] "Pearson Correlation for Hits vs. Wins: 0.388767521072289"
Let’s take a look at the residuals and see if this response would benefit from a transformation.
The overall residual distribution does appear to be fairly normally distributed with no obvious heteroscedasticity, so a transformation may not be necessary. However, given that this is only a single predictor variable to a single response, this may certainly change the residual plot. Therefore, we will create the most common transformations, the logarithmic transformation, square-root transformation, and the BoxCox transformation and include them with our model building to see if these response transformations would improve the R-squared value.
It is also a good idea to look at the data separated into their bins (WINS). Another visualization with color separations may provide more insight into how BATTING_H
may relate to WINS
.
We could also look and see how BATTING_H
compare to BATTING_HR
binning an Average, Strong, and Weak Team.
Or BATTING_H
compare to BATTING_SO
binning an Average, Strong, and Weak Team.
Again, it is not easily discernible what this information means. But is interesting to see how the data separates itself out.
Before we continue onto the third section, we should go back to Moneyball. Moneyball had emphasized these two stats quite heavily, On-base Percentage and Slugging. On-Base Percentage (OBP) is “a statistic generally measugin how frequently a batter reaches base. Specifically, it records the ratio of the batter’s times-on-base (TOB) (the sum of hits, walks, and times hit by pitch) to their number of plate appearances. By factoring in only hits, walks and times hit by pitch, OBP does not credit the batter for reaching base due to fielding errors or decisions.” (https://en.wikipedia.org/wiki/On-base_percentage)
The formula is as follows:
\[OBP = \frac{H + BB + HBP}{AB + BB + HBP + SF}\]
Slugging Percentage is “a measure of the batting producitivity of a hitter. It is calculated as total bases divided by at bats, through the following formula: (https://en.wikipedia.org/wiki/Slugging_percentage)
\[SLG = \frac{(1B)+(2 * 2B) + (3 * 3B) + (4 * HR)}{AB}\]
We would like to include both of these stats into our model. However, given that we do not have ‘At-Bats (AB)’, ‘Sacrifice Fly (SF)’, ‘Hit by Pitch (HBP - as this was taken out)’, we will only be able to include a modified version of the slugging percentage into our dataframe. As of note, BATTING_H
counts all types of hits, whether it was a single, double, triple, or homerun. As a result, we will use the following stat and include it in our model:
\[MODIFIED-SLG = \frac{BATTING\text{_}H + (2 * BATTING\text{_}2B) + (3 * BATTING\text{_}3B) + (4 * BATTING\text{_}HR)}{\text{Total Number of Innings in 162 games = 1458}}\]
Other statistics that we could potentially add to our model can be viewed in this following article, “Which Baseball Statistic is the Most Important When Determining Team Success?” (https://www.iwu.edu/economics/PPE13/houser.pdf). However, given the limitations of this data, we will only add the modified slugging to our dataset.
Using the training data set, we will test out several different multiple linear regression models, and try to develop a model that would best accurate in terms of predicting WINS
for a team. The initial approach would be to build the full model, assuming that more data and information would create the best model. Of course, this is subjected to testing, and we may ultimately find models that are better than the full model.
Full model (using WINS
as the response variable; and not the transformations; and leaving out STRENGTH as that was derived from the target variable):
##
## Call:
## lm(formula = WINS ~ . - WINS_log - WINS_sqrt - STRENGTH, data = moneyball_rev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.862 -8.582 0.353 8.578 56.341
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.414986 5.607021 -0.252 0.80079
## BATTING_H 0.038461 0.003769 10.204 < 2e-16 ***
## BATTING_2B -0.019783 0.009431 -2.098 0.03604 *
## BATTING_3B 0.102623 0.017633 5.820 6.72e-09 ***
## BATTING_HR 0.075863 0.027554 2.753 0.00595 **
## BATTING_BB 0.035305 0.004431 7.968 2.52e-15 ***
## BATTING_SO 0.011331 0.004418 2.565 0.01039 *
## BASERUN_SB 0.030714 0.004590 6.692 2.76e-11 ***
## PITCHING_H 0.006600 0.001173 5.625 2.08e-08 ***
## PITCHING_HR -0.036080 0.024315 -1.484 0.13798
## PITCHING_BB -0.018563 0.007410 -2.505 0.01230 *
## PITCHING_SO -0.007138 0.003582 -1.993 0.04643 *
## FIELDING_E -0.022174 0.003612 -6.138 9.82e-10 ***
## MOD_SLG NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.49 on 2263 degrees of freedom
## Multiple R-squared: 0.27, Adjusted R-squared: 0.2661
## F-statistic: 69.74 on 12 and 2263 DF, p-value: < 2.2e-16
This model has an adjusted R-square value of 0.2661. While we had attempted to use the MOD_SLG
as a predictor variable, R is producing an undefined value because of its likely very high correlation with the other variables within the dataset. The model otherwise does appear to fulfill the assumptions of the regression with a normally distributed residuals as evidenced by the histogram plot of the residuals and the Q-Q plot. However, an adjusted R-square value of 0.2661 is quite poor. We can try to improve this.
The model did show PITCHING_HR
as a non-significant value, let’s see if we can remove this variable, and re-develop another model with this variable and MOD_SLG
taken out.
##
## Call:
## lm(formula = WINS ~ . - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG -
## PITCHING_HR, data = moneyball_rev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.196 -8.470 0.304 8.603 55.858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.482090 5.573141 -0.087 0.931075
## BATTING_H 0.037819 0.003745 10.098 < 2e-16 ***
## BATTING_2B -0.019438 0.009430 -2.061 0.039398 *
## BATTING_3B 0.100402 0.017574 5.713 1.26e-08 ***
## BATTING_HR 0.037611 0.009732 3.865 0.000114 ***
## BATTING_BB 0.035544 0.004429 8.026 1.61e-15 ***
## BATTING_SO 0.014178 0.003980 3.562 0.000375 ***
## BASERUN_SB 0.030305 0.004582 6.613 4.68e-11 ***
## PITCHING_H 0.006706 0.001171 5.725 1.17e-08 ***
## PITCHING_BB -0.019088 0.007403 -2.578 0.009988 **
## PITCHING_SO -0.009777 0.003110 -3.143 0.001693 **
## FIELDING_E -0.021807 0.003605 -6.049 1.70e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.5 on 2264 degrees of freedom
## Multiple R-squared: 0.2693, Adjusted R-squared: 0.2657
## F-statistic: 75.84 on 11 and 2264 DF, p-value: < 2.2e-16
So while all of the coefficients are now statistically significant, the adjusted R-square value really has not changed much. In fact, we can use an ANOVA to determine if there really is any difference between these two models.
## Analysis of Variance Table
##
## Model 1: WINS ~ (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO + FIELDING_E + STRENGTH + WINS_log + WINS_sqrt +
## MOD_SLG) - WINS_log - WINS_sqrt - STRENGTH
## Model 2: WINS ~ (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO + FIELDING_E + STRENGTH + WINS_log + WINS_sqrt +
## MOD_SLG) - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG - PITCHING_HR
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2263 412102
## 2 2264 412503 -1 -400.97 2.2019 0.138
And ANOVA suggests that with a F-statistic of 2.2019, this has a corresponding p-value of .138, indicating no difference between the two models.
We may be able to modify the data and start looking at the terms from a different perspective. From our data exploration, we have noted that there are some collinearity in the datasets. This may be a good opportunity to include the full model with interactive terms. Particularly, in regards to batting, BATTING_H
, BATTING_2B
, BATTING_3B
, BATTING_HR
are all likely to be interactive. Let’s try it out in the model.
##
## Call:
## lm(formula = WINS ~ . + BATTING_H * BATTING_2B * BATTING_3B *
## BATTING_HR, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.388 -8.373 0.412 8.340 56.319
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -8.746e+01 3.922e+01 -2.230
## BATTING_H 7.479e-02 2.798e-02 2.673
## BATTING_2B 2.797e-01 1.571e-01 1.781
## BATTING_3B 1.596e+00 5.145e-01 3.103
## BATTING_HR 8.029e-01 5.597e-01 1.435
## BATTING_BB 3.497e-02 4.442e-03 7.874
## BATTING_SO 1.064e-02 4.463e-03 2.384
## BASERUN_SB 3.698e-02 4.701e-03 7.868
## PITCHING_H 6.923e-03 1.224e-03 5.655
## PITCHING_HR -3.419e-02 2.588e-02 -1.321
## PITCHING_BB -2.552e-02 7.463e-03 -3.420
## PITCHING_SO -6.838e-03 3.630e-03 -1.884
## FIELDING_E -2.193e-02 3.652e-03 -6.005
## BATTING_H:BATTING_2B -1.079e-04 1.053e-04 -1.024
## BATTING_H:BATTING_3B -8.410e-04 3.446e-04 -2.441
## BATTING_2B:BATTING_3B -4.750e-03 1.900e-03 -2.500
## BATTING_H:BATTING_HR -2.659e-04 3.840e-04 -0.692
## BATTING_2B:BATTING_HR -3.511e-03 2.014e-03 -1.743
## BATTING_3B:BATTING_HR -6.663e-03 1.115e-02 -0.597
## BATTING_H:BATTING_2B:BATTING_3B 2.576e-06 1.189e-06 2.166
## BATTING_H:BATTING_2B:BATTING_HR 1.520e-06 1.345e-06 1.131
## BATTING_H:BATTING_3B:BATTING_HR 3.084e-06 7.464e-06 0.413
## BATTING_2B:BATTING_3B:BATTING_HR 3.411e-05 3.906e-05 0.873
## BATTING_H:BATTING_2B:BATTING_3B:BATTING_HR -1.843e-08 2.534e-08 -0.727
## Pr(>|t|)
## (Intercept) 0.025830 *
## BATTING_H 0.007578 **
## BATTING_2B 0.075094 .
## BATTING_3B 0.001941 **
## BATTING_HR 0.151565
## BATTING_BB 5.31e-15 ***
## BATTING_SO 0.017214 *
## BASERUN_SB 5.55e-15 ***
## PITCHING_H 1.76e-08 ***
## PITCHING_HR 0.186634
## PITCHING_BB 0.000637 ***
## PITCHING_SO 0.059730 .
## FIELDING_E 2.22e-09 ***
## BATTING_H:BATTING_2B 0.305855
## BATTING_H:BATTING_3B 0.014739 *
## BATTING_2B:BATTING_3B 0.012495 *
## BATTING_H:BATTING_HR 0.488719
## BATTING_2B:BATTING_HR 0.081442 .
## BATTING_3B:BATTING_HR 0.550304
## BATTING_H:BATTING_2B:BATTING_3B 0.030394 *
## BATTING_H:BATTING_2B:BATTING_HR 0.258376
## BATTING_H:BATTING_3B:BATTING_HR 0.679527
## BATTING_2B:BATTING_3B:BATTING_HR 0.382726
## BATTING_H:BATTING_2B:BATTING_3B:BATTING_HR 0.467066
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.31 on 2252 degrees of freedom
## Multiple R-squared: 0.2937, Adjusted R-squared: 0.2864
## F-statistic: 40.71 on 23 and 2252 DF, p-value: < 2.2e-16
The interactive terms approach does seem to improve the adjusted R-squared value to 0.2864, but there are quite a bit of non-statistically significant coefficients here.
## Analysis of Variance Table
##
## Model 1: WINS ~ (BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO + FIELDING_E + STRENGTH + WINS_log + WINS_sqrt +
## MOD_SLG) - WINS_log - WINS_sqrt - STRENGTH
## Model 2: WINS ~ BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO + FIELDING_E + BATTING_H * BATTING_2B * BATTING_3B *
## BATTING_HR
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2263 412102
## 2 2252 398726 11 13376 6.8679 1.724e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA analysis suggests that the first model I had developed compared to this last model with the interactive terms are statistically different. However, it is unclear if a more complicated model is truly the better way to go.
Let’s determine to see if looking at any of the transformations of the predictor responses. Initially, I had attempted the logarithmic transformation and square root transformation, but this led to singularity issues, thus not allowing R to properly create a model. Therefore, those two transformations were discarded. Let’s take a look at the BoxCox transformations. In order to perform a box-cox transformation, all the response variables will need to be positive. There was one outlier with 0 wins, which we will eliminate in order to perform the boxcox transformation.
For the sake of simplicity, we will choose lambda as 1.4. Let’s transform the target variable and see if we had improved the linear regression model.
##
## Call:
## lm(formula = WINS_BC ~ . - WINS - WINS_log - WINS_sqrt - STRENGTH -
## MOD_SLG, data = mod_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -421.44 -71.06 -0.23 67.44 476.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.380e+02 4.586e+01 -3.009 0.00265 **
## BATTING_H 2.968e-01 3.034e-02 9.782 < 2e-16 ***
## BATTING_2B -1.446e-01 7.540e-02 -1.918 0.05527 .
## BATTING_3B 7.895e-01 1.407e-01 5.613 2.23e-08 ***
## BATTING_HR 6.269e-01 2.196e-01 2.854 0.00435 **
## BATTING_BB 2.716e-01 3.533e-02 7.687 2.24e-14 ***
## BATTING_SO 5.981e-02 3.530e-02 1.694 0.09038 .
## BASERUN_SB 2.564e-01 3.662e-02 7.001 3.33e-12 ***
## PITCHING_H 5.002e-02 9.352e-03 5.349 9.73e-08 ***
## PITCHING_HR -2.782e-01 1.939e-01 -1.435 0.15147
## PITCHING_BB -1.756e-01 5.966e-02 -2.944 0.00327 **
## PITCHING_SO -4.464e-02 2.855e-02 -1.563 0.11808
## FIELDING_E -1.773e-01 2.879e-02 -6.157 8.73e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.6 on 2262 degrees of freedom
## Multiple R-squared: 0.261, Adjusted R-squared: 0.2571
## F-statistic: 66.58 on 12 and 2262 DF, p-value: < 2.2e-16
The adjusted R-square value is 0.2571, which suggets that this is not an improved model. A transformation here may not be helpful for this particular dataset.
Before we continue onto the fourth section of this assignment, let’s go back and run the full model without any modifications I had made (including imputing out missing variables, adjusting the variables, etc.).
##
## Call:
## lm(formula = WINS ~ ., data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8708 -5.6564 -0.0599 5.2545 22.9274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.28826 19.67842 3.064 0.00253 **
## BATTING_H 1.91348 2.76139 0.693 0.48927
## BATTING_2B 0.02639 0.03029 0.871 0.38484
## BATTING_3B -0.10118 0.07751 -1.305 0.19348
## BATTING_HR -4.84371 10.50851 -0.461 0.64542
## BATTING_BB -4.45969 3.63624 -1.226 0.22167
## BATTING_SO 0.34196 2.59876 0.132 0.89546
## BASERUN_SB 0.03304 0.02867 1.152 0.25071
## BASERUN_CS -0.01104 0.07143 -0.155 0.87730
## BATTING_HBP 0.08247 0.04960 1.663 0.09815 .
## PITCHING_H -1.89096 2.76095 -0.685 0.49432
## PITCHING_HR 4.93043 10.50664 0.469 0.63946
## PITCHING_BB 4.51089 3.63372 1.241 0.21612
## PITCHING_SO -0.37364 2.59705 -0.144 0.88577
## FIELDING_E -0.17204 0.04140 -4.155 5.08e-05 ***
## FIELDING_DP -0.10819 0.03654 -2.961 0.00349 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.467 on 175 degrees of freedom
## (2085 observations deleted due to missingness)
## Multiple R-squared: 0.5501, Adjusted R-squared: 0.5116
## F-statistic: 14.27 on 15 and 175 DF, p-value: < 2.2e-16
Incredibly, the Adjusted R-squared value is 0.5116. However, when we had performed our data munging, we noted some issues with the data. Could this model be overfitting?
As of now, we will leave this last model into consideration, but other than this model, our best bet model may likely be the model with the interactive terms. Again, taking a look at that interactive model.
##
## Call:
## lm(formula = WINS ~ . + BATTING_H * BATTING_2B * BATTING_3B *
## BATTING_HR, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.388 -8.373 0.412 8.340 56.319
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -8.746e+01 3.922e+01 -2.230
## BATTING_H 7.479e-02 2.798e-02 2.673
## BATTING_2B 2.797e-01 1.571e-01 1.781
## BATTING_3B 1.596e+00 5.145e-01 3.103
## BATTING_HR 8.029e-01 5.597e-01 1.435
## BATTING_BB 3.497e-02 4.442e-03 7.874
## BATTING_SO 1.064e-02 4.463e-03 2.384
## BASERUN_SB 3.698e-02 4.701e-03 7.868
## PITCHING_H 6.923e-03 1.224e-03 5.655
## PITCHING_HR -3.419e-02 2.588e-02 -1.321
## PITCHING_BB -2.552e-02 7.463e-03 -3.420
## PITCHING_SO -6.838e-03 3.630e-03 -1.884
## FIELDING_E -2.193e-02 3.652e-03 -6.005
## BATTING_H:BATTING_2B -1.079e-04 1.053e-04 -1.024
## BATTING_H:BATTING_3B -8.410e-04 3.446e-04 -2.441
## BATTING_2B:BATTING_3B -4.750e-03 1.900e-03 -2.500
## BATTING_H:BATTING_HR -2.659e-04 3.840e-04 -0.692
## BATTING_2B:BATTING_HR -3.511e-03 2.014e-03 -1.743
## BATTING_3B:BATTING_HR -6.663e-03 1.115e-02 -0.597
## BATTING_H:BATTING_2B:BATTING_3B 2.576e-06 1.189e-06 2.166
## BATTING_H:BATTING_2B:BATTING_HR 1.520e-06 1.345e-06 1.131
## BATTING_H:BATTING_3B:BATTING_HR 3.084e-06 7.464e-06 0.413
## BATTING_2B:BATTING_3B:BATTING_HR 3.411e-05 3.906e-05 0.873
## BATTING_H:BATTING_2B:BATTING_3B:BATTING_HR -1.843e-08 2.534e-08 -0.727
## Pr(>|t|)
## (Intercept) 0.025830 *
## BATTING_H 0.007578 **
## BATTING_2B 0.075094 .
## BATTING_3B 0.001941 **
## BATTING_HR 0.151565
## BATTING_BB 5.31e-15 ***
## BATTING_SO 0.017214 *
## BASERUN_SB 5.55e-15 ***
## PITCHING_H 1.76e-08 ***
## PITCHING_HR 0.186634
## PITCHING_BB 0.000637 ***
## PITCHING_SO 0.059730 .
## FIELDING_E 2.22e-09 ***
## BATTING_H:BATTING_2B 0.305855
## BATTING_H:BATTING_3B 0.014739 *
## BATTING_2B:BATTING_3B 0.012495 *
## BATTING_H:BATTING_HR 0.488719
## BATTING_2B:BATTING_HR 0.081442 .
## BATTING_3B:BATTING_HR 0.550304
## BATTING_H:BATTING_2B:BATTING_3B 0.030394 *
## BATTING_H:BATTING_2B:BATTING_HR 0.258376
## BATTING_H:BATTING_3B:BATTING_HR 0.679527
## BATTING_2B:BATTING_3B:BATTING_HR 0.382726
## BATTING_H:BATTING_2B:BATTING_3B:BATTING_HR 0.467066
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.31 on 2252 degrees of freedom
## Multiple R-squared: 0.2937, Adjusted R-squared: 0.2864
## F-statistic: 40.71 on 23 and 2252 DF, p-value: < 2.2e-16
The reason why this model may be the most reasonable to use is that its coefficient values are probably the most intuitive out of all the models proposed here. For example, the positive coefficients for BATTING_H
, BATTING_2B
, BATTING_3B
, BATTING_HR
, BATTING_BB
, etc. suggests that the higher the number associated with these coefficients, the more likely it will contribute to the WINS
column. Likewise, with PITCHING_HR
, PITCHING_BB
, given it’s negative coefficient values, the higher these values, the more it will hurt the WINS
column.
That being said, not all of the coefficient values make sense. For instance, PITCHING_SO
has a negative value. In the baseball world, throwing more strikeouts should in theory increases your chances of winning.
However, that being said, the mathematics behind regression can be quite complex, and different variables often times interact with each other. This can cause the numbers to be quite counter-intuitive. That doesn’t necessarily mean that the model is incorrect.
As discussed above, we have several different models to choose from. And there are several different ways to evaluate a model, including \(Adjusted^2\), RMSE, etc. The descriptions of these different methods can be read here: https://www.r-bloggers.com/assessing-the-accuracy-of-our-models-r-squared-adjusted-r-squared-rmse-mae-aic/.
A significant amount of time was looking into these models listed above. In addition to trying to make sense of the model, I have already started evaluating the models based off their Adjusted \(R^2\) values. Ultimately, I had chosen the interactive terms model as multi-collinearity is likely an issue. So for instance, a player’s ability to bat is likely going to affect his ability to hit singles, doubles, triples, or homeruns. Players who tend to be better batters, tend to hit well much better all around in all four of these categories. Therefore, they are not independent from each other. This is the big reason why I ultimately ended up choosing this particular model over the others.
Now there was the last model I had provided that had a significantly improved R-squared value over the model I had chosen. Why did I not choose this model? As we were performing our exploratory data analysis, this data contained a significant amount of missing data and outliers, which can skew the model and provide a false sense of security and assurance. Knowing that the unmodified full model contains all of these issues, it may be best not to use this model, as this may overfit the training model, but may ultimately fail to generalize to other datasets.
As requested in the question set, let’s evaluate the multiple linear regression model based on (a) mean squared error, (b) \(R^2\), (c) F-statistic, and (d) residual plots.
## [1] "Mean Squared Error: 175.187129382774"
## [1] "Root MSE: 13.2358274914255"
## [1] "Adjusted R-squared: 0.286446971302889"
## [1] "F-statistic: 40.7074086819212"
The model appears to statistically significant.
Now the to make the predictions with the evaluation dataset. We will initially have to eliminate the variables BATTING_HBP
,BASERUN_CS
,FIELDING_DP
. If the evaluation dataset has any missing data points, we will have to impute in the median data points to remain consistent with our training model.
Let’s take a look at the head of the test set after we made these modifications.
## BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB BATTING_SO
## 1 1209 170 33 83 447 1080
## 2 1221 151 29 88 516 929
## 3 1395 183 29 93 509 816
## 4 1539 309 29 159 486 914
## 5 1445 203 68 5 95 416
## 6 1431 236 53 10 215 377
## BASERUN_SB PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## 1 62 1209 83 447 1080 140
## 2 54 1221 88 516 929 135
## 3 59 1395 93 509 816 156
## 4 148 1539 159 486 914 124
## 5 92 3902 14 257 1123 616
## 6 92 2793 20 420 736 572
Let’s produce a prediction for the test set including its prediction intervals. (If you like to save this data as a .csv file and output it, please see the appendix.)
## fit lwr upr
## 1 65.19033 38.963608 91.41705
## 2 65.75030 39.473535 92.02707
## 3 74.08726 47.938162 100.23636
## 4 84.90616 58.763857 111.04847
## 5 67.94395 41.316532 94.57138
## 6 62.79714 36.464737 89.12954
## 7 77.45402 51.228355 103.67968
## 8 65.51250 39.315788 91.70922
## 9 70.98443 44.815416 97.15345
## 10 69.90627 43.781607 96.03094
## 11 72.15654 46.020825 98.29225
## 12 82.60546 56.450025 108.76089
## 13 82.06223 55.834390 108.29008
## 14 80.48800 54.280972 106.69502
## 15 79.35101 53.156912 105.54510
## 16 81.84139 55.654766 108.02802
## 17 72.44740 46.251563 98.64324
## 18 78.99631 52.852399 105.14022
## 19 61.00442 34.822399 87.18644
## 20 88.54734 62.322522 114.77216
## 21 81.91433 55.705036 108.12363
## 22 84.20397 57.991239 110.41669
## 23 78.65437 52.424724 104.88402
## 24 71.56051 45.414208 97.70681
## 25 82.93302 56.788821 109.07721
## 26 86.54794 60.379938 112.71595
## 27 59.14667 30.910344 87.38299
## 28 74.11489 47.961425 100.26835
## 29 83.62548 57.378076 109.87289
## 30 76.59144 50.331029 102.85184
## 31 84.98409 58.680956 111.28723
## 32 83.41988 57.233143 109.60661
## 33 82.11643 55.851783 108.38108
## 34 83.47974 57.157876 109.80160
## 35 80.24415 54.095454 106.39286
## 36 81.81934 55.640607 107.99806
## 37 74.37683 48.249544 100.50411
## 38 89.03987 62.828234 115.25150
## 39 79.32565 52.644050 106.00725
## 40 88.27825 62.094141 114.46236
## 41 77.11910 50.885123 103.35308
## 42 85.49939 59.334723 111.66406
## 43 25.48259 -3.547424 54.51261
## 44 97.98304 71.180574 124.78551
## 45 87.34547 60.978442 113.71249
## 46 87.89786 61.517969 114.27774
## 47 90.59936 64.077126 117.12159
## 48 73.27413 47.123699 99.42456
## 49 68.33860 42.193537 94.48366
## 50 74.98734 48.853630 101.12105
## 51 77.19618 51.061838 103.33053
## 52 85.13214 58.960191 111.30409
## 53 79.18255 53.036294 105.32881
## 54 74.63227 48.467231 100.79730
## 55 76.29415 50.175469 102.41282
## 56 77.84642 51.721354 103.97149
## 57 87.19360 60.950002 113.43721
## 58 71.76554 45.590474 97.94060
## 59 56.51941 30.026999 83.01183
## 60 76.73483 50.592401 102.87725
## 61 84.09501 57.937354 110.25267
## 62 77.58656 51.266835 103.90629
## 63 85.24703 59.109578 111.38447
## 64 81.97135 55.714428 108.22828
## 65 84.91806 58.730521 111.10560
## 66 95.05652 68.729248 121.38378
## 67 74.31244 48.172371 100.45252
## 68 80.09233 53.927763 106.25689
## 69 76.73882 50.581141 102.89651
## 70 88.13985 61.896703 114.38300
## 71 85.90838 59.663667 112.15309
## 72 73.70898 47.574961 99.84300
## 73 80.76387 54.599711 106.92804
## 74 90.59474 64.284360 116.90513
## 75 83.30479 57.115692 109.49389
## 76 89.78166 63.531510 116.03181
## 77 82.46595 56.284587 108.64731
## 78 81.27257 55.145115 107.40003
## 79 67.71837 41.567232 93.86952
## 80 73.97585 47.832538 100.11916
## 81 84.36046 58.087835 110.63308
## 82 87.05392 60.803596 113.30424
## 83 91.67783 64.542796 118.81287
## 84 80.38443 54.248262 106.52060
## 85 83.86098 57.642162 110.07980
## 86 79.95696 53.782240 106.13167
## 87 80.98155 54.762846 107.20026
## 88 81.85522 55.734353 107.97608
## 89 83.22258 57.049404 109.39576
## 90 83.34236 57.117784 109.56694
## 91 74.31418 48.084138 100.54422
## 92 87.84938 45.968486 129.73027
## 93 67.71741 41.544106 93.89072
## 94 78.69073 52.511501 104.86996
## 95 79.83220 53.676285 105.98811
## 96 78.33182 52.202588 104.46104
## 97 85.74682 59.452304 112.04134
## 98 94.17179 67.888270 120.45531
## 99 87.90937 61.654843 114.16390
## 100 89.64484 63.385556 115.90412
## 101 83.06290 56.909825 109.21597
## 102 70.06731 43.916807 96.21782
## 103 81.97868 55.849528 108.10784
## 104 81.80705 55.612554 108.00154
## 105 81.23208 55.038172 107.42600
## 106 69.69927 43.316922 96.08162
## 107 51.45812 24.928120 77.98813
## 108 77.23375 51.053263 103.41425
## 109 80.72630 54.561585 106.89101
## 110 59.31439 32.997730 85.63105
## 111 82.35130 56.190169 108.51242
## 112 84.24418 58.050125 110.43824
## 113 92.58932 66.435355 118.74328
## 114 88.50692 62.353622 114.66021
## 115 81.93386 55.791334 108.07639
## 116 80.68634 54.533626 106.83905
## 117 89.33272 63.180354 115.48510
## 118 79.60421 53.460792 105.74764
## 119 76.69707 50.549293 102.84484
## 120 75.12056 48.895427 101.34569
## 121 87.40086 61.195277 113.60645
## 122 61.46132 35.266193 87.65645
## 123 64.45339 38.269788 90.63698
## 124 58.01540 31.590142 84.44067
## 125 70.33036 44.159179 96.50154
## 126 83.89022 57.728962 110.05148
## 127 86.35077 60.185744 112.51579
## 128 74.25452 48.117202 100.39184
## 129 85.57835 59.421421 111.73529
## 130 89.79343 63.526818 116.06005
## 131 85.25814 59.082260 111.43401
## 132 77.80046 51.602412 103.99851
## 133 76.32213 50.151380 102.49287
## 134 86.09441 59.882982 112.30584
## 135 89.33801 63.152281 115.52375
## 136 59.78975 33.261077 86.31843
## 137 77.31551 51.177677 103.45334
## 138 76.57652 50.455069 102.69796
## 139 86.59601 60.411472 112.78055
## 140 80.93792 54.813035 107.06281
## 141 62.77592 36.563040 88.98880
## 142 68.55715 42.282573 94.83172
## 143 91.01579 64.805980 117.22561
## 144 76.89694 50.754520 103.03937
## 145 70.91936 44.737192 97.10153
## 146 71.99100 45.849100 98.13289
## 147 78.41426 52.282760 104.54575
## 148 80.75981 54.626206 106.89341
## 149 83.77344 57.655507 109.89137
## 150 78.69675 52.569928 104.82358
## 151 82.87512 56.736292 109.01395
## 152 81.37889 55.230979 107.52681
## 153 156.99017 102.266046 211.71429
## 154 71.96646 45.787191 98.14572
## 155 75.73622 49.597692 101.87474
## 156 75.40503 49.191800 101.61826
## 157 80.29909 54.032283 106.56590
## 158 62.20562 35.886090 88.52514
## 159 86.78267 60.517509 113.04782
## 160 67.57468 41.410067 93.73930
## 161 90.08680 62.841868 117.33174
## 162 96.44026 69.739091 123.14143
## 163 85.66220 59.409389 111.91502
## 164 91.11011 63.737069 118.48316
## 165 87.64169 61.154422 114.12896
## 166 84.70623 58.409282 111.00318
## 167 85.51847 59.306794 111.73015
## 168 82.09385 55.951787 108.23591
## 169 76.40460 50.163202 102.64600
## 170 79.33048 53.175531 105.48543
## 171 89.58480 63.387303 115.78229
## 172 85.44618 59.290005 111.60236
## 173 77.92991 51.758648 104.10117
## 174 89.30595 63.126415 115.48549
## 175 79.62758 53.469350 105.78581
## 176 73.40745 47.210875 99.60403
## 177 73.88085 47.612427 100.14928
## 178 75.35809 49.239079 101.47709
## 179 73.23287 47.105102 99.36064
## 180 82.31379 56.155727 108.47186
## 181 84.06089 57.822858 110.29893
## 182 83.83366 57.678479 109.98885
## 183 85.07842 58.929591 111.22725
## 184 80.14081 53.960171 106.32146
## 185 109.54300 71.924584 147.16142
## 186 95.62474 69.225611 122.02388
## 187 86.02903 59.620017 112.43804
## 188 56.84309 29.869694 83.81648
## 189 54.60977 28.250477 80.96906
## 190 108.85223 82.026384 135.67808
## 191 65.75286 39.590759 91.91497
## 192 78.75669 52.614144 104.89923
## 193 75.31090 49.128335 101.49346
## 194 77.47357 51.338185 103.60896
## 195 82.15986 55.936617 108.38311
## 196 63.94168 37.782928 90.10044
## 197 74.59659 48.478351 100.71482
## 198 78.72536 52.455108 104.99562
## 199 77.58546 51.395559 103.77537
## 200 83.39330 57.214291 109.57231
## 201 80.98275 54.784910 107.18059
## 202 81.46477 55.326329 107.60321
## 203 76.16639 49.980054 102.35273
## 204 79.22534 52.636960 105.81372
## 205 78.52411 52.402750 104.64547
## 206 81.08151 54.916165 107.24686
## 207 78.04266 51.857006 104.22832
## 208 76.37720 50.230116 102.52428
## 209 79.46808 53.253476 105.68269
## 210 72.15377 45.929879 98.37765
## 211 95.20298 68.683771 121.72219
## 212 87.87247 61.549843 114.19510
## 213 84.08937 57.884732 110.29401
## 214 65.99910 39.833482 92.16472
## 215 72.18685 46.007990 98.36571
## 216 84.85461 58.699564 111.00967
## 217 85.26477 59.074753 111.45479
## 218 91.53384 65.354403 117.71327
## 219 76.55081 50.429003 102.67262
## 220 77.36313 51.231257 103.49499
## 221 77.51708 51.368310 103.66586
## 222 77.86387 51.688701 104.03905
## 223 80.50133 54.351219 106.65145
## 224 78.85486 52.712468 104.99724
## 225 101.78313 72.330538 131.23571
## 226 76.04254 49.883527 102.20156
## 227 80.20297 54.078127 106.32781
## 228 80.73083 54.569197 106.89247
## 229 80.05644 53.922349 106.19053
## 230 67.69091 41.294343 94.08747
## 231 68.15125 41.961485 94.34102
## 232 89.71557 63.507540 115.92360
## 233 80.88768 54.658426 107.11692
## 234 89.32436 63.076754 115.57196
## 235 78.79225 52.673496 104.91101
## 236 72.67776 46.552751 98.80278
## 237 83.37966 57.211714 109.54760
## 238 79.78433 53.623895 105.94476
## 239 85.25932 58.767660 111.75099
## 240 71.86001 45.712783 98.00724
## 241 85.99510 59.843070 112.14713
## 242 84.15897 58.000262 110.31767
## 243 82.35973 56.211391 108.50807
## 244 84.49100 58.354138 110.62787
## 245 56.64383 30.285525 83.00214
## 246 87.86669 61.683253 114.05013
## 247 79.47383 53.338449 105.60921
## 248 81.98708 55.848114 108.12605
## 249 72.04451 45.920334 98.16868
## 250 85.32581 59.174962 111.47665
## 251 78.73890 52.556754 104.92104
## 252 56.25814 29.383086 83.13320
## 253 87.56414 61.168158 113.96012
## 254 -23.30153 -84.499000 37.89594
## 255 66.46014 40.196344 92.72393
## 256 76.31004 50.061512 102.55856
## 257 79.46685 53.318807 105.61488
## 258 81.91008 55.763864 108.05630
## 259 71.54079 45.258491 97.82308
Section 1
library(ggplot2)
library(reshape2)
library(ggpubr)
library(corrplot)
library(Hmisc)
library(psych)
library(PerformanceAnalytics)
library(MASS)
url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%201/moneyball-training-data.csv'
moneyball <- read.csv(url, header=TRUE)
# Removing the index column given that this is non-contributory
# Renaming the columns to make it easier to read
moneyball <- moneyball[,-1]
colnames(moneyball) <- c("WINS", "BATTING_H", "BATTING_2B", "BATTING_3B", "BATTING_HR", "BATTING_BB", "BATTING_SO", "BASERUN_SB", "BASERUN_CS", "BATTING_HBP", "PITCHING_H", "PITCHING_HR", "PITCHING_BB", "PITCHING_SO", "FIELDING_E", "FIELDING_DP")
head(moneyball)
# Data shape
colnames(moneyball)
dim(moneyball)
moneyball_sum <- summary(moneyball)
moneyball_sum
describe(moneyball)
# ggplot visualization of data
meltMoneyball <- melt(moneyball)
ggplot(meltMoneyball, aes(factor(variable), value)) + geom_boxplot() + facet_wrap(~variable, scale="free") + labs(x="", y="")
meltMoneyball_toWins <- melt(moneyball, id.vars=c('WINS'))
ggplot(meltMoneyball_toWins, aes(x=value, y=WINS)) + geom_point() + facet_wrap(~variable, scale = "free") + geom_smooth(model="lm") + labs(x="", y="Number of Wins")
moneyball_cor <- cor(moneyball, method = "pearson", use = "complete.obs")
round(moneyball_cor, 2)
corrplot(moneyball_cor, method="shade")
# Checking out positive impact according to the table supplied
pos_impact <- moneyball[,c(1,2,3,4,5,6,8,10,14,16)]
chart.Correlation(pos_impact, histogram=TRUE, pch=19)
# Negatively impact the wins
neg_impact <- moneyball[,c(1,7,9,11,12,13,15)]
chart.Correlation(neg_impact, histogram=TRUE, pch=19)
# Wins
mean_wins <- mean(moneyball$WINS, na.rm=TRUE)
sd_wins <- sd(moneyball$WINS)
ggplot(moneyball, aes(x=WINS)) + geom_histogram(binwidth=1, col = 'black', fill = "green", aes(y=..density..)) + labs(x="Number of Wins", y = "Distribution") + geom_density(alpha=.2) + geom_vline(aes(xintercept=mean(WINS, na.rm=TRUE)), color="red", linetype="dashed", size=1) + geom_vline(xintercept = (mean_wins + 2 * sd_wins), color='blue', linetype='dashed', size=1) + geom_vline(xintercept = (mean_wins - 2 * sd_wins), color='blue', linetype='dashed', size=1)
describe(moneyball$WINS)
# Checking for NA values
rev(sort(colSums(sapply(moneyball, is.na))))
# Percentage of the data that is NA for its respective variable
rev(sort(colSums(sapply(moneyball, is.na))/nrow(moneyball) * 100))
Section 2
# Creating a revised version of the data frame taking out the columns with significant amount of missing data
moneyball_rev <- moneyball[, !names(moneyball) %in% c('BATTING_HBP','BASERUN_CS','FIELDING_DP')]
# Dealing with the outliers. Visualizing the outliers.
outlier_df <- moneyball[, names(moneyball) %in% c('BASERUN_SB', 'PITCHING_SO', 'BATTING_SO')]
outlier_melt <- melt(outlier_df)
ggplot(outlier_melt, aes(x=value)) + geom_histogram(binwidth=20) + facet_wrap(~variable, scale='free') + labs(x='', y='Frequency')
# Example of an outlier
max(outlier_df$PITCHING_SO, na.rm=TRUE)
# Creating a function to replace NA values with median values
replace_median <- function(x){
x <- as.numeric(as.character(x))
x[is.na(x)] = median(x, na.rm=TRUE)
return(x)
}
# Creating the revised data frame
moneyball_rev <- apply(moneyball_rev, 2, replace_median)
moneyball_rev <- as.data.frame(moneyball_rev)
head(moneyball_rev)
summary(moneyball_rev)
# Histogram plots of the independent variables
moneyball_rev_melt <- melt(moneyball_rev)
ggplot(moneyball_rev_melt, aes(x=value)) + geom_histogram() + facet_wrap(~variable, scale='free') + labs(x='', y='Frequency')
# To modify the outliers
moneyball_rev$PITCHING_H[moneyball_rev$PITCHING_H > 3*sd(moneyball_rev$PITCHING_H)] <- median(moneyball_rev$PITCHING_H)
moneyball_rev$PITCHING_BB[moneyball_rev$PITCHING_BB > 3*sd(moneyball_rev$PITCHING_BB)] <- median(moneyball_rev$PITCHING_BB)
moneyball_rev$PITCHING_SO[moneyball_rev$PITCHING_SO > 3*sd(moneyball_rev$PITCHING_SO)] <- median(moneyball_rev$PITCHING_SO)
moneyball_rev$FIELDING_E[moneyball_rev$FIELDING_E > 3*sd(moneyball_rev$FIELDING_E)] <- median(moneyball_rev$FIELDING_E)
summary(moneyball_rev)
# Finding out how many of the wins are outliers
outlier_wins <- sum(moneyball_rev$WINS > 123 | moneyball_rev$WINS < 21)
print(paste0("Number of Wins that are less than 21 or greater than 123: ", outlier_wins))
print(paste0("Percentage: ", round(outlier_wins/nrow(moneyball_rev) * 100,2), "%"))
# Creating buckets for wins
Wins <- cut(moneyball_rev[,'WINS'], breaks = c(0,81,100,162), include.lowest=TRUE, labels=c("Weak Teams", "Average Teams", "Strong Teams"))
table(Wins)
# Subsetting the team strengths for visualization
weak_team <- subset(moneyball_rev, moneyball_rev$WINS < 81)
avg_team <- subset(moneyball_rev, moneyball_rev$WINS >= 81 & moneyball_rev$WINS < 100)
strong_team <- subset(moneyball_rev, moneyball_rev$WINS >= 100)
moneyball_rev$STRENGTH <- ifelse(moneyball_rev$WINS < 81, 'WEAK', ifelse(moneyball_rev$WINS >= 100, 'STRONG', 'AVERAGE'))
# Scatterplot visualization of variables against WINS
moneyball_rev_melt_WINS <- melt(moneyball_rev, id.vars='WINS')
ggplot(moneyball_rev_melt_WINS, aes(x=value, y=WINS)) + geom_jitter() + facet_wrap(~variable, scale='free') + labs(x='', y='Number of WINS')
# Scatter plot of Hitting and Wins
ggplot(moneyball_rev, aes(x=BATTING_H, y=WINS)) + geom_jitter() + labs(x="Hits (1B, 2B, 3B, HR)", y="Number of Wins") + geom_smooth(method="lm")
print(paste0("Pearson Correlation for Hits vs. Wins: ", cor(moneyball_rev$BATTING_H, moneyball_rev$WINS)))
# Creating a sample linear model
lm_samp <- lm(WINS ~ BATTING_H, data=moneyball_rev)
par(mfrow=c(2,2))
plot(lm_samp)
# More visualizations
ggplot(moneyball_rev, aes(x=BATTING_H, y=WINS, color=STRENGTH)) + geom_jitter() + labs(x="Hits (1B, 2B, 3B, HR)", y="Number of Wins")
ggplot(moneyball_rev, aes(x=BATTING_H, y=BATTING_HR, color=STRENGTH)) + geom_jitter() + labs(x="Hits (1B, 2B, 3B, HR)", y="Number of Home Runs")
ggplot(moneyball_rev, aes(x=BATTING_H, y=BATTING_SO, color=STRENGTH)) + geom_jitter() + labs(x="Hits (1B, 2B, 3B, HR)", y="Number of Strike Outs")
# Attempting transformations
moneyball_rev$WINS_log <- log(moneyball_rev$WINS)
moneyball_rev$WINS_sqrt <- sqrt(moneyball_rev$WINS)
# Attempting to create a new variable for slugging
moneyball_rev$MOD_SLG <- (moneyball_rev$BATTING_H + (2 * moneyball_rev$BATTING_2B) + (3 * moneyball_rev$BATTING_3B) + (4 * moneyball_rev$BATTING_HR))/1458
Section 3
# Creating 1st model
lmod <- lm(WINS ~ . - WINS_log - WINS_sqrt - STRENGTH, moneyball_rev)
summary(lmod)
par(mfrow=c(2,2))
plot(lmod)
hist(resid(lmod), main="Histogram of Residuals")
# Creating 2nd model
lmod2 <- lm(WINS ~ . - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG - PITCHING_HR, moneyball_rev)
summary(lmod2)
par(mfrow=c(2,2))
plot(lmod2)
hist(resid(lmod2), main="Histogram of Residuals")
# ANOVA analysis between 1st and 2nd model
anova(lmod, lmod2)
# Creating 3rd model
new_df <- moneyball_rev[,!names(moneyball_rev) %in% c("WINS_log", "WINS_sqrt", "STRENGTH", "MOD_SLG")]
lmod3 <- lm(WINS ~ . + BATTING_H*BATTING_2B*BATTING_3B*BATTING_HR, new_df)
summary(lmod3)
par(mfrow=c(2,2))
plot(lmod3)
hist(resid(lmod3))
# Another ANOVA analysis between 1st and 3rd model
anova(lmod, lmod3)
# Creating 4th model
mod_df <- subset(moneyball_rev, moneyball_rev$WINS != 0)
lmod4 <- lm(WINS ~ . - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG, mod_df)
# BoxCox Transformation
boxcox(lmod4, plotit=TRUE)
boxcox(lmod4, plotit=TRUE, lambda=seq(1.2,1.6, by=0.1))
mod_df$WINS_BC <- mod_df$WINS**1.4
# Creating the 5th model
lmod5 <- lm(WINS_BC ~ . -WINS - WINS_log - WINS_sqrt - STRENGTH - MOD_SLG, mod_df)
summary(lmod5)
par(mfrow=c(2,2))
plot(lmod5)
hist(resid(lmod5))
# Creating the 6th model
lmod6 <- lm(WINS ~ ., moneyball)
summary(lmod6)
par(mfrow=c(2,2))
plot(lmod6)
hist(resid(lmod6))
# Showing the summary of the 3rd model again.
summary(lmod3)
Section 4
# Showing the numbers associated with model 3
sum_lmod3 <- summary(lmod3)
RSS <- c(crossprod(lmod3$residuals))
MSE <- RSS/length(lmod3$residuals)
print(paste0("Mean Squared Error: ", MSE))
print(paste0("Root MSE: ", sqrt(MSE)))
print(paste0("Adjusted R-squared: ", sum_lmod3$adj.r.squared))
print(paste0("F-statistic: ",sum_lmod3$fstatistic[1]))
plot(resid(lmod3))
abline(h=0, col=2)
# Importing the test data set
url2 <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Homework%201/moneyball-evaluation-data.csv'
test_set <- read.csv(url2, header = TRUE)
colnames(test_set) <- c('INDEX', 'BATTING_H', 'BATTING_2B' , 'BATTING_3B' , 'BATTING_HR' , 'BATTING_BB', 'BATTING_SO' , 'BASERUN_SB', 'BASERUN_CS', 'BATTING_HBP', 'PITCHING_H', 'PITCHING_HR', 'PITCHING_BB' , 'PITCHING_SO', 'FIELDING_E', 'FIELDING_DP')
test_set <- test_set[,!names(test_set) %in% c('INDEX','BATTING_HBP','BASERUN_CS','FIELDING_DP')]
test_set <- apply(test_set, 2, replace_median)
test_set <- as.data.frame(test_set)
head(test_set)
# Creating the prediction
predict(lmod3, newdata = test_set, interval="prediction")
# You may un-comment this section if you would like to download a csv file of the predictions
# a <- predict(lmod3, newdata = test_set, interval="prediction")
# write.csv(a, file="Prediction.csv")