Introduction to Regression

To start this lesson lets load in ba_2014_2015 which has batting averages for the 2014-2015 MLB season. We are going to be focusing on the batting averages of players between the 2014 and 2015 seasons.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.1

## -- Attaching packages ----------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.1

## Warning: package 'tibble' was built under R version 3.5.1

## Warning: package 'tidyr' was built under R version 3.5.1

## Warning: package 'readr' was built under R version 3.5.1

## Warning: package 'purrr' was built under R version 3.5.1

## Warning: package 'dplyr' was built under R version 3.5.1

## Warning: package 'stringr' was built under R version 3.5.1

## Warning: package 'forcats' was built under R version 3.5.1

## -- Conflicts -------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ba_2014_2015 <- read_csv("C:/Users/ankit/OneDrive/Desktop/Robotics Scouting/Data Sets/ba_2014_2015.csv")

## Parsed with column specification:
## cols(
##   playerID = col_character(),
##   ba_2014 = col_double(),
##   ba_2015 = col_double()
## )

ba_2014_2015

## # A tibble: 73 x 3
##    playerID  ba_2014 ba_2015
##    <chr>       <dbl>   <dbl>
##  1 abreujo02   0.317   0.290
##  2 altuvjo01   0.341   0.313
##  3 andruel01   0.263   0.258
##  4 aybarer01   0.278   0.270
##  5 bautijo02   0.286   0.250
##  6 beltrad01   0.324   0.287
##  7 blackch02   0.288   0.287
##  8 bogaexa01   0.240   0.320
##  9 brantmi02   0.327   0.310
## 10 braunry02   0.266   0.285
## # ... with 63 more rows

Predicting Averages

Let’s say there is a player who played in 2014 and 2015 but is not in our dataset. We known absolutely nothing about the player so how could we predict their 2015 batting average using the data that we have? The simplest thing (which actually works pretyy well) to do in our case is to use the overall mean of the 2015 batting averages in our dataset to predict the batting average of unknown player.

Imagine if we were also able to get this mystery player’s 2014 batting average? Can we improve on this simple prediction? In this lesson, we will go through various ways of predicting 2015 batting average using 2014 batting average. In order to decide the best method for this instance we will decide how well each method predicts the actual 2015 batting averages in our tbl. We will add columns to our current tibble to compare predicitions against reality. Let’s start off by adding a column containing the overall mean of the 2015 batting averages. The 2015 season mean is 0.273, while the mean 2014 mean is 0.272. We are going to be using the term yhat here which is often associated with regreesion.

ba_2014_2015 <- mutate(ba_2014_2015,yhat_1 = mean(ba_2015))

If the 2014 batting average was not related to the 2015 averages at all, then the 2015 overall mean will be the best prediction model with the data we have. We can plot the 2015 batting averages against the 2014 batting averages to see visually whether there is any relationship between the two seasons data. We can add dashed horizontal and vertical lines for overall means of the 2014 and 2015 seasons respectively.

ggplot(ba_2014_2015) + geom_point(aes(x = ba_2014, y = ba_2015), col = "black", shape=16,
    size = 0.9) + geom_hline(yintercept = 0.273, col = "blue", lty = 3) + 
     geom_vline(xintercept = 0.272, col = "blue", lty = 3)

It looks like there is a relationship so it’s at least reasonable to use the 2014 batting averages to make predictions for 2015 we can be more accurate though.

Looking at the plot we see that most players with below average batting averages in 2014 tended to also have below average batting averages in 2015. Similarly, most players with above average batting averages in 2014 tended to have above average batting averages in 2015.

Here is one way to imporve this-

We can divide the players into two groups, one for those with above average ba in 2014 and one for those with below average ba in 2014. We can then average the 2015 ba within our two new groups and use these averages as the prediction for each member of the group. In order to do this, we will need to create a column indicating which group each player belongs. Then we can pass this column to group_by() and compute the average ba_2015 within each group. To create the column we will use the powerful cut() function, which divides the range of a numerical vector into intervals and recodes the numerical values according to which interval they fall.

The following does what was aid above with two more steps- once we have made our predictions, we ungroup the tbl and we can drop the column indicating the interval in which our observation falls.

ba_2014_2015 <- ba_2014_2015 %>% mutate(bins = cut(ba_2014, breaks = c(0.15, 0.272, 0.4))) %>% 
  group_by(bins) %>% mutate(yhat_2 = mean(ba_2015)) %>% 
  ungroup() %>% select(-bins)

## Warning: package 'bindrcpp' was built under R version 3.5.1

When we run this code and print out our tbl, we see that there is a new column called “yhat_2” that contains our new predictions. Before continuing, understand the syntax used in cut(). The first argument is the variable we want to approximate. The next argument, breaks = is a vector that tells R where the endpoints of these intervals (bins) are. These are often called “cut points” In this particular case, we wanted to divide the players into those with below average ba in 2014 and above average ba in 2015. The first element of the cut point vector, 0.15 is much less than the smallest ba_2014 value, whereas the second elements, 0.272, is the overall mean of the ba_2014 values. The last element, 0.40 is much greater than the largest ba_2014 value.

Now if we look at this visually-

ggplot(ba_2014_2015) + geom_point(aes(x = ba_2014, y = ba_2015), col = "black", shape = 16, size = 0.9) + 
     geom_point(aes(x = ba_2014, y = yhat_1), col = "red", shape = 3, size = 0.75) + 
     geom_point(aes(x = ba_2014, y = yhat_2), col = "green", shape = 15, size = 0.75)

Visually it appears that the green squares (corresponding to yhat_2) are a bit closer to the actual values than the red crosses (corresponding to yhat_1). This would suggest that dividing the players into the two bins according to their 2014 batting average and using the average average 2015 value within each bin as our forecast was better than using the overall average 2015 value for all players.

We can continue with this process and divide the players into even more bins. When we do that, instead of hand-coding the vector of cut points, we can use the function seq() which generates a vector of equally spaced numbers. To demonstrate, suppose we wanted to divide the interval [0,1] into 10 equally sized intervals: (0,0.1], (0.1, 0.2], ., (0.9, 1]. To get the vector of cutpoints, we need to tell seq() either how many points we wanted or the spacing between the points:

seq(from = 0, to = 1, length = 11)

##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Now let’s divde the 2014 batting averages into intervals of length 0.05 and predict 2015 batting averages using the average ba_2015 values within the resulting bins.

ba_2014_2015 <- ba_2014_2015 %>% mutate(bins = cut(ba_2014, breaks = seq(from = 0.15, 
     to = 0.4, by = 0.05))) %>% group_by(bins) %>% mutate(yhat_3 = mean(ba_2015)) %>% 
     ungroup() %>% select(-bins)

Then we visualize-

ggplot(ba_2014_2015) + geom_point(aes(x = ba_2014, y = ba_2015), shape = 16, size = 0.9) +
  geom_point(aes(x = ba_2014, y = yhat_1), col = "red", shape = 3, size = 0.75) +
  geom_point(aes(x = ba_2014, y = yhat_2), col = "green", shape = 15, size = 0.75) + 
  geom_point(aes(x = ba_2014, y = yhat_3), col = "blue", shape = 17, size = 0.75)

It appears that we are able to predict the 2015 batting average of the player with the lowest batting average in 2014 pretty well.

Exercise Add a new column “yhat_3” to our tbl that contains the predictions formed when we divide the 2014 batting averages into bins of length 0.005. Plot these new predictions in purple using shape = 8. What do you notice about these predictions?

Assessing Predictive Power

We now have a couple of different ways of predicting 2015 batting average. Qualitatively, the predictions in purple (smallest bins) appear to fit the observed data better than the blue, green, and red predictions. To assess the predictions quantitatively, we use the root mean square error or RMSE. This is the square root of the mean square error (MSE), which is computed by averaging the squared difference between the actual values and the predicted values.

summarize(ba_2014_2015, rmse_1 = sqrt(mean((ba_2015 - yhat_1)^2)), 
          rmse_2 = sqrt(mean((ba_2015 - yhat_2)^2)), 
          rmse_3 = sqrt(mean((ba_2015 - yhat_3)^2)), 
          rmse_4 = sqrt(mean((ba_2015 - yhat_4)^2)))

## # A tibble: 1 x 4
##   rmse_1 rmse_2 rmse_3 rmse_4
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1 0.0260 0.0246 0.0226 0.0183

The RMSEs confirm what we could see visually- the purple predictions fit the data much better than the blue, green, and red predictions. Moreover we see that the predictions formed by binning into smaller intervals yielded smaller RMSEs than the predictions formed by binning into larger intervals.

Exercise Add another column of predictions, “yhat_5”, which are computed by dividing ba_2014 into intervals of length 0.001. Re-compute the RMSEs and plot all of the predictions again. Use col = ’orange, shape = 10 for these new predictions.

## # A tibble: 1 x 5
##   rmse_1 rmse_2 rmse_3 rmse_4 rmse_5
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 0.0260 0.0246 0.0226 0.0183 0.0114

Introduction to Regression

Ankith Kodali

Predicting Averages

Assessing Predictive Power