Data605_Discussion12

#refer to Data 606 library in order to use plot_ss function
library ("DATA606")

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

library(ggplot2)

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average.

2011 Major League Baseball Season Data For All 30 Teams

This is the same data used in Lab 8

df <- read.csv("/Users/aaronzalki/Desktop/mlbdata.csv")
df

##     X                  team runs at_bats hits homeruns bat_avg strikeouts
## 1   1         Texas Rangers  855    5659 1599      210   0.283        930
## 2   2        Boston Red Sox  875    5710 1600      203   0.280       1108
## 3   3        Detroit Tigers  787    5563 1540      169   0.277       1143
## 4   4    Kansas City Royals  730    5672 1560      129   0.275       1006
## 5   5   St. Louis Cardinals  762    5532 1513      162   0.273        978
## 6   6         New York Mets  718    5600 1477      108   0.264       1085
## 7   7      New York Yankees  867    5518 1452      222   0.263       1138
## 8   8     Milwaukee Brewers  721    5447 1422      185   0.261       1083
## 9   9      Colorado Rockies  735    5544 1429      163   0.258       1201
## 10 10        Houston Astros  615    5598 1442       95   0.258       1164
## 11 11     Baltimore Orioles  708    5585 1434      191   0.257       1120
## 12 12   Los Angeles Dodgers  644    5436 1395      117   0.257       1087
## 13 13          Chicago Cubs  654    5549 1423      148   0.256       1202
## 14 14       Cincinnati Reds  735    5612 1438      183   0.256       1250
## 15 15    Los Angeles Angels  667    5513 1394      155   0.253       1086
## 16 16 Philadelphia Phillies  713    5579 1409      153   0.253       1024
## 17 17     Chicago White Sox  654    5502 1387      154   0.252        989
## 18 18     Cleveland Indians  704    5509 1380      154   0.250       1269
## 19 19  Arizona Diamondbacks  731    5421 1357      172   0.250       1249
## 20 20     Toronto Blue Jays  743    5559 1384      186   0.249       1184
## 21 21       Minnesota Twins  619    5487 1357      103   0.247       1048
## 22 22       Florida Marlins  625    5508 1358      149   0.247       1244
## 23 23    Pittsburgh Pirates  610    5421 1325      107   0.244       1308
## 24 24     Oakland Athletics  645    5452 1330      114   0.244       1094
## 25 25        Tampa Bay Rays  707    5436 1324      172   0.244       1193
## 26 26        Atlanta Braves  641    5528 1345      173   0.243       1260
## 27 27  Washington Nationals  624    5441 1319      154   0.242       1323
## 28 28  San Francisco Giants  570    5486 1327      121   0.242       1122
## 29 29      San Diego Padres  593    5417 1284       91   0.237       1320
## 30 30      Seattle Mariners  556    5421 1263      109   0.233       1280
##    stolen_bases wins new_onbase new_slug new_obs
## 1           143   96      0.340    0.460   0.800
## 2           102   90      0.349    0.461   0.810
## 3            49   95      0.340    0.434   0.773
## 4           153   71      0.329    0.415   0.744
## 5            57   90      0.341    0.425   0.766
## 6           130   77      0.335    0.391   0.725
## 7           147   97      0.343    0.444   0.788
## 8            94   96      0.325    0.425   0.750
## 9           118   73      0.329    0.410   0.739
## 10          118   56      0.311    0.374   0.684
## 11           81   69      0.316    0.413   0.729
## 12          126   82      0.322    0.375   0.697
## 13           69   71      0.314    0.401   0.715
## 14           97   79      0.326    0.408   0.734
## 15          135   86      0.313    0.402   0.714
## 16           96  102      0.323    0.395   0.717
## 17           81   79      0.319    0.388   0.706
## 18           89   80      0.317    0.396   0.714
## 19          133   94      0.322    0.413   0.736
## 20          131   81      0.317    0.413   0.730
## 21           92   63      0.306    0.360   0.666
## 22           95   72      0.318    0.388   0.706
## 23          108   72      0.309    0.368   0.676
## 24          117   74      0.311    0.369   0.680
## 25          155   91      0.322    0.402   0.724
## 26           77   89      0.308    0.387   0.695
## 27          106   80      0.309    0.383   0.691
## 28           85   86      0.303    0.368   0.671
## 29          170   71      0.305    0.349   0.653
## 30          125   67      0.292    0.348   0.640

I am interested in creating a linear model to predict the number of wins a team achieves in a season based on how many homeruns the team scores.

homeruns as the predictor

wins as the output

Plot Data

#x axis is home runs, y axis is wins
homerun_wins <- ggplot(df, aes(homeruns, wins))
homerun_wins + geom_point()

The plot concludes an approximately linear relationship with a 66% correlation strength as seen below. The two variables are positively correlated meaning the more home runs the more wins. The strength is moderately strong because there is variation in the data.

cor(df$wins, df$homeruns)

## [1] 0.660614

Sum of Squared Residuals

Similar to how we can use the mean and standard deviation to summarize a single variable, we can summarize the relationship between homerunsand wins by finding the line that best follows their association.

plot_ss(x = df$homeruns, y = df$wins)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     48.8140       0.2119  
## 
## Sum of Squares:  2129.785

There are 30 residuals shown in blue, one for each of the 30 MLB teams. Residuals are the difference between the observed values and the values predicted by the line:

\[ e_i = y_i - \hat{y}_i \]

The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, I will rerun the plot_ss command and add the argument showSquares = TRUE.

plot_ss(x = df$homeruns, y = df$wins, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     48.8140       0.2119  
## 
## Sum of Squares:  2129.785

The output from the plot_ss function provides the slope and intercept of the line as well as the sum of squares. I can also capture this info using the lm function in R (Textbook Page 19) to fit the linear model or regression line.

The Linear Model

linear_model <- lm(wins ~ homeruns, data = df)

The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.

summary(linear_model)

## 
## Call:
## lm(formula = wins ~ homeruns, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2874  -6.7083   0.7708   5.6292  20.7649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.81397    7.08635   6.888 1.74e-07 ***
## homeruns     0.21190    0.04551   4.656 7.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.721 on 28 degrees of freedom
## Multiple R-squared:  0.4364, Adjusted R-squared:  0.4163 
## F-statistic: 21.68 on 1 and 28 DF,  p-value: 7.094e-05

The “Coefficients” table shown is key; its first column displays the linear model’s y-intercept and the coefficient of homeruns. With this table, we can write down the least squares regression line for the linear model:

\[ \hat{y} = 48.81397 + 0.21190 * {home runs} \]

The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 43.6% of the variability in wins is explained by home runs.

Prediction and prediction errors

Let’s create a scatterplot with the least squares line laid on top.

plot(df$wins ~ df$homeruns)
abline(linear_model)

The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model linear_model, which contains both parameter estimates. This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.

Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Linearity: We checked if the relationship between homeruns and wins is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. homeruns.

plot(linear_model$residuals ~ df$homeruns)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

There is no obvious pattern in the residual plot

Nearly normal residuals: To check this condition, we can look at a histogram or a normal probability plot of the residuals.

hist(linear_model$residuals)

qqnorm(linear_model$residuals)
qqline(linear_model$residuals)  # adds diagonal line to the normal prob plot

The residuals appear to be constant and normal, therefore we can assume the constant variability condition is met.