Introduction

This project has been built with the intention to understand Linear Regression Model. The plot between the actual wins (W) and predicted wins (expwin) is presented for all the teams in MLB for year 1983,1997,2008. A separate plot for each year has been presented together in a grid. Each plot consists all the points corresponding to respective teams for that year along with the best fit line through the points.

Pythagorean Expectation

Pythagorean expectation is a formula invented by Bill James to estimate how many games a baseball team “should” have won based on the number of runs they scored (R) and allowed (RA). Comparing a team’s actual and Pythagorean winning percentage can be used to evaluate how lucky that team was (by examining the relation between the two winning percentages).

require(Lahman)
## Loading required package: Lahman
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
require(grid)
## Loading required package: grid
require(gridExtra)
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Loading Data

Inbuilt Dataset-> Lahman’s subset is obtained to work upon.

mydata <- Teams %>% select(yearID,lgID,teamID,W,L,R,RA)

Calculating pythagorean expectation

Code below calculates the expected wins based on R, RA using James’ formula for every team-year combination and adds the rows to the dataset

    mydata <- mydata %>% mutate(wpct=R^1.83/(R^1.83+RA^1.83),expwin=round(wpct*(W+L)),diff=W-expwin)

3 years’ data is subset to make observations

mydata1 <- mydata %>% filter(yearID== 1983)
mydata2 <- mydata %>% filter(yearID== 1997)
mydata3 <- mydata %>% filter(yearID== 2008)

Plotting actual wins (W) vs expected wins (expwins)

We observe that their is linear relationship as desired with low residuals and without any outliers

p1 <- ggplot(mydata1,aes(expwin,W)) + geom_point() + stat_smooth(method = "lm") 
p2 <- ggplot(mydata2,aes(expwin,W)) + geom_point() + stat_smooth(method = "lm") 
p3 <- ggplot(mydata3,aes(expwin,W)) + geom_point() + stat_smooth(method = "lm") 
grid.arrange(p1,p2,p3,ncol = 3)

Coefficients of The best Fit line

The coefficients obtained can be used to predict the wins from the calculated wins for more accuracy!

mks <- lm(mydata1$W ~ mydata1$expwin, mydata)

summary(mks)
## 
## Call:
## lm(formula = mydata1$W ~ mydata1$expwin, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1805 -0.8402  0.4061  2.1579  5.3159 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.65508    6.47162  -0.565    0.577    
## mydata1$expwin  1.04512    0.07945  13.155 1.82e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.489 on 24 degrees of freedom
## Multiple R-squared:  0.8782, Adjusted R-squared:  0.8731 
## F-statistic:   173 on 1 and 24 DF,  p-value: 1.824e-12
c <- summary(mks)$coefficients[1]
m <- summary(mks)$coefficients[2]