This project has been built with the intention to understand Linear Regression Model. The plot between the actual wins (W) and predicted wins (expwin) is presented for all the teams in MLB for year 1983,1997,2008. A separate plot for each year has been presented together in a grid. Each plot consists all the points corresponding to respective teams for that year along with the best fit line through the points.
Pythagorean expectation is a formula invented by Bill James to estimate how many games a baseball team “should” have won based on the number of runs they scored (R) and allowed (RA). Comparing a team’s actual and Pythagorean winning percentage can be used to evaluate how lucky that team was (by examining the relation between the two winning percentages).
require(Lahman)
## Loading required package: Lahman
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
require(grid)
## Loading required package: grid
require(gridExtra)
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Inbuilt Dataset-> Lahman’s subset is obtained to work upon.
mydata <- Teams %>% select(yearID,lgID,teamID,W,L,R,RA)
Code below calculates the expected wins based on R, RA using James’ formula for every team-year combination and adds the rows to the dataset
mydata <- mydata %>% mutate(wpct=R^1.83/(R^1.83+RA^1.83),expwin=round(wpct*(W+L)),diff=W-expwin)
3 years’ data is subset to make observations
mydata1 <- mydata %>% filter(yearID== 1983)
mydata2 <- mydata %>% filter(yearID== 1997)
mydata3 <- mydata %>% filter(yearID== 2008)
We observe that their is linear relationship as desired with low residuals and without any outliers
p1 <- ggplot(mydata1,aes(expwin,W)) + geom_point() + stat_smooth(method = "lm")
p2 <- ggplot(mydata2,aes(expwin,W)) + geom_point() + stat_smooth(method = "lm")
p3 <- ggplot(mydata3,aes(expwin,W)) + geom_point() + stat_smooth(method = "lm")
grid.arrange(p1,p2,p3,ncol = 3)
The coefficients obtained can be used to predict the wins from the calculated wins for more accuracy!
mks <- lm(mydata1$W ~ mydata1$expwin, mydata)
summary(mks)
##
## Call:
## lm(formula = mydata1$W ~ mydata1$expwin, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1805 -0.8402 0.4061 2.1579 5.3159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.65508 6.47162 -0.565 0.577
## mydata1$expwin 1.04512 0.07945 13.155 1.82e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.489 on 24 degrees of freedom
## Multiple R-squared: 0.8782, Adjusted R-squared: 0.8731
## F-statistic: 173 on 1 and 24 DF, p-value: 1.824e-12
c <- summary(mks)$coefficients[1]
m <- summary(mks)$coefficients[2]