loading packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
loading data
url = 'https://raw.githubusercontent.com/schoolkidrich/R/main/DATA_605/week12/baseball.csv'
data = read.csv(url)
summary data
summary(data)
## Team League Year RS
## Length:1232 Length:1232 Min. :1962 Min. : 463.0
## Class :character Class :character 1st Qu.:1977 1st Qu.: 652.0
## Mode :character Mode :character Median :1989 Median : 711.0
## Mean :1989 Mean : 715.1
## 3rd Qu.:2002 3rd Qu.: 775.0
## Max. :2012 Max. :1009.0
##
## RA W OBP SLG
## Min. : 472.0 Min. : 40.0 Min. :0.2770 Min. :0.3010
## 1st Qu.: 649.8 1st Qu.: 73.0 1st Qu.:0.3170 1st Qu.:0.3750
## Median : 709.0 Median : 81.0 Median :0.3260 Median :0.3960
## Mean : 715.1 Mean : 80.9 Mean :0.3263 Mean :0.3973
## 3rd Qu.: 774.2 3rd Qu.: 89.0 3rd Qu.:0.3370 3rd Qu.:0.4210
## Max. :1103.0 Max. :116.0 Max. :0.3730 Max. :0.4910
##
## BA Playoffs RankSeason RankPlayoffs
## Min. :0.2140 Min. :0.0000 Min. :1.000 Min. :1.000
## 1st Qu.:0.2510 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:2.000
## Median :0.2600 Median :0.0000 Median :3.000 Median :3.000
## Mean :0.2593 Mean :0.1981 Mean :3.123 Mean :2.717
## 3rd Qu.:0.2680 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :0.2940 Max. :1.0000 Max. :8.000 Max. :5.000
## NA's :988 NA's :988
## G OOBP OSLG
## Min. :158.0 Min. :0.2940 Min. :0.3460
## 1st Qu.:162.0 1st Qu.:0.3210 1st Qu.:0.4010
## Median :162.0 Median :0.3310 Median :0.4190
## Mean :161.9 Mean :0.3323 Mean :0.4197
## 3rd Qu.:162.0 3rd Qu.:0.3430 3rd Qu.:0.4380
## Max. :165.0 Max. :0.3840 Max. :0.4990
## NA's :812 NA's :812
drop values
# Playoffs is dichotomous
data = data[c('RS','RA','W','OBP','SLG','BA','Playoffs','G')]
model
m1 = lm(W~., data=data)
summary(m1)
##
## Call:
## lm(formula = W ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8301 -2.6783 0.0306 2.6769 11.9779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.463882 29.752939 -0.654 0.51312
## RS 0.085187 0.004471 19.052 < 2e-16 ***
## RA -0.098456 0.001487 -66.201 < 2e-16 ***
## OBP 47.514358 19.739414 2.407 0.01623 *
## SLG 18.141321 9.117671 1.990 0.04685 *
## BA -14.555571 17.411238 -0.836 0.40333
## Playoffs 3.128132 0.339881 9.204 < 2e-16 ***
## G 0.557668 0.177921 3.134 0.00176 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.831 on 1224 degrees of freedom
## Multiple R-squared: 0.8888, Adjusted R-squared: 0.8882
## F-statistic: 1398 on 7 and 1224 DF, p-value: < 2.2e-16
residual plot
# variance of residuals clusters around 0
plot(fitted(m1),(resid(m1)))

qqplot
# qq plot shows that the residuals are normal
qqnorm(resid(m1))

Conclusion:
Given the normal residuals, and average residuals being close to 0 I think a multiple linear model is appropriate for this dataset