y = β0 +β1x1 +···+βkxk +ε
I'll be using the MLB 2009 revenues dataset, which I compiled back in 2011.
MLB <- read.csv("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/MLB_Revenue_2009.csv",
header = T, as.is = T)
summary(MLB$Rev09)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 144 164 184 197 200 441
# Who the hell made $441 million? Oh right, the Yankees. You know what?
# I hate the Yankees. Damn Outliers. Kill 'em.
MLB <- MLB[c(1:18, 20:30), ]
summary(MLB$Rev09)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 144 163 183 188 195 268
# Not surprisingly, that looks better.
# par() sets graphical parameters. We've used this for nfrow= before.
# Use mex= sets the interline difference in the margins
par(mex = 0.5)
# pairs() plots every variable by every other variable in a big matrix.
# Use gap=0 to remove the gaps between plots. Use cex.labels=## to set
# the font size.
pairs(MLB[, 2:14], gap = 0, cex.labels = 0.9)
# compare pairs() to plot()
plot(MLB[, 2:14], gap = 0, cex.labels = 0.9) # There is no difference.
# To make a multiple regression model instead of a simple regression,
# simply use the + sign to add another independent variable. Pretty
# simple.
names(MLB)
## [1] "Team" "AllTimeWinPer" "YrFounded"
## [4] "HOFSeasons" "StadAge" "WinPer2009"
## [7] "WinPer5Yr" "WinPer10Yr" "PostWins5Yr"
## [10] "PostWins10Yr" "AllStar09" "Pop09"
## [13] "OtherFranchises" "Rev09"
mlb_1 <- lm(Rev09 ~ Pop09 + WinPer5Yr + OtherFranchises + YrFounded + PostWins5Yr +
AllStar09 + HOFSeasons + StadAge + AllTimeWinPer, data = MLB)
# There is no immediate reward for making a multiple regression model, so
# it's best to store it as an object. Then use summary(MODEL) to see the
# details:
summary(mlb_1)
##
## Call:
## lm(formula = Rev09 ~ Pop09 + WinPer5Yr + OtherFranchises + YrFounded +
## PostWins5Yr + AllStar09 + HOFSeasons + StadAge + AllTimeWinPer,
## data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.16 -7.68 -0.08 11.85 21.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 748.7238 605.9166 1.24 0.23164
## Pop09 3.9959 1.0282 3.89 0.00099 ***
## WinPer5Yr -96.0501 115.7484 -0.83 0.41695
## OtherFranchises -0.2919 1.1645 -0.25 0.80473
## YrFounded -0.0719 0.2496 -0.29 0.77649
## PostWins5Yr 1.8232 0.7901 2.31 0.03244 *
## AllStar09 1.0504 0.3761 2.79 0.01161 *
## HOFSeasons -0.0182 0.1144 -0.16 0.87511
## StadAge -0.2585 0.1491 -1.73 0.09919 .
## AllTimeWinPer 194.1947 251.3811 0.77 0.44932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.2 on 19 degrees of freedom
## Multiple R-squared: 0.832, Adjusted R-squared: 0.752
## F-statistic: 10.4 on 9 and 19 DF, p-value: 1.18e-05
# The summary is exactly like the simple regression model except that
# there are multiple variables to consider. Each t-test asks whether the
# variable explains some of the variation when holding the other variables
# constant. But the t tests only say something about what happens if you
# remove one variable and leave in all the others. In this case,
# population has a significant relationship with revenue. The teams
# making more money are located in cities with more people. In
# particular, increasing the population by 1,000 people equates to about
# $5 million dollars more revenue. Teams with a higher win percentage
# over the past 5 years may also make more money, but that result could
# also be a result of random error. Percentages in baseball are actually
# reported as decimals, so a 1% increase equates to $2.17 mill dollars
# more revenue. The presence of other franchises in the same city and the
# year the team was founded do not have a significant impact. More
# postseason wins over the past 5 years and more all stars in 2009
# increase revenue. The age of the stadium and historical success by the
# team and players do not have a significant impact. Notice that in
# multiple regression, the difference between R-squared and adj. R-squared
# becomes greater since there is a penalty for adding more variables.
plot(mlb_1$residuals ~ mlb_1$fitted.values, main = "Residuals of Full MLB Model",
ylab = "Residuals", xlab = "Fitted Values", col = "red")
abline(lm(mlb_1$residuals ~ mlb_1$fitted.values), col = "maroon", lwd = 2)
# Our model is overpredicting more poorly than underpredicting, but the
# average is 0 at least.
ANOVA
anova(mlb_1)
## Analysis of Variance Table
##
## Response: Rev09
## Df Sum Sq Mean Sq F value Pr(>F)
## Pop09 1 17061 17061 57.54 3.7e-07 ***
## WinPer5Yr 1 2663 2663 8.98 0.0074 **
## OtherFranchises 1 26 26 0.09 0.7717
## YrFounded 1 1625 1625 5.48 0.0303 *
## PostWins5Yr 1 2072 2072 6.99 0.0160 *
## AllStar09 1 3195 3195 10.78 0.0039 **
## HOFSeasons 1 42 42 0.14 0.7094
## StadAge 1 961 961 3.24 0.0877 .
## AllTimeWinPer 1 177 177 0.60 0.4493
## Residuals 19 5633 296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 'These tests are successive; they correspond to (reading upward from the
# bottom) a stepwise removal of terms from the model until finally only
# the first is left.' And 5-yr win percentage is now significant, as is
# the year founded. In other words, order matters in an anova. And you
# will see different test statistics, so it's worth looking at both.
# Essentially, the ANOVA is saying that it's worth adding WinPer5Yr,
# YrFounded, PostWins5Yr, and AllStar09 but not the others. But do the
# insignificant variables together contribute something? Answer this by
# making a full and reduced model and comparing them.
mlb_2 <- lm(Rev09 ~ Pop09 + AllStar09 + PostWins5Yr, data = MLB) # Reduced model with only the three best.
anova(mlb_1, mlb_2) # Put the full model first, and then the reduced or vice-versa, it doesn't matter.
## Analysis of Variance Table
##
## Model 1: Rev09 ~ Pop09 + WinPer5Yr + OtherFranchises + YrFounded + PostWins5Yr +
## AllStar09 + HOFSeasons + StadAge + AllTimeWinPer
## Model 2: Rev09 ~ Pop09 + AllStar09 + PostWins5Yr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 5633
## 2 25 7644 -6 -2010 1.13 0.38
anova(mlb_2, mlb_1)
## Analysis of Variance Table
##
## Model 1: Rev09 ~ Pop09 + AllStar09 + PostWins5Yr
## Model 2: Rev09 ~ Pop09 + WinPer5Yr + OtherFranchises + YrFounded + PostWins5Yr +
## AllStar09 + HOFSeasons + StadAge + AllTimeWinPer
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 7644
## 2 19 5633 6 2010 1.13 0.38
# This shows that there is no significant difference in explanatory power
# by adding the extra variables, despite the ANOVA table revealing some as
# significant! Statistics are weird. This begs the question, though.
# Why did WinPer5Yr show up as significant in the ANOVA? Is it just
# because of the order in the model?
par(mfrow = c(1, 2))
plot(mlb_1$residuals ~ mlb_1$fitted.values, main = "Residuals of Full MLB Model",
ylab = "Residuals", xlab = "Fitted Values", col = "red")
abline(lm(mlb_1$residuals ~ mlb_1$fitted.values), col = "maroon", lwd = 2)
plot(mlb_2$residuals ~ mlb_2$fitted.values, main = "Residuals of Reduced MLB Model",
ylab = "Residuals", xlab = "Fitted Values", col = "blue")
abline(lm(mlb_2$residuals ~ mlb_2$fitted.values), col = "navy", lwd = 2)
Find a Better Model by Eliminating Variables Manually
# To do this, just hard-code a bunch of models with different sets of
# variables.
summary(lm(Rev09 ~ AllStar09 + PostWins5Yr + Pop09 + WinPer10Yr, data = MLB))
##
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins5Yr + Pop09 + WinPer10Yr,
## data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.90 -8.84 2.90 10.35 44.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 145.126 43.463 3.34 0.00274 **
## AllStar09 1.194 0.389 3.07 0.00528 **
## PostWins5Yr 1.618 0.698 2.32 0.02924 *
## Pop09 4.089 0.958 4.27 0.00027 ***
## WinPer10Yr -16.301 95.078 -0.17 0.86530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.8 on 24 degrees of freedom
## Multiple R-squared: 0.772, Adjusted R-squared: 0.734
## F-statistic: 20.3 on 4 and 24 DF, p-value: 2.05e-07
summary(lm(Rev09 ~ AllStar09 + PostWins5Yr + Pop09 + WinPer5Yr, data = MLB))
##
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins5Yr + Pop09 + WinPer5Yr,
## data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.46 -6.45 2.66 10.17 45.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.591 49.695 3.27 0.00323 **
## AllStar09 1.232 0.373 3.30 0.00300 **
## PostWins5Yr 1.777 0.771 2.30 0.03019 *
## Pop09 4.175 0.971 4.30 0.00025 ***
## WinPer5Yr -55.187 109.462 -0.50 0.61874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.8 on 24 degrees of freedom
## Multiple R-squared: 0.774, Adjusted R-squared: 0.736
## F-statistic: 20.5 on 4 and 24 DF, p-value: 1.83e-07
summary(lm(Rev09 ~ AllStar09 + PostWins5Yr + Pop09, data = MLB))
##
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins5Yr + Pop09, data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.04 -8.16 2.49 10.38 44.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.764 6.582 20.93 <2e-16 ***
## AllStar09 1.165 0.344 3.39 0.0023 **
## PostWins5Yr 1.597 0.674 2.37 0.0258 *
## Pop09 4.073 0.935 4.36 0.0002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.5 on 25 degrees of freedom
## Multiple R-squared: 0.772, Adjusted R-squared: 0.744
## F-statistic: 28.1 on 3 and 25 DF, p-value: 3.53e-08
summary(lm(Rev09 ~ AllStar09 + Pop09, data = MLB))
##
## Call:
## lm(formula = Rev09 ~ AllStar09 + Pop09, data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.06 -9.91 0.11 8.64 37.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.82 7.14 19.30 < 2e-16 ***
## AllStar09 1.50 0.34 4.42 0.00016 ***
## Pop09 4.08 1.01 4.02 0.00044 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19 on 26 degrees of freedom
## Multiple R-squared: 0.72, Adjusted R-squared: 0.699
## F-statistic: 33.5 on 2 and 26 DF, p-value: 6.45e-08
summary(lm(Rev09 ~ AllStar09 + PostWins5Yr, data = MLB))
##
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins5Yr, data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.19 -11.94 -4.57 8.96 61.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.938 8.111 18.12 2.9e-16 ***
## AllStar09 1.801 0.405 4.45 0.00014 ***
## PostWins5Yr 1.604 0.877 1.83 0.07875 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.7 on 26 degrees of freedom
## Multiple R-squared: 0.598, Adjusted R-squared: 0.567
## F-statistic: 19.3 on 2 and 26 DF, p-value: 7.15e-06
summary(lm(Rev09 ~ AllStar09 + PostWins10Yr + Pop09, data = MLB))
##
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins10Yr + Pop09, data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.58 -11.97 0.86 11.83 40.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.354 6.975 19.69 < 2e-16 ***
## AllStar09 1.155 0.402 2.87 0.00817 **
## PostWins10Yr 0.743 0.488 1.52 0.14061
## Pop09 4.120 0.990 4.16 0.00033 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.5 on 25 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.713
## F-statistic: 24.2 on 3 and 25 DF, p-value: 1.44e-07
*Find a Better Model using regsubsets()
library(leaps)
regs_1 <- regsubsets(Rev09 ~ ., data = MLB[2:14], nbest = 1, nvmax = 10)
summary(regs_1)
## Subset selection object
## Call: regsubsets.formula(Rev09 ~ ., data = MLB[2:14], nbest = 1, nvmax = 10)
## 12 Variables (and intercept)
## Forced in Forced out
## AllTimeWinPer FALSE FALSE
## YrFounded FALSE FALSE
## HOFSeasons FALSE FALSE
## StadAge FALSE FALSE
## WinPer2009 FALSE FALSE
## WinPer5Yr FALSE FALSE
## WinPer10Yr FALSE FALSE
## PostWins5Yr FALSE FALSE
## PostWins10Yr FALSE FALSE
## AllStar09 FALSE FALSE
## Pop09 FALSE FALSE
## OtherFranchises FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## AllTimeWinPer YrFounded HOFSeasons StadAge WinPer2009 WinPer5Yr
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) "*" " " " " "*" " " " "
## 6 ( 1 ) "*" " " " " "*" " " " "
## 7 ( 1 ) "*" "*" " " "*" " " " "
## 8 ( 1 ) "*" "*" " " "*" " " " "
## 9 ( 1 ) "*" "*" " " "*" " " "*"
## 10 ( 1 ) "*" "*" "*" "*" " " "*"
## WinPer10Yr PostWins5Yr PostWins10Yr AllStar09 Pop09
## 1 ( 1 ) " " " " " " "*" " "
## 2 ( 1 ) " " " " " " "*" "*"
## 3 ( 1 ) " " "*" " " "*" "*"
## 4 ( 1 ) " " "*" " " "*" "*"
## 5 ( 1 ) " " "*" " " "*" "*"
## 6 ( 1 ) "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" " " "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*"
## OtherFranchises
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) " "
## 8 ( 1 ) " "
## 9 ( 1 ) " "
## 10 ( 1 ) " "
mlb1 <- lm(Rev09 ~ AllStar09, data = MLB)
mlb2 <- lm(Rev09 ~ AllStar09 + Pop09, data = MLB)
mlb3 <- lm(Rev09 ~ AllStar09 + Pop09 + PostWins5Yr, data = MLB)
mlb4 <- lm(Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge, data = MLB)
mlb5 <- lm(Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge + AllTimeWinPer,
data = MLB)
mlb6 <- lm(Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge + AllTimeWinPer +
WinPer10Yr, data = MLB)
# Comparison:
anova(mlb1, mlb2, mlb3, mlb4, mlb5, mlb6)
## Analysis of Variance Table
##
## Model 1: Rev09 ~ AllStar09
## Model 2: Rev09 ~ AllStar09 + Pop09
## Model 3: Rev09 ~ AllStar09 + Pop09 + PostWins5Yr
## Model 4: Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge
## Model 5: Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge + AllTimeWinPer
## Model 6: Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge + AllTimeWinPer +
## WinPer10Yr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 15179
## 2 26 9361 1 5818 22.38 0.0001 ***
## 3 25 7644 1 1717 6.61 0.0175 *
## 4 24 6453 1 1190 4.58 0.0437 *
## 5 23 6000 1 453 1.74 0.2004
## 6 22 5719 1 281 1.08 0.3099
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Comparing several nested models at once shows that adding variables is
# useful through 4 variables. With each added variable, less and less
# difference exists between th previous and new models.
summary(mlb4)
##
## Call:
## lm(formula = Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge,
## data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.37 -7.07 1.51 12.69 24.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 709.602 271.833 2.61 0.01534 *
## AllStar09 1.015 0.330 3.08 0.00517 **
## Pop09 3.956 0.879 4.50 0.00015 ***
## PostWins5Yr 1.534 0.633 2.42 0.02323 *
## StadAge -0.286 0.136 -2.10 0.04602 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.4 on 24 degrees of freedom
## Multiple R-squared: 0.807, Adjusted R-squared: 0.775
## F-statistic: 25.1 on 4 and 24 DF, p-value: 2.84e-08
Conclusion for MLB Stats
par(mfrow = c(1, 1))
plot(MLB$Rev09, mlb4$fitted.values, main = "Best Model for Predicting 2009 MLB Revenue",
xlab = "Fitted Values (millions of dollars)", ylab = "Revenue (millions of dollars)")
lines(predict(mlb4) ~ mlb4$fitted.values, col = "red", lwd = 2)
MLB$fit <- mlb4$fitted.values
mlb_4 <- lm(Rev09 ~ fit + I(fit^2), data = MLB)
summary(mlb_4)
##
## Call:
## lm(formula = Rev09 ~ fit + I(fit^2), data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.89 -6.98 2.25 9.25 27.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180.15960 127.86941 1.41 0.17
## fit -0.83008 1.28985 -0.64 0.53
## I(fit^2) 0.00452 0.00318 1.42 0.17
##
## Residual standard error: 15.2 on 26 degrees of freedom
## Multiple R-squared: 0.821, Adjusted R-squared: 0.807
## F-statistic: 59.6 on 2 and 26 DF, p-value: 1.93e-10
plot(MLB$Rev09, MLB$fit, main = "Quadratic Model for Predicting 2009 MLB Revenue",
xlab = "Fitted Values (millions of dollars)", ylab = "Revenue (millions of dollars)")
curve(predict(mlb_4, data.frame(fit = x), type = "resp"), add = TRUE, col = "red",
lwd = 3)
lines(predict(mlb4) ~ mlb4$fitted.values, col = "blue", lwd = 2)
par(mfrow = c(1, 2))
qqnorm(mlb4$residuals, xlab = "Residuals", main = "Best Linear Model QQ Plot")
qqline(mlb4$residuals)
qqnorm(mlb_4$residuals, xlab = "Residuals", main = "Best Quadratic Model QQ Plot")
qqline(mlb_4$residuals)
# Counterinuitive but true, polynomial regression is still linear
# regression. Add squared or higher order polynomial functions using the
# identity function I(). This 'protects' the polynomials from being
# interpreted in the wrong way.
MLB <- read.csv("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/MLB_Revenue_2009.csv",
header = T, as.is = T)
names(MLB)
## [1] "Team" "AllTimeWinPer" "YrFounded"
## [4] "HOFSeasons" "StadAge" "WinPer2009"
## [7] "WinPer5Yr" "WinPer10Yr" "PostWins5Yr"
## [10] "PostWins10Yr" "AllStar09" "Pop09"
## [13] "OtherFranchises" "Rev09"
MLB <- MLB[order(MLB[, 6]), ]
head(MLB)
## Team AllTimeWinPer YrFounded HOFSeasons StadAge
## 30 Washington Nationals 0.475 1969 26 2008
## 22 Pittsburgh Pirates 0.503 1882 263 2001
## 3 Baltimore Orioles 0.474 1901 174 1992
## 8 Cleveland Indians 0.509 1901 183 1994
## 13 Kansas City Royals 0.482 1969 24 1973
## 1 Arizona Diamondbacks 0.491 1998 1 1998
## WinPer2009 WinPer5Yr WinPer10Yr PostWins5Yr PostWins10Yr AllStar09
## 30 0.364 0.424 0.439 0 0 7
## 22 0.385 0.409 0.421 0 0 5
## 3 0.395 0.426 0.431 0 0 7
## 8 0.401 0.510 0.504 6 8 10
## 13 0.401 0.404 0.415 0 0 3
## 1 0.432 0.488 0.497 3 14 8
## Pop09 OtherFranchises Rev09
## 30 5.36 5.60 184
## 22 2.35 8.51 145
## 3 2.67 3.75 171
## 8 2.09 9.58 170
## 13 2.00 4.99 155
## 1 4.28 7.01 172
plot(MLB[, 2:14], gap = 0, cex.labels = 0.9)
plot(Rev09 ~ WinPer2009, data = MLB, main = "2009 MLB Team Revenue v. 2009 Win %",
xlab = "2009 Win %", ylab = "Revenue (millions USD)", col = "blue")
# WinPer2009 seems to have a nonlinear relationship with Rev09.
# Simple Linear Model
mod_wp09_1 <- lm(Rev09 ~ WinPer2009, data = MLB)
summary(mod_wp09_1) # adj. r^2 = 0.2577
##
## Call:
## lm(formula = Rev09 ~ WinPer2009, data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.6 -23.4 -15.3 12.6 185.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.9 65.7 -0.30 0.7638
## WinPer2009 433.1 130.2 3.33 0.0025 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.4 on 28 degrees of freedom
## Multiple R-squared: 0.283, Adjusted R-squared: 0.258
## F-statistic: 11.1 on 1 and 28 DF, p-value: 0.00247
plot(Rev09 ~ WinPer2009, data = MLB, main = "2009 MLB Team Revenue v. 2009 Win %",
xlab = "2009 Win %", ylab = "Revenue (millions USD)", col = "blue")
lines(predict(mod_wp09_1) ~ MLB$WinPer2009, col = "red", lwd = 2)
plot(mod_wp09_1$residuals ~ mod_wp09_1$fitted.values, main = "2009 MLB Team Revenue by 2009 Win % Residuals",
xlab = "Predicted 2009 Revenue (millions USD)", ylab = "Residuals")
qqnorm(mod_wp09_1$residuals, main = "Q-Q Plot of Residuals mod_wp09_1")
qqline(mod_wp09_1$residuals) # Residuals not normally distributed.
shapiro.test(mod_wp09_1$residuals)
##
## Shapiro-Wilk normality test
##
## data: mod_wp09_1$residuals
## W = 0.8046, p-value = 7.957e-05
# Polynomial Linear Model
mod_wp09_2 <- lm(MLB$Rev09 ~ MLB$WinPer2009 + I(MLB$WinPer2009^2))
summary(mod_wp09_2) # adj. R^2 = 0.5347
##
## Call:
## lm(formula = MLB$Rev09 ~ MLB$WinPer2009 + I(MLB$WinPer2009^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.75 -22.12 -2.33 5.82 107.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1374 336 4.09 0.00035 ***
## MLB$WinPer2009 -5334 1376 -3.88 0.00061 ***
## I(MLB$WinPer2009^2) 5848 1391 4.20 0.00026 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.1 on 27 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.535
## F-statistic: 17.7 on 2 and 27 DF, p-value: 1.24e-05
# Confidence Intervals -- extract the confidence intervals in order to
# plot them on a graph.
confint(mod_wp09_2)
## 2.5 % 97.5 %
## (Intercept) 685.1 2062
## MLB$WinPer2009 -8156.9 -2511
## I(MLB$WinPer2009^2) 2993.7 8702
pc <- predict(mod_wp09_2, interval = "conf", level = 0.95)
head(pc)
## fit lwr upr
## 1 206.8 160.1 253.5
## 2 186.8 152.3 221.2
## 3 179.1 149.2 209.0
## 4 175.0 147.4 202.6
## 5 175.0 147.4 202.6
## 6 160.7 139.9 181.4
MLB$modfit <- pc[, 1]
MLB$modlwr <- pc[, 2]
MLB$modupr <- pc[, 3]
# Plotting Polynomial Model with confidence intervals
plot(Rev09 ~ WinPer2009, data = MLB, main = "2009 MLB Team Revenue v. 2009 Win %",
xlab = "2009 Win %", ylab = "Revenue (millions USD)", col = "blue", cex = 1,
pch = 16)
lines(MLB$WinPer2009, MLB$modfit, type = "l", col = "red", lwd = 3)
lines(MLB$WinPer2009, MLB$modlwr, type = "l", col = "black", lty = 3, lwd = 2)
lines(MLB$WinPer2009, MLB$modupr, type = "l", col = "black", lty = 3, lwd = 2)
plot(mod_wp09_2$residuals ~ mod_wp09_2$fitted.values, main = "2009 MLB Team Revenue by 2009 Win % Residuals",
xlab = "Predicted 2009 Revenue (millions USD)", ylab = "Residuals")
qqnorm(mod_wp09_2$residuals, main = "Q-Q Plot of Residuals mod_wp09_2")
qqline(mod_wp09_2$residuals) # Residuals not normally distributed.
shapiro.test(mod_wp09_2$residuals)
##
## Shapiro-Wilk normality test
##
## data: mod_wp09_2$residuals
## W = 0.8843, p-value = 0.003541
# You can force the linear model to have an intercept of 0 by writing
# lm(Y~X-1):
mod_wp09_3 <- lm(Rev09 ~ WinPer2009 - 1, data = MLB)
summary(mod_wp09_3)
##
## Call:
## lm(formula = Rev09 ~ WinPer2009 - 1, data = MLB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.5 -24.9 -14.4 11.3 190.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## WinPer2009 394.0 17.6 22.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.6 on 29 degrees of freedom
## Multiple R-squared: 0.945, Adjusted R-squared: 0.944
## F-statistic: 502 on 1 and 29 DF, p-value: <2e-16
# Note the adj R^2 is now giant. This makes no sense! Why? Because R^2
# is no longer measuring the same thing. It's now comparing the model to
# y = 0 instead of y = intercept because you've eliminated the intercept,
# but the anova must always compare to a nested, reduced model.
anova(mod_wp09_1)
## Analysis of Variance Table
##
## Response: Rev09
## Df Sum Sq Mean Sq F value Pr(>F)
## WinPer2009 1 26980 26980 11.1 0.0025 **
## Residuals 28 68265 2438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod_wp09_3)
## Analysis of Variance Table
##
## Response: Rev09
## Df Sum Sq Mean Sq F value Pr(>F)
## WinPer2009 1 1186303 1186303 502 <2e-16 ***
## Residuals 29 68489 2362
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Notice how in the anova tables, the SST has sky-rocketed but the SSE
# hasn't really changed. It looks like you're explaining a lot more, but
# really, you're just comparing to a different reduced model (that doesn't
# work well!).
model.matrix(Rev09 ~ WinPer2009 + Pop09, data = MLB)
## (Intercept) WinPer2009 Pop09
## 30 1 0.364 5.36
## 22 1 0.385 2.35
## 3 1 0.395 2.67
## 8 1 0.401 2.09
## 13 1 0.401 2.00
## 1 1 0.432 4.28
## 18 1 0.432 19.01
## 12 1 0.457 5.73
## 20 1 0.463 4.27
## 23 1 0.463 3.00
## 29 1 0.463 2.50
## 7 1 0.481 2.16
## 6 1 0.488 9.57
## 16 1 0.494 1.55
## 5 1 0.516 9.57
## 27 1 0.519 2.73
## 25 1 0.525 3.34
## 10 1 0.528 4.43
## 2 1 0.531 5.38
## 17 1 0.534 3.23
## 11 1 0.537 5.41
## 28 1 0.537 6.30
## 24 1 0.543 4.27
## 26 1 0.562 2.82
## 9 1 0.568 2.51
## 21 1 0.574 5.84
## 4 1 0.586 4.52
## 15 1 0.586 12.87
## 14 1 0.599 12.87
## 19 1 0.636 19.01
## attr(,"assign")
## [1] 0 1 2
# Multiplying across the table (after weighting each variable by the model
# coefficient) will yield the fitted values. Works for dummy variables,
# too.
# Two Ways to Sort Data:
MLB_sort <- MLB[do.call(order, data.frame(MLB[, 5])), ]
head(MLB_sort)
## Team AllTimeWinPer YrFounded HOFSeasons StadAge
## 4 Boston Red Sox 0.517 1901 217 1912
## 5 Chicago Cubs 0.514 1876 258 1914
## 15 Los Angeles Dodgers 0.524 1884 230 1962
## 20 Oakland Athletics 0.486 1901 186 1966
## 14 Los Angeles Angels 0.498 1961 33 1966
## 13 Kansas City Royals 0.482 1969 24 1973
## WinPer2009 WinPer5Yr WinPer10Yr PostWins5Yr PostWins10Yr AllStar09
## 4 0.586 0.576 0.568 17 34 51
## 5 0.516 0.508 0.499 0 6 21
## 15 0.586 0.500 0.523 8 9 33
## 20 0.463 0.503 0.550 3 11 14
## 14 0.599 0.586 0.555 6 21 23
## 13 0.401 0.404 0.415 0 0 3
## Pop09 OtherFranchises Rev09 modfit modlwr modupr
## 4 4.52 8.84 266 256.0 230.3 281.7
## 5 9.57 5.22 246 178.3 158.8 197.7
## 15 12.87 5.44 247 256.0 230.3 281.7
## 20 4.27 14.04 155 157.5 137.5 177.6
## 14 12.87 5.44 217 276.8 245.4 308.1
## 13 2.00 4.99 155 175.0 147.4 202.6
MLB_sort2 <- MLB[order(MLB[, 5]), ]
head(MLB_sort2)
## Team AllTimeWinPer YrFounded HOFSeasons StadAge
## 4 Boston Red Sox 0.517 1901 217 1912
## 5 Chicago Cubs 0.514 1876 258 1914
## 15 Los Angeles Dodgers 0.524 1884 230 1962
## 20 Oakland Athletics 0.486 1901 186 1966
## 14 Los Angeles Angels 0.498 1961 33 1966
## 13 Kansas City Royals 0.482 1969 24 1973
## WinPer2009 WinPer5Yr WinPer10Yr PostWins5Yr PostWins10Yr AllStar09
## 4 0.586 0.576 0.568 17 34 51
## 5 0.516 0.508 0.499 0 6 21
## 15 0.586 0.500 0.523 8 9 33
## 20 0.463 0.503 0.550 3 11 14
## 14 0.599 0.586 0.555 6 21 23
## 13 0.401 0.404 0.415 0 0 3
## Pop09 OtherFranchises Rev09 modfit modlwr modupr
## 4 4.52 8.84 266 256.0 230.3 281.7
## 5 9.57 5.22 246 178.3 158.8 197.7
## 15 12.87 5.44 247 256.0 230.3 281.7
## 20 4.27 14.04 155 157.5 137.5 177.6
## 14 12.87 5.44 217 276.8 245.4 308.1
## 13 2.00 4.99 155 175.0 147.4 202.6
# Either method seems to work equally well.
# matlines(MLB$WinPer2009,pc,lty=c(1,3,3),col=c('red','black','black'),lwd=c(3,2,2))
# # matlines(XVALUES,YAVLUES,lty=LineType,col=Colors,lwd=LineWidth); but
# is sensitive to the order of data I looked up sorting in the first place
# trying to get matlines to work.