Dalgaard Chapter 11 & 12 Functions – Alex Crawford

Basic Equation for Multiple Regression

y = β0 +β1x1 +···+βkxk +ε

11.1 Plotting Multi-Variate Data

Load Data

I'll be using the MLB 2009 revenues dataset, which I compiled back in 2011.

MLB <- read.csv("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/MLB_Revenue_2009.csv", 
    header = T, as.is = T)
summary(MLB$Rev09)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     144     164     184     197     200     441

# Who the hell made $441 million?  Oh right, the Yankees.  You know what?
# I hate the Yankees.  Damn Outliers.  Kill 'em.
MLB <- MLB[c(1:18, 20:30), ]
summary(MLB$Rev09)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     144     163     183     188     195     268

# Not surprisingly, that looks better.

Generating Pair-wise plots

# par() sets graphical parameters.  We've used this for nfrow= before.
# Use mex= sets the interline difference in the margins
par(mex = 0.5)
# pairs() plots every variable by every other variable in a big matrix.
# Use gap=0 to remove the gaps between plots.  Use cex.labels=## to set
# the font size.
pairs(MLB[, 2:14], gap = 0, cex.labels = 0.9)

plot of chunk unnamed-chunk-2

# compare pairs() to plot()
plot(MLB[, 2:14], gap = 0, cex.labels = 0.9)  # There is no difference.

plot of chunk unnamed-chunk-3

11.2 Model Specification and Output

Multiple Regression Model

# To make a multiple regression model instead of a simple regression,
# simply use the + sign to add another independent variable.  Pretty
# simple.
names(MLB)

##  [1] "Team"            "AllTimeWinPer"   "YrFounded"      
##  [4] "HOFSeasons"      "StadAge"         "WinPer2009"     
##  [7] "WinPer5Yr"       "WinPer10Yr"      "PostWins5Yr"    
## [10] "PostWins10Yr"    "AllStar09"       "Pop09"          
## [13] "OtherFranchises" "Rev09"

mlb_1 <- lm(Rev09 ~ Pop09 + WinPer5Yr + OtherFranchises + YrFounded + PostWins5Yr + 
    AllStar09 + HOFSeasons + StadAge + AllTimeWinPer, data = MLB)
# There is no immediate reward for making a multiple regression model, so
# it's best to store it as an object.  Then use summary(MODEL) to see the
# details:
summary(mlb_1)

## 
## Call:
## lm(formula = Rev09 ~ Pop09 + WinPer5Yr + OtherFranchises + YrFounded + 
##     PostWins5Yr + AllStar09 + HOFSeasons + StadAge + AllTimeWinPer, 
##     data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.16  -7.68  -0.08  11.85  21.87 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     748.7238   605.9166    1.24  0.23164    
## Pop09             3.9959     1.0282    3.89  0.00099 ***
## WinPer5Yr       -96.0501   115.7484   -0.83  0.41695    
## OtherFranchises  -0.2919     1.1645   -0.25  0.80473    
## YrFounded        -0.0719     0.2496   -0.29  0.77649    
## PostWins5Yr       1.8232     0.7901    2.31  0.03244 *  
## AllStar09         1.0504     0.3761    2.79  0.01161 *  
## HOFSeasons       -0.0182     0.1144   -0.16  0.87511    
## StadAge          -0.2585     0.1491   -1.73  0.09919 .  
## AllTimeWinPer   194.1947   251.3811    0.77  0.44932    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 17.2 on 19 degrees of freedom
## Multiple R-squared: 0.832,   Adjusted R-squared: 0.752 
## F-statistic: 10.4 on 9 and 19 DF,  p-value: 1.18e-05

# The summary is exactly like the simple regression model except that
# there are multiple variables to consider.  Each t-test asks whether the
# variable explains some of the variation when holding the other variables
# constant.  But the t tests only say something about what happens if you
# remove one variable and leave in all the others.  In this case,
# population has a significant relationship with revenue.  The teams
# making more money are located in cities with more people.  In
# particular, increasing the population by 1,000 people equates to about
# $5 million dollars more revenue.  Teams with a higher win percentage
# over the past 5 years may also make more money, but that result could
# also be a result of random error.  Percentages in baseball are actually
# reported as decimals, so a 1% increase equates to $2.17 mill dollars
# more revenue.  The presence of other franchises in the same city and the
# year the team was founded do not have a significant impact.  More
# postseason wins over the past 5 years and more all stars in 2009
# increase revenue.  The age of the stadium and historical success by the
# team and players do not have a significant impact.  Notice that in
# multiple regression, the difference between R-squared and adj. R-squared
# becomes greater since there is a penalty for adding more variables.
plot(mlb_1$residuals ~ mlb_1$fitted.values, main = "Residuals of Full MLB Model", 
    ylab = "Residuals", xlab = "Fitted Values", col = "red")
abline(lm(mlb_1$residuals ~ mlb_1$fitted.values), col = "maroon", lwd = 2)

plot of chunk unnamed-chunk-4

# Our model is overpredicting more poorly than underpredicting, but the
# average is 0 at least.

ANOVA

anova(mlb_1)

## Analysis of Variance Table
## 
## Response: Rev09
##                 Df Sum Sq Mean Sq F value  Pr(>F)    
## Pop09            1  17061   17061   57.54 3.7e-07 ***
## WinPer5Yr        1   2663    2663    8.98  0.0074 ** 
## OtherFranchises  1     26      26    0.09  0.7717    
## YrFounded        1   1625    1625    5.48  0.0303 *  
## PostWins5Yr      1   2072    2072    6.99  0.0160 *  
## AllStar09        1   3195    3195   10.78  0.0039 ** 
## HOFSeasons       1     42      42    0.14  0.7094    
## StadAge          1    961     961    3.24  0.0877 .  
## AllTimeWinPer    1    177     177    0.60  0.4493    
## Residuals       19   5633     296                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 'These tests are successive; they correspond to (reading upward from the
# bottom) a stepwise removal of terms from the model until finally only
# the first is left.'  And 5-yr win percentage is now significant, as is
# the year founded.  In other words, order matters in an anova.  And you
# will see different test statistics, so it's worth looking at both.
# Essentially, the ANOVA is saying that it's worth adding WinPer5Yr,
# YrFounded, PostWins5Yr, and AllStar09 but not the others.  But do the
# insignificant variables together contribute something?  Answer this by
# making a full and reduced model and comparing them.
mlb_2 <- lm(Rev09 ~ Pop09 + AllStar09 + PostWins5Yr, data = MLB)  # Reduced model with only the three best.
anova(mlb_1, mlb_2)  # Put the full model first, and then the reduced or vice-versa, it doesn't matter.

## Analysis of Variance Table
## 
## Model 1: Rev09 ~ Pop09 + WinPer5Yr + OtherFranchises + YrFounded + PostWins5Yr + 
##     AllStar09 + HOFSeasons + StadAge + AllTimeWinPer
## Model 2: Rev09 ~ Pop09 + AllStar09 + PostWins5Yr
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     19 5633                         
## 2     25 7644 -6     -2010 1.13   0.38

anova(mlb_2, mlb_1)

## Analysis of Variance Table
## 
## Model 1: Rev09 ~ Pop09 + AllStar09 + PostWins5Yr
## Model 2: Rev09 ~ Pop09 + WinPer5Yr + OtherFranchises + YrFounded + PostWins5Yr + 
##     AllStar09 + HOFSeasons + StadAge + AllTimeWinPer
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     25 7644                         
## 2     19 5633  6      2010 1.13   0.38

# This shows that there is no significant difference in explanatory power
# by adding the extra variables, despite the ANOVA table revealing some as
# significant!  Statistics are weird.  This begs the question, though.
# Why did WinPer5Yr show up as significant in the ANOVA?  Is it just
# because of the order in the model?

par(mfrow = c(1, 2))
plot(mlb_1$residuals ~ mlb_1$fitted.values, main = "Residuals of Full MLB Model", 
    ylab = "Residuals", xlab = "Fitted Values", col = "red")
abline(lm(mlb_1$residuals ~ mlb_1$fitted.values), col = "maroon", lwd = 2)
plot(mlb_2$residuals ~ mlb_2$fitted.values, main = "Residuals of Reduced MLB Model", 
    ylab = "Residuals", xlab = "Fitted Values", col = "blue")
abline(lm(mlb_2$residuals ~ mlb_2$fitted.values), col = "navy", lwd = 2)

plot of chunk unnamed-chunk-6

11.3 Model Search

Find a Better Model by Eliminating Variables Manually

# To do this, just hard-code a bunch of models with different sets of
# variables.
summary(lm(Rev09 ~ AllStar09 + PostWins5Yr + Pop09 + WinPer10Yr, data = MLB))

## 
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins5Yr + Pop09 + WinPer10Yr, 
##     data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30.90  -8.84   2.90  10.35  44.80 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  145.126     43.463    3.34  0.00274 ** 
## AllStar09      1.194      0.389    3.07  0.00528 ** 
## PostWins5Yr    1.618      0.698    2.32  0.02924 *  
## Pop09          4.089      0.958    4.27  0.00027 ***
## WinPer10Yr   -16.301     95.078   -0.17  0.86530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 17.8 on 24 degrees of freedom
## Multiple R-squared: 0.772,   Adjusted R-squared: 0.734 
## F-statistic: 20.3 on 4 and 24 DF,  p-value: 2.05e-07

summary(lm(Rev09 ~ AllStar09 + PostWins5Yr + Pop09 + WinPer5Yr, data = MLB))

## 
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins5Yr + Pop09 + WinPer5Yr, 
##     data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.46  -6.45   2.66  10.17  45.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  162.591     49.695    3.27  0.00323 ** 
## AllStar09      1.232      0.373    3.30  0.00300 ** 
## PostWins5Yr    1.777      0.771    2.30  0.03019 *  
## Pop09          4.175      0.971    4.30  0.00025 ***
## WinPer5Yr    -55.187    109.462   -0.50  0.61874    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 17.8 on 24 degrees of freedom
## Multiple R-squared: 0.774,   Adjusted R-squared: 0.736 
## F-statistic: 20.5 on 4 and 24 DF,  p-value: 1.83e-07

summary(lm(Rev09 ~ AllStar09 + PostWins5Yr + Pop09, data = MLB))

## 
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins5Yr + Pop09, data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.04  -8.16   2.49  10.38  44.79 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  137.764      6.582   20.93   <2e-16 ***
## AllStar09      1.165      0.344    3.39   0.0023 ** 
## PostWins5Yr    1.597      0.674    2.37   0.0258 *  
## Pop09          4.073      0.935    4.36   0.0002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 17.5 on 25 degrees of freedom
## Multiple R-squared: 0.772,   Adjusted R-squared: 0.744 
## F-statistic: 28.1 on 3 and 25 DF,  p-value: 3.53e-08

summary(lm(Rev09 ~ AllStar09 + Pop09, data = MLB))

## 
## Call:
## lm(formula = Rev09 ~ AllStar09 + Pop09, data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36.06  -9.91   0.11   8.64  37.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   137.82       7.14   19.30  < 2e-16 ***
## AllStar09       1.50       0.34    4.42  0.00016 ***
## Pop09           4.08       1.01    4.02  0.00044 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 19 on 26 degrees of freedom
## Multiple R-squared: 0.72,    Adjusted R-squared: 0.699 
## F-statistic: 33.5 on 2 and 26 DF,  p-value: 6.45e-08

summary(lm(Rev09 ~ AllStar09 + PostWins5Yr, data = MLB))

## 
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins5Yr, data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36.19 -11.94  -4.57   8.96  61.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  146.938      8.111   18.12  2.9e-16 ***
## AllStar09      1.801      0.405    4.45  0.00014 ***
## PostWins5Yr    1.604      0.877    1.83  0.07875 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 22.7 on 26 degrees of freedom
## Multiple R-squared: 0.598,   Adjusted R-squared: 0.567 
## F-statistic: 19.3 on 2 and 26 DF,  p-value: 7.15e-06

summary(lm(Rev09 ~ AllStar09 + PostWins10Yr + Pop09, data = MLB))

## 
## Call:
## lm(formula = Rev09 ~ AllStar09 + PostWins10Yr + Pop09, data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29.58 -11.97   0.86  11.83  40.51 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   137.354      6.975   19.69  < 2e-16 ***
## AllStar09       1.155      0.402    2.87  0.00817 ** 
## PostWins10Yr    0.743      0.488    1.52  0.14061    
## Pop09           4.120      0.990    4.16  0.00033 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 18.5 on 25 degrees of freedom
## Multiple R-squared: 0.744,   Adjusted R-squared: 0.713 
## F-statistic: 24.2 on 3 and 25 DF,  p-value: 1.44e-07

*Find a Better Model using regsubsets()

library(leaps)
regs_1 <- regsubsets(Rev09 ~ ., data = MLB[2:14], nbest = 1, nvmax = 10)
summary(regs_1)

## Subset selection object
## Call: regsubsets.formula(Rev09 ~ ., data = MLB[2:14], nbest = 1, nvmax = 10)
## 12 Variables  (and intercept)
##                 Forced in Forced out
## AllTimeWinPer       FALSE      FALSE
## YrFounded           FALSE      FALSE
## HOFSeasons          FALSE      FALSE
## StadAge             FALSE      FALSE
## WinPer2009          FALSE      FALSE
## WinPer5Yr           FALSE      FALSE
## WinPer10Yr          FALSE      FALSE
## PostWins5Yr         FALSE      FALSE
## PostWins10Yr        FALSE      FALSE
## AllStar09           FALSE      FALSE
## Pop09               FALSE      FALSE
## OtherFranchises     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
##           AllTimeWinPer YrFounded HOFSeasons StadAge WinPer2009 WinPer5Yr
## 1  ( 1 )  " "           " "       " "        " "     " "        " "      
## 2  ( 1 )  " "           " "       " "        " "     " "        " "      
## 3  ( 1 )  " "           " "       " "        " "     " "        " "      
## 4  ( 1 )  " "           " "       " "        "*"     " "        " "      
## 5  ( 1 )  "*"           " "       " "        "*"     " "        " "      
## 6  ( 1 )  "*"           " "       " "        "*"     " "        " "      
## 7  ( 1 )  "*"           "*"       " "        "*"     " "        " "      
## 8  ( 1 )  "*"           "*"       " "        "*"     " "        " "      
## 9  ( 1 )  "*"           "*"       " "        "*"     " "        "*"      
## 10  ( 1 ) "*"           "*"       "*"        "*"     " "        "*"      
##           WinPer10Yr PostWins5Yr PostWins10Yr AllStar09 Pop09
## 1  ( 1 )  " "        " "         " "          "*"       " "  
## 2  ( 1 )  " "        " "         " "          "*"       "*"  
## 3  ( 1 )  " "        "*"         " "          "*"       "*"  
## 4  ( 1 )  " "        "*"         " "          "*"       "*"  
## 5  ( 1 )  " "        "*"         " "          "*"       "*"  
## 6  ( 1 )  "*"        "*"         " "          "*"       "*"  
## 7  ( 1 )  "*"        "*"         " "          "*"       "*"  
## 8  ( 1 )  "*"        "*"         "*"          "*"       "*"  
## 9  ( 1 )  "*"        "*"         "*"          "*"       "*"  
## 10  ( 1 ) "*"        "*"         "*"          "*"       "*"  
##           OtherFranchises
## 1  ( 1 )  " "            
## 2  ( 1 )  " "            
## 3  ( 1 )  " "            
## 4  ( 1 )  " "            
## 5  ( 1 )  " "            
## 6  ( 1 )  " "            
## 7  ( 1 )  " "            
## 8  ( 1 )  " "            
## 9  ( 1 )  " "            
## 10  ( 1 ) " "

mlb1 <- lm(Rev09 ~ AllStar09, data = MLB)
mlb2 <- lm(Rev09 ~ AllStar09 + Pop09, data = MLB)
mlb3 <- lm(Rev09 ~ AllStar09 + Pop09 + PostWins5Yr, data = MLB)
mlb4 <- lm(Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge, data = MLB)
mlb5 <- lm(Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge + AllTimeWinPer, 
    data = MLB)
mlb6 <- lm(Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge + AllTimeWinPer + 
    WinPer10Yr, data = MLB)

# Comparison:
anova(mlb1, mlb2, mlb3, mlb4, mlb5, mlb6)

## Analysis of Variance Table
## 
## Model 1: Rev09 ~ AllStar09
## Model 2: Rev09 ~ AllStar09 + Pop09
## Model 3: Rev09 ~ AllStar09 + Pop09 + PostWins5Yr
## Model 4: Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge
## Model 5: Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge + AllTimeWinPer
## Model 6: Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge + AllTimeWinPer + 
##     WinPer10Yr
##   Res.Df   RSS Df Sum of Sq     F Pr(>F)    
## 1     27 15179                              
## 2     26  9361  1      5818 22.38 0.0001 ***
## 3     25  7644  1      1717  6.61 0.0175 *  
## 4     24  6453  1      1190  4.58 0.0437 *  
## 5     23  6000  1       453  1.74 0.2004    
## 6     22  5719  1       281  1.08 0.3099    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Comparing several nested models at once shows that adding variables is
# useful through 4 variables.  With each added variable, less and less
# difference exists between th previous and new models.
summary(mlb4)

## 
## Call:
## lm(formula = Rev09 ~ AllStar09 + Pop09 + PostWins5Yr + StadAge, 
##     data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28.37  -7.07   1.51  12.69  24.30 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  709.602    271.833    2.61  0.01534 *  
## AllStar09      1.015      0.330    3.08  0.00517 ** 
## Pop09          3.956      0.879    4.50  0.00015 ***
## PostWins5Yr    1.534      0.633    2.42  0.02323 *  
## StadAge       -0.286      0.136   -2.10  0.04602 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 16.4 on 24 degrees of freedom
## Multiple R-squared: 0.807,   Adjusted R-squared: 0.775 
## F-statistic: 25.1 on 4 and 24 DF,  p-value: 2.84e-08

Conclusion for MLB Stats

par(mfrow = c(1, 1))
plot(MLB$Rev09, mlb4$fitted.values, main = "Best Model for Predicting 2009 MLB Revenue", 
    xlab = "Fitted Values (millions of dollars)", ylab = "Revenue (millions of dollars)")
lines(predict(mlb4) ~ mlb4$fitted.values, col = "red", lwd = 2)

plot of chunk unnamed-chunk-9

MLB$fit <- mlb4$fitted.values
mlb_4 <- lm(Rev09 ~ fit + I(fit^2), data = MLB)
summary(mlb_4)

## 
## Call:
## lm(formula = Rev09 ~ fit + I(fit^2), data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.89  -6.98   2.25   9.25  27.77 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180.15960  127.86941    1.41     0.17
## fit          -0.83008    1.28985   -0.64     0.53
## I(fit^2)      0.00452    0.00318    1.42     0.17
## 
## Residual standard error: 15.2 on 26 degrees of freedom
## Multiple R-squared: 0.821,   Adjusted R-squared: 0.807 
## F-statistic: 59.6 on 2 and 26 DF,  p-value: 1.93e-10

plot(MLB$Rev09, MLB$fit, main = "Quadratic Model for Predicting 2009 MLB Revenue", 
    xlab = "Fitted Values (millions of dollars)", ylab = "Revenue (millions of dollars)")
curve(predict(mlb_4, data.frame(fit = x), type = "resp"), add = TRUE, col = "red", 
    lwd = 3)
lines(predict(mlb4) ~ mlb4$fitted.values, col = "blue", lwd = 2)

plot of chunk unnamed-chunk-10

par(mfrow = c(1, 2))
qqnorm(mlb4$residuals, xlab = "Residuals", main = "Best Linear Model QQ Plot")
qqline(mlb4$residuals)
qqnorm(mlb_4$residuals, xlab = "Residuals", main = "Best Quadratic Model QQ Plot")
qqline(mlb_4$residuals)

plot of chunk unnamed-chunk-11

12.1 Polynomial Regression

# Counterinuitive but true, polynomial regression is still linear
# regression.  Add squared or higher order polynomial functions using the
# identity function I().  This 'protects' the polynomials from being
# interpreted in the wrong way.
MLB <- read.csv("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/MLB_Revenue_2009.csv", 
    header = T, as.is = T)
names(MLB)

##  [1] "Team"            "AllTimeWinPer"   "YrFounded"      
##  [4] "HOFSeasons"      "StadAge"         "WinPer2009"     
##  [7] "WinPer5Yr"       "WinPer10Yr"      "PostWins5Yr"    
## [10] "PostWins10Yr"    "AllStar09"       "Pop09"          
## [13] "OtherFranchises" "Rev09"

MLB <- MLB[order(MLB[, 6]), ]
head(MLB)

##                    Team AllTimeWinPer YrFounded HOFSeasons StadAge
## 30 Washington Nationals         0.475      1969         26    2008
## 22   Pittsburgh Pirates         0.503      1882        263    2001
## 3     Baltimore Orioles         0.474      1901        174    1992
## 8     Cleveland Indians         0.509      1901        183    1994
## 13   Kansas City Royals         0.482      1969         24    1973
## 1  Arizona Diamondbacks         0.491      1998          1    1998
##    WinPer2009 WinPer5Yr WinPer10Yr PostWins5Yr PostWins10Yr AllStar09
## 30      0.364     0.424      0.439           0            0         7
## 22      0.385     0.409      0.421           0            0         5
## 3       0.395     0.426      0.431           0            0         7
## 8       0.401     0.510      0.504           6            8        10
## 13      0.401     0.404      0.415           0            0         3
## 1       0.432     0.488      0.497           3           14         8
##    Pop09 OtherFranchises Rev09
## 30  5.36            5.60   184
## 22  2.35            8.51   145
## 3   2.67            3.75   171
## 8   2.09            9.58   170
## 13  2.00            4.99   155
## 1   4.28            7.01   172

plot(MLB[, 2:14], gap = 0, cex.labels = 0.9)

plot of chunk unnamed-chunk-12

plot(Rev09 ~ WinPer2009, data = MLB, main = "2009 MLB Team Revenue v. 2009 Win %", 
    xlab = "2009 Win %", ylab = "Revenue (millions USD)", col = "blue")

plot of chunk unnamed-chunk-12

# WinPer2009 seems to have a nonlinear relationship with Rev09.

# Simple Linear Model
mod_wp09_1 <- lm(Rev09 ~ WinPer2009, data = MLB)
summary(mod_wp09_1)  # adj. r^2 = 0.2577

## 
## Call:
## lm(formula = Rev09 ~ WinPer2009, data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -68.6  -23.4  -15.3   12.6  185.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    -19.9       65.7   -0.30   0.7638   
## WinPer2009     433.1      130.2    3.33   0.0025 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 49.4 on 28 degrees of freedom
## Multiple R-squared: 0.283,   Adjusted R-squared: 0.258 
## F-statistic: 11.1 on 1 and 28 DF,  p-value: 0.00247

plot(Rev09 ~ WinPer2009, data = MLB, main = "2009 MLB Team Revenue v. 2009 Win %", 
    xlab = "2009 Win %", ylab = "Revenue (millions USD)", col = "blue")
lines(predict(mod_wp09_1) ~ MLB$WinPer2009, col = "red", lwd = 2)

plot of chunk unnamed-chunk-13

plot(mod_wp09_1$residuals ~ mod_wp09_1$fitted.values, main = "2009 MLB Team Revenue by 2009 Win % Residuals", 
    xlab = "Predicted 2009 Revenue (millions USD)", ylab = "Residuals")

plot of chunk unnamed-chunk-13

qqnorm(mod_wp09_1$residuals, main = "Q-Q Plot of Residuals mod_wp09_1")
qqline(mod_wp09_1$residuals)  # Residuals not normally distributed.

plot of chunk unnamed-chunk-13

shapiro.test(mod_wp09_1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mod_wp09_1$residuals 
## W = 0.8046, p-value = 7.957e-05

# Polynomial Linear Model
mod_wp09_2 <- lm(MLB$Rev09 ~ MLB$WinPer2009 + I(MLB$WinPer2009^2))
summary(mod_wp09_2)  # adj. R^2 = 0.5347

## 
## Call:
## lm(formula = MLB$Rev09 ~ MLB$WinPer2009 + I(MLB$WinPer2009^2))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.75 -22.12  -2.33   5.82 107.35 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1374        336    4.09  0.00035 ***
## MLB$WinPer2009         -5334       1376   -3.88  0.00061 ***
## I(MLB$WinPer2009^2)     5848       1391    4.20  0.00026 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 39.1 on 27 degrees of freedom
## Multiple R-squared: 0.567,   Adjusted R-squared: 0.535 
## F-statistic: 17.7 on 2 and 27 DF,  p-value: 1.24e-05


# Confidence Intervals -- extract the confidence intervals in order to
# plot them on a graph.
confint(mod_wp09_2)

##                       2.5 % 97.5 %
## (Intercept)           685.1   2062
## MLB$WinPer2009      -8156.9  -2511
## I(MLB$WinPer2009^2)  2993.7   8702

pc <- predict(mod_wp09_2, interval = "conf", level = 0.95)
head(pc)

##     fit   lwr   upr
## 1 206.8 160.1 253.5
## 2 186.8 152.3 221.2
## 3 179.1 149.2 209.0
## 4 175.0 147.4 202.6
## 5 175.0 147.4 202.6
## 6 160.7 139.9 181.4

MLB$modfit <- pc[, 1]
MLB$modlwr <- pc[, 2]
MLB$modupr <- pc[, 3]

# Plotting Polynomial Model with confidence intervals
plot(Rev09 ~ WinPer2009, data = MLB, main = "2009 MLB Team Revenue v. 2009 Win %", 
    xlab = "2009 Win %", ylab = "Revenue (millions USD)", col = "blue", cex = 1, 
    pch = 16)
lines(MLB$WinPer2009, MLB$modfit, type = "l", col = "red", lwd = 3)
lines(MLB$WinPer2009, MLB$modlwr, type = "l", col = "black", lty = 3, lwd = 2)
lines(MLB$WinPer2009, MLB$modupr, type = "l", col = "black", lty = 3, lwd = 2)

plot of chunk unnamed-chunk-15

plot(mod_wp09_2$residuals ~ mod_wp09_2$fitted.values, main = "2009 MLB Team Revenue by 2009 Win % Residuals", 
    xlab = "Predicted 2009 Revenue (millions USD)", ylab = "Residuals")

plot of chunk unnamed-chunk-16

qqnorm(mod_wp09_2$residuals, main = "Q-Q Plot of Residuals mod_wp09_2")
qqline(mod_wp09_2$residuals)  # Residuals not normally distributed.

plot of chunk unnamed-chunk-16

shapiro.test(mod_wp09_2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mod_wp09_2$residuals 
## W = 0.8843, p-value = 0.003541

12.2 Regression Through the Origin

# You can force the linear model to have an intercept of 0 by writing
# lm(Y~X-1):
mod_wp09_3 <- lm(Rev09 ~ WinPer2009 - 1, data = MLB)
summary(mod_wp09_3)

## 
## Call:
## lm(formula = Rev09 ~ WinPer2009 - 1, data = MLB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -67.5  -24.9  -14.4   11.3  190.4 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## WinPer2009    394.0       17.6    22.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 48.6 on 29 degrees of freedom
## Multiple R-squared: 0.945,   Adjusted R-squared: 0.944 
## F-statistic:  502 on 1 and 29 DF,  p-value: <2e-16

# Note the adj R^2 is now giant.  This makes no sense!  Why?  Because R^2
# is no longer measuring the same thing.  It's now comparing the model to
# y = 0 instead of y = intercept because you've eliminated the intercept,
# but the anova must always compare to a nested, reduced model.
anova(mod_wp09_1)

## Analysis of Variance Table
## 
## Response: Rev09
##            Df Sum Sq Mean Sq F value Pr(>F)   
## WinPer2009  1  26980   26980    11.1 0.0025 **
## Residuals  28  68265    2438                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(mod_wp09_3)

## Analysis of Variance Table
## 
## Response: Rev09
##            Df  Sum Sq Mean Sq F value Pr(>F)    
## WinPer2009  1 1186303 1186303     502 <2e-16 ***
## Residuals  29   68489    2362                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Notice how in the anova tables, the SST has sky-rocketed but the SSE
# hasn't really changed.  It looks like you're explaining a lot more, but
# really, you're just comparing to a different reduced model (that doesn't
# work well!).

12.3 Design Matrices & Dummy Variables

model.matrix(Rev09 ~ WinPer2009 + Pop09, data = MLB)

##    (Intercept) WinPer2009 Pop09
## 30           1      0.364  5.36
## 22           1      0.385  2.35
## 3            1      0.395  2.67
## 8            1      0.401  2.09
## 13           1      0.401  2.00
## 1            1      0.432  4.28
## 18           1      0.432 19.01
## 12           1      0.457  5.73
## 20           1      0.463  4.27
## 23           1      0.463  3.00
## 29           1      0.463  2.50
## 7            1      0.481  2.16
## 6            1      0.488  9.57
## 16           1      0.494  1.55
## 5            1      0.516  9.57
## 27           1      0.519  2.73
## 25           1      0.525  3.34
## 10           1      0.528  4.43
## 2            1      0.531  5.38
## 17           1      0.534  3.23
## 11           1      0.537  5.41
## 28           1      0.537  6.30
## 24           1      0.543  4.27
## 26           1      0.562  2.82
## 9            1      0.568  2.51
## 21           1      0.574  5.84
## 4            1      0.586  4.52
## 15           1      0.586 12.87
## 14           1      0.599 12.87
## 19           1      0.636 19.01
## attr(,"assign")
## [1] 0 1 2

# Multiplying across the table (after weighting each variable by the model
# coefficient) will yield the fitted values.  Works for dummy variables,
# too.

12.4 Linearity Over Groups

Extras

Ordering and Sorting Data

# Two Ways to Sort Data:
MLB_sort <- MLB[do.call(order, data.frame(MLB[, 5])), ]
head(MLB_sort)

##                   Team AllTimeWinPer YrFounded HOFSeasons StadAge
## 4       Boston Red Sox         0.517      1901        217    1912
## 5         Chicago Cubs         0.514      1876        258    1914
## 15 Los Angeles Dodgers         0.524      1884        230    1962
## 20   Oakland Athletics         0.486      1901        186    1966
## 14  Los Angeles Angels         0.498      1961         33    1966
## 13  Kansas City Royals         0.482      1969         24    1973
##    WinPer2009 WinPer5Yr WinPer10Yr PostWins5Yr PostWins10Yr AllStar09
## 4       0.586     0.576      0.568          17           34        51
## 5       0.516     0.508      0.499           0            6        21
## 15      0.586     0.500      0.523           8            9        33
## 20      0.463     0.503      0.550           3           11        14
## 14      0.599     0.586      0.555           6           21        23
## 13      0.401     0.404      0.415           0            0         3
##    Pop09 OtherFranchises Rev09 modfit modlwr modupr
## 4   4.52            8.84   266  256.0  230.3  281.7
## 5   9.57            5.22   246  178.3  158.8  197.7
## 15 12.87            5.44   247  256.0  230.3  281.7
## 20  4.27           14.04   155  157.5  137.5  177.6
## 14 12.87            5.44   217  276.8  245.4  308.1
## 13  2.00            4.99   155  175.0  147.4  202.6

MLB_sort2 <- MLB[order(MLB[, 5]), ]
head(MLB_sort2)

##                   Team AllTimeWinPer YrFounded HOFSeasons StadAge
## 4       Boston Red Sox         0.517      1901        217    1912
## 5         Chicago Cubs         0.514      1876        258    1914
## 15 Los Angeles Dodgers         0.524      1884        230    1962
## 20   Oakland Athletics         0.486      1901        186    1966
## 14  Los Angeles Angels         0.498      1961         33    1966
## 13  Kansas City Royals         0.482      1969         24    1973
##    WinPer2009 WinPer5Yr WinPer10Yr PostWins5Yr PostWins10Yr AllStar09
## 4       0.586     0.576      0.568          17           34        51
## 5       0.516     0.508      0.499           0            6        21
## 15      0.586     0.500      0.523           8            9        33
## 20      0.463     0.503      0.550           3           11        14
## 14      0.599     0.586      0.555           6           21        23
## 13      0.401     0.404      0.415           0            0         3
##    Pop09 OtherFranchises Rev09 modfit modlwr modupr
## 4   4.52            8.84   266  256.0  230.3  281.7
## 5   9.57            5.22   246  178.3  158.8  197.7
## 15 12.87            5.44   247  256.0  230.3  281.7
## 20  4.27           14.04   155  157.5  137.5  177.6
## 14 12.87            5.44   217  276.8  245.4  308.1
## 13  2.00            4.99   155  175.0  147.4  202.6

# Either method seems to work equally well.
# matlines(MLB$WinPer2009,pc,lty=c(1,3,3),col=c('red','black','black'),lwd=c(3,2,2))
# # matlines(XVALUES,YAVLUES,lty=LineType,col=Colors,lwd=LineWidth); but
# is sensitive to the order of data I looked up sorting in the first place
# trying to get matlines to work.