Model Formulae

In many ways the R language is similar to other programming languages (Matlab, Python, Scala) but it has a few unique features due to its focus on statistics. Amongst these model formulae are a useful shorthand for representing the relations between data variables - in particular to specify linear models (e.g. regression and multiple regression)- and they appear in both statistical and plotting expressions.

The simplest form of model is the simple linear regression described by a predictor variable x and a response variable y:

\( y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i} \)

where \( \beta_{0} \) is the intercept on y, \( \beta_{1}x_{i} \) is the slope, and \( \epsilon_{i} \) an error term. Within R this type of model can be fitted (i.e. the best estimates of the coefficients \( \beta_{0} \) and \( \beta_{1} \) are found by the Ordinary Least Squares (OLS) algorithm) using the lm function and model formulae:

fit = lm(y ~ x, data = data)

where data is a data.frame containing columns of x and y. As an example we can use the built in dataset cars, which records the stopping distance of cars from various speeds (NB it was recorded in the 1920s)

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
fit = lm(dist ~ speed, data = cars)
fit
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##      -17.58         3.93  
## 

The fit object is of class ‘lm’ for which there are many R functions for accessing details of the model. The coefficients printed are \( \beta_{0} \) and \( \beta_{1} \). This can be seen clearly by plotting the data and the model fit. One advantage of the base plot commands over ggplot2 is that they more closely follow the model formulae syntax:

plot(dist ~ speed, data = cars, xlim = c(0, 25), ylim = c(-20, 120))
abline(fit)

plot of chunk carPlot

The abline function recognises the ‘lm’ class of fit and extracts \( \beta_{0} \) and \( \beta_{1} \) to plot the fit line through the data automagically. However there is far more information that can be retrieved even form tyhis simple regression.

summary(fit)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29.07  -9.53  -2.27   9.21  43.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.579      6.758   -2.60    0.012 *  
## speed          3.932      0.416    9.46  1.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 15.4 on 48 degrees of freedom
## Multiple R-squared: 0.651,   Adjusted R-squared: 0.644 
## F-statistic: 89.6 on 1 and 48 DF,  p-value: 1.49e-12 
## 

Perhaps this is confusing? Lets take it line by line.

  1. The first line is the call:
    This is just an echo of the original function.

  2. The next part is the quartiles of the residuals:
    The OLS algorithm should produce a fit with residuals that sum to 0 (you can check all the residuals using resid(fit). If the fitted predictor data is normal the residuals too should be approximately normal and symmetrical. Here 1Q and 3Q are pretty similar but the median is slightly negative. We will look at further regression diagnostics in a moment.

  3. The coefficients:
    The Estimate column is again the OLS best estimate of intercept \( \beta_{0} \) and slope \( \beta_{1} \). This time however we also get the standard error of the estimate and t-value and p-value of the coefficients. Roughly speaking the t-value is a comparison to the null hypothesis that there is no relationship between y~x, (i.e. dist~speed) and the p-value is the probability of the null hypothesis (errr..that’s a little umm…). For the sake of clarity significant p-values are marked with asterisks

  4. Residual Standard Error:
    Next comes the standard error of those residuals of the fit. The smaller the better of course.

  5. Coefficient of determination (Multiple and Adjusted R Squared):
    For simple linear regression this is tantamount to the correlation coefficient or the fraction of variance of y explained by the predictor x, the closer to 1 the closer to a perfect linear relationship. For multiple regression it is is the fraction of variance for each predictor u,v,w etc. (see below). For multiple regression the adjusted R-squared is a better measure as it accounts for multiple variables.

  6. F-statistic:
    The F-statistic measures the significance of the whole model. For simple regression the p-value significance of F is just the same as that of the t-value. However for multiple linear regression considered below the significance of the whole model (F) will be different to the constituent significance (t) of any single predictor variable.

Exercises 5A

  1. Find the commands to extract the a) residuals b) coefficients from the fit
  2. What is the equation of the fir given those coefficients?

Analysis of Variance: ANOVA

Beginners are often confused by tutorials or discussions of ANOVA in R. This is because people are often talking at cross purposes about one of a number of different but related operations. To a lot of general scientists ‘the ANOVA’ is simply an equation for testing whether the means of a number of different groups are significantly different - and in R this is:
aov(vector~factor.groups)

To statisticians however ANOVA is a more general method

“in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation” (description from Wikipedia!).

The ANOVA to statisticians then generally means a comparison of different types of model usually to select the better model, and in R this is

anova(lm.fit1, lm.fit2) NB it’s a different function!

Actually the familiar ANOVA of comparing means (aov) is just the simplest form, comparing the global mean model to the local means model- and seeing which has the smaller residual sum squares (RSS). The confusion is sometimes compounded as R, unlike some other statistical software, does not have functions for particular forms of ANOVA which go by names such as ANCOVA, MANOVA, or even MANCOVA. These are still just linear models that are handled by lm and described using the model formulae. Phewww… I thought it was best to explain that up-front. Hopefully these examples will make all the above a bit clearer.

The aov() function can be used to perform the simple ANOVA that compares means of many groups. In the following example count is the insects in an agricultural experimental unit treated with different insecticide spray from the R built-in InsectSprays dataset.

aov(count ~ spray, InsectSprays)
## Call:
##    aov(formula = count ~ spray, data = InsectSprays)
## 
## Terms:
##                 spray Residuals
## Sum of Squares   2669      1015
## Deg. of Freedom     5        66
## 
## Residual standard error: 3.922 
## Estimated effects may be unbalanced
anova(aov(count ~ spray, InsectSprays))
## Analysis of Variance Table
## 
## Response: count
##           Df Sum Sq Mean Sq F value Pr(>F)    
## spray      5   2669     534    34.7 <2e-16 ***
## Residuals 66   1015      15                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
anova(lm(count ~ spray, InsectSprays))
## Analysis of Variance Table
## 
## Response: count
##           Df Sum Sq Mean Sq F value Pr(>F)    
## spray      5   2669     534    34.7 <2e-16 ***
## Residuals 66   1015      15                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Exercises 5B

  1. The anova here gives you the RSS or Residual Sum Squares. Can you calculate it yourself from the fit?

Really though… aov is just a synonym for the lm command but the results are printed out slightly differently. The anova function takes the fit and tests whether the RSS is significantly different if each group is fitted with it's own mean or if a single global mean is used. I'm tempted to say “Seemples”
here but I hate people who say that
. For our purposes though this is perhaps not so great as though the p-value clearly indicates that the groups are better fitted with their own means - it doesn't tell us the relationship o fgroups to each other. Actually a summary of lm is better for this.

# The Estimate column is the mean difference to the reference group sprayA
summary(lm(count ~ spray, InsectSprays))
## 
## Call:
## lm(formula = count ~ spray, data = InsectSprays)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.33  -1.96  -0.50   1.67   9.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   14.500      1.132   12.81  < 2e-16 ***
## sprayB         0.833      1.601    0.52     0.60    
## sprayC       -12.417      1.601   -7.76  7.3e-11 ***
## sprayD        -9.583      1.601   -5.99  9.8e-08 ***
## sprayE       -11.000      1.601   -6.87  2.8e-09 ***
## sprayF         2.167      1.601    1.35     0.18    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 3.92 on 66 degrees of freedom
## Multiple R-squared: 0.724,   Adjusted R-squared: 0.704 
## F-statistic: 34.7 on 5 and 66 DF,  p-value: <2e-16 
## 

Really we mention aov here not because it is a particularly useful command for fitting models -it's redundant to lm (anova is useful for comparing models as I described above), but because newcomers always look for it. Actually for this simple example of comparing many means there is a better method than aov, anova, or lm: Tukeys honest significant differences. This can be performed using the TukeyHSD() function:.

tk = TukeyHSD(aov(count ~ spray, InsectSprays))
tk
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = count ~ spray, data = InsectSprays)
## 
## $spray
##         diff     lwr    upr  p adj
## B-A   0.8333  -3.866  5.533 0.9952
## C-A -12.4167 -17.116 -7.717 0.0000
## D-A  -9.5833 -14.283 -4.884 0.0000
## E-A -11.0000 -15.699 -6.301 0.0000
## F-A   2.1667  -2.533  6.866 0.7542
## C-B -13.2500 -17.949 -8.551 0.0000
## D-B -10.4167 -15.116 -5.717 0.0000
## E-B -11.8333 -16.533 -7.134 0.0000
## F-B   1.3333  -3.366  6.033 0.9603
## D-C   2.8333  -1.866  7.533 0.4921
## E-C   1.4167  -3.283  6.116 0.9489
## F-C  14.5833   9.884 19.283 0.0000
## E-D  -1.4167  -6.116  3.283 0.9489
## F-D  11.7500   7.051 16.449 0.0000
## F-E  13.1667   8.467 17.866 0.0000
## 
plot(tk)

plot of chunk tukeyHSD

The table here shows all the pairwise comparison between the different sprays, the mean difference, lower and upper confidence bounds, and crucially a p-value adjusted for the multiplicity of pairwise tests. Rather wonderfully this tukeyHSD model table can be plotted automatically - which is what most people were probably looking for.

Multiple Linear Regression

When we have a dataset with more than one predictor variable then the simple model can be expanded to multiple linear regression. which specified in R is:

\( y_{i}=\beta_{0}+\beta_{1}u_{i}+\beta_{2}v_{i}+\beta_{3}w_{i}...+\epsilon_{i} \)

or fit= lm(y~u+v+w, data=data)

where data is now a data.frame containing columns of u, v, w and the response variable y. We can examine this using the tips dataset in the reshape package: One waiter recorded information about each tip he received over a period of a few months working in one restaurant.

  1. tip in dollars,
  2. bill in dollars,
  3. sex of the bill payer,
  4. whether there were smokers in the party,
  5. day of the week,
  6. time of day,
  7. size of the party.

Using lm we can try and fit all of the other variables against the tip in dollars:

library(reshape)
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
## 
## rename, round_any
fit.all = lm(tip ~ ., data = tips)
summary(fit.all)
## 
## Call:
## lm(formula = tip ~ ., data = tips)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.848 -0.573 -0.103  0.476  4.108 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.8038     0.3527    2.28    0.024 *  
## total_bill    0.0945     0.0096    9.84   <2e-16 ***
## sexMale      -0.0324     0.1416   -0.23    0.819    
## smokerYes    -0.0864     0.1466   -0.59    0.556    
## daySat       -0.1215     0.3097   -0.39    0.695    
## daySun       -0.0255     0.3213   -0.08    0.937    
## dayThur      -0.1623     0.3934   -0.41    0.680    
## timeLunch     0.0681     0.4446    0.15    0.878    
## size          0.1760     0.0895    1.97    0.051 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.02 on 235 degrees of freedom
## Multiple R-squared: 0.47,    Adjusted R-squared: 0.452 
## F-statistic: 26.1 on 8 and 235 DF,  p-value: <2e-16 
## 

There is a lot of information in this summary- but the clear message from the p-value of the coefficients is that only total_bill and possibly size (of the party) are significantly related to the tip. One way to confirm this is to use the step function.

Consider an idealised example first. When given a linear model of the form:

fit1.all=lm(y~u+v+w, data=data)
step(fit.all, direction=’backwards’)

The step function will begin with the full model and assess the goodness of fit using the AIC metric (yeah its complicated) . It then removes the least significant variable and rechecks the AIC metric. If the overall fit of the model is improved (the AIC is smaller) it will remove another and so on, or if the AIC was greater (i.e. the variable was useful) then it will stop. The step procedure output for all predictors in the tips model is very verbose, so we will take it as read that only 2 predictors were useful total_bill and size. Instead we will consider whether an interaction term might be useful:

\( y_{i}=\beta_{0}+\beta_{1}total_{i}+\beta_{2}size_{i}+\beta_{3}total_{i}size_{i}+\epsilon_{i} \)

fit.int = lm(tip ~ total_bill * size, data = tips)
step(fit.int, direction = "backward")
## Start:  AIC=11.24
## tip ~ total_bill * size
## 
##                   Df Sum of Sq RSS   AIC
## - total_bill:size  1     0.288 248  9.53
## <none>                         247 11.24
## 
## Step:  AIC=9.53
## tip ~ total_bill + size
## 
##              Df Sum of Sq RSS  AIC
## <none>                    248  9.5
## - size        1       5.2 253 12.6
## - total_bill  1     106.3 354 94.7
## 
## Call:
## lm(formula = tip ~ total_bill + size, data = tips)
## 
## Coefficients:
## (Intercept)   total_bill         size  
##      0.6689       0.0927       0.1926  
## 

This output simply means the first total model with all terms had an AIC (or goodness of fit) of 11.24 and by removing the interaction term we get a better fit of 9.53, but if size or totalbill is removed completely the AIC score gets much worse. So the ‘best’ model is tip~total_bill+size, which is fairly intuitive.

I said in the previous section that we would revisit the anova function which we can also use for comparing models on the same data. NB One model needs to be a subset of the other. Models with mutually distinct predictors can’t be compared e.g.

anova( lm(y~a+b+c), lm(y~a+b) ) # this works
anova( lm(y~a+b+c), lm(y~c+d) ) # this doesn't

fit.best = lm(tip ~ total_bill + size, data = tips)
anova(fit.int, fit.best)
## Analysis of Variance Table
## 
## Model 1: tip ~ total_bill * size
## Model 2: tip ~ total_bill + size
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1    240 247                         
## 2    241 248 -1    -0.288 0.28    0.6

By comparison if we compare these models (with and without interaction) surprisingly the anova model comparison does not think that these models are significantly different. Yet the step function using the AIC score for comparison selects the simpler model. So should you - because whilst they do an equally good job of fitting the data it is always best to fit the simplest model - and this is what the AIC metric in the step command captures!

Exercises 5C

  1. Plot the residuals of the fits as points, then in a histogram. Do they look OK? How should they look?
  2. Load (install if necessary) the coefplot library. Compare the coefplot of the fits to the summary of the fits.

Regression Diagnostics

With multiple regression we cannot always simply plot the fit to the data and easily judge the fit by eye so regression diagnostics become more useful. Fortunately with lm fit data this is easy, we simply plug it into the plot function:

par(mfrow = c(2, 2))
# plot has a plot.lm method that plots regression diagnostics
# automatically
plot(fit.best)

plot of chunk regressionDiag


# or alternatively there is a longwinded ggplot2 method that maybe
# instructive. Fortify turns the lm into a dataframe with columns for the
# regression diags.
library(ggplot2)
library(grid)
mod = fortify(fit.best)
# then you can make a qplot of each
topleft = qplot(.fitted, .stdresid, data = mod) + geom_hline(yintercept = 0) + 
    geom_smooth(se = FALSE)
topright = qplot(sample = .stdresid, data = mod, stat = "qq") + geom_abline()
bottomleft = qplot(.fitted, sqrt(abs(.stdresid)), data = mod) + geom_smooth(se = FALSE)
bottomright = qplot(.hat, .stdresid, data = mod) + geom_smooth(se = FALSE)
#  We use the grid graphics to plot them together on the same page
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
print(topleft, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(topright, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
print(bottomleft, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(bottomright, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))

plot of chunk regressionDiag

The top-left plot is simply the residuals of the fit, then top-right is the Q-Q plot demonstrating normality of the residuals, bottom left we have the scale location plot (a skew adjusted residuals plot), and the leverage plot which show the influence (or leverage) of points and highlights points which exceed the Cook’s distance i.e. they are too influential. In each plot a few outliers are marked, row 171,173, and 213 but overall this is a good regression fit. This is of necessity a fairly brief and sketchy explanation, a fuller explanation of the regression diagnostic plots is given in Section 7 of Faraway’s book: Practical Regression and ANOVA using R.

Exercises 5D

  1. There are actually 6 plot.lm diagnostics but only 4 are shown by default. Plot the missing 4th and 6th regression diagnostics.

Generalised Linear Models

In addition to standard linear regression R has many tools for modeling other types of response variable such as binomial, Poisson etc. The simplest way is perhaps to use the ‘generalised linear models’ or glm() function which closely follows the syntax of lm. Here we will use a popular, if rather macabre, dataset on Titanic passengers compiled by Thomas Cason to demonstrate binomial regression.

load("titanic3.sav")
colnames(titanic3)
##  [1] "pclass"    "survived"  "name"      "sex"       "age"      
##  [6] "sibsp"     "parch"     "ticket"    "fare"      "cabin"    
## [11] "embarked"  "boat"      "body"      "home.dest"

The variables in the titanic3 data.frame are

Essentially we are interested in modeling a passengers chance of survival based on the other relevant variables.

fit.l = glm(survived ~ pclass + sex + age + sibsp + parch, data = titanic3, 
    family = binomial)
summary(fit.l)
## 
## Call:
## glm(formula = survived ~ pclass + sex + age + sibsp + parch, 
##     family = binomial, data = titanic3)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.702  -0.661  -0.420   0.666   2.520  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.90679    0.36192   10.79  < 2e-16 ***
## pclass2nd   -1.36676    0.23004   -5.94  2.8e-09 ***
## pclass3rd   -2.35202    0.22880  -10.28  < 2e-16 ***
## sexmale     -2.55686    0.17327  -14.76  < 2e-16 ***
## age         -0.03949    0.00663   -5.95  2.6e-09 ***
## sibsp       -0.35291    0.10535   -3.35  0.00081 ***
## parch        0.07436    0.09991    0.74  0.45670    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1414.62  on 1045  degrees of freedom
## Residual deviance:  970.12  on 1039  degrees of freedom
##   (263 observations deleted due to missingness)
## AIC: 984.1
## 
## Number of Fisher Scoring iterations: 4
## 

All these variables excepting the number of parents/children (parch) significantly effected the chance of passenger survival (you can check this yourself using the step function). Of course with a model of this kind it is perhaps difficult to conceptualise its meaning. It is probably very helpful therefore to plot the data - with fits and where possible separate different factors into separate panels of data. Below for example we use the qplot function to facet the data by the most important predictor variables and fit a binomial model to each subset.

# the ggplot2 glm fit is nice as it automatically includes a plot of the
# confidence intervals about the fit
qplot(x = age, y = survived, facets = sex ~ pclass, data = titanic3) + 
    stat_smooth(method = "glm", family = "binomial")

plot of chunk titanicPlot

Remember The ggplot package has it’s own command stat_smooth() or geom_smooth() for adding model fits and confidence intervals. Previously we have used the default a loess smooth and used it to fit an lm fit but here we use it to fit a binomial regression - albeit not modelled on all variables. Nonetheless this makes the message a lot clearer: best to be a 1st class woman passenger.

Exercises 5E

  1. That last plot has 2 rows by 3 columns. Can you make it the other way round..simply.

Prediction

Given a model such as that above with strong predictor variables we are able to make predictions about new data. For instance we can invent a series of passengers with novel permutations of the predictor variables and then estimate their chance of survival. The same method can be used for lm or many other types of R model with a response~predictors.

# refit the model but remove the unhelpful parch predictor
fit.l = glm(survived ~ pclass + sex + age + sibsp, data = titanic3, 
    family = binomial)
#  create new permutations of data based on random sampling
attach(titanic3)
pclass.new = sample(levels(pclass), 10, replace = T)
sex.new = sample(levels(sex), 10, replace = T)
age.new = sample(age, 10, replace = T)
sibsp.new = sample(sibsp, 10, replace = T)
newdata = data.frame(pclass = pclass.new, sex = sex.new, age = age.new, 
    sibsp = sibsp.new)
# of course if you repeat this your random data will look different to
# mine
newdata
##    pclass    sex age sibsp
## 1     2nd female  NA     0
## 2     2nd   male  32     0
## 3     3rd   male  31     0
## 4     3rd   male  18     0
## 5     2nd   male  48     0
## 6     3rd female  18     0
## 7     2nd   male  60     1
## 8     1st female  NA     0
## 9     2nd   male   2     0
## 10    2nd female  22     1
preds = predict(fit.l, newdata = newdata)
cbind(preds, newdata)
##       preds pclass    sex age sibsp
## 1        NA    2nd female  NA     0
## 2  -1.27122    2nd   male  32     0
## 3  -2.21294    3rd   male  31     0
## 4  -1.69707    3rd   male  18     0
## 5  -1.90614    2nd   male  48     0
## 6   0.88315    3rd female  18     0
## 7  -2.71153    2nd   male  60     1
## 8        NA    1st female  NA     0
## 9  -0.08075    2nd   male   2     0
## 10  1.37662    2nd female  22     1

I have printed the predictions in a column using cbind for the sake of clarity. The first thing to note is that there are NA where our newdata contained randomly sampled NA predictors. Second the prediction column values are perhaps not what you were expecting, indeed it flummoxed me for a minute. It’s not the probability of survival it’s the log-odds of the probability:

\( log(\frac{p}{1-p}) \)

We want the probability of survival p.

preds = predict(fit.l, newdata = newdata, type = "response")
cbind(preds, newdata)
##      preds pclass    sex age sibsp
## 1       NA    2nd female  NA     0
## 2  0.21905    2nd   male  32     0
## 3  0.09859    3rd   male  31     0
## 4  0.15485    3rd   male  18     0
## 5  0.12941    2nd   male  48     0
## 6  0.70748    3rd female  18     0
## 7  0.06230    2nd   male  60     1
## 8       NA    1st female  NA     0
## 9  0.47982    2nd   male   2     0
## 10 0.79845    2nd female  22     1

Hopefully if you look back at our earlier qplots each of these new random cases should fall within the range of the plots.

Exercises 5F

  1. Try to produce a deliberately poor model using poor predictor variables. Use coefplot to examine the model. Test the prediction using newdata and compare it to the faceted titanic plots above.

Polynomial Regression

We have looked at multiple regression with many predictor variables, then at the glm extension for non-normal reponse variables (binomial, poission, Gamma). In this now somehwat lengthy tutorial we will look at one other model type - polynomial predictors. This is an obvious choice when we plot some data y~x and it's clearly not well fit by a straight line. The polynomial model is a simple extension of linear regression to a multiple linear regression with as many nth order polynomials of x as necessary to make a good fit.

\( y_{i}=\beta_{0}+\beta_{1}x_{i}+\beta_{2}x^2_{i}+\beta_{3}x^3_{i}...+\beta_{n}x^n_{i}+\epsilon_{i} \)

or

lm(y~poly(x, 2)) this is adding up to 2nd degree polynomials

We'll use ggplot2 again. Although it's a bit more long winded than plot for this example it's good practice and looks nicer.

g = ggplot(mtcars, aes(disp, mpg))
g = g + geom_point()

p1 = g + geom_smooth(method = "lm", formula = y ~ poly(x, 1), se = F) + 
    opts(title = "poly degree = 1")
p2 = g + geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = F) + 
    opts(title = "poly degree = 2")
p3 = g + geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = F) + 
    opts(title = "poly degree = 3")
p4 = g + geom_smooth(method = "lm", formula = y ~ poly(x, 4), se = F) + 
    opts(title = "poly degree = 4")

grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
print(p3, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(p4, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))

plot of chunk mtcars

The mtcars data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Above we have plotted engine displacement vs miles per gallon and fit with 4 different polynomial models (**poly 1 is a linear model). Which is best? - to me poly 2 looks the most **general with poly 3 and poly 4 increasingly overfitted. Of course! This is a job for anova : model comparison.

# here's a neato little trick for creating a series of similarly named
# variable using the `assign` and paste commands within a loop
for (degree in 1:4) {
    fm <- lm(mpg ~ poly(disp, degree), data = mtcars)
    assign(paste("mtcars", degree, sep = ".poly"), fm)
}
anova(mtcars.poly1, mtcars.poly2, mtcars.poly3, mtcars.poly4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ poly(disp, degree)
## Model 2: mpg ~ poly(disp, degree)
## Model 3: mpg ~ poly(disp, degree)
## Model 4: mpg ~ poly(disp, degree)
##   Res.Df RSS Df Sum of Sq    F Pr(>F)    
## 1     30 317                             
## 2     29 233  1      83.8 16.3  4e-04 ***
## 3     28 138  1      95.0 18.5  2e-04 ***
## 4     27 138  1       0.0  0.0  1e+00    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Exercises 5F

  1. Can you add more non-polynomial predictors (cyl, wt, gear) to the model? Make a coefplot.
  2. Would any of them be useful additions to model mpg?

According to anova comparison the 3rd degree polynomial model is best- but I'm not so sure. In this particular sample this model may minimise the RSS but perhaps it too closely follows this particular sample (that's overfitting) rather than the real (or population) mpg~disp relationship (that's generalisation)? We will finish this tutorial with a rough and ready script and plot to investigate:

# there is quite a bit of tricky code in here to test how much you have
# really picked up so far
RSSframe = data.frame(RSS.poly2 = rep(NA, 100), RSS.poly3 = rep(NA, 
    100), RSS.poly4 = rep(NA, 100))
for (i in 1:100) {
    s = sample(1:32, 16)
    train = mtcars[s, ]
    test = mtcars[-s, ]
    mod2 <- lm(mpg ~ poly(disp, 2), data = train)
    mod3 <- lm(mpg ~ poly(disp, 3), data = train)
    mod4 <- lm(mpg ~ poly(disp, 4), data = train)
    pred2 = predict(mod2, test)
    pred3 = predict(mod3, test)
    pred4 = predict(mod4, test)
    RSSframe$RSS.poly2[i] = sum((pred2 - mtcars$mpg[-s])^2)
    RSSframe$RSS.poly3[i] = sum((pred3 - mtcars$mpg[-s])^2)
    RSSframe$RSS.poly4[i] = sum((pred4 - mtcars$mpg[-s])^2)
}
boxplot(log(RSSframe), ylab = "cross-validated log of RSS")

plot of chunk xvalPolyLm

Exercises 5G

  1. Do you understand what we have done in this last R script above? Which is the best model?
  2. If you understand the code in the last script annotate it to explain each step.
  3. The relation between wt~disp of the mtcars data looks interesting. Plot it then find a good polynomial fit by eyeball, anova, and if you really feel like it use the cross-validation method.

This last script was an example of cross validation. We'll come across tis again in the machine learning tutorial.