In many ways the R language is similar to other programming languages (Matlab, Python, Scala) but it has a few unique features due to its focus on statistics. Amongst these model formulae are a useful shorthand for representing the relations between data variables - in particular to specify linear models (e.g. regression and multiple regression)- and they appear in both statistical and plotting expressions.
The simplest form of model is the simple linear regression described by a predictor variable x and a response variable y:
\( y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i} \)
where \( \beta_{0} \) is the intercept on y, \( \beta_{1}x_{i} \) is the slope, and \( \epsilon_{i} \) an error term. Within R this type of model can be fitted (i.e. the best estimates of the coefficients \( \beta_{0} \) and \( \beta_{1} \) are found by the Ordinary Least Squares (OLS) algorithm) using the lm function and model formulae:
fit = lm(y ~ x, data = data)
where data is a data.frame containing columns of x and y. As an example we can use the built in dataset cars, which records the stopping distance of cars from various speeds (NB it was recorded in the 1920s)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
fit = lm(dist ~ speed, data = cars)
fit
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.58 3.93
##
The fit object is of class ‘lm’ for which there are many R functions for accessing details of the model. The coefficients printed are \( \beta_{0} \) and \( \beta_{1} \). This can be seen clearly by plotting the data and the model fit. One advantage of the base plot commands over ggplot2 is that they more closely follow the model formulae syntax:
plot(dist ~ speed, data = cars, xlim = c(0, 25), ylim = c(-20, 120))
abline(fit)
The abline function recognises the ‘lm’ class of fit and extracts \( \beta_{0} \) and \( \beta_{1} \) to plot the fit line through the data automagically. However there is far more information that can be retrieved even form tyhis simple regression.
summary(fit)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.07 -9.53 -2.27 9.21 43.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.579 6.758 -2.60 0.012 *
## speed 3.932 0.416 9.46 1.5e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.4 on 48 degrees of freedom
## Multiple R-squared: 0.651, Adjusted R-squared: 0.644
## F-statistic: 89.6 on 1 and 48 DF, p-value: 1.49e-12
##
Perhaps this is confusing? Lets take it line by line.
The first line is the call:
This is just an echo of the original function.
The next part is the quartiles of the residuals:
The OLS algorithm should produce a fit with residuals that sum to 0 (you can check all the residuals using resid(fit). If the fitted predictor data is normal the residuals too should be approximately normal and symmetrical. Here 1Q and 3Q are pretty similar but the median is slightly negative. We will look at further regression diagnostics in a moment.
The coefficients:
The Estimate column is again the OLS best estimate of intercept \( \beta_{0} \) and slope \( \beta_{1} \). This time however we also get the standard error of the estimate and t-value and p-value of the coefficients. Roughly speaking the t-value is a comparison to the null hypothesis that there is no relationship between y~x, (i.e. dist~speed) and the p-value is the probability of the null hypothesis (errr..that’s a little umm…). For the sake of clarity significant p-values are marked with asterisks
Residual Standard Error:
Next comes the standard error of those residuals of the fit. The smaller the better of course.
Coefficient of determination (Multiple and Adjusted R Squared):
For simple linear regression this is tantamount to the correlation coefficient or the fraction of variance of y explained by the predictor x, the closer to 1 the closer to a perfect linear relationship. For multiple regression it is is the fraction of variance for each predictor u,v,w etc. (see below). For multiple regression the adjusted R-squared is a better measure as it accounts for multiple variables.
F-statistic:
The F-statistic measures the significance of the whole model. For simple regression the p-value significance of F is just the same as that of the t-value. However for multiple linear regression considered below the significance of the whole model (F) will be different to the constituent significance (t) of any single predictor variable.
Beginners are often confused by tutorials or discussions of ANOVA in R. This is because people are often talking at cross purposes about one of a number of different but related operations. To a lot of general scientists ‘the ANOVA’ is simply an equation for testing whether the means of a number of different groups are significantly different - and in R this is:
aov(vector~factor.groups)
To statisticians however ANOVA is a more general method
“in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation” (description from Wikipedia!).
The ANOVA to statisticians then generally means a comparison of different types of model usually to select the better model, and in R this is
anova(lm.fit1, lm.fit2) NB it’s a different function!
Actually the familiar ANOVA of comparing means (aov) is just the simplest form, comparing the global mean model to the local means model- and seeing which has the smaller residual sum squares (RSS). The confusion is sometimes compounded as R, unlike some other statistical software, does not have functions for particular forms of ANOVA which go by names such as ANCOVA, MANOVA, or even MANCOVA. These are still just linear models that are handled by lm and described using the model formulae. Phewww… I thought it was best to explain that up-front. Hopefully these examples will make all the above a bit clearer.
The aov() function can be used to perform the simple ANOVA that compares means of many groups. In the following example count is the insects in an agricultural experimental unit treated with different insecticide spray from the R built-in InsectSprays dataset.
aov(count ~ spray, InsectSprays)
## Call:
## aov(formula = count ~ spray, data = InsectSprays)
##
## Terms:
## spray Residuals
## Sum of Squares 2669 1015
## Deg. of Freedom 5 66
##
## Residual standard error: 3.922
## Estimated effects may be unbalanced
anova(aov(count ~ spray, InsectSprays))
## Analysis of Variance Table
##
## Response: count
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 534 34.7 <2e-16 ***
## Residuals 66 1015 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(count ~ spray, InsectSprays))
## Analysis of Variance Table
##
## Response: count
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 534 34.7 <2e-16 ***
## Residuals 66 1015 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Really though… aov is just a synonym for the lm command but the results are printed out slightly differently. The anova function takes the fit and tests whether the RSS is significantly different if each group is fitted with it's own mean or if a single global mean is used. I'm tempted to say “Seemples”
here but I hate people who say that. For our purposes though this is perhaps not so great as though the p-value clearly indicates that the groups are better fitted with their own means - it doesn't tell us the relationship o fgroups to each other. Actually a summary of lm is better for this.
# The Estimate column is the mean difference to the reference group sprayA
summary(lm(count ~ spray, InsectSprays))
##
## Call:
## lm(formula = count ~ spray, data = InsectSprays)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.33 -1.96 -0.50 1.67 9.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.500 1.132 12.81 < 2e-16 ***
## sprayB 0.833 1.601 0.52 0.60
## sprayC -12.417 1.601 -7.76 7.3e-11 ***
## sprayD -9.583 1.601 -5.99 9.8e-08 ***
## sprayE -11.000 1.601 -6.87 2.8e-09 ***
## sprayF 2.167 1.601 1.35 0.18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.92 on 66 degrees of freedom
## Multiple R-squared: 0.724, Adjusted R-squared: 0.704
## F-statistic: 34.7 on 5 and 66 DF, p-value: <2e-16
##
Really we mention aov here not because it is a particularly useful command for fitting models -it's redundant to lm (anova is useful for comparing models as I described above), but because newcomers always look for it. Actually for this simple example of comparing many means there is a better method than aov, anova, or lm: Tukeys honest significant differences. This can be performed using the TukeyHSD() function:.
tk = TukeyHSD(aov(count ~ spray, InsectSprays))
tk
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = count ~ spray, data = InsectSprays)
##
## $spray
## diff lwr upr p adj
## B-A 0.8333 -3.866 5.533 0.9952
## C-A -12.4167 -17.116 -7.717 0.0000
## D-A -9.5833 -14.283 -4.884 0.0000
## E-A -11.0000 -15.699 -6.301 0.0000
## F-A 2.1667 -2.533 6.866 0.7542
## C-B -13.2500 -17.949 -8.551 0.0000
## D-B -10.4167 -15.116 -5.717 0.0000
## E-B -11.8333 -16.533 -7.134 0.0000
## F-B 1.3333 -3.366 6.033 0.9603
## D-C 2.8333 -1.866 7.533 0.4921
## E-C 1.4167 -3.283 6.116 0.9489
## F-C 14.5833 9.884 19.283 0.0000
## E-D -1.4167 -6.116 3.283 0.9489
## F-D 11.7500 7.051 16.449 0.0000
## F-E 13.1667 8.467 17.866 0.0000
##
plot(tk)
The table here shows all the pairwise comparison between the different sprays, the mean difference, lower and upper confidence bounds, and crucially a p-value adjusted for the multiplicity of pairwise tests. Rather wonderfully this tukeyHSD model table can be plotted automatically - which is what most people were probably looking for.
When we have a dataset with more than one predictor variable then the simple model can be expanded to multiple linear regression. which specified in R is:
\( y_{i}=\beta_{0}+\beta_{1}u_{i}+\beta_{2}v_{i}+\beta_{3}w_{i}...+\epsilon_{i} \)
or fit= lm(y~u+v+w, data=data)
where data is now a data.frame containing columns of u, v, w and the response variable y. We can examine this using the tips dataset in the reshape package: One waiter recorded information about each tip he received over a period of a few months working in one restaurant.
Using lm we can try and fit all of the other variables against the tip in dollars:
library(reshape)
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
##
## rename, round_any
fit.all = lm(tip ~ ., data = tips)
summary(fit.all)
##
## Call:
## lm(formula = tip ~ ., data = tips)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.848 -0.573 -0.103 0.476 4.108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8038 0.3527 2.28 0.024 *
## total_bill 0.0945 0.0096 9.84 <2e-16 ***
## sexMale -0.0324 0.1416 -0.23 0.819
## smokerYes -0.0864 0.1466 -0.59 0.556
## daySat -0.1215 0.3097 -0.39 0.695
## daySun -0.0255 0.3213 -0.08 0.937
## dayThur -0.1623 0.3934 -0.41 0.680
## timeLunch 0.0681 0.4446 0.15 0.878
## size 0.1760 0.0895 1.97 0.051 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.02 on 235 degrees of freedom
## Multiple R-squared: 0.47, Adjusted R-squared: 0.452
## F-statistic: 26.1 on 8 and 235 DF, p-value: <2e-16
##
There is a lot of information in this summary- but the clear message from the p-value of the coefficients is that only total_bill and possibly size (of the party) are significantly related to the tip. One way to confirm this is to use the step function.
Consider an idealised example first. When given a linear model of the form:
fit1.all=lm(y~u+v+w, data=data)
step(fit.all, direction=’backwards’)
The step function will begin with the full model and assess the goodness of fit using the AIC metric (yeah its complicated) . It then removes the least significant variable and rechecks the AIC metric. If the overall fit of the model is improved (the AIC is smaller) it will remove another and so on, or if the AIC was greater (i.e. the variable was useful) then it will stop. The step procedure output for all predictors in the tips model is very verbose, so we will take it as read that only 2 predictors were useful total_bill and size. Instead we will consider whether an interaction term might be useful:
\( y_{i}=\beta_{0}+\beta_{1}total_{i}+\beta_{2}size_{i}+\beta_{3}total_{i}size_{i}+\epsilon_{i} \)
fit.int = lm(tip ~ total_bill * size, data = tips)
step(fit.int, direction = "backward")
## Start: AIC=11.24
## tip ~ total_bill * size
##
## Df Sum of Sq RSS AIC
## - total_bill:size 1 0.288 248 9.53
## <none> 247 11.24
##
## Step: AIC=9.53
## tip ~ total_bill + size
##
## Df Sum of Sq RSS AIC
## <none> 248 9.5
## - size 1 5.2 253 12.6
## - total_bill 1 106.3 354 94.7
##
## Call:
## lm(formula = tip ~ total_bill + size, data = tips)
##
## Coefficients:
## (Intercept) total_bill size
## 0.6689 0.0927 0.1926
##
This output simply means the first total model with all terms had an AIC (or goodness of fit) of 11.24 and by removing the interaction term we get a better fit of 9.53, but if size or totalbill is removed completely the AIC score gets much worse. So the ‘best’ model is tip~total_bill+size, which is fairly intuitive.
I said in the previous section that we would revisit the anova function which we can also use for comparing models on the same data. NB One model needs to be a subset of the other. Models with mutually distinct predictors can’t be compared e.g.
anova( lm(y~a+b+c), lm(y~a+b) ) # this works
anova( lm(y~a+b+c), lm(y~c+d) ) # this doesn't
fit.best = lm(tip ~ total_bill + size, data = tips)
anova(fit.int, fit.best)
## Analysis of Variance Table
##
## Model 1: tip ~ total_bill * size
## Model 2: tip ~ total_bill + size
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 240 247
## 2 241 248 -1 -0.288 0.28 0.6
By comparison if we compare these models (with and without interaction) surprisingly the anova model comparison does not think that these models are significantly different. Yet the step function using the AIC score for comparison selects the simpler model. So should you - because whilst they do an equally good job of fitting the data it is always best to fit the simplest model - and this is what the AIC metric in the step command captures!
coefplot of the fits to the summary of the fits.With multiple regression we cannot always simply plot the fit to the data and easily judge the fit by eye so regression diagnostics become more useful. Fortunately with lm fit data this is easy, we simply plug it into the plot function:
par(mfrow = c(2, 2))
# plot has a plot.lm method that plots regression diagnostics
# automatically
plot(fit.best)
# or alternatively there is a longwinded ggplot2 method that maybe
# instructive. Fortify turns the lm into a dataframe with columns for the
# regression diags.
library(ggplot2)
library(grid)
mod = fortify(fit.best)
# then you can make a qplot of each
topleft = qplot(.fitted, .stdresid, data = mod) + geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)
topright = qplot(sample = .stdresid, data = mod, stat = "qq") + geom_abline()
bottomleft = qplot(.fitted, sqrt(abs(.stdresid)), data = mod) + geom_smooth(se = FALSE)
bottomright = qplot(.hat, .stdresid, data = mod) + geom_smooth(se = FALSE)
# We use the grid graphics to plot them together on the same page
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
print(topleft, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(topright, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
print(bottomleft, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(bottomright, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))
The top-left plot is simply the residuals of the fit, then top-right is the Q-Q plot demonstrating normality of the residuals, bottom left we have the scale location plot (a skew adjusted residuals plot), and the leverage plot which show the influence (or leverage) of points and highlights points which exceed the Cook’s distance i.e. they are too influential. In each plot a few outliers are marked, row 171,173, and 213 but overall this is a good regression fit. This is of necessity a fairly brief and sketchy explanation, a fuller explanation of the regression diagnostic plots is given in Section 7 of Faraway’s book: Practical Regression and ANOVA using R.
plot.lm diagnostics but only 4 are shown by default. Plot the missing 4th and 6th regression diagnostics. In addition to standard linear regression R has many tools for modeling other types of response variable such as binomial, Poisson etc. The simplest way is perhaps to use the ‘generalised linear models’ or glm() function which closely follows the syntax of lm. Here we will use a popular, if rather macabre, dataset on Titanic passengers compiled by Thomas Cason to demonstrate binomial regression.
load("titanic3.sav")
colnames(titanic3)
## [1] "pclass" "survived" "name" "sex" "age"
## [6] "sibsp" "parch" "ticket" "fare" "cabin"
## [11] "embarked" "boat" "body" "home.dest"
The variables in the titanic3 data.frame are
Essentially we are interested in modeling a passengers chance of survival based on the other relevant variables.
fit.l = glm(survived ~ pclass + sex + age + sibsp + parch, data = titanic3,
family = binomial)
summary(fit.l)
##
## Call:
## glm(formula = survived ~ pclass + sex + age + sibsp + parch,
## family = binomial, data = titanic3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.702 -0.661 -0.420 0.666 2.520
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.90679 0.36192 10.79 < 2e-16 ***
## pclass2nd -1.36676 0.23004 -5.94 2.8e-09 ***
## pclass3rd -2.35202 0.22880 -10.28 < 2e-16 ***
## sexmale -2.55686 0.17327 -14.76 < 2e-16 ***
## age -0.03949 0.00663 -5.95 2.6e-09 ***
## sibsp -0.35291 0.10535 -3.35 0.00081 ***
## parch 0.07436 0.09991 0.74 0.45670
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1414.62 on 1045 degrees of freedom
## Residual deviance: 970.12 on 1039 degrees of freedom
## (263 observations deleted due to missingness)
## AIC: 984.1
##
## Number of Fisher Scoring iterations: 4
##
All these variables excepting the number of parents/children (parch) significantly effected the chance of passenger survival (you can check this yourself using the step function). Of course with a model of this kind it is perhaps difficult to conceptualise its meaning. It is probably very helpful therefore to plot the data - with fits and where possible separate different factors into separate panels of data. Below for example we use the qplot function to facet the data by the most important predictor variables and fit a binomial model to each subset.
# the ggplot2 glm fit is nice as it automatically includes a plot of the
# confidence intervals about the fit
qplot(x = age, y = survived, facets = sex ~ pclass, data = titanic3) +
stat_smooth(method = "glm", family = "binomial")
Remember The ggplot package has it’s own command stat_smooth() or geom_smooth() for adding model fits and confidence intervals. Previously we have used the default a loess smooth and used it to fit an lm fit but here we use it to fit a binomial regression - albeit not modelled on all variables. Nonetheless this makes the message a lot clearer: best to be a 1st class woman passenger.
Given a model such as that above with strong predictor variables we are able to make predictions about new data. For instance we can invent a series of passengers with novel permutations of the predictor variables and then estimate their chance of survival. The same method can be used for lm or many other types of R model with a response~predictors.
# refit the model but remove the unhelpful parch predictor
fit.l = glm(survived ~ pclass + sex + age + sibsp, data = titanic3,
family = binomial)
# create new permutations of data based on random sampling
attach(titanic3)
pclass.new = sample(levels(pclass), 10, replace = T)
sex.new = sample(levels(sex), 10, replace = T)
age.new = sample(age, 10, replace = T)
sibsp.new = sample(sibsp, 10, replace = T)
newdata = data.frame(pclass = pclass.new, sex = sex.new, age = age.new,
sibsp = sibsp.new)
# of course if you repeat this your random data will look different to
# mine
newdata
## pclass sex age sibsp
## 1 2nd female NA 0
## 2 2nd male 32 0
## 3 3rd male 31 0
## 4 3rd male 18 0
## 5 2nd male 48 0
## 6 3rd female 18 0
## 7 2nd male 60 1
## 8 1st female NA 0
## 9 2nd male 2 0
## 10 2nd female 22 1
preds = predict(fit.l, newdata = newdata)
cbind(preds, newdata)
## preds pclass sex age sibsp
## 1 NA 2nd female NA 0
## 2 -1.27122 2nd male 32 0
## 3 -2.21294 3rd male 31 0
## 4 -1.69707 3rd male 18 0
## 5 -1.90614 2nd male 48 0
## 6 0.88315 3rd female 18 0
## 7 -2.71153 2nd male 60 1
## 8 NA 1st female NA 0
## 9 -0.08075 2nd male 2 0
## 10 1.37662 2nd female 22 1
I have printed the predictions in a column using cbind for the sake of clarity. The first thing to note is that there are NA where our newdata contained randomly sampled NA predictors. Second the prediction column values are perhaps not what you were expecting, indeed it flummoxed me for a minute. It’s not the probability of survival it’s the log-odds of the probability:
\( log(\frac{p}{1-p}) \)
We want the probability of survival p.
preds = predict(fit.l, newdata = newdata, type = "response")
cbind(preds, newdata)
## preds pclass sex age sibsp
## 1 NA 2nd female NA 0
## 2 0.21905 2nd male 32 0
## 3 0.09859 3rd male 31 0
## 4 0.15485 3rd male 18 0
## 5 0.12941 2nd male 48 0
## 6 0.70748 3rd female 18 0
## 7 0.06230 2nd male 60 1
## 8 NA 1st female NA 0
## 9 0.47982 2nd male 2 0
## 10 0.79845 2nd female 22 1
Hopefully if you look back at our earlier qplots each of these new random cases should fall within the range of the plots.
coefplot to examine the model. Test the prediction using newdata and compare it to the faceted titanic plots above. We have looked at multiple regression with many predictor variables, then at the glm extension for non-normal reponse variables (binomial, poission, Gamma). In this now somehwat lengthy tutorial we will look at one other model type - polynomial predictors. This is an obvious choice when we plot some data y~x and it's clearly not well fit by a straight line. The polynomial model is a simple extension of linear regression to a multiple linear regression with as many nth order polynomials of x as necessary to make a good fit.
\( y_{i}=\beta_{0}+\beta_{1}x_{i}+\beta_{2}x^2_{i}+\beta_{3}x^3_{i}...+\beta_{n}x^n_{i}+\epsilon_{i} \)
or
lm(y~poly(x, 2)) this is adding up to 2nd degree polynomials
We'll use ggplot2 again. Although it's a bit more long winded than plot for this example it's good practice and looks nicer.
g = ggplot(mtcars, aes(disp, mpg))
g = g + geom_point()
p1 = g + geom_smooth(method = "lm", formula = y ~ poly(x, 1), se = F) +
opts(title = "poly degree = 1")
p2 = g + geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = F) +
opts(title = "poly degree = 2")
p3 = g + geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = F) +
opts(title = "poly degree = 3")
p4 = g + geom_smooth(method = "lm", formula = y ~ poly(x, 4), se = F) +
opts(title = "poly degree = 4")
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
print(p3, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(p4, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))
The mtcars data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Above we have plotted engine displacement vs miles per gallon and fit with 4 different polynomial models (**poly 1 is a linear model). Which is best? - to me poly 2 looks the most **general with poly 3 and poly 4 increasingly overfitted. Of course! This is a job for anova : model comparison.
# here's a neato little trick for creating a series of similarly named
# variable using the `assign` and paste commands within a loop
for (degree in 1:4) {
fm <- lm(mpg ~ poly(disp, degree), data = mtcars)
assign(paste("mtcars", degree, sep = ".poly"), fm)
}
anova(mtcars.poly1, mtcars.poly2, mtcars.poly3, mtcars.poly4)
## Analysis of Variance Table
##
## Model 1: mpg ~ poly(disp, degree)
## Model 2: mpg ~ poly(disp, degree)
## Model 3: mpg ~ poly(disp, degree)
## Model 4: mpg ~ poly(disp, degree)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 317
## 2 29 233 1 83.8 16.3 4e-04 ***
## 3 28 138 1 95.0 18.5 2e-04 ***
## 4 27 138 1 0.0 0.0 1e+00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to anova comparison the 3rd degree polynomial model is best- but I'm not so sure. In this particular sample this model may minimise the RSS but perhaps it too closely follows this particular sample (that's overfitting) rather than the real (or population) mpg~disp relationship (that's generalisation)? We will finish this tutorial with a rough and ready script and plot to investigate:
# there is quite a bit of tricky code in here to test how much you have
# really picked up so far
RSSframe = data.frame(RSS.poly2 = rep(NA, 100), RSS.poly3 = rep(NA,
100), RSS.poly4 = rep(NA, 100))
for (i in 1:100) {
s = sample(1:32, 16)
train = mtcars[s, ]
test = mtcars[-s, ]
mod2 <- lm(mpg ~ poly(disp, 2), data = train)
mod3 <- lm(mpg ~ poly(disp, 3), data = train)
mod4 <- lm(mpg ~ poly(disp, 4), data = train)
pred2 = predict(mod2, test)
pred3 = predict(mod3, test)
pred4 = predict(mod4, test)
RSSframe$RSS.poly2[i] = sum((pred2 - mtcars$mpg[-s])^2)
RSSframe$RSS.poly3[i] = sum((pred3 - mtcars$mpg[-s])^2)
RSSframe$RSS.poly4[i] = sum((pred4 - mtcars$mpg[-s])^2)
}
boxplot(log(RSSframe), ylab = "cross-validated log of RSS")
wt~disp of the mtcars data looks interesting. Plot it then find a good polynomial fit by eyeball, anova, and if you really feel like it use the cross-validation method. This last script was an example of cross validation. We'll come across tis again in the machine learning tutorial.