We have moved from simple linear models to multiple regression models with more than one predictor being tested for explaining the observed variation in the data. Some of this variation is systematic (that explained by a variable of theoretical importance, e.g., temperature, where evaporation~temperature), while other variation remains unexplained (either because we didn’t measure it directly or because we don’t know what is causing it). We can add covariates to reduce the unexplained variance, but need to do so with care. In an experimental setting, you will design an experiment with the most likely explanatory variables included in the design. But in an observational setting, we may need to select for which variables help in explaining the distribution of our observations. In our simple example with evaporation, we know temperature is important, but also may have access to other variables like humidity, wind speed, or the like. But which of these helps explain or observed evaporation rates? We would construct a model to find out.
Today we will explore some of the diagnostics associated with model fitting and touch on model selection. You are not done with fitting an lm until your assessed the output and the residuals. We can select a model based on hypothesis tests (t, F), R2 , and backward/forward selection, or we can use an information theoretic approach. Information theory helps us answer the question: what is the probability of the hypothesis given our data (e.g., P[H1 | Data])? Information-theoretic analyses, like Bayesian analyses, are grounded in likelihood estimates–we need the log likelihood as estimated with Maximum Likelihood (ML). This is how R fits some models behind the scenes: when data are normally distributed, ML and Ordinary Least Squares (OLS) converge on the same solution–a lot of matrix algebra is involved that we won’t get into. At any rate, Akaike’s Information Criterion (AIC) is one I-T approach for model selection and we will get into that. There is a fairly accessible paper on I-T approaches here.
Think about your workflow:
assess the data to know what you have
fit your model
check your model
interpret your findings
Sure, there is more to it than this but we’ve already covered a lot of the nuance. It’s up to you to find a system that works for you to make sure you methodically complete the analysis.
library(jtools)
## Warning: package 'jtools' was built under R version 4.3.2
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.3.2
## Registered S3 methods overwritten by 'broom':
## method from
## tidy.glht jtools
## tidy.summary.glht jtools
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
trees
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
## 7 11.0 66 15.6
## 8 11.0 75 18.2
## 9 11.1 80 22.6
## 10 11.2 75 19.9
## 11 11.3 79 24.2
## 12 11.4 76 21.0
## 13 11.4 76 21.4
## 14 11.7 69 21.3
## 15 12.0 75 19.1
## 16 12.9 74 22.2
## 17 12.9 85 33.8
## 18 13.3 86 27.4
## 19 13.7 71 25.7
## 20 13.8 64 24.9
## 21 14.0 78 34.5
## 22 14.2 80 31.7
## 23 14.5 74 36.3
## 24 16.0 72 38.3
## 25 16.3 77 42.6
## 26 17.3 81 55.4
## 27 17.5 82 55.7
## 28 17.9 80 58.3
## 29 18.0 80 51.5
## 30 18.0 80 51.0
## 31 20.6 87 77.0
dim(trees)
## [1] 31 3
str(trees)
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
summary(trees)
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
plot(trees)
hist(trees$Volume)
# maybe log-normal; not very normal
hist(trees$Girth)
# normal-ish
hist(trees$Height)
# Normal
plot(trees$Girth, trees$Volume)
# pretty linear (+)
Create a simple linear regression of volume as a function of girt:
m1 <- lm(Volume ~ Girth, data = trees)
summary(m1)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
anova(m1)
## Analysis of Variance Table
##
## Response: Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## Girth 1 7581.8 7581.8 419.36 < 2.2e-16 ***
## Residuals 29 524.3 18.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Girth is a significant predictor of volume. The slope of the model is 5.066 and is significantly different from 0 (t = 20.48, p < 0.001). The high \(R^2\) value (0.935) shows that the majority of the variance in volume is explained by girth (\(F\)(1,29) = 419.36, p < 0.001).
par(mfrow = c(2,2))
plot(m1)
The residuals are drawn down a bit in the middle - not equally scattered. There is some variation in the QQ plot around the tails, but our data set is small. The standardized residuals look pretty good. There does appear to be one point that has high leverage - a high predictor value. Let’s check point 31.
trees[21:31,] #choosing ten observations to compare with point 31
## Girth Height Volume
## 21 14.0 78 34.5
## 22 14.2 80 31.7
## 23 14.5 74 36.3
## 24 16.0 72 38.3
## 25 16.3 77 42.6
## 26 17.3 81 55.4
## 27 17.5 82 55.7
## 28 17.9 80 58.3
## 29 18.0 80 51.5
## 30 18.0 80 51.0
## 31 20.6 87 77.0
Seems a bit high.
mean(trees$Volume, na.rm = TRUE)
## [1] 30.17097
summary(trees$Volume)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.20 19.40 24.20 30.17 37.30 77.00
The 3rd quartile value is 37.3 and the max is 77. Observation 31 was a giant.
We’ve investigated: it seems like we just have a big tree. Let’s proceed for now.
Back to the lm:
m1 <- lm(Volume~Girth, data = trees)
summary(m1)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
anova(m1)
## Analysis of Variance Table
##
## Response: Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## Girth 1 7581.8 7581.8 419.36 < 2.2e-16 ***
## Residuals 29 524.3 18.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We have done our due diligence and followed the model through. Now we make some inference and interpret our findings:
Tree volume is a function of girth. Our linear model (Volume=-36.94 + 5.07Girth), showed that for each addition inch in diameter [put “trees” in the help search and it will give you all the info on the data set], tree volume of timber increases ~5 cubic feet (Slope=5.066, t=20.48, p<0.001). Tree girth explains significant variation in tree volume (\(R^2\)=0.93) and tree girth is a good predictor of volume (F=419.4, p<0.001). Thus, black cherry tree board feet can be determined by girth (DBH).
Okay, that was a simple example. Let’s bump it up with an additional variable:
m2 <- lm(Volume ~ Girth + Height, data = trees)
summary(m2)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
With both girth and height in the model, both appear to have a significant impact on volume. The slope for girth is 4.708 and is significantly different from 0 (t = 17.816, p < 0.001). The slope for height is 2.607 and is also significantly different from 0 (t = 2.607, p < 0.0145). The model explains a great deal of the variance in volume and is well fit (\(R^2\) = 0.944, \(F\)(2,28) = 255, p < 0.001).
Now we can drill down into the 2 explanatory vars. a bit more using anova(). This is different from aov(), which is used for a true ANOVA analysis. With anova() we are looking at the amount of variation explained by each variable related to the unexplained residual error. The anova() function is almost like a post-hoc test to determine which of the variables in the model contribute to the overall model fit. There is some info here:. If you go to the help and input “anova” you’ll see that for this function analysis of variance (or deviance) tables are computed for one or more fitted model objects.
anova(m2)
## Analysis of Variance Table
##
## Response: Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## Girth 1 7581.8 7581.8 503.1503 < 2e-16 ***
## Height 1 102.4 102.4 6.7943 0.01449 *
## Residuals 28 421.9 15.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both girth and height are significant predictors. That is, a model with these variables is better than one without. And the MSE is very low.
par(mfrow=c(2,2))
plot(m2)
These diagnostic plots look okay - a bit wonky in spots, but nothing systematic, especially given our small 31 observations.
There is one concern here - are girth and height collinear?
cor.test(trees$Girth, trees$Height)
##
## Pearson's product-moment correlation
##
## data: trees$Girth and trees$Height
## t = 3.2722, df = 29, p-value = 0.002758
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2021327 0.7378538
## sample estimates:
## cor
## 0.5192801
Girth and height are roughly positively correlated, but with an r of 0.519, it is a moderate enough correlation. We can include the interaction with a *.
What about a potential interaction between height and girth? We can evaluate that. First, let’s define an interaction: it means the slope of one continuous variable related to the response variable changes as the values of a second continuous predictor variable changes. Darn. You can probably see the problem here.
m3 <- lm(Volume ~ Girth * Height, data = trees)
summary(m3)
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.39632 23.83575 2.911 0.00713 **
## Girth -5.85585 1.92134 -3.048 0.00511 **
## Height -1.29708 0.30984 -4.186 0.00027 ***
## Girth:Height 0.13465 0.02438 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
anova(m3)
## Analysis of Variance Table
##
## Response: Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## Girth 1 7581.8 7581.8 1033.469 < 2.2e-16 ***
## Height 1 102.4 102.4 13.956 0.0008867 ***
## Girth:Height 1 223.8 223.8 30.512 7.484e-06 ***
## Residuals 27 198.1 7.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems there is an interaction effect: girth and height are correlated. As girth increases, height is also increasing but at a different rate.
plot(trees$Girth, trees$Height)
ggplot(trees, aes(x = Girth, y = Height)) +
geom_point(aes(col=Volume))+
theme_bw()
Neither of these plots really shows us this interaction very well.
library(sjPlot)
plot_model(m3, type="int")
Here you can nicely see that at different values of height, the slopes
are different. But these are only two potential values of height that
are selected, and height is continuous. So we need to carefully consider
the values we want to plot.
summary(trees$Height) #63-87 (plotted), mean = 76
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63 72 76 76 80 87
So we can try a different method, which uses the mean and sd:
plot_model(m3, type ="int", mdrt.values = "meansd")
This gives us a range of values from the mean +/- 1 sd; we can see that
the slope for Volume~Girth is steepest when height is greatest. This
makes intuitive sense (more board feet come from bigger trees in height
and girth).
Back to the models. We aren’t sure which is the best model. Model 3 has the highest \(R^2\) value but it also has a pesky interaction term. Here we can interpret that, but in some cases, it may be hard to interpret the interaction so thought needs to be given to its inclusion and outcomes.
So, is m3 the best? Maybe. It also has the highest number of parameters (3 + intercept = 4).
So here is where we desire some unbiased (or less biased) way of selecting the best model. We can compare models - if they use the same exact input data - with Akaike’s Information Criterion. By now, you know something about AIC, including its equation:
\(AIC = 2k - 2ln(\hat{L})\)
Which is 2 times the number of parameters (k) minus 2 times the negative log likelihood (\(\hat{L}\)).
Which model would you prefer using AIC? The implementation is simple.
AIC(m1, m2, m3)
## df AIC
## m1 3 181.6447
## m2 4 176.9100
## m3 5 155.4692
Recall that AIC is a model selection tool based on “badness of fit” where lower AIC values are preferred. In general, models with \(\Delta\)AIC < 3 are similarly supported (one would not be preferred over the other). Here m3 has the lowest AIC at 155.5 and it is well below the others. So, we can say this is the best fitting model.
Being king of dipshits doesn’t mean you’re smart: there will always be a winner with AIC, but it still could be a bad model.
We may be able to do better. We know the response variable is lognormal-ish. Let’s fit model 3, but with ln(Volume).
m4 <- lm(log(Volume) ~ Girth * Height, data = trees)
summary (m4)
##
## Call:
## lm(formula = log(Volume) ~ Girth * Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14154 -0.05912 -0.01248 0.05700 0.14502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.1485424 0.7425484 -2.893 0.00745 **
## Girth 0.3319765 0.0598548 5.546 7.05e-06 ***
## Height 0.0453027 0.0096524 4.693 6.94e-05 ***
## Girth:Height -0.0023796 0.0007594 -3.133 0.00413 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08438 on 27 degrees of freedom
## Multiple R-squared: 0.9769, Adjusted R-squared: 0.9743
## F-statistic: 380 on 3 and 27 DF, p-value: < 2.2e-16
The \(R^2\) is higher for this model, showing that it is a better fit to the data than the former model.
par(mfrow=c(2,2))
plot(m4)
What about the QQ plot and the residuals? The leverage is still a bit off due to some influential points, but the other plots look good.
Should we check the AIC?
AIC(m4)
## [1] -59.59979
Why is this so low? This should be a sign that something is off
With AIC, you can only compare models with the SAME input data. When we log transformed the variable, we fundamentally changed the data. Bad, very bad. Don’t do this! See below for how different the data are:
summary(trees$Volume)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.20 19.40 24.20 30.17 37.30 77.00
summary(log(trees$Volume))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.322 2.965 3.186 3.273 3.619 4.344
Stepwise regression. We are going to use another built in dataset, mtcars. We will build a linear model of MPG ~ displacement, horsepower, weight, and cylinders, and determine which terms should stay and which should go. We start with the fully saturated model (the full model) as opposed to the null model which would be the mean of a y (a horizontal line).
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
dim(mtcars)
## [1] 32 11
There are 11 variables with 32 data points. Gear, carb, vs, cyl, and am are categorical variables, while mpg, disp, hp, drat, wt, and qsec are continuous.
We start with the full model.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
c1 <- lm(mpg ~ cyl + disp + hp + wt, data = mtcars)
summary(c1)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0562 -1.4636 -0.4281 1.2854 5.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.82854 2.75747 14.807 1.76e-14 ***
## cyl -1.29332 0.65588 -1.972 0.058947 .
## disp 0.01160 0.01173 0.989 0.331386
## hp -0.02054 0.01215 -1.691 0.102379
## wt -3.85390 1.01547 -3.795 0.000759 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8262
## F-statistic: 37.84 on 4 and 27 DF, p-value: 1.061e-10
anova(c1)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 817.71 817.71 129.5335 8.251e-12 ***
## disp 1 37.59 37.59 5.9552 0.0215174 *
## hp 1 9.37 9.37 1.4844 0.2336201
## wt 1 90.92 90.92 14.4034 0.0007589 ***
## Residuals 27 170.44 6.31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Displacement has the least significant test of slope, but the ANOVA suggests hp is actually not contributing. We’ll use the F test to sequentially drop terms. In stepwise regression, you sequentially delete terms that aren’t significant until you arrive at the most parsimonious model - the one with the most significance that uses the fewest terms.
Now do that (in reality, you would evaluate each model for assumptions - in the interest of time, just fit them).
c2 <- lm(mpg ~ cyl + disp + wt, data = mtcars)
summary(c2)
##
## Call:
## lm(formula = mpg ~ cyl + disp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4035 -1.4028 -0.4955 1.3387 6.0722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.107678 2.842426 14.462 1.62e-14 ***
## cyl -1.784944 0.607110 -2.940 0.00651 **
## disp 0.007473 0.011845 0.631 0.53322
## wt -3.635677 1.040138 -3.495 0.00160 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.595 on 28 degrees of freedom
## Multiple R-squared: 0.8326, Adjusted R-squared: 0.8147
## F-statistic: 46.42 on 3 and 28 DF, p-value: 5.399e-11
Dropped hp, as it had the least significant impact on the fit of the model (\(F\)~(1,27) = 1.485, p < 0.233) and on the response variable (t = -1.69, p < 0.10).