Packages used
library("tidyverse")
## -- Attaching packages ------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("corrplot")
## corrplot 0.84 loaded
library("MASS")
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Problem:
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Solution
I have used the mtcars dataset to do this exercise. The dataset contains the following variables:
mpg:Miles/(US) gallon cyl:Number of cylinders disp: Displacement (cu.in.) hp:Gross horsepower drat:Rear axle ratio wt:Weight (1000 lbs) qsec:1/4 mile time vs:V/S (V engine or straight engine) am:Transmission (0 = automatic, 1 = manual) gear:Number of forward gears carb:Number of carburetors
data<-mtcars
summary<-summary (mtcars)
pairs(mtcars)
Based on the scatter plots below, there might be a negative linear relationship between mile and hp, disp, and wt. However these aren’t very strong relationships. There is a positive linear relationship between mpg and mile.
I will be using hp and am as the dichotomous term, wt as the quadratic term, and disp and vs as the dichotomous vs. quantitative interaction term.
I will transfrom wt using the Boc-Cox model
lm_bc<-lm(wt ~ qsec, data = mtcars)
bc <- boxcox(lm_bc, lambda = seq(-2,2))
wtLambda <- bc$x[which.max(bc$y)]
print(wtLambda)
## [1] 0.4646465
lm1<- lm(qsec ~ hp + vs + sqrt(wt) + vs:cyl, data = mtcars)
summary(lm1)
##
## Call:
## lm(formula = qsec ~ hp + vs + sqrt(wt) + vs:cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0521 -0.3874 -0.1936 0.3262 2.4975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.188146 1.280007 8.741 2.35e-09 ***
## hp -0.017824 0.003185 -5.596 6.17e-06 ***
## vs 4.903943 1.288518 3.806 0.000738 ***
## sqrt(wt) 4.659763 0.728179 6.399 7.44e-07 ***
## vs:cyl -0.566642 0.247907 -2.286 0.030344 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7485 on 27 degrees of freedom
## Multiple R-squared: 0.8472, Adjusted R-squared: 0.8246
## F-statistic: 37.42 on 4 and 27 DF, p-value: 1.204e-10
The equation for the model is
qsec=11.1881−0.0178∗hp+4.9039∗vs+4.6598∗wt−√−0.5666∗vs:cyl
All of the predictors are significant.
Interpretation of coefficients:
hp: For each unit increase in horsepower, quarter-mile time decreeases by 0.0178 seconds, holding all other variables constant
vs: For each unit increase in vs, quarter-mile time increases by 4.9039 seconds, holding all other variables constant
wt−−√: For each unit increase in the square root of weight, quarter-mile time increases by 4.6598 seconds, holding all other variables constant
vs∗cyl For each unit increase in the product of the type of engine and the number of cylinders, quarter-mile time decreases by 0.5666 seconds, holding all other variables constant.
Linearity
Nearly Normal Residuals
qqnorm(resid(lm1))
qqline(resid(lm1))
Most of the residuals seem to be evenly distributed around 0. The Q-Q plot displays signs of skewness with one point deviating significantly. Most of the points show small deviations from the line, so I can say the assumption of nearly normal residuals has been met.
Constant variability
plot(fitted(lm1), resid(lm1))
abline(a=0, b=0, col='red')
The no of residuals seem to increasing as we move towards the right, but the residuals seem to be evenly distributed around 0 except for one significant deviation. I am not sure the constant variability condition is met here.
Independence
data <- dplyr::select( mtcars, hp, vs, cyl, wt ) %>%
mutate( vs_X_cyl = vs * cyl, wt_sqrt = sqrt(wt) ) %>%
dplyr::select( hp, vs, vs_X_cyl, wt_sqrt )
m<-cor(data)
corrplot.mixed(m, lower.col = "black", number.cex = .7)
The linear model lm1 seems appropriate as the predictors are signficicant and the rsquare at 0.85 is high. The outlier should be investigated to see if we are able to remove it.