Processing math: 100%
  • Introduction
  • Summary Statistics
  • Visualization
  • Data Preparation
  • Build Model
  • Model Diagnostics
  • Conclusion

Packages used

library("tidyverse")
## -- Attaching packages ------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("corrplot")
## corrplot 0.84 loaded
library("MASS")
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

Introduction

Problem:

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Solution

I have used the mtcars dataset to do this exercise. The dataset contains the following variables:

mpg:Miles/(US) gallon cyl:Number of cylinders disp: Displacement (cu.in.) hp:Gross horsepower drat:Rear axle ratio wt:Weight (1000 lbs) qsec:1/4 mile time vs:V/S (V engine or straight engine) am:Transmission (0 = automatic, 1 = manual) gear:Number of forward gears carb:Number of carburetors

Summary Statistics

data<-mtcars
summary<-summary (mtcars)

Visualization

pairs(mtcars)

Based on the scatter plots below, there might be a negative linear relationship between mile and hp, disp, and wt. However these aren’t very strong relationships. There is a positive linear relationship between mpg and mile.

Data Preparation

I will be using hp and am as the dichotomous term, wt as the quadratic term, and disp and vs as the dichotomous vs. quantitative interaction term.

I will transfrom wt using the Boc-Cox model

lm_bc<-lm(wt ~ qsec, data = mtcars)
bc <- boxcox(lm_bc, lambda = seq(-2,2))

wtLambda <- bc$x[which.max(bc$y)]
print(wtLambda)  
## [1] 0.4646465

Build Model

lm1<- lm(qsec ~ hp  + vs  + sqrt(wt) + vs:cyl, data = mtcars)
summary(lm1)
## 
## Call:
## lm(formula = qsec ~ hp + vs + sqrt(wt) + vs:cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0521 -0.3874 -0.1936  0.3262  2.4975 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.188146   1.280007   8.741 2.35e-09 ***
## hp          -0.017824   0.003185  -5.596 6.17e-06 ***
## vs           4.903943   1.288518   3.806 0.000738 ***
## sqrt(wt)     4.659763   0.728179   6.399 7.44e-07 ***
## vs:cyl      -0.566642   0.247907  -2.286 0.030344 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7485 on 27 degrees of freedom
## Multiple R-squared:  0.8472, Adjusted R-squared:  0.8246 
## F-statistic: 37.42 on 4 and 27 DF,  p-value: 1.204e-10

The equation for the model is

qsec=11.1881−0.0178∗hp+4.9039∗vs+4.6598∗wt−√−0.5666∗vs:cyl

All of the predictors are significant.

Interpretation of coefficients:

hp: For each unit increase in horsepower, quarter-mile time decreeases by 0.0178 seconds, holding all other variables constant

vs: For each unit increase in vs, quarter-mile time increases by 4.9039 seconds, holding all other variables constant

wt−−√: For each unit increase in the square root of weight, quarter-mile time increases by 4.6598 seconds, holding all other variables constant

vs∗cyl For each unit increase in the product of the type of engine and the number of cylinders, quarter-mile time decreases by 0.5666 seconds, holding all other variables constant.

Model Diagnostics

Linearity

Nearly Normal Residuals

qqnorm(resid(lm1))
qqline(resid(lm1))

Most of the residuals seem to be evenly distributed around 0. The Q-Q plot displays signs of skewness with one point deviating significantly. Most of the points show small deviations from the line, so I can say the assumption of nearly normal residuals has been met.

Constant variability

plot(fitted(lm1), resid(lm1))
abline(a=0, b=0, col='red')

The no of residuals seem to increasing as we move towards the right, but the residuals seem to be evenly distributed around 0 except for one significant deviation. I am not sure the constant variability condition is met here.

Independence

data <- dplyr::select( mtcars, hp, vs, cyl, wt ) %>% 
  mutate( vs_X_cyl = vs * cyl, wt_sqrt = sqrt(wt) ) %>%
  dplyr::select( hp, vs, vs_X_cyl, wt_sqrt )

m<-cor(data)

corrplot.mixed(m, lower.col = "black", number.cex = .7)

Conclusion

The linear model lm1 seems appropriate as the predictors are signficicant and the rsquare at 0.85 is high. The outlier should be investigated to see if we are able to remove it.