Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

We will use mtcars dataset which was extracted from the 1974 Motor Trend US magazine, and includes fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa

Below is a brief description of the variables in the mtcars data set:

  • mpg Miles/(US) gallon
  • cyl Number of cylinders
  • disp Displacement (cu.in.)
  • hp Gross horsepower
  • drat Rear axle ratio
  • wt Weight (1000 lbs)
  • qsec 1/4 mile time
  • vs Engine (0 = V-shaped, 1 = straight)
  • am Transmission (0 = automatic, 1 = manual)
  • gear Number of forward gears
data(mtcars)

glimpse(mtcars)
## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 1…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 18…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 1…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2…

Lets first create a nice pairwise scatter plots. It shows the relationship between all the variables in this data set.

ggpairs(mtcars)

Now we will transform qsec in dichotomous term using its median value.

# dichotomous term
mtcars$qsec <- ifelse(mtcars$qsec >= median(mtcars$qsec), 1, 0)
mtcars
##                      mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620    0  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875    0  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320    1  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215    1  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440    0  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460    1  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570    0  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190    1  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150    1  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440    1  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440    1  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070    0  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730    0  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780    1  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250    1  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424    1  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345    0  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200    1  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615    1  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835    1  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465    1  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520    0  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435    0  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840    0  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845    0  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935    1  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140    0  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513    0  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170    0  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770    0  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570    0  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780    1  1  1    4    2

In next step we will create model. We have used poly() function with degree 2 for dataset variable hp. wt is a continuous variable and qsec is made dichotomous in previous step.

rel <- mpg ~ poly(hp,2, raw = TRUE) + wt + qsec
model <- lm(rel, mtcars)
summary(model)
## 
## Call:
## lm(formula = rel, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5004 -1.3725 -0.2550  0.8877  5.1543 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              42.7610200  2.7340198  15.640 4.66e-15 ***
## poly(hp, 2, raw = TRUE)1 -0.1412655  0.0432542  -3.266  0.00297 ** 
## poly(hp, 2, raw = TRUE)2  0.0002687  0.0001010   2.661  0.01296 *  
## wt                       -2.5957590  0.8059383  -3.221  0.00332 ** 
## qsec                     -1.2050580  1.2243371  -0.984  0.33373    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.388 on 27 degrees of freedom
## Multiple R-squared:  0.8632, Adjusted R-squared:  0.843 
## F-statistic: 42.61 on 4 and 27 DF,  p-value: 2.738e-11

A p-value is used in hypothesis testing to support or reject the null hypothesis. The smaller the p-value, the stronger the evidence that one should reject the null hypothesis. In our case it is less than 0.05 so we reject Null hypothesis described above.

The R2 value is a measure for how close our data is to the linear regression model. In this case the R2 value is 0.8632 which means our data fits the linear regression model. Simialrly higher F-statistic value indicates close linear relationship.

# residuals
model %>% 
  ggplot(aes(fitted(model), resid(model))) +
  geom_point() +
  geom_smooth(method = lm, se =F) +
  labs(title = "Residual Analysis",
       x = "Fitted Line", y = "Residuals") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

# residuals histogram
hist(model$residuals, xlab = "Residuals", ylab = "")

# qq plot
model %>% 
  ggplot(aes(sample = resid(model))) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot") +
  theme_minimal()

Conclusion

As described above, seeing the p-value, R-squared value and all the above graphs we can conclude the model seems pretty good.