Week 13 Discussion - Multiple Regression using R

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

# Required data
carprice2 <- carprice %>% select(fueltype, carbody, horsepower, citympg, highwaympg, price) 
print(carprice2)
## # A tibble: 205 x 6
##    fueltype carbody     horsepower citympg highwaympg  price
##    <chr>    <chr>            <dbl>   <dbl>      <dbl>  <dbl>
##  1 gas      convertible        111      21         27 13495 
##  2 gas      convertible        111      21         27 16500 
##  3 gas      hatchback          154      19         26 16500 
##  4 gas      sedan              102      24         30 13950 
##  5 gas      sedan              115      18         22 17450 
##  6 gas      sedan              110      19         25 15250 
##  7 gas      sedan              110      19         25 17710 
##  8 gas      wagon              110      19         25 18920 
##  9 gas      sedan              140      17         20 23875 
## 10 gas      hatchback          160      16         22 17859.
## # ... with 195 more rows
# Converting into factor

list <- c("fueltype", "carbody")
carprice2[,list] <- lapply(carprice2[,list], factor)
glimpse(carprice2)
## Observations: 205
## Variables: 6
## $ fueltype   <fct> gas, gas, gas, gas, gas, gas, gas, gas, gas, gas, g...
## $ carbody    <fct> convertible, convertible, hatchback, sedan, sedan, ...
## $ horsepower <dbl> 111, 111, 154, 102, 115, 110, 110, 110, 140, 160, 1...
## $ citympg    <dbl> 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21,...
## $ highwaympg <dbl> 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28,...
## $ price      <dbl> 13495.00, 16500.00, 16500.00, 13950.00, 17450.00, 1...
head(carprice2) %>% kable() %>% kable_styling()
fueltype carbody horsepower citympg highwaympg price
gas convertible 111 21 27 13495
gas convertible 111 21 27 16500
gas hatchback 154 19 26 16500
gas sedan 102 24 30 13950
gas sedan 115 18 22 17450
gas sedan 110 19 25 15250
model <- lm(price ~ fueltype + horsepower + citympg + highwaympg, data=carprice2)
summary(model)
## 
## Call:
## lm(formula = price ~ fueltype + horsepower + citympg + highwaympg, 
##     data = carprice2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8109.1 -2428.1  -178.4  1732.8 17327.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12443.08    3356.66   3.707 0.000272 ***
## fueltypegas -6576.79    1050.48  -6.261 2.29e-09 ***
## horsepower    142.16      12.39  11.475  < 2e-16 ***
## citympg       259.17     210.44   1.232 0.219551    
## highwaympg   -473.77     184.34  -2.570 0.010894 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4162 on 200 degrees of freedom
## Multiple R-squared:  0.7339, Adjusted R-squared:  0.7286 
## F-statistic: 137.9 on 4 and 200 DF,  p-value: < 2.2e-16

Using backward elimination technique, let’s remove citympg from the model and update the model again

model <- lm(price ~ fueltype + horsepower + highwaympg, data=carprice2)
summary(model)
## 
## Call:
## lm(formula = price ~ fueltype + horsepower + highwaympg, data = carprice2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8035.2 -2271.2  -458.4  1717.3 18017.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13404.04    3268.91   4.100 5.98e-05 ***
## fueltypegas -6978.90     999.73  -6.981 4.18e-11 ***
## horsepower    136.69      11.58  11.804  < 2e-16 ***
## highwaympg   -262.14      66.83  -3.922  0.00012 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4167 on 201 degrees of freedom
## Multiple R-squared:  0.7319, Adjusted R-squared:  0.7279 
## F-statistic: 182.9 on 3 and 201 DF,  p-value: < 2.2e-16

Actually after removing citympg the model did not improve but since it did not have impact so we will consider it excluding from the model. Model predicts 73.19% of the variance and the model is good. Also, f-statistics shows that there is variation between mean of the predictors. According to summary statistics, gas and highwaympg has negative impact on price while horsepower has positive impact on the price of vehicle. It makes sense because as the horsepower of the vehicle gets higher, vehicle’s price also go up. Fueltype is a factor where there are two categories (gas and diesel). It has negative impact on price i.e. vehicles having gas have negative impact as oppose to vehicles with diesel are expensive. With highwaympg, it does not sound sane but as the horsepower of vehicle gets high, its price also goes up which means the vehicles are less fuel efficient but other factors contribute to the price.

ggplot(carprice2, aes(horsepower, price))+geom_point()+geom_smooth(method="lm", se=FALSE)+labs(title="Horsepower vs price of vehicle")
## `geom_smooth()` using formula 'y ~ x'

ggplot(carprice2, aes(fueltype, price))+geom_point()+geom_smooth(method="lm", se =FALSE)+labs(title="Fuel type (GAS) vs price of vehicle")
## `geom_smooth()` using formula 'y ~ x'

ggplot(carprice2, aes(highwaympg, price))+geom_point()+geom_smooth(method="lm", se=FALSE)+labs(title="Vehicle's price VS Highway MPG")
## `geom_smooth()` using formula 'y ~ x'

Now let’s check the residual of the model

qqnorm(model$residuals)
qqline(model$residuals)

The model is not very normal as the values are not near the line on the right side.