Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
# Required data
carprice2 <- carprice %>% select(fueltype, carbody, horsepower, citympg, highwaympg, price)
print(carprice2)
## # A tibble: 205 x 6
## fueltype carbody horsepower citympg highwaympg price
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 gas convertible 111 21 27 13495
## 2 gas convertible 111 21 27 16500
## 3 gas hatchback 154 19 26 16500
## 4 gas sedan 102 24 30 13950
## 5 gas sedan 115 18 22 17450
## 6 gas sedan 110 19 25 15250
## 7 gas sedan 110 19 25 17710
## 8 gas wagon 110 19 25 18920
## 9 gas sedan 140 17 20 23875
## 10 gas hatchback 160 16 22 17859.
## # ... with 195 more rows
# Converting into factor
list <- c("fueltype", "carbody")
carprice2[,list] <- lapply(carprice2[,list], factor)
glimpse(carprice2)
## Observations: 205
## Variables: 6
## $ fueltype <fct> gas, gas, gas, gas, gas, gas, gas, gas, gas, gas, g...
## $ carbody <fct> convertible, convertible, hatchback, sedan, sedan, ...
## $ horsepower <dbl> 111, 111, 154, 102, 115, 110, 110, 110, 140, 160, 1...
## $ citympg <dbl> 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21,...
## $ highwaympg <dbl> 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28,...
## $ price <dbl> 13495.00, 16500.00, 16500.00, 13950.00, 17450.00, 1...
head(carprice2) %>% kable() %>% kable_styling()
fueltype | carbody | horsepower | citympg | highwaympg | price |
---|---|---|---|---|---|
gas | convertible | 111 | 21 | 27 | 13495 |
gas | convertible | 111 | 21 | 27 | 16500 |
gas | hatchback | 154 | 19 | 26 | 16500 |
gas | sedan | 102 | 24 | 30 | 13950 |
gas | sedan | 115 | 18 | 22 | 17450 |
gas | sedan | 110 | 19 | 25 | 15250 |
model <- lm(price ~ fueltype + horsepower + citympg + highwaympg, data=carprice2)
summary(model)
##
## Call:
## lm(formula = price ~ fueltype + horsepower + citympg + highwaympg,
## data = carprice2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8109.1 -2428.1 -178.4 1732.8 17327.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12443.08 3356.66 3.707 0.000272 ***
## fueltypegas -6576.79 1050.48 -6.261 2.29e-09 ***
## horsepower 142.16 12.39 11.475 < 2e-16 ***
## citympg 259.17 210.44 1.232 0.219551
## highwaympg -473.77 184.34 -2.570 0.010894 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4162 on 200 degrees of freedom
## Multiple R-squared: 0.7339, Adjusted R-squared: 0.7286
## F-statistic: 137.9 on 4 and 200 DF, p-value: < 2.2e-16
Using backward elimination technique, let’s remove citympg from the model and update the model again
model <- lm(price ~ fueltype + horsepower + highwaympg, data=carprice2)
summary(model)
##
## Call:
## lm(formula = price ~ fueltype + horsepower + highwaympg, data = carprice2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8035.2 -2271.2 -458.4 1717.3 18017.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13404.04 3268.91 4.100 5.98e-05 ***
## fueltypegas -6978.90 999.73 -6.981 4.18e-11 ***
## horsepower 136.69 11.58 11.804 < 2e-16 ***
## highwaympg -262.14 66.83 -3.922 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4167 on 201 degrees of freedom
## Multiple R-squared: 0.7319, Adjusted R-squared: 0.7279
## F-statistic: 182.9 on 3 and 201 DF, p-value: < 2.2e-16
Actually after removing citympg the model did not improve but since it did not have impact so we will consider it excluding from the model. Model predicts 73.19% of the variance and the model is good. Also, f-statistics shows that there is variation between mean of the predictors. According to summary statistics, gas and highwaympg has negative impact on price while horsepower has positive impact on the price of vehicle. It makes sense because as the horsepower of the vehicle gets higher, vehicle’s price also go up. Fueltype is a factor where there are two categories (gas and diesel). It has negative impact on price i.e. vehicles having gas have negative impact as oppose to vehicles with diesel are expensive. With highwaympg, it does not sound sane but as the horsepower of vehicle gets high, its price also goes up which means the vehicles are less fuel efficient but other factors contribute to the price.
ggplot(carprice2, aes(horsepower, price))+geom_point()+geom_smooth(method="lm", se=FALSE)+labs(title="Horsepower vs price of vehicle")
## `geom_smooth()` using formula 'y ~ x'
ggplot(carprice2, aes(fueltype, price))+geom_point()+geom_smooth(method="lm", se =FALSE)+labs(title="Fuel type (GAS) vs price of vehicle")
## `geom_smooth()` using formula 'y ~ x'
ggplot(carprice2, aes(highwaympg, price))+geom_point()+geom_smooth(method="lm", se=FALSE)+labs(title="Vehicle's price VS Highway MPG")
## `geom_smooth()` using formula 'y ~ x'
Now let’s check the residual of the model
qqnorm(model$residuals)
qqline(model$residuals)
The model is not very normal as the values are not near the line on the right side.