Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Using sample cars data to fit the multiple linear regression model.
The data has following variables: - Mileage, - Type, - Cylinder, - Liter, - Doors, - Leather.
Using all the other variables we are going to build the model to predict the price of the car.
## Observations: 126
## Variables: 7
## $ Price <dbl> 37510.25, 37215.17, 36332.89, 36245.16, 32954.14, 32537.19...
## $ Mileage <int> 21593, 22211, 25153, 26250, 36074, 41829, 6447, 10555, 119...
## $ Type <fct> Sedan, Sedan, Sedan, Sedan, Sedan, Sedan, Sedan, Sedan, Se...
## $ Cylinder <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 6...
## $ Liter <dbl> 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6...
## $ Doors <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ Leather <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## Price Mileage Type Cylinder
## Min. :22245 Min. : 583 Convertible:50 Min. :4.000
## 1st Qu.:29338 1st Qu.:14050 Sedan :76 1st Qu.:4.000
## Median :33370 Median :21237 Median :4.000
## Mean :35667 Mean :20257 Mean :5.619
## 3rd Qu.:38275 3rd Qu.:25776 3rd Qu.:8.000
## Max. :70755 Max. :50387 Max. :8.000
## Liter Doors Leather
## Min. :2.000 Min. :2.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.0000
## Median :2.300 Median :4.000 Median :1.0000
## Mean :3.211 Mean :3.206 Mean :0.7698
## 3rd Qu.:4.600 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :6.000 Max. :4.000 Max. :1.0000
## 'data.frame': 126 obs. of 7 variables:
## $ Price : num 37510 37215 36333 36245 32954 ...
## $ Mileage : int 21593 22211 25153 26250 36074 41829 6447 10555 11975 13449 ...
## $ Type : Factor w/ 2 levels "Convertible",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Cylinder: int 8 8 8 8 8 8 8 8 8 8 ...
## $ Liter : num 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 ...
## $ Doors : int 4 4 4 4 4 4 4 4 4 4 ...
## $ Leather : int 1 1 1 1 1 1 1 1 1 1 ...
Checking missing values in the dataset
## Price Mileage Type Cylinder Liter Doors Leather
## 0 0 0 0 0 0 0
There are no missing values in the dataset which is good as it would lead to better model prediction.
Encoding the categorical variable of Type of Car where Sedan=0 and Convertible=1
par(mfrow=c(1, 2))
plot(density(cars$Price), main="Density Plot: Price", ylab="Frequency")
plot(density(cars$Mileage), main="Density Plot: Mileage", ylab="Frequency")Plot for Mileage looks normal, however, plot for Price skewed towards right.
In this section we will create a linear regression model and calculate the correlation between the data to see if there is a relationship between Price and Mileage.
##
## Call:
## lm(formula = Price ~ Mileage, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12243 -5918 -2494 2660 29346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42417.69521 1986.64967 21.351 < 2e-16 ***
## Mileage -0.33327 0.08869 -3.758 0.000263 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9518 on 124 degrees of freedom
## Multiple R-squared: 0.1022, Adjusted R-squared: 0.095
## F-statistic: 14.12 on 1 and 124 DF, p-value: 0.0002626
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 5.152718
## Prob > Chi2 = 0.02321002
First of all, a low p value in the Breusch Pagan Test for Heteroskedasticity allows us to reject the null hypothesis, meaning Heteroskedasticity is assumed. The histogram of the residuals is nearly normal with a slight skew. The QQ plot shows evidence of outliers in the data set. We can take a more zoomed in look at the constant variance check.
ml = lm(Price ~ Mileage + Type + Cylinder + Liter + Doors + Leather + q + dq, data = cars)
summary(ml)##
## Call:
## lm(formula = Price ~ Mileage + Type + Cylinder + Liter + Doors +
## Leather + q + dq, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6042.6 -2068.0 -60.9 1875.2 5785.8
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43871.2373 8883.4544 -4.939 0.000002618689 ***
## Mileage -0.3002 0.0287 -10.460 < 2e-16 ***
## Type -13258.2586 1919.1940 -6.908 0.000000000266 ***
## Cylinder 33731.9881 3417.8461 9.869 < 2e-16 ***
## Liter -13716.8991 949.0691 -14.453 < 2e-16 ***
## Doors NA NA NA NA
## Leather 2122.2078 746.3827 2.843 0.00526 **
## q -1895.5566 269.7468 -7.027 0.000000000146 ***
## dq 4637.5116 346.6670 13.377 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3059 on 118 degrees of freedom
## Multiple R-squared: 0.9118, Adjusted R-squared: 0.9065
## F-statistic: 174.2 on 7 and 118 DF, p-value: < 2.2e-16
After seeing summary, Doors seem to be not significant contributor, so removing the variable from the model.
#Refitting the model
ml2 = lm(Price ~ Mileage + Type + Cylinder + Liter + Leather + q + dq, data = cars)
summary(ml2)##
## Call:
## lm(formula = Price ~ Mileage + Type + Cylinder + Liter + Leather +
## q + dq, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6042.6 -2068.0 -60.9 1875.2 5785.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43871.2373 8883.4544 -4.939 0.000002618689 ***
## Mileage -0.3002 0.0287 -10.460 < 2e-16 ***
## Type -13258.2586 1919.1940 -6.908 0.000000000266 ***
## Cylinder 33731.9881 3417.8461 9.869 < 2e-16 ***
## Liter -13716.8991 949.0691 -14.453 < 2e-16 ***
## Leather 2122.2078 746.3827 2.843 0.00526 **
## q -1895.5566 269.7468 -7.027 0.000000000146 ***
## dq 4637.5116 346.6670 13.377 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3059 on 118 degrees of freedom
## Multiple R-squared: 0.9118, Adjusted R-squared: 0.9065
## F-statistic: 174.2 on 7 and 118 DF, p-value: < 2.2e-16
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2857996, Df = 1, p = 0.59292
## lag Autocorrelation D-W Statistic p-value
## 1 0.8420812 0.2909659 0
## Alternative hypothesis: rho != 0
##
## Call:
## lm(formula = Price ~ Mileage + Type + Cylinder + Liter + Leather +
## q + dq, data = cars)
##
## Coefficients:
## (Intercept) Mileage Type Cylinder Liter Leather
## -43871.2373 -0.3002 -13258.2586 33731.9882 -13716.8991 2122.2078
## q dq
## -1895.5566 4637.5115
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = ml2)
##
## Value p-value Decision
## Global Stat 43.2507 0.0000000091797 Assumptions NOT satisfied!
## Skewness 0.5275 0.4676577114265 Assumptions acceptable.
## Kurtosis 2.3188 0.1278197794003 Assumptions acceptable.
## Link Function 40.1441 0.0000000002359 Assumptions NOT satisfied!
## Heteroscedasticity 0.2603 0.6099355362465 Assumptions acceptable.
The variances of residuals areUniformly scattered about zero.
The Q-Q plot shows that the residuals follow the indicated line.
The R-squared value is 91.18% which is good. That means that the explained variability is 91.18% between independent and dependent variables. Seeing the residual plot, we can see mostly there is constant variability and no pattern. Q-Q plot also looks good with some outliers at the tails. It seems the multiple linear model(ml2) is appropriate for prediction.