QUESTION

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

ANSWER

Dataset

Using sample cars data to fit the multiple linear regression model.

The data has following variables: - Mileage, - Type, - Cylinder, - Liter, - Doors, - Leather.

Using all the other variables we are going to build the model to predict the price of the car.

  • Quadratic term - Cylinder of the car.
  • Dichotomous term - Type - Sedan or Convertible
  • Dichotomous vs Quantitative term - Type vs Cylinder
## Observations: 126
## Variables: 7
## $ Price    <dbl> 37510.25, 37215.17, 36332.89, 36245.16, 32954.14, 32537.19...
## $ Mileage  <int> 21593, 22211, 25153, 26250, 36074, 41829, 6447, 10555, 119...
## $ Type     <fct> Sedan, Sedan, Sedan, Sedan, Sedan, Sedan, Sedan, Sedan, Se...
## $ Cylinder <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 6...
## $ Liter    <dbl> 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6...
## $ Doors    <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ Leather  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Dataset Exploration

##      Price          Mileage               Type       Cylinder    
##  Min.   :22245   Min.   :  583   Convertible:50   Min.   :4.000  
##  1st Qu.:29338   1st Qu.:14050   Sedan      :76   1st Qu.:4.000  
##  Median :33370   Median :21237                    Median :4.000  
##  Mean   :35667   Mean   :20257                    Mean   :5.619  
##  3rd Qu.:38275   3rd Qu.:25776                    3rd Qu.:8.000  
##  Max.   :70755   Max.   :50387                    Max.   :8.000  
##      Liter           Doors          Leather      
##  Min.   :2.000   Min.   :2.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.0000  
##  Median :2.300   Median :4.000   Median :1.0000  
##  Mean   :3.211   Mean   :3.206   Mean   :0.7698  
##  3rd Qu.:4.600   3rd Qu.:4.000   3rd Qu.:1.0000  
##  Max.   :6.000   Max.   :4.000   Max.   :1.0000
## 'data.frame':    126 obs. of  7 variables:
##  $ Price   : num  37510 37215 36333 36245 32954 ...
##  $ Mileage : int  21593 22211 25153 26250 36074 41829 6447 10555 11975 13449 ...
##  $ Type    : Factor w/ 2 levels "Convertible",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Cylinder: int  8 8 8 8 8 8 8 8 8 8 ...
##  $ Liter   : num  4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 ...
##  $ Doors   : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Leather : int  1 1 1 1 1 1 1 1 1 1 ...

Checking missing values in the dataset

##    Price  Mileage     Type Cylinder    Liter    Doors  Leather 
##        0        0        0        0        0        0        0

There are no missing values in the dataset which is good as it would lead to better model prediction.

Encoding the categorical variable of Type of Car where Sedan=0 and Convertible=1

Build multiple regression model

In this section we will create a linear regression model and calculate the correlation between the data to see if there is a relationship between Price and Mileage.

Linear Regression Model

## 
## Call:
## lm(formula = Price ~ Mileage, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12243  -5918  -2494   2660  29346 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 42417.69521  1986.64967  21.351  < 2e-16 ***
## Mileage        -0.33327     0.08869  -3.758 0.000263 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9518 on 124 degrees of freedom
## Multiple R-squared:  0.1022, Adjusted R-squared:  0.095 
## F-statistic: 14.12 on 1 and 124 DF,  p-value: 0.0002626

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    5.152718 
##  Prob > Chi2   =    0.02321002

First of all, a low p value in the Breusch Pagan Test for Heteroskedasticity allows us to reject the null hypothesis, meaning Heteroskedasticity is assumed. The histogram of the residuals is nearly normal with a slight skew. The QQ plot shows evidence of outliers in the data set. We can take a more zoomed in look at the constant variance check.

Quadratic variable

Sample Quadratic Equation: $ ax^2 + bx + c = 0 $

Dichotomous vs. quantative interaction

Fitting the multiple regression model

## 
## Call:
## lm(formula = Price ~ Mileage + Type + Cylinder + Liter + Doors + 
##     Leather + q + dq, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6042.6 -2068.0   -60.9  1875.2  5785.8 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate  Std. Error t value       Pr(>|t|)    
## (Intercept) -43871.2373   8883.4544  -4.939 0.000002618689 ***
## Mileage         -0.3002      0.0287 -10.460        < 2e-16 ***
## Type        -13258.2586   1919.1940  -6.908 0.000000000266 ***
## Cylinder     33731.9881   3417.8461   9.869        < 2e-16 ***
## Liter       -13716.8991    949.0691 -14.453        < 2e-16 ***
## Doors                NA          NA      NA             NA    
## Leather       2122.2078    746.3827   2.843        0.00526 ** 
## q            -1895.5566    269.7468  -7.027 0.000000000146 ***
## dq            4637.5116    346.6670  13.377        < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3059 on 118 degrees of freedom
## Multiple R-squared:  0.9118, Adjusted R-squared:  0.9065 
## F-statistic: 174.2 on 7 and 118 DF,  p-value: < 2.2e-16

After seeing summary, Doors seem to be not significant contributor, so removing the variable from the model.

## 
## Call:
## lm(formula = Price ~ Mileage + Type + Cylinder + Liter + Leather + 
##     q + dq, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6042.6 -2068.0   -60.9  1875.2  5785.8 
## 
## Coefficients:
##                Estimate  Std. Error t value       Pr(>|t|)    
## (Intercept) -43871.2373   8883.4544  -4.939 0.000002618689 ***
## Mileage         -0.3002      0.0287 -10.460        < 2e-16 ***
## Type        -13258.2586   1919.1940  -6.908 0.000000000266 ***
## Cylinder     33731.9881   3417.8461   9.869        < 2e-16 ***
## Liter       -13716.8991    949.0691 -14.453        < 2e-16 ***
## Leather       2122.2078    746.3827   2.843        0.00526 ** 
## q            -1895.5566    269.7468  -7.027 0.000000000146 ***
## dq            4637.5116    346.6670  13.377        < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3059 on 118 degrees of freedom
## Multiple R-squared:  0.9118, Adjusted R-squared:  0.9065 
## F-statistic: 174.2 on 7 and 118 DF,  p-value: < 2.2e-16

Residual Analysis

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.2857996, Df = 1, p = 0.59292
##  lag Autocorrelation D-W Statistic p-value
##    1       0.8420812     0.2909659       0
##  Alternative hypothesis: rho != 0
## 
## Call:
## lm(formula = Price ~ Mileage + Type + Cylinder + Liter + Leather + 
##     q + dq, data = cars)
## 
## Coefficients:
## (Intercept)      Mileage         Type     Cylinder        Liter      Leather  
## -43871.2373      -0.3002  -13258.2586   33731.9882  -13716.8991    2122.2078  
##           q           dq  
##  -1895.5566    4637.5115  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = ml2) 
## 
##                      Value         p-value                   Decision
## Global Stat        43.2507 0.0000000091797 Assumptions NOT satisfied!
## Skewness            0.5275 0.4676577114265    Assumptions acceptable.
## Kurtosis            2.3188 0.1278197794003    Assumptions acceptable.
## Link Function      40.1441 0.0000000002359 Assumptions NOT satisfied!
## Heteroscedasticity  0.2603 0.6099355362465    Assumptions acceptable.

The variances of residuals areUniformly scattered about zero.

The Q-Q plot shows that the residuals follow the indicated line.

Summary

The R-squared value is 91.18% which is good. That means that the explained variability is 91.18% between independent and dependent variables. Seeing the residual plot, we can see mostly there is constant variability and no pattern. Q-Q plot also looks good with some outliers at the tails. It seems the multiple linear model(ml2) is appropriate for prediction.