HW 3 Regression Cars

importdata <- read.table("~/Multivirate Analyses/MVA_2022_2023/Homework/Homework 3 R/CarPrice_Assignment.csv", header=TRUE, sep=",", dec=".")

customdata <- importdata %>% dplyr::select(1, 26, 4, 6, 10, 14, 22, 24)
head(customdata)

##   car_ID price fueltype doornumber wheelbase curbweight horsepower citympg
## 1      1 13495      gas        two      88.6       2548        111      21
## 2      2 16500      gas        two      88.6       2548        111      21
## 3      3 16500      gas        two      94.5       2823        154      19
## 4      4 13950      gas       four      99.8       2337        102      24
## 5      5 17450      gas       four      99.4       2824        115      18
## 6      6 15250      gas        two      99.8       2507        110      19

Description of data

Source of data: Car Price Prediction Multiple Linear Regression https://www.kaggle.com/datasets/hellbuoy/car-price-prediction?select=CarPrice_Assignment.csv

Unit of Observation: Cars Sample Size: 205 observations

Variables

Car_ID: ID

Dependent variable - price: Price of car in Dollars

Independent variables - fueltype: Car Fuel Type (gas or diesel) - doornumber Number of Doors (two or four) - wheelbase Length of the wheelbase in inches (distance between front and rear wheels) - curbweight: Weight of car in pounds - citympg: MPG (miles per gallon) on the highway - horsepower: Car horsepower

Multiple Regression Model

Research Question: What variables increase the price of a car?

The dependent variable in the model is the price of a car. I expect to see diesel fueled cars be more expensive than petrol fueled cars and two door cars be more expensive than four door cars. Diesel engine parts are more complex and more expensive, whilst two door cars are generally more complex to design and usually are associated with sport models.

I expect to see the weight of the car, the horsepower and the city miles per gallon all increase the price of a car as these variable units increase. Bigger cars cost more to in raw materials to construct, greater horsepower is desired for better acceleration or torque and greater miles to the gallon are desired due to savings from fuel efficiency.

I expect to see the length of the wheelbase of a car to increase the price of a car as the length decreases. Cars with smaller wheelbase lengths are more maneuverable and better at turning, therefore likely to be associated with sport models that cost more.

#Changed any missing data to NA and removed any NA's from the data.
mydata1 <- customdata %>% 
              na_if ('') %>% 
              drop_na()

#Changed fueltype to a factor.
mydata1$fuelTypeF <- factor(mydata1$fueltype, 
                            levels = c("gas","diesel"),  
                            labels = c("Gas", "Diesel"))

#Changed doornumber to a factor.
mydata1$doornumberF <- factor(mydata1$doornumber, 
                            levels = c("two","four"),  
                            labels = c("Two", "Four"))

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix(mydata1[ , c(-1,-3, -4, -9, -10)], 
                  smooth = FALSE)

Scatterplot Matrix

As we can see from this matrix, price is positively correlated with wheelbase, curb weight and horsepower. City mpg is negatively correlated with price.

stat1 <- lm(price ~ fuelTypeF + doornumberF + wheelbase + curbweight + horsepower +citympg,
           data = mydata1)

stat1

## 
## Call:
## lm(formula = price ~ fuelTypeF + doornumberF + wheelbase + curbweight + 
##     horsepower + citympg, data = mydata1)
## 
## Coefficients:
##     (Intercept)  fuelTypeFDiesel  doornumberFFour        wheelbase       curbweight       horsepower          citympg  
##      -32763.979         1362.955         -667.628          157.841            6.285          109.978          126.132

The first multivariate model, before we interpret this, we check the inflation factor.

vif(stat1)

##   fuelTypeF doornumberF   wheelbase  curbweight  horsepower     citympg 
##    1.989164    1.419714    4.042427   10.455966    4.748824    4.622313

mean(vif(stat1))

## [1] 4.546401

We can see that we have have an issue with multicollinearity as our average Variance inflation factor is almost 5. Curb weight is highly correlated with our other variables as it has a VIF of 10.5. To correctly specify the model, this variable should be removed and our average VIF should drop closer to 1.

#removed curbweight
mydata2 = select(mydata1, -6)

#estimate standardized residuals of stat1
mydata2$StdResid <- round(rstandard(stat1), 3) 

#calculate Cooks distance of stat1
mydata2$CooksD <- round(cooks.distance(stat1), 3)

hist(mydata2$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

Looking at our standard residuals we can see that there is a skew to the right caused by some outliers. We need to check for normal distribution of standardized residuals with Shapiro-Wilk normality test.

shapiro.test(mydata2$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$StdResid
## W = 0.93488, p-value = 6.207e-08

H0: Standard residuals are normally distributed H1: Standard residuals are not normally distributed

We reject null hypothesis at P > 0.05. The standard residuals are not normally distributed, we need to remove outliers.

hist(mydata2$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

Looking at Cooks distances we can see there are a few variables with very high impact, these should be removed.

head(mydata2[order(-mydata2$StdResid),], 10)

##     car_ID price fueltype doornumber wheelbase horsepower citympg fuelTypeF doornumberF StdResid CooksD
## 75      75 45400      gas        two     112.0        184      14       Gas         Two    4.150  0.166
## 17      17 41315      gas        two     103.5        182      16       Gas         Two    3.901  0.071
## 129    129 37028      gas        two      89.5        207      17       Gas         Two    3.650  0.176
## 73      73 35056      gas        two      96.6        155      16       Gas         Two    2.956  0.198
## 128    128 34028      gas        two      89.5        207      17       Gas         Two    2.898  0.115
## 127    127 32528      gas        two      89.5        207      17       Gas         Two    2.481  0.084
## 74      74 40960      gas       four     120.9        184      14       Gas        Four    2.462  0.097
## 18      18 36880      gas       four     110.0        182      15       Gas        Four    2.439  0.034
## 72      72 34184      gas       four     115.6        155      16       Gas        Four    1.858  0.030
## 15      15 24565      gas       four     103.5        121      20       Gas        Four    1.774  0.008

head(mydata2[order(-mydata2$CooksD),], 10)

##     car_ID   price fueltype doornumber wheelbase horsepower citympg fuelTypeF doornumberF StdResid CooksD
## 73      73 35056.0      gas        two      96.6        155      16       Gas         Two    2.956  0.198
## 129    129 37028.0      gas        two      89.5        207      17       Gas         Two    3.650  0.176
## 75      75 45400.0      gas        two     112.0        184      14       Gas         Two    4.150  0.166
## 130    130 31400.5      gas        two      98.4        288      17       Gas         Two   -1.900  0.142
## 128    128 34028.0      gas        two      89.5        207      17       Gas         Two    2.898  0.115
## 74      74 40960.0      gas       four     120.9        184      14       Gas        Four    2.462  0.097
## 127    127 32528.0      gas        two      89.5        207      17       Gas         Two    2.481  0.084
## 17      17 41315.0      gas        two     103.5        182      16       Gas         Two    3.901  0.071
## 111    111 13860.0   diesel       four     114.2         95      25    Diesel        Four   -1.995  0.042
## 18      18 36880.0      gas       four     110.0        182      15       Gas        Four    2.439  0.034

mydata3 <- subset(mydata2, !StdResid > 3 | StdResid < -3,
select=c(car_ID, price, fuelTypeF, doornumberF, wheelbase, horsepower, citympg, StdResid, CooksD))

mydata4 <- subset(mydata3, !CooksD > .12 ,
select=c(car_ID, price, fuelTypeF, doornumberF, wheelbase, horsepower, citympg, StdResid, CooksD))

We remove observations with standard residuals outside of plus or minus 3. We also remove observations with high impact, in this case, observations that have a cooks distance above .12

stat2 <- lm(price ~ fuelTypeF + doornumberF + wheelbase + horsepower + citympg,
           data = mydata4)

We estimate the model after variables with high multicollinearity, outliers and units with high impact have been removed.

vif(stat2)

##   fuelTypeF doornumberF   wheelbase  horsepower     citympg 
##    1.469650    1.420869    2.299784    3.155306    3.867298

mean(vif(stat2))

## [1] 2.442581

Average VIF has dropped substantialy. Finally we need to check homoscedasticity.

round(stat.desc(mydata4[ ,c(-1,-3, -4, -8, -9, -10 )]), 2)

##                    price wheelbase horsepower citympg
## nbr.val           200.00    200.00     200.00  200.00
## nbr.null            0.00      0.00       0.00    0.00
## nbr.na              0.00      0.00       0.00    0.00
## min              5118.00     86.60      48.00   13.00
## max             40960.00    120.90     262.00   49.00
## range           35842.00     34.30     214.00   36.00
## sum           2531526.17  19745.10   20328.00 5090.00
## median          10096.50     96.95      95.00   25.00
## mean            12657.63     98.73     101.64   25.45
## SE.mean           495.15      0.42       2.55    0.46
## CI.mean.0.95      976.42      0.83       5.02    0.90
## var          49035191.35     35.72    1298.20   41.66
## std.dev          7002.51      5.98      36.03    6.45
## coef.var            0.55      0.06       0.35    0.25

mydata4$StdResid <- round(rstandard(stat2), 3) 

mydata4$CooksD <- round(cooks.distance(stat2), 3)

mydata4$StdFittedValues <- scale(stat2$fitted.values)

library(car)
scatterplot(y = mydata4$StdResid, x = mydata4$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Our model is showing signs of heteroscedasticity. We use the Breusch Pagan Test for Heteroskedasticity

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(stat2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    110.3264 
##  Prob > Chi2   =    8.31152e-26

H0: The variance is constant
H1: The variance is not constant

We reject H0 at p <0.001 the variance is not constant. Therefore we need to use lm-robust for our regression model.

library(estimatr)
stat4 <- lm_robust(price ~ fuelTypeF + doornumberF + wheelbase + horsepower + citympg,
                  se_type = "HC1",
           data = mydata4)
summary(stat4)

## 
## Call:
## lm_robust(formula = price ~ fuelTypeF + doornumberF + wheelbase + 
##     horsepower + citympg, data = mydata4, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                  Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)     -39068.16    9943.46 -3.9290 1.185e-04 -58679.3 -19457.0 194
## fuelTypeFDiesel   3714.34    1095.35  3.3910 8.438e-04   1554.0   5874.7 194
## doornumberFFour     88.07     499.17  0.1764 8.601e-01   -896.4   1072.6 194
## wheelbase          374.37      94.29  3.9704 1.010e-04    188.4    560.3 194
## horsepower         137.88      16.42  8.3977 9.476e-15    105.5    170.3 194
## citympg             12.96      71.18  0.1821 8.557e-01   -127.4    153.3 194
## 
## Multiple R-squared:  0.7906 ,    Adjusted R-squared:  0.7852 
## F-statistic: 75.84 on 5 and 194 DF,  p-value: < 2.2e-16

Regression coefficients:

FuelType: Given the values of all other variables, the price of diesel fueled cars will be on average $3714.34 higher than petrol cars, p < 0.001, assuming all other variables remain unchanged.

Doornumber: We cannot say if the number of doors of a car had a significant impact on the price of a car p = 0.8601..

Wheelbase: If the wheelbase length of a car increases by 1 inch the price of a car will on average increase by $374.37, p < 0.001, assuming all other variables remain unchanged.

Horespower: If the horsepower of a car increases by 1 unit the price of a car will on average increase by $137.88, p < 0.001, assuming all other variables remain unchanged.

Citympg: We cannot say if the city miles per gallon of a car had a significant impact on the price of a car p = 0.8557.

The R-squared value of 0.7906 indicates that about 79% of the variability of car prices is explained by Fuel type, Number of doors, Wheel base length, Horsepower and City mpg.

Looking at F statistics

H0: Ro2 = 0 H1: Ro2 > 0

We can reject H0 at P < 0.001, there is a relationship between the dependent and at least one of our explanatory variables.

sqrt(summary(stat4)$r.squared)

## [1] 0.8891563

Multiple Coefficient of Correlation: 0.882. The linear relationship between car Price and all 5 explanatory variables is strong.

HW 3 Regression Cars

Paul Shenstone

2023-01-22