importdata <- read.table("~/Multivirate Analyses/MVA_2022_2023/Homework/Homework 3 R/CarPrice_Assignment.csv", header=TRUE, sep=",", dec=".")
customdata <- importdata %>% dplyr::select(1, 26, 4, 6, 10, 14, 22, 24)
head(customdata)
## car_ID price fueltype doornumber wheelbase curbweight horsepower citympg
## 1 1 13495 gas two 88.6 2548 111 21
## 2 2 16500 gas two 88.6 2548 111 21
## 3 3 16500 gas two 94.5 2823 154 19
## 4 4 13950 gas four 99.8 2337 102 24
## 5 5 17450 gas four 99.4 2824 115 18
## 6 6 15250 gas two 99.8 2507 110 19
Description of data
Source of data: Car Price Prediction Multiple Linear Regression https://www.kaggle.com/datasets/hellbuoy/car-price-prediction?select=CarPrice_Assignment.csv
Unit of Observation: Cars Sample Size: 205 observations
Variables
Dependent variable - price: Price of car in Dollars
Independent variables - fueltype: Car Fuel Type (gas or diesel) - doornumber Number of Doors (two or four) - wheelbase Length of the wheelbase in inches (distance between front and rear wheels) - curbweight: Weight of car in pounds - citympg: MPG (miles per gallon) on the highway - horsepower: Car horsepower
Multiple Regression Model
Research Question: What variables increase the price of a car?
The dependent variable in the model is the price of a car. I expect to see diesel fueled cars be more expensive than petrol fueled cars and two door cars be more expensive than four door cars. Diesel engine parts are more complex and more expensive, whilst two door cars are generally more complex to design and usually are associated with sport models.
I expect to see the weight of the car, the horsepower and the city miles per gallon all increase the price of a car as these variable units increase. Bigger cars cost more to in raw materials to construct, greater horsepower is desired for better acceleration or torque and greater miles to the gallon are desired due to savings from fuel efficiency.
I expect to see the length of the wheelbase of a car to increase the price of a car as the length decreases. Cars with smaller wheelbase lengths are more maneuverable and better at turning, therefore likely to be associated with sport models that cost more.
#Changed any missing data to NA and removed any NA's from the data.
mydata1 <- customdata %>%
na_if ('') %>%
drop_na()
#Changed fueltype to a factor.
mydata1$fuelTypeF <- factor(mydata1$fueltype,
levels = c("gas","diesel"),
labels = c("Gas", "Diesel"))
#Changed doornumber to a factor.
mydata1$doornumberF <- factor(mydata1$doornumber,
levels = c("two","four"),
labels = c("Two", "Four"))
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata1[ , c(-1,-3, -4, -9, -10)],
smooth = FALSE)
Scatterplot Matrix
As we can see from this matrix, price is positively correlated with wheelbase, curb weight and horsepower. City mpg is negatively correlated with price.
stat1 <- lm(price ~ fuelTypeF + doornumberF + wheelbase + curbweight + horsepower +citympg,
data = mydata1)
stat1
##
## Call:
## lm(formula = price ~ fuelTypeF + doornumberF + wheelbase + curbweight +
## horsepower + citympg, data = mydata1)
##
## Coefficients:
## (Intercept) fuelTypeFDiesel doornumberFFour wheelbase curbweight horsepower citympg
## -32763.979 1362.955 -667.628 157.841 6.285 109.978 126.132
The first multivariate model, before we interpret this, we check the inflation factor.
vif(stat1)
## fuelTypeF doornumberF wheelbase curbweight horsepower citympg
## 1.989164 1.419714 4.042427 10.455966 4.748824 4.622313
mean(vif(stat1))
## [1] 4.546401
We can see that we have have an issue with multicollinearity as our average Variance inflation factor is almost 5. Curb weight is highly correlated with our other variables as it has a VIF of 10.5. To correctly specify the model, this variable should be removed and our average VIF should drop closer to 1.
#removed curbweight
mydata2 = select(mydata1, -6)
#estimate standardized residuals of stat1
mydata2$StdResid <- round(rstandard(stat1), 3)
#calculate Cooks distance of stat1
mydata2$CooksD <- round(cooks.distance(stat1), 3)
hist(mydata2$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
Looking at our standard residuals we can see that there is a skew to the right caused by some outliers. We need to check for normal distribution of standardized residuals with Shapiro-Wilk normality test.
shapiro.test(mydata2$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata2$StdResid
## W = 0.93488, p-value = 6.207e-08
H0: Standard residuals are normally distributed H1: Standard residuals are not normally distributed
We reject null hypothesis at P > 0.05. The standard residuals are not normally distributed, we need to remove outliers.
hist(mydata2$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
Looking at Cooks distances we can see there are a few variables with very high impact, these should be removed.
head(mydata2[order(-mydata2$StdResid),], 10)
## car_ID price fueltype doornumber wheelbase horsepower citympg fuelTypeF doornumberF StdResid CooksD
## 75 75 45400 gas two 112.0 184 14 Gas Two 4.150 0.166
## 17 17 41315 gas two 103.5 182 16 Gas Two 3.901 0.071
## 129 129 37028 gas two 89.5 207 17 Gas Two 3.650 0.176
## 73 73 35056 gas two 96.6 155 16 Gas Two 2.956 0.198
## 128 128 34028 gas two 89.5 207 17 Gas Two 2.898 0.115
## 127 127 32528 gas two 89.5 207 17 Gas Two 2.481 0.084
## 74 74 40960 gas four 120.9 184 14 Gas Four 2.462 0.097
## 18 18 36880 gas four 110.0 182 15 Gas Four 2.439 0.034
## 72 72 34184 gas four 115.6 155 16 Gas Four 1.858 0.030
## 15 15 24565 gas four 103.5 121 20 Gas Four 1.774 0.008
head(mydata2[order(-mydata2$CooksD),], 10)
## car_ID price fueltype doornumber wheelbase horsepower citympg fuelTypeF doornumberF StdResid CooksD
## 73 73 35056.0 gas two 96.6 155 16 Gas Two 2.956 0.198
## 129 129 37028.0 gas two 89.5 207 17 Gas Two 3.650 0.176
## 75 75 45400.0 gas two 112.0 184 14 Gas Two 4.150 0.166
## 130 130 31400.5 gas two 98.4 288 17 Gas Two -1.900 0.142
## 128 128 34028.0 gas two 89.5 207 17 Gas Two 2.898 0.115
## 74 74 40960.0 gas four 120.9 184 14 Gas Four 2.462 0.097
## 127 127 32528.0 gas two 89.5 207 17 Gas Two 2.481 0.084
## 17 17 41315.0 gas two 103.5 182 16 Gas Two 3.901 0.071
## 111 111 13860.0 diesel four 114.2 95 25 Diesel Four -1.995 0.042
## 18 18 36880.0 gas four 110.0 182 15 Gas Four 2.439 0.034
mydata3 <- subset(mydata2, !StdResid > 3 | StdResid < -3,
select=c(car_ID, price, fuelTypeF, doornumberF, wheelbase, horsepower, citympg, StdResid, CooksD))
mydata4 <- subset(mydata3, !CooksD > .12 ,
select=c(car_ID, price, fuelTypeF, doornumberF, wheelbase, horsepower, citympg, StdResid, CooksD))
We remove observations with standard residuals outside of plus or minus 3. We also remove observations with high impact, in this case, observations that have a cooks distance above .12
stat2 <- lm(price ~ fuelTypeF + doornumberF + wheelbase + horsepower + citympg,
data = mydata4)
We estimate the model after variables with high multicollinearity, outliers and units with high impact have been removed.
vif(stat2)
## fuelTypeF doornumberF wheelbase horsepower citympg
## 1.469650 1.420869 2.299784 3.155306 3.867298
mean(vif(stat2))
## [1] 2.442581
Average VIF has dropped substantialy. Finally we need to check homoscedasticity.
round(stat.desc(mydata4[ ,c(-1,-3, -4, -8, -9, -10 )]), 2)
## price wheelbase horsepower citympg
## nbr.val 200.00 200.00 200.00 200.00
## nbr.null 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 5118.00 86.60 48.00 13.00
## max 40960.00 120.90 262.00 49.00
## range 35842.00 34.30 214.00 36.00
## sum 2531526.17 19745.10 20328.00 5090.00
## median 10096.50 96.95 95.00 25.00
## mean 12657.63 98.73 101.64 25.45
## SE.mean 495.15 0.42 2.55 0.46
## CI.mean.0.95 976.42 0.83 5.02 0.90
## var 49035191.35 35.72 1298.20 41.66
## std.dev 7002.51 5.98 36.03 6.45
## coef.var 0.55 0.06 0.35 0.25
mydata4$StdResid <- round(rstandard(stat2), 3)
mydata4$CooksD <- round(cooks.distance(stat2), 3)
mydata4$StdFittedValues <- scale(stat2$fitted.values)
library(car)
scatterplot(y = mydata4$StdResid, x = mydata4$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Our model is showing signs of heteroscedasticity. We use the Breusch Pagan Test for Heteroskedasticity
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(stat2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 110.3264
## Prob > Chi2 = 8.31152e-26
H0: The variance is constant
H1: The variance is not constant
We reject H0 at p <0.001 the variance is not constant. Therefore we need to use lm-robust for our regression model.
library(estimatr)
stat4 <- lm_robust(price ~ fuelTypeF + doornumberF + wheelbase + horsepower + citympg,
se_type = "HC1",
data = mydata4)
summary(stat4)
##
## Call:
## lm_robust(formula = price ~ fuelTypeF + doornumberF + wheelbase +
## horsepower + citympg, data = mydata4, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -39068.16 9943.46 -3.9290 1.185e-04 -58679.3 -19457.0 194
## fuelTypeFDiesel 3714.34 1095.35 3.3910 8.438e-04 1554.0 5874.7 194
## doornumberFFour 88.07 499.17 0.1764 8.601e-01 -896.4 1072.6 194
## wheelbase 374.37 94.29 3.9704 1.010e-04 188.4 560.3 194
## horsepower 137.88 16.42 8.3977 9.476e-15 105.5 170.3 194
## citympg 12.96 71.18 0.1821 8.557e-01 -127.4 153.3 194
##
## Multiple R-squared: 0.7906 , Adjusted R-squared: 0.7852
## F-statistic: 75.84 on 5 and 194 DF, p-value: < 2.2e-16
Regression coefficients:
FuelType: Given the values of all other variables, the price of diesel fueled cars will be on average $3714.34 higher than petrol cars, p < 0.001, assuming all other variables remain unchanged.
Doornumber: We cannot say if the number of doors of a car had a significant impact on the price of a car p = 0.8601..
Wheelbase: If the wheelbase length of a car increases by 1 inch the price of a car will on average increase by $374.37, p < 0.001, assuming all other variables remain unchanged.
Horespower: If the horsepower of a car increases by 1 unit the price of a car will on average increase by $137.88, p < 0.001, assuming all other variables remain unchanged.
Citympg: We cannot say if the city miles per gallon of a car had a significant impact on the price of a car p = 0.8557.
The R-squared value of 0.7906 indicates that about 79% of the variability of car prices is explained by Fuel type, Number of doors, Wheel base length, Horsepower and City mpg.
Looking at F statistics
H0: Ro2 = 0 H1: Ro2 > 0
We can reject H0 at P < 0.001, there is a relationship between the dependent and at least one of our explanatory variables.
sqrt(summary(stat4)$r.squared)
## [1] 0.8891563
Multiple Coefficient of Correlation: 0.882. The linear relationship between car Price and all 5 explanatory variables is strong.