Firstly i imported the dataset, and cleaned it in a manner that sufices my needs in at this instance, and took a randome sample of observations, since using all the observations would lead to p values being inherently low due to the size of the dataset.
CarMarket <- read.table("~/IMB/Multivariate analysis/HOMEWORK MVA/bmw.csv", header=TRUE, sep=",", dec=";")
CarMarket <- CarMarket[order(CarMarket$price, decreasing = TRUE), ]
set.seed(1)
CarMarket <- sample_n(CarMarket, 1500)
#CarMarket$model <- unclass(CarMarket$model)
CarMarket <- CarMarket %>% relocate(price)
CarMarket <- CarMarket[,c(-4, -6, -7, -9)]
CarMarket <- na.omit(CarMarket)
head(CarMarket)
## price model year mileage mpg
## 1 36500 X3 2019 5664 54.3
## 2 14995 1 Series 2017 13472 65.7
## 3 21898 3 Series 2018 6425 48.7
## 4 8991 3 Series 2013 64377 68.9
## 5 11030 1 Series 2017 35321 83.1
## 6 7990 1 Series 2010 60000 53.3
This data set regards the British used car market, or used BMW market to be precise. The unit of observation are specific cars, with their respective propreties, which are denoted as variables in the data set. The sample size consists of 483 and 9 respective variables (5 used). The variables consist of:
-Model of car: this variable entails specific models which the BMW line offers ranging from 1-8 and x1-x8 -year of production: the year the specific unit of car was produced -price: price of the unit in GBP -mileage: the distance the vehicle has already travelled in miles -tax: required tax payment upon purchase of the vehicle -mpg: fuel economy of the vehicle denoted in distance driven (in miles) per gallon of fuel (4L)
Using the latter, I will be performing multiple regression, in order to determine which of the explanatory variables impact the price, and to which extent they do so.
This data is sourced from Kaggle: https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes?select=bmw.csv
CarMarket$mpg <- as.numeric(CarMarket$mpg)
CarMarket$mileage <- as.numeric(CarMarket$mileage)
CarMarket$price <- as.numeric(CarMarket$price)
CarMarket$year <- as.numeric(CarMarket$year)
CarMarket$model <- factor(CarMarket$model, levels = c(" 3 Series" , " 5 Series" ," M5" , " 8 Series" ," X7" , " i8" , " X5" , " X6" , " X3" , " M3" , " X4" , " M4" , " 7 Series", " 6 Series" ," Z4" , " M2" , " 4 Series" ," X2" , " 1 Series" ," 2 Series", " X1"), labels = c(" 3 Series" , " 5 Series" ," M5" , " 8 Series" ," X7" , " i8" , " X5" , " X6" , " X3" , " M3" , " X4" , " M4" , " 7 Series", " 6 Series" ," Z4" , " M2" , " 4 Series" ," X2" , " 1 Series" ," 2 Series", " X1"))
CarMarket <- CarMarket[CarMarket$model %in% c(" 3 Series", " 5 Series"), ]
#Graphical presentation of the variables used in the regression model
library(car)
scatterplotMatrix(CarMarket[,-2],
smooth = FALSE)
From the scatterplot used above, we can observe a positive correlation
between the independant variable (price) and the dependant/explanatory
variables year, a negative correlation with mileage and a visually non
indicative relationship with mpg.
#Multiple regression model
In the following segment, I will create a regression model to try to identify the relationship between the dependand and independant variables, namely I will use the following: year, mileage, mpg and model to try and explain the price.
fit1 <- lm(price ~ model + year + mileage + mpg,
data = CarMarket) #creating a linear model denoted by
vif (fit1)
## model year mileage mpg
## 1.007097 2.819297 2.829575 1.021481
mean(vif(fit1))
## [1] 1.919362
We can observe that the Variance Inflation Factor (VIF) value is lower than 5 for all the numeric variables, and lower than √5 for the categorical variable (model). The mean VIF is also accaptable since it is not higher than 5.
CarMarket <- na.omit(CarMarket)
#Adding standardized residuals and cooks distances to the dataset
CarMarket$StdRes <- rstandard(fit1)
CarMarket$CooksD <- cooks.distance(fit1)
#Making histograms of distribution of standardized residuals and Cook's distances in order to detect outliers and high impact units which need to be potentially removed.
hist(CarMarket$StdRes,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(CarMarket$StdRes)
##
## Shapiro-Wilk normality test
##
## data: CarMarket$StdRes
## W = 0.87622, p-value < 2.2e-16
hist(CarMarket$CooksD,
main = "Histogram of Cooks distances",
xlab = "Cooks distance",
ylab = "Frequency")
Observing the distribution of standardized residuals and the results of
the Shapiro Wilk test (p < 0.05), we can reject the Null Hypothesis
and accept the alternative: H1 Standard residuals are not normally
distributed
Due to the absolute values of the std. res., I will be forced to remove those which posses am absolute value higher than 3.
I will also remove the units, which according to the graph have a higher impact, we can observe those to be located around the 0.4 and 0.5 region in the graph located above. For this reason, I will remove all units with Cooks D above 0.3.
head(CarMarket[order(CarMarket$StdRes),])
## price model year mileage mpg StdRes CooksD
## 1280 18550 3 Series 2019 7478 49.6 -1.779901 0.003315089
## 630 14400 3 Series 2017 14358 48.7 -1.745936 0.002875576
## 204 17480 3 Series 2019 20410 48.7 -1.717744 0.003578753
## 1151 14050 3 Series 2017 20829 48.7 -1.672030 0.002168492
## 1052 14998 3 Series 2017 13309 48.7 -1.644969 0.002654508
## 271 14000 3 Series 2017 22000 52.3 -1.631325 0.001876465
head(CarMarket[order(-CarMarket$StdRes),])
## price model year mileage mpg StdRes CooksD
## 1223 71990 3 Series 2020 150 47.1 8.829247 0.112038310
## 683 54845 5 Series 2020 450 60.1 4.704782 0.041355788
## 189 6990 5 Series 2002 78316 23.7 3.156383 0.266650612
## 593 42995 5 Series 2020 7500 156.9 3.115311 0.112618859
## 258 40980 3 Series 2019 6250 43.5 2.835540 0.009645099
## 902 42990 3 Series 2020 3698 43.5 2.824708 0.012391401
head(CarMarket[order(-CarMarket$CooksD),], 20)
## price model year mileage mpg StdRes CooksD
## 189 6990 5 Series 2002 78316 23.7 3.156383 0.26665061
## 593 42995 5 Series 2020 7500 156.9 3.115311 0.11261886
## 1223 71990 3 Series 2020 150 47.1 8.829247 0.11203831
## 460 38995 5 Series 2019 6500 156.9 2.620196 0.07957674
## 617 36499 5 Series 2019 5000 156.9 2.051077 0.04905970
## 40 5994 3 Series 2007 118000 38.2 2.557216 0.04214782
## 683 54845 5 Series 2020 450 60.1 4.704782 0.04135579
## 213 3800 3 Series 2010 156800 58.9 1.996363 0.03659728
## 1147 34500 5 Series 2019 99 156.9 1.510089 0.02722613
## 1293 4495 3 Series 2009 138026 42.8 1.969771 0.02493525
## 538 6495 3 Series 2009 122144 34.0 1.963881 0.02096822
## 437 6985 3 Series 2008 98630 39.2 1.944624 0.02033694
## 882 5990 3 Series 2008 108000 38.7 1.945053 0.02012034
## 650 3885 3 Series 2007 97284 42.2 1.665080 0.01873457
## 297 3500 3 Series 2011 149958 62.8 1.417366 0.01758398
## 603 32490 5 Series 2019 5134 156.9 1.195254 0.01665084
## 542 7795 3 Series 2014 147110 62.8 1.128884 0.01620762
## 202 2550 3 Series 2007 103000 49.6 1.562819 0.01552426
## 902 42990 3 Series 2020 3698 43.5 2.824708 0.01239140
## 1026 43990 5 Series 2020 10 39.2 2.284410 0.01233518
Judging by the ordered cooks distances, I will remove observations 126, 246, 16, 359, 147, 120 and 299
CarMarket1 <- CarMarket [c(-1223, -683, -189, -593)]
At this point, I will retry running the linear model, without the “outliers” denoted by invalid Cooks Distances and Standard Residuals.
fit2 <- lm(price ~ model + year + mileage + mpg,
data = CarMarket1)
#Following up with homoskedasticity tests
Ho: Variance is constant (we have homoskedasticity).
H1: Variance is not constant (we have heteroskedasticity).
CarMarket1 <- na.omit(CarMarket1)
CarMarket1$StdFitValues <- scale(fit1$fitted.values)
scatterplot(y = CarMarket1$StdRes, x = CarMarket1$StdFitValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 17.76829
## Prob > Chi2 = 2.49507e-05
Due to p-value < 0.005, we can reject the null hypothesis and accept the alternative, which mean we are detecting heteroskedasticity. This fact means we have to use the lm_robust function when describing the model.
lm_robust(price ~ model + year + mileage + mpg,
data = CarMarket1, se_type = "HC1")
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) -3.599416e+06 3.532289e+05 -10.190037 3.372308e-22 -4.293489e+06
## model 5 Series 3.067256e+03 4.764616e+02 6.437573 2.958087e-10 2.131038e+03
## year 1.797380e+03 1.750233e+02 10.269377 1.722076e-22 1.453470e+03
## mileage -1.082825e-01 1.259731e-02 -8.595689 1.186496e-16 -1.330355e-01
## mpg -3.247235e+01 1.349944e+01 -2.405459 1.653107e-02 -5.899792e+01
## CI Upper DF
## (Intercept) -2.905343e+06 478
## model 5 Series 4.003474e+03 478
## year 2.141290e+03 478
## mileage -8.352958e-02 478
## mpg -5.946775e+00 478
From the summary printed, we can observe that all the numerical variables have a statistically significant impact on the price of the vehicle, with the variables “mileage” and year having the highest impact, followed by the variable mileage. Same goes for the categorical variable, which is also significant.
Intercept: If all independant variables would be at zero, the value of the intercept or price, would be -3.599416e+06 at the p-value 3.372308e-22. (not relevant)
Model 5 Series: On average, the model 5 Series will be priced at about 3.067256e+03 GBP higher than the 3 Series.
Year: An increase in a year of production of the vehicle would increase the price by 1.797380e+03 GBP.
Mileage: An increase in a mile driven on the odometer would decrease the price by 1.082825e-01 GBP.
MPG: An increase of the economy by 1 mpg would decrease the price of the vehicle by 3.247235e+01 GBP.
The P-Values for all the variables are < 0.05.
summary(lm_robust(price ~ model + year + mileage + mpg,
data = CarMarket1, se_type = "HC1"))
##
## Call:
## lm_robust(formula = price ~ model + year + mileage + mpg, data = CarMarket1,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper
## (Intercept) -3.599e+06 3.532e+05 -10.190 3.372e-22 -4.293e+06 -2.905e+06
## model 5 Series 3.067e+03 4.765e+02 6.438 2.958e-10 2.131e+03 4.003e+03
## year 1.797e+03 1.750e+02 10.269 1.722e-22 1.453e+03 2.141e+03
## mileage -1.083e-01 1.260e-02 -8.596 1.186e-16 -1.330e-01 -8.353e-02
## mpg -3.247e+01 1.350e+01 -2.405 1.653e-02 -5.900e+01 -5.947e+00
## DF
## (Intercept) 478
## model 5 Series 478
## year 478
## mileage 478
## mpg 478
##
## Multiple R-squared: 0.7333 , Adjusted R-squared: 0.7311
## F-statistic: 206.3 on 4 and 478 DF, p-value: < 2.2e-16
We will follow up by using the F-test, to determine whether the regression model is sufficient. ro^2 = 0 ro^2 > 0
We can reject the null hypothesis, due to the p < 0.001, which means there is relationship between at least one of the explanatory/independent variables and the dependent variable.
Furthermore, the multiple squared coefficient of determination or the R-Squared is at 0.733, which means that 73% of t he variability of the dependent variable (price) can be explained by the variability of the independent variables, which are in our case model, year, mileage and mpg (fuel economy).
sqrt(summary(fit2)$r.squared)
## [1] 0.8563417
The coefficient of correlation located above, also indicated that there is a high-very high correlation between the price and the independant variables.