Firstly i imported the dataset, and cleaned it in a manner that sufices my needs in at this instance, and took a randome sample of observations, since using all the observations would lead to p values being inherently low due to the size of the dataset.

CarMarket <- read.table("~/IMB/Multivariate analysis/HOMEWORK MVA/bmw.csv", header=TRUE, sep=",", dec=";")
CarMarket <- CarMarket[order(CarMarket$price, decreasing = TRUE), ] 


set.seed(1)
CarMarket <- sample_n(CarMarket, 1500)

#CarMarket$model <- unclass(CarMarket$model)

CarMarket <- CarMarket %>% relocate(price)

CarMarket <- CarMarket[,c(-4, -6, -7, -9)]

CarMarket <- na.omit(CarMarket)

head(CarMarket)

##   price     model year mileage  mpg
## 1 36500        X3 2019    5664 54.3
## 2 14995  1 Series 2017   13472 65.7
## 3 21898  3 Series 2018    6425 48.7
## 4  8991  3 Series 2013   64377 68.9
## 5 11030  1 Series 2017   35321 83.1
## 6  7990  1 Series 2010   60000 53.3

This data set regards the British used car market, or used BMW market to be precise. The unit of observation are specific cars, with their respective propreties, which are denoted as variables in the data set. The sample size consists of 483 and 9 respective variables (5 used). The variables consist of:

-Model of car: this variable entails specific models which the BMW line offers ranging from 1-8 and x1-x8 -year of production: the year the specific unit of car was produced -price: price of the unit in GBP -mileage: the distance the vehicle has already travelled in miles -tax: required tax payment upon purchase of the vehicle -mpg: fuel economy of the vehicle denoted in distance driven (in miles) per gallon of fuel (4L)

Using the latter, I will be performing multiple regression, in order to determine which of the explanatory variables impact the price, and to which extent they do so.

This data is sourced from Kaggle: https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes?select=bmw.csv

Coercing variables into usable forms

CarMarket$mpg <- as.numeric(CarMarket$mpg)
CarMarket$mileage <- as.numeric(CarMarket$mileage)
CarMarket$price <- as.numeric(CarMarket$price)
CarMarket$year <- as.numeric(CarMarket$year)

CarMarket$model <- factor(CarMarket$model, levels = c(" 3 Series" , " 5 Series" ," M5"  ,     " 8 Series" ," X7"    ,   " i8"     ,  " X5"    ,   " X6"  ,     " X3"     ,  " M3"    ,   " X4"     ,  " M4"  , " 7 Series", " 6 Series" ," Z4"    ,   " M2" ,      " 4 Series" ," X2" ,      " 1 Series" ," 2 Series", " X1"), labels = c(" 3 Series" , " 5 Series" ," M5"  ,     " 8 Series" ," X7"    ,   " i8"     ,  " X5"    ,   " X6"  ,     " X3"     ,  " M3"    ,   " X4"     ,  " M4"  , " 7 Series", " 6 Series" ," Z4"    ,   " M2" ,      " 4 Series" ," X2" ,      " 1 Series" ," 2 Series", " X1"))

CarMarket <- CarMarket[CarMarket$model %in% c(" 3 Series", " 5 Series"), ]

#Graphical presentation of the variables used in the regression model

library(car)
scatterplotMatrix(CarMarket[,-2], 
                  smooth = FALSE)

From the scatterplot used above, we can observe a positive correlation between the independant variable (price) and the dependant/explanatory variables year, a negative correlation with mileage and a visually non indicative relationship with mpg.

#Multiple regression model

In the following segment, I will create a regression model to try to identify the relationship between the dependand and independant variables, namely I will use the following: year, mileage, mpg and model to try and explain the price.

fit1 <- lm(price ~ model + year + mileage + mpg, 
                      data = CarMarket) #creating a linear model denoted by
vif (fit1)

##    model     year  mileage      mpg 
## 1.007097 2.819297 2.829575 1.021481

mean(vif(fit1))

## [1] 1.919362

We can observe that the Variance Inflation Factor (VIF) value is lower than 5 for all the numeric variables, and lower than √5 for the categorical variable (model). The mean VIF is also accaptable since it is not higher than 5.

CarMarket <- na.omit(CarMarket)

#Adding standardized residuals and cooks distances to the dataset
CarMarket$StdRes <- rstandard(fit1)
CarMarket$CooksD <- cooks.distance(fit1)

#Making histograms of distribution of standardized residuals and Cook's distances in order to detect outliers and high impact units which need to be potentially removed.
hist(CarMarket$StdRes,
     xlab = "Standardized residuals", 
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(CarMarket$StdRes)

## 
##  Shapiro-Wilk normality test
## 
## data:  CarMarket$StdRes
## W = 0.87622, p-value < 2.2e-16

hist(CarMarket$CooksD, 
      main = "Histogram of Cooks distances",
     xlab = "Cooks distance", 
     ylab = "Frequency")

Observing the distribution of standardized residuals and the results of the Shapiro Wilk test (p < 0.05), we can reject the Null Hypothesis and accept the alternative: H1 Standard residuals are not normally distributed

Due to the absolute values of the std. res., I will be forced to remove those which posses am absolute value higher than 3.

I will also remove the units, which according to the graph have a higher impact, we can observe those to be located around the 0.4 and 0.5 region in the graph located above. For this reason, I will remove all units with Cooks D above 0.3.

head(CarMarket[order(CarMarket$StdRes),])

##      price     model year mileage  mpg    StdRes      CooksD
## 1280 18550  3 Series 2019    7478 49.6 -1.779901 0.003315089
## 630  14400  3 Series 2017   14358 48.7 -1.745936 0.002875576
## 204  17480  3 Series 2019   20410 48.7 -1.717744 0.003578753
## 1151 14050  3 Series 2017   20829 48.7 -1.672030 0.002168492
## 1052 14998  3 Series 2017   13309 48.7 -1.644969 0.002654508
## 271  14000  3 Series 2017   22000 52.3 -1.631325 0.001876465

head(CarMarket[order(-CarMarket$StdRes),])

##      price     model year mileage   mpg   StdRes      CooksD
## 1223 71990  3 Series 2020     150  47.1 8.829247 0.112038310
## 683  54845  5 Series 2020     450  60.1 4.704782 0.041355788
## 189   6990  5 Series 2002   78316  23.7 3.156383 0.266650612
## 593  42995  5 Series 2020    7500 156.9 3.115311 0.112618859
## 258  40980  3 Series 2019    6250  43.5 2.835540 0.009645099
## 902  42990  3 Series 2020    3698  43.5 2.824708 0.012391401

head(CarMarket[order(-CarMarket$CooksD),], 20)

##      price     model year mileage   mpg   StdRes     CooksD
## 189   6990  5 Series 2002   78316  23.7 3.156383 0.26665061
## 593  42995  5 Series 2020    7500 156.9 3.115311 0.11261886
## 1223 71990  3 Series 2020     150  47.1 8.829247 0.11203831
## 460  38995  5 Series 2019    6500 156.9 2.620196 0.07957674
## 617  36499  5 Series 2019    5000 156.9 2.051077 0.04905970
## 40    5994  3 Series 2007  118000  38.2 2.557216 0.04214782
## 683  54845  5 Series 2020     450  60.1 4.704782 0.04135579
## 213   3800  3 Series 2010  156800  58.9 1.996363 0.03659728
## 1147 34500  5 Series 2019      99 156.9 1.510089 0.02722613
## 1293  4495  3 Series 2009  138026  42.8 1.969771 0.02493525
## 538   6495  3 Series 2009  122144  34.0 1.963881 0.02096822
## 437   6985  3 Series 2008   98630  39.2 1.944624 0.02033694
## 882   5990  3 Series 2008  108000  38.7 1.945053 0.02012034
## 650   3885  3 Series 2007   97284  42.2 1.665080 0.01873457
## 297   3500  3 Series 2011  149958  62.8 1.417366 0.01758398
## 603  32490  5 Series 2019    5134 156.9 1.195254 0.01665084
## 542   7795  3 Series 2014  147110  62.8 1.128884 0.01620762
## 202   2550  3 Series 2007  103000  49.6 1.562819 0.01552426
## 902  42990  3 Series 2020    3698  43.5 2.824708 0.01239140
## 1026 43990  5 Series 2020      10  39.2 2.284410 0.01233518

Judging by the ordered cooks distances, I will remove observations 126, 246, 16, 359, 147, 120 and 299

CarMarket1 <- CarMarket [c(-1223, -683, -189, -593)]

At this point, I will retry running the linear model, without the “outliers” denoted by invalid Cooks Distances and Standard Residuals.

fit2 <- lm(price ~ model + year + mileage + mpg, 
                      data = CarMarket1)

#Following up with homoskedasticity tests

Ho: Variance is constant (we have homoskedasticity).

H1: Variance is not constant (we have heteroskedasticity).

CarMarket1 <- na.omit(CarMarket1)

CarMarket1$StdFitValues <- scale(fit1$fitted.values)

scatterplot(y = CarMarket1$StdRes, x = CarMarket1$StdFitValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    17.76829 
##  Prob > Chi2   =    2.49507e-05

Due to p-value < 0.005, we can reject the null hypothesis and accept the alternative, which mean we are detecting heteroskedasticity. This fact means we have to use the lm_robust function when describing the model.

lm_robust(price ~ model + year + mileage + mpg, 
                      data = CarMarket1, se_type = "HC1")

##                     Estimate   Std. Error    t value     Pr(>|t|)      CI Lower
## (Intercept)    -3.599416e+06 3.532289e+05 -10.190037 3.372308e-22 -4.293489e+06
## model 5 Series  3.067256e+03 4.764616e+02   6.437573 2.958087e-10  2.131038e+03
## year            1.797380e+03 1.750233e+02  10.269377 1.722076e-22  1.453470e+03
## mileage        -1.082825e-01 1.259731e-02  -8.595689 1.186496e-16 -1.330355e-01
## mpg            -3.247235e+01 1.349944e+01  -2.405459 1.653107e-02 -5.899792e+01
##                     CI Upper  DF
## (Intercept)    -2.905343e+06 478
## model 5 Series  4.003474e+03 478
## year            2.141290e+03 478
## mileage        -8.352958e-02 478
## mpg            -5.946775e+00 478

From the summary printed, we can observe that all the numerical variables have a statistically significant impact on the price of the vehicle, with the variables “mileage” and year having the highest impact, followed by the variable mileage. Same goes for the categorical variable, which is also significant.

Intercept: If all independant variables would be at zero, the value of the intercept or price, would be -3.599416e+06 at the p-value 3.372308e-22. (not relevant)

Model 5 Series: On average, the model 5 Series will be priced at about 3.067256e+03 GBP higher than the 3 Series.

Year: An increase in a year of production of the vehicle would increase the price by 1.797380e+03 GBP.

Mileage: An increase in a mile driven on the odometer would decrease the price by 1.082825e-01 GBP.

MPG: An increase of the economy by 1 mpg would decrease the price of the vehicle by 3.247235e+01 GBP.

The P-Values for all the variables are < 0.05.

summary(lm_robust(price ~ model + year + mileage + mpg, 
                      data = CarMarket1, se_type = "HC1"))

## 
## Call:
## lm_robust(formula = price ~ model + year + mileage + mpg, data = CarMarket1, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                  Estimate Std. Error t value  Pr(>|t|)   CI Lower   CI Upper
## (Intercept)    -3.599e+06  3.532e+05 -10.190 3.372e-22 -4.293e+06 -2.905e+06
## model 5 Series  3.067e+03  4.765e+02   6.438 2.958e-10  2.131e+03  4.003e+03
## year            1.797e+03  1.750e+02  10.269 1.722e-22  1.453e+03  2.141e+03
## mileage        -1.083e-01  1.260e-02  -8.596 1.186e-16 -1.330e-01 -8.353e-02
## mpg            -3.247e+01  1.350e+01  -2.405 1.653e-02 -5.900e+01 -5.947e+00
##                 DF
## (Intercept)    478
## model 5 Series 478
## year           478
## mileage        478
## mpg            478
## 
## Multiple R-squared:  0.7333 ,    Adjusted R-squared:  0.7311 
## F-statistic: 206.3 on 4 and 478 DF,  p-value: < 2.2e-16

We will follow up by using the F-test, to determine whether the regression model is sufficient. ro^2 = 0 ro^2 > 0

We can reject the null hypothesis, due to the p < 0.001, which means there is relationship between at least one of the explanatory/independent variables and the dependent variable.

Furthermore, the multiple squared coefficient of determination or the R-Squared is at 0.733, which means that 73% of t he variability of the dependent variable (price) can be explained by the variability of the independent variables, which are in our case model, year, mileage and mpg (fuel economy).

sqrt(summary(fit2)$r.squared)

## [1] 0.8563417

The coefficient of correlation located above, also indicated that there is a high-very high correlation between the price and the independant variables.

HW-3-MVA

Kristjan Puhek

2023-01-30

Coercing variables into usable forms