R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

mydata <- read.csv("C:/Users/MihaPristov/Desktop/faks/R-multiverse/audi.csv", header=TRUE, sep=",", dec=".")

A unit of observation in this dataset is a used car being sold in the UK.

The sample size in this data set is 10668 but we narrow the sample down to 300 for easier analysis pourpuses.

Definition of variables:

Sourced from: https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes

The research question for this analysis is to find out how different variables influence the used cars price.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(1)

mydata1 <- mydata %>% 
  sample_n(300)

head(mydata1)
##   model year price transmission mileage fuelType tax  mpg engineSize
## 1    A1 2016 11495       Manual   33397   Petrol   0 67.3        1.0
## 2    A6 2016 17336    Automatic   62118   Diesel 145 56.5        2.0
## 3    A3 2018 19490       Manual   23800   Petrol 145 49.6        2.0
## 4    A3 2017 20000    Automatic    8500   Petrol 145 47.9        2.0
## 5    TT 2012 14000    Semi-Auto   56441   Petrol 265 36.7        2.0
## 6    A1 2015 13250    Automatic   40000   Petrol  30 58.9        1.4
mydata1$TransmissionF <- factor(mydata1$transmission,
                            levels = c("Automatic","Semi-Auto","Manual"),
                            labels = c("Automatic","Semi-Auto","Manual") )

In the regression we will use price as the dependent variable and mileage, tax, miles per gallon and Transmission type as dependent variables. I belive the price of a used car will be influenced by these variables witch makes the data valid for regression.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata1[ ,c(-1,-2,-4,-6,-9,-10)],
                  smooth = FALSE)

reg1 <- lm(price ~ mileage + tax + mpg + TransmissionF,
           data = mydata1)

summary(reg1)
## 
## Call:
## lm(formula = price ~ mileage + tax + mpg + TransmissionF, data = mydata1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12707  -4790  -1146   2223  59707 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.505e+04  3.752e+03  14.674  < 2e-16 ***
## mileage                -1.884e-01  2.121e-02  -8.881  < 2e-16 ***
## tax                    -1.680e+01  1.020e+01  -1.648    0.100    
## mpg                    -4.354e+02  5.706e+01  -7.630 3.29e-13 ***
## TransmissionFSemi-Auto  2.166e+02  1.251e+03   0.173    0.863    
## TransmissionFManual    -7.170e+03  1.207e+03  -5.938 8.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8214 on 294 degrees of freedom
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.5614 
## F-statistic: 77.54 on 5 and 294 DF,  p-value: < 2.2e-16
vif(reg1)
##                   GVIF Df GVIF^(1/(2*Df))
## mileage       1.178198  1        1.085448
## tax           1.822705  1        1.350076
## mpg           2.058422  1        1.434720
## TransmissionF 1.220874  2        1.051157
mean(vif(reg1))
## [1] 1.350133

Test for multicolinareity: We compare all the variables GVIF square root to square root of 5 witch is 2,23. All the variables are under 2,23 and are okay to use. The mean is close to 1.

mydata1$StdResid <- round(rstandard(reg1),3)

mydata1$CooksD <- round(cooks.distance(reg1),3)

hist(mydata1$StdResid,
     xlab= "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

hist(mydata1$CooksD,
     xlab= "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

mydata2 <- mydata1[!(mydata1$StdResid >= 3.0),]

mydata2 <- mydata2[(-33),]

We get rid of values that have Standard residuals over 3 or under -3, and we also get rid of a value with high cooks distance (high impact).

reg2 <- lm(price ~ mileage + tax + mpg + TransmissionF,
           data = mydata2)

summary(reg2)
## 
## Call:
## lm(formula = price ~ mileage + tax + mpg + TransmissionF, data = mydata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12160.9  -4159.0   -783.6   3150.6  23684.2 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            60898.0124  3274.4934  18.598  < 2e-16 ***
## mileage                   -0.1625     0.0167  -9.731  < 2e-16 ***
## tax                      -32.2436     8.4330  -3.824 0.000161 ***
## mpg                     -544.2735    51.4626 -10.576  < 2e-16 ***
## TransmissionFSemi-Auto  -331.1636   971.8898  -0.341 0.733545    
## TransmissionFManual    -6129.6335   942.0010  -6.507 3.37e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6297 on 289 degrees of freedom
## Multiple R-squared:  0.6716, Adjusted R-squared:  0.6659 
## F-statistic: 118.2 on 5 and 289 DF,  p-value: < 2.2e-16

We then make another regression after we removed the outliers and high impact units.

hist(mydata2$StdResid,
     xlab= "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

The fact that normal distirbution is violated does not bother as because of the size of the sample.

hist(mydata2$CooksD,
     xlab= "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

vif(reg2)
##                   GVIF Df GVIF^(1/(2*Df))
## mileage       1.235603  1        1.111577
## tax           2.053399  1        1.432969
## mpg           2.436578  1        1.560954
## TransmissionF 1.238206  2        1.054868
mean(vif(reg2))
## [1] 1.427013
mydata2FiitedValues <- scale(reg2$fitted.values)

scatterplot(y = mydata2$StdResid, x = mydata2FiitedValues,
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

We seem to have a problem with heteroskadiscity and (possible uncolinearity).

WE preform a Breusch pagan test:

H0: We assume homoskedacity H1: We asssume heteroskedacity.

We reject null hypothtesis at p value < 0.001.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(reg2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    51.74459 
##  Prob > Chi2   =    6.321083e-13

We try to reastimate the model, we can’t change variables to make it better.

We try to cut the model down to get rid of heteroskadicity and possible non-linearity.

Final model

mydata3 <- mydata1[c(-250,-293,-218),]
mydata3 <- mydata3[!(mydata3$StdResid >= 1),]
mydata3 <- mydata3[!(mydata3$StdResid <= -1),]
hist(mydata3$StdResid,
     xlab= "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
shapiro_test(mydata3$StdResid)
## # A tibble: 1 × 3
##   variable         statistic  p.value
##   <chr>                <dbl>    <dbl>
## 1 mydata3$StdResid     0.978 0.000568

The fact that normal distirbution is violated does not bother as because of the size of the sample.

hist(mydata3$CooksD,
     xlab= "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

We do Cooks distance adjustments in row 173 and we do not readjust cooks distance again.

reg10 <- lm(price ~ mileage + tax + mpg + TransmissionF,
           data = mydata3)

summary(reg10)
## 
## Call:
## lm(formula = price ~ mileage + tax + mpg + TransmissionF, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6895.4 -2676.4  -216.6  2070.1 10636.1 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.581e+04  2.189e+03  20.931   <2e-16 ***
## mileage                -1.788e-01  1.115e-02 -16.035   <2e-16 ***
## tax                    -7.849e+00  5.493e+00  -1.429    0.154    
## mpg                    -3.224e+02  3.361e+01  -9.592   <2e-16 ***
## TransmissionFSemi-Auto  7.273e+02  6.419e+02   1.133    0.258    
## TransmissionFManual    -5.596e+03  5.874e+02  -9.525   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3650 on 245 degrees of freedom
## Multiple R-squared:  0.8066, Adjusted R-squared:  0.8027 
## F-statistic: 204.4 on 5 and 245 DF,  p-value: < 2.2e-16

We reastimate the regression.

vif(reg10)
##                   GVIF Df GVIF^(1/(2*Df))
## mileage       1.219234  1        1.104189
## tax           2.098929  1        1.448768
## mpg           2.399333  1        1.548978
## TransmissionF 1.182034  2        1.042695
mean(vif(reg10))
## [1] 1.420347

Multicolinearity is not a problem (values are lower than square root of 5). Mean is close to 1.

mydata3FiitedValues <- scale(reg10$fitted.values)

scatterplot(y = mydata3$StdResid, x = mydata3FiitedValues,
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

WE preform a Breusch pagan test:

H0: We assume homoskedacity H1: We assume heteroskedacity.

library(olsrr)

ols_test_breusch_pagan(reg10)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : price 
##  Variables: fitted values of price 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    8.063652 
##  Prob > Chi2   =    0.004516207

We reject null hypothtesis at p value < 0.001. We did not resolve heteroskadacity so we use lm.robust.

#install.packages("estimatr")
library(estimatr)

reg11 <- lm_robust(price ~ mileage + tax + mpg + TransmissionF,
                  se_type = "HC1",
                  data = mydata3)

summary(reg11)
## 
## Call:
## lm_robust(formula = price ~ mileage + tax + mpg + TransmissionF, 
##     data = mydata3, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                          Estimate Std. Error t value  Pr(>|t|)   CI Lower
## (Intercept)            45808.0011  2.354e+03  19.456 1.308e-51 41170.4751
## mileage                   -0.1788  1.093e-02 -16.359 3.710e-41    -0.2003
## tax                       -7.8487  5.657e+00  -1.388 1.665e-01   -18.9904
## mpg                     -322.3611  3.461e+01  -9.314 7.353e-18  -390.5323
## TransmissionFSemi-Auto   727.3408  7.249e+02   1.003 3.167e-01  -700.5086
## TransmissionFManual    -5595.6012  5.547e+02 -10.088 3.077e-20 -6688.1827
##                          CI Upper  DF
## (Intercept)            50445.5270 245
## mileage                   -0.1573 245
## tax                        3.2930 245
## mpg                     -254.1900 245
## TransmissionFSemi-Auto  2155.1902 245
## TransmissionFManual    -4503.0198 245
## 
## Multiple R-squared:  0.8066 ,    Adjusted R-squared:  0.8027 
## F-statistic: 176.9 on 5 and 245 DF,  p-value: < 2.2e-16

Conclusion

Multiple coefficient of determination indicates that 80% of variability of price is explained of the variable price is explained by independent variables mileage, mpg and TrassmissionManual. This shows the regrresion fits our data well.

Multiple correlation coefficient is the square root of R-squared and is 0.89 this coefficient means, that the maximum degree of linear relationship between the variable of price, and independent variables is 0.89 meaning it is a strong relationship.

H0:ρ2 = 0 H1:ρ2 > 0

F-statistics explain that ρ squared is more than zero at p<0.001 this means that we explain something with the model and that the model is good.

If mileage increases for 1 mile price falls for 0.18 pounds on average. Ceteris paribus.

From the regression we can not say that tax paid on the vehicle has an effect on the price.

If miles per galon increase for 1mile per galon the price declines for 322 pounds on average. Ceteris paribus.

From the model we can’t say that Semi-automatic cars are different in price than Automatic cars.

From the regression model we can see that a car with manual transmission is 5595 pounds cheaper on average then a automatic car, everything else held constant.