This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
mydata <- read.csv("C:/Users/MihaPristov/Desktop/faks/R-multiverse/audi.csv", header=TRUE, sep=",", dec=".")
A unit of observation in this dataset is a used car being sold in the UK.
The sample size in this data set is 10668 but we narrow the sample down to 300 for easier analysis pourpuses.
Definition of variables:
Sourced from: https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes
The research question for this analysis is to find out how different variables influence the used cars price.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(1)
mydata1 <- mydata %>%
sample_n(300)
head(mydata1)
## model year price transmission mileage fuelType tax mpg engineSize
## 1 A1 2016 11495 Manual 33397 Petrol 0 67.3 1.0
## 2 A6 2016 17336 Automatic 62118 Diesel 145 56.5 2.0
## 3 A3 2018 19490 Manual 23800 Petrol 145 49.6 2.0
## 4 A3 2017 20000 Automatic 8500 Petrol 145 47.9 2.0
## 5 TT 2012 14000 Semi-Auto 56441 Petrol 265 36.7 2.0
## 6 A1 2015 13250 Automatic 40000 Petrol 30 58.9 1.4
mydata1$TransmissionF <- factor(mydata1$transmission,
levels = c("Automatic","Semi-Auto","Manual"),
labels = c("Automatic","Semi-Auto","Manual") )
In the regression we will use price as the dependent variable and mileage, tax, miles per gallon and Transmission type as dependent variables. I belive the price of a used car will be influenced by these variables witch makes the data valid for regression.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata1[ ,c(-1,-2,-4,-6,-9,-10)],
smooth = FALSE)
reg1 <- lm(price ~ mileage + tax + mpg + TransmissionF,
data = mydata1)
summary(reg1)
##
## Call:
## lm(formula = price ~ mileage + tax + mpg + TransmissionF, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12707 -4790 -1146 2223 59707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.505e+04 3.752e+03 14.674 < 2e-16 ***
## mileage -1.884e-01 2.121e-02 -8.881 < 2e-16 ***
## tax -1.680e+01 1.020e+01 -1.648 0.100
## mpg -4.354e+02 5.706e+01 -7.630 3.29e-13 ***
## TransmissionFSemi-Auto 2.166e+02 1.251e+03 0.173 0.863
## TransmissionFManual -7.170e+03 1.207e+03 -5.938 8.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8214 on 294 degrees of freedom
## Multiple R-squared: 0.5687, Adjusted R-squared: 0.5614
## F-statistic: 77.54 on 5 and 294 DF, p-value: < 2.2e-16
vif(reg1)
## GVIF Df GVIF^(1/(2*Df))
## mileage 1.178198 1 1.085448
## tax 1.822705 1 1.350076
## mpg 2.058422 1 1.434720
## TransmissionF 1.220874 2 1.051157
mean(vif(reg1))
## [1] 1.350133
Test for multicolinareity: We compare all the variables GVIF square root to square root of 5 witch is 2,23. All the variables are under 2,23 and are okay to use. The mean is close to 1.
mydata1$StdResid <- round(rstandard(reg1),3)
mydata1$CooksD <- round(cooks.distance(reg1),3)
hist(mydata1$StdResid,
xlab= "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
hist(mydata1$CooksD,
xlab= "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
mydata2 <- mydata1[!(mydata1$StdResid >= 3.0),]
mydata2 <- mydata2[(-33),]
We get rid of values that have Standard residuals over 3 or under -3, and we also get rid of a value with high cooks distance (high impact).
reg2 <- lm(price ~ mileage + tax + mpg + TransmissionF,
data = mydata2)
summary(reg2)
##
## Call:
## lm(formula = price ~ mileage + tax + mpg + TransmissionF, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12160.9 -4159.0 -783.6 3150.6 23684.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60898.0124 3274.4934 18.598 < 2e-16 ***
## mileage -0.1625 0.0167 -9.731 < 2e-16 ***
## tax -32.2436 8.4330 -3.824 0.000161 ***
## mpg -544.2735 51.4626 -10.576 < 2e-16 ***
## TransmissionFSemi-Auto -331.1636 971.8898 -0.341 0.733545
## TransmissionFManual -6129.6335 942.0010 -6.507 3.37e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6297 on 289 degrees of freedom
## Multiple R-squared: 0.6716, Adjusted R-squared: 0.6659
## F-statistic: 118.2 on 5 and 289 DF, p-value: < 2.2e-16
We then make another regression after we removed the outliers and high impact units.
hist(mydata2$StdResid,
xlab= "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
The fact that normal distirbution is violated does not bother as because
of the size of the sample.
hist(mydata2$CooksD,
xlab= "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
vif(reg2)
## GVIF Df GVIF^(1/(2*Df))
## mileage 1.235603 1 1.111577
## tax 2.053399 1 1.432969
## mpg 2.436578 1 1.560954
## TransmissionF 1.238206 2 1.054868
mean(vif(reg2))
## [1] 1.427013
mydata2FiitedValues <- scale(reg2$fitted.values)
scatterplot(y = mydata2$StdResid, x = mydata2FiitedValues,
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
We seem to have a problem with heteroskadiscity and (possible
uncolinearity).
WE preform a Breusch pagan test:
H0: We assume homoskedacity H1: We asssume heteroskedacity.
We reject null hypothtesis at p value < 0.001.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(reg2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 51.74459
## Prob > Chi2 = 6.321083e-13
We try to reastimate the model, we can’t change variables to make it better.
We try to cut the model down to get rid of heteroskadicity and possible non-linearity.
mydata3 <- mydata1[c(-250,-293,-218),]
mydata3 <- mydata3[!(mydata3$StdResid >= 1),]
mydata3 <- mydata3[!(mydata3$StdResid <= -1),]
hist(mydata3$StdResid,
xlab= "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
shapiro_test(mydata3$StdResid)
## # A tibble: 1 × 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 mydata3$StdResid 0.978 0.000568
The fact that normal distirbution is violated does not bother as because of the size of the sample.
hist(mydata3$CooksD,
xlab= "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We do Cooks distance adjustments in row 173 and we do not readjust cooks
distance again.
reg10 <- lm(price ~ mileage + tax + mpg + TransmissionF,
data = mydata3)
summary(reg10)
##
## Call:
## lm(formula = price ~ mileage + tax + mpg + TransmissionF, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6895.4 -2676.4 -216.6 2070.1 10636.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.581e+04 2.189e+03 20.931 <2e-16 ***
## mileage -1.788e-01 1.115e-02 -16.035 <2e-16 ***
## tax -7.849e+00 5.493e+00 -1.429 0.154
## mpg -3.224e+02 3.361e+01 -9.592 <2e-16 ***
## TransmissionFSemi-Auto 7.273e+02 6.419e+02 1.133 0.258
## TransmissionFManual -5.596e+03 5.874e+02 -9.525 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3650 on 245 degrees of freedom
## Multiple R-squared: 0.8066, Adjusted R-squared: 0.8027
## F-statistic: 204.4 on 5 and 245 DF, p-value: < 2.2e-16
We reastimate the regression.
vif(reg10)
## GVIF Df GVIF^(1/(2*Df))
## mileage 1.219234 1 1.104189
## tax 2.098929 1 1.448768
## mpg 2.399333 1 1.548978
## TransmissionF 1.182034 2 1.042695
mean(vif(reg10))
## [1] 1.420347
Multicolinearity is not a problem (values are lower than square root of 5). Mean is close to 1.
mydata3FiitedValues <- scale(reg10$fitted.values)
scatterplot(y = mydata3$StdResid, x = mydata3FiitedValues,
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
WE preform a Breusch pagan test:
H0: We assume homoskedacity H1: We assume heteroskedacity.
library(olsrr)
ols_test_breusch_pagan(reg10)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : price
## Variables: fitted values of price
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 8.063652
## Prob > Chi2 = 0.004516207
We reject null hypothtesis at p value < 0.001. We did not resolve heteroskadacity so we use lm.robust.
#install.packages("estimatr")
library(estimatr)
reg11 <- lm_robust(price ~ mileage + tax + mpg + TransmissionF,
se_type = "HC1",
data = mydata3)
summary(reg11)
##
## Call:
## lm_robust(formula = price ~ mileage + tax + mpg + TransmissionF,
## data = mydata3, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) 45808.0011 2.354e+03 19.456 1.308e-51 41170.4751
## mileage -0.1788 1.093e-02 -16.359 3.710e-41 -0.2003
## tax -7.8487 5.657e+00 -1.388 1.665e-01 -18.9904
## mpg -322.3611 3.461e+01 -9.314 7.353e-18 -390.5323
## TransmissionFSemi-Auto 727.3408 7.249e+02 1.003 3.167e-01 -700.5086
## TransmissionFManual -5595.6012 5.547e+02 -10.088 3.077e-20 -6688.1827
## CI Upper DF
## (Intercept) 50445.5270 245
## mileage -0.1573 245
## tax 3.2930 245
## mpg -254.1900 245
## TransmissionFSemi-Auto 2155.1902 245
## TransmissionFManual -4503.0198 245
##
## Multiple R-squared: 0.8066 , Adjusted R-squared: 0.8027
## F-statistic: 176.9 on 5 and 245 DF, p-value: < 2.2e-16
Multiple coefficient of determination indicates that 80% of variability of price is explained of the variable price is explained by independent variables mileage, mpg and TrassmissionManual. This shows the regrresion fits our data well.
Multiple correlation coefficient is the square root of R-squared and is 0.89 this coefficient means, that the maximum degree of linear relationship between the variable of price, and independent variables is 0.89 meaning it is a strong relationship.
H0:ρ2 = 0 H1:ρ2 > 0
F-statistics explain that ρ squared is more than zero at p<0.001 this means that we explain something with the model and that the model is good.
If mileage increases for 1 mile price falls for 0.18 pounds on average. Ceteris paribus.
From the regression we can not say that tax paid on the vehicle has an effect on the price.
If miles per galon increase for 1mile per galon the price declines for 322 pounds on average. Ceteris paribus.
From the model we can’t say that Semi-automatic cars are different in price than Automatic cars.
From the regression model we can see that a car with manual transmission is 5595 pounds cheaper on average then a automatic car, everything else held constant.