The data set consists of 301 observations and 9 variables. The variables are the following: (this data set did not have any description of the variables, so I wrote the description myself)
-Car name: name of a car -Year: the year that car was made -Selling_Price: the price for which the car was initially sold (in thousand US Dollars) -Present_Price: reseller price (in thousand US Dollars) -Driven_kms:how many km the car has -Fuel_Type:what type of fuel the car has -Selling_type:how the car is sold (dealer or individual) -Transmission:what type of the transmission the car has -Owner:how many owners the car already had (-age: age of a car) The data set was found on the Kaggle website, the author is Mohammed Tauseef Ahmed. Retrieved Feb 2nd 2023, from https://www.kaggle.com/code/mohammedtauseefahmed/car-price-dataset/data?fbclid=IwAR0YAkuEZWEvHOMRQlx4xq9iW5sPQvs2FtyFbDnVwTG9kSeElFlO8JIZlnw . unit of observation is one car we are going to explore if there is any relationship between the present price and the selling price of a car, driven kilometers, the age of a car, and the selling type it has.
mydata <- read.csv("C:/Users/pauli/Downloads/car data.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## Car_Name Year Selling_Price Present_Price Driven_kms Fuel_Type
## 1 ritz 2014 3.35 5.59 27000 Petrol
## 2 sx4 2013 4.75 9.54 43000 Diesel
## 3 ciaz 2017 7.25 9.85 6900 Petrol
## 4 wagon r 2011 2.85 4.15 5200 Petrol
## 5 swift 2014 4.60 6.87 42450 Diesel
## 6 vitara brezza 2018 9.25 9.83 2071 Diesel
## Selling_type Transmission Owner
## 1 Dealer Manual 0
## 2 Dealer Manual 0
## 3 Dealer Manual 0
## 4 Dealer Manual 0
## 5 Dealer Manual 0
## 6 Dealer Manual 0
We remove the columns we do not need
mydata1 <- mydata[,c(-1) ]
head(mydata1)
## Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 1 2014 3.35 5.59 27000 Petrol Dealer
## 2 2013 4.75 9.54 43000 Diesel Dealer
## 3 2017 7.25 9.85 6900 Petrol Dealer
## 4 2011 2.85 4.15 5200 Petrol Dealer
## 5 2014 4.60 6.87 42450 Diesel Dealer
## 6 2018 9.25 9.83 2071 Diesel Dealer
## Transmission Owner
## 1 Manual 0
## 2 Manual 0
## 3 Manual 0
## 4 Manual 0
## 5 Manual 0
## 6 Manual 0
now lets calculate the age of each car:
mydata1$Age <- 2023 - mydata1$Year
head(mydata1)
## Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 1 2014 3.35 5.59 27000 Petrol Dealer
## 2 2013 4.75 9.54 43000 Diesel Dealer
## 3 2017 7.25 9.85 6900 Petrol Dealer
## 4 2011 2.85 4.15 5200 Petrol Dealer
## 5 2014 4.60 6.87 42450 Diesel Dealer
## 6 2018 9.25 9.83 2071 Diesel Dealer
## Transmission Owner Age
## 1 Manual 0 9
## 2 Manual 0 10
## 3 Manual 0 6
## 4 Manual 0 12
## 5 Manual 0 9
## 6 Manual 0 5
mydata1$Selling_Price <- mydata1$Selling_Price
mydata1$Present_Price <- mydata1$Present_Price
head(mydata1)
## Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 1 2014 3.35 5.59 27000 Petrol Dealer
## 2 2013 4.75 9.54 43000 Diesel Dealer
## 3 2017 7.25 9.85 6900 Petrol Dealer
## 4 2011 2.85 4.15 5200 Petrol Dealer
## 5 2014 4.60 6.87 42450 Diesel Dealer
## 6 2018 9.25 9.83 2071 Diesel Dealer
## Transmission Owner Age
## 1 Manual 0 9
## 2 Manual 0 10
## 3 Manual 0 6
## 4 Manual 0 12
## 5 Manual 0 9
## 6 Manual 0 5
#I do not have suitable data to do a vector, but it is needed to do it so I did it with selling type and I will later use it in my regression
mydata1$Type[mydata1$Selling_type=="Individual"]<-"0"
mydata1$Type[mydata1$Selling_type=="Dealer"]<-"1"
mydata1$TypeF <- factor(mydata1$Type,
levels = c("0", "1"),
labels = c("Individual", "Dealer"))
Lets do some descriptive statistics to see if there is any missing values or anything wrong with the data.
summary(mydata1)
## Year Selling_Price Present_Price Driven_kms
## Min. :2003 Min. : 0.100 Min. : 0.320 Min. : 500
## 1st Qu.:2012 1st Qu.: 0.900 1st Qu.: 1.200 1st Qu.: 15000
## Median :2014 Median : 3.600 Median : 6.400 Median : 32000
## Mean :2014 Mean : 4.661 Mean : 7.628 Mean : 36947
## 3rd Qu.:2016 3rd Qu.: 6.000 3rd Qu.: 9.900 3rd Qu.: 48767
## Max. :2018 Max. :35.000 Max. :92.600 Max. :500000
## Fuel_Type Selling_type Transmission Owner
## Length:301 Length:301 Length:301 Min. :0.00000
## Class :character Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Mode :character Median :0.00000
## Mean :0.04319
## 3rd Qu.:0.00000
## Max. :3.00000
## Age Type TypeF
## Min. : 5.000 Length:301 Individual:106
## 1st Qu.: 7.000 Class :character Dealer :195
## Median : 9.000 Mode :character
## Mean : 9.372
## 3rd Qu.:11.000
## Max. :20.000
The oldest car from the data set was manufactured in 2003. The average selling price is 4661 US Dollars, and average present price is 7628 US Dollars. The minimum number of driven kilometers is 500km and the maximum is 500000. The “youngest” car is 5years old, and the “oldest” is 20 years old. 106 of cars are being sold by individual(s) and 195 by car dealer(s).
library(psych)
describe(mydata1)
## vars n mean sd median trimmed mad min
## Year 1 301 2013.63 2.89 2014.0 2014.00 2.97 2003.00
## Selling_Price 2 301 4.66 5.08 3.6 3.74 3.85 0.10
## Present_Price 3 301 7.63 8.64 6.4 6.18 6.89 0.32
## Driven_kms 4 301 36947.21 38886.88 32000.0 32133.00 25204.20 500.00
## Fuel_Type* 5 301 2.79 0.43 3.0 2.87 0.00 1.00
## Selling_type* 6 301 1.35 0.48 1.0 1.32 0.00 1.00
## Transmission* 7 301 1.87 0.34 2.0 1.96 0.00 1.00
## Owner 8 301 0.04 0.25 0.0 0.00 0.00 0.00
## Age 9 301 9.37 2.89 9.0 9.00 2.97 5.00
## Type* 10 301 1.65 0.48 2.0 1.68 0.00 1.00
## TypeF* 11 301 1.65 0.48 2.0 1.68 0.00 1.00
## max range skew kurtosis se
## Year 2018.0 15.00 -1.23 1.46 0.17
## Selling_Price 35.0 34.90 2.47 8.66 0.29
## Present_Price 92.6 92.28 4.04 30.95 0.50
## Driven_kms 500000.0 499500.00 6.37 66.94 2241.40
## Fuel_Type* 3.0 2.00 -1.65 1.44 0.02
## Selling_type* 2.0 1.00 0.62 -1.63 0.03
## Transmission* 2.0 1.00 -2.15 2.64 0.02
## Owner 3.0 3.00 7.54 71.59 0.01
## Age 20.0 15.00 1.23 1.46 0.17
## Type* 2.0 1.00 -0.62 -1.63 0.03
## TypeF* 2.0 1.00 -0.62 -1.63 0.03
40 of the cars have automatic transmission and 261 have manual transmission.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata1[ ,-c(1,5:8,10:11)],
smooth = FALSE)
From above we can see positive relationship between present price
(dependent variable) and other independent ones (selling price, driven
kms, age)
scatterplot(mydata1$Selling_Price~ mydata1$Present_Price | mydata1$TypeF,
ylab = "Selling price",
xlab = "Present price",smooth = FALSE)
fit <- lm(Present_Price ~ Selling_Price + Driven_kms + Age + TypeF,
data = mydata1)
summary(fit)
##
## Call:
## lm(formula = Present_Price ~ Selling_Price + Driven_kms + Age +
## TypeF, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.595 -1.337 -0.024 1.112 33.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.978e+00 8.078e-01 -8.638 3.63e-16 ***
## Selling_Price 1.586e+00 4.895e-02 32.404 < 2e-16 ***
## Driven_kms 1.096e-05 6.125e-06 1.790 0.0745 .
## Age 7.226e-01 8.482e-02 8.519 8.31e-16 ***
## TypeFDealer 5.388e-02 5.023e-01 0.107 0.9146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.448 on 296 degrees of freedom
## Multiple R-squared: 0.8429, Adjusted R-squared: 0.8408
## F-statistic: 397.1 on 4 and 296 DF, p-value: < 2.2e-16
vif(fit)
## Selling_Price Driven_kms Age TypeF
## 1.562058 1.431578 1.517730 1.456911
mean(vif(fit))
## [1] 1.492069
mydata1$StdResid <- round(rstandard(fit), 3)
mydata1$CooksD <- round(cooks.distance(fit), 3)
hist(mydata1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
H0: Variables are normally distributed H1: Variables are not normally
distributed
shapiro.test(mydata1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata1$StdResid
## W = 0.71771, p-value < 2.2e-16
We reject the H0 at p<0.001 and conclude, that variables are not normally distributed.
hist(mydata1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(mydata1[order(-mydata1$CooksD),], 30)
## Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 87 2010 35.00 92.60 78000 Diesel Dealer
## 197 2008 0.17 0.52 500000 Petrol Individual
## 65 2017 33.00 36.23 6000 Diesel Dealer
## 86 2006 2.50 23.73 142000 Petrol Individual
## 83 2017 23.00 25.39 15000 Diesel Dealer
## 38 2003 0.35 2.28 127000 Petrol Individual
## 95 2008 4.00 22.78 89000 Petrol Dealer
## 180 2010 0.31 1.05 213000 Petrol Individual
## 190 2005 0.20 0.57 55000 Petrol Individual
## 52 2015 23.00 30.61 40000 Diesel Dealer
## 94 2015 23.00 30.61 40000 Diesel Dealer
## 97 2016 20.75 25.39 29000 Diesel Dealer
## 79 2010 5.25 22.83 80000 Petrol Dealer
## 67 2017 19.75 23.15 11000 Petrol Dealer
## 53 2017 18.00 19.77 15000 Diesel Dealer
## 63 2014 18.75 35.96 78000 Diesel Dealer
## 54 2013 16.00 30.61 135000 Diesel Individual
## 91 2009 3.80 18.61 62000 Petrol Dealer
## 201 2006 0.10 0.75 92233 Petrol Individual
## 40 2003 2.25 7.98 62000 Petrol Dealer
## 98 2017 17.00 18.64 8700 Petrol Dealer
## 81 2016 14.73 14.89 23000 Diesel Dealer
## 58 2010 4.75 18.54 50000 Petrol Dealer
## 60 2014 19.99 35.96 41000 Diesel Dealer
## 80 2012 14.50 30.61 89000 Diesel Dealer
## 193 2007 0.20 0.75 49000 Petrol Individual
## 200 2007 0.12 0.58 53000 Petrol Individual
## 51 2012 14.90 30.61 104707 Diesel Dealer
## 186 2008 0.25 0.58 1900 Petrol Individual
## 48 2006 1.05 4.15 65000 Petrol Dealer
## Transmission Owner Age Type TypeF StdResid CooksD
## 87 Manual 0 13 1 Dealer 10.852 5.395
## 197 Automatic 0 15 0 Individual -4.169 5.214
## 65 Automatic 0 6 1 Dealer -4.246 0.576
## 86 Automatic 3 17 0 Individual 3.827 0.136
## 83 Automatic 0 6 1 Dealer -2.587 0.078
## 38 Manual 0 20 0 Individual -2.132 0.054
## 95 Automatic 0 15 1 Dealer 3.379 0.043
## 180 Manual 0 13 0 Individual -1.275 0.033
## 190 Manual 0 18 0 Individual -1.892 0.033
## 52 Automatic 0 8 1 Dealer -1.543 0.028
## 94 Automatic 0 8 1 Dealer -1.543 0.028
## 97 Automatic 0 7 1 Dealer -1.771 0.027
## 79 Automatic 0 13 1 Dealer 3.253 0.024
## 67 Automatic 0 6 1 Dealer -1.689 0.023
## 53 Automatic 0 6 1 Dealer -1.872 0.022
## 63 Automatic 0 9 1 Dealer 1.708 0.022
## 54 Automatic 0 10 0 Individual 1.058 0.019
## 91 Manual 0 14 1 Dealer 2.545 0.019
## 201 Manual 0 17 0 Individual -1.688 0.019
## 40 Manual 0 20 1 Dealer -1.136 0.016
## 98 Manual 0 6 1 Dealer -1.715 0.016
## 81 Manual 0 7 1 Dealer -2.008 0.014
## 58 Manual 0 13 1 Dealer 2.330 0.013
## 60 Automatic 0 9 1 Dealer 1.249 0.013
## 80 Automatic 0 11 1 Dealer 1.645 0.013
## 193 Manual 1 16 0 Individual -1.380 0.012
## 200 Manual 0 16 0 Individual -1.405 0.012
## 51 Automatic 0 11 1 Dealer 1.412 0.011
## 186 Automatic 0 15 0 Individual -1.093 0.009
## 48 Manual 0 17 1 Dealer -1.058 0.008
We remove outliers/cooks distances
mydata2<-mydata1[mydata1$StdResid <= 2.5, ]
mydata3<-mydata2[mydata2$StdResid >= -2.5, ]
We deleted observations with standerdized residual values lower than -2.5 and higher than 2.5.
head(mydata3[order(-mydata3$CooksD),], 30)
## Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 38 2003 0.35 2.280 127000 Petrol Individual
## 180 2010 0.31 1.050 213000 Petrol Individual
## 190 2005 0.20 0.570 55000 Petrol Individual
## 52 2015 23.00 30.610 40000 Diesel Dealer
## 94 2015 23.00 30.610 40000 Diesel Dealer
## 97 2016 20.75 25.390 29000 Diesel Dealer
## 67 2017 19.75 23.150 11000 Petrol Dealer
## 53 2017 18.00 19.770 15000 Diesel Dealer
## 63 2014 18.75 35.960 78000 Diesel Dealer
## 54 2013 16.00 30.610 135000 Diesel Individual
## 201 2006 0.10 0.750 92233 Petrol Individual
## 40 2003 2.25 7.980 62000 Petrol Dealer
## 98 2017 17.00 18.640 8700 Petrol Dealer
## 81 2016 14.73 14.890 23000 Diesel Dealer
## 58 2010 4.75 18.540 50000 Petrol Dealer
## 60 2014 19.99 35.960 41000 Diesel Dealer
## 80 2012 14.50 30.610 89000 Diesel Dealer
## 193 2007 0.20 0.750 49000 Petrol Individual
## 200 2007 0.12 0.580 53000 Petrol Individual
## 51 2012 14.90 30.610 104707 Diesel Dealer
## 186 2008 0.25 0.580 1900 Petrol Individual
## 48 2006 1.05 4.150 65000 Petrol Dealer
## 56 2009 3.60 15.040 70000 Petrol Dealer
## 96 2012 5.85 18.610 72000 Petrol Dealer
## 185 2008 0.25 0.750 26000 Petrol Individual
## 191 2008 0.20 0.750 60000 Petrol Individual
## 251 2016 12.90 13.600 35934 Diesel Dealer
## 84 2015 12.50 13.460 38000 Diesel Dealer
## 195 2008 0.20 0.787 50000 Petrol Individual
## 61 2013 6.95 18.610 40001 Petrol Dealer
## Transmission Owner Age Type TypeF StdResid CooksD
## 38 Manual 0 20 0 Individual -2.132 0.054
## 180 Manual 0 13 0 Individual -1.275 0.033
## 190 Manual 0 18 0 Individual -1.892 0.033
## 52 Automatic 0 8 1 Dealer -1.543 0.028
## 94 Automatic 0 8 1 Dealer -1.543 0.028
## 97 Automatic 0 7 1 Dealer -1.771 0.027
## 67 Automatic 0 6 1 Dealer -1.689 0.023
## 53 Automatic 0 6 1 Dealer -1.872 0.022
## 63 Automatic 0 9 1 Dealer 1.708 0.022
## 54 Automatic 0 10 0 Individual 1.058 0.019
## 201 Manual 0 17 0 Individual -1.688 0.019
## 40 Manual 0 20 1 Dealer -1.136 0.016
## 98 Manual 0 6 1 Dealer -1.715 0.016
## 81 Manual 0 7 1 Dealer -2.008 0.014
## 58 Manual 0 13 1 Dealer 2.330 0.013
## 60 Automatic 0 9 1 Dealer 1.249 0.013
## 80 Automatic 0 11 1 Dealer 1.645 0.013
## 193 Manual 1 16 0 Individual -1.380 0.012
## 200 Manual 0 16 0 Individual -1.405 0.012
## 51 Automatic 0 11 1 Dealer 1.412 0.011
## 186 Automatic 0 15 0 Individual -1.093 0.009
## 48 Manual 0 17 1 Dealer -1.058 0.008
## 56 Automatic 0 14 1 Dealer 1.569 0.007
## 96 Manual 0 11 1 Dealer 2.188 0.007
## 185 Manual 1 15 0 Individual -1.116 0.007
## 191 Manual 0 15 0 Individual -1.198 0.007
## 251 Manual 0 7 1 Dealer -1.574 0.007
## 84 Manual 0 8 1 Dealer -1.646 0.006
## 195 Manual 0 15 0 Individual -1.156 0.006
## 61 Manual 0 10 1 Dealer 1.990 0.004
head(mydata3[order(-mydata3$StdResid),], 30)
## Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 58 2010 4.75 18.54 50000 Petrol Dealer
## 96 2012 5.85 18.61 72000 Petrol Dealer
## 61 2013 6.95 18.61 40001 Petrol Dealer
## 99 2013 7.05 18.61 45000 Petrol Dealer
## 73 2013 7.45 18.61 56001 Petrol Dealer
## 63 2014 18.75 35.96 78000 Diesel Dealer
## 80 2012 14.50 30.61 89000 Diesel Dealer
## 56 2009 3.60 15.04 70000 Petrol Dealer
## 77 2013 5.50 14.68 72000 Petrol Dealer
## 51 2012 14.90 30.61 104707 Diesel Dealer
## 60 2014 19.99 35.96 41000 Diesel Dealer
## 69 2011 4.35 13.74 88000 Petrol Dealer
## 280 2014 6.25 13.60 40126 Petrol Dealer
## 54 2013 16.00 30.61 135000 Diesel Individual
## 72 2011 4.50 12.48 45000 Diesel Dealer
## 88 2012 5.90 13.74 56000 Petrol Dealer
## 68 2010 9.25 20.45 59000 Diesel Dealer
## 174 2017 0.40 0.51 1300 Petrol Individual
## 160 2017 0.45 0.51 4000 Petrol Individual
## 133 2017 0.75 0.95 3500 Petrol Individual
## 156 2017 0.48 0.51 4300 Petrol Individual
## 159 2017 0.48 0.54 8600 Petrol Individual
## 135 2017 0.65 0.81 11800 Petrol Individual
## 157 2017 0.48 0.52 15000 Petrol Individual
## 129 2017 0.80 0.87 3000 Petrol Individual
## 131 2017 0.75 0.87 11000 Petrol Individual
## 130 2017 0.78 0.84 5000 Petrol Individual
## 127 2017 0.90 0.95 1300 Petrol Individual
## 100 2010 9.65 20.45 50024 Diesel Dealer
## 110 2017 1.20 1.47 11000 Petrol Individual
## Transmission Owner Age Type TypeF StdResid CooksD
## 58 Manual 0 13 1 Dealer 2.330 0.013
## 96 Manual 0 11 1 Dealer 2.188 0.007
## 61 Manual 0 10 1 Dealer 1.990 0.004
## 99 Manual 0 10 1 Dealer 1.928 0.004
## 73 Manual 0 10 1 Dealer 1.709 0.003
## 63 Automatic 0 9 1 Dealer 1.708 0.022
## 80 Automatic 0 11 1 Dealer 1.645 0.013
## 56 Automatic 0 14 1 Dealer 1.569 0.007
## 77 Manual 0 10 1 Dealer 1.416 0.003
## 51 Automatic 0 11 1 Dealer 1.412 0.011
## 60 Automatic 0 9 1 Dealer 1.249 0.013
## 69 Manual 0 12 1 Dealer 1.204 0.003
## 280 Manual 0 9 1 Dealer 1.066 0.001
## 54 Automatic 0 10 0 Individual 1.058 0.019
## 72 Manual 0 12 1 Dealer 0.903 0.001
## 88 Manual 0 11 1 Dealer 0.798 0.001
## 68 Manual 0 13 1 Dealer 0.777 0.002
## 174 Automatic 0 6 0 Individual 0.732 0.002
## 160 Automatic 0 6 0 Individual 0.700 0.001
## 133 Manual 0 6 0 Individual 0.691 0.001
## 156 Automatic 0 6 0 Individual 0.685 0.001
## 159 Manual 0 6 0 Individual 0.680 0.001
## 135 Manual 0 6 0 Individual 0.670 0.001
## 157 Manual 0 6 0 Individual 0.654 0.001
## 129 Manual 0 6 0 Individual 0.646 0.001
## 131 Manual 0 6 0 Individual 0.643 0.001
## 130 Manual 0 6 0 Individual 0.640 0.001
## 127 Manual 0 6 0 Individual 0.628 0.001
## 100 Manual 0 13 1 Dealer 0.621 0.001
## 110 Manual 0 6 0 Individual 0.610 0.001
hist(mydata3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
The COOks distances after the clean up are better, no value is higher
thann 1.
Re-estimate the model
fit1 <- lm(Present_Price ~ Selling_Price + Driven_kms + Age + TypeF,
data = mydata3)
summary(fit1)
##
## Call:
## lm(formula = Present_Price ~ Selling_Price + Driven_kms + Age +
## TypeF, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9969 -1.0636 -0.0498 0.8260 9.0717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.856e+00 5.362e-01 -7.191 5.56e-12 ***
## Selling_Price 1.381e+00 3.799e-02 36.361 < 2e-16 ***
## Driven_kms 3.282e-05 6.072e-06 5.405 1.36e-07 ***
## Age 3.450e-01 6.024e-02 5.726 2.58e-08 ***
## TypeFDealer 6.370e-01 3.323e-01 1.917 0.0563 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.124 on 288 degrees of freedom
## Multiple R-squared: 0.9005, Adjusted R-squared: 0.8991
## F-statistic: 651.3 on 4 and 288 DF, p-value: < 2.2e-16
Lets check the relationship between the standardized residuals and standardized fitted values if the assumption of heteroskedasticity is met.
mydata3$StdFittedValues <- scale(fit1$fitted.values)
library(car)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
From the scatter plot above, we cannot tell if the assumption is
violated or not. We need to perform the Breusch Pagan test to know for
sure.
H0:variance is constant(homosk.) H1:variance is not constant(heterosk.)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------------
## Response : Present_Price
## Variables: fitted values of Present_Price
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 152.4985
## Prob > Chi2 = 4.930335e-35
We reject the null hypothesis by p<0.001 and conclude that the variance is not constant, meaning heteroskedasticity is present in the model. Because of that we will need to use lm robust to explain the results of our regression.
vif(fit1)
## Selling_Price Driven_kms Age TypeF
## 1.807329 1.784389 1.862135 1.642679
mean(vif(fit1))
## [1] 1.774133
Vif is more than 1 and below 5, meaning variables are moderately correlated.
Re-estimating the model
library(estimatr)
fit2 <- lm(data = mydata3, Present_Price ~ Selling_Price + Driven_kms + Age + TypeF )
summary(lm_robust(data = mydata3, Present_Price ~ Selling_Price + Driven_kms + Age + TypeF, se_type = "HC1"))
##
## Call:
## lm_robust(formula = Present_Price ~ Selling_Price + Driven_kms +
## Age + TypeF, data = mydata3, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -3.856e+00 5.968e-01 -6.461 4.425e-10 -5.030e+00 -2.681e+00 288
## Selling_Price 1.381e+00 6.623e-02 20.859 1.614e-59 1.251e+00 1.512e+00 288
## Driven_kms 3.282e-05 1.145e-05 2.866 4.469e-03 1.028e-05 5.536e-05 288
## Age 3.450e-01 7.844e-02 4.398 1.540e-05 1.906e-01 4.993e-01 288
## TypeFDealer 6.370e-01 4.054e-01 1.571 1.172e-01 -1.609e-01 1.435e+00 288
##
## Multiple R-squared: 0.9005 , Adjusted R-squared: 0.8991
## F-statistic: 404.1 on 4 and 288 DF, p-value: < 2.2e-16
Explanations of the estimated regression coefficients:all are statistically significant
The intercept is statistically significant (p<0.001), but it makes no sense to explain it.
If the selling price is increased by 1.000 dollars then the present price of a car increases on average by 1.381 dollars, assuming all other variables remain unchanged(p<0.001).
If the number of driven kilometers increase by 1 kilometer then the present price of a car, on average increases by 0,032 dollars, assuming all other variables remain unchanged(p<0.001).
If the age of a car increases by 1 year, the present selling price on average increases by 345 dollars, assuming all other variables remain unchanged(p<0.001).
Given the values of other variables remain the same, dealers will on average sell the car for 637 dollars more than individuals (p<0.001).
F-test statistic, to evaluate how good the regression model is:
H0: ρ2 = 0 (non of the explanatory variables explain the) H1: ρ2 > 0 (at least one explanatory variable explains the dependent one)
Since the p-value is very low (p < 0.001) we can reject H0 and assume that the model is well structured and that there is a relationship between the dependent and at least one of the independent variables.
sqrt(0.9005)
## [1] 0.9489468
Multiple coeffiecient of determination: as the multiple R^2=0.90, meaning 90% of the variability of the dependent variable(present price) can be explained with the variability of the independent variables(selling price, driven kms, age, selling type). Coefficient of correlation is 95%, meaning that the relationship between the present price and all independent variables is very strong.