The data set consists of 301 observations and 9 variables. The variables are the following: (this data set did not have any description of the variables, so I wrote the description myself)

-Car name: name of a car -Year: the year that car was made -Selling_Price: the price for which the car was initially sold (in thousand US Dollars) -Present_Price: reseller price (in thousand US Dollars) -Driven_kms:how many km the car has -Fuel_Type:what type of fuel the car has -Selling_type:how the car is sold (dealer or individual) -Transmission:what type of the transmission the car has -Owner:how many owners the car already had (-age: age of a car) The data set was found on the Kaggle website, the author is Mohammed Tauseef Ahmed. Retrieved Feb 2nd 2023, from https://www.kaggle.com/code/mohammedtauseefahmed/car-price-dataset/data?fbclid=IwAR0YAkuEZWEvHOMRQlx4xq9iW5sPQvs2FtyFbDnVwTG9kSeElFlO8JIZlnw . unit of observation is one car we are going to explore if there is any relationship between the present price and the selling price of a car, driven kilometers, the age of a car, and the selling type it has.

mydata <- read.csv("C:/Users/pauli/Downloads/car data.csv", header=TRUE, sep=",", dec=".")
head(mydata)
##        Car_Name Year Selling_Price Present_Price Driven_kms Fuel_Type
## 1          ritz 2014          3.35          5.59      27000    Petrol
## 2           sx4 2013          4.75          9.54      43000    Diesel
## 3          ciaz 2017          7.25          9.85       6900    Petrol
## 4       wagon r 2011          2.85          4.15       5200    Petrol
## 5         swift 2014          4.60          6.87      42450    Diesel
## 6 vitara brezza 2018          9.25          9.83       2071    Diesel
##   Selling_type Transmission Owner
## 1       Dealer       Manual     0
## 2       Dealer       Manual     0
## 3       Dealer       Manual     0
## 4       Dealer       Manual     0
## 5       Dealer       Manual     0
## 6       Dealer       Manual     0

We remove the columns we do not need

mydata1 <- mydata[,c(-1) ]
head(mydata1)
##   Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 1 2014          3.35          5.59      27000    Petrol       Dealer
## 2 2013          4.75          9.54      43000    Diesel       Dealer
## 3 2017          7.25          9.85       6900    Petrol       Dealer
## 4 2011          2.85          4.15       5200    Petrol       Dealer
## 5 2014          4.60          6.87      42450    Diesel       Dealer
## 6 2018          9.25          9.83       2071    Diesel       Dealer
##   Transmission Owner
## 1       Manual     0
## 2       Manual     0
## 3       Manual     0
## 4       Manual     0
## 5       Manual     0
## 6       Manual     0

now lets calculate the age of each car:

mydata1$Age <- 2023 - mydata1$Year
head(mydata1)
##   Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 1 2014          3.35          5.59      27000    Petrol       Dealer
## 2 2013          4.75          9.54      43000    Diesel       Dealer
## 3 2017          7.25          9.85       6900    Petrol       Dealer
## 4 2011          2.85          4.15       5200    Petrol       Dealer
## 5 2014          4.60          6.87      42450    Diesel       Dealer
## 6 2018          9.25          9.83       2071    Diesel       Dealer
##   Transmission Owner Age
## 1       Manual     0   9
## 2       Manual     0  10
## 3       Manual     0   6
## 4       Manual     0  12
## 5       Manual     0   9
## 6       Manual     0   5
mydata1$Selling_Price <- mydata1$Selling_Price 
mydata1$Present_Price <- mydata1$Present_Price
head(mydata1)
##   Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 1 2014          3.35          5.59      27000    Petrol       Dealer
## 2 2013          4.75          9.54      43000    Diesel       Dealer
## 3 2017          7.25          9.85       6900    Petrol       Dealer
## 4 2011          2.85          4.15       5200    Petrol       Dealer
## 5 2014          4.60          6.87      42450    Diesel       Dealer
## 6 2018          9.25          9.83       2071    Diesel       Dealer
##   Transmission Owner Age
## 1       Manual     0   9
## 2       Manual     0  10
## 3       Manual     0   6
## 4       Manual     0  12
## 5       Manual     0   9
## 6       Manual     0   5
#I do not have suitable data to do a vector, but it is needed to do it so I did it with selling type and I will later use it in my regression
mydata1$Type[mydata1$Selling_type=="Individual"]<-"0"
mydata1$Type[mydata1$Selling_type=="Dealer"]<-"1"

mydata1$TypeF <- factor(mydata1$Type, 
                             levels = c("0", "1"), 
                             labels = c("Individual", "Dealer"))

Lets do some descriptive statistics to see if there is any missing values or anything wrong with the data.

summary(mydata1)
##       Year      Selling_Price    Present_Price      Driven_kms    
##  Min.   :2003   Min.   : 0.100   Min.   : 0.320   Min.   :   500  
##  1st Qu.:2012   1st Qu.: 0.900   1st Qu.: 1.200   1st Qu.: 15000  
##  Median :2014   Median : 3.600   Median : 6.400   Median : 32000  
##  Mean   :2014   Mean   : 4.661   Mean   : 7.628   Mean   : 36947  
##  3rd Qu.:2016   3rd Qu.: 6.000   3rd Qu.: 9.900   3rd Qu.: 48767  
##  Max.   :2018   Max.   :35.000   Max.   :92.600   Max.   :500000  
##   Fuel_Type         Selling_type       Transmission           Owner        
##  Length:301         Length:301         Length:301         Min.   :0.00000  
##  Class :character   Class :character   Class :character   1st Qu.:0.00000  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.00000  
##                                                           Mean   :0.04319  
##                                                           3rd Qu.:0.00000  
##                                                           Max.   :3.00000  
##       Age             Type                  TypeF    
##  Min.   : 5.000   Length:301         Individual:106  
##  1st Qu.: 7.000   Class :character   Dealer    :195  
##  Median : 9.000   Mode  :character                   
##  Mean   : 9.372                                      
##  3rd Qu.:11.000                                      
##  Max.   :20.000

The oldest car from the data set was manufactured in 2003. The average selling price is 4661 US Dollars, and average present price is 7628 US Dollars. The minimum number of driven kilometers is 500km and the maximum is 500000. The “youngest” car is 5years old, and the “oldest” is 20 years old. 106 of cars are being sold by individual(s) and 195 by car dealer(s).

library(psych)
describe(mydata1)
##               vars   n     mean       sd  median  trimmed      mad     min
## Year             1 301  2013.63     2.89  2014.0  2014.00     2.97 2003.00
## Selling_Price    2 301     4.66     5.08     3.6     3.74     3.85    0.10
## Present_Price    3 301     7.63     8.64     6.4     6.18     6.89    0.32
## Driven_kms       4 301 36947.21 38886.88 32000.0 32133.00 25204.20  500.00
## Fuel_Type*       5 301     2.79     0.43     3.0     2.87     0.00    1.00
## Selling_type*    6 301     1.35     0.48     1.0     1.32     0.00    1.00
## Transmission*    7 301     1.87     0.34     2.0     1.96     0.00    1.00
## Owner            8 301     0.04     0.25     0.0     0.00     0.00    0.00
## Age              9 301     9.37     2.89     9.0     9.00     2.97    5.00
## Type*           10 301     1.65     0.48     2.0     1.68     0.00    1.00
## TypeF*          11 301     1.65     0.48     2.0     1.68     0.00    1.00
##                    max     range  skew kurtosis      se
## Year            2018.0     15.00 -1.23     1.46    0.17
## Selling_Price     35.0     34.90  2.47     8.66    0.29
## Present_Price     92.6     92.28  4.04    30.95    0.50
## Driven_kms    500000.0 499500.00  6.37    66.94 2241.40
## Fuel_Type*         3.0      2.00 -1.65     1.44    0.02
## Selling_type*      2.0      1.00  0.62    -1.63    0.03
## Transmission*      2.0      1.00 -2.15     2.64    0.02
## Owner              3.0      3.00  7.54    71.59    0.01
## Age               20.0     15.00  1.23     1.46    0.17
## Type*              2.0      1.00 -0.62    -1.63    0.03
## TypeF*             2.0      1.00 -0.62    -1.63    0.03

40 of the cars have automatic transmission and 261 have manual transmission.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(mydata1[ ,-c(1,5:8,10:11)], 
                  smooth = FALSE)

From above we can see positive relationship between present price (dependent variable) and other independent ones (selling price, driven kms, age)

scatterplot(mydata1$Selling_Price~ mydata1$Present_Price  | mydata1$TypeF,  
            
            ylab = "Selling price", 
            xlab = "Present price",smooth = FALSE)

fit <- lm(Present_Price ~ Selling_Price + Driven_kms + Age + TypeF,
           data = mydata1)
summary(fit)
## 
## Call:
## lm(formula = Present_Price ~ Selling_Price + Driven_kms + Age + 
##     TypeF, data = mydata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.595  -1.337  -0.024   1.112  33.755 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.978e+00  8.078e-01  -8.638 3.63e-16 ***
## Selling_Price  1.586e+00  4.895e-02  32.404  < 2e-16 ***
## Driven_kms     1.096e-05  6.125e-06   1.790   0.0745 .  
## Age            7.226e-01  8.482e-02   8.519 8.31e-16 ***
## TypeFDealer    5.388e-02  5.023e-01   0.107   0.9146    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.448 on 296 degrees of freedom
## Multiple R-squared:  0.8429, Adjusted R-squared:  0.8408 
## F-statistic: 397.1 on 4 and 296 DF,  p-value: < 2.2e-16
vif(fit)
## Selling_Price    Driven_kms           Age         TypeF 
##      1.562058      1.431578      1.517730      1.456911
mean(vif(fit))
## [1] 1.492069
mydata1$StdResid <- round(rstandard(fit), 3) 
mydata1$CooksD <- round(cooks.distance(fit), 3) 

hist(mydata1$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

H0: Variables are normally distributed H1: Variables are not normally distributed

shapiro.test(mydata1$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$StdResid
## W = 0.71771, p-value < 2.2e-16

We reject the H0 at p<0.001 and conclude, that variables are not normally distributed.

hist(mydata1$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

head(mydata1[order(-mydata1$CooksD),], 30) 
##     Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 87  2010         35.00         92.60      78000    Diesel       Dealer
## 197 2008          0.17          0.52     500000    Petrol   Individual
## 65  2017         33.00         36.23       6000    Diesel       Dealer
## 86  2006          2.50         23.73     142000    Petrol   Individual
## 83  2017         23.00         25.39      15000    Diesel       Dealer
## 38  2003          0.35          2.28     127000    Petrol   Individual
## 95  2008          4.00         22.78      89000    Petrol       Dealer
## 180 2010          0.31          1.05     213000    Petrol   Individual
## 190 2005          0.20          0.57      55000    Petrol   Individual
## 52  2015         23.00         30.61      40000    Diesel       Dealer
## 94  2015         23.00         30.61      40000    Diesel       Dealer
## 97  2016         20.75         25.39      29000    Diesel       Dealer
## 79  2010          5.25         22.83      80000    Petrol       Dealer
## 67  2017         19.75         23.15      11000    Petrol       Dealer
## 53  2017         18.00         19.77      15000    Diesel       Dealer
## 63  2014         18.75         35.96      78000    Diesel       Dealer
## 54  2013         16.00         30.61     135000    Diesel   Individual
## 91  2009          3.80         18.61      62000    Petrol       Dealer
## 201 2006          0.10          0.75      92233    Petrol   Individual
## 40  2003          2.25          7.98      62000    Petrol       Dealer
## 98  2017         17.00         18.64       8700    Petrol       Dealer
## 81  2016         14.73         14.89      23000    Diesel       Dealer
## 58  2010          4.75         18.54      50000    Petrol       Dealer
## 60  2014         19.99         35.96      41000    Diesel       Dealer
## 80  2012         14.50         30.61      89000    Diesel       Dealer
## 193 2007          0.20          0.75      49000    Petrol   Individual
## 200 2007          0.12          0.58      53000    Petrol   Individual
## 51  2012         14.90         30.61     104707    Diesel       Dealer
## 186 2008          0.25          0.58       1900    Petrol   Individual
## 48  2006          1.05          4.15      65000    Petrol       Dealer
##     Transmission Owner Age Type      TypeF StdResid CooksD
## 87        Manual     0  13    1     Dealer   10.852  5.395
## 197    Automatic     0  15    0 Individual   -4.169  5.214
## 65     Automatic     0   6    1     Dealer   -4.246  0.576
## 86     Automatic     3  17    0 Individual    3.827  0.136
## 83     Automatic     0   6    1     Dealer   -2.587  0.078
## 38        Manual     0  20    0 Individual   -2.132  0.054
## 95     Automatic     0  15    1     Dealer    3.379  0.043
## 180       Manual     0  13    0 Individual   -1.275  0.033
## 190       Manual     0  18    0 Individual   -1.892  0.033
## 52     Automatic     0   8    1     Dealer   -1.543  0.028
## 94     Automatic     0   8    1     Dealer   -1.543  0.028
## 97     Automatic     0   7    1     Dealer   -1.771  0.027
## 79     Automatic     0  13    1     Dealer    3.253  0.024
## 67     Automatic     0   6    1     Dealer   -1.689  0.023
## 53     Automatic     0   6    1     Dealer   -1.872  0.022
## 63     Automatic     0   9    1     Dealer    1.708  0.022
## 54     Automatic     0  10    0 Individual    1.058  0.019
## 91        Manual     0  14    1     Dealer    2.545  0.019
## 201       Manual     0  17    0 Individual   -1.688  0.019
## 40        Manual     0  20    1     Dealer   -1.136  0.016
## 98        Manual     0   6    1     Dealer   -1.715  0.016
## 81        Manual     0   7    1     Dealer   -2.008  0.014
## 58        Manual     0  13    1     Dealer    2.330  0.013
## 60     Automatic     0   9    1     Dealer    1.249  0.013
## 80     Automatic     0  11    1     Dealer    1.645  0.013
## 193       Manual     1  16    0 Individual   -1.380  0.012
## 200       Manual     0  16    0 Individual   -1.405  0.012
## 51     Automatic     0  11    1     Dealer    1.412  0.011
## 186    Automatic     0  15    0 Individual   -1.093  0.009
## 48        Manual     0  17    1     Dealer   -1.058  0.008

We remove outliers/cooks distances

mydata2<-mydata1[mydata1$StdResid <= 2.5, ]
mydata3<-mydata2[mydata2$StdResid >= -2.5, ]

We deleted observations with standerdized residual values lower than -2.5 and higher than 2.5.

head(mydata3[order(-mydata3$CooksD),], 30) 
##     Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 38  2003          0.35         2.280     127000    Petrol   Individual
## 180 2010          0.31         1.050     213000    Petrol   Individual
## 190 2005          0.20         0.570      55000    Petrol   Individual
## 52  2015         23.00        30.610      40000    Diesel       Dealer
## 94  2015         23.00        30.610      40000    Diesel       Dealer
## 97  2016         20.75        25.390      29000    Diesel       Dealer
## 67  2017         19.75        23.150      11000    Petrol       Dealer
## 53  2017         18.00        19.770      15000    Diesel       Dealer
## 63  2014         18.75        35.960      78000    Diesel       Dealer
## 54  2013         16.00        30.610     135000    Diesel   Individual
## 201 2006          0.10         0.750      92233    Petrol   Individual
## 40  2003          2.25         7.980      62000    Petrol       Dealer
## 98  2017         17.00        18.640       8700    Petrol       Dealer
## 81  2016         14.73        14.890      23000    Diesel       Dealer
## 58  2010          4.75        18.540      50000    Petrol       Dealer
## 60  2014         19.99        35.960      41000    Diesel       Dealer
## 80  2012         14.50        30.610      89000    Diesel       Dealer
## 193 2007          0.20         0.750      49000    Petrol   Individual
## 200 2007          0.12         0.580      53000    Petrol   Individual
## 51  2012         14.90        30.610     104707    Diesel       Dealer
## 186 2008          0.25         0.580       1900    Petrol   Individual
## 48  2006          1.05         4.150      65000    Petrol       Dealer
## 56  2009          3.60        15.040      70000    Petrol       Dealer
## 96  2012          5.85        18.610      72000    Petrol       Dealer
## 185 2008          0.25         0.750      26000    Petrol   Individual
## 191 2008          0.20         0.750      60000    Petrol   Individual
## 251 2016         12.90        13.600      35934    Diesel       Dealer
## 84  2015         12.50        13.460      38000    Diesel       Dealer
## 195 2008          0.20         0.787      50000    Petrol   Individual
## 61  2013          6.95        18.610      40001    Petrol       Dealer
##     Transmission Owner Age Type      TypeF StdResid CooksD
## 38        Manual     0  20    0 Individual   -2.132  0.054
## 180       Manual     0  13    0 Individual   -1.275  0.033
## 190       Manual     0  18    0 Individual   -1.892  0.033
## 52     Automatic     0   8    1     Dealer   -1.543  0.028
## 94     Automatic     0   8    1     Dealer   -1.543  0.028
## 97     Automatic     0   7    1     Dealer   -1.771  0.027
## 67     Automatic     0   6    1     Dealer   -1.689  0.023
## 53     Automatic     0   6    1     Dealer   -1.872  0.022
## 63     Automatic     0   9    1     Dealer    1.708  0.022
## 54     Automatic     0  10    0 Individual    1.058  0.019
## 201       Manual     0  17    0 Individual   -1.688  0.019
## 40        Manual     0  20    1     Dealer   -1.136  0.016
## 98        Manual     0   6    1     Dealer   -1.715  0.016
## 81        Manual     0   7    1     Dealer   -2.008  0.014
## 58        Manual     0  13    1     Dealer    2.330  0.013
## 60     Automatic     0   9    1     Dealer    1.249  0.013
## 80     Automatic     0  11    1     Dealer    1.645  0.013
## 193       Manual     1  16    0 Individual   -1.380  0.012
## 200       Manual     0  16    0 Individual   -1.405  0.012
## 51     Automatic     0  11    1     Dealer    1.412  0.011
## 186    Automatic     0  15    0 Individual   -1.093  0.009
## 48        Manual     0  17    1     Dealer   -1.058  0.008
## 56     Automatic     0  14    1     Dealer    1.569  0.007
## 96        Manual     0  11    1     Dealer    2.188  0.007
## 185       Manual     1  15    0 Individual   -1.116  0.007
## 191       Manual     0  15    0 Individual   -1.198  0.007
## 251       Manual     0   7    1     Dealer   -1.574  0.007
## 84        Manual     0   8    1     Dealer   -1.646  0.006
## 195       Manual     0  15    0 Individual   -1.156  0.006
## 61        Manual     0  10    1     Dealer    1.990  0.004
head(mydata3[order(-mydata3$StdResid),], 30) 
##     Year Selling_Price Present_Price Driven_kms Fuel_Type Selling_type
## 58  2010          4.75         18.54      50000    Petrol       Dealer
## 96  2012          5.85         18.61      72000    Petrol       Dealer
## 61  2013          6.95         18.61      40001    Petrol       Dealer
## 99  2013          7.05         18.61      45000    Petrol       Dealer
## 73  2013          7.45         18.61      56001    Petrol       Dealer
## 63  2014         18.75         35.96      78000    Diesel       Dealer
## 80  2012         14.50         30.61      89000    Diesel       Dealer
## 56  2009          3.60         15.04      70000    Petrol       Dealer
## 77  2013          5.50         14.68      72000    Petrol       Dealer
## 51  2012         14.90         30.61     104707    Diesel       Dealer
## 60  2014         19.99         35.96      41000    Diesel       Dealer
## 69  2011          4.35         13.74      88000    Petrol       Dealer
## 280 2014          6.25         13.60      40126    Petrol       Dealer
## 54  2013         16.00         30.61     135000    Diesel   Individual
## 72  2011          4.50         12.48      45000    Diesel       Dealer
## 88  2012          5.90         13.74      56000    Petrol       Dealer
## 68  2010          9.25         20.45      59000    Diesel       Dealer
## 174 2017          0.40          0.51       1300    Petrol   Individual
## 160 2017          0.45          0.51       4000    Petrol   Individual
## 133 2017          0.75          0.95       3500    Petrol   Individual
## 156 2017          0.48          0.51       4300    Petrol   Individual
## 159 2017          0.48          0.54       8600    Petrol   Individual
## 135 2017          0.65          0.81      11800    Petrol   Individual
## 157 2017          0.48          0.52      15000    Petrol   Individual
## 129 2017          0.80          0.87       3000    Petrol   Individual
## 131 2017          0.75          0.87      11000    Petrol   Individual
## 130 2017          0.78          0.84       5000    Petrol   Individual
## 127 2017          0.90          0.95       1300    Petrol   Individual
## 100 2010          9.65         20.45      50024    Diesel       Dealer
## 110 2017          1.20          1.47      11000    Petrol   Individual
##     Transmission Owner Age Type      TypeF StdResid CooksD
## 58        Manual     0  13    1     Dealer    2.330  0.013
## 96        Manual     0  11    1     Dealer    2.188  0.007
## 61        Manual     0  10    1     Dealer    1.990  0.004
## 99        Manual     0  10    1     Dealer    1.928  0.004
## 73        Manual     0  10    1     Dealer    1.709  0.003
## 63     Automatic     0   9    1     Dealer    1.708  0.022
## 80     Automatic     0  11    1     Dealer    1.645  0.013
## 56     Automatic     0  14    1     Dealer    1.569  0.007
## 77        Manual     0  10    1     Dealer    1.416  0.003
## 51     Automatic     0  11    1     Dealer    1.412  0.011
## 60     Automatic     0   9    1     Dealer    1.249  0.013
## 69        Manual     0  12    1     Dealer    1.204  0.003
## 280       Manual     0   9    1     Dealer    1.066  0.001
## 54     Automatic     0  10    0 Individual    1.058  0.019
## 72        Manual     0  12    1     Dealer    0.903  0.001
## 88        Manual     0  11    1     Dealer    0.798  0.001
## 68        Manual     0  13    1     Dealer    0.777  0.002
## 174    Automatic     0   6    0 Individual    0.732  0.002
## 160    Automatic     0   6    0 Individual    0.700  0.001
## 133       Manual     0   6    0 Individual    0.691  0.001
## 156    Automatic     0   6    0 Individual    0.685  0.001
## 159       Manual     0   6    0 Individual    0.680  0.001
## 135       Manual     0   6    0 Individual    0.670  0.001
## 157       Manual     0   6    0 Individual    0.654  0.001
## 129       Manual     0   6    0 Individual    0.646  0.001
## 131       Manual     0   6    0 Individual    0.643  0.001
## 130       Manual     0   6    0 Individual    0.640  0.001
## 127       Manual     0   6    0 Individual    0.628  0.001
## 100       Manual     0  13    1     Dealer    0.621  0.001
## 110       Manual     0   6    0 Individual    0.610  0.001
hist(mydata3$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

The COOks distances after the clean up are better, no value is higher thann 1.

Re-estimate the model

fit1 <- lm(Present_Price ~ Selling_Price + Driven_kms + Age + TypeF,
           data = mydata3)
summary(fit1)
## 
## Call:
## lm(formula = Present_Price ~ Selling_Price + Driven_kms + Age + 
##     TypeF, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9969 -1.0636 -0.0498  0.8260  9.0717 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.856e+00  5.362e-01  -7.191 5.56e-12 ***
## Selling_Price  1.381e+00  3.799e-02  36.361  < 2e-16 ***
## Driven_kms     3.282e-05  6.072e-06   5.405 1.36e-07 ***
## Age            3.450e-01  6.024e-02   5.726 2.58e-08 ***
## TypeFDealer    6.370e-01  3.323e-01   1.917   0.0563 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.124 on 288 degrees of freedom
## Multiple R-squared:  0.9005, Adjusted R-squared:  0.8991 
## F-statistic: 651.3 on 4 and 288 DF,  p-value: < 2.2e-16

Lets check the relationship between the standardized residuals and standardized fitted values if the assumption of heteroskedasticity is met.

mydata3$StdFittedValues <- scale(fit1$fitted.values)
library(car)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

From the scatter plot above, we cannot tell if the assumption is violated or not. We need to perform the Breusch Pagan test to know for sure.

H0:variance is constant(homosk.) H1:variance is not constant(heterosk.)

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                   Data                    
##  -----------------------------------------
##  Response : Present_Price 
##  Variables: fitted values of Present_Price 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    152.4985 
##  Prob > Chi2   =    4.930335e-35

We reject the null hypothesis by p<0.001 and conclude that the variance is not constant, meaning heteroskedasticity is present in the model. Because of that we will need to use lm robust to explain the results of our regression.

vif(fit1)
## Selling_Price    Driven_kms           Age         TypeF 
##      1.807329      1.784389      1.862135      1.642679
mean(vif(fit1))
## [1] 1.774133

Vif is more than 1 and below 5, meaning variables are moderately correlated.

Re-estimating the model

library(estimatr)
fit2 <- lm(data = mydata3, Present_Price ~ Selling_Price + Driven_kms + Age + TypeF )
summary(lm_robust(data = mydata3, Present_Price ~ Selling_Price + Driven_kms + Age + TypeF, se_type = "HC1"))
## 
## Call:
## lm_robust(formula = Present_Price ~ Selling_Price + Driven_kms + 
##     Age + TypeF, data = mydata3, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                 Estimate Std. Error t value  Pr(>|t|)   CI Lower   CI Upper  DF
## (Intercept)   -3.856e+00  5.968e-01  -6.461 4.425e-10 -5.030e+00 -2.681e+00 288
## Selling_Price  1.381e+00  6.623e-02  20.859 1.614e-59  1.251e+00  1.512e+00 288
## Driven_kms     3.282e-05  1.145e-05   2.866 4.469e-03  1.028e-05  5.536e-05 288
## Age            3.450e-01  7.844e-02   4.398 1.540e-05  1.906e-01  4.993e-01 288
## TypeFDealer    6.370e-01  4.054e-01   1.571 1.172e-01 -1.609e-01  1.435e+00 288
## 
## Multiple R-squared:  0.9005 ,    Adjusted R-squared:  0.8991 
## F-statistic: 404.1 on 4 and 288 DF,  p-value: < 2.2e-16

Explanations of the estimated regression coefficients:all are statistically significant

The intercept is statistically significant (p<0.001), but it makes no sense to explain it.

If the selling price is increased by 1.000 dollars then the present price of a car increases on average by 1.381 dollars, assuming all other variables remain unchanged(p<0.001).

If the number of driven kilometers increase by 1 kilometer then the present price of a car, on average increases by 0,032 dollars, assuming all other variables remain unchanged(p<0.001).

If the age of a car increases by 1 year, the present selling price on average increases by 345 dollars, assuming all other variables remain unchanged(p<0.001).

Given the values of other variables remain the same, dealers will on average sell the car for 637 dollars more than individuals (p<0.001).

F-test statistic, to evaluate how good the regression model is:

H0: ρ2 = 0 (non of the explanatory variables explain the) H1: ρ2 > 0 (at least one explanatory variable explains the dependent one)

Since the p-value is very low (p < 0.001) we can reject H0 and assume that the model is well structured and that there is a relationship between the dependent and at least one of the independent variables.

sqrt(0.9005)
## [1] 0.9489468

Multiple coeffiecient of determination: as the multiple R^2=0.90, meaning 90% of the variability of the dependent variable(present price) can be explained with the variability of the independent variables(selling price, driven kms, age, selling type). Coefficient of correlation is 95%, meaning that the relationship between the present price and all independent variables is very strong.