Introduction

This document compromises one of the data science project done for price prediction. Dataset source is from public domain and can be found here.

Primary objectives are to demonstrate following knowledge :

1.Explore, understand, cleanse & reorganized data with explained reasoning (Exploratory Data Analysis).

  1. Complete visualizations to identify and exploit noticeable trends and patterns.

  2. Highlight the correlation between the dependent variable (Price) with the independent and explain how they interact with each other.

  3. Apply linear regression model to our analysis,interpret the significance of our result through statistical analysis and draw a conclusion about whether or not regression analysis is a good model to explain changes in the dependent variable.

  4. Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application

Content

  1. Overview
  2. Data Dictionary
  3. Data Cleansing
  4. Analysis & Visualization
  5. Correlation analysis & Interpretation
  6. Findings & Conclusion

Overview

About 6 months ago, I was willing to buy a second hand car. It is a common practice in France since the price tend to be way below the average. However, I was surprised of how expensive it was compared to a few years back then.

In this project, use case is to determine how the independent variables (Car brand, Car model, Miles, Year) explain the changes in value of the dependent variable (Price) in order forecast the Price.

Data Dictionnary

Data Cleansing

library(ggplot2)
library(dplyr)
## 
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag
## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   1.0.0
## ✔ tidyr   1.2.1     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(stringr)
library(corrplot)
## corrplot 0.92 loaded
car <- read.csv("carvana.csv")
summary(car)
##      Name                Year              Miles            Price       
##  Length:22000       Min.   :    2009   Min.   :    53   Min.   : 10990  
##  Class :character   1st Qu.:    2014   1st Qu.: 35512   1st Qu.: 16590  
##  Mode  :character   Median :    2017   Median : 51890   Median : 19590  
##                     Mean   :  194665   Mean   : 54446   Mean   : 20707  
##                     3rd Qu.:    2020   3rd Qu.: 73191   3rd Qu.: 22990  
##                     Max.   :20223500   Max.   :120167   Max.   :102990
head(car)
##              Name  Year Miles Price
## 1  Chevrolet Trax  2018 41946 16990
## 2     GMC Terrain  2020 45328 23990
## 3   Jeep Wrangler  2012 81068 21590
## 4   Jeep Renegade  2019 35372 21590
## 5           BMW X 20173 68992 22990
## 6    Buick Encore  2019 47973 18590

It seems like some dates are containing more than 4 digits, the first 4 represent the year, so we want to remove everything that comes after :

Car_Year <- as.numeric(substr(car$Year, 1, 4))
car$Year <- NULL

“Name” contains two valuable information : the constructor’s name & the model of the car. We would like to split this column in two to show both information in different columns and then merge all our freshly created column into one data frame.

Car_Brand <- word(car$Name, 2)
Car_Model <- word(car$Name,-1)
car$Name <- NULL
Car2 <- cbind(Car_Year, car, Car_Brand, Car_Model)
head(Car2)
##   Car_Year Miles Price Car_Brand Car_Model
## 1     2018 41946 16990 Chevrolet      Trax
## 2     2020 45328 23990       GMC   Terrain
## 3     2012 81068 21590      Jeep  Wrangler
## 4     2019 35372 21590      Jeep  Renegade
## 5     2017 68992 22990       BMW         X
## 6     2019 47973 18590     Buick    Encore

Now, let’s rename & change the position of the columns to make our data frame easier to read

colnames(Car2)[1] = "Year"
Car3 <- Car2 %>% relocate(Car_Brand, .before = Year)
Car4 <- Car3 %>% relocate(Car_Model, .before = Miles )
df <- Car4 %>% relocate(Year,.after = Price)
head(df)
##   Car_Brand Car_Model Miles Price Year
## 1 Chevrolet      Trax 41946 16990 2018
## 2       GMC   Terrain 45328 23990 2020
## 3      Jeep  Wrangler 81068 21590 2012
## 4      Jeep  Renegade 35372 21590 2019
## 5       BMW         X 68992 22990 2017
## 6     Buick    Encore 47973 18590 2019

Analysis & Visualisation

We first want to know how is the prices of cars distributed compared to the year release

ggplot(df, aes(x = Year, y = Price)) +
  geom_bar(color ='steelblue', stat = 'Identity') +
  theme(axis.text.x = element_text(angle = 90,hjust = 1)) +
  ggtitle("Yearly Sales of second hand cars ")

This visualization draws attention on few things :

Let’s see now how the average price changes throughout the years

ggplot(data = df, aes(x = Year, y=Price, group = Car_Brand)) + 
  geom_boxplot() +ggtitle("Distribution of resale price by years including car contrustors")+
  geom_hline(aes(yintercept = mean(Price)),color = "blue", linetype = "dashed", alpha = 1)  

- The price of cars seems to remain pretty stable from 2012 to 2019. However past 2019, the price start dramatically increased for cars released between 2020 & 2022. - Our assumption is that this higher price is explained by two things :

- Low mileage 
- state of art technology 

Which constructors is more profitable ?

We will verify the previous assumption during our correlation analysis but we can start now. Let’s look how the profit is distributed regarding each brands

ggplot(df, aes(x= Car_Brand, y = Price)) +
  geom_bar(color ='steelblue', stat = 'Identity') +
  theme(axis.text.x= element_text(angle = 90,hjust = 1)) +
  ggtitle("Total profit by Car brands ")

We can see the top 5 sellers are : 1. Toyota 2. Nissan 3. Honda 4. Chevrolet 5. Ford

However, if we take a look at the top 10 most expensive, car we realize that Tesla appears 3 times and all the models have been released between 2019 & 2023.

df_desc <- df %>% arrange(desc(Price))
head(df_desc,10)
##        Car_Brand Car_Model Miles  Price Year
## 1  Mercedes-Benz   S-Class 10880 102990 2022
## 2          Tesla         S  8454  98590 2022
## 3      Chevrolet  Corvette  2982  94590 2023
## 4           Ford Lightning  1766  92990 2022
## 5          Tesla         S 10093  89990 2021
## 6            BMW         M 15378  89590 2022
## 7           Ford   Mustang  6697  88590 2020
## 8         Toyota   CrewMax  2625  87990 2022
## 9       Maserati   Levante  5339  86990 2022
## 10         Tesla         X 29496  85590 2019

The reason why Toyota is so profitable is that Carvana is selling a lot of mid-range cars.

Toyota_Sales <- filter(df, Car_Brand == 'Toyota')
head(Toyota_Sales)
##   Car_Brand Car_Model Miles Price Year
## 1    Toyota   Corolla 37726 22990 2020
## 2    Toyota      C-HR 48626 23990 2019
## 3    Toyota     Yaris 33075 20990 2020
## 4    Toyota     Yaris 65338 16990 2017
## 5    Toyota        iM 19651 22590 2018
## 6    Toyota   Corolla 62585 17990 2017

On the 22 000 cars sold, 2646 were constructed by Toyota, which means about 12% of the car sold are Toyota. It is important to not get confused about the previous graphs, most of Toyota cars were released before 2020 and contributes a lot to grow yearly profit between 2014 & 2019.

Correlation analysis & Interpretation

In this section, how mileage of a car and the year it was released influence the resale price. In order to highlight this, we create a scatter plot diagram.

ggplot(data=df) +
  geom_point(mapping= aes(x=Miles, y=Price, color=Car_Brand)) +
  geom_smooth(mapping=aes(x=Miles, y=Price, linetype= 'dashed')) +
  ggtitle("Distribution of resale price by miles")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

With all those dots from different colors, it is indeed difficult to visualize. However, the dashed line helps us to notice that a car loses most of its value between 0 and 25 000 miles, keeps decreasing until 100 000 Miles to an average resale price of 20 000$ and surprisingly goes back up passed that point. However, regarding the low size of sample of car having over 100 000 miles, we will ignore this information

Proceed to the the same analysis with a smaller sample size

Let’s take 5 constructors with cars being released on the market during 2013 & 2022 and draw the same scatter plot diagram

Sample_car <- filter(df, Car_Brand =="Tesla"|  Car_Brand == "Nissan"| Car_Brand == "Chevrolet" | Car_Brand == "Honda" | Car_Brand == "Ford")

ggplot(data = Sample_car) +
  geom_point(mapping = aes(x = Miles, y = Price, color = Car_Brand)) +
  geom_smooth(mapping = aes(x=Miles, y=Price, linetype = Car_Brand)) +
  ggtitle("Distribution of resale price by miles (sample of 5 constructors)")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Let’s look at the trend of each curves :

  • Regardless of the brand, every cars loses value as the mileage increases
  • Chevrolet & Honda follow a similar pattern as the one with saw previously, the price falls sharply over the first 30 000 Miles and stabilizes after 90 000 Miles
  • Nissan’s cars have a declining price trend which tends to increases at some moments
  • Ford’s curve is most elastic one. On average, a ford car loses 50% of its value after the first 15 000 miles.
  • Tesla is interesting because it’s a declining straight line and turns out to be the most inelastic of the sample. After 60 000 Miles, a Tesla has last only about 20% of its value. We can suppose that this resilience is allowed by a state of art technology that somehow resists to depreciation.

How strong is the correlation between Year, Miles & Price ?

So far, our analysis has shown that Price and Miles are negatively correlated. We may also think that Year and price are negatively correlated because the oldest the car, the higher the mileage. I mean, this would make sense.

cor.result <- cor(df[,c(3,4,5)])
corrplot(cor.result, method = 'ellipse')

cor.result
##            Miles      Price       Year
## Miles  1.0000000 -0.4741244 -0.4800435
## Price -0.4741244  1.0000000  0.4917514
## Year  -0.4800435  0.4917514  1.0000000

As predicted, our correlation analysis reports the following results :

  • There is a low negative correlation between Miles and Price. As mileage increases, price is decreasing
  • There is a low negative correlation between Year and Miles. As Year increases, mileage is decreasing because the car is more recent.
  • There is a low positive correlation between Year and Price. As Year increases, Price increases as well because the car has better technology and lower mileage.

Findings & Conclusion

Let’s take a look at the Regression analysis and decide whether the linear regression model is a fit for predicting prices of cars.

regressionFloor <- lm(formula = Price ~Miles, data = df)
summary(regressionFloor)
## 
## Call:
## lm(formula = Price ~ Miles, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12400  -3114   -978   1786  77027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.728e+04  9.093e+01  299.96   <2e-16 ***
## Miles       -1.206e-01  1.510e-03  -79.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5754 on 21998 degrees of freedom
## Multiple R-squared:  0.2248, Adjusted R-squared:  0.2248 
## F-statistic:  6379 on 1 and 21998 DF,  p-value: < 2.2e-16
regressionFloor2 <- lm(formula = Price ~Year, data = df)
summary(regressionFloor2)
## 
## Call:
## lm(formula = Price ~ Year, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12668  -3064   -928   1607  75998 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.132e+06  2.570e+04  -82.96   <2e-16 ***
## Year         1.068e+03  1.275e+01   83.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5691 on 21998 degrees of freedom
## Multiple R-squared:  0.2418, Adjusted R-squared:  0.2418 
## F-statistic:  7016 on 1 and 21998 DF,  p-value: < 2.2e-16

Based on the results :

Conclusion

We can attest the statistical significance of the independent variables Year & Miles to explain the variation of second hand car prices sold by Caravan. However, Linear Regression Model doesn’t seem to be a good fit since it can only explain about 1/4 of the results. This could be explained by the high differences in elasticity between variables as we saw earlier, which leads the “linearity” of our model to obsolescence.