Content
- Overview
- Data Dictionary
- Data Cleansing
- Analysis & Visualization
- Correlation analysis & Interpretation
- Findings & Conclusion
This document compromises one of the data science project done for price prediction. Dataset source is from public domain and can be found here.
Primary objectives are to demonstrate following knowledge :
1.Explore, understand, cleanse & reorganized data with explained reasoning (Exploratory Data Analysis).
Complete visualizations to identify and exploit noticeable trends and patterns.
Highlight the correlation between the dependent variable (Price) with the independent and explain how they interact with each other.
Apply linear regression model to our analysis,interpret the significance of our result through statistical analysis and draw a conclusion about whether or not regression analysis is a good model to explain changes in the dependent variable.
Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application
About 6 months ago, I was willing to buy a second hand car. It is a common practice in France since the price tend to be way below the average. However, I was surprised of how expensive it was compared to a few years back then.
In this project, use case is to determine how the independent variables (Car brand, Car model, Miles, Year) explain the changes in value of the dependent variable (Price) in order forecast the Price.
library(ggplot2)
library(dplyr)
##
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
##
## filter, lag
## Les objets suivants sont masqués depuis 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 1.0.0
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
library(corrplot)
## corrplot 0.92 loaded
car <- read.csv("carvana.csv")
summary(car)
## Name Year Miles Price
## Length:22000 Min. : 2009 Min. : 53 Min. : 10990
## Class :character 1st Qu.: 2014 1st Qu.: 35512 1st Qu.: 16590
## Mode :character Median : 2017 Median : 51890 Median : 19590
## Mean : 194665 Mean : 54446 Mean : 20707
## 3rd Qu.: 2020 3rd Qu.: 73191 3rd Qu.: 22990
## Max. :20223500 Max. :120167 Max. :102990
head(car)
## Name Year Miles Price
## 1 Chevrolet Trax 2018 41946 16990
## 2 GMC Terrain 2020 45328 23990
## 3 Jeep Wrangler 2012 81068 21590
## 4 Jeep Renegade 2019 35372 21590
## 5 BMW X 20173 68992 22990
## 6 Buick Encore 2019 47973 18590
It seems like some dates are containing more than 4 digits, the first 4 represent the year, so we want to remove everything that comes after :
Car_Year <- as.numeric(substr(car$Year, 1, 4))
car$Year <- NULL
“Name” contains two valuable information : the constructor’s name & the model of the car. We would like to split this column in two to show both information in different columns and then merge all our freshly created column into one data frame.
Car_Brand <- word(car$Name, 2)
Car_Model <- word(car$Name,-1)
car$Name <- NULL
Car2 <- cbind(Car_Year, car, Car_Brand, Car_Model)
head(Car2)
## Car_Year Miles Price Car_Brand Car_Model
## 1 2018 41946 16990 Chevrolet Trax
## 2 2020 45328 23990 GMC Terrain
## 3 2012 81068 21590 Jeep Wrangler
## 4 2019 35372 21590 Jeep Renegade
## 5 2017 68992 22990 BMW X
## 6 2019 47973 18590 Buick Encore
Now, let’s rename & change the position of the columns to make our data frame easier to read
colnames(Car2)[1] = "Year"
Car3 <- Car2 %>% relocate(Car_Brand, .before = Year)
Car4 <- Car3 %>% relocate(Car_Model, .before = Miles )
df <- Car4 %>% relocate(Year,.after = Price)
head(df)
## Car_Brand Car_Model Miles Price Year
## 1 Chevrolet Trax 41946 16990 2018
## 2 GMC Terrain 45328 23990 2020
## 3 Jeep Wrangler 81068 21590 2012
## 4 Jeep Renegade 35372 21590 2019
## 5 BMW X 68992 22990 2017
## 6 Buick Encore 47973 18590 2019
We first want to know how is the prices of cars distributed compared to the year release
ggplot(df, aes(x = Year, y = Price)) +
geom_bar(color ='steelblue', stat = 'Identity') +
theme(axis.text.x = element_text(angle = 90,hjust = 1)) +
ggtitle("Yearly Sales of second hand cars ")
This visualization draws attention on few things :
Let’s see now how the average price changes throughout the years
ggplot(data = df, aes(x = Year, y=Price, group = Car_Brand)) +
geom_boxplot() +ggtitle("Distribution of resale price by years including car contrustors")+
geom_hline(aes(yintercept = mean(Price)),color = "blue", linetype = "dashed", alpha = 1)
- The price of cars seems to remain pretty stable from 2012 to 2019.
However past 2019, the price start dramatically increased for cars
released between 2020 & 2022. - Our assumption is that this higher
price is explained by two things :
- Low mileage
- state of art technology
We will verify the previous assumption during our correlation analysis but we can start now. Let’s look how the profit is distributed regarding each brands
ggplot(df, aes(x= Car_Brand, y = Price)) +
geom_bar(color ='steelblue', stat = 'Identity') +
theme(axis.text.x= element_text(angle = 90,hjust = 1)) +
ggtitle("Total profit by Car brands ")
We can see the top 5 sellers are : 1. Toyota 2. Nissan 3. Honda 4. Chevrolet 5. Ford
However, if we take a look at the top 10 most expensive, car we realize that Tesla appears 3 times and all the models have been released between 2019 & 2023.
df_desc <- df %>% arrange(desc(Price))
head(df_desc,10)
## Car_Brand Car_Model Miles Price Year
## 1 Mercedes-Benz S-Class 10880 102990 2022
## 2 Tesla S 8454 98590 2022
## 3 Chevrolet Corvette 2982 94590 2023
## 4 Ford Lightning 1766 92990 2022
## 5 Tesla S 10093 89990 2021
## 6 BMW M 15378 89590 2022
## 7 Ford Mustang 6697 88590 2020
## 8 Toyota CrewMax 2625 87990 2022
## 9 Maserati Levante 5339 86990 2022
## 10 Tesla X 29496 85590 2019
The reason why Toyota is so profitable is that Carvana is selling a lot of mid-range cars.
Toyota_Sales <- filter(df, Car_Brand == 'Toyota')
head(Toyota_Sales)
## Car_Brand Car_Model Miles Price Year
## 1 Toyota Corolla 37726 22990 2020
## 2 Toyota C-HR 48626 23990 2019
## 3 Toyota Yaris 33075 20990 2020
## 4 Toyota Yaris 65338 16990 2017
## 5 Toyota iM 19651 22590 2018
## 6 Toyota Corolla 62585 17990 2017
On the 22 000 cars sold, 2646 were constructed by Toyota, which means about 12% of the car sold are Toyota. It is important to not get confused about the previous graphs, most of Toyota cars were released before 2020 and contributes a lot to grow yearly profit between 2014 & 2019.
In this section, how mileage of a car and the year it was released influence the resale price. In order to highlight this, we create a scatter plot diagram.
ggplot(data=df) +
geom_point(mapping= aes(x=Miles, y=Price, color=Car_Brand)) +
geom_smooth(mapping=aes(x=Miles, y=Price, linetype= 'dashed')) +
ggtitle("Distribution of resale price by miles")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
With all those dots from different colors, it is indeed difficult to visualize. However, the dashed line helps us to notice that a car loses most of its value between 0 and 25 000 miles, keeps decreasing until 100 000 Miles to an average resale price of 20 000$ and surprisingly goes back up passed that point. However, regarding the low size of sample of car having over 100 000 miles, we will ignore this information
Let’s take 5 constructors with cars being released on the market during 2013 & 2022 and draw the same scatter plot diagram
Sample_car <- filter(df, Car_Brand =="Tesla"| Car_Brand == "Nissan"| Car_Brand == "Chevrolet" | Car_Brand == "Honda" | Car_Brand == "Ford")
ggplot(data = Sample_car) +
geom_point(mapping = aes(x = Miles, y = Price, color = Car_Brand)) +
geom_smooth(mapping = aes(x=Miles, y=Price, linetype = Car_Brand)) +
ggtitle("Distribution of resale price by miles (sample of 5 constructors)")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Let’s look at the trend of each curves :
So far, our analysis has shown that Price and Miles are negatively correlated. We may also think that Year and price are negatively correlated because the oldest the car, the higher the mileage. I mean, this would make sense.
cor.result <- cor(df[,c(3,4,5)])
corrplot(cor.result, method = 'ellipse')
cor.result
## Miles Price Year
## Miles 1.0000000 -0.4741244 -0.4800435
## Price -0.4741244 1.0000000 0.4917514
## Year -0.4800435 0.4917514 1.0000000
As predicted, our correlation analysis reports the following results :
Let’s take a look at the Regression analysis and decide whether the linear regression model is a fit for predicting prices of cars.
regressionFloor <- lm(formula = Price ~Miles, data = df)
summary(regressionFloor)
##
## Call:
## lm(formula = Price ~ Miles, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12400 -3114 -978 1786 77027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.728e+04 9.093e+01 299.96 <2e-16 ***
## Miles -1.206e-01 1.510e-03 -79.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5754 on 21998 degrees of freedom
## Multiple R-squared: 0.2248, Adjusted R-squared: 0.2248
## F-statistic: 6379 on 1 and 21998 DF, p-value: < 2.2e-16
regressionFloor2 <- lm(formula = Price ~Year, data = df)
summary(regressionFloor2)
##
## Call:
## lm(formula = Price ~ Year, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12668 -3064 -928 1607 75998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.132e+06 2.570e+04 -82.96 <2e-16 ***
## Year 1.068e+03 1.275e+01 83.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5691 on 21998 degrees of freedom
## Multiple R-squared: 0.2418, Adjusted R-squared: 0.2418
## F-statistic: 7016 on 1 and 21998 DF, p-value: < 2.2e-16
Based on the results :
We can attest the statistical significance of the independent variables Year & Miles to explain the variation of second hand car prices sold by Caravan. However, Linear Regression Model doesn’t seem to be a good fit since it can only explain about 1/4 of the results. This could be explained by the high differences in elasticity between variables as we saw earlier, which leads the “linearity” of our model to obsolescence.