library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(dplyr)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
df <- read.delim('Cars24.csv', sep = ",")
head(df)
## X Car.Brand Model Price Model.Year Location Fuel Driven..Kms.
## 1 0 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674
## 2 1 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979
## 3 2 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330
## 4 3 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662
## 5 4 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256
## 6 5 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434
## Gear Ownership EMI..monthly.
## 1 Manual 2 7350
## 2 Manual 1 7790
## 3 Manual 2 5098
## 4 Manual 1 6816
## 5 Automatic 1 4642
## 6 Manual 1 5554
In the Cars24 dataset, I have selected Price as the response variable. This makes sense in the context of used car sales, as price is the key factor for both buyers and sellers. It represents the outcome of interest, capturing the overall value of a vehicle based on various characteristics.
The categorical variable selected is Fuel. The Fuel column has four different values, Petrol, Diesel, Electric, Petrol+LPG and Petrol+CNG. Let’s devise an ANOVA test to figure out whether there is any difference between the response variable between the four class of transmission.
df |>
filter(Price < 4000000) |>
ggplot() +
geom_boxplot(mapping = aes(y = Price, x = Fuel)) +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
labs(x = "Fuel Type",
y = "Car Price (INR)",
title = "Fuel Type Vs Price")
We can see that some of these typically have a slightly higher sales price than others, but we want to know if the differences are large enough to significantly challenge our hypothesis that they’re actually all same. Below find the hypothesis for ANOVA test.
Null Hypothesis : There is no significant difference in the mean car price across different fuel types.
Alternative Hypothesis : There is a significant difference in the mean car price across different fuel types.
Choosing the Significance value (\(\alpha\)) to be 0.05.
m <- aov(Price ~ Fuel, data = df)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## Fuel 4 6.551e+13 1.638e+13 176.1 <2e-16 ***
## Residuals 5913 5.498e+14 9.298e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p value is less than \(\alpha\), we can reject the null hypothesis. Which means that there is significant difference in the mean car price across different fuel types.
Since we rejected the null hypothesis, which means that there is enough evidence to conclude there is significant difference between the fuel types. This result suggests that the fuel type of the car impacts its price in the used car market. Understanding this relationship can guide buyers, sellers, and dealers in setting or negotiating car prices more effectively.
First let’s create a new column “Age” by subtracting Current year and the car’s year of manufacturing.
df$age = year(now()) - df$Model.Year
I am choosing “Age” as my continuous variable to analyze its relationship with “Price” in the Cars24 dataset. Understanding how a car’s age impacts its resale value will provide insights into consumer preferences and pricing strategies. This analysis will help identify trends in depreciation, revealing whether older vehicles significantly drop in price compared to newer models.
Let’s create a scatter plot to see the data.
df |>
ggplot(mapping = aes(x = age, y = Price)) +
geom_point(color = 'darkblue') +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
labs(x = "Age of the car",
y = "Car Price (INR)",
title = "Age Vs Price")
Just by seeing the scatter plot we can see that, as age of the car increases the car price decreases. Let’s see by adding a line (or a linear model) that fits this data.
df |>
ggplot(mapping = aes(x = age, y = Price)) +
geom_point(color = 'darkblue') +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
geom_smooth(method = "lm", se = FALSE, color = 'black') +
labs(x = "Age of the car",
y = "Car Price (INR)",
title = "Age Vs Price")
## `geom_smooth()` using formula = 'y ~ x'
Let’s check the summary statistics of the linear model
model <- lm(Price ~ age, data = df)
summary(model)
##
## Call:
## lm(formula = Price ~ age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -472260 -138638 -48274 56920 6116102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1015797 12562 80.86 <2e-16 ***
## age -52658 1270 -41.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 283900 on 5916 degrees of freedom
## Multiple R-squared: 0.2251, Adjusted R-squared: 0.2249
## F-statistic: 1718 on 1 and 5916 DF, p-value: < 2.2e-16
R-squared value : 0.236
A R-squared value close to 1 means a better fitted model.
This value shows that this model is not a great fit and also shows a relatively low percentage of the variance in the dependent variable and suggests that there are likely other factors not included in the model that significantly influence the dependent variable. Which makes sense because there are bunch of other factors that influence the price of an used car such as kilometers driven, transmission, brand, number of previous owners.
model$coefficients
## (Intercept) age
## 1015796.91 -52658.23
Price (993461.37):
Age (-50964.13):
Putting these two coefficients together, your regression equation can be expressed as:
\[ \begin{align} Price &= 993461.37 −50964.13×Age \end{align} \]