library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(dplyr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
df <- read.delim('Cars24.csv', sep = ",")

head(df)
##   X Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1 0   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2 1    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3 2    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4 3    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5 4      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6 5    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly.
## 1    Manual         2          7350
## 2    Manual         1          7790
## 3    Manual         2          5098
## 4    Manual         1          6816
## 5 Automatic         1          4642
## 6    Manual         1          5554

Response variable - “Price”

In the Cars24 dataset, I have selected Price as the response variable. This makes sense in the context of used car sales, as price is the key factor for both buyers and sellers. It represents the outcome of interest, capturing the overall value of a vehicle based on various characteristics.

Categorical variable - “Fuel”

The categorical variable selected is Fuel. The Fuel column has four different values, Petrol, Diesel, Electric, Petrol+LPG and Petrol+CNG. Let’s devise an ANOVA test to figure out whether there is any difference between the response variable between the four class of transmission.

ANOVA Test

df |>
  filter(Price < 4000000) |>
  ggplot() +
  geom_boxplot(mapping = aes(y = Price, x = Fuel)) +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  labs(x = "Fuel Type",
       y = "Car Price (INR)",
       title = "Fuel Type Vs Price")

We can see that some of these typically have a slightly higher sales price than others, but we want to know if the differences are large enough to significantly challenge our hypothesis that they’re actually all same. Below find the hypothesis for ANOVA test.

Devising Hypothesis for ANOVA

Null Hypothesis : There is no significant difference in the mean car price across different fuel types.

Alternative Hypothesis : There is a significant difference in the mean car price across different fuel types.

Performing ANOVA test

Choosing the Significance value (\(\alpha\)) to be 0.05.

m <- aov(Price ~ Fuel, data = df)
summary(m)
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## Fuel           4 6.551e+13 1.638e+13   176.1 <2e-16 ***
## Residuals   5913 5.498e+14 9.298e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p value is less than \(\alpha\), we can reject the null hypothesis. Which means that there is significant difference in the mean car price across different fuel types.

Conclusion of ANOVA result

Since we rejected the null hypothesis, which means that there is enough evidence to conclude there is significant difference between the fuel types. This result suggests that the fuel type of the car impacts its price in the used car market. Understanding this relationship can guide buyers, sellers, and dealers in setting or negotiating car prices more effectively.

Continuous Variable - “Age”

First let’s create a new column “Age” by subtracting Current year and the car’s year of manufacturing.

df$age = year(now()) - df$Model.Year

I am choosing “Age” as my continuous variable to analyze its relationship with “Price” in the Cars24 dataset. Understanding how a car’s age impacts its resale value will provide insights into consumer preferences and pricing strategies. This analysis will help identify trends in depreciation, revealing whether older vehicles significantly drop in price compared to newer models.

Building Linear Regression model

Let’s create a scatter plot to see the data.

df |>
  ggplot(mapping = aes(x = age, y = Price)) +
  geom_point(color = 'darkblue') +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  labs(x = "Age of the car",
       y = "Car Price (INR)",
       title = "Age Vs Price")

Just by seeing the scatter plot we can see that, as age of the car increases the car price decreases. Let’s see by adding a line (or a linear model) that fits this data.

df |>
  ggplot(mapping = aes(x = age, y = Price)) +
  geom_point(color = 'darkblue') +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  geom_smooth(method = "lm", se = FALSE, color = 'black') +
  labs(x = "Age of the car",
       y = "Car Price (INR)",
       title = "Age Vs Price")
## `geom_smooth()` using formula = 'y ~ x'

Let’s check the summary statistics of the linear model

model <- lm(Price ~ age, data = df)
summary(model)
## 
## Call:
## lm(formula = Price ~ age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -472260 -138638  -48274   56920 6116102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1015797      12562   80.86   <2e-16 ***
## age           -52658       1270  -41.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283900 on 5916 degrees of freedom
## Multiple R-squared:  0.2251, Adjusted R-squared:  0.2249 
## F-statistic:  1718 on 1 and 5916 DF,  p-value: < 2.2e-16

R-squared value : 0.236

A R-squared value close to 1 means a better fitted model.

This value shows that this model is not a great fit and also shows a relatively low percentage of the variance in the dependent variable and suggests that there are likely other factors not included in the model that significantly influence the dependent variable. Which makes sense because there are bunch of other factors that influence the price of an used car such as kilometers driven, transmission, brand, number of previous owners.

Interpreting Coefficients

model$coefficients
## (Intercept)         age 
##  1015796.91   -52658.23

Price (993461.37):

  • This value represents the estimated price of a car when the age is zero (i.e., a brand-new car). In this context, it indicates that if a car were brand new (0 years old), its predicted price would be approximately $993,461.37. However, since this scenario is theoretical and not practically meaningful, the intercept is often more of a baseline reference rather than a direct interpretation.

Age (-50964.13):

  • This coefficient indicates that for each additional year of age, the price of the car decreases by approximately $50,964.13. This negative relationship suggests that as cars age, their resale value significantly declines, which is consistent with general expectations in the used car market.

Equation of the model

Putting these two coefficients together, your regression equation can be expressed as:

\[ \begin{align} Price &= 993461.37 −50964.13×Age \end{align} \]