library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggplot2)
options(scipen = 6)
theme_set(theme_minimal())
cars24 <- read.delim('Cars24.csv', sep = ",")
head(cars24)
## Car.Brand Model Price Model.Year Location Fuel Driven..Kms.
## 1 Hyundai EonERA PLUS 330399 2016 Hyderabad Petrol 10674
## 2 Maruti Wagon R 1.0LXI 350199 2011 Hyderabad Petrol 20979
## 3 Maruti Alto K10LXI 229199 2011 Hyderabad Petrol 47330
## 4 Maruti RitzVXI BS IV 306399 2011 Hyderabad Petrol 19662
## 5 Tata NanoTWIST XTA 208699 2015 Hyderabad Petrol 11256
## 6 Maruti AltoLXI 249699 2012 Hyderabad Petrol 28434
## Gear Ownership EMI..monthly.
## 1 Manual 2 7350
## 2 Manual 1 7790
## 3 Manual 2 5098
## 4 Manual 1 6816
## 5 Automatic 1 4642
## 6 Manual 1 5554
We’ll use Ownership as the binary outcome variable:
If there is more than one owner then multiple_owner = 1 else multiple_owner = 0
cars24 <- cars24 |>
mutate(multiple_owner = ifelse(Ownership > 1, 1, 0))
Calculating age using Model Year column
cars24$age = year(now()) - cars24$Model.Year
model_data <- cars24 |>
select(multiple_owner, Price, Driven..Kms., age) |>
drop_na() # removing the rows with missing data
# Fitting logistic regression model
logit_model <- glm(multiple_owner ~ Price + Driven..Kms. + age,
data = model_data, family = binomial(link = 'logit'))
# Summarizing the model
summary(logit_model)
##
## Call:
## glm(formula = multiple_owner ~ Price + Driven..Kms. + age, family = binomial(link = "logit"),
## data = model_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4467944181 0.1644170045 -20.964 < 2e-16 ***
## Price 0.0000003575 0.0000001123 3.183 0.00146 **
## Driven..Kms. -0.0000012590 0.0000008027 -1.568 0.11677
## age 0.2271827640 0.0134496308 16.891 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6634.8 on 5917 degrees of freedom
## Residual deviance: 6260.4 on 5914 degrees of freedom
## AIC: 6268.4
##
## Number of Fisher Scoring iterations: 4
logit_model$coefficients
## (Intercept) Price Driven..Kms. age
## -3.4467944180764 0.0000003575261 -0.0000012589911 0.2271827640100
Price:
A small, positive coefficient for Price means that, as the price increases, the likelihood (log-odds) of the car being a multiple-owner vehicle also increases slightly. Since the p-value (0.00146) is below 0.05, this effect is statistically significant. However, because the coefficient is so small, the effect size in real terms is likely minimal.
Driven Kms:
The negative coefficient for Driven (Kms) suggests that as the number of kilometers driven increases, the probability of the car being a multiple-owner vehicle decreases. However, with a p-value of 0.11677, this effect is not statistically significant at the 0.05 level, indicating that the influence of Driven (Kms) on ownership status might be weak or uncertain.
Age
Age has a positive coefficient, which means that as the age of the car increases, the probability of it being a multiple-owner vehicle also increases. The very small p-value (near zero) indicates that this effect is highly statistically significant. The relatively larger coefficient (compared to Price and Driven) implies that Age has a more substantial impact on ownership status, with older cars being more likely to have multiple owners.
Age is the most significant predictor of multiple ownership status, with a substantial positive effect.
Price has a small but statistically significant positive effect, indicating a minor influence.
Driven (Kms), though negatively associated with multiple ownership, is not statistically significant at the 0.05 level, suggesting that it may not be a reliable predictor in this context.
Overall, older cars are more likely to have had multiple owners, and higher prices also show a slight association with multiple ownership.
coef_data <- data.frame(
Variable = names(coef(logit_model)),
Estimate = coef(logit_model),
StdError = summary(logit_model)$coefficients[, "Std. Error"]
)
# Calculating 95% confidence interval
z_value <- 1.96 # for 95% confidence level
coef_data <- coef_data |>
mutate(
LowerCI = Estimate - z_value * StdError,
UpperCI = Estimate + z_value * StdError
)
# Plot the coefficients with confidence intervals
ggplot(coef_data, aes(x = Variable, y = Estimate)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = LowerCI, ymax = UpperCI), width = 0.2, color = "darkgray") +
labs(
title = "Logistic Regression Coefficients with 95% Confidence Intervals",
x = "Predictor Variables",
y = "Coefficient Estimate (Log-Odds)"
) +
theme_minimal()
Intercept:
Age:
The coefficient for Age is positive and has a CI that does not include zero, indicating a significant positive relationship with multiple ownership.
Interpretation: As the car’s age increases, the log-odds of it being a multiple-owner vehicle also increase. This is consistent with the idea that older cars are more likely to have had multiple owners.
Driven Kms:
The coefficient for Driven Kms is negative, but its CI includes zero, suggesting that this variable is not statistically significant in predicting multiple ownership.
Interpretation: There isn’t strong evidence from this model that the kilometers driven by a car significantly impact the likelihood of it being a multiple-owner vehicle.
Price:
The coefficient for Price is positive, with a CI that does not include zero, indicating a significant positive relationship with multiple ownership.
Interpretation: Higher prices are associated with higher log-odds of a car being a multiple-owner vehicle, possibly because higher-priced cars may have unique features or history that attract multiple owners.
This plot suggests that Age and Price are significant predictors of a car being a multiple-owner vehicle, while Driven Kms does not appear to have a significant effect. This visualization helps to identify which predictors contribute meaningfully to the model and which do not.