Week 10 Assignment

Loading necessary libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(ggplot2)
options(scipen = 6)
theme_set(theme_minimal())

Importing the cars24 data

cars24 <- read.delim('Cars24.csv', sep = ",")

head(cars24)
##   Car.Brand          Model  Price Model.Year  Location   Fuel Driven..Kms.
## 1   Hyundai    EonERA PLUS 330399       2016 Hyderabad Petrol        10674
## 2    Maruti Wagon R 1.0LXI 350199       2011 Hyderabad Petrol        20979
## 3    Maruti    Alto K10LXI 229199       2011 Hyderabad Petrol        47330
## 4    Maruti  RitzVXI BS IV 306399       2011 Hyderabad Petrol        19662
## 5      Tata  NanoTWIST XTA 208699       2015 Hyderabad Petrol        11256
## 6    Maruti        AltoLXI 249699       2012 Hyderabad Petrol        28434
##        Gear Ownership EMI..monthly.
## 1    Manual         2          7350
## 2    Manual         1          7790
## 3    Manual         2          5098
## 4    Manual         1          6816
## 5 Automatic         1          4642
## 6    Manual         1          5554

Convert Ownership to a Binary Variable

We’ll use Ownership as the binary outcome variable:

If there is more than one owner then multiple_owner = 1 else multiple_owner = 0

cars24 <- cars24 |>
  mutate(multiple_owner = ifelse(Ownership > 1, 1, 0))

Calculating age using Model Year column

cars24$age = year(now()) - cars24$Model.Year

Selecting relevant columns for modeling

model_data <- cars24 |>
  select(multiple_owner, Price, Driven..Kms., age) |>
  drop_na()  # removing the rows with missing data

Building a Logistic Regression model

# Fitting logistic regression model
logit_model <- glm(multiple_owner ~ Price + Driven..Kms. + age, 
                   data = model_data, family = binomial(link = 'logit'))

# Summarizing the model
summary(logit_model)
## 
## Call:
## glm(formula = multiple_owner ~ Price + Driven..Kms. + age, family = binomial(link = "logit"), 
##     data = model_data)
## 
## Coefficients:
##                   Estimate    Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4467944181  0.1644170045 -20.964  < 2e-16 ***
## Price         0.0000003575  0.0000001123   3.183  0.00146 ** 
## Driven..Kms. -0.0000012590  0.0000008027  -1.568  0.11677    
## age           0.2271827640  0.0134496308  16.891  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6634.8  on 5917  degrees of freedom
## Residual deviance: 6260.4  on 5914  degrees of freedom
## AIC: 6268.4
## 
## Number of Fisher Scoring iterations: 4

Interpreting the coefficients

logit_model$coefficients
##      (Intercept)            Price     Driven..Kms.              age 
## -3.4467944180764  0.0000003575261 -0.0000012589911  0.2271827640100

Analysis of Each Coefficient

Price:

A small, positive coefficient for Price means that, as the price increases, the likelihood (log-odds) of the car being a multiple-owner vehicle also increases slightly. Since the p-value (0.00146) is below 0.05, this effect is statistically significant. However, because the coefficient is so small, the effect size in real terms is likely minimal.

Driven Kms:

The negative coefficient for Driven (Kms) suggests that as the number of kilometers driven increases, the probability of the car being a multiple-owner vehicle decreases. However, with a p-value of 0.11677, this effect is not statistically significant at the 0.05 level, indicating that the influence of Driven (Kms) on ownership status might be weak or uncertain.

Age

Age has a positive coefficient, which means that as the age of the car increases, the probability of it being a multiple-owner vehicle also increases. The very small p-value (near zero) indicates that this effect is highly statistically significant. The relatively larger coefficient (compared to Price and Driven) implies that Age has a more substantial impact on ownership status, with older cars being more likely to have multiple owners.

Summary

  • Age is the most significant predictor of multiple ownership status, with a substantial positive effect.

  • Price has a small but statistically significant positive effect, indicating a minor influence.

  • Driven (Kms), though negatively associated with multiple ownership, is not statistically significant at the 0.05 level, suggesting that it may not be a reliable predictor in this context.

  • Overall, older cars are more likely to have had multiple owners, and higher prices also show a slight association with multiple ownership.

Building Confidence Interval

coef_data <- data.frame(
  Variable = names(coef(logit_model)),
  Estimate = coef(logit_model),
  StdError = summary(logit_model)$coefficients[, "Std. Error"]
)

# Calculating 95% confidence interval
z_value <- 1.96  # for 95% confidence level

coef_data <- coef_data |>
  mutate(
    LowerCI = Estimate - z_value * StdError,
    UpperCI = Estimate + z_value * StdError
  )

# Plot the coefficients with confidence intervals
ggplot(coef_data, aes(x = Variable, y = Estimate)) +
  geom_point(color = "blue") +  
  geom_errorbar(aes(ymin = LowerCI, ymax = UpperCI), width = 0.2, color = "darkgray") + 
  labs(
    title = "Logistic Regression Coefficients with 95% Confidence Intervals",
    x = "Predictor Variables",
    y = "Coefficient Estimate (Log-Odds)"
  ) +
  theme_minimal()

Analysis of Each Predictor

  1. Intercept:

    • The intercept is significantly negative with a CI that does not include zero. This implies that, at the baseline level of other variables, the log-odds of a car being a multiple-owner vehicle are low.
  2. Age:

    • The coefficient for Age is positive and has a CI that does not include zero, indicating a significant positive relationship with multiple ownership.

    • Interpretation: As the car’s age increases, the log-odds of it being a multiple-owner vehicle also increase. This is consistent with the idea that older cars are more likely to have had multiple owners.

  3. Driven Kms:

    • The coefficient for Driven Kms is negative, but its CI includes zero, suggesting that this variable is not statistically significant in predicting multiple ownership.

    • Interpretation: There isn’t strong evidence from this model that the kilometers driven by a car significantly impact the likelihood of it being a multiple-owner vehicle.

  4. Price:

    • The coefficient for Price is positive, with a CI that does not include zero, indicating a significant positive relationship with multiple ownership.

    • Interpretation: Higher prices are associated with higher log-odds of a car being a multiple-owner vehicle, possibly because higher-priced cars may have unique features or history that attract multiple owners.

Summary

This plot suggests that Age and Price are significant predictors of a car being a multiple-owner vehicle, while Driven Kms does not appear to have a significant effect. This visualization helps to identify which predictors contribute meaningfully to the model and which do not.