library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
options(scipen = 6)
theme_set(theme_minimal())
df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'
I’ll keep things simple and make a model for the Transmission variable. People still might want to look for manual transmission even though practically everyone uses automatic transmission.
Considering how Price is strongly correlated with transmission type, I think I can comfortably use it as an explanatory variable. Volume would also be a good idea to include, since I’d like to see if bigger or smaller vehicles are more likely to be manual.
df_owner <- subset(df, Owner != "UnRegistered Car")
df_owner <- df_owner |> mutate(IsManual = ifelse(Transmission == "Manual", 1, 0))
df_owner[is.na(df_owner) | df_owner=="Inf"] = NA
model <- lm(IsManual ~ PriceUSD, data = df_owner)
summary(model)
##
## Call:
## lm(formula = IsManual ~ PriceUSD, data = df_owner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7020 -0.4799 0.3065 0.3396 2.7901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7206000536 0.0118534607 60.79 <2e-16 ***
## PriceUSD -0.0000083588 0.0000003418 -24.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4373 on 2036 degrees of freedom
## Multiple R-squared: 0.2271, Adjusted R-squared: 0.2267
## F-statistic: 598.2 on 1 and 2036 DF, p-value: < 2.2e-16
model$coefficients
## (Intercept) PriceUSD
## 0.720600053631 -0.000008358762
The PriceUSD coefficient seems to be very small, but it still appears to be statistically significant from the Pr(>|t|) values and the small standard error value. Considering the “maximum” value of our binary variable is 1 and prices are in the tens of thousand ranges, the coefficients are going to be very, very small. From the coefficient, I can assume that models with higher prices tend to have automatic transmission.
I’ll see if I can build a confidence interval for the PriceUSD variable. Like before, I will create a 80% confidence interval.
confint(model, level = 0.80)
## 10 % 90 %
## (Intercept) 0.70540430214 0.735795805124
## PriceUSD -0.00000879688 -0.000007920644
model_sum <- summary(model)
c("lower" = model_sum$coef[2,1] - qt(0.90, df = model_sum$df[2]) * model_sum$coef[2, 2],
"upper" = model_sum$coef[2,1] + qt(0.90, df = model_sum$df[2]) * model_sum$coef[2, 2])
## lower upper
## -0.000008796880 -0.000007920644
The quantile in the “sanity check” is 0.90 so the result matches up with confint()’s calculation. It’s a bit counterintuitive, but it’s how confint() does things. As we can see here, PriceUSD has a lower bound of about \(-8.79 \times 10^{-6}\) and an upper bound of about \(-7.92 \times 10^{-6}\). They seem very small, but as mentioned before suggest a negative correlation between selling price and having manual transmission. Considering how automatic transmission is pretty much always a higher price than manual transmission, this seems like a no-brainer, but it’s always good to have rigor.