Week 10

Importing

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

options(scipen = 6)
theme_set(theme_minimal())

df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'

Selecting Variables

I’ll keep things simple and make a model for the Transmission variable. People still might want to look for manual transmission even though practically everyone uses automatic transmission.

Considering how Price is strongly correlated with transmission type, I think I can comfortably use it as an explanatory variable. Volume would also be a good idea to include, since I’d like to see if bigger or smaller vehicles are more likely to be manual.

Logistic Regression Model

df_owner <- subset(df, Owner != "UnRegistered Car")
df_owner <- df_owner |> mutate(IsManual = ifelse(Transmission == "Manual", 1, 0))
df_owner[is.na(df_owner) | df_owner=="Inf"] = NA
model <- lm(IsManual ~ PriceUSD, data = df_owner)
summary(model)

## 
## Call:
## lm(formula = IsManual ~ PriceUSD, data = df_owner)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7020 -0.4799  0.3065  0.3396  2.7901 
## 
## Coefficients:
##                  Estimate    Std. Error t value Pr(>|t|)    
## (Intercept)  0.7206000536  0.0118534607   60.79   <2e-16 ***
## PriceUSD    -0.0000083588  0.0000003418  -24.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4373 on 2036 degrees of freedom
## Multiple R-squared:  0.2271, Adjusted R-squared:  0.2267 
## F-statistic: 598.2 on 1 and 2036 DF,  p-value: < 2.2e-16

model$coefficients

##     (Intercept)        PriceUSD 
##  0.720600053631 -0.000008358762

The PriceUSD coefficient seems to be very small, but it still appears to be statistically significant from the Pr(>|t|) values and the small standard error value. Considering the “maximum” value of our binary variable is 1 and prices are in the tens of thousand ranges, the coefficients are going to be very, very small. From the coefficient, I can assume that models with higher prices tend to have automatic transmission.

Building the Confidence Interval

I’ll see if I can build a confidence interval for the PriceUSD variable. Like before, I will create a 80% confidence interval.

confint(model, level = 0.80)

##                       10 %            90 %
## (Intercept)  0.70540430214  0.735795805124
## PriceUSD    -0.00000879688 -0.000007920644

model_sum <- summary(model)

c("lower" = model_sum$coef[2,1] - qt(0.90, df = model_sum$df[2]) * model_sum$coef[2, 2],
  "upper" = model_sum$coef[2,1] + qt(0.90, df = model_sum$df[2]) * model_sum$coef[2, 2])

##           lower           upper 
## -0.000008796880 -0.000007920644

The quantile in the “sanity check” is 0.90 so the result matches up with confint()’s calculation. It’s a bit counterintuitive, but it’s how confint() does things. As we can see here, PriceUSD has a lower bound of about \(-8.79 \times 10^{-6}\) and an upper bound of about \(-7.92 \times 10^{-6}\). They seem very small, but as mentioned before suggest a negative correlation between selling price and having manual transmission. Considering how automatic transmission is pretty much always a higher price than manual transmission, this seems like a no-brainer, but it’s always good to have rigor.

Week 10

Wyatt Van Dyke

2024-11-03

Importing

Selecting Variables

Logistic Regression Model

Building the Confidence Interval