Sarah Troyep
2025-12-05
For this project, I decided to look at a dataset that would analyze a car to understand how the different features relate to the cost of the vehicle. This dataset interested me a lot because it will be good to look at how the design of the car will affect the cost.
Question 1: which car feature have the strongest relationship with the car price?
Question 2: Can we build a simple model to see what the car price using these features?
I use the “car features and MSrP dataset that is available on this website https://www.kaggle.com/datasets/CooperUnion/cardataset
The dataset now includes the following cleaned columns:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Rows: 11914 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Make, Model, Engine Fuel Type, Transmission Type, Driven_Wheels, Ma...
## dbl (8): Year, Engine HP, Engine Cylinders, Number of Doors, highway MPG, ci...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 11,914
## Columns: 16
## $ Make <chr> "BMW", "BMW", "BMW", "BMW", "BMW", "BMW", "BMW", "…
## $ Model <chr> "1 Series M", "1 Series", "1 Series", "1 Series", …
## $ Year <dbl> 2011, 2011, 2011, 2011, 2011, 2012, 2012, 2012, 20…
## $ `Engine Fuel Type` <chr> "premium unleaded (required)", "premium unleaded (…
## $ `Engine HP` <dbl> 335, 300, 300, 230, 230, 230, 300, 300, 230, 230, …
## $ `Engine Cylinders` <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ `Transmission Type` <chr> "MANUAL", "MANUAL", "MANUAL", "MANUAL", "MANUAL", …
## $ Driven_Wheels <chr> "rear wheel drive", "rear wheel drive", "rear whee…
## $ `Number of Doors` <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ `Market Category` <chr> "Factory Tuner,Luxury,High-Performance", "Luxury,P…
## $ `Vehicle Size` <chr> "Compact", "Compact", "Compact", "Compact", "Compa…
## $ `Vehicle Style` <chr> "Coupe", "Convertible", "Coupe", "Coupe", "Convert…
## $ `highway MPG` <dbl> 26, 28, 28, 28, 28, 28, 26, 28, 28, 27, 28, 28, 28…
## $ `city mpg` <dbl> 19, 19, 20, 18, 18, 18, 17, 20, 18, 18, 20, 19, 19…
## $ Popularity <dbl> 3916, 3916, 3916, 3916, 3916, 3916, 3916, 3916, 39…
## $ MSRP <dbl> 46135, 40650, 36350, 29450, 34500, 31200, 44100, 3…
## spc_tbl_ [11,914 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Make : chr [1:11914] "BMW" "BMW" "BMW" "BMW" ...
## $ Model : chr [1:11914] "1 Series M" "1 Series" "1 Series" "1 Series" ...
## $ Year : num [1:11914] 2011 2011 2011 2011 2011 ...
## $ Engine Fuel Type : chr [1:11914] "premium unleaded (required)" "premium unleaded (required)" "premium unleaded (required)" "premium unleaded (required)" ...
## $ Engine HP : num [1:11914] 335 300 300 230 230 230 300 300 230 230 ...
## $ Engine Cylinders : num [1:11914] 6 6 6 6 6 6 6 6 6 6 ...
## $ Transmission Type: chr [1:11914] "MANUAL" "MANUAL" "MANUAL" "MANUAL" ...
## $ Driven_Wheels : chr [1:11914] "rear wheel drive" "rear wheel drive" "rear wheel drive" "rear wheel drive" ...
## $ Number of Doors : num [1:11914] 2 2 2 2 2 2 2 2 2 2 ...
## $ Market Category : chr [1:11914] "Factory Tuner,Luxury,High-Performance" "Luxury,Performance" "Luxury,High-Performance" "Luxury,Performance" ...
## $ Vehicle Size : chr [1:11914] "Compact" "Compact" "Compact" "Compact" ...
## $ Vehicle Style : chr [1:11914] "Coupe" "Convertible" "Coupe" "Coupe" ...
## $ highway MPG : num [1:11914] 26 28 28 28 28 28 26 28 28 27 ...
## $ city mpg : num [1:11914] 19 19 20 18 18 18 17 20 18 18 ...
## $ Popularity : num [1:11914] 3916 3916 3916 3916 3916 ...
## $ MSRP : num [1:11914] 46135 40650 36350 29450 34500 ...
## - attr(*, "spec")=
## .. cols(
## .. Make = col_character(),
## .. Model = col_character(),
## .. Year = col_double(),
## .. `Engine Fuel Type` = col_character(),
## .. `Engine HP` = col_double(),
## .. `Engine Cylinders` = col_double(),
## .. `Transmission Type` = col_character(),
## .. Driven_Wheels = col_character(),
## .. `Number of Doors` = col_double(),
## .. `Market Category` = col_character(),
## .. `Vehicle Size` = col_character(),
## .. `Vehicle Style` = col_character(),
## .. `highway MPG` = col_double(),
## .. `city mpg` = col_double(),
## .. Popularity = col_double(),
## .. MSRP = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## Make Model Year Engine Fuel Type
## Length:11914 Length:11914 Min. :1990 Length:11914
## Class :character Class :character 1st Qu.:2007 Class :character
## Mode :character Mode :character Median :2015 Mode :character
## Mean :2010
## 3rd Qu.:2016
## Max. :2017
##
## Engine HP Engine Cylinders Transmission Type Driven_Wheels
## Min. : 55.0 Min. : 0.000 Length:11914 Length:11914
## 1st Qu.: 170.0 1st Qu.: 4.000 Class :character Class :character
## Median : 227.0 Median : 6.000 Mode :character Mode :character
## Mean : 249.4 Mean : 5.629
## 3rd Qu.: 300.0 3rd Qu.: 6.000
## Max. :1001.0 Max. :16.000
## NA's :69 NA's :30
## Number of Doors Market Category Vehicle Size Vehicle Style
## Min. :2.000 Length:11914 Length:11914 Length:11914
## 1st Qu.:2.000 Class :character Class :character Class :character
## Median :4.000 Mode :character Mode :character Mode :character
## Mean :3.436
## 3rd Qu.:4.000
## Max. :4.000
## NA's :6
## highway MPG city mpg Popularity MSRP
## Min. : 12.00 Min. : 7.00 Min. : 2 Min. : 2000
## 1st Qu.: 22.00 1st Qu.: 16.00 1st Qu.: 549 1st Qu.: 21000
## Median : 26.00 Median : 18.00 Median :1385 Median : 29995
## Mean : 26.64 Mean : 19.73 Mean :1555 Mean : 40595
## 3rd Qu.: 30.00 3rd Qu.: 22.00 3rd Qu.:2009 3rd Qu.: 42231
## Max. :354.00 Max. :137.00 Max. :5657 Max. :2065902
##
## [1] "Make" "Model" "Year"
## [4] "Engine Fuel Type" "Engine HP" "Engine Cylinders"
## [7] "Transmission Type" "Driven_Wheels" "Number of Doors"
## [10] "Market Category" "Vehicle Size" "Vehicle Style"
## [13] "highway MPG" "city mpg" "Popularity"
## [16] "MSRP"
library(dplyr)
df <- df %>% select(
Make,
Model,
Year,
Horsepower = `Engine HP`,
Cylinders = `Engine Cylinders`,
`Transmission Type`,
Driven_Wheels,
MPG_Highway = `highway MPG`,
MPG_City = `city mpg`,
MSRP
)
df <- df %>% filter(!is.na(MSRP), !is.na(Horsepower), !is.na(Cylinders))
head(df)## # A tibble: 6 × 10
## Make Model Year Horsepower Cylinders `Transmission Type` Driven_Wheels
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 BMW 1 Series M 2011 335 6 MANUAL rear wheel dr…
## 2 BMW 1 Series 2011 300 6 MANUAL rear wheel dr…
## 3 BMW 1 Series 2011 300 6 MANUAL rear wheel dr…
## 4 BMW 1 Series 2011 230 6 MANUAL rear wheel dr…
## 5 BMW 1 Series 2011 230 6 MANUAL rear wheel dr…
## 6 BMW 1 Series 2012 230 6 MANUAL rear wheel dr…
## # ℹ 3 more variables: MPG_Highway <dbl>, MPG_City <dbl>, MSRP <dbl>
library(ggplot2)
ggplot(df,aes(x= Horsepower, y= MSRP)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Engine HP vs MSRP")## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = MSRP ~ Horsepower + Cylinders + MPG_Highway + Driven_Wheels,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -271484 -18056 859 12788 1709884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.015e+05 3.294e+03 -30.825 <2e-16 ***
## Horsepower 3.179e+02 6.357e+00 49.997 <2e-16 ***
## Cylinders 7.944e+03 4.461e+02 17.810 <2e-16 ***
## MPG_Highway 8.769e+02 7.576e+01 11.575 <2e-16 ***
## Driven_Wheelsfour wheel drive -2.075e+04 1.573e+03 -13.196 <2e-16 ***
## Driven_Wheelsfront wheel drive 2.328e+03 1.241e+03 1.875 0.0608 .
## Driven_Wheelsrear wheel drive -1.284e+04 1.235e+03 -10.394 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43930 on 11809 degrees of freedom
## Multiple R-squared: 0.4698, Adjusted R-squared: 0.4696
## F-statistic: 1744 on 6 and 11809 DF, p-value: < 2.2e-16
clust_df <- df %>% select(Horsepower, Cylinders, MPG_Highway) %>% scale()
k3 <- kmeans(clust_df, centers = 3)
df$cluster <- factor(k3$cluster)
ggplot(df, aes(x = Horsepower, y = MPG_Highway, color = cluster)) +
geom_point(alpha = 0.5) +
labs(title = "Simple Car Groups Based on Features")