Car Features and Price Analysis

Sarah Troyep

2025-12-05

Issue Description

For this project, I decided to look at a dataset that would analyze a car to understand how the different features relate to the cost of the vehicle. This dataset interested me a lot because it will be good to look at how the design of the car will affect the cost.

Questions I will look at in this presentation.

Question 1: which car feature have the strongest relationship with the car price?

Question 2: Can we build a simple model to see what the car price using these features?

Data Source

I use the “car features and MSrP dataset that is available on this website https://www.kaggle.com/datasets/CooperUnion/cardataset

Documentation

The dataset now includes the following cleaned columns:

Description of the data

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df<- read_csv("data.csv")
## Rows: 11914 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Make, Model, Engine Fuel Type, Transmission Type, Driven_Wheels, Ma...
## dbl (8): Year, Engine HP, Engine Cylinders, Number of Doors, highway MPG, ci...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# First Look
glimpse(df)
## Rows: 11,914
## Columns: 16
## $ Make                <chr> "BMW", "BMW", "BMW", "BMW", "BMW", "BMW", "BMW", "…
## $ Model               <chr> "1 Series M", "1 Series", "1 Series", "1 Series", …
## $ Year                <dbl> 2011, 2011, 2011, 2011, 2011, 2012, 2012, 2012, 20…
## $ `Engine Fuel Type`  <chr> "premium unleaded (required)", "premium unleaded (…
## $ `Engine HP`         <dbl> 335, 300, 300, 230, 230, 230, 300, 300, 230, 230, …
## $ `Engine Cylinders`  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ `Transmission Type` <chr> "MANUAL", "MANUAL", "MANUAL", "MANUAL", "MANUAL", …
## $ Driven_Wheels       <chr> "rear wheel drive", "rear wheel drive", "rear whee…
## $ `Number of Doors`   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ `Market Category`   <chr> "Factory Tuner,Luxury,High-Performance", "Luxury,P…
## $ `Vehicle Size`      <chr> "Compact", "Compact", "Compact", "Compact", "Compa…
## $ `Vehicle Style`     <chr> "Coupe", "Convertible", "Coupe", "Coupe", "Convert…
## $ `highway MPG`       <dbl> 26, 28, 28, 28, 28, 28, 26, 28, 28, 27, 28, 28, 28…
## $ `city mpg`          <dbl> 19, 19, 20, 18, 18, 18, 17, 20, 18, 18, 20, 19, 19…
## $ Popularity          <dbl> 3916, 3916, 3916, 3916, 3916, 3916, 3916, 3916, 39…
## $ MSRP                <dbl> 46135, 40650, 36350, 29450, 34500, 31200, 44100, 3…
str(df)
## spc_tbl_ [11,914 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Make             : chr [1:11914] "BMW" "BMW" "BMW" "BMW" ...
##  $ Model            : chr [1:11914] "1 Series M" "1 Series" "1 Series" "1 Series" ...
##  $ Year             : num [1:11914] 2011 2011 2011 2011 2011 ...
##  $ Engine Fuel Type : chr [1:11914] "premium unleaded (required)" "premium unleaded (required)" "premium unleaded (required)" "premium unleaded (required)" ...
##  $ Engine HP        : num [1:11914] 335 300 300 230 230 230 300 300 230 230 ...
##  $ Engine Cylinders : num [1:11914] 6 6 6 6 6 6 6 6 6 6 ...
##  $ Transmission Type: chr [1:11914] "MANUAL" "MANUAL" "MANUAL" "MANUAL" ...
##  $ Driven_Wheels    : chr [1:11914] "rear wheel drive" "rear wheel drive" "rear wheel drive" "rear wheel drive" ...
##  $ Number of Doors  : num [1:11914] 2 2 2 2 2 2 2 2 2 2 ...
##  $ Market Category  : chr [1:11914] "Factory Tuner,Luxury,High-Performance" "Luxury,Performance" "Luxury,High-Performance" "Luxury,Performance" ...
##  $ Vehicle Size     : chr [1:11914] "Compact" "Compact" "Compact" "Compact" ...
##  $ Vehicle Style    : chr [1:11914] "Coupe" "Convertible" "Coupe" "Coupe" ...
##  $ highway MPG      : num [1:11914] 26 28 28 28 28 28 26 28 28 27 ...
##  $ city mpg         : num [1:11914] 19 19 20 18 18 18 17 20 18 18 ...
##  $ Popularity       : num [1:11914] 3916 3916 3916 3916 3916 ...
##  $ MSRP             : num [1:11914] 46135 40650 36350 29450 34500 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Make = col_character(),
##   ..   Model = col_character(),
##   ..   Year = col_double(),
##   ..   `Engine Fuel Type` = col_character(),
##   ..   `Engine HP` = col_double(),
##   ..   `Engine Cylinders` = col_double(),
##   ..   `Transmission Type` = col_character(),
##   ..   Driven_Wheels = col_character(),
##   ..   `Number of Doors` = col_double(),
##   ..   `Market Category` = col_character(),
##   ..   `Vehicle Size` = col_character(),
##   ..   `Vehicle Style` = col_character(),
##   ..   `highway MPG` = col_double(),
##   ..   `city mpg` = col_double(),
##   ..   Popularity = col_double(),
##   ..   MSRP = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(df)
##      Make              Model                Year      Engine Fuel Type  
##  Length:11914       Length:11914       Min.   :1990   Length:11914      
##  Class :character   Class :character   1st Qu.:2007   Class :character  
##  Mode  :character   Mode  :character   Median :2015   Mode  :character  
##                                        Mean   :2010                     
##                                        3rd Qu.:2016                     
##                                        Max.   :2017                     
##                                                                         
##    Engine HP      Engine Cylinders Transmission Type  Driven_Wheels     
##  Min.   :  55.0   Min.   : 0.000   Length:11914       Length:11914      
##  1st Qu.: 170.0   1st Qu.: 4.000   Class :character   Class :character  
##  Median : 227.0   Median : 6.000   Mode  :character   Mode  :character  
##  Mean   : 249.4   Mean   : 5.629                                        
##  3rd Qu.: 300.0   3rd Qu.: 6.000                                        
##  Max.   :1001.0   Max.   :16.000                                        
##  NA's   :69       NA's   :30                                            
##  Number of Doors Market Category    Vehicle Size       Vehicle Style     
##  Min.   :2.000   Length:11914       Length:11914       Length:11914      
##  1st Qu.:2.000   Class :character   Class :character   Class :character  
##  Median :4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.436                                                           
##  3rd Qu.:4.000                                                           
##  Max.   :4.000                                                           
##  NA's   :6                                                               
##   highway MPG        city mpg        Popularity        MSRP        
##  Min.   : 12.00   Min.   :  7.00   Min.   :   2   Min.   :   2000  
##  1st Qu.: 22.00   1st Qu.: 16.00   1st Qu.: 549   1st Qu.:  21000  
##  Median : 26.00   Median : 18.00   Median :1385   Median :  29995  
##  Mean   : 26.64   Mean   : 19.73   Mean   :1555   Mean   :  40595  
##  3rd Qu.: 30.00   3rd Qu.: 22.00   3rd Qu.:2009   3rd Qu.:  42231  
##  Max.   :354.00   Max.   :137.00   Max.   :5657   Max.   :2065902  
## 
colnames(df)
##  [1] "Make"              "Model"             "Year"             
##  [4] "Engine Fuel Type"  "Engine HP"         "Engine Cylinders" 
##  [7] "Transmission Type" "Driven_Wheels"     "Number of Doors"  
## [10] "Market Category"   "Vehicle Size"      "Vehicle Style"    
## [13] "highway MPG"       "city mpg"          "Popularity"       
## [16] "MSRP"

Cleaning and Preparation

library(dplyr)

df <- df %>% select(
  Make,
  Model,
  Year,
  Horsepower = `Engine HP`,
  Cylinders = `Engine Cylinders`,
  `Transmission Type`,
  Driven_Wheels,
  MPG_Highway = `highway MPG`,
  MPG_City = `city mpg`,
  MSRP
)

df <- df %>% filter(!is.na(MSRP), !is.na(Horsepower), !is.na(Cylinders))

head(df)
## # A tibble: 6 × 10
##   Make  Model       Year Horsepower Cylinders `Transmission Type` Driven_Wheels 
##   <chr> <chr>      <dbl>      <dbl>     <dbl> <chr>               <chr>         
## 1 BMW   1 Series M  2011        335         6 MANUAL              rear wheel dr…
## 2 BMW   1 Series    2011        300         6 MANUAL              rear wheel dr…
## 3 BMW   1 Series    2011        300         6 MANUAL              rear wheel dr…
## 4 BMW   1 Series    2011        230         6 MANUAL              rear wheel dr…
## 5 BMW   1 Series    2011        230         6 MANUAL              rear wheel dr…
## 6 BMW   1 Series    2012        230         6 MANUAL              rear wheel dr…
## # ℹ 3 more variables: MPG_Highway <dbl>, MPG_City <dbl>, MSRP <dbl>

Final Results

Engine HP vs Price

library(ggplot2)
ggplot(df,aes(x= Horsepower, y= MSRP)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(title = "Engine HP vs MSRP")
## `geom_smooth()` using formula = 'y ~ x'

Model

mod <- lm(MSRP ~ Horsepower + Cylinders + MPG_Highway + Driven_Wheels, data = df)
summary(mod)
## 
## Call:
## lm(formula = MSRP ~ Horsepower + Cylinders + MPG_Highway + Driven_Wheels, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -271484  -18056     859   12788 1709884 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.015e+05  3.294e+03 -30.825   <2e-16 ***
## Horsepower                      3.179e+02  6.357e+00  49.997   <2e-16 ***
## Cylinders                       7.944e+03  4.461e+02  17.810   <2e-16 ***
## MPG_Highway                     8.769e+02  7.576e+01  11.575   <2e-16 ***
## Driven_Wheelsfour wheel drive  -2.075e+04  1.573e+03 -13.196   <2e-16 ***
## Driven_Wheelsfront wheel drive  2.328e+03  1.241e+03   1.875   0.0608 .  
## Driven_Wheelsrear wheel drive  -1.284e+04  1.235e+03 -10.394   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43930 on 11809 degrees of freedom
## Multiple R-squared:  0.4698, Adjusted R-squared:  0.4696 
## F-statistic:  1744 on 6 and 11809 DF,  p-value: < 2.2e-16

Make clusters of car by their features

clust_df <- df %>% select(Horsepower, Cylinders, MPG_Highway) %>% scale()
k3 <- kmeans(clust_df, centers = 3)
df$cluster <- factor(k3$cluster)


ggplot(df, aes(x = Horsepower, y = MPG_Highway, color = cluster)) +
geom_point(alpha = 0.5) +
labs(title = "Simple Car Groups Based on Features")