Exploratory Data Analysis: CIS Large Car Dataset Using Arrow & DuckDB

Author

Luis Tapia

Phase 1: Set Up & Quality Assessment

Objective(s):

Apply the READY framework to plan data investigation
Use the SCAN framework to systematically explore the dataset
Practice using Arrow for memory-efficient data loading
Document initial findings and develop investigation questions

Part 1: Data Setup and Loading

Loading Libraries:

# performed an installation of 'zip' package first for troubleshooting purposes

# help with initial troubleshooting
# install.packages("zip") 
# install.packages("writexl")
# install.packages("openxlsx")
# install.packages("janitor")
# install.packages("viridis")
# install.packages("corrr")
# install.packages("naniar")

# Load required libraries
library(knitr)    # in case to output pdf files

Warning: package 'knitr' was built under R version 4.5.1

library(openxlsx) # opening spreadsheet files

Warning: package 'openxlsx' was built under R version 4.5.1

library(writexl)  # for outputting/printing excel files for external analysis

Warning: package 'writexl' was built under R version 4.5.1

library(arrow)     # efficient big data handling

Warning: package 'arrow' was built under R version 4.5.1


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

library(glue)      # string interpolation

Warning: package 'glue' was built under R version 4.5.1

library(zip)       # handling zipped files

Warning: package 'zip' was built under R version 4.5.1


Attaching package: 'zip'

The following objects are masked from 'package:utils':

    unzip, zip

library(dplyr)     # data manipulation, loaded for assurance

Warning: package 'dplyr' was built under R version 4.5.1


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(readr)     # reads in files, loaded for assurance

Warning: package 'readr' was built under R version 4.5.1

library(tidyverse) # data manipulation and visualization

Warning: package 'tidyverse' was built under R version 4.5.1

Warning: package 'ggplot2' was built under R version 4.5.1

Warning: package 'tibble' was built under R version 4.5.1

Warning: package 'tidyr' was built under R version 4.5.1

Warning: package 'purrr' was built under R version 4.5.1

Warning: package 'stringr' was built under R version 4.5.1

Warning: package 'forcats' was built under R version 4.5.1

Warning: package 'lubridate' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::duration() masks arrow::duration()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(duckdb)    # in-process analytics database

Warning: package 'duckdb' was built under R version 4.5.1

Loading required package: DBI

Warning: package 'DBI' was built under R version 4.5.1

library(janitor) # used for changing varaible naming scheme

Warning: package 'janitor' was built under R version 4.5.1


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(ggplot2) # create graph/chart visualizations 
library(scales)  # scale functions for visualization

Warning: package 'scales' was built under R version 4.5.1


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

library(viridis) # heat-map/accessible color palette

Warning: package 'viridis' was built under R version 4.5.2

Loading required package: viridisLite

Warning: package 'viridisLite' was built under R version 4.5.1


Attaching package: 'viridis'

The following object is masked from 'package:scales':

    viridis_pal

library(DBI)     # database interface
library(corrr)   # correlation analysis

Warning: package 'corrr' was built under R version 4.5.2

library(naniar)  # missing data visualization

Warning: package 'naniar' was built under R version 4.5.2

library(corrplot) # visualize correlation patterns

Warning: package 'corrplot' was built under R version 4.5.1

corrplot 0.95 loaded

Disclaimer

During the process of exploring my selected dataset further, I decided to change the scope of the analysis from my previously submitted work. The analysis now focuses on observing price increases of new car models based on model year and market segment shifts related to body class, rather than identifying specific features that drive a car’s price tag. I made this change due to the dataset’s age and because many variables measuring technical features would not necessarily be applicable across all model years. This approach ensured my analysis remained consistent.

In addition, since the dataset spans from 2017 to 2020, I considered that the features or “drivers” that influenced prices from those records would not necessarily apply to current times due to policy and economic shifts. As such, a combination of my general findings and thought process throughout the analysis made me simplify my scope to instead focus on car asking price forecasting for the future.

For ZIP files Containing CSV(s):

Since the large dataset I was working on only consisted of a single compressed file for unzipping, I still encountered issues in using the “open_dataset()” function directly to load in the data after the unzipping process. Specifically, when opening the dataset, there are various datatypes for each field or variable (such as categorical or numeric) that created errors in reading in the variables correctly.

Error Example:

Error: Invalid: In CSV column #9: Row #1204: CSV conversion error to null: invalid value 'Ebony'

As such, I have utilized research gathered from StackOverflow and the assistance of Claude as an AI tool to help guide the best approach to load in the data that has a large number of different types of fields/variables. This also involved converting the file early on to a Parquet file type for efficiency. That was based on what was noted in class (10/8), which would best optimize performance in working with the dataset during the cleaning process.

*Some code chunks involving loading in multiple versions of the dataset and that create Excel Sheets have been commented out for error handing

# Note: some changes are added to base template to handle 'read'-related issues

# path to zip file 
# CHANGE PATH WHEN RUNNING ON DIFFERENT SYSTEM 
#zip_path <- "C:/Univeristy_Assignments/Fall 2025/DSA 406/406_Final/Final_Project/archive.zip"

# create folder for holding extracted dataset 
#outdir <- file.path(dirname(zip_path), "Extracted Data Folder")

# extract files if needed and use if statements to ensure extracted file is not already present
#if (!dir.exists(outdir)) {
 #dir.create(outdir)
 #unzip(zip_path, exdir = outdir) # unzips the dataset
 #message("Files extracted")
 #} else{
 #message("Files already extracted")
 #}

# get list of CSV files present
#csv_files <- list.files(outdir, pattern = "\\.csv$", full.names = TRUE)

# used read_csv_arrow() to read in dataset rather than open_dataset() as a result of columm reading issues
#baseData <- read_csv_arrow(
  #csv_files[1],
  #col_types = schema(.default = string())) %>%
  #collect()

# check memory usage of dataset load <--- Intensive
#glue("Memory used by Arrow object: {format(object.size(baseData), units = 'MB')}")

Initial View of Data (without conversion only for testing purposes):

# reading in the first 5 rows of the data
#baseData %>% 
  #head(5)

Converting CSV to Parquet:

# convert and save csv file as parquet for efficiency
#parquet_path <- file.path(outdir, "baseDataAauto.parquet")

# creates parquet file and 
#write_parquet(baseData, parquet_path)
#message("Parquet file created")

Parquet File Validation and Viewing (run this to bring data into environment):

# verification of file presence
file.exists("Extracted Data Folder/baseDataAauto.parquet")

[1] TRUE

file.size("Extracted Data Folder/baseDataAauto.parquet") / 1024^3  # Size in GB

[1] 0.9410913

# loading in the dataset
autoInfo <- open_dataset("Extracted Data Folder/baseDataAauto.parquet", format = "parquet")

Using glimpse() on Parquet File (Version 1):

Note: AutoInfo is the first version of the parquet file without any other modifications implemented (vanilla). However, further into the analysis, another parquet file will be created that contains changes to better streamline data loading through partitioning and a streamlined column naming scheme for easier ‘fetching’ of data.

# prints first couple of rows from dataset
autoInfo %>%
  glimpse()

FileSystemDataset with 1 Parquet file
5,695,015 rows x 156 columns
$ vin                                    <string> "abc5f0360059cf7b6fa8368db57f2…
$ stockNum                               <string> "11701A", "9055B", "11816A", "…
$ firstSeen                         <date32[day]> 2019-05-06, 2019-05-06, 2017-0…
$ lastSeen                          <date32[day]> 2019-05-06, 2019-05-06, 2019-0…
$ msrp                                    <int32> 1498, 10589, 11992, 12387, 416…
$ askPrice                                <int32> 1498, 10589, 9940, 12387, 4165…
$ mileage                                 <int32> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ isNew                                  <string> "False", "False", "False", "Fa…
$ color                                  <string> "Gray", "Super Black", "White"…
$ interiorColor                          <string> "N/A", "N/A", "N/A", "N/A", "N…
$ brandName                              <string> "MITSUBISHI", "NISSAN", "FORD"…
$ modelName                              <string> "Eclipse Spyder", "Altima", "E…
$ dealerID                                <int32> 7514, 7514, 7514, 7514, 7514, …
$ vf_ABS                                 <string> NA, NA, NA, "Standard", "Stand…
$ vf_ActiveSafetySysNote                 <string> NA, NA, NA, NA, NA, NA, "My Ke…
$ vf_AdaptiveCruiseControl               <string> NA, NA, NA, NA, NA, "Optional"…
$ vf_AdaptiveDrivingBeam                 <string> NA, NA, NA, "Optional", "Stand…
$ vf_AdaptiveHeadlights                  <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_AdditionalErrorText                 <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_AirBagLocCurtain                    <string> NA, "1st & 2nd Rows", NA, "All…
$ vf_AirBagLocFront                      <string> "1st Row (Driver & Passenger)"…
$ vf_AirBagLocKnee                       <string> NA, NA, "Driver Seat Only", "1…
$ vf_AirBagLocSeatCushion                <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_AirBagLocSide                       <string> NA, "1st Row (Driver & Passeng…
$ vf_AutoReverseSystem                   <string> NA, NA, NA, "Standard", "Stand…
$ vf_AutomaticPedestrianAlertingSound    <string> NA, NA, NA, NA, NA, "Standard"…
$ vf_AxleConfiguration                   <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Axles                                <int32> NA, NA, NA, 2, 2, 2, 2, 2, 2, …
$ vf_BasePrice                           <double> NA, NA, NA, 23475, NA, 26500, …
$ vf_BatteryA                             <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryA_to                           <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryCells                         <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryInfo                         <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryKWh                          <double> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryKWh_to                        <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryModules                        <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryPacks                         <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryType                         <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryV                             <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BatteryV_to                           <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BedLengthIN                          <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BedType                             <string> "Not Applicable", "Not Applica…
$ vf_BlindSpotMon                        <string> NA, NA, NA, "Optional", NA, "O…
$ vf_BodyCabType                         <string> "Not Applicable", "Not Applica…
$ vf_BodyClass                           <string> "Convertible/Cabriolet", "Seda…
$ vf_BrakeSystemDesc                     <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BrakeSystemType                     <string> NA, NA, "Hydraulic", NA, "Hydr…
$ vf_BusFloorConfigType                  <string> "Not Applicable", "Not Applica…
$ vf_BusLength                             <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_BusType                             <string> "Not Applicable", "Not Applica…
$ vf_CAN_AACN                            <string> NA, NA, NA, "Standard", "Stand…
$ vf_CIB                                 <string> NA, NA, NA, NA, "Standard", "O…
$ vf_CashForClunkers                       <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ChargerLevel                        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ChargerPowerKW                       <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_CoolingType                         <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_CurbWeightLB                         <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_CustomMotorcycleType                <string> "Not Applicable", "Not Applica…
$ vf_DaytimeRunningLight                 <string> NA, NA, NA, "Standard", "Stand…
$ vf_DestinationMarket                   <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_DisplacementCC                      <double> 3000, 2500, 1600, 1400, 5000, …
$ vf_DisplacementCI                      <double> 183.07123, 152.55936, 97.63799…
$ vf_DisplacementL                       <double> 3.0, 2.5, 1.6, 1.4, 5.0, 2.0, …
$ vf_Doors                                <int32> 2, 4, 4, 4, NA, 4, 4, 4, 4, 4,…
$ vf_DriveType                           <string> NA, "4x2", "4x2", NA, "4WD/4-W…
$ vf_DriverAssist                        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_DynamicBrakeSupport                 <string> NA, NA, NA, "Standard", "Stand…
$ vf_EDR                                 <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ESC                                 <string> NA, NA, NA, "Standard", "Stand…
$ vf_EVDriveUnit                         <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ElectrificationLevel                <string> NA, NA, NA, NA, NA, "Strong HE…
$ vf_EngineConfiguration                 <string> NA, "In-Line", "In-Line", "In-…
$ vf_EngineCycles                         <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_EngineCylinders                      <int32> NA, 4, 4, 4, 8, 4, 4, 6, 6, 4,…
$ vf_EngineHP                            <double> NA, NA, 178, NA, 395, 188, 171…
$ vf_EngineHP_to                         <double> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_EngineKW                            <double> NA, NA, 132.7346, NA, 294.5515…
$ vf_EngineManufacturer                  <string> NA, NA, "Ford", "GMNA", "Ford"…
$ vf_EngineModel                         <string> NA, NA, NA, "LE2 -DI: Direct I…
$ vf_EntertainmentSystem                 <string> NA, NA, NA, NA, "CD + stereo",…
$ vf_ForwardCollisionWarning             <string> NA, NA, NA, "Optional", "Stand…
$ vf_FuelInjectionType                   <string> "Multipoint Fuel Injection (MP…
$ vf_FuelTypePrimary                     <string> NA, "Gasoline", "Gasoline", "G…
$ vf_FuelTypeSecondary                   <string> NA, NA, NA, NA, NA, "Electric"…
$ vf_GCWR                                  <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_GCWR_to                               <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_GVWR                                <string> NA, NA, "Class 1C: 4001 - 5000…
$ vf_GVWR_to                               <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_KeylessIgnition                     <string> NA, NA, NA, "Standard", NA, "O…
$ vf_LaneDepartureWarning                <string> NA, NA, NA, "Optional", "Optio…
$ vf_LaneKeepSystem                      <string> NA, NA, NA, "Optional", "Optio…
$ vf_LowerBeamHeadlampLightSource        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Make                                <string> "MITSUBISHI", "NISSAN", "FORD"…
$ vf_MakeID                              <double> 481, 478, 460, 467, 460, 460, …
$ vf_Manufacturer                        <string> "MITSUBISHI MOTORS NORTH AMERI…
$ vf_ManufacturerId                       <int32> 1054, 997, 976, 984, 976, 979,…
$ vf_Model                               <string> "Eclipse Spyder", "Altima", "E…
$ vf_ModelID                              <int32> 2321, 1904, 1798, 1832, 1801, …
$ vf_ModelYear                            <int32> 2002, 2016, 2014, 2017, 2019, …
$ vf_MotorcycleChassisType               <string> "Not Applicable", "Not Applica…
$ vf_MotorcycleSuspensionType            <string> "Not Applicable", "Not Applica…
$ vf_NCSABodyType                          <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAMake                              <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAMapExcApprovedBy                  <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAMapExcApprovedOn                  <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAMappingException                  <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSAModel                             <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_NCSANote                            <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Note                                <string> NA, "position 6:Model change n…
$ vf_OtherBusInfo                          <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_OtherEngineInfo                     <string> "MPI", NA, "Ti-VCT ", NA, NA, …
$ vf_OtherMotorcycleInfo                 <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_OtherRestraintSystemInfo            <string> NA, "2nd row outboard and cent…
$ vf_OtherTrailerInfo                    <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_ParkAssist                          <string> NA, NA, NA, NA, NA, NA, NA, "O…
$ vf_PedestrianAutomaticEmergencyBraking <string> NA, NA, NA, NA, "Standard", "O…
$ vf_PlantCity                           <string> "BLOOMINGTON-NORMAL", "CANTON"…
$ vf_PlantCompanyName                    <string> NA, "Nissan North America Inc.…
$ vf_PlantCountry                        <string> "UNITED STATES (USA)", "UNITED…
$ vf_PlantState                          <string> "ILLINOIS", "MISSISSIPPI", "KE…
$ vf_PossibleValues                      <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Pretensioner                        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_RearCrossTrafficAlert               <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_RearVisibilitySystem                <string> NA, NA, NA, "Standard", "Stand…
$ vf_SAEAutomationLevel                   <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_SAEAutomationLevel_to                 <null> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_SeatBeltsAll                        <string> "Manual", "Manual", "Manual", …
$ vf_SeatRows                             <int32> NA, NA, NA, 2, 2, 2, 2, NA, 3,…
$ vf_Seats                                <int32> NA, NA, NA, 5, 6, 5, 5, 7, 7, …
$ vf_SemiautomaticHeadlampBeamSwitching  <string> NA, NA, NA, "Standard", "Stand…
$ vf_Series                              <string> "SPORTS", NA, "SE", "Premier",…
$ vf_Series2                             <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_SteeringLocation                    <string> NA, NA, NA, "Left Hand Drive (…
$ vf_SuggestedVIN                        <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_TPMS                                <string> NA, "Direct", "Direct", "Direc…
$ vf_TopSpeedMPH                          <int32> NA, NA, NA, 130, NA, 105, 114,…
$ vf_TrackWidth                          <double> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_TractionControl                     <string> NA, NA, NA, "Standard", "Stand…
$ vf_TrailerBodyType                     <string> "Not Applicable", "Not Applica…
$ vf_TrailerLength                        <int32> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_TrailerType                         <string> "Not Applicable", "Not Applica…
$ vf_TransmissionSpeeds                   <int32> NA, NA, NA, NA, NA, NA, 6, 6, …
$ vf_TransmissionStyle                   <string> NA, NA, NA, "Automatic", "Auto…
$ vf_Trim                                <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Trim2                               <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_Turbo                               <string> NA, NA, NA, "Yes", NA, NA, NA,…
$ vf_VIN                                 <string> "abc5f0360059cf7b6fa8368db57f2…
$ vf_ValveTrainDesign                    <string> NA, NA, NA, "Dual Overhead Cam…
$ vf_VehicleType                         <string> "PASSENGER CAR", "PASSENGER CA…
$ vf_WheelBaseLong                       <double> NA, NA, NA, NA, 156.8, NA, NA,…
$ vf_WheelBaseShort                      <double> NA, NA, NA, 106.3, 145.0, 112.…
$ vf_WheelBaseType                       <string> NA, NA, NA, NA, NA, NA, NA, NA…
$ vf_WheelSizeFront                       <int32> NA, NA, NA, 17, 17, 17, 17, 19…
$ vf_WheelSizeRear                        <int32> NA, NA, NA, 17, 17, 17, 17, 19…
$ vf_Wheels                               <int32> NA, NA, NA, 4, 4, 4, 4, 4, 4, …
$ vf_Windows                              <int32> NA, NA, NA, 4, NA, NA, NA, NA,…
Call `print()` for full schema details

Part 2: READY Framework Analysis

Work through each component of READY with the dataset:

R - Representative Data

Document my Thoughts as Comments:

The overall objective in analyzing this specific dataset is to establish baseline findings related to featuresets and aspects of vehicles in dealerships that impacted pricing in a past period of time (2017-2020). In addition, visualize any sales patterns primarily associated with body class and brands before any current policy or economic changes.

What is the scope of the data?

The overall scope of the data is serving as an information record (that includes physical to pricing attributes) of new and used vehicle offerings from dealerships in a specific state. According to the company that published the dataset, Competitive Intelligence Solutions (CSI), the data is meant for reporting and analysis across multiple industries to evaluate specific metrics derived from the data. As such, the overall size of the dataset is 5,695,015 rows.

Time Period Covered:

The time period that the data covers is between September 2017 to June 2020 (Roughly a 4-year time span).

Geographic Coverage:

The specific location that the dataset covers is the state of Illinois.

Population Represented:

The population represented by the dataset is primarily vehicle dealerships across the state of Illinois.

Potential Biases or Limitations:

A form of confirmation bias going into the analysis is that primarily technology-oriented features in newer model year vehicles, such as lane departure warning or lane-keep systems, are expected to impact pricing due to the complexity of these systems. In addition, vehicle brands alone create various forms of bias and give specific impressions when viewing the dataset, as this would serve as another influencing factor based on trends when exploring new and used car options. Another potential limitation is the presence of many NA values across most of the dataset fields.

Example Questions to Consider:

Do we have complete coverage of what we’re studying?
Are there any obvious gaps in the data?
What might be missing?

E - Executive Driven Questions

Who would care about insights from the data?

Primary stakeholders: Key business/research questions they might ask: What decisions could this data inform?

Examples:

If this is sales data: “How can we optimize our sales strategy?”
If this is health data: “What patterns affect patient outcomes?”
If this is social media data: “How can we improve engagement?”

Stakeholder/Research Questions:

How much has dealer pricing among new and used offerings varied across different car model years?
What is the distribution of car body classes ( Pickups, SUVS, Sedans) across dealerships?
What car brands are the most prominent in dealership locations, and what are their most prominent models?
Does having a high volume of larger vehicles (such as SUVs) directly impact the overall market’s median asking price to be higher?
How strong are aspects of a car, such as brand, as predictors for forecasting the dealer’s asking price?

A - Analytic Framework

Exploration Strategy

Phase 1: Data Quality Assessment:

Check for missing values
Identify data types and consistency
Look for outliers or anomalies

Phase 2: Descriptive Analysis:

What are the key variables?
What’s the distribution of important metrics?
What time patterns exist?

Phase 3: Pattern Investigation:

What relationships might exist between variables?
Are there seasonal or temporal patterns?
What groupings or segments emerge?

Specific Analytical Approach:

Begin checking for any potential relationships within the data through the use of measures such as a correlation matrix to begin identifying areas to explore.
In determining key variables or fields I want to further focus on as part of the data, I begin developing visualizations or informational tables that depict anything noteworthy at first glance.
Consider utilizing strategies such as “feature engineering” to develop new variables from existing fields, such as any month-day-year fields, to make a separate year field that can be used to develop any form of comparison analysis on key variables.

D - Data Best Practices

Missing Data Assessment:

Data type verification: Are numeric columns actually numeric? Are dates properly formatted? Are categorical variables consistent?

Quality Concerns:

Based on a first glance of the dataset, there appear to be various variables/fields that have numerous amounts of NA values present, which impacts the usefulness of some fields. Other fields are completely read in as ‘NULL’; however, they only appear as variables that are not relevant to the scope of the analysis.
Inconsistent formatting between categorical variables, such as ‘vehicleType’ (all uppercase) and ‘Model’ (titlecase), would require some further cleaning to ensure consistent reporting.
There appear to be variations of the same measure, such as price (‘askPrice’, ‘msrp’, and ‘BasePrice’), that could make it difficult to perform certain analyses related to pricing/value, not recognizing which is the most optimal to use for specific cases.

Y - “Your Insights”

Initial Hypotheses About Possible Findings:

Based on my domain knowledge, what patterns do you expect? What would surprise you? What would be most valuable to discover?

**My Predictions:

There might be certain fields (related to areas such as vehicle type) that show distinct patterns based on year that could be indicative of trends or preferences demonstrated during the dataset’s timeframe.
A prominence of larger vehicle types such as SUVs and or trucks when compared to sedans in the dataset due to noted popularity of larger car types for practicality, higher driving position, as well as comfort.
There would be a distinct distribution between both the price segments of vehicles that would be leaning towards increasing price, making the entry point of vehicles move closer to around $30k for brand new offerings.

Basic Data Cleaning & Dataset Enhancement

This section is necessary to perform basic enhancements/adjustments to the loaded dataset, ensuring the data structure remains consistent during analysis without removing any variables/fields at this stage. In addition, I created a new version of the parquet file, which primarily carried over a standardized column/field naming scheme and partitioned the dataset through the “model_year” field to further improve performance when working with the data.

Get Column Names to See Format:

autoInfo %>% 
  colnames()

  [1] "vin"                                   
  [2] "stockNum"                              
  [3] "firstSeen"                             
  [4] "lastSeen"                              
  [5] "msrp"                                  
  [6] "askPrice"                              
  [7] "mileage"                               
  [8] "isNew"                                 
  [9] "color"                                 
 [10] "interiorColor"                         
 [11] "brandName"                             
 [12] "modelName"                             
 [13] "dealerID"                              
 [14] "vf_ABS"                                
 [15] "vf_ActiveSafetySysNote"                
 [16] "vf_AdaptiveCruiseControl"              
 [17] "vf_AdaptiveDrivingBeam"                
 [18] "vf_AdaptiveHeadlights"                 
 [19] "vf_AdditionalErrorText"                
 [20] "vf_AirBagLocCurtain"                   
 [21] "vf_AirBagLocFront"                     
 [22] "vf_AirBagLocKnee"                      
 [23] "vf_AirBagLocSeatCushion"               
 [24] "vf_AirBagLocSide"                      
 [25] "vf_AutoReverseSystem"                  
 [26] "vf_AutomaticPedestrianAlertingSound"   
 [27] "vf_AxleConfiguration"                  
 [28] "vf_Axles"                              
 [29] "vf_BasePrice"                          
 [30] "vf_BatteryA"                           
 [31] "vf_BatteryA_to"                        
 [32] "vf_BatteryCells"                       
 [33] "vf_BatteryInfo"                        
 [34] "vf_BatteryKWh"                         
 [35] "vf_BatteryKWh_to"                      
 [36] "vf_BatteryModules"                     
 [37] "vf_BatteryPacks"                       
 [38] "vf_BatteryType"                        
 [39] "vf_BatteryV"                           
 [40] "vf_BatteryV_to"                        
 [41] "vf_BedLengthIN"                        
 [42] "vf_BedType"                            
 [43] "vf_BlindSpotMon"                       
 [44] "vf_BodyCabType"                        
 [45] "vf_BodyClass"                          
 [46] "vf_BrakeSystemDesc"                    
 [47] "vf_BrakeSystemType"                    
 [48] "vf_BusFloorConfigType"                 
 [49] "vf_BusLength"                          
 [50] "vf_BusType"                            
 [51] "vf_CAN_AACN"                           
 [52] "vf_CIB"                                
 [53] "vf_CashForClunkers"                    
 [54] "vf_ChargerLevel"                       
 [55] "vf_ChargerPowerKW"                     
 [56] "vf_CoolingType"                        
 [57] "vf_CurbWeightLB"                       
 [58] "vf_CustomMotorcycleType"               
 [59] "vf_DaytimeRunningLight"                
 [60] "vf_DestinationMarket"                  
 [61] "vf_DisplacementCC"                     
 [62] "vf_DisplacementCI"                     
 [63] "vf_DisplacementL"                      
 [64] "vf_Doors"                              
 [65] "vf_DriveType"                          
 [66] "vf_DriverAssist"                       
 [67] "vf_DynamicBrakeSupport"                
 [68] "vf_EDR"                                
 [69] "vf_ESC"                                
 [70] "vf_EVDriveUnit"                        
 [71] "vf_ElectrificationLevel"               
 [72] "vf_EngineConfiguration"                
 [73] "vf_EngineCycles"                       
 [74] "vf_EngineCylinders"                    
 [75] "vf_EngineHP"                           
 [76] "vf_EngineHP_to"                        
 [77] "vf_EngineKW"                           
 [78] "vf_EngineManufacturer"                 
 [79] "vf_EngineModel"                        
 [80] "vf_EntertainmentSystem"                
 [81] "vf_ForwardCollisionWarning"            
 [82] "vf_FuelInjectionType"                  
 [83] "vf_FuelTypePrimary"                    
 [84] "vf_FuelTypeSecondary"                  
 [85] "vf_GCWR"                               
 [86] "vf_GCWR_to"                            
 [87] "vf_GVWR"                               
 [88] "vf_GVWR_to"                            
 [89] "vf_KeylessIgnition"                    
 [90] "vf_LaneDepartureWarning"               
 [91] "vf_LaneKeepSystem"                     
 [92] "vf_LowerBeamHeadlampLightSource"       
 [93] "vf_Make"                               
 [94] "vf_MakeID"                             
 [95] "vf_Manufacturer"                       
 [96] "vf_ManufacturerId"                     
 [97] "vf_Model"                              
 [98] "vf_ModelID"                            
 [99] "vf_ModelYear"                          
[100] "vf_MotorcycleChassisType"              
[101] "vf_MotorcycleSuspensionType"           
[102] "vf_NCSABodyType"                       
[103] "vf_NCSAMake"                           
[104] "vf_NCSAMapExcApprovedBy"               
[105] "vf_NCSAMapExcApprovedOn"               
[106] "vf_NCSAMappingException"               
[107] "vf_NCSAModel"                          
[108] "vf_NCSANote"                           
[109] "vf_Note"                               
[110] "vf_OtherBusInfo"                       
[111] "vf_OtherEngineInfo"                    
[112] "vf_OtherMotorcycleInfo"                
[113] "vf_OtherRestraintSystemInfo"           
[114] "vf_OtherTrailerInfo"                   
[115] "vf_ParkAssist"                         
[116] "vf_PedestrianAutomaticEmergencyBraking"
[117] "vf_PlantCity"                          
[118] "vf_PlantCompanyName"                   
[119] "vf_PlantCountry"                       
[120] "vf_PlantState"                         
[121] "vf_PossibleValues"                     
[122] "vf_Pretensioner"                       
[123] "vf_RearCrossTrafficAlert"              
[124] "vf_RearVisibilitySystem"               
[125] "vf_SAEAutomationLevel"                 
[126] "vf_SAEAutomationLevel_to"              
[127] "vf_SeatBeltsAll"                       
[128] "vf_SeatRows"                           
[129] "vf_Seats"                              
[130] "vf_SemiautomaticHeadlampBeamSwitching" 
[131] "vf_Series"                             
[132] "vf_Series2"                            
[133] "vf_SteeringLocation"                   
[134] "vf_SuggestedVIN"                       
[135] "vf_TPMS"                               
[136] "vf_TopSpeedMPH"                        
[137] "vf_TrackWidth"                         
[138] "vf_TractionControl"                    
[139] "vf_TrailerBodyType"                    
[140] "vf_TrailerLength"                      
[141] "vf_TrailerType"                        
[142] "vf_TransmissionSpeeds"                 
[143] "vf_TransmissionStyle"                  
[144] "vf_Trim"                               
[145] "vf_Trim2"                              
[146] "vf_Turbo"                              
[147] "vf_VIN"                                
[148] "vf_ValveTrainDesign"                   
[149] "vf_VehicleType"                        
[150] "vf_WheelBaseLong"                      
[151] "vf_WheelBaseShort"                     
[152] "vf_WheelBaseType"                      
[153] "vf_WheelSizeFront"                     
[154] "vf_WheelSizeRear"                      
[155] "vf_Wheels"                             
[156] "vf_Windows"

Create Partitioned Parquet File:

# delete old parquet file/directory if it exists
#if (dir.exists("Extracted Data Folder/modAutoData.parquet")) {
  #unlink("Extracted Data Folder/modAutoData.parquet", recursive = TRUE)
#}
#if (dir.exists("Extracted Data Folder/baseDataAauto.parquet")) {
  #unlink("Extracted Data Folder/baseDataAauto.parquet", recursive = TRUE)
#}

# write optimized partitioned version
#autoInfo %>%
  # standardize naming scheme by removing the 'vf_' attribute on some fields
  # convert naming scheme to snake case
  #rename_with(~ janitor::make_clean_names(str_replace(., "^vf_", ""))) %>%
 #write_dataset(
    #"Extracted Data Folder/modAutoData.parquet", 
    #format = "parquet",
    # partition data by model_year
    #partitioning = "model_year",
  #)

Loading & Testing the Revised Dataset “AutoInfoData”:

# data validation check
file.exists("Extracted Data Folder/modAutoData.parquet")

[1] TRUE

file.size("Extracted Data Folder/modAutoData.parquet") / 1024^3  # Size in GB

[1] 1.144409e-05

# loading in dataset
autoInfoData <- open_dataset("Extracted Data Folder/modAutoData.parquet")

# display revised column names
colnames(autoInfoData)

  [1] "vin"                                   
  [2] "stock_num"                             
  [3] "first_seen"                            
  [4] "last_seen"                             
  [5] "msrp"                                  
  [6] "ask_price"                             
  [7] "mileage"                               
  [8] "is_new"                                
  [9] "color"                                 
 [10] "interior_color"                        
 [11] "brand_name"                            
 [12] "model_name"                            
 [13] "dealer_id"                             
 [14] "abs"                                   
 [15] "active_safety_sys_note"                
 [16] "adaptive_cruise_control"               
 [17] "adaptive_driving_beam"                 
 [18] "adaptive_headlights"                   
 [19] "additional_error_text"                 
 [20] "air_bag_loc_curtain"                   
 [21] "air_bag_loc_front"                     
 [22] "air_bag_loc_knee"                      
 [23] "air_bag_loc_seat_cushion"              
 [24] "air_bag_loc_side"                      
 [25] "auto_reverse_system"                   
 [26] "automatic_pedestrian_alerting_sound"   
 [27] "axle_configuration"                    
 [28] "axles"                                 
 [29] "base_price"                            
 [30] "battery_a"                             
 [31] "battery_a_to"                          
 [32] "battery_cells"                         
 [33] "battery_info"                          
 [34] "battery_k_wh"                          
 [35] "battery_k_wh_to"                       
 [36] "battery_modules"                       
 [37] "battery_packs"                         
 [38] "battery_type"                          
 [39] "battery_v"                             
 [40] "battery_v_to"                          
 [41] "bed_length_in"                         
 [42] "bed_type"                              
 [43] "blind_spot_mon"                        
 [44] "body_cab_type"                         
 [45] "body_class"                            
 [46] "brake_system_desc"                     
 [47] "brake_system_type"                     
 [48] "bus_floor_config_type"                 
 [49] "bus_length"                            
 [50] "bus_type"                              
 [51] "can_aacn"                              
 [52] "cib"                                   
 [53] "cash_for_clunkers"                     
 [54] "charger_level"                         
 [55] "charger_power_kw"                      
 [56] "cooling_type"                          
 [57] "curb_weight_lb"                        
 [58] "custom_motorcycle_type"                
 [59] "daytime_running_light"                 
 [60] "destination_market"                    
 [61] "displacement_cc"                       
 [62] "displacement_ci"                       
 [63] "displacement_l"                        
 [64] "doors"                                 
 [65] "drive_type"                            
 [66] "driver_assist"                         
 [67] "dynamic_brake_support"                 
 [68] "edr"                                   
 [69] "esc"                                   
 [70] "ev_drive_unit"                         
 [71] "electrification_level"                 
 [72] "engine_configuration"                  
 [73] "engine_cycles"                         
 [74] "engine_cylinders"                      
 [75] "engine_hp"                             
 [76] "engine_hp_to"                          
 [77] "engine_kw"                             
 [78] "engine_manufacturer"                   
 [79] "engine_model"                          
 [80] "entertainment_system"                  
 [81] "forward_collision_warning"             
 [82] "fuel_injection_type"                   
 [83] "fuel_type_primary"                     
 [84] "fuel_type_secondary"                   
 [85] "gcwr"                                  
 [86] "gcwr_to"                               
 [87] "gvwr"                                  
 [88] "gvwr_to"                               
 [89] "keyless_ignition"                      
 [90] "lane_departure_warning"                
 [91] "lane_keep_system"                      
 [92] "lower_beam_headlamp_light_source"      
 [93] "make"                                  
 [94] "make_id"                               
 [95] "manufacturer"                          
 [96] "manufacturer_id"                       
 [97] "model"                                 
 [98] "model_id"                              
 [99] "motorcycle_chassis_type"               
[100] "motorcycle_suspension_type"            
[101] "ncsa_body_type"                        
[102] "ncsa_make"                             
[103] "ncsa_map_exc_approved_by"              
[104] "ncsa_map_exc_approved_on"              
[105] "ncsa_mapping_exception"                
[106] "ncsa_model"                            
[107] "ncsa_note"                             
[108] "note"                                  
[109] "other_bus_info"                        
[110] "other_engine_info"                     
[111] "other_motorcycle_info"                 
[112] "other_restraint_system_info"           
[113] "other_trailer_info"                    
[114] "park_assist"                           
[115] "pedestrian_automatic_emergency_braking"
[116] "plant_city"                            
[117] "plant_company_name"                    
[118] "plant_country"                         
[119] "plant_state"                           
[120] "possible_values"                       
[121] "pretensioner"                          
[122] "rear_cross_traffic_alert"              
[123] "rear_visibility_system"                
[124] "sae_automation_level"                  
[125] "sae_automation_level_to"               
[126] "seat_belts_all"                        
[127] "seat_rows"                             
[128] "seats"                                 
[129] "semiautomatic_headlamp_beam_switching" 
[130] "series"                                
[131] "series2"                               
[132] "steering_location"                     
[133] "suggested_vin"                         
[134] "tpms"                                  
[135] "top_speed_mph"                         
[136] "track_width"                           
[137] "traction_control"                      
[138] "trailer_body_type"                     
[139] "trailer_length"                        
[140] "trailer_type"                          
[141] "transmission_speeds"                   
[142] "transmission_style"                    
[143] "trim"                                  
[144] "trim2"                                 
[145] "turbo"                                 
[146] "vin_2"                                 
[147] "valve_train_design"                    
[148] "vehicle_type"                          
[149] "wheel_base_long"                       
[150] "wheel_base_short"                      
[151] "wheel_base_type"                       
[152] "wheel_size_front"                      
[153] "wheel_size_rear"                       
[154] "wheels"                                
[155] "windows"                               
[156] "model_year"

Part 3: Data Quality Assessment Summary

S - Stakeholders (Revisited)

Revised Dataset Structure Analysis:

# this section analyzes the overall structure of the data
# take a sample size from data 
# to minimize instance of loading the entire data repeatedly

# OTHER DETAILS
# this version of the data filters for specifc variables that stakeholders
# would be most interested in, that avoids heavily technical details
autoInfoSample_1 <- autoInfoData %>% 
  slice_sample(n = 100) %>% 
    select(
      model_year,
      brand_name,
      model_name,
      body_class,
      vehicle_type,
      msrp,
      ask_price,
      mileage,
      is_new,
      color,
      interior_color,
      doors,
      fuel_type_primary,
      seat_belts_all
  ) %>% 
  collect()  

# prints out structure of data
str(autoInfoSample_1)

tibble [100 × 14] (S3: tbl_df/tbl/data.frame)
 $ model_year       : int [1:100] 1988 1990 1991 1990 1991 1991 1992 1993 1991 1992 ...
 $ brand_name       : chr [1:100] "CHEVROLET" "CHEVROLET" "MAZDA" "OLDSMOBILE" ...
 $ model_name       : chr [1:100] "Astro Van" "Lumina" "MX-5" "Ninety Eight (98)" ...
 $ body_class       : chr [1:100] "Van" "Sedan/Saloon" "Convertible/Cabriolet" "Sedan/Saloon" ...
 $ vehicle_type     : chr [1:100] "MULTIPURPOSE PASSENGER VEHICLE (MPV)" "PASSENGER CAR" "PASSENGER CAR" "PASSENGER CAR" ...
 $ msrp             : int [1:100] 3995 4995 9999 0 0 6971 0 0 0 0 ...
 $ ask_price        : int [1:100] 3995 4995 7324 0 0 6971 0 0 0 0 ...
 $ mileage          : int [1:100] 199997 0 31798 154280 0 0 0 180000 0 0 ...
 $ is_new           : chr [1:100] "False" "False" "False" "False" ...
 $ color            : chr [1:100] "Beige" "Tan" "White" "Red" ...
 $ interior_color   : chr [1:100] "N/A" "Beige" "Black" "N/A" ...
 $ doors            : int [1:100] NA 4 2 4 NA 2 4 4 NA 4 ...
 $ fuel_type_primary: chr [1:100] "Gasoline" "Gasoline" "Gasoline" "Gasoline" ...
 $ seat_belts_all   : chr [1:100] NA "Automatic" "Manual" "Automatic" ...

# display dataset dimensions
glue("Dimension Information:\n") # print statement without "" showing

Dimension Information:

dim(autoInfoSample_1)

[1] 100  14

# what model year is the most recurring in the dataset
glue("Most Recurring 'ModelYear':\n")

Most Recurring 'ModelYear':

modelYearMode <- autoInfoData %>%
  count(model_year) %>% # count of 'ModelYear'
  arrange(desc(n)) %>%
  collect() %>% # needs to collect first, before outputting
  slice(1) %>% # selects first row of arranged table
  pull(model_year) # value extraction is performed

modelYearMode

[1] 2019

Timeframe of Data:

# data timeframe
autotime <- autoInfoData %>%
  summarise(
    earliest_firstSeen = min(as.Date(first_seen), na.rm = TRUE),
    recent_lastSeen = max(as.Date(last_seen), na.rm = TRUE)
  ) %>% 
  collect()

# print readable text output
glue(
  "Date Information:\n", 
  "The earliest seen date is: {autotime$earliest_firstSeen}\n",
  "The latest seen date is: {autotime$recent_lastSeen}"
)

Date Information:
The earliest seen date is: 2016-04-05
The latest seen date is: 2020-05-31

Output First 10 Rows of Dataset as an Excel Sheet to View Structure:

# extract first 10 rows
#sampleDataTen <- autoInfoData %>%
  #head(10) %>%
  #collect()  # bring data into memory

After examining the data structure, who else might be interested?

Since the main audience of the dataset is market analysts for price forecasting, this data might also further appeal to consumers, as well as manufacturers, based on what aspects of a car (such as body style or other features) are most prominent in dealerships.

What specific questions would they have?

How does the presence of specific features and different configurations affect the overall price?
How much does brand potentially impact the overall pricing of vehicles when compared to the manufacturer’s price and the dealer’s price?
What vehicles (such as SUVs or Trucks) are most commonly found at dealerships?
Are there any trends over the time span of the data that seem to indicate a shift in what types of vehicles are acquired by dealers (such as a shift from smaller vehicle types to larger vehicles)?

What concerns might they have about data quality?

The relevancy of some fields that are a part of the dataset, which would require adjustment/further cleaning to narrow the variable size of the dataset to only core fields.
Duplicated data, such as the model name of a vehicle that requires further cleaning/handling, should be interpreted to best interpret the data.
Some stakeholder-relevant variables/fields have some portion of missing/NA variables related; however, it is sizable enough not to hinder any reporting. As such, only acknowledgement is warranted for this matter.

C - Columns and Coverage

Summary Table of Variables (Arrow Approach):

# get column names directly from dataset
colNames <- names(autoInfoData)  

# get column types by sampling first row
sampleRow <- autoInfoData %>% 
  head(1) %>% 
  collect()

# output actual column types
colTypes <- sapply(sampleRow, function(x) class(x)[1])

# calculate missing counts using Arrow functions
missingCounts <- autoInfoData %>%
  summarise(across(everything(), ~sum(as.integer(is.na(.))))) %>%
  collect() %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "MissingQuantity")

# summary table out
variableSummary <- data.frame(
  Variable = colNames,
  Type = colTypes,
  MissingQuantity = missingCounts$MissingQuantity
)

# print out summary
print(variableSummary)

                                                                     Variable
vin                                                                       vin
stock_num                                                           stock_num
first_seen                                                         first_seen
last_seen                                                           last_seen
msrp                                                                     msrp
ask_price                                                           ask_price
mileage                                                               mileage
is_new                                                                 is_new
color                                                                   color
interior_color                                                 interior_color
brand_name                                                         brand_name
model_name                                                         model_name
dealer_id                                                           dealer_id
abs                                                                       abs
active_safety_sys_note                                 active_safety_sys_note
adaptive_cruise_control                               adaptive_cruise_control
adaptive_driving_beam                                   adaptive_driving_beam
adaptive_headlights                                       adaptive_headlights
additional_error_text                                   additional_error_text
air_bag_loc_curtain                                       air_bag_loc_curtain
air_bag_loc_front                                           air_bag_loc_front
air_bag_loc_knee                                             air_bag_loc_knee
air_bag_loc_seat_cushion                             air_bag_loc_seat_cushion
air_bag_loc_side                                             air_bag_loc_side
auto_reverse_system                                       auto_reverse_system
automatic_pedestrian_alerting_sound       automatic_pedestrian_alerting_sound
axle_configuration                                         axle_configuration
axles                                                                   axles
base_price                                                         base_price
battery_a                                                           battery_a
battery_a_to                                                     battery_a_to
battery_cells                                                   battery_cells
battery_info                                                     battery_info
battery_k_wh                                                     battery_k_wh
battery_k_wh_to                                               battery_k_wh_to
battery_modules                                               battery_modules
battery_packs                                                   battery_packs
battery_type                                                     battery_type
battery_v                                                           battery_v
battery_v_to                                                     battery_v_to
bed_length_in                                                   bed_length_in
bed_type                                                             bed_type
blind_spot_mon                                                 blind_spot_mon
body_cab_type                                                   body_cab_type
body_class                                                         body_class
brake_system_desc                                           brake_system_desc
brake_system_type                                           brake_system_type
bus_floor_config_type                                   bus_floor_config_type
bus_length                                                         bus_length
bus_type                                                             bus_type
can_aacn                                                             can_aacn
cib                                                                       cib
cash_for_clunkers                                           cash_for_clunkers
charger_level                                                   charger_level
charger_power_kw                                             charger_power_kw
cooling_type                                                     cooling_type
curb_weight_lb                                                 curb_weight_lb
custom_motorcycle_type                                 custom_motorcycle_type
daytime_running_light                                   daytime_running_light
destination_market                                         destination_market
displacement_cc                                               displacement_cc
displacement_ci                                               displacement_ci
displacement_l                                                 displacement_l
doors                                                                   doors
drive_type                                                         drive_type
driver_assist                                                   driver_assist
dynamic_brake_support                                   dynamic_brake_support
edr                                                                       edr
esc                                                                       esc
ev_drive_unit                                                   ev_drive_unit
electrification_level                                   electrification_level
engine_configuration                                     engine_configuration
engine_cycles                                                   engine_cycles
engine_cylinders                                             engine_cylinders
engine_hp                                                           engine_hp
engine_hp_to                                                     engine_hp_to
engine_kw                                                           engine_kw
engine_manufacturer                                       engine_manufacturer
engine_model                                                     engine_model
entertainment_system                                     entertainment_system
forward_collision_warning                           forward_collision_warning
fuel_injection_type                                       fuel_injection_type
fuel_type_primary                                           fuel_type_primary
fuel_type_secondary                                       fuel_type_secondary
gcwr                                                                     gcwr
gcwr_to                                                               gcwr_to
gvwr                                                                     gvwr
gvwr_to                                                               gvwr_to
keyless_ignition                                             keyless_ignition
lane_departure_warning                                 lane_departure_warning
lane_keep_system                                             lane_keep_system
lower_beam_headlamp_light_source             lower_beam_headlamp_light_source
make                                                                     make
make_id                                                               make_id
manufacturer                                                     manufacturer
manufacturer_id                                               manufacturer_id
model                                                                   model
model_id                                                             model_id
motorcycle_chassis_type                               motorcycle_chassis_type
motorcycle_suspension_type                         motorcycle_suspension_type
ncsa_body_type                                                 ncsa_body_type
ncsa_make                                                           ncsa_make
ncsa_map_exc_approved_by                             ncsa_map_exc_approved_by
ncsa_map_exc_approved_on                             ncsa_map_exc_approved_on
ncsa_mapping_exception                                 ncsa_mapping_exception
ncsa_model                                                         ncsa_model
ncsa_note                                                           ncsa_note
note                                                                     note
other_bus_info                                                 other_bus_info
other_engine_info                                           other_engine_info
other_motorcycle_info                                   other_motorcycle_info
other_restraint_system_info                       other_restraint_system_info
other_trailer_info                                         other_trailer_info
park_assist                                                       park_assist
pedestrian_automatic_emergency_braking pedestrian_automatic_emergency_braking
plant_city                                                         plant_city
plant_company_name                                         plant_company_name
plant_country                                                   plant_country
plant_state                                                       plant_state
possible_values                                               possible_values
pretensioner                                                     pretensioner
rear_cross_traffic_alert                             rear_cross_traffic_alert
rear_visibility_system                                 rear_visibility_system
sae_automation_level                                     sae_automation_level
sae_automation_level_to                               sae_automation_level_to
seat_belts_all                                                 seat_belts_all
seat_rows                                                           seat_rows
seats                                                                   seats
semiautomatic_headlamp_beam_switching   semiautomatic_headlamp_beam_switching
series                                                                 series
series2                                                               series2
steering_location                                           steering_location
suggested_vin                                                   suggested_vin
tpms                                                                     tpms
top_speed_mph                                                   top_speed_mph
track_width                                                       track_width
traction_control                                             traction_control
trailer_body_type                                           trailer_body_type
trailer_length                                                 trailer_length
trailer_type                                                     trailer_type
transmission_speeds                                       transmission_speeds
transmission_style                                         transmission_style
trim                                                                     trim
trim2                                                                   trim2
turbo                                                                   turbo
vin_2                                                                   vin_2
valve_train_design                                         valve_train_design
vehicle_type                                                     vehicle_type
wheel_base_long                                               wheel_base_long
wheel_base_short                                             wheel_base_short
wheel_base_type                                               wheel_base_type
wheel_size_front                                             wheel_size_front
wheel_size_rear                                               wheel_size_rear
wheels                                                                 wheels
windows                                                               windows
model_year                                                         model_year
                                                    Type MissingQuantity
vin                                            character               0
stock_num                                      character            6806
first_seen                                          Date               0
last_seen                                           Date               0
msrp                                             integer               0
ask_price                                        integer               0
mileage                                          integer               0
is_new                                         character               0
color                                          character               4
interior_color                                 character               1
brand_name                                     character            1260
model_name                                     character            5843
dealer_id                                        integer               0
abs                                            character         3364367
active_safety_sys_note                         character         4858219
adaptive_cruise_control                        character         4732470
adaptive_driving_beam                          character         4859853
adaptive_headlights                            character         5692273
additional_error_text                          character         5663198
air_bag_loc_curtain                            character         2346025
air_bag_loc_front                              character          302911
air_bag_loc_knee                               character         4038037
air_bag_loc_seat_cushion                       character         5364405
air_bag_loc_side                               character          612020
auto_reverse_system                            character         3724453
automatic_pedestrian_alerting_sound            character         5611953
axle_configuration                             character         5694492
axles                                            integer         3595438
base_price                                       numeric         3838272
battery_a                                        integer         5695013
battery_a_to                           vctrs_unspecified         5695015
battery_cells                                    integer         5695013
battery_info                                   character         5681514
battery_k_wh                                     numeric         5693362
battery_k_wh_to                                  integer         5694777
battery_modules                        vctrs_unspecified         5695015
battery_packs                                    integer         5634621
battery_type                                   character         5676852
battery_v                                        integer         5685448
battery_v_to                           vctrs_unspecified         5695015
bed_length_in                                    integer         5689399
bed_type                                       character         3628238
blind_spot_mon                                 character         4297756
body_cab_type                                  character         3011749
body_class                                     character           11559
brake_system_desc                              character         5590227
brake_system_type                              character         3409195
bus_floor_config_type                          character           29202
bus_length                             vctrs_unspecified         5695015
bus_type                                       character           29202
can_aacn                                       character         4537162
cib                                            character         4522241
cash_for_clunkers                      vctrs_unspecified         5695015
charger_level                                  character         5694818
charger_power_kw                                 integer         5693367
cooling_type                                   character         5193992
curb_weight_lb                                   integer         5577825
custom_motorcycle_type                         character            3852
daytime_running_light                          character         3719303
destination_market                             character         5551219
displacement_cc                                  numeric           46149
displacement_ci                                  numeric           46149
displacement_l                                   numeric           46149
doors                                            integer          780801
drive_type                                     character         1411574
driver_assist                                  character         5692273
dynamic_brake_support                          character         3657055
edr                                            character         5632193
esc                                            character         3507431
ev_drive_unit                                  character         5694160
electrification_level                          character         5615959
engine_configuration                           character         2198693
engine_cycles                                    integer         5299045
engine_cylinders                                 integer          416098
engine_hp                                        numeric         2517224
engine_hp_to                                     numeric         5500857
engine_kw                                        numeric         2518459
engine_manufacturer                            character         2561190
engine_model                                   character         2416151
entertainment_system                           character         5482863
forward_collision_warning                      character         4423758
fuel_injection_type                            character         4293878
fuel_type_primary                              character          166433
fuel_type_secondary                            character         5157971
gcwr                                   vctrs_unspecified         5695015
gcwr_to                                vctrs_unspecified         5695015
gvwr                                           character         1785733
gvwr_to                                vctrs_unspecified         5695015
keyless_ignition                               character         4147281
lane_departure_warning                         character         4508384
lane_keep_system                               character         4629374
lower_beam_headlamp_light_source               character         5682593
make                                           character            1260
make_id                                          numeric            1260
manufacturer                                   character            1260
manufacturer_id                                  integer            1260
model                                          character            5843
model_id                                         integer            5843
motorcycle_chassis_type                        character            3919
motorcycle_suspension_type                     character            3916
ncsa_body_type                         vctrs_unspecified         5695015
ncsa_make                              vctrs_unspecified         5695015
ncsa_map_exc_approved_by               vctrs_unspecified         5695015
ncsa_map_exc_approved_on               vctrs_unspecified         5695015
ncsa_mapping_exception                 vctrs_unspecified         5695015
ncsa_model                             vctrs_unspecified         5695015
ncsa_note                                      character         5680634
note                                           character         5043647
other_bus_info                         vctrs_unspecified         5695015
other_engine_info                              character         3805614
other_motorcycle_info                          character         5693679
other_restraint_system_info                    character         3794935
other_trailer_info                             character         5695005
park_assist                                    character         5288671
pedestrian_automatic_emergency_braking         character         5136236
plant_city                                     character          614448
plant_company_name                             character         1568615
plant_country                                  character          178494
plant_state                                    character         1522665
possible_values                                character         5687691
pretensioner                                   character         5276379
rear_cross_traffic_alert                       character         5682328
rear_visibility_system                         character         3534758
sae_automation_level                             integer         5694761
sae_automation_level_to                vctrs_unspecified         5695015
seat_belts_all                                 character          358162
seat_rows                                        integer         3769624
seats                                            integer         3590951
semiautomatic_headlamp_beam_switching          character         3705261
series                                         character         1404119
series2                                        character         5429563
steering_location                              character         3129603
suggested_vin                                  character         5663196
tpms                                           character         1002845
top_speed_mph                                    integer         4797588
track_width                                      numeric         5640611
traction_control                               character         3600224
trailer_body_type                              character            1299
trailer_length                                   integer         5694994
trailer_type                                   character            1322
transmission_speeds                              integer         4401493
transmission_style                             character         3770877
trim                                           character         3445574
trim2                                          character         5575717
turbo                                          character         4488758
vin_2                                          character               0
valve_train_design                             character         3494358
vehicle_type                                   character            1260
wheel_base_long                                  numeric         5607936
wheel_base_short                                 numeric         3632320
wheel_base_type                                character         5515108
wheel_size_front                                 integer         3964805
wheel_size_rear                                  integer         3965238
wheels                                           integer         3506892
windows                                          integer         5357885
model_year                                       integer            1282

A - Aggregates: Overall Picture

# get comprehensive dataset statistics
sample_data <- autoInfoData %>% 
  head(1) %>% 
  collect()

# get numeric column names (excluding dealer_id and model_year)
numeric_cols <- names(sample_data)[sapply(sample_data, is.numeric)]
numeric_cols <- setdiff(numeric_cols, c("dealer_id", "model_year"))

# Now summarize using explicit column names
autoNumStats <- autoInfoData %>% 
  summarise(
    total_rows = n(),
    across(all_of(numeric_cols), list(
      average = ~mean(.x, na.rm = TRUE)
    ))
  ) %>%
  collect()

# display as a formatted table
print(autoNumStats)

# A tibble: 1 × 40
  total_rows msrp_average ask_price_average mileage_average axles_average
       <int>        <dbl>             <dbl>           <dbl>         <dbl>
1    5695015      744570.           186921.          22415.          2.00
# ℹ 35 more variables: base_price_average <dbl>, battery_a_average <dbl>,
#   battery_cells_average <dbl>, battery_k_wh_average <dbl>,
#   battery_k_wh_to_average <dbl>, battery_packs_average <dbl>,
#   battery_v_average <dbl>, bed_length_in_average <dbl>,
#   charger_power_kw_average <dbl>, curb_weight_lb_average <dbl>,
#   displacement_cc_average <dbl>, displacement_ci_average <dbl>,
#   displacement_l_average <dbl>, doors_average <dbl>, …

N - Notable Segments

Analyze key categorical variables and modify based on specific data.

Calculation Formatting:

# must be run prior to conducting calculations for formatting
options(scipen = 999) # used to avoid scientific notation
# count of total records
totalRecords <- nrow(autoInfoData)

Counting by Brand:

# OVERALL BRAND COUNT
# selecting car 'brandName' to observe frequency values
brandCount <- autoInfoData %>%
  # group by 'brandName' Variable
  group_by(brand_name) %>% 
  # conducts count
  summarise(Frequency = n()) %>%   
  # calculate percentage in respect to the dataset
  mutate(Percentage = (Frequency / sum(Frequency)) * 100) %>% 
  # sort from highest to lowest frequency
  arrange(desc(Frequency)) %>% 
  collect()

# print out table
print(brandCount)

# A tibble: 111 × 3
   brand_name Frequency Percentage
   <chr>          <int>      <dbl>
 1 CHEVROLET     890213      15.6 
 2 FORD          782063      13.7 
 3 TOYOTA        398976       7.01
 4 JEEP          369067       6.48
 5 NISSAN        312876       5.49
 6 HONDA         278725       4.89
 7 HYUNDAI       264381       4.64
 8 GMC           232111       4.08
 9 DODGE         218588       3.84
10 VOLKSWAGEN    210886       3.70
# ℹ 101 more rows

Note: For conducting pricing based calculations, the median measure was used to avoid any skewness from outlier pricing from specific vehicle listings.

Pricing Differences Clarification:

“msrp” - Manufacturer’s Suggested Retail Price
“ask_price” - Last price seen before vehicle was sold

Brand & Model Median Pricing:

# MODEL NAME & VEHICLE TYPE COUNT 
modelCount <- autoInfoData %>%
  # group by 'brandName','modelName', 'VechicleType' Variable
  group_by(brand_name, model_name, vehicle_type) %>% 
  # conducts count
  summarise(
     MedianMSRP = round(median(msrp), 2), 
     MedianAskPrice = round(median(ask_price), 2), 
     Frequency = n()) %>%
  # calculate percentage in respect to the dataset
  mutate(Percentage = (Frequency /totalRecords) * 100) %>% 
  # calculates average 'askPrice'  # sort from highest to lowest frequency
  arrange(desc(Frequency)) %>% 
  collect()

Warning: median() currently returns an approximate median in Arrow
This warning is displayed once per session.

# print out table
print(modelCount)

# A tibble: 1,683 × 7
# Groups:   brand_name, model_name [1,548]
   brand_name model_name     vehicle_type    MedianMSRP MedianAskPrice Frequency
   <chr>      <chr>          <chr>                <dbl>          <dbl>     <int>
 1 FORD       F-150          "TRUCK "            30973.         29079.    175036
 2 CHEVROLET  Silverado      "TRUCK "            31910.         29953.    161960
 3 CHEVROLET  Equinox        "MULTIPURPOSE …     22080.         20290.    157648
 4 FORD       Escape         "MULTIPURPOSE …     17825.         16600.    117659
 5 JEEP       Grand Cherokee "MULTIPURPOSE …     30117.         28704.    104822
 6 CHEVROLET  Malibu         "PASSENGER CAR"     16614.         15644.     96262
 7 RAM        1500           "TRUCK "            32585.         30383.     86582
 8 FORD       Explorer       "MULTIPURPOSE …     28056.         26649.     83510
 9 GMC        Sierra         "TRUCK "            40564.         37718.     81360
10 FORD       Fusion         "PASSENGER CAR"     15831.         14704.     80275
# ℹ 1,673 more rows
# ℹ 1 more variable: Percentage <dbl>

# sum(modelCount$Percentage) ~ this is a check to make sure values
# do equal 100

Most Common Vehicle Type:

#MOST COMMON VEHICLE TYPE
vehicleTypeMode <- autoInfoData %>%
  filter(!is.na(vehicle_type)) %>%   
  count(vehicle_type) %>%        
  arrange(desc(n)) %>%
  collect() %>%
  slice(1) %>%                   
  pull(vehicle_type)                

glue("Mode of 'VehicleType':", vehicleTypeMode, "\n")

Mode of 'VehicleType':MULTIPURPOSE PASSENGER VEHICLE (MPV)

Vehicle Plant Data Based on US State:

#USA MANUFACTURING TABLE
plantTable <- autoInfoData %>%
  # only focus on data from the US
  filter(plant_country %in% c("UNITED STATES (USA)")) %>% 
  # removes NA values
  filter(!is.na(plant_state)) %>% 
  # group by Plant State
  group_by(plant_state) %>%                           
  # counts values       
  summarise(Frequency = n(), .groups = "drop") %>%   
  # arrange count from high to low            
  arrange(desc(Frequency)) %>% 
  collect()

#print out table
print(plantTable)

# A tibble: 32 × 2
   plant_state Frequency
   <chr>           <int>
 1 MICHIGAN       677742
 2 OHIO           322011
 3 INDIANA        311711
 4 KENTUCKY       288568
 5 TENNESSEE      261068
 6 ALABAMA        220984
 7 ILLINOIS       176490
 8 MISSOURI       172641
 9 KANSAS         110277
10 TEXAS          107274
# ℹ 22 more rows

Plant Data Based on Country:

# MANUFACTURING BY COUNTRY TABLE
countryPlantTable <- autoInfoData %>%
  filter(!is.na(plant_country)) %>%                   
  # group by 'PlantCountry'         
  group_by(plant_country) %>%                         
  # counts values       
  summarise(Frequency = n(), .groups = "drop") %>%   
  # arrange count from high to low         
  arrange(desc(Frequency))

#print out table
print(countryPlantTable)

FileSystemDataset (query)
plant_country: string
Frequency: int64

* Sorted by Frequency [desc]
See $.data for the source Arrow object

Comprehensive Assessment

Dataset Overview:

Records: There is a total of 5,695,015 records representing different car models present at dealerships across the state of Illinois.

Time Span: The time-frame of the data is roughly from 9/29/2017 to 5/30/2020.

Key Metrics:

Most recurring model year for vehicles is ‘2019’.
Most frequent car model across Illinois dealerships is ’Ford - F150”, specifically making up 3.07% of the data set.
- Median asking price for a “Ford F-150” is $28,995, when compared to an MSRP at the of $30,981.
However, Chevrolet is the most recurring brand (present frequently) across Illinois dealerships , resulting in a frequency percentage of 15.63%.
The most prominent vehicle type across dealerships is Multipurpose Passenger Vehicle (MPV).
The US is the most common manufacturer of vehicles in the data set, with the specific state being Michigan.

Data Completeness

Potential Relevant/Reliable Fields

Variable Name	Completeness
first_seen	100.00%
last_seen	100.00%
msrp	100.00%
ask_price	100.00%
mileage	100.00%
is_new	100.00%
color	100.00%
interior_color	100.00%
brand_name	99.98%
model_name	99.90%
body_class	99.80%
doors	86.29%
engine_cylinders	92.69%
fuel_type_primary	97.08%
model_year	99.98%
plant_city	89.21%
plant_country	96.87%
vehicle_type	99.98%
body_class	99.80%

Data Quality Strengths

What aspects are high quality?

The amount of data present in the dataset makes it overall high quality, in that there are various measures taken on the aspects or components related to a vehicle. Along with other attributes that I did not initially consider but are still highly relevant to the overall analysis geared towards price observation/forecasting.
What makes this reliable?

What makes it overall reliable is that most of the crucial/core fields that are a part of my analysis are, for the most part, complete, with some exceptions that do not have a completeness rate of at least 85%. However, those specific fields that are not at that preferred completeness rate can still be used for analysis in other areas, serving as secondary support for other aspects of the exploratory analysis.
What coverage is excellent?

The coverage that is excellent coverage is the information reporting on the exterior physical aspects of vehicles, along with any geographical-related information (such as fields related to manufacturer location) related to each vehicle, which demonstrates sufficient coverage/completeness for analysis.

Data Quality Concerns

What are the main issues?

The main issue with the data is that most technical fields have N/A or missing values that may limit performing certain observations, or not at all possible due to the amount of coverage missing within these specific fields. However, recognizing the scope of the analysis (focusing primarily on body class), the variables that are missing significant amounts of fields may not pose significant challenges during this process.
What might limit analysis?

As previously explained, there is a large absence of information on a majority of the technical variables that may impact the practicality of using these specific variables for analysis (due to not having a complete technical picture). This has limited the analysis by focusing on variables that are close to, if not 100% complete, and are relevant to the new approach analysis regarding visualizing increasing price over time, as well as segment shifts.
What needs careful handling?

Specifically, what would require the most careful handling is performing further data cleaning and manipulation of the variables to ensure the dataset only contains information relevant to the analysis. Additionally, ensuring that the other variables I focus on are still relevant to the overall objective I set with the analysis and match the revised scope of what information I want to display from the analysis.

Missing Data Impact

Most Missing:

All car-technical- feature variables such as the LaneKeepSystem’ field which has one of the lowest completeness rate of just 18.71% .
Impact on Analysis:

Since the majority of the specific field is missing, this may hinder overall reporting when performing analysis (such as skewness of values).
Handling Strategy:

The strategy for handling this specific instance is by not including the technical-heavy fields that have most of their values missing or NA. My overall approach in selecting which variables to focus on the most is based on the scope of the analysis and overall completion. If certain variables have a completeness rate of 50% or less, they will not be included in the overall analysis. This inclusion would negatively impact the reliability of the data reporting, as the majority of the field values are invalid, which further validates the consideration of a smaller group of variables for streamlined analysis.

Variables needing caution:
- *All technical car feature variables - This will be explored in Part 4

Overall Confidence Level:

My overall confidence level remains high for the data analysis and reporting.

Justification:

My justification for why I still feel highly confident about this for the analysis (despite having a mjority of car tehcnical fields with a completeness rate of less than 50%) is based on the fact that most of the selected target variables for analysis are, for the most part, complete ( at least 86% or higher) and can be relied upon during analysis. In addition, since the scope of the analysis is based on the value of vehicles and key features or aspects that may impact overall pricing, I have a clear and specific objective I can achieve through continued analysis of the dataset. I will still continue to adjust other aspects to further present findings during analysis to potentially support or reinforce findings. However, I will still maintain caution on what other fields to consider or to incorporate to best support points from the general analysis.

Summary & Next Steps:

After successfully loading the dataset, I performed a basic analysis to address the key objective: understanding vehicle pricing and basic car features. I started by assessing data validity to check the structure, then used various features to gather insights beyond just price. I identified which vehicle brands appear most frequently across Illinois dealerships, determined the most common model year, and analyzed vehicle types. I also measured how often different categorical variables and feature sets appear in the data.

The next step is to analyze columns with missing values. I need to determine whether the data is missing at random or follows a pattern (or potentially both). Afterward, I will create a streamlined version of the dataset that focuses on the primary variables used for pricing and car segment analysis. This will lead to producing relevant visualizations and creating a “hero visualization” that captures my primary findings for presentation to stakeholders.

Tool Selection Documentation

Tools Used Throughout the Analysis:

Arrow (< 5GB, simple operations)

That was essential to use since it enabled loading the dataset initially to get a general view of the dataset, then creating revised parquet files to increase efficiency in loading the dataset.

DuckDB (5-50GB, complex analytics)

This is also important to use, as it takes the cleaned (third) parquet file I created and hosts it as a database connection to further increase efficiency in performing repeated analyses and visualizations wth the dataset.

Loading the Data Into a Database

Creating and Loading a DuckDB Table Using Parquet File:

To better approach missing data handling, I created an explicit database file, along with accompanying folder for further data loading. In addition, I made added explicit adjustments to help enhance performance in loading the revised Parquet file as a DuckDB table based on my current system specifications.

# first create folder to house database file
dbContainer <- "AutoStorage"
if (!dir.exists(dbContainer)) {
  dir.create(dbContainer, recursive = TRUE)
  glue("Created folder: {dbContainer}")
}

# Create fixed database file in the folder
dbPath  <- file.path(dbContainer, "AutoInfo.db")

# explicit performance adjustments based on prior testing 
con <- dbConnect(duckdb::duckdb(), 
                  dbdir = dbPath,  
                   config = list(
                   threads = "3",
                   memory_limit = "8GB",
                   max_memory = "8GB",
                   preserve_insertion_order = "false",
                   temp_directory = tempdir()
                 ))

# check if table already exists, skips loading if it does
table_exists <- dbExistsTable(con, "autoInfo_tbl")

# if the table does not currently exist
if (!table_exists) {
  glue("Now loading Parquet files into DuckDB table")
  
  # Create table with partitioned Parquet file in mind
  dbExecute(con, "
    CREATE TABLE autoInfo_tbl AS 
    SELECT * FROM read_parquet(
      'Extracted Data Folder/modAutoData.parquet/**/*.parquet', 
      hive_partitioning = true,
      union_by_name = true,
      filename = false  -- Skip filename column if not needed
    )
  ")
  glue("Parquet loaded as table")
} else {
  glue("Using existing DuckDB table")
}

Using existing DuckDB table

# reference the table
autoInfo_tbl <- tbl(con, "autoInfo_tbl")

# print out final connection status
glue("Connected to DuckDB")

Connected to DuckDB

Phase 2: Descriptive Analysis

Objective(s):

Systematically analyze and handle missing data

Part 4: Missing Data Analysis & Strategy

The Three Questions We Must Answer:

HOW MUCH data is missing?
WHERE is data missing (patterns)?
WHY is data missing (type)?

Quantify Missing Data

What Percentage of data is Missing for Each Variable?

Based on the quantity of variables present in the dataset (156 variables) and to make the NA observation much more efficient, I categorized the amount that a variable is missing by certain percentage thresholds, which I determined to be ideal for the analysis that consisted of the following:

Complete: Variable has no missing data
Minimal: Missing less than 5% of data
Moderate: Missing greater than 5% but less than 20% of data
Substantial: Missing between 20% - 50% of data
Severe: Missing greater than 50% of data

# calculate missing data summary using the DuckDB table
missingSummary <- autoInfo_tbl %>%
  summarise(across(everything(), ~sum(as.integer(is.na(.))))) %>%
  collect() %>%
  pivot_longer(everything(), names_to = "variable", values_to = "numMissing") %>%
  mutate(
    totalRows = nrow(autoInfo_tbl %>% collect()),
    percentMissing = round((numMissing / totalRows) * 100, 2),
    
  # created a classifier that rates how much data is missing by 4 tiers
  # allowing for easier visualization of the data 
     severity = if_else(percentMissing == 0, "Complete",
        if_else(percentMissing < 5, "Minimal",
        if_else(percentMissing < 20, "Moderate",
        if_else(percentMissing < 50, "Substantial", "Severe"))))
        )%>%
  select(variable, numMissing, percentMissing, severity) %>%
  arrange(desc(percentMissing))

Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.

# print the table
print(missingSummary)

# A tibble: 156 × 4
   variable          numMissing percentMissing severity
   <chr>                  <dbl>          <dbl> <chr>   
 1 battery_a            5695013            100 Severe  
 2 battery_a_to         5695015            100 Severe  
 3 battery_cells        5695013            100 Severe  
 4 battery_k_wh_to      5694777            100 Severe  
 5 battery_modules      5695015            100 Severe  
 6 battery_v_to         5695015            100 Severe  
 7 bus_length           5695015            100 Severe  
 8 cash_for_clunkers    5695015            100 Severe  
 9 charger_level        5694818            100 Severe  
10 gcwr                 5695015            100 Severe  
# ℹ 146 more rows

Data Quantity Missing General Visualization:

# organize the order of levels
missingSummary <- missingSummary %>%
  mutate(severity = factor(severity,
         levels = c("Complete", "Minimal", "Moderate", "Substantial", "Severe")))

# count variables in each severity category
severitySummary <- missingSummary %>%
  group_by(severity) %>%
  summarise(count = n()) %>%
  ungroup()

# create bar graph by severity category
overviewPlot <- ggplot(severitySummary, aes(x = severity, y = count, fill = severity)) +
  geom_bar(stat = "identity") +
  
  # adds labels to each bar
  geom_text(aes(label = count), vjust = -0.5, size = 4, fontface = 
    "bold") +
  
  # color scheme
  scale_fill_manual(values = c(
    "Complete" = "#42f590",
    "Minimal" = "#4290f5",
    "Moderate" = "#ffdd00",
    "Substantial" = "#f5b642",
    "Severe" = "#f54579"
  )) +
  
  # better scaling for labeling bar values
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  
  # labels and theme adjustments
  labs(title = "Distribution of Variables by Missing Data Severity",
       subtitle = "Count of variables in each severity category",
       x = "Severity Category",
       y = "Number of Variables",
       fill = "Severity") +
  
  # adjustments to the theme
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14,
        hjust = 0.5),
        plot.subtitle = element_text(size = 11, hjust = 0.5),
        axis.text.x = element_text(size = 10),
        legend.position = "none")

overviewPlot

Missing Data by Categorical Variables Visualization:

# get column types
colTypes <- autoInfo_tbl %>%
  head(1) %>%
  collect() %>%
  sapply(class)

# find categorical (character) columns
categoricalVars <- names(colTypes[colTypes == "character"])

# filter missing summary for categorical only
categoricalMissing <- missingSummary %>%
  filter(variable %in% categoricalVars) %>%
  arrange(desc(percentMissing))

# count frequency of each severity level
severityFreq <- categoricalMissing %>%
  count(severity, name = "frequency") %>%
  mutate(severity = factor(severity, levels = c("Complete", "Minimal", "Moderate", "Substantial", "Severe")))

# display frequency table
print(severityFreq)

# A tibble: 5 × 2
  severity    frequency
  <fct>           <int>
1 Complete            6
2 Minimal            17
3 Moderate            5
4 Substantial         9
5 Severe             61

# create bar plot showing frequency of severity levels
categoricalMissingPlot <- ggplot(severityFreq, 
                          aes(x = severity, 
                              y = frequency, 
                              fill = severity)) + 
  # color scheme
  scale_fill_manual(values = c(
    "Complete" = "#42f590",
    "Minimal" = "#4290f5",
    "Moderate" = "#ffdd00",
    "Substantial" = "#f5b642",
    "Severe" = "#f54579"
  )) +
  
  # better scaling for labeling bar values
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  
  # add data labels to data figures
  geom_col(width = 0.7) +
  geom_text(aes(label = frequency), vjust = -0.5, size = 4, fontface = 
    "bold") + 
  
  # title information
  labs(
    title = "Severity Distribution of Missing Data: Categorical Variables",
    x = "Severity Level",
    y = "Number of Variables",
    fill = "Severity"
  ) +
  # minimal theme
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5), 
    legend.position = "none")

# display plot
categoricalMissingPlot

Missing Data by Numerical Variable Type Visualization:

# retrieve a data sample for classifying
colTypes <- autoInfo_tbl %>%
  head(1) %>%
  collect() %>%
  sapply(class)

# find numeric columns
numericVars <- names(colTypes[colTypes %in% c("numeric", "integer", "double")])

# filter missing summary for numeric only
numericMissing <- missingSummary %>%
  filter(variable %in% numericVars) %>%
  # arrange by descending
  arrange(desc(percentMissing))

# frequency count for each severity level
severityFreqNumeric <- numericMissing %>%
  count(severity, name = "frequency") %>%
  mutate(severity = factor(severity, levels = c("Complete", "Minimal", "Moderate", "Substantial", "Severe")))

# display frequency table
print(severityFreqNumeric)

# A tibble: 5 × 2
  severity    frequency
  <fct>           <int>
1 Complete            4
2 Minimal             6
3 Moderate            2
4 Substantial         2
5 Severe             42

# create bar plot showing frequency of severity levels
numericMissingPlot <- ggplot(severityFreqNumeric, 
                      aes(x = severity, 
                          y = frequency, 
                          fill = severity)) +
  # color scheme
  scale_fill_manual(values = c(
    "Complete" = "#42f590",
    "Minimal" = "#4290f5",
    "Moderate" = "#ffdd00",
    "Substantial" = "#f5b642",
    "Severe" = "#f54579"
  )) +
  
  # better scaling for labeling bar values
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  
  # add data labels to data figures
  geom_col(width = 0.7) +
  geom_text(aes(label = frequency), vjust = -0.5, size = 4, fontface = 
    "bold") + 
  
  # title information
  labs(
    title = "Severity Distribution of Missing Data: Numeric Variables",
    x = "Severity Level",
    y = "Number of Variables"
  ) +
  # minimal theme
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.position = "none")

# display plot
numericMissingPlot

Questions:

Which variable has the MOST missing data?
Which variables are nearly complete (< 5% missing)?
Are any variables severely incomplete (> 50% missing)?
Does the amount of missingness concern you for the analysis?

Notes:

61 categorical variables and 42 numerical variables are classified as “Severe”. From that, there are approximately 23 fields that have 100% of its data missing.
These are following variables that are nearly complete with less than 5% of data missing:
- plant_country
- fuel_type_primary
- displacement_cc
- displacement_ci
- displacement_l
- bus_floor_config_type
- bus_type
- body_class
- stock_num
- model_name
- model
- model_id
- custom_motorcycle_type
- motorcycle_chassis_type
- motorcycle_suspension_type
- brand_name
- make
- make_id
- manufacturer
- manufacturer_id
- trailer_body_type
- trailer_type
- vehicle_type
There apppears to be 103 variables that are missing greater than 50% of it’s data (“Severe”).
The amount of missingness in the data does not concern me, as the variables that are missing large amounts of data are not apart of the scope of the ask price analysis.

Missing Data Severity Level (Complete & Minimal) ~ Table & Excel Output:

# formatting table
nonSevereList <- missingSummary %>%
  filter(severity %in% c("Minimal", "Complete")) %>%
  arrange(severity, desc(percentMissing)) %>%
  select(severity, variable, percentMissing)

# complete variables
completeVars <- nonSevereList %>% filter(severity == "Complete")
print(as.data.frame(completeVars), row.names = FALSE)

 severity       variable percentMissing
 Complete            vin              0
 Complete     first_seen              0
 Complete      last_seen              0
 Complete           msrp              0
 Complete      ask_price              0
 Complete        mileage              0
 Complete         is_new              0
 Complete          color              0
 Complete interior_color              0
 Complete      dealer_id              0
 Complete          vin_2              0
 Complete     model_year              0

# minimal variables
minimalVars <- nonSevereList %>% filter(severity == "Minimal")
print(as.data.frame(minimalVars), row.names = FALSE)

 severity                   variable percentMissing
  Minimal              plant_country           3.13
  Minimal          fuel_type_primary           2.92
  Minimal            displacement_cc           0.81
  Minimal            displacement_ci           0.81
  Minimal             displacement_l           0.81
  Minimal      bus_floor_config_type           0.51
  Minimal                   bus_type           0.51
  Minimal                 body_class           0.20
  Minimal                  stock_num           0.12
  Minimal                 model_name           0.10
  Minimal                      model           0.10
  Minimal                   model_id           0.10
  Minimal     custom_motorcycle_type           0.07
  Minimal    motorcycle_chassis_type           0.07
  Minimal motorcycle_suspension_type           0.07
  Minimal                 brand_name           0.02
  Minimal                       make           0.02
  Minimal                    make_id           0.02
  Minimal               manufacturer           0.02
  Minimal            manufacturer_id           0.02
  Minimal          trailer_body_type           0.02
  Minimal               trailer_type           0.02
  Minimal               vehicle_type           0.02

# -- WARNING: OUTPUT EXCEL FILE ---
#write_xlsx(
  #list(
    #Complete = as.data.frame(completeVars),
    #Minimal = as.data.frame(minimalVars),
    #All_NonSevere = as.data.frame(nonSevereList)
  #),
  #path = "non_severe_missing_data.xlsx"
#)

Spreadsheet Link:

https://docs.google.com/spreadsheets/d/1K_awd910r2otieFTOpU79y-r4-UFHvVG/edit?usp=sharing&ouid=102065865487880001963&rtpof=true&sd=true

Missing Severity Level (Moderate, Substantial, & Severe) ~ Table & Excel Output:

# formatting table
severityList <- missingSummary %>%
  filter(severity %in% c("Moderate", "Substantial", "Severe")) %>%
  arrange(severity, desc(percentMissing)) %>%
  select(severity, variable, percentMissing)

# moderate variables
moderateVars <- severityList %>% filter(severity == "Moderate")
print(as.data.frame(moderateVars), row.names = FALSE)

 severity          variable percentMissing
 Moderate              tpms          17.61
 Moderate             doors          13.71
 Moderate        plant_city          10.79
 Moderate  air_bag_loc_side          10.75
 Moderate  engine_cylinders           7.31
 Moderate    seat_belts_all           6.29
 Moderate air_bag_loc_front           5.32

# substantial variables
substantialVars <- severityList %>% filter(severity == "Substantial")
print(as.data.frame(substantialVars), row.names = FALSE)

    severity             variable percentMissing
 Substantial  engine_manufacturer          44.97
 Substantial            engine_kw          44.22
 Substantial            engine_hp          44.20
 Substantial         engine_model          42.43
 Substantial  air_bag_loc_curtain          41.19
 Substantial engine_configuration          38.61
 Substantial                 gvwr          31.36
 Substantial   plant_company_name          27.54
 Substantial          plant_state          26.74
 Substantial           drive_type          24.79
 Substantial               series          24.66

# severe variables
severeVars <- severityList %>% filter(severity == "Severe")
print(as.data.frame(severeVars), row.names = FALSE)

 severity                               variable percentMissing
   Severe                              battery_a         100.00
   Severe                           battery_a_to         100.00
   Severe                          battery_cells         100.00
   Severe                        battery_k_wh_to         100.00
   Severe                        battery_modules         100.00
   Severe                           battery_v_to         100.00
   Severe                             bus_length         100.00
   Severe                      cash_for_clunkers         100.00
   Severe                          charger_level         100.00
   Severe                                   gcwr         100.00
   Severe                                gcwr_to         100.00
   Severe                                gvwr_to         100.00
   Severe                         ncsa_body_type         100.00
   Severe                              ncsa_make         100.00
   Severe               ncsa_map_exc_approved_by         100.00
   Severe               ncsa_map_exc_approved_on         100.00
   Severe                 ncsa_mapping_exception         100.00
   Severe                             ncsa_model         100.00
   Severe                         other_bus_info         100.00
   Severe                     other_trailer_info         100.00
   Severe                   sae_automation_level         100.00
   Severe                sae_automation_level_to         100.00
   Severe                         trailer_length         100.00
   Severe                     axle_configuration          99.99
   Severe                          ev_drive_unit          99.98
   Severe                  other_motorcycle_info          99.98
   Severe                           battery_k_wh          99.97
   Severe                       charger_power_kw          99.97
   Severe                    adaptive_headlights          99.95
   Severe                          driver_assist          99.95
   Severe                          bed_length_in          99.90
   Severe                        possible_values          99.87
   Severe                              battery_v          99.83
   Severe       lower_beam_headlamp_light_source          99.78
   Severe               rear_cross_traffic_alert          99.78
   Severe                           battery_info          99.76
   Severe                              ncsa_note          99.75
   Severe                           battery_type          99.68
   Severe                  additional_error_text          99.44
   Severe                          suggested_vin          99.44
   Severe                            track_width          99.04
   Severe                          battery_packs          98.94
   Severe                                    edr          98.90
   Severe                  electrification_level          98.61
   Severe    automatic_pedestrian_alerting_sound          98.54
   Severe                        wheel_base_long          98.47
   Severe                      brake_system_desc          98.16
   Severe                         curb_weight_lb          97.94
   Severe                                  trim2          97.91
   Severe                     destination_market          97.48
   Severe                        wheel_base_type          96.84
   Severe                           engine_hp_to          96.59
   Severe                   entertainment_system          96.27
   Severe                                series2          95.34
   Severe               air_bag_loc_seat_cushion          94.19
   Severe                                windows          94.08
   Severe                          engine_cycles          93.05
   Severe                            park_assist          92.86
   Severe                           pretensioner          92.65
   Severe                           cooling_type          91.20
   Severe                    fuel_type_secondary          90.57
   Severe pedestrian_automatic_emergency_braking          90.19
   Severe                                   note          88.56
   Severe                  adaptive_driving_beam          85.34
   Severe                 active_safety_sys_note          85.31
   Severe                          top_speed_mph          84.24
   Severe                adaptive_cruise_control          83.10
   Severe                       lane_keep_system          81.29
   Severe                               can_aacn          79.67
   Severe                                    cib          79.41
   Severe                 lane_departure_warning          79.16
   Severe                                  turbo          78.82
   Severe              forward_collision_warning          77.68
   Severe                    transmission_speeds          77.29
   Severe                         blind_spot_mon          75.47
   Severe                    fuel_injection_type          75.40
   Severe                       keyless_ignition          72.82
   Severe                       air_bag_loc_knee          70.90
   Severe                        wheel_size_rear          69.63
   Severe                       wheel_size_front          69.62
   Severe                             base_price          67.40
   Severe                      other_engine_info          66.82
   Severe            other_restraint_system_info          66.64
   Severe                     transmission_style          66.21
   Severe                              seat_rows          66.19
   Severe                    auto_reverse_system          65.40
   Severe                  daytime_running_light          65.31
   Severe  semiautomatic_headlamp_beam_switching          65.06
   Severe                  dynamic_brake_support          64.22
   Severe                       wheel_base_short          63.78
   Severe                               bed_type          63.71
   Severe                       traction_control          63.22
   Severe                                  axles          63.13
   Severe                                  seats          63.05
   Severe                 rear_visibility_system          62.07
   Severe                                    esc          61.59
   Severe                                 wheels          61.58
   Severe                     valve_train_design          61.36
   Severe                                   trim          60.50
   Severe                      brake_system_type          59.86
   Severe                                    abs          59.08
   Severe                      steering_location          54.95
   Severe                          body_cab_type          52.88

# -- WARNING: OUTPUT EXCEL FILE ---
  #list(
    #Moderate = as.data.frame(moderateVars),
    #Substantial = as.data.frame(substantialVars),
    #Severe = as.data.frame(severeVars),
    #All_Severity = as.data.frame(severityList)
  #),
  #path = "severe_missing_data.xlsx"
#)

Spreadsheet Link:

https://docs.google.com/spreadsheets/d/1_9mT92Fr8l6y5VPSJvOD4VyvsUvm6msQ/edit?usp=sharing&ouid=102065865487880001963&rtpof=true&sd=true

Pattern 1: Missing by Categorical Groups

Does missingness vary by groups or over time?

For this, I am grouping missing data counts by brand name based on the rationale that for some brands, it could possibly be harder to view some technical specifications that are a part of the dataset.

Missing Data by Brand Name Table:

# analyze average missing data by brand_name
missingSummaryBrand <- autoInfo_tbl %>%
  group_by(brand_name) %>%
  summarise(across(everything(), ~sum(as.integer(is.na(.))))) %>%
  collect() %>%
  pivot_longer(-brand_name, names_to = "variable", values_to = "numMissing") %>%
  
  # add brand counts for calculating percentages
  left_join(
    autoInfo_tbl %>% 
      group_by(brand_name) %>% 
      count() %>% 
      collect(),
    by = "brand_name"
  ) %>%
  
  mutate(
    # use the count column for total rows per brand
    totalRows = n,
    percentMissing = round((numMissing / totalRows) * 100, 2)
  ) %>%
  
  # calculate average percentage missing per brand
  group_by(brand_name) %>%
  summarise(
    totalRows = first(totalRows),
    avgPercentMissing = round(mean(percentMissing), 2)
  ) %>%
  
  arrange(desc(avgPercentMissing))

# print the table
print(missingSummaryBrand)

# A tibble: 111 × 3
   brand_name            totalRows avgPercentMissing
   <chr>                     <dbl>             <dbl>
 1 <NA>                       1260              91.5
 2 HINO                         12              83.9
 3 MACK                          2              83.9
 4 OPEL                         14              82.6
 5 CHEVROLET GEO                 8              79.4
 6 MERCEDES-BENZ MAYBACH        35              78.9
 7 HYUNDAI KIA                  55              76.3
 8 HYUNDAI GENESIS               1              74.8
 9 NISSAN INFINITI              41              74.5
10 CHEVROLET NISSAN              1              74.2
# ℹ 101 more rows

Average Missing Data Percentage by Brand Visualization (Top 10 Brands):

# get top 10 brands by frequency in the dataset
top10_brands <- autoInfo_tbl %>%
  count(brand_name, sort = TRUE) %>%
  head(10) %>%
  collect()

# filter missing summary to only include top 10 most frequent brands
top10_missing <- missingSummaryBrand  %>%
  filter(brand_name %in% top10_brands$brand_name)

# create bar graph visualization for top 10 brands
brandMissingPlot <- ggplot(top10_missing, aes(x = reorder(brand_name, -avgPercentMissing), y = avgPercentMissing)) +
  
  # add bars with red fill to indicate missing data
  geom_bar(stat = "identity", fill = "#e74c3c") +
 
   # add percentage labels above each bar
  geom_text(aes(label = paste0(avgPercentMissing, "%")), 
            vjust = -0.5, size = 4, fontface = "bold") +
 
   # set title and axis labels
  labs(title = "Average Missing Data - Top 10 Most Frequent Brands",
       subtitle = "Average percentage of missing values across all variables",
       x = "Brand Name",
       y = "Average Percent Missing (%)") +
 
   # apply minimal theme for clean appearance
  theme_minimal() +
 
   # customize theme elements
  theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
        plot.subtitle = element_text(size = 11, hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        panel.grid.major.x = element_blank()) +
 
   # expand y-axis to prevent labels from being cut off
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))

# display plot
brandMissingPlot

Top 10 Brands Frequency Count Table:

# create table counting rows for top 10 most frequent brands
top10_brand_counts <- autoInfo_tbl %>%
  # count occurrences of each brand
  count(brand_name, sort = TRUE) %>%
  # get top 10 brands
  head(10) %>%
  collect() %>%
  # rename column for clarity
  rename(row_count = n) %>%
  # arrange by count descending
  arrange(desc(row_count))

# print the table 
print(top10_brand_counts)

# A tibble: 10 × 2
   brand_name row_count
   <chr>          <dbl>
 1 CHEVROLET     890213
 2 FORD          782063
 3 TOYOTA        398976
 4 JEEP          369067
 5 NISSAN        312876
 6 HONDA         278725
 7 HYUNDAI       264381
 8 GMC           232111
 9 DODGE         218588
10 VOLKSWAGEN    210886

Questions:

Do some groups have much more missing data than others?
Is there a logical reason why certain groups would have missing data?
Does this missingness seem random or systematic?

Notes:

Firstly, since “brand_name” has 111 different levels/brands, I filtered by the most frequent brands in the dataset by record count. I was then able to find that some brands, such as Nissan and GMC, have a higher average percentage of missing variables when compared to other brands, such as Honda and Toyota.
From that, there does not appear to be a logical reason why certain groups appear to have a higher average missing data percentage compared to other brands.
For this specific instance, I would say this pattern of missingness appears to be random due to lack of logic rationale behind why certain brands recorded in the dataset have higher missing comapred to othersThere does not appear to be a systematic pattern in missing data related to brand; however, based on the table, most of the missing data consists of records that have NA values, which is approximately 1260 records.

Pattern 2: Missing Over Time

Does data quality improve or worsen over time?

# filter out invalid model_year values 
valid_years <- autoInfo_tbl %>%
  filter(!is.na(model_year) & model_year != "__HIVE_DEFAULT_PARTITION__" & model_year != "") %>%
 
   # set year interval by every 5 years
  mutate(year_interval = floor(as.numeric(model_year) / 5) * 5)

# calculate average missing data by 5-year intervals
missingSummary_byYear <- valid_years %>%
  group_by(year_interval) %>%
  summarise(across(everything(), ~sum(as.integer(is.na(.))))) %>%
  collect() %>%
 
   # convert to long format for analysis
  pivot_longer(-year_interval, names_to = "variable", values_to = "numMissing") %>%
  
  # join with counts per interval
  left_join(
    valid_years %>% 
      group_by(year_interval) %>% 
      count() %>% 
      collect(),
    by = "year_interval"
  ) %>%
  
  # calculate percentage missing for each variable
  mutate(percentMissing = round((numMissing / n) * 100, 2)) %>%
 
   # group by interval and get average across all variables
  group_by(year_interval) %>%
  summarise(avgPercentMissing = round(mean(percentMissing), 2)) %>%
  arrange(year_interval)

# create line graph showing trend over time
missingdataTime <- ggplot(missingSummary_byYear, aes(x = year_interval, y = avgPercentMissing)) +
  
  # add connecting line
  geom_line(color = "#3498db", size = 1.2) +
 
   # add data points
  geom_point(color = "#2c3e50", size = 3) +
  
  # add percentage labels above points
  geom_text(aes(label = paste0(avgPercentMissing, "%")), vjust = -1, size = 3.5, fontface = "bold") +
  
  # set labels
  labs(title = "Average Missing Data by Model Year",
       x = "Model Year Interval",
       y = "Average Percent Missing (%)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14)) +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.15)))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# displays plot
missingdataTime

# print summary table
print(missingSummary_byYear)

# A tibble: 9 × 2
  year_interval avgPercentMissing
          <dbl>             <dbl>
1          1980              72.1
2          1985              71.0
3          1990              69.2
4          1995              68.5
5          2000              68.1
6          2005              67.8
7          2010              66.6
8          2015              57.3
9          2020              60.6

Questions:

Is missingness improving, worsening, or staying constant over time?
What might explain any trends you observe?
Does this affect which time periods you should focus on?

Notes:

It does appear that there is a notable general downward trend in average missing data percentages with newer or more recent vehicle years, indicating a gradual improvement in technical documentation with newer model years. However, this progress is interrupted by a sudden increase in missing data beginning around the 2020 model year.
One of the first things I believe that led towards the downward trend is that with newer model year vehicles, it has been easier to report on car specifications, with improved data collection techniques. However, the increase in the 2020 model year could possibly be related to the effects of the COVID-19 pandemic at the time, which led to reduced recording of data.
Based on this visual alone, I would roughly only utilize records of model years from 2015-2020 since this specific range is when the dataset appeared to have the least amount of missing data.

Classify Missingness Types & Justification

Understanding MCAR, MAR, and MNAR

Three types of missingness

MCAR (Missing Completely At Random):
- Missing values are unrelated to any other variables
- Example: Random data entry errors
- Safe to delete or use simple imputation
MAR (Missing At Random):
- Missingness depends on observed variables
- Example: Older records missing digital fields
- Need to account for related variables
MNAR (Missing Not At Random):
- Missingness depends on the missing value itself
- Example: High earners not reporting income
- Most problematic - may bias results

Classification Table

Classification	Severity Level	Recommended Strategy
MAR	Severe (>50%)	Exclude variables from consideration
MAR	Substantial (20%-50%)	Exclude variables from consideration
MAR	Moderate (5%-20%)	Exclude variables from consideration

Categorical & Numerical Variable Handling Reasoning:

I ultimately decided to no longer include the numerical and categorical variables that were earlier classified under “Moderate”, “Substantial”, and “Severe”. This decision was based on viewing the varying amount of information missing in the fields that were under each category and focusing primarily on the scope of the analysis to ensure that the findings are direct, as well as primarily relevant. Removing these variables also serves to further concentrate only on a select few variables (all of which have a missingness value of less than 5%) that remain relevant to the topic of ask price observation and vehicle body class shifts.

Impact Assessment

Overall Strategy:
Based on my findings, I have decided to ultimately exclude the variables that are missing more than 5% of data, as a majority of the variables no longer fit the revised scope of the analysis.

Most Challenging Decision:
Ultimately, deciding on whether or not some of the variables that have more than 5% missing would be relevant to the scope of the analysis and handling approach. This is due to each automotive record; it would make no sense to impute, given that vehicles have direct specifications based on make, model, and especially model year. As such, imputation would result in completely wrong information that would invalidate the rest of the analysis.

Potential Limitations:
In going with this choice, some features I originally wanted to explore, such as car technology features (such as a lane assist system), would no longer be a part of the analysis. However, the overall analysis would remain focused on price observation and notable segment shifts.

Confidence in Approach: I remain highly confident in my approach to handling the variables with large amounts of missing variables, as the exclusion of these specific fields would not hinder my analysis.

Phase 3: From Data Understanding to Stakeholder Insights

Objective(s)

Select the most meaningful variables for the analysis
Create compelling visualizations that answer exploratory questions
Communicate findings to stakeholders

Part 5: Variable Selection & Focus

Variable Inventory Table:

# create variable inventory for complete and minimal missing variables
variable_inventory <- missingSummary %>%
  
  # filter for only complete and minimal severity levels
  filter(severity %in% c("Complete", "Minimal")) %>%
 
   # add a column to evaluate potential usefulness for price prediction
  mutate(
    # classify each variable type based on actual variable names
    variableType = case_when(
      variable %in% c("ask_price", "msrp") ~ "Target/Reference Price",
      variable %in% c("brand_name", "make", "make_id", "model_name", "model", "model_id", "stock_num",
                      "vin", "vin_2") ~ "Vehicle Identity",
      variable %in% c("model_year") ~ "Time/Age",
      variable %in% c("body_class", "vehicle_type", "trailer_body_type", "trailer_type", "bus_type") 
                      ~ "Vehicle Category",
      variable %in% c("mileage") ~ "Usage",
      variable %in% c("plant_country") ~ "Geographic",
      variable %in% c("dealer_id") ~ "Dealer Info",
      variable %in% c("displacement_cc", "displacement_ci", "displacement_l", "fuel_type_primary",
                      "bus_floor_config_type", "custom_motorcycle_type", 
                      "motorcycle_chassis_type", "motorcycle_suspension_type") ~ "Technical Specs",
      variable %in% c("color", "interior_color") ~ "Aesthetics",
      variable %in% c("is_new") ~ "Condition",
      variable %in% c("first_seen", "last_seen") ~ "Listing Metadata",
      variable %in% c("manufacturer", "manufacturer_id") ~ "Manufacturer Info",
      TRUE ~ "Other"
    ),
   
     # evaluate potential usefulness for predicting price
    predictionRelevance = case_when(
      variable == "ask_price" ~ "Target Variable",
      variableType %in% c("Vehicle Identity", "Time/Age", "Technical Specs", "Usage", "Condition") ~ "High",
      variableType %in% c("Vehicle Category", "Target/Reference Price") ~ "High",
      variableType %in% c("Geographic", "Aesthetics", "Manufacturer Info") ~ "Medium",
      variableType %in% c("Dealer Info", "Listing Metadata") ~ "Low",
      TRUE ~ "To Be Evaluated"
    )
  ) %>%
  
  # select and arrange columns for easy review
  select(variable, percentMissing, variableType, predictionRelevance) %>%
  arrange(predictionRelevance, percentMissing)

# print the inventory table
print(variable_inventory)

# A tibble: 35 × 4
   variable          percentMissing variableType           predictionRelevance
   <chr>                      <dbl> <chr>                  <chr>              
 1 vin                         0    Vehicle Identity       High               
 2 msrp                        0    Target/Reference Price High               
 3 mileage                     0    Usage                  High               
 4 is_new                      0    Condition              High               
 5 vin_2                       0    Vehicle Identity       High               
 6 model_year                  0    Time/Age               High               
 7 brand_name                  0.02 Vehicle Identity       High               
 8 make                        0.02 Vehicle Identity       High               
 9 make_id                     0.02 Vehicle Identity       High               
10 trailer_body_type           0.02 Vehicle Category       High               
# ℹ 25 more rows

# -- WARNING: OUTPUT EXCEL FILE ---
# optionally export to excel for detailed review
#write_xlsx(variable_inventory, path = "variable_inventory_for_pricing.xlsx")

Spreadsheet Link:

https://docs.google.com/spreadsheets/d/1w2fjuTBG5nB3vWYan3M1vdN339MtsSRu/edit?usp=sharing&ouid=102065865487880001963&rtpof=true&sd=true

Numeric Relationships

Which numeric variables are correlated with each other?

Correlation Matrix:

# disable scientific notation
options(scipen = 999)

numeric_vars <- variable_inventory %>%
  filter(predictionRelevance != "Target Variable") %>%
  pull(variable)

# collect data and select only numeric columns (including ask_price)
numeric_data <- autoInfo_tbl %>%
  select(ask_price, all_of(numeric_vars)) %>%
  collect() %>%
  # keep only numeric columns
  select(where(is.numeric))

# calculate correlation matrix
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

# create correlation table as a data frame
correlation_table <- as.data.frame(correlation_matrix) %>%
  # add variable names as a column
  mutate(variable = rownames(.)) %>%
  # move variable column to front
  select(variable, everything()) %>%
  # round all correlation values to 4 decimal places
  mutate(across(-variable, ~round(., 4)))

# print correlation table
print(correlation_table)

                       variable ask_price    msrp mileage make_id model_id
ask_price             ask_price    1.0000  0.4778  0.0000  0.0000  -0.0002
msrp                       msrp    0.4778  1.0000 -0.0003  0.0000  -0.0032
mileage                 mileage    0.0000 -0.0003  1.0000  0.0000  -0.0041
make_id                 make_id    0.0000  0.0000  0.0000  1.0000   0.0946
model_id               model_id   -0.0002 -0.0032 -0.0041  0.0946   1.0000
displacement_cc displacement_cc    0.0002  0.0002  0.0003 -0.0001  -0.0065
displacement_ci displacement_ci    0.0002  0.0002  0.0003 -0.0001  -0.0066
displacement_l   displacement_l    0.0002  0.0002  0.0003 -0.0001  -0.0066
dealer_id             dealer_id    0.0066  0.0136 -0.0003  0.0005   0.0036
manufacturer_id manufacturer_id   -0.0010 -0.0022 -0.0024 -0.0003   0.1398
                displacement_cc displacement_ci displacement_l dealer_id
ask_price                0.0002          0.0002         0.0002    0.0066
msrp                     0.0002          0.0002         0.0002    0.0136
mileage                  0.0003          0.0003         0.0003   -0.0003
make_id                 -0.0001         -0.0001        -0.0001    0.0005
model_id                -0.0065         -0.0066        -0.0066    0.0036
displacement_cc          1.0000          0.9994         0.9994    0.0052
displacement_ci          0.9994          1.0000         1.0000    0.0051
displacement_l           0.9994          1.0000         1.0000    0.0051
dealer_id                0.0052          0.0051         0.0051    1.0000
manufacturer_id         -0.0213         -0.0213        -0.0213    0.0083
                manufacturer_id
ask_price               -0.0010
msrp                    -0.0022
mileage                 -0.0024
make_id                 -0.0003
model_id                 0.1398
displacement_cc         -0.0213
displacement_ci         -0.0213
displacement_l          -0.0213
dealer_id                0.0083
manufacturer_id          1.0000

# -- WARNING: OUTPUT EXCEL FILE ---
# export correlation table to excel
#library(writexl)
#write_xlsx(correlation_table, path = "correlation_matrix.xlsx")

Spreadsheet Link:

https://docs.google.com/spreadsheets/d/1rwWIJnaFSBPHvGcf4HUwbIMrF7VbFzLu/edit?usp=sharing&ouid=102065865487880001963&rtpof=true&sd=true

Questions:

Which variables are strongly correlated (|r| > 0.7)?
Are any of these redundant? Can we drop one?
Do any correlations surprise you?

Notes:

There does not appear to be any form of variables that are strongly correlated with one another, other than the variable itself. The only noteworthy finding is that the MSRP and ask price have a correlation value of 0.4778, although both are variations in pricing.

As a reminder:

Ask Price - Dealership Price

MSRP - Manufacturer’s Recommended Price

There are several redundant variables, such as the displacement labeled and id variables, that can be removed from the analysis, due to the appearance of variations of the same variable. MSRP, I will also consider removing it, as the scope of the analysis primarily focuses on dealership pricing (ask price).
No particular set of correlations surprised me, as I did expect most of the variables based on this set to not be highly correlated with one another.

Categorical Relationships

How do categorical variables relate to our outcome?

# get list of complete/minimal categorical variables from missingSummary
categorical_vars <- missingSummary %>%
  filter(severity %in% c("Complete", "Minimal"),
         variable != "ask_price") %>%
  pull(variable)

# collect data
categorical_data <- autoInfo_tbl %>%
  select(ask_price, all_of(categorical_vars)) %>%
  collect()

# select a categorical variable to visualize (change as needed)
selected_var <- "body_class"

# calculate average price by category
plot_data <- categorical_data %>%
  group_by(!!sym(selected_var)) %>%
  summarise(
    count = n(),
    avg_price = mean(ask_price, na.rm = TRUE),
    .groups = "drop" 
  ) %>%
  filter(count >= 10) %>%
  arrange(desc(avg_price)) %>%
  head(15)

# create bar chart
ggplot(plot_data, aes(x = reorder(!!sym(selected_var), avg_price), y = avg_price)) +
  geom_bar(stat = "identity", fill = "#3498db") +
  geom_text(aes(label = paste0("$", format(round(avg_price, 0), big.mark = ","))), 
            hjust = -0.1, size = 3.5, fontface = "bold") +
  coord_flip() +
  labs(title = paste("Average Ask Price by", tools::toTitleCase(gsub("_", " ", selected_var))),
       subtitle = "Top 15 categories with at least 10 observations",
       x = tools::toTitleCase(gsub("_", " ", selected_var)),
       y = "Average Ask Price ($)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 11)) +
  scale_y_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0, 0.15)))

Does this categorical variable show meaningful differences?

Based on using the “body_class” as one of the primary variables I intend to focus on for my analysis, it does appear there are meaningful differences when using it.

Should it be kept in the final variable set?

From these results, I strongly believe that this variable should still be kept and be a part of the main analysis.

Final Variable Selection

Selecting 3-7 variables that best answer the research questions

Based on overall data inventory, correlations, and relationships, this is the final variable set:

ask_price
is_new
body_class
model_year
brand_name
model_name

These 6 variables, based on data completion and relevance to the analysis, are the primary variables I intended to use for conducting the analysis.

Variables Excluded:

These variables are excluded from analysis based primarily on not being relevant or fitting the overall scope of the analysis, with some of the variables also being redundant or duplicates of the same information:

vin
first_seen
last_seen
msrp
mileage
color
interior_color
dealer_id
vin_2
plant_country
displacement_cc
displacement_ci
displacement_l
bus_floor_config_type
bus_type
stock_num
model
model_id
custom_motorcycle_type
motorcycle_chassis_type
motorcycle_suspension_type
make
make_id
manufacturer
manufacturer_id
trailer_body_type
trailer_type
vehicle_type

Part 6: Exploratory Visualizations & Analysis

New Data Table with Selected Variables:

auto_unique <- autoInfo_tbl %>%
  # ensure each VIN appears only once
  add_count(vin) %>%
  filter(n == 1) %>%
  select(-n) %>%
  # select only the specified fields
  select(model_year, brand_name, model_name, ask_price, is_new, body_class) %>%
   # remove invalid model_year values
  filter(model_year != "__HIVE_DEFAULT_PARTITION__" & !is.na(model_year)) %>%
  
  # filter by body class of interest
    filter(body_class %in% c(
    "Pickup",
    "Convertible/Cabriolet", 
    "Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)",
    "Minivan", 
    "Roadster", 
    "Sedan/Saloon", 
    "Coupe", 
    "Van",
    "Sport Utility Truck (SUT)", 
    "Cargo Van", 
    "Crossover Utility Vehicle (CUV)",
    "Hatchback/Liftback/Notchback"
  )) %>%
  # collect the data
  collect()

# print summary of the new dataset
print(paste("Total rows:", nrow(auto_unique)))

[1] "Total rows: 493123"

print(paste("Total columns:", ncol(auto_unique)))

[1] "Total columns: 6"

print("Column names:")

[1] "Column names:"

print(names(auto_unique))

[1] "model_year" "brand_name" "model_name" "ask_price"  "is_new"    
[6] "body_class"

# view first few rows
head(auto_unique)

# A tibble: 6 × 6
  model_year brand_name model_name     ask_price is_new body_class              
  <chr>      <chr>      <chr>              <int> <chr>  <chr>                   
1 2003       FORD       F-150               7995 False  Pickup                  
2 2014       FORD       Escape             15853 False  Sport Utility Vehicle (…
3 2014       FORD       Fiesta              7495 False  Hatchback/Liftback/Notc…
4 2006       JEEP       Grand Cherokee      8000 False  Sport Utility Vehicle (…
5 2005       NISSAN     Frontier               0 False  Pickup                  
6 2018       NISSAN     Altima             19425 True   Sedan/Saloon

Visualization 1: New vs Used Offerings Over Time

Question: How much has dealer pricing among new and used offerings varied across different car model years?

newVsUsed_plot_data <- auto_unique %>%
 
   # filter for valid prices only
  filter(!is.na(ask_price), ask_price > 0) %>%
 
   # aggregate median prices and count by model year and condition
  group_by(model_year, is_new) %>%
  summarise(
    medianPrice = median(ask_price),
    count = n(),
    .groups = "drop"
  ) %>%
  
  # filter for sufficient sample size and prepare for plotting
  filter(count >= 10) %>%
  mutate(
    condition = ifelse(is_new == "True", "New", "Used"),
    model_year_num = as.numeric(model_year)
  )

# create visualization
new_used_graph <- ggplot(newVsUsed_plot_data, aes(
  x = model_year_num,
  y = medianPrice,
  color = condition
)) +
  geom_point(aes(size = count), alpha = 0.8) +
  
  # ensure proper scaling is taken into account
  scale_x_continuous(breaks = scales::breaks_width(5)) +
  scale_y_continuous(labels = scales::dollar_format()) +
  
  scale_color_manual(
    values = c("New" = "#034efc", "Used" = "#f79b5e"),
    name = "Condition"
  ) +
  scale_size(range = c(2, 6), guide = "none") +
  labs(
    title = "Median Ask Price by Model Year: New vs Used Vehicles",
    subtitle = "Each point represents median price based on model year",
    x = "Model Year",
    y = "Median Ask Price (USD)",
    caption = "Dot Size = Sample Volume"
  ) +
  
  # theme adjustments
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 10, color = "blue", hjust = 0.5),
    legend.position = "bottom",
    legend.box = "vertical",
    legend.title = element_text(face = "bold"),
    plot.caption = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

# display graph
new_used_graph

Pattern Observed:

The scatter plot displays the changing median asking price for new and used vehicles across different model years, ranging from 1980 to 2020. In addition, each of the dots that make up the visual varies in size, indicating the volume for that specific model year and condition.

Statistical Evidence:

There is a clear trend represented that appears to be a form of exponential increase in pricing over time. Specifically, new vehicle prices (blue dots) appear to be consistently higher than used vehicle prices (orange dots) for the same model year. In general, new model year vehicles, after 2020, have been increasing at a faster rate when compared to previous model years before 2010.

Stakeholder Implications:

Decision-makers such as dealerships and market analysts (the primary audience for this analysis) can use this to better understand pricing disparities between new and used vehicles. Additionally, analysts can use this for price forecasting and long-term pricing approximations after observing the general trend moving towards higher prices.

Adjustment to Data Table to Provide Only New Offerings and Focus on 2015-2020 Model Years:

auto_unique <- auto_unique %>% 
  filter(is_new == "True") %>%
  filter(model_year %in% c("2015", "2016", "2017", "2018", "2019", "2020")) %>% 
  collect()

Visualization 2: Body Class Distribution

Question: What is the distribution of car body classes ( Pickups, SUVS, Sedans) across dealerships?

# calculate distribution of body class
body_class_distribution <- auto_unique %>%
  group_by(body_class) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(desc(count)) %>%
  mutate(percentage = round((count / sum(count)) * 100, 1))

# create bar graph with conditional labels
ggplot(body_class_distribution, aes(x = reorder(body_class, count), y = count)) +
  geom_bar(stat = "identity", fill = "#eb4034") +
  # labels for bars >= 40%: count inside, percentage outside
  geom_text(data = filter(body_class_distribution, percentage >= 40),
            aes(label = format(count, big.mark = ",")), 
            hjust = 1.1, size = 3.5, fontface = "bold", color = "white") +
  geom_text(data = filter(body_class_distribution, percentage >= 40),
            aes(label = paste0(percentage, "%")), 
            hjust = -0.2, size = 3.5, fontface = "bold") +
  # labels for bars < 40%: combined format outside with dynamic spacing
  geom_text(data = filter(body_class_distribution, percentage < 40),
            aes(label = paste0(percentage, "%, (", count, ")"),
                hjust = -0.05 - (log10(max(body_class_distribution$count)) - log10(count)) * 0.02), 
            size = 3.5, fontface = "bold") +
  coord_flip() +
  labs(
    title = "Distribution of Vehicle Body Classes",
    subtitle = "Count and percentage of each body class in dataset",
    x = "Body Class",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank()
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

body_class_distribution

# A tibble: 11 × 3
   body_class                                               count percentage
   <chr>                                                    <int>      <dbl>
 1 Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV) 107874       46.2
 2 Sedan/Saloon                                             53108       22.8
 3 Pickup                                                   43511       18.6
 4 Hatchback/Liftback/Notchback                             12540        5.4
 5 Minivan                                                   5556        2.4
 6 Coupe                                                     4015        1.7
 7 Van                                                       2993        1.3
 8 Convertible/Cabriolet                                     1766        0.8
 9 Cargo Van                                                 1578        0.7
10 Crossover Utility Vehicle (CUV)                            363        0.2
11 Roadster                                                    45        0

Pattern Observed:

The horizontal bar chart displays the count and accompanying percentage (proportion) for each different vehicle body class within the dataset. The body classes are ordered by their frequency, from most to least frequent, with SUVs/MPVs being the most prominent.

Statistical Evidence:

Larger vehicles, such as SUVs/MPVs, appear to take the lead in terms of body class in the dataset. Specifically, accounting for 46.2% of the vehicles, followed by Sedans/Saloons at 22.8% and Pickups at 18.6%. The remaining body classes make up a much smaller percentage of the total.

Stakeholder Implications:

From the perspective of an analyst, this helps to highlight the increasing inventory and potentially increasing demand (on the consumer side) for SUVs/MPVs. This also demonstrates that dealerships are receiving much less new inventory in body styles such as Roadsters and Crossover Utility Vehicles (CUVs). This could be due to less profit brought in from these segments when compared to SUVs/MPVs.

Visualization 3: Most Frequent Brands & Accompanying Model

Question: What car brands are the most prominent in dealership locations, and what are their most prominent models?

*Performed research to use a “Lollipop Chart” which has a similar structure to a bar graph but instead uses a line and a dot at the end of the line to represent a value for a level/figure.

# calculate top 10 brand-model combinations by frequency
top_brand_model <- auto_unique %>%
  group_by(brand_name, model_name) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(desc(count)) %>%
  head(10) %>%
  mutate(rank = row_number())

# create lollipop chart
model_count <- ggplot(top_brand_model, aes(x = reorder(paste(brand_name, "-", model_name), count), y = count)) +
  # segment (stick of lollipop)
  geom_segment(aes(x = reorder(paste(brand_name, "-", model_name), count), 
                   xend = reorder(paste(brand_name, "-", model_name), count), 
                   y = 0, 
                   yend = count),
               color = "#349e80", 
               linewidth = 1.2) +
  # point (lollipop head)
  geom_point(color = "#349e80", size = 5) +
  # count label next to point
  geom_text(aes(label = format(count, big.mark = ",")), 
            hjust = -0.3, 
            size = 4, 
            fontface = "bold", 
            color = "#2c3e50") +
  coord_flip() +
  labs(
    title = "Top 10 Most Frequent Brand-Model Combinations",
    subtitle = "Brands with their most prominent model",
    x = "Brand - Model",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "gray90", linetype = "dashed")
  ) +
  scale_y_continuous(labels = scales::comma_format(), 
                     expand = expansion(mult = c(0, 0.15)))

# display graph
model_count

Pattern Observed:

This visualization displays the top 10 most frequent brand and model combinations (as individual bars) found in the dataset, that is organized by frequency/count of each combination.

Statistical Evidence:

The Chevrolet Silverado is the most frequent car model with 9,887 records, followed by Ford F-150 at 8,632 records. The Chevrolet and Ford brands each have multiple models in the top 10, which is indicative of strong market presence.

Stakeholder Implications:

It helps decision makers primarily identify the most popular and in-demand vehicles. This insight can provide information in multiple areas including marketing strategies and even production planning to maintain an inventory of popular models effectively.

Visualization 4: Large Vehicle Volume on Market Median Pricing

Question: Does having a high volume of larger vehicles (such as SUVs) directly impact the overall market’s median asking price to be higher?

# classify vehicles by size
vehicle_size_data <- auto_unique %>%
  mutate(
    vehicle_size = case_when(
      body_class %in% c("Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)", 
                        "Crossover Utility Vehicle (CUV)",
                        "Sport Utility Truck (SUT)",
                        "Pickup",
                        "Van",
                        "Cargo Van",
                        "Minivan") ~ "Large Vehicle",
      body_class %in% c("Sedan/Saloon",
                        "Coupe",
                        "Hatchback/Liftback/Notchback",
                        "Convertible/Cabriolet",
                        "Roadster") ~ "Small/Medium Vehicle",
      TRUE ~ "Other"
    )
  )

# calculate metrics by model year
size_impact_data <- vehicle_size_data %>%
  group_by(model_year, vehicle_size) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(model_year) %>%
  mutate(
    total_count = sum(count),
    percentage = round((count / total_count) * 100, 1)
  ) %>%
  filter(vehicle_size == "Large Vehicle") %>%
  
  # left join conducted
  left_join(
    vehicle_size_data %>%
      group_by(model_year) %>%
      summarise(median_price = median(ask_price, na.rm = TRUE), .groups = "drop"),
    by = "model_year"
  ) %>%
  mutate(model_year_num = as.numeric(model_year))

# create dual-axis visualization
median_price_analysis <- ggplot(size_impact_data, aes(x = model_year_num)) +
  
  # bar chart for large vehicle percentage
  geom_bar(aes(y = percentage * 500), stat = "identity", fill = "#3498db", alpha = 0.6) +
 
  # line chart for median price
  geom_line(aes(y = median_price), color = "#e74c3c", size = 1.5) +
  geom_point(aes(y = median_price), color = "#e74c3c", size = 3) +
  
  # add percentage labels (positioned inside/on top of bars)
  geom_text(aes(y = percentage * 500, label = paste0(percentage, "%")), 
            vjust = 1.5, size = 3.5, fontface = "bold", color = "#2c3e50") +
  
  # add price labels (positioned above the line with consistent spacing)
  geom_text(aes(y = median_price, label = paste0("$", format(round(median_price/1000, 1), nsmall = 1), "k")), 
            vjust = -1.5, size = 3.5, fontface = "bold", color = "#e74c3c") +
  
  # dual axis labels with expanded limits
  scale_y_continuous(
    name = "Median Ask Price ($)",
    labels = scales::dollar_format(),
    limits = c(0, 45000),  # expanded to prevent cutoff
    expand = expansion(mult = c(0, 0.05)),  # add 5% padding at top
    sec.axis = sec_axis(~./500, name = "% Large Vehicle Concentration")
  ) +
  scale_x_continuous(
    breaks = 2015:2020,
    expand = expansion(mult = c(0.02, 0.02))  # add small padding on sides
  ) +
  labs(
    title = "Impact of Large Vehicle Volume on Market Median Price",
    subtitle = "Blue bars show % of large vehicles (SUVs, Pickups, Vans) & Red line shows median asking price",
    x = "Model Year"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 10, hjust = 0.5),
    axis.title.y.left = element_text(color = "#e74c3c", face = "bold"),
    axis.title.y.right = element_text(color = "#3498db", face = "bold"),
    panel.grid.minor = element_blank(),
    plot.margin = margin(15, 15, 10, 10)  # Increased top and right margins
  )

median_price_analysis

Pattern Observed:

This is a combined bar and line chart that shows the percentage of large vehicles (SUVs, Pickups, Vans) from the dataset as blue bars. In addition, the median asking price is represented by the red line, with both the proportion of large vehicles and the median asking price being measured against model year. This visual helps primarily link the prevalence of large vehicles to overall market pricing.

Statistical Evidence:

There is a mobile upward trend in both the percentage of large vehicles and the median asking price from 2015 to 2020 model years. Specifically, the percentage of large vehicles increased from 58.3% in 2015 to 74.3% in 2020, while the median ask price rose from $25.0k to $38.5k in the same period. That appears to suggest some form of corresponding relationship between these 2 observations.

Stakeholder Implications:

For decision-makers in the automotive industry, what should be primarily considered is the observed relationship between the shift in large vehicle inventory and the median asking price for each model year. This can be utilized for forecasting market demand and pricing strategies based on shifting market preferences for larger vehicles, which are represented by each new model year in the dataset.

Visualization 5: Observing Car Brand as a Price Predictor

Question: How strong are aspects of a car, such as brand, as predictors for forecasting the dealer’s asking price?

# get top 20 most frequent brands (with minimum 50 count)
brand_predictor <- auto_unique %>%
  group_by(brand_name) %>%
  summarise(
    count = n(),
    median_price = median(ask_price, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(count >= 50) %>%
  arrange(desc(count)) %>%  # Changed from desc(median_price) to desc(count)
  head(20)

# calculate R-squared to measure brand's predictive power
price_model <- lm(ask_price ~ brand_name, data = auto_unique)
r_squared <- summary(price_model)$r.squared

# create simple bar chart
brand_price <- ggplot(brand_predictor, aes(x = reorder(brand_name, median_price), y = median_price)) +
  geom_bar(stat = "identity", fill = "#eb8334") +
  geom_text(aes(label = paste0("$", format(round(median_price/1000, 1), nsmall = 1), "k")), 
            hjust = -0.1, size = 3.5, fontface = "bold") +
  coord_flip() +
  labs(
    title = "Brand Strength as a Price Predictor",
    subtitle = paste0("Brand explains ", round(r_squared * 100, 1), "% of price variation (R² = ", round(r_squared, 3), ")"),
    x = "Brand Name",
    y = "Median Ask Price ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "#2c3e50"),
    panel.grid.major.y = element_blank()
  ) +
  
  # add sufficient scaling to y-axis
  scale_y_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0, 0.15)))

# display plot
brand_price

Pattern Observed:

This visualization displayes the median ask price for the most frequent car brands, ranked from highest to lowest in a top 20 form. It also notes that brand explains 24.5% of price variation based on performed calculations when creating the visual.

Statistical Evidence:

To address this immediately, statistically, given the R-Squared value from this data sample alone, it does not appear that just brand by itself is strongly correlated with asking price. However, there do appear to be noteworthy observations based on specific car brands included in that data. Specifically based on the brands that are the most frequent in the dataset, it appears that Land Rover, a luxury car brand, has the highest median ask price of $54.1K, followed by Cadillac with $55.3k, and Audi with $53.0k. There appears to be a gradual increase from the bottom of the list towards the top (no apparent significant jumps in pricing).

Stakeholder Implications:

As mentioned, despite “brand name” having a small statistical relationship in relation to asking price, there does appear to be notable patterns to consider. Specifically, recognizing from an analysis standpoint that other factors in combination beyond just brand name help to forecast dealership pricing trends.

Part 7: Stakeholder Communication

Executive Summary

New vehicle affordability is declining, with dealerships displaying increased inventory of larger, higher-priced new vehicles over sedans. This finding was obtained from analyzing the “Large Car” dataset from Kaggle, which contains over 5 million vehicle listings with accompanying dealer ask price and relevant physical attributes collected from 2017 to 2020 only in the state of Illinois. Why this analysis matters specifically is to facilitate understanding of inventory composition across vehicle segments in recent years to help identify pricing patterns in the automotive market for forecasting future pricing.

Key Findings:

Despite the age of the dataset, it effectively displays the increasing presence of SUVs and Pickups that are notably concentrated at the high mid-range to premium pricing segment of the market, which starts around $30K to $40K, and can extend well past this range. This is supported by the median asking price rising to $38.5k for the 2020 model year, as large vehicle volume increased when observing inventory changes from past model years. What this would mean for price forecasting is that the aspects of the automotive market are in some form influenced by evolving changes involving higher-cost body classes that are increasing in overall volume.

BMW, RAM, and GMC lead the top 10 brands in median asking price, that ranges from $42.73k to $62.40k, with their inventory being heavily concentrated in SUV or Pickup classes, between 64% and 88% specialization. This suggests that brands that have frequent inventory in large, expensive vehicles (like SUVs and Pickups) are able to achieve higher asking prices just based on volume. In addition, brands focused primarily on sedans or more budget-oriented segments generally have median prices that are roughly 30% to 65% lower when compared to more frequent brands in the dataset.

Furthermore, in terms of the budget segment, only one brand shows consistently low median pricing of $21.92k within the top 10 most frequent brands: Nissan. Its primary body class is Sedan/Saloon, which accounts for 62% of its inventory and reflects a broader market trend where non-dominant body styles are associated with lower price points. This positions Nissan’s pricing as a potential baseline for analysts evaluating the pricing changes in the budget-focused category.

The 2025 Nissian Versa, a subcompact car offering has a base price that starts at $17,190 which is considered one of the least expensive new car offerings a consumer can get from the 2025 model year:

Source: https://cars.usnews.com/cars-trucks/advice/best-cars-under-20000?onepage

Recommendations:

These findings suggest that the automotive market is influenced by the expanding presence of larger, high-priced vehicles, which directly impacts overall affordability. One of the first recommendations for future price forecasting is to use inventory concentration and weigh in external risk factors, such as policy changes, during model creation. In addition, analysts should perform continuous monitoring of sedan inventory decline and price changes in budget or non-dominant segments. Finally, analysts should develop a structured evaluation of price shifts across Budget, Mid-Range, and Premium price tiers to evaluate whether different pricing strategies are remaining constant or experiencing price shifts, such as one tier’s entry pricing moving higher or below a previous threshold amount.

*Data Quality Note:

To reinforce this disclaimer, the data itself is based on information recorded from a previous timeframe (2017-2020) and is based in a single state (Illinois). In addition, it contains several technical fields (that measure what technical components a car has) that have most of their data missing; only the general physical attributes of each car record appear close to or already complete.

Hero Visualization

Computational Code Chunk:

# container for top 10 amount
top_brands <- 10

# base filtering and brand aggregation 
base_filtered <- filter(
  auto_unique,
  !is.na(ask_price) &
  ask_price > 0 &
  !is.na(brand_name) &
  brand_name != ""
)

# get Brand Frequencies and Median Prices
brand_freq_data <- base_filtered %>%
  group_by(brand_name) %>%
  summarise(
    Count = n(),
    MedianPrice = median(ask_price, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(Count)) %>%
  head(top_brands)

# collect required frequency data
brand_freq_data_collected <- brand_freq_data %>% collect()

# find Primary Body Class ---
top_brand_names_list <- brand_freq_data_collected %>% pull(brand_name)

primary_body_class_data <- base_filtered %>%
  filter(brand_name %in% top_brand_names_list) %>%
  group_by(brand_name, body_class) %>%
  summarise(ClassCount = n(), .groups = "drop_last") %>%
  slice_max(order_by = ClassCount, n = 1) %>%
  ungroup() %>%
  collect() %>%
  mutate(body_class_short = sub(" \\(.*", "", body_class))

# final join and plotting setup
filter_label <- "New Vehicles"

# percentage and ranking handling
brand_freq_data <- brand_freq_data_collected %>% 
  inner_join(
    primary_body_class_data %>% select(brand_name, body_class_short, ClassCount),
    by = "brand_name"
  ) %>%
  
  # calculate percentage of primary body class from total count
  mutate(
    ClassPercentage = (ClassCount / Count) * 100
  ) %>%
  
  # identify top 3 brands by overall count
  mutate(
    rank_by_count = rank(desc(Count), ties.method = "min"),
    is_top_3 = rank_by_count <= 3
  ) %>%
  
  mutate(
    brand_name = reorder(brand_name, MedianPrice),
    price_midpoint = (min(MedianPrice) + max(MedianPrice)) / 2,
    text_color = ifelse(MedianPrice < price_midpoint, "white", "black"),
    # Only percentage in label, no unit count
    body_class_count_label = paste0(body_class_short, " (", round(ClassPercentage), "%)"),
    # flag for Ford and Chevrolet to have labels inside
    label_inside = brand_name %in% c("FORD", "CHEVROLET")
  )

Creating the Hero Visualization:

brand_class_comparison <- ggplot(
  brand_freq_data, 
  aes(x = brand_name, y = Count, fill = MedianPrice)
) +
  # main bars
  geom_col(color = "gray10", linewidth = 0.4) +
  
  # median price label inside bar (left side)
  geom_text(
    aes(
      label = dollar(MedianPrice, scale = 0.001, suffix = "K", prefix = "$"),
      y = 0,
      color = text_color
    ),
    hjust = -0.1,
    vjust = 0.5,
    size = 3,
    fontface = "bold"
  ) +
  
  # body class labels OUTSIDE the bar (all except Ford and Chevrolet)
  geom_text(
    data = subset(brand_freq_data, !label_inside),
    aes(
      label = body_class_count_label,
      y = Count
    ),
    hjust = -0.05,
    vjust = 0.5,
    size = 2.6,
    color = "black",
    fontface = "plain"
  ) +
  
  # body class labels INSIDE the bar (Ford and Chevrolet only) - white bold
  geom_text(
    data = subset(brand_freq_data, label_inside),
    aes(
      label = body_class_count_label,
      y = Count * 0.98
    ),
    hjust = 1,
    vjust = 0.5,
    size = 2.6,
    color = "white",
    fontface = "bold"
  ) +
  
  coord_flip(clip = "off") +
  
  # heat map color scheme
  scale_fill_viridis_c(
    option = "plasma",
    name = "Median Ask Price (USD)",
    labels = dollar_format(),
    guide = guide_colorbar(barwidth = 15)
  ) +
  
  # mapping colors to values
  scale_color_identity() +
  
  # scaling y-axis sufficiently
  scale_y_continuous(
    labels = comma_format(),
    expand = expansion(mult = c(0.02, 0.15))
  ) +
  
  labs(
    title = paste("Top", top_brands, "Vehicle Brands: Inventory Count vs. Median Price (2015 - 2020 Model Years)"),
    subtitle = "Primary Body Class and percentage of total brand count shown at each bar",
    x = "Brand Name",
    y = NULL,
    caption = "Source: CIS Automotive API | Kaggle Dataset"
  ) +
  
  theme_minimal(base_size = 10) +
  theme(
    plot.title = element_text(face = "bold", size = 11),
    plot.subtitle = element_text(size = 8.5, color = "blue"),
    legend.position = "bottom",
    legend.title = element_text(color = "gray30", size = 9),
    legend.text = element_text(color = "gray20", size = 8),
    axis.text = element_text(size = 8.5),
    panel.grid.major.y = element_blank(),
    plot.margin = margin(6, 6, 6, 6),
    plot.caption = element_text(hjust = 0.5, size = 7)
  )

brand_class_comparison

Disconnect from Database Connection:

dbDisconnect(con, shutdown = TRUE)
glue("DuckDB connection closed successfully")

DuckDB connection closed successfully

Why this visualization?

Why I went with this specific bar graph as my “hero visualization” is based on my initial mindset to best visualize the past state of new automotive inventory in dealerships. That also maintains a focused or “direct” manner in displaying values and characteristics without adding unnecessary complexity in interpretation.

This is different from the visualizations I included in Pa rt 6, as the idea to create this specific visual was from wanting to merge aspects of visualizations 5 (“Brand Strength as a Price Predictor”) and 2 (“Distribution of Vehicle Body Classes”).

What should viewers notice first?

Viewers would first notice two distinct aspects: the frequency of each brand and the most frequent body class they have across dealerships. Secondly, viewers would notice the median pricing, which is indicated by color, with dark blue representing the lowest price range and yellow representing the highest price range in the sample.

The visualization has a bell curve shape that indicates a high amount of inventory concentrated in the midrange ($30-$40k) price segment. The added labels between the brands that make up the top 10 list and body class labels help display that larger body classes, such as “SUV/MPV”, take up the majority of dealership inventory.

The “so what” factor:

This matters to stakeholders because it serves as a visual indicator that supports the bottom line statement made for the analysis: the increasing price of automobiles is closely moving past the $25K figure mentioned in the Kelly Blue Book source I included for the executive summary. The data also clearly demonstrates a segment shift from smaller body class automobiles towards larger classes, which are represented by SUVs/MPVs. This pattern is worth further investigating and researching for the purposes of price forecasting.

Limitations

Data Limitations:

The geographic scope of the dataset is limited to only dealerships from only 1 state.
The timeframe of the dataset is only from 2017 to 2020, and does not capture any policy or economic shifts that have occurred past 2020.
Most of the dataset’s fields (103 out of 156) that refer to the technical specifications of a vehicle are missing more than 50% of their data.

Methodological Limitations:

Did not heavily focus on other fields that pertained to physical aspects of a vehicle (such as interior color or fuel type), which could lead to other findings.

Scope Limitations:

One important question that was left unanswered from this analysis (due to the unreliability of some fields in the dataset) is how does the increased incorporation of technology, such as lane assist, also influences care price for newer model years?
To adress what future analysis whould strengethening/validating the findings is to perform the same form of analysis in a current or close to present day dataset that covers more geographic locations.

Actionable Recommendations (For Analysts on Price Forecasting)

Recommendation 1: Track Decreasing Sedan Inventory & Monitor Pricing

Finding it’s based on:

Visualization 1: “Median Ask Price by Model Year: New vs Used Vehicles”
Visualization 2: “Distribution of Vehicle Body Classes”

Proposed Action: Perform ongoing analysis that primarily monitors for sedan inventory levels to track the suggested decline in volume based on past data. Alongside analysis, implementing a system to track changes in average/median price based on timeframes to provide values to evaluate for any price changes (upward trend).

Expected Impact: This helps to provide signals if sedan inventory stabilizes or tightens, allowing analysts to adjust pricing values based on the changing metrics.

Implementation difficulty: Moderate - This may require some time to fully allocate resources for.

Recommendation 2: Price Forecasting by Inventory Concentration & External Risk Factors

Finding it’s based on:

Visualization 3: “Top 10 Most Frequent Brand-Model Combinations”
Visualization 4: “Impact of Large Vehicle Volume on Market Median Price”

Proposed Action: Develop price forecasting models where the impact of a volume shift on the median ask price is managed/influenced by a segment’s current concentration volume. Additionally, incorperating approaches to handle external influences beyond just inventory shifts that include economic policy changes and current market demand.

Expected Impact: Future forecasts may lead to close approximations that are based on integrating aspects concerning market structure, while also taking into account risk based on other external factors occurring over a timeframe (economic policy changes such as tariffs that occurred in 2025).

Implementation difficulty: Difficult - There are several complexities with developing such models that would prove to be increasingly difficult to manage and require much dedication to developing (this can involve developing and testing multiple forecasting models).

Recommendation 3: Evaluate Price Shifts Across Price Tiers (Budget, Mid-Range, and Premium)

Finding it’s based on:

Visualization 5: “Brand Strength as a Price Predictor”
Hero Visualization: “Top 10 Vehicle Brands: Inventory Count vs. Median Price (2015 - 2020 Model Years)”

Proposed Action: In continuing to conduct price analysis, analysts could potentially categorize vehicles into budget, mid-range, and premium tiers based on a brand’s average/median ask price. To then separately analyze the rate of price inflation over time (this can be based on a yearly or monthly timeframe) for new vehicles within each tier.

This could roughly be:

Budget: Less than $25K

Mid-Range: $30-40k

Premium: Greater than $40K

One thing to point out is that these approximate tiers would have to be adjusted yearly (or other consistent timeframe) as a result of inflation, which requires careful monitoring to then perform.

Expected Impact: Analysts can determine different pricing strategies being utilized through observing if mid-range brands or models are getting closer in price to models from premium brands. That is also applicable to prices of budget brands and models, specifically if they are also experiencing increases or remain in a stagnant position.

Implementation difficulty: Moderate ~ In theory, having systems in place to effectively categorize and analyze the new car price inflation rate should still be manageable to handle, though will still require enough resources to manage.

Hypothesis Development

After exploratory analysis, we can formulate a testable hypothesis:

Hypothesis 1: Vehicle Body Class Inventory Shifts on Car Median Asking Price

Null Hypothesis (H₀):

The increasing market share or inventory concentration of larger vehicle body classes (SUVs/MPVs, Pickups, etc) has no measurable effect on the median asking price of all vehicles in US dealerships.

Alternative Hypothesis (H₁):

The increasing market share or inventory concentration of larger vehicle body classes (SUVs/MPVs, Pickups, etc) has a positive relationship with the increase of median asking price, indicating decreasing affordability in dealerships.

Evidence from EDA:

Notably, Visualization 4 (Impact of Large Vehicle Volume on Market Median Price) demonstrates some form of positive correlation in the shifting percentages of “large vehicle” body class inventory (represented as bars) for the 2015-2020 model years, showing a general upward trend for median asking price (line graph). Specifically, from this 5-year timeframe based on model year, the proportion of large vehicles increased from 58.3% to 74.3%, with a median price increase from $25K to $38.5K. That suggests an increased focus in the automotive industry towards producing large body class vehicles that are in a mid-range to premium price point.

How to Test:

Testing for this hypothesis would involve producing some form of regression using machine learning models such as decision trees, gradient boosting, and random forest to primarily validate the presence of a relationship between these two specific fields.

Presentation Plan

4 Minutes Overall

Slide 1: Title & Hook

Title: “Visualizing the Increasing Price Tag of Automobiles & Segment Shifts”

Hook Question: When you see the prices of new or used vehicles in 2025, have you ever considered how much it’s changed from prior years?

Key Statistics (Kelly Blue Book):

In September 2025, the average car buyer paid up to $50,080. The highest average figure recorded in the US thus far (https://www.kbb.com/car-news/new-record-average-new-car-prices-surpasses-50000/)
The sales of vehicles priced at $25,000 or less have fallen by 78% in just five years (https://www.kbb.com/car-advice/is-now-the-time-to-buy-sell-or-trade-in-a-used-car)

Slide 2: Brief Executive Summary

Bottom Line: New vehicle affordability is declining, with dealerships displaying increased inventory of larger, higher-priced new vehicles over sedans.

Overview & Purpose of Analysis:

This result is obtained from analyzing the “Large Car” dataset from Kaggle, which contains over 5 million vehicle listings with accompanying dealer ask price and relevant physical attributes collected from 2017-2020.
Understanding inventory composition across vehicle segments helps identify pricing patterns in the automotive market for forecasting future pricing.

Key Findings (Hero Visualization):

Growing presence of Sports Utility Vehicle (SUV) and Pickup classes in the $30-$40k range to premium segments that start at $40K.
BMW, RAM, and GMC had the highest median ask prices ($44.50-$62.40k) with a SUV/Pickup specialization between 64-88%, while sedan-focused brands were priced roughly 30-65% lower.
Nissan is shown to be the only brand that has consistent median pricing of below $25K, with a primary body class of Sedan.

Slide 3: Hero Visualization

Highlight the distribution of body class inventory across the most frequent car brands in dealerships.
Emphasize that there is only being 1 car brand between the 2015-2020 model years that has a consistent median price below $25K (within budget range).
Emphasize that the majority of the median prices are roughly between the $30K- $40K that helps to suggest the auto market’s focus on the mid-range segment.

Modified Version of “Hero Visualization” for Presentation Purposes:

Slide 4: Key Takeaways & Recommendations

Finding 1: Budget offerings appear mostly concentrated in the Sedan class, with Toyota and Nissan each showing at least 60% sedan inventory but totaling under 20K units combined.

Recommendation: Industry analysts should track if sedan inventory continues to decline currently, and monitor any major pricing changes.

Finding 2: The SUV class appears consistently as the most common vehicle category, with 7 out of the top 10 frequent brands having it as their primary body class.

Recommendation: For price forecasting, consider the inventory shifts in specific classes and evaluate impact of changes towards ask price.

Finding 3: Top 3 brands based on inventory size have median pricing between $32-$36K that suggests a large concentration of inventory in the mid-range to premium segment.

Recommendation: Evaluate current segment concentrations and ask price shifts across budget, mid-range, and premium categories.

Slide 5: Recognizing Limitations & Next Steps

Limitations: Overall findings suggest price patterns related to vehicle body class, but cannot confirm if these specific trends are present nationally in current market conditions, based on the following limitations:

The analysis approach in using a past time frame (ending May 2020) does not capture any recent policy changes and market shifts
The geographic scope of data is limited to only dealerships in Illinois, which restricts applicability to current market trends
Several technical fields in the dataset are missing over 50% of the data

Hypothesis Development: It can be hypothesized that vehicle market shifts towards larger body classes are related to decreased vehicle affordability.

The next step would be using nationally published data (from at least 2020- present) and test using machine learning methods (gradient boosting, random forest, etc) to quantify how body class distributions explain price changes.

Final Reflection

Most Surprising Finding:

The most surprising finding from my analysis is observing just how much new car inventory the SUV/MPV body class makes up in the dataset, as I couldn’t exactly speculate on the concentration of each body class based on evolving trends. I was also surprised to see there being visual evidence presented in the shift of “large car” inventory between the 2015-2020 car model years, which shows increasing median prices each new model year. This was something I was not sure I was able to find, given the limited geographical range and age of the data.

Biggest Challenge:

The biggest challenge from this analysis was determining how to effectively measure or analyze car pricing in a relevant way, given that the dataset’s age predated June 2020. As such, I had to revise multiple times what aspects of the dataset I would focus on that are relevant to car pricing and still hold some applicability in the present time. Only until I did further research on current car pricing trends and market segments was I able to specifically create a framework for analysis that primarily looked at a car’s body class, as well as understanding the increasing price of vehicles with each new model year.

What You’d Do Differently:

What I would have done differently from an analysis standpoint was to start off small with how I wanted to approach what influenced car prices. Since I started taking into account numerous variables, even ones that were not relevant to the physical attributes of the car (such as the country where a car was assembled). Realizing this when I was still analyzing my original research question, “what drivers best influenced car pricing,” I ended up scaling things back after getting a better picture of the dataset’s quality.

Skills Developed:

I further developed my skills in R by working with the ‘tidyverse’ package to effectively manipulate the dataset and create subsets that were essential to my analysis.

The ‘Arrow’ and ‘DuckDB’ packages enabled me to better understand methods for efficiently bringing in big data (over 5GB) and conducting data and visual analysis.

Extensively using the “ggplot2” package helped me constantly revise my initial visuals to create streamlined final versions for accessible viewing by a general audience.

Confidence in Findings: High

Explanation:

I am highly confident in the findings from the analysis, as for the most part they all appear to correspond to the core sentiment that affordability in vehicles is decreasing, while there is also a shift towards larger vehicles based on the inventory values given.

As such, based on the recommendations provided, this would support the need for further investigation and analysis through an updated dataset to capture current patterns relevant to price forecasting for the long term.