Business Analytics - Task 2

Authors

Netta Amar 300232444

Grisha Rozenshtein 317302461

Hellena Ilanit Elimelech 027788215

Outline

This document contains all three deliverables required for Task 2, alongside some additional information. It is structured as follows:

Section 1: Introduction to the general process of used car trade and auctioning and the main objective of this project.

Section 2: Comprises data descriptions (variables and their meanings), appropriate transformations and a quality report (elaborate presentation of missing and unusual values in the dataset).

Section 3: States ten research questions about the lemon-cars domain we would like to address and believe that are answerable using the current dataset.

Section 4: Contains ideas and schematic sketches for dashboards that we will use to illuminate key insights from this dataset, focusing on our research questions.

1 Introduction

We present the general process of used car trade in the USA in order for the reader to gain background before emerging into our research theme. Afterwards, we state the main objective of this work.

1.1 Car auction and trading process

Vehicle Acquisition: Car dealerships, rental companies, or other vehicle owners bring used vehicles to the auction house to sell. The acquisition details, such as the auction provider (auction), purchase date (purch_date), and vehicle’s features, are recorded. Acquisition prices for the vehicle in various conditions are also documented (mmr_acquisition_auction_average_price, mmr_acquisition_auction_clean_price, mmr_acquisition_retail_average_price, and mmr_acquisition_retail_clean_price).
Auction: The auction house organizes an event where registered buyers (car dealers or other businesses) can bid on the vehicles. The auction may take place in person or online. Vehicles are assigned unique reference numbers (ref_id) and the auction house provides a level of guarantee for each vehicle’s condition (aucguart). Buyers bid on the vehicles and the highest bidder wins the auction.
Vehicle Purchase: The winning buyer completes the transaction, paying the acquisition cost (veh_bcost) for the vehicle. The buyer’s unique number (byr_no), purchase date (purch_date), and whether the sale occurred online (is_online_sale) are recorded. The vehicle’s location details, such as the state (vnst) and zip code (vnzip) where the car was purchased, are also documented.
Post-Purchase: After the purchase, the buyer may discover undisclosed issues or misrepresentation of the vehicle’s condition or features. In such cases, the vehicle may be “kicked back” to the auction house, and the return date (kick_date) is recorded.
Warranty: The buyer may also purchase a warranty for the vehicle at an additional cost (warranty_cost).

Throughout this process, various attributes of the vehicles, such as make, model, trim, color, and others, are recorded in the dataset. The data can be used to analyze trends, identify factors influencing vehicle prices or returns, and answer other research questions related to the used car auction and trade process.

1.2 Main objective

The primary focus of the research will be to identify the key factors that are associated with a car being a lemon, including the car’s make, model, year, and other relevant characteristics. The study will also aim to assess the factors that determine a car’s price, and to compare the rates of lemons across different makes and models. A desired outcome is one or a few predictive models that allow fair and accurate evaluation of lemon cars and their corresponding price.

2 Data Description & Quality Report

2.1 Data outline

2.1.1 Data source and discrepancies

There is a slight difference in the names of the data columns and their descriptions provided in the source.

The variables ref_id, acquisition_type, kick_date are present in the description file but not in the data.
The variable was found in the data but not in the description.
We corrected the zipcode field name to vnzip.
No data column is lost, only nonexistent variables reported in the description.
All string values are transformed to lowercase for consistency.

2.2 Data wrangling

In this subsection we present the data wrangling procedures to be used before fitting any model, including:

Transformations - mapping values of a variable into another set of values, hopefully facilitating anlyses and fitting, without sacrificing too much information.
Missing values - some values are not present in the dataset. We study which ones and suggest methods to handle this phenomenon.
Unusual values - unusual values constitute a problem for model fitting and interpretation so we suggest a framework to smooth them out.

2.2.1 Transformations

In this preliminary stage of our work it seems that two transformation methods are appropriate for most variables:

Dummy coding: one-hot encoding and binary coding are both special cases of dummy coding (Boehmke and Greenwell 2019). In one-hot encoding, each category of a categorical variable is represented by a separate column, with a value of 1 for observations that fall into that category and a value of 0 for observations that do not fall into that category. In binary coding, each category of a categorical variable is represented by a single column, with a value of 1 for observations that fall into that category and a value of 0 for observations that do not fall into that category.
Normalization: this general term often refers to centering and scaling, but there are more ways to make the data more normal (in the sense of “sane”, not necessarily gaussian). See Table 2 for the planned tranformations for each variable.

2.2.2 Missing values

Figure 1 illustrates the percentage of missing values per variable.
- The percentages appear in the data summary (Table 2).
- The columns primeunit and aucguart had more than 90% missing values, and are omitted from the plot.

(a) Two variables (primeunit, aucguart) are not shown due to >90% missingness.

There are not many missing values (less than 5% for all columns).
Four common methods of handling missing data are described below, and the corresponding method for each variable is listed in Table 1.
- Imputation: Replace missing values with a meaningful value, such as the mean, median, or mode of the variable. This is useful when the percentage of missing values is low and when the imputed value is not expected to significantly affect the overall analysis.

Deletion: Remove observations with missing values if the percentage of missing data is low, and removing these observations is not expected to introduce bias or reduce the representativeness of the sample.

Model-based imputation: Use machine learning algorithms (e.g., k-Nearest Neighbors or regression models) to predict and fill in missing values based on other variables. This can be a good option when there’s a significant amount of missing data and a relationship exists between the missing variable and other variables.

Removal: Discard variables with an extremely high percentage of missing values, as imputing or deleting such a large proportion of the data may introduce significant bias.

Table 1: Type of missing-variable handling by variable
Variable	Missing (%)	Handling
primeunit	95.32	removal
aucguart	95.32	removal
wheel_type	4.35	deletion
wheel_type_id	4.34	deletion
trim	3.23	imputation
mmr_current_auction_average_price	0.43	imputation
mmr_current_auction_clean_price	0.43	imputation
mmr_current_retail_average_price	0.43	imputation
mmr_current_retail_clean_price	0.43	imputation
veh_b_cost	0.09	imputation
mmr_acquisition_auction_average_price	0.02	imputation
mmr_acquisition_auction_clean_price	0.02	imputation
mmr_acquisition_retail_average_price	0.02	imputation
mmr_acquisiton_retail_clean_price	0.02	imputation
transmission	0.01	imputation
sub_model	0.01	imputation
color	0.01	imputation
nationality	0.01	imputation
size	0.01	imputation
top_three_american_name	0.01	imputation

2.2.3 Unusual values

We crudely define unusual values as follows:
- For categorical variables, values whose relative frequency is less than a subjective threshold.
  - We chose this threshold to be 0.001.
- For numerical values with use Tukey’s definition - values that are “over the fences”, which means further than 1.5 times the Interquartile range (IQR) below (above) Q1 (Q3).

There are three ways to handle these unusual values, namely:

Grouping into ‘Other’ or ‘Unknown’:

For categorical variables, unusual values could represent rare brands, models, or trims with insufficient data points. Grouping these unusual values into a separate category, such as Other or Unknown, helps reduce the dimensionality of the data, while preserving the information about the rarity of these cases.

Appropriate variables: make, model, trim, sub_model, byrno, vnst.

Mode Imputation:

For some variables, unusual values could be due to data entry errors or represent a very small fraction of the data. Replacing these values with the mode (most common value) of the variable can be a reasonable way to handle these cases without introducing significant bias.

Appropriate variables: transmission, wheel_type_id.

Winsorizing or Clipping:

For continuous variables unusual values could be extreme outliers that might influence model performance. Winsorizing or clipping these values at a specified percentile can help limit their impact on the model without completely discarding the information.

Variables: mmr_acquisition_auction_average_price, mmr_acquisition_auction_clean_price, mmr_acquisition_retail_average_price, mmr_acquisiton_retail_clean_price, mmr_current_auction_average_price, mmr_current_auction_clean_price, mmr_current_retail_average_price, mmr_current_retail_clean_price, veh_b_cost, warranty_cost.

2.3 Variable description

Table 2 lists the variables in our dataset, their description and importance/necessity for the analyses proposed in Section 3.

Table 2: Variables in the dataset, descriptions and assumed importance

(a) Unusual is checked if there are unusual values present. Importance of a variable is a subjective guess and may turn out wrong later on, it has the following hierarchical order: crucial, important, mediocre and lastly unimportant. The Hypotheses column refers to the hypotheses to which this variable is relevant.
Name	Definition	Missing(%)	Unusual	Importance	Transformations	Hypotheses
is_bad_buy	Identifies if the kicked vehicle was an avoidable purchase	0.00	✓	NA	NA	NA
purch_date	The Date the vehicle was Purchased at Auction	0.00		important	Year, Month, Day	7
auction	Auction provider at which the vehicle was purchased	0.00		important	Dummy encoding	6
veh_year	The manufacturer’s year of the vehicle	0.00		important	None	1, 2, 5, 9
vehicle_age	The Years elapsed since the manufacturer’s year	0.00		important	None	1, 2, 5, 9
make	Vehicle Manufacturer	0.00	✓	important	Dummy encoding	1, 2, 3, 4, 5, 7, 9
model	Vehicle Model	0.00	✓	important	Dummy encoding	1, 2, 3, 4, 5, 7, 9
trim	Vehicle Trim Level	3.23	✓	mediocre	Dummy encoding	1, 2, 5, 9
sub_model	Vehicle Submodel	0.01	✓	mediocre	Normalization	1, 2, 5, 9
color	Vehicle Color	0.01		unimportant	Dummy encoding	1, 2, 5, 9
transmission	Vehicles transmission type (Automatic / Manual)	0.01		mediocre	Dummy encoding	1, 2, 5, 9
wheel_type_id	The type id of the vehicle wheel	4.34		mediocre	Dummy encoding	1, 2, 5, 9
wheel_type	The vehicle wheel type description (Alloy / Covers)	4.35		mediocre	Dummy encoding	1, 2, 5, 9
veh_odo	The vehicles odometer reading	0.00		important	Normalization	1, 2, 5, 9
nationality	The Manufacturer’s country	0.01		important	Dummy encoding	1, 2, 7, 9
size	The size category of the vehicle (Compact / SUV/ etc.)	0.01		mediocre	Dummy encoding	1, 2, 7, 9
top_three_american_name	Identifies if the manufacturer is one of the top three American manufacturers	0.01		important	Dummy encoding	1, 2, 7, 9
mmr_acquisition_auction_average_price	Acquisition price for this vehicle in average condition at time of purchase	0.02	✓	crucial	Normalization	2, 5, 6, 8, 10
mmr_acquisition_auction_clean_price	Acquisition price for this vehicle in the above Average condition at time of purchase	0.02	✓	crucial	Normalization	2, 5, 6, 8, 10
mmr_acquisition_retail_average_price	Acquisition price for this vehicle in the retail market in average condition at time of purchase	0.02	✓	crucial	Normalization	2, 5, 6, 8, 10
mmr_acquisiton_retail_clean_price	Acquisition price for this vehicle in the retail market in above average condition at time of purchase	0.02	✓	crucial	Normalization	2, 5, 6, 8, 10
mmr_current_auction_average_price	Acquisition price for this vehicle in average condition as of current day	0.43	✓	crucial	Normalization	2, 5, 6, 8, 10
mmr_current_auction_clean_price	Acquisition price for this vehicle in the above condition as of current day	0.43	✓	crucial	Normalization	2, 5, 6, 8, 10
mmr_current_retail_average_price	Acquisition price for this vehicle in the retail market in average condition as of current day	0.43	✓	crucial	Normalization	2, 5, 6, 8, 10
mmr_current_retail_clean_price	Acquisition price for this vehicle in the retail market in above average condition as of current day	0.43	✓	crucial	Normalization	2, 5, 6, 8, 10
primeunit	Identifies if the vehicle would have a higher demand than a standard purchase	95.32	✓	unimportant	Dummy encoding
aucguart	The level guarantee provided by auction for the vehicle (Green light - Guaranteed/arbitratable Yellow Light - caution/issue red light - sold as is)	95.32		unimportant	Dummy encoding
byrno	Unique number assigned to the buyer that purchased the vehicle	0.00	✓	unimportant	None
vnzip	Zipcode where the car was purchased	0.00		NA	NA	NA
vnst	State where the car was purchased	0.00	✓	important	Dummy encoding	6
veh_b_cost	Acquisition cost paid for the vehicle at time of purchase	0.09	✓	crucial	Normalization	2, 5, 6, 8, 10
is_online_sale	Identifies if the vehicle was originally purchased online	0.00	✓	important	Normalization	10
warranty_cost	Warranty price (term=36month and mileage=36K)	0.00	✓	important	Normalization	8

3 Research Methodology

3.1 Research questions

What are the key factors, including make, model, year, and other vehicle characteristics, that are most strongly associated with a car being a lemon?
- Variables: is_bad_buy, vehicle_age, make, model, trim, color, transmission, veh_bcost, veh_odo, size, top_three_american_name
- Method: Model-based (Logistic Regression)
- Speculation: Vehicle age, make, and model might be strong predictors of a car being a lemon.
- Actionable insight: Dealerships can use the identified factors to avoid purchasing lemons, reducing costs and improving customer satisfaction.
Can we develop a predictive model to estimate the probability of a car being a lemon based on its features?
- Variables: is_bad_buy, vehicle_age, make, model, trim, color, transmission, veh_bcost, veh_odo, size, top_three_american_name
- Method: Model-based (Random Forest, XGBoost)
- Speculation: The predictive model could accurately estimate the probability of a car being a lemon.
- Actionable insight: Dealerships can use the model to make informed purchasing decisions, lowering the risk of acquiring lemons and reducing financial losses.
How do the lemon rates vary across different makes and models? Are there any significant differences in lemon rates between American, Japanese, and other manufacturers?
- Variables: is_bad_buy, make, model
- Method: Descriptive (Summary Statistics, Chi-squared test)
- Speculation: There may be significant differences in lemon rates between various makes and models, as well as between American, Japanese, and other manufacturers.
- Actionable insight: Consumers and dealerships can use this information to make more informed decisions when purchasing used cars, leading to increased satisfaction and reduced risks.
Are the commonly held stereotypes about car reliability supported by the data? Which specific makes and models have the highest and lowest lemon rates?
- Variables: is_bad_buy, make, model
- Method: Descriptive (Summary Statistics, Visualization)
- Speculation: Some stereotypes about car reliability might be supported by the data, while others might not hold true.
- Actionable insight: Car buyers can use this information to guide their purchasing decisions, while dealerships can adjust their inventory strategies to match consumer preferences and expectations.
Can we build a regression model to predict the acquisition cost of a car based on its features?
- Variables: veh_bcost, vehicle_age, make, model, trim, color, transmission, veh_odo
- Method: Model-based (Linear Regression, Random Forest)
- Speculation: The regression model could accurately predict the acquisition cost of a car based on its features.
- Actionable insight: Dealerships can use the model to make more accurate pricing decisions, leading to increased profitability and competitiveness in the market.
How do the acquisition costs and lemon rates of vehicles vary across different auction providers and geographical locations?
- Variables: is_bad_buy, veh_bcost, auction, vnst, vnzip
- Method: Descriptive (Summary Statistics, Visualization)
- Speculation: There might be noticeable differences in acquisition costs and lemon rates across auction providers and geographical locations.
- Actionable insight: Dealerships can use this information to optimize their sourcing strategies, minimizing costs and reducing the risk of purchasing lemons.
Are there any trends or patterns in the lemon rates over time? Do these trends vary between different makes, models, or vehicle categories?
- Variables: is_bad_buy, purch_date, make, model, size, top_three_american_name
- Method: Descriptive (Time-series analysis, Visualization)
- Speculation: There may be observable trends in lemon rates over time, with variations between different makes, models, or vehicle categories.
- Actionable insight: Dealerships and manufacturers can use this information to identify and address potential quality issues, while car buyers can consider these trends when making purchasing decisions.
How effective are various machine learning algorithms in predicting the lemon status and acquisition cost of cars in the dataset?
- Variables: is_bad_buy, veh_bcost, vehicle_age, make, model, trim, color, transmission, veh_odo, size, top_three_american_name
- Method: Model-based (Logistic Multinomial Regression, Linear Regression, Random Forest, XGBoost)
- Speculation: More advanced machine learning algorithms like Random Forest and XGBoost might outperform simpler methods like logistic and linear regression.
- Actionable insight: Choosing the most effective algorithm can improve prediction accuracy, helping dealerships make better purchasing and pricing decisions.
Can we identify any interactions between vehicle features that contribute to a car being a lemon or influence its acquisition cost?
- Variables: is_bad_buy, veh_bcost, vehicle_age, make, model, trim, color, transmission, veh_odo, size, top_three_american_name
- Method: Model-based (Logistic Regression, Linear Regression, Random Forest, XGBoost)
- Speculation: There may be interactions between certain vehicle features that contribute to a car being a lemon or influence its acquisition cost.
- Actionable insight: Understanding these interactions can help dealerships refine their purchasing strategies and optimize pricing to minimize risks and maximize profitability.
Are there any differences in the lemon rates and acquisition costs of cars sold through online auctions compared to those sold in person?
- Variables: is_bad_buy, veh_bcost, is_online_sale
- Method: Descriptive (Summary Statistics, Visualization, T-test)
- Speculation: There might be differences in lemon rates and acquisition costs between online and in-person auctions.
- Actionable insight: Dealerships can use this information to choose the most suitable auction platform for their needs, potentially reducing costs and the risk of acquiring lemons.

4 Dashboard Sketch & Visualizations

The Dashboards section should have three pages at least. Each page will have a different focus:

I. The lemon rate - plots that show the lemon rate across different variables such as make, model or age of the car.

Price/cost variables - this page will contain figures and tables related to the pricing of the car.
Other variable - other plots that illustrate the effects of different variables on the labels of interest, namely the lemon-rate and the cost.

We present crude approximations of the dashboard visualizations using ggplot2 graphics.

4.1 Page I - Lemon rate

Lemon Rates by Make (Bar Chart)
- Variables: make, is_bad_buy
- Transformations: Calculate lemon rate for each make
- Mapping variables to aesthetics:
  - X-axis: make (categorical variable representing car manufacturers)
  - Y-axis: Lemon rate (numerical value representing the proportion of lemons for each make)
- Reasoning: A bar chart facilitates comparison of numerical values across categories. In this case, it allows users to quickly identify car manufacturers with high lemon rates.
- User Interaction: Tooltip to display exact lemon rate for each make when hovering over bars.
- Importance: Lemon rates help users identify risky car manufacturers when purchasing a used vehicle.
- Title: “Lemon Rates by Car Make”
- Short explanation: “This bar chart compares the lemon rates of different car manufacturers. Higher lemon rates indicate a greater risk of purchasing a problematic vehicle.”

Lemon Rates by Car Models (Bar Plot)
- Variables: make, model, is_bad_buy.
- Transformations: Filter data for Ford make as an example, calculate lemon rate for each model.
- Mapping variables to aesthetics:
  - X-axis: model (categorical variable representing Ford car models)
  - Y-axis: Lemon rate (numerical value representing the proportion of lemons for each Ford model)
  - Color: Lemon rate (numerical value representing the proportion of lemons for each Ford model)
- Reasoning: A bar plot allows users to compare the lemon rates among Ford car models easily. This plot helps visualize the risk of purchasing a lemon for different Ford models.
- User Interaction: Tooltip to display exact lemon rate for each model when hovering over bars, selection menu for choosing other makes.
- Importance: Lemon rates for specific Ford car models help users make informed decisions when purchasing a used Ford vehicle.
- Title: “Lemon Rates by Car Models”
- Short explanation: “This bar plot shows the lemon rates for various car models. Higher bars indicate higher lemon rates, suggesting a greater risk of purchasing a problematic vehicle.”

Lemon Rates by Vehicle Age for Top Car Makes (Smoothed Line Plot)
- Variables: vehicle_age, is_bad_buy, make
- Transformations: Calculate lemon rate for each vehicle age and make combination; filter for top car makes: Chevrolet, Dodge, Ford, and Chrysler
- Mapping variables to aesthetics:
  - X-axis: vehicle_age (numerical variable representing the age of the vehicle)
  - Y-axis: lemon_rate_vehicle_age (numerical value representing the proportion of lemons for each vehicle age)
  - Color and line type: make (categorical variable representing the top car makes)
- Reasoning: A smoothed line plot allows users to see trends and patterns in the lemon rates as the vehicles age for different car makes. This visualization helps users assess how the risk of purchasing a lemon changes over time for specific car brands.
- User Interaction: Tooltip to display exact lemon rate for each car make and vehicle age when hovering over the lines.
- Importance: Understanding the lemon rates for top car makes over time helps users make informed decisions when purchasing a used vehicle.
- Title: “Lemon Rates by Vehicle Age for Top Car Makes”
- Short explanation: “This smoothed line plot shows the lemon rates for the top car makes as vehicles age. Different colored lines represent different car brands, allowing users to compare their risk of purchasing a problematic vehicle.”

Remark: in the meantime we use two smoothing methods (the method argument in geom_smooth) as an example:
- lm - linear regression fit of lemon-rate regressed onto car age and make.
- loess - which is a local polynomial regression (Harrell 2017).
- We will later choose the “best” way to present these data.

4.3 Page III - other findings

Auction Lemon Rates by State (Choropleth Map)
- Hypothesis: 2
- Variables: vnst, is_bad_buy
- Transformations: Calculate lemon rate for each state
- Mapping variables to aesthetics:
  - Map regions: States (vnst)
  - Color: Lemon rate (numerical value representing the proportion of lemons for each state)
- Reasoning: A choropleth map helps visualize geographic variations in data. In this case, it shows how lemon rates vary across different states in the U.S.
- User Interaction: Tooltip to display exact lemon rate for each state when hovering over map regions.
- Importance: State-specific lemon rates help users identify regional trends and consider the geographic location of a vehicle when purchasing.
- Title: “Auction Lemon Rates by State”
- Short explanation: “This choropleth map displays lemon rates by state for vehicles sold at auction. Darker colors indicate higher lemon rates, suggesting a greater risk of purchasing a problematic vehicle in that region.”

Median Vehicle Buying Cost by State (Choropleth Map)

Hypothesis: 4
Variables: vnst, veh_b_cost
Transformations: Calculate median vehicle buying cost for each state
Mapping variables to aesthetics:
- Map regions: States (vnst)
- Color: Median vehicle buying cost (numerical value representing the median buying cost of vehicles for each state)
Reasoning: A choropleth map helps visualize geographic variations in data. In this case, it shows how median vehicle buying costs vary across different states in the U.S.
User Interaction: Tooltip to display exact median vehicle buying cost for each state when hovering over map regions.
Importance: State-specific median vehicle buying costs help users identify regional trends and consider the geographic location of a vehicle when purchasing.
Title: “Median Vehicle Buying Cost by State”
Short explanation: “This choropleth map displays the median vehicle buying costs by state. Darker colors indicate higher costs, suggesting more expensive markets in these regions.”

References

Boehmke, Brad, and Brandon M Greenwell. 2019. Hands-on Machine Learning with r. CRC press.

Harrell, Frank E. 2017. “Regression Modeling Strategies.” Bios 330 (2018): 14.