Data source

library(devtools)
#devtools::install_github("mlr-org/farff")
library(farff) # package for reading the data set
library(readr) # for reading the csv file fast
library(summarytools) # as requested in assignment 1
library(tidyverse)
library(janitor) # for data preparation
library(gt) # recommended package for html tables
theme_set(theme_minimal())
dat <- farff::readARFF("kick.arff")
dat <- janitor::clean_names(dat)

Data description

Question 3

The topic of avoiding the purchase of malfunctioning cars, commonly referred to as “lemons,” is of great importance to almost everyone. The experience of buying a car that constantly breaks down can be incredibly frustrating and costly. In order to alleviate this problem, the research objective is to develop a statistical model that can accurately predict the likelihood of a car being a lemon.

The primary focus of the research will be to identify the key factors that are associated with a car being a lemon, including the car’s make, model, year, and other relevant characteristics. The study will also aim to assess the factors that determine a car’s price, and to compare the rates of lemons across different makes and models.

The comparison of makes and models is particularly interesting, as it has the potential to provide valuable information for anyone looking to purchase a car. The study will examine the lemon rate for each make and model, and determine if any commonly held stereotypes, such as “Japanese cars are reliable” or “American cars are often lemons,” are supported by the data.

The data set for this study comes from the business domain of car sales, and its implications are clear. Dealerships and other businesses that purchase used cars through trade-ins or auctions would greatly benefit from being able to assess the likelihood of a car being a lemon, as this information could help them to avoid costly purchases or make more informed pricing decisions.

In order to build the statistical models, a variety of machine learning algorithms will be employed, including simple methods like logistic regression as well as more advanced algorithms such as random forests and XGBoost. Regression models will also be developed to determine the price of the car, and if time permits, basic time-series methods will be used to examine changes in the prevalence of lemon cars over time.

Variables in the dataset

desc <- read.csv("data_desc.txt",header = T)
desc %>%   gt()
Field.Name Definition
RefID Unique (sequential) number assigned to vehicles
IsBadBuy Identifies if the kicked vehicle was an avoidable purchase
PurchDate The Date the vehicle was Purchased at Auction
Auction Auction provider at which the vehicle was purchased
VehYear The manufacturer's year of the vehicle
VehicleAge The Years elapsed since the manufacturer's year
Make Vehicle Manufacturer
Model Vehicle Model
Trim Vehicle Trim Level
SubModel Vehicle Submodel
Color Vehicle Color
Transmission Vehicles transmission type (Automatic / Manual)
WheelTypeID The type id of the vehicle wheel
WheelType The vehicle wheel type description (Alloy / Covers)
VehOdo The vehicles odometer reading
Nationality The Manufacturer's country
Size The size category of the vehicle (Compact / SUV/ etc.)
TopThreeAmericanName Identifies if the manufacturer is one of the top three American manufacturers
MMRAcquisitionAuctionAveragePrice Acquisition price for this vehicle in average condition at time of purchase
MMRAcquisitionAuctionCleanPrice Acquisition price for this vehicle in the above Average condition at time of purchase
MMRAcquisitionRetailAveragePrice Acquisition price for this vehicle in the retail market in average condition at time of purchase
MMRAcquisitonRetailCleanPrice Acquisition price for this vehicle in the retail market in above average condition at time of purchase
MMRCurrentAuctionAveragePrice Acquisition price for this vehicle in average condition as of current day
MMRCurrentAuctionCleanPrice Acquisition price for this vehicle in the above condition as of current day
MMRCurrentRetailAveragePrice Acquisition price for this vehicle in the retail market in average condition as of current day
MMRCurrentRetailCleanPrice Acquisition price for this vehicle in the retail market in above average condition as of current day
PRIMEUNIT Identifies if the vehicle would have a higher demand than a standard purchase
AcquisitionType Identifies how the vehicle was acquired (Auction buy or trade in or etc)
AUCGUART The level guarantee provided by auction for the vehicle (Green light - Guaranteed/arbitratable Yellow Light - caution/issue red light - sold as is)
KickDate Date the vehicle was kicked back to the auction
BYRNO Unique number assigned to the buyer that purchased the vehicle
VNZIP Zipcode where the car was purchased
VNST State where the car was purchased
VehBCost Acquisition cost paid for the vehicle at time of purchase
IsOnlineSale Identifies if the vehicle was originally purchased online
WarrantyCost Warranty price (term=36month and mileage=36K)

Data Summary

dfSummary(dat) 

Data Frame Summary
dat
Dimensions: 72983 x 33
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 is_bad_buy [factor] 1. 0 2. 1 64007 (87.7%) 8976 (12.3%) IIIIIIIIIIIIIIIII II 72983 (100.0%) 0 (0.0%)
2 purch_date [numeric] Mean (sd) : 1263091914 (18196025) min < med < max: 1231113600 < 1264032000 < 1293667200 IQR (CV) : 31017600 (0) 517 distinct values . : . . . : : : . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72983 (100.0%) 0 (0.0%)
3 auction [factor] 1. ADESA 2. MANHEIM 3. OTHER 14439 (19.8%) 41043 (56.2%) 17501 (24.0%) III IIIIIIIIIII IIII 72983 (100.0%) 0 (0.0%)
4 veh_year [numeric] Mean (sd) : 2005.3 (1.7) min < med < max: 2001 < 2005 < 2010 IQR (CV) : 3 (0) 2001 : 1481 ( 2.0%) 2002 : 3405 ( 4.7%) 2003 : 6227 ( 8.5%) 2004 : 10207 (14.0%) 2005 : 15489 (21.2%) 2006 : 17043 (23.4%) 2007 : 11423 (15.7%) 2008 : 6885 ( 9.4%) 2009 : 822 ( 1.1%) 2010 : 1 ( 0.0%) 72983 (100.0%) 0 (0.0%)
5 vehicle_age [numeric] Mean (sd) : 4.2 (1.7) min < med < max: 0 < 4 < 9 IQR (CV) : 2 (0.4) 0 : 2 ( 0.0%) 1 : 3094 ( 4.2%) 2 : 8482 (11.6%) 3 : 15902 (21.8%) 4 : 17013 (23.3%) 5 : 12956 (17.8%) 6 : 8022 (11.0%) 7 : 4646 ( 6.4%) 8 : 2220 ( 3.0%) 9 : 646 ( 0.9%) 72983 (100.0%) 0 (0.0%)
6 make [factor] 1. ACURA 2. BUICK 3. CADILLAC 4. CHEVROLET 5. CHRYSLER 6. DODGE 7. FORD 8. GMC 9. HONDA 10. HUMMER [ 23 others ] 33 ( 0.0%) 720 ( 1.0%) 33 ( 0.0%) 17248 (23.6%) 8844 (12.1%) 12912 (17.7%) 11305 (15.5%) 649 ( 0.9%) 497 ( 0.7%) 1 ( 0.0%) 20741 (28.4%) 72983 (100.0%) 0 (0.0%)
7 model [factor] 1. 1500 RAM PICKUP 2WD 2. 1500 RAM PICKUP 4WD 3. 1500 SIERRA PICKUP 2 4. 1500 SIERRA PICKUP 4 5. 1500 SILVERADO PICKU 6. 1500HD SIERRA PICKUP 7. 1500HD SILVERADO PIC 8. 2500 RAM PICKUP 2WD 9. 2500 RAM PICKUP 4WD 10. 2500 SILVERADO PICKU [ 1053 others ] 1101 ( 1.5%) 57 ( 0.1%) 132 ( 0.2%) 4 ( 0.0%) 596 ( 0.8%) 1 ( 0.0%) 8 ( 0.0%) 13 ( 0.0%) 3 ( 0.0%) 2 ( 0.0%) 71066 (97.4%) 72983 (100.0%) 0 (0.0%)
8 trim [factor] 1. 1 2. 150 3. 2 4. 250 5. 3 6. 3 R 7. Adv 8. Bas 9. C 10. Car [ 124 others ] 168 ( 0.2%) 305 ( 0.4%) 765 ( 1.1%) 3 ( 0.0%) 188 ( 0.3%) 7 ( 0.0%) 227 ( 0.3%) 13950 (19.8%) 7 ( 0.0%) 3 ( 0.0%) 55000 (77.9%) 70623 (96.8%) 2360 (3.2%)
9 sub_model [factor] 1. 2D CONVERTIBLE 2. 2D CONVERTIBLE DREAM CRUI 3. 2D CONVERTIBLE GL 4. 2D CONVERTIBLE GLS 5. 2D CONVERTIBLE GT 6. 2D CONVERTIBLE GTC 7. 2D CONVERTIBLE LIMITED 8. 2D CONVERTIBLE LX 9. 2D CONVERTIBLE LXI 10. 2D CONVERTIBLE SE [ 853 others ] 279 ( 0.4%) 10 ( 0.0%) 1 ( 0.0%) 3 ( 0.0%) 9 ( 0.0%) 34 ( 0.0%) 9 ( 0.0%) 38 ( 0.1%) 2 ( 0.0%) 3 ( 0.0%) 72587 (99.5%) 72975 (100.0%) 8 (0.0%)
10 color [factor] 1. BEIGE 2. BLACK 3. BLUE 4. BROWN 5. GOLD 6. GREEN 7. GREY 8. MAROON 9. NOT AVAIL 10. ORANGE [ 6 others ] 1584 ( 2.2%) 7627 (10.5%) 10347 (14.2%) 436 ( 0.6%) 5231 ( 7.2%) 3194 ( 4.4%) 7887 (10.8%) 2046 ( 2.8%) 94 ( 0.1%) 415 ( 0.6%) 34114 (46.7%) II II 72975 (100.0%) 8 (0.0%)
11 transmission [factor] 1. AUTO 2. Manual 3. MANUAL 70398 (96.5%) 1 ( 0.0%) 2575 ( 3.5%) IIIIIIIIIIIIIIIIIII 72974 (100.0%) 9 (0.0%)
12 wheel_type_id [factor] 1. 0 2. 1 3. 2 4. 3 5 ( 0.0%) 36050 (51.6%) 33004 (47.3%) 755 ( 1.1%) IIIIIIIIII IIIIIIIII 69814 (95.7%) 3169 (4.3%)
13 wheel_type [factor] 1. Alloy 2. Covers 3. Special 36050 (51.6%) 33004 (47.3%) 755 ( 1.1%) IIIIIIIIII IIIIIIIII 69809 (95.7%) 3174 (4.3%)
14 veh_odo [numeric] Mean (sd) : 71500 (14578.9) min < med < max: 4825 < 73361 < 115717 IQR (CV) : 20599 (0.2) 39947 distinct values : . : . . : : : : : : : : : : : : . 72983 (100.0%) 0 (0.0%)
15 nationality [factor] 1. AMERICAN 2. OTHER 3. OTHER ASIAN 4. TOP LINE ASIAN 61028 (83.6%) 195 ( 0.3%) 8033 (11.0%) 3722 ( 5.1%) IIIIIIIIIIIIIIII 72978 (100.0%) 5 (0.0%)
16 size [factor] 1. COMPACT 2. CROSSOVER 3. LARGE 4. LARGE SUV 5. LARGE TRUCK 6. MEDIUM 7. MEDIUM SUV 8. SMALL SUV 9. SMALL TRUCK 10. SPECIALTY [ 2 others ] 7205 ( 9.9%) 1759 ( 2.4%) 8850 (12.1%) 1433 ( 2.0%) 3170 ( 4.3%) 30785 (42.2%) 8090 (11.1%) 2276 ( 3.1%) 864 ( 1.2%) 1915 ( 2.6%) 6631 ( 9.1%) I 72978 (100.0%) 5 (0.0%)
17 top_three_american_name [factor] 1. CHRYSLER 2. FORD 3. GM 4. OTHER 23399 (32.1%) 12315 (16.9%) 25314 (34.7%) 11950 (16.4%) IIIIII III IIIIII III 72978 (100.0%) 5 (0.0%)
18 mmr_acquisition_auction_average_price [numeric] Mean (sd) : 6128.9 (2462) min < med < max: 0 < 6097 < 35722 IQR (CV) : 3492 (0.4) 10342 distinct values : : : : . : : : : : . 72965 (100.0%) 18 (0.0%)
19 mmr_acquisition_auction_clean_price [numeric] Mean (sd) : 7373.6 (2722.5) min < med < max: 0 < 7303 < 36859 IQR (CV) : 3615 (0.4) 11379 distinct values : . : : : : : : . : : : 72965 (100.0%) 18 (0.0%)
20 mmr_acquisition_retail_average_price [numeric] Mean (sd) : 8497 (3156.3) min < med < max: 0 < 8444 < 39080 IQR (CV) : 4371 (0.4) 12725 distinct values : : : : : : : . : : : : 72965 (100.0%) 18 (0.0%)
21 mmr_acquisiton_retail_clean_price [numeric] Mean (sd) : 9850.9 (3385.8) min < med < max: 0 < 9789 < 41482 IQR (CV) : 4595 (0.3) 13456 distinct values : . : : : : : : . : : : . 72965 (100.0%) 18 (0.0%)
22 mmr_current_auction_average_price [numeric] Mean (sd) : 6132.1 (2434.6) min < med < max: 0 < 6062 < 35722 IQR (CV) : 3461 (0.4) 10315 distinct values : : : : . : : : : : . 72668 (99.6%) 315 (0.4%)
23 mmr_current_auction_clean_price [numeric] Mean (sd) : 7390.7 (2686.2) min < med < max: 0 < 7313 < 36859 IQR (CV) : 3599 (0.4) 11265 distinct values : . : : : : : : . : : : 72668 (99.6%) 315 (0.4%)
24 mmr_current_retail_average_price [numeric] Mean (sd) : 8775.7 (3090.7) min < med < max: 0 < 8729 < 39080 IQR (CV) : 4375 (0.4) 12493 distinct values : : : : : : : . . : : : 72668 (99.6%) 315 (0.4%)
25 mmr_current_retail_clean_price [numeric] Mean (sd) : 10145.4 (3310.3) min < med < max: 0 < 10103 < 41062 IQR (CV) : 4525 (0.3) 13192 distinct values : : : : . : : : : : : . 72668 (99.6%) 315 (0.4%)
26 primeunit [factor] 1. NO 2. YES 3357 (98.2%) 62 ( 1.8%) IIIIIIIIIIIIIIIIIII 3419 (4.7%) 69564 (95.3%)
27 aucguart [factor] 1. GREEN 2. RED 3340 (97.7%) 79 ( 2.3%) IIIIIIIIIIIIIIIIIII 3419 (4.7%) 69564 (95.3%)
28 byrno [factor] 1. 835 2. 1031 3. 1035 4. 1041 5. 1045 6. 1051 7. 1055 8. 1081 9. 1082 10. 1085 [ 64 others ] 2987 ( 4.1%) 31 ( 0.0%) 10 ( 0.0%) 8 ( 0.0%) 62 ( 0.1%) 73 ( 0.1%) 12 ( 0.0%) 37 ( 0.1%) 7 ( 0.0%) 50 ( 0.1%) 69706 (95.5%) 72983 (100.0%) 0 (0.0%)
29 vnzip1 [factor] 1. 2764 2. 3106 3. 8505 4. 12552 5. 16066 6. 16137 7. 17028 8. 17406 9. 17545 10. 19440 [ 143 others ] 15 ( 0.0%) 97 ( 0.1%) 317 ( 0.4%) 6 ( 0.0%) 6 ( 0.0%) 2 ( 0.0%) 33 ( 0.0%) 139 ( 0.2%) 136 ( 0.2%) 531 ( 0.7%) 71701 (98.2%) 72983 (100.0%) 0 (0.0%)
30 vnst [factor] 1. AL 2. AR 3. AZ 4. CA 5. CO 6. FL 7. GA 8. IA 9. ID 10. IL [ 27 others ] 690 ( 0.9%) 70 ( 0.1%) 6174 ( 8.5%) 7095 ( 9.7%) 4998 ( 6.8%) 10447 (14.3%) 2450 ( 3.4%) 499 ( 0.7%) 196 ( 0.3%) 458 ( 0.6%) 39906 (54.7%) 72983 (100.0%) 0 (0.0%)
31 veh_b_cost [numeric] Mean (sd) : 6729.2 (1765) min < med < max: 1 < 6700 < 45469 IQR (CV) : 2470 (0.3) 2010 distinct values : : : : . : . 72915 (99.9%) 68 (0.1%)
32 is_online_sale [factor] 1. 0 2. 1 71138 (97.5%) 1845 ( 2.5%) IIIIIIIIIIIIIIIIIII 72983 (100.0%) 0 (0.0%)
33 warranty_cost [numeric] Mean (sd) : 1276.6 (598.8) min < med < max: 462 < 1155 < 7498 IQR (CV) : 786 (0.5) 281 distinct values : : . : : : : . : : : 72983 (100.0%) 0 (0.0%)

Exploration

Year manufactured

ggplot(dat) +
  aes(x = veh_year ) +
  geom_bar()+
  scale_x_continuous(breaks = min(dat$veh_year):max(dat$veh_year)) + 
  xlab("The manufacturer's year of the vehicle")

Auction

ggplot(dat) +
  aes(x = auction) +
  geom_bar()

Make

dat$make <- tolower(dat$make)
dat %>%
  group_by(make) %>%
  tally() %>% # count the number of instances per make
  mutate(make = if_else(n >= 1000, make, "other")) %>%  # mark cars with less than 1000 instances
  group_by(make) %>%  # group the summarized table to sum all "others"
  summarize(n = sum(n)) %>%  # keeps original n, sums only for "others" %>%
  ggplot() +
  aes(x = make, y = n) +
  geom_col()

Wheel type

ggplot(dat) +
  aes(x = wheel_type ) +
  geom_bar()

Vehicle odometer

ggplot(dat) +
  aes(x = veh_odo ) +
  geom_histogram() +
  xlab("Odometer reading")

Size category

ggplot(dat) +
  aes(x = size ) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45))

Odometer vs. Cost

  • It is hard to see the relationship of cost and odometer from this plot.
ggplot(dat) +
  aes(x = veh_odo, y = veh_b_cost) +
  geom_point(alpha = 0.1) # lower the alpha to see areas with many points

Cost by Wheel type

ggplot(dat) +
  aes(y = veh_b_cost, x = wheel_type) +
  geom_boxplot()

Cost by Nation

ggplot(dat) +
  aes(y = veh_b_cost, x = nationality) +
  geom_boxplot()

Cost by Color

  • Grey, black and silver cars have some higher cost instances. Are these luxury cars?
ggplot(dat) +
  aes(y = veh_b_cost, x = color) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45))

Bad Buy vs Cost

  • Unsurprisingly, kick cars (aka “lemons”) tend to be cheaper than good cars.
ggplot(dat) +
  aes(x = veh_b_cost, color = is_bad_buy) +
  geom_density() +
  scale_x_continuous(limits = c(0,15000)) +
  ggtitle("Densities of vehicle costs, by lemon status")