We obtained the data set from the following URL: https://www.openml.org/search?type=data&status=any&id=41162.
The data are in FARFF format, which required a package to read into R.
library(devtools)
#devtools::install_github("mlr-org/farff")
library(farff) # package for reading the data set
library(readr) # for reading the csv file fast
library(summarytools) # as requested in assignment 1
library(tidyverse)
library(janitor) # for data preparation
library(gt) # recommended package for html tables
theme_set(theme_minimal())
dat <- farff::readARFF("kick.arff")
dat <- janitor::clean_names(dat)
A brief online research revealed that this data set was provided by a company called ’Caravan”.
The data comprise 36 fields as defined in the table below.
Note: We use the R styling convention where names are uncapitalized and words are separated by underscores. The data were renamed using the function clean_names() from the janitor package.
The topic of avoiding the purchase of malfunctioning cars, commonly referred to as “lemons,” is of great importance to almost everyone. The experience of buying a car that constantly breaks down can be incredibly frustrating and costly. In order to alleviate this problem, the research objective is to develop a statistical model that can accurately predict the likelihood of a car being a lemon.
The primary focus of the research will be to identify the key factors that are associated with a car being a lemon, including the car’s make, model, year, and other relevant characteristics. The study will also aim to assess the factors that determine a car’s price, and to compare the rates of lemons across different makes and models.
The comparison of makes and models is particularly interesting, as it has the potential to provide valuable information for anyone looking to purchase a car. The study will examine the lemon rate for each make and model, and determine if any commonly held stereotypes, such as “Japanese cars are reliable” or “American cars are often lemons,” are supported by the data.
The data set for this study comes from the business domain of car sales, and its implications are clear. Dealerships and other businesses that purchase used cars through trade-ins or auctions would greatly benefit from being able to assess the likelihood of a car being a lemon, as this information could help them to avoid costly purchases or make more informed pricing decisions.
In order to build the statistical models, a variety of machine learning algorithms will be employed, including simple methods like logistic regression as well as more advanced algorithms such as random forests and XGBoost. Regression models will also be developed to determine the price of the car, and if time permits, basic time-series methods will be used to examine changes in the prevalence of lemon cars over time.
desc <- read.csv("data_desc.txt",header = T)
desc %>% gt()
| Field.Name | Definition |
|---|---|
| RefID | Unique (sequential) number assigned to vehicles |
| IsBadBuy | Identifies if the kicked vehicle was an avoidable purchase |
| PurchDate | The Date the vehicle was Purchased at Auction |
| Auction | Auction provider at which the vehicle was purchased |
| VehYear | The manufacturer's year of the vehicle |
| VehicleAge | The Years elapsed since the manufacturer's year |
| Make | Vehicle Manufacturer |
| Model | Vehicle Model |
| Trim | Vehicle Trim Level |
| SubModel | Vehicle Submodel |
| Color | Vehicle Color |
| Transmission | Vehicles transmission type (Automatic / Manual) |
| WheelTypeID | The type id of the vehicle wheel |
| WheelType | The vehicle wheel type description (Alloy / Covers) |
| VehOdo | The vehicles odometer reading |
| Nationality | The Manufacturer's country |
| Size | The size category of the vehicle (Compact / SUV/ etc.) |
| TopThreeAmericanName | Identifies if the manufacturer is one of the top three American manufacturers |
| MMRAcquisitionAuctionAveragePrice | Acquisition price for this vehicle in average condition at time of purchase |
| MMRAcquisitionAuctionCleanPrice | Acquisition price for this vehicle in the above Average condition at time of purchase |
| MMRAcquisitionRetailAveragePrice | Acquisition price for this vehicle in the retail market in average condition at time of purchase |
| MMRAcquisitonRetailCleanPrice | Acquisition price for this vehicle in the retail market in above average condition at time of purchase |
| MMRCurrentAuctionAveragePrice | Acquisition price for this vehicle in average condition as of current day |
| MMRCurrentAuctionCleanPrice | Acquisition price for this vehicle in the above condition as of current day |
| MMRCurrentRetailAveragePrice | Acquisition price for this vehicle in the retail market in average condition as of current day |
| MMRCurrentRetailCleanPrice | Acquisition price for this vehicle in the retail market in above average condition as of current day |
| PRIMEUNIT | Identifies if the vehicle would have a higher demand than a standard purchase |
| AcquisitionType | Identifies how the vehicle was acquired (Auction buy or trade in or etc) |
| AUCGUART | The level guarantee provided by auction for the vehicle (Green light - Guaranteed/arbitratable Yellow Light - caution/issue red light - sold as is) |
| KickDate | Date the vehicle was kicked back to the auction |
| BYRNO | Unique number assigned to the buyer that purchased the vehicle |
| VNZIP | Zipcode where the car was purchased |
| VNST | State where the car was purchased |
| VehBCost | Acquisition cost paid for the vehicle at time of purchase |
| IsOnlineSale | Identifies if the vehicle was originally purchased online |
| WarrantyCost | Warranty price (term=36month and mileage=36K) |
dfSummary(dat)
Data Frame Summary
dat
Dimensions: 72983 x 33
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
|---|---|---|---|---|---|---|
| 1 | is_bad_buy [factor] | 1. 0 2. 1 | 64007 (87.7%) 8976 (12.3%) | IIIIIIIIIIIIIIIII II | 72983 (100.0%) | 0 (0.0%) |
| 2 | purch_date [numeric] | Mean (sd) : 1263091914 (18196025) min < med < max: 1231113600 < 1264032000 < 1293667200 IQR (CV) : 31017600 (0) | 517 distinct values | . : . . . : : : . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : | 72983 (100.0%) | 0 (0.0%) |
| 3 | auction [factor] | 1. ADESA 2. MANHEIM 3. OTHER | 14439 (19.8%) 41043 (56.2%) 17501 (24.0%) | III IIIIIIIIIII IIII | 72983 (100.0%) | 0 (0.0%) |
| 4 | veh_year [numeric] | Mean (sd) : 2005.3 (1.7) min < med < max: 2001 < 2005 < 2010 IQR (CV) : 3 (0) | 2001 : 1481 ( 2.0%) 2002 : 3405 ( 4.7%) 2003 : 6227 ( 8.5%) 2004 : 10207 (14.0%) 2005 : 15489 (21.2%) 2006 : 17043 (23.4%) 2007 : 11423 (15.7%) 2008 : 6885 ( 9.4%) 2009 : 822 ( 1.1%) 2010 : 1 ( 0.0%) | 72983 (100.0%) | 0 (0.0%) | |
| 5 | vehicle_age [numeric] | Mean (sd) : 4.2 (1.7) min < med < max: 0 < 4 < 9 IQR (CV) : 2 (0.4) | 0 : 2 ( 0.0%) 1 : 3094 ( 4.2%) 2 : 8482 (11.6%) 3 : 15902 (21.8%) 4 : 17013 (23.3%) 5 : 12956 (17.8%) 6 : 8022 (11.0%) 7 : 4646 ( 6.4%) 8 : 2220 ( 3.0%) 9 : 646 ( 0.9%) | 72983 (100.0%) | 0 (0.0%) | |
| 6 | make [factor] | 1. ACURA 2. BUICK 3. CADILLAC 4. CHEVROLET 5. CHRYSLER 6. DODGE 7. FORD 8. GMC 9. HONDA 10. HUMMER [ 23 others ] | 33 ( 0.0%) 720 ( 1.0%) 33 ( 0.0%) 17248 (23.6%) 8844 (12.1%) 12912 (17.7%) 11305 (15.5%) 649 ( 0.9%) 497 ( 0.7%) 1 ( 0.0%) 20741 (28.4%) | 72983 (100.0%) | 0 (0.0%) | |
| 7 | model [factor] | 1. 1500 RAM PICKUP 2WD 2. 1500 RAM PICKUP 4WD 3. 1500 SIERRA PICKUP 2 4. 1500 SIERRA PICKUP 4 5. 1500 SILVERADO PICKU 6. 1500HD SIERRA PICKUP 7. 1500HD SILVERADO PIC 8. 2500 RAM PICKUP 2WD 9. 2500 RAM PICKUP 4WD 10. 2500 SILVERADO PICKU [ 1053 others ] | 1101 ( 1.5%) 57 ( 0.1%) 132 ( 0.2%) 4 ( 0.0%) 596 ( 0.8%) 1 ( 0.0%) 8 ( 0.0%) 13 ( 0.0%) 3 ( 0.0%) 2 ( 0.0%) 71066 (97.4%) | 72983 (100.0%) | 0 (0.0%) | |
| 8 | trim [factor] | 1. 1 2. 150 3. 2 4. 250 5. 3 6. 3 R 7. Adv 8. Bas 9. C 10. Car [ 124 others ] | 168 ( 0.2%) 305 ( 0.4%) 765 ( 1.1%) 3 ( 0.0%) 188 ( 0.3%) 7 ( 0.0%) 227 ( 0.3%) 13950 (19.8%) 7 ( 0.0%) 3 ( 0.0%) 55000 (77.9%) | 70623 (96.8%) | 2360 (3.2%) | |
| 9 | sub_model [factor] | 1. 2D CONVERTIBLE 2. 2D CONVERTIBLE DREAM CRUI 3. 2D CONVERTIBLE GL 4. 2D CONVERTIBLE GLS 5. 2D CONVERTIBLE GT 6. 2D CONVERTIBLE GTC 7. 2D CONVERTIBLE LIMITED 8. 2D CONVERTIBLE LX 9. 2D CONVERTIBLE LXI 10. 2D CONVERTIBLE SE [ 853 others ] | 279 ( 0.4%) 10 ( 0.0%) 1 ( 0.0%) 3 ( 0.0%) 9 ( 0.0%) 34 ( 0.0%) 9 ( 0.0%) 38 ( 0.1%) 2 ( 0.0%) 3 ( 0.0%) 72587 (99.5%) | 72975 (100.0%) | 8 (0.0%) | |
| 10 | color [factor] | 1. BEIGE 2. BLACK 3. BLUE 4. BROWN 5. GOLD 6. GREEN 7. GREY 8. MAROON 9. NOT AVAIL 10. ORANGE [ 6 others ] | 1584 ( 2.2%) 7627 (10.5%) 10347 (14.2%) 436 ( 0.6%) 5231 ( 7.2%) 3194 ( 4.4%) 7887 (10.8%) 2046 ( 2.8%) 94 ( 0.1%) 415 ( 0.6%) 34114 (46.7%) | II II | 72975 (100.0%) | 8 (0.0%) |
| 11 | transmission [factor] | 1. AUTO 2. Manual 3. MANUAL | 70398 (96.5%) 1 ( 0.0%) 2575 ( 3.5%) | IIIIIIIIIIIIIIIIIII | 72974 (100.0%) | 9 (0.0%) |
| 12 | wheel_type_id [factor] | 1. 0 2. 1 3. 2 4. 3 | 5 ( 0.0%) 36050 (51.6%) 33004 (47.3%) 755 ( 1.1%) | IIIIIIIIII IIIIIIIII | 69814 (95.7%) | 3169 (4.3%) |
| 13 | wheel_type [factor] | 1. Alloy 2. Covers 3. Special | 36050 (51.6%) 33004 (47.3%) 755 ( 1.1%) | IIIIIIIIII IIIIIIIII | 69809 (95.7%) | 3174 (4.3%) |
| 14 | veh_odo [numeric] | Mean (sd) : 71500 (14578.9) min < med < max: 4825 < 73361 < 115717 IQR (CV) : 20599 (0.2) | 39947 distinct values | : . : . . : : : : : : : : : : : : . | 72983 (100.0%) | 0 (0.0%) |
| 15 | nationality [factor] | 1. AMERICAN 2. OTHER 3. OTHER ASIAN 4. TOP LINE ASIAN | 61028 (83.6%) 195 ( 0.3%) 8033 (11.0%) 3722 ( 5.1%) | IIIIIIIIIIIIIIII | 72978 (100.0%) | 5 (0.0%) |
| 16 | size [factor] | 1. COMPACT 2. CROSSOVER 3. LARGE 4. LARGE SUV 5. LARGE TRUCK 6. MEDIUM 7. MEDIUM SUV 8. SMALL SUV 9. SMALL TRUCK 10. SPECIALTY [ 2 others ] | 7205 ( 9.9%) 1759 ( 2.4%) 8850 (12.1%) 1433 ( 2.0%) 3170 ( 4.3%) 30785 (42.2%) 8090 (11.1%) 2276 ( 3.1%) 864 ( 1.2%) 1915 ( 2.6%) 6631 ( 9.1%) | I | 72978 (100.0%) | 5 (0.0%) |
| 17 | top_three_american_name [factor] | 1. CHRYSLER 2. FORD 3. GM 4. OTHER | 23399 (32.1%) 12315 (16.9%) 25314 (34.7%) 11950 (16.4%) | IIIIII III IIIIII III | 72978 (100.0%) | 5 (0.0%) |
| 18 | mmr_acquisition_auction_average_price [numeric] | Mean (sd) : 6128.9 (2462) min < med < max: 0 < 6097 < 35722 IQR (CV) : 3492 (0.4) | 10342 distinct values | : : : : . : : : : : . | 72965 (100.0%) | 18 (0.0%) |
| 19 | mmr_acquisition_auction_clean_price [numeric] | Mean (sd) : 7373.6 (2722.5) min < med < max: 0 < 7303 < 36859 IQR (CV) : 3615 (0.4) | 11379 distinct values | : . : : : : : : . : : : | 72965 (100.0%) | 18 (0.0%) |
| 20 | mmr_acquisition_retail_average_price [numeric] | Mean (sd) : 8497 (3156.3) min < med < max: 0 < 8444 < 39080 IQR (CV) : 4371 (0.4) | 12725 distinct values | : : : : : : : . : : : : | 72965 (100.0%) | 18 (0.0%) |
| 21 | mmr_acquisiton_retail_clean_price [numeric] | Mean (sd) : 9850.9 (3385.8) min < med < max: 0 < 9789 < 41482 IQR (CV) : 4595 (0.3) | 13456 distinct values | : . : : : : : : . : : : . | 72965 (100.0%) | 18 (0.0%) |
| 22 | mmr_current_auction_average_price [numeric] | Mean (sd) : 6132.1 (2434.6) min < med < max: 0 < 6062 < 35722 IQR (CV) : 3461 (0.4) | 10315 distinct values | : : : : . : : : : : . | 72668 (99.6%) | 315 (0.4%) |
| 23 | mmr_current_auction_clean_price [numeric] | Mean (sd) : 7390.7 (2686.2) min < med < max: 0 < 7313 < 36859 IQR (CV) : 3599 (0.4) | 11265 distinct values | : . : : : : : : . : : : | 72668 (99.6%) | 315 (0.4%) |
| 24 | mmr_current_retail_average_price [numeric] | Mean (sd) : 8775.7 (3090.7) min < med < max: 0 < 8729 < 39080 IQR (CV) : 4375 (0.4) | 12493 distinct values | : : : : : : : . . : : : | 72668 (99.6%) | 315 (0.4%) |
| 25 | mmr_current_retail_clean_price [numeric] | Mean (sd) : 10145.4 (3310.3) min < med < max: 0 < 10103 < 41062 IQR (CV) : 4525 (0.3) | 13192 distinct values | : : : : . : : : : : : . | 72668 (99.6%) | 315 (0.4%) |
| 26 | primeunit [factor] | 1. NO 2. YES | 3357 (98.2%) 62 ( 1.8%) | IIIIIIIIIIIIIIIIIII | 3419 (4.7%) | 69564 (95.3%) |
| 27 | aucguart [factor] | 1. GREEN 2. RED | 3340 (97.7%) 79 ( 2.3%) | IIIIIIIIIIIIIIIIIII | 3419 (4.7%) | 69564 (95.3%) |
| 28 | byrno [factor] | 1. 835 2. 1031 3. 1035 4. 1041 5. 1045 6. 1051 7. 1055 8. 1081 9. 1082 10. 1085 [ 64 others ] | 2987 ( 4.1%) 31 ( 0.0%) 10 ( 0.0%) 8 ( 0.0%) 62 ( 0.1%) 73 ( 0.1%) 12 ( 0.0%) 37 ( 0.1%) 7 ( 0.0%) 50 ( 0.1%) 69706 (95.5%) | 72983 (100.0%) | 0 (0.0%) | |
| 29 | vnzip1 [factor] | 1. 2764 2. 3106 3. 8505 4. 12552 5. 16066 6. 16137 7. 17028 8. 17406 9. 17545 10. 19440 [ 143 others ] | 15 ( 0.0%) 97 ( 0.1%) 317 ( 0.4%) 6 ( 0.0%) 6 ( 0.0%) 2 ( 0.0%) 33 ( 0.0%) 139 ( 0.2%) 136 ( 0.2%) 531 ( 0.7%) 71701 (98.2%) | 72983 (100.0%) | 0 (0.0%) | |
| 30 | vnst [factor] | 1. AL 2. AR 3. AZ 4. CA 5. CO 6. FL 7. GA 8. IA 9. ID 10. IL [ 27 others ] | 690 ( 0.9%) 70 ( 0.1%) 6174 ( 8.5%) 7095 ( 9.7%) 4998 ( 6.8%) 10447 (14.3%) 2450 ( 3.4%) 499 ( 0.7%) 196 ( 0.3%) 458 ( 0.6%) 39906 (54.7%) | 72983 (100.0%) | 0 (0.0%) | |
| 31 | veh_b_cost [numeric] | Mean (sd) : 6729.2 (1765) min < med < max: 1 < 6700 < 45469 IQR (CV) : 2470 (0.3) | 2010 distinct values | : : : : . : . | 72915 (99.9%) | 68 (0.1%) |
| 32 | is_online_sale [factor] | 1. 0 2. 1 | 71138 (97.5%) 1845 ( 2.5%) | IIIIIIIIIIIIIIIIIII | 72983 (100.0%) | 0 (0.0%) |
| 33 | warranty_cost [numeric] | Mean (sd) : 1276.6 (598.8) min < med < max: 462 < 1155 < 7498 IQR (CV) : 786 (0.5) | 281 distinct values | : : . : : : : . : : : | 72983 (100.0%) | 0 (0.0%) |
ggplot(dat) +
aes(x = veh_year ) +
geom_bar()+
scale_x_continuous(breaks = min(dat$veh_year):max(dat$veh_year)) +
xlab("The manufacturer's year of the vehicle")
ggplot(dat) +
aes(x = auction) +
geom_bar()
dat$make <- tolower(dat$make)
dat %>%
group_by(make) %>%
tally() %>% # count the number of instances per make
mutate(make = if_else(n >= 1000, make, "other")) %>% # mark cars with less than 1000 instances
group_by(make) %>% # group the summarized table to sum all "others"
summarize(n = sum(n)) %>% # keeps original n, sums only for "others" %>%
ggplot() +
aes(x = make, y = n) +
geom_col()
ggplot(dat) +
aes(x = wheel_type ) +
geom_bar()
ggplot(dat) +
aes(x = veh_odo ) +
geom_histogram() +
xlab("Odometer reading")
ggplot(dat) +
aes(x = size ) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45))
ggplot(dat) +
aes(x = veh_odo, y = veh_b_cost) +
geom_point(alpha = 0.1) # lower the alpha to see areas with many points
ggplot(dat) +
aes(y = veh_b_cost, x = wheel_type) +
geom_boxplot()
ggplot(dat) +
aes(y = veh_b_cost, x = nationality) +
geom_boxplot()
ggplot(dat) +
aes(y = veh_b_cost, x = color) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45))
ggplot(dat) +
aes(x = veh_b_cost, color = is_bad_buy) +
geom_density() +
scale_x_continuous(limits = c(0,15000)) +
ggtitle("Densities of vehicle costs, by lemon status")