Craigslist is the world’s largest collection of used vehicles for sale, yet it’s very difficult to collect all of them in the same place. I built a scraper for a school project and expanded upon it later to create this dataset which includes every used vehicle entry within the United States on Craigslist.
The dataset contains around 500k used cars with its 26 associated variables (characteristics). This data is scraped every few months, it contains most all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, latitude/longitude, and 18 other categories.
new, good, fair etc.gas, diesel, electric etc.clean, rebuilt etc.manual, automatic.rwd, fwd, 4wd.compact, full-size, etc.SUV, sedan, etc.In order to address those questions, there are some things to do such as:
suppressPackageStartupMessages({
# Load Analysis Libraries
library(dplyr)
library(tidyr)
library(plotly)
library(ggplot2)
# Load ML libraries
library(caret)
library(ranger)
library(rpart)
library(MASS)
})
df = read.csv("vehicles.csv", na.strings = c("", "NA"))
head(df)
dim(df)
[1] 426880 26
names(df)
[1] "id" "url" "region" "region_url" "price" "year"
[7] "manufacturer" "model" "condition" "cylinders" "fuel" "odometer"
[13] "title_status" "transmission" "VIN" "drive" "size" "type"
[19] "paint_color" "image_url" "description" "county" "state" "lat"
[25] "long" "posting_date"
# The structure of the dataframe
glimpse(df)
Rows: 426,880
Columns: 26
$ id <dbl> 7222695916, 7218891961, 7221797935, 7222270760, 7210384030, 7222379453, 7221952215~
$ url <chr> "https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html",~
$ region <chr> "prescott", "fayetteville", "florida keys", "worcester / central MA", "greensboro"~
$ region_url <chr> "https://prescott.craigslist.org", "https://fayar.craigslist.org", "https://keys.c~
$ price <dbl> 6000, 11900, 21000, 1500, 4900, 1600, 1000, 15995, 5000, 3000, 0, 0, 0, 0, 0, 1399~
$ year <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ manufacturer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ model <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ condition <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ cylinders <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ fuel <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ odometer <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ title_status <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ transmission <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ VIN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ drive <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ size <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ paint_color <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ image_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ county <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ state <chr> "az", "ar", "fl", "ma", "nc", "ny", "ny", "ny", "or", "pa", "tx", "tx", "tx", "tx"~
$ lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ long <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ posting_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
Most of the variable aren’t in the correct type and it need to be reassigned to its datatype.
# Count the nan values in the data
sum(is.na(df))
[1] 1655334
sum(complete.cases(df))
[1] 0
There’s almost 1.5M fields contains NA values across all variables, however there’s no row filled entirely with NA.
# Look if there's any duplicates
sum(duplicated(df$id))
[1] 0
First of all, Let’s get rid of image_url, url, description, county, VIN, and region_url columns that we won’t use it in our analysis.
df2 <- df[, !names(df) %in% c("image_url", "url", "description", "county", "region_url", "VIN")]
Now, we can check the percentage of NA values in each column.
nan_prop <- function(data_frame) {
prop_col = c()
for (col in names(data_frame)) {
nan_perc = round((sum(is.na(data_frame[[col]])) / dim(data_frame)[1]), 2)
prop_col = c(prop_col, nan_perc)
}
return(prop_col)
}
nan_prop(df2)
[1] 0.00 0.00 0.00 0.00 0.04 0.01 0.41 0.42 0.01 0.01 0.02 0.01 0.31 0.72 0.22 0.31 0.00 0.02 0.02 0.00
plot_ly(
x = names(df2),
y = 1 - nan_prop(df2),
type = "bar",
) %>% layout(title = "Propotions of NA values", xaxis=list(title="Column Names"), yaxis=list(title="Propotion"))
As we can see above, the most variable with NA values is size with 71%. This variable looks significant to the other variables although, that will be a problem to our analysis, so we will get rid of the entire column. The other columns such as condition and cylinders are almost the same with 40% of its values are filled with NA and we can make our analysis with it, however, it’s a big percentage.
df2 = df2[, !names(df2) %in% c("size")]
Let’s change the datatype for each variable to its corresponding one and rename some the mis-styped.
df2$year = as.integer(df2$year)
df2$price = as.numeric(df2$price)
df2$odometer = as.integer(df2$odometer)
df2$condition = as.factor(df2$condition)
df2$cylinders = as.factor(df2$cylinders)
df2$title_status = as.factor(df2$title_status)
df2$transmission = as.factor(df2$transmission)
df2$drive = as.factor(df2$drive)
df2$type = as.factor(df2$type)
df2$fuel = as.factor(df2$fuel)
df2$paint_color = as.factor(df2$paint_color)
df2$posting_date = as.POSIXct(df2$posting_date)
# rename some columns
df2 = rename(df2,status = title_status, shape = type)
glimpse(df2)
Rows: 426,880
Columns: 19
$ id <dbl> 7222695916, 7218891961, 7221797935, 7222270760, 7210384030, 7222379453, 7221952215~
$ region <chr> "prescott", "fayetteville", "florida keys", "worcester / central MA", "greensboro"~
$ price <dbl> 6000, 11900, 21000, 1500, 4900, 1600, 1000, 15995, 5000, 3000, 0, 0, 0, 0, 0, 1399~
$ year <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ manufacturer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ model <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ condition <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ cylinders <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ fuel <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ odometer <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ status <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ transmission <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ drive <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ shape <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ paint_color <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ state <chr> "az", "ar", "fl", "ma", "nc", "ny", "ny", "ny", "or", "pa", "tx", "tx", "tx", "tx"~
$ lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ long <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ posting_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
The dataset now seems fine, let’s look at the unique values for the categorical columns.
lapply(df2[,c(7,8,9,11,12,13,14,15,16)], unique)
$condition
[1] <NA> good excellent fair like new new salvage
Levels: excellent fair good like new new salvage
$cylinders
[1] <NA> 8 cylinders 6 cylinders 4 cylinders 5 cylinders other 3 cylinders
[8] 10 cylinders 12 cylinders
8 Levels: 10 cylinders 12 cylinders 3 cylinders 4 cylinders 5 cylinders 6 cylinders ... other
$fuel
[1] <NA> gas other diesel hybrid electric
Levels: diesel electric gas hybrid other
$status
[1] <NA> clean rebuilt lien salvage missing parts only
Levels: clean lien missing parts only rebuilt salvage
$transmission
[1] <NA> other automatic manual
Levels: automatic manual other
$drive
[1] <NA> rwd 4wd fwd
Levels: 4wd fwd rwd
$shape
[1] <NA> pickup truck other coupe SUV hatchback mini-van
[9] sedan offroad bus van convertible wagon
Levels: bus convertible coupe hatchback mini-van offroad other pickup sedan SUV truck van wagon
$paint_color
[1] <NA> white blue red black silver grey brown yellow orange green custom purple
Levels: black blue brown custom green grey orange purple red silver white yellow
$state
[1] "az" "ar" "fl" "ma" "nc" "ny" "or" "pa" "tx" "wa" "wi" "al" "ak" "ca" "co" "ct" "dc" "de" "ga" "hi"
[21] "id" "il" "in" "ia" "ks" "ky" "la" "me" "md" "mi" "mn" "ms" "mo" "mt" "ne" "nv" "nj" "nm" "nh" "nd"
[41] "oh" "ok" "ri" "sc" "sd" "tn" "ut" "vt" "va" "wv" "wy"
summary(df2)
id region price year manufacturer
Min. :7207408119 Length:426880 Min. : 0 Min. :1900 Length:426880
1st Qu.:7308143339 Class :character 1st Qu.: 5900 1st Qu.:2008 Class :character
Median :7312620821 Mode :character Median : 13950 Median :2013 Mode :character
Mean :7311486634 Mean : 75199 Mean :2011
3rd Qu.:7315253544 3rd Qu.: 26486 3rd Qu.:2017
Max. :7317101084 Max. :3736928711 Max. :2022
NA's :1205
model condition cylinders fuel odometer
Length:426880 excellent:101467 6 cylinders : 94169 diesel : 30062 Min. : 0
Class :character fair : 6769 4 cylinders : 77642 electric: 1698 1st Qu.: 37704
Mode :character good :121456 8 cylinders : 72062 gas :356209 Median : 85548
like new : 21178 5 cylinders : 1712 hybrid : 5170 Mean : 98043
new : 1305 10 cylinders: 1455 other : 30728 3rd Qu.: 133542
salvage : 601 (Other) : 2162 NA's : 3013 Max. :10000000
NA's :174104 NA's :177678 NA's :4400
status transmission drive shape paint_color
clean :405117 automatic:336524 4wd :131904 sedan :87056 white : 79285
lien : 1422 manual : 25118 fwd :105517 SUV :77284 black : 62861
missing : 814 other : 62682 rwd : 58892 pickup :43510 silver : 42970
parts only: 198 NA's : 2556 NA's:130567 truck :35279 blue : 31223
rebuilt : 7219 other :22110 red : 30473
salvage : 3868 (Other):68783 (Other): 49865
NA's : 8242 NA's :92858 NA's :130203
state lat long posting_date
Length:426880 Min. :-84.12 Min. :-159.83 Min. :2021-04-04 00:00:00
Class :character 1st Qu.: 34.60 1st Qu.:-111.94 1st Qu.:2021-04-17 00:00:00
Mode :character Median : 39.15 Median : -88.43 Median :2021-04-25 00:00:00
Mean : 38.49 Mean : -94.75 Mean :2021-04-23 06:08:37
3rd Qu.: 42.40 3rd Qu.: -80.83 3rd Qu.:2021-05-01 00:00:00
Max. : 82.39 Max. : 173.89 Max. :2021-05-04 00:00:00
NA's :6549 NA's :6549 NA's :68
So, as we can see there’s a lot of NA values with each column and the only variables that contain numeric values are price and odometer. Moreover, the difference between the maximum value and the 3rd quantile of price is pretty huge, so we can assume that the price has some outliers to be handled as well as the odometer variable.
In the following section we’ll look at the outliers for price and odometer variables. Any values outside of the interval below will be considered potential outliers:
\[I = [Q_{0.25} - 1.5⋅IQR; Q_{0.75} + 1.5⋅IQR]\]
I <- function(column) {
quan <- quantile(column, na.rm=TRUE)
IQR = as.integer(quan["75%"] - quan["25%"])
I = as.integer(c(quan["25%"] - 1.5 * IQR, quan["75%"] + 1.5 * IQR))
return(I)
}
I(df2$price)
[1] -24977 57363
For the sake of the data and the cars, $57k wouldn’t be enough, since some car types are actually over price. So, we’ll assume that any car with a price above 1 million will be an outlier and of course the price must be non-negative.
Let’s do the same thing with the odometer variable.
I(df2$odometer)
[1] -106053 277299
For the odometer, 277k miles makes sense, why would someone buy a car traveled more than 300k miles!
Let’s filter those values out from our dataset and replace the NA with the means of each variable.
df3 = df2 %>% filter(price < 5e5 & price >= 0 & odometer <= 3e5 & odometer >= 0)
df3[is.na(df3$price)] = mean(df3$price, na.rm = TRUE)
df3[is.na(df3$odometer)] = mean(df3$odometer, na.rm = TRUE)
Delete the manufacturer and the model if it’s NA since it has no meaning if a car has no model or a car has no manufacturer.
df3 = df3[!is.na(df3$manufacturer), ]
df3 = df3[!is.na(df3$model), ]
dim(df3)
[1] 397853 19
names(df3)
[1] "id" "region" "price" "year" "manufacturer" "model"
[7] "condition" "cylinders" "fuel" "odometer" "status" "transmission"
[13] "drive" "shape" "paint_color" "state" "lat" "long"
[19] "posting_date"
nan_prop(df3)
[1] 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.01 0.00 0.02 0.00 0.30 0.21 0.30 0.00 0.02 0.02 0.00
All the variables are categorical except two variables price and odometer so let’s look at its distributions.
price_hist <- plot_ly(nbinsx = 100) %>%
add_histogram(x = ~df3$price, marker = list(color = 'rgba(55, 128, 191, 0.8)'), name = "Price $")
odo_hist <- plot_ly(nbinsx = 50) %>%
add_histogram(x = ~df3$odometer, marker = list(color = 'rgba(219, 64, 82, 0.8)'), name = "Odometer (miles)")
fig <- subplot(price_hist, odo_hist) %>%
layout(title = "Odometer and Price Distribution", yaxis=list(title="Density"))
fig
Both price and odometer have right-skewed histograms which mean cars with high odometer are less popular for sale and cars with more price are less popular for sale as well.
# Group by each state and arrange by number of vehicles
by_state = group_by(df3, state = df3$state) %>% count() %>% arrange(n)
by_state %>%
plot_ly(x = ~n, y = ~state, type = "bar", marker = list(color = 'rgba(55, 128, 191, 0.9)')) %>%
layout(title="Used Cars by each state", xaxis=list(title="Cars"),
yaxis=list(title="State", categoryorder = "array", categoryarray = ~n))
California has the most used vehicles among all states with ~47k cars.
# Get the region in the ca state that has the most cars
ca_state = df3[df3$state == "ca",]
by_region = group_by(ca_state, region = ca_state$region)
arrange(count(by_region), desc(n))[1,]
Stockton are two cities in California which have the most vehicles with 2840 cars.
by_state_region = group_by(df3, state=df3$state, region=df3$region)
arrange(count(by_state_region), desc(n))[1,]
Pasco in Washington got the top on among all regions which has 2875 cars.
by_manf = df3 %>% group_by(manufacturer) %>% count() %>% arrange(n)
by_manf %>% plot_ly(x = ~n, y = ~manufacturer, type = "bar", marker = list(color = 'rgba(219, 64, 82, 0.9)')) %>%
layout(title="Car Used by each manufacturer", xaxis=list(title="Cars"),
yaxis=list(title="Manufacturer", categoryorder = "array", categoryarray = ~n))
It appears that, Ford, Chevrolet, and Toyota have the most vehicles. Let’s see what’s average price for each of the top 3.
# AVG price for each of top 3 Manufacturers
avg_ford = round(mean(df3[df3$manufacturer == "ford",]$price), 2)
avg_chevrolet = round(mean(df3[df3$manufacturer == "chevrolet",]$price), 2)
avg_toyota = round(mean(df3[df3$manufacturer == "toyota",]$price), 2)
list(ford=avg_ford, chevrolet=avg_chevrolet, toyota=avg_toyota)
$ford
[1] 19381.51
$chevrolet
[1] 18530.17
$toyota
[1] 16002.98
Ford got the higher average price among top 3 manufacturers with $19k.
by_manf_model = data.frame(df3 %>% group_by(manufacturer, model) %>% count())
agg = aggregate(n ~ manufacturer, data = by_manf_model, FUN=max, na.rm = TRUE)
max_by_model = data.frame(matrix(ncol=3, nrow=0, dimnames=list(NULL, c("manf", "model", "max_cars"))))
for (i in 1:length(agg$n)) {
max_by_model[nrow(max_by_model)+1, ] = as.vector(by_manf_model[by_manf_model$n == agg$n[i] &
by_manf_model$manufacturer == agg$manufacturer[i], ] %>%
mutate_all(as.character) %>% unlist(., use.names = FALSE))
}
max_by_model$max_cars = as.integer(max_by_model$max_cars)
max_by_model = arrange(max_by_model, desc(max_cars))
max_by_model[1:10,] %>%
plot_ly(x=~model, y=~max_cars, type = 'bar', color=~manf, colors="Dark2") %>%
layout(yaxis = list(title = 'Count'), xaxis=list(title='Car Model', categoryorder = "array", categoryarray=~max_cars))
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
by_year = group_by(df2, year)
by_year = count(by_year)
plot_ly(by_year, x=~year, y=~n, type = 'bar', color=~year, reversescale=TRUE, colors = "Blues") %>%
layout(yaxis = list(title = 'Count'), xaxis=list(title='Year', categoryorder = "array", categoryarray=~n))
Warning: Ignoring 1 observations
Warning: textfont.color doesn't (yet) support data arrays
Warning: textfont.color doesn't (yet) support data arrays
Warning: 'bar' objects don't have these attributes: 'reversescale'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'fra [... truncated]
Warning: Ignoring 1 observations
Warning: textfont.color doesn't (yet) support data arrays
Warning: textfont.color doesn't (yet) support data arrays
Warning: 'bar' objects don't have these attributes: 'reversescale'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'fra [... truncated]
It’s normal to see that the number of used cars increase since 1900 until 2018.
Let’s see what’s the most used cars in this year.
most_manu_b = df2[df3$year == 2017,] %>% group_by(manufacturer) %>% count()
most_manu_b = arrange(most_manu_b,desc(most_manu_b$n))
most_manu_b[1,]
Ford is most manufacture has a cars in 2018 with 6390 cars.
by_fuel = group_by(df2, fuel = df2$fuel) %>% count()
by_fuel$n = round(by_fuel$n / sum(by_fuel$n), 2)
by_fuel = rename(by_fuel, ratio = n) %>% drop_na()
colors <- c('rgb(128,133,133)', 'rgb(211,94,96)', 'rgb(144,103,167)', 'rgb(171,104,87)', 'rgb(114,147,203)')
plot_ly(by_fuel, labels = ~fuel, values = ~ratio, marker = list(colors = colors, line = list(color = '#FFFFFF', width = 2))) %>%
add_pie(hole = 0.5) %>%
layout(title = 'Car Fuel Type',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
The most fuel type used is Gas with ~85% among all cars.
by_drive = group_by(df2, drive) %>% count(.) %>% drop_na(.)
by_drive = mutate(by_drive, ratio =n/sum(by_drive$n))
plot_ly(by_drive, labels = ~drive, values = ~ratio, type = 'pie', marker = list(line = list(color = '#FFFFFF', width = 3))) %>%
layout(title = 'Car Drive Type',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
Cars with 4wd drive type are the most common with 44.5%.
Next, we’ll see which of the variables affect the price.
# Take 10000 random sample from the dataset in order to plot
sample_df = sample_n(df3, 1e4)
options(repr.plot.width = 7, repr.plot.height = 4)
sample_df %>% ggplot(aes(x = odometer, y = price)) +
geom_bin2d(bins=50)
cor(df3$price, df3$odometer)
[1] -0.4567531
There’s some sort of negative relationship between how long a car traveled and its price, however, that seems normal since the car price became higher whenever the odometer is lower.
Let’s add other factors may affect that relation such as drive.
ggplot(sample_df[!is.na(sample_df$drive),], aes(x=odometer, y=price, color=drive, shape=drive)) +
geom_point() +
geom_smooth(method=lm, se=FALSE) +
ggtitle("Price vs Odometer for each Drive type")
`geom_smooth()` using formula 'y ~ x'
As we noticed the 4wd models gets the higher price and higher odometer and the next one is rwd since these type of cars used a lot in car races.
Let’s see the average price for each type.
by_drive = aggregate(price ~ drive, data=df3, FUN=mean) %>% arrange(desc(price))
by_drive
As expected 4wd got the highest average price with ~$20k and rwd with $19.5k.
aggregate(price ~ cylinders + fuel, data = df3[df3$fuel != "other",], FUN=mean, na.rm = TRUE) %>%
arrange(price) %>%
plot_ly(x = ~cylinders, y = ~price, color = ~fuel, type = "bar") %>%
layout(title="Average Price per Cylinder and Fuel Type", yaxis=list(categoryorder = "array", categoryarray=~price))
It seems doesn’t, typical cars have 4, 6 or 8 cylinders and the figure above shows exactly those types got most average price for each fuel type.
by_status = group_by(df3, status)
by_status = aggregate(price ~ status, data = by_status, FUN=mean, na.rm = TRUE)
plot_ly(by_status, x=~status, y=~price, type = 'bar', colors="Dark2",
marker = list(color = 'rgba(55, 128, 191, 0.8)')) %>%
layout(yaxis = list(title = 'Average Price'), xaxis=list(title='Car Status'))
Lien has the most average used car price with $22k.
# Aggregate average car price by condition and status
aggregate(price ~ condition + status, data = df3, FUN=mean, na.rm = TRUE) %>%
arrange(price) %>% plot_ly(x = ~condition, y = ~price, color = ~status, type = "bar") %>%
layout(title="Average Car Price per Condition and Status ", yaxis=list(categoryorder = "array", categoryarray=~price))
Lien cars got the highest average price among other cars and it seems usual. In addition, new cars which are parts only got the second place, so if we filtered those out from the figure we can see that the normal average price starts with new clean cars until fair salvage.
avg_by_manuf = aggregate(price ~ manufacturer, data = df3, FUN=mean, na.rm = TRUE) %>% arrange(price)
plot_ly(x = ~avg_by_manuf$price,y = ~avg_by_manuf$manufacturer, name = "Car Used by each manufacturer",type = "bar",
marker = list(color = 'rgba(219, 64, 82, 0.8)'),
) %>% layout(title = "Average Price for each Manufacturer", xaxis=list(title="Average Price"),
yaxis=list(title="Manufacturer", categoryorder = "array", categoryarray = ~avg_by_manuf$price))
Ferrari got the 1st place with the highest average price $114.7k.
by_shape = group_by(df3, shape)
cnt_by_shape = count(by_shape) %>% drop_na()
avg_price = aggregate(price ~ shape, data = by_shape, FUN=mean, na.rm = TRUE)
avg_cnt = cbind(cnt_by_shape, price = avg_price$price)
plot_ly(avg_cnt, x = ~shape, y = ~price, type = 'bar', name = 'Price', marker = list(color = 'rgb(49,130,189)')) %>%
add_trace(y = ~n, name = 'Cars', marker = list(color = 'rgb(200,200,200)')) %>%
layout(xaxis = list(title = "Car Body Shape", tickangle = -45),
yaxis = list(title = ""), margin = list(b = 100), barmode = 'group')
From the previous plot we got two observations:
sample_df %>% plot_ly(x = ~price, y = ~odometer, type="scatter", color = ~fuel, colors="Dark2", marker = list(size = 6,
opacity = 0.6)) %>%
layout(title = 'Odometer vs Price per Fuel Type',
xaxis = list(title="Odometer (miles)"),
yaxis = list(title="Price $"))
No scatter mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
No scatter mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Most cars use gas as a fuel type and it seems each fuel type strongly affects the price of the used car along with the odometer.
Let’s see the average price by each fuel type.
by_fuel = aggregate(price ~ fuel, data = df2, FUN=mean, na.rm = TRUE) %>% arrange(price)
by_fuel
aggregate(price ~ transmission, data = df3, FUN=mean, na.rm = TRUE)
The average price is highly affected by the transmission type.
Let’s see fuel variables may contribute to the relation.
# Aggregate average car price by transmission type and fuel
aggregate(price ~ fuel + transmission, data = df3, FUN=mean, na.rm = TRUE) %>%
arrange(price) %>% plot_ly(x = ~fuel, y = ~price, color = ~transmission, type = "bar") %>%
layout(title="Average Car Price per Transmission and Fuel ", yaxis=list(categoryorder = "array", categoryarray=~price))
It’s pretty obvious that other got the highest average price, but if we filtered that we got the normal differences between automatic and manual cars.
To address this question we should extract another feature from the dataset using those variables year and posting_date, then calculate the difference.
df3[is.na(df3$year),]$year <- as.integer(mean(df3$year, na.rm=TRUE))
df4 = df3 %>% mutate(age = as.integer(format(posting_date, format="%Y")) - year)
plot_ly(nbinsx = 100) %>%
add_histogram(x = ~df4$age, marker = list(color = 'rgba(55, 128, 191, 0.8)')) %>%
layout(title = "Age of Used Cars", yaxis=list(title="Density"), xaxis=list(title="Age"))
The age of the used cars is positively skewed which means recent cars are most popular for sale. Next, let’s see the condition and status how it affects the age of the car.
aggregate(age ~ condition + status, data = df4, FUN=mean, na.rm = TRUE) %>%
arrange(age) %>% plot_ly(x = ~condition, y = ~age, color = ~status, type = "bar") %>%
layout(title="Average Car Age per Condition and Status ", yaxis=list(categoryorder = "array", categoryarray=~age))
New and parts-only cars are getting the highest average age, however, the missing cars got the highest average age among all conditions.
In this part, we’re going to test some of the above variables whether it can contribute to the price of a used vehicle or not. To do so, we can use ANOVA Test since we got one numeric variable vs other categorical variables.
To begin, let’s see some transformations on the price in order to make it normally distributed.
price_log <- plot_ly(nbinsx = 30) %>%
add_histogram(x = ~log(df3$price), marker = list(color = 'rgba(55, 128, 191, 0.8)'), name = "Price Log")
price_sqrt <- plot_ly(nbinsx = 30) %>%
add_histogram(x = ~sqrt(df3$price), marker = list(color = 'rgba(219, 64, 82, 0.8)'), name = "Price Sqrt")
fig <- subplot(price_log, price_sqrt) %>%
layout(title = "Price Transformation", yaxis=list(title="Density"))
fig