Table of contents

Introduction

Craigslist is the world’s largest collection of used vehicles for sale, yet it’s very difficult to collect all of them in the same place. I built a scraper for a school project and expanded upon it later to create this dataset which includes every used vehicle entry within the United States on Craigslist.

The dataset contains around 500k used cars with its 26 associated variables (characteristics). This data is scraped every few months, it contains most all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, latitude/longitude, and 18 other categories.

Some Definitions:

  • Odometer: An odometer or odograph is an instrument used for measuring the distance traveled by a vehicle, such as a bicycle or car.
  • Transmission: A transmission is a machine in a power transmission system, which provides controlled application of power.
  • Drive Type: The four different types of drivetrain are all-wheel-drive AWD, front wheel drive FWD, rear wheel drive RWD, and 4WD 4 wheel drive.
  • Lien Cars: A lien sale is the sale of the claim—or a hold—placed on an asset to satisfy an unpaid debt. Typically, lien sales are conducted as public auctions, and the lien is on real estate, automobiles, and other personal property.

Data Dictionary

  • id: Entry identification.
  • url: Listing URL of the car.
  • region: Region in which car are listed for sale.
  • region_url: URL of the craigslist of that region.
  • price: Entry price of the car.
  • year: The year in which the car was manufactured.
  • manufacturer: Manufacturer of the vehicle.
  • model: Model of the vehicle.
  • condition: Condition of the vehicle e.g. new, good, fair etc.
  • cylinders: Number of cylinders in the vehicle.
  • fuel: Fuel type e.g. gas, diesel, electric etc.
  • odometer: miles traveled by vehicle.
  • title_status: Vehicle status e.g. clean, rebuilt etc.
  • transmission: Transmission type of the vehicle e.g. manual, automatic.
  • VIN: Vehicle identification number.
  • drive: Type of drive by vehicle e.g. rwd, fwd, 4wd.
  • size: Size of the vehicle e.g. compact, full-size, etc.
  • type: Body type of the vehicle e.g. SUV, sedan, etc.
  • paint_color: Color of the vehicle.
  • image_url: Image URL of the vehicle.
  • description: Description of the vehicle.
  • state: Region in which the vehicle are listed for sale.
  • lat: Latitude of listing region.
  • long: longitude of listing region.
  • posting_date: Date in which the vehicle was listed for sale.

Questions that would be able to answer during our analysis:

  • Which state has more used craigslist and for which region?
  • Which region has more used craigslist and in which state?
  • What is the top 3 manufacturer with most cars?
  • What is the top 10 model for each manufacturer?
  • How the number of used cars changes over the years since 1900?
  • What is the most common fuel types?
  • What the percentage for each of the car drive type?
  • Does the odometer affect car price?
  • Does the cylinders affect car price for each fuel type?
  • Does the car status affect the price?
  • Does the condition affect car price for each status?
  • What’s average price for each manufacture?
  • What is th most common body shapes used, and its average price?
  • Does fuel type along with the odometer affect car price?
  • Does the transmission affect car price?
  • What are the average years people spend before offering their car for sale?

In order to address those questions, there are some things to do such as:

  1. Cleaning the dataset from any incorrect datatypes or null values.
  2. Calculating some statistics about the variables and look for any relationships in the data.
  3. Drawing conclusions with some charts and communicating the findings.
suppressPackageStartupMessages({
  # Load Analysis Libraries
  library(dplyr)
  library(tidyr)
  library(plotly)
  library(ggplot2)
  # Load ML libraries
  library(caret)
  library(ranger)
  library(rpart)
  library(MASS)
})

Data Wrangling

Gathering

df = read.csv("vehicles.csv", na.strings = c("", "NA"))
head(df)

Assessing

dim(df)
[1] 426880     26
names(df)
 [1] "id"           "url"          "region"       "region_url"   "price"        "year"        
 [7] "manufacturer" "model"        "condition"    "cylinders"    "fuel"         "odometer"    
[13] "title_status" "transmission" "VIN"          "drive"        "size"         "type"        
[19] "paint_color"  "image_url"    "description"  "county"       "state"        "lat"         
[25] "long"         "posting_date"
# The structure of the dataframe
glimpse(df)
Rows: 426,880
Columns: 26
$ id           <dbl> 7222695916, 7218891961, 7221797935, 7222270760, 7210384030, 7222379453, 7221952215~
$ url          <chr> "https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html",~
$ region       <chr> "prescott", "fayetteville", "florida keys", "worcester / central MA", "greensboro"~
$ region_url   <chr> "https://prescott.craigslist.org", "https://fayar.craigslist.org", "https://keys.c~
$ price        <dbl> 6000, 11900, 21000, 1500, 4900, 1600, 1000, 15995, 5000, 3000, 0, 0, 0, 0, 0, 1399~
$ year         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ manufacturer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ model        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ condition    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ cylinders    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ fuel         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ odometer     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ title_status <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ transmission <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ VIN          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ drive        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ size         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ type         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ paint_color  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ image_url    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ description  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ county       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ state        <chr> "az", "ar", "fl", "ma", "nc", "ny", "ny", "ny", "or", "pa", "tx", "tx", "tx", "tx"~
$ lat          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ long         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ posting_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~

Most of the variable aren’t in the correct type and it need to be reassigned to its datatype.

# Count the nan values in the data
sum(is.na(df))
[1] 1655334
sum(complete.cases(df))
[1] 0

There’s almost 1.5M fields contains NA values across all variables, however there’s no row filled entirely with NA.

# Look if there's any duplicates
sum(duplicated(df$id))
[1] 0

Data Cleaning

First of all, Let’s get rid of image_url, url, description, county, VIN, and region_url columns that we won’t use it in our analysis.

df2 <- df[, !names(df) %in% c("image_url", "url", "description", "county", "region_url", "VIN")]

Now, we can check the percentage of NA values in each column.

nan_prop <- function(data_frame) {
    prop_col = c()
    for (col in names(data_frame)) {
        nan_perc = round((sum(is.na(data_frame[[col]])) / dim(data_frame)[1]), 2) 
        prop_col = c(prop_col, nan_perc)
    }
    return(prop_col)
}
nan_prop(df2)
 [1] 0.00 0.00 0.00 0.00 0.04 0.01 0.41 0.42 0.01 0.01 0.02 0.01 0.31 0.72 0.22 0.31 0.00 0.02 0.02 0.00
plot_ly(
  x = names(df2),
  y = 1 - nan_prop(df2),
  type = "bar",
) %>% layout(title = "Propotions of NA values", xaxis=list(title="Column Names"), yaxis=list(title="Propotion"))

As we can see above, the most variable with NA values is size with 71%. This variable looks significant to the other variables although, that will be a problem to our analysis, so we will get rid of the entire column. The other columns such as condition and cylinders are almost the same with 40% of its values are filled with NA and we can make our analysis with it, however, it’s a big percentage.

df2 = df2[, !names(df2) %in% c("size")]

Let’s change the datatype for each variable to its corresponding one and rename some the mis-styped.

edit the data type

df2$year = as.integer(df2$year)
df2$price = as.numeric(df2$price)
df2$odometer = as.integer(df2$odometer)
df2$condition = as.factor(df2$condition)
df2$cylinders = as.factor(df2$cylinders)
df2$title_status = as.factor(df2$title_status)
df2$transmission = as.factor(df2$transmission)
df2$drive = as.factor(df2$drive)
df2$type = as.factor(df2$type)
df2$fuel = as.factor(df2$fuel)
df2$paint_color = as.factor(df2$paint_color)
df2$posting_date = as.POSIXct(df2$posting_date)

# rename some columns
df2 = rename(df2,status = title_status, shape = type)
glimpse(df2)
Rows: 426,880
Columns: 19
$ id           <dbl> 7222695916, 7218891961, 7221797935, 7222270760, 7210384030, 7222379453, 7221952215~
$ region       <chr> "prescott", "fayetteville", "florida keys", "worcester / central MA", "greensboro"~
$ price        <dbl> 6000, 11900, 21000, 1500, 4900, 1600, 1000, 15995, 5000, 3000, 0, 0, 0, 0, 0, 1399~
$ year         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ manufacturer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ model        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ condition    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ cylinders    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ fuel         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ odometer     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ status       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ transmission <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ drive        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ shape        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ paint_color  <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ state        <chr> "az", "ar", "fl", "ma", "nc", "ny", "ny", "ny", "or", "pa", "tx", "tx", "tx", "tx"~
$ lat          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ long         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ posting_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~

The dataset now seems fine, let’s look at the unique values for the categorical columns.

lapply(df2[,c(7,8,9,11,12,13,14,15,16)], unique)
$condition
[1] <NA>      good      excellent fair      like new  new       salvage  
Levels: excellent fair good like new new salvage

$cylinders
[1] <NA>         8 cylinders  6 cylinders  4 cylinders  5 cylinders  other        3 cylinders 
[8] 10 cylinders 12 cylinders
8 Levels: 10 cylinders 12 cylinders 3 cylinders 4 cylinders 5 cylinders 6 cylinders ... other

$fuel
[1] <NA>     gas      other    diesel   hybrid   electric
Levels: diesel electric gas hybrid other

$status
[1] <NA>       clean      rebuilt    lien       salvage    missing    parts only
Levels: clean lien missing parts only rebuilt salvage

$transmission
[1] <NA>      other     automatic manual   
Levels: automatic manual other

$drive
[1] <NA> rwd  4wd  fwd 
Levels: 4wd fwd rwd

$shape
 [1] <NA>        pickup      truck       other       coupe       SUV         hatchback   mini-van   
 [9] sedan       offroad     bus         van         convertible wagon      
Levels: bus convertible coupe hatchback mini-van offroad other pickup sedan SUV truck van wagon

$paint_color
 [1] <NA>   white  blue   red    black  silver grey   brown  yellow orange green  custom purple
Levels: black blue brown custom green grey orange purple red silver white yellow

$state
 [1] "az" "ar" "fl" "ma" "nc" "ny" "or" "pa" "tx" "wa" "wi" "al" "ak" "ca" "co" "ct" "dc" "de" "ga" "hi"
[21] "id" "il" "in" "ia" "ks" "ky" "la" "me" "md" "mi" "mn" "ms" "mo" "mt" "ne" "nv" "nj" "nm" "nh" "nd"
[41] "oh" "ok" "ri" "sc" "sd" "tn" "ut" "vt" "va" "wv" "wy"
summary(df2)
       id                region              price                 year      manufacturer      
 Min.   :7207408119   Length:426880      Min.   :         0   Min.   :1900   Length:426880     
 1st Qu.:7308143339   Class :character   1st Qu.:      5900   1st Qu.:2008   Class :character  
 Median :7312620821   Mode  :character   Median :     13950   Median :2013   Mode  :character  
 Mean   :7311486634                      Mean   :     75199   Mean   :2011                     
 3rd Qu.:7315253544                      3rd Qu.:     26486   3rd Qu.:2017                     
 Max.   :7317101084                      Max.   :3736928711   Max.   :2022                     
                                                              NA's   :1205                     
    model               condition             cylinders            fuel           odometer       
 Length:426880      excellent:101467   6 cylinders : 94169   diesel  : 30062   Min.   :       0  
 Class :character   fair     :  6769   4 cylinders : 77642   electric:  1698   1st Qu.:   37704  
 Mode  :character   good     :121456   8 cylinders : 72062   gas     :356209   Median :   85548  
                    like new : 21178   5 cylinders :  1712   hybrid  :  5170   Mean   :   98043  
                    new      :  1305   10 cylinders:  1455   other   : 30728   3rd Qu.:  133542  
                    salvage  :   601   (Other)     :  2162   NA's    :  3013   Max.   :10000000  
                    NA's     :174104   NA's        :177678                     NA's   :4400      
        status          transmission     drive            shape        paint_color    
 clean     :405117   automatic:336524   4wd :131904   sedan  :87056   white  : 79285  
 lien      :  1422   manual   : 25118   fwd :105517   SUV    :77284   black  : 62861  
 missing   :   814   other    : 62682   rwd : 58892   pickup :43510   silver : 42970  
 parts only:   198   NA's     :  2556   NA's:130567   truck  :35279   blue   : 31223  
 rebuilt   :  7219                                    other  :22110   red    : 30473  
 salvage   :  3868                                    (Other):68783   (Other): 49865  
 NA's      :  8242                                    NA's   :92858   NA's   :130203  
    state                lat              long          posting_date                
 Length:426880      Min.   :-84.12   Min.   :-159.83   Min.   :2021-04-04 00:00:00  
 Class :character   1st Qu.: 34.60   1st Qu.:-111.94   1st Qu.:2021-04-17 00:00:00  
 Mode  :character   Median : 39.15   Median : -88.43   Median :2021-04-25 00:00:00  
                    Mean   : 38.49   Mean   : -94.75   Mean   :2021-04-23 06:08:37  
                    3rd Qu.: 42.40   3rd Qu.: -80.83   3rd Qu.:2021-05-01 00:00:00  
                    Max.   : 82.39   Max.   : 173.89   Max.   :2021-05-04 00:00:00  
                    NA's   :6549     NA's   :6549      NA's   :68                   

So, as we can see there’s a lot of NA values with each column and the only variables that contain numeric values are price and odometer. Moreover, the difference between the maximum value and the 3rd quantile of price is pretty huge, so we can assume that the price has some outliers to be handled as well as the odometer variable.

In the following section we’ll look at the outliers for price and odometer variables. Any values outside of the interval below will be considered potential outliers:

\[I = [Q_{0.25} - 1.5⋅IQR; Q_{0.75} + 1.5⋅IQR]\]

I <- function(column) {
    quan <- quantile(column, na.rm=TRUE)
    IQR = as.integer(quan["75%"] - quan["25%"])
    I = as.integer(c(quan["25%"] - 1.5 * IQR, quan["75%"] + 1.5 * IQR))
    return(I)
}
I(df2$price)
[1] -24977  57363

For the sake of the data and the cars, $57k wouldn’t be enough, since some car types are actually over price. So, we’ll assume that any car with a price above 1 million will be an outlier and of course the price must be non-negative.

Let’s do the same thing with the odometer variable.

I(df2$odometer)
[1] -106053  277299

For the odometer, 277k miles makes sense, why would someone buy a car traveled more than 300k miles!

Let’s filter those values out from our dataset and replace the NA with the means of each variable.

df3 = df2 %>% filter(price < 5e5 & price >= 0 & odometer <= 3e5 & odometer >= 0)
df3[is.na(df3$price)] = mean(df3$price, na.rm = TRUE)
df3[is.na(df3$odometer)] = mean(df3$odometer, na.rm = TRUE)

Delete the manufacturer and the model if it’s NA since it has no meaning if a car has no model or a car has no manufacturer.

df3 = df3[!is.na(df3$manufacturer), ]
df3 = df3[!is.na(df3$model), ]
dim(df3)
[1] 397853     19
names(df3)
 [1] "id"           "region"       "price"        "year"         "manufacturer" "model"       
 [7] "condition"    "cylinders"    "fuel"         "odometer"     "status"       "transmission"
[13] "drive"        "shape"        "paint_color"  "state"        "lat"          "long"        
[19] "posting_date"
nan_prop(df3)
 [1] 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.01 0.00 0.02 0.00 0.30 0.21 0.30 0.00 0.02 0.02 0.00

Exploratory Analysis

All the variables are categorical except two variables price and odometer so let’s look at its distributions.

price_hist <- plot_ly(nbinsx = 100) %>%
  add_histogram(x = ~df3$price, marker = list(color = 'rgba(55, 128, 191, 0.8)'), name = "Price $")

odo_hist <- plot_ly(nbinsx = 50) %>%
    add_histogram(x = ~df3$odometer, marker = list(color = 'rgba(219, 64, 82, 0.8)'), name = "Odometer (miles)")

fig <- subplot(price_hist, odo_hist) %>%
    layout(title = "Odometer and Price Distribution", yaxis=list(title="Density"))
fig

Both price and odometer have right-skewed histograms which mean cars with high odometer are less popular for sale and cars with more price are less popular for sale as well.

Which state has more used craigslist and for which region?

# Group by each state and arrange by number of vehicles
by_state = group_by(df3, state = df3$state) %>% count() %>% arrange(n)
by_state %>% 
    plot_ly(x = ~n, y = ~state, type = "bar", marker = list(color = 'rgba(55, 128, 191, 0.9)')) %>% 
    layout(title="Used Cars by each state", xaxis=list(title="Cars"), 
       yaxis=list(title="State",  categoryorder = "array", categoryarray = ~n))

California has the most used vehicles among all states with ~47k cars.

# Get the region in the ca state that has the most cars
ca_state = df3[df3$state == "ca",]         
by_region = group_by(ca_state, region = ca_state$region)
arrange(count(by_region), desc(n))[1,]

Stockton are two cities in California which have the most vehicles with 2840 cars.

Which region has more used craigslist and in which state?

by_state_region = group_by(df3, state=df3$state, region=df3$region)
arrange(count(by_state_region), desc(n))[1,]

Pasco in Washington got the top on among all regions which has 2875 cars.

What is first top 3 manufacturers with more vehicles?

by_manf = df3 %>% group_by(manufacturer) %>% count() %>% arrange(n)
by_manf %>% plot_ly(x = ~n, y = ~manufacturer, type = "bar", marker = list(color = 'rgba(219, 64, 82, 0.9)')) %>%
layout(title="Car Used by each manufacturer", xaxis=list(title="Cars"), 
       yaxis=list(title="Manufacturer",  categoryorder = "array", categoryarray = ~n))

It appears that, Ford, Chevrolet, and Toyota have the most vehicles. Let’s see what’s average price for each of the top 3.

# AVG price for each of top 3 Manufacturers
avg_ford = round(mean(df3[df3$manufacturer == "ford",]$price), 2)
avg_chevrolet = round(mean(df3[df3$manufacturer == "chevrolet",]$price), 2)
avg_toyota = round(mean(df3[df3$manufacturer == "toyota",]$price), 2)
list(ford=avg_ford, chevrolet=avg_chevrolet, toyota=avg_toyota)
$ford
[1] 19381.51

$chevrolet
[1] 18530.17

$toyota
[1] 16002.98

Ford got the higher average price among top 3 manufacturers with $19k.

What is the top 10 models with each manufacturer?

by_manf_model = data.frame(df3 %>% group_by(manufacturer, model) %>% count())
agg = aggregate(n ~ manufacturer, data = by_manf_model, FUN=max, na.rm = TRUE)
max_by_model = data.frame(matrix(ncol=3, nrow=0, dimnames=list(NULL, c("manf", "model", "max_cars"))))

for (i in 1:length(agg$n)) {
    max_by_model[nrow(max_by_model)+1, ] = as.vector(by_manf_model[by_manf_model$n == agg$n[i] & 
                                                                   by_manf_model$manufacturer == agg$manufacturer[i], ] %>%
          mutate_all(as.character) %>% unlist(., use.names = FALSE))
}
max_by_model$max_cars = as.integer(max_by_model$max_cars)
max_by_model = arrange(max_by_model, desc(max_cars))
max_by_model[1:10,] %>%
    plot_ly(x=~model, y=~max_cars, type = 'bar', color=~manf, colors="Dark2") %>%
    layout(yaxis = list(title = 'Count'), xaxis=list(title='Car Model', categoryorder = "array", categoryarray=~max_cars))
Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(n, pal) :
  n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors

How the number of used cars changes over the years since 1900?

by_year = group_by(df2, year)
by_year = count(by_year)
plot_ly(by_year, x=~year, y=~n, type = 'bar', color=~year, reversescale=TRUE, colors = "Blues") %>% 
  layout(yaxis = list(title = 'Count'), xaxis=list(title='Year', categoryorder = "array", categoryarray=~n))
Warning: Ignoring 1 observations
Warning: textfont.color doesn't (yet) support data arrays
Warning: textfont.color doesn't (yet) support data arrays
Warning: 'bar' objects don't have these attributes: 'reversescale'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'fra [... truncated]
Warning: Ignoring 1 observations
Warning: textfont.color doesn't (yet) support data arrays
Warning: textfont.color doesn't (yet) support data arrays
Warning: 'bar' objects don't have these attributes: 'reversescale'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'fra [... truncated]

It’s normal to see that the number of used cars increase since 1900 until 2018.

Let’s see what’s the most used cars in this year.

most_manu_b = df2[df3$year == 2017,] %>% group_by(manufacturer) %>% count()
most_manu_b = arrange(most_manu_b,desc(most_manu_b$n))
most_manu_b[1,]

Ford is most manufacture has a cars in 2018 with 6390 cars.

What is the most common fuel types?

by_fuel = group_by(df2, fuel = df2$fuel) %>% count()
by_fuel$n = round(by_fuel$n / sum(by_fuel$n), 2)
by_fuel = rename(by_fuel, ratio = n) %>% drop_na()
colors <- c('rgb(128,133,133)', 'rgb(211,94,96)', 'rgb(144,103,167)', 'rgb(171,104,87)', 'rgb(114,147,203)')
plot_ly(by_fuel, labels = ~fuel, values = ~ratio, marker = list(colors = colors, line = list(color = '#FFFFFF', width = 2))) %>% 
  add_pie(hole = 0.5) %>%
  layout(title = 'Car Fuel Type',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

The most fuel type used is Gas with ~85% among all cars.

What the percentage for each of the car drive type?

by_drive = group_by(df2, drive) %>% count(.) %>% drop_na(.)
by_drive = mutate(by_drive, ratio =n/sum(by_drive$n))
plot_ly(by_drive, labels = ~drive, values = ~ratio, type = 'pie', marker = list(line = list(color = '#FFFFFF', width = 3))) %>% 
  layout(title = 'Car Drive Type',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

Cars with 4wd drive type are the most common with 44.5%.

Next, we’ll see which of the variables affect the price.

Does the odometer affect car price?

# Take 10000 random sample from the dataset in order to plot
sample_df = sample_n(df3, 1e4)
options(repr.plot.width = 7, repr.plot.height = 4) 
sample_df %>% ggplot(aes(x = odometer, y = price)) +
    geom_bin2d(bins=50)

cor(df3$price, df3$odometer)
[1] -0.4567531

There’s some sort of negative relationship between how long a car traveled and its price, however, that seems normal since the car price became higher whenever the odometer is lower.

Let’s add other factors may affect that relation such as drive.

ggplot(sample_df[!is.na(sample_df$drive),], aes(x=odometer, y=price, color=drive, shape=drive)) +
  geom_point() + 
  geom_smooth(method=lm, se=FALSE) +
  ggtitle("Price vs Odometer for each Drive type")
`geom_smooth()` using formula 'y ~ x'

As we noticed the 4wd models gets the higher price and higher odometer and the next one is rwd since these type of cars used a lot in car races.

Let’s see the average price for each type.

by_drive = aggregate(price ~ drive, data=df3, FUN=mean) %>% arrange(desc(price))
by_drive

As expected 4wd got the highest average price with ~$20k and rwd with $19.5k.

Does the cylinders affect car price for each fuel type?

aggregate(price ~ cylinders + fuel, data = df3[df3$fuel != "other",], FUN=mean, na.rm = TRUE) %>% 
    arrange(price) %>% 
    plot_ly(x = ~cylinders, y = ~price, color = ~fuel, type = "bar") %>% 
    layout(title="Average Price per Cylinder and Fuel Type", yaxis=list(categoryorder = "array", categoryarray=~price))

It seems doesn’t, typical cars have 4, 6 or 8 cylinders and the figure above shows exactly those types got most average price for each fuel type.

Does the car status affect the price?

by_status = group_by(df3, status)
by_status = aggregate(price ~ status, data = by_status, FUN=mean, na.rm = TRUE)
plot_ly(by_status, x=~status, y=~price, type = 'bar', colors="Dark2", 
               marker = list(color = 'rgba(55, 128, 191, 0.8)')) %>% 
  layout(yaxis = list(title = 'Average Price'), xaxis=list(title='Car Status'))

Lien has the most average used car price with $22k.

Does the condition affect car price for each status?

# Aggregate average car price by condition and status
aggregate(price ~ condition + status, data = df3, FUN=mean, na.rm = TRUE) %>% 
    arrange(price) %>% plot_ly(x = ~condition, y = ~price, color = ~status, type = "bar") %>%
    layout(title="Average Car Price per Condition and Status ", yaxis=list(categoryorder = "array", categoryarray=~price))

Lien cars got the highest average price among other cars and it seems usual. In addition, new cars which are parts only got the second place, so if we filtered those out from the figure we can see that the normal average price starts with new clean cars until fair salvage.

What’s average price for each manufacture?

avg_by_manuf = aggregate(price ~ manufacturer, data = df3, FUN=mean, na.rm = TRUE) %>% arrange(price)
plot_ly(x = ~avg_by_manuf$price,y = ~avg_by_manuf$manufacturer, name = "Car Used by each manufacturer",type = "bar",
   marker = list(color = 'rgba(219, 64, 82, 0.8)'),
) %>% layout(title = "Average Price for each Manufacturer", xaxis=list(title="Average Price"), 
             yaxis=list(title="Manufacturer",  categoryorder = "array", categoryarray = ~avg_by_manuf$price))

Ferrari got the 1st place with the highest average price $114.7k.

What is th most common body shapes used, and its average price?

by_shape = group_by(df3, shape)
cnt_by_shape = count(by_shape) %>% drop_na()
avg_price = aggregate(price ~ shape, data = by_shape, FUN=mean, na.rm = TRUE)
avg_cnt = cbind(cnt_by_shape, price = avg_price$price)
plot_ly(avg_cnt, x = ~shape, y = ~price, type = 'bar', name = 'Price', marker = list(color = 'rgb(49,130,189)')) %>%
  add_trace(y = ~n, name = 'Cars', marker = list(color = 'rgb(200,200,200)')) %>% 
  layout(xaxis = list(title = "Car Body Shape", tickangle = -45),
         yaxis = list(title = ""), margin = list(b = 100), barmode = 'group')

From the previous plot we got two observations:

  • Sedan body type is the most common shape among used cars with average price $13.1k.
  • Pickup body type has the most average price among all shapes with $28.6k.

Does fuel type along with the odometer affect car price?

sample_df %>% plot_ly(x = ~price, y = ~odometer, type="scatter", color = ~fuel, colors="Dark2", marker = list(size = 6,
  opacity = 0.6)) %>% 
  layout(title = 'Odometer vs Price per Fuel Type',
         xaxis = list(title="Odometer (miles)"),
         yaxis = list(title="Price $"))
No scatter mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
No scatter mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Most cars use gas as a fuel type and it seems each fuel type strongly affects the price of the used car along with the odometer.

Let’s see the average price by each fuel type.

by_fuel = aggregate(price ~ fuel, data = df2, FUN=mean, na.rm = TRUE) %>% arrange(price)
by_fuel

Does the transmission affect car price?

aggregate(price ~ transmission, data = df3, FUN=mean, na.rm = TRUE)

The average price is highly affected by the transmission type.

Let’s see fuel variables may contribute to the relation.

# Aggregate average car price by transmission type and fuel
aggregate(price ~ fuel + transmission, data = df3, FUN=mean, na.rm = TRUE) %>% 
    arrange(price) %>% plot_ly(x = ~fuel, y = ~price, color = ~transmission, type = "bar") %>%
    layout(title="Average Car Price per Transmission and Fuel ", yaxis=list(categoryorder = "array", categoryarray=~price))

It’s pretty obvious that other got the highest average price, but if we filtered that we got the normal differences between automatic and manual cars.

What are the average years people spend before offering their car for sale?

To address this question we should extract another feature from the dataset using those variables year and posting_date, then calculate the difference.

df3[is.na(df3$year),]$year <- as.integer(mean(df3$year, na.rm=TRUE))
df4 = df3 %>% mutate(age = as.integer(format(posting_date, format="%Y")) - year)
plot_ly(nbinsx = 100) %>%
    add_histogram(x = ~df4$age, marker = list(color = 'rgba(55, 128, 191, 0.8)')) %>%
    layout(title = "Age of Used Cars", yaxis=list(title="Density"), xaxis=list(title="Age"))

The age of the used cars is positively skewed which means recent cars are most popular for sale. Next, let’s see the condition and status how it affects the age of the car.

aggregate(age ~ condition + status, data = df4, FUN=mean, na.rm = TRUE) %>% 
    arrange(age) %>% plot_ly(x = ~condition, y = ~age, color = ~status, type = "bar") %>%
    layout(title="Average Car Age per Condition and Status ", yaxis=list(categoryorder = "array", categoryarray=~age))

New and parts-only cars are getting the highest average age, however, the missing cars got the highest average age among all conditions.

Significance Testing

In this part, we’re going to test some of the above variables whether it can contribute to the price of a used vehicle or not. To do so, we can use ANOVA Test since we got one numeric variable vs other categorical variables.

To begin, let’s see some transformations on the price in order to make it normally distributed.

price_log <- plot_ly(nbinsx = 30) %>%
  add_histogram(x = ~log(df3$price), marker = list(color = 'rgba(55, 128, 191, 0.8)'), name = "Price Log")

price_sqrt <- plot_ly(nbinsx = 30) %>%
    add_histogram(x = ~sqrt(df3$price), marker = list(color = 'rgba(219, 64, 82, 0.8)'), name = "Price Sqrt")

fig <- subplot(price_log, price_sqrt) %>%
    layout(title = "Price Transformation", yaxis=list(title="Density"))
fig
summary(log(df4[df4$price>0,]$price))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   8.922   9.654   9.428  10.228  13.016 

The logarithmic transformation seems to be normal with mean ~= median so, let’s make the ANOVA test

# Compute the analysis of variance
res.aov <- aov(log(price) ~ drive + transmission + cylinders + condition + shape + status , data = df4[df4$price>0,])
# Summary of the analysis
summary(res.aov)
                 Df Sum Sq Mean Sq F value              Pr(>F)    
drive             2   9812    4906 4042.77 <0.0000000000000002 ***
transmission      2   7429    3714 3060.84 <0.0000000000000002 ***
cylinders         7   5596     799  658.82 <0.0000000000000002 ***
condition         5   8309    1662 1369.42 <0.0000000000000002 ***
shape            12   6356     530  436.49 <0.0000000000000002 ***
status            5    446      89   73.46 <0.0000000000000002 ***
Residuals    121423 147344       1                                
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
247625 observations deleted due to missingness

Surprisingly, all variables are seem to be statistically significant with p-value < 0.05.

Data Modelling & Predictions

In the following part, we’re going to predict the price variable based on those features (Odometer, Condition, Status, Cylinders, Drive, Fuel, Manufacturer, Transmission, Shape, Age). These features highly affected the car price as we’ve shown in the previous section.

Data Preparation

df_features = dplyr::select(df4, price, odometer, condition, cylinders, status, drive, fuel, manufacturer, transmission, shape, age)
df_features <- df_features %>%
  mutate_if(is.character, is.factor)
# Omit NA values in the data
df_features = df_features[complete.cases(df_features),]

We’ve seen some factors contributed to huge volume of data and it negatively affects the car price such as other in Fuel, and Cylinders type, so let’s remove them.

df_features = df_features %>% filter(fuel != "other", cylinders != "other")
dim(df_features)
[1] 123834     11

Data Modelling

To begin with, let’s define some regression metrics:

  1. \(R^2\) measures how much variability in dependent variable can be explained by the model.
  2. While \(R^2\) is a relative measure of how well the model fits dependent variables, Mean Square Error is an absolute measure of the goodness for the fit.
  3. Mean Absolute Error \(MAE\) is similar to \(MSE\), however, unlike MSE, MAE takes the sum of the ABSOLUTE value of error.
# Split the dataset into 80% training and 20% testing
set.seed(2312)
inTrain <- createDataPartition(y=df_features$price,p=0.8, list=FALSE, times = 1)
training <- df_features[inTrain,]
testing <- df_features[-inTrain,]

Linear Regression

model.lm <- lm(price ~.,data=training)
summary(model.lm)

Call:
lm(formula = price ~ ., data = training)

Residuals:
   Min     1Q Median     3Q    Max 
-44916  -4012   -212   3627 148549 

Coefficients: (1 not defined because of singularities)
                            Estimate     Std. Error  t value             Pr(>|t|)    
(Intercept)            36263.3720814    794.3215444   45.653 < 0.0000000000000002 ***
odometer                  -0.0869380      0.0005592 -155.458 < 0.0000000000000002 ***
conditionfair          -3131.9169088    177.6527013  -17.629 < 0.0000000000000002 ***
conditiongood            194.0974646     69.7392873    2.783             0.005384 ** 
conditionlike new       1938.3168260     98.9963371   19.580 < 0.0000000000000002 ***
conditionnew            4228.9854937    409.5164135   10.327 < 0.0000000000000002 ***
conditionsalvage       -3145.6386484    620.1764055   -5.072   0.0000003940178359 ***
cylinders12 cylinders  10908.3229919   1327.9311099    8.215 < 0.0000000000000002 ***
cylinders3 cylinders   -7003.5324489    766.4007195   -9.138 < 0.0000000000000002 ***
cylinders4 cylinders   -2501.2789525    355.1224277   -7.043   0.0000000000018878 ***
cylinders5 cylinders   -1653.3466941    482.6776826   -3.425             0.000614 ***
cylinders6 cylinders     198.7246488    350.1159043    0.568             0.570310    
cylinders8 cylinders    3740.6767849    348.9272516   10.721 < 0.0000000000000002 ***
statuslien              2598.3556964    357.6426060    7.265   0.0000000000003751 ***
statusmissing          -2178.0423843    725.3847034   -3.003             0.002677 ** 
statusparts only       -6285.2572073   1325.2579110   -4.743   0.0000021121574168 ***
statusrebuilt          -2245.2443840    171.7717300  -13.071 < 0.0000000000000002 ***
statussalvage          -2618.0715742    273.5028604   -9.572 < 0.0000000000000002 ***
drivefwd               -3779.8784690     90.7255115  -41.663 < 0.0000000000000002 ***
driverwd               -1689.0061260     90.6233569  -18.638 < 0.0000000000000002 ***
fuelelectric           -5378.6732247   1060.0234041   -5.074   0.0000003900197875 ***
fuelgas               -11410.6746897    140.6613785  -81.122 < 0.0000000000000002 ***
fuelhybrid             -8987.2137046    312.3176920  -28.776 < 0.0000000000000002 ***
manufacturerTRUE                  NA             NA       NA                   NA    
transmissionmanual      1854.0642397    124.2159467   14.926 < 0.0000000000000002 ***
transmissionother       6203.2070846    127.4316080   48.679 < 0.0000000000000002 ***
shapeconvertible        5241.6601410    758.2711120    6.913   0.0000000000047856 ***
shapecoupe              4659.3845206    746.5914396    6.241   0.0000000004368814 ***
shapehatchback           379.3054217    756.0118330    0.502             0.615866    
shapemini-van           1918.9786053    765.5431379    2.507             0.012188 *  
shapeoffroad            5246.3418904    866.9459545    6.052   0.0000000014399712 ***
shapeother              5757.9234537    761.2308785    7.564   0.0000000000000394 ***
shapepickup             4795.7058855    745.8611441    6.430   0.0000000001283871 ***
shapesedan               411.9464525    741.5786473    0.555             0.578554    
shapeSUV                1944.6721684    743.0944878    2.617             0.008872 ** 
shapetruck              5095.9422742    742.9904040    6.859   0.0000000000069900 ***
shapevan                3670.5209241    758.6784347    4.838   0.0000013131872821 ***
shapewagon              -187.2485920    761.8040140   -0.246             0.805840    
age                     -283.3513656      3.6786370  -77.026 < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9145 on 99032 degrees of freedom
Multiple R-squared:  0.538, Adjusted R-squared:  0.5378 
F-statistic:  3116 on 37 and 99032 DF,  p-value: < 0.00000000000000022

Decision Tree

rpart.cntrl <- rpart.control(minsplit = 300)
model.dt <- rpart(price ~.,data=training, control = rpart.cntrl)
summary(model.dt)
Call:
rpart(formula = price ~ ., data = training, control = rpart.cntrl)
  n= 99070 

           CP nsplit rel error    xerror        xstd
1  0.25729072      0 1.0000000 1.0000054 0.008737311
2  0.06233870      1 0.7427093 0.7428689 0.007424963
3  0.05700253      2 0.6803706 0.6805406 0.007315240
4  0.04973553      3 0.6233681 0.6234479 0.007044982
5  0.02949513      4 0.5736325 0.5737366 0.006941787
6  0.01486208      5 0.5441374 0.5452025 0.006979325
7  0.01409489      6 0.5292753 0.5268898 0.006951188
8  0.01364211      7 0.5151804 0.5153401 0.006917114
9  0.01355801      8 0.5015383 0.5042837 0.006873065
10 0.01000000      9 0.4879803 0.4888082 0.006813484

Variable importance
    odometer          age    cylinders        shape transmission         fuel        drive 
          35           25           10           10           10            7            3 

Node number 1: 99070 observations,    complexity param=0.2572907
  mean=15358.82, MSE=1.809353e+08 
  left son=2 (73489 obs) right son=3 (25581 obs)
  Primary splits:
      odometer     < 57411    to the right, improve=0.2572907, (0 missing)
      age          < 6.5      to the right, improve=0.2516894, (0 missing)
      transmission splits as  LLR, improve=0.1663510, (0 missing)
      shape        splits as  LRRLLLRRLLRLL, improve=0.1403136, (0 missing)
      drive        splits as  RLR, improve=0.1054470, (0 missing)
  Surrogate splits:
      age          < 4.5      to the right, agree=0.844, adj=0.398, (0 split)
      transmission splits as  LLR, agree=0.813, adj=0.274, (0 split)
      shape        splits as  LLRLLLRLLLLLL, agree=0.754, adj=0.047, (0 split)
      condition    splits as  LLLLRL, agree=0.743, adj=0.006, (0 split)
      cylinders    splits as  LLRLLLL-, agree=0.742, adj=0.001, (0 split)

Node number 2: 73489 observations,    complexity param=0.0623387
  mean=11333.31, MSE=9.379467e+07 
  left son=4 (43936 obs) right son=5 (29553 obs)
  Primary splits:
      age       < 9.5      to the right, improve=0.1621148, (0 missing)
      shape     splits as  RLLLLRRRLLRLL, improve=0.1412970, (0 missing)
      fuel      splits as  RLLL-, improve=0.1203584, (0 missing)
      drive     splits as  RLR, improve=0.1121400, (0 missing)
      cylinders splits as  RLLLLLR-, improve=0.0873691, (0 missing)
  Surrogate splits:
      odometer     < 109969.5 to the right, agree=0.695, adj=0.242, (0 split)
      transmission splits as  LLR,          agree=0.614, adj=0.039, (0 split)
      condition    splits as  LLLRRL,       agree=0.607, adj=0.022, (0 split)
      status       splits as  LRLLRL,       agree=0.601, adj=0.008, (0 split)
      cylinders    splits as  LLRRLLL-,     agree=0.600, adj=0.005, (0 split)

Node number 3: 25581 observations,    complexity param=0.05700253
  mean=26923.3, MSE=2.509824e+08 
  left son=6 (6725 obs) right son=7 (18856 obs)
  Primary splits:
      cylinders    splits as  LRLLLRR-, improve=0.15914710, (0 missing)
      shape        splits as  LRRLLLRRLLRRL, improve=0.11760650, (0 missing)
      drive        splits as  RLR, improve=0.10893040, (0 missing)
      transmission splits as  LLR, improve=0.08167025, (0 missing)
      age          < 5.5      to the right, improve=0.07631503, (0 missing)
  Surrogate splits:
      drive  splits as  RLR, agree=0.781, adj=0.168, (0 split)
      shape  splits as  LRRLRRRRLRRRL, agree=0.775, adj=0.143, (0 split)
      status splits as  RRRRLL, agree=0.746, adj=0.034, (0 split)
      fuel   splits as  RLRL-, agree=0.743, adj=0.021, (0 split)
      age    < -0.5     to the left,  agree=0.737, adj=0.000, (0 split)

Node number 4: 43936 observations,    complexity param=0.01486208
  mean=8135.213, MSE=4.793165e+07 
  left son=8 (41468 obs) right son=9 (2468 obs)
  Primary splits:
      fuel      splits as  RLLL-, improve=0.12650340, (0 missing)
      shape     splits as  RRLLLRRRLLRLL, improve=0.11610260, (0 missing)
      drive     splits as  RLR, improve=0.10136460, (0 missing)
      cylinders splits as  RRLLLLR-, improve=0.09763326, (0 missing)
      odometer  < 99999.5  to the right, improve=0.05309322, (0 missing)

Node number 5: 29553 observations,    complexity param=0.04973553
  mean=16087.86, MSE=1.241672e+08 
  left son=10 (21899 obs) right son=11 (7654 obs)
  Primary splits:
      shape     splits as  LLLLLRRRLLRLL, improve=0.24295380, (0 missing)
      drive     splits as  RLR, improve=0.21367810, (0 missing)
      cylinders splits as  R-LLLRR-, improve=0.21055150, (0 missing)
      fuel      splits as  RLLL-, improve=0.16570920, (0 missing)
      age       < 6.5      to the right, improve=0.05775407, (0 missing)
  Surrogate splits:
      cylinders splits as  R-LLLLR-,     agree=0.843, adj=0.394, (0 split)
      fuel      splits as  RLLL-,        agree=0.786, adj=0.172, (0 split)
      odometer  < 230211.5 to the left,  agree=0.743, adj=0.007, (0 split)
      age       < 2.5      to the right, agree=0.742, adj=0.005, (0 split)

Node number 6: 6725 observations
  mean=16340.52, MSE=1.125237e+08 

Node number 7: 18856 observations,    complexity param=0.02949513
  mean=30697.65, MSE=2.461749e+08 
  left son=14 (4887 obs) right son=15 (13969 obs)
  Primary splits:
      age      < 7.5      to the right, improve=0.11389970, (0 missing)
      fuel     splits as  RLLL-, improve=0.05065110, (0 missing)
      shape    splits as  LRRLLLRRLRRLL, improve=0.04367528, (0 missing)
      odometer < 34518    to the right, improve=0.04365292, (0 missing)
      drive    splits as  RLL, improve=0.03387125, (0 missing)
  Surrogate splits:
      transmission splits as  RLR, agree=0.773, adj=0.124, (0 split)
      shape        splits as  LLRRRLRRRRRRL, agree=0.762, adj=0.082, (0 split)
      odometer     < 335.5    to the left,  agree=0.759, adj=0.069, (0 split)
      condition    splits as  RLRRRL, agree=0.748, adj=0.028, (0 split)
      status       splits as  RRLLRR, agree=0.743, adj=0.007, (0 split)

Node number 8: 41468 observations
  mean=7534.485, MSE=3.792343e+07 

Node number 9: 2468 observations
  mean=18228.82, MSE=1.081481e+08 

Node number 10: 21899 observations,    complexity param=0.01409489
  mean=12840.75, MSE=6.657982e+07 
  left son=20 (11442 obs) right son=21 (10457 obs)
  Primary splits:
      drive     splits as  RLR, improve=0.17328470, (0 missing)
      cylinders splits as  L-LLLRR-, improve=0.15724690, (0 missing)
      shape     splits as  RRRLL---LR-RL, improve=0.10495730, (0 missing)
      odometer  < 98069.5  to the right, improve=0.07025000, (0 missing)
      age       < 6.5      to the right, improve=0.06608053, (0 missing)
  Surrogate splits:
      shape        splits as  RRRLL---LR-RR, agree=0.742, adj=0.460, (0 split)
      cylinders    splits as  R-LLLRR-, agree=0.714, adj=0.402, (0 split)
      transmission splits as  LLR, agree=0.545, adj=0.048, (0 split)
      odometer     < 79951    to the right, agree=0.537, adj=0.030, (0 split)
      age          < 4.5      to the right, agree=0.526, adj=0.007, (0 split)

Node number 11: 7654 observations,    complexity param=0.01364211
  mean=25378.24, MSE=1.724536e+08 
  left son=22 (6086 obs) right son=23 (1568 obs)
  Primary splits:
      fuel      splits as  RLLL-,        improve=0.18526190, (0 missing)
      drive     splits as  RLL,          improve=0.05220287, (0 missing)
      age       < 6.5      to the right, improve=0.03958335, (0 missing)
      cylinders splits as  R-LLLRR-,     improve=0.03249489, (0 missing)
      odometer  < 111580   to the right, improve=0.02653238, (0 missing)
  Surrogate splits:
      odometer < 241610.5 to the left,  agree=0.797, adj=0.008, (0 split)

Node number 14: 4887 observations
  mean=21745.15, MSE=2.261747e+08 

Node number 15: 13969 observations,    complexity param=0.01355801
  mean=33829.65, MSE=2.153232e+08 
  left son=30 (13326 obs) right son=31 (643 obs)
  Primary splits:
      fuel      splits as  RLLL-, improve=0.08079886, (0 missing)
      cylinders splits as  -R---LR-, improve=0.07298311, (0 missing)
      shape     splits as  LRRLLRRRLLRLL, improve=0.04624595, (0 missing)
      odometer  < 13286.5  to the right, improve=0.04220035, (0 missing)
      age       < 2.5      to the right, improve=0.04022975, (0 missing)

Node number 20: 11442 observations
  mean=9593.585, MSE=2.446773e+07 

Node number 21: 10457 observations
  mean=16393.78, MSE=8.849738e+07 

Node number 22: 6086 observations
  mean=22509.2, MSE=1.191975e+08 

Node number 23: 1568 observations
  mean=36514.05, MSE=2.232054e+08 

Node number 30: 13326 observations
  mean=32913.42, MSE=1.833724e+08 

Node number 31: 643 observations
  mean=52818.23, MSE=4.995308e+08 
# Plot decision tree Summary
rsq.rpart(model.dt)

Regression tree:
rpart(formula = price ~ ., data = training, control = rpart.cntrl)

Variables actually used in tree construction:
[1] age       cylinders drive     fuel      odometer  shape    

Root node error: 17925260137998/99070 = 180935300

n= 99070 

         CP nsplit rel error  xerror      xstd
1  0.257291      0   1.00000 1.00001 0.0087373
2  0.062339      1   0.74271 0.74287 0.0074250
3  0.057003      2   0.68037 0.68054 0.0073152
4  0.049736      3   0.62337 0.62345 0.0070450
5  0.029495      4   0.57363 0.57374 0.0069418
6  0.014862      5   0.54414 0.54520 0.0069793
7  0.014095      6   0.52928 0.52689 0.0069512
8  0.013642      7   0.51518 0.51534 0.0069171
9  0.013558      8   0.50154 0.50428 0.0068731
10 0.010000      9   0.48798 0.48881 0.0068135

Random Forest

model.rf <- ranger(price ~., data = training, num.threads = 10, num.trees = 500, mtry = 4)
Growing trees.. Progress: 12%. Estimated remaining time: 3 minutes, 43 seconds.
Growing trees.. Progress: 26%. Estimated remaining time: 3 minutes, 3 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 2 minutes, 22 seconds.
Growing trees.. Progress: 54%. Estimated remaining time: 1 minute, 46 seconds.
Growing trees.. Progress: 68%. Estimated remaining time: 1 minute, 12 seconds.
Growing trees.. Progress: 84%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 4 seconds.
model.rf
Ranger result

Call:
 ranger(price ~ ., data = training, num.threads = 10, num.trees = 500,      mtry = 4) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      99070 
Number of independent variables:  10 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       37990355 
R squared (OOB):                  0.7900356 

Predictions

We shall predict the new testing-set we prepared above and see the accuracy for each algorithm.

testing_acc = function(model, test, rf = F) {
  x_test = test[,2:11]
  y_test = test$price
  if(rf) predictions <- predict(model, data=x_test)$predictions
  else predictions <- predict(model, newdata=x_test)
  # predictions
  actuals_preds <- data.frame(cbind(actuals=y_test, preds=predictions))
  correlation_accuracy <- cor(actuals_preds)  
  return(correlation_accuracy)
}
testing_acc(model.lm, testing)
Warning in predict.lm(model, newdata = x_test) :
  prediction from a rank-deficient fit may be misleading
          actuals     preds
actuals 1.0000000 0.7440916
preds   0.7440916 1.0000000
testing_acc(model.dt, testing)
          actuals     preds
actuals 1.0000000 0.7282761
preds   0.7282761 1.0000000
testing_acc(model.rf, testing, T)
          actuals     preds
actuals 1.0000000 0.8943955
preds   0.8943955 1.0000000

We trained our data on 3 different machine learning algorithms Linear Regression, Decision Tree, and Random Forest. Overall, Decision Tree got the least accuracy, however Random Forest seems to be the best algorithm for predicting the price of a used car with 89.4% on the testing set.

Conclusion

In this report we’ve analyzed and investigated the Used Cars Dataset in few different angles starting with gathering, assessing, and cleaning the dataset. We’ve also explored the data analytically by addressing some questions and drawing some insights about the data. Then, we tested some of the variables whether it has a significant effect on the average car price or not. In addition to modeling the data into some of the machine learning regression algorithms to predict the vehicle price.

The outcome of the analysis:

There are some limitations we’ve faced during our investigation such as:

---
title: "Used Vehicles Data Analysis"
author: "Ahmed Sayed"
output: html_notebook
---

## Table of contents

- Introduction
- Data Wrangling
- Exploratory Analysis
  - Univariate Analysis
  - Multivariate Analysis
- Significance Testing
- Data Modeling & Predictions
- Conclusion


## Introduction 

Craigslist is the world's largest collection of used vehicles for sale, yet it's very difficult to collect all of them in the same place. I built a scraper for a school project and expanded upon it later to create this dataset which includes every used vehicle entry within the United States on Craigslist. 

The dataset contains around 500k used cars with its 26 associated variables (characteristics). This data is scraped every few months, it contains most all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, latitude/longitude, and 18 other categories.

### Some Definitions:

- **Odometer:** An odometer or odograph is an instrument used for measuring the distance traveled by a vehicle, such as a bicycle or car.
- **Transmission:** A transmission is a machine in a power transmission system, which provides controlled application of power.
- **Drive Type:** The four different types of drivetrain are all-wheel-drive **AWD**, front wheel drive **FWD**, rear wheel drive **RWD**, and **4WD** 4 wheel drive.
- **Lien Cars:** A lien sale is the sale of the claim—or a hold—placed on an asset to satisfy an unpaid debt. Typically, lien sales are conducted as public auctions, and the lien is on real estate, automobiles, and other personal property.


### Data Dictionary

- **id:** Entry identification.
- **url:** Listing URL of the car.
- **region:** Region in which car are listed for sale.
- **region_url:** URL of the craigslist of that region.
- **price:** Entry price of the car.
- **year:** The year in which the car was manufactured.
- **manufacturer:** Manufacturer of the vehicle.
- **model:** Model of the vehicle.
- **condition:** Condition of the vehicle e.g. `new`, `good`, `fair` etc.
- **cylinders:** Number of cylinders in the vehicle.
- **fuel:** Fuel type e.g. `gas`, `diesel`, `electric` etc.
- **odometer:** miles traveled by vehicle.
- **title_status:** Vehicle status e.g. `clean`, `rebuilt` etc.
- **transmission:** Transmission type of the vehicle e.g. `manual`, `automatic`.
- **VIN:** Vehicle identification number.
- **drive:** Type of drive by vehicle e.g. `rwd`, `fwd`, `4wd`.
- **size:** Size of the vehicle e.g. `compact`, `full-size`, etc.
- **type:** Body type of the vehicle e.g. `SUV`, `sedan`, etc.
- **paint_color:** Color of the vehicle.
- **image_url:** Image URL of the vehicle.
- **description:** Description of the vehicle.
- **state:** Region in which the vehicle are listed for sale.
- **lat:** Latitude of listing region.
- **long:** longitude of listing region.
- **posting_date:** Date in which the vehicle was listed for sale.


### Questions that would be able to answer during our analysis:

- Which state has more used craigslist and for which region?
- Which region has more used craigslist and in which state?
- What is the top 3 manufacturer with most cars?
- What is the top 10 model for each manufacturer?
- How the number of used cars changes over the years since 1900?
- What is the most common fuel types?
- What the percentage for each of the car drive type?
- Does the odometer affect car price?
- Does the cylinders affect car price for each fuel type?
- Does the car status affect the price?
- Does the condition affect car price for each status?
- What's average price for each manufacture?
- What is th most common body shapes used, and its average price?
- Does fuel type along with the odometer affect car price?
- Does the transmission affect car price?
- What are the average years people spend before offering their car for sale?


In order to address those questions, there are some things to do such as:

1. Cleaning the dataset from any incorrect datatypes or null values.
2. Calculating some statistics about the variables and look for any relationships in the data.
3. Drawing conclusions with some charts and communicating the findings.


```{r load-libs, results="hide"}
# turn off scientific notation like 1e+06
options(scipen=999, warn=-1)
suppressPackageStartupMessages({
  # Load Analysis Libraries
  library(dplyr)
  library(tidyr)
  library(plotly)
  library(ggplot2)
  # Load ML libraries
  library(caret)
  library(ranger)
  library(rpart)
})
```


## Data Wrangling

### Gathering

```{r load-data}
df = read.csv("vehicles.csv", na.strings = c("", "NA"))
head(df)
```

### Assessing

```{r}
dim(df)
```

```{r}
names(df)
```

```{r}
# The structure of the dataframe
glimpse(df)
```

Most of the variable aren't in the correct type and it need to be reassigned to its datatype.


```{r}
# Count the nan values in the data
sum(is.na(df))
sum(complete.cases(df))
```

There's almost **1.5M** fields contains `NA` values across all variables, however there's no row filled entirely with `NA`.

```{r}
# Look if there's any duplicates
sum(duplicated(df$id))
```


### Data Cleaning

First of all, Let's get rid of `image_url`, `url`, `description`, `county`, `VIN`, and `region_url` columns that we won't use it in our analysis.

```{r}
df2 <- df[, !names(df) %in% c("image_url", "url", "description", "county", "region_url", "VIN")]
```

Now, we can check the percentage of `NA` values in each column.

```{r}
nan_prop <- function(data_frame) {
    prop_col = c()
    for (col in names(data_frame)) {
        nan_perc = round((sum(is.na(data_frame[[col]])) / dim(data_frame)[1]), 2) 
        prop_col = c(prop_col, nan_perc)
    }
    return(prop_col)
}
nan_prop(df2)
```

```{r, fig.align='center', out.width='100%'}
plot_ly(
  x = names(df2),
  y = 1 - nan_prop(df2),
  type = "bar",
) %>% layout(title = "Propotions of NA values", xaxis=list(title="Column Names"), yaxis=list(title="Propotion"))
```

As we can see above, the most variable with `NA` values is `size` with **71%**. This variable looks significant to the other variables although, that will be a problem to our analysis, so we will get rid of the entire column. The other columns such as `condition` and `cylinders` are almost the same with **40%** of its values are filled with `NA` and we can make our analysis with it, however, it's a big percentage.

```{r}
df2 = df2[, !names(df2) %in% c("size")]
```

Let's change the datatype for each variable to its corresponding one and rename some the mis-styped.


###  edit the data type
```{r}
df2$year = as.integer(df2$year)
df2$price = as.numeric(df2$price)
df2$odometer = as.integer(df2$odometer)
df2$condition = as.factor(df2$condition)
df2$cylinders = as.factor(df2$cylinders)
df2$title_status = as.factor(df2$title_status)
df2$transmission = as.factor(df2$transmission)
df2$drive = as.factor(df2$drive)
df2$type = as.factor(df2$type)
df2$fuel = as.factor(df2$fuel)
df2$paint_color = as.factor(df2$paint_color)
df2$posting_date = as.POSIXct(df2$posting_date)

# rename some columns
df2 = rename(df2,status = title_status, shape = type)
glimpse(df2)
```

The dataset now seems fine, let's look at the unique values for the categorical columns.

```{r}
lapply(df2[,c(7,8,9,11,12,13,14,15,16)], unique)
```

```{r}
summary(df2)
```

So, as we can see there's a lot of `NA` values with each column and the only variables that contain numeric values are `price` and `odometer`. Moreover, the difference between the maximum value and the 3rd quantile of `price` is pretty huge, so we can assume that the price has some outliers to be handled as well as the `odometer` variable.


In the following section we'll look at the outliers for `price` and `odometer` variables. Any values outside of the interval below will be considered potential outliers:

$$I = [Q_{0.25} - 1.5⋅IQR; Q_{0.75} + 1.5⋅IQR]$$



```{r}
I <- function(column) {
    quan <- quantile(column, na.rm=TRUE)
    IQR = as.integer(quan["75%"] - quan["25%"])
    I = as.integer(c(quan["25%"] - 1.5 * IQR, quan["75%"] + 1.5 * IQR))
    return(I)
}
I(df2$price)
```

For the sake of the data and the cars, **$57k** wouldn't be enough, since some car types are actually over price. So, we'll assume that any car with a price above 1 million will be an outlier and of course the price must be non-negative.

Let's do the same thing with the `odometer` variable.

```{r}
I(df2$odometer)
```

For the odometer, 277k miles makes sense, why would someone buy a car traveled more than 300k miles!

Let's filter those values out from our dataset and replace the NA with the means of each variable.

```{r}
df3 = df2 %>% filter(price < 5e5 & price >= 0 & odometer <= 3e5 & odometer >= 0)
df3[is.na(df3$price)] = mean(df3$price, na.rm = TRUE)
df3[is.na(df3$odometer)] = mean(df3$odometer, na.rm = TRUE)
```

Delete the manufacturer and the model if it's `NA` since it has no meaning if a car has no model or a car has no manufacturer.

```{r}
df3 = df3[!is.na(df3$manufacturer), ]
df3 = df3[!is.na(df3$model), ]
dim(df3)
```

```{r}
names(df3)
nan_prop(df3)
```


## Exploratory Analysis

All the variables are categorical except two variables `price` and `odometer` so let's look at its distributions.


```{r fig.align='center', out.width='100%'}
price_hist <- plot_ly(nbinsx = 100) %>%
  add_histogram(x = ~df3$price, marker = list(color = 'rgba(55, 128, 191, 0.8)'), name = "Price $")

odo_hist <- plot_ly(nbinsx = 50) %>%
    add_histogram(x = ~df3$odometer, marker = list(color = 'rgba(219, 64, 82, 0.8)'), name = "Odometer (miles)")

fig <- subplot(price_hist, odo_hist) %>%
    layout(title = "Odometer and Price Distribution", yaxis=list(title="Density"))
fig
```

Both `price` and `odometer` have **right-skewed** histograms which mean cars with high odometer are less popular for sale and cars with more price are less popular for sale as well.


### Which state has more used craigslist and for which region?

```{r fig.align='center', out.width='100%'}
# Group by each state and arrange by number of vehicles
by_state = group_by(df3, state = df3$state) %>% count() %>% arrange(n)
by_state %>% 
    plot_ly(x = ~n, y = ~state, type = "bar", marker = list(color = 'rgba(55, 128, 191, 0.9)')) %>% 
    layout(title="Used Cars by each state", xaxis=list(title="Cars"), 
       yaxis=list(title="State",  categoryorder = "array", categoryarray = ~n))
```

**California** has the most used vehicles among all states with **~47k** cars.

```{r}
# Get the region in the ca state that has the most cars
ca_state = df3[df3$state == "ca",]         
by_region = group_by(ca_state, region = ca_state$region)
arrange(count(by_region), desc(n))[1,]
```

Stockton are two cities in California which have the most vehicles with **2840** cars.

### Which region has more used craigslist and in which state?

```{r}
by_state_region = group_by(df3, state=df3$state, region=df3$region)
arrange(count(by_state_region), desc(n))[1,]
```

**Pasco** in **Washington** got the top on among all regions which has **2875** cars.

### What is first top 3 manufacturers with more vehicles?

```{r fig.align='center', out.width='100%'}
by_manf = df3 %>% group_by(manufacturer) %>% count() %>% arrange(n)
by_manf %>% plot_ly(x = ~n, y = ~manufacturer, type = "bar", marker = list(color = 'rgba(219, 64, 82, 0.9)')) %>%
layout(title="Car Used by each manufacturer", xaxis=list(title="Cars"), 
       yaxis=list(title="Manufacturer",  categoryorder = "array", categoryarray = ~n))
```

It appears that, **Ford**, **Chevrolet**, and **Toyota** have the most vehicles. Let's see what's average price for each of the top 3.

```{r}
# AVG price for each of top 3 Manufacturers
avg_ford = round(mean(df3[df3$manufacturer == "ford",]$price), 2)
avg_chevrolet = round(mean(df3[df3$manufacturer == "chevrolet",]$price), 2)
avg_toyota = round(mean(df3[df3$manufacturer == "toyota",]$price), 2)
list(ford=avg_ford, chevrolet=avg_chevrolet, toyota=avg_toyota)
```

Ford got the higher average price among top 3 manufacturers with **$19k**.

### What is the top 10 models with each manufacturer?

```{r}
by_manf_model = data.frame(df3 %>% group_by(manufacturer, model) %>% count())
agg = aggregate(n ~ manufacturer, data = by_manf_model, FUN=max, na.rm = TRUE)
max_by_model = data.frame(matrix(ncol=3, nrow=0, dimnames=list(NULL, c("manf", "model", "max_cars"))))

for (i in 1:length(agg$n)) {
    max_by_model[nrow(max_by_model)+1, ] = as.vector(by_manf_model[by_manf_model$n == agg$n[i] & 
                                                                   by_manf_model$manufacturer == agg$manufacturer[i], ] %>%
          mutate_all(as.character) %>% unlist(., use.names = FALSE))
}
max_by_model$max_cars = as.integer(max_by_model$max_cars)
max_by_model = arrange(max_by_model, desc(max_cars))
```


```{r fig.align='center', out.width='100%'}
max_by_model[1:10,] %>%
    plot_ly(x=~model, y=~max_cars, type = 'bar', color=~manf, colors="Dark2") %>%
    layout(yaxis = list(title = 'Count'), xaxis=list(title='Car Model', categoryorder = "array", categoryarray=~max_cars))
```

### How the number of used cars changes over the years since 1900?

```{r fig.align='center', out.width='100%'}
by_year = group_by(df2, year)
by_year = count(by_year)
plot_ly(by_year, x=~year, y=~n, type = 'bar', color=~year, reversescale=TRUE, colors = "Blues") %>% 
  layout(yaxis = list(title = 'Count'), xaxis=list(title='Year', categoryorder = "array", categoryarray=~n))
```

It's normal to see that the number of used cars increase since 1900 until 2018.

Let's see what's the most used cars in this year.

```{r}
most_manu_b = df2[df3$year == 2017,] %>% group_by(manufacturer) %>% count()
most_manu_b = arrange(most_manu_b,desc(most_manu_b$n))
most_manu_b[1,]
```

**Ford** is most manufacture has a cars in 2018 with **6390** cars.

### What is the most common fuel types?

```{r}
by_fuel = group_by(df2, fuel = df2$fuel) %>% count()
by_fuel$n = round(by_fuel$n / sum(by_fuel$n), 2)
by_fuel = rename(by_fuel, ratio = n) %>% drop_na()
```

```{r fig.align='center', out.width='100%'}
colors <- c('rgb(128,133,133)', 'rgb(211,94,96)', 'rgb(144,103,167)', 'rgb(171,104,87)', 'rgb(114,147,203)')
plot_ly(by_fuel, labels = ~fuel, values = ~ratio, marker = list(colors = colors, line = list(color = '#FFFFFF', width = 2))) %>% 
  add_pie(hole = 0.5) %>%
  layout(title = 'Car Fuel Type',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
```

The most fuel type used is **Gas** with **~85%** among all cars. 

### What the percentage for each of the car drive type?

```{r}
by_drive = group_by(df2, drive) %>% count(.) %>% drop_na(.)
by_drive = mutate(by_drive, ratio =n/sum(by_drive$n))
```

```{r fig.align='center', out.width='100%'}
plot_ly(by_drive, labels = ~drive, values = ~ratio, type = 'pie', marker = list(line = list(color = '#FFFFFF', width = 3))) %>% 
  layout(title = 'Car Drive Type',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
```

Cars with `4wd` drive type are the most common with **44.5%**.

Next, we'll see which of the variables affect the price.

### Does the odometer affect car price?

```{r fig.align='center', out.width='100%'}
# Take 10000 random sample from the dataset in order to plot
sample_df = sample_n(df3, 1e4)
options(repr.plot.width = 7, repr.plot.height = 4) 
sample_df %>% ggplot(aes(x = odometer, y = price)) +
    geom_bin2d(bins=50)
```

```{r}
cor(df3$price, df3$odometer)
```

There's some sort of negative relationship between how long a car traveled and its price, however, that seems normal since the car price became higher whenever the odometer is lower.

Let's add other factors may affect that relation such as `drive`.

```{r fig.align='center', out.width='100%'}
ggplot(sample_df[!is.na(sample_df$drive),], aes(x=odometer, y=price, color=drive, shape=drive)) +
  geom_point() + 
  geom_smooth(method=lm, se=FALSE) +
  ggtitle("Price vs Odometer for each Drive type")
```

As we noticed the `4wd` models gets the higher price and higher odometer and the next one is `rwd` since these type of cars used a lot in car races.

Let's see the average price for each type.

```{r}
by_drive = aggregate(price ~ drive, data=df3, FUN=mean) %>% arrange(desc(price))
by_drive
```

As expected `4wd` got the highest average price with **~$20k** and `rwd` with **$19.5k**.

### Does the cylinders affect car price for each fuel type?

```{r fig.align='center',out.width='100%'}
aggregate(price ~ cylinders + fuel, data = df3[df3$fuel != "other",], FUN=mean, na.rm = TRUE) %>% 
    arrange(price) %>% 
    plot_ly(x = ~cylinders, y = ~price, color = ~fuel, type = "bar") %>% 
    layout(title="Average Price per Cylinder and Fuel Type", yaxis=list(categoryorder = "array", categoryarray=~price))
```

It seems doesn't, typical cars have 4, 6 or 8 cylinders and the figure above shows exactly those types got most average price for each fuel type.

### Does the car status affect the price?

```{r}
by_status = group_by(df3, status)
by_status = aggregate(price ~ status, data = by_status, FUN=mean, na.rm = TRUE)
```

```{r fig.align='center', out.width='100%'}
plot_ly(by_status, x=~status, y=~price, type = 'bar', colors="Dark2", 
               marker = list(color = 'rgba(55, 128, 191, 0.8)')) %>% 
  layout(yaxis = list(title = 'Average Price'), xaxis=list(title='Car Status'))
```

**Lien** has the most average used car price with **$22k**.

### Does the condition affect car price for each status?

```{r fig.align='center', out.width='100%'}
# Aggregate average car price by condition and status
aggregate(price ~ condition + status, data = df3, FUN=mean, na.rm = TRUE) %>% 
    arrange(price) %>% plot_ly(x = ~condition, y = ~price, color = ~status, type = "bar") %>%
    layout(title="Average Car Price per Condition and Status ", yaxis=list(categoryorder = "array", categoryarray=~price))
```

**Lien** cars got the highest average price among other cars and it seems usual. In addition, new cars which are parts only got the second place, so if we filtered those out from the figure we can see that the normal average price starts with new clean cars until fair salvage.

### What's average price for each manufacture?

```{r fig.align='center', out.width='100%'}
avg_by_manuf = aggregate(price ~ manufacturer, data = df3, FUN=mean, na.rm = TRUE) %>% arrange(price)
plot_ly(x = ~avg_by_manuf$price,y = ~avg_by_manuf$manufacturer, name = "Car Used by each manufacturer",type = "bar",
   marker = list(color = 'rgba(219, 64, 82, 0.8)'),
) %>% layout(title = "Average Price for each Manufacturer", xaxis=list(title="Average Price"), 
             yaxis=list(title="Manufacturer",  categoryorder = "array", categoryarray = ~avg_by_manuf$price))
```

Ferrari got the 1st place with the highest average price **$114.7k**.

### What is th most common body shapes used, and its average price?

```{r}
by_shape = group_by(df3, shape)
cnt_by_shape = count(by_shape) %>% drop_na()
avg_price = aggregate(price ~ shape, data = by_shape, FUN=mean, na.rm = TRUE)
avg_cnt = cbind(cnt_by_shape, price = avg_price$price)
```

```{r fig.align='center', out.width='100%'}
plot_ly(avg_cnt, x = ~shape, y = ~price, type = 'bar', name = 'Price', marker = list(color = 'rgb(49,130,189)')) %>%
  add_trace(y = ~n, name = 'Cars', marker = list(color = 'rgb(200,200,200)')) %>% 
  layout(xaxis = list(title = "Car Body Shape", tickangle = -45),
         yaxis = list(title = ""), margin = list(b = 100), barmode = 'group')
```

From the previous plot we got two observations:

- Sedan body type is the most common shape among used cars with average price **$13.1k**.
- Pickup body type has the most average price among all shapes with **$28.6k**.

### Does fuel type along with the odometer affect car price? 

```{r fig.align='center', out.width='100%'}
sample_df %>% plot_ly(x = ~price, y = ~odometer, type="scatter", color = ~fuel, colors="Dark2", marker = list(size = 6,
  opacity = 0.6)) %>% 
  layout(title = 'Odometer vs Price per Fuel Type',
         xaxis = list(title="Odometer (miles)"),
         yaxis = list(title="Price $"))
```

Most cars use gas as a fuel type and it seems each fuel type strongly affects the price of the used car along with the odometer. 

Let's see the average price by each fuel type.

```{r}
by_fuel = aggregate(price ~ fuel, data = df2, FUN=mean, na.rm = TRUE) %>% arrange(price)
by_fuel
```

### Does the transmission affect car price?

```{r}
aggregate(price ~ transmission, data = df3, FUN=mean, na.rm = TRUE)
```

The average price is highly affected by the transmission type. 

Let's see `fuel` variables may contribute to the relation.

```{r fig.align='center', out.width='100%'}
# Aggregate average car price by transmission type and fuel
aggregate(price ~ fuel + transmission, data = df3, FUN=mean, na.rm = TRUE) %>% 
    arrange(price) %>% plot_ly(x = ~fuel, y = ~price, color = ~transmission, type = "bar") %>%
    layout(title="Average Car Price per Transmission and Fuel ", yaxis=list(categoryorder = "array", categoryarray=~price))
```

It's pretty obvious that `other` got the highest average price, but if we filtered that we got the normal differences between `automatic` and `manual` cars.

### What are the average years people spend before offering their car for sale?

To address this question we should extract another feature from the dataset using those variables `year` and `posting_date`, then calculate the difference.

```{r}
df3[is.na(df3$year),]$year <- as.integer(mean(df3$year, na.rm=TRUE))
df4 = df3 %>% mutate(age = as.integer(format(posting_date, format="%Y")) - year)
```

```{r fig.align='center', out.width='100%'}
plot_ly(nbinsx = 100) %>%
    add_histogram(x = ~df4$age, marker = list(color = 'rgba(55, 128, 191, 0.8)')) %>%
    layout(title = "Age of Used Cars", yaxis=list(title="Density"), xaxis=list(title="Age"))
```

The age of the used cars is positively skewed which means recent cars are most popular for sale. Next, let's see the `condition` and `status` how it affects the age of the car.

```{r fig.align='center', out.width='100%'}
aggregate(age ~ condition + status, data = df4, FUN=mean, na.rm = TRUE) %>% 
    arrange(age) %>% plot_ly(x = ~condition, y = ~age, color = ~status, type = "bar") %>%
    layout(title="Average Car Age per Condition and Status ", yaxis=list(categoryorder = "array", categoryarray=~age))
```

**New** and **parts-only** cars are getting the highest average age, however, the missing cars got the highest average age among all conditions. 

## Significance Testing
 
In this part, we're going to test some of the above variables whether it can contribute to the price of a used vehicle or not. To do so, we can use [ANOVA Test](https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/anova/) since we got one numeric variable vs other categorical variables.

To begin, let's see some transformations on the `price` in order to make it normally distributed.

```{r fig.align='center', out.width='100%'}
price_log <- plot_ly(nbinsx = 30) %>%
  add_histogram(x = ~log(df3$price), marker = list(color = 'rgba(55, 128, 191, 0.8)'), name = "Price Log")

price_sqrt <- plot_ly(nbinsx = 30) %>%
    add_histogram(x = ~sqrt(df3$price), marker = list(color = 'rgba(219, 64, 82, 0.8)'), name = "Price Sqrt")

fig <- subplot(price_log, price_sqrt) %>%
    layout(title = "Price Transformation", yaxis=list(title="Density"))
fig
```

```{r}
summary(log(df4[df4$price>0,]$price))
```

The logarithmic transformation seems to be normal with `mean ~= median` so, let's make the ANOVA test

```{r}
# Compute the analysis of variance
res.aov <- aov(log(price) ~ drive + transmission + cylinders + condition + shape + status , data = df4[df4$price>0,])
# Summary of the analysis
summary(res.aov)
```

Surprisingly, all variables are seem to be statistically significant with `p-value < 0.05`.

## Data Modelling & Predictions

In the following part, we're going to predict the price variable based on those features **(Odometer, Condition, Status, Cylinders, Drive, Fuel, Manufacturer, Transmission, Shape, Age)**. These features highly affected the car price as we've shown in the previous section.

### Data Preparation

```{r}
df_features = dplyr::select(df4, price, odometer, condition, cylinders, status, drive, fuel, manufacturer, transmission, shape, age)
df_features <- df_features %>%
  mutate_if(is.character, is.factor)
# Omit NA values in the data
df_features = df_features[complete.cases(df_features),]
```

We've seen some factors contributed to huge volume of data and it negatively affects the car price such as `other` in **Fuel**, and **Cylinders** type, so let's remove them.

```{r}
df_features = df_features %>% filter(fuel != "other", cylinders != "other")
dim(df_features)
```


### Data Modelling

To begin with, let's define some regression metrics:

1. $R^2$ measures how much variability in dependent variable can be explained by the model.
2. While $R^2$ is a relative measure of how well the model fits dependent variables, Mean Square Error is an absolute measure of the goodness for the fit.
3. Mean Absolute Error $MAE$ is similar to $MSE$, however, unlike MSE, MAE takes the sum of the ABSOLUTE value of error.

```{r}
# Split the dataset into 80% training and 20% testing
set.seed(2312)
inTrain <- createDataPartition(y=df_features$price,p=0.8, list=FALSE, times = 1)
training <- df_features[inTrain,]
testing <- df_features[-inTrain,]
```

### Linear Regression

```{r}
model.lm <- lm(price ~.,data=training)
summary(model.lm)
```

### Decision Tree

```{r}
rpart.cntrl <- rpart.control(minsplit = 300)
model.dt <- rpart(price ~.,data=training, control = rpart.cntrl)
summary(model.dt)
```

```{r fig.align='center', out.width='100%'}
# Plot decision tree Summary
rsq.rpart(model.dt)
```

### Random Forest

```{r}
model.rf <- ranger(price ~., data = training, num.threads = 10, num.trees = 500, mtry = 4)
model.rf
```

## Predictions

We shall predict the new testing-set we prepared above and see the accuracy for each algorithm.

```{r}
testing_acc = function(model, test, rf = F) {
  x_test = test[,2:11]
  y_test = test$price
  if(rf) predictions <- predict(model, data=x_test)$predictions
  else predictions <- predict(model, newdata=x_test)
  # predictions
  actuals_preds <- data.frame(cbind(actuals=y_test, preds=predictions))
  correlation_accuracy <- cor(actuals_preds)  
  return(correlation_accuracy)
}
```

```{r}
testing_acc(model.lm, testing)
```

```{r}
testing_acc(model.dt, testing)
```

```{r}
testing_acc(model.rf, testing, T)
```

We trained our data on 3 different machine learning algorithms **Linear Regression**, **Decision Tree**, and **Random Forest**. Overall, Decision Tree got the least accuracy, however Random Forest seems to be the best algorithm for predicting the price of a used car with 89.4% on the testing set.

## Conclusion

In this report we've analyzed and investigated the **Used Cars Dataset** in few different angles starting with gathering, assessing, and cleaning the dataset. We've also explored the data analytically by addressing some questions and drawing some insights about the data. Then, we tested some of the variables whether it has a significant effect on the average car price or not. In addition to modeling the data into some of the machine learning regression algorithms to predict the vehicle price.

The outcome of the analysis:

- California has the most used vehicles among all states in USA.
- Ford, Chevrolet, and Toyota have the most used vehicles since 1900.
- Most of the used vehicles use Gas as a fuel type.
- 4×4 Vehicles are the most common cars in the market.
- Typical cars have 4, 6 or 8 cylinders and those types got most average price for each fuel type.
- Lien has the most average used car price with **$22k**.
- Ferrari got the 1st place with the highest average price **$108.5k**.
- Sedan body type is the most common body shape among used cars with average price **$13.1k**.
- Pickup body type has the most average price among all shapes with **$28.6k**.
- People spent on average of 10 years before offering their cars for sale.
- **Random Forest** is the best algorithm for predicting used car price with 79% accuracy on the training set and 89.4% on the testing set.

There are some limitations we've faced during our investigation such as:

- Most of the data have a lot of `NA` values which means losing a lot of useful information in our analysis.
- We've tried to used some imputation techniques in order to deal with such `NA` values such as **MICE**, but it needs lots of computations, so it was time consuming to make it work.
- The machine learning models accuracy was a little down due to the huge number of `NA` values we omitted.




