finalproject

Introduction

In this project, I will be exploring the specs of used Volkswagens in Northeastern (Mostly) U.S.A. This data came from a webscraped cargurus inventory only up to date as of September 2020. There are so many variables to take into account regarding this data, each bit of information regarding any and every car is provided. I used several dyplr commands such as (filter) to focus on volkwsawgens and very specific models of said vehicle. I then used !is.na to remove any columns that I wouldn’t be including in my dataset)

Why this dataset?

I chose this dataset because there was so much to work with regarding the variables, there was endless possibilities on how I could incorporate several of them. The reason I narrowed down to Volkswagens is because I drive a Jetta GLI (the color of my car is the same as the regression line in plot 1). The models I narrowed it down to was simply models I could think of the top of my head, giving me room to work with well-known and advertised vehicles.

Source: https://blog.consumerguide.com/quick-spin-2020-volkswagen-jetta-gli/#google_vignette

Background Information

Volkwswagens are known for their blend of performance, reliability, and innovation; German engineering strikes again. They are known for being a car for the people and having iconic designs such as the beetle and their golf trims. They are known for having cultural impact in the automobile industry. Accessibility and an enjoying driving experience is what brought volkswagen to where they are now.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
library(ggplot2)
library(readr)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Load the data set

# Set working directories
setwd("C:/Users/jfgam/Downloads/Data 101")
usedcars <- read_csv("used_cars_data.csv")
Rows: 259032 Columns: 66
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (37): vin, back_legroom, bed, bed_height, bed_length, body_type, cabin,...
dbl  (15): city_fuel_economy, daysonmarket, engine_displacement, highway_fue...
lgl  (13): combine_fuel_economy, fleet, frame_damaged, franchise_dealer, has...
date  (1): listed_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(usedcars)
 [1] "vin"                     "back_legroom"           
 [3] "bed"                     "bed_height"             
 [5] "bed_length"              "body_type"              
 [7] "cabin"                   "city"                   
 [9] "city_fuel_economy"       "combine_fuel_economy"   
[11] "daysonmarket"            "dealer_zip"             
[13] "description"             "engine_cylinders"       
[15] "engine_displacement"     "engine_type"            
[17] "exterior_color"          "fleet"                  
[19] "frame_damaged"           "franchise_dealer"       
[21] "franchise_make"          "front_legroom"          
[23] "fuel_tank_volume"        "fuel_type"              
[25] "has_accidents"           "height"                 
[27] "highway_fuel_economy"    "horsepower"             
[29] "interior_color"          "isCab"                  
[31] "is_certified"            "is_cpo"                 
[33] "is_new"                  "is_oemcpo"              
[35] "latitude"                "length"                 
[37] "listed_date"             "listing_color"          
[39] "listing_id"              "longitude"              
[41] "main_picture_url"        "major_options"          
[43] "make_name"               "maximum_seating"        
[45] "mileage"                 "model_name"             
[47] "owner_count"             "power"                  
[49] "price"                   "salvage"                
[51] "savings_amount"          "seller_rating"          
[53] "sp_id"                   "sp_name"                
[55] "theft_title"             "torque"                 
[57] "transmission"            "transmission_display"   
[59] "trimId"                  "trim_name"              
[61] "vehicle_damage_category" "wheel_system"           
[63] "wheel_system_display"    "wheelbase"              
[65] "width"                   "year"                   

Cleaning the Data

Use filter() to Narrow down to volkwsagens and it’s kwell-known models

# Focus on the Volkswagen brand
usedcars1 <- usedcars |>
  filter (make_name == "Volkswagen") |>
  filter (model_name %in% c("Bettle", "Golf", "Golf R", "GTI", "Jetta", "Jetta GLI", "Passat"))

Then use select(-)

# Remove variables that won't be used using select (-)
usedcars2 <- usedcars1 |>
  select(-back_legroom,
         -bed,
         -bed_height,
         -bed_length,
         -body_type,
         -cabin,
         -combine_fuel_economy,
         -daysonmarket,
         -dealer_zip,
         -description,
         -engine_cylinders,
         -engine_displacement,
         -fleet,
         -frame_damaged,
         -franchise_dealer,
         -franchise_make,
         -front_legroom,
         -fuel_tank_volume,
         -fuel_type,
         -height,
         -interior_color,
         -isCab,
         -is_certified,
         -is_cpo,
         -is_new,
         -is_oemcpo,
         -length,
         -listed_date,
         -listing_color,
         -listing_id,
         -main_picture_url,
         -major_options,
         -make_name,
         -maximum_seating,
         -salvage,
         -savings_amount,
         -seller_rating,
         -sp_id,
         -sp_name,
         -theft_title,
         -transmission_display,
         -trimId,
         -trim_name,
         -vehicle_damage_category,
         -wheel_system_display,
         -wheelbase,
         -width,
         
  )

Filter out any variables without value

#Remove any NA in the remaining columns
usedcars3 <- usedcars2 |>
  filter(!is.na(city_fuel_economy),
         !is.na(has_accidents),
         !is.na(highway_fuel_economy),
         !is.na(horsepower),
         !is.na(transmission),
         !is.na(wheel_system),
         !is.na(torque),
         !is.na(horsepower),
         !is.na(owner_count)
  )
names(usedcars3)
 [1] "vin"                  "city"                 "city_fuel_economy"   
 [4] "engine_type"          "exterior_color"       "has_accidents"       
 [7] "highway_fuel_economy" "horsepower"           "latitude"            
[10] "longitude"            "mileage"              "model_name"          
[13] "owner_count"          "power"                "price"               
[16] "torque"               "transmission"         "wheel_system"        
[19] "year"                

Create Visualization

p1 <- ggplot(usedcars3,aes(x=year, y=price, color=model_name))+
  geom_point(aes(size=mileage), alpha=.8) +
  geom_smooth(method= "lm", formula= y~x, se= FALSE, color= "#9C9A9A", linetype = "dashed") +
  labs(title= "How Models, Mileage, and Year determine a Volkswagens value",
       x= "Year",
       y= "Price in $USD",
       size= "[Legend]",
       color= "Volkswagen Model",
       caption= "Source: Webscraped from Cargurus Inventory as of September 2020", ) +
  scale_color_brewer(palette= "Set2") +
theme_classic()
       
ggplotly(p1)

Interpretation

This interactive model tells us what we need to know regarding what effects a VW Price. Naturally, the newer a car gets the more pricier it will be. Some key take away is how the Golf R and GTI’s tends to be higher in price while the other vehicles follow a regular standard. Something I wish I could’ve done a little differently is incorporate mileage into the x value, proving a more accurate demonstration. I could’ve used mutate to turn it into a categorical variable such as putting them in groups of 0-15k miles, 15-40k, 40-70k, and so forth. I didn’t because it create a uniform plot, with simply the plots only moving up or down without much room horizontally.

Linear Regression model

fit1<- lm(price ~ + year + mileage + horsepower, data= usedcars3)

summary(fit1)

Call:
lm(formula = price ~ +year + mileage + horsepower, data = usedcars3)

Residuals:
     Min       1Q   Median       3Q      Max 
-11072.8  -1387.8   -176.7   1149.9  12128.4 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.166e+06  4.823e+04  -24.18   <2e-16 ***
year         5.825e+02  2.389e+01   24.38   <2e-16 ***
mileage     -6.233e-02  1.939e-03  -32.14   <2e-16 ***
horsepower   5.604e+01  1.932e+00   29.00   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2183 on 1967 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.7727,    Adjusted R-squared:  0.7724 
F-statistic:  2229 on 3 and 1967 DF,  p-value: < 2.2e-16

What this means

Taking these variables into consideration, all 3 of them are significantly tied to a vehicle’s price. The r squared has a value of .77, this means that at least 77% of the prices posted can be tied back to the variables. The standard error is between 1967 and 2183, meaning that the estimated price when calculating it for a vehicle will be off by those said amounts.

Create popups for map

popupvw <- paste0(
  "<b>Volkswagen Specs: </b>", "<br>",
  "<b>Model: </b>", usedcars3$model_name, "<br>",
  "<b>City: </b>", usedcars3$city, "<br>",
  "<b>Horsepower: </b>", usedcars3$horsepower, "<br>",
  "<b>Mileage: </b>", usedcars3$mileage, "<br>",
  "<b>Vehicle Year: </b>", usedcars3$year, "<br>",
  "<b>Numer of Owners: </b>", usedcars3$owner_count, "<br>"
  
)

Final map Visualization

library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
pal <- colorNumeric(
  palette = c("yellow", "orange", "darkred"),
  domain = usedcars3$owner_count
)


leaflet(data = usedcars3) |>
  setView(lng = -73.7980, lat = 43.4326, zoom = 6) |>
  addProviderTiles(providers$Esri.NatGeoWorldMap) |>
  addCircleMarkers(
    lat = ~latitude,
    lng = ~longitude,
    data = usedcars3,
    color = ~pal(owner_count),
    radius = 3,
    fillOpacity = .8,
    popup = popupvw
  ) |>
  addLegend(
    position = "bottomright",
    pal = pal,
    values = usedcars$owner_count,
    title = "# of Past VW owners",
    opacity = .8
    
    
  )
Warning in pal(c(r[1], cuts, r[2])): Some values were outside the color scale
and will be treated as NA

Conclusion

The final map represents the Volkswagen specs of the Northeastern part of the U.S (with some scattered across the county). The interactive map displays key information someone would wanna know such as the model, the hp, the mileage, and owner, and year. The legend has a mini heat map so you know when you click on a dot, if it has more than 1 owner. There’s not that many surprising moments, besides the fact that finding a vehicle with more than 150k is pretty rare and even rarer if it has less than 3 owners. Something I wish I would’ve done differently is incorporating horsepower or price on the legend and color palette, but I kept struggling and receiving NA’s on my scale. Something else I wanted to do looking back was try out a tree map, I like how aesthetically pleasing they are but I couldn’t get one to function properly or use the proper commands.

Sources: https://www.vwofstreetsboro.com/blog/what-was-volkswagen-known-for/

Class notes: correlation scatter plots and regression

AI uses: from chunk 55 to 105, having the ai list all of the variables so I can just delete the ones I need instead of spending a long time typing each individual variable (kinda lazy ik).

Picture: https://blog.consumerguide.com/quick-spin-2020-volkswagen-jetta-gli/#google_vignette