In this project, I will be exploring the specs of used Volkswagens in Northeastern (Mostly) U.S.A. This data came from a webscraped cargurus inventory only up to date as of September 2020. There are so many variables to take into account regarding this data, each bit of information regarding any and every car is provided. I used several dyplr commands such as (filter) to focus on volkwsawgens and very specific models of said vehicle. I then used !is.na to remove any columns that I wouldn’t be including in my dataset)
Why this dataset?
I chose this dataset because there was so much to work with regarding the variables, there was endless possibilities on how I could incorporate several of them. The reason I narrowed down to Volkswagens is because I drive a Jetta GLI (the color of my car is the same as the regression line in plot 1). The models I narrowed it down to was simply models I could think of the top of my head, giving me room to work with well-known and advertised vehicles.
Volkwswagens are known for their blend of performance, reliability, and innovation; German engineering strikes again. They are known for being a car for the people and having iconic designs such as the beetle and their golf trims. They are known for having cultural impact in the automobile industry. Accessibility and an enjoying driving experience is what brought volkswagen to where they are now.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Load the data set
# Set working directoriessetwd("C:/Users/jfgam/Downloads/Data 101")usedcars <-read_csv("used_cars_data.csv")
Rows: 259032 Columns: 66
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): vin, back_legroom, bed, bed_height, bed_length, body_type, cabin,...
dbl (15): city_fuel_economy, daysonmarket, engine_displacement, highway_fue...
lgl (13): combine_fuel_economy, fleet, frame_damaged, franchise_dealer, has...
date (1): listed_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove variables that won't be used using select (-)usedcars2 <- usedcars1 |>select(-back_legroom,-bed,-bed_height,-bed_length,-body_type,-cabin,-combine_fuel_economy,-daysonmarket,-dealer_zip,-description,-engine_cylinders,-engine_displacement,-fleet,-frame_damaged,-franchise_dealer,-franchise_make,-front_legroom,-fuel_tank_volume,-fuel_type,-height,-interior_color,-isCab,-is_certified,-is_cpo,-is_new,-is_oemcpo,-length,-listed_date,-listing_color,-listing_id,-main_picture_url,-major_options,-make_name,-maximum_seating,-salvage,-savings_amount,-seller_rating,-sp_id,-sp_name,-theft_title,-transmission_display,-trimId,-trim_name,-vehicle_damage_category,-wheel_system_display,-wheelbase,-width, )
Filter out any variables without value
#Remove any NA in the remaining columnsusedcars3 <- usedcars2 |>filter(!is.na(city_fuel_economy),!is.na(has_accidents),!is.na(highway_fuel_economy),!is.na(horsepower),!is.na(transmission),!is.na(wheel_system),!is.na(torque),!is.na(horsepower),!is.na(owner_count) )names(usedcars3)
p1 <-ggplot(usedcars3,aes(x=year, y=price, color=model_name))+geom_point(aes(size=mileage), alpha=.8) +geom_smooth(method="lm", formula= y~x, se=FALSE, color="#9C9A9A", linetype ="dashed") +labs(title="How Models, Mileage, and Year determine a Volkswagens value",x="Year",y="Price in $USD",size="[Legend]",color="Volkswagen Model",caption="Source: Webscraped from Cargurus Inventory as of September 2020", ) +scale_color_brewer(palette="Set2") +theme_classic()ggplotly(p1)
Interpretation
This interactive model tells us what we need to know regarding what effects a VW Price. Naturally, the newer a car gets the more pricier it will be. Some key take away is how the Golf R and GTI’s tends to be higher in price while the other vehicles follow a regular standard. Something I wish I could’ve done a little differently is incorporate mileage into the x value, proving a more accurate demonstration. I could’ve used mutate to turn it into a categorical variable such as putting them in groups of 0-15k miles, 15-40k, 40-70k, and so forth. I didn’t because it create a uniform plot, with simply the plots only moving up or down without much room horizontally.
Linear Regression model
fit1<-lm(price ~+ year + mileage + horsepower, data= usedcars3)summary(fit1)
Call:
lm(formula = price ~ +year + mileage + horsepower, data = usedcars3)
Residuals:
Min 1Q Median 3Q Max
-11072.8 -1387.8 -176.7 1149.9 12128.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.166e+06 4.823e+04 -24.18 <2e-16 ***
year 5.825e+02 2.389e+01 24.38 <2e-16 ***
mileage -6.233e-02 1.939e-03 -32.14 <2e-16 ***
horsepower 5.604e+01 1.932e+00 29.00 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2183 on 1967 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.7727, Adjusted R-squared: 0.7724
F-statistic: 2229 on 3 and 1967 DF, p-value: < 2.2e-16
What this means
Taking these variables into consideration, all 3 of them are significantly tied to a vehicle’s price. The r squared has a value of .77, this means that at least 77% of the prices posted can be tied back to the variables. The standard error is between 1967 and 2183, meaning that the estimated price when calculating it for a vehicle will be off by those said amounts.
Warning in pal(c(r[1], cuts, r[2])): Some values were outside the color scale
and will be treated as NA
Conclusion
The final map represents the Volkswagen specs of the Northeastern part of the U.S (with some scattered across the county). The interactive map displays key information someone would wanna know such as the model, the hp, the mileage, and owner, and year. The legend has a mini heat map so you know when you click on a dot, if it has more than 1 owner. There’s not that many surprising moments, besides the fact that finding a vehicle with more than 150k is pretty rare and even rarer if it has less than 3 owners. Something I wish I would’ve done differently is incorporating horsepower or price on the legend and color palette, but I kept struggling and receiving NA’s on my scale. Something else I wanted to do looking back was try out a tree map, I like how aesthetically pleasing they are but I couldn’t get one to function properly or use the proper commands.
Class notes: correlation scatter plots and regression
AI uses: from chunk 55 to 105, having the ai list all of the variables so I can just delete the ones I need instead of spending a long time typing each individual variable (kinda lazy ik).