# Setup
getwd()
## [1] "C:/Users/Jerome/Documents/0000_Work_Files/0000_Montgomery_College/Data_Science_101/Data_101_Fall_2022/Homework_11_Due_21Nov22"
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
library(infer)
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# Question 1 - Describe the dataset
#This dataset was downloaded from https://www.kaggle.com/datasets/akshaydattatraykhare/car-details-dataset It contains 8 variables and 4,340 records. The data items are described as follows:
#Variable Name Type Description
#name character detailed name of the car being sold
#year numeric model year of the car being sold
#selling_price numeric sales price – currency not given
#km_driven numeric number of km on the odometer when sold
#fuel character diesel, petrol, other
#seller_type character individual, dealer, other
#transmission character manual, automatic
#owner character owner who sold the car (I assume) – 1st owner, 2nd owner, other
# Question 2 - Peng's EDA Checklist
#1. Formulate your question
#2. Read in your data
#3. Check the packaging
#4. Run str()
#5. Look at the top and the bottom of your data
#6. Check your “n”s
#7. Validate with at least one external data source
#8. Try the easy solution first
#9. Challenge your solution
#10. Follow up
# Question 3 - Three research questions
# (1) what is the relationship between selling price and kilometers driven?
# (2) what is the relationship between selling price and year?
# (3) what is the relationship between owner and miles driven?
# Question 4 - Begin Peng's EDA checklist
### 1. Questions formulated as above.
### 2. Read in your data.
cars <- read_csv("CAR DETAILS FROM CAR DEKHO.csv")
## Rows: 4340 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, fuel, seller_type, transmission, owner
## dbl (3): year, selling_price, km_driven
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Peng's EDA checklist (con't.)
### 3. check the packaging
nrow(cars)
## [1] 4340
ncol(cars)
## [1] 8
# Peng's EDA checklist (con't.)
### 4. Run str()
str(cars)
## spec_tbl_df [4,340 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:4340] "Maruti 800 AC" "Maruti Wagon R LXI Minor" "Hyundai Verna 1.6 SX" "Datsun RediGO T Option" ...
## $ year : num [1:4340] 2007 2007 2012 2017 2014 ...
## $ selling_price: num [1:4340] 60000 135000 600000 250000 450000 140000 550000 240000 850000 365000 ...
## $ km_driven : num [1:4340] 70000 50000 100000 46000 141000 125000 25000 60000 25000 78000 ...
## $ fuel : chr [1:4340] "Petrol" "Petrol" "Diesel" "Petrol" ...
## $ seller_type : chr [1:4340] "Individual" "Individual" "Individual" "Individual" ...
## $ transmission : chr [1:4340] "Manual" "Manual" "Manual" "Manual" ...
## $ owner : chr [1:4340] "First Owner" "First Owner" "First Owner" "First Owner" ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. year = col_double(),
## .. selling_price = col_double(),
## .. km_driven = col_double(),
## .. fuel = col_character(),
## .. seller_type = col_character(),
## .. transmission = col_character(),
## .. owner = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# Peng's EDA checklist (con't.)
### 5. Top and bottom of data
head(cars)
## # A tibble: 6 × 8
## name year selling_p…¹ km_dr…² fuel selle…³ trans…⁴ owner
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Maruti 800 AC 2007 60000 70000 Petr… Indivi… Manual Firs…
## 2 Maruti Wagon R LXI Minor 2007 135000 50000 Petr… Indivi… Manual Firs…
## 3 Hyundai Verna 1.6 SX 2012 600000 100000 Dies… Indivi… Manual Firs…
## 4 Datsun RediGO T Option 2017 250000 46000 Petr… Indivi… Manual Firs…
## 5 Honda Amaze VX i-DTEC 2014 450000 141000 Dies… Indivi… Manual Seco…
## 6 Maruti Alto LX BSIII 2007 140000 125000 Petr… Indivi… Manual Firs…
## # … with abbreviated variable names ¹selling_price, ²km_driven, ³seller_type,
## # ⁴transmission
tail(cars)
## # A tibble: 6 × 8
## name year selli…¹ km_dr…² fuel selle…³ trans…⁴ owner
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Toyota Innova 2.5 VX (Diese… 2012 600000 170000 Dies… Indivi… Manual Firs…
## 2 Hyundai i20 Magna 1.4 CRDi … 2014 409999 80000 Dies… Indivi… Manual Seco…
## 3 Hyundai i20 Magna 1.4 CRDi 2014 409999 80000 Dies… Indivi… Manual Seco…
## 4 Maruti 800 AC BSIII 2009 110000 83000 Petr… Indivi… Manual Seco…
## 5 Hyundai Creta 1.6 CRDi SX O… 2016 865000 90000 Dies… Indivi… Manual Firs…
## 6 Renault KWID RXT 2016 225000 40000 Petr… Indivi… Manual Firs…
## # … with abbreviated variable names ¹selling_price, ²km_driven, ³seller_type,
## # ⁴transmission
# Peng's EDA checklist (con't.)
### 6 Check the n's
select(cars, seller_type) %>% unique %>% nrow
## [1] 3
select(cars, fuel) %>% unique %>% nrow
## [1] 5
select(cars, transmission) %>% unique %>% nrow
## [1] 2
table(cars$year)
##
## 1992 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## 1 1 2 3 12 10 12 20 21 23 42 85 110 134 145 193
## 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
## 234 271 415 386 367 421 357 466 366 195 48
table(cars$fuel)
##
## CNG Diesel Electric LPG Petrol
## 40 2153 1 23 2123
# Peng's EDA checklist (con't.)
### 7 Validate w/ at least 1 external data source
### I can only validate these data w/ my own knowledge (however limited) of automobiles. The datashown in EDA step 6 appear to be OK.
# Peng's EDA checklist (con't.)
### 8. Try the easy solution 1st
mean(cars$selling_price)
## [1] 504127.3
median(cars$km_driven)
## [1] 60000
unique(cars$year)
## [1] 2007 2012 2017 2014 2016 2015 2018 2019 2013 2011 2010 2009 2006 1996 2005
## [16] 2008 2004 1998 2003 2002 2020 2000 1999 2001 1995 1997 1992
# Peng's EDA checklist (con't.)
### 9. Challenge your solution. This is where I will do the analysis to answer the research questions.
### Question 1: Relationship between selling price and KM driven.
cor(cars$selling_price, cars$km_driven)
## [1] -0.1922886
options(scipen = 999)
plot(cars$km_driven, cars$selling_price,main = "Selling Price as a Function of KM Driven",
xlab = "KM Driven", ylab = "Selling Price")
abline(lm(cars$selling_price ~ cars$km_driven, ), col = "red")

### Question 2: Relationship between selling price and year.
cor(cars$selling_price, cars$year)
## [1] 0.4139217
options(scipen = 999)
plot(cars$year, cars$selling_price,main = "Selling Price as a Function of Age of Car",
xlab = "Year", ylab = "Selling Price")
abline(lm(cars$selling_price ~ cars$year, ), col = "red")

### A question that arose after looking at the 2 plots.
cor(cars$km_driven, cars$year)
## [1] -0.4196881
options(scipen = 999)
plot(cars$year, cars$km_driven,main = "Age of Car vs. KM Driven",
xlab = "Year", ylab = "KM Driven")
abline(lm(cars$km_driven ~ cars$year, ), col = "red")

# The final research questions: what is the relationship between owner and selling price and owner and miles driven?
cars_anova <- aov(selling_price ~ owner, data = cars)
aov(selling_price ~ owner, data = cars)
## Call:
## aov(formula = selling_price ~ owner, data = cars)
##
## Terms:
## owner Residuals
## Sum of Squares 82709846799329 1369634332542086
## Deg. of Freedom 4 4335
##
## Residual standard error: 562092.5
## Estimated effects may be unbalanced
summary(cars_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## owner 4 82709846799329 20677461699832 65.45 <0.0000000000000002
## Residuals 4335 1369634332542086 315947942916
##
## owner ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(cars_anova, conf.level=.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = selling_price ~ owner, data = cars)
##
## $owner
## diff lwr upr p adj
## Fourth & Above Owner-First Owner -424735.77 -597590.110 -251881.4 0.0000000
## Second Owner-First Owner -254745.88 -309135.103 -200356.7 0.0000000
## Test Drive Car-First Owner 355656.97 -17485.183 728799.1 0.0703113
## Third Owner-First Owner -329162.97 -421740.112 -236585.8 0.0000000
## Second Owner-Fourth & Above Owner 169989.89 -6575.045 346554.8 0.0656637
## Test Drive Car-Fourth & Above Owner 780392.74 371183.681 1189601.8 0.0000021
## Third Owner-Fourth & Above Owner 95572.81 -96227.917 287373.5 0.6535341
## Test Drive Car-Second Owner 610402.85 235527.377 985278.3 0.0000887
## Third Owner-Second Owner -74417.09 -173750.274 24916.1 0.2449879
## Third Owner-Test Drive Car -684819.94 -1067107.739 -302532.1 0.0000105
#Create a numerical owner code for correlations
cars_own_n <- mutate (cars, owner = recode(owner, "First Owner" = 1, "Second Owner" = 2, "Third Owner" = 3, "Fourth & Above Owner" = 4, "Test Drive Car" = 5))
table(cars_own_n$owner)
##
## 1 2 3 4 5
## 2832 1106 304 81 17
table(cars$owner)
##
## First Owner Fourth & Above Owner Second Owner
## 2832 81 1106
## Test Drive Car Third Owner
## 17 304
mean(cars_own_n$owner)
## [1] 1.46659
#Calculate descriptive statistics for miles_driven by owner and correlations between owner and KM driven
cat_mode <- function(cat_var){
mode_idx <- which.max(table(cat_var))
levels(cat_var)[mode_idx]
}
cars$owner <- as.factor(cars$owner)
cars_grp <- group_by(cars, owner)
summarise(cars_grp, mode_owner = cat_mode(km_driven),
km_med = median(km_driven),
km_mean = mean(km_driven))
## # A tibble: 5 × 3
## owner km_med km_mean
## <fct> <dbl> <dbl>
## 1 First Owner 50000 56015.
## 2 Fourth & Above Owner 90000 99138.
## 3 Second Owner 80000 81784.
## 4 Test Drive Car 1010 4155
## 5 Third Owner 90000 99305.
cor(cars_own_n$owner, cars_own_n$km_driven)
## [1] 0.2744086
# Peng's EDA checklist (con't.)
### 10. Follow-up questions.
### I do have questions. What is the currency of the selling price? From what countries were the data gathered? Given the high use of diesel fuel and manual transmissions (see tables below), I assume the data are from European countries, but we don't know. So yes, other data are needed to fully explain and understand the data in this dataset.
table(cars$fuel)
##
## CNG Diesel Electric LPG Petrol
## 40 2153 1 23 2123
table(cars$transmission)
##
## Automatic Manual
## 448 3892
# Question 5. What did we learn from the data?
###The first research question is “what is the relationship between selling price and kilometers driven?” This question was answered by running a correlation between the selling price and the kilometers driven and by constructing a scatterplot. The correlation, as expected, showed a negative correlation (-0.19). As the cars were driven more kilometers, the sales price decreased. However, the correlation was much weaker than I expected, suggesting these cars hold their values, even when driven many kilometers.
###The second research question is “what is the relationship between selling price and year?” The same two techniques were used, correlation and scatterplot. The result was positive and stronger, again as would be expected (0.41). These two results led to an unexpected question, and that is “what is the relationship between kilometers driven and the age of the car?” The result, as shown on the scatterplot, was surprising. The older cars had lower mileage; the kilometers driven peaked in the center of the distribution of years, as shown on the scatterplot. I assumed older cars would have higher mileage, but this is not the case in this sample.
###The final research questions dealt with the relationship between the number of prior owners, the selling price and the kilometers driven. To answer the first question, I ran an ANOVA with a Tukey HSD post-hoc test. The ANOVA was statistically significant, but the real answers were found in the post-hoc test results. Stated simply, the test-driven cars were priced essentially the same as cars sold by first owners. This is to be expected; the test-driven cars had very low mileage and were essentially brand new cars. There was also no statistically significant difference between the selling prices of 2nd owner and 3rd owner cars, 2nd and 4th and above owner cars, and 3rd and 4th and above owner cars. It appears that after a car has passed from the first owner to a second owner, subsequent resale value is essentially the same. The correlation between owner rank (1st, 2nd, 3rd, etc.) and miles driven is weak (0.27). Fourth owner and above cars actually had fewer mean kilometers than 3rd owner vehicles. This again suggests that once a car has been sold more than once, differences in both mileage and value diminish.