# Setup
getwd()
## [1] "C:/Users/Jerome/Documents/0000_Work_Files/0000_Montgomery_College/Data_Science_101/Data_101_Fall_2022/Homework_11_Due_21Nov22"
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggplot2)
library(infer)
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
# Question 1 - Describe the dataset

#This dataset was downloaded from https://www.kaggle.com/datasets/akshaydattatraykhare/car-details-dataset  It contains 8 variables and 4,340 records. The data items are described as follows:

#Variable Name  Type              Description
#name             character         detailed name of the car being sold
#year             numeric           model year of the car being sold
#selling_price  numeric         sales price – currency not given
#km_driven      numeric         number of km on the odometer when sold
#fuel             character       diesel, petrol, other
#seller_type      character       individual, dealer, other
#transmission     character       manual, automatic
#owner          character         owner who sold the car (I assume) – 1st owner, 2nd owner, other
# Question 2 - Peng's EDA Checklist
#1. Formulate your question
#2. Read in your data
#3. Check the packaging
#4. Run str()
#5. Look at the top and the bottom of your data
#6. Check your “n”s
#7. Validate with at least one external data source
#8. Try the easy solution first
#9. Challenge your solution
#10. Follow up
# Question 3 - Three research questions
# (1) what is the relationship between selling price and kilometers driven?
# (2) what is the relationship between selling price and year?
# (3) what is the relationship between   owner and miles driven? 
#  Question 4 - Begin Peng's EDA checklist
### 1. Questions formulated as above.
### 2. Read in your data.
cars <- read_csv("CAR DETAILS FROM CAR DEKHO.csv")
## Rows: 4340 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, fuel, seller_type, transmission, owner
## dbl (3): year, selling_price, km_driven
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Peng's EDA checklist (con't.)
### 3. check the packaging
nrow(cars)
## [1] 4340
ncol(cars)
## [1] 8
# Peng's EDA checklist (con't.)
### 4. Run str()
str(cars)
## spec_tbl_df [4,340 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name         : chr [1:4340] "Maruti 800 AC" "Maruti Wagon R LXI Minor" "Hyundai Verna 1.6 SX" "Datsun RediGO T Option" ...
##  $ year         : num [1:4340] 2007 2007 2012 2017 2014 ...
##  $ selling_price: num [1:4340] 60000 135000 600000 250000 450000 140000 550000 240000 850000 365000 ...
##  $ km_driven    : num [1:4340] 70000 50000 100000 46000 141000 125000 25000 60000 25000 78000 ...
##  $ fuel         : chr [1:4340] "Petrol" "Petrol" "Diesel" "Petrol" ...
##  $ seller_type  : chr [1:4340] "Individual" "Individual" "Individual" "Individual" ...
##  $ transmission : chr [1:4340] "Manual" "Manual" "Manual" "Manual" ...
##  $ owner        : chr [1:4340] "First Owner" "First Owner" "First Owner" "First Owner" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   year = col_double(),
##   ..   selling_price = col_double(),
##   ..   km_driven = col_double(),
##   ..   fuel = col_character(),
##   ..   seller_type = col_character(),
##   ..   transmission = col_character(),
##   ..   owner = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Peng's EDA checklist (con't.)
### 5. Top and bottom of data
head(cars)
## # A tibble: 6 × 8
##   name                      year selling_p…¹ km_dr…² fuel  selle…³ trans…⁴ owner
##   <chr>                    <dbl>       <dbl>   <dbl> <chr> <chr>   <chr>   <chr>
## 1 Maruti 800 AC             2007       60000   70000 Petr… Indivi… Manual  Firs…
## 2 Maruti Wagon R LXI Minor  2007      135000   50000 Petr… Indivi… Manual  Firs…
## 3 Hyundai Verna 1.6 SX      2012      600000  100000 Dies… Indivi… Manual  Firs…
## 4 Datsun RediGO T Option    2017      250000   46000 Petr… Indivi… Manual  Firs…
## 5 Honda Amaze VX i-DTEC     2014      450000  141000 Dies… Indivi… Manual  Seco…
## 6 Maruti Alto LX BSIII      2007      140000  125000 Petr… Indivi… Manual  Firs…
## # … with abbreviated variable names ¹​selling_price, ²​km_driven, ³​seller_type,
## #   ⁴​transmission
tail(cars)
## # A tibble: 6 × 8
##   name                          year selli…¹ km_dr…² fuel  selle…³ trans…⁴ owner
##   <chr>                        <dbl>   <dbl>   <dbl> <chr> <chr>   <chr>   <chr>
## 1 Toyota Innova 2.5 VX (Diese…  2012  600000  170000 Dies… Indivi… Manual  Firs…
## 2 Hyundai i20 Magna 1.4 CRDi …  2014  409999   80000 Dies… Indivi… Manual  Seco…
## 3 Hyundai i20 Magna 1.4 CRDi    2014  409999   80000 Dies… Indivi… Manual  Seco…
## 4 Maruti 800 AC BSIII           2009  110000   83000 Petr… Indivi… Manual  Seco…
## 5 Hyundai Creta 1.6 CRDi SX O…  2016  865000   90000 Dies… Indivi… Manual  Firs…
## 6 Renault KWID RXT              2016  225000   40000 Petr… Indivi… Manual  Firs…
## # … with abbreviated variable names ¹​selling_price, ²​km_driven, ³​seller_type,
## #   ⁴​transmission
# Peng's EDA checklist (con't.)
### 6 Check the n's
 select(cars, seller_type) %>% unique %>% nrow
## [1] 3
 select(cars, fuel) %>% unique %>% nrow
## [1] 5
select(cars, transmission) %>% unique %>% nrow
## [1] 2
table(cars$year)
## 
## 1992 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
##    1    1    2    3   12   10   12   20   21   23   42   85  110  134  145  193 
## 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 
##  234  271  415  386  367  421  357  466  366  195   48
table(cars$fuel)
## 
##      CNG   Diesel Electric      LPG   Petrol 
##       40     2153        1       23     2123
# Peng's EDA checklist (con't.)
### 7 Validate w/ at least 1 external data source
### I can only validate these data w/ my own knowledge (however limited) of automobiles. The datashown in EDA step 6 appear to be OK. 
# Peng's EDA checklist (con't.)
### 8. Try the easy solution 1st
mean(cars$selling_price)
## [1] 504127.3
median(cars$km_driven)
## [1] 60000
unique(cars$year)
##  [1] 2007 2012 2017 2014 2016 2015 2018 2019 2013 2011 2010 2009 2006 1996 2005
## [16] 2008 2004 1998 2003 2002 2020 2000 1999 2001 1995 1997 1992
# Peng's EDA checklist (con't.)
### 9. Challenge your solution. This is where I will do the analysis to answer the research questions. 

### Question 1: Relationship between selling price and KM driven.
cor(cars$selling_price, cars$km_driven)
## [1] -0.1922886
options(scipen = 999)
plot(cars$km_driven, cars$selling_price,main = "Selling Price as a Function of KM Driven",
     xlab = "KM Driven", ylab = "Selling Price")
abline(lm(cars$selling_price ~ cars$km_driven, ), col = "red")

### Question 2: Relationship between selling price and year.

cor(cars$selling_price, cars$year)
## [1] 0.4139217
options(scipen = 999)
plot(cars$year, cars$selling_price,main = "Selling Price as a Function of Age of Car",
     xlab = "Year", ylab = "Selling Price")
abline(lm(cars$selling_price ~ cars$year, ), col = "red")

### A question that arose after looking at the 2 plots. 
cor(cars$km_driven, cars$year)
## [1] -0.4196881
options(scipen = 999)
plot(cars$year, cars$km_driven,main = "Age of Car vs. KM Driven",
     xlab = "Year", ylab = "KM Driven")
abline(lm(cars$km_driven ~ cars$year, ), col = "red")

# The final research questions:  what is the relationship between owner and selling price and owner and miles driven? 
cars_anova <- aov(selling_price ~ owner, data = cars)
aov(selling_price ~ owner, data = cars)
## Call:
##    aov(formula = selling_price ~ owner, data = cars)
## 
## Terms:
##                            owner        Residuals
## Sum of Squares    82709846799329 1369634332542086
## Deg. of Freedom                4             4335
## 
## Residual standard error: 562092.5
## Estimated effects may be unbalanced
summary(cars_anova)
##               Df           Sum Sq        Mean Sq F value              Pr(>F)
## owner          4   82709846799329 20677461699832   65.45 <0.0000000000000002
## Residuals   4335 1369634332542086   315947942916                            
##                
## owner       ***
## Residuals      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(cars_anova, conf.level=.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = selling_price ~ owner, data = cars)
## 
## $owner
##                                           diff          lwr       upr     p adj
## Fourth & Above Owner-First Owner    -424735.77  -597590.110 -251881.4 0.0000000
## Second Owner-First Owner            -254745.88  -309135.103 -200356.7 0.0000000
## Test Drive Car-First Owner           355656.97   -17485.183  728799.1 0.0703113
## Third Owner-First Owner             -329162.97  -421740.112 -236585.8 0.0000000
## Second Owner-Fourth & Above Owner    169989.89    -6575.045  346554.8 0.0656637
## Test Drive Car-Fourth & Above Owner  780392.74   371183.681 1189601.8 0.0000021
## Third Owner-Fourth & Above Owner      95572.81   -96227.917  287373.5 0.6535341
## Test Drive Car-Second Owner          610402.85   235527.377  985278.3 0.0000887
## Third Owner-Second Owner             -74417.09  -173750.274   24916.1 0.2449879
## Third Owner-Test Drive Car          -684819.94 -1067107.739 -302532.1 0.0000105
 #Create a numerical owner code for correlations
cars_own_n <- mutate (cars, owner = recode(owner, "First Owner" = 1, "Second Owner" = 2, "Third Owner" = 3, "Fourth & Above Owner" = 4, "Test Drive Car" = 5))
table(cars_own_n$owner)
## 
##    1    2    3    4    5 
## 2832 1106  304   81   17
table(cars$owner)
## 
##          First Owner Fourth & Above Owner         Second Owner 
##                 2832                   81                 1106 
##       Test Drive Car          Third Owner 
##                   17                  304
mean(cars_own_n$owner)
## [1] 1.46659
#Calculate descriptive statistics for miles_driven by owner and correlations between owner and KM driven

cat_mode <- function(cat_var){
mode_idx <- which.max(table(cat_var))
levels(cat_var)[mode_idx]
}


cars$owner <- as.factor(cars$owner)
cars_grp <- group_by(cars, owner)
summarise(cars_grp, mode_owner = cat_mode(km_driven),
km_med = median(km_driven),
km_mean = mean(km_driven))
## # A tibble: 5 × 3
##   owner                km_med km_mean
##   <fct>                 <dbl>   <dbl>
## 1 First Owner           50000  56015.
## 2 Fourth & Above Owner  90000  99138.
## 3 Second Owner          80000  81784.
## 4 Test Drive Car         1010   4155 
## 5 Third Owner           90000  99305.
cor(cars_own_n$owner, cars_own_n$km_driven)
## [1] 0.2744086
# Peng's EDA checklist (con't.)
### 10. Follow-up questions.
### I do have questions. What is the currency of the selling price? From what countries were the data gathered? Given the high use of diesel fuel and manual transmissions (see tables below), I assume the data are from European countries, but we don't know. So yes, other data are needed to fully explain and understand the data in this dataset. 

table(cars$fuel)
## 
##      CNG   Diesel Electric      LPG   Petrol 
##       40     2153        1       23     2123
table(cars$transmission)
## 
## Automatic    Manual 
##       448      3892
# Question 5. What did we learn from the data?
###The first research question is “what is the relationship between selling price and kilometers driven?” This question was answered by running a correlation between the selling price and the kilometers driven and by constructing a scatterplot. The correlation, as expected, showed a negative correlation (-0.19). As the cars were driven more kilometers, the sales price decreased. However, the correlation was much weaker than I expected, suggesting these cars hold their values, even when driven many kilometers. 

###The second research question is “what is the relationship between selling price and year?” The same two techniques were used, correlation and scatterplot. The result was positive and stronger, again as would be expected (0.41). These two results led to an unexpected question, and that is “what is the relationship between kilometers driven and the age of the car?” The result, as shown on the scatterplot, was surprising. The older cars had lower mileage; the kilometers driven peaked in the center of the distribution of years, as shown on the scatterplot. I assumed older cars would have higher mileage, but this is not the case in this sample. 

###The final research questions dealt with the relationship between the number of prior owners, the selling price and the kilometers driven. To answer the first question, I ran an ANOVA with a Tukey HSD post-hoc test. The ANOVA was statistically significant, but the real answers were found in the post-hoc test results. Stated simply, the test-driven cars were priced essentially the same as cars sold by first owners. This is to be expected; the test-driven cars had very low mileage and were essentially brand new cars. There was also no statistically significant difference between the selling prices of 2nd owner and 3rd owner cars, 2nd and 4th and above owner cars, and 3rd and 4th and above owner cars. It appears that after a car has passed from the first owner to a second owner, subsequent resale value is essentially the same. The correlation between owner rank (1st, 2nd, 3rd, etc.) and miles driven is weak (0.27). Fourth owner and above cars actually had fewer mean kilometers than 3rd owner vehicles. This again suggests that once a car has been sold more than once, differences in both mileage and value diminish.