Picture are taken from Google

1. Intro

1.1 Greetings

Hi Everyone :)

Welcome to my Rmd.

This is my HTML_Document which contains some of visualizations of used cars dataset within my analysis inside.

Hope you can enjoy that!

1.2. Brief

This dataset is the stacked version of 100,000 UK Used Car Data set present in Kaggle. Here I have combined the used car information of 7 brands namely Audi, BMW, Skoda, Ford, Volkswagen, Toyota and Hyundai.

Data Source: https://www.kaggle.com/datasets/aishwaryamuthukumar/cars-dataset-audi-bmw-ford-hyundai-skoda-vw.

The first things i should do is load all packages that might be needed for visualize the dataset.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(scales)
library(tidyr)
library(colorspace)

2. Data Explanatory

2.1. Data Input & Structure

We should put dataset in the same folder with our R project data. And then put it into ‘cars’ object.

cars <- read.csv(file="data_input/cars_mobil.csv")

Then we do inspection data.

dim(cars)
## [1] 72435    10
head(cars,10)
##    model year price transmission mileage fuelType tax  mpg engineSize Make
## 1     A1 2017 12500       Manual   15735   Petrol 150 55.4        1.4 audi
## 2     A6 2016 16500    Automatic   36203   Diesel  20 64.2        2.0 audi
## 3     A1 2016 11000       Manual   29946   Petrol  30 55.4        1.4 audi
## 4     A4 2017 16800    Automatic   25952   Diesel 145 67.3        2.0 audi
## 5     A3 2019 17300       Manual    1998   Petrol 145 49.6        1.0 audi
## 6     A1 2016 13900    Automatic   32260   Petrol  30 58.9        1.4 audi
## 7     A6 2016 13250    Automatic   76788   Diesel  30 61.4        2.0 audi
## 8     A4 2016 11750       Manual   75185   Diesel  20 70.6        2.0 audi
## 9     A3 2015 10200       Manual   46112   Petrol  20 60.1        1.4 audi
## 10    A1 2016 12000       Manual   22451   Petrol  30 55.4        1.4 audi
tail(cars)
##          model year price transmission mileage fuelType tax  mpg engineSize
## 72430 Santa Fe 2019 29995    Semi-Auto    1567   Diesel 145 39.8        2.2
## 72431      I30 2016  8680       Manual   25906   Diesel   0 78.4        1.6
## 72432      I40 2015  7830       Manual   59508   Diesel  30 65.7        1.7
## 72433      I10 2017  6830       Manual   13810   Petrol  20 60.1        1.0
## 72434   Tucson 2018 13994       Manual   23313   Petrol 145 44.8        1.6
## 72435   Tucson 2016 15999    Automatic   11472   Diesel 125 57.6        1.7
##          Make
## 72430 Hyundai
## 72431 Hyundai
## 72432 Hyundai
## 72433 Hyundai
## 72434 Hyundai
## 72435 Hyundai

From inspection data above, we got short description of dataset. Used cars dataset contains 72435 rows and 10 columns

str(cars)
## 'data.frame':    72435 obs. of  10 variables:
##  $ model       : chr  "A1" "A6" "A1" "A4" ...
##  $ year        : int  2017 2016 2016 2017 2019 2016 2016 2016 2015 2016 ...
##  $ price       : int  12500 16500 11000 16800 17300 13900 13250 11750 10200 12000 ...
##  $ transmission: chr  "Manual" "Automatic" "Manual" "Automatic" ...
##  $ mileage     : int  15735 36203 29946 25952 1998 32260 76788 75185 46112 22451 ...
##  $ fuelType    : chr  "Petrol" "Diesel" "Petrol" "Diesel" ...
##  $ tax         : int  150 20 30 145 145 30 30 20 20 30 ...
##  $ mpg         : num  55.4 64.2 55.4 67.3 49.6 58.9 61.4 70.6 60.1 55.4 ...
##  $ engineSize  : num  1.4 2 1.4 2 1 1.4 2 2 1.4 1.4 ...
##  $ Make        : chr  "audi" "audi" "audi" "audi" ...

By seeing the type of each columns, we got some type of columns are incorrect. So we decide to change it into right type.

cars$year <- as.character(cars$year)
cars$transmission <- as.factor(cars$transmission)
cars$fuelType <- as.factor(cars$fuelType)
cars$Make <- as.factor(cars$Make)

str(cars)
## 'data.frame':    72435 obs. of  10 variables:
##  $ model       : chr  "A1" "A6" "A1" "A4" ...
##  $ year        : chr  "2017" "2016" "2016" "2017" ...
##  $ price       : int  12500 16500 11000 16800 17300 13900 13250 11750 10200 12000 ...
##  $ transmission: Factor w/ 4 levels "Automatic","Manual",..: 2 1 2 1 2 1 1 2 2 2 ...
##  $ mileage     : int  15735 36203 29946 25952 1998 32260 76788 75185 46112 22451 ...
##  $ fuelType    : Factor w/ 5 levels "Diesel","Electric",..: 5 1 5 1 5 5 1 1 5 5 ...
##  $ tax         : int  150 20 30 145 145 30 30 20 20 30 ...
##  $ mpg         : num  55.4 64.2 55.4 67.3 49.6 58.9 61.4 70.6 60.1 55.4 ...
##  $ engineSize  : num  1.4 2 1.4 2 1 1.4 2 2 1.4 1.4 ...
##  $ Make        : Factor w/ 7 levels "audi","BMW","Ford",..: 1 1 1 1 1 1 1 1 1 1 ...

2.2. Missing Data

 anyNA(cars)
## [1] FALSE
colSums(is.na(cars))
##        model         year        price transmission      mileage     fuelType 
##            0            0            0            0            0            0 
##          tax          mpg   engineSize         Make 
##            0            0            0            0

Great! The dataset has no missing value. It means the dataset is complete.

2.3. Practical Statistics

summary(cars)
##     model               year               price           transmission  
##  Length:72435       Length:72435       Min.   :   495   Automatic:14046  
##  Class :character   Class :character   1st Qu.: 10175   Manual   :43021  
##  Mode  :character   Mode  :character   Median : 14495   Other    :    4  
##                                        Mean   : 16580   Semi-Auto:15364  
##                                        3rd Qu.: 20361                    
##                                        Max.   :145000                    
##                                                                          
##     mileage           fuelType          tax           mpg        
##  Min.   :     1   Diesel  :28918   Min.   :  0   Min.   :  0.30  
##  1st Qu.:  7202   Electric:    5   1st Qu.: 30   1st Qu.: 47.90  
##  Median : 17531   Hybrid  : 2903   Median :145   Median : 55.40  
##  Mean   : 23177   Other   :  239   Mean   :117   Mean   : 55.85  
##  3rd Qu.: 32449   Petrol  :40370   3rd Qu.:145   3rd Qu.: 62.80  
##  Max.   :323000                    Max.   :580   Max.   :470.80  
##                                                                  
##    engineSize         Make      
##  Min.   :0.000   audi   :10668  
##  1st Qu.:1.200   BMW    :10781  
##  Median :1.600   Ford   :17964  
##  Mean   :1.636   Hyundai: 4860  
##  3rd Qu.:2.000   skoda  : 6267  
##  Max.   :6.600   toyota : 6738  
##                  vw     :15157

Summary:

  1. There were 7 company with many variance model.

  2. The lowest price of used cars was 495, The highest price was 145000, and The average price was 16580.

  3. The used cars were in range year from 1996 to 2020.

  4. There were only 4 cars which have Other transmission type.

  5. Petrol was the famous fueltype.

  6. Minimun of mileage was 1. It meant there were cars with minimum activity (used).

  7. Mile per gallon was started from 0.30 to 470.80

  8. The tax of used cars have many variance too with minimun 0 to 580.

  9. Ford was the most highest quantity used cars.

3. Study Case

1. We will check the between price and company overlay with average price.

I will use function “Scale_y_log10’ for better interpretation of IQR.

ggplot(cars, aes(Make, price)) +
   geom_boxplot(aes(fill = Make)) +
   scale_y_log10() +
   labs(title = "Price by Company", x= "Company", y= "Price", fill = "Company",
        subtitle = "red line indicate average price") +
   theme(plot.title = element_text(hjust = 0.5)) +
   geom_hline(yintercept = mean(cars$price), color = "red", linetype = 5)

Interpretations :

  1. As we can see from the boxplot above, the highest price of all is Audi company.

  2. Second place is BMW, the third were Hyundai and skoda.

  3. Average price line only crossed Audi, BMW, and vw company.

  4. Ford, Hyundai, skoda, and toyota of price distribution are far below from the average price.

2. We want to find out the corelation between price and mileage, does cheap price makes low or high activity of used cars become the highest quantity or used cars in UK?

Create corelation between price and mileage using geom point.

ggplot(cars, aes(mileage, price)) +
  geom_point(alpha = 0.9, color = "green") + 
  geom_segment(aes(x=mileage, xend=mileage, y=0, yend=price)) +
  labs(title = " Mileage Vs Price", x ="Mileage", y= "Price")+
  theme(plot.title = element_text(hjust = 0.7))

Interpretations :

If the mileage or the activity of used car was little or low, it will affect the price too. It can be known from the plot if there was the highest mileage but the price was not high too. User can use this information as their consideration.

3. How is the milepergallon within the company and in different type of fueltype?

I will use violin plot, with x is company and y is milepergallon.

ggplot(cars,aes(Make,mpg))+
   geom_violin(aes(fill = fuelType)) +
   labs (title = "milepergallon in fuelType", x = "Company", y = "mpg", fill ="FuelType")+
   theme(plot.title = element_text(hjust = 0.5))
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.

Interpretation :

We might say if the most mpg was around 0 to 100 and we knew that the average of mpg was 55.85 . But from the plot above there were some mpg above 100 with fueltype were Hybrid, Other, less Petrol, less Diesel. Beside of that, we almost couldn’t see Electric fueltype inside the plot because of quantity of used cars with Electric fueltype was little.

4. Show me the price distribution of each type of transmission for Ford company only!

Lets we focused on “Ford” company. Create new object which only consist of ”Ford”, named ‘ford’.

head(cars[cars$Make == "Ford",]) 
##        model year price transmission mileage fuelType tax  mpg engineSize Make
## 21450 Fiesta 2017 12000    Automatic   15944   Petrol 150 57.7        1.0 Ford
## 21451  Focus 2018 14000       Manual    9083   Petrol 150 57.7        1.0 Ford
## 21452  Focus 2017 13000       Manual   12456   Petrol 150 57.7        1.0 Ford
## 21453 Fiesta 2019 17500       Manual   10460   Petrol 145 40.3        1.5 Ford
## 21454 Fiesta 2019 16500    Automatic    1482   Petrol 145 48.7        1.0 Ford
## 21455 Fiesta 2015 10500       Manual   35432   Petrol 145 47.9        1.6 Ford
ford <- cars[cars$Make == "Ford",]
dim(ford)
## [1] 17964    10

after that, make grouping price with range into <15000, 15000<=x<=50000, 50000<x<100000, x>=100000 after that create new column in ford object named ‘priceFord’ column

P <- function(x){
   if (x<15000) {x<-"Below 15000"} else if (x>=15000 & x<=50000) {x <- "15000 to 50000"} else if (x>50000 & x<100000) {x <-"between 50000-100000"} else {x <- "above 100000"}}

ford$priceFord <- as.factor(sapply(ford$price, P))
head(ford)
##        model year price transmission mileage fuelType tax  mpg engineSize Make
## 21450 Fiesta 2017 12000    Automatic   15944   Petrol 150 57.7        1.0 Ford
## 21451  Focus 2018 14000       Manual    9083   Petrol 150 57.7        1.0 Ford
## 21452  Focus 2017 13000       Manual   12456   Petrol 150 57.7        1.0 Ford
## 21453 Fiesta 2019 17500       Manual   10460   Petrol 145 40.3        1.5 Ford
## 21454 Fiesta 2019 16500    Automatic    1482   Petrol 145 48.7        1.0 Ford
## 21455 Fiesta 2015 10500       Manual   35432   Petrol 145 47.9        1.6 Ford
##            priceFord
## 21450    Below 15000
## 21451    Below 15000
## 21452    Below 15000
## 21453 15000 to 50000
## 21454 15000 to 50000
## 21455    Below 15000
ggplot(ford,aes(transmission,price)) +
   geom_jitter(aes(col= ford$priceFord)) +
   geom_boxplot(alpha=0.7) +
   scale_y_log10()+
   labs(title = "Ford Price by Transmission", x= "Transmission", y= "Price", col = "Price Ford") +
   theme(plot.title = element_text(hjust = 0.8))
## Warning: Use of `ford$priceFord` is discouraged. Use `priceFord` instead.

Interpretation :

  1. Manual transmission are the most distributed in price for “Ford” company, although we still found a Ford used car with price between 50000 - 100000 and the price was 54995.

  2. If we can order the position of favorite transmission were Manual, Automatic, Semi-Auto, and Other. There was only 4 used cars with Other transmission and it couldn’t be shown in plot above.

5. Show the average number of Tax and MPG for each fueltype Within Audi company!

Create new data frame for Tax and MPG based on average number for each fueltype named ‘audi_TM’

audi <- cars[cars$Make == "audi",]
head(aggregate.data.frame(list(Tax = audi$tax, MPG = audi$mpg), by = list(Fueltype = audi$fuelType), mean))
##   Fueltype       Tax       MPG
## 1   Diesel 119.65663  54.24024
## 2   Hybrid  72.67857 150.22143
## 3   Petrol 133.30634  46.39751
audi_TM <- aggregate.data.frame(list(Tax = audi$tax, MPG = audi$mpg), by = list(Fueltype = audi$fuelType), mean)
dim(audi_TM)
## [1] 3 3

gather all value (mileage and mpg) into 1 column, named ‘variable’, using function ‘gather’.

audi_TM <- gather(audi_TM, key = "variable", value = "average", -Fueltype)
audi_TM
##   Fueltype variable   average
## 1   Diesel      Tax 119.65663
## 2   Hybrid      Tax  72.67857
## 3   Petrol      Tax 133.30634
## 4   Diesel      MPG  54.24024
## 5   Hybrid      MPG 150.22143
## 6   Petrol      MPG  46.39751
ggplot(audi_TM,aes(Fueltype, average))+
   geom_col(aes(fill = variable), position = "dodge") +
   coord_flip()+
   labs(title = "Average Tax and MPG for each Fuel type", x= "Fuel Type", y = "Value", fill = "Variable")+
   geom_text(aes(label=comma(average)), hjust = -0.1, size = 2.1)+
   theme(plot.title = element_text(hjust = 0.2))

Interpretation :

  1. In Audi company, there was no used cars with Other or Electric fueltype. There was only Petrol, Hybrid, and Diesel fueltype in Audi company.

  2. Petrol has the highest average number of Tax but the lowest average number of MPG.

  3. Hybrid fueltype was the highest average number of MPG.

6. We want to know which model of Ford used cars has the highest price?

test <- ford[match(unique(ford$model), ford$model), ]
test <- test[order(test$price, decreasing = T), ]
test
##                       model year price transmission mileage fuelType tax  mpg
## 21752               Mustang 2020 42489    Automatic    3500   Petrol 145 22.1
## 21458                  Kuga 2019 25500    Automatic    6894   Diesel 145 42.2
## 21456                  Puma 2019 22500       Manual    2029   Petrol 145 50.4
## 21485                Mondeo 2019 20000       Manual      24   Diesel 145 65.7
## 21759 Grand Tourneo Connect 2019 19999       Manual    3500   Diesel 145 61.4
## 21505        Tourneo Custom 2018 19995    Automatic   24568   Diesel 145 31.7
## 21529                  Edge 2016 18640       Manual   24105   Diesel 160 48.7
## 21710                Galaxy 2016 18498       Manual   30528   Diesel 125 56.5
## 21506                 S-MAX 2017 18495    Automatic   39605   Diesel 145 54.3
## 35057                Ranger 2013 14495       Manual   88000   Diesel 240 28.3
## 21451                 Focus 2018 14000       Manual    9083   Petrol 150 57.7
## 21461              EcoSport 2018 13500       Manual   12065   Petrol 145 54.3
## 21589           Grand C-MAX 2018 13495       Manual    1030   Petrol 145 47.9
## 39190       Transit Tourneo 2014 12450       Manual   19496   Diesel 235 42.2
## 21450                Fiesta 2017 12000    Automatic   15944   Petrol 150 57.7
## 21476                 C-MAX 2018 11799       Manual   23800   Diesel 145 68.9
## 21544       Tourneo Connect 2015  9295       Manual   47904   Diesel 125 56.5
## 21489                   Ka+ 2018  8261       Manual   25000   Petrol 145 57.7
## 21512                 B-MAX 2014  7498    Semi-Auto   33023   Petrol 160 44.1
## 21672                    KA 2014  4898       Manual   22609   Petrol  30 57.7
## 34916                Fusion 2010  4750    Automatic   26588   Petrol 260 37.7
## 38328                Escort 1996  3000       Manual   50000   Petrol 265 34.4
## 36141              Streetka 2005  1999       Manual   63000   Petrol 270 35.3
##       engineSize Make      priceFord
## 21752        5.0 Ford 15000 to 50000
## 21458        2.0 Ford 15000 to 50000
## 21456        1.0 Ford 15000 to 50000
## 21485        2.0 Ford 15000 to 50000
## 21759        1.5 Ford 15000 to 50000
## 21505        2.0 Ford 15000 to 50000
## 21529        2.0 Ford 15000 to 50000
## 21710        2.0 Ford 15000 to 50000
## 21506        2.0 Ford 15000 to 50000
## 35057        3.2 Ford    Below 15000
## 21451        1.0 Ford    Below 15000
## 21461        1.0 Ford    Below 15000
## 21589        1.0 Ford    Below 15000
## 39190        2.2 Ford    Below 15000
## 21450        1.0 Ford    Below 15000
## 21476        1.5 Ford    Below 15000
## 21544        1.6 Ford    Below 15000
## 21489        1.2 Ford    Below 15000
## 21512        1.6 Ford    Below 15000
## 21672        1.2 Ford    Below 15000
## 34916        1.6 Ford    Below 15000
## 38328        1.8 Ford    Below 15000
## 36141        1.6 Ford    Below 15000
ggplot(test,aes(reorder(test$model,test$engineSize), test$engineSize))+
   geom_col(fill ="maroon")+
   facet_grid(rows = vars(Make), scales = "free_y")+
   geom_point(aes(col=price))+
   geom_text(aes(label= comma(test$engineSize)), hjust=-0.2, size = 2)+
   labs( x="Model", y= "EngineSize")+
   coord_flip()
## Warning: Use of `test$model` is discouraged. Use `model` instead.
## Warning: Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Warning: Use of `test$model` is discouraged. Use `model` instead.
## Warning: Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Warning: Use of `test$model` is discouraged. Use `model` instead.
## Warning: Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Use of `test$engineSize` is discouraged. Use `engineSize` instead.

4. Final Conclusion

From all graphs above, we can get some suggestions or assumptions like :

  1. We knew that Ford used cars was a kind of cars with affordable or reachable price. Maybe, that’s why the quantity of Ford used cars were the highest than other company.

  2. Manual transmission of Ford company has a lot or quantity used cars with price below 15000.

  3. Audi and BMW were the most expensive used cars because their average price were above average price for all.

  4. The used cars with highest mileage value have lowest price too. It means the highest activity or used of the cars would be valued low price.