Picture are taken from Google
Hi Everyone :)
Welcome to my Rmd.
This is my HTML_Document which contains some of visualizations of used cars dataset within my analysis inside.
Hope you can enjoy that!
This dataset is the stacked version of 100,000 UK Used Car Data set present in Kaggle. Here I have combined the used car information of 7 brands namely Audi, BMW, Skoda, Ford, Volkswagen, Toyota and Hyundai.
Data Source: https://www.kaggle.com/datasets/aishwaryamuthukumar/cars-dataset-audi-bmw-ford-hyundai-skoda-vw.
The first things i should do is load all packages that might be needed for visualize the dataset.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(scales)
library(tidyr)
library(colorspace)
We should put dataset in the same folder with our R project data. And then put it into ‘cars’ object.
cars <- read.csv(file="data_input/cars_mobil.csv")
Then we do inspection data.
dim(cars)
## [1] 72435 10
head(cars,10)
## model year price transmission mileage fuelType tax mpg engineSize Make
## 1 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4 audi
## 2 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0 audi
## 3 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4 audi
## 4 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0 audi
## 5 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0 audi
## 6 A1 2016 13900 Automatic 32260 Petrol 30 58.9 1.4 audi
## 7 A6 2016 13250 Automatic 76788 Diesel 30 61.4 2.0 audi
## 8 A4 2016 11750 Manual 75185 Diesel 20 70.6 2.0 audi
## 9 A3 2015 10200 Manual 46112 Petrol 20 60.1 1.4 audi
## 10 A1 2016 12000 Manual 22451 Petrol 30 55.4 1.4 audi
tail(cars)
## model year price transmission mileage fuelType tax mpg engineSize
## 72430 Santa Fe 2019 29995 Semi-Auto 1567 Diesel 145 39.8 2.2
## 72431 I30 2016 8680 Manual 25906 Diesel 0 78.4 1.6
## 72432 I40 2015 7830 Manual 59508 Diesel 30 65.7 1.7
## 72433 I10 2017 6830 Manual 13810 Petrol 20 60.1 1.0
## 72434 Tucson 2018 13994 Manual 23313 Petrol 145 44.8 1.6
## 72435 Tucson 2016 15999 Automatic 11472 Diesel 125 57.6 1.7
## Make
## 72430 Hyundai
## 72431 Hyundai
## 72432 Hyundai
## 72433 Hyundai
## 72434 Hyundai
## 72435 Hyundai
From inspection data above, we got short description of dataset. Used cars dataset contains 72435 rows and 10 columns
str(cars)
## 'data.frame': 72435 obs. of 10 variables:
## $ model : chr "A1" "A6" "A1" "A4" ...
## $ year : int 2017 2016 2016 2017 2019 2016 2016 2016 2015 2016 ...
## $ price : int 12500 16500 11000 16800 17300 13900 13250 11750 10200 12000 ...
## $ transmission: chr "Manual" "Automatic" "Manual" "Automatic" ...
## $ mileage : int 15735 36203 29946 25952 1998 32260 76788 75185 46112 22451 ...
## $ fuelType : chr "Petrol" "Diesel" "Petrol" "Diesel" ...
## $ tax : int 150 20 30 145 145 30 30 20 20 30 ...
## $ mpg : num 55.4 64.2 55.4 67.3 49.6 58.9 61.4 70.6 60.1 55.4 ...
## $ engineSize : num 1.4 2 1.4 2 1 1.4 2 2 1.4 1.4 ...
## $ Make : chr "audi" "audi" "audi" "audi" ...
By seeing the type of each columns, we got some type of columns are incorrect. So we decide to change it into right type.
cars$year <- as.character(cars$year)
cars$transmission <- as.factor(cars$transmission)
cars$fuelType <- as.factor(cars$fuelType)
cars$Make <- as.factor(cars$Make)
str(cars)
## 'data.frame': 72435 obs. of 10 variables:
## $ model : chr "A1" "A6" "A1" "A4" ...
## $ year : chr "2017" "2016" "2016" "2017" ...
## $ price : int 12500 16500 11000 16800 17300 13900 13250 11750 10200 12000 ...
## $ transmission: Factor w/ 4 levels "Automatic","Manual",..: 2 1 2 1 2 1 1 2 2 2 ...
## $ mileage : int 15735 36203 29946 25952 1998 32260 76788 75185 46112 22451 ...
## $ fuelType : Factor w/ 5 levels "Diesel","Electric",..: 5 1 5 1 5 5 1 1 5 5 ...
## $ tax : int 150 20 30 145 145 30 30 20 20 30 ...
## $ mpg : num 55.4 64.2 55.4 67.3 49.6 58.9 61.4 70.6 60.1 55.4 ...
## $ engineSize : num 1.4 2 1.4 2 1 1.4 2 2 1.4 1.4 ...
## $ Make : Factor w/ 7 levels "audi","BMW","Ford",..: 1 1 1 1 1 1 1 1 1 1 ...
anyNA(cars)
## [1] FALSE
colSums(is.na(cars))
## model year price transmission mileage fuelType
## 0 0 0 0 0 0
## tax mpg engineSize Make
## 0 0 0 0
Great! The dataset has no missing value. It means the dataset is complete.
summary(cars)
## model year price transmission
## Length:72435 Length:72435 Min. : 495 Automatic:14046
## Class :character Class :character 1st Qu.: 10175 Manual :43021
## Mode :character Mode :character Median : 14495 Other : 4
## Mean : 16580 Semi-Auto:15364
## 3rd Qu.: 20361
## Max. :145000
##
## mileage fuelType tax mpg
## Min. : 1 Diesel :28918 Min. : 0 Min. : 0.30
## 1st Qu.: 7202 Electric: 5 1st Qu.: 30 1st Qu.: 47.90
## Median : 17531 Hybrid : 2903 Median :145 Median : 55.40
## Mean : 23177 Other : 239 Mean :117 Mean : 55.85
## 3rd Qu.: 32449 Petrol :40370 3rd Qu.:145 3rd Qu.: 62.80
## Max. :323000 Max. :580 Max. :470.80
##
## engineSize Make
## Min. :0.000 audi :10668
## 1st Qu.:1.200 BMW :10781
## Median :1.600 Ford :17964
## Mean :1.636 Hyundai: 4860
## 3rd Qu.:2.000 skoda : 6267
## Max. :6.600 toyota : 6738
## vw :15157
Summary:
There were 7 company with many variance model.
The lowest price of used cars was 495, The highest price was 145000, and The average price was 16580.
The used cars were in range year from 1996 to 2020.
There were only 4 cars which have Other transmission
type.
Petrol was the famous fueltype.
Minimun of mileage was 1. It meant there were cars with minimum activity (used).
Mile per gallon was started from 0.30 to 470.80
The tax of used cars have many variance too with minimun 0 to 580.
Ford was the most highest quantity used cars.
1. We will check the between price and company overlay with average price.
I will use function “Scale_y_log10’ for better interpretation of IQR.
ggplot(cars, aes(Make, price)) +
geom_boxplot(aes(fill = Make)) +
scale_y_log10() +
labs(title = "Price by Company", x= "Company", y= "Price", fill = "Company",
subtitle = "red line indicate average price") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = mean(cars$price), color = "red", linetype = 5)
Interpretations :
As we can see from the boxplot above, the highest price of all is
Audi company.
Second place is BMW, the third were
Hyundai and skoda.
Average price line only crossed Audi, BMW, and vw
company.
Ford, Hyundai, skoda, and toyota of price distribution are far below from the average price.
2. We want to find out the corelation between price and mileage, does cheap price makes low or high activity of used cars become the highest quantity or used cars in UK?
Create corelation between price and mileage using geom point.
ggplot(cars, aes(mileage, price)) +
geom_point(alpha = 0.9, color = "green") +
geom_segment(aes(x=mileage, xend=mileage, y=0, yend=price)) +
labs(title = " Mileage Vs Price", x ="Mileage", y= "Price")+
theme(plot.title = element_text(hjust = 0.7))
Interpretations :
If the mileage or the activity of used car was little or low, it will affect the price too. It can be known from the plot if there was the highest mileage but the price was not high too. User can use this information as their consideration.
3. How is the milepergallon within the company and in different type of fueltype?
I will use violin plot, with x is company and y is milepergallon.
ggplot(cars,aes(Make,mpg))+
geom_violin(aes(fill = fuelType)) +
labs (title = "milepergallon in fuelType", x = "Company", y = "mpg", fill ="FuelType")+
theme(plot.title = element_text(hjust = 0.5))
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
Interpretation :
We might say if the most mpg was around 0 to 100 and we knew that the average of mpg was 55.85 . But from the plot above there were some mpg above 100 with fueltype were Hybrid, Other, less Petrol, less Diesel. Beside of that, we almost couldn’t see Electric fueltype inside the plot because of quantity of used cars with Electric fueltype was little.
4. Show me the price distribution of each type of transmission for Ford company only!
Lets we focused on “Ford” company. Create new object which only consist of ”Ford”, named ‘ford’.
head(cars[cars$Make == "Ford",])
## model year price transmission mileage fuelType tax mpg engineSize Make
## 21450 Fiesta 2017 12000 Automatic 15944 Petrol 150 57.7 1.0 Ford
## 21451 Focus 2018 14000 Manual 9083 Petrol 150 57.7 1.0 Ford
## 21452 Focus 2017 13000 Manual 12456 Petrol 150 57.7 1.0 Ford
## 21453 Fiesta 2019 17500 Manual 10460 Petrol 145 40.3 1.5 Ford
## 21454 Fiesta 2019 16500 Automatic 1482 Petrol 145 48.7 1.0 Ford
## 21455 Fiesta 2015 10500 Manual 35432 Petrol 145 47.9 1.6 Ford
ford <- cars[cars$Make == "Ford",]
dim(ford)
## [1] 17964 10
after that, make grouping price with range into <15000, 15000<=x<=50000, 50000<x<100000, x>=100000 after that create new column in ford object named ‘priceFord’ column
P <- function(x){
if (x<15000) {x<-"Below 15000"} else if (x>=15000 & x<=50000) {x <- "15000 to 50000"} else if (x>50000 & x<100000) {x <-"between 50000-100000"} else {x <- "above 100000"}}
ford$priceFord <- as.factor(sapply(ford$price, P))
head(ford)
## model year price transmission mileage fuelType tax mpg engineSize Make
## 21450 Fiesta 2017 12000 Automatic 15944 Petrol 150 57.7 1.0 Ford
## 21451 Focus 2018 14000 Manual 9083 Petrol 150 57.7 1.0 Ford
## 21452 Focus 2017 13000 Manual 12456 Petrol 150 57.7 1.0 Ford
## 21453 Fiesta 2019 17500 Manual 10460 Petrol 145 40.3 1.5 Ford
## 21454 Fiesta 2019 16500 Automatic 1482 Petrol 145 48.7 1.0 Ford
## 21455 Fiesta 2015 10500 Manual 35432 Petrol 145 47.9 1.6 Ford
## priceFord
## 21450 Below 15000
## 21451 Below 15000
## 21452 Below 15000
## 21453 15000 to 50000
## 21454 15000 to 50000
## 21455 Below 15000
ggplot(ford,aes(transmission,price)) +
geom_jitter(aes(col= ford$priceFord)) +
geom_boxplot(alpha=0.7) +
scale_y_log10()+
labs(title = "Ford Price by Transmission", x= "Transmission", y= "Price", col = "Price Ford") +
theme(plot.title = element_text(hjust = 0.8))
## Warning: Use of `ford$priceFord` is discouraged. Use `priceFord` instead.
Interpretation :
Manual transmission are the most distributed in price for “Ford” company, although we still found a Ford used car with price between 50000 - 100000 and the price was 54995.
If we can order the position of favorite transmission were
Manual, Automatic, Semi-Auto, and Other. There was only 4 used cars with
Other transmission and it couldn’t be shown in plot
above.
5. Show the average number of Tax and MPG for each fueltype Within Audi company!
Create new data frame for Tax and MPG based on average number for each fueltype named ‘audi_TM’
audi <- cars[cars$Make == "audi",]
head(aggregate.data.frame(list(Tax = audi$tax, MPG = audi$mpg), by = list(Fueltype = audi$fuelType), mean))
## Fueltype Tax MPG
## 1 Diesel 119.65663 54.24024
## 2 Hybrid 72.67857 150.22143
## 3 Petrol 133.30634 46.39751
audi_TM <- aggregate.data.frame(list(Tax = audi$tax, MPG = audi$mpg), by = list(Fueltype = audi$fuelType), mean)
dim(audi_TM)
## [1] 3 3
gather all value (mileage and mpg) into 1 column, named ‘variable’, using function ‘gather’.
audi_TM <- gather(audi_TM, key = "variable", value = "average", -Fueltype)
audi_TM
## Fueltype variable average
## 1 Diesel Tax 119.65663
## 2 Hybrid Tax 72.67857
## 3 Petrol Tax 133.30634
## 4 Diesel MPG 54.24024
## 5 Hybrid MPG 150.22143
## 6 Petrol MPG 46.39751
ggplot(audi_TM,aes(Fueltype, average))+
geom_col(aes(fill = variable), position = "dodge") +
coord_flip()+
labs(title = "Average Tax and MPG for each Fuel type", x= "Fuel Type", y = "Value", fill = "Variable")+
geom_text(aes(label=comma(average)), hjust = -0.1, size = 2.1)+
theme(plot.title = element_text(hjust = 0.2))
Interpretation :
In Audi company, there was no used cars with Other or Electric fueltype. There was only Petrol, Hybrid, and Diesel fueltype in Audi company.
Petrol has the highest average number of Tax but the lowest average number of MPG.
Hybrid fueltype was the highest average number of MPG.
6. We want to know which model of Ford used cars has the highest price?
test <- ford[match(unique(ford$model), ford$model), ]
test <- test[order(test$price, decreasing = T), ]
test
## model year price transmission mileage fuelType tax mpg
## 21752 Mustang 2020 42489 Automatic 3500 Petrol 145 22.1
## 21458 Kuga 2019 25500 Automatic 6894 Diesel 145 42.2
## 21456 Puma 2019 22500 Manual 2029 Petrol 145 50.4
## 21485 Mondeo 2019 20000 Manual 24 Diesel 145 65.7
## 21759 Grand Tourneo Connect 2019 19999 Manual 3500 Diesel 145 61.4
## 21505 Tourneo Custom 2018 19995 Automatic 24568 Diesel 145 31.7
## 21529 Edge 2016 18640 Manual 24105 Diesel 160 48.7
## 21710 Galaxy 2016 18498 Manual 30528 Diesel 125 56.5
## 21506 S-MAX 2017 18495 Automatic 39605 Diesel 145 54.3
## 35057 Ranger 2013 14495 Manual 88000 Diesel 240 28.3
## 21451 Focus 2018 14000 Manual 9083 Petrol 150 57.7
## 21461 EcoSport 2018 13500 Manual 12065 Petrol 145 54.3
## 21589 Grand C-MAX 2018 13495 Manual 1030 Petrol 145 47.9
## 39190 Transit Tourneo 2014 12450 Manual 19496 Diesel 235 42.2
## 21450 Fiesta 2017 12000 Automatic 15944 Petrol 150 57.7
## 21476 C-MAX 2018 11799 Manual 23800 Diesel 145 68.9
## 21544 Tourneo Connect 2015 9295 Manual 47904 Diesel 125 56.5
## 21489 Ka+ 2018 8261 Manual 25000 Petrol 145 57.7
## 21512 B-MAX 2014 7498 Semi-Auto 33023 Petrol 160 44.1
## 21672 KA 2014 4898 Manual 22609 Petrol 30 57.7
## 34916 Fusion 2010 4750 Automatic 26588 Petrol 260 37.7
## 38328 Escort 1996 3000 Manual 50000 Petrol 265 34.4
## 36141 Streetka 2005 1999 Manual 63000 Petrol 270 35.3
## engineSize Make priceFord
## 21752 5.0 Ford 15000 to 50000
## 21458 2.0 Ford 15000 to 50000
## 21456 1.0 Ford 15000 to 50000
## 21485 2.0 Ford 15000 to 50000
## 21759 1.5 Ford 15000 to 50000
## 21505 2.0 Ford 15000 to 50000
## 21529 2.0 Ford 15000 to 50000
## 21710 2.0 Ford 15000 to 50000
## 21506 2.0 Ford 15000 to 50000
## 35057 3.2 Ford Below 15000
## 21451 1.0 Ford Below 15000
## 21461 1.0 Ford Below 15000
## 21589 1.0 Ford Below 15000
## 39190 2.2 Ford Below 15000
## 21450 1.0 Ford Below 15000
## 21476 1.5 Ford Below 15000
## 21544 1.6 Ford Below 15000
## 21489 1.2 Ford Below 15000
## 21512 1.6 Ford Below 15000
## 21672 1.2 Ford Below 15000
## 34916 1.6 Ford Below 15000
## 38328 1.8 Ford Below 15000
## 36141 1.6 Ford Below 15000
ggplot(test,aes(reorder(test$model,test$engineSize), test$engineSize))+
geom_col(fill ="maroon")+
facet_grid(rows = vars(Make), scales = "free_y")+
geom_point(aes(col=price))+
geom_text(aes(label= comma(test$engineSize)), hjust=-0.2, size = 2)+
labs( x="Model", y= "EngineSize")+
coord_flip()
## Warning: Use of `test$model` is discouraged. Use `model` instead.
## Warning: Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Warning: Use of `test$model` is discouraged. Use `model` instead.
## Warning: Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Warning: Use of `test$model` is discouraged. Use `model` instead.
## Warning: Use of `test$engineSize` is discouraged. Use `engineSize` instead.
## Use of `test$engineSize` is discouraged. Use `engineSize` instead.
From all graphs above, we can get some suggestions or assumptions like :
We knew that Ford used cars was a kind of cars with affordable or reachable price. Maybe, that’s why the quantity of Ford used cars were the highest than other company.
Manual transmission of Ford company has a lot or quantity used cars with price below 15000.
Audi and BMW were the most expensive used cars because their average price were above average price for all.
The used cars with highest mileage value have lowest price too. It means the highest activity or used of the cars would be valued low price.