At present, the amount of transfers held by the industry is increasing year by year. Thus, transfers play an important role in the development of automobiles. In developed countries, automobiles are referred to as an “industrial industry”. According to industry experts, the automotive industry is growing significantly. Apart from being the fastest growing country in the automotive industry, it also represents a global presence,as in most other countries, cars have become very popular among local residents and expatriates working in the country.
Today, almost everyone wants to own their own car, but due to factors such as affordability or financial situation, many people opt for used cars. Accurately predicting used car prices requires expertise, as they depend on many factors and characteristics. Used car prices in the market are not stable and both buyers and sellers need an intelligent system that allows them to effectively predict the price. In this project,used car price prediction was conducted on used car-related data. The data obtained from Kaggle.
In order to predict used car price, there are several question can be ask which are:
What are the algorithms is used and which algortihm performed the best in predicting used car selling price.
Which variables influence the used car selling price?
In this regard, machine learning algorithms is required in predicting the selling price of cars based on available variables.
Buying a used car from a dealership can be a frustrating and unsatisfying experience, as some dealers have been known to use deceptive sales tactics to close the deal. The project aims to :
To build machine learning algorithms which can be implementing into used car recommendation system in predicting used car selling price.
To identify what variables influence the used car selling price.
Considering this is an interesting research topic for the research community, and by continuing their steps and hope to achieve important results using more advanced methods from previous work.
Import necessary packages:
library(tidyverse)
library(stringr)
library(purrr)
library(Amelia)
library(GGally)
library(caret)
library(relaimpo)
library(randomForest)
library(gbm)
library(broom)
library(dplyr)
library(e1071)
library(magrittr)
library(dplyr)
library(ggplot2)
library(data.table)
library(tidyr)
library(tidyselect)
library(plotly)
library(reactable)
library(htmlwidgets)
library('IRdisplay')
library("scales")
In this project, a vehicle dataset was used which consisting of four sets of data that contains the information about used cars named car data, Car details v3, Car details v4 and Car Details from Car Dekho. For acknowledgement, the Car details v3 dataset was used. To explore the dataset, R language was mainly used for this project. Firstly, import the dataset into R project and view them.
Import dataset from local storage:
car <- read.csv('../group_project/car_project.csv')
view(car)
Running the script gave us important information that there are about 8128 rows and 13 columns for this dataset. For data types, there are nine (9) column character types and four (4) column double types. Next, let’s see the dataset dimension.
Dimension of dataset:
dim(car)
## [1] 8128 13
The car data has 8128 observations and 13 variables which are the same as the number of rows and columns as above. Next, the dataset will be familiarized with the variables available within the dataset by taking a look at the top rows of the car dataset.
Name of columns:
names(car)
## [1] "name" "year" "selling_price" "km_driven"
## [5] "fuel" "seller_type" "transmission" "owner"
## [9] "mileage" "engine" "max_power" "torque"
## [13] "seats"
The description of each coloumns:
name - Name of the car models.year - Year of the car when it was bought.selling_price - Price at which the car is being
sold.km_driven - Number of kilometers the car is
driven.fuel - Fuel type of the car used.seller_type - Tells if a seller is individual or a
dealer.transmission - Gear transmission of the car.owner - Number of previous owners of the car.Summary of dataset:
summary(car)
## name year selling_price km_driven
## Length:8128 Min. :1983 Min. : 29999 Min. : 1
## Class :character 1st Qu.:2011 1st Qu.: 254999 1st Qu.: 35000
## Mode :character Median :2015 Median : 450000 Median : 60000
## Mean :2014 Mean : 638272 Mean : 69820
## 3rd Qu.:2017 3rd Qu.: 675000 3rd Qu.: 98000
## Max. :2020 Max. :10000000 Max. :2360457
##
## fuel seller_type transmission owner
## Length:8128 Length:8128 Length:8128 Length:8128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## mileage engine max_power torque
## Length:8128 Length:8128 Length:8128 Length:8128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## seats
## Min. : 2.000
## 1st Qu.: 5.000
## Median : 5.000
## Mean : 5.417
## 3rd Qu.: 5.000
## Max. :14.000
## NA's :221
As for the output of summary(), if the column is numeric type, then the summary would contain information like minimum, maximum, median, mean, etc. If the column is char type, then summary would contain information like length, class and mode. Here, as seen, there are nine character types: name, fuel, seller_type, transmission, owner, mileage, engine, max_power, torque and four character types: year, selling_price, km_driven, seats. Next, let’s see the data structure.
str(car)
## 'data.frame': 8128 obs. of 13 variables:
## $ name : chr "Maruti Swift Dzire VDI" "Skoda Rapid 1.5 TDI Ambition" "Honda City 2017-2020 EXi" "Hyundai i20 Sportz Diesel" ...
## $ year : int 2014 2014 2006 2010 2007 2017 2007 2001 2011 2013 ...
## $ selling_price: int 450000 370000 158000 225000 130000 440000 96000 45000 350000 200000 ...
## $ km_driven : int 145500 120000 140000 127000 120000 45000 175000 5000 90000 169000 ...
## $ fuel : chr "Diesel" "Diesel" "Petrol" "Diesel" ...
## $ seller_type : chr "Individual" "Individual" "Individual" "Individual" ...
## $ transmission : chr "Manual" "Manual" "Manual" "Manual" ...
## $ owner : chr "First Owner" "Second Owner" "Third Owner" "First Owner" ...
## $ mileage : chr "23.4 kmpl" "21.14 kmpl" "17.7 kmpl" "23.0 kmpl" ...
## $ engine : chr "1248 CC" "1498 CC" "1497 CC" "1396 CC" ...
## $ max_power : chr "74 bhp" "103.52 bhp" "78 bhp" "90 bhp" ...
## $ torque : chr "190Nm@ 2000rpm" "250Nm@ 1500-2500rpm" "12.7@ 2,700(kgm@ rpm)" "22.4 kgm at 1750-2750rpm" ...
## $ seats : int 5 5 5 5 5 5 5 4 5 5 ...
Data structure is a way to see an overview of the data so that it can be used effectively. In this dataset, it gives an overview of dataframes, the generic data objects of R which used to store the tabular data. Since the main objective focused on the selling price of the cars, distributions of the selling price will be seen in the first place.
Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.
Function to plot width and height of plot:
fig<-function(x,y){
options(repr.plot.width = x, repr.plot.height = y)
}
expensive_cars <- car %>% group_by(name) %>% summarise(selling_price=max(selling_price))%>%top_n(10)
## Selecting by selling_price
head(expensive_cars)
## # A tibble: 6 × 2
## name selling_price
## <chr> <int>
## 1 Audi A6 35 TFSI Matrix 6523000
## 2 BMW 5 Series 520d Luxury Line 5200000
## 3 BMW 6 Series GT 630d Luxury Line 6000000
## 4 BMW X4 M Sport X xDrive20d 5800000
## 5 BMW X7 xDrive 30d DPE 7200000
## 6 Mercedes-Benz E-Class Exclusive E 200 BSIV 5200000
print(expensive_cars)
## # A tibble: 10 × 2
## name selling_price
## <chr> <int>
## 1 Audi A6 35 TFSI Matrix 6523000
## 2 BMW 5 Series 520d Luxury Line 5200000
## 3 BMW 6 Series GT 630d Luxury Line 6000000
## 4 BMW X4 M Sport X xDrive20d 5800000
## 5 BMW X7 xDrive 30d DPE 7200000
## 6 Mercedes-Benz E-Class Exclusive E 200 BSIV 5200000
## 7 Mercedes-Benz S-Class S 350 CDI 6000000
## 8 Volvo S90 D4 Inscription BSIV 5500000
## 9 Volvo XC60 Inscription D5 BSIV 5500000
## 10 Volvo XC90 T8 Excellence BSIV 10000000
options(scipen = 999)
ggplot(data = expensive_cars, aes(y=name, x=selling_price, fill=selling_price))+
geom_bar(stat="identity", width = 0.5, fill="#E14D2A",color = 'black')+
geom_text(aes(label=selling_price), vjust=1.7, color="black", size=3.0)+
scale_x_continuous(labels = comma)+
labs(x="Car Price",
y="Car Brand",
title="Top 10 most expensive cars")+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
The most expensive car model is Volvo XC90 T8 Excellence BSIV, sold at 10,000,000. The second highest expensive car model is BMW X7 xDrive 30d DPE sold at 72,000,000. The third rank of expensive car brands is Audi A6 35 TFSI Matrix which was sold at 6,523,000.
cheapest_cars <- filter(car %>% group_by(name) %>% summarise(selling_price=max(selling_price)) %>% arrange(desc(selling_price)))
ggplot(data = tail(cheapest_cars,10), aes(y=name, x=selling_price, fill=selling_price))+
geom_bar(stat="identity", width = 0.5, fill="#001253",color = 'white')+
geom_text(aes(label=selling_price), vjust=1.9, color="black", size=3.0)+
scale_x_continuous(labels = comma)+
labs(x="Car Price",
y="Car Brand",
title="Top 10 most cheapest car model")+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
The most cheapest car model is Hyundai Santro LE, sold at 35,000. The second lowest car model sold is Tata Indica DLX sold at 40,000. The third and fourth rank share the same value of the cheapest car models sold at 45,000 which are Tata Nano LX SE and Maruti 800 DXBSII.
sold_per_year<- car %>% group_by(name, year)%>%summarise(Count=length(name))
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
sold_per_year
## # A tibble: 3,837 × 3
## # Groups: name [2,058]
## name year Count
## <chr> <int> <int>
## 1 Ambassador CLASSIC 1500 DSL AC 2000 1
## 2 Ambassador Classic 2000 DSZ AC PS 1994 1
## 3 Ambassador Grand 1500 DSZ BSIII 2008 1
## 4 Ambassador Grand 2000 DSZ PW CL 2008 1
## 5 Ashok Leyland Stile LE 2013 1
## 6 Audi A3 35 TDI Premium Plus 2017 1
## 7 Audi A3 35 TDI Premium Plus 2018 1
## 8 Audi A3 40 TFSI Premium 2017 1
## 9 Audi A4 1.8 TFSI 2010 1
## 10 Audi A4 2.0 TDI 2014 2
## # … with 3,827 more rows
ggplot(sold_per_year, aes(year))+
geom_line(stat="count", width = 0.7,color = 'black')+ # Stack for stacked chart
labs(x="Year",
y="Count",
title="Distribution of cars sold per year")+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10,angle=90),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
From the data distributions of the used cars sold over past few years from 1996 to 2020, there is an increment of the car units sold. In 2012, there was a slight reduction in the number of counts and rose back up till 2017 where it reached its peak at 434 units sold. Then, the number of used cars sold decreases till 2020.
fuel_type<- car %>% group_by(fuel) %>% summarise(Count=length(fuel)) %>% plot_ly(x=~fuel, y=~Count, color = ~fuel, colors = c("#764AF1","#F2F2F2"), type='bar')%>%layout(title="Distribution of fuel types on used car model", xaxis=list(title="Fuel type"))
fuel_type
From the data distributions of the fuel types over the total of used cars model, the highest fuel type car model was diesel, followed by petrol. While the lowest fuel type consumption of car models is Compressed Natural Gas (CNG) and Liquefied Petroleum Gas (LPG).
seller<- car %>% group_by(seller_type) %>% summarise(Count=length(seller_type)) %>% plot_ly(x=~seller_type, y=~Count, color = ~seller_type, colors = c("#00FFFF","#F2F2F2"), type='bar')%>%layout(title="Distribution of seller types on used car model", xaxis=list(title="Seller type"))
seller
From the data distributions of the seller types over the total of used cars model, the highest seller type car model was Individual, followed by Dealer and Trustmark Dealer.
transmission <- car %>% group_by(transmission) %>% summarise(Count=length(transmission)) %>% plot_ly(x=~transmission, y=~Count, color = ~transmission, colors = c("#00FF00","#F2F2F2"), type='bar')%>%layout(title="Distribution of used car model's transmission type", xaxis=list(title="transmission type"))
transmission
From the data distributions of the transmission type over the total of used cars model, the highest seller type car model is Manual over the 7000 in counts, while the lowest number is the Manual type with around 1000 in counts.
ggplot(data = car, aes(x=year, y=selling_price, fill=selling_price)) +
geom_point()+
labs(y="Selling Price",
x="Year",
title="Relationship between Selling Price and Car Manufactured")+
theme_bw()+
theme(plot.title = element_text(size=15)
,axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10),
axis.title=element_text(size=10))
From the relationship of data distributions of sold car selling price over the year, the graph shows that there is an incremental density in value. As the year in time increases, the selling price values increase.
ggplot(data = car, aes(x=km_driven, y=selling_price, fill=km_driven)) +
geom_point()+
labs(y="Selling Price",
x="KM driven",
title="Relationship between Selling Price and km driven")+
theme_bw()+
theme(plot.title = element_text(size=15)
,axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10),
axis.title=element_text(size=10))
From the relationship of data distributions of sold car selling price over the km driven, the graph shows that there is an decrement density in value. As the year in km driven increases, the selling price values decrease.
Check and visualize the missing missing data:
car[car == ''] <- NA
sapply(car, function(x) sum(is.na(x)))
## name year selling_price km_driven fuel
## 0 0 0 0 0
## seller_type transmission owner mileage engine
## 0 0 0 221 221
## max_power torque seats
## 215 222 221
missmap(car, legend = TRUE, col = c("red", "blue"))
There are 221 missing value in mileage, engine and seats while there are 215 and 222 missing value for max_power and torque respectively.
Change model name variable into company name to understand data better with just single word rather than multiple word.
car$name <- word(car$name,1)
change car mileage variable to numerical data type and impute the missing value in car mileage using the mean:
car$mileage <- parse_number(car$mileage)
car$mileage[is.na(car$mileage)]<-mean(car$mileage,na.rm=TRUE)
change car engine variable to numerical data type and impute the missing value using the mean:
car$engine <- parse_number(car$engine)
car$engine[is.na(car$engine)]<-mean(car$engine,na.rm=TRUE)
Change car max_power variable to numeric data type and impute missing value using mean:
car$max_power <- parse_number(car$max_power)
car$max_power[is.na(car$max_power)]<-mean(car$max_power,na.rm=TRUE)
Impute car seat variable missing value using median:
car$seats[is.na(car$seats)]<-median(car$seats,na.rm=TRUE)
Change car torque variable to numeric and input missing value using mean:
car$torque <- parse_number(car$torque)
car$torque[is.na(car$torque)]<-mean(car$torque,na.rm=TRUE)
Check the data after data transformation and imputation.
head(car)
## name year selling_price km_driven fuel seller_type transmission
## 1 Maruti 2014 450000 145500 Diesel Individual Manual
## 2 Skoda 2014 370000 120000 Diesel Individual Manual
## 3 Honda 2006 158000 140000 Petrol Individual Manual
## 4 Hyundai 2010 225000 127000 Diesel Individual Manual
## 5 Maruti 2007 130000 120000 Petrol Individual Manual
## 6 Hyundai 2017 440000 45000 Petrol Individual Manual
## owner mileage engine max_power torque seats
## 1 First Owner 23.40 1248 74.00 190.00 5
## 2 Second Owner 21.14 1498 103.52 250.00 5
## 3 Third Owner 17.70 1497 78.00 12.70 5
## 4 First Owner 23.00 1396 90.00 22.40 5
## 5 First Owner 16.10 1298 88.20 11.50 5
## 6 First Owner 20.14 1197 81.86 113.75 5
The data has been transformed into more suitable form.
Recheck and revisualize the missing value:
sapply(car, function(x) sum(is.na(x)))
## name year selling_price km_driven fuel
## 0 0 0 0 0
## seller_type transmission owner mileage engine
## 0 0 0 0 0
## max_power torque seats
## 0 0 0
missmap(car, legend = TRUE, col = c("red", "blue"))
There is no missing value remaining after cleaning.
Check data type for all variables:
lapply(car, class)
## $name
## [1] "character"
##
## $year
## [1] "integer"
##
## $selling_price
## [1] "integer"
##
## $km_driven
## [1] "integer"
##
## $fuel
## [1] "character"
##
## $seller_type
## [1] "character"
##
## $transmission
## [1] "character"
##
## $owner
## [1] "character"
##
## $mileage
## [1] "numeric"
##
## $engine
## [1] "numeric"
##
## $max_power
## [1] "numeric"
##
## $torque
## [1] "numeric"
##
## $seats
## [1] "integer"
Next, the categorical data will be encoded as data preparation before modelling.
The each car name variable will be encoded to the numerical value from 0 to 31. The car name data type will be changed from character to numerical data type.
car$name <- str_replace(car$name, 'Maruti', '0')
car$name <- str_replace(car$name, 'Skoda', '1')
car$name <- str_replace(car$name, 'Honda', '2')
car$name <- str_replace(car$name, 'Hyundai', '3')
car$name <- str_replace(car$name, 'Toyota', '4')
car$name <- str_replace(car$name, 'Ford', '5')
car$name <- str_replace(car$name, 'Renault', '6')
car$name <- str_replace(car$name, 'Mahindra', '7')
car$name <- str_replace(car$name, 'Tata', '8')
car$name <- str_replace(car$name, 'Chevrolet', '9')
car$name <- str_replace(car$name, 'Fiat', '10')
car$name <- str_replace(car$name, 'Datsun', '11')
car$name <- str_replace(car$name, 'Jeep', '12')
car$name <- str_replace(car$name, 'Mercedes-Benz', '13')
car$name <- str_replace(car$name, 'Mitsubishi', '14')
car$name <- str_replace(car$name, 'Audi', '15')
car$name <- str_replace(car$name, 'Volkswagen', '16')
car$name <- str_replace(car$name, 'BMW', '17')
car$name <- str_replace(car$name, 'Nissan', '18')
car$name <- str_replace(car$name, 'Lexus', '19')
car$name <- str_replace(car$name, 'Jaguar', '20')
car$name <- str_replace(car$name, 'Land', '21')
car$name <- str_replace(car$name, 'MG', '22')
car$name <- str_replace(car$name, 'Volvo', '23')
car$name <- str_replace(car$name, 'Daewoo', '24')
car$name <- str_replace(car$name, 'Kia', '25')
car$name <- str_replace(car$name, 'Force', '26')
car$name <- str_replace(car$name, 'Ambassador', '27')
car$name <- str_replace(car$name, 'Ashok', '28')
car$name <- str_replace(car$name, 'Isuzu', '29')
car$name <- str_replace(car$name, 'Opel', '30')
car$name <- str_replace(car$name, 'Peugeot', '31')
car$name <- as.numeric(car$name)
table(car$name)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2448 105 467 1415 488 397 228 772 734 230 47 65 31 54 14 40
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 186 120 81 34 71 6 3 67 3 4 6 4 1 5 1 1
The car transmission variable will be encoded to the numerical value where “Manual” will be changed to 0 while “Auto” will be changed to 1. The transmission variable data type will be changed from character to numerical data type:
car$transmission <- str_replace(car$transmission, 'Manual', "0")
car$transmission <- str_replace(car$transmission, 'Automatic', "1")
car$transmission <- as.numeric(car$transmission)
table(car$transmission)
##
## 0 1
## 7078 1050
The car owner variables will be encoded to numerical value where “First Owner” will be changed to “0”, “Second Owner” will be changed to “1”, “Third Owner” will be changed to “2”, “Fourth & Above Owner” will be changed to “3” and “Test Drive Car” will be changed to “4”:
car$owner <- str_replace(car$owner, 'First Owner', "0")
car$owner <- str_replace(car$owner, 'Second Owner', "1")
car$owner <- str_replace(car$owner, 'Third Owner', "2")
car$owner <- str_replace(car$owner, 'Fourth & Above Owner', "3")
car$owner <- str_replace(car$owner, 'Test Drive Car', "4")
car$owner <- as.numeric(car$owner)
table(car$owner)
##
## 0 1 2 3 4
## 5289 2105 555 174 5
The car seller_type variable will be encoded to numerical value where “Trustmark Dealer” will be changed to “0”, “Dealer” will be changed to “1” and “Individual” will be changed to “2”:
car$seller_type <- str_replace(car$seller_type, "Trustmark Dealer", "0")
car$seller_type <- str_replace(car$seller_type, "Dealer", "1")
car$seller_type <- str_replace(car$seller_type, "Individual", "2")
car$seller_type <- as.numeric(car$seller_type)
table(car$seller_type)
##
## 0 1 2
## 236 1126 6766
The car fuel variable will be encoded to numerical value where “Diesel” will be changed to “0”, “Petrol” will be changed to “1” and “CNG” will be changed to “2” and “LPG” will be changed to “3”:
car$fuel <- str_replace(car$fuel, 'Diesel', "0")
car$fuel <- str_replace(car$fuel, 'Petrol', "1")
car$fuel <- str_replace(car$fuel, 'CNG', "2")
car$fuel <- str_replace(car$fuel, 'LPG', "3")
car$fuel <- as.numeric(car$fuel)
table(car$fuel)
##
## 0 1 2 3
## 4402 3631 57 38
Check the dataset after encoding and transformation:
head(car)
## name year selling_price km_driven fuel seller_type transmission owner mileage
## 1 0 2014 450000 145500 0 2 0 0 23.40
## 2 1 2014 370000 120000 0 2 0 1 21.14
## 3 2 2006 158000 140000 1 2 0 2 17.70
## 4 3 2010 225000 127000 0 2 0 0 23.00
## 5 0 2007 130000 120000 1 2 0 0 16.10
## 6 3 2017 440000 45000 1 2 0 0 20.14
## engine max_power torque seats
## 1 1248 74.00 190.00 5
## 2 1498 103.52 250.00 5
## 3 1497 78.00 12.70 5
## 4 1396 90.00 22.40 5
## 5 1298 88.20 11.50 5
## 6 1197 81.86 113.75 5
All data is encoded properly.
specify non target variables into categorical and numerical column and specify target variables for preparation for normalization.
cat_cols <- c('name', 'year', 'fuel', 'seller_type', 'transmission', 'owner', 'seats')
num_cols <- c('km_driven', 'mileage', 'engine', 'max_power', 'torque')
target_cols <- c('selling_price')
Standardized the data for non target variables, while target variable will not be standardized for easy representation of selling price.
preproc <- preProcess(car[, c(cat_cols,num_cols)], method=c("range"))
car_scaled <- predict(preproc, car)
head(car_scaled)
## name year selling_price km_driven fuel seller_type
## 1 0.00000000 0.8378378 450000 0.06164021 0.0000000 1
## 2 0.03225806 0.8378378 370000 0.05083721 0.0000000 1
## 3 0.06451613 0.6216216 158000 0.05931015 0.3333333 1
## 4 0.09677419 0.7297297 225000 0.05380274 0.0000000 1
## 5 0.00000000 0.6486486 130000 0.05083721 0.3333333 1
## 6 0.09677419 0.9189189 440000 0.01906369 0.3333333 1
## transmission owner mileage engine max_power torque seats
## 1 0 0.00 0.5571429 0.2093960 0.18500 0.236164244 0.25
## 2 0 0.25 0.5033333 0.2932886 0.25880 0.312675338 0.25
## 3 0 0.50 0.4214286 0.2929530 0.19500 0.010073961 0.25
## 4 0 0.00 0.5476190 0.2590604 0.22500 0.022443254 0.25
## 5 0 0.00 0.3833333 0.2261745 0.22050 0.008543739 0.25
## 6 0 0.00 0.4795238 0.1922819 0.20465 0.138931395 0.25
The data has been successfully processed as shown in the display of data above.
Split the data into 80% training set and 20% testing set.
trainIndex <- createDataPartition(car_scaled$selling_price, p = 0.8,
list = FALSE,
times = 1)
train <- car_scaled[ trainIndex,]
test <- car_scaled[-trainIndex,]
dim(train)
## [1] 6504 13
dim(test)
## [1] 1624 13
5 regression algorithms will be used for modelling and all of them will be evaluated and compared to find the best model. The evaluation metric will use RMSE score for evaluation. The comparison of each model will be discussed on the evaluation & conclusion section.
Linear regression training:
set.seed(123)
lr <- lm(selling_price ~ ., data = train)
summary(lr)
##
## Call:
## lm(formula = selling_price ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2420055 -213892 -8343 164594 3777283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1558144 84276 -18.489 < 0.0000000000000002 ***
## name 803566 42331 18.983 < 0.0000000000000002 ***
## year 1050260 74020 14.189 < 0.0000000000000002 ***
## km_driven -3901099 364749 -10.695 < 0.0000000000000002 ***
## fuel 50259 49313 1.019 0.308158
## seller_type -256948 27255 -9.428 < 0.0000000000000002 ***
## transmission 432129 21828 19.797 < 0.0000000000000002 ***
## owner 29933 36437 0.821 0.411398
## mileage 814081 94655 8.601 < 0.0000000000000002 ***
## engine 150792 76766 1.964 0.049539 *
## max_power 4558902 142717 31.944 < 0.0000000000000002 ***
## torque 309234 92010 3.361 0.000781 ***
## seats -149507 107191 -1.395 0.163130
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 454000 on 6491 degrees of freedom
## Multiple R-squared: 0.6873, Adjusted R-squared: 0.6867
## F-statistic: 1189 on 12 and 6491 DF, p-value: < 0.00000000000000022
The summary of the Linear Regression model is stated in the display above which give statistical information about the model.
Linear regression testing:
pred_lr <- predict(lr, newdata = test)
error_lr <- test$selling_price - pred_lr
eval_lr <- cbind(test$selling_price, pred_lr)
colnames(eval_lr) <- c("True Price", "Predicted Price")
eval_lr <- as.data.frame(eval_lr)
head(eval_lr)
## True Price Predicted Price
## 3 158000 -68063.71
## 9 350000 221854.82
## 11 500000 820865.47
## 13 280000 -17387.50
## 15 180000 -55655.91
## 23 525000 453151.51
From the model testing, some proportion of true values and predicted value has been shown above just to give rough view of comparison between true values and predicted values. As can been shown in display above, roughly, the difference between true values and predicted values in Linear Regression model is quite far and the predicted value have negative value. All the selling price values should be positive.
Scatter plot of actual values against predicted values for Linear Regression.
RMSE_lr <- round(sqrt(mean(error_lr^2)),2)
RMSE_lr
## [1] 443895.2
plot(test$selling_price,pred_lr, main="Actual Value vs Predicted Value for Linear Regression Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value is Root Mean Squared Error of the predicted and true values. The higher RMSE indicate that the higher error which indicate that lower model performance while lower RMSE indicate that the lower error which indicate that higher model performance. The RMSE value for Linear Regression is 443895.23 which are quite high. In the scatter plot the blue point represent true value while red point represent predicted value and the black line is regressed diagonal line which as benchmark where points that are closer to the line indicate the better performance of the model. In the Linear Regression model, the point of both values are dispersed away from the diagonal line which indicate that the model performance is low. In addition, some of the predicted value lies in the negative and zero value, it is impossible to have zero or negative value of selling price. There, it can be conclude that the performance of Linear regression model is worse.
Random Forest Regression training.
set.seed(123)
rf <- randomForest(selling_price~.,data = train)
rf
##
## Call:
## randomForest(formula = selling_price ~ ., data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 16453500983
## % Var explained: 97.5
Random Forest Regression testing.
pred_rf <- predict(rf, test)
error_rf <- test$selling_price - pred_rf
eval_rf <- cbind(test$selling_price, pred_rf)
colnames(eval_rf) <- c("True Price", "Predicted Price")
eval_rf <- as.data.frame(eval_rf)
head(eval_rf)
## True Price Predicted Price
## 3 158000 176141.9
## 9 350000 334391.0
## 11 500000 553945.1
## 13 280000 228762.8
## 15 180000 136370.3
## 23 525000 563152.7
Roughly, in the Random Forest model, the difference of true value and predicted value having relatively small difference which roughly indicate that the Random Forest has better performance than Linear Regression model.
Scatter plot for actual values against predicted values for Random forest Regression.
RMSE_rf <- round(sqrt(mean(error_rf^2)),2)
RMSE_rf
## [1] 196246.2
plot(test$selling_price,pred_rf, main="Actual Value vs Predicted Value for Random Forest Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value for Random Forest model is 196246.21 which is very low compared to the Linear Regression model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with only a few point is far from the regressed diagonal line which indicate the good performance of model. Based on the RMSE value and scatter plot, the Random Forest have good performance.
K-Nearest Neighbor training.
set.seed(123)
knn = knnreg(selling_price~.,data = train)
knn
## 5-nearest neighbor regression model
K-Nearest Neighbor testing.
pred_knn <- predict(knn, newdata = test)
error_knn <- test$selling_price - pred_knn
eval_knn <- cbind(test$selling_price, pred_knn)
colnames(eval_knn) <- c("True Price", "Predicted Price")
eval_knn <- as.data.frame(eval_knn)
head(eval_knn)
## True Price Predicted Price
## 1 158000 166000.0
## 2 350000 387199.8
## 3 500000 507000.0
## 4 280000 226666.3
## 5 180000 135600.0
## 6 525000 580833.3
Roughly, in the KNN model, the difference of true value and predicted value have noticeable difference. This roughly estimate that the KNN model will have lower performance compared the Random Forest model.
Scatter plot for actual values against predicted values for Random forest Regression.
RMSE_knn <- round(sqrt(mean(error_knn^2)),2)
RMSE_knn
## [1] 260642.2
plot(test$selling_price,pred_knn, main="Actual Value vs Predicted Value for K-Nearest Neighbor Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value for KNN model is 260642.16 which is significantly high compared to the Random Forest model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with just a some point that are far from the regressed diagonal line which indicate the notable performance of model. However, based on RMSE and overall fit of the point, the performance of Random Forest model is still far better than KNN model.
Support Vector Machine training.
set.seed(123)
svm = svm(selling_price~.,data = train)
svm
##
## Call:
## svm(formula = selling_price ~ ., data = train)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.08333333
## epsilon: 0.1
##
##
## Number of Support Vectors: 2139
Support Vector Machine testing.
pred_svm <- predict(svm, newdata = test)
error_svm <- test$selling_price - pred_svm
eval_svm <- cbind(test$selling_price, pred_svm)
colnames(eval_svm) <- c("True Price", "Predicted Price")
eval_svm <- as.data.frame(eval_svm)
head(eval_svm)
## True Price Predicted Price
## 3 158000 144203.0
## 9 350000 252253.4
## 11 500000 549645.1
## 13 280000 243952.5
## 15 180000 184362.6
## 23 525000 563488.3
Roughly, in SVM mode, the difference of true value and predicted value having also relatively small difference same as KNN model. This roughly estimate that the SVM model will have almost the same performance with the KNN models.
Scatter plot for actual values against predicted values for Support Vector Machine.
RMSE_svm <- round(sqrt(mean(error_svm^2)),2)
RMSE_svm
## [1] 289123.8
plot(test$selling_price,pred_svm, main="Actual Value vs Predicted Value for SVM Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value for SVM model is 289123.77 which is almost the same as KNN model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with just a some point that are far from the regressed diagonal line which indicate the notable performance of model. However, based on RMSE and overall fit of the point, the performance of Random Forest model is still far better than SVM model.
Gradient Boost training.
set.seed(123)
gbm <- gbm(
formula = selling_price ~ .,
distribution = "gaussian",
data = train,
n.trees = 6000,
interaction.depth = 3,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
gbm
## gbm(formula = selling_price ~ ., distribution = "gaussian", data = train,
## n.trees = 6000, interaction.depth = 3, shrinkage = 0.1, cv.folds = 5,
## verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 6000 iterations were performed.
## The best cross-validation iteration was 5743.
## There were 12 predictors of which 12 had non-zero influence.
Gradient Boost testing.
pred_gbm <- predict(gbm, test)
## Using 5743 trees...
error_gbm <- test$selling_price - pred_gbm
eval_gbm <- cbind(test$selling_price, pred_gbm)
colnames(eval_gbm) <- c("True Price", "Predicted Price")
eval_gbm <- as.data.frame(eval_gbm)
head(eval_gbm)
## True Price Predicted Price
## 1 158000 151758.7
## 2 350000 370361.7
## 3 500000 541782.5
## 4 280000 224875.1
## 5 180000 151004.5
## 6 525000 542562.7
RMSE_gbm <- round(sqrt(mean(error_gbm^2)),2)
RMSE_gbm
## [1] 198965.4
Roughly, in the Gradient Boost model, the difference of true value and predicted value having also relatively small difference same as Random Forest model. This roughly estimate that the Gradient Boost model will have almost the same performance with the Random Forest model.
Scatter plot for actual values against predicted values for Gradient Boost.
plot(test$selling_price,pred_gbm, main="Actual Value vs Predicted Value for Gradient Boost Model", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
abline(a = 0, b =1)
The RMSE value for Gradient Boost model is 198965.41 which is relatively low and almost similar with the Random Forest model. Based on the scatter plot, all the point are reasonably close to the regressed diagonal line with almost negligible points that far from regressed diagonal line which indicate the outstanding performance of model. Based on the RMSE and overall fit of the point, the Gradient Boost model have better performance compared to Random Forest model with small difference between RMSE score.
Each RMSE score for each model will be put into a table for comparison. RMSE score for all models is shown in Figure Below.
Model <- c('Linear Regression','Random Forest','Gradient Boosting', 'K-Nearest Neighbour', "Support Vector Machine")
RMSE <- c(RMSE_lr,RMSE_rf,RMSE_gbm,RMSE_knn,RMSE_svm)
res <- data.frame(Model,RMSE)
res %>% arrange(RMSE,descending =TRUE)
## Model RMSE
## 1 Random Forest 196246.2
## 2 Gradient Boosting 198965.4
## 3 K-Nearest Neighbour 260642.2
## 4 Support Vector Machine 289123.8
## 5 Linear Regression 443895.2
ggplot(data = res, aes(x= Model, y = RMSE, fill = Model)) + geom_bar(stat="identity") + theme(axis.text.x = element_blank(), axis.ticks = element_blank())
Gradient Boost model have the lowest RMSE score which is 198965.41 with only small difference with Random Forest model. Therefore, Gradient Boost model the best model among the other model while Linear Regression model have the highest RMSE score which is 443895.23 which indicate the worse model among the other model.
Using Gradient Boosting model, the features importance can be calculated to find which features or variables have greater influence in the price prediction.
summary(
gbm,
cBars = 10,
method = relative.influence, las = 2
)
## var rel.inf
## max_power max_power 64.7513779
## year year 17.4809107
## torque torque 4.8297703
## km_driven km_driven 3.9078292
## engine engine 2.9093576
## name name 2.6730714
## mileage mileage 1.9829501
## transmission transmission 0.6755642
## seller_type seller_type 0.3654856
## owner owner 0.1865162
## seats seats 0.1357598
## fuel fuel 0.1014069
From the features importance, the max_power variable is the biggest infrequence which 60% of relative influence in the predicting selling price based on this dataset and Gradient Boost model. The features importance rank from highest to lowest is max_power, year, torque, km_driven, name, engine, mileage, transmission, seller_type and owner.
Gradient Boost model and Random Forest model have less significant difference in RMSE value. Both model can be approved to be the best model, the selection of project this project are not fully utilizing the machine learning capabilities such as hyperparameter tuning. The hyperparameter tuning can not be utilized due to technical limitation.
Hyperparameter tuning script for Gradient Boost model:
rainControl <- trainControl(method = "cv",
number = 10,
returnResamp="all", ### use "all" to return all cross-validated metrics
search = "grid")
tuneGrid <- expand.grid(
n.trees = c(5000, 10000),
interaction.depth = c( 6, 13),
shrinkage = c(0.01, 0.001),
n.minobsinnode=c(5, 10)
)
gbm_hpt <- train(selling_price ~.,
data = train,
method = "gbm",
tuneGrid = tuneGrid,
trControl = trainControl,
verbose=FALSE)
pred_gbm_hpt <- predict(gbm_hpt, test)
error_gbm_hpt <- test$selling_price - pred_gbm_hpt
RMSE_gbm_hpt <- round(sqrt(mean(error_gbm_hpt^2)),2)
RMSE_gbm_hpt
plot(test$selling_price,pred_gbm_hpt, main="Scatterplot", col = c("blue","red"), xlab = "Actual Selling Price", ylab = "Predicted Selling Price")
The result of descriptive and predictive analysis will be integrated into the shiny application. The deployment will be in the form of dashboard by using shiny based on the selected machine learning algorithm (best model) to car prediction system. The selected model is built by using following code:
library(tidyverse)
library(stringr)
library(purrr)
library(Amelia)
library(GGally)
library(caret)
library(relaimpo)
library(randomForest)
library(gbm)
library(broom)
library(dplyr)
library(e1071)
library(skimr)
library(janitor)
library(tidymodels)
library(vip)
library(xgboost)
library(openxlsx)
car <- read.csv('../group_project/car_project.csv')
car[car == ''] <- NA
sapply(car, function(x) sum(is.na(x)))
# change model name to company name
car$name <- word(car$name,1)
# change car mileage to numeric & impute missing value using mean
car$mileage <- parse_number(car$mileage)
car$mileage[is.na(car$mileage)]<-mean(car$mileage,na.rm=TRUE)
#change car engine to numeric & impute missing value using mean
car$engine <- parse_number(car$engine)
car$engine[is.na(car$engine)]<-mean(car$engine,na.rm=TRUE)
#change max power to numeric & impute missing value using mean
car$max_power <- parse_number(car$max_power)
car$max_power[is.na(car$max_power)]<-mean(car$max_power,na.rm=TRUE)
# change car seat to numeric & imput missing value using median
car$seats[is.na(car$seats)]<-median(car$seats,na.rm=TRUE)
# change car torque to numeric and input missing value using mean
car$torque <- parse_number(car$torque)
car$torque[is.na(car$torque)]<-mean(car$torque,na.rm=TRUE)
set.seed(345)
split <- initial_split(car, prop = .80)
train_df <- training(split)
test_df <- testing(split)
# Create recipe and roles--preprocessing for the model
recipe <-
recipe(selling_price ~ ., data = train_df) %>%
step_dummy(all_predictors(), -all_numeric()) %>%
step_zv(all_predictors()) %>%
step_center(all_predictors(), -all_nominal()) %>%
step_scale(all_predictors(), -all_nominal()) %>%
step_impute_mean(all_predictors(), -all_nominal())
# Model, with hyperparameters chosen from best performing model after tuning models
mod <-
boost_tree(trees = 6000,
tree_depth = 3,
learn_rate = 0.1) %>%
set_engine('xgboost') %>%
set_mode("regression")
# Workflow
set.seed(345)
wf <- workflow() %>%
add_model(mod) %>%
add_recipe(recipe)
final <-wf %>%
fit(data = train_df)
# Predictions with training set
predictions_train <- predict(final, train_df) %>%
bind_cols(train_df %>% select_all)
# Test data predictions merged with test data
predictions_test <- predict(final, test_df) %>%
bind_cols(test_df %>% select_all())
The application is built by using this following code:
library(shiny)
library(shinyWidgets)
library(scales)
library(grid)
library(gridExtra)
library(shinydashboard)
library(rsconnect)
library(textyle)
# Source in model file
source('model.R')
source('eda.r')
saveRDS(final, "pricemodel.rds")
priceModel <- readRDS("pricemodel.rds")
shiny_df <- bind_rows(predictions_train, predictions_train)
priceModel
years <- unique(sold_per_year$year)
years
colnames(cardata)
plottype <- c("bar", "line", "box")
plottype
sort(years)
ui <- dashboardPage(
dashboardHeader(title = "Automobile Price"),
dashboardSidebar(
sidebarMenu(
menuItem("Dashboard", tabName = "dashboard", icon = icon("magnifying-glass-chart", lib = "font-awesome")),
menuItem("Prediction Tools", tabName = "tools", icon = icon("gear", lib = "font-awesome"))
)
),
dashboardBody(
tabItems(
# First tab content
tabItem(tabName = "dashboard",
h1("Desriptive Analysis of Automobile Price"),
fluidRow(
box(title = "Cars sold across year", width = 6, height = 300, solidHeader = TRUE, plotOutput("plot3", height = 250)),
box(title = "Number of car sold in a year", width = 6, height = 300, solidHeader = TRUE, plotOutput("plot4", height = 250))
),
fluidRow(
box(title = "Select years: ",height = 300, sliderInput("slider3", "Years:", min(years), max(years),c(min(years), max(years)))),
box(title = "Select a year ",height = 300, selectInput("slider4", "years:", choices= sort(years)))
),
fluidRow(
box(title = "Most expensive cars", width = 6, height = 300, solidHeader = TRUE, plotOutput("plot1", height = 250)),
box(title = "Most cheapest cars", width = 6, height = 300, solidHeader = TRUE, plotOutput("plot2", height = 250))
),
fluidRow(
box(title = "Select number of top cars: ",height = 300, sliderInput("slider1", "Number of expensive cars:", 3, 10, 5), selectInput("carname1", "select car model:", choices= sort(c(unique(shiny_df$name), "All")))),
box(title = "Select number of top cars: ", height = 300, sliderInput("slider2", "Number of cheapest cars:", 3, 10, 5), selectInput("carname2", "select car model:", choices= sort(c(unique(shiny_df$name), "All"))))
),
fluidRow(
box(title = "Exploration between two variables", width = 12, height = 300, solidHeader = TRUE, plotOutput("plot5", height = 250))
),
fluidRow(
box(title = "Select variables:", width = 12, height = 300, selectInput("select1", "Variable X:", choices = colnames(cardata)), selectInput("select2", "Variable Y:", choices= colnames(cardata), "All"), selectInput("select3", "Type of plot:", choices = plottype)))
),
tabItem(tabName = "tools",
h1("Predictive Analysis of Automobile Price"),
fluidRow(
box(title = "Select car name", width = 3, height = 150, solidHeader = TRUE, selectInput("carname", "name of car", choices= sort(unique(shiny_df$name)))),
box(title = "Select year", width = 3, height = 150, solidHeader = TRUE, numericInput("caryear", "year", 2011)),
box(title = "Select number of kilometer", width = 3, height = 150, solidHeader = TRUE, numericInput("carkm", "Number of kilometer driven", 100000)),
box(title = "Select fuel type", width = 3, height = 150, solidHeader = TRUE, selectInput("carfuel", "Type of fuel", choices= sort(unique(shiny_df$fuel))))
),
fluidRow(
box(title = "Select seller type", width = 3, height = 150, solidHeader = TRUE, selectInput("carseller", "Seller type", choices= sort(unique(shiny_df$seller_type)))),
box(title = "Select car transmission", width = 3, height = 150, solidHeader = TRUE, selectInput("cartransmission", "Type of transmission", choices= sort(unique(shiny_df$transmission)))),
box(title = "Select number of previous owner", width = 3, height = 150, solidHeader = TRUE, selectInput("carowner", "Number of previous owner", choices= sort(unique(shiny_df$owner)))),
box(title = "Select number of mileage", width = 3, height = 150, solidHeader = TRUE, numericInput("carmileage", "Mileage of the car",21))
),
fluidRow(
box(title = "Select engine capacity", width = 3, height = 150, solidHeader = TRUE, numericInput("carengine", "Engine capacity of the car",1400)),
box(title = "Select maximum engine power", width = 3, height = 150, solidHeader = TRUE, numericInput("carmaxpower", "Max power of the engine",80)),
box(title = "Select engine tourque", width = 3, height = 150, solidHeader = TRUE, numericInput("cartourque", "Tourque of the engine",200)),
box(title = "Select number of seat", width = 3, height = 150, solidHeader = TRUE, selectInput("carseat", "Number of seat", choices= sort(unique(shiny_df$seats))))
),
fluidRow(
box(align = "center", status = "info", title = "The vehicle price prediction", width =12, height =500,
actionButton("submitbutton", "Submit", class = "btn btn-primary"), br(), br(),
textOutput('text'), br(),#,
tags$head(tags$style("#text{color: black;
font-size: 20px;
font-style: italic;
font-weight: bold;
}"
)),
imageOutput('image', width = 210, height = 250)))
#fluidRow(
# box(width = 6, height = 300, actionButton("submitbutton", "Submit", class = "btn btn-primary")),
#box(width = 6, height = 300, textyle(tags$p("TEXTYLE", style = "font-size:7rem;font-weight:900;")),textOutput("text")))
)
)
)
)
server <- function(input, output, session) {
#set.seed(122)
output$plot1 <- renderPlot({
expensive_cars <- cardata %>% group_by(name) %>% summarise(selling_price=max(selling_price))#%>%top_n(input$slider1)
expensive_cars <- expensive_cars[order(-expensive_cars$selling_price),]#%>%top_n(10)
if (input$carname1 == "All"){
selectdata <- grepl("", expensive_cars$name)
} else {
selectdata <- grepl(input$carname1, expensive_cars$name)
}
expensive_cars <- expensive_cars[selectdata,]%>%top_n(input$slider1)
#head(expensive_cars)
#print(expensive_cars)
title <- paste("Top", input$slider1, "most expensive cars")
options(scipen = 999)
ggplot(data = expensive_cars, aes(y=name, x=selling_price, fill=selling_price))+
geom_bar(stat="identity", width = 0.5, fill="#001253",color = 'black')+
geom_text(aes(label=selling_price), vjust=1.9, color="black", size=3.0)+
scale_x_continuous(labels = comma)+
labs(x="Car Price",
y="Car Brand",
title=title)+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
})
output$plot2 <- renderPlot({
cheapest_cars <- filter(cardata %>% group_by(name) %>% summarise(selling_price=max(selling_price)) %>% arrange(desc(selling_price)))
if (input$carname2 == "All"){
selectdata <- grepl("", cheapest_cars$name)
} else {
selectdata <- grepl(input$carname2, cheapest_cars$name)
}
cheapest_cars <- cheapest_cars[selectdata,]
#head(expensive_cars)
#print(expensive_cars)
title <- paste("Top", input$slider2, "most cheapest cars")
ggplot(data = tail(cheapest_cars,input$slider2), aes(y=name, x=selling_price, fill=selling_price))+
geom_bar(stat="identity", width = 0.5, fill="#001253",color = 'black')+
geom_text(aes(label=selling_price), vjust=1.9, color="black", size=3.0)+
scale_x_continuous(labels = comma)+
labs(x="Car brand",
y="Car Price",
title=title)+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
})
my_range <- reactive({
cbind(input$slider3[1],input$slider3[2])
})
output$plot3 <- renderPlot({
sold_per_year<- cardata %>% group_by(name, year)%>%summarise(Count=length(name))
sold_per_year <- filter(sold_per_year, year>my_range()[1] & year<my_range()[2])
title<- paste("Distribution of cars sold from",input$slider3[1],"to",input$slider3[2])
ggplot(sold_per_year, aes(year))+
geom_line(stat="count", width = 0.7,color = 'blue')+ # Stack for stacked chart
labs(x="Year",
y="Count",
title=title)+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10,angle=90),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
})
output$plot4 <- renderPlot({
sold_per_year<- car %>% group_by(name, year)%>%summarise(Count=length(name))
sold_per_year <- filter(sold_per_year, year==input$slider4)
sold_per_year
title <- paste("Number of cars sold in ", input$slider4)
ggplot(data = sold_per_year, aes(y=Count, x=name))+
geom_bar(stat="identity", width = 0.5, fill="#001253", color = "blue")+
geom_text(aes(label=Count), vjust=1.7, color="black", size=3.0)+
labs(x="Car brand",
y="Count",
title=title)+
theme_bw()+
theme(plot.title = element_text(size=15),axis.text.x= element_text(size=10),
axis.text.y= element_text(size=10), axis.title=element_text(size=10))
})
output$plot5 <- renderPlot({
title <- paste("Relationhip between", input$select1, "and", input$select2)
if (input$select3 == "bar"){
ggplot(data = car, aes_string(x=input$select1, y=input$select2)) +
geom_bar(stat = 'identity',color = 'blue')+
labs(title=title)
} else if (input$select3 == "line"){
ggplot(data = car, aes_string(x=input$select1, y=input$select2)) +
geom_line(colour = "blue")+
labs(title=title)
} else if (input$select3 == "box"){
ggplot(data = car, aes_string(x=input$select1, y=input$select2)) +
geom_boxplot(colour = "blue")+
labs(title=title)
}
})
datasetInput <- reactive({
test <- data.frame(
name = c(input$carname),
year = c(input$caryear),
km_driven = c(input$carkm),
fuel = c(input$carfuel),
seller_type = c(input$carseller),
transmission = c(input$cartransmission),
owner = c(input$carowner),
mileage = c(input$carmileage),
engine = c(input$carengine),
max_power = c(input$carmaxpower),
torque = c(input$cartourque),
seats = c(as.numeric(input$carseat)))
priceModel <- readRDS("pricemodel.rds")
Output <- predict(priceModel,test)
Output <- paste("The price of car based on selected features: - USD ", round(Output), " -")
paste(" ", Output, " ", sep="\n")
})
output$image <-
renderImage({
if (input$submitbutton>0){
img <- c("image1.png",
"image2.png",
"image3.png",
"image4.png",
"image5.png",
"image6.png",
"image7.png",
"image8.png")
img_ch <- sample(img, 1)
list(src = file.path("images", img_ch),
width = "100%",
height = "100%")
} else{
list(src = file.path("images/28this.jpg"),
width = "100%",
height = "100%")
}
}, deleteFile = F)
output$text <- renderText({
if (input$submitbutton>0) {
#isolate("Calculation complete.")
isolate(datasetInput())
} else {
return("Server is ready for calculation.")
}
})
}
shinyApp(ui, server)
The deployment product was developed and can be access through this link.