Introduction

Data clustering is the assignment of objects to homogeneous groups (called clusters) while making sure that objects in different groups are not similar. Clustering is considered an unsupervised task as it aims to describe the hidden structure of the objects. Each object is described by a set of characters called features. The first step of dividing objects into clusters is to define the distance between the different objects. Defining an adequate distance measure is crucial for the success of the clustering process.

In the field of machine learning, it is useful to apply a process called dimensionality reduction to highly dimensional data. The purpose of this process is to reduce the number of features under consideration, where each feature is a dimension that partly represents the objects.

Dimensionality reduction can be executed using two different methods:
- Selecting from the existing features (feature selection)
- Extracting new features by combining the existing features (feature extraction)

opel

In this paper, Opel Corsa dataset is used to analysis and it is thought to be work both in different data clustering and dimentionality reduction approaches will give more insight about how to use these approaches together in daily analysis. Analysis divided into three main parts.

1. Exploratory Data analysis
- Statistical Analysis, Visualization, Data Transformation
2. Dimensonality Reduction
- Drop no variance features
- Correlation analysis to check similarity between features
- PCA(Principal Component Analysis) for dimentionality reduction + Feature Engineering
3. Clustering
- K-Means
- PAM
- Hierarchical Clustering

Exploratory Data analysis

In this part, you can observe statistical analysis of the given data for analysis with some visualization parts to give an insight about distribution of the data.

For data transformation, chr type features converted into factor and inputed with some numeric value which categorize this features.

Loading Packages

packages<-function(x){
  x<-as.character(match.call()[[2]])
  if (!require(x,character.only=TRUE)){
    install.packages(pkgs=x,repos="http://cran.r-project.org")
    require(x,character.only=TRUE)
  }
}

packages(tidyverse)
packages(dplyr)
packages(ggplot2)
packages(broom)
packages(rpart)
packages(rpart.plot) 
packages(rattle)
packages(cowplot)
packages(knitr)
packages(corrplot)
packages(gridExtra)
packages(GGally)
packages(cluster) # clustering algorithms 
packages(factoextra) # clustering algorithms & visualization
packages(funModeling) 
packages(Hmisc)
packages(FactoMineR)
packages(factoextra)
packages(plot3D)

Statistical Analysis of Data

cars <- read.csv('C:/Users/ALPARSLAN/Desktop/Alparslan/MERVE/irem&Onur/Project 2/Data_3000.csv', sep=";", dec=".", header=TRUE)

Cars.head()

kable(head(cars))
Price Brand Model Engine Year Fuel_Type Gear_Type Km Engine_Power Engine_Capacity WD Color Warranty Seller
24900 Opel Corsa 1.3 2004 Diesel Manual 270000 70 1248 Front_2WD Red No Gallery
27500 Opel Corsa 1.2 2004 Gasoline Manual 164000 75 1199 Front_2WD Red No Gallery
42500 Opel Corsa 1.3 2008 Diesel Semi Automatic 300000 90 1248 Front_2WD Grey No Gallery
63750 Opel Corsa 1.4 2017 Gasoline Semi Automatic 8780 90 1398 Front_2WD Red Yes Gallery
25000 Opel Corsa 1.3 2004 Diesel Manual 200000 70 1248 Front_2WD Blue No Owner
28500 Opel Corsa 1.7 2003 Diesel Manual 130000 76 1601 Front_2WD White Yes Gallery

str(cars)

kable(str(cars))
## 'data.frame':    2994 obs. of  14 variables:
##  $ Price          : int  24900 27500 42500 63750 25000 28500 29750 19250 37500 21250 ...
##  $ Brand          : chr  "Opel" "Opel" "Opel" "Opel" ...
##  $ Model          : chr  "Corsa" "Corsa" "Corsa" "Corsa" ...
##  $ Engine         : num  1.3 1.2 1.3 1.4 1.3 1.7 1.3 1.2 1.2 1.4 ...
##  $ Year           : int  2004 2004 2008 2017 2004 2003 2011 2000 2007 1999 ...
##  $ Fuel_Type      : chr  "Diesel" "Gasoline" "Diesel" "Gasoline" ...
##  $ Gear_Type      : chr  "Manual" "Manual" "Semi Automatic" "Semi Automatic" ...
##  $ Km             : int  270000 164000 300000 8780 200000 130000 197000 169000 97000 221000 ...
##  $ Engine_Power   : int  70 75 90 90 70 76 75 65 76 90 ...
##  $ Engine_Capacity: int  1248 1199 1248 1398 1248 1601 1248 1199 1201 1389 ...
##  $ WD             : chr  "Front_2WD" "Front_2WD" "Front_2WD" "Front_2WD" ...
##  $ Color          : chr  "Red" "Red" "Grey" "Red" ...
##  $ Warranty       : chr  "No" "No" "No" "Yes" ...
##  $ Seller         : chr  "Gallery" "Gallery" "Gallery" "Gallery" ...
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

You can observe that we have character type features. To make an easier analysis, I factorized this features.

After-Factor str(cars)

cars[sapply(cars, is.character)] <- lapply(cars[sapply(cars, is.character)],as.factor)
kable(str(cars))
## 'data.frame':    2994 obs. of  14 variables:
##  $ Price          : int  24900 27500 42500 63750 25000 28500 29750 19250 37500 21250 ...
##  $ Brand          : Factor w/ 1 level "Opel": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Model          : Factor w/ 1 level "Corsa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Engine         : num  1.3 1.2 1.3 1.4 1.3 1.7 1.3 1.2 1.2 1.4 ...
##  $ Year           : int  2004 2004 2008 2017 2004 2003 2011 2000 2007 1999 ...
##  $ Fuel_Type      : Factor w/ 3 levels "Diesel","Gasoline",..: 1 2 1 2 1 1 1 2 2 3 ...
##  $ Gear_Type      : Factor w/ 3 levels "Automatic","Manual",..: 2 2 3 3 2 2 2 2 3 2 ...
##  $ Km             : int  270000 164000 300000 8780 200000 130000 197000 169000 97000 221000 ...
##  $ Engine_Power   : int  70 75 90 90 70 76 75 65 76 90 ...
##  $ Engine_Capacity: int  1248 1199 1248 1398 1248 1601 1248 1199 1201 1389 ...
##  $ WD             : Factor w/ 3 levels "4WD","Back_2WD",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Color          : Factor w/ 18 levels "Beige","Black",..: 13 13 8 13 3 17 17 18 3 13 ...
##  $ Warranty       : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 1 ...
##  $ Seller         : Factor w/ 2 levels "Gallery","Owner": 1 1 1 1 2 1 1 1 2 2 ...
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

Status

status(cars)
##           variable q_zeros     p_zeros q_na p_na q_inf p_inf    type unique
## 1            Price       0 0.000000000    0    0     0     0 integer    498
## 2            Brand       0 0.000000000    0    0     0     0  factor      1
## 3            Model       0 0.000000000    0    0     0     0  factor      1
## 4           Engine       0 0.000000000    0    0     0     0 numeric      7
## 5             Year       0 0.000000000    0    0     0     0 integer     27
## 6        Fuel_Type       0 0.000000000    0    0     0     0  factor      3
## 7        Gear_Type       0 0.000000000    0    0     0     0  factor      3
## 8               Km       5 0.001670007    0    0     0     0 integer    791
## 9     Engine_Power       0 0.000000000    0    0     0     0 integer     28
## 10 Engine_Capacity       0 0.000000000    0    0     0     0 integer     18
## 11              WD       0 0.000000000    0    0     0     0  factor      3
## 12           Color       0 0.000000000    0    0     0     0  factor     18
## 13        Warranty       0 0.000000000    0    0     0     0  factor      2
## 14          Seller       0 0.000000000    0    0     0     0  factor      2

Freq

freq(cars)

##   Brand frequency percentage cumulative_perc
## 1  Opel      2994        100             100

##   Model frequency percentage cumulative_perc
## 1 Corsa      2994        100             100

##        Fuel_Type frequency percentage cumulative_perc
## 1       Gasoline      1172      39.14           39.14
## 2         Diesel      1148      38.34           77.48
## 3 Gasoline & LPG       674      22.51          100.00

##        Gear_Type frequency percentage cumulative_perc
## 1         Manual      2018       67.4            67.4
## 2      Automatic       518       17.3            84.7
## 3 Semi Automatic       458       15.3           100.0

##          WD frequency percentage cumulative_perc
## 1 Front_2WD      2987      99.77           99.77
## 2  Back_2WD         6       0.20           99.97
## 3       4WD         1       0.03          100.00

##          Color frequency percentage cumulative_perc
## 1        White       953      31.83           31.83
## 2         Grey       405      13.53           45.36
## 3  Silver Grey       382      12.76           58.12
## 4        Black       297       9.92           68.04
## 5          Red       236       7.88           75.92
## 6         Blue       226       7.55           83.47
## 7    Navy Blue        98       3.27           86.74
## 8       Smoked        94       3.14           89.88
## 9        Green        91       3.04           92.92
## 10      Yellow        62       2.07           94.99
## 11       Brown        61       2.04           97.03
## 12    Burgundy        39       1.30           98.33
## 13   Turquoise        18       0.60           98.93
## 14       Beige        10       0.33           99.26
## 15      Orange         8       0.27           99.53
## 16   Champagne         6       0.20           99.73
## 17      Purple         6       0.20           99.93
## 18        Pink         2       0.07          100.00

##   Warranty frequency percentage cumulative_perc
## 1       No      2444      81.63           81.63
## 2      Yes       550      18.37          100.00

##    Seller frequency percentage cumulative_perc
## 1   Owner      1698      56.71           56.71
## 2 Gallery      1296      43.29          100.00
## [1] "Variables processed: Brand, Model, Fuel_Type, Gear_Type, WD, Color, Warranty, Seller"

Plot_num

plot_num(cars)

Profiling_num

profiling_num(cars)
##          variable         mean      std_dev variation_coef  p_01    p_05
## 1           Price 3.530798e+04 1.274661e+04     0.36101225 14250 17750.0
## 2          Engine 1.322779e+00 1.097129e-01     0.08294120     1     1.2
## 3            Year 2.007506e+03 5.922384e+00     0.00295012  1994  1997.0
## 4              Km 1.283223e+05 7.173726e+04     0.55903958  3000 17325.0
## 5    Engine_Power 7.976119e+01 1.535335e+01     0.19249152    51    51.0
## 6 Engine_Capacity 1.288937e+03 1.013122e+02     0.07860132   998  1199.0
##      p_25     p_50     p_75     p_95     p_99   skewness  kurtosis      iqr
## 1 25000.0  33900.0  43937.5  58500.0  67552.5  0.6486791  4.348960  18937.5
## 2     1.3      1.3      1.4      1.4      1.7  0.6368272  5.785257      0.1
## 3  2004.0   2008.0   2013.0   2016.0   2017.0 -0.3453788  2.098885      9.0
## 4 71000.0 130000.0 182000.0 240000.0 300000.0  0.2302952  2.662186 111000.0
## 5    75.0     76.0     90.0    100.0    126.0  2.0245987 17.960649     15.0
## 6  1229.0   1248.0   1364.0   1401.0   1601.0  0.9501447  5.528914    135.0
##                    range_98          range_80
## 1 [14250, 67552.5000000001]    [19750, 52000]
## 2                  [1, 1.7]        [1.2, 1.4]
## 3              [1994, 2017]      [1999, 2015]
## 4             [3000, 3e+05] [30000, 218918.1]
## 5                 [51, 126]          [60, 95]
## 6               [998, 1601]      [1201, 1398]

Describe

describe(cars)
## cars 
## 
##  14  Variables      2994  Observations
## --------------------------------------------------------------------------------
## Price 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2994        0      498        1    35308    14336    17750    19750 
##      .25      .50      .75      .90      .95 
##    25000    33900    43938    52000    58500 
## 
## lowest :   7950   8250   9200   9900  10500, highest:  73750  74750  75000  75500 144750
## --------------------------------------------------------------------------------
## Brand 
##        n  missing distinct    value 
##     2994        0        1     Opel 
##                
## Value      Opel
## Frequency  2994
## Proportion    1
## --------------------------------------------------------------------------------
## Model 
##        n  missing distinct    value 
##     2994        0        1    Corsa 
##                 
## Value      Corsa
## Frequency   2994
## Proportion     1
## --------------------------------------------------------------------------------
## Engine 
##        n  missing distinct     Info     Mean      Gmd 
##     2994        0        7    0.896    1.323   0.1108 
## 
## lowest : 1.0 1.2 1.3 1.4 1.5, highest: 1.3 1.4 1.5 1.6 1.7
##                                                     
## Value        1.0   1.2   1.3   1.4   1.5   1.6   1.7
## Frequency     48   684  1052  1094    14    20    82
## Proportion 0.016 0.228 0.351 0.365 0.005 0.007 0.027
## --------------------------------------------------------------------------------
## Year 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2994        0       27    0.996     2008    6.755     1997     1999 
##      .25      .50      .75      .90      .95 
##     2004     2008     2013     2015     2016 
## 
## lowest : 1990 1992 1993 1994 1995, highest: 2013 2014 2015 2016 2017
## --------------------------------------------------------------------------------
## Fuel_Type 
##        n  missing distinct 
##     2994        0        3 
##                                                        
## Value              Diesel       Gasoline Gasoline & LPG
## Frequency            1148           1172            674
## Proportion          0.383          0.391          0.225
## --------------------------------------------------------------------------------
## Gear_Type 
##        n  missing distinct 
##     2994        0        3 
##                                                        
## Value           Automatic         Manual Semi Automatic
## Frequency             518           2018            458
## Proportion          0.173          0.674          0.153
## --------------------------------------------------------------------------------
## Km 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2994        0      791        1   128322    81774    17325    30000 
##      .25      .50      .75      .90      .95 
##    71000   130000   182000   218918   240000 
## 
## lowest :      0      1     10    160    190, highest: 360000 361000 420000 435000 460000
## --------------------------------------------------------------------------------
## Engine_Power 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2994        0       28    0.977    79.76    15.27       51       60 
##      .25      .50      .75      .90      .95 
##       75       76       90       95      100 
## 
## lowest :  45  50  51  54  58, highest: 151 176 192 201 226
## --------------------------------------------------------------------------------
## Engine_Capacity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2994        0       18     0.96     1289    101.8     1199     1201 
##      .25      .50      .75      .90      .95 
##     1229     1248     1364     1398     1401 
## 
## lowest :  973  998  999 1196 1199, highest: 1488 1598 1601 1686 1801
##                                                                             
## Value        973   998   999  1196  1199  1200  1201  1229  1248  1364  1389
## Frequency     21    17     1     9   136    98   297   407   954   339   369
## Proportion 0.007 0.006 0.000 0.003 0.045 0.033 0.099 0.136 0.319 0.113 0.123
##                                                     
## Value       1398  1401  1488  1598  1601  1686  1801
## Frequency    187    47    12    17    65    17     1
## Proportion 0.062 0.016 0.004 0.006 0.022 0.006 0.000
## --------------------------------------------------------------------------------
## WD 
##        n  missing distinct 
##     2994        0        3 
##                                         
## Value            4WD  Back_2WD Front_2WD
## Frequency          1         6      2987
## Proportion     0.000     0.002     0.998
## --------------------------------------------------------------------------------
## Color 
##        n  missing distinct 
##     2994        0       18 
## 
## lowest : Beige       Black       Blue        Brown       Burgundy   
## highest: Silver Grey Smoked      Turquoise   White       Yellow     
## 
## Beige (10, 0.003), Black (297, 0.099), Blue (226, 0.075), Brown (61, 0.020),
## Burgundy (39, 0.013), Champagne (6, 0.002), Green (91, 0.030), Grey (405,
## 0.135), Navy Blue (98, 0.033), Orange (8, 0.003), Pink (2, 0.001), Purple (6,
## 0.002), Red (236, 0.079), Silver Grey (382, 0.128), Smoked (94, 0.031),
## Turquoise (18, 0.006), White (953, 0.318), Yellow (62, 0.021)
## --------------------------------------------------------------------------------
## Warranty 
##        n  missing distinct 
##     2994        0        2 
##                       
## Value         No   Yes
## Frequency   2444   550
## Proportion 0.816 0.184
## --------------------------------------------------------------------------------
## Seller 
##        n  missing distinct 
##     2994        0        2 
##                           
## Value      Gallery   Owner
## Frequency     1296    1698
## Proportion   0.433   0.567
## --------------------------------------------------------------------------------

Data Visualization of Data

Factor Hist

cars %>% dplyr::select(where(is.factor)) %>%
  gather(attributes, value, 1:6) %>%
  ggplot(aes(x = value)) +
  geom_bar(fill = 'lightblue2', color = 'black') +
  facet_wrap(~attributes, scales = 'free_x') +
  labs(x="Values", y="Frequency") +
  theme_bw()
## Warning: attributes are not identical across measure variables;
## they will be dropped

From the brand and model feature plots, we can easily observe that these features have zero variance, so we can simply drop these columns in dimentionality reduction section.
Fuel_Type, Gear_type and WD seems to be categorical features!

Numeric Hist

cars %>% dplyr::select(where(is.numeric)) %>%
  gather(attributes, value, 1:5) %>%
  ggplot(aes(x = value)) +
  geom_histogram(fill = 'lightblue2', color = 'black') +
  facet_wrap(~attributes, scales = 'free_x') +
  labs(x="Values", y="Frequency") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cars$Engine <- as.factor(cars$Engine)

Engine is a numeric column but it has only 7 unique value. So we can consider this feature as categorical one also.

Upper Correlation Matrix

corrplot(cor(cars %>% dplyr::select(where(is.numeric))), type = 'upper', method = 'number', tl.cex = 0.9)

There is a high correlation between Price & Year and Enigne & Engine_Capacity columns. We drop some highly corelated column to use dimentionality reduction method.

Price-Year Relation

ggplot(cars, aes(x = Price, y = Year)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

You can easily observe that, there is an outlier in the data… Let’s treat this outlier with filtering.

cars[cars$Price > 100000, ]$Price = median(cars$Price)

Let’s check it again

Price-Year Relation After Outlier

ggplot(cars, aes(x = Price, y = Year)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

Now it seems better :)

Fuel-Type & Price Box Plot + Scatter Plot

ggplot(data = cars, aes(x=Fuel_Type, y=Price, color=Fuel_Type)) + 
  geom_boxplot()+
  scale_color_brewer(palette="Dark2") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  labs(title = 'Title',
       y='Price',x='Fuel_Type')

In this plot, we can observe that, Gasoline type cars are expensive then the rest.

Year & Price, Fuel-TypeBox Plot + Gear-Type Scatter Plot

ggplot(data = cars, aes(x=Fuel_Type,y=Price, fill=Fuel_Type)) + 
  geom_boxplot()+
  scale_fill_brewer(palette="Green") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  labs(title = 'Title',
       y='Price',x='Year')+
  facet_wrap(~Gear_Type,nrow = 1)
## Warning in pal_name(palette, type): Unknown palette Green

In this plot, we can observe that, automatic and semi-automatic cars are more expensive than manual type cars while gasoline cars still most expensive ones respect to fuel_type.

Year & Price, Fuel-TypeBox Plot + Year Scatter Plot

ggplot(data = cars, aes(x=Fuel_Type,y=Price, fill=Fuel_Type)) + 
  geom_boxplot()+
  scale_fill_brewer(palette="Green") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  labs(title = 'Title',
       y='Price',x='Year')+
  facet_wrap(~Year,nrow = 1)
## Warning in pal_name(palette, type): Unknown palette Green

In this plot, we can observe that, Year and price have high positive correlation between each other.

Create 3D plot to visualize most informative features Price & Year & KM

# x, y, z variables
x <- cars$Price
y <- cars$Km
z <- cars$Engine_Capacity 

scatter3D(cars$Price, cars$Km,cars$Year,
          xlab = "Price", ylab = "Km", zlab = "Year",
          phi = 0, bty = "g",pch = 20,cex = 2,
          ticktype = "detailed", colvar=as.numeric(cars$Year))

Categorical Value Treatment & Data Normalization

Give index number to factor columns…

car <- cars
indx <- sapply(car, is.factor)
car[indx] <- lapply(car[indx], function(x) as.numeric(factor(x)))

To run the algorithms below, we need to use numerical values. These features are factor and simply given them some number to identify this factor values.

kable(str(car))
## 'data.frame':    2994 obs. of  14 variables:
##  $ Price          : num  24900 27500 42500 63750 25000 ...
##  $ Brand          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Model          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Engine         : num  3 2 3 4 3 7 3 2 2 4 ...
##  $ Year           : int  2004 2004 2008 2017 2004 2003 2011 2000 2007 1999 ...
##  $ Fuel_Type      : num  1 2 1 2 1 1 1 2 2 3 ...
##  $ Gear_Type      : num  2 2 3 3 2 2 2 2 3 2 ...
##  $ Km             : int  270000 164000 300000 8780 200000 130000 197000 169000 97000 221000 ...
##  $ Engine_Power   : int  70 75 90 90 70 76 75 65 76 90 ...
##  $ Engine_Capacity: int  1248 1199 1248 1398 1248 1601 1248 1199 1201 1389 ...
##  $ WD             : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ Color          : num  13 13 8 13 3 17 17 18 3 13 ...
##  $ Warranty       : num  1 1 1 2 1 2 1 1 1 1 ...
##  $ Seller         : num  1 1 1 1 2 1 1 1 2 2 ...
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

kable(head(car))
Price Brand Model Engine Year Fuel_Type Gear_Type Km Engine_Power Engine_Capacity WD Color Warranty Seller
24900 1 1 3 2004 1 2 270000 70 1248 3 13 1 1
27500 1 1 2 2004 2 2 164000 75 1199 3 13 1 1
42500 1 1 3 2008 1 3 300000 90 1248 3 8 1 1
63750 1 1 4 2017 2 3 8780 90 1398 3 13 2 1
25000 1 1 3 2004 1 2 200000 70 1248 3 3 1 2
28500 1 1 7 2003 1 2 130000 76 1601 3 17 2 1

Dimentionality Reduction

In the field of machine learning, it is useful to apply a process called dimensionality reduction to highly dimensional data. The purpose of this process is to reduce the number of features under consideration, where each feature is a dimension that partly represents the objects. Why is dimensionality reduction important? As more features are added, the data becomes very sparse and analysis suffers from the curse of dimensionality. Additionally, it is easier to process smaller data sets. Dimensionality reduction can be executed using two different methods:
- Selecting from the existing features (feature selection)
- Extracting new features by combining the existing features (feature extraction)

Zero-Variance Analysis

kable(profiling_num(car))
variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75 p_95 p_99 skewness kurtosis iqr range_98 range_80
Price 3.527096e+04 1.258863e+04 0.3569120 14250 17750 25000 33900 43900 58500 67500 0.4626585 2.667568 18900 [14250, 67500] [19750, 52000]
Brand 1.000000e+00 0.000000e+00 0.0000000 1 1 1 1 1 1 1 NaN NaN 0 [1, 1] [1, 1]
Model 1.000000e+00 0.000000e+00 0.0000000 1 1 1 1 1 1 1 NaN NaN 0 [1, 1] [1, 1]
Engine 3.243821e+00 1.056381e+00 0.3256595 1 2 3 3 4 4 7 0.9749908 5.589242 1 [1, 7] [2, 4]
Year 2.007506e+03 5.922384e+00 0.0029501 1994 1997 2004 2008 2013 2016 2017 -0.3453788 2.098885 9 [1994, 2017] [1999, 2015]
Fuel_Type 1.841683e+00 7.639903e-01 0.4148326 1 1 1 2 2 3 3 0.2754706 1.756249 1 [1, 3] [1, 3]
Gear_Type 1.979960e+00 5.706947e-01 0.2882355 1 1 2 2 2 3 3 -0.0024646 3.067447 0 [1, 3] [1, 3]
Km 1.283223e+05 7.173726e+04 0.5590396 3000 17325 71000 130000 182000 240000 300000 0.2302952 2.662186 111000 [3000, 3e+05] [30000, 218918.1]
Engine_Power 7.976119e+01 1.535335e+01 0.1924915 51 51 75 76 90 100 126 2.0245987 17.960649 15 [51, 126] [60, 95]
Engine_Capacity 1.288937e+03 1.013122e+02 0.0786013 998 1199 1229 1248 1364 1401 1601 0.9501447 5.528914 135 [998, 1601] [1201, 1398]
WD 2.997328e+00 5.774070e-02 0.0192641 3 3 3 3 3 3 3 -24.1633816 657.018707 0 [3, 3] [3, 3]
Color 1.139646e+01 5.567746e+00 0.4885505 2 2 8 14 17 17 18 -0.4894941 1.738998 9 [2, 18] [2, 17]
Warranty 1.183701e+00 3.873046e-01 0.3271981 1 1 1 1 1 2 2 1.6336087 3.668677 0 [1, 2] [1, 2]
Seller 1.567134e+00 4.955553e-01 0.3162175 1 1 1 2 2 2 2 -0.2709909 1.073436 1 [1, 2] [1, 2]

There are two columns that have zero variance -> Brand, Model.
Drop these zero variance columns for dimension reduction.

drops <- c("Brand","Model")
car <- car[ , !(names(car) %in% drops)]
kable(head(car))
Price Engine Year Fuel_Type Gear_Type Km Engine_Power Engine_Capacity WD Color Warranty Seller
24900 3 2004 1 2 270000 70 1248 3 13 1 1
27500 2 2004 2 2 164000 75 1199 3 13 1 1
42500 3 2008 1 3 300000 90 1248 3 8 1 1
63750 4 2017 2 3 8780 90 1398 3 13 2 1
25000 3 2004 1 2 200000 70 1248 3 3 1 2
28500 7 2003 1 2 130000 76 1601 3 17 2 1

Correlation Analysis

corrplot(cor(car[sapply(car, is.numeric)]), type = 'upper', method = 'number', tl.cex = 0.9)

From correlation analysis, there is a high correlation between Price and Year which is expected also Engine and Engine Capacity.
We can reduce dimentionality by using highly correlated features because if two features are highly correlated, then they have same effect from similar span in space to the machine learning model.
Candidates for dimentionality reduction are: Price&Year and Engine&Engine_Capacity. Let’s remove Engine and Year features from the data.

car <- car[!names(car) %in% c("Engine","Year")]
kable(head(car))
Price Fuel_Type Gear_Type Km Engine_Power Engine_Capacity WD Color Warranty Seller
24900 1 2 270000 70 1248 3 13 1 1
27500 2 2 164000 75 1199 3 13 1 1
42500 1 3 300000 90 1248 3 8 1 1
63750 2 3 8780 90 1398 3 13 2 1
25000 1 2 200000 70 1248 3 3 1 2
28500 1 2 130000 76 1601 3 17 2 1

PCA(Principal Component Analysis)

In this chapter we are going to use PCA for dimentionality reduction.
The main technique for feature extraction is the Principle Component Analysis (PCA). PCA guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Sometimes the information that was lost is regarded as noise – information that does not represent the phenomena we are trying to model, but is rather a side effect of some usually unknown processes. PCA process can be visualized as follows:
PCA-1

Basic PCA

res.pca <- PCA(car)

In the PCA graph above, we can observe that Engine_Capacity, Km and Price are the best features to cover whole distribution.

Extract Eigenvalue / Variance

Let’s find eigenvalue / variance of the data and show percentage of each feature.

get_eig(res.pca)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   2.1644722        21.644722                    21.64472
## Dim.2   1.6687576        16.687576                    38.33230
## Dim.3   1.1506568        11.506568                    49.83887
## Dim.4   1.0095588        10.095588                    59.93445
## Dim.5   0.9767795         9.767795                    69.70225
## Dim.6   0.9006474         9.006474                    78.70872
## Dim.7   0.8095553         8.095553                    86.80428
## Dim.8   0.7098923         7.098923                    93.90320
## Dim.9   0.4378012         4.378012                    98.28121
## Dim.10  0.1718790         1.718790                   100.00000
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))

Result: We can cover ~95% of all data with 8 dimension rather that 10.

Extract the results for variables
Now let’s merge these eigenvalue results with feature names and see which features are more important than the others.

var <- get_pca_var(res.pca)
# Coordinates of variables
kable(var$coord)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Price 0.8718555 -0.2988576 -0.1361422 0.0353816 0.1660124
Fuel_Type -0.0121482 0.5814507 -0.3883851 -0.0949452 -0.3446402
Gear_Type -0.2597633 -0.4660614 -0.0428322 0.2357834 0.1587561
Km -0.8257640 0.2318551 0.3166230 -0.0158091 -0.1115312
Engine_Power 0.4916763 0.5797274 0.0860130 0.1487164 0.2790065
Engine_Capacity 0.2112345 0.7433600 0.3590415 -0.0137531 0.1458924
WD 0.0562266 0.0496148 0.1630968 0.9134135 -0.1052629
Color 0.0630850 -0.2169561 0.7436052 -0.2137706 0.2544330
Warranty 0.5427151 -0.1010429 0.1773395 -0.1953049 -0.3648084
Seller -0.2584240 0.1482482 -0.3631621 -0.0545596 0.6962536
# Contribution of variables
kable(var$contrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Price 35.1185872 5.3522369 1.6107934 0.1240004 2.821528
Fuel_Type 0.0068182 20.2596767 13.1092968 0.8929233 12.160050
Gear_Type 3.1174798 13.0164648 0.1594393 5.5067415 2.580266
Km 31.5035808 3.2213666 8.7124279 0.0247561 1.273491
Engine_Power 11.1687989 20.1397679 0.6429574 2.1907155 7.969517
Engine_Capacity 2.0614740 33.1135057 11.2032380 0.0187356 2.179058
WD 0.1460601 0.1475126 2.3117737 82.6424692 1.134368
Color 0.1838654 2.8206573 48.0550488 4.5265188 6.627507
Warranty 13.6079195 0.6118120 2.7331609 3.7782829 13.624891
Seller 3.0854160 1.3169994 11.4618637 0.2948567 49.629324
# Control variable colors using their contributions
fviz_pca_var(res.pca, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
)

# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

Extract the results for individuals
Let’s find results for individuals.

ind <- get_pca_ind(res.pca)
# Coordinates of individuals
kable(head(ind$coord))
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
-1.8244337 -0.6222581 1.4147255 0.0988391 -0.7726425
-0.8439338 -0.4782340 0.3425005 0.0604904 -1.0071373
-1.1430736 -0.7674374 0.7725699 0.9366888 -0.1668707
3.3292843 -0.7982184 0.4310606 0.2229979 -0.3849839
-1.7034901 -0.2659470 -0.8029131 0.3870671 0.2982094
1.0550205 0.8193233 2.9250885 -0.5050588 -0.6477896
# Contribution of variables
kable(head(ind$contrib))
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
0.0513632 0.0077499 0.0580961 0.0003232 0.0204131
0.0109904 0.0045776 0.0034051 0.0001211 0.0346840
0.0201625 0.0117880 0.0173252 0.0290273 0.0009522
0.1710401 0.0127526 0.0053936 0.0016452 0.0050680
0.0447791 0.0014156 0.0187128 0.0049567 0.0030408
0.0171758 0.0134359 0.2483593 0.0084392 0.0143489
var<-get_pca_var(res.pca)
a<-fviz_contrib(res.pca, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(res.pca, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')

RESULT FOR PCA

As a result from the contributions graphs above, both in results for variables and results for individual case, we can simply drop WD, Color and Seller features from the dataset. Since we improve our data complexity, we won’t loose much accuracy with clustering the data which is a desired case for dimentionality reduction.

car <- car[!names(car) %in% c("WD", "Color","Seller")]
kable(head(car))
Price Fuel_Type Gear_Type Km Engine_Power Engine_Capacity Warranty
24900 1 2 270000 70 1248 1
27500 2 2 164000 75 1199 1
42500 1 3 300000 90 1248 1
63750 2 3 8780 90 1398 2
25000 1 2 200000 70 1248 1
28500 1 2 130000 76 1601 2

Clustering

In this chapter we are going to use K-Means, PAM and Hierarchical clustering methods.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

K-MEANS

There are many clustering algorithms, each has its advantages and disadvantages. A popular algorithm for clustering is k-means, which aims to identify the best k cluster centers in an iterative manner. Cluster centers are served as “representative” of the objects associated with the cluster. k-means’ key features are also its drawbacks:
- The number of clusters (k) must be given explicitly. In some cases, the number of different groups is unknown.
- k-means iterative nature might lead to an incorrect result due to convergence to a local minimum.
- The clusters are assumed to be spherical.
Despite these drawbacks, k-means remains the right and popular choice in many cases. An example for clustering using k-means on spherical data can be seen in Figure.
cluster

K-Means With Different Center Points

Let’s compare k-means with k = [2,3,4,5]

set.seed(123)
cars_K2 <- kmeans(car, centers = 2, nstart = 25)
cars_K3 <- kmeans(car, centers = 3, nstart = 25)
cars_K4 <- kmeans(car, centers = 4, nstart = 25)
cars_K5 <- kmeans(car, centers = 5, nstart = 25)
p1 <- fviz_cluster(cars_K2, geom = "point", data = car) + ggtitle(" K = 2")
p2 <- fviz_cluster(cars_K3, geom = "point", data = car) + ggtitle(" K = 3")
p3 <- fviz_cluster(cars_K4, geom = "point", data = car) + ggtitle(" K = 4")
p4 <- fviz_cluster(cars_K5, geom = "point", data = car) + ggtitle(" K = 5")
grid.arrange(p1, p2, p3, p4, nrow = 2)

Find Optimal Number of Clusters K

Let’s compare number of clusters with Silhouette and Elbow Graphs

f1 <- fviz_nbclust(car, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(car, FUNcluster = kmeans, method = "wss") + 
  ggtitle("Optimal number of clusters \n K-means")
grid.arrange(f1, f2, ncol=2)

From Silhouette and Elbow graphs, we can observe that there are 2 candidate for number of clusters as 2 or 3. Let’s deep dive into the number of clusters with 2 and 3.

K-Means With 2 and 3 Number of Clusters

Let’s compare k-means with number of clusters 2 and 3.

km2 <- eclust(car, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(km2, data=car, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s2 <- fviz_silhouette(km2)
##   cluster size ave.sil.width
## 1       1 1539          0.58
## 2       2 1455          0.60
grid.arrange(c2, s2, ncol=2)

km3 <- eclust(car, k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c3 <- fviz_cluster(km3, data=car, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s3 <- fviz_silhouette(km3)
##   cluster size ave.sil.width
## 1       1  875          0.52
## 2       2  958          0.59
## 3       3 1161          0.47
grid.arrange(c3, s3, ncol=2)

We can conclude the optimal number of cluster for k-means is 3.
Let’s check results for other methods –> PAM, HIERARCHICAL CLUSTERING

PAM

The PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set.

After finding a set of k medoids, clusters are constructed by assigning each observation to the nearest medoid. Next, each selected medoid m and each non-medoid data point are swapped and the objective function is computed. The objective function corresponds to the sum of the dissimilarities of all objects to their nearest medoid.

The SWAP step attempts to improve the quality of the clustering by exchanging selected objects (medoids) and non-selected objects. If the objective function can be reduced by interchanging a selected object with an unselected object, then the swap is carried out. This is continued until the objective function can no longer be decreased. The goal is to find k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object.

pam

PAM With 2 and 3 Number of Clusters

Let’s compare PAM with number of clusters 2 and 3.

pam2 <- eclust(car, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(pam2, data=car, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)
##   cluster size ave.sil.width
## 1       1 1566          0.57
## 2       2 1428          0.61
grid.arrange(cp2, sp2, ncol=2)

pam3 <- eclust(car, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=car, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)
##   cluster size ave.sil.width
## 1       1 1136          0.47
## 2       2  863          0.58
## 3       3  995          0.52
grid.arrange(cp3, sp3, ncol=2)

We can conclude the optimal number of cluster for PAM is 3 which is the same conclusion with k-means.
Let’s check results for other methods –> HIERARCHICAL CLUSTERING

HIERARCHICAL CLUSTERING

Hierarchical Clustering

Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data, where each group (or “node”) links to two or more successor groups. The groups are nested and organized as a tree, which ideally ends up as a meaningful classification scheme. Distances between center points are calculate for each data point by using different type of distance calculation method and creating a distance matrix.

Each node in the cluster tree contains a group of similar data; Nodes group on the graph next to other, similar nodes. Clusters at one level join with clusters in the next level up, using a degree of similarity; The process carries on until all nodes are in the tree, which gives a visual snapshot of the data contained in the whole set. The total number of clusters is not predetermined before you start the tree creation.

What is Dendongram?
dendongram

HIERARCHICAL CLUSTERING with Euclidean Distance

Let’s compare number of clusters of 2 and 3 with euclidean distance in hierarchical clustering.

discar<- dist(car, method = "euclidean")
hcar<-hclust(discar, method="ward.D2")
plot(hcar)
groups <- cutree(hcar, k=2) 
rect.hclust(hcar, k=2, border="red") 

discar<- dist(car, method = "euclidean")
hcar<-hclust(discar, method="ward.D2")
plot(hcar)
groups <- cutree(hcar, k=3)
rect.hclust(hcar, k=3, border="red") 

HIERARCHICAL CLUSTERING with Manhattan Distance

Let’s compare number of clusters of 2 and 3 with manhattan distance in hierarchical clustering.

d <- dist(car, method = "manhattan")
H.fit <- hclust(d, method="ward.D2")
plot(H.fit) 
groups <- cutree(H.fit, k=2) 
rect.hclust(H.fit, k=2, border="red") 

d <- dist(car, method = "manhattan") 
H.fit <- hclust(d, method="ward.D2")
plot(H.fit)
groups <- cutree(H.fit, k=3)
rect.hclust(H.fit, k=3, border="red") 

BEST CLUSTER

RESULT : As you can observe from the clustering methods and finding an optimal number of cluster processes, it is concluded to use…
- Number of Cluster: 3
- Clustering Method: K-Means Clustering
- Distance Calculation Method: Euclidean and Manhattan can bu used

set.seed(123)
final <- kmeans(car, centers = 3, nstart = 25)
print(final$centers)
##      Price Fuel_Type Gear_Type        Km Engine_Power Engine_Capacity Warranty
## 1 48271.72  1.907098  1.902923  46022.23     85.21294        1295.738 1.328810
## 2 24139.60  1.896000  2.010286 213999.41     76.89486        1312.898 1.113143
## 3 32932.63  1.746770  2.020672 131660.87     77.42291        1265.267 1.117140

FINAL PLOT FOR K-MEANS CLUSTER NUMBER 3

f1 <- fviz_cluster(final, car)
f2 <- fviz_silhouette(silhouette(final$cluster,discar))
##   cluster size ave.sil.width
## 1       1  958          0.59
## 2       2  875          0.52
## 3       3 1161          0.47
grid.arrange(f1, f2, ncol=2)

PLOT FOR EACH FEATURE WITH CLUSTER

plot(car,col=final$cluster)

PLOT FOR PRICE & KM WITH CLUSTER

hcar.a<- hclust(discar,method = "average")
hcaralabel<- cutree(hcar.a,3)
hcar.labels<- cutree(hcar,3)

car<- data.frame(car)
ggplot(car,aes(car$Price,car$Km))+geom_point(aes(col=hcar.labels),show.legend = F)+
  scale_color_gradient(low="blue", high="red")##change the colour by the scale
## Warning: Use of `car$Price` is discouraged. Use `Price` instead.
## Warning: Use of `car$Km` is discouraged. Use `Km` instead.