Data clustering is the assignment of objects to homogeneous groups (called clusters) while making sure that objects in different groups are not similar. Clustering is considered an unsupervised task as it aims to describe the hidden structure of the objects. Each object is described by a set of characters called features. The first step of dividing objects into clusters is to define the distance between the different objects. Defining an adequate distance measure is crucial for the success of the clustering process.
In the field of machine learning, it is useful to apply a process called dimensionality reduction to highly dimensional data. The purpose of this process is to reduce the number of features under consideration, where each feature is a dimension that partly represents the objects.
Dimensionality reduction can be executed using two different methods:
- Selecting from the existing features (feature selection)
- Extracting new features by combining the existing features (feature extraction)
opel
In this paper, Opel Corsa dataset is used to analysis and it is thought to be work both in different data clustering and dimentionality reduction approaches will give more insight about how to use these approaches together in daily analysis. Analysis divided into three main parts.
1. Exploratory Data analysis
- Statistical Analysis, Visualization, Data Transformation
2. Dimensonality Reduction
- Drop no variance features
- Correlation analysis to check similarity between features
- PCA(Principal Component Analysis) for dimentionality reduction + Feature Engineering
3. Clustering
- K-Means
- PAM
- Hierarchical Clustering
In this part, you can observe statistical analysis of the given data for analysis with some visualization parts to give an insight about distribution of the data.
For data transformation, chr type features converted into factor and inputed with some numeric value which categorize this features.
packages<-function(x){
x<-as.character(match.call()[[2]])
if (!require(x,character.only=TRUE)){
install.packages(pkgs=x,repos="http://cran.r-project.org")
require(x,character.only=TRUE)
}
}
packages(tidyverse)
packages(dplyr)
packages(ggplot2)
packages(broom)
packages(rpart)
packages(rpart.plot)
packages(rattle)
packages(cowplot)
packages(knitr)
packages(corrplot)
packages(gridExtra)
packages(GGally)
packages(cluster) # clustering algorithms
packages(factoextra) # clustering algorithms & visualization
packages(funModeling)
packages(Hmisc)
packages(FactoMineR)
packages(factoextra)
packages(plot3D)cars <- read.csv('C:/Users/ALPARSLAN/Desktop/Alparslan/MERVE/irem&Onur/Project 2/Data_3000.csv', sep=";", dec=".", header=TRUE)| Price | Brand | Model | Engine | Year | Fuel_Type | Gear_Type | Km | Engine_Power | Engine_Capacity | WD | Color | Warranty | Seller |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24900 | Opel | Corsa | 1.3 | 2004 | Diesel | Manual | 270000 | 70 | 1248 | Front_2WD | Red | No | Gallery |
| 27500 | Opel | Corsa | 1.2 | 2004 | Gasoline | Manual | 164000 | 75 | 1199 | Front_2WD | Red | No | Gallery |
| 42500 | Opel | Corsa | 1.3 | 2008 | Diesel | Semi Automatic | 300000 | 90 | 1248 | Front_2WD | Grey | No | Gallery |
| 63750 | Opel | Corsa | 1.4 | 2017 | Gasoline | Semi Automatic | 8780 | 90 | 1398 | Front_2WD | Red | Yes | Gallery |
| 25000 | Opel | Corsa | 1.3 | 2004 | Diesel | Manual | 200000 | 70 | 1248 | Front_2WD | Blue | No | Owner |
| 28500 | Opel | Corsa | 1.7 | 2003 | Diesel | Manual | 130000 | 76 | 1601 | Front_2WD | White | Yes | Gallery |
## 'data.frame': 2994 obs. of 14 variables:
## $ Price : int 24900 27500 42500 63750 25000 28500 29750 19250 37500 21250 ...
## $ Brand : chr "Opel" "Opel" "Opel" "Opel" ...
## $ Model : chr "Corsa" "Corsa" "Corsa" "Corsa" ...
## $ Engine : num 1.3 1.2 1.3 1.4 1.3 1.7 1.3 1.2 1.2 1.4 ...
## $ Year : int 2004 2004 2008 2017 2004 2003 2011 2000 2007 1999 ...
## $ Fuel_Type : chr "Diesel" "Gasoline" "Diesel" "Gasoline" ...
## $ Gear_Type : chr "Manual" "Manual" "Semi Automatic" "Semi Automatic" ...
## $ Km : int 270000 164000 300000 8780 200000 130000 197000 169000 97000 221000 ...
## $ Engine_Power : int 70 75 90 90 70 76 75 65 76 90 ...
## $ Engine_Capacity: int 1248 1199 1248 1398 1248 1601 1248 1199 1201 1389 ...
## $ WD : chr "Front_2WD" "Front_2WD" "Front_2WD" "Front_2WD" ...
## $ Color : chr "Red" "Red" "Grey" "Red" ...
## $ Warranty : chr "No" "No" "No" "Yes" ...
## $ Seller : chr "Gallery" "Gallery" "Gallery" "Gallery" ...
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)
|| || || ||
You can observe that we have character type features. To make an easier analysis, I factorized this features.
cars[sapply(cars, is.character)] <- lapply(cars[sapply(cars, is.character)],as.factor)
kable(str(cars))## 'data.frame': 2994 obs. of 14 variables:
## $ Price : int 24900 27500 42500 63750 25000 28500 29750 19250 37500 21250 ...
## $ Brand : Factor w/ 1 level "Opel": 1 1 1 1 1 1 1 1 1 1 ...
## $ Model : Factor w/ 1 level "Corsa": 1 1 1 1 1 1 1 1 1 1 ...
## $ Engine : num 1.3 1.2 1.3 1.4 1.3 1.7 1.3 1.2 1.2 1.4 ...
## $ Year : int 2004 2004 2008 2017 2004 2003 2011 2000 2007 1999 ...
## $ Fuel_Type : Factor w/ 3 levels "Diesel","Gasoline",..: 1 2 1 2 1 1 1 2 2 3 ...
## $ Gear_Type : Factor w/ 3 levels "Automatic","Manual",..: 2 2 3 3 2 2 2 2 3 2 ...
## $ Km : int 270000 164000 300000 8780 200000 130000 197000 169000 97000 221000 ...
## $ Engine_Power : int 70 75 90 90 70 76 75 65 76 90 ...
## $ Engine_Capacity: int 1248 1199 1248 1398 1248 1601 1248 1199 1201 1389 ...
## $ WD : Factor w/ 3 levels "4WD","Back_2WD",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Color : Factor w/ 18 levels "Beige","Black",..: 13 13 8 13 3 17 17 18 3 13 ...
## $ Warranty : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 1 ...
## $ Seller : Factor w/ 2 levels "Gallery","Owner": 1 1 1 1 2 1 1 1 2 2 ...
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)
|| || || ||
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 Price 0 0.000000000 0 0 0 0 integer 498
## 2 Brand 0 0.000000000 0 0 0 0 factor 1
## 3 Model 0 0.000000000 0 0 0 0 factor 1
## 4 Engine 0 0.000000000 0 0 0 0 numeric 7
## 5 Year 0 0.000000000 0 0 0 0 integer 27
## 6 Fuel_Type 0 0.000000000 0 0 0 0 factor 3
## 7 Gear_Type 0 0.000000000 0 0 0 0 factor 3
## 8 Km 5 0.001670007 0 0 0 0 integer 791
## 9 Engine_Power 0 0.000000000 0 0 0 0 integer 28
## 10 Engine_Capacity 0 0.000000000 0 0 0 0 integer 18
## 11 WD 0 0.000000000 0 0 0 0 factor 3
## 12 Color 0 0.000000000 0 0 0 0 factor 18
## 13 Warranty 0 0.000000000 0 0 0 0 factor 2
## 14 Seller 0 0.000000000 0 0 0 0 factor 2
## Brand frequency percentage cumulative_perc
## 1 Opel 2994 100 100
## Model frequency percentage cumulative_perc
## 1 Corsa 2994 100 100
## Fuel_Type frequency percentage cumulative_perc
## 1 Gasoline 1172 39.14 39.14
## 2 Diesel 1148 38.34 77.48
## 3 Gasoline & LPG 674 22.51 100.00
## Gear_Type frequency percentage cumulative_perc
## 1 Manual 2018 67.4 67.4
## 2 Automatic 518 17.3 84.7
## 3 Semi Automatic 458 15.3 100.0
## WD frequency percentage cumulative_perc
## 1 Front_2WD 2987 99.77 99.77
## 2 Back_2WD 6 0.20 99.97
## 3 4WD 1 0.03 100.00
## Color frequency percentage cumulative_perc
## 1 White 953 31.83 31.83
## 2 Grey 405 13.53 45.36
## 3 Silver Grey 382 12.76 58.12
## 4 Black 297 9.92 68.04
## 5 Red 236 7.88 75.92
## 6 Blue 226 7.55 83.47
## 7 Navy Blue 98 3.27 86.74
## 8 Smoked 94 3.14 89.88
## 9 Green 91 3.04 92.92
## 10 Yellow 62 2.07 94.99
## 11 Brown 61 2.04 97.03
## 12 Burgundy 39 1.30 98.33
## 13 Turquoise 18 0.60 98.93
## 14 Beige 10 0.33 99.26
## 15 Orange 8 0.27 99.53
## 16 Champagne 6 0.20 99.73
## 17 Purple 6 0.20 99.93
## 18 Pink 2 0.07 100.00
## Warranty frequency percentage cumulative_perc
## 1 No 2444 81.63 81.63
## 2 Yes 550 18.37 100.00
## Seller frequency percentage cumulative_perc
## 1 Owner 1698 56.71 56.71
## 2 Gallery 1296 43.29 100.00
## [1] "Variables processed: Brand, Model, Fuel_Type, Gear_Type, WD, Color, Warranty, Seller"
## variable mean std_dev variation_coef p_01 p_05
## 1 Price 3.530798e+04 1.274661e+04 0.36101225 14250 17750.0
## 2 Engine 1.322779e+00 1.097129e-01 0.08294120 1 1.2
## 3 Year 2.007506e+03 5.922384e+00 0.00295012 1994 1997.0
## 4 Km 1.283223e+05 7.173726e+04 0.55903958 3000 17325.0
## 5 Engine_Power 7.976119e+01 1.535335e+01 0.19249152 51 51.0
## 6 Engine_Capacity 1.288937e+03 1.013122e+02 0.07860132 998 1199.0
## p_25 p_50 p_75 p_95 p_99 skewness kurtosis iqr
## 1 25000.0 33900.0 43937.5 58500.0 67552.5 0.6486791 4.348960 18937.5
## 2 1.3 1.3 1.4 1.4 1.7 0.6368272 5.785257 0.1
## 3 2004.0 2008.0 2013.0 2016.0 2017.0 -0.3453788 2.098885 9.0
## 4 71000.0 130000.0 182000.0 240000.0 300000.0 0.2302952 2.662186 111000.0
## 5 75.0 76.0 90.0 100.0 126.0 2.0245987 17.960649 15.0
## 6 1229.0 1248.0 1364.0 1401.0 1601.0 0.9501447 5.528914 135.0
## range_98 range_80
## 1 [14250, 67552.5000000001] [19750, 52000]
## 2 [1, 1.7] [1.2, 1.4]
## 3 [1994, 2017] [1999, 2015]
## 4 [3000, 3e+05] [30000, 218918.1]
## 5 [51, 126] [60, 95]
## 6 [998, 1601] [1201, 1398]
## cars
##
## 14 Variables 2994 Observations
## --------------------------------------------------------------------------------
## Price
## n missing distinct Info Mean Gmd .05 .10
## 2994 0 498 1 35308 14336 17750 19750
## .25 .50 .75 .90 .95
## 25000 33900 43938 52000 58500
##
## lowest : 7950 8250 9200 9900 10500, highest: 73750 74750 75000 75500 144750
## --------------------------------------------------------------------------------
## Brand
## n missing distinct value
## 2994 0 1 Opel
##
## Value Opel
## Frequency 2994
## Proportion 1
## --------------------------------------------------------------------------------
## Model
## n missing distinct value
## 2994 0 1 Corsa
##
## Value Corsa
## Frequency 2994
## Proportion 1
## --------------------------------------------------------------------------------
## Engine
## n missing distinct Info Mean Gmd
## 2994 0 7 0.896 1.323 0.1108
##
## lowest : 1.0 1.2 1.3 1.4 1.5, highest: 1.3 1.4 1.5 1.6 1.7
##
## Value 1.0 1.2 1.3 1.4 1.5 1.6 1.7
## Frequency 48 684 1052 1094 14 20 82
## Proportion 0.016 0.228 0.351 0.365 0.005 0.007 0.027
## --------------------------------------------------------------------------------
## Year
## n missing distinct Info Mean Gmd .05 .10
## 2994 0 27 0.996 2008 6.755 1997 1999
## .25 .50 .75 .90 .95
## 2004 2008 2013 2015 2016
##
## lowest : 1990 1992 1993 1994 1995, highest: 2013 2014 2015 2016 2017
## --------------------------------------------------------------------------------
## Fuel_Type
## n missing distinct
## 2994 0 3
##
## Value Diesel Gasoline Gasoline & LPG
## Frequency 1148 1172 674
## Proportion 0.383 0.391 0.225
## --------------------------------------------------------------------------------
## Gear_Type
## n missing distinct
## 2994 0 3
##
## Value Automatic Manual Semi Automatic
## Frequency 518 2018 458
## Proportion 0.173 0.674 0.153
## --------------------------------------------------------------------------------
## Km
## n missing distinct Info Mean Gmd .05 .10
## 2994 0 791 1 128322 81774 17325 30000
## .25 .50 .75 .90 .95
## 71000 130000 182000 218918 240000
##
## lowest : 0 1 10 160 190, highest: 360000 361000 420000 435000 460000
## --------------------------------------------------------------------------------
## Engine_Power
## n missing distinct Info Mean Gmd .05 .10
## 2994 0 28 0.977 79.76 15.27 51 60
## .25 .50 .75 .90 .95
## 75 76 90 95 100
##
## lowest : 45 50 51 54 58, highest: 151 176 192 201 226
## --------------------------------------------------------------------------------
## Engine_Capacity
## n missing distinct Info Mean Gmd .05 .10
## 2994 0 18 0.96 1289 101.8 1199 1201
## .25 .50 .75 .90 .95
## 1229 1248 1364 1398 1401
##
## lowest : 973 998 999 1196 1199, highest: 1488 1598 1601 1686 1801
##
## Value 973 998 999 1196 1199 1200 1201 1229 1248 1364 1389
## Frequency 21 17 1 9 136 98 297 407 954 339 369
## Proportion 0.007 0.006 0.000 0.003 0.045 0.033 0.099 0.136 0.319 0.113 0.123
##
## Value 1398 1401 1488 1598 1601 1686 1801
## Frequency 187 47 12 17 65 17 1
## Proportion 0.062 0.016 0.004 0.006 0.022 0.006 0.000
## --------------------------------------------------------------------------------
## WD
## n missing distinct
## 2994 0 3
##
## Value 4WD Back_2WD Front_2WD
## Frequency 1 6 2987
## Proportion 0.000 0.002 0.998
## --------------------------------------------------------------------------------
## Color
## n missing distinct
## 2994 0 18
##
## lowest : Beige Black Blue Brown Burgundy
## highest: Silver Grey Smoked Turquoise White Yellow
##
## Beige (10, 0.003), Black (297, 0.099), Blue (226, 0.075), Brown (61, 0.020),
## Burgundy (39, 0.013), Champagne (6, 0.002), Green (91, 0.030), Grey (405,
## 0.135), Navy Blue (98, 0.033), Orange (8, 0.003), Pink (2, 0.001), Purple (6,
## 0.002), Red (236, 0.079), Silver Grey (382, 0.128), Smoked (94, 0.031),
## Turquoise (18, 0.006), White (953, 0.318), Yellow (62, 0.021)
## --------------------------------------------------------------------------------
## Warranty
## n missing distinct
## 2994 0 2
##
## Value No Yes
## Frequency 2444 550
## Proportion 0.816 0.184
## --------------------------------------------------------------------------------
## Seller
## n missing distinct
## 2994 0 2
##
## Value Gallery Owner
## Frequency 1296 1698
## Proportion 0.433 0.567
## --------------------------------------------------------------------------------
cars %>% dplyr::select(where(is.factor)) %>%
gather(attributes, value, 1:6) %>%
ggplot(aes(x = value)) +
geom_bar(fill = 'lightblue2', color = 'black') +
facet_wrap(~attributes, scales = 'free_x') +
labs(x="Values", y="Frequency") +
theme_bw()## Warning: attributes are not identical across measure variables;
## they will be dropped
From the
brandandmodelfeature plots, we can easily observe that these features have zero variance, so we can simply drop these columns in dimentionality reduction section.
Fuel_Type,Gear_typeandWDseems to be categorical features!
cars %>% dplyr::select(where(is.numeric)) %>%
gather(attributes, value, 1:5) %>%
ggplot(aes(x = value)) +
geom_histogram(fill = 'lightblue2', color = 'black') +
facet_wrap(~attributes, scales = 'free_x') +
labs(x="Values", y="Frequency") +
theme_bw()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Engineis a numeric column but it has only 7 unique value. So we can consider this feature as categorical one also.
corrplot(cor(cars %>% dplyr::select(where(is.numeric))), type = 'upper', method = 'number', tl.cex = 0.9)There is a high correlation between
Price & YearandEnigne & Engine_Capacitycolumns. We drop some highly corelated column to use dimentionality reduction method.
ggplot(cars, aes(x = Price, y = Year)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
theme_bw()## `geom_smooth()` using formula 'y ~ x'
You can easily observe that, there is an outlier in the data… Let’s treat this outlier with filtering.
Let’s check it again
ggplot(cars, aes(x = Price, y = Year)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
theme_bw()## `geom_smooth()` using formula 'y ~ x'
Now it seems better :)
ggplot(data = cars, aes(x=Fuel_Type, y=Price, color=Fuel_Type)) +
geom_boxplot()+
scale_color_brewer(palette="Dark2") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Title',
y='Price',x='Fuel_Type')In this plot, we can observe that,
Gasoline typecars are expensive then the rest.
ggplot(data = cars, aes(x=Fuel_Type,y=Price, fill=Fuel_Type)) +
geom_boxplot()+
scale_fill_brewer(palette="Green") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Title',
y='Price',x='Year')+
facet_wrap(~Gear_Type,nrow = 1)## Warning in pal_name(palette, type): Unknown palette Green
In this plot, we can observe that,
automaticandsemi-automaticcars are more expensive thanmanualtype cars whilegasolinecars still most expensive ones respect to fuel_type.
ggplot(data = cars, aes(x=Fuel_Type,y=Price, fill=Fuel_Type)) +
geom_boxplot()+
scale_fill_brewer(palette="Green") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Title',
y='Price',x='Year')+
facet_wrap(~Year,nrow = 1)## Warning in pal_name(palette, type): Unknown palette Green
In this plot, we can observe that,
Yearandpricehave high positive correlation between each other.
Create 3D plot to visualize most informative features
Price & Year & KM
# x, y, z variables
x <- cars$Price
y <- cars$Km
z <- cars$Engine_Capacity
scatter3D(cars$Price, cars$Km,cars$Year,
xlab = "Price", ylab = "Km", zlab = "Year",
phi = 0, bty = "g",pch = 20,cex = 2,
ticktype = "detailed", colvar=as.numeric(cars$Year))Give index number to factor columns…
car <- cars
indx <- sapply(car, is.factor)
car[indx] <- lapply(car[indx], function(x) as.numeric(factor(x)))To run the algorithms below, we need to use numerical values. These features are factor and simply given them some number to identify this factor values.
## 'data.frame': 2994 obs. of 14 variables:
## $ Price : num 24900 27500 42500 63750 25000 ...
## $ Brand : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Model : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Engine : num 3 2 3 4 3 7 3 2 2 4 ...
## $ Year : int 2004 2004 2008 2017 2004 2003 2011 2000 2007 1999 ...
## $ Fuel_Type : num 1 2 1 2 1 1 1 2 2 3 ...
## $ Gear_Type : num 2 2 3 3 2 2 2 2 3 2 ...
## $ Km : int 270000 164000 300000 8780 200000 130000 197000 169000 97000 221000 ...
## $ Engine_Power : int 70 75 90 90 70 76 75 65 76 90 ...
## $ Engine_Capacity: int 1248 1199 1248 1398 1248 1601 1248 1199 1201 1389 ...
## $ WD : num 3 3 3 3 3 3 3 3 3 3 ...
## $ Color : num 13 13 8 13 3 17 17 18 3 13 ...
## $ Warranty : num 1 1 1 2 1 2 1 1 1 1 ...
## $ Seller : num 1 1 1 1 2 1 1 1 2 2 ...
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)
|| || || ||
| Price | Brand | Model | Engine | Year | Fuel_Type | Gear_Type | Km | Engine_Power | Engine_Capacity | WD | Color | Warranty | Seller |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24900 | 1 | 1 | 3 | 2004 | 1 | 2 | 270000 | 70 | 1248 | 3 | 13 | 1 | 1 |
| 27500 | 1 | 1 | 2 | 2004 | 2 | 2 | 164000 | 75 | 1199 | 3 | 13 | 1 | 1 |
| 42500 | 1 | 1 | 3 | 2008 | 1 | 3 | 300000 | 90 | 1248 | 3 | 8 | 1 | 1 |
| 63750 | 1 | 1 | 4 | 2017 | 2 | 3 | 8780 | 90 | 1398 | 3 | 13 | 2 | 1 |
| 25000 | 1 | 1 | 3 | 2004 | 1 | 2 | 200000 | 70 | 1248 | 3 | 3 | 1 | 2 |
| 28500 | 1 | 1 | 7 | 2003 | 1 | 2 | 130000 | 76 | 1601 | 3 | 17 | 2 | 1 |
In the field of machine learning, it is useful to apply a process called dimensionality reduction to highly dimensional data. The purpose of this process is to reduce the number of features under consideration, where each feature is a dimension that partly represents the objects. Why is dimensionality reduction important? As more features are added, the data becomes very sparse and analysis suffers from the curse of dimensionality. Additionally, it is easier to process smaller data sets. Dimensionality reduction can be executed using two different methods:
- Selecting from the existing features (feature selection)
- Extracting new features by combining the existing features (feature extraction)
| variable | mean | std_dev | variation_coef | p_01 | p_05 | p_25 | p_50 | p_75 | p_95 | p_99 | skewness | kurtosis | iqr | range_98 | range_80 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Price | 3.527096e+04 | 1.258863e+04 | 0.3569120 | 14250 | 17750 | 25000 | 33900 | 43900 | 58500 | 67500 | 0.4626585 | 2.667568 | 18900 | [14250, 67500] | [19750, 52000] |
| Brand | 1.000000e+00 | 0.000000e+00 | 0.0000000 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | NaN | NaN | 0 | [1, 1] | [1, 1] |
| Model | 1.000000e+00 | 0.000000e+00 | 0.0000000 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | NaN | NaN | 0 | [1, 1] | [1, 1] |
| Engine | 3.243821e+00 | 1.056381e+00 | 0.3256595 | 1 | 2 | 3 | 3 | 4 | 4 | 7 | 0.9749908 | 5.589242 | 1 | [1, 7] | [2, 4] |
| Year | 2.007506e+03 | 5.922384e+00 | 0.0029501 | 1994 | 1997 | 2004 | 2008 | 2013 | 2016 | 2017 | -0.3453788 | 2.098885 | 9 | [1994, 2017] | [1999, 2015] |
| Fuel_Type | 1.841683e+00 | 7.639903e-01 | 0.4148326 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 0.2754706 | 1.756249 | 1 | [1, 3] | [1, 3] |
| Gear_Type | 1.979960e+00 | 5.706947e-01 | 0.2882355 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | -0.0024646 | 3.067447 | 0 | [1, 3] | [1, 3] |
| Km | 1.283223e+05 | 7.173726e+04 | 0.5590396 | 3000 | 17325 | 71000 | 130000 | 182000 | 240000 | 300000 | 0.2302952 | 2.662186 | 111000 | [3000, 3e+05] | [30000, 218918.1] |
| Engine_Power | 7.976119e+01 | 1.535335e+01 | 0.1924915 | 51 | 51 | 75 | 76 | 90 | 100 | 126 | 2.0245987 | 17.960649 | 15 | [51, 126] | [60, 95] |
| Engine_Capacity | 1.288937e+03 | 1.013122e+02 | 0.0786013 | 998 | 1199 | 1229 | 1248 | 1364 | 1401 | 1601 | 0.9501447 | 5.528914 | 135 | [998, 1601] | [1201, 1398] |
| WD | 2.997328e+00 | 5.774070e-02 | 0.0192641 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | -24.1633816 | 657.018707 | 0 | [3, 3] | [3, 3] |
| Color | 1.139646e+01 | 5.567746e+00 | 0.4885505 | 2 | 2 | 8 | 14 | 17 | 17 | 18 | -0.4894941 | 1.738998 | 9 | [2, 18] | [2, 17] |
| Warranty | 1.183701e+00 | 3.873046e-01 | 0.3271981 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 1.6336087 | 3.668677 | 0 | [1, 2] | [1, 2] |
| Seller | 1.567134e+00 | 4.955553e-01 | 0.3162175 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | -0.2709909 | 1.073436 | 1 | [1, 2] | [1, 2] |
There are two columns that have zero variance ->
Brand,Model.
Drop these zero variance columns for dimension reduction.
| Price | Engine | Year | Fuel_Type | Gear_Type | Km | Engine_Power | Engine_Capacity | WD | Color | Warranty | Seller |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 24900 | 3 | 2004 | 1 | 2 | 270000 | 70 | 1248 | 3 | 13 | 1 | 1 |
| 27500 | 2 | 2004 | 2 | 2 | 164000 | 75 | 1199 | 3 | 13 | 1 | 1 |
| 42500 | 3 | 2008 | 1 | 3 | 300000 | 90 | 1248 | 3 | 8 | 1 | 1 |
| 63750 | 4 | 2017 | 2 | 3 | 8780 | 90 | 1398 | 3 | 13 | 2 | 1 |
| 25000 | 3 | 2004 | 1 | 2 | 200000 | 70 | 1248 | 3 | 3 | 1 | 2 |
| 28500 | 7 | 2003 | 1 | 2 | 130000 | 76 | 1601 | 3 | 17 | 2 | 1 |
From correlation analysis, there is a high correlation between Price and Year which is expected also Engine and Engine Capacity.
We can reduce dimentionality by using highly correlated features because if two features are highly correlated, then they have same effect from similar span in space to the machine learning model.
Candidates for dimentionality reduction are:Price&YearandEngine&Engine_Capacity. Let’s removeEngineandYearfeatures from the data.
| Price | Fuel_Type | Gear_Type | Km | Engine_Power | Engine_Capacity | WD | Color | Warranty | Seller |
|---|---|---|---|---|---|---|---|---|---|
| 24900 | 1 | 2 | 270000 | 70 | 1248 | 3 | 13 | 1 | 1 |
| 27500 | 2 | 2 | 164000 | 75 | 1199 | 3 | 13 | 1 | 1 |
| 42500 | 1 | 3 | 300000 | 90 | 1248 | 3 | 8 | 1 | 1 |
| 63750 | 2 | 3 | 8780 | 90 | 1398 | 3 | 13 | 2 | 1 |
| 25000 | 1 | 2 | 200000 | 70 | 1248 | 3 | 3 | 1 | 2 |
| 28500 | 1 | 2 | 130000 | 76 | 1601 | 3 | 17 | 2 | 1 |
In this chapter we are going to use PCA for dimentionality reduction.
The main technique for feature extraction is the Principle Component Analysis (PCA). PCA guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Sometimes the information that was lost is regarded as noise – information that does not represent the phenomena we are trying to model, but is rather a side effect of some usually unknown processes. PCA process can be visualized as follows:
In the PCA graph above, we can observe that
Engine_Capacity,KmandPriceare the best features to cover whole distribution.
Let’s find eigenvalue / variance of the data and show percentage of each feature.
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.1644722 21.644722 21.64472
## Dim.2 1.6687576 16.687576 38.33230
## Dim.3 1.1506568 11.506568 49.83887
## Dim.4 1.0095588 10.095588 59.93445
## Dim.5 0.9767795 9.767795 69.70225
## Dim.6 0.9006474 9.006474 78.70872
## Dim.7 0.8095553 8.095553 86.80428
## Dim.8 0.7098923 7.098923 93.90320
## Dim.9 0.4378012 4.378012 98.28121
## Dim.10 0.1718790 1.718790 100.00000
Result: We can cover ~95% of all data with 8 dimension rather that 10.
Extract the results for variables
Now let’s merge these eigenvalue results with feature names and see which features are more important than the others.
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| Price | 0.8718555 | -0.2988576 | -0.1361422 | 0.0353816 | 0.1660124 |
| Fuel_Type | -0.0121482 | 0.5814507 | -0.3883851 | -0.0949452 | -0.3446402 |
| Gear_Type | -0.2597633 | -0.4660614 | -0.0428322 | 0.2357834 | 0.1587561 |
| Km | -0.8257640 | 0.2318551 | 0.3166230 | -0.0158091 | -0.1115312 |
| Engine_Power | 0.4916763 | 0.5797274 | 0.0860130 | 0.1487164 | 0.2790065 |
| Engine_Capacity | 0.2112345 | 0.7433600 | 0.3590415 | -0.0137531 | 0.1458924 |
| WD | 0.0562266 | 0.0496148 | 0.1630968 | 0.9134135 | -0.1052629 |
| Color | 0.0630850 | -0.2169561 | 0.7436052 | -0.2137706 | 0.2544330 |
| Warranty | 0.5427151 | -0.1010429 | 0.1773395 | -0.1953049 | -0.3648084 |
| Seller | -0.2584240 | 0.1482482 | -0.3631621 | -0.0545596 | 0.6962536 |
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| Price | 35.1185872 | 5.3522369 | 1.6107934 | 0.1240004 | 2.821528 |
| Fuel_Type | 0.0068182 | 20.2596767 | 13.1092968 | 0.8929233 | 12.160050 |
| Gear_Type | 3.1174798 | 13.0164648 | 0.1594393 | 5.5067415 | 2.580266 |
| Km | 31.5035808 | 3.2213666 | 8.7124279 | 0.0247561 | 1.273491 |
| Engine_Power | 11.1687989 | 20.1397679 | 0.6429574 | 2.1907155 | 7.969517 |
| Engine_Capacity | 2.0614740 | 33.1135057 | 11.2032380 | 0.0187356 | 2.179058 |
| WD | 0.1460601 | 0.1475126 | 2.3117737 | 82.6424692 | 1.134368 |
| Color | 0.1838654 | 2.8206573 | 48.0550488 | 4.5265188 | 6.627507 |
| Warranty | 13.6079195 | 0.6118120 | 2.7331609 | 3.7782829 | 13.624891 |
| Seller | 3.0854160 | 1.3169994 | 11.4618637 | 0.2948567 | 49.629324 |
# Control variable colors using their contributions
fviz_pca_var(res.pca, col.var="contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)Extract the results for individuals
Let’s find results for individuals.
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 |
|---|---|---|---|---|
| -1.8244337 | -0.6222581 | 1.4147255 | 0.0988391 | -0.7726425 |
| -0.8439338 | -0.4782340 | 0.3425005 | 0.0604904 | -1.0071373 |
| -1.1430736 | -0.7674374 | 0.7725699 | 0.9366888 | -0.1668707 |
| 3.3292843 | -0.7982184 | 0.4310606 | 0.2229979 | -0.3849839 |
| -1.7034901 | -0.2659470 | -0.8029131 | 0.3870671 | 0.2982094 |
| 1.0550205 | 0.8193233 | 2.9250885 | -0.5050588 | -0.6477896 |
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 |
|---|---|---|---|---|
| 0.0513632 | 0.0077499 | 0.0580961 | 0.0003232 | 0.0204131 |
| 0.0109904 | 0.0045776 | 0.0034051 | 0.0001211 | 0.0346840 |
| 0.0201625 | 0.0117880 | 0.0173252 | 0.0290273 | 0.0009522 |
| 0.1710401 | 0.0127526 | 0.0053936 | 0.0016452 | 0.0050680 |
| 0.0447791 | 0.0014156 | 0.0187128 | 0.0049567 | 0.0030408 |
| 0.0171758 | 0.0134359 | 0.2483593 | 0.0084392 | 0.0143489 |
var<-get_pca_var(res.pca)
a<-fviz_contrib(res.pca, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(res.pca, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')RESULT FOR PCA
As a result from the contributions graphs above, both in results for variables and results for individual case, we can simply drop
WD,ColorandSellerfeatures from the dataset. Since we improve our data complexity, we won’t loose much accuracy with clustering the data which is a desired case for dimentionality reduction.
| Price | Fuel_Type | Gear_Type | Km | Engine_Power | Engine_Capacity | Warranty |
|---|---|---|---|---|---|---|
| 24900 | 1 | 2 | 270000 | 70 | 1248 | 1 |
| 27500 | 2 | 2 | 164000 | 75 | 1199 | 1 |
| 42500 | 1 | 3 | 300000 | 90 | 1248 | 1 |
| 63750 | 2 | 3 | 8780 | 90 | 1398 | 2 |
| 25000 | 1 | 2 | 200000 | 70 | 1248 | 1 |
| 28500 | 1 | 2 | 130000 | 76 | 1601 | 2 |
In this chapter we are going to use K-Means, PAM and Hierarchical clustering methods.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
There are many clustering algorithms, each has its advantages and disadvantages. A popular algorithm for clustering is k-means, which aims to identify the best k cluster centers in an iterative manner. Cluster centers are served as “representative” of the objects associated with the cluster. k-means’ key features are also its drawbacks:
- The number of clusters (k) must be given explicitly. In some cases, the number of different groups is unknown.
- k-means iterative nature might lead to an incorrect result due to convergence to a local minimum.
- The clusters are assumed to be spherical.
Despite these drawbacks, k-means remains the right and popular choice in many cases. An example for clustering using k-means on spherical data can be seen in Figure.
Let’s compare k-means with k = [2,3,4,5]
set.seed(123)
cars_K2 <- kmeans(car, centers = 2, nstart = 25)
cars_K3 <- kmeans(car, centers = 3, nstart = 25)
cars_K4 <- kmeans(car, centers = 4, nstart = 25)
cars_K5 <- kmeans(car, centers = 5, nstart = 25)
p1 <- fviz_cluster(cars_K2, geom = "point", data = car) + ggtitle(" K = 2")
p2 <- fviz_cluster(cars_K3, geom = "point", data = car) + ggtitle(" K = 3")
p3 <- fviz_cluster(cars_K4, geom = "point", data = car) + ggtitle(" K = 4")
p4 <- fviz_cluster(cars_K5, geom = "point", data = car) + ggtitle(" K = 5")
grid.arrange(p1, p2, p3, p4, nrow = 2)Let’s compare number of clusters with Silhouette and Elbow Graphs
f1 <- fviz_nbclust(car, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(car, FUNcluster = kmeans, method = "wss") +
ggtitle("Optimal number of clusters \n K-means")
grid.arrange(f1, f2, ncol=2)From
SilhouetteandElbowgraphs, we can observe that there are 2 candidate for number of clusters as2or3. Let’s deep dive into the number of clusters with 2 and 3.
Let’s compare k-means with number of clusters 2 and 3.
km2 <- eclust(car, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(km2, data=car, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s2 <- fviz_silhouette(km2)## cluster size ave.sil.width
## 1 1 1539 0.58
## 2 2 1455 0.60
km3 <- eclust(car, k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c3 <- fviz_cluster(km3, data=car, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s3 <- fviz_silhouette(km3)## cluster size ave.sil.width
## 1 1 875 0.52
## 2 2 958 0.59
## 3 3 1161 0.47
We can conclude the optimal number of cluster for k-means is
3.
Let’s check results for other methods –>PAM,HIERARCHICAL CLUSTERING
The PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set.
After finding a set of k medoids, clusters are constructed by assigning each observation to the nearest medoid. Next, each selected medoid m and each non-medoid data point are swapped and the objective function is computed. The objective function corresponds to the sum of the dissimilarities of all objects to their nearest medoid.
The SWAP step attempts to improve the quality of the clustering by exchanging selected objects (medoids) and non-selected objects. If the objective function can be reduced by interchanging a selected object with an unselected object, then the swap is carried out. This is continued until the objective function can no longer be decreased. The goal is to find k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object.
pam
Let’s compare PAM with number of clusters 2 and 3.
pam2 <- eclust(car, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(pam2, data=car, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)## cluster size ave.sil.width
## 1 1 1566 0.57
## 2 2 1428 0.61
pam3 <- eclust(car, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=car, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)## cluster size ave.sil.width
## 1 1 1136 0.47
## 2 2 863 0.58
## 3 3 995 0.52
We can conclude the optimal number of cluster for PAM is
3which is the same conclusion with k-means.
Let’s check results for other methods –>HIERARCHICAL CLUSTERING
Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data, where each group (or “node”) links to two or more successor groups. The groups are nested and organized as a tree, which ideally ends up as a meaningful classification scheme. Distances between center points are calculate for each data point by using different type of distance calculation method and creating a distance matrix.
Each node in the cluster tree contains a group of similar data; Nodes group on the graph next to other, similar nodes. Clusters at one level join with clusters in the next level up, using a degree of similarity; The process carries on until all nodes are in the tree, which gives a visual snapshot of the data contained in the whole set. The total number of clusters is not predetermined before you start the tree creation.
What is Dendongram?
Let’s compare number of clusters of 2 and 3 with euclidean distance in hierarchical clustering.
discar<- dist(car, method = "euclidean")
hcar<-hclust(discar, method="ward.D2")
plot(hcar)
groups <- cutree(hcar, k=2)
rect.hclust(hcar, k=2, border="red") discar<- dist(car, method = "euclidean")
hcar<-hclust(discar, method="ward.D2")
plot(hcar)
groups <- cutree(hcar, k=3)
rect.hclust(hcar, k=3, border="red") Let’s compare number of clusters of 2 and 3 with manhattan distance in hierarchical clustering.
d <- dist(car, method = "manhattan")
H.fit <- hclust(d, method="ward.D2")
plot(H.fit)
groups <- cutree(H.fit, k=2)
rect.hclust(H.fit, k=2, border="red") d <- dist(car, method = "manhattan")
H.fit <- hclust(d, method="ward.D2")
plot(H.fit)
groups <- cutree(H.fit, k=3)
rect.hclust(H.fit, k=3, border="red") RESULT : As you can observe from the clustering methods and finding an optimal number of cluster processes, it is concluded to use…
-Number of Cluster: 3
-Clustering Method: K-Means Clustering
-Distance Calculation Method: Euclidean and Manhattan can bu used
## Price Fuel_Type Gear_Type Km Engine_Power Engine_Capacity Warranty
## 1 48271.72 1.907098 1.902923 46022.23 85.21294 1295.738 1.328810
## 2 24139.60 1.896000 2.010286 213999.41 76.89486 1312.898 1.113143
## 3 32932.63 1.746770 2.020672 131660.87 77.42291 1265.267 1.117140
FINAL PLOT FOR K-MEANS CLUSTER NUMBER 3
## cluster size ave.sil.width
## 1 1 958 0.59
## 2 2 875 0.52
## 3 3 1161 0.47
PLOT FOR EACH FEATURE WITH CLUSTER
PLOT FOR PRICE & KM WITH CLUSTER
hcar.a<- hclust(discar,method = "average")
hcaralabel<- cutree(hcar.a,3)
hcar.labels<- cutree(hcar,3)
car<- data.frame(car)
ggplot(car,aes(car$Price,car$Km))+geom_point(aes(col=hcar.labels),show.legend = F)+
scale_color_gradient(low="blue", high="red")##change the colour by the scale## Warning: Use of `car$Price` is discouraged. Use `Price` instead.
## Warning: Use of `car$Km` is discouraged. Use `Km` instead.