Background

The aim of this paper is to use different tacsonomic methods on data received from webscraping Otomoto.pl portal in order to build dendrogams, (also known as decision trees) and basic clusters . ‘OTOMOTO.pl’ is a Polsih ecommerce site operating on the same rules as eBay. It allows private and nonprivate users either sell or buy cars.

Data:

As the study group, I chose cars that fit into the average wealthy family in Poland. These are the criteria:

Price <15,000 - 25,000> PLN
Mileage <100,000 - 200,000> km
Capacity <1.0 - 2.0> cm3
Age <0 - 10> years

As the result I obtained 1860 observations with the variables:

##         mark price year mileage capacity        fuel           location
## 1       Seat 24900 2009  198000     2000     Benzyna          Lubelskie
## 2 Volkswagen 25000 2009  188000     1968      Diesel        Mazowieckie
## 3       Opel 17900 2009  190000     1598 Benzyna+LPG            Łódzkie
## 4       Saab 22999 2009  180000     1910      Diesel           Lubuskie
## 5        BMW 24900 2009  163000     1995      Diesel Zachodniopomorskie
## 6       Saab 19999 2009  181000     1910      Diesel          Pomorskie

The data has been grouped by the mark and location which resulted in two data frames:
df - by mark and
df2 <- by location
for analysis the marks with count less than 5 were removed.

 ##grouped data by mark
    df<-data %>% group_by(mark)%>% summarize(mean_price=mean(price, na.rm=TRUE),mean_capacity=mean(capacity, na.rm=TRUE),
                                              mean_age=mean(age, na.rm=TRUE),mean_mileage=mean(mileage, na.rm=TRUE),mean_fuel=mean(fuel2, na.rm=TRUE)
                                             )
  
    
    a<-data%>%group_by(mark) %>%count(mark)
    df$count<-a$n
    
    df<-filter(df, count>5)
    
    ##grouped data by location
    df2<-data %>% group_by(location)%>% summarize(mean_price=mean(price, na.rm=TRUE),mean_capacity=mean(capacity, na.rm=TRUE),
                                             mean_age=mean(age, na.rm=TRUE),mean_mileage=mean(mileage, na.rm=TRUE),mean_fuel=mean(fuel2, na.rm=TRUE)
    )
    
    a<-data%>%group_by(location) %>%count(location)
    df2$count<-a$n
    
    df2<-filter(df2, count>5)

df

##             mark mean_price mean_capacity mean_age mean_mileage mean_fuel
## 1            BMW   22851.52      1751.788 9.454545     165883.6  2.272727
## 2      Chevrolet   20665.48      1684.645 7.806452     144740.9  2.290323
## 3        Citroën   21116.22      1605.759 8.626437     163520.1  2.390805
## 4          Dacia   20976.67      1542.467 7.733333     147968.7  1.666667
## 5           Fiat   20213.29      1563.169 8.813559     153775.9  2.271186
## 6           Ford   20922.75      1609.789 8.529412     165701.4  2.279412
## 7          Honda   21646.67      1556.267 9.000000     144400.9  1.400000
## 8        Hyundai   19680.59      1524.811 8.822222     152779.4  2.144444
## 9            Kia   20729.34      1595.857 8.836735     157812.1  2.255102
## 10         Mazda   21257.53      1647.079 9.131579     161203.4  1.973684
## 11 Mercedes-Benz   21086.69      1518.938 9.500000     165132.7  1.937500
## 12    Mitsubishi   22424.80      1833.867 8.933333     155193.8  1.800000
## 13        Nissan   21189.03      1452.375 7.906250     132862.2  1.437500
## 14          Opel   20699.36      1589.388 8.380769     160998.5  2.269231
## 15       Peugeot   20506.55      1570.538 8.461538     156890.4  2.392308
## 16       Renault   20888.14      1660.415 8.740113     161348.0  2.474576
## 17          Seat   21499.06      1568.976 8.670588     158804.9  2.200000
## 18         Škoda   21338.74      1519.386 8.150327     160371.8  2.019608
## 19        Suzuki   21034.00      1536.368 8.842105     147968.6  1.684211
## 20        Toyota   21058.84      1553.922 8.843137     155325.5  2.000000
## 21    Volkswagen   22139.14      1562.940 9.140000     166115.0  2.260000
## 22         Volvo   21561.68      1654.294 9.352941     168840.1  2.823529
##    count
## 1     33
## 2     31
## 3    174
## 4     15
## 5     59
## 6    204
## 7     15
## 8     90
## 9     98
## 10    38
## 11    16
## 12    15
## 13    32
## 14   260
## 15   130
## 16   177
## 17    85
## 18   153
## 19    19
## 20    51
## 21   100
## 22    34

Clustering

The data has been prepared for clustering by using ‘Manhattan’ method to calculate the distances.

xxx <- df[,c("mark" , "mean_price","mean_capacity" ,"mean_age","mean_mileage" , "mean_fuel")]
xxx<-as.data.frame(xxx)
rownames(xxx) <- xxx$mark
rownames(xxx)

#preparing data for clustering
xxx$price_norm <- scale(sqrt(xxx$mean_price))
xxx$mileage_norm <- scale(sqrt(xxx$mean_mileage))
xxx$capacity_norm <- scale(sqrt(xxx$mean_capacity))
xxx$fuel_norm <- scale(sqrt(xxx$mean_fuel))
xxx$age_norm <- scale(sqrt(xxx$mean_age))


odleglosci <- dist(xxx[,c("price_norm","mileage_norm","capacity_norm","fuel_norm","age_norm")], method = "manhattan")
as.matrix((odleglosci))

grupy <- agnes(odleglosci, method = "ward")

The optimal number of clusters has been chosen based on:

grupy_eclust <- eclust(xxx[,c("price_norm","mileage_norm","age_norm")], "hclust", graph = FALSE)

Besides the fact that from the graph it occurs that there should be 1 cluster, I choose 3 clusters as the optimum number, because the static Gap difference is very small between them.

Cluster plot with manhattan distances

As we see on the graph marks has been splitted on 3 clusters. In 1st one there are cars considered more luxurious (Mercedes, BMW, Volvo, Mazda), in second group there are popular marks such as Toyota or Fiat and in the third cluster we can observe group which contains Chevrolet, Dacia and Nissan. Let check how it looks when we compare it by using different method.

Ward’s dendrogram

fviz_dend(grupy, k = 3, rect = TRUE, main = "Metoda Ward")

Above tree has strongly emphasized 3 roots which than brach out on next parts. We can observe which marks are most similar to each other and which are definietly belonging to the other clusters.

Points plot

The next graph presents pont plot which shows the dependence between mean price and mean mileage in spite of 3 clusters. The highest prices are for group 1 (considered as most luxuries). However the mileage for this group is also high. The reason is that such cars with the lower mileage would be not accessible in our cost borders (25,000PLN).

nazwa<-rownames(xxx)

#point graph
xxx$grupa = factor(cutree(grupy, k = 3))

ggplot(xxx, aes(mean_price, mean_mileage, label=nazwa, color=grupa)) +
  geom_point(size=3) + theme_bw() +
  coord_trans("sqrt", "sqrt")

Clustering on data from otomoto.pl

Piotr Borowski

31 January 2019