The aim of this paper is to use different tacsonomic methods on data received from webscraping Otomoto.pl portal in order to build dendrogams, (also known as decision trees) and basic clusters . ‘OTOMOTO.pl’ is a Polsih ecommerce site operating on the same rules as eBay. It allows private and nonprivate users either sell or buy cars.
As the study group, I chose cars that fit into the average wealthy family in Poland. These are the criteria:
As the result I obtained 1860 observations with the variables:
## mark price year mileage capacity fuel location
## 1 Seat 24900 2009 198000 2000 Benzyna Lubelskie
## 2 Volkswagen 25000 2009 188000 1968 Diesel Mazowieckie
## 3 Opel 17900 2009 190000 1598 Benzyna+LPG Łódzkie
## 4 Saab 22999 2009 180000 1910 Diesel Lubuskie
## 5 BMW 24900 2009 163000 1995 Diesel Zachodniopomorskie
## 6 Saab 19999 2009 181000 1910 Diesel Pomorskie
The data has been grouped by the mark and location which resulted in two data frames: df - by mark and df2 <- by location for analysis the marks with count less than 5 were removed.
##grouped data by mark
df<-data %>% group_by(mark)%>% summarize(mean_price=mean(price, na.rm=TRUE),mean_capacity=mean(capacity, na.rm=TRUE),
mean_age=mean(age, na.rm=TRUE),mean_mileage=mean(mileage, na.rm=TRUE),mean_fuel=mean(fuel2, na.rm=TRUE)
)
a<-data%>%group_by(mark) %>%count(mark)
df$count<-a$n
df<-filter(df, count>5)
##grouped data by location
df2<-data %>% group_by(location)%>% summarize(mean_price=mean(price, na.rm=TRUE),mean_capacity=mean(capacity, na.rm=TRUE),
mean_age=mean(age, na.rm=TRUE),mean_mileage=mean(mileage, na.rm=TRUE),mean_fuel=mean(fuel2, na.rm=TRUE)
)
a<-data%>%group_by(location) %>%count(location)
df2$count<-a$n
df2<-filter(df2, count>5)
df
## mark mean_price mean_capacity mean_age mean_mileage mean_fuel
## 1 BMW 22851.52 1751.788 9.454545 165883.6 2.272727
## 2 Chevrolet 20665.48 1684.645 7.806452 144740.9 2.290323
## 3 Citroën 21116.22 1605.759 8.626437 163520.1 2.390805
## 4 Dacia 20976.67 1542.467 7.733333 147968.7 1.666667
## 5 Fiat 20213.29 1563.169 8.813559 153775.9 2.271186
## 6 Ford 20922.75 1609.789 8.529412 165701.4 2.279412
## 7 Honda 21646.67 1556.267 9.000000 144400.9 1.400000
## 8 Hyundai 19680.59 1524.811 8.822222 152779.4 2.144444
## 9 Kia 20729.34 1595.857 8.836735 157812.1 2.255102
## 10 Mazda 21257.53 1647.079 9.131579 161203.4 1.973684
## 11 Mercedes-Benz 21086.69 1518.938 9.500000 165132.7 1.937500
## 12 Mitsubishi 22424.80 1833.867 8.933333 155193.8 1.800000
## 13 Nissan 21189.03 1452.375 7.906250 132862.2 1.437500
## 14 Opel 20699.36 1589.388 8.380769 160998.5 2.269231
## 15 Peugeot 20506.55 1570.538 8.461538 156890.4 2.392308
## 16 Renault 20888.14 1660.415 8.740113 161348.0 2.474576
## 17 Seat 21499.06 1568.976 8.670588 158804.9 2.200000
## 18 Škoda 21338.74 1519.386 8.150327 160371.8 2.019608
## 19 Suzuki 21034.00 1536.368 8.842105 147968.6 1.684211
## 20 Toyota 21058.84 1553.922 8.843137 155325.5 2.000000
## 21 Volkswagen 22139.14 1562.940 9.140000 166115.0 2.260000
## 22 Volvo 21561.68 1654.294 9.352941 168840.1 2.823529
## count
## 1 33
## 2 31
## 3 174
## 4 15
## 5 59
## 6 204
## 7 15
## 8 90
## 9 98
## 10 38
## 11 16
## 12 15
## 13 32
## 14 260
## 15 130
## 16 177
## 17 85
## 18 153
## 19 19
## 20 51
## 21 100
## 22 34
The data has been prepared for clustering by using ‘Manhattan’ method to calculate the distances.
xxx <- df[,c("mark" , "mean_price","mean_capacity" ,"mean_age","mean_mileage" , "mean_fuel")]
xxx<-as.data.frame(xxx)
rownames(xxx) <- xxx$mark
rownames(xxx)
#preparing data for clustering
xxx$price_norm <- scale(sqrt(xxx$mean_price))
xxx$mileage_norm <- scale(sqrt(xxx$mean_mileage))
xxx$capacity_norm <- scale(sqrt(xxx$mean_capacity))
xxx$fuel_norm <- scale(sqrt(xxx$mean_fuel))
xxx$age_norm <- scale(sqrt(xxx$mean_age))
odleglosci <- dist(xxx[,c("price_norm","mileage_norm","capacity_norm","fuel_norm","age_norm")], method = "manhattan")
as.matrix((odleglosci))
grupy <- agnes(odleglosci, method = "ward")
The optimal number of clusters has been chosen based on:
grupy_eclust <- eclust(xxx[,c("price_norm","mileage_norm","age_norm")], "hclust", graph = FALSE)
Besides the fact that from the graph it occurs that there should be 1 cluster, I choose 3 clusters as the optimum number, because the static Gap difference is very small between them.
fviz_dend(grupy, k = 3, rect = TRUE, main = "Metoda Ward")
Above tree has strongly emphasized 3 roots which than brach out on next parts. We can observe which marks are most similar to each other and which are definietly belonging to the other clusters.
The next graph presents pont plot which shows the dependence between mean price and mean mileage in spite of 3 clusters. The highest prices are for group 1 (considered as most luxuries). However the mileage for this group is also high. The reason is that such cars with the lower mileage would be not accessible in our cost borders (25,000PLN).
nazwa<-rownames(xxx)
#point graph
xxx$grupa = factor(cutree(grupy, k = 3))
ggplot(xxx, aes(mean_price, mean_mileage, label=nazwa, color=grupa)) +
geom_point(size=3) + theme_bw() +
coord_trans("sqrt", "sqrt")