In the 1970s and early 1980s, the American car industry had to undergo radical changes in order to survive on the global market. Heavy cars with six- or eight-cylinder engines began to be visibly dominated in the United States by European and Japanese competitors. Global brands focused mainly on delivering lighter cars with less powerful engines (mostly four-cylinder), but more fuel-efficient. For this reason, in the aforementioned years, many of the basic models of American car brands were redesigned in line with market demand, mainly focusing on fuel economy.
The aim of this paper is to analyze the similarities and differences in the basic characteristics of cars from the ’70s and to identify groups of similar cars. In order to find the most accurate one, I will use several clustering methods, including partitioning clustering (K-means, PAM) and hierarchical clustering. With regard to the information in the first paragraph, I will also check to what extent the obtained similarities were related to the origin of the cars, and thus whether American cars were significantly different from European and Japanese cars at that time. Based on the available data, I will also assess how much it has changed over the next few years.
Dataset used in this analysis was found on a private GitHub profile. It describes basic characteristics of the cars produced in the United States, Europe and Japan between 1970 and 1982. The columns of the Dataset are as following
library(knitr)
library(dplyr)
# Loading the data and denoting as 'cars'
cars <- read.csv('https://raw.githubusercontent.com/RodolfoViana/exploratory-data-analysis-dataset-cars/master/cars_multi.csv', header=TRUE)
# Converting 'origin' column into a more descriptive variable
cars$origin <- factor(cars$origin, labels = c('USA', 'Europe', 'Japan'))
# Converting 'horsepower' column into numerical variable (it was imported as a factor with 95 levels)
cars$horsepower <- as.numeric(cars$horsepower)
kable(head(cars[122:398,]))
| ID | mpg | cylinders | displacement | horsepower | weight | acceleration | model | origin | car_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 122 | 122 | 15 | 8 | 318 | 29 | 3399 | 11.0 | 73 | USA | dodge dart custom |
| 123 | 123 | 24 | 4 | 121 | 8 | 2660 | 14.0 | 73 | Europe | saab 99le |
| 124 | 124 | 20 | 6 | 156 | 14 | 2807 | 13.5 | 73 | Japan | toyota mark ii |
| 125 | 125 | 11 | 8 | 350 | 39 | 3664 | 11.0 | 73 | USA | oldsmobile omega |
| 126 | 126 | 20 | 6 | 198 | 91 | 3102 | 16.5 | 74 | USA | plymouth duster |
| 127 | 127 | 21 | 6 | 200 | 1 | 2875 | 17.0 | 74 | USA | ford maverick |
There are no missing values in the Dataset.
any(is.na(cars))
## [1] FALSE
As mentioned in the introduction, during analyzing period of time many significant moves in the car industry took place. In order to check how the introduced changes influenced the dependence of clusters on the cars’ origin, I selected two subsets of data separated by several years. Namely, I distinguished cars produced in 1973-74 (cars74) and those produced in 1981-82 (cars82). In both cases, I omitted the variables unnecessary for clustering as below.
cars74 <- cars[126:182, 2:7]
cars82 <- cars[339:398, 2:7]
## Number of observation in cars74: 57
## Number of observation in cars82: 60
As the subject of this analysis is the change of the differences between cars by origin, I present below tables summarizing the characteristics of the cars from 1973-1974 for each of three origins separately.
carss <- cars[126:182,c(2:7,9)]
summary(carss[carss$origin=='USA',]) %>% kable()
| mpg | cylinders | displacement | horsepower | weight | acceleration | origin | |
|---|---|---|---|---|---|---|---|
| Min. :13.00 | Min. :4.000 | Min. : 90 | Min. : 1 | Min. :2125 | Min. :11.50 | USA :35 | |
| 1st Qu.:15.00 | 1st Qu.:6.000 | 1st Qu.:225 | 1st Qu.: 8 | 1st Qu.:3012 | 1st Qu.:14.50 | Europe: 0 | |
| Median :17.00 | Median :6.000 | Median :250 | Median :27 | Median :3432 | Median :16.00 | Japan : 0 | |
| Mean :17.89 | Mean :6.343 | Mean :246 | Mean :37 | Mean :3520 | Mean :16.19 | NA | |
| 3rd Qu.:20.00 | 3rd Qu.:8.000 | 3rd Qu.:302 | 3rd Qu.:71 | 3rd Qu.:4024 | 3rd Qu.:17.00 | NA | |
| Max. :28.00 | Max. :8.000 | Max. :400 | Max. :93 | Max. :4699 | Max. :21.00 | NA |
summary(carss[carss$origin=='Europe',]) %>% kable()
| mpg | cylinders | displacement | horsepower | weight | acceleration | origin | |
|---|---|---|---|---|---|---|---|
| Min. :22.00 | Min. :4 | Min. : 79.0 | Min. :11.00 | Min. :1937 | Min. :13.50 | USA : 0 | |
| 1st Qu.:23.75 | 1st Qu.:4 | 1st Qu.: 90.0 | 1st Qu.:66.25 | 1st Qu.:2081 | 1st Qu.:14.38 | Europe:12 | |
| Median :25.50 | Median :4 | Median : 97.5 | Median :71.00 | Median :2234 | Median :15.25 | Japan : 0 | |
| Mean :25.75 | Mean :4 | Mean :101.3 | Mean :69.83 | Mean :2355 | Mean :15.21 | NA | |
| 3rd Qu.:26.75 | 3rd Qu.:4 | 3rd Qu.:117.0 | 3rd Qu.:80.25 | 3rd Qu.:2677 | 3rd Qu.:16.12 | NA | |
| Max. :31.00 | Max. :4 | Max. :121.0 | Max. :94.00 | Max. :2957 | Max. :17.00 | NA |
summary(carss[carss$origin=='Japan',]) %>% kable()
| mpg | cylinders | displacement | horsepower | weight | acceleration | origin | |
|---|---|---|---|---|---|---|---|
| Min. :24.00 | Min. :4 | Min. : 71.0 | Min. :53.00 | Min. :1649 | Min. :13.50 | USA : 0 | |
| 1st Qu.:24.50 | 1st Qu.:4 | 1st Qu.: 80.0 | 1st Qu.:59.00 | 1st Qu.:1864 | 1st Qu.:15.62 | Europe: 0 | |
| Median :30.00 | Median :4 | Median : 94.0 | Median :67.50 | Median :2087 | Median :16.75 | Japan :10 | |
| Mean :28.60 | Mean :4 | Mean : 97.8 | Mean :72.90 | Mean :2153 | Mean :17.00 | NA | |
| 3rd Qu.:31.75 | 3rd Qu.:4 | 3rd Qu.:116.2 | 3rd Qu.:91.25 | 3rd Qu.:2464 | 3rd Qu.:18.62 | NA | |
| Max. :33.00 | Max. :4 | Max. :134.0 | Max. :93.00 | Max. :2702 | Max. :21.00 | NA |
The most significant differences can be observed in weight and engine displacement. For these variables, cars from the USA had in 1973 and 1974 much higher values than cars from the rest of the world. In Europe and Japan, 4-cylinder engines were already dominating then, while in the United States producers were still using 6-cylinder. The car mileage (mpg column) was, on the other hand, on average lower in case of American cars. Furthermore, the differences between cars from Japan and Europe produced in those years are not so obvious in this summary.
Naturally, these statistics only show a general trend depending on the origin. From the perspective of this analysis, much more interesting is how the differences in car characteristics are shaped on the individual level.
Since it is always good to look at the relationship between the variables, below is a fragment of the correlation matrix.
library(gridExtra)
library(corrplot)
corrplot(cor(cars74, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="cars74", mar=c(0,0,1,0))
corrplot(cor(cars82, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="cars82", mar=c(0,0,1,0))
It is not difficult to notice that in the years 1973–74 the variables were generally more strongly correlated with each other. Almost every coefficient is greater (in absolute terms) for cars74 than for cars82.
The strong correlation between some characteristics should not be surprising. For example, the engine displacement is the total volume of all combustion cylinders, which is strongly related to the car mileage (and thus MPG). Moreover, in the formula of the displacement appears the number of cylinders (see Wikipedia). In turn a heavier car generally needs more cylinders and uses on average more fuel.
Since considered variables are measured in different scales, it is recommended to scale the data in order to make those variables comparable.
cars74 <- scale(cars74)
cars82 <- scale(cars82)
First I will do clustering analysis for cars74 dataset. Later I will go to the cars82 database for the mentioned comparison purposes.
Before proceeding with any clustering method, it is worth to assess the general clustering tendency of the data. For this purpose, Hopkins statistic and visual assessment were used.
library(factoextra)
get_clust_tendency(cars74, 2, graph=TRUE, gradient=list(low="blue", high="white"), seed=1234)
## $hopkins_stat
## [1] 0.7935214
##
## $plot
In the context of the conducted analysis, the results are rather satisfactory. Hopkins statistic is equal to 0.79, which is far above 0.5 and thus, according to R Documentation, we can conclude that the dataset is significantly clusterable.
A similar conclusion can be drawn on the basis of darker square shaped blocks along the diagonal in the above dissimilarity image. Its visual assessment also indicates a general tendency to clustering.
In the next step it is necessary to obtain the optimal number of clusters for each of partitional clustering method. Since the analyzed datasets (cars74, cars 82) are rather small, therefore there is no need to consider CLARA which is intended for large datasets. However, for comparative purposes, both K-means and PAM will be implemented. The optimal number of clusters will be chosen primarily based on silhouette statistic.
f1 <- fviz_nbclust(cars74, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(cars74, FUNcluster = cluster::pam, method = "silhouette") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
The results indicate that K-means analysis for cars74 dataset should be conducted for 2 clusters, while PAM analysis for 4 clusters. However, if we look closely at the charts, we can see that in the case of PAM the average silhouette width is almost the same for 2 as for 4 clusters. What is more, in both cases (K-means, PAM) the average silhouette width value for 3 clusters is only slightly lower than its value for the optimal number of clusters. This is good news especially due to the fact that in the database we have three levels of the origin variable, which is somehow a matter of concern in the analysis.
To confirm the results, it is always good to look at an alternative method. Therefore, I check the stability of the obtained above results by using the WSS statistics.
f1 <- fviz_nbclust(cars74, FUNcluster = kmeans, method = "wss") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(cars74, FUNcluster = cluster::pam, method = "wss") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
Summing up, in both cases (K-means and PAM) the division into 2 clusters seems to be the most promising. However, due to the subject of interest of the analysis and the obtained results, the case for 3 clusters will also be considered. In addition, as it is suggested, PAM with 4 clusters should be also analyzed.
First, the clusterization will be produced based on the K-means algorithm for the case with two and with three clusters.
It is worth noting that in the analysis below it was decided to use euclidean distance to calculate dissimilarities between observations. After making calculations for several basic measures (including those based on correlation), the results turned out to be so close to each other that it was decided to stick to one measure in the whole paper.
km2 <- eclust(cars74, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(km2, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
s2 <- fviz_silhouette(km2)
## cluster size ave.sil.width
## 1 1 29 0.58
## 2 2 28 0.40
grid.arrange(c2, s2, ncol=2)
For each case, I have decided to provide a small table showing the distribution of cars with different origins among the clusters. In other words, I will be checking the dependence of cluster division on the origin.
table(cars$origin[126:182], km2$cluster)
##
## 1 2
## USA 7 28
## Europe 12 0
## Japan 10 0
km3 <- eclust(cars74, k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c3 <- fviz_cluster(km3, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s3 <- fviz_silhouette(km3)
## cluster size ave.sil.width
## 1 1 18 0.32
## 2 2 11 0.56
## 3 3 28 0.53
grid.arrange(c3, s3, ncol=2)
Checking the dependence of cluster division on the origin:
table(cars$origin[126:182], km3$cluster)
##
## 1 2 3
## USA 18 11 6
## Europe 0 0 12
## Japan 0 0 10
Silhouette statistic is slightly higher in the case of 2 clusters. On the other hand, K-means with 3 clusters does not have negative average silhouette value (which is good), and K-means with 2 clusters does. It can also be easily noticed that the case of 3 clusters was almost created by splitting one of the clusters from the case of K-means with 2 clusters.
When it comes to the relationship between clusters and the origin of cars, it turns out that all European and Japanese cars in both cases are in one cluster (along with several American cars). The vast majority of American cars, on the other hand, are so different in terms of characteristics that they form separate clusters. It follows that, based on clustering, there is a significant difference in the cars produced in the years 1973-74 in the USA and in Europe or Japan.
The tables below present the basic statistics for the characteristics of each cluster. It is worth recalling that the data has been scaled, hence negative values appear.
cars74_cl <- as.data.frame(cbind(cars74, km3$cluster))
colnames(cars74_cl) <- c(colnames(cars74),"cluster")
# Cluster 1 (red)
summary(cars74_cl[cars74_cl$cluster==1,1:6]) %>% kable()
| mpg | cylinders | displacement | horsepower | weight | acceleration | |
|---|---|---|---|---|---|---|
| Min. :-1.11299 | Min. :0.3629 | Min. :0.09329 | Min. :-1.4863 | Min. :-0.18572 | Min. :-0.529674 | |
| 1st Qu.:-0.93965 | 1st Qu.:0.3629 | 1st Qu.:0.40704 | 1st Qu.:-1.4334 | 1st Qu.: 0.09362 | 1st Qu.: 0.001035 | |
| Median :-0.59298 | Median :0.3629 | Median :0.46758 | Median :-1.2749 | Median : 0.40427 | Median : 0.413808 | |
| Mean :-0.65076 | Mean :0.3629 | Mean :0.51101 | Mean :-0.5886 | Mean : 0.41226 | Mean : 0.675887 | |
| 3rd Qu.:-0.41965 | 3rd Qu.:0.3629 | 3rd Qu.:0.66574 | 3rd Qu.: 0.5675 | 3rd Qu.: 0.77693 | 3rd Qu.: 1.298323 | |
| Max. :-0.07298 | Max. :0.3629 | Max. :0.75381 | Max. : 1.2319 | Max. : 1.01050 | Max. : 2.300773 |
# Cluster 2 (green)
summary(cars74_cl[cars74_cl$cluster==2,1:6]) %>% kable()
| mpg | cylinders | displacement | horsepower | weight | acceleration | |
|---|---|---|---|---|---|---|
| Min. :-1.4597 | Min. :1.656 | Min. :0.7978 | Min. :-1.2749 | Min. :0.1551 | Min. :-2.18077 | |
| 1st Qu.:-1.2863 | 1st Qu.:1.656 | 1st Qu.:1.2382 | 1st Qu.:-0.7916 | 1st Qu.:1.3490 | 1st Qu.:-1.23729 | |
| Median :-1.2863 | Median :1.656 | Median :1.4143 | Median :-0.7010 | Median :1.6480 | Median :-1.00142 | |
| Mean :-1.1130 | Mean :1.656 | Mean :1.4754 | Mean :-0.7532 | Mean :1.4181 | Mean :-1.06574 | |
| 3rd Qu.:-0.9397 | 3rd Qu.:1.656 | 3rd Qu.:1.7666 | 3rd Qu.:-0.6406 | 3rd Qu.:1.8688 | 3rd Qu.:-0.76555 | |
| Max. :-0.2463 | Max. :1.656 | Max. :2.3171 | Max. :-0.3990 | Max. :1.9285 | Max. :-0.05793 |
# Cluster 3 (blue)
summary(cars74_cl[cars74_cl$cluster==3,1:6]) %>% kable()
| mpg | cylinders | displacement | horsepower | weight | acceleration | |
|---|---|---|---|---|---|---|
| Min. :-0.5930 | Min. :-0.9299 | Min. :-1.3048 | Min. :-1.1842 | Min. :-1.60682 | Min. :-1.23729 | |
| 1st Qu.: 0.4470 | 1st Qu.:-0.9299 | 1st Qu.:-1.0957 | 1st Qu.: 0.4165 | 1st Qu.:-1.19735 | 1st Qu.:-0.76555 | |
| Median : 0.7937 | Median :-0.9299 | Median :-0.9526 | Median : 0.6279 | Median :-0.88352 | Median :-0.05793 | |
| Mean : 0.8556 | Mean :-0.8837 | Mean :-0.9081 | Mean : 0.6743 | Mean :-0.82213 | Mean :-0.01581 | |
| 3rd Qu.: 1.3137 | 3rd Qu.:-0.9299 | 3rd Qu.:-0.7544 | 3rd Qu.: 1.0583 | 3rd Qu.:-0.50014 | 3rd Qu.: 0.41381 | |
| Max. : 2.0070 | Max. : 0.3629 | Max. :-0.2040 | Max. : 1.3225 | Max. :-0.05938 | Max. : 2.30077 |
The third cluster differs from the other two in exactly the same way in which the cars from Europe and Japan differed from those from the USA (lower weight, lower engine displacement, fewer cylinders and on average higher miles per gallon). American clusters, on the other hand, had their cars divided by differentiating them mainly due to acceleration, displacement and the number of cylinders. From the data it can be concluded that in the second cluster there are cars most distant from the world demand at that time.
In this part the clusterization is produced based on the PAM algorithm. After interesting tendency of American cars being clustered separately, I have decided to take into account also the case with 4 clusters as it was suggested by the average silhouette width.
pam2 <- eclust(cars74, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(pam2, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 32 0.52
## 2 2 25 0.44
grid.arrange(cp2, sp2, ncol=2)
Checking the dependence of cluster division on the origin:
table(cars$origin[126:182], pam2$cluster)
##
## 1 2
## USA 10 25
## Europe 12 0
## Japan 10 0
pam3 <- eclust(cars74, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp3 <- fviz_cluster(pam3, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)
## cluster size ave.sil.width
## 1 1 31 0.48
## 2 2 15 0.41
## 3 3 11 0.55
grid.arrange(cp3, sp3, ncol=2)
Checking the dependence of cluster division on the origin:
table(cars$origin[126:182], pam3$cluster)
##
## 1 2 3
## USA 9 15 11
## Europe 12 0 0
## Japan 10 0 0
pam4 <- eclust(cars74, k=4 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp4 <- fviz_cluster(pam4, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 4 clusters")
sp4 <- fviz_silhouette(pam4)
## cluster size ave.sil.width
## 1 1 7 0.41
## 2 2 12 0.54
## 3 3 27 0.48
## 4 4 11 0.52
grid.arrange(cp4, sp4, ncol=2)
Checking the dependence of cluster division on the origin:
table(cars$origin[126:182], pam4$cluster)
##
## 1 2 3 4
## USA 7 12 5 11
## Europe 0 0 12 0
## Japan 0 0 10 0
Summarizing PAM clustering, Silhouette statistic is almost on the same level for all the above numbers of clusters (equal to 0.49). However, only in the case of PAM with 4 clusters no negative average silhouette value was observed.
As for the differences in cars depending on their origin, similar to the K-means cases, all European and Japanese cars were grouped each time in one cluster (regardless of the number of clusters). This shows a great similarity in the characteristics of these brands’ cars. The remaining clusters consisted only of American cars, which in turn indicates their significant differentiation among themselves.
As the last method for clusterization, hierarchical clustering will be used. This idea is based on setting the hierarchy of clusters depending on chosen way of calculating the similarity between clusters. In the below analysis I will concentrate on agglomerative hierarchical clustering technique since it seems to be more appropriate to considered Dataset. In this technique all observations are initially in their own clusters and then iteratively similar clusters are merged with others until one cluster is formed.
In the hierarchical clustering method, it is necessary to compute the dissimilarity matrix and thus the linkage method needs to be specified first. Obviously, there are multiple options, but I have decided to limit to two of them: Ward’s method and complete linkage. The first one is frequently claimed to be sensible by default, especially when we do not have any clear theoretical justifications for any other linkage criteria. The second method, however, does well in clustering when there is some kind of noise between clusters, which seems to be partly our case.
Therefore, we can plot the dendrograms of the agglomerative hierarchical clustering.
hcc <- eclust(cars74, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="complete")
# Changing labels into origin levels
hcc$labels <- cars$origin[126:182]
plot(hcc, cex=0.5, hang=-1)
rect.hclust(hcc, k=3, border='blue')
Checking the dependence of cluster division on the origin:
cutc <- cutree(hcc, 3)
table(cars$origin[126:182], cutc)
## cutc
## 1 2 3
## USA 19 5 11
## Europe 0 12 0
## Japan 0 10 0
hcw <- eclust(cars74, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="ward.D")
hcw$labels <- cars$origin[126:182]
plot(hcw, cex=0.5, hang=-1)
rect.hclust(hcw, k=3, border='blue')
Checking the dependence of cluster division on the origin:
cutw <- cutree(hcw, 3)
table(cars$origin[126:182], cutw)
## cutw
## 1 2 3
## USA 19 5 11
## Europe 0 12 0
## Japan 0 10 0
The obtained results using Ward’s method and complete linkage are very similar. In both cases we received a cluster in which all European and Japanese cars are grouped and two clusters consisting only of American cars. These results fully confirm the previous ones obtained with K-means and PAM.
This time, however, the structure of the dendrogram together with attached origin labels perfectly shows how the next clusters connect based on the similarities between their elements. Particularly noteworthy is the fact that, according to both dendrograms, in order to obtain two clusters, clusters containing exclusively American car brands will be joined together. This perfectly highlights the differences in car characteristics between American and European/Japanese cars in those years.
In order to assess the consistency of clustering results, one can use the clValid package. It contains stability measures which checks the stability of the method by comparing basic clustering with clusters obtained after removing particular column of the data. They include:
Note that the smaller the values of those measures, the more consistent clustering results.
Therefore, using clValid() function, we can check which of the used methods and with what number of clusters is the most stable for considered in this paper data.
library(clValid)
clmethods <- c("hierarchical","kmeans","pam")
st <- clValid(cars74, nClust=2:6, clMethods=clmethods, validation="stability", method="complete")
optimalScores(st)
## Score Method Clusters
## APN 0.04226532 pam 3
## AD 1.23759216 pam 6
## ADM 0.16481820 pam 3
## FOM 0.48882087 pam 6
Explicitly PAM algorithm was assessed to be the most consistent one of the used methods for this data based on mentioned four measures. There is no consensus on the number of clusters, but considering that so far we had no more than 4 clusters, the choice seems to remain one.
Finally, we are ready to analyze how the car industry situation has changed during next 8 yers. Therefore, I will check what kind of cars were produced in 1981-82 and to what extent they differed from each other depending on the origin.
get_clust_tendency(cars82, 2, graph=TRUE, gradient=list(low="blue", high="white"), seed=1234)
## $hopkins_stat
## [1] 0.8781423
##
## $plot
Hopkins statistic is very high (0.88), which means that the dataset is significantly clusterable. Visual approach confirms this statement.
This time I limit my decision to considering only silhouette statistic.
f1 <- fviz_nbclust(cars82, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(cars82, FUNcluster = cluster::pam, method = "silhouette") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
In both cases, the optimal number of clusters was assessed as 2. What is more, the average silhouette width for K-means and PAM is comparable in the case of 2 clusters. Since the results for K-means and PAM are very similar in this analysis, I will limit the clustering of cars82 to the PAM method, so the one considered to be the most consistent for this kind of data.
pam2 <- eclust(cars82, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(pam2, data=cars82, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 48 0.55
## 2 2 12 0.27
grid.arrange(cp2, sp2, ncol=2)
Checking the dependence of cluster division on the origin:
table(cars$origin[339:398], pam2$cluster)
##
## 1 2
## USA 24 9
## Europe 5 1
## Japan 19 2
I also decided to provide a dendrogram as it clearly shows the stages of formatting the major clusters and simultaneously gives an interesting insight into the changes of the produced cars.
hcc <- eclust(cars82, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="complete")
hcc$labels <- cars$origin[339:398]
plot(hcc, cex=0.5, hang=-1)
rect.hclust(hcc, k=3, border='blue')
Checking the dependence of cluster division on the origin:
cutc <- cutree(hcc, 3)
table(cars$origin[339:398], cutc)
## cutc
## 1 2 3
## USA 19 8 6
## Europe 3 1 2
## Japan 18 2 1
Undoubtedly, the results obtained using both of the above methods differ significantly from those based on cars74 dataset. First of all, each cluster contains at least one car from each origin. Secondly, there are no more clusters of only American cars. What is more, based on the dendrogram, it can be concluded that many more American cars (more than 8 years earlier) are being merged with European and Japanese cars already in the early stages of the hierarchical clustering.
It is also interesting that one of the obtained clusters is definitely larger and includes most of the cars. This may indicate a reduced variation in the characteristics of cars from different parts of the world. The most important result, however, is that the differences in cars produced in the years 1981-82 are not as strongly dependent on the origin as it was just a few years earlier.
Finally, for those still interested, I would like to present graphs of the actual differences among those years in some individual car characteristics by origin, about which I have already mentioned in the introduction.
library(ggplot2)
theme_update(plot.title = element_text(hjust = 0.5))
a1 <- ggplot(data = cars[126:182,], aes(x = origin, y = mpg)) +
geom_boxplot() +
coord_cartesian(ylim = c(14, 43)) +
ylab('MPG') +
xlab(' ') +
ggtitle('1973-1974')
a2 <- ggplot(data = cars[339:398,], aes(x = origin, y = mpg)) +
geom_boxplot() +
coord_cartesian(ylim = c(14, 43)) +
ylab(' ') + xlab(' ') +
ggtitle('1981-1982')
b1 <- ggplot(data = cars[126:182,], aes(x = origin, y = weight)) +
geom_boxplot() +
coord_cartesian(ylim = c(1800, 4500)) +
ylab('Weight') +
xlab(' ') +
ggtitle('1973-1974')
b2 <- ggplot(data = cars[339:398,], aes(x = origin, y = weight)) +
geom_boxplot() +
coord_cartesian(ylim = c(1800, 4500)) +
ylab(' ') + xlab(' ') +
ggtitle('1981-1982')
c1 <- ggplot(data = cars[126:182,], aes(x = origin, y = displacement)) +
geom_boxplot() +
coord_cartesian(ylim = c(75, 350)) +
ylab('Displacement') +
xlab(' ') +
ggtitle('1973-1974')
c2 <- ggplot(data = cars[339:398,], aes(x = origin, y = displacement)) +
geom_boxplot() +
coord_cartesian(ylim = c(75, 350)) +
ylab(' ') + xlab(' ') +
ggtitle('1981-1982')
grid.arrange(a1,a2, ncol=2, top='MPG by Origin')
grid.arrange(b1,b2, ncol=2, top='Weight by Origin')
grid.arrange(c1,c2, ncol=2, top='Displacement by Origin')
In this paper, I analyzed the grouping of cars produced between 1970 and 1982 depending on the similarities and differences in their basic characteristics. Based on fundamental clustering methods, it was shown that at the beginning of this period, American cars were much different from European and Japanese brands. However, a number of introduced changes allowed the American brands not only to stay on the market, but also significantly bring the characteristics of their cars closer to more desirable cars from other parts of the world. In the context of clustering, this resulted in a significant reduction in the dependence of cluster divisions on the origin of the cars.