American and Global Car Industry in the ’70s

Introduction

In the 1970s and early 1980s, the American car industry had to undergo radical changes in order to survive on the global market. Heavy cars with six- or eight-cylinder engines began to be visibly dominated in the United States by European and Japanese competitors. Global brands focused mainly on delivering lighter cars with less powerful engines (mostly four-cylinder), but more fuel-efficient. For this reason, in the aforementioned years, many of the basic models of American car brands were redesigned in line with market demand, mainly focusing on fuel economy.

The aim of this paper is to analyze the similarities and differences in the basic characteristics of cars from the ’70s and to identify groups of similar cars. In order to find the most accurate one, I will use several clustering methods, including partitioning clustering (K-means, PAM) and hierarchical clustering. With regard to the information in the first paragraph, I will also check to what extent the obtained similarities were related to the origin of the cars, and thus whether American cars were significantly different from European and Japanese cars at that time. Based on the available data, I will also assess how much it has changed over the next few years.

Review of the Dataset

Dataset used in this analysis was found on a private GitHub profile. It describes basic characteristics of the cars produced in the United States, Europe and Japan between 1970 and 1982. The columns of the Dataset are as following

ID
mpg -> miles per gallon
cylinders -> the number of cylinders in the engine
displacement -> engine displacement
horsepower
weight
acceleration
model -> the last two digits of the car’s production year
origin -> USA (1), Europe (2), Japan (3)
car_name

Preparing the data

library(knitr)
library(dplyr)

# Loading the data and denoting as 'cars'
cars <- read.csv('https://raw.githubusercontent.com/RodolfoViana/exploratory-data-analysis-dataset-cars/master/cars_multi.csv', header=TRUE)

# Converting 'origin' column into a more descriptive variable
cars$origin <- factor(cars$origin, labels = c('USA', 'Europe', 'Japan'))

# Converting 'horsepower' column into numerical variable (it was imported as a factor with 95 levels)
cars$horsepower <- as.numeric(cars$horsepower)

kable(head(cars[122:398,]))

	ID	mpg	cylinders	displacement	horsepower	weight	acceleration	model	origin	car_name
122	122	15	8	318	29	3399	11.0	73	USA	dodge dart custom
123	123	24	4	121	8	2660	14.0	73	Europe	saab 99le
124	124	20	6	156	14	2807	13.5	73	Japan	toyota mark ii
125	125	11	8	350	39	3664	11.0	73	USA	oldsmobile omega
126	126	20	6	198	91	3102	16.5	74	USA	plymouth duster
127	127	21	6	200	1	2875	17.0	74	USA	ford maverick

There are no missing values in the Dataset.

any(is.na(cars))

## [1] FALSE

As mentioned in the introduction, during analyzing period of time many significant moves in the car industry took place. In order to check how the introduced changes influenced the dependence of clusters on the cars’ origin, I selected two subsets of data separated by several years. Namely, I distinguished cars produced in 1973-74 (cars74) and those produced in 1981-82 (cars82). In both cases, I omitted the variables unnecessary for clustering as below.

cars74 <- cars[126:182, 2:7]

cars82 <- cars[339:398, 2:7]

## Number of observation in cars74: 57

## Number of observation in cars82: 60

As the subject of this analysis is the change of the differences between cars by origin, I present below tables summarizing the characteristics of the cars from 1973-1974 for each of three origins separately.

carss <- cars[126:182,c(2:7,9)]
summary(carss[carss$origin=='USA',]) %>% kable()

mpg	cylinders	displacement	horsepower	weight	acceleration	origin
Min. :13.00	Min. :4.000	Min. : 90	Min. : 1	Min. :2125	Min. :11.50	USA :35
1st Qu.:15.00	1st Qu.:6.000	1st Qu.:225	1st Qu.: 8	1st Qu.:3012	1st Qu.:14.50	Europe: 0
Median :17.00	Median :6.000	Median :250	Median :27	Median :3432	Median :16.00	Japan : 0
Mean :17.89	Mean :6.343	Mean :246	Mean :37	Mean :3520	Mean :16.19	NA
3rd Qu.:20.00	3rd Qu.:8.000	3rd Qu.:302	3rd Qu.:71	3rd Qu.:4024	3rd Qu.:17.00	NA
Max. :28.00	Max. :8.000	Max. :400	Max. :93	Max. :4699	Max. :21.00	NA

summary(carss[carss$origin=='Europe',]) %>% kable()

mpg	cylinders	displacement	horsepower	weight	acceleration	origin
Min. :22.00	Min. :4	Min. : 79.0	Min. :11.00	Min. :1937	Min. :13.50	USA : 0
1st Qu.:23.75	1st Qu.:4	1st Qu.: 90.0	1st Qu.:66.25	1st Qu.:2081	1st Qu.:14.38	Europe:12
Median :25.50	Median :4	Median : 97.5	Median :71.00	Median :2234	Median :15.25	Japan : 0
Mean :25.75	Mean :4	Mean :101.3	Mean :69.83	Mean :2355	Mean :15.21	NA
3rd Qu.:26.75	3rd Qu.:4	3rd Qu.:117.0	3rd Qu.:80.25	3rd Qu.:2677	3rd Qu.:16.12	NA
Max. :31.00	Max. :4	Max. :121.0	Max. :94.00	Max. :2957	Max. :17.00	NA

summary(carss[carss$origin=='Japan',]) %>% kable()

mpg	cylinders	displacement	horsepower	weight	acceleration	origin
Min. :24.00	Min. :4	Min. : 71.0	Min. :53.00	Min. :1649	Min. :13.50	USA : 0
1st Qu.:24.50	1st Qu.:4	1st Qu.: 80.0	1st Qu.:59.00	1st Qu.:1864	1st Qu.:15.62	Europe: 0
Median :30.00	Median :4	Median : 94.0	Median :67.50	Median :2087	Median :16.75	Japan :10
Mean :28.60	Mean :4	Mean : 97.8	Mean :72.90	Mean :2153	Mean :17.00	NA
3rd Qu.:31.75	3rd Qu.:4	3rd Qu.:116.2	3rd Qu.:91.25	3rd Qu.:2464	3rd Qu.:18.62	NA
Max. :33.00	Max. :4	Max. :134.0	Max. :93.00	Max. :2702	Max. :21.00	NA

The most significant differences can be observed in weight and engine displacement. For these variables, cars from the USA had in 1973 and 1974 much higher values than cars from the rest of the world. In Europe and Japan, 4-cylinder engines were already dominating then, while in the United States producers were still using 6-cylinder. The car mileage (mpg column) was, on the other hand, on average lower in case of American cars. Furthermore, the differences between cars from Japan and Europe produced in those years are not so obvious in this summary.

Naturally, these statistics only show a general trend depending on the origin. From the perspective of this analysis, much more interesting is how the differences in car characteristics are shaped on the individual level.

Since it is always good to look at the relationship between the variables, below is a fragment of the correlation matrix.

library(gridExtra)
library(corrplot)

corrplot(cor(cars74, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="cars74", mar=c(0,0,1,0))

corrplot(cor(cars82, use="complete"), method="number", type="upper", diag=F, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="cars82", mar=c(0,0,1,0))

It is not difficult to notice that in the years 1973–74 the variables were generally more strongly correlated with each other. Almost every coefficient is greater (in absolute terms) for cars74 than for cars82.

The strong correlation between some characteristics should not be surprising. For example, the engine displacement is the total volume of all combustion cylinders, which is strongly related to the car mileage (and thus MPG). Moreover, in the formula of the displacement appears the number of cylinders (see Wikipedia). In turn a heavier car generally needs more cylinders and uses on average more fuel.

Since considered variables are measured in different scales, it is recommended to scale the data in order to make those variables comparable.

cars74 <- scale(cars74)
cars82 <- scale(cars82)

First I will do clustering analysis for cars74 dataset. Later I will go to the cars82 database for the mentioned comparison purposes.

Assessing clustering tendency

Before proceeding with any clustering method, it is worth to assess the general clustering tendency of the data. For this purpose, Hopkins statistic and visual assessment were used.

library(factoextra)

get_clust_tendency(cars74, 2, graph=TRUE, gradient=list(low="blue",  high="white"), seed=1234)

## $hopkins_stat
## [1] 0.7935214
## 
## $plot

In the context of the conducted analysis, the results are rather satisfactory. Hopkins statistic is equal to 0.79, which is far above 0.5 and thus, according to R Documentation, we can conclude that the dataset is significantly clusterable.

A similar conclusion can be drawn on the basis of darker square shaped blocks along the diagonal in the above dissimilarity image. Its visual assessment also indicates a general tendency to clustering.

The optimal number of clusters

Silhouette statistic

In the next step it is necessary to obtain the optimal number of clusters for each of partitional clustering method. Since the analyzed datasets (cars74, cars 82) are rather small, therefore there is no need to consider CLARA which is intended for large datasets. However, for comparative purposes, both K-means and PAM will be implemented. The optimal number of clusters will be chosen primarily based on silhouette statistic.

f1 <- fviz_nbclust(cars74, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(cars74, FUNcluster = cluster::pam, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

The results indicate that K-means analysis for cars74 dataset should be conducted for 2 clusters, while PAM analysis for 4 clusters. However, if we look closely at the charts, we can see that in the case of PAM the average silhouette width is almost the same for 2 as for 4 clusters. What is more, in both cases (K-means, PAM) the average silhouette width value for 3 clusters is only slightly lower than its value for the optimal number of clusters. This is good news especially due to the fact that in the database we have three levels of the origin variable, which is somehow a matter of concern in the analysis.

Total within-clusters sum of square

To confirm the results, it is always good to look at an alternative method. Therefore, I check the stability of the obtained above results by using the WSS statistics.

f1 <- fviz_nbclust(cars74, FUNcluster = kmeans, method = "wss") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(cars74, FUNcluster = cluster::pam, method = "wss") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

Summing up, in both cases (K-means and PAM) the division into 2 clusters seems to be the most promising. However, due to the subject of interest of the analysis and the obtained results, the case for 3 clusters will also be considered. In addition, as it is suggested, PAM with 4 clusters should be also analyzed.

K-means Clustering

First, the clusterization will be produced based on the K-means algorithm for the case with two and with three clusters.

It is worth noting that in the analysis below it was decided to use euclidean distance to calculate dissimilarities between observations. After making calculations for several basic measures (including those based on correlation), the results turned out to be so close to each other that it was decided to stick to one measure in the whole paper.

K-means with 2 clusters

km2 <- eclust(cars74, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)

c2 <- fviz_cluster(km2, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
s2 <- fviz_silhouette(km2)

##   cluster size ave.sil.width
## 1       1   29          0.58
## 2       2   28          0.40

grid.arrange(c2, s2, ncol=2)

For each case, I have decided to provide a small table showing the distribution of cars with different origins among the clusters. In other words, I will be checking the dependence of cluster division on the origin.

table(cars$origin[126:182], km2$cluster)

##         
##           1  2
##   USA     7 28
##   Europe 12  0
##   Japan  10  0

K-means with 3 clusters

km3 <- eclust(cars74, k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=F)

c3 <- fviz_cluster(km3, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 3 clusters")
s3 <- fviz_silhouette(km3)

##   cluster size ave.sil.width
## 1       1   18          0.32
## 2       2   11          0.56
## 3       3   28          0.53

grid.arrange(c3, s3, ncol=2)

Checking the dependence of cluster division on the origin:

table(cars$origin[126:182], km3$cluster)

##         
##           1  2  3
##   USA    18 11  6
##   Europe  0  0 12
##   Japan   0  0 10

Silhouette statistic is slightly higher in the case of 2 clusters. On the other hand, K-means with 3 clusters does not have negative average silhouette value (which is good), and K-means with 2 clusters does. It can also be easily noticed that the case of 3 clusters was almost created by splitting one of the clusters from the case of K-means with 2 clusters.

When it comes to the relationship between clusters and the origin of cars, it turns out that all European and Japanese cars in both cases are in one cluster (along with several American cars). The vast majority of American cars, on the other hand, are so different in terms of characteristics that they form separate clusters. It follows that, based on clustering, there is a significant difference in the cars produced in the years 1973-74 in the USA and in Europe or Japan.

The tables below present the basic statistics for the characteristics of each cluster. It is worth recalling that the data has been scaled, hence negative values appear.

cars74_cl <- as.data.frame(cbind(cars74, km3$cluster))
colnames(cars74_cl) <- c(colnames(cars74),"cluster")

# Cluster 1 (red)
summary(cars74_cl[cars74_cl$cluster==1,1:6]) %>% kable()

mpg	cylinders	displacement	horsepower	weight	acceleration
Min. :-1.11299	Min. :0.3629	Min. :0.09329	Min. :-1.4863	Min. :-0.18572	Min. :-0.529674
1st Qu.:-0.93965	1st Qu.:0.3629	1st Qu.:0.40704	1st Qu.:-1.4334	1st Qu.: 0.09362	1st Qu.: 0.001035
Median :-0.59298	Median :0.3629	Median :0.46758	Median :-1.2749	Median : 0.40427	Median : 0.413808
Mean :-0.65076	Mean :0.3629	Mean :0.51101	Mean :-0.5886	Mean : 0.41226	Mean : 0.675887
3rd Qu.:-0.41965	3rd Qu.:0.3629	3rd Qu.:0.66574	3rd Qu.: 0.5675	3rd Qu.: 0.77693	3rd Qu.: 1.298323
Max. :-0.07298	Max. :0.3629	Max. :0.75381	Max. : 1.2319	Max. : 1.01050	Max. : 2.300773

# Cluster 2 (green)
summary(cars74_cl[cars74_cl$cluster==2,1:6]) %>% kable()

mpg	cylinders	displacement	horsepower	weight	acceleration
Min. :-1.4597	Min. :1.656	Min. :0.7978	Min. :-1.2749	Min. :0.1551	Min. :-2.18077
1st Qu.:-1.2863	1st Qu.:1.656	1st Qu.:1.2382	1st Qu.:-0.7916	1st Qu.:1.3490	1st Qu.:-1.23729
Median :-1.2863	Median :1.656	Median :1.4143	Median :-0.7010	Median :1.6480	Median :-1.00142
Mean :-1.1130	Mean :1.656	Mean :1.4754	Mean :-0.7532	Mean :1.4181	Mean :-1.06574
3rd Qu.:-0.9397	3rd Qu.:1.656	3rd Qu.:1.7666	3rd Qu.:-0.6406	3rd Qu.:1.8688	3rd Qu.:-0.76555
Max. :-0.2463	Max. :1.656	Max. :2.3171	Max. :-0.3990	Max. :1.9285	Max. :-0.05793

# Cluster 3 (blue)
summary(cars74_cl[cars74_cl$cluster==3,1:6]) %>% kable()

mpg	cylinders	displacement	horsepower	weight	acceleration
Min. :-0.5930	Min. :-0.9299	Min. :-1.3048	Min. :-1.1842	Min. :-1.60682	Min. :-1.23729
1st Qu.: 0.4470	1st Qu.:-0.9299	1st Qu.:-1.0957	1st Qu.: 0.4165	1st Qu.:-1.19735	1st Qu.:-0.76555
Median : 0.7937	Median :-0.9299	Median :-0.9526	Median : 0.6279	Median :-0.88352	Median :-0.05793
Mean : 0.8556	Mean :-0.8837	Mean :-0.9081	Mean : 0.6743	Mean :-0.82213	Mean :-0.01581
3rd Qu.: 1.3137	3rd Qu.:-0.9299	3rd Qu.:-0.7544	3rd Qu.: 1.0583	3rd Qu.:-0.50014	3rd Qu.: 0.41381
Max. : 2.0070	Max. : 0.3629	Max. :-0.2040	Max. : 1.3225	Max. :-0.05938	Max. : 2.30077

The third cluster differs from the other two in exactly the same way in which the cars from Europe and Japan differed from those from the USA (lower weight, lower engine displacement, fewer cylinders and on average higher miles per gallon). American clusters, on the other hand, had their cars divided by differentiating them mainly due to acceleration, displacement and the number of cylinders. From the data it can be concluded that in the second cluster there are cars most distant from the world demand at that time.

PAM Clustering

In this part the clusterization is produced based on the PAM algorithm. After interesting tendency of American cars being clustered separately, I have decided to take into account also the case with 4 clusters as it was suggested by the average silhouette width.

PAM with 2 clusters

pam2 <- eclust(cars74, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)

cp2 <- fviz_cluster(pam2, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)

##   cluster size ave.sil.width
## 1       1   32          0.52
## 2       2   25          0.44

grid.arrange(cp2, sp2, ncol=2)

Checking the dependence of cluster division on the origin:

table(cars$origin[126:182], pam2$cluster)

##         
##           1  2
##   USA    10 25
##   Europe 12  0
##   Japan  10  0

PAM with 3 clusters

pam3 <- eclust(cars74, k=3 , FUNcluster="pam", hc_metric="euclidean", graph=F)

cp3 <- fviz_cluster(pam3, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 3 clusters")
sp3 <- fviz_silhouette(pam3)

##   cluster size ave.sil.width
## 1       1   31          0.48
## 2       2   15          0.41
## 3       3   11          0.55

grid.arrange(cp3, sp3, ncol=2)

Checking the dependence of cluster division on the origin:

table(cars$origin[126:182], pam3$cluster)

##         
##           1  2  3
##   USA     9 15 11
##   Europe 12  0  0
##   Japan  10  0  0

PAM with 4 clusters

pam4 <- eclust(cars74, k=4 , FUNcluster="pam", hc_metric="euclidean", graph=F)

cp4 <- fviz_cluster(pam4, data=cars74, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 4 clusters")
sp4 <- fviz_silhouette(pam4)

##   cluster size ave.sil.width
## 1       1    7          0.41
## 2       2   12          0.54
## 3       3   27          0.48
## 4       4   11          0.52

grid.arrange(cp4, sp4, ncol=2)

Checking the dependence of cluster division on the origin:

table(cars$origin[126:182], pam4$cluster)

##         
##           1  2  3  4
##   USA     7 12  5 11
##   Europe  0  0 12  0
##   Japan   0  0 10  0

Summarizing PAM clustering, Silhouette statistic is almost on the same level for all the above numbers of clusters (equal to 0.49). However, only in the case of PAM with 4 clusters no negative average silhouette value was observed.

As for the differences in cars depending on their origin, similar to the K-means cases, all European and Japanese cars were grouped each time in one cluster (regardless of the number of clusters). This shows a great similarity in the characteristics of these brands’ cars. The remaining clusters consisted only of American cars, which in turn indicates their significant differentiation among themselves.

Hierarchical clustering

As the last method for clusterization, hierarchical clustering will be used. This idea is based on setting the hierarchy of clusters depending on chosen way of calculating the similarity between clusters. In the below analysis I will concentrate on agglomerative hierarchical clustering technique since it seems to be more appropriate to considered Dataset. In this technique all observations are initially in their own clusters and then iteratively similar clusters are merged with others until one cluster is formed.

In the hierarchical clustering method, it is necessary to compute the dissimilarity matrix and thus the linkage method needs to be specified first. Obviously, there are multiple options, but I have decided to limit to two of them: Ward’s method and complete linkage. The first one is frequently claimed to be sensible by default, especially when we do not have any clear theoretical justifications for any other linkage criteria. The second method, however, does well in clustering when there is some kind of noise between clusters, which seems to be partly our case.

Therefore, we can plot the dendrograms of the agglomerative hierarchical clustering.

Complete linkage

hcc <- eclust(cars74, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="complete")
# Changing labels  into origin levels
hcc$labels <- cars$origin[126:182]
plot(hcc, cex=0.5, hang=-1)
rect.hclust(hcc, k=3, border='blue')

Checking the dependence of cluster division on the origin:

cutc <- cutree(hcc, 3)
table(cars$origin[126:182], cutc)

##         cutc
##           1  2  3
##   USA    19  5 11
##   Europe  0 12  0
##   Japan   0 10  0

Ward’s method

hcw <- eclust(cars74, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="ward.D")
hcw$labels <- cars$origin[126:182]
plot(hcw, cex=0.5, hang=-1)
rect.hclust(hcw, k=3, border='blue')

Checking the dependence of cluster division on the origin:

cutw <- cutree(hcw, 3)
table(cars$origin[126:182], cutw)

##         cutw
##           1  2  3
##   USA    19  5 11
##   Europe  0 12  0
##   Japan   0 10  0

Comment

The obtained results using Ward’s method and complete linkage are very similar. In both cases we received a cluster in which all European and Japanese cars are grouped and two clusters consisting only of American cars. These results fully confirm the previous ones obtained with K-means and PAM.

This time, however, the structure of the dendrogram together with attached origin labels perfectly shows how the next clusters connect based on the similarities between their elements. Particularly noteworthy is the fact that, according to both dendrograms, in order to obtain two clusters, clusters containing exclusively American car brands will be joined together. This perfectly highlights the differences in car characteristics between American and European/Japanese cars in those years.

Stability comparison

In order to assess the consistency of clustering results, one can use the clValid package. It contains stability measures which checks the stability of the method by comparing basic clustering with clusters obtained after removing particular column of the data. They include:

The average proportion of non-overlap (APN) - the average proportion of observations that changed the cluster after removing a single column from the data.
The average distance (AD) - the average distance between observations that did not change the cluster after removing a single column from the data.
The average distance between means (ADM) - the average distance between cluster centers for observations that did not change the cluster after removing a single column from the data.
The figure of merit (FOM) - the average intra-cluster variance of the removed column for clustering made after this removal.

Note that the smaller the values of those measures, the more consistent clustering results.

Therefore, using clValid() function, we can check which of the used methods and with what number of clusters is the most stable for considered in this paper data.

library(clValid)

clmethods <- c("hierarchical","kmeans","pam")
st <- clValid(cars74, nClust=2:6, clMethods=clmethods, validation="stability", method="complete")

optimalScores(st)

##          Score Method Clusters
## APN 0.04226532    pam        3
## AD  1.23759216    pam        6
## ADM 0.16481820    pam        3
## FOM 0.48882087    pam        6

Explicitly PAM algorithm was assessed to be the most consistent one of the used methods for this data based on mentioned four measures. There is no consensus on the number of clusters, but considering that so far we had no more than 4 clusters, the choice seems to remain one.

8 years later…

Finally, we are ready to analyze how the car industry situation has changed during next 8 yers. Therefore, I will check what kind of cars were produced in 1981-82 and to what extent they differed from each other depending on the origin.

Cars82 - clustering tendency

get_clust_tendency(cars82, 2, graph=TRUE, gradient=list(low="blue",  high="white"), seed=1234)

## $hopkins_stat
## [1] 0.8781423
## 
## $plot

Hopkins statistic is very high (0.88), which means that the dataset is significantly clusterable. Visual approach confirms this statement.

Cars82 - The optimal number of clusters

Silhouette statistic

This time I limit my decision to considering only silhouette statistic.

f1 <- fviz_nbclust(cars82, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(cars82, FUNcluster = cluster::pam, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

In both cases, the optimal number of clusters was assessed as 2. What is more, the average silhouette width for K-means and PAM is comparable in the case of 2 clusters. Since the results for K-means and PAM are very similar in this analysis, I will limit the clustering of cars82 to the PAM method, so the one considered to be the most consistent for this kind of data.

Cars82 - PAM Clustering

PAM with 2 clusters

pam2 <- eclust(cars82, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)

cp2 <- fviz_cluster(pam2, data=cars82, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
sp2 <- fviz_silhouette(pam2)

##   cluster size ave.sil.width
## 1       1   48          0.55
## 2       2   12          0.27

grid.arrange(cp2, sp2, ncol=2)

Checking the dependence of cluster division on the origin:

table(cars$origin[339:398], pam2$cluster)

##         
##           1  2
##   USA    24  9
##   Europe  5  1
##   Japan  19  2

Cars82 - Hierarchical clustering

I also decided to provide a dendrogram as it clearly shows the stages of formatting the major clusters and simultaneously gives an interesting insight into the changes of the produced cars.

hcc <- eclust(cars82, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="complete")
hcc$labels <- cars$origin[339:398]
plot(hcc, cex=0.5, hang=-1)
rect.hclust(hcc, k=3, border='blue')

Checking the dependence of cluster division on the origin:

cutc <- cutree(hcc, 3)
table(cars$origin[339:398], cutc)

##         cutc
##           1  2  3
##   USA    19  8  6
##   Europe  3  1  2
##   Japan  18  2  1

Undoubtedly, the results obtained using both of the above methods differ significantly from those based on cars74 dataset. First of all, each cluster contains at least one car from each origin. Secondly, there are no more clusters of only American cars. What is more, based on the dendrogram, it can be concluded that many more American cars (more than 8 years earlier) are being merged with European and Japanese cars already in the early stages of the hierarchical clustering.

It is also interesting that one of the obtained clusters is definitely larger and includes most of the cars. This may indicate a reduced variation in the characteristics of cars from different parts of the world. The most important result, however, is that the differences in cars produced in the years 1981-82 are not as strongly dependent on the origin as it was just a few years earlier.

For those still interested

Finally, for those still interested, I would like to present graphs of the actual differences among those years in some individual car characteristics by origin, about which I have already mentioned in the introduction.

library(ggplot2)

theme_update(plot.title = element_text(hjust = 0.5))

a1 <- ggplot(data = cars[126:182,], aes(x = origin, y = mpg)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(14, 43)) +
  ylab('MPG') +
  xlab(' ') +
  ggtitle('1973-1974')

a2 <- ggplot(data = cars[339:398,], aes(x = origin, y = mpg)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(14, 43)) +
  ylab(' ') + xlab(' ') +
  ggtitle('1981-1982')

b1 <- ggplot(data = cars[126:182,], aes(x = origin, y = weight)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(1800, 4500)) +
  ylab('Weight') +
  xlab(' ') +
  ggtitle('1973-1974')

b2 <- ggplot(data = cars[339:398,], aes(x = origin, y = weight)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(1800, 4500)) +
  ylab(' ') + xlab(' ') +
  ggtitle('1981-1982')

c1 <- ggplot(data = cars[126:182,], aes(x = origin, y = displacement)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(75, 350)) +
  ylab('Displacement') +
  xlab(' ') +
  ggtitle('1973-1974')

c2 <- ggplot(data = cars[339:398,], aes(x = origin, y = displacement)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(75, 350)) +
  ylab(' ') + xlab(' ') +
  ggtitle('1981-1982')

grid.arrange(a1,a2, ncol=2, top='MPG by Origin')

grid.arrange(b1,b2, ncol=2, top='Weight by Origin')

grid.arrange(c1,c2, ncol=2, top='Displacement by Origin')

Conclusion

In this paper, I analyzed the grouping of cars produced between 1970 and 1982 depending on the similarities and differences in their basic characteristics. Based on fundamental clustering methods, it was shown that at the beginning of this period, American cars were much different from European and Japanese brands. However, a number of introduced changes allowed the American brands not only to stay on the market, but also significantly bring the characteristics of their cars closer to more desirable cars from other parts of the world. In the context of clustering, this resulted in a significant reduction in the dependence of cluster divisions on the origin of the cars.

References

https://www.datanovia.com/en/lessons/choosing-the-best-clustering-algorithms/

https://www.displayr.com/what-is-hierarchical-clustering/

https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec

American and Global Car Industry in the ’70s - Cluster Analysis

Unsupervised Learning - Clustering

Szymon Groszkiewicz

Introduction

Review of the Dataset

Preparing the data

Assessing clustering tendency

The optimal number of clusters

Silhouette statistic

Total within-clusters sum of square

K-means Clustering

K-means with 2 clusters

K-means with 3 clusters

PAM Clustering

PAM with 2 clusters

PAM with 3 clusters

PAM with 4 clusters

Hierarchical clustering

Complete linkage

Ward’s method

Comment

Stability comparison

8 years later…

Cars82 - clustering tendency

Cars82 - The optimal number of clusters

Silhouette statistic

Cars82 - PAM Clustering

PAM with 2 clusters

Cars82 - Hierarchical clustering

For those still interested

Conclusion

References