There are three major types of machine learning: unsupervised machine learning with a goal to find structure in unlabeled data (clustering), supervised machine learning with a goal to make predictions on labeled data (regression, classification), and reinforcement machine learning. One of the main goals of unsupervised machine learning is to find homogenous subgroup within population (clusters) and dimensionality reduction (method of decreasing number of features while maintaining maximum information content). The process of finding homogenous subgroup within population is called clustering.
Dataset is available in the text file on the following link https://www.dropbox.com/s/c9w2q2hlfh0y2iv/data.txt?dl=0. After we have loaded dataset in R data frame, the first step was to conduct the EDA to understand data, and to check if data is structured and to check for missing values. The results of EDA are presented in the table 1, while 5-number summary statistics for the quantitative variables are presented in the picture 1.
Table 1.Variables of dataset
There are 207 observations and 12 variables (9 variables were taken as features for clustering). There are 10 observations (cities) with infrastructure, and 8 cities are close to border. From 5-summary statistics is obvious that we have outliers in dataset. However, we will not remove outliers as they cannot influence our model. Contrary, clustering is a good method of detecting outliers, so it is always a good practice to conduct unsupervised machine learning to find some patterns in data before conducting supervised machine learning such as classification or regression. Moreover, from standard deviation of features we can see that we must scale data before making clusters.
Picture 1. 5-number Summary Statistics for Quantitative Features
Picture 2. Standard deviations of features
R code for data preperation
install.packages("cluster")
install.packages("fpc")
library(cluster)
library(fpc)
set.seed(123)
#PART 1: Preparing the data
#Reading data into R data frame
df <- read.csv("data.txt", sep = "\t")
#Taking features for clustering
ikea <- as.matrix(df[,4:12]) #9 features
#Row names
rownames(ikea) <- df$Kommun_name
#PART 2: EDA
dim(df) #There are 207 observations and 12 features
dim(ikea) #9 features are taken for modelling
sum(df$Border) #8 observations have border
sum(df$Infrast) #10 observations have infrastructure
str(df)
#5-number summary statistics
summary(ikea) #It is obvious that we have outliers in our data.
#Checking features means and standard deviations
options(scipen = 20)
colMeans(ikea) #Scaling should be done
apply(ikea,2,sd) #Computing standard deviation to inspect deviation in data
The main logic of k-means algorithm is that first it breaks population into predefined number of clusters and assign each data point to one cluster. This is a random component of this algorithm. Then, k-means computes the centers of each cluster (average of all data points in the cluster). Third, k-means computes distances of each data point from these centers and reassign data point to the cluster based which distance is nearest. By this one iteration of algorithm is finished. K-means will end when there are no any changes of assignment. Programming language R has function kmeans with the following parameters: x, centers, nstart, and iter.max. The first parameter x is the data, the second parameter is pre-defined number of clusters (that we need to decide), while nstart is parameter that tells how many random sets of centers should be chosen and finally the iter.max tells how many times this algorithm will be repeated. K-means has a random component which means that single run k-mean will not find an optimal solution. To overcome random component of this algorithm, k-means can be run multiple times and then it will select the best outcome as a final.
R code for k-means algorithm
#PART 3: K-MEANS
#Determing number of clusters
wcc <- 0
for(i in 1:20) {
km.output <- kmeans(ikea[,-c(1)], centers = i, nstart = 30, iter.max = 50 )
wcc[i] <- km.output$tot.withinss
}
#Plotting model quality vs. number of clusters
plot(1:20, wcc, type="b", col="red", xlab="Number of clusters", ylab = "WCC",
main = "Scree plot (Model quality vs. Number of clusters)")
#Based on scree plot the optimal number of cluster is between 4 and 7
k <- 6
#Making the model with 6 clusters
km.out <- kmeans(ikea[,-c(1)], centers = k, nstart = 20, iter.max = 50)
#Adding column with cluster numbers
ikea <- cbind(ikea, clusterNumber = km.out$cluster)
#View the resulting model
clusplot(ikea, km.out$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0, main = "Graphical Model Representation")
#Looking for cities in clusters
ikea.df <- as.data.frame(ikea)
index.has <- which(df$Kommun_name == "Haparanda" | df$Kommun_name == "Helsingborg" | df$Kommun_name == "Jönköping" |
df$Kommun_name == "Kalmar" | df$Kommun_name == "Karlstad" | df$Kommun_name == "Linköping" |
df$Kommun_name == "Malmö" | df$Kommun_name == "Stockholm" | df$Kommun_name == "Uddevalla" |
df$Kommun_name == "Uppsala" | df$Kommun_name == "Älmhult" | df$Kommun_name == "Örebro")
ikea.df$HasStore <- 0
ikea.df$HasStore[index.has] <- 1
cluster.one <- ikea.df[ikea.df$clusterNumber==1,]
cluster.two <- ikea.df[ikea.df$clusterNumber==2,]
cluster.three <- ikea.df[ikea.df$clusterNumber==3,]
cluster.four <- ikea.df[ikea.df$clusterNumber==4,]
cluster.five <- ikea.df[ikea.df$clusterNumber==5,]
cluster.six <- ikea.df[ikea.df$clusterNumber==6,]
K-means uses WSS (Total within cluster sum of squares) as a measurement to determine best model. The kmeans() function produces several outputs. One of these outputs are WSS. Therefore, the best way to find an optimal number of clusters is by heuristic approach. We run k-means for some number of clusters and record the WSS for each run. Then we make scree plot (x – number of clusters, y – WSS). Finally, we look for the elbow in this plot which represent that there is no a big decrease in WCC if we increase number of clusters. From the picture below we can see that optimal number of cluster is between 4 and 7. It is obvious from the picture that after 7 clusters, adding more clusters do not affect significantly the change in WCC. The best practice is to run k-means algorithm for these different number of clusters and to see what kind of information we can get. For this case, we have decided to have 6 clusters.
Picture 3. Model Quality vs. Number of Clusters
If we do not know the number of clusters and need to determine it, we need to run the algorithm multiple times, each time with a different number of clusters. Then, we can observe how a measure of model quality changes with the number of clusters. In this case, we run kmeans() for 20 clusters to see how model quality changes as the number of clusters changes. The ideal plot will have an elbow where the quality measure improves more slowly as the number of clusters increases. This indicates that the quality of the model is no longer improving substantially as the model complexity (i.e. number of clusters) increases. In other words, the elbow indicates the optimal number of clusters which is in our case from 4 to 7.
After we have decided on the number of clusters, the last step is to run the algorithm and to interpret results. We used the iter.max argument in kmeans() function. The main reason is because k-means is an iterative algorithm, i.e. repeating until some stopping criteria is reached. The default number of iterations for kmeans() is 10, which is not enough for the algorithm to converge and reach its stopping criteria, so we set the number of iterations to 50 to overcome this issue.
Picture 4. Graphical Model Representation
From the picture above it is obvious why clustering is a good method to detect outliers. We can see that Stockholm is an outlier and it is the only observation in the cluster 4. Moreover, we can see that Malmo and Upsala are also outliers and are only two observations in the cluster 3.
Finally, to make conclusion about which community should be good for IKEA to open new stores we created 6 different tables (each table for each cluster), and observed data within each cluster. We can conclude the following:
Cluster 1 has 9 observations. In this cluster 4 communities have IKEA store, while 5 cities do not have. Huddinge is a community with the highest sales index and revenue, but there is no IKEA store. Therefore, we think that Huddinge would be a good place for IKEA store. Moreover, Norrköping would be a good community to consider for IKEA store in the same cluster. However, most likely the reason why there is no store in this community, because it is close to Linköping, which has already IKEA store. Generally speaking, if we would have to choose one cluster to open IKEA stores, then it would be cities without stores in this cluster.
Cluster 2 has 52 observations and no community has an IKEA store. In this cluster, all the communities have relatively low sales index, revenue and population as compare to communities in other clusters. But Falkenberg community without IKEA store has high sales index, revenue and population than Haparanda community which has IKEA store. Haparanda is close to Finland boarder that is why it has IKEA store.
Cluster 3 has 120 observations. In this cluster 2 communities have IKEA store. This cluster has communities with the lowest sales index, revenue and population but still have 2 IKEA stores. One in Haparanda, which is close to the Finland boarder and another is in Älmhult, which is in the center of Halmstad, Ljungby, Växjö, Karlshamn and Hässleholm communities. We think these communities are not feasible for open IKEA store.
Cluster 4 has 2 communities Uppsala and Malmö, which already have IKEA store.
Cluster 5 has 23 observations and only 3 communities have IKEA store. Järfälla community in this cluster has the highest sales index and revenue followed by Mölndal but both without IKEA store. The best candidates for IKEA stores in this cluster is Järfälla followed by Mölndal.
Cluster 6 has only Stockholm community with IKEA store.
After analyzing the data for all the communities in each cluster, we can say that Huddinge in cluster 1 is the most suitable community for new IKEA store, followed by Järfälla and Mölndal. We select these communities because these have high sales index, revenue and population. We have border feature in the data but we do not know how far is border from and which cities. If we have that data then that would be good to take this feature seriously in our decision about opening new IKEA in these Communities.
To improve our model, we can conduct the PCA. The goal of PCA is to reduce number of dimensions while maintaining the maximum information of original data. In our case it is not necessary to conduct the PCA, but we made it to see if there are some features that we can eliminate from the model. From the picture below we can see that productivity and the border have different loading compared to other features. Moreover, we can see from scree plot that 60% of the variance in the data is explained by PC1, while two PCs explains 70%. To have explained 80% of the variance in the data we need 3 PC.
Picture 7. Biplot of PCA model
Picture 8. Scree plot (Cumulative Proportion of Variance Explained)