Introduction

There are three major types of machine learning: unsupervised machine learning with a goal to find structure in unlabeled data (clustering), supervised machine learning with a goal to make predictions on labeled data (regression, classification), and reinforcement machine learning. One of the main goals of unsupervised machine learning is to find homogenous subgroup within population (clusters) and dimensionality reduction (method of decreasing number of features while maintaining maximum information content). The process of finding homogenous subgroup within population is called clustering.

Data Description

Dataset is available in the text file on the following link https://www.dropbox.com/s/c9w2q2hlfh0y2iv/data.txt?dl=0. After we have loaded dataset in R data frame, the first step was to conduct the EDA to understand data, and to check if data is structured and to check for missing values. The results of EDA are presented in the table 1, while 5-number summary statistics for the quantitative variables are presented in the picture 1.

Table 1.Variables of dataset

Table 1.Variables of dataset

There are 207 observations and 12 variables (9 variables were taken as features for clustering). There are 10 observations (cities) with infrastructure, and 8 cities are close to border. From 5-summary statistics is obvious that we have outliers in dataset. However, we will not remove outliers as they cannot influence our model. Contrary, clustering is a good method of detecting outliers, so it is always a good practice to conduct unsupervised machine learning to find some patterns in data before conducting supervised machine learning such as classification or regression. Moreover, from standard deviation of features we can see that we must scale data before making clusters.

Picture 1. 5-number Summary Statistics for Quantitative Features

Picture 1. 5-number Summary Statistics for Quantitative Features

Picture 2. Standard deviations of features

Picture 2. Standard deviations of features

R code for data preperation

install.packages("cluster")
install.packages("fpc")
library(cluster)
library(fpc)
set.seed(123)
#PART 1: Preparing the data

#Reading data into R data frame
df <- read.csv("data.txt", sep = "\t")

#Taking features for clustering
ikea <- as.matrix(df[,4:12]) #9 features

#Row names
rownames(ikea) <- df$Kommun_name

#PART 2: EDA
dim(df) #There are 207 observations and 12 features
dim(ikea) #9 features are taken for modelling
sum(df$Border) #8 observations have border
sum(df$Infrast) #10 observations have infrastructure
str(df)
#5-number summary statistics

summary(ikea) #It is obvious that we have outliers in our data.

#Checking features means and standard deviations
options(scipen = 20)
colMeans(ikea) #Scaling should be done
apply(ikea,2,sd) #Computing standard deviation to inspect deviation in data

K-means algorithm

The main logic of k-means algorithm is that first it breaks population into predefined number of clusters and assign each data point to one cluster. This is a random component of this algorithm. Then, k-means computes the centers of each cluster (average of all data points in the cluster). Third, k-means computes distances of each data point from these centers and reassign data point to the cluster based which distance is nearest. By this one iteration of algorithm is finished. K-means will end when there are no any changes of assignment. Programming language R has function kmeans with the following parameters: x, centers, nstart, and iter.max. The first parameter x is the data, the second parameter is pre-defined number of clusters (that we need to decide), while nstart is parameter that tells how many random sets of centers should be chosen and finally the iter.max tells how many times this algorithm will be repeated. K-means has a random component which means that single run k-mean will not find an optimal solution. To overcome random component of this algorithm, k-means can be run multiple times and then it will select the best outcome as a final.

R code for k-means algorithm

#PART 3: K-MEANS
#Determing number of clusters
wcc <- 0

for(i in 1:20) {
  
  km.output <- kmeans(ikea[,-c(1)], centers = i, nstart = 30, iter.max = 50 )
  wcc[i] <- km.output$tot.withinss
}

#Plotting model quality vs. number of clusters

plot(1:20, wcc, type="b", col="red", xlab="Number of clusters", ylab = "WCC",
     main = "Scree plot (Model quality vs. Number of clusters)")

#Based on scree plot the optimal number of cluster is between 4 and 7

k <- 6

#Making the model with 6 clusters
km.out <- kmeans(ikea[,-c(1)], centers = k, nstart = 20, iter.max = 50)

#Adding column with cluster numbers
ikea <- cbind(ikea, clusterNumber = km.out$cluster)

#View the resulting model
clusplot(ikea, km.out$cluster, color=TRUE, shade=TRUE, 
         labels=2, lines=0, main = "Graphical Model Representation")

#Looking for cities in clusters

ikea.df <- as.data.frame(ikea)

index.has <- which(df$Kommun_name == "Haparanda" | df$Kommun_name == "Helsingborg" | df$Kommun_name == "Jönköping" |
                     df$Kommun_name == "Kalmar" | df$Kommun_name == "Karlstad" | df$Kommun_name == "Linköping" |
                     df$Kommun_name == "Malmö" | df$Kommun_name == "Stockholm" | df$Kommun_name == "Uddevalla" |
                     df$Kommun_name == "Uppsala" | df$Kommun_name == "Älmhult" | df$Kommun_name == "Örebro")

ikea.df$HasStore <- 0

ikea.df$HasStore[index.has] <- 1

cluster.one <- ikea.df[ikea.df$clusterNumber==1,]

cluster.two <- ikea.df[ikea.df$clusterNumber==2,]

cluster.three <- ikea.df[ikea.df$clusterNumber==3,]

cluster.four <- ikea.df[ikea.df$clusterNumber==4,]

cluster.five <- ikea.df[ikea.df$clusterNumber==5,]

cluster.six <- ikea.df[ikea.df$clusterNumber==6,]

Determining optimal number of clusters

K-means uses WSS (Total within cluster sum of squares) as a measurement to determine best model. The kmeans() function produces several outputs. One of these outputs are WSS. Therefore, the best way to find an optimal number of clusters is by heuristic approach. We run k-means for some number of clusters and record the WSS for each run. Then we make scree plot (x – number of clusters, y – WSS). Finally, we look for the elbow in this plot which represent that there is no a big decrease in WCC if we increase number of clusters. From the picture below we can see that optimal number of cluster is between 4 and 7. It is obvious from the picture that after 7 clusters, adding more clusters do not affect significantly the change in WCC. The best practice is to run k-means algorithm for these different number of clusters and to see what kind of information we can get. For this case, we have decided to have 6 clusters.

Picture 3. Model Quality vs. Number of Clusters

Picture 3. Model Quality vs. Number of Clusters

If we do not know the number of clusters and need to determine it, we need to run the algorithm multiple times, each time with a different number of clusters. Then, we can observe how a measure of model quality changes with the number of clusters. In this case, we run kmeans() for 20 clusters to see how model quality changes as the number of clusters changes. The ideal plot will have an elbow where the quality measure improves more slowly as the number of clusters increases. This indicates that the quality of the model is no longer improving substantially as the model complexity (i.e. number of clusters) increases. In other words, the elbow indicates the optimal number of clusters which is in our case from 4 to 7.

K-means Output

After we have decided on the number of clusters, the last step is to run the algorithm and to interpret results. We used the iter.max argument in kmeans() function. The main reason is because k-means is an iterative algorithm, i.e. repeating until some stopping criteria is reached. The default number of iterations for kmeans() is 10, which is not enough for the algorithm to converge and reach its stopping criteria, so we set the number of iterations to 50 to overcome this issue.

Picture 4. Graphical Model Representation

Picture 4. Graphical Model Representation

From the picture above it is obvious why clustering is a good method to detect outliers. We can see that Stockholm is an outlier and it is the only observation in the cluster 4. Moreover, we can see that Malmo and Upsala are also outliers and are only two observations in the cluster 3.

CONCLUSIONS

Finally, to make conclusion about which community should be good for IKEA to open new stores we created 6 different tables (each table for each cluster), and observed data within each cluster. We can conclude the following:

Principal Component Analysis

To improve our model, we can conduct the PCA. The goal of PCA is to reduce number of dimensions while maintaining the maximum information of original data. In our case it is not necessary to conduct the PCA, but we made it to see if there are some features that we can eliminate from the model. From the picture below we can see that productivity and the border have different loading compared to other features. Moreover, we can see from scree plot that 60% of the variance in the data is explained by PC1, while two PCs explains 70%. To have explained 80% of the variance in the data we need 3 PC.

Picture 7. Biplot of PCA model

Picture 7. Biplot of PCA model

Picture 8. Scree plot (Cumulative Proportion of Variance Explained)

Picture 8. Scree plot (Cumulative Proportion of Variance Explained)