Introduction

The objective of this project was to build classifiers to predict an indoor location based on RSSI readings from 13 iBeacons. The data was collected in Waldo Library, Western Michigan University using an iPhone 6S. The data sets were sourced from UCI Machine Learning Repository. This is the phase II. Phase II focuses on model building, fine tuning and evaluation. However a better understanding of the data was obtained during the model building which lead to further data preparation. This shows the value of following the CRISP-DM and even more iterations of model building and data preparation could have been done to improve the overall quality of the final model. Only the final models are shown in this report.

Methodology

Four classifiers were considered in this project; Naive Bayes, Random Forest, K-Neirest Neighbour and decision tree. After splitting the data into training data(70%) and test data(30%) the hyperparameters for each classifier was finetuned. The training and test data was stratified to make sure that the imbalance of the data was considered during the model building. During the finetuning process a 5-fold crossvalidation was used. Furthermore, the mmce performance measure was used during finetuning.

For evaluation mmce and the confusion matrix was used. The modelling was implemented using the mlr package (Bischl et al. (2016)). Feature selection was implemented using spsa package (Aksakalli, Abbasi, and Wong (2018)).

library(knitr)
library(mlr)
library(tidyverse)

Additional data preparation

While working on developing models for predicting location further data insights were obtained. Therefore it was decided to do some extra data preparation. Also, data normalization was left during Phase 1 and therefore it is performed in this section. Lastly, data split into test and training data.

Additional data cleaning

Some target levels having a really low number of instances it was not possible to do a stratified five fold cross-validation. Some target feature levels simply had less than five instances. Recalling the data visualization from phase I some target levels stood out to have significantly less instances than others.

Frequency of X
X	D	E	F	G	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W
Freq	24	4	4	4	202	192	142	100	85	86	86	71	74	91	136	39	55	8	17

Frequency of Y
Y	01	02	03	04	05	06	07	08	09	10	13	14	15
Freq	138	155	200	238	213	187	75	58	10	20	6	4	116

For the x-location data the target feature levels of E, F, G, and V contains marginally low number of instances which does not allow a stratified five-fold cross-validation. Furthermore, the low number of instances might not be enough to learn the characteristics of the given target level and therefore these are removed. Similarly for the y-location data the target levels “13” and “14” contains a really low number of instances and therefore these are also removed.

# x: 

data_x <- data_x[data_x$x != "E",]
data_x <- data_x[data_x$x != "F",]
data_x <- data_x[data_x$x != "G",]
data_x <- data_x[data_x$x != "V",]

data_x$x <- factor(data_x$x)

# y: 
data_y <- data_y[data_y$y != "13",]
data_y <- data_y[data_y$y != "14",]

data_y$y <- factor(data_y$y)

Data normalisation

In Phase I no data normalisation were done. Even though all the descriptive features are sensor readings within the same range data normalisation are performed.

data_norm <- as.data.frame(t(apply(data_x[,1:13], 1, function(x) (x - min(x))/(max(x)-min(x)))))
data_x[,1:13] <- data_norm

# Normalisation of y data
data_norm <- as.data.frame(t(apply(data_y[,1:13], 1, function(x) (x - min(x))/(max(x)-min(x)))))
data_y[,1:13] <- data_norm

Test and Training data

The data for both the x- and y-coordinate are split into test and training data. 70 % is used for training and 30 % is used for test. Furthermore the test and training data is stratified.

## Splitting data into test and training: 
# 70 % training 
# 30 % test

datasplit_x <- makeResampleInstance(desc = 'Holdout', 
                     task = makeClassifTask(data = data_x,
                                            target = 'x'),
                     split = 0.7,
                     stratify = TRUE)

datasplit_y <- makeResampleInstance(desc = 'Holdout', 
                                    task = makeClassifTask(data = data_y,
                                                           target = 'y'),
                                    split = 0.7,
                                    stratify = TRUE)

                     
# For the x dataset: 
training_data_x  <- data_x[datasplit_x$train.inds[[1]], ]
test_data_x      <- data_x[datasplit_x$test.inds[[1]], ]
# For the y dataset: 
training_data_y  <- data_y[datasplit_y$train.inds[[1]], ]
test_data_y      <- data_y[datasplit_y$test.inds[[1]], ]

Model buiding and hyperparameter finetuning

In this section, a learner is created for each classifier for both x- and y-coordinate. The hyperparameters of each learner are finetuned and finally, a model is built for each tuned learner.

PrelimatiesFor the hyperparameter finetuning it is decided to use a 5-fold cross validation resampling stratagy. The distribution of the target level instances are kept the same. A task for training and a control grid is also defined.

# creating a task for x: 
classif.task_x <- makeClassifTask(data = training_data_x, target ="x", id = "x-location_train")


# Creating task for y: 
classif.task_y <- makeClassifTask(data = training_data_y, target ="y", id = "y-location_test")

## Making the resample for finetunning: 
rdescx <- makeResampleDesc("CV", iters = 5L, stratify = TRUE)
rdescy <- makeResampleDesc("CV", iters = 5L, stratify = TRUE)



## Making the control grid: 
ctrl <- makeTuneControlGrid()

Naive Bayes

As we have intentionally removed some instance from training data the model might produce zero probabilities predictions. To reduce this, we performed the parameter tuning to get the optimal value of “laplace” parameter of X and Y coordinates. The experiment was conducted using stratified samples in 5 folds ranging from 0 to 25. For both the coordinates, all the values produced same mean misclassification error 0.878 for X and 0.778 for Y. Laplacian smoothing did not improve the performance of the model.

Effect of hyperparameter finetuning for naive bayes models

Random Forest

Tuned two parameter mtry(number of variables as candidate at each split) and ntree(number of trees to grow). As this is a classification problem, mtry is square root of descriptive features i.e. square root of 13= 3.6. Hence, we tried mtry with 2,3,4,5,6. For ntree, we tried a sequence of values from 10 to 100. The optimal values of mtry and ntree for X coordinate is 6 and 20 respectively with 0.55 misclassification error rate. For Y coordinate the values are mtry=6 and ntree=60 with misclassification error rate=0.526. Furthermore, we plotted hyperparameter effect graphs to determine the need for more improvement, according to graph, during first ten iterations, the misclassification error rate reduced significantly and for rest of the iterations, it did not reduce further. It indicates that there is no room for further improvement for these two parameters.

Effect of hyperparameter finetuning for Random forest

K nearest Neighbour

We tried tuning two parameters (k and distance). For X coordinate, the optimal value for K we got is 6 and distance=1 with 0.676 as misclassification error rate. For Y coordinate, the optimal value for K was 21 and distance=1 with misclassification error rate as 0.624. According to hyperparameter effects graph of X coordinate, the misclassification error rate reduced significantly during first iteration and for rest of the 49 iterations, it did not fall further. Whereas Y coordinate parameter effects graph shows that the mmce reduced at sixth iteration and then remained same for rest of the iterations. There were no opportunities for further enhancement of these two parameters.

Effect of hyperparameter finetuning for K-Nearest Neighbour

Decision Tree

Tunned two parameters “maxdepth” to determine the maximum depth of any node of the final tree and “minsplit” for minimum observations that must exist in a node to split. For X coordinate, the optimal value for maxdepth and minsplit we got 22 and 4 respectively with mmce as 0.601. For Y coordinate, the optimal values maxdepth and minsplit are 22 and 1 with mmce as 0.591. Hyperparameter effects graph for X coordinate shows that the mmce reduced in first four iteration and then it did not change. Similar is the case for Y coordinate.

Effect of hyperparameter finetuning for Decision tree

Decision Tree Feature selection

It is decided to try to use the SPSA package(REMEMBER REF) to do feature selection on a decision tree to see what performance that can be achieved with fever features.

## USing SPSA to perform feature selection: 
# We only try on random forest because this is the best model:
# X: 
spsaMod_learner4_tuned_x <- spFSR:: spFeatureSelection(classif.task_x, wrapper = tunedLearner4_x, 
                                        measure = mmce, num.features.selected = 0,
                                        show.info = FALSE)

# y: 
spsaMod_learner4_tuned_y <- spFSR:: spFeatureSelection(classif.task_y, wrapper = tunedLearner4_y, 
                                        measure = mmce, num.features.selected = 0,
                                        show.info = FALSE)

## Getting the best models from feature selection:
# x: 
spsaModel_x <- spsaMod_learner4_tuned_x$best.model

# y: 
spsaModel_y <- spsaMod_learner4_tuned_y$best.model

For the model predicting the x-location the best model has 10 features and the best model predicting the y-location has 10 features. The importance of the features are plotted for both the model predicting x- and y-locations.

Evaluation

First, all models are evaluated individually and afterward the models are compared to evaluate the performance of the models a number of performance measures are used. The performance measures mmce, mulitclass.aunp and kappa is used to evaluate the model. However, for this kind of prediction problem, the visualization of the confusion matrix can give an important insight to the performance of a model. The performance measures only take into account if the model make the right prediction or not. In this case, it is more important to look at how the model make mistakes. If the model predict the location to be K when the actual value is J the mistake is not critical. Inspecting the confusion matrix it is possible to evaluate how the model make mistakes. If the higher values in the confusion matrix are concentrated around the diagonal of the confusion matrix the performance of the is better.

For the performance measures following can be stated:

mmce: Misclassification error. Always between one and zero. A lower value indicates a better model.
multiclass.aunp: Similar to AUC for a binary classification problem. The higher a score the better the model performs.
Kappa: Tells how much better model performs than random guesses according to the frequency of each class. Always less than 1. The higher the score the better the model performs.

Naive Bayes

The naive Bayes models predicting both x- and y-location performs poorly. Both mmce scores is above 80 and the kappa statistic is really close to zero. However, the multiclass aunp is pretty high. Looking at the confusion matrices in figure 6 and 7 it is clear to see that the model has a problem predicting the locations. Especially the model predicting the y-location has troubles separating the target levels. Almost all instances are predicted to be the same location.

From the data exploration in phase I it was expected that the performance of the Naive Bayes model due to the distribution of the descriptive features. They contained a lot of zero values and then some instances that were normally distributed around a certain value. The bad results comes from the assumption that the descriptive features are normally distributed using the naive Bayes prediction model. To overcome this problem a zero-inflated Poisson distribution could be used, but it is decided not to work further on this during this project.

Performance measure of Naive Bayes for x-coordinate
	x
mmce	0.8641686
multiclass.aunp	0.8061124
kappa	0.1014781

Confusion matrix visualised for naive bayes model predicting y-coordinate

Performance measures of Naive Bayes for y-coordinate
	x
mmce	0.8337237
multiclass.aunp	0.7334375
kappa	0.0918530

Confusion matrix visualised for naive bayes model predicting y-coordinate

Random Forest

The random forest models performs relatively good on the test data. Looking at the plots of the confusion matrices the majority of predictions are centered around the diagonal of the matrices. This shows that the misclassification that the models make are relatively small. The performance measures confirms that the model performed good as compared to the other models having the lowest mmce and highest kappa statistics.

Performance measures of random forest for x-coordinate
	x
mmce	0.5667447
multiclass.aunp	0.7985298
kappa	0.3705096

Confusion matrix visualised for random forest predicting x-coordinate

Performance measures of random forest for y-coordinate
	x
mmce	0.5362998
multiclass.aunp	0.7802187
kappa	0.3772696

Confusion matrix visualised for random forest predicting y-coordinate

K Nearest Neighbour

The confusion matrix for the model predicting x-location shows a tendency of a prediction pattern, but not a really strong pattern. The model predicting the y-location does a satisfactory job predicting all levels except level 1. In general, the KNN models are not doing a good job predicting the locations and the performance measures confirms this.

Performance measures of KNN for x-coordinate
	x
mmce	0.6791569
multiclass.aunp	0.7107815
kappa	0.2486819

Confusion matrix visualised for KNN predicting x-coordinate

Performance measures of KNN for y-coordinate
	x
mmce	0.6323185
multiclass.aunp	0.6799668
kappa	0.2919783

Confusion matrix visualised for KNN predicting y-coordinate

Decision Tree

Looking at the confusion matrices from the decision tree models it is clear to see that most predictions are centered around the diagonals. For the model predicting the x-location it seems to group the predictions around the same locations. The performance indicates that the model is better than a random guess, but the mmce is still high.

Performance measures of decision tree for x-coordinate
	x
mmce	0.6393443
multiclass.aunp	0.8273512
kappa	0.2871540

Confusion matrix visualised for decision tree predicting x-coordinate

Performance measures of decision tree for y-coordinate
	x
mmce	0.6182670
multiclass.aunp	0.7862724
kappa	0.2767866

Confusion matrix visualised for decision predicting y-coordinate

Decision tree with feature selection

The models build after feature selection performs really bad on the test data. The model predicting x-locations are only predicting two different target levels and the model predicting the y-location is predicting all the instances to have the same target level. The same is reflected in the performance measures. A kappa score of 0 and a multiclass.aunp of 0.5 tells that the model is not performing better than a model based on random guessing.

Performance measures of decesion tree with feature selection for x-coordinate
	x
mmce	0.9391101
multiclass.aunp	0.5000000
kappa	0.0000000

Confusion matrix visualised for decision tree with feature selection predicting x-coordinate

Performance measures of decision tree with feature selection for y-coordinate
	x
mmce	0.7540984
multiclass.aunp	0.6012339
kappa	0.1413709

Confusion matrix visualised for decision tree with feature selection predicting y-coordinate

Comparison of models

Discussion

In general, the different models are having a hard time making correct predictions. First of all, the dataset is clearly not fit for a naive Bayes model. That is probably because of the many zero values. This could probably had been handled by using a zero-inflated Poisson distribution, but that would be for further work on this project. It is seen that the mmce measure is above all models are above 50 % which is really high. However, keeping in mind that the target features(x- and y-location) has 15 and 11 levels and looking at the multiclass.aunp and kappa measure the models are in general performing much better than a model doing random guesses. This shows the importance of not relying on mmce as the primary measure for model performance. The tendency of the predictions being distributed around the diagonal of the confusion matrices confirms this.

Especially random forest and decision trees seems to be working well this dataset compared to the other types of models. That is possibly because the tree models splits on whether the signal is detected(zero or around 120) for the descriptive features. Furthermore, this type of model is not sensitive to the distribution of the descriptive features.

Initially, the goal was to do feature selection on the best model, but when random forest turned out to be the best model for this problem it did not make any sense to do the feature selection on the best model due to how the random forest model work. Therefore it was decided to try feature selection on the decision tree. While doing the feature selection on the training data using five-fold cross-validation the model performs at the same level as the model without feature selection. However, on test data, the model performs poorly and predicted all test instances to have the same target level. This is is a clear signal that the model is overfitting and that the model does not generalize well beyond the training data.

Lately, the quality of the dataset can be discussed. The confusion matrices shows that all models have trouble to separate target feature levels(locations) next to each other. A solution could be to make the areas that you try to predict larger. If that is a possible solution would require some domain knowledge and business understanding of the actual application of this model. You basically have to ask how exact a prediction of the location you need for the prediction to be useful. If the model has to make predictions that are as accurate as the locations given to build this model it has to be considered. If you can find some other descriptive features that contains some useful information. Since the signals are actually collected by mobile phones it should be possible to find more valuable data that can improve the performance of the model.

Conclusion

The random forest model performs the best on test data followed by the decision tree. Especially the naive Bayes model performs badly on test data as expected due to the distribution of the descriptive features. Also, the decision tree that was built with feature selection seemed to overfit and did not generalize beyond training data.

The evaluation of the models shows the importance of not relying on mnce of accuracy as the primary measure. The majority of the models are performing way better than a model doing random guesses would do. Also, it is important to take into account how costly a mistake will be. In this case, random forest clearly make the least costly mistakes. This can be found by investigating the visualizations of the confusion matrices.

References

Aksakalli, Vural, Babak Abbasi, and Yong Kai Wong. 2018. SpFSR: Feature Selection and Ranking by Simultaneous Perturbation Stochastic Approximation. https://CRAN.R-project.org/package=spFSR.

Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.

Predicting location from RSSI signals

MATH 2319 Machine Learning Applied Project Phase II

Maya Dere (s3675042) & Kristoffer LÃ¸wenstein (s3706122)

June 11, 2018

Introduction

Methodology

Additional data preparation

Additional data cleaning

Data normalisation

Test and Training data

Model buiding and hyperparameter finetuning

PrelimatiesFor the hyperparameter finetuning it is decided to use a 5-fold cross validation resampling stratagy. The distribution of the target level instances are kept the same. A task for training and a control grid is also defined.

Naive Bayes

Random Forest

K nearest Neighbour

Decision Tree

Decision Tree Feature selection

Evaluation

Naive Bayes

Random Forest

K Nearest Neighbour

Decision Tree

Decision tree with feature selection

Comparison of models

Discussion

Conclusion

References