Determining a person’s physical position in a multi-building indoor space using wifi fingerprinting.
Our client is developing a system to be deployed on large industrial campuses, shopping malls, et cetera to help people to navigate a complex, unfamiliar interior space without getting lost.
While GPS works fairly reliably outdoors, it generally doesn’t work indoors, so a different technology is necessary. Our client would like us to investigate the feasibility of using “wifi fingerprinting” to determine a person’s location in indoor spaces.
Wifi fingerprinting uses the signals from multiple wifi hotspots within the building to determine location, analogously to how GPS uses satellite signals.
Our job is to evaluate multiple machine learning models to see which produces the best result, enabling us to make a recommendation to the client. If the recommended model is sufficiently accurate, it will be incorporated into a smartphone app for indoor locationing.
We have been provided with a large database(about 20,000 observations and 529 variables) of wifi fingerprints for a multi-building industrial campus with a location (building, floor, and location ID) associated with each fingerprint.
# load the libraries
library(readr)
library(caret)
library(dplyr)
library(tidyr)
# set up parallel processing
library(doParallel)
# Find how many cores are on your machine
detectCores()
# Create Cluster with desired number of cores.
cl <- makeCluster(2)
# Register Cluster
registerDoParallel(cl)
# Confirm how many cores are now "assigned" to R and RStudio
getDoParWorkers()
# Stop Cluster. (After performing tasks, stop the cluster.)
stopCluster(cl)
# load the dataset
training <- read.csv("trainingData.csv")
# check headings of the first 8 variables with the first 3 observations
head(training, n=3)[1:8]
## WAP001 WAP002 WAP003 WAP004 WAP005 WAP006 WAP007 WAP008
## 1 100 100 100 100 100 100 100 100
## 2 100 100 100 100 100 100 100 100
## 3 100 100 100 100 100 100 100 -97
# check headings of the last few variables with the first 3 observations
head(training, n=3)[520:529]
## WAP520 LONGITUDE LATITUDE FLOOR BUILDINGID SPACEID RELATIVEPOSITION USERID
## 1 100 -7541.264 4864921 2 1 106 2 2
## 2 100 -7536.621 4864934 2 1 106 2 2
## 3 100 -7519.152 4864950 2 1 103 2 2
## PHONEID TIMESTAMP
## 1 23 1371713733
## 2 23 1371713691
## 3 23 1371714095
# check the structure of the first 3 observations
str(training, list.len=3)
## 'data.frame': 19937 obs. of 529 variables:
## $ WAP001 : int 100 100 100 100 100 100 100 100 100 100 ...
## $ WAP002 : int 100 100 100 100 100 100 100 100 100 100 ...
## $ WAP003 : int 100 100 100 100 100 100 100 100 100 100 ...
## [list output truncated]
Remove zero variance variables
Zero variance variables indicate constant predictors across all the samples. This kind of predictor is not really informative. Fore example in our dataset, constant numbers for the intensity of WAP shows there is no movement for a certain device. This type of data will not be useful when we build the model, it could also break some models we are building. So we will find out and remove all the zero variance variables.
Remove unnecessary variables
Since our goal of precision is to predict which space/room the person at, so Longitude and Latitud are not very important for our task. UserID, PhoneID and Timestamp are also not related to predict the location. We will remove them all.
Combine location-indicating variables into one
BuildingID, Floor, SpaceID and RelativePositin are all vaiables indicating the location which we need to predict. We will combine all of them into a single variabel “Location”. After combining we will need to convert the data type from numerical to factor, so we can train our classification models later on.
Subet the data by buildings
Since we have a huge amount of data of three buildings, the data from different buildings could perform differently by different models. For the efficiency and accuracy, we will subset the whole dataset into three subsets by buildings.
Earlier, we didn’t remove the remaining attributes when we built “Location” variable, because we will use BuildingID to subset the data. After subsetting, we can safely BuildingID, Floor, SpaceID and RelativeLocation for all three building.
# find and take out zero variance variables
rzv_training <- training[ -which(apply(training, 2, var) == 0 )]
# check if all zero variance columns are removed
which(apply(rzv_training, 2, var) == 0)
## named integer(0)
The result of zero variance variable shows 0.
# take off extra dependent variables
rzv_training <- rzv_training[, -c(466,467,472:474)]
# check if they are all removed
names(rzv_training)[465:469]
## [1] "WAP519" "FLOOR" "BUILDINGID" "SPACEID"
## [5] "RELATIVEPOSITION"
# combine dependent variables into one
rzv_training <- unite(rzv_training, col = "LOCATION", c("BUILDINGID",
"FLOOR", "SPACEID", "RELATIVEPOSITION"), sep = "", remove = FALSE)
# convert data type to factor
rzv_training$LOCATION <- as.factor(rzv_training$LOCATION)
# make sure the data type has been converted
str(rzv_training$LOCATION)
## Factor w/ 905 levels "001022","001062",..: 400 400 394 392 16 398 394 389 407 393 ...
# brake the dataset into three datasets by buildings
training_b0 <- subset(rzv_training, BUILDINGID == 0)
training_b1 <- subset(rzv_training, BUILDINGID == 1)
training_b2 <- subset(rzv_training, BUILDINGID == 2)
# remove individual location variables
training_b0[, c(467:470)] <- NULL
training_b1[, c(467:470)] <- NULL
training_b2[, c(467:470)] <- NULL
# applying factor() to avoid extra levels
training_b0$LOCATION <- factor(training_b0$LOCATION)
training_b1$LOCATION <- factor(training_b1$LOCATION)
training_b2$LOCATION <- factor(training_b2$LOCATION)
# check how many levels of LOCATION for building0
str(training_b0$LOCATION)
## Factor w/ 259 levels "001022","001062",..: 16 1 4 5 3 2 9 8 7 6 ...
After we subset the data by buildings, the level of LOCATION still has the orginal levels, which are much more than the examples from individual subset. That’s why we need to apply factor() function to LOCATION variable, to re-level all the factors.
We are going to predict the location detailed to space/room, which is not a consecutive value, so we will use classification classifiers to build our models. We picked out three classifiers to try, C5.0, Random Forest, KNN. 10 folds cross validation is used to avoid overfitting.
Both C5.0 and Random Forest algorithms are under the tree model family. Tree model is a flowchart-like structure that works by splitting the sample based on the maximum informative vaiable, named nodes. Each nodes will then split again, the process repeats until the subsamples cannot be split any further.
C5.0 is robust at process a large number of variables, for example, our dataset which has 528 independent variables. It usually doesn’t take a long training time.
Random Forest on the other hand, usaully require a longer time to be trained, depending on the numbers of trees. It works by constructing multiple decision trees and output the mode of the classes for classification problem. The advantage of Random forests is avoiding overfitting from simple decision tree model.
K-Nearest Neighbor algorithm is based on assume similar things exist close to each other. It captures the similarity by calculate the distance between two points. KNN is simple and easy to implement, but it could get time consuming when the dataset gets larger.
set.seed(520)
# create 10-fold cross validation fitcontrol
fitControl <- trainControl(method = "cv", number = 10)
BUILDING_0
# split training and testing datasets
inTraining_b0 <- createDataPartition(training_b0$LOCATION, p = .75, list = FALSE )
training_b0sp <- training_b0[inTraining_b0, ]
testing_b0sp <- training_b0[-inTraining_b0, ]
# C5.0 model
C50_b0 <- train(LOCATION~., data = training_b0sp, method = "C5.0",
trControl = fitControl)
# testing
prediction_C50_b0 <- predict(C50_b0, testing_b0sp)
# randome forest
rf_b0 <- train(LOCATION~., data = training_b0sp, method = "rf",
trControl = fitControl)
prediction_rf_b0 <- predict(rf_b0, testing_b0sp)
# KNN
KNN_b0 <- train(LOCATION~., data = training_b0sp, method = "knn",
trControl = fitControl)
prediction_KNN_b0 <- predict(KNN_b0, testing_b0sp)
BUILDING_1
# split training and testing datasets
inTraining_b1 <- createDataPartition(training_b1$LOCATION, p = .75, list = FALSE )
training_b1sp <- training_b1[inTraining_b1, ]
testing_b1sp <- training_b1[-inTraining_b1, ]
# C5.0 model
C50_b1 <- train(LOCATION~., data = training_b1sp, method = "C5.0",
trControl = fitControl)
summary(C50_b1)
# testing
prediction_C50_b1 <- predict(C50_b1, testing_b1sp)
# randome forest
rf_b1 <- train(LOCATION~., data = training_b1sp, method = "rf",
trControl = fitControl)
prediction_rf_b1 <- predict(rf_b1, testing_b1sp)
# KNN
KNN_b1 <- train(LOCATION~., data = training_b1sp, method = "knn",
trControl = fitControl)
prediction_KNN_b1 <- predict(KNN_b1, testing_b1sp)
BUILDING_2
# split training and testing datasets
inTraining_b2 <- createDataPartition(training_b2$LOCATION, p = .75, list = FALSE )
training_b2sp <- training_b2[inTraining_b2, ]
testing_b2sp <- training_b2[-inTraining_b2, ]
# C5.0 model
C50_b2 <- train(LOCATION~., data = training_b2sp, method = "C5.0",
trControl = fitControl)
# testing
prediction_C50_b2 <- predict(C50_b2, testing_b2sp)
# randome forest
rf_b2 <- train(LOCATION~., data = training_b2sp, method = "rf",
trControl = fitControl)
prediction_rf_b2 <- predict(rf_b2, testing_b2sp)
# KNN
KNN_b2 <- train(LOCATION~., data = training_b2sp, method = "knn",
trControl = fitControl)
prediction_KNN_b2 <- predict(KNN_b2, testing_b2sp)
There are different ways to compare performance of the models. We will try confusion matrix, postResample, resample methods. We will use both Accuracy and Kappa Score to evaluate our models.
Kappa Score compares an Observed Accuracy with an Expected Accuracy. Observed Accuracy is simply the number of instances that were classified correctly. Expected Accuracy is defined as the accuracy that any random classifier would be expected to achieve. The Expected Accuracy is directly related to the number of instances of each class combined with the number of instances that the machine learning classifier agreed with as being ground truth. In general Kappa Score is less misleading than simply using accuracy.
BUILDING_0
# evaluate C5.0
cm_C50_b0 <- confusionMatrix(prediction_C50_b0, testing_b0sp$LOCATION)
postResample(prediction_C50_b0, testing_b0sp$LOCATION)
# evaluate Random Forest
cm_rf_b0 <- confusionMatrix(prediction_rf_b0, testing_b0sp$LOCATION)
postResample(prediction_rf_b0, testing_b0sp$LOCATION)
# evaluate KNN
cm_KNN_b0 <- confusionMatrix(prediction_KNN_b0, testing_b0sp$LOCATION)
postResample(prediction_KNN_b0, testing_b0sp$LOCATION)
# resample for all three models
resample_b0 <- resamples( list(C50 = C50_b0, RF = rf_b0, KNN = KNN_b0))
summary(resample_b0)
BUILDING_1
# evaluate C5.0
cm_C50_b1<- confusionMatrix(prediction_C50_b1, testing_b1sp$LOCATION)
postResample(prediction_C50_b1, testing_b1sp$LOCATION)
# evaluate Random forest
cm_rf_b1<- confusionMatrix(prediction_rf_b1, testing_b1sp$LOCATION)
postResample(prediction_rf_b1, testing_b1sp$LOCATION)
# evaluate KNN
cm_KNN_b1 <- confusionMatrix(prediction_KNN_b1, testing_b1sp$LOCATION)
postResample(prediction_KNN_b1, testing_b1sp$LOCATION)
# resample for all three models
resample_b1 <- resamples( list(C50 = C50_b1, RF = rf_b1, KNN = KNN_b1))
summary(resample_b1)
BUILDING_2
# evaluate C5.0
cm_C50_b2<- confusionMatrix(prediction_C50_b2, testing_b2sp$LOCATION)
postResample(prediction_C50_b2, testing_b2sp$LOCATION)
# evaluate Random Forest
cm_rf_b2<- confusionMatrix(prediction_rf_b2, testing_b2sp$LOCATION)
postResample(prediction_rf_b2, testing_b2sp$LOCATION)
# evaluate KNN
cm_KNN_b2 <- confusionMatrix(prediction_KNN_b2, testing_b2sp$LOCATION)
postResample(prediction_KNN_b2, testing_b2sp$LOCATION)
# resample for all three models
resample_b2 <- resamples( list(C50 = C50_b2, RF = rf_b2, KNN = KNN_b2))
summary(resample_b2)
Models | Min. | 1st.Qu. | Median | Mean | 3rd.Qu. | Max. | |
---|---|---|---|---|---|---|---|
Accuracy | C5.0 | 0.68 | 0.7 | 0.72 | 0.72 | 0.73 | 0.75 |
Random Forest | 0.73 | 0.76 | 0.77 | 0.77 | 0.78 | 0.8 | |
KNN | 0.5 | 0.52 | 0.54 | 0.54 | 0.55 | 0.57 | |
Kappa | C5.0 | 0.68 | 0.7 | 0.72 | 0.72 | 0.73 | 0.75 |
Random Forest | 0.73 | 0.76 | 0.77 | 0.77 | 0.78 | 0.8 | |
KNN | 0.5 | 0.52 | 0.53 | 0.53 | 0.55 | 0.57 |
Models | Min. | 1st.Qu. | Median | Mean | 3rd.Qu. | Max. | |
---|---|---|---|---|---|---|---|
Accuracy | C5.0 | 0.76 | 0.79 | 0.79 | 0.79 | 0.80 | 0.81 |
Random Forest | 0.82 | 0.83 | 0.84 | 0.84 | 0.85 | 0.87 | |
KNN | 0.60 | 0.63 | 0.63 | 0.64 | 0.65 | 0.66 | |
Kappa | C5.0 | 0.76 | 0.78 | 0.79 | 0.79 | 0.80 | 0.81 |
Random Forest | 0.82 | 0.83 | 0.84 | 0.84 | 0.85 | 0.87 | |
KNN | 0.60 | 0.62 | 0.63 | 0.63 | 0.64 | 0.66 |
Models | Min. | 1st.Qu. | Median | Mean | 3rd.Qu. | Max. | |
---|---|---|---|---|---|---|---|
Accuracy | C5.0 | 0.67 | 0.71 | 0.71 | 0.71 | 0.73 | 0.74 |
Random Forest | 0.78 | 0.79 | 0.80 | 0.80 | 0.80 | 0.85 | |
KNN | 0.58 | 0.61 | 0.63 | 0.62 | 0.64 | 0.64 | |
Kappa | C5.0 | 0.67 | 0.71 | 0.71 | 0.71 | 0.72 | 0.73 |
Random Forest | 0.78 | 0.79 | 0.80 | 0.80 | 0.80 | 0.85 | |
KNN | 0.58 | 0.60 | 0.62 | 0.63 | 0.63 | 0.64 |
Below chart shows how long it took each model to be trained in second.
C5.0 | Random Forest | KNN | |
---|---|---|---|
time in second | 1823 | 11453 | 527 |
As we can see, the KNN algorithm we use doesn’t perform very well for our dataset and problem. C5.0 and Random Forest have very similar accuracy and kappa. We picked C5.0 model based on the time taken to train the model.
Wi-Fi locationing has been around for some time, the accuracy couldn’t be improved too much because it has some fundamental weakness.
Beacon locationing using bluetooth is on the raise through the year, it has some advantages which can offset the disadvantages of Wi-Fi locationing.