Indoor Wi-Fi Locationing Model

Determining a person’s physical position in a multi-building indoor space using wifi fingerprinting.

Overview

Background

Our client is developing a system to be deployed on large industrial campuses, shopping malls, et cetera to help people to navigate a complex, unfamiliar interior space without getting lost.

While GPS works fairly reliably outdoors, it generally doesn’t work indoors, so a different technology is necessary. Our client would like us to investigate the feasibility of using “wifi fingerprinting” to determine a person’s location in indoor spaces.

Wifi fingerprinting uses the signals from multiple wifi hotspots within the building to determine location, analogously to how GPS uses satellite signals.

Objective

Our job is to evaluate multiple machine learning models to see which produces the best result, enabling us to make a recommendation to the client. If the recommended model is sufficiently accurate, it will be incorporated into a smartphone app for indoor locationing.

Dataset Information

We have been provided with a large database(about 20,000 observations and 529 variables) of wifi fingerprints for a multi-building industrial campus with a location (building, floor, and location ID) associated with each fingerprint.

Data Sources (credit to UCI Machine Learning Repository)
- The database covers three buildings of Universitat Jaume I with 4 or more floors and almost 110.000m2
- It was created in 2013 by means of more than 20 different users and 25 Android devices.
- The database consists of 19937 training records and 1111 validation/test records.
Attributes Information
- WAP001- WAP520: Intensity value for WAP. Negative integer values from -104 to 0 and +100. Positive value 100 used if WA was not detected.
- Longitude and Latitude
- Floor, Building ID, SpaceID(office, corridor, classroom)
- RelativePosition(Inside or outside of the space)
- UserID, PhoneID
- Timestamp

Pre-process and Feature Engineering

Load libraies and uploade the dataset

dplyr and tidyr packages are powerful tools for data wrangling and manipulation. Here, we use them to perform necessary preprocessing.
Since we are dealing with large amount of data, we will use parallel processing. Parrallel processing enable the computer to use more than one cores, hence reducing the computing time.

# load the libraries 
library(readr)
library(caret)
library(dplyr)
library(tidyr)

# set up parallel processing
library(doParallel)
# Find how many cores are on your machine
detectCores() 
# Create Cluster with desired number of cores. 
cl <- makeCluster(2)
# Register Cluster
registerDoParallel(cl)
# Confirm how many cores are now "assigned" to R and RStudio
getDoParWorkers()  
# Stop Cluster. (After performing tasks, stop the cluster.)
stopCluster(cl)

# load the dataset
training <- read.csv("trainingData.csv")

Initial Exploration

The first 520 variables are the intensity value for WAP, we only displayed eight here for visualization purpose.
We also displayed last few varialbes. They represent the location, user and time information.
We checked the structure of the whole dataset, all data type are numeric. We only displayed three also for visualization purpose.

# check headings of the first 8 variables with the first 3 observations
head(training, n=3)[1:8]

##   WAP001 WAP002 WAP003 WAP004 WAP005 WAP006 WAP007 WAP008
## 1    100    100    100    100    100    100    100    100
## 2    100    100    100    100    100    100    100    100
## 3    100    100    100    100    100    100    100    -97

# check headings of the last few variables with the first 3 observations
head(training, n=3)[520:529]

##   WAP520 LONGITUDE LATITUDE FLOOR BUILDINGID SPACEID RELATIVEPOSITION USERID
## 1    100 -7541.264  4864921     2          1     106                2      2
## 2    100 -7536.621  4864934     2          1     106                2      2
## 3    100 -7519.152  4864950     2          1     103                2      2
##   PHONEID  TIMESTAMP
## 1      23 1371713733
## 2      23 1371713691
## 3      23 1371714095

# check the structure of the first 3 observations
str(training, list.len=3)

## 'data.frame':    19937 obs. of  529 variables:
##  $ WAP001          : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ WAP002          : int  100 100 100 100 100 100 100 100 100 100 ...
##  $ WAP003          : int  100 100 100 100 100 100 100 100 100 100 ...
##   [list output truncated]

Pre-processing

Remove zero variance variables

Zero variance variables indicate constant predictors across all the samples. This kind of predictor is not really informative. Fore example in our dataset, constant numbers for the intensity of WAP shows there is no movement for a certain device. This type of data will not be useful when we build the model, it could also break some models we are building. So we will find out and remove all the zero variance variables.

Remove unnecessary variables

Since our goal of precision is to predict which space/room the person at, so Longitude and Latitud are not very important for our task. UserID, PhoneID and Timestamp are also not related to predict the location. We will remove them all.

Combine location-indicating variables into one

BuildingID, Floor, SpaceID and RelativePositin are all vaiables indicating the location which we need to predict. We will combine all of them into a single variabel “Location”. After combining we will need to convert the data type from numerical to factor, so we can train our classification models later on.

Subet the data by buildings

Since we have a huge amount of data of three buildings, the data from different buildings could perform differently by different models. For the efficiency and accuracy, we will subset the whole dataset into three subsets by buildings.

Earlier, we didn’t remove the remaining attributes when we built “Location” variable, because we will use BuildingID to subset the data. After subsetting, we can safely BuildingID, Floor, SpaceID and RelativeLocation for all three building.

# find and take out zero variance variables
rzv_training <- training[ -which(apply(training, 2, var) == 0 )] 
# check if all zero variance columns are removed
which(apply(rzv_training, 2, var) == 0)

## named integer(0)

The result of zero variance variable shows 0.

# take off extra dependent variables 
rzv_training <- rzv_training[, -c(466,467,472:474)]
# check if they are all removed
names(rzv_training)[465:469]

## [1] "WAP519"           "FLOOR"            "BUILDINGID"       "SPACEID"         
## [5] "RELATIVEPOSITION"

# combine dependent variables into one  
rzv_training <- unite(rzv_training, col = "LOCATION", c("BUILDINGID",
                                                        "FLOOR", "SPACEID", "RELATIVEPOSITION"), sep = "", remove = FALSE)
# convert data type to factor  
rzv_training$LOCATION <- as.factor(rzv_training$LOCATION)
# make sure the data type has been converted
str(rzv_training$LOCATION)

##  Factor w/ 905 levels "001022","001062",..: 400 400 394 392 16 398 394 389 407 393 ...

# brake the dataset into three datasets by buildings
training_b0 <- subset(rzv_training, BUILDINGID == 0)
training_b1 <- subset(rzv_training, BUILDINGID == 1)
training_b2 <- subset(rzv_training, BUILDINGID == 2)
# remove individual location variables
training_b0[, c(467:470)] <- NULL
training_b1[, c(467:470)] <- NULL
training_b2[, c(467:470)] <- NULL
# applying factor() to avoid extra levels
training_b0$LOCATION <- factor(training_b0$LOCATION)
training_b1$LOCATION <- factor(training_b1$LOCATION)
training_b2$LOCATION <- factor(training_b2$LOCATION)
# check how many levels of LOCATION for building0
str(training_b0$LOCATION)

##  Factor w/ 259 levels "001022","001062",..: 16 1 4 5 3 2 9 8 7 6 ...

After we subset the data by buildings, the level of LOCATION still has the orginal levels, which are much more than the examples from individual subset. That’s why we need to apply factor() function to LOCATION variable, to re-level all the factors.

Build and Test the Models

We are going to predict the location detailed to space/room, which is not a consecutive value, so we will use classification classifiers to build our models. We picked out three classifiers to try, C5.0, Random Forest, KNN. 10 folds cross validation is used to avoid overfitting.

C5.0 and Random Forest

Both C5.0 and Random Forest algorithms are under the tree model family. Tree model is a flowchart-like structure that works by splitting the sample based on the maximum informative vaiable, named nodes. Each nodes will then split again, the process repeats until the subsamples cannot be split any further.

C5.0 is robust at process a large number of variables, for example, our dataset which has 528 independent variables. It usually doesn’t take a long training time.

Random Forest on the other hand, usaully require a longer time to be trained, depending on the numbers of trees. It works by constructing multiple decision trees and output the mode of the classes for classification problem. The advantage of Random forests is avoiding overfitting from simple decision tree model.

KNN

K-Nearest Neighbor algorithm is based on assume similar things exist close to each other. It captures the similarity by calculate the distance between two points. KNN is simple and easy to implement, but it could get time consuming when the dataset gets larger.

set.seed(520)
# create 10-fold cross validation fitcontrol
fitControl <- trainControl(method = "cv", number = 10)

BUILDING_0

# split training and testing datasets
inTraining_b0 <- createDataPartition(training_b0$LOCATION, p = .75, list = FALSE )
training_b0sp <- training_b0[inTraining_b0, ]
testing_b0sp <- training_b0[-inTraining_b0, ]

# C5.0 model
C50_b0 <- train(LOCATION~., data = training_b0sp, method = "C5.0", 
                trControl = fitControl)
# testing 
prediction_C50_b0 <- predict(C50_b0, testing_b0sp)

# randome forest
rf_b0 <- train(LOCATION~., data = training_b0sp, method = "rf",
               trControl = fitControl)
prediction_rf_b0 <- predict(rf_b0, testing_b0sp)

# KNN
KNN_b0 <- train(LOCATION~., data = training_b0sp, method = "knn",
                trControl = fitControl)
prediction_KNN_b0 <- predict(KNN_b0, testing_b0sp)

BUILDING_1

# split training and testing datasets
inTraining_b1 <- createDataPartition(training_b1$LOCATION, p = .75, list = FALSE )
training_b1sp <- training_b1[inTraining_b1, ]
testing_b1sp <- training_b1[-inTraining_b1, ]

# C5.0 model
C50_b1 <- train(LOCATION~., data = training_b1sp, method = "C5.0", 
                trControl = fitControl)
summary(C50_b1)
# testing 
prediction_C50_b1 <- predict(C50_b1, testing_b1sp)

# randome forest
rf_b1 <- train(LOCATION~., data = training_b1sp, method = "rf",
               trControl = fitControl)
prediction_rf_b1 <- predict(rf_b1, testing_b1sp)

# KNN
KNN_b1 <- train(LOCATION~., data = training_b1sp, method = "knn",
                trControl = fitControl)
prediction_KNN_b1 <- predict(KNN_b1, testing_b1sp)

BUILDING_2

# split training and testing datasets
inTraining_b2 <- createDataPartition(training_b2$LOCATION, p = .75, list = FALSE )
training_b2sp <- training_b2[inTraining_b2, ]
testing_b2sp <- training_b2[-inTraining_b2, ]

# C5.0 model
C50_b2 <- train(LOCATION~., data = training_b2sp, method = "C5.0", 
                trControl = fitControl)
# testing 
prediction_C50_b2 <- predict(C50_b2, testing_b2sp)

# randome forest
rf_b2 <- train(LOCATION~., data = training_b2sp, method = "rf",
               trControl = fitControl)
prediction_rf_b2 <- predict(rf_b2, testing_b2sp)

# KNN
KNN_b2 <- train(LOCATION~., data = training_b2sp, method = "knn",
                trControl = fitControl)
prediction_KNN_b2 <- predict(KNN_b2, testing_b2sp)

Evaluate the Models

There are different ways to compare performance of the models. We will try confusion matrix, postResample, resample methods. We will use both Accuracy and Kappa Score to evaluate our models.

Kappa Score

Kappa Score compares an Observed Accuracy with an Expected Accuracy. Observed Accuracy is simply the number of instances that were classified correctly. Expected Accuracy is defined as the accuracy that any random classifier would be expected to achieve. The Expected Accuracy is directly related to the number of instances of each class combined with the number of instances that the machine learning classifier agreed with as being ground truth. In general Kappa Score is less misleading than simply using accuracy.

BUILDING_0

# evaluate C5.0 
cm_C50_b0 <- confusionMatrix(prediction_C50_b0, testing_b0sp$LOCATION)
postResample(prediction_C50_b0, testing_b0sp$LOCATION)

# evaluate Random Forest
cm_rf_b0 <- confusionMatrix(prediction_rf_b0, testing_b0sp$LOCATION)
postResample(prediction_rf_b0, testing_b0sp$LOCATION)

# evaluate KNN
cm_KNN_b0 <- confusionMatrix(prediction_KNN_b0, testing_b0sp$LOCATION)
postResample(prediction_KNN_b0, testing_b0sp$LOCATION)

# resample for all three models 
resample_b0 <- resamples( list(C50 = C50_b0, RF = rf_b0, KNN = KNN_b0))
summary(resample_b0)

BUILDING_1

# evaluate C5.0 
cm_C50_b1<- confusionMatrix(prediction_C50_b1, testing_b1sp$LOCATION)
postResample(prediction_C50_b1, testing_b1sp$LOCATION)

# evaluate Random forest
cm_rf_b1<- confusionMatrix(prediction_rf_b1, testing_b1sp$LOCATION)
postResample(prediction_rf_b1, testing_b1sp$LOCATION)

# evaluate KNN 
cm_KNN_b1 <- confusionMatrix(prediction_KNN_b1, testing_b1sp$LOCATION)
postResample(prediction_KNN_b1, testing_b1sp$LOCATION)

# resample for all three models 
resample_b1 <- resamples( list(C50 = C50_b1, RF = rf_b1, KNN = KNN_b1))
summary(resample_b1)

BUILDING_2

# evaluate C5.0 
cm_C50_b2<- confusionMatrix(prediction_C50_b2, testing_b2sp$LOCATION)
postResample(prediction_C50_b2, testing_b2sp$LOCATION)

# evaluate Random Forest
cm_rf_b2<- confusionMatrix(prediction_rf_b2, testing_b2sp$LOCATION)
postResample(prediction_rf_b2, testing_b2sp$LOCATION)

# evaluate KNN 
cm_KNN_b2 <- confusionMatrix(prediction_KNN_b2, testing_b2sp$LOCATION)
postResample(prediction_KNN_b2, testing_b2sp$LOCATION)

# resample for all three models 
resample_b2 <- resamples( list(C50 = C50_b2, RF = rf_b2, KNN = KNN_b2))
summary(resample_b2)

Resample Building 0

	Models	Min.	1st.Qu.	Median	Mean	3rd.Qu.	Max.
Accuracy	C5.0	0.68	0.7	0.72	0.72	0.73	0.75
	Random Forest	0.73	0.76	0.77	0.77	0.78	0.8
	KNN	0.5	0.52	0.54	0.54	0.55	0.57
Kappa	C5.0	0.68	0.7	0.72	0.72	0.73	0.75
	Random Forest	0.73	0.76	0.77	0.77	0.78	0.8
	KNN	0.5	0.52	0.53	0.53	0.55	0.57

Resample Building 1

	Models	Min.	1st.Qu.	Median	Mean	3rd.Qu.	Max.
Accuracy	C5.0	0.76	0.79	0.79	0.79	0.80	0.81
	Random Forest	0.82	0.83	0.84	0.84	0.85	0.87
	KNN	0.60	0.63	0.63	0.64	0.65	0.66
Kappa	C5.0	0.76	0.78	0.79	0.79	0.80	0.81
	Random Forest	0.82	0.83	0.84	0.84	0.85	0.87
	KNN	0.60	0.62	0.63	0.63	0.64	0.66

Resample Building 2

	Models	Min.	1st.Qu.	Median	Mean	3rd.Qu.	Max.
Accuracy	C5.0	0.67	0.71	0.71	0.71	0.73	0.74
	Random Forest	0.78	0.79	0.80	0.80	0.80	0.85
	KNN	0.58	0.61	0.63	0.62	0.64	0.64
Kappa	C5.0	0.67	0.71	0.71	0.71	0.72	0.73
	Random Forest	0.78	0.79	0.80	0.80	0.80	0.85
	KNN	0.58	0.60	0.62	0.63	0.63	0.64

Accuracy and Kappa Comparison

Time Taken Comparison

Below chart shows how long it took each model to be trained in second.

	C5.0	Random Forest	KNN
time in second	1823	11453	527

Model Selection

As we can see, the KNN algorithm we use doesn’t perform very well for our dataset and problem. C5.0 and Random Forest have very similar accuracy and kappa. We picked C5.0 model based on the time taken to train the model.

Business Recommendation - Combine Wi-Fi and Bluetooth Beacon

Wi-Fi locationing has been around for some time, the accuracy couldn’t be improved too much because it has some fundamental weakness.

The accuracy of Wi-Fi used for indoor positioning is relatively inaccurate, it varies from five to 15 meters.
Wi-Fi client-based positioning is almost impossible with iOS devices.

Beacon locationing using bluetooth is on the raise through the year, it has some advantages which can offset the disadvantages of Wi-Fi locationing.

Bluetooth positioning has a far greater accuracy, +/- 1 meter.
Both Android and iOS support Bluetooth technology.
Bluetooth devices are very cost-efficient and it can be installed without cables.
Bluetooth has offline access ability, it doesn’t require internet access or signal to enable positioning.
Meanwhile, combining both will be best for location analysis (WiFi collects data on a large scale, and Bluetooth add on the individual interaction).