Indoor WiFi Locationing

Overview

Background Info

Our client is developing a system to be deployed on large industrial campuses, shopping malls, hospitals, et cetera to help people to navigate in a complex, unfamiliar interior space without getting lost.
While GPS works fairly raliably outdoors, it generally does not work indoors, so a different technology is necessary. Our client would like us to investigate the feasibility of using “wifi fingerprinting” to determine a person’s location in indoor spaces.
Wifi fingerprinting uses the signals from multiple wifi hotspots within the building to determine location, analogously to how GPS uses satellite signals.

Objective

Our job is to evaluate multiple machine learning algorithms to see which produces the best result, enabling us to make a recommendation to the client. If the recommended model is sufficiently accurate, it will be incorporated into a smartphone App for indoor locationing.

Dataset Info

This UJIIndoorLoc database was stored in UCI machining learning Repository.
The database covers three buildings of Universitat Jaume I with 4 or more floors and almost 110m2 in Valencia, Spain. It was created in 2013 by means of more than 20 different users and 25 Android devices.
We have been provided with a large database of wifi fingerprints for a multi-building industrial campus with a location (building, floor, and location ID) associated with each fingerprint.
The database consists of 19937 training records and 1111 validation/test records.
Attributes include:
- WAP001- WAP520: Intensity value for WAP. Negative integer values from -104 to 0 and +100. Positive value 100 used if any WAP was not detected.
- 9 Position-related attributes:
  - FLOOR, BUILDINGID, SPACEID, RELATIVEPOSITION
  - LONGITUDE, LATITUDE
  - USERID, PHONEID, TIMESTAMP

Pre-process and Feature Engineering

Load libraies and import the dataset

For convenient reason, I moved 4 columns (FLOOR, BUILDINGID, SPACEID, RELATIVEPOSITION) to the front and then imported it as preprocessed dataset.
tidyr packages has powerful tools for data wrangling and manipulation. We used its unite() function to create a single unique identifier.
We used parallel processing to allocate multiple cores in order to reduce the computing time.

# Load libraries 
library(caret)
library(readr)
library(plotly)

# Import the dataset
processed_trainingData <- read_csv("processed_trainingData.csv")

# Set up parallel processing
library(doParallel)
# Find how many cores are on your machine
detectCores() 
## [1] 8
# Create Cluster with desired number of cores. 
cl <- makeCluster(5)
# Register Cluster
registerDoParallel(cl)
# Confirm how many cores are now "assigned" to R and RStudio
getDoParWorkers()  
## [1] 5

# Inspect the dataset
# Check the structure and summary of the dataset
str(processed_trainingData)
summary(processed_trainingData)

# Check the first 10 variables with the first 5 observations
head(processed_trainingData, n=5)[1:10]
## # A tibble: 5 x 10
##   FLOOR BUILDINGID SPACEID RELATIVEPOSITION WAP001 WAP002 WAP003 WAP004 WAP005
##   <dbl>      <dbl>   <dbl>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     2          1     106                2    100    100    100    100    100
## 2     2          1     106                2    100    100    100    100    100
## 3     2          1     103                2    100    100    100    100    100
## 4     2          1     102                2    100    100    100    100    100
## 5     0          0     122                2    100    100    100    100    100
## # ... with 1 more variable: WAP006 <dbl>

# Check the last 10 variables with the first 5 observations
tail(processed_trainingData, n=5)[520:529]
## # A tibble: 5 x 10
##   WAP516 WAP517 WAP518 WAP519 WAP520 LONGITUDE LATITUDE USERID PHONEID TIMESTAMP
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>     <dbl>    <dbl>  <dbl>   <dbl>     <dbl>
## 1    100    100    100    100    100    -7485. 4864875.     18      10    1.37e9
## 2    100    100    100    100    100    -7391. 4864836.     18      10    1.37e9
## 3    100    100    100    100    100    -7517. 4864889.     18      10    1.37e9
## 4    100    100    100    100    100    -7537. 4864896.     18      10    1.37e9
## 5    100    100    100    100    100    -7536. 4864898.     18      10    1.37e9

Preprocessing

Remove Zero Variance Variables

Zero or near zero variance variables refer to constant and almost constant predictors across samples. These kind of predictors are not only non-informative, but also it can break some models. Therefore, we need to throw them away before feeding into the models.

Remove irrelevant variables

LONGITUDE, LATITUDE are more for the outdoors location (GPS positioning) and indoors locationing focus on the precision of which space/room the person in. Therefore, we should remove both variables.
USERID, PHONEID, TIMESTAMP should be also removed from the dataset as they are irrelevant to our general goal.

Creating a unique identifier-LOCATION

We combined the FLOOR, BUILDINGID, SPACEID, RELATIVEPOSITION attributes into a single unique identifier-LOCATION, as all these variables indicate location.
Because this is a classification task, we need to FACTORIZE this ‘LOCATION’ dependent variable before moving into next step of model building.

Subset the dataset by building# and refactorize the subsets

By examming the dataset, I realized that we have three different buildings and each with huge amount of data. In ordre to find the patterns for each building, I decided to subset them by specifying the buildingID# for each subset.
We did not remove those location related attributes when we built “LOCATION” variable via unite() funciton, because we need to use BuildingID to subset the data in this step. After subsetting, we can safely remove FLOOR, BUILDINGID, SPACEID, RELATIVEPOSITION from each building.
We need to factorize again after subsetting in order to drop factor levels in a subsetted data frame.

# Removing Near Zero Variance from the dataset
rzv_training <- processed_trainingData[, sapply(processed_trainingData, var) != 0]  
str(rzv_training)

# Check if all zero variance columns have been removed
which(sapply(rzv_training, var) == 0)
## named integer(0)

# Remove extra dependent variables
rzv_training <- rzv_training[, -c(470:474)]
# Check if they are all removed
names(rzv_training)[465:469]
## [1] "WAP515" "WAP516" "WAP517" "WAP518" "WAP519"

# Use tidyr to create a new attribute
library(tidyr)

# To create a single unique identifier (new column attribute) that combine 4 other attributes function
newDF <- unite(rzv_training, "LOCATION", c(FLOOR, BUILDINGID, SPACEID, RELATIVEPOSITION), remove = FALSE, sep ="-")

# Convert location attribute to factor
newDF$LOCATION <- as.factor(newDF$LOCATION)

# Make sure the data type has been converted
str(newDF$LOCATION)
##  Factor w/ 905 levels "0-0-102-2","0-0-106-2",..: 485 485 479 477 16 483 479 475 492 478 ...

# Subsetting the Building 0-2
trainingBUD0 <- subset(newDF, BUILDINGID== 0)
trainingBUD1 <- subset(newDF, BUILDINGID== 1)
trainingBUD2 <- subset(newDF, BUILDINGID== 2)

# Remove Floor, BuildingID,SPACEID, RELATIVELOCATION from these subsets
trainingBUD0[,2:5] <- NULL
trainingBUD1[,2:5] <- NULL
trainingBUD2[,2:5] <- NULL

# Factorize again after subsetting in order to drop factor levels
trainingBUD0$LOCATION <- factor(trainingBUD0$LOCATION)
trainingBUD1$LOCATION <- factor(trainingBUD1$LOCATION)
trainingBUD2$LOCATION <- factor(trainingBUD2$LOCATION)

# Check how many levels of LOCATION for building0
str(trainingBUD0$LOCATION)
##  Factor w/ 259 levels "0-0-102-2","0-0-106-2",..: 16 1 4 5 3 2 9 8 7 6 ...

Model Building and Testing

This is a classification task, since our prediction target is the location of an exact space/room and it does not include consecutive value.
We used 10 fold cross validataion to avoid overfitting
Here we selected three different algorithms from the caret package: C5.0, Random Forest, and KNN.

C5.0 and Random Forest

C5.0 and Random Forest models are both belong to the tree model family. Both model works by splitting the sample based on the field that provides the maximum information gain. Each subsample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute significantly to the value of the model are removed or pruned.
C5.0 is more robust when processing a large number of variables, which usually takes less training time in general.
Random Forest algorithm normally requires longer training time. Because it consists of a large number of relatively uncorrelated models (trees) which protect each other from their individual errors. It is more conservative but also can prevent overfitting from a simple decision tree model.

KNN

K-Nearest Neighbor classifier requires the selection of the number of nearest neighbors and it captures the similarity by calculate the distance between two points. It is simple and easy to implement. However, it is time costly, lazy, requires full training data and depends on the value of k and has the issue of dimensionality because of the distance.

# To set seed
set.seed(520)

## 10 fold cross validation 
fitControl <- trainControl(method = "cv", number = 10)

Building_0 on C5.0 Model

# Define 75%/25% train/test split of the dataset to BUILDING 0
inTraining0 <- createDataPartition(trainingBUD0$LOCATION, p = .75, list = FALSE)
training0 <- trainingBUD0[inTraining0,]
testing0 <- trainingBUD0[-inTraining0,]

# C5.0 model 
C50_BUD0 <- train(LOCATION~., data = training0, method = "C5.0", trControl=fitControl)

# Testing 
prediction_C50BUD0 <- predict(C50_BUD0, testing0)

# Evaluate the model 
cm_C50_BUD0 <- confusionMatrix(prediction_C50BUD0, testing0$LOCATION)
postResample(prediction_C50BUD0, testing0$LOCATION)
##  Accuracy     Kappa 
## 0.7286512 0.7275444

Building_0 on Random Forest Model

# Random Forest Model
rf_BUD0 <- train(LOCATION~., data = training0, method = "rf", trControl=fitControl)
 
# Testing 
prediction_rfBUD0<- predict(rf_BUD0, testing0)

# Evaluate the model 
cm_rf_BUD0 <- confusionMatrix(prediction_rfBUD0, testing0$LOCATION)
postResample(prediction_rfBUD0, testing0$LOCATION)
##  Accuracy     Kappa 
## 0.7741421 0.7732169

Building_0 on KNN Model

# KNN Model
KNN_BUD0 <- train(LOCATION~., data = training0, method = "knn", trControl=fitControl)
        
# Testing
prediction_KNNBUD0 <- predict(KNN_BUD0, testing0)

# Evaluation the model
cm_KNN_BUD0 <- confusionMatrix(prediction_KNNBUD0, testing0$LOCATION)
postResample(prediction_KNNBUD0, testing0$LOCATION)
##  Accuracy     Kappa 
## 0.5602554 0.5584560

Resample on Building_0

resample_BUD0 <- resamples( list(C50 = C50_BUD0, RF = rf_BUD0, KNN = KNN_BUD0))
summary(resample_BUD0)
## 
## Call:
## summary.resamples(object = resample_BUD0)
## 
## Models: C50, RF, KNN 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## C50 0.6857143 0.6966973 0.7050096 0.7057123 0.7124347 0.7398990    0
## RF  0.7114914 0.7312939 0.7658117 0.7562927 0.7802988 0.7926829    0
## KNN 0.5135802 0.5300000 0.5379138 0.5411246 0.5510385 0.5760599    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## C50 0.6843920 0.6954157 0.7037252 0.7044562 0.7111684 0.7388030    0
## RF  0.7102582 0.7301419 0.7647806 0.7552386 0.7793579 0.7917936    0
## KNN 0.5115194 0.5280295 0.5359863 0.5391965 0.5491473 0.5743259    0

Model Evaluation

There are many different metrics to measure the performance of a classificion model, like: Kappa and Accuracy, Confusion Matrix, Postresample and Resample. We will focus on Kappa and Accuracy in this task.

Kappa Score

Kappa Score is a metric that compares an Observed Accuracy with an Expected Accuracy. Observed Accuracy is simply the number of instances that were classified correctly throughout the entire confusion matrix. Expected Accuracy is defined as the accuracy that any random classifier would be expected to achieve based on the confusion matrix. In general, it is less misleading than simply using accuracy.

Model Selection

As we can see the Random Forest Model has the highest Accuracy and Kappa value. Therefore, we can safely apply this model to Building_1 and Building_2 for further predictions.

Building_1 on Random Forest Model

# Define 75%/25% train/test split of the dataset to BUILDING 1
inTraining1 <- createDataPartition(trainingBUD1$LOCATION, p = .75, list = FALSE)
training1 <- trainingBUD1[inTraining1,]
testing1 <- trainingBUD1[-inTraining1,]

# Random Forest Model
rf_BUD1 <- train(LOCATION~., data = training1, method = "rf", trControl=fitControl)

# Testing 
prediction_rfBUD1<- predict(rf_BUD1, testing1)

# Evaluate the model 
cm_rf_BUD1 <- confusionMatrix(prediction_rfBUD1, testing1$LOCATION)
postResample(prediction_rfBUD1, testing1$LOCATION)
##  Accuracy     Kappa 
## 0.8595779 0.8587076

Building_2 on Random Forest Model

# Define 75%/25% train/test split of the dataset to BUILDING 2
inTraining2 <- createDataPartition(trainingBUD2$LOCATION, p = .75, list = FALSE)
training2 <- trainingBUD2[inTraining2,]
testing2 <- trainingBUD2[-inTraining2,]

# Random Forest Model
rf_BUD2 <- train(LOCATION~., data = training2, method = "rf", trControl=fitControl)

# Testing 
prediction_rfBUD2<- predict(rf_BUD2, testing2)

# Evaluate the model 
cm_rf_BUD2 <- confusionMatrix(prediction_rfBUD2, testing2$LOCATION)
postResample(prediction_rfBUD2, testing2$LOCATION)
##  Accuracy     Kappa 
## 0.8077601 0.8071570

Business Recommendation

Wifi locationing has been around for some time, however, the accuracy couldn’t be improved too much due to its internal weakness.
Beacon technology, with based on Bluetooth low energy proximity sensing by transmitting a universally unique identifier picked up by a compatible app or operating systems. iBeacon can be used with an application as an indoor positioning system, which helps smartphones determine their approximate location in a building or a store.
If we combine both technologies, we can use Beacon’s advantage to offset the disadvantages of Wifi locationing.
Bluetooth positioning has a far greater accuracy, +/- 1 meter.
Both Android and iOS support Bluetooth technology. Bluetooth has offline access ability, it doesn’t require internet access or signal to enable positioning.
Takenly together, the signal should be stronger enough to cover more areas in the building. The total near Zero Variance attributes might be less, and our dataset might be more meaningful and informative than the current training dataset.

# Stop Cluster 
stopCluster(cl)