Wifi Indoor Localization

Introduction

Many real world applications need to know the localization of a user in the world to provide their services. Outdoor localization problem can be solved very accurately thanks to the inclusion of GPS sensors into the mobile devices. However, indoor localization is still an open problem mainly due to the loss of GPS signal in indoor environments. For this reason, we evaluated the application of machine learning techniques to this problem, replacing the GPS signal with WAPS signal.

Get to know the data

http://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc

We were provided with two datasets with the same number of variables:

Training Dataset: it has 19,937 observations which were generated with an Android application called CaptureLoc. The users had to perform captures in different reference points.
Validation Dataset: it has 1,111 observations which were generated with another Android application called ValidateLoc. This time, no reference point was provided to them, so they could capture signals in places which were not in the training phase.

And the following variables:

WAP001: Intensity value for WAP001. Negative integer values from -104 to 0 and +100. Positive value +100 used if WAP001 was not detected.
….
WAP520: Intensity value for WAP520. Negative integer values from -104 to 0 and +100. Positive value +100 used if WAP520 was not detected.
Longitude: Longitude. Negative real values from -7695.9387549299299000 to -7299.786516730871000.
Latitude: Latitude. Positive real values from 4864745.7450159714 to 4865017.3646842018.
Floor: Altitude in floors inside the building. Integer values from 0 to 4.
BuildingID: ID to identify the building. Measures were taken in three different buildings. Categorical integer values from 0 to 2.
SpaceID: Internal ID number to identify the Space (office, corridor, classroom) where the capture was taken. Categorical integer values.
RelativePosition: Relative position with respect to the Space (1 - Inside, 2 - Outside in Front of the door). Categorical integer values.
UserID: User identifier, 18 users participated in the procedure to generate the training samples. This field was no recorded in the validation phase, so the default value 0 was used to denote it. Categorical integer values.
PhoneID: Android device identifier, 20 devices were used (25 if the Android version is considered). Categorical integer values.
Timestamp: UNIX Time when the capture was taken, the training phase took place during six days (from 30 May to 20 June 2013) and the validation phase during 9 days (from 19 Sep to 08 Oct) 3 months latter. Integer value.

Brief introduction into understanding WAP signals

The dataset contains signals that go from -104 to 0, and +100 if the WAP was not detected.

WAP signal is registered in a logarithmic scale. -90 is considered to be an unusable signal. -30 is considered to be an amazing signal.

It is strange to have signals outside this range.

Also, having the “no signal” value as +100 distorts the logarithmic properties of the scale.

In order to solve both of this issues, first I’m going to converge all the values outside the range of [-90, -30] into its minimum and maximum respectively. And then replace the “no signal” value with a -100 in order to maintain the logarithmic aspect of the WAP scale. In other words, “no signal” is worse than “unusable signal”, so the numeric value of “no signal” should be below -90.

Pre-Processing

Load libraries

if ("pacman" %in% rownames(installed.packages()) == FALSE) {
  install.packages("pacman")
} else {
  pacman::p_load(readr, dplyr, magrittr, ggplot2, plotly, reshape2, tidyr, caret,
                 stringr, mime, ranger, gridExtra, tidyverse, svDialogs, rstudioapi)
}

Check for missing values

sum(is.na(trainingData))

## [1] 0

sum(is.na(validationData))

## [1] 0

Feature selection

The predictive model can’t depend on USERID and PHONEID since those variables are not related to location, I’m keeping PHONEID for now to do an exploratory analysis on phones with the intention to remove it latter.

Although TIMESTAMP can be realted to location based on a schedule, im opting to remove it because that schedule might vary due to many reasons that we arent keeping track of.

SPACEID and RELATIVEPOSITION are mostly noise, all spaces inside the buildings will be considered the same.

Convert BUILDINGID and FLOOR into categorical variables.

Remove duplicate observations, of which there were 637 in trainingData. This could be explained due to users not using the app properly, or some issue with the training app since different apps were used to capture training and validation data.

And finaly convert the “no signal” value of the different waps from +100 to NA (not available). Previously I mentioned that I will be converting that value into -100, but for now I will keep it as NA in order to do the next steps of pre-processing, and convert it to -100 when needed.

preproc <- function(df) {
  df$SPACEID <- NULL
  df$RELATIVEPOSITION <- NULL
  df$USERID <- NULL
  df$TIMESTAMP <- NULL
  cols <- c("FLOOR", "BUILDINGID")
  df[, cols] <- lapply(df[, cols], as.factor)
  if (sum(duplicated(df)) != 0) {
    df <- df[!duplicated(df), ]
  }
  WAP <- grep("WAP", names(df), value = T)
  replace <- function(x) ifelse(x == 100, NA, x)
  df[, c(WAP)] <- df[, c(WAP)] %>% mutate_all(replace)
  return(df)
}

trainingData   <- preproc(trainingData)
validationData <- preproc(validationData)

This plot compares the observations between the training and validation datasets

We can see that the training data is more structured around the same spots but doesnt cover all the areas.

While validation data is more messy, lacks structure and more important, doesnt always overlap with the training data.

There are areas of the buildings that are present in validation but not in training, this will negatively affect our predictions, and the errors made by the predictive models will probably be from these areas.

trainingData$Set <- "Train"
validationData$Set <- "Validation"
combinedData <- bind_rows(trainingData, validationData)
combinedData$Set <- as.factor(combinedData$Set)

p <- plot_ly(combinedData, x = ~LONGITUDE, y = ~LATITUDE, z = ~FLOOR,
             color = ~Set, colors = c("turquoise3", "tomato1"),
             size = 1.2) %>% add_markers() %>%
  layout(scene = list(xaxis = list(title = "Longitude"),
                      yaxis = list(title = "Latitude"),
                      zaxis = list(title = "Floor")))
p

Next I will look for observations that dont receive signal from any WAP and for WAPs that dont send signal to any observation and remove them.

drop_allna_rows <- function(df) {
  WAP <- grep("WAP", names(df), value = T)
  ind <- which(apply(df[, c(WAP)], 1, function(x) all(is.na(x))) == TRUE)
  if (length(ind) > 0){df <- df [-ind, ]}
  return(df)
}

drop_allna_cols <- function(df) {
  WAP <- grep("WAP", names(df), value = T)
  cols_to_drop <- which(apply(df[, c(WAP)], 2, function(x) all(is.na(x))) == TRUE)
  df <- df[, -cols_to_drop]
  return(df)
}

trainingData   <- drop_allna_rows(trainingData)
validationData <- drop_allna_rows(validationData)

trainingData   <- drop_allna_cols(trainingData)
validationData <- drop_allna_cols(validationData)

Since WAPs that dont send signal to any observation in one dataset might send signal to observations in the other one, I am going to cross the list of WAPs that do send signals and find the common WAPs between the two datasets.

WAPs_train <- grep("WAP", names(trainingData), value = T)
WAPs_val   <- grep("WAP", names(validationData), value = T)

WAPs <- intersect(WAPs_train, WAPs_val)

rm(WAPs_train, WAPs_val)

And then update both datasets with the common WAPs.

cols_to_keep_train <- c(which(colnames(trainingData) %in% WAPs),
                        which(grepl("^WAP", colnames(trainingData)) == FALSE))

cols_to_keep_val  <- c(which(colnames(validationData) %in% WAPs),
                       which(grepl("^WAP", colnames(validationData)) == FALSE))

trainingData   <- trainingData[, cols_to_keep_train]
validationData <- validationData[, cols_to_keep_val]

combinedData <- bind_rows(trainingData, validationData)
combinedData$Set <- as.factor(combinedData$Set)

This plot shows the density distribution of the signals of both datasets Both distributions being similar is a good sign that we can use the train data to make predictions on the validation data.

The train curve resembles a bit like a stair while the validation curve is more soft, that can be explained due to training data being captured through fixed spots and validation data being captured at randomly selected spots by the users.

dat1 <- melt(combinedData[, c(WAPs, "Set")])

## Using Set as id variables

p0 <- ggplot(dat1, aes(x = value, fill = Set)) +
  geom_density(alpha = 0.2) +
  xlab("Signal value") +
  ggtitle("Densities of WAP signals for each dataset")

p0

## Warning: Removed 5992776 rows containing non-finite values (stat_density).

Phone exploration

The predictive model cant depend on PHONEID because that wont be available if the model is put on production, but since its one of the two elements that influences the signal, we can do a littlle exploratory analysis to know a bit more about the signals and maybe use that knowledge to make better predictions.

Here we can see a table with PHONEIDs at the top, and the number of signals with value 0 at the bottom.

rowsw0 <- which(trainingData[, c(WAPs)] == 0, arr.ind = TRUE)
Phone_test <- trainingData[rowsw0, ]
table(Phone_test$PHONEID)

## 
##  19  23 
## 102 102

When we increase the treshold to see signals stronger than -30 we can see that there are more PHONEIDs but still #19 and #23 are clearly ahead.

rowsw30 <- which(trainingData[, c(WAPs)] > -30, arr.ind = TRUE)
Phone_test <- trainingData[rowsw30, ]
table(Phone_test$PHONEID)

## 
##   7   8  11  14  16  19  23 
##  58   3   2   3   7 609 682

And when we check for signals weaker than -90 we can see that more than the half of the PHONEIDs are represented, and PHONEID #23 leads the list again.

rowsw90 <- which(trainingData[, c(WAPs)] < -90, arr.ind = TRUE)
Phone_test <- trainingData[rowsw90, ]
table(Phone_test$PHONEID)

## 
##     1     3     6     7     8    10    11    13    14    16    17    18 
##  1573   190  1293  3953  2074  1808    15 14118  1311  1400   219  2810 
##    19    22    23    24 
##  4063  2193 49217   809

When we check the documentation for PHONEID #23 we see that is not actualy a phone but a tablet: Transformer TF101 4.0.3 2.

I tried removing the observations from PHONEID #23 but since it left untracked some areas of the training data I decided to leave it to get a better representation of the space inside the buildings.

After this we no longer need the PHONEID variable.

trainingData$PHONEID <- NULL
validationData$PHONEID <- NULL
combinedData$PHONEID <- NULL

Previously I mentioned that I was going to relocate signals weaker than -90 or stronger than -30 and that is going to be my next step.

If the signal is below -90 i will replace it with a -90, and if the signal is above -30 then i will replace it with a -30.

replace_values <- function(df) {
  WAP <- grep("WAP", names(df), value = T)
  na_to_onehundred <- function(x) {
    ifelse(!is.na(x), ifelse(x < -90, -90, ifelse(x > -30, -30, x)), -100)
  }
  df[, c(WAP)] <- df[, c(WAP)] %>% mutate_all(na_to_onehundred)
  return(df)
}

trainingData     <- replace_values(trainingData)
validationData   <- replace_values(validationData)

By combining BUILDINGID and FLOOR we have a BUILDINGID_FLOOR variable that will be used to make predictions instead of FLOOR.

If we make predictions on FLOOR alone, the model will take as a success predicting the FLOOR correctly even though it missed at predicting the BUILDINGID.

build_floor <- function(df) {
  df <- unite(df, BUILDINGID_FLOOR, c("BUILDINGID", "FLOOR"),
              sep = "_", remove = FALSE)
  df$BUILDINGID_FLOOR <- as.factor(df$BUILDINGID_FLOOR)
  return(df)
}

trainingData   <- build_floor(trainingData)
validationData <- build_floor(validationData)

The next thing that I want to do is getting a list of WAPs that send strong signals to many BUILDINGID and BUILDINGID_FLOOR, this WAPs might be problematic for the predictive model.

This process will require a few steps.

First step: Finding out for each observation, which is the WAP that sends the strongest signal.

HighestWap_train <- names(trainingData)[apply(trainingData[, c(WAPs)], 1, which.max)]

Second step: For each observation we find out in which BUILDINGID_FLOOR the observation was recorded.

Highest_loc_train <- data.frame("WAP" = HighestWap_train,
                                "BFLocation" = trainingData$BUILDINGID_FLOOR)

Third step: We find out for each WAP what was the most frequent BUILDINGID_FLOOR in which the highest signal was recorded, and we store it on a dataframe.

multiple <- c()
for ( i in unique(Highest_loc_train$WAP) ) {
  multiple <- rbind(multiple,
                    c(i, names(which.max
                               (table(Highest_loc_train
                                               [which(Highest_loc_train
                                                      $WAP == i), ]$BFLocation)))))
}

multiple <- as.data.frame(multiple)
colnames(multiple) <- c("WAP", "Most_Frequent_location")
multiple$Most_Frequent_location <- factor(multiple$Most_Frequent_location,
                                          levels = levels(trainingData$BUILDINGID_FLOOR))

Fourth step: Taking that position as a prediction of the WAP location.

Highest_loc_train$Prediction <-
  sapply(Highest_loc_train$WAP,
         function(x) multiple[which(multiple$WAP == x), ]$Most_Frequent_location)

Do the same for BUILDINGID.

In order to do this we only need to get the first digit of the predicted BUILDINGID_FLOOR.

Highest_loc_train$BLocation <- str_sub(Highest_loc_train$BFLocation, start = 1, end = 1)
Highest_loc_train$BPrediction <- str_sub(Highest_loc_train$Prediction, start = 1, end = 1)

Highest_loc_train$BLocation <- as.factor(Highest_loc_train$BLocation)
Highest_loc_train$BPrediction <- as.factor(Highest_loc_train$BPrediction)

And finaly we compare the predicted WAP locations with their real locations, and the wrongly predicted WAPs will make it into the bad WAP list. We will do this both for BUILDINGID and BUILDINGID_FLOOR.

Highest_loc_train$BFCorrect <- Highest_loc_train$BFLocation == Highest_loc_train$Prediction
BadWAPsBF <- unique(Highest_loc_train[which(Highest_loc_train$BFCorrect == FALSE), ]$WAP)

Highest_loc_train$BCorrect <- Highest_loc_train$BLocation == Highest_loc_train$BPrediction
BadWAPsB <- unique(Highest_loc_train[which(Highest_loc_train$BCorrect == FALSE), ]$WAP)

Now that I have a list of bad BUILDINGID and BUILDINGID_FLOOR WAPs I’m going to explore them a bit.

Boxplot for waps that made mistakes predicting the BUILDINGID.

Here we can clearly see why the previous process made mistakes at predicting the BUILDINGID, this WAPs are very strong and send a lot of signals to more than 1 BUILDINGID.

for (i in BadWAPsB) {
  x <- trainingData[[i]]
  x[x == -100] <- NA
  plot(trainingData$BUILDINGID_FLOOR, x, ylim = c(-100, -25), main = i,
       xlab = "BUILDINGID_FLOOR", ylab = "Signal value")
}

Density plot for waps that made mistakes predicting the BUILDINGID.

We can compare the density shapes of this WAPs with the density shape of the training data and see clear differences:

-They have more than one peak.

-They dont decrease in the same way (fast at the begining, slow at the end).

for (i in BadWAPsB) {
  x <- trainingData[[i]]
  x[x == -100] <- NA
  plot(density(x, na.rm = TRUE), main = i, xlab = "Signal value")
}

Seeing that I’m going to remove those WAPs from both datasets.

trainingData <- trainingData[, !(colnames(trainingData) %in% BadWAPsB), drop = FALSE]
validationData <- validationData[, !(colnames(validationData) %in% BadWAPsB), drop = FALSE]

WAPs_train <- grep("WAP", names(trainingData), value = T)
WAPs_val <- grep("WAP", names(validationData), value = T)

Since I removed some WAPs I also will have to update the WAP list.

WAPs <- intersect(WAPs_train, WAPs_val)

Another thing that I wanted to make was to add three new variables with the WAP names of the three strongest signals received in every observation respectively.

So the variable TopWap1 will have the WAP name of the strongest signal in every observation, TopWap2 the second strongest and TopWap3 the third strongest.

topwaps <- function(df) {
  df$Top3Waps <- apply(df[, c(WAPs)], 1, function(x) 
    paste(names(head(sort(x, decreasing = TRUE), 3)), collapse = " "))
  df <- separate(df, Top3Waps, c("TopWap1", "TopWap2", "TopWap3"))
  df$TopWap1 <- factor(df$TopWap1, levels = WAPs)
  df$TopWap2 <- factor(df$TopWap2, levels = WAPs)
  df$TopWap3 <- factor(df$TopWap3, levels = WAPs)
  return(df)
}

trainingData$Top3Waps <- apply(trainingData[, c(WAPs)], 1, function(x) 
  paste(names(head(sort(x, decreasing = TRUE), 3)), collapse = " "))

trainingData <- separate(trainingData, Top3Waps, c("TopWap1", "TopWap2", "TopWap3"))

trainingData$TopWap1 <- factor(trainingData$TopWap1, levels = WAPs)
trainingData$TopWap2 <- factor(trainingData$TopWap2, levels = WAPs)
trainingData$TopWap3 <- factor(trainingData$TopWap3, levels = WAPs)

validationData$Top3Waps <- apply(validationData[, c(WAPs)], 1, function(x) 
  paste(names(head(sort(x, decreasing = TRUE), 3)), collapse = " "))

validationData <- separate(validationData, Top3Waps, c("TopWap1", "TopWap2", "TopWap3"))

validationData$TopWap1 <- factor(validationData$TopWap1, levels = WAPs)
validationData$TopWap2 <- factor(validationData$TopWap2, levels = WAPs)
validationData$TopWap3 <- factor(validationData$TopWap3, levels = WAPs)

In order to not make distintions between high-end phones and other budget options and the natural progression of technology I decided to normalize every row to put the signals received in every observation at the same scale.

Since we removed some WAPs we will check again for rows with no signals and remove them.

drop_null_value_rows <- function(df) {
  df[df == -100] <- NA
  df <- drop_allna_rows(df)
  df[is.na(df)] <- -100
  return(df)
}

trainingData <- drop_null_value_rows(trainingData)
validationData <- drop_null_value_rows(validationData)

And then rescale the signals of every row between 0 and 1.

This way, the strongest signal of every observation will always be 1, no matter if it was from a good phone that received a strong signal or from a bad phone that received a weaker signal.

normalize <- function(df) {
  data.frame(t(apply(df[WAPs], 1,
                     function(x) (x - min(x)) / (max(x) - min(x)))),
             df[, !colnames(df) %in% WAPs])
  return(df)
}

trainingData <- normalize(trainingData)
validationData <- normalize(validationData)

Since we still have to many variables in our datasets I will proceed with a PCA on the remaining WAPs in order to reduce the dimensionality of the data sets and increase the performance of the predictive models

Transform trainingData WAPs into PCs

trainingData.pca <- prcomp(trainingData[, c(WAPs)], center = TRUE, scale = TRUE)

This plot shows the eigenvalue or standar deviation of each PC, I will keep all PCs with eigenvalue 1 or higher and drop the rest since an eigenvalue lower than 1 would mean that the component actually explains less than a single explanatory variable, and by the look of it I will keep the first 77 PCs

screeplot(trainingData.pca, type = "l", npcs = 100, pch = c(20), col = c("turquoise3"),
          main = "Screeplot of the first 100 PCs")
abline(h = 1, col = "tomato1", lty = 5)
legend("topright", legend = c("Eigenvalue = 1"),
       col = c("tomato1"), lty = 5, cex = 0.6)

This plot shows the accumulated variance captured, our selected 77 PCs can explain 80% of the variance, thats great we reduced a list of 304 WAPs into 77 PCs retaining big part of the value provided by those variables

cumpro <- cumsum(trainingData.pca$sdev^2 / sum(trainingData.pca$sdev^2))
plot(cumpro[0:100], pch = c(20), col = c("turquoise3"),
     xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 77, col = "tomato1", lty = 5)
abline(h = 0.80, col = "tomato1", lty = 5)
legend("bottom", legend = c("Cut-off @ PC77"),
       col = c("tomato1"), lty = 5, cex = 0.6)

Select the top 77 PCs for trainingData.pca

training.pca <- as.data.frame(trainingData.pca$x[, 1:77])

Add the non WAP variables to the selected PCs

training.pca <-
  cbind(training.pca, trainingData[, !(colnames(trainingData) %in% WAPs), drop = FALSE])

Predict PCs for validationData

validation.pca <-
  as.data.frame(predict(trainingData.pca, newdata = validationData[, c(WAPs)]))

Select the top 77 PCs for validation.pca

validation.pca <- validation.pca[, 1:77]

Add the non WAP variables to the selected PCs

validation.pca <-
  cbind(validation.pca, validationData[, !(colnames(validationData) %in% WAPs), drop = FALSE])

Modeling

BUILDINGID prediction

For the purpose of this problem I only need to predict LATITUDE, LONGITUDE and FLOOR, but in order to increase the performance of the LATITUDE and LONGITUDE models I will add the BUILDINGID prediction as a variable in them.

Model that uses all the selected WAPs

This model consists of a list of 304 WAPs of the 520 provided, the list was reduced in the previously commented steps.

WAPs <- grep("WAP", names(trainingData), value = T)

Train set

set.seed(123);Building_Model_allwaps <-
  ranger(BUILDINGID ~., data = trainingData[, c(WAPs, "BUILDINGID")])

Building_Model_allwaps_cm_train <-
  confusionMatrix(trainingData$BUILDINGID, Building_Model_allwaps$predictions)

Validation set

Building_Model_allwaps_val_pred <-
  predict(Building_Model_allwaps, validationData[, c(WAPs)])

Building_Model_allwaps_cm_val <-
  confusionMatrix(validationData$BUILDINGID, Building_Model_allwaps_val_pred[[1]])

Model that uses the names of the top three WAPs

This model consists of the three variables that I created earlier with the WAP names of the 1st, 2nd and 3rd strongest signals in every observation.

predictors_train <- grep("TopW", colnames(trainingData))
predictors_train <- c(predictors_train, which(colnames(trainingData) == "BUILDINGID"))
predictors_validation <- grep("TopW", colnames(validationData))
predictors_validation <- c(predictors_validation, which(colnames(validationData) == "BUILDINGID"))

Train set

## Train set:
set.seed(123);Building_Model_Top3waps <-
  ranger(BUILDINGID ~., data = trainingData[, c(predictors_train)])

Building_Model_Top3waps_cm_train <-
  confusionMatrix(trainingData$BUILDINGID, Building_Model_Top3waps$predictions)

Validation set

Building_Model_Top3waps_val_pred <-
  predict(Building_Model_Top3waps, validationData[, c(predictors_validation)])

Building_Model_Top3waps_cm_val <-
  confusionMatrix(validationData$BUILDINGID, Building_Model_Top3waps_val_pred$predictions)

Building_Model_Top3waps_cm_val

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 535   0   1
##          1   2 304   1
##          2   0   1 267
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9955          
##                  95% CI : (0.9895, 0.9985)
##     No Information Rate : 0.4833          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9929          
##                                           
##  Mcnemar's Test P-Value : 0.3916          
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9963   0.9967   0.9926
## Specificity            0.9983   0.9963   0.9988
## Pos Pred Value         0.9981   0.9902   0.9963
## Neg Pred Value         0.9965   0.9988   0.9976
## Prevalence             0.4833   0.2745   0.2421
## Detection Rate         0.4815   0.2736   0.2403
## Detection Prevalence   0.4824   0.2763   0.2412
## Balanced Accuracy      0.9973   0.9965   0.9957

Error metrics table for BUILDINGID

In this table we can see the Accuracy, Kappa, Model used and from which dataset the metrics are from.

Both metrics are very good with any Model/Set combination.

metrics_b <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(metrics_b) <- c("Accuracy", "Kappa", "Model", "Set")

metrics_b[nrow(metrics_b) + 1, ] <- c(Building_Model_allwaps_cm_train$overall[1:2],
                                      "AllWaps", "Training")
metrics_b[nrow(metrics_b) + 1, ] <- c(Building_Model_allwaps_cm_val$overall[1:2],
                                      "AllWaps", "Validation")
metrics_b[nrow(metrics_b) + 1, ] <- c(Building_Model_Top3waps_cm_train$overall[1:2],
                                      "Top3Waps", "Training")
metrics_b[nrow(metrics_b) + 1, ] <- c(Building_Model_Top3waps_cm_val$overall[1:2],
                                      "Top3Waps", "Validation")

metrics_b[, c(1, 2)] <- mapply(as.numeric,metrics_b[, c(1, 2)])
metrics_b

##    Accuracy     Kappa    Model        Set
## 1 0.9998956 0.9998361  AllWaps   Training
## 2 0.9972997 0.9957323  AllWaps Validation
## 3 0.9991644 0.9986891 Top3Waps   Training
## 4 0.9954995 0.9928852 Top3Waps Validation

Plotting error metrics for BUILDINGID

This barplots represent the same information shown in the previous table.

Since the metrics in both models are very similar, and are almost equaly good between train and validation I’m going to use the Top3Wap model since it only has three variables instead of 304 and its the simples of the two.

ggplot(metrics_b, aes(x = Set, y = Accuracy, fill = Set)) +
  geom_col() +
  facet_wrap( ~Model) +
  ggtitle("Accuracy of models for predicting BUILDINGID")

ggplot(metrics_b, aes(x = Set, y = Kappa, fill = Set)) +
  geom_col() +
  facet_wrap( ~Model) +
  ggtitle("Kappa of models for predicting BUILDINGID")

Plotting the locations where the Top3Waps model makes mistakes for BUILDINGID

Here we can see in green the correctly predicted spots and in red the spots where the model made a mistake, if we compare this plot with the train/validation one shown before we can see that the mistakes happened in spots where there are validation observations but not training ones.

correctness <- function(x, y){
  return(ifelse(x == y, "Correct", "Wrong"))
}

validationData$Error_BPred <- mapply(correctness, validationData$BUILDINGID,
                                     Building_Model_Top3waps_val_pred$predictions)
validationData$Error_BPred <- as.factor(validationData$Error_BPred)

p3 <- plot_ly(validationData, x = ~LONGITUDE, y = ~LATITUDE, z = ~FLOOR,
              color = ~Error_BPred, colors = c("green", "red"),
              size = 1.2) %>% add_markers() %>%
  layout(scene = list(xaxis = list(title = "Longitude"),
                      yaxis = list(title = "Latitude"),
                      zaxis = list(title = "Floor")),
         title = "Location of the BUILDINGID errors")
p3

BUILDINGID_FLOOR prediction

Model that uses all the selected WAPs

Train set

set.seed(123);BF_Model_allwaps <-
  ranger(BUILDINGID_FLOOR ~.,trainingData[, c(WAPs, "BUILDINGID_FLOOR")])

BF_Model_allwaps_cm_train <-
  confusionMatrix(trainingData$BUILDINGID_FLOOR, BF_Model_allwaps$predictions)

Validation set

BF_Model_allwaps_pred_val <-
  predict(BF_Model_allwaps, validationData[, c(WAPs)])

BF_Model_allwaps_cm_val <-
  confusionMatrix(validationData$BUILDINGID_FLOOR, BF_Model_allwaps_pred_val$predictions)

Model that uses the names of the top three WAPs in every observation

Train set

set.seed(123);BF_Model_Top3waps <-
  ranger(BUILDINGID_FLOOR ~.,
         trainingData[, c("TopWap1", "TopWap2", "TopWap3", "BUILDINGID_FLOOR")])

BF_Model_Top3waps_cm_train <-
  confusionMatrix(trainingData$BUILDINGID_FLOOR, BF_Model_Top3waps$predictions)

Validation set

BF_Model_Top3waps_pred_val <-
  predict(BF_Model_Top3waps, validationData[, c("TopWap1", "TopWap2", "TopWap3")])

BF_Model_Top3waps_cm_val <-
  confusionMatrix(validationData$BUILDINGID_FLOOR, BF_Model_Top3waps_pred_val$predictions) 

BF_Model_Top3waps_cm_val

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0_0 0_1 0_2 0_3 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4
##        0_0  72   6   0   0   0   0   0   0   0   0   0   0   0
##        0_1   4 199   5   0   0   0   0   0   0   0   0   0   0
##        0_2   0  23 137   3   0   0   0   2   0   0   0   0   0
##        0_3   0   0   6  79   0   0   0   0   0   0   0   0   0
##        1_0   0   0   0   0  23   4   3   0   0   0   0   0   0
##        1_1   0   0   0   0  13 122   7   1   0   0   0   0   0
##        1_2   0   0   0   0   1   3  80   3   0   0   0   0   0
##        1_3   0   0   0   0   0   0   3  44   0   0   0   0   0
##        2_0   0   0   0   0   0   0   0   0  23   1   0   0   0
##        2_1   0   0   0   0   0   0   0   0   6 102   1   2   0
##        2_2   1   0   0   0   0   1   0   0   0   1  49   2   0
##        2_3   0   0   0   0   0   0   0   0   0   1   0  38   1
##        2_4   0   0   0   0   1   0   0   0   0   0   0   2  36
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9037          
##                  95% CI : (0.8848, 0.9204)
##     No Information Rate : 0.2052          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.892           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0_0 Class: 0_1 Class: 0_2 Class: 0_3
## Sensitivity             0.93506     0.8728     0.9257    0.96341
## Specificity             0.99420     0.9898     0.9709    0.99417
## Pos Pred Value          0.92308     0.9567     0.8303    0.92941
## Neg Pred Value          0.99516     0.9679     0.9884    0.99708
## Prevalence              0.06931     0.2052     0.1332    0.07381
## Detection Rate          0.06481     0.1791     0.1233    0.07111
## Detection Prevalence    0.07021     0.1872     0.1485    0.07651
## Balanced Accuracy       0.96463     0.9313     0.9483    0.97879
##                      Class: 1_0 Class: 1_1 Class: 1_2 Class: 1_3
## Sensitivity              0.6053     0.9385    0.86022     0.8800
## Specificity              0.9935     0.9786    0.99312     0.9972
## Pos Pred Value           0.7667     0.8531    0.91954     0.9362
## Neg Pred Value           0.9861     0.9917    0.98730     0.9944
## Prevalence               0.0342     0.1170    0.08371     0.0450
## Detection Rate           0.0207     0.1098    0.07201     0.0396
## Detection Prevalence     0.0270     0.1287    0.07831     0.0423
## Balanced Accuracy        0.7994     0.9585    0.92667     0.9386
##                      Class: 2_0 Class: 2_1 Class: 2_2 Class: 2_3
## Sensitivity              0.7931    0.97143     0.9800     0.8636
## Specificity              0.9991    0.99105     0.9953     0.9981
## Pos Pred Value           0.9583    0.91892     0.9074     0.9500
## Neg Pred Value           0.9945    0.99700     0.9991     0.9944
## Prevalence               0.0261    0.09451     0.0450     0.0396
## Detection Rate           0.0207    0.09181     0.0441     0.0342
## Detection Prevalence     0.0216    0.09991     0.0486     0.0360
## Balanced Accuracy        0.8961    0.98124     0.9876     0.9309
##                      Class: 2_4
## Sensitivity              0.9730
## Specificity              0.9972
## Pos Pred Value           0.9231
## Neg Pred Value           0.9991
## Prevalence               0.0333
## Detection Rate           0.0324
## Detection Prevalence     0.0351
## Balanced Accuracy        0.9851

Error metrics table for BUILDINGID_FLOOR

Here we can see that the training model performs a bit better than the validation model, thats because of the same reason mentioned in the BUILDINGID predictions, but that performance drop is bigger since for BUILDINGID there were only 3 options and for BUILDINGID_FLOOR 13 (3 buildings with 4-5 floors each).

metrics_bf <- metrics_bf <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(metrics_bf) <- c("Accuracy", "Kappa", "Model", "Set")

metrics_bf[nrow(metrics_bf) + 1, ] <- c(BF_Model_allwaps_cm_train$overall[1:2],
                                        "AllWaps", "Training")
metrics_bf[nrow(metrics_bf) + 1, ] <- c(BF_Model_allwaps_cm_val$overall[1:2],
                                        "AllWaps", "Validation")
metrics_bf[nrow(metrics_bf) + 1, ] <- c(BF_Model_Top3waps_cm_train$overall[1:2],
                                        "Top3Waps", "Training")
metrics_bf[nrow(metrics_bf) + 1, ] <- c(BF_Model_Top3waps_cm_val$overall[1:2],
                                        "Top3Waps", "Validation")

metrics_bf[, c(1, 2)] <- mapply(as.numeric, metrics_bf[, c(1, 2)])
metrics_bf

##    Accuracy     Kappa    Model        Set
## 1 0.9982245 0.9980570  AllWaps   Training
## 2 0.9108911 0.9001983  AllWaps Validation
## 3 0.9535746 0.9492043 Top3Waps   Training
## 4 0.9036904 0.8920356 Top3Waps Validation

Plotting error metrics for BUILDINGID_FLOOR

Here I’m going to choose again the Top3Waps model for the following reasons:

-The validation accuracy in both models are almost the same.

-Theres less performance drop between training and validation on Top3Waps than in AllWaps.

-Also the Top3Waps model is simpler because it only has 3 variables instead of the 304 of the AllWaps model.

ggplot(metrics_bf, aes(x = Set, y = Accuracy, fill = Set)) +
  geom_col() +
  facet_wrap( ~Model) +
  ggtitle("Accuracy of models for predicting BUILDINGID_FLOOR")

ggplot(metrics_bf, aes(x = Set, y = Kappa, fill = Set)) +
  geom_col() +
  facet_wrap( ~Model) +
  ggtitle("Kappa of models for predicting BUILDINGID_FLOOR")

Plotting the locations where the model makes mistakes for BUILDINGID_FLOOR

Here we can observe and conclude the same as in the previous BUILDINGID plot, most of the errors happened at locations where theres no training data available.

With the data available this issue can only be solved by merging both data sets, but then we will be killing the error metrics in the process, they will be great for any model and we wont be able to use them to choose a model.

validationData$Error_BFPred <- mapply(correctness, validationData$BUILDINGID_FLOOR,
                                      BF_Model_Top3waps_pred_val$predictions)
validationData$Error_BFPred <- as.factor(validationData$Error_BFPred)

p4 <- plot_ly(validationData, x = ~LONGITUDE, y = ~LATITUDE, z = ~FLOOR,
              color = ~Error_BFPred, colors = c("green", "red"),
              size = 1.2) %>% add_markers() %>%
  layout(scene = list(xaxis = list(title = "Longitude"),
                      yaxis = list(title = "Latitude"),
                      zaxis = list(title = "Floor")),
         title = "Location of the BUILDINGID_FLOOR errors")
p4

Add BUILDINGID prediction to validation

As I previously mentioned, the reason to predict the BUILDINGID was to include it as a variable for the LATITUDE and LONGITUDE models and that is what the next step is for.

validationData$BUILDINGID <- Building_Model_Top3waps_val_pred$predictions
validation.pca$BUILDINGID <- Building_Model_Top3waps_val_pred$predictions

LATITUDE prediction

Model that uses all the selected WAPs

Train set

set.seed(123);Latitude_Model <-
  ranger(LATITUDE ~., data = trainingData[, c(WAPs, "LATITUDE", "BUILDINGID")])

Latitude_Model_pred_train_metrics <-
  postResample(Latitude_Model$predictions, trainingData$LATITUDE)

Validation set

Latitude_Model_pred_val <-
  predict(Latitude_Model, validationData[, c(WAPs, "BUILDINGID")])

Latitude_Model_pred_val_metrics <-
  postResample(Latitude_Model_pred_val$predictions, validationData$LATITUDE)

Model that uses the selected PCs

The following model will use the first 77 PCs that can explain 80% of the WAP variance

PCs <- grep("PC", names(training.pca), value = T)

Train set

set.seed(123);Latitude_Model_PC <-
  ranger(LATITUDE ~., data = training.pca[, c(PCs, "LATITUDE", "BUILDINGID")])

Latitude_Model_PC_pred_train_metrics <-
  postResample(Latitude_Model_PC$predictions, training.pca$LATITUDE)

Validation set

Latitude_Model_PC_pred_val <-
  predict(Latitude_Model_PC, validation.pca[, c(PCs, "BUILDINGID")])

Latitude_Model_PC_pred_val_metrics <-
  postResample(Latitude_Model_PC_pred_val$predictions, validation.pca$LATITUDE)

Error metrics table for LATITUDE

Here we see that the metrics on valdiation are quite worse than the metrics in training. That is called overfitting, means that the model is more suited at predicting the data that it was trained on than at predicting general data that it hasnt seen.

That can be explained due to the difference in how training and validation data were captured.

Training data was built on fixed spots, like marks on the ground, the users went to those marks and used the app to capture the values of the Wifi signals.

The model is trained on those fixed spots and is quite good at predicting them, but validation data was captured at randomly selected spots by each user, so its harder for the model to make predictions.

Although the metrics are a bit worse on the model using PCs than in the model using WAPs, I will go with the PCA model since it has more than 200 less variables and it is simpler

metrics_lat <- metrics_lat <- data.frame(matrix(ncol = 5, nrow = 0))
colnames(metrics_lat) <- c("RMSE", "Rsquared", "MAE", "Model", "Set")

metrics_lat[nrow(metrics_lat) + 1, ] <- c(Latitude_Model_pred_train_metrics[1:3],
                                          "AllWaps", "Training")
metrics_lat[nrow(metrics_lat) + 1, ] <- c(Latitude_Model_pred_val_metrics[1:3],
                                          "AllWaps", "Validation")
metrics_lat[nrow(metrics_lat) + 1, ] <- c(Latitude_Model_PC_pred_train_metrics[1:3],
                                          "PCs", "Training")
metrics_lat[nrow(metrics_lat) + 1, ] <- c(Latitude_Model_PC_pred_val_metrics[1:3],
                                          "PCs", "Validation")
metrics_lat

##               RMSE          Rsquared              MAE   Model        Set
## 1 4.09044042940572 0.996411629631604 2.78972558965396 AllWaps   Training
## 2 9.41968862540729 0.982759458478635 6.17484362118037 AllWaps Validation
## 3 3.80194825138135 0.996900833087085 2.27403806185639     PCs   Training
## 4 9.55126051109952 0.982171820750341 6.39228134005122     PCs Validation

Plotting LATITUDE error distribution

Here we can see density plots for how big the error was and the frequency of it happening.

Most of them where close to 0 so thats good. And then as the error gets bigger, the frequency of errors happening gets smaller, thats also good.

The global density plot shows that 75% of the errors were smaller than 9.

And the densities by BUILDINGID show that the model was better at predicting the area inside BUILDINGID #0.

pabserror_split <- function(pred, real, building, z) {
  AELAT <- abs(pred - real)
  errors <- data.frame(abs = AELAT, b = building)
  dens <- density(AELAT, na.rm = TRUE)
  df <- data.frame(x = dens$x, y = dens$y)
  probs <- c(0.75)
  quantiles <- quantile(AELAT, prob = probs)
  df$quant <- factor(findInterval(df$x, quantiles))
  p1 <- ggplot(df, aes(x, y)) + geom_line(size = 1.5) +
    geom_ribbon(aes(ymin = 0, ymax = y, fill = quant)) +
    scale_fill_brewer(guide = "none") + 
    scale_x_continuous(breaks = seq(0, 120, 20)) +
    labs(x = "Absolute Error", y = "Density", title = paste("Density of global error in", z))
  p2 <- ggplot(errors, aes(abs, fill = b)) + geom_density(aes(y = ..density..)) +
    facet_wrap( ~b) + xlab("Absolute error") + 
    ggtitle("Densities of absolute error by BUILDINGID") + 
    labs(x = "Absolute Error", y = "Density", fill = "BUILDINGID")
  p.both <- arrangeGrob(p1, p2)
  grid::grid.draw(p.both)
}

pabserror_split(Latitude_Model_PC_pred_val$predictions, validation.pca$LATITUDE,
                validation.pca$BUILDINGID, "LATITUDE")

Add LATITUDE prediction to validation

I added the LATITUDE predictions to make it easyer for the model to predict the LONGITUDE.

validationData$LATITUDE <- Latitude_Model_pred_val$predictions
validation.pca$LATITUDE <- Latitude_Model_PC_pred_val$predictions

LONGITUDE prediction

Model that uses all the selected WAPs

Train set

set.seed(123);Longitude_Model <-
  ranger(LONGITUDE ~., data = trainingData[, c(WAPs, "LONGITUDE", "LATITUDE", "BUILDINGID")])

Longitude_Model_metrics_train <-
  postResample(Longitude_Model$predictions, trainingData$LONGITUDE)

Validation set

Longitude_Model_pred_val <-
  predict(Longitude_Model, validationData[, c(WAPs, "LATITUDE", "BUILDINGID")])

Longitude_Model_metrics_val <-
  postResample(Longitude_Model_pred_val$predictions, validationData$LONGITUDE)

Model that uses the selected PCs

Train set

set.seed(123);Longitude_Model_PC <-
  ranger(LONGITUDE ~., data = training.pca[, c(PCs, "LONGITUDE", "LATITUDE", "BUILDINGID")])

Longitude_Model_PC_metrics_train <-
  postResample(Longitude_Model_PC$predictions, training.pca$LONGITUDE)

Validation set

Longitude_Model_PC_pred_val <-
  predict(Longitude_Model_PC, validation.pca[, c(PCs, "LATITUDE", "BUILDINGID")])

Longitude_Model_PC_metrics_val <-
  postResample(Longitude_Model_PC_pred_val$predictions, validation.pca$LONGITUDE)

Error metrics table for LONGITUDE

Here we see that what happened with the LATITUDE error metrics happened here to, and we can arrive at the same conclusion, the model is overfitted due to differences in how training and validation data were captured and I will choose the PCA model for its reduced dimensionality.

metrics_lon <- metrics_lon <- data.frame(matrix(ncol = 5, nrow = 0))
colnames(metrics_lon) <- c("RMSE", "Rsquared", "MAE", "Model", "Set")

metrics_lon[nrow(metrics_lon) + 1, ] <- c(Longitude_Model_metrics_train[1:3],
                                          "AllWaps", "Training")
metrics_lon[nrow(metrics_lon) + 1, ] <- c(Longitude_Model_metrics_val[1:3],
                                          "AllWaps", "Validation")
metrics_lon[nrow(metrics_lon) + 1, ] <- c(Longitude_Model_PC_metrics_train[1:3],
                                          "PCs", "Training")
metrics_lon[nrow(metrics_lon) + 1, ] <- c(Longitude_Model_PC_metrics_val[1:3],
                                          "PCs", "Validation")
metrics_lon

##               RMSE          Rsquared              MAE   Model        Set
## 1 4.45763264460499 0.998724083093364  2.9904469671499 AllWaps   Training
## 2 10.6632142357706 0.992186018674215 6.56328892070392 AllWaps Validation
## 3 3.99655645918155 0.998977755473581 2.27755594669835     PCs   Training
## 4 10.3103817476841 0.992969207053521 6.62873559664271     PCs Validation

Plotting LONGITUDE error distribution

Again we see a similar picture than we have seen before with LATITUDE.

The error density shape is good and the model is better at predicting inside BUILDINGID #0.

pabserror_split(Longitude_Model_PC_pred_val$predictions, validation.pca$LONGITUDE,
                validation.pca$BUILDINGID, "LONGITUDE")

Conclusion

The point of this analysis was to see if it was possible to predict the location of a device based on the wifi signals, and we can say that yes it is possible and actualy a good method.

But the first thing to do in order to implement this method is having a good coverage of the surface tracked in the training data, which I didnt have for this analysis and it is the main reason of the predictive model errors.

With a focus on dimensionality reduction I reduced the 520 WAPs into just 3 variables for BUILDINGID and BUILDINGID_FLOOR and 77 for LATITUDE and LONGITUDE while still maintaining good error metrics relative to using more variables. For LATITUDE and LONGITUDE more variables were used since you still need the weaker wifi signals to increase the model precision and cant rely on just having the stronger signals.