Many real world applications need to know the localization of a user in the world to provide their services. Outdoor localization problem can be solved very accurately thanks to the inclusion of GPS sensors into the mobile devices. However, indoor localization is still an open problem mainly due to the loss of GPS signal in indoor environments. For this reason, we evaluated the application of machine learning techniques to this problem, replacing the GPS signal with WAPS signal.
http://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc
We were provided with two datasets with the same number of variables:
And the following variables:
Brief introduction into understanding WAP signals
The dataset contains signals that go from -104 to 0, and +100 if the WAP was not detected.
WAP signal is registered in a logarithmic scale. -90 is considered to be an unusable signal. -30 is considered to be an amazing signal.
It is strange to have signals outside this range.
Also, having the “no signal” value as +100 distorts the logarithmic properties of the scale.
In order to solve both of this issues, first I’m going to converge all the values outside the range of [-90, -30] into its minimum and maximum respectively. And then replace the “no signal” value with a -100 in order to maintain the logarithmic aspect of the WAP scale. In other words, “no signal” is worse than “unusable signal”, so the numeric value of “no signal” should be below -90.
Load libraries
if ("pacman" %in% rownames(installed.packages()) == FALSE) {
install.packages("pacman")
} else {
pacman::p_load(readr, dplyr, magrittr, ggplot2, plotly, reshape2, tidyr, caret,
stringr, mime, ranger, gridExtra, tidyverse, svDialogs, rstudioapi)
}
Check for missing values
sum(is.na(trainingData))
## [1] 0
sum(is.na(validationData))
## [1] 0
Feature selection
The predictive model can’t depend on USERID and PHONEID since those variables are not related to location, I’m keeping PHONEID for now to do an exploratory analysis on phones with the intention to remove it latter.
Although TIMESTAMP can be realted to location based on a schedule, im opting to remove it because that schedule might vary due to many reasons that we arent keeping track of.
SPACEID and RELATIVEPOSITION are mostly noise, all spaces inside the buildings will be considered the same.
Convert BUILDINGID and FLOOR into categorical variables.
Remove duplicate observations, of which there were 637 in trainingData. This could be explained due to users not using the app properly, or some issue with the training app since different apps were used to capture training and validation data.
And finaly convert the “no signal” value of the different waps from +100 to NA (not available). Previously I mentioned that I will be converting that value into -100, but for now I will keep it as NA in order to do the next steps of pre-processing, and convert it to -100 when needed.
preproc <- function(df) {
df$SPACEID <- NULL
df$RELATIVEPOSITION <- NULL
df$USERID <- NULL
df$TIMESTAMP <- NULL
cols <- c("FLOOR", "BUILDINGID")
df[, cols] <- lapply(df[, cols], as.factor)
if (sum(duplicated(df)) != 0) {
df <- df[!duplicated(df), ]
}
WAP <- grep("WAP", names(df), value = T)
replace <- function(x) ifelse(x == 100, NA, x)
df[, c(WAP)] <- df[, c(WAP)] %>% mutate_all(replace)
return(df)
}
trainingData <- preproc(trainingData)
validationData <- preproc(validationData)
This plot compares the observations between the training and validation datasets
We can see that the training data is more structured around the same spots but doesnt cover all the areas.
While validation data is more messy, lacks structure and more important, doesnt always overlap with the training data.
There are areas of the buildings that are present in validation but not in training, this will negatively affect our predictions, and the errors made by the predictive models will probably be from these areas.
trainingData$Set <- "Train"
validationData$Set <- "Validation"
combinedData <- bind_rows(trainingData, validationData)
combinedData$Set <- as.factor(combinedData$Set)
p <- plot_ly(combinedData, x = ~LONGITUDE, y = ~LATITUDE, z = ~FLOOR,
color = ~Set, colors = c("turquoise3", "tomato1"),
size = 1.2) %>% add_markers() %>%
layout(scene = list(xaxis = list(title = "Longitude"),
yaxis = list(title = "Latitude"),
zaxis = list(title = "Floor")))
p
Next I will look for observations that dont receive signal from any WAP and for WAPs that dont send signal to any observation and remove them.
drop_allna_rows <- function(df) {
WAP <- grep("WAP", names(df), value = T)
ind <- which(apply(df[, c(WAP)], 1, function(x) all(is.na(x))) == TRUE)
if (length(ind) > 0){df <- df [-ind, ]}
return(df)
}
drop_allna_cols <- function(df) {
WAP <- grep("WAP", names(df), value = T)
cols_to_drop <- which(apply(df[, c(WAP)], 2, function(x) all(is.na(x))) == TRUE)
df <- df[, -cols_to_drop]
return(df)
}
trainingData <- drop_allna_rows(trainingData)
validationData <- drop_allna_rows(validationData)
trainingData <- drop_allna_cols(trainingData)
validationData <- drop_allna_cols(validationData)
Since WAPs that dont send signal to any observation in one dataset might send signal to observations in the other one, I am going to cross the list of WAPs that do send signals and find the common WAPs between the two datasets.
WAPs_train <- grep("WAP", names(trainingData), value = T)
WAPs_val <- grep("WAP", names(validationData), value = T)
WAPs <- intersect(WAPs_train, WAPs_val)
rm(WAPs_train, WAPs_val)
And then update both datasets with the common WAPs.
cols_to_keep_train <- c(which(colnames(trainingData) %in% WAPs),
which(grepl("^WAP", colnames(trainingData)) == FALSE))
cols_to_keep_val <- c(which(colnames(validationData) %in% WAPs),
which(grepl("^WAP", colnames(validationData)) == FALSE))
trainingData <- trainingData[, cols_to_keep_train]
validationData <- validationData[, cols_to_keep_val]
combinedData <- bind_rows(trainingData, validationData)
combinedData$Set <- as.factor(combinedData$Set)
This plot shows the density distribution of the signals of both datasets Both distributions being similar is a good sign that we can use the train data to make predictions on the validation data.
The train curve resembles a bit like a stair while the validation curve is more soft, that can be explained due to training data being captured through fixed spots and validation data being captured at randomly selected spots by the users.
dat1 <- melt(combinedData[, c(WAPs, "Set")])
## Using Set as id variables
p0 <- ggplot(dat1, aes(x = value, fill = Set)) +
geom_density(alpha = 0.2) +
xlab("Signal value") +
ggtitle("Densities of WAP signals for each dataset")
p0
## Warning: Removed 5992776 rows containing non-finite values (stat_density).
Phone exploration
The predictive model cant depend on PHONEID because that wont be available if the model is put on production, but since its one of the two elements that influences the signal, we can do a littlle exploratory analysis to know a bit more about the signals and maybe use that knowledge to make better predictions.
Here we can see a table with PHONEIDs at the top, and the number of signals with value 0 at the bottom.
rowsw0 <- which(trainingData[, c(WAPs)] == 0, arr.ind = TRUE)
Phone_test <- trainingData[rowsw0, ]
table(Phone_test$PHONEID)
##
## 19 23
## 102 102
When we increase the treshold to see signals stronger than -30 we can see that there are more PHONEIDs but still #19 and #23 are clearly ahead.
rowsw30 <- which(trainingData[, c(WAPs)] > -30, arr.ind = TRUE)
Phone_test <- trainingData[rowsw30, ]
table(Phone_test$PHONEID)
##
## 7 8 11 14 16 19 23
## 58 3 2 3 7 609 682
And when we check for signals weaker than -90 we can see that more than the half of the PHONEIDs are represented, and PHONEID #23 leads the list again.
rowsw90 <- which(trainingData[, c(WAPs)] < -90, arr.ind = TRUE)
Phone_test <- trainingData[rowsw90, ]
table(Phone_test$PHONEID)
##
## 1 3 6 7 8 10 11 13 14 16 17 18
## 1573 190 1293 3953 2074 1808 15 14118 1311 1400 219 2810
## 19 22 23 24
## 4063 2193 49217 809
When we check the documentation for PHONEID #23 we see that is not actualy a phone but a tablet: Transformer TF101 4.0.3 2.
I tried removing the observations from PHONEID #23 but since it left untracked some areas of the training data I decided to leave it to get a better representation of the space inside the buildings.
After this we no longer need the PHONEID variable.
trainingData$PHONEID <- NULL
validationData$PHONEID <- NULL
combinedData$PHONEID <- NULL
Previously I mentioned that I was going to relocate signals weaker than -90 or stronger than -30 and that is going to be my next step.
If the signal is below -90 i will replace it with a -90, and if the signal is above -30 then i will replace it with a -30.
replace_values <- function(df) {
WAP <- grep("WAP", names(df), value = T)
na_to_onehundred <- function(x) {
ifelse(!is.na(x), ifelse(x < -90, -90, ifelse(x > -30, -30, x)), -100)
}
df[, c(WAP)] <- df[, c(WAP)] %>% mutate_all(na_to_onehundred)
return(df)
}
trainingData <- replace_values(trainingData)
validationData <- replace_values(validationData)
By combining BUILDINGID and FLOOR we have a BUILDINGID_FLOOR variable that will be used to make predictions instead of FLOOR.
If we make predictions on FLOOR alone, the model will take as a success predicting the FLOOR correctly even though it missed at predicting the BUILDINGID.
build_floor <- function(df) {
df <- unite(df, BUILDINGID_FLOOR, c("BUILDINGID", "FLOOR"),
sep = "_", remove = FALSE)
df$BUILDINGID_FLOOR <- as.factor(df$BUILDINGID_FLOOR)
return(df)
}
trainingData <- build_floor(trainingData)
validationData <- build_floor(validationData)
The next thing that I want to do is getting a list of WAPs that send strong signals to many BUILDINGID and BUILDINGID_FLOOR, this WAPs might be problematic for the predictive model.
This process will require a few steps.
First step: Finding out for each observation, which is the WAP that sends the strongest signal.
HighestWap_train <- names(trainingData)[apply(trainingData[, c(WAPs)], 1, which.max)]
Second step: For each observation we find out in which BUILDINGID_FLOOR the observation was recorded.
Highest_loc_train <- data.frame("WAP" = HighestWap_train,
"BFLocation" = trainingData$BUILDINGID_FLOOR)
Third step: We find out for each WAP what was the most frequent BUILDINGID_FLOOR in which the highest signal was recorded, and we store it on a dataframe.
multiple <- c()
for ( i in unique(Highest_loc_train$WAP) ) {
multiple <- rbind(multiple,
c(i, names(which.max
(table(Highest_loc_train
[which(Highest_loc_train
$WAP == i), ]$BFLocation)))))
}
multiple <- as.data.frame(multiple)
colnames(multiple) <- c("WAP", "Most_Frequent_location")
multiple$Most_Frequent_location <- factor(multiple$Most_Frequent_location,
levels = levels(trainingData$BUILDINGID_FLOOR))
Fourth step: Taking that position as a prediction of the WAP location.
Highest_loc_train$Prediction <-
sapply(Highest_loc_train$WAP,
function(x) multiple[which(multiple$WAP == x), ]$Most_Frequent_location)
Do the same for BUILDINGID.
In order to do this we only need to get the first digit of the predicted BUILDINGID_FLOOR.
Highest_loc_train$BLocation <- str_sub(Highest_loc_train$BFLocation, start = 1, end = 1)
Highest_loc_train$BPrediction <- str_sub(Highest_loc_train$Prediction, start = 1, end = 1)
Highest_loc_train$BLocation <- as.factor(Highest_loc_train$BLocation)
Highest_loc_train$BPrediction <- as.factor(Highest_loc_train$BPrediction)
And finaly we compare the predicted WAP locations with their real locations, and the wrongly predicted WAPs will make it into the bad WAP list. We will do this both for BUILDINGID and BUILDINGID_FLOOR.
Highest_loc_train$BFCorrect <- Highest_loc_train$BFLocation == Highest_loc_train$Prediction
BadWAPsBF <- unique(Highest_loc_train[which(Highest_loc_train$BFCorrect == FALSE), ]$WAP)
Highest_loc_train$BCorrect <- Highest_loc_train$BLocation == Highest_loc_train$BPrediction
BadWAPsB <- unique(Highest_loc_train[which(Highest_loc_train$BCorrect == FALSE), ]$WAP)
Now that I have a list of bad BUILDINGID and BUILDINGID_FLOOR WAPs I’m going to explore them a bit.
Boxplot for waps that made mistakes predicting the BUILDINGID.
Here we can clearly see why the previous process made mistakes at predicting the BUILDINGID, this WAPs are very strong and send a lot of signals to more than 1 BUILDINGID.
for (i in BadWAPsB) {
x <- trainingData[[i]]
x[x == -100] <- NA
plot(trainingData$BUILDINGID_FLOOR, x, ylim = c(-100, -25), main = i,
xlab = "BUILDINGID_FLOOR", ylab = "Signal value")
}
Density plot for waps that made mistakes predicting the BUILDINGID.
We can compare the density shapes of this WAPs with the density shape of the training data and see clear differences:
-They have more than one peak.
-They dont decrease in the same way (fast at the begining, slow at the end).
for (i in BadWAPsB) {
x <- trainingData[[i]]
x[x == -100] <- NA
plot(density(x, na.rm = TRUE), main = i, xlab = "Signal value")
}
Seeing that I’m going to remove those WAPs from both datasets.
trainingData <- trainingData[, !(colnames(trainingData) %in% BadWAPsB), drop = FALSE]
validationData <- validationData[, !(colnames(validationData) %in% BadWAPsB), drop = FALSE]
WAPs_train <- grep("WAP", names(trainingData), value = T)
WAPs_val <- grep("WAP", names(validationData), value = T)
Since I removed some WAPs I also will have to update the WAP list.
WAPs <- intersect(WAPs_train, WAPs_val)
Another thing that I wanted to make was to add three new variables with the WAP names of the three strongest signals received in every observation respectively.
So the variable TopWap1 will have the WAP name of the strongest signal in every observation, TopWap2 the second strongest and TopWap3 the third strongest.
topwaps <- function(df) {
df$Top3Waps <- apply(df[, c(WAPs)], 1, function(x)
paste(names(head(sort(x, decreasing = TRUE), 3)), collapse = " "))
df <- separate(df, Top3Waps, c("TopWap1", "TopWap2", "TopWap3"))
df$TopWap1 <- factor(df$TopWap1, levels = WAPs)
df$TopWap2 <- factor(df$TopWap2, levels = WAPs)
df$TopWap3 <- factor(df$TopWap3, levels = WAPs)
return(df)
}
trainingData$Top3Waps <- apply(trainingData[, c(WAPs)], 1, function(x)
paste(names(head(sort(x, decreasing = TRUE), 3)), collapse = " "))
trainingData <- separate(trainingData, Top3Waps, c("TopWap1", "TopWap2", "TopWap3"))
trainingData$TopWap1 <- factor(trainingData$TopWap1, levels = WAPs)
trainingData$TopWap2 <- factor(trainingData$TopWap2, levels = WAPs)
trainingData$TopWap3 <- factor(trainingData$TopWap3, levels = WAPs)
validationData$Top3Waps <- apply(validationData[, c(WAPs)], 1, function(x)
paste(names(head(sort(x, decreasing = TRUE), 3)), collapse = " "))
validationData <- separate(validationData, Top3Waps, c("TopWap1", "TopWap2", "TopWap3"))
validationData$TopWap1 <- factor(validationData$TopWap1, levels = WAPs)
validationData$TopWap2 <- factor(validationData$TopWap2, levels = WAPs)
validationData$TopWap3 <- factor(validationData$TopWap3, levels = WAPs)
In order to not make distintions between high-end phones and other budget options and the natural progression of technology I decided to normalize every row to put the signals received in every observation at the same scale.
Since we removed some WAPs we will check again for rows with no signals and remove them.
drop_null_value_rows <- function(df) {
df[df == -100] <- NA
df <- drop_allna_rows(df)
df[is.na(df)] <- -100
return(df)
}
trainingData <- drop_null_value_rows(trainingData)
validationData <- drop_null_value_rows(validationData)
And then rescale the signals of every row between 0 and 1.
This way, the strongest signal of every observation will always be 1, no matter if it was from a good phone that received a strong signal or from a bad phone that received a weaker signal.
normalize <- function(df) {
data.frame(t(apply(df[WAPs], 1,
function(x) (x - min(x)) / (max(x) - min(x)))),
df[, !colnames(df) %in% WAPs])
return(df)
}
trainingData <- normalize(trainingData)
validationData <- normalize(validationData)
Since we still have to many variables in our datasets I will proceed with a PCA on the remaining WAPs in order to reduce the dimensionality of the data sets and increase the performance of the predictive models
Transform trainingData WAPs into PCs
trainingData.pca <- prcomp(trainingData[, c(WAPs)], center = TRUE, scale = TRUE)
This plot shows the eigenvalue or standar deviation of each PC, I will keep all PCs with eigenvalue 1 or higher and drop the rest since an eigenvalue lower than 1 would mean that the component actually explains less than a single explanatory variable, and by the look of it I will keep the first 77 PCs
screeplot(trainingData.pca, type = "l", npcs = 100, pch = c(20), col = c("turquoise3"),
main = "Screeplot of the first 100 PCs")
abline(h = 1, col = "tomato1", lty = 5)
legend("topright", legend = c("Eigenvalue = 1"),
col = c("tomato1"), lty = 5, cex = 0.6)
This plot shows the accumulated variance captured, our selected 77 PCs can explain 80% of the variance, thats great we reduced a list of 304 WAPs into 77 PCs retaining big part of the value provided by those variables
cumpro <- cumsum(trainingData.pca$sdev^2 / sum(trainingData.pca$sdev^2))
plot(cumpro[0:100], pch = c(20), col = c("turquoise3"),
xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 77, col = "tomato1", lty = 5)
abline(h = 0.80, col = "tomato1", lty = 5)
legend("bottom", legend = c("Cut-off @ PC77"),
col = c("tomato1"), lty = 5, cex = 0.6)
Select the top 77 PCs for trainingData.pca
training.pca <- as.data.frame(trainingData.pca$x[, 1:77])
Add the non WAP variables to the selected PCs
training.pca <-
cbind(training.pca, trainingData[, !(colnames(trainingData) %in% WAPs), drop = FALSE])
Predict PCs for validationData
validation.pca <-
as.data.frame(predict(trainingData.pca, newdata = validationData[, c(WAPs)]))
Select the top 77 PCs for validation.pca
validation.pca <- validation.pca[, 1:77]
Add the non WAP variables to the selected PCs
validation.pca <-
cbind(validation.pca, validationData[, !(colnames(validationData) %in% WAPs), drop = FALSE])
For the purpose of this problem I only need to predict LATITUDE, LONGITUDE and FLOOR, but in order to increase the performance of the LATITUDE and LONGITUDE models I will add the BUILDINGID prediction as a variable in them.
This model consists of a list of 304 WAPs of the 520 provided, the list was reduced in the previously commented steps.
WAPs <- grep("WAP", names(trainingData), value = T)
set.seed(123);Building_Model_allwaps <-
ranger(BUILDINGID ~., data = trainingData[, c(WAPs, "BUILDINGID")])
Building_Model_allwaps_cm_train <-
confusionMatrix(trainingData$BUILDINGID, Building_Model_allwaps$predictions)
Building_Model_allwaps_val_pred <-
predict(Building_Model_allwaps, validationData[, c(WAPs)])
Building_Model_allwaps_cm_val <-
confusionMatrix(validationData$BUILDINGID, Building_Model_allwaps_val_pred[[1]])
This model consists of the three variables that I created earlier with the WAP names of the 1st, 2nd and 3rd strongest signals in every observation.
predictors_train <- grep("TopW", colnames(trainingData))
predictors_train <- c(predictors_train, which(colnames(trainingData) == "BUILDINGID"))
predictors_validation <- grep("TopW", colnames(validationData))
predictors_validation <- c(predictors_validation, which(colnames(validationData) == "BUILDINGID"))
## Train set:
set.seed(123);Building_Model_Top3waps <-
ranger(BUILDINGID ~., data = trainingData[, c(predictors_train)])
Building_Model_Top3waps_cm_train <-
confusionMatrix(trainingData$BUILDINGID, Building_Model_Top3waps$predictions)
Building_Model_Top3waps_val_pred <-
predict(Building_Model_Top3waps, validationData[, c(predictors_validation)])
Building_Model_Top3waps_cm_val <-
confusionMatrix(validationData$BUILDINGID, Building_Model_Top3waps_val_pred$predictions)
Building_Model_Top3waps_cm_val
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 535 0 1
## 1 2 304 1
## 2 0 1 267
##
## Overall Statistics
##
## Accuracy : 0.9955
## 95% CI : (0.9895, 0.9985)
## No Information Rate : 0.4833
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9929
##
## Mcnemar's Test P-Value : 0.3916
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9963 0.9967 0.9926
## Specificity 0.9983 0.9963 0.9988
## Pos Pred Value 0.9981 0.9902 0.9963
## Neg Pred Value 0.9965 0.9988 0.9976
## Prevalence 0.4833 0.2745 0.2421
## Detection Rate 0.4815 0.2736 0.2403
## Detection Prevalence 0.4824 0.2763 0.2412
## Balanced Accuracy 0.9973 0.9965 0.9957
In this table we can see the Accuracy, Kappa, Model used and from which dataset the metrics are from.
Both metrics are very good with any Model/Set combination.
metrics_b <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(metrics_b) <- c("Accuracy", "Kappa", "Model", "Set")
metrics_b[nrow(metrics_b) + 1, ] <- c(Building_Model_allwaps_cm_train$overall[1:2],
"AllWaps", "Training")
metrics_b[nrow(metrics_b) + 1, ] <- c(Building_Model_allwaps_cm_val$overall[1:2],
"AllWaps", "Validation")
metrics_b[nrow(metrics_b) + 1, ] <- c(Building_Model_Top3waps_cm_train$overall[1:2],
"Top3Waps", "Training")
metrics_b[nrow(metrics_b) + 1, ] <- c(Building_Model_Top3waps_cm_val$overall[1:2],
"Top3Waps", "Validation")
metrics_b[, c(1, 2)] <- mapply(as.numeric,metrics_b[, c(1, 2)])
metrics_b
## Accuracy Kappa Model Set
## 1 0.9998956 0.9998361 AllWaps Training
## 2 0.9972997 0.9957323 AllWaps Validation
## 3 0.9991644 0.9986891 Top3Waps Training
## 4 0.9954995 0.9928852 Top3Waps Validation
This barplots represent the same information shown in the previous table.
Since the metrics in both models are very similar, and are almost equaly good between train and validation I’m going to use the Top3Wap model since it only has three variables instead of 304 and its the simples of the two.
ggplot(metrics_b, aes(x = Set, y = Accuracy, fill = Set)) +
geom_col() +
facet_wrap( ~Model) +
ggtitle("Accuracy of models for predicting BUILDINGID")
ggplot(metrics_b, aes(x = Set, y = Kappa, fill = Set)) +
geom_col() +
facet_wrap( ~Model) +
ggtitle("Kappa of models for predicting BUILDINGID")
Here we can see in green the correctly predicted spots and in red the spots where the model made a mistake, if we compare this plot with the train/validation one shown before we can see that the mistakes happened in spots where there are validation observations but not training ones.
correctness <- function(x, y){
return(ifelse(x == y, "Correct", "Wrong"))
}
validationData$Error_BPred <- mapply(correctness, validationData$BUILDINGID,
Building_Model_Top3waps_val_pred$predictions)
validationData$Error_BPred <- as.factor(validationData$Error_BPred)
p3 <- plot_ly(validationData, x = ~LONGITUDE, y = ~LATITUDE, z = ~FLOOR,
color = ~Error_BPred, colors = c("green", "red"),
size = 1.2) %>% add_markers() %>%
layout(scene = list(xaxis = list(title = "Longitude"),
yaxis = list(title = "Latitude"),
zaxis = list(title = "Floor")),
title = "Location of the BUILDINGID errors")
p3
set.seed(123);BF_Model_allwaps <-
ranger(BUILDINGID_FLOOR ~.,trainingData[, c(WAPs, "BUILDINGID_FLOOR")])
BF_Model_allwaps_cm_train <-
confusionMatrix(trainingData$BUILDINGID_FLOOR, BF_Model_allwaps$predictions)
BF_Model_allwaps_pred_val <-
predict(BF_Model_allwaps, validationData[, c(WAPs)])
BF_Model_allwaps_cm_val <-
confusionMatrix(validationData$BUILDINGID_FLOOR, BF_Model_allwaps_pred_val$predictions)
set.seed(123);BF_Model_Top3waps <-
ranger(BUILDINGID_FLOOR ~.,
trainingData[, c("TopWap1", "TopWap2", "TopWap3", "BUILDINGID_FLOOR")])
BF_Model_Top3waps_cm_train <-
confusionMatrix(trainingData$BUILDINGID_FLOOR, BF_Model_Top3waps$predictions)
BF_Model_Top3waps_pred_val <-
predict(BF_Model_Top3waps, validationData[, c("TopWap1", "TopWap2", "TopWap3")])
BF_Model_Top3waps_cm_val <-
confusionMatrix(validationData$BUILDINGID_FLOOR, BF_Model_Top3waps_pred_val$predictions)
BF_Model_Top3waps_cm_val
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0_0 0_1 0_2 0_3 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4
## 0_0 72 6 0 0 0 0 0 0 0 0 0 0 0
## 0_1 4 199 5 0 0 0 0 0 0 0 0 0 0
## 0_2 0 23 137 3 0 0 0 2 0 0 0 0 0
## 0_3 0 0 6 79 0 0 0 0 0 0 0 0 0
## 1_0 0 0 0 0 23 4 3 0 0 0 0 0 0
## 1_1 0 0 0 0 13 122 7 1 0 0 0 0 0
## 1_2 0 0 0 0 1 3 80 3 0 0 0 0 0
## 1_3 0 0 0 0 0 0 3 44 0 0 0 0 0
## 2_0 0 0 0 0 0 0 0 0 23 1 0 0 0
## 2_1 0 0 0 0 0 0 0 0 6 102 1 2 0
## 2_2 1 0 0 0 0 1 0 0 0 1 49 2 0
## 2_3 0 0 0 0 0 0 0 0 0 1 0 38 1
## 2_4 0 0 0 0 1 0 0 0 0 0 0 2 36
##
## Overall Statistics
##
## Accuracy : 0.9037
## 95% CI : (0.8848, 0.9204)
## No Information Rate : 0.2052
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.892
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0_0 Class: 0_1 Class: 0_2 Class: 0_3
## Sensitivity 0.93506 0.8728 0.9257 0.96341
## Specificity 0.99420 0.9898 0.9709 0.99417
## Pos Pred Value 0.92308 0.9567 0.8303 0.92941
## Neg Pred Value 0.99516 0.9679 0.9884 0.99708
## Prevalence 0.06931 0.2052 0.1332 0.07381
## Detection Rate 0.06481 0.1791 0.1233 0.07111
## Detection Prevalence 0.07021 0.1872 0.1485 0.07651
## Balanced Accuracy 0.96463 0.9313 0.9483 0.97879
## Class: 1_0 Class: 1_1 Class: 1_2 Class: 1_3
## Sensitivity 0.6053 0.9385 0.86022 0.8800
## Specificity 0.9935 0.9786 0.99312 0.9972
## Pos Pred Value 0.7667 0.8531 0.91954 0.9362
## Neg Pred Value 0.9861 0.9917 0.98730 0.9944
## Prevalence 0.0342 0.1170 0.08371 0.0450
## Detection Rate 0.0207 0.1098 0.07201 0.0396
## Detection Prevalence 0.0270 0.1287 0.07831 0.0423
## Balanced Accuracy 0.7994 0.9585 0.92667 0.9386
## Class: 2_0 Class: 2_1 Class: 2_2 Class: 2_3
## Sensitivity 0.7931 0.97143 0.9800 0.8636
## Specificity 0.9991 0.99105 0.9953 0.9981
## Pos Pred Value 0.9583 0.91892 0.9074 0.9500
## Neg Pred Value 0.9945 0.99700 0.9991 0.9944
## Prevalence 0.0261 0.09451 0.0450 0.0396
## Detection Rate 0.0207 0.09181 0.0441 0.0342
## Detection Prevalence 0.0216 0.09991 0.0486 0.0360
## Balanced Accuracy 0.8961 0.98124 0.9876 0.9309
## Class: 2_4
## Sensitivity 0.9730
## Specificity 0.9972
## Pos Pred Value 0.9231
## Neg Pred Value 0.9991
## Prevalence 0.0333
## Detection Rate 0.0324
## Detection Prevalence 0.0351
## Balanced Accuracy 0.9851
Here we can see that the training model performs a bit better than the validation model, thats because of the same reason mentioned in the BUILDINGID predictions, but that performance drop is bigger since for BUILDINGID there were only 3 options and for BUILDINGID_FLOOR 13 (3 buildings with 4-5 floors each).
metrics_bf <- metrics_bf <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(metrics_bf) <- c("Accuracy", "Kappa", "Model", "Set")
metrics_bf[nrow(metrics_bf) + 1, ] <- c(BF_Model_allwaps_cm_train$overall[1:2],
"AllWaps", "Training")
metrics_bf[nrow(metrics_bf) + 1, ] <- c(BF_Model_allwaps_cm_val$overall[1:2],
"AllWaps", "Validation")
metrics_bf[nrow(metrics_bf) + 1, ] <- c(BF_Model_Top3waps_cm_train$overall[1:2],
"Top3Waps", "Training")
metrics_bf[nrow(metrics_bf) + 1, ] <- c(BF_Model_Top3waps_cm_val$overall[1:2],
"Top3Waps", "Validation")
metrics_bf[, c(1, 2)] <- mapply(as.numeric, metrics_bf[, c(1, 2)])
metrics_bf
## Accuracy Kappa Model Set
## 1 0.9982245 0.9980570 AllWaps Training
## 2 0.9108911 0.9001983 AllWaps Validation
## 3 0.9535746 0.9492043 Top3Waps Training
## 4 0.9036904 0.8920356 Top3Waps Validation
Here I’m going to choose again the Top3Waps model for the following reasons:
-The validation accuracy in both models are almost the same.
-Theres less performance drop between training and validation on Top3Waps than in AllWaps.
-Also the Top3Waps model is simpler because it only has 3 variables instead of the 304 of the AllWaps model.
ggplot(metrics_bf, aes(x = Set, y = Accuracy, fill = Set)) +
geom_col() +
facet_wrap( ~Model) +
ggtitle("Accuracy of models for predicting BUILDINGID_FLOOR")
ggplot(metrics_bf, aes(x = Set, y = Kappa, fill = Set)) +
geom_col() +
facet_wrap( ~Model) +
ggtitle("Kappa of models for predicting BUILDINGID_FLOOR")
Here we can observe and conclude the same as in the previous BUILDINGID plot, most of the errors happened at locations where theres no training data available.
With the data available this issue can only be solved by merging both data sets, but then we will be killing the error metrics in the process, they will be great for any model and we wont be able to use them to choose a model.
validationData$Error_BFPred <- mapply(correctness, validationData$BUILDINGID_FLOOR,
BF_Model_Top3waps_pred_val$predictions)
validationData$Error_BFPred <- as.factor(validationData$Error_BFPred)
p4 <- plot_ly(validationData, x = ~LONGITUDE, y = ~LATITUDE, z = ~FLOOR,
color = ~Error_BFPred, colors = c("green", "red"),
size = 1.2) %>% add_markers() %>%
layout(scene = list(xaxis = list(title = "Longitude"),
yaxis = list(title = "Latitude"),
zaxis = list(title = "Floor")),
title = "Location of the BUILDINGID_FLOOR errors")
p4
As I previously mentioned, the reason to predict the BUILDINGID was to include it as a variable for the LATITUDE and LONGITUDE models and that is what the next step is for.
validationData$BUILDINGID <- Building_Model_Top3waps_val_pred$predictions
validation.pca$BUILDINGID <- Building_Model_Top3waps_val_pred$predictions
set.seed(123);Latitude_Model <-
ranger(LATITUDE ~., data = trainingData[, c(WAPs, "LATITUDE", "BUILDINGID")])
Latitude_Model_pred_train_metrics <-
postResample(Latitude_Model$predictions, trainingData$LATITUDE)
Latitude_Model_pred_val <-
predict(Latitude_Model, validationData[, c(WAPs, "BUILDINGID")])
Latitude_Model_pred_val_metrics <-
postResample(Latitude_Model_pred_val$predictions, validationData$LATITUDE)
The following model will use the first 77 PCs that can explain 80% of the WAP variance
PCs <- grep("PC", names(training.pca), value = T)
set.seed(123);Latitude_Model_PC <-
ranger(LATITUDE ~., data = training.pca[, c(PCs, "LATITUDE", "BUILDINGID")])
Latitude_Model_PC_pred_train_metrics <-
postResample(Latitude_Model_PC$predictions, training.pca$LATITUDE)
Latitude_Model_PC_pred_val <-
predict(Latitude_Model_PC, validation.pca[, c(PCs, "BUILDINGID")])
Latitude_Model_PC_pred_val_metrics <-
postResample(Latitude_Model_PC_pred_val$predictions, validation.pca$LATITUDE)
Here we see that the metrics on valdiation are quite worse than the metrics in training. That is called overfitting, means that the model is more suited at predicting the data that it was trained on than at predicting general data that it hasnt seen.
That can be explained due to the difference in how training and validation data were captured.
Training data was built on fixed spots, like marks on the ground, the users went to those marks and used the app to capture the values of the Wifi signals.
The model is trained on those fixed spots and is quite good at predicting them, but validation data was captured at randomly selected spots by each user, so its harder for the model to make predictions.
Although the metrics are a bit worse on the model using PCs than in the model using WAPs, I will go with the PCA model since it has more than 200 less variables and it is simpler
metrics_lat <- metrics_lat <- data.frame(matrix(ncol = 5, nrow = 0))
colnames(metrics_lat) <- c("RMSE", "Rsquared", "MAE", "Model", "Set")
metrics_lat[nrow(metrics_lat) + 1, ] <- c(Latitude_Model_pred_train_metrics[1:3],
"AllWaps", "Training")
metrics_lat[nrow(metrics_lat) + 1, ] <- c(Latitude_Model_pred_val_metrics[1:3],
"AllWaps", "Validation")
metrics_lat[nrow(metrics_lat) + 1, ] <- c(Latitude_Model_PC_pred_train_metrics[1:3],
"PCs", "Training")
metrics_lat[nrow(metrics_lat) + 1, ] <- c(Latitude_Model_PC_pred_val_metrics[1:3],
"PCs", "Validation")
metrics_lat
## RMSE Rsquared MAE Model Set
## 1 4.09044042940572 0.996411629631604 2.78972558965396 AllWaps Training
## 2 9.41968862540729 0.982759458478635 6.17484362118037 AllWaps Validation
## 3 3.80194825138135 0.996900833087085 2.27403806185639 PCs Training
## 4 9.55126051109952 0.982171820750341 6.39228134005122 PCs Validation
Here we can see density plots for how big the error was and the frequency of it happening.
Most of them where close to 0 so thats good. And then as the error gets bigger, the frequency of errors happening gets smaller, thats also good.
The global density plot shows that 75% of the errors were smaller than 9.
And the densities by BUILDINGID show that the model was better at predicting the area inside BUILDINGID #0.
pabserror_split <- function(pred, real, building, z) {
AELAT <- abs(pred - real)
errors <- data.frame(abs = AELAT, b = building)
dens <- density(AELAT, na.rm = TRUE)
df <- data.frame(x = dens$x, y = dens$y)
probs <- c(0.75)
quantiles <- quantile(AELAT, prob = probs)
df$quant <- factor(findInterval(df$x, quantiles))
p1 <- ggplot(df, aes(x, y)) + geom_line(size = 1.5) +
geom_ribbon(aes(ymin = 0, ymax = y, fill = quant)) +
scale_fill_brewer(guide = "none") +
scale_x_continuous(breaks = seq(0, 120, 20)) +
labs(x = "Absolute Error", y = "Density", title = paste("Density of global error in", z))
p2 <- ggplot(errors, aes(abs, fill = b)) + geom_density(aes(y = ..density..)) +
facet_wrap( ~b) + xlab("Absolute error") +
ggtitle("Densities of absolute error by BUILDINGID") +
labs(x = "Absolute Error", y = "Density", fill = "BUILDINGID")
p.both <- arrangeGrob(p1, p2)
grid::grid.draw(p.both)
}
pabserror_split(Latitude_Model_PC_pred_val$predictions, validation.pca$LATITUDE,
validation.pca$BUILDINGID, "LATITUDE")
I added the LATITUDE predictions to make it easyer for the model to predict the LONGITUDE.
validationData$LATITUDE <- Latitude_Model_pred_val$predictions
validation.pca$LATITUDE <- Latitude_Model_PC_pred_val$predictions
set.seed(123);Longitude_Model <-
ranger(LONGITUDE ~., data = trainingData[, c(WAPs, "LONGITUDE", "LATITUDE", "BUILDINGID")])
Longitude_Model_metrics_train <-
postResample(Longitude_Model$predictions, trainingData$LONGITUDE)
Longitude_Model_pred_val <-
predict(Longitude_Model, validationData[, c(WAPs, "LATITUDE", "BUILDINGID")])
Longitude_Model_metrics_val <-
postResample(Longitude_Model_pred_val$predictions, validationData$LONGITUDE)
set.seed(123);Longitude_Model_PC <-
ranger(LONGITUDE ~., data = training.pca[, c(PCs, "LONGITUDE", "LATITUDE", "BUILDINGID")])
Longitude_Model_PC_metrics_train <-
postResample(Longitude_Model_PC$predictions, training.pca$LONGITUDE)
Longitude_Model_PC_pred_val <-
predict(Longitude_Model_PC, validation.pca[, c(PCs, "LATITUDE", "BUILDINGID")])
Longitude_Model_PC_metrics_val <-
postResample(Longitude_Model_PC_pred_val$predictions, validation.pca$LONGITUDE)
Here we see that what happened with the LATITUDE error metrics happened here to, and we can arrive at the same conclusion, the model is overfitted due to differences in how training and validation data were captured and I will choose the PCA model for its reduced dimensionality.
metrics_lon <- metrics_lon <- data.frame(matrix(ncol = 5, nrow = 0))
colnames(metrics_lon) <- c("RMSE", "Rsquared", "MAE", "Model", "Set")
metrics_lon[nrow(metrics_lon) + 1, ] <- c(Longitude_Model_metrics_train[1:3],
"AllWaps", "Training")
metrics_lon[nrow(metrics_lon) + 1, ] <- c(Longitude_Model_metrics_val[1:3],
"AllWaps", "Validation")
metrics_lon[nrow(metrics_lon) + 1, ] <- c(Longitude_Model_PC_metrics_train[1:3],
"PCs", "Training")
metrics_lon[nrow(metrics_lon) + 1, ] <- c(Longitude_Model_PC_metrics_val[1:3],
"PCs", "Validation")
metrics_lon
## RMSE Rsquared MAE Model Set
## 1 4.45763264460499 0.998724083093364 2.9904469671499 AllWaps Training
## 2 10.6632142357706 0.992186018674215 6.56328892070392 AllWaps Validation
## 3 3.99655645918155 0.998977755473581 2.27755594669835 PCs Training
## 4 10.3103817476841 0.992969207053521 6.62873559664271 PCs Validation
Again we see a similar picture than we have seen before with LATITUDE.
The error density shape is good and the model is better at predicting inside BUILDINGID #0.
pabserror_split(Longitude_Model_PC_pred_val$predictions, validation.pca$LONGITUDE,
validation.pca$BUILDINGID, "LONGITUDE")
The point of this analysis was to see if it was possible to predict the location of a device based on the wifi signals, and we can say that yes it is possible and actualy a good method.
But the first thing to do in order to implement this method is having a good coverage of the surface tracked in the training data, which I didnt have for this analysis and it is the main reason of the predictive model errors.
With a focus on dimensionality reduction I reduced the 520 WAPs into just 3 variables for BUILDINGID and BUILDINGID_FLOOR and 77 for LATITUDE and LONGITUDE while still maintaining good error metrics relative to using more variables. For LATITUDE and LONGITUDE more variables were used since you still need the weaker wifi signals to increase the model precision and cant rely on just having the stronger signals.