Wi-Fi positioning system (WPS), is used when GPS can’t recive the signal. To improve the user experience, it exist the Wi-Fi Fingerprinting. That system consist of estimating the position of a user using a smartphone that sends a signal (the fingerprint or instance) that is recived by a Wireless Access Points (WAPs). UJIIndoorLoc Data Set of Universitat of Jaume I have been used for the task. That free repository contains the following atributes:

In that project I pretend to predict the location (longitude, latitude and floor) of a user, based on WAPs signals. The methods I used have been the following Machine Learning Algorithms:

Both have been used to predict longitude, latitude and floor feautures (both used for regresion and clasification). The main goal has been to see wich algorithms work best.

1.Load the Data

setwd("/Users/jessicagonzalez/Downloads/UJIndoorLoc")
train <- read.csv("trainingData.csv")
valid <- read.csv("validationData.csv")

2.Load required packages

library(dplyr)
library(ISLR)
library(lattice)
library(ggplot2)
library(caret)
library(ggmap)
library(caTools)
library(gridExtra)
library(ranger)
library(e1071)

3.Exploration

#Summary some atributes 
summary(train[,521:529])
##    LONGITUDE        LATITUDE           FLOOR         BUILDINGID   
##  Min.   :-7691   Min.   :4864746   Min.   :0.000   Min.   :0.000  
##  1st Qu.:-7595   1st Qu.:4864821   1st Qu.:1.000   1st Qu.:0.000  
##  Median :-7423   Median :4864852   Median :2.000   Median :1.000  
##  Mean   :-7464   Mean   :4864871   Mean   :1.675   Mean   :1.213  
##  3rd Qu.:-7359   3rd Qu.:4864930   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :-7301   Max.   :4865017   Max.   :4.000   Max.   :2.000  
##     SPACEID      RELATIVEPOSITION     USERID          PHONEID     
##  Min.   :  1.0   Min.   :1.000    Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:110.0   1st Qu.:2.000    1st Qu.: 5.000   1st Qu.: 8.00  
##  Median :129.0   Median :2.000    Median :11.000   Median :13.00  
##  Mean   :148.4   Mean   :1.833    Mean   : 9.068   Mean   :13.02  
##  3rd Qu.:207.0   3rd Qu.:2.000    3rd Qu.:13.000   3rd Qu.:14.00  
##  Max.   :254.0   Max.   :2.000    Max.   :18.000   Max.   :24.00  
##    TIMESTAMP        
##  Min.   :1.370e+09  
##  1st Qu.:1.371e+09  
##  Median :1.372e+09  
##  Mean   :1.371e+09  
##  3rd Qu.:1.372e+09  
##  Max.   :1.372e+09
summary(valid[,521:529])
##    LONGITUDE        LATITUDE           FLOOR         BUILDINGID    
##  Min.   :-7696   Min.   :4864748   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:-7637   1st Qu.:4864843   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :-7560   Median :4864915   Median :1.000   Median :1.0000  
##  Mean   :-7529   Mean   :4864902   Mean   :1.572   Mean   :0.7588  
##  3rd Qu.:-7421   3rd Qu.:4864967   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :-7300   Max.   :4865017   Max.   :4.000   Max.   :2.0000  
##     SPACEID  RELATIVEPOSITION     USERID     PHONEID     
##  Min.   :0   Min.   :0        Min.   :0   Min.   : 0.00  
##  1st Qu.:0   1st Qu.:0        1st Qu.:0   1st Qu.: 9.00  
##  Median :0   Median :0        Median :0   Median :13.00  
##  Mean   :0   Mean   :0        Mean   :0   Mean   :11.92  
##  3rd Qu.:0   3rd Qu.:0        3rd Qu.:0   3rd Qu.:15.00  
##  Max.   :0   Max.   :0        Max.   :0   Max.   :21.00  
##    TIMESTAMP        
##  Min.   :1.380e+09  
##  1st Qu.:1.380e+09  
##  Median :1.381e+09  
##  Mean   :1.381e+09  
##  3rd Qu.:1.381e+09  
##  Max.   :1.381e+09
#Plot location with both data set
plot(train$LONGITUDE, train$LATITUDE)

plot(valid$LONGITUDE, valid$LATITUDE)

#In real life
#Universitat of Jaume I:

univ <- c(-0.067417, 39.992871)
map1 <- get_map(univ, zoom = 17, scale = 1)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=39.992871,-0.067417&zoom=17&size=640x640&scale=1&maptype=terrain&language=en-EN
ggmap(map1) 

In previous plots we can see that in validation set there are some areas that have no instances. For a complet information I will merge both data set in a full data, and then divide it in train and valid set.

4.Preprocesing

#Merge both data. Divide full data (caTools used library)
data <- rbind(train, valid)
sample <- sample.split(data, SplitRatio = .70)
training <- subset(data, sample ==TRUE)
validation <- subset(data, sample == FALSE)

Knn and Random Forest Algorithms has been implemented for latitude, longitude and floor, for each building (3):

#Separate data by building in train & valid
build.0 <- filter(training, BUILDINGID == 0)
build.1 <- filter(training, BUILDINGID == 1)
build.2 <- filter(training, BUILDINGID == 2)

build.0.v <- filter(validation, BUILDINGID == 0)
build.1.v <- filter(validation, BUILDINGID == 1)
build.2.v <- filter(validation, BUILDINGID == 2)

#Create a data frame for each feature
#TRAIN
#Build 0
build.0.lat <- data.frame(build.0$LATITUDE, build.0[,1:520])
build.0.long <- data.frame(build.0$LONGITUDE, build.0[,1:520])
build.0.floor <- data.frame(build.0$FLOOR, build.0[,1:520])

#Build 1
build.1.lat <- data.frame(build.1$LATITUDE, build.1[,1:520])
build.1.long <- data.frame(build.1$LONGITUDE, build.1[,1:520])
build.1.floor <- data.frame(build.1$FLOOR, build.1[,1:520])

#Build 2
build.2.lat <- data.frame(build.2$LATITUDE, build.2[,1:520])
build.2.long <- data.frame(build.2$LONGITUDE, build.2[,1:520])
build.2.floor <- data.frame(build.2$FLOOR, build.2[,1:520])

#VALID
#Build 0
build.0.lat.v <- data.frame(build.0.v$LATITUDE, build.0.v[,1:520])
build.0.long.v <- data.frame(build.0.v$LONGITUDE, build.0.v[,1:520])
build.0.floor.v <- data.frame(build.0.v$FLOOR, build.0.v[,1:520])

#Build 1
build.1.lat.v <- data.frame(build.1.v$LATITUDE, build.1.v[,1:520])
build.1.long.v <- data.frame(build.1.v$LONGITUDE, build.1.v[,1:520])
build.1.floor.v <- data.frame(build.1.v$FLOOR, build.1.v[,1:520])

#Build 2
build.2.lat.v <- data.frame(build.2.v$LATITUDE, build.2.v[,1:520])
build.2.long.v <- data.frame(build.2.v$LONGITUDE, build.2.v[,1:520])
build.2.floor.v <- data.frame(build.2.v$FLOOR, build.2.v[,1:520])

For computing saving I used random samples in each data. The difference in sample values is because each building has a specific number of instances:

#Sample the data taking X random values
#TRAIN
#Build 0
sample.build.0.lat <- build.0.lat[sample(1:nrow(build.0.lat), 4000, replace = FALSE),]
sample.build.0.long <- build.0.long[sample(1:nrow(build.0.long), 4000, replace = FALSE),]
sample.build.0.floor <- build.0.floor[sample(1:nrow(build.0.floor), 4000, replace = FALSE),]
#Build 1
sample.build.1.lat <- build.1.lat[sample(1:nrow(build.1.lat), 3000, replace = FALSE),]
sample.build.1.long <- build.1.long[sample(1:nrow(build.1.long), 3000, replace = FALSE),]
sample.build.1.floor <- build.1.floor[sample(1:nrow(build.1.floor), 3000, replace = FALSE),]
#Build 2
sample.build.2.lat <- build.2.lat[sample(1:nrow(build.2.lat), 6000, replace = FALSE),]
sample.build.2.long <- build.2.long[sample(1:nrow(build.2.long), 6000, replace = FALSE),]
sample.build.2.floor <- build.2.floor[sample(1:nrow(build.2.floor), 6000, replace = FALSE),]

Floor is only feature used in clasification problem, then I should convert to a factor atribute:

#Convert FLOOR fetaure in factor
sample.build.0.floor$build.0.FLOOR <- as.factor(sample.build.0.floor$build.0.FLOOR)
sample.build.1.floor$build.1.FLOOR <- as.factor(sample.build.1.floor$build.1.FLOOR)
sample.build.2.floor$build.2.FLOOR <- as.factor(sample.build.2.floor$build.2.FLOOR)

5.Fiting Models

We want to see if the WAPs signal can predict the real position of a user. For that, the main algorithm are knn. This type of supervised learning algorithm have been used becouse is a instance based algorithm. The algorithm classified a point based on the information of his frequent neightbours, using a distance function,in my case Euclidean Distance:

An other suitable algorithm in that problem is Random Forest (CART). This algorithm compute trees with a random number of predictors, and then each tree is averaged. The most common instance is the prediction value.

In that case I use the caret package with Cross Validation technique (5 times), and with zero variance and median impute preprocess. In all cases first I train the model and then I predict each feature in a validation sample. After, I compute the RMSE and Rsquared for latitude and longitude. For floor feature I make a confusion matrix which crosses the real values by predicted values.

KNN

Build 0
#Latitude
knn.lat.0 <- train(build.0.LATITUDE ~ ., 
                 sample.build.0.lat,
                 method = "knn",
                 trControl = trainControl(method = "cv",
                                          number = 5,
                                          verboseIter = TRUE),
                preProcess = c("zv", "medianImpute"))

pred.knn.lat.0 <- predict(knn.lat.0, build.0.lat.v)
error.knn.lat.0 <- pred.knn.lat.0 - build.0.lat.v$build.0.v.LATITUDE 
rmse.knn.lat.0 <- sqrt(mean(error.knn.lat.0^2))
rmse.knn.lat.0
rsquared.knn.lat.0 <- 1 - (sum(error.knn.lat.0^2) / sum((build.0.lat.v$build.0.v.LATITUDE-mean(build.0.lat.v$build.0.v.LATITUDE))^2))
rsquared.knn.lat.0 <- rsquared.knn.lat.0 * 100
rsquared.knn.lat.0

#Longitude
knn.long.0 <- train(build.0.LONGITUDE ~ ., 
                   sample.build.0.long,
                   method = "knn",
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            verboseIter = TRUE),
                   preProcess = c("zv", "medianImpute"))

pred.knn.long.0 <- predict(knn.long.0, build.0.long.v)
error.knn.long.0 <- pred.knn.long.0 - build.0.long.v$build.0.v.LONGITUDE
rmse.knn.long.0 <- sqrt(mean(error.knn.long.0^2))
rmse.knn.long.0
rsquared.knn.long.0 <- 1 - (sum(error.knn.long.0^2) / sum((build.0.long.v$build.0.v.LONGITUDE-mean(build.0.long.v$build.0.v.LONGITUDE))^2))
rsquared.knn.long.0 <- rsquared.knn.long.0 * 100
rsquared.knn.long.0

#Floor
knn.floor.0 <- train(build.0.FLOOR ~ ., 
                   sample.build.0.floor,
                   method = "knn",
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            verboseIter = TRUE),
                   preProcess = c("zv", "medianImpute"))

pred.knn.floor.0 <- predict(knn.floor.0, build.0.floor.v)
conf.matrix.knn.floor.0 <- table(pred.knn.floor.0, build.0.floor.v$build.0.v.FLOOR)
accuracy.knn.floor.0 <- (sum(diag(conf.matrix.knn.floor.0))) / sum(conf.matrix.knn.floor.0)
accuracy.knn.floor.0 <- accuracy.knn.floor.0 * 100
accuracy.knn.floor.0
Build 1
#Latitude
knn.lat.1 <- train(build.1.LATITUDE ~ ., 
                   sample.build.1.lat,
                   method = "knn",
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            verboseIter = TRUE),
                   preProcess = c("zv", "medianImpute"))

pred.knn.lat.1 <- predict(knn.lat.1, build.1.lat.v)
error.knn.lat.1 <- pred.knn.lat.1 - build.1.lat.v$build.1.v.LATITUDE 
rmse.knn.lat.1 <- sqrt(mean(error.knn.lat.1^2))
rmse.knn.lat.1
rsquared.knn.lat.1 <- 1 - (sum(error.knn.lat.1^2) / sum((build.1.lat.v$build.1.v.LATITUDE-mean(build.1.lat.v$build.1.v.LATITUDE))^2))
rsquared.knn.lat.1 <- rsquared.knn.lat.1 * 100
rsquared.knn.lat.1

#Longitude
knn.long.1 <- train(build.1.LONGITUDE ~ ., 
                    sample.build.1.long,
                    method = "knn",
                    trControl = trainControl(method = "cv",
                                             number = 5,
                                             verboseIter = TRUE),
                    preProcess = c("zv", "medianImpute"))

pred.knn.long.1 <- predict(knn.long.1, build.1.long.v)
error.knn.long.1 <- pred.knn.long.1 - build.1.long.v$build.1.v.LONGITUDE
rmse.knn.long.1 <- sqrt(mean(error.knn.long.1^2))
rmse.knn.long.1
rsquared.knn.long.1 <- 1 - (sum(error.knn.long.1^2) / sum((build.1.long.v$build.1.v.LONGITUDE-mean(build.1.long.v$build.1.v.LONGITUDE))^2))
rsquared.knn.long.1 <- rsquared.knn.long.1 * 100
rsquared.knn.long.1

#Floor
knn.floor.1 <- train(build.1.FLOOR ~ ., 
                     sample.build.1.floor,
                     method = "knn",
                     trControl = trainControl(method = "cv",
                                              number = 5,
                                              verboseIter = TRUE),
                     preProcess = c("zv", "medianImpute"))

pred.knn.floor.1 <- predict(knn.floor.1, build.1.floor.v)
conf.matrix.knn.floor.1 <- table(pred.knn.floor.1, build.1.floor.v$build.1.v.FLOOR)
accuracy.knn.floor.1 <- (sum(diag(conf.matrix.knn.floor.1))) / sum(conf.matrix.knn.floor.1)
accuracy.knn.floor.1 <- accuracy.knn.floor.1 * 100
accuracy.knn.floor.1
Build 2
#Latitude
knn.lat.2 <- train(build.2.LATITUDE ~ ., 
                   sample.build.2.lat,
                   method = "knn",
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            verboseIter = TRUE),
                   preProcess = c("zv", "medianImpute"))

pred.knn.lat.2 <- predict(knn.lat.2, build.2.lat.v)
error.knn.lat.2 <- pred.knn.lat.2 - build.2.lat.v$build.2.v.LATITUDE 
rmse.knn.lat.2 <- sqrt(mean(error.knn.lat.2^2))
rmse.knn.lat.2
rsquared.knn.lat.2 <- 1 - (sum(error.knn.lat.2^2) / sum((build.2.lat.v$build.2.v.LATITUDE-mean(build.2.lat.v$build.2.v.LATITUDE))^2))
rsquared.knn.lat.2 <- rsquared.knn.lat.2 * 100
rsquared.knn.lat.2

#Longitude
knn.long.2 <- train(build.2.LONGITUDE ~ ., 
                    sample.build.2.long,
                    method = "knn",
                    trControl = trainControl(method = "cv",
                                             number = 5,
                                             verboseIter = TRUE),
                    preProcess = c("zv", "medianImpute"))

pred.knn.long.2 <- predict(knn.long.2, build.2.long.v)
error.knn.long.2 <- pred.knn.long.2 - build.2.long.v$build.2.v.LONGITUDE
rmse.knn.long.2 <- sqrt(mean(error.knn.long.2^2))
rmse.knn.long.2
rsquared.knn.long.2 <- 1 - (sum(error.knn.long.2^2) / sum((build.2.long.v$build.2.v.LONGITUDE-mean(build.2.long.v$build.2.v.LONGITUDE))^2))
rsquared.knn.long.2 <- rsquared.knn.long.2 * 100
rsquared.knn.long.2

#Floor
knn.floor.2 <- train(build.2.FLOOR ~ ., 
                     sample.build.2.floor,
                     method = "knn",
                     trControl = trainControl(method = "cv",
                                              number = 5,
                                              verboseIter = TRUE),
                     preProcess = c("zv", "medianImpute"))

pred.knn.floor.2 <- predict(knn.floor.2, build.2.floor.v)
conf.matrix.knn.floor.2 <- table(pred.knn.floor.2, build.2.floor.v$build.2.v.FLOOR)
accuracy.knn.floor.2 <- (sum(diag(conf.matrix.knn.floor.2))) / sum(conf.matrix.knn.floor.2)
accuracy.knn.floor.2 <- accuracy.knn.floor.2 * 100
accuracy.knn.floor.2

Random Forest

To save computational cost, in this algorithm I have customized mtry parameter, and has been mtry=32 (the best accuracy between 2, 32, 520).

Build 0
#Latitude
rfor.lat.0 <- train(build.0.LATITUDE ~ ., 
                   sample.build.0.lat,
                   method = "ranger",
                   tuneGrid=data.frame(mtry=32),
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            verboseIter = TRUE),
                   preProcess = c("zv", "medianImpute"))

pred.rfor.lat.0 <- predict(rfor.lat.0, build.0.lat.v)
error.rfor.lat.0 <- pred.rfor.lat.0 - build.0.lat.v$build.0.v.LATITUDE 
rmse.rfor.lat.0 <- sqrt(mean(error.rfor.lat.0^2))
rmse.rfor.lat.0
rsquared.rfor.lat.0 <- 1 - (sum(error.rfor.lat.0^2) / sum((build.0.lat.v$build.0.v.LATITUDE-mean(build.0.lat.v$build.0.v.LATITUDE))^2))
rsquared.rfor.lat.0 <- rsquared.rfor.lat.0 * 100
rsquared.rfor.lat.0

#Longitude
rfor.long.0 <- train(build.0.LONGITUDE ~ ., 
                    sample.build.0.long,
                    method = "ranger",
                    tuneGrid=data.frame(mtry=32),
                    trControl = trainControl(method = "cv",
                                             number = 5,
                                             verboseIter = TRUE),
                    preProcess = c("zv", "medianImpute"))

pred.rfor.long.0 <- predict(rfor.long.0, build.0.long.v)
error.rfor.long.0 <- pred.rfor.long.0 - build.0.long.v$build.0.v.LONGITUDE
rmse.rfor.long.0 <- sqrt(mean(error.rfor.long.0^2))
rmse.rfor.long.0
rsquared.rfor.long.0 <- 1 - (sum(error.rfor.long.0^2) / sum((build.0.long.v$build.0.v.LONGITUDE-mean(build.0.long.v$build.0.v.LONGITUDE))^2))
rsquared.rfor.long.0 <- rsquared.rfor.long.0 * 100
rsquared.rfor.long.0

#Floor
rfor.floor.0 <- train(build.0.FLOOR ~ ., 
                     sample.build.0.floor,
                     method = "ranger",
                     tuneGrid=data.frame(mtry=32),
                     trControl = trainControl(method = "cv",
                                              number = 5,
                                              verboseIter = TRUE),
                     preProcess = c("zv", "medianImpute"))

pred.rfor.floor.0 <- predict(rfor.floor.0, build.0.floor.v)
conf.matrix.rfor.floor.0 <- table(pred.rfor.floor.0, build.0.floor.v$build.0.v.FLOOR)
accuracy.rfor.floor.0 <- (sum(diag(conf.matrix.rfor.floor.0))) / sum(conf.matrix.rfor.floor.0)
accuracy.rfor.floor.0 <- accuracy.rfor.floor.0 * 100
accuracy.rfor.floor.0
Build 1
#Latitude
rfor.lat.1 <- train(build.1.LATITUDE ~ ., 
                    sample.build.1.lat,
                    method = "ranger",
                    tuneGrid=data.frame(mtry=32),
                    trControl = trainControl(method = "cv",
                                             number = 5,
                                             verboseIter = TRUE),
                    preProcess = c("zv", "medianImpute"))

pred.rfor.lat.1 <- predict(rfor.lat.1, build.1.lat.v)
error.rfor.lat.1 <- pred.rfor.lat.1 - build.1.lat.v$build.1.v.LATITUDE 
rmse.rfor.lat.1 <- sqrt(mean(error.rfor.lat.1^2))
rmse.rfor.lat.1
rsquared.rfor.lat.1 <- 1 - (sum(error.rfor.lat.1^2) / sum((build.1.lat.v$build.1.v.LATITUDE-mean(build.1.lat.v$build.1.v.LATITUDE))^2))
rsquared.rfor.lat.1 <- rsquared.rfor.lat.1 * 100
rsquared.rfor.lat.1

#Longitude
rfor.long.1 <- train(build.1.LONGITUDE ~ ., 
                     sample.build.1.long,
                     method = "ranger",
                     tuneGrid=data.frame(mtry=32),
                     trControl = trainControl(method = "cv",
                                              number = 5,
                                              verboseIter = TRUE),
                     preProcess = c("zv", "medianImpute"))

pred.rfor.long.1 <- predict(rfor.long.1, build.1.long.v)
error.rfor.long.1 <- pred.rfor.long.1 - build.1.long.v$build.1.v.LONGITUDE
rmse.rfor.long.1 <- sqrt(mean(error.rfor.long.1^2))
rmse.rfor.long.1
rsquared.rfor.long.1 <- 1 - (sum(error.rfor.long.1^2) / sum((build.1.long.v$build.1.v.LONGITUDE-mean(build.1.long.v$build.1.v.LONGITUDE))^2))
rsquared.rfor.long.1 <- rsquared.rfor.long.1 * 100
rsquared.rfor.long.1

#Floor
rfor.floor.1 <- train(build.1.FLOOR ~ ., 
                      sample.build.1.floor,
                      method = "ranger",
                      tuneGrid=data.frame(mtry=32),
                      trControl = trainControl(method = "cv",
                                               number = 5,
                                               verboseIter = TRUE),
                      preProcess = c("zv", "medianImpute"))

pred.rfor.floor.1 <- predict(rfor.floor.1, build.1.floor.v)
conf.matrix.rfor.floor.1 <- table(pred.rfor.floor.1, build.1.floor.v$build.1.v.FLOOR)
accuracy.rfor.floor.1 <- (sum(diag(conf.matrix.rfor.floor.1))) / sum(conf.matrix.rfor.floor.1)
accuracy.rfor.floor.1 <- accuracy.rfor.floor.1 * 100
accuracy.rfor.floor.1
Build 2
#Latitude
rfor.lat.2 <- train(build.2.LATITUDE ~ ., 
                    sample.build.2.lat,
                    method = "ranger",
                    tuneGrid=data.frame(mtry=32),
                    trControl = trainControl(method = "cv",
                                             number = 5,
                                             verboseIter = TRUE),
                    preProcess = c("zv", "medianImpute"))

pred.rfor.lat.2 <- predict(rfor.lat.2, build.2.lat.v)
error.rfor.lat.2 <- pred.rfor.lat.2 - build.2.lat.v$build.2.v.LATITUDE 
rmse.rfor.lat.2 <- sqrt(mean(error.rfor.lat.2^2))
rmse.rfor.lat.2
rsquared.rfor.lat.2 <- 1 - (sum(error.rfor.lat.2^2) / sum((build.2.lat.v$build.2.v.LATITUDE-mean(build.2.lat.v$build.2.v.LATITUDE))^2))
rsquared.rfor.lat.2 <- rsquared.rfor.lat.2 * 100
rsquared.rfor.lat.2

#Longitude
rfor.long.2 <- train(build.2.LONGITUDE ~ ., 
                     sample.build.2.long,
                     method = "ranger",
                     tuneGrid=data.frame(mtry=32),
                     trControl = trainControl(method = "cv",
                                              number = 5,
                                              verboseIter = TRUE),
                     preProcess = c("zv", "medianImpute"))

pred.rfor.long.2 <- predict(rfor.long.2, build.2.long.v)
error.rfor.long.2 <- pred.rfor.long.2 - build.2.long.v$build.2.v.LONGITUDE
rmse.rfor.long.2 <- sqrt(mean(error.rfor.long.2^2))
rmse.rfor.long.2
rsquared.rfor.long.2 <- 1 - (sum(error.rfor.long.2^2) / sum((build.2.long.v$build.2.v.LONGITUDE-mean(build.2.long.v$build.2.v.LONGITUDE))^2))
rsquared.rfor.long.2 <- rsquared.rfor.long.2 * 100
rsquared.rfor.long.2

#Floor
rfor.floor.2 <- train(build.2.FLOOR ~ ., 
                      sample.build.2.floor,
                      method = "ranger",
                      tuneGrid=data.frame(mtry=32),
                      trControl = trainControl(method = "cv",
                                               number = 5,
                                               verboseIter = TRUE),
                      preProcess = c("zv", "medianImpute"))

pred.rfor.floor.2 <- predict(rfor.floor.2, build.2.floor.v)
conf.matrix.rfor.floor.2 <- table(pred.rfor.floor.2, build.2.floor.v$build.2.v.FLOOR)
accuracy.rfor.floor.2 <- (sum(diag(conf.matrix.rfor.floor.2))) / sum(conf.matrix.rfor.floor.2)
accuracy.rfor.floor.2 <- accuracy.rfor.floor.2 * 100
accuracy.rfor.floor.2

Summary

LATITUDE

Algorithm RMSE R squared
0.knn 4.830 97.741
0.R.forest 3.479 98.828
1.knn 7.219 96.052
1.R.forest 5.491 97.716
2.knn 5.694 96.021
2.R.forest 5.545 96.228

LONGITUDE

Algorithm RMSE R squared
0.knn 5.981 94.389
0.R.forest 4.297 97.103
1.knn 7.021 97.964
1.R.forest 5.507 98.748
2.knn 8.245 92.152
2.R.forest 7.282 93.878

FLOOR

Algorithm Accuracy
0.knn 93.226
0.R.forest 99.139
1.knn 95.309
1.R.forest 99.347
2.knn 97.446
2.R.forest 99.413

Floor models are clasification problems, so we can see wich floor has been better and worse predicted with a confusion matrix:

#KNN
#Build 0
conf.matrix.knn.floor.0
##                 
## pred.knn.floor.0   0   1   2   3
##                0 323  26   2   0
##                1   8 378  12   1
##                2   0   7 448  24
##                3   0   0  57 449
#Build 1
conf.matrix.knn.floor.1
##                 
## pred.knn.floor.1   0   1   2   3
##                0 390  25   1   0
##                1  11 447   4   0
##                2   0   8 421   5
##                3   5   4  23 299
#Build 2
conf.matrix.knn.floor.2
##                 
## pred.knn.floor.2   0   1   2   3   4
##                0 570  23   4   2   2
##                1   2 681   6   0   0
##                2   0   8 472   3   0
##                3   0   0  14 800   3
##                4   0   0   1   6 356
#Random Forest
#Build 0
conf.matrix.rfor.floor.0
##                  
## pred.rfor.floor.0   0   1   2   3
##                 0 326   1   0   0
##                 1   5 410   1   0
##                 2   0   0 515   2
##                 3   0   0   3 472
#Build 1
conf.matrix.rfor.floor.1
##                  
## pred.rfor.floor.1   0   1   2   3
##                 0 405   1   0   0
##                 1   1 477   0   0
##                 2   0   5 445   5
##                 3   0   1   4 299
#Build 2
conf.matrix.rfor.floor.2
##                  
## pred.rfor.floor.2   0   1   2   3   4
##                 0 570   1   0   0   0
##                 1   2 709   3   1   1
##                 2   0   1 485   0   0
##                 3   0   1   9 810   4
##                 4   0   0   0   0 356

6.Resamples

Too see wich model work best, it has been done a resample for each metric of accuracy (R squared and RMSE). In theory, R squared value should be the highest, and RMSE (Root Mean Squared Error) should be the lowest. In the case of floor feature, I only use Accuracy (% of correctly sorted instances).

#Create dataframes
#rmse 
latitude.rmse <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
                                values = c(rmse.knn.lat.0, rmse.rfor.lat.0, rmse.knn.lat.1, rmse.rfor.lat.1,
                                     rmse.knn.lat.2, rmse.rfor.lat.2))

longitude.rmse <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
                            values = c(rmse.knn.long.0, rmse.rfor.long.0, rmse.knn.long.1, rmse.rfor.long.1,
                                       rmse.knn.long.2, rmse.rfor.long.2))

#rsquared
latitude.rsquared <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
                            values = c(rsquared.knn.lat.0, rsquared.rfor.lat.0, rsquared.knn.lat.1, rsquared.rfor.lat.1,
                                       rsquared.knn.lat.2, rsquared.rfor.lat.2))

longitude.rsquared <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
                             values = c(rsquared.knn.long.0, rsquared.rfor.long.0, rsquared.knn.long.1,                                                          rsquared.rfor.long.1, rsquared.knn.long.2, rsquared.rfor.long.2))

floor.accuracy <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
                             values = c(accuracy.knn.floor.0, accuracy.rfor.floor.0, accuracy.knn.floor.1,                                                      accuracy.rfor.floor.1, accuracy.knn.floor.2, accuracy.rfor.floor.2))

For a correct visualization, I have done a plot for each feature and for each metric. As the ggplot2 package sorts the x axis alphabetically, has had to customize the way to order it.

#Order x axis
latitude.rmse$metricas <- as.character(latitude.rmse$metricas)
latitude.rmse$metricas <- factor(latitude.rmse$metricas, levels=unique(latitude.rmse$metricas))

longitude.rmse$metricas <- as.character(longitude.rmse$metricas)
longitude.rmse$metricas <- factor(longitude.rmse$metricas, levels=unique(longitude.rmse$metricas))

latitude.rsquared$metricas <- as.character(latitude.rsquared$metricas)
latitude.rsquared$metricas <- factor(latitude.rsquared$metricas, levels=unique(latitude.rsquared$metricas))

longitude.rsquared$metricas <- as.character(longitude.rsquared$metricas)
longitude.rsquared$metricas <- factor(longitude.rsquared$metricas, levels=unique(longitude.rsquared$metricas))

floor.accuracy$metricas <- as.character(floor.accuracy$metricas)
floor.accuracy$metricas <- factor(floor.accuracy$metricas, levels=unique(floor.accuracy$metricas))

7.Plots

The used parameters has been the following:

#Latitude plots
a <- latitude.rmse %>% 
  ggplot(aes(x = metricas, y = values)) + 
  geom_col(aes(fill = metricas)) +
  geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
  coord_flip() +
  labs(x = "Metrics for each Building",
       y = "RMSE",
       title = "LATITUDE") +
  theme_light() +
  scale_fill_brewer(palette = "GnBu") +
  theme(legend.position="none")

d <- latitude.rsquared %>% 
ggplot(aes(x = metricas, y = values)) + 
  geom_col(aes(fill = metricas)) +
  geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
  coord_flip() +
  labs(x = "",
       y = "RSQUARED",
       title = "") +
  theme_light() +
  scale_fill_brewer(palette = "GnBu") +
  theme(legend.position="none")


#Longitude plots
b <- longitude.rmse %>% 
  ggplot(aes(x = metricas, y = values)) + 
  geom_col(aes(fill = metricas)) +
  geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
  coord_flip() +
  labs(x = "",
       y = "RMSE",
       title = "LONGITUDE") +
  theme_light() +
  scale_fill_brewer(palette = "OrRd") +
  theme(legend.position="none")

e <- longitude.rsquared %>% 
  ggplot(aes(x = metricas, y = values)) + 
  geom_col(aes(fill = metricas)) +
  geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
  coord_flip() +
  labs(x = "",
       y = "RSQUARED",
       title = "") +
  theme_light() +
  scale_fill_brewer(palette = "OrRd") +
  theme(legend.position="none")

#Floor plots
f <- floor.accuracy %>% 
  ggplot(aes(x = metricas, y = values)) + 
  geom_col(aes(fill = metricas)) +
  geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
  coord_flip() +
  labs(x = "",
       y = "ACCURACY",
       title = "FLOOR") +
  theme_light() +
  scale_fill_brewer(palette = "BuPu") +
  theme(legend.position="none")

#All plots in one
lat.long.plots <- grid.arrange(a, d, b, e, f, ncol = 2)

As we can see, Random Forest Algorithm is the best algorithm is all cases. Maybe is for that algorithm works best with a high dimensional data set, taking into account that we have 520 predictors (WAPs). Anyway, we can see how it works for building: Latitude and longitude has the lowest error in build 0. The highest R squared in longitude has been in build 0, istead of longitude that has been in build 1. In case of floor, the highest accuracy has been in build 2, although we can see high accuracy in all buildings.

Summary with Random Forest Algorithm

LATITUDE

Algorithm RMSE R squared
0.R.forest 3.479 98.828
1.R.forest 5.491 97.716
2.R.forest 5.545 96.228

LONGITUDE

Algorithm RMSE R squared
0.R.forest 4.297 97.103
1.R.forest 5.507 98.748
2.R.forest 7.282 93.878

FLOOR

Algorithm Accuracy
0.R.forest 99.139
1.R.forest 99.347
2.R.forest 99.413

8.Fit the final model

We could see that Random Forest Algorithm is the best, now we use all available train sample to predict longitude, latitude and floor. First, we create a data frame with WAPs (predictors) and each feature in training and validation set.

#Create Data Frame
lat <- data.frame(training$LATITUDE, training[,1:520])
long <- data.frame(training$LONGITUDE, training[,1:520])
floor <- data.frame(training$FLOOR, training[,1:520])

floor$training.FLOOR <- as.factor(floor$training.FLOOR)


lat.v <- data.frame(validation$LATITUDE, validation[,1:520])
long.v <- data.frame(validation$LONGITUDE, validation[,1:520])
floor.v <- data.frame(validation$FLOOR, validation[,1:520])
#Latitude
lat.rfor <- train(training.LATITUDE ~ ., 
                   lat,
                   method = "ranger",
                  tuneGrid=data.frame(mtry=32),
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            verboseIter = TRUE),
                   preProcess = c("zv", "medianImpute"))

pred.lat.rfor <- predict(lat.rfor, lat.v)
error.lat.rfor <- pred.lat.rfor - lat.v$validation.LATITUDE 
rmse.lat.rfor <- sqrt(mean(error.lat.rfor^2))
rmse.lat.rfor
rsquared.lat.rfor <- 1 - (sum(error.lat.rfor^2) / sum((lat.v$validation.LATITUDE-mean(lat.v$validation.LATITUDE))^2))
rsquared.lat.rfor <- rsquared.lat.rfor * 100
rsquared.lat.rfor

#Longitude
long.rfor <- train(training.LONGITUDE ~ ., 
                  long,
                  method = "ranger",
                  tuneGrid=data.frame(mtry=32),
                  trControl = trainControl(method = "cv",
                                           number = 5,
                                           verboseIter = TRUE),
                  preProcess = c("zv", "medianImpute"))

pred.long.rfor <- predict(long.rfor, long.v)
error.long.rfor <- pred.long.rfor - long.v$validation.LONGITUDE 
rmse.long.rfor <- sqrt(mean(error.long.rfor^2))
rmse.long.rfor
rsquared.long.rfor <- 1 - (sum(error.long.rfor^2) / sum((long.v$validation.LONGITUDE-mean(long.v$validation.LONGITUDE))^2))
rsquared.long.rfor <- rsquared.long.rfor * 100
rsquared.long.rfor

#Floor
floor.rfor <- train(training.FLOOR ~ ., 
                   floor,
                   method = "ranger",
                   tuneGrid=data.frame(mtry=32),
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            verboseIter = TRUE),
                   preProcess = c("zv", "medianImpute"))

pred.floor.rfor <- predict(floor.rfor, floor.v)
conf.matrix.rfor.floor <- table(pred.floor.rfor, floor.v$validation.FLOOR)
accuracy.rfor.floor <- (sum(diag(conf.matrix.rfor.floor))) / sum(conf.matrix.rfor.floor)
accuracy.rfor.floor <- accuracy.rfor.floor * 100
accuracy.rfor.floor

Here we have the accuracy for each feature:

#Latitude
rmse.lat.rfor
## [1] 6.599523
rsquared.lat.rfor
## [1] 99.03976
#Longitude
rmse.long.rfor
## [1] 8.375887
rsquared.long.rfor
## [1] 99.54325
#Floor
conf.matrix.rfor.floor
##                
## pred.floor.rfor    0    1    2    3    4
##               0 1302    9    0   13    1
##               1    7 1591    5    0    0
##               2    0    7 1448    8    0
##               3    0    0   12 1568    5
##               4    0    0    0    0  355
accuracy.rfor.floor
## [1] 98.94172

The predicted error has been 7 meters to north and south, and 9 meters to west ans east:

In real terms:

This error range is suitable for predict de position in indoor spaces, taking into account that the error range of GPS (location system in outdoor) goes from 3 to 15 meters, depends on the quality of the appliance.

In reference to floor feature we can see wich floor has been the best and the worse predicted:

conf.matrix.rfor.floor
##                
## pred.floor.rfor    0    1    2    3    4
##               0 1302    9    0   13    1
##               1    7 1591    5    0    0
##               2    0    7 1448    8    0
##               3    0    0   12 1568    5
##               4    0    0    0    0  355
Nª FLOOR % of prediction
0 98.129
1 99.338
2 98.651
3 99.872
4 99.715

The best predicted has been floor 3, and the worse predicted floor 0. Maybe is becouse in floor 0, we have more users outside or in front of the door, so the signal has not been taking correctly.

RELATIVEPOSITION range: 1 - Inside, 2 - Outside in Front of the door

summary(build.0$RELATIVEPOSITION)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   1.809   2.000   2.000
summary(build.1$RELATIVEPOSITION)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.599   2.000   2.000
summary(build.2$RELATIVEPOSITION)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    2.00    1.77    2.00    2.00

9.Conclusions

We have available a location system for indoor spaces. This problem has been applied in a College, but multiple uses are are available in order to improve the user experiencie. For exemple, with that system of indoor location via Wi-Fi, based on the location of each user, they could get customized offers and advertising in a shopping center. Otherwise, knowing the most crowded zone, it would be a good option to implent strategic advertising. In culture, the system could be implemented in museums and galleries, that inform to a user when passing through a place of interest, and to have a versatile and comfortable experience.

Thanks to:

Joaquín Torres-Sospedra, Raúl Montoliu, Adolfo Martínez-Usó, Tomar J. Arnau, Joan P. Avariento, Mauri Benedito-Bordonau, Joaquín Huerta UJIIndoorLoc: A New Multi-building and Multi-floor Database for WLAN Fingerprint-based Indoor Localization Problems In Proceedings of the Fifth International Conference on Indoor Positioning and Indoor Navigation, 2014. https://archive.ics.uci.edu/ml/datasets/ujiindoorloc

Contact:

@ jessica.gonzalez.d8@gmail.com linkedin: https://www.linkedin.com/in/jessica-gonzalezd/