Wi-Fi positioning system (WPS), is used when GPS can’t recive the signal. To improve the user experience, it exist the Wi-Fi Fingerprinting. That system consist of estimating the position of a user using a smartphone that sends a signal (the fingerprint or instance) that is recived by a Wireless Access Points (WAPs). UJIIndoorLoc Data Set of Universitat of Jaume I have been used for the task. That free repository contains the following atributes:
In that project I pretend to predict the location (longitude, latitude and floor) of a user, based on WAPs signals. The methods I used have been the following Machine Learning Algorithms:
Both have been used to predict longitude, latitude and floor feautures (both used for regresion and clasification). The main goal has been to see wich algorithms work best.
setwd("/Users/jessicagonzalez/Downloads/UJIndoorLoc")
train <- read.csv("trainingData.csv")
valid <- read.csv("validationData.csv")
library(dplyr)
library(ISLR)
library(lattice)
library(ggplot2)
library(caret)
library(ggmap)
library(caTools)
library(gridExtra)
library(ranger)
library(e1071)
#Summary some atributes
summary(train[,521:529])
## LONGITUDE LATITUDE FLOOR BUILDINGID
## Min. :-7691 Min. :4864746 Min. :0.000 Min. :0.000
## 1st Qu.:-7595 1st Qu.:4864821 1st Qu.:1.000 1st Qu.:0.000
## Median :-7423 Median :4864852 Median :2.000 Median :1.000
## Mean :-7464 Mean :4864871 Mean :1.675 Mean :1.213
## 3rd Qu.:-7359 3rd Qu.:4864930 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :-7301 Max. :4865017 Max. :4.000 Max. :2.000
## SPACEID RELATIVEPOSITION USERID PHONEID
## Min. : 1.0 Min. :1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.:110.0 1st Qu.:2.000 1st Qu.: 5.000 1st Qu.: 8.00
## Median :129.0 Median :2.000 Median :11.000 Median :13.00
## Mean :148.4 Mean :1.833 Mean : 9.068 Mean :13.02
## 3rd Qu.:207.0 3rd Qu.:2.000 3rd Qu.:13.000 3rd Qu.:14.00
## Max. :254.0 Max. :2.000 Max. :18.000 Max. :24.00
## TIMESTAMP
## Min. :1.370e+09
## 1st Qu.:1.371e+09
## Median :1.372e+09
## Mean :1.371e+09
## 3rd Qu.:1.372e+09
## Max. :1.372e+09
summary(valid[,521:529])
## LONGITUDE LATITUDE FLOOR BUILDINGID
## Min. :-7696 Min. :4864748 Min. :0.000 Min. :0.0000
## 1st Qu.:-7637 1st Qu.:4864843 1st Qu.:1.000 1st Qu.:0.0000
## Median :-7560 Median :4864915 Median :1.000 Median :1.0000
## Mean :-7529 Mean :4864902 Mean :1.572 Mean :0.7588
## 3rd Qu.:-7421 3rd Qu.:4864967 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :-7300 Max. :4865017 Max. :4.000 Max. :2.0000
## SPACEID RELATIVEPOSITION USERID PHONEID
## Min. :0 Min. :0 Min. :0 Min. : 0.00
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.: 9.00
## Median :0 Median :0 Median :0 Median :13.00
## Mean :0 Mean :0 Mean :0 Mean :11.92
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:15.00
## Max. :0 Max. :0 Max. :0 Max. :21.00
## TIMESTAMP
## Min. :1.380e+09
## 1st Qu.:1.380e+09
## Median :1.381e+09
## Mean :1.381e+09
## 3rd Qu.:1.381e+09
## Max. :1.381e+09
#Plot location with both data set
plot(train$LONGITUDE, train$LATITUDE)
plot(valid$LONGITUDE, valid$LATITUDE)
#In real life
#Universitat of Jaume I:
univ <- c(-0.067417, 39.992871)
map1 <- get_map(univ, zoom = 17, scale = 1)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=39.992871,-0.067417&zoom=17&size=640x640&scale=1&maptype=terrain&language=en-EN
ggmap(map1)
In previous plots we can see that in validation set there are some areas that have no instances. For a complet information I will merge both data set in a full data, and then divide it in train and valid set.
#Merge both data. Divide full data (caTools used library)
data <- rbind(train, valid)
sample <- sample.split(data, SplitRatio = .70)
training <- subset(data, sample ==TRUE)
validation <- subset(data, sample == FALSE)
Knn and Random Forest Algorithms has been implemented for latitude, longitude and floor, for each building (3):
#Separate data by building in train & valid
build.0 <- filter(training, BUILDINGID == 0)
build.1 <- filter(training, BUILDINGID == 1)
build.2 <- filter(training, BUILDINGID == 2)
build.0.v <- filter(validation, BUILDINGID == 0)
build.1.v <- filter(validation, BUILDINGID == 1)
build.2.v <- filter(validation, BUILDINGID == 2)
#Create a data frame for each feature
#TRAIN
#Build 0
build.0.lat <- data.frame(build.0$LATITUDE, build.0[,1:520])
build.0.long <- data.frame(build.0$LONGITUDE, build.0[,1:520])
build.0.floor <- data.frame(build.0$FLOOR, build.0[,1:520])
#Build 1
build.1.lat <- data.frame(build.1$LATITUDE, build.1[,1:520])
build.1.long <- data.frame(build.1$LONGITUDE, build.1[,1:520])
build.1.floor <- data.frame(build.1$FLOOR, build.1[,1:520])
#Build 2
build.2.lat <- data.frame(build.2$LATITUDE, build.2[,1:520])
build.2.long <- data.frame(build.2$LONGITUDE, build.2[,1:520])
build.2.floor <- data.frame(build.2$FLOOR, build.2[,1:520])
#VALID
#Build 0
build.0.lat.v <- data.frame(build.0.v$LATITUDE, build.0.v[,1:520])
build.0.long.v <- data.frame(build.0.v$LONGITUDE, build.0.v[,1:520])
build.0.floor.v <- data.frame(build.0.v$FLOOR, build.0.v[,1:520])
#Build 1
build.1.lat.v <- data.frame(build.1.v$LATITUDE, build.1.v[,1:520])
build.1.long.v <- data.frame(build.1.v$LONGITUDE, build.1.v[,1:520])
build.1.floor.v <- data.frame(build.1.v$FLOOR, build.1.v[,1:520])
#Build 2
build.2.lat.v <- data.frame(build.2.v$LATITUDE, build.2.v[,1:520])
build.2.long.v <- data.frame(build.2.v$LONGITUDE, build.2.v[,1:520])
build.2.floor.v <- data.frame(build.2.v$FLOOR, build.2.v[,1:520])
For computing saving I used random samples in each data. The difference in sample values is because each building has a specific number of instances:
#Sample the data taking X random values
#TRAIN
#Build 0
sample.build.0.lat <- build.0.lat[sample(1:nrow(build.0.lat), 4000, replace = FALSE),]
sample.build.0.long <- build.0.long[sample(1:nrow(build.0.long), 4000, replace = FALSE),]
sample.build.0.floor <- build.0.floor[sample(1:nrow(build.0.floor), 4000, replace = FALSE),]
#Build 1
sample.build.1.lat <- build.1.lat[sample(1:nrow(build.1.lat), 3000, replace = FALSE),]
sample.build.1.long <- build.1.long[sample(1:nrow(build.1.long), 3000, replace = FALSE),]
sample.build.1.floor <- build.1.floor[sample(1:nrow(build.1.floor), 3000, replace = FALSE),]
#Build 2
sample.build.2.lat <- build.2.lat[sample(1:nrow(build.2.lat), 6000, replace = FALSE),]
sample.build.2.long <- build.2.long[sample(1:nrow(build.2.long), 6000, replace = FALSE),]
sample.build.2.floor <- build.2.floor[sample(1:nrow(build.2.floor), 6000, replace = FALSE),]
Floor is only feature used in clasification problem, then I should convert to a factor atribute:
#Convert FLOOR fetaure in factor
sample.build.0.floor$build.0.FLOOR <- as.factor(sample.build.0.floor$build.0.FLOOR)
sample.build.1.floor$build.1.FLOOR <- as.factor(sample.build.1.floor$build.1.FLOOR)
sample.build.2.floor$build.2.FLOOR <- as.factor(sample.build.2.floor$build.2.FLOOR)
We want to see if the WAPs signal can predict the real position of a user. For that, the main algorithm are knn. This type of supervised learning algorithm have been used becouse is a instance based algorithm. The algorithm classified a point based on the information of his frequent neightbours, using a distance function,in my case Euclidean Distance:
An other suitable algorithm in that problem is Random Forest (CART). This algorithm compute trees with a random number of predictors, and then each tree is averaged. The most common instance is the prediction value.
In that case I use the caret package with Cross Validation technique (5 times), and with zero variance and median impute preprocess. In all cases first I train the model and then I predict each feature in a validation sample. After, I compute the RMSE and Rsquared for latitude and longitude. For floor feature I make a confusion matrix which crosses the real values by predicted values.
#Latitude
knn.lat.0 <- train(build.0.LATITUDE ~ .,
sample.build.0.lat,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.lat.0 <- predict(knn.lat.0, build.0.lat.v)
error.knn.lat.0 <- pred.knn.lat.0 - build.0.lat.v$build.0.v.LATITUDE
rmse.knn.lat.0 <- sqrt(mean(error.knn.lat.0^2))
rmse.knn.lat.0
rsquared.knn.lat.0 <- 1 - (sum(error.knn.lat.0^2) / sum((build.0.lat.v$build.0.v.LATITUDE-mean(build.0.lat.v$build.0.v.LATITUDE))^2))
rsquared.knn.lat.0 <- rsquared.knn.lat.0 * 100
rsquared.knn.lat.0
#Longitude
knn.long.0 <- train(build.0.LONGITUDE ~ .,
sample.build.0.long,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.long.0 <- predict(knn.long.0, build.0.long.v)
error.knn.long.0 <- pred.knn.long.0 - build.0.long.v$build.0.v.LONGITUDE
rmse.knn.long.0 <- sqrt(mean(error.knn.long.0^2))
rmse.knn.long.0
rsquared.knn.long.0 <- 1 - (sum(error.knn.long.0^2) / sum((build.0.long.v$build.0.v.LONGITUDE-mean(build.0.long.v$build.0.v.LONGITUDE))^2))
rsquared.knn.long.0 <- rsquared.knn.long.0 * 100
rsquared.knn.long.0
#Floor
knn.floor.0 <- train(build.0.FLOOR ~ .,
sample.build.0.floor,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.floor.0 <- predict(knn.floor.0, build.0.floor.v)
conf.matrix.knn.floor.0 <- table(pred.knn.floor.0, build.0.floor.v$build.0.v.FLOOR)
accuracy.knn.floor.0 <- (sum(diag(conf.matrix.knn.floor.0))) / sum(conf.matrix.knn.floor.0)
accuracy.knn.floor.0 <- accuracy.knn.floor.0 * 100
accuracy.knn.floor.0
#Latitude
knn.lat.1 <- train(build.1.LATITUDE ~ .,
sample.build.1.lat,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.lat.1 <- predict(knn.lat.1, build.1.lat.v)
error.knn.lat.1 <- pred.knn.lat.1 - build.1.lat.v$build.1.v.LATITUDE
rmse.knn.lat.1 <- sqrt(mean(error.knn.lat.1^2))
rmse.knn.lat.1
rsquared.knn.lat.1 <- 1 - (sum(error.knn.lat.1^2) / sum((build.1.lat.v$build.1.v.LATITUDE-mean(build.1.lat.v$build.1.v.LATITUDE))^2))
rsquared.knn.lat.1 <- rsquared.knn.lat.1 * 100
rsquared.knn.lat.1
#Longitude
knn.long.1 <- train(build.1.LONGITUDE ~ .,
sample.build.1.long,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.long.1 <- predict(knn.long.1, build.1.long.v)
error.knn.long.1 <- pred.knn.long.1 - build.1.long.v$build.1.v.LONGITUDE
rmse.knn.long.1 <- sqrt(mean(error.knn.long.1^2))
rmse.knn.long.1
rsquared.knn.long.1 <- 1 - (sum(error.knn.long.1^2) / sum((build.1.long.v$build.1.v.LONGITUDE-mean(build.1.long.v$build.1.v.LONGITUDE))^2))
rsquared.knn.long.1 <- rsquared.knn.long.1 * 100
rsquared.knn.long.1
#Floor
knn.floor.1 <- train(build.1.FLOOR ~ .,
sample.build.1.floor,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.floor.1 <- predict(knn.floor.1, build.1.floor.v)
conf.matrix.knn.floor.1 <- table(pred.knn.floor.1, build.1.floor.v$build.1.v.FLOOR)
accuracy.knn.floor.1 <- (sum(diag(conf.matrix.knn.floor.1))) / sum(conf.matrix.knn.floor.1)
accuracy.knn.floor.1 <- accuracy.knn.floor.1 * 100
accuracy.knn.floor.1
#Latitude
knn.lat.2 <- train(build.2.LATITUDE ~ .,
sample.build.2.lat,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.lat.2 <- predict(knn.lat.2, build.2.lat.v)
error.knn.lat.2 <- pred.knn.lat.2 - build.2.lat.v$build.2.v.LATITUDE
rmse.knn.lat.2 <- sqrt(mean(error.knn.lat.2^2))
rmse.knn.lat.2
rsquared.knn.lat.2 <- 1 - (sum(error.knn.lat.2^2) / sum((build.2.lat.v$build.2.v.LATITUDE-mean(build.2.lat.v$build.2.v.LATITUDE))^2))
rsquared.knn.lat.2 <- rsquared.knn.lat.2 * 100
rsquared.knn.lat.2
#Longitude
knn.long.2 <- train(build.2.LONGITUDE ~ .,
sample.build.2.long,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.long.2 <- predict(knn.long.2, build.2.long.v)
error.knn.long.2 <- pred.knn.long.2 - build.2.long.v$build.2.v.LONGITUDE
rmse.knn.long.2 <- sqrt(mean(error.knn.long.2^2))
rmse.knn.long.2
rsquared.knn.long.2 <- 1 - (sum(error.knn.long.2^2) / sum((build.2.long.v$build.2.v.LONGITUDE-mean(build.2.long.v$build.2.v.LONGITUDE))^2))
rsquared.knn.long.2 <- rsquared.knn.long.2 * 100
rsquared.knn.long.2
#Floor
knn.floor.2 <- train(build.2.FLOOR ~ .,
sample.build.2.floor,
method = "knn",
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.knn.floor.2 <- predict(knn.floor.2, build.2.floor.v)
conf.matrix.knn.floor.2 <- table(pred.knn.floor.2, build.2.floor.v$build.2.v.FLOOR)
accuracy.knn.floor.2 <- (sum(diag(conf.matrix.knn.floor.2))) / sum(conf.matrix.knn.floor.2)
accuracy.knn.floor.2 <- accuracy.knn.floor.2 * 100
accuracy.knn.floor.2
To save computational cost, in this algorithm I have customized mtry parameter, and has been mtry=32 (the best accuracy between 2, 32, 520).
#Latitude
rfor.lat.0 <- train(build.0.LATITUDE ~ .,
sample.build.0.lat,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.lat.0 <- predict(rfor.lat.0, build.0.lat.v)
error.rfor.lat.0 <- pred.rfor.lat.0 - build.0.lat.v$build.0.v.LATITUDE
rmse.rfor.lat.0 <- sqrt(mean(error.rfor.lat.0^2))
rmse.rfor.lat.0
rsquared.rfor.lat.0 <- 1 - (sum(error.rfor.lat.0^2) / sum((build.0.lat.v$build.0.v.LATITUDE-mean(build.0.lat.v$build.0.v.LATITUDE))^2))
rsquared.rfor.lat.0 <- rsquared.rfor.lat.0 * 100
rsquared.rfor.lat.0
#Longitude
rfor.long.0 <- train(build.0.LONGITUDE ~ .,
sample.build.0.long,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.long.0 <- predict(rfor.long.0, build.0.long.v)
error.rfor.long.0 <- pred.rfor.long.0 - build.0.long.v$build.0.v.LONGITUDE
rmse.rfor.long.0 <- sqrt(mean(error.rfor.long.0^2))
rmse.rfor.long.0
rsquared.rfor.long.0 <- 1 - (sum(error.rfor.long.0^2) / sum((build.0.long.v$build.0.v.LONGITUDE-mean(build.0.long.v$build.0.v.LONGITUDE))^2))
rsquared.rfor.long.0 <- rsquared.rfor.long.0 * 100
rsquared.rfor.long.0
#Floor
rfor.floor.0 <- train(build.0.FLOOR ~ .,
sample.build.0.floor,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.floor.0 <- predict(rfor.floor.0, build.0.floor.v)
conf.matrix.rfor.floor.0 <- table(pred.rfor.floor.0, build.0.floor.v$build.0.v.FLOOR)
accuracy.rfor.floor.0 <- (sum(diag(conf.matrix.rfor.floor.0))) / sum(conf.matrix.rfor.floor.0)
accuracy.rfor.floor.0 <- accuracy.rfor.floor.0 * 100
accuracy.rfor.floor.0
#Latitude
rfor.lat.1 <- train(build.1.LATITUDE ~ .,
sample.build.1.lat,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.lat.1 <- predict(rfor.lat.1, build.1.lat.v)
error.rfor.lat.1 <- pred.rfor.lat.1 - build.1.lat.v$build.1.v.LATITUDE
rmse.rfor.lat.1 <- sqrt(mean(error.rfor.lat.1^2))
rmse.rfor.lat.1
rsquared.rfor.lat.1 <- 1 - (sum(error.rfor.lat.1^2) / sum((build.1.lat.v$build.1.v.LATITUDE-mean(build.1.lat.v$build.1.v.LATITUDE))^2))
rsquared.rfor.lat.1 <- rsquared.rfor.lat.1 * 100
rsquared.rfor.lat.1
#Longitude
rfor.long.1 <- train(build.1.LONGITUDE ~ .,
sample.build.1.long,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.long.1 <- predict(rfor.long.1, build.1.long.v)
error.rfor.long.1 <- pred.rfor.long.1 - build.1.long.v$build.1.v.LONGITUDE
rmse.rfor.long.1 <- sqrt(mean(error.rfor.long.1^2))
rmse.rfor.long.1
rsquared.rfor.long.1 <- 1 - (sum(error.rfor.long.1^2) / sum((build.1.long.v$build.1.v.LONGITUDE-mean(build.1.long.v$build.1.v.LONGITUDE))^2))
rsquared.rfor.long.1 <- rsquared.rfor.long.1 * 100
rsquared.rfor.long.1
#Floor
rfor.floor.1 <- train(build.1.FLOOR ~ .,
sample.build.1.floor,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.floor.1 <- predict(rfor.floor.1, build.1.floor.v)
conf.matrix.rfor.floor.1 <- table(pred.rfor.floor.1, build.1.floor.v$build.1.v.FLOOR)
accuracy.rfor.floor.1 <- (sum(diag(conf.matrix.rfor.floor.1))) / sum(conf.matrix.rfor.floor.1)
accuracy.rfor.floor.1 <- accuracy.rfor.floor.1 * 100
accuracy.rfor.floor.1
#Latitude
rfor.lat.2 <- train(build.2.LATITUDE ~ .,
sample.build.2.lat,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.lat.2 <- predict(rfor.lat.2, build.2.lat.v)
error.rfor.lat.2 <- pred.rfor.lat.2 - build.2.lat.v$build.2.v.LATITUDE
rmse.rfor.lat.2 <- sqrt(mean(error.rfor.lat.2^2))
rmse.rfor.lat.2
rsquared.rfor.lat.2 <- 1 - (sum(error.rfor.lat.2^2) / sum((build.2.lat.v$build.2.v.LATITUDE-mean(build.2.lat.v$build.2.v.LATITUDE))^2))
rsquared.rfor.lat.2 <- rsquared.rfor.lat.2 * 100
rsquared.rfor.lat.2
#Longitude
rfor.long.2 <- train(build.2.LONGITUDE ~ .,
sample.build.2.long,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.long.2 <- predict(rfor.long.2, build.2.long.v)
error.rfor.long.2 <- pred.rfor.long.2 - build.2.long.v$build.2.v.LONGITUDE
rmse.rfor.long.2 <- sqrt(mean(error.rfor.long.2^2))
rmse.rfor.long.2
rsquared.rfor.long.2 <- 1 - (sum(error.rfor.long.2^2) / sum((build.2.long.v$build.2.v.LONGITUDE-mean(build.2.long.v$build.2.v.LONGITUDE))^2))
rsquared.rfor.long.2 <- rsquared.rfor.long.2 * 100
rsquared.rfor.long.2
#Floor
rfor.floor.2 <- train(build.2.FLOOR ~ .,
sample.build.2.floor,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.rfor.floor.2 <- predict(rfor.floor.2, build.2.floor.v)
conf.matrix.rfor.floor.2 <- table(pred.rfor.floor.2, build.2.floor.v$build.2.v.FLOOR)
accuracy.rfor.floor.2 <- (sum(diag(conf.matrix.rfor.floor.2))) / sum(conf.matrix.rfor.floor.2)
accuracy.rfor.floor.2 <- accuracy.rfor.floor.2 * 100
accuracy.rfor.floor.2
LATITUDE
| Algorithm | RMSE | R squared |
|---|---|---|
| 0.knn | 4.830 | 97.741 |
| 0.R.forest | 3.479 | 98.828 |
| 1.knn | 7.219 | 96.052 |
| 1.R.forest | 5.491 | 97.716 |
| 2.knn | 5.694 | 96.021 |
| 2.R.forest | 5.545 | 96.228 |
LONGITUDE
| Algorithm | RMSE | R squared |
|---|---|---|
| 0.knn | 5.981 | 94.389 |
| 0.R.forest | 4.297 | 97.103 |
| 1.knn | 7.021 | 97.964 |
| 1.R.forest | 5.507 | 98.748 |
| 2.knn | 8.245 | 92.152 |
| 2.R.forest | 7.282 | 93.878 |
FLOOR
| Algorithm | Accuracy |
|---|---|
| 0.knn | 93.226 |
| 0.R.forest | 99.139 |
| 1.knn | 95.309 |
| 1.R.forest | 99.347 |
| 2.knn | 97.446 |
| 2.R.forest | 99.413 |
Floor models are clasification problems, so we can see wich floor has been better and worse predicted with a confusion matrix:
#KNN
#Build 0
conf.matrix.knn.floor.0
##
## pred.knn.floor.0 0 1 2 3
## 0 323 26 2 0
## 1 8 378 12 1
## 2 0 7 448 24
## 3 0 0 57 449
#Build 1
conf.matrix.knn.floor.1
##
## pred.knn.floor.1 0 1 2 3
## 0 390 25 1 0
## 1 11 447 4 0
## 2 0 8 421 5
## 3 5 4 23 299
#Build 2
conf.matrix.knn.floor.2
##
## pred.knn.floor.2 0 1 2 3 4
## 0 570 23 4 2 2
## 1 2 681 6 0 0
## 2 0 8 472 3 0
## 3 0 0 14 800 3
## 4 0 0 1 6 356
#Random Forest
#Build 0
conf.matrix.rfor.floor.0
##
## pred.rfor.floor.0 0 1 2 3
## 0 326 1 0 0
## 1 5 410 1 0
## 2 0 0 515 2
## 3 0 0 3 472
#Build 1
conf.matrix.rfor.floor.1
##
## pred.rfor.floor.1 0 1 2 3
## 0 405 1 0 0
## 1 1 477 0 0
## 2 0 5 445 5
## 3 0 1 4 299
#Build 2
conf.matrix.rfor.floor.2
##
## pred.rfor.floor.2 0 1 2 3 4
## 0 570 1 0 0 0
## 1 2 709 3 1 1
## 2 0 1 485 0 0
## 3 0 1 9 810 4
## 4 0 0 0 0 356
Too see wich model work best, it has been done a resample for each metric of accuracy (R squared and RMSE). In theory, R squared value should be the highest, and RMSE (Root Mean Squared Error) should be the lowest. In the case of floor feature, I only use Accuracy (% of correctly sorted instances).
#Create dataframes
#rmse
latitude.rmse <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
values = c(rmse.knn.lat.0, rmse.rfor.lat.0, rmse.knn.lat.1, rmse.rfor.lat.1,
rmse.knn.lat.2, rmse.rfor.lat.2))
longitude.rmse <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
values = c(rmse.knn.long.0, rmse.rfor.long.0, rmse.knn.long.1, rmse.rfor.long.1,
rmse.knn.long.2, rmse.rfor.long.2))
#rsquared
latitude.rsquared <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
values = c(rsquared.knn.lat.0, rsquared.rfor.lat.0, rsquared.knn.lat.1, rsquared.rfor.lat.1,
rsquared.knn.lat.2, rsquared.rfor.lat.2))
longitude.rsquared <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
values = c(rsquared.knn.long.0, rsquared.rfor.long.0, rsquared.knn.long.1, rsquared.rfor.long.1, rsquared.knn.long.2, rsquared.rfor.long.2))
floor.accuracy <- data.frame(metricas = c("KNN.0", "RFOREST.0", "KNN.1", "RFOREST.1", "KNN.2", "RFOREST.2"),
values = c(accuracy.knn.floor.0, accuracy.rfor.floor.0, accuracy.knn.floor.1, accuracy.rfor.floor.1, accuracy.knn.floor.2, accuracy.rfor.floor.2))
For a correct visualization, I have done a plot for each feature and for each metric. As the ggplot2 package sorts the x axis alphabetically, has had to customize the way to order it.
#Order x axis
latitude.rmse$metricas <- as.character(latitude.rmse$metricas)
latitude.rmse$metricas <- factor(latitude.rmse$metricas, levels=unique(latitude.rmse$metricas))
longitude.rmse$metricas <- as.character(longitude.rmse$metricas)
longitude.rmse$metricas <- factor(longitude.rmse$metricas, levels=unique(longitude.rmse$metricas))
latitude.rsquared$metricas <- as.character(latitude.rsquared$metricas)
latitude.rsquared$metricas <- factor(latitude.rsquared$metricas, levels=unique(latitude.rsquared$metricas))
longitude.rsquared$metricas <- as.character(longitude.rsquared$metricas)
longitude.rsquared$metricas <- factor(longitude.rsquared$metricas, levels=unique(longitude.rsquared$metricas))
floor.accuracy$metricas <- as.character(floor.accuracy$metricas)
floor.accuracy$metricas <- factor(floor.accuracy$metricas, levels=unique(floor.accuracy$metricas))
The used parameters has been the following:
#Latitude plots
a <- latitude.rmse %>%
ggplot(aes(x = metricas, y = values)) +
geom_col(aes(fill = metricas)) +
geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
coord_flip() +
labs(x = "Metrics for each Building",
y = "RMSE",
title = "LATITUDE") +
theme_light() +
scale_fill_brewer(palette = "GnBu") +
theme(legend.position="none")
d <- latitude.rsquared %>%
ggplot(aes(x = metricas, y = values)) +
geom_col(aes(fill = metricas)) +
geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
coord_flip() +
labs(x = "",
y = "RSQUARED",
title = "") +
theme_light() +
scale_fill_brewer(palette = "GnBu") +
theme(legend.position="none")
#Longitude plots
b <- longitude.rmse %>%
ggplot(aes(x = metricas, y = values)) +
geom_col(aes(fill = metricas)) +
geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
coord_flip() +
labs(x = "",
y = "RMSE",
title = "LONGITUDE") +
theme_light() +
scale_fill_brewer(palette = "OrRd") +
theme(legend.position="none")
e <- longitude.rsquared %>%
ggplot(aes(x = metricas, y = values)) +
geom_col(aes(fill = metricas)) +
geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
coord_flip() +
labs(x = "",
y = "RSQUARED",
title = "") +
theme_light() +
scale_fill_brewer(palette = "OrRd") +
theme(legend.position="none")
#Floor plots
f <- floor.accuracy %>%
ggplot(aes(x = metricas, y = values)) +
geom_col(aes(fill = metricas)) +
geom_text(aes(fill = metricas, label = round(values, digits = 3)), colour = "black") +
coord_flip() +
labs(x = "",
y = "ACCURACY",
title = "FLOOR") +
theme_light() +
scale_fill_brewer(palette = "BuPu") +
theme(legend.position="none")
#All plots in one
lat.long.plots <- grid.arrange(a, d, b, e, f, ncol = 2)
As we can see, Random Forest Algorithm is the best algorithm is all cases. Maybe is for that algorithm works best with a high dimensional data set, taking into account that we have 520 predictors (WAPs). Anyway, we can see how it works for building: Latitude and longitude has the lowest error in build 0. The highest R squared in longitude has been in build 0, istead of longitude that has been in build 1. In case of floor, the highest accuracy has been in build 2, although we can see high accuracy in all buildings.
LATITUDE
| Algorithm | RMSE | R squared |
|---|---|---|
| 0.R.forest | 3.479 | 98.828 |
| 1.R.forest | 5.491 | 97.716 |
| 2.R.forest | 5.545 | 96.228 |
LONGITUDE
| Algorithm | RMSE | R squared |
|---|---|---|
| 0.R.forest | 4.297 | 97.103 |
| 1.R.forest | 5.507 | 98.748 |
| 2.R.forest | 7.282 | 93.878 |
FLOOR
| Algorithm | Accuracy |
|---|---|
| 0.R.forest | 99.139 |
| 1.R.forest | 99.347 |
| 2.R.forest | 99.413 |
We could see that Random Forest Algorithm is the best, now we use all available train sample to predict longitude, latitude and floor. First, we create a data frame with WAPs (predictors) and each feature in training and validation set.
#Create Data Frame
lat <- data.frame(training$LATITUDE, training[,1:520])
long <- data.frame(training$LONGITUDE, training[,1:520])
floor <- data.frame(training$FLOOR, training[,1:520])
floor$training.FLOOR <- as.factor(floor$training.FLOOR)
lat.v <- data.frame(validation$LATITUDE, validation[,1:520])
long.v <- data.frame(validation$LONGITUDE, validation[,1:520])
floor.v <- data.frame(validation$FLOOR, validation[,1:520])
#Latitude
lat.rfor <- train(training.LATITUDE ~ .,
lat,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.lat.rfor <- predict(lat.rfor, lat.v)
error.lat.rfor <- pred.lat.rfor - lat.v$validation.LATITUDE
rmse.lat.rfor <- sqrt(mean(error.lat.rfor^2))
rmse.lat.rfor
rsquared.lat.rfor <- 1 - (sum(error.lat.rfor^2) / sum((lat.v$validation.LATITUDE-mean(lat.v$validation.LATITUDE))^2))
rsquared.lat.rfor <- rsquared.lat.rfor * 100
rsquared.lat.rfor
#Longitude
long.rfor <- train(training.LONGITUDE ~ .,
long,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.long.rfor <- predict(long.rfor, long.v)
error.long.rfor <- pred.long.rfor - long.v$validation.LONGITUDE
rmse.long.rfor <- sqrt(mean(error.long.rfor^2))
rmse.long.rfor
rsquared.long.rfor <- 1 - (sum(error.long.rfor^2) / sum((long.v$validation.LONGITUDE-mean(long.v$validation.LONGITUDE))^2))
rsquared.long.rfor <- rsquared.long.rfor * 100
rsquared.long.rfor
#Floor
floor.rfor <- train(training.FLOOR ~ .,
floor,
method = "ranger",
tuneGrid=data.frame(mtry=32),
trControl = trainControl(method = "cv",
number = 5,
verboseIter = TRUE),
preProcess = c("zv", "medianImpute"))
pred.floor.rfor <- predict(floor.rfor, floor.v)
conf.matrix.rfor.floor <- table(pred.floor.rfor, floor.v$validation.FLOOR)
accuracy.rfor.floor <- (sum(diag(conf.matrix.rfor.floor))) / sum(conf.matrix.rfor.floor)
accuracy.rfor.floor <- accuracy.rfor.floor * 100
accuracy.rfor.floor
Here we have the accuracy for each feature:
#Latitude
rmse.lat.rfor
## [1] 6.599523
rsquared.lat.rfor
## [1] 99.03976
#Longitude
rmse.long.rfor
## [1] 8.375887
rsquared.long.rfor
## [1] 99.54325
#Floor
conf.matrix.rfor.floor
##
## pred.floor.rfor 0 1 2 3 4
## 0 1302 9 0 13 1
## 1 7 1591 5 0 0
## 2 0 7 1448 8 0
## 3 0 0 12 1568 5
## 4 0 0 0 0 355
accuracy.rfor.floor
## [1] 98.94172
The predicted error has been 7 meters to north and south, and 9 meters to west ans east:
In real terms:
This error range is suitable for predict de position in indoor spaces, taking into account that the error range of GPS (location system in outdoor) goes from 3 to 15 meters, depends on the quality of the appliance.
In reference to floor feature we can see wich floor has been the best and the worse predicted:
conf.matrix.rfor.floor
##
## pred.floor.rfor 0 1 2 3 4
## 0 1302 9 0 13 1
## 1 7 1591 5 0 0
## 2 0 7 1448 8 0
## 3 0 0 12 1568 5
## 4 0 0 0 0 355
| Nª FLOOR | % of prediction |
|---|---|
| 0 | 98.129 |
| 1 | 99.338 |
| 2 | 98.651 |
| 3 | 99.872 |
| 4 | 99.715 |
The best predicted has been floor 3, and the worse predicted floor 0. Maybe is becouse in floor 0, we have more users outside or in front of the door, so the signal has not been taking correctly.
RELATIVEPOSITION range: 1 - Inside, 2 - Outside in Front of the door
summary(build.0$RELATIVEPOSITION)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.809 2.000 2.000
summary(build.1$RELATIVEPOSITION)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.599 2.000 2.000
summary(build.2$RELATIVEPOSITION)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.00 2.00 1.77 2.00 2.00
We have available a location system for indoor spaces. This problem has been applied in a College, but multiple uses are are available in order to improve the user experiencie. For exemple, with that system of indoor location via Wi-Fi, based on the location of each user, they could get customized offers and advertising in a shopping center. Otherwise, knowing the most crowded zone, it would be a good option to implent strategic advertising. In culture, the system could be implemented in museums and galleries, that inform to a user when passing through a place of interest, and to have a versatile and comfortable experience.
Joaquín Torres-Sospedra, Raúl Montoliu, Adolfo Martínez-Usó, Tomar J. Arnau, Joan P. Avariento, Mauri Benedito-Bordonau, Joaquín Huerta UJIIndoorLoc: A New Multi-building and Multi-floor Database for WLAN Fingerprint-based Indoor Localization Problems In Proceedings of the Fifth International Conference on Indoor Positioning and Indoor Navigation, 2014. https://archive.ics.uci.edu/ml/datasets/ujiindoorloc
@ jessica.gonzalez.d8@gmail.com linkedin: https://www.linkedin.com/in/jessica-gonzalezd/