Techniques for Wifi Locationing

This report is aimed to explain the process of analyzing, visualization, preprocessing, building models, testing them and analyzing their errors. For this task the dataset used is the UJIIndoorLoc Data Set. The UJIIndoorLoc is a Multi-Building Multi-Floor indoor localization database to test Indoor Positioning System that rely on WLAN/WiFi fingerprint. The dataset can be found here http://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc. The UJIIndoorLoc database covers three buildings of Universitat Jaume I with 4 or more floors and almost 110.000m2.The database consists of 19937 training/reference records (trainingData.csv file) and 1111 validation/test records (validationData.csv file). The 529 attributes contain the WiFi fingerprint, the coordinates where it was taken, and other useful information.

# LOADING DATASETS----

trainingData <- read.csv("trainingData.csv", stringsAsFactors = FALSE)
validationData <- read.csv("validationData.csv", stringsAsFactors = FALSE)
trainingData$inTrain <- TRUE
validationData$inTrain <- FALSE
locationData <- rbind(trainingData, validationData , stringsAsFactors = FALSE) # Meging the two datasets

Starting with the Data and Preprocessing

The first things I did after bringing the data was merging the two dataset that were given ( Training and Validation). This was done at first in order to preprocess them at the same time but then I discovered that the validation dataset was not representative for the data so using it to validate the models was not the best choice. In this case I choase to re-split the data later on in train, test and validation. Then I procedeed to change the types of the variables to factors or Date Time as needed.

#INSPECTING, PREPROCESSING, VISUALIZATIONS ----

# Transform Data Types 
# locationData[1:520]<- sapply(locationData[1:520],as.numeric)
locationData [, c("SPACEID","USERID","PHONEID","RELATIVEPOSITION", "BUILDINGID","FLOOR")] <- lapply(locationData [, c("SPACEID","USERID","PHONEID","RELATIVEPOSITION", "BUILDINGID","FLOOR")], factor) # to factors
locationData$TIMESTAMP <- as.POSIXct(as.numeric(locationData$TIMESTAMP),origin  =  "1970-01-01",tz = "GMT")

# Removing duplicated Rows (#637 duplicated)
locationData <- distinct(locationData)

# Remove near zero variance columns
WAPS_VarTrain <- nearZeroVar(locationData[locationData$inTrain == T, 1:520], saveMetrics = TRUE)
WAPS_VarValid <- nearZeroVar(locationData[locationData$inTrain == F, 1:520], saveMetrics = TRUE)
locationData <- locationData[ - which(WAPS_VarTrain$zeroVar==TRUE | WAPS_VarValid$zeroVar == TRUE)]

As part of the preprocessing the duplicated rows ( number of 637) were removed. The dataset contained no missing values. All the near zero variance columns were also removed. After this the number of WAPS went from 520 to 315.

Visualizing the Data

The 3D plot of the data from each building was created in order to understand from where the datapoints are coming from and to examine their distribution. Also 3D Ploting the building we saw that the buildings have the same shape as the campus of Universitat Jaume I where the dataset was colected as seen at the link below. https://www.google.es/maps/place/Jaume+I+University/@39.9915504,-0.0682044,516a,35y,32.49h,14.15t/data=!3m1!1e3!4m5!3m4!1s0x0:0x1368bf53b3a7fb3f!8m2!3d39.9945711!4d-0.0689003

Inspecting the Given Split of the Dataset

As part of the visualization we wanted to see if the given split od the data (train and validation) will be suitable for use in the building/ testing of the models. The plot below showed us the distribution of the samples across the two datasets.

As we can see in the case of building 3 especially the data points in the validation set are very low and for the other buildings also we don’t have enough data in the valisation set. Seeing the diferences in distibution made us split the merged dataset again into train, test and validation later on so we have representative data in each set.

Signal Values changed

The values of the signal for each WAP had a range of negative integer values from -104 to 0 and + 100. Positive value 100 used if no signal was detected for that WAP. I chose to change te values so they are more intuitive so 0 will mean no signal and 104 will be the highest value of signal. We also removed the rows that had no signal across all WAPS.

WAPS <-grep("WAP", names(locationData), value = T) #gets all the wap names
locationData[, WAPS] <-sapply(locationData[, WAPS], function(x)ifelse(x == 100, -105, x)) # changes all values of 100(no signal) to 105
locationData[, WAPS] <-locationData[, WAPS] + 105 # flips all waps to positive values

# Filter WAPS that have all 0 signal for the full row and remove them (it means that no user connected to this wap)
# and the ones with near zero variance
locationData <- filter(locationData[which(rowSums(locationData[,WAPS])!=0),]) 

Feature Engineering

New features were added. The highest signal value, lowest signal values and the number of waps connected to for each data point.

# Add features highest, lowest  signal column and number of waps connected to
locationData$HIGHESTSIGNAL <- apply(locationData[, 1:312], 1, function(x)
  max(x))
locationData$LOWESTSIGNAL <- apply(locationData[, 1:312], 1, function(x)
  min(x[x > 0]))
locationData$NUMBERCONNECTIONS <- apply(locationData[, 1:312], 1, function(x)
  sum(x > 0))

Highest signal distribution

After ploting the highest signal distribution we noticed that the curve od the distributon started falling at signal 47. This made us remove the values under 45 because they were too low. Also we removed the extremely high values ( > 100) because we think that they were impossible to obtain in real life.

Splitting the Data

We split the data in 3 data frames: train, test and validation. We used the createDataPartition() function to first split it in 60/40 % for te train set. Then the 40 % we split it again in half and obtained the test and validation sets. Beeing that we will use models based on distance we also built a standardized version of these datasets after preprocessing the values of the signals for each WAPby centering and scaling them.

# DATA SPLIT ----
# We do this again because the initial split was not representative as we so in the plots above
# Also we added the validation dataset in order to better evaluate the models
indicesTraining <-createDataPartition(locationData$BUILDINGID, p = 0.6, list = FALSE)
dfTraining <-locationData[indicesTraining, ] # Training Test 60 %
dfLeftoverTraining <-locationData[-indicesTraining, ]
indicesTest <-createDataPartition(dfLeftoverTraining$BUILDINGID, p = 0.5, list = FALSE)
dfTest <- dfLeftoverTraining[indicesTest, ] # Test Test 20 % of total
dfValidation <-dfLeftoverTraining[-indicesTest, ] # Validation Test 20 % Total

# STANDARDIZING DATA FOR DISTANCE BASED MODELS ----

# Saving the waps in a vector
WAPs<-grep("WAP", names(locationData), value=T)

# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(locationData[WAPs], method=c("center", "scale"))

# transform the waps using the parameters
stand_waps <- predict(preprocessParams, locationData[WAPs])

# complete dataset
stand_dataset <- cbind(stand_waps, BUILDINGID=locationData$BUILDINGID, LONGITUDE=locationData$LONGITUDE, LATITUDE = locationData$LATITUDE,HIGHESTSIGNAL = locationData$HIGHESTSIGNAL, FLOOR = locationData$FLOOR, LOWESTSIGNAL = locationData$LOWESTSIGNAL, NUMBERCONNECTIONS = locationData$NUMBERCONNECTIONS) 

# DATA SPLIT STANDARDIZED DATA ----
indicesTrainingS <-createDataPartition(stand_dataset$BUILDINGID, p = 0.6, list = FALSE)
dfTrainingStand <-stand_dataset[indicesTrainingS, ] # Training Test 60 %
dfLeftoverTrainingS <-stand_dataset[-indicesTrainingS, ]
indicesTestS <-createDataPartition(dfLeftoverTrainingS$BUILDINGID, p = 0.5, list = FALSE)
dfTestStand <- dfLeftoverTrainingS[indicesTestS, ] # Test Test 20 % of total
dfValidationStand <- dfLeftoverTrainingS[-indicesTestS, ] # Validation Test 20 % Total

Model Building and Testing

For this task we built 2 models for each variable that we had to predict ( BUILDINGID, FLOOR, LATITUDE AND LONGITUDE). For the building ID nad floor number we had a classification problem and for the latitude and longitude a regression problem.

Models for Predicting Building Id

In order to ptrdict the building ID I chose to test 2 models. KNN and SVM. The two models both did very good with the SVM with Linear Kernel and a C = 1 doing slightly better than KNN.

Tune KNN

The number of neighbours for the KNN model was calculated using the below loop. In the case oF this dataset K =1 looked like the best option.This loop was used in the case of the floor too in order to find out the best number of neighbours.

i=1
k.optm=1
for (i in 1:3){
  knn.mod <- knn(train = dfTrainingStand[, -which(names(dfTrainingStand) %in% c("LONGITUDE","LATITUDE","FLOOR"))],
                 test = dfValidationStand[, -which(names(dfValidationStand) %in% c("LONGITUDE","LATITUDE","FLOOR"))],
                 cl = dfTrainingStand$BUILDINGID, k=i)
  k.optm[i] <- 100 * sum(dfValidationStand$BUILDINGID ==  knn.mod)/NROW(dfValidationStand$BUILDINGID)
  k=i
  cat(k,"=",k.optm[i],"
      ")
}
## 1 = 100 
##       2 = 99.86264 
##       3 = 99.95421 
## 

KNN building ID prediction performance

The results for testing the knn model on the test set were:

##    user  system elapsed 
##   22.46    0.06   23.21
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0  768    0    0
##          1    1  416    0
##          2    0    0 1001
## 
## Overall Statistics
##                                      
##                Accuracy : 0.9995     
##                  95% CI : (0.9975, 1)
##     No Information Rate : 0.4579     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.9993     
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9987   1.0000   1.0000
## Specificity            1.0000   0.9994   1.0000
## Pos Pred Value         1.0000   0.9976   1.0000
## Neg Pred Value         0.9993   1.0000   1.0000
## Prevalence             0.3518   0.1903   0.4579
## Detection Rate         0.3513   0.1903   0.4579
## Detection Prevalence   0.3513   0.1908   0.4579
## Balanced Accuracy      0.9993   0.9997   1.0000

The results for testing the knn model on the validation set were:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0  769    0    0
##          1    0  415    0
##          2    0    0 1000
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9983, 1)
##     No Information Rate : 0.4579     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            1.0000     1.00   1.0000
## Specificity            1.0000     1.00   1.0000
## Pos Pred Value         1.0000     1.00   1.0000
## Neg Pred Value         1.0000     1.00   1.0000
## Prevalence             0.3521     0.19   0.4579
## Detection Rate         0.3521     0.19   0.4579
## Detection Prevalence   0.3521     0.19   0.4579
## Balanced Accuracy      1.0000     1.00   1.0000

As we can see in the case of the KNN model built to predict the building id it did very good on the test set and a little bit worse on the validation set. We are looking for this model to work perfectly because we will use the prediction given by it for the next models as imput where we will deploy it on the blind dataset.

SVM building ID prediction performance

The results for testing the svm model on the test set were:

# 2. SVM BUILDING (CHOSEN) ----
# set.seed(123)
# ctrl <- trainControl(method = "cv", number = 10)
# system.time(svmFit <- train(BUILDINGID ~., 
#                             data = dfTraining[, -which(names(dfTraining) %in% c("LONGITUDE","LATITUDE","FLOOR"))], 
#                             method = "svmLinear",
#                             trControl = ctrl
#                            
#                           
# )
# )
svmFit <- readRDS("svmBuilding.rds")
# Check results on validation dataset # 99 % acc  kappa 0.9897  
svmTest <- predict(svmFit ,newdata = dfTestStand[, -which(names(dfTestStand) %in% c("LONGITUDE","LATITUDE","FLOOR"))])
print(svmCMT <- confusionMatrix(svmTest, dfTestStand$BUILDINGID)) # Confusion Matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0  769    0    0
##          1    0  416    0
##          2    0    0 1001
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9983, 1)
##     No Information Rate : 0.4579     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000
## Prevalence             0.3518   0.1903   0.4579
## Detection Rate         0.3518   0.1903   0.4579
## Detection Prevalence   0.3518   0.1903   0.4579
## Balanced Accuracy      1.0000   1.0000   1.0000

The results for testing the svm model on the validation set were:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2
##          0  769    0    0
##          1    0  415    0
##          2    0    0 1000
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9983, 1)
##     No Information Rate : 0.4579     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            1.0000     1.00   1.0000
## Specificity            1.0000     1.00   1.0000
## Pos Pred Value         1.0000     1.00   1.0000
## Neg Pred Value         1.0000     1.00   1.0000
## Prevalence             0.3521     0.19   0.4579
## Detection Rate         0.3521     0.19   0.4579
## Detection Prevalence   0.3521     0.19   0.4579
## Balanced Accuracy      1.0000     1.00   1.0000

Models for Predicting Floor Number

For predicting the floor number we used the same tactic as for the building and the same two models KNN and SVM. SVM was the same linear kernel and cost = 1. In the case of the KNN the best number of neighbours turned out to be 5 so we used K=5.We used the Building ID as predictor for this models because on the blind set we will add the prediction to the other predictors beeing that it is 100 % accurate by our testing.

KNN floor prediction performance

The results for testing the knn model on the test set were:

##    user  system elapsed 
##   23.09    0.09   24.79
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4
##          0 431   6   0   2   0
##          1   1 562   5   0   0
##          2   0   7 493   7   0
##          3   0   0   9 562   6
##          4   0   0   0   7  88
## 
## Overall Statistics
##                                        
##                Accuracy : 0.9771       
##                  95% CI : (0.97, 0.983)
##     No Information Rate : 0.2644       
##     P-Value [Acc > NIR] : < 2.2e-16    
##                                        
##                   Kappa : 0.9702       
##                                        
##  Mcnemar's Test P-Value : NA           
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9977   0.9774   0.9724   0.9723  0.93617
## Specificity            0.9954   0.9963   0.9917   0.9907  0.99665
## Pos Pred Value         0.9818   0.9894   0.9724   0.9740  0.92632
## Neg Pred Value         0.9994   0.9920   0.9917   0.9901  0.99713
## Prevalence             0.1976   0.2630   0.2319   0.2644  0.04300
## Detection Rate         0.1972   0.2571   0.2255   0.2571  0.04026
## Detection Prevalence   0.2008   0.2598   0.2319   0.2640  0.04346
## Balanced Accuracy      0.9966   0.9868   0.9820   0.9815  0.96641

The results for testing the knn model on the validation set were:

##    user  system elapsed 
##   24.35    0.07   25.97
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4
##          0 456   7   0   0   0
##          1   3 536   7   0   0
##          2   0   4 497  11   0
##          3   0   1   9 547   1
##          4   0   1   0   2 102
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9789         
##                  95% CI : (0.972, 0.9845)
##     No Information Rate : 0.2564         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9726         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9935   0.9763   0.9688   0.9768  0.99029
## Specificity            0.9959   0.9939   0.9910   0.9932  0.99856
## Pos Pred Value         0.9849   0.9817   0.9707   0.9803  0.97143
## Neg Pred Value         0.9983   0.9921   0.9904   0.9920  0.99952
## Prevalence             0.2102   0.2514   0.2349   0.2564  0.04716
## Detection Rate         0.2088   0.2454   0.2276   0.2505  0.04670
## Detection Prevalence   0.2120   0.2500   0.2344   0.2555  0.04808
## Balanced Accuracy      0.9947   0.9851   0.9799   0.9850  0.99442

SVM floor prediction performance

The results for testing the svm model on the test set were:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4
##          0 431   0   0   0   0
##          1   1 575   2   0   0
##          2   0   0 504   0   0
##          3   0   0   1 576   1
##          4   0   0   0   2  93
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9968          
##                  95% CI : (0.9934, 0.9987)
##     No Information Rate : 0.2644          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9958          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9977   1.0000   0.9941   0.9965  0.98936
## Specificity            1.0000   0.9981   1.0000   0.9988  0.99904
## Pos Pred Value         1.0000   0.9948   1.0000   0.9965  0.97895
## Neg Pred Value         0.9994   1.0000   0.9982   0.9988  0.99952
## Prevalence             0.1976   0.2630   0.2319   0.2644  0.04300
## Detection Rate         0.1972   0.2630   0.2306   0.2635  0.04254
## Detection Prevalence   0.1972   0.2644   0.2306   0.2644  0.04346
## Balanced Accuracy      0.9988   0.9991   0.9970   0.9976  0.99420

The results for testing the svm model on the validation set were:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4
##          0 459   1   0   0   0
##          1   0 548   1   0   0
##          2   0   0 512   0   0
##          3   0   0   0 559   2
##          4   0   0   0   1 101
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9977          
##                  95% CI : (0.9947, 0.9993)
##     No Information Rate : 0.2564          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.997           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   0.9982   0.9981   0.9982  0.98058
## Specificity            0.9994   0.9994   1.0000   0.9988  0.99952
## Pos Pred Value         0.9978   0.9982   1.0000   0.9964  0.99020
## Neg Pred Value         1.0000   0.9994   0.9994   0.9994  0.99904
## Prevalence             0.2102   0.2514   0.2349   0.2564  0.04716
## Detection Rate         0.2102   0.2509   0.2344   0.2560  0.04625
## Detection Prevalence   0.2106   0.2514   0.2344   0.2569  0.04670
## Balanced Accuracy      0.9997   0.9988   0.9990   0.9985  0.99005

Models for Predicting Latitude and Longitude

In order to solve the regression problem that is predicting the longitude and latitude we applied the same method as we did for the classification in the case of building and floor. We tried two models, random forest and knn for regression for predicting the latitude and longitude.

Models for Predicting Latitude

Random Forest Latitude prediction performance

Metrics for Test Set and Validation Set

##      RMSE  Rsquared       MAE 
## 2.5090602 0.9987657 1.3562495
##      RMSE  Rsquared       MAE 
## 2.6040049 0.9986972 1.3676621

KNN Latitude prediction performance

Metrics for Test Set and Validation Set

##      RMSE  Rsquared       MAE 
## 4.8700911 0.9954751 2.3034323
##      RMSE  Rsquared       MAE 
## 5.1863992 0.9946624 2.1910701

Models for Predicting Longitude

Random Forest Longitude prediction performance

Metrics for Test Set and Validation Set

##      RMSE  Rsquared       MAE 
## 2.1248809 0.9997545 1.1021418
##      RMSE  Rsquared       MAE 
## 2.3596028 0.9996901 1.1273789

KNN Longitude prediction performance

Metrics for Test Set and Validation Set

##      RMSE  Rsquared       MAE 
## 4.0375491 0.9990575 1.6697196
##      RMSE  Rsquared       MAE 
## 4.1945788 0.9990116 1.6226327

Error Analysis

Error distribution for the KNN and Random forest for bothe latitude and longitude.