Introduction

The objective of this project was to build classifiers to predict an indoor location based on RSSI readings from 13 iBeacons. The data was collected in Waldo Library, Western Michigan University using an iPhone 6S. The data sets were sourced from UCI Machine Learning Repository. This project is divided into two phases. Phase I puts emphasis on data preprocessing and exploration which will be covered in this report. In section two the data and its attributes are described. Section three covers data preprocessing and transformation. Section four first contains an exploration of the target feature followed by univariate and multivariate exploration. Prediction model building will be covered in phase two of this project.

Data set

The UCI Machine learning Respository provides two data sets(Mohammadi et al. (2017)). One labeled and one unlabeled data set. In order to fulfill the purpose of this course only the labeled dataset will be used. The dataset contains 13 RSSI readings of iBeacons and the location. Furthermore, the data set also contains a timestamp on when the RSSI readings of the 13 iBeacons was made. The data set has 1420 observations.
In this project we decide not to split the data set into test and training data since we in Phase II of the project will build classifiers from the entire data set and use cross-validation to evaluate their performance.

Library map showing the target feature location(Mohammadi et al. (2017)).

Target feature

The target feature is a location on a map of the library (see figure ). The target feature(labeled as location) consists of a horizontal coordinate expressed as a letter and a vertical coordinate expressed as a number on the map.

As an example, if the target feature is K03 it means that the location is column K and row 3.

Descriptive features

All descriptive features are RSSI signal readings collected from the iBeacons. All iBeacons are named upon there given position shown on the map (See figure ) from b3001 to b3013 where the last two digits corresponds to the position. All readings are integers.
If a RSSI reading has a value of -200 the given iBeacon is out of reach. The greater the RSSI signal values indicates a closer proximity to a given iBeacon. A RSSI reading of -45 represents a closer distance to a certain iBeacon than a RSSI reading of -125.

Data Pre-processing

Prelimimaries

The following R packages were used in this project.

library(knitr)
library(mlr)
library(rmarkdown)
library(ggplot2)
library(corrplot)
library(GGally)
library(pander)

A short mentioning of the used packages:

mlr is a great r- package for classification and regression techniques. The package is used multiple times through out this project(Bischl et al. (2016)).
ggplot2 is used for data visualisation(Wickham (2009)).
corrplot is used to graphical display the correlation matrix of descriptive features(Wei and Simko (2017)).
GGally is an extension of ggplot2 for simplifying some graphical visualisation(Schloerke et al. (2017)).
pander is used for exporting/converting outputs to PDF in a presentable way(Daróczi and Tsegelskyi (2017)).

We read the data from the csv-file and store it as a data frame. We decided not to split the dataset in to a test and training set since we will use five fold cross validation to measure the perfomance of our prediction models in Phase II.

indoor_local <- read.csv("iBeacon_RSSI_Labeled.csv")

Data Cleaning and Transformation

First we get a general overview of the data set and check the dataset for missing values, white spaces and misspellings. From our findings we transform and prepare the data set for visualisation and model building.

Data Cleaning

First we use the head and dim functions to get a first sight of the dataset.

pander(head(indoor_local),split.table = 85, style = "rmarkdown", caption = "Overview of the dataset before data transformation")

Overview of the dataset before data transformation (continued below)
location	date	b3001	b3002	b3003	b3004	b3005
O02	10-18-2016 11:15:21	-200	-200	-200	-200	-200
P01	10-18-2016 11:15:19	-200	-200	-200	-200	-200
P01	10-18-2016 11:15:17	-200	-200	-200	-200	-200
P01	10-18-2016 11:15:15	-200	-200	-200	-200	-200
P01	10-18-2016 11:15:13	-200	-200	-200	-200	-200
P01	10-18-2016 11:15:11	-200	-200	-82	-200	-200

b3006	b3007	b3008	b3009	b3010	b3011	b3012	b3013
-78	-200	-200	-200	-200	-200	-200	-200
-78	-200	-200	-200	-200	-200	-200	-200
-77	-200	-200	-200	-200	-200	-200	-200
-77	-200	-200	-200	-200	-200	-200	-200
-77	-200	-200	-200	-200	-200	-200	-200
-200	-200	-200	-200	-200	-200	-200	-200

dim_df <- as.data.frame(dim(indoor_local))
rownames(dim_df) <- c("Instances", "Features")
colnames(dim_df) <- c("Dimension")
kable(dim_df,caption = "Dimensions of data", row.names = TRUE)

Dimensions of data
	Dimension
Instances	1420
Features	15

From the output of the head function we notice that the file was loaded correctly.
Using the dim function we see that the dimensions of the data frame are as expected with 1420 instances and 15 features.

We use the summarizeColumns and str functions to get an overview of the data set.

kable(summarizeColumns(indoor_local), caption = "Summary of data before data transformation")

Summary of data before data transformation
name	type	mean	disp	median	mad	min	max	nlevs
location	factor	NA	0.9760563	NA	NA	2	34	105
date	factor	NA	0.9992958	NA	NA	1	1	1420
b3001	integer	-197.8254	16.2591046	-200	0	-200	-67	0
b3002	integer	-156.6239	60.2177467	-200	0	-200	-59	0
b3003	integer	-175.5331	49.4529584	-200	0	-200	-56	0
b3004	integer	-164.5345	56.5232607	-200	0	-200	-56	0
b3005	integer	-178.3782	47.1757986	-200	0	-200	-60	0
b3006	integer	-175.0634	49.5966273	-200	0	-200	-62	0
b3007	integer	-195.6373	22.8809805	-200	0	-200	-58	0
b3008	integer	-191.9704	30.7337421	-200	0	-200	-56	0
b3009	integer	-197.1451	19.1602072	-200	0	-200	-55	0
b3010	integer	-197.4423	17.7416322	-200	0	-200	-61	0
b3011	integer	-197.7486	16.8525347	-200	0	-200	-59	0
b3012	integer	-197.2338	18.5410878	-200	0	-200	-60	0
b3013	integer	-196.0655	22.0539237	-200	0	-200	-59	0

From this we see:

There are 105 levels of the target feature. Looking at the map(See figure ) there is $18*21=375$ possible levels.
All descriptive features range from -200 to around -60
No mininum or maximum values of the descriptive features discard from the other values
There is no NA-values
Location and time are a factors
There are 1420 different timestamps. We do not want to use them for prediction
All 13 descriptive features are integers

We are now using the str function to understand the internal structure of the data frame.

str(indoor_local)

## 'data.frame':    1420 obs. of  15 variables:
##  $ location: Factor w/ 105 levels "D13","D14","D15",..: 59 64 64 64 64 64 64 65 77 77 ...
##  $ date    : Factor w/ 1420 levels "10-18-2016 10:35:20",..: 600 599 598 597 596 595 594 593 592 591 ...
##  $ b3001   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3002   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3003   : int  -200 -200 -200 -200 -200 -82 -80 -86 -200 -200 ...
##  $ b3004   : int  -200 -200 -200 -200 -200 -200 -200 -200 -75 -75 ...
##  $ b3005   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3006   : int  -78 -78 -77 -77 -77 -200 -77 -200 -200 -200 ...
##  $ b3007   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3008   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3009   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3010   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3011   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3012   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
##  $ b3013   : int  -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...

From this we see:

It looks like there is no abnormalities like misspellings, white spaces or even missing values

We now use the unique function to check if there are any misspellings in the target feature.

unique(indoor_local$location)

##   [1] O02 P01 P02 R01 R02 S01 S02 T01 U02 U01 J03 K03 L03 M03 N03 O03 P03
##  [18] Q03 R03 S03 T03 U03 U04 T04 S04 R04 Q04 P04 O04 N04 M04 L04 K04 J04
##  [35] I04 I05 J05 K05 L05 M05 N05 O05 P05 Q05 R05 S05 T05 U05 S06 R06 Q06
##  [52] P06 O06 N06 M06 L06 K06 J06 I06 F08 J02 J07 I07 I10 J10 D15 E15 G15
##  [69] J15 L15 R15 T15 W15 I08 I03 J08 I01 I02 J01 K01 K02 L01 L02 M01 M02
##  [86] N01 N02 O01 I09 D14 D13 K07 K08 N15 P15 I15 S15 U15 V15 S07 S08 L09
## [103] L08 Q02 Q01
## 105 Levels: D13 D14 D15 E15 F08 G15 I01 I02 I03 I04 I05 I06 I07 I08 ... W15

Following observations are made from the output:

We did not find any misspellings
we did discover some levels of the target feature containing V and W which is not expected when you are looking at the map(Figure ). This means that there is $23*18=414$ possible target levels.

Since only 105 out of 414 or roughly around 25 % of possible target levels are present in the dataset we want to plot these 105 on the map to see if they are spread or located in the same area. It could be an option to split the target feature into its x- and y-coordinates. We should consider this since we would then have two sets of target features with maximum 18 and 23 levels respectively. If it is decided to split the target feature into its x- and y-coordinates it would be necessary to make two independent prediction models. One model to predict the x-coordinate and one model to predict the y-coordinate.

Till now we did not find any outliers, but we do a visual check of outliers during data visualisation(Section 4.2).

Data Transformation

In this section data will be transformed and prepared for model building. We will save the data normalisation to Phase II of the project.

Transforming negative RSSI signal readings to positive values

We will add 200 to all the signal measurements which will turn all negative numbers into zeroes and positive numbers. We can do this because the signals are measured in the same range.

indoor_local[,3:15] <- indoor_local[,3:15] + 200
pander(head(indoor_local),split.table = 85, style = "rmarkdown", caption = "Overview of the dataset after adding 200 to all RSSI signal readings ")

Overview of the dataset after adding 200 to all RSSI signal readings (continued below)
location	date	b3003
O02	10-18-2016 11:15:21	0
P01	10-18-2016 11:15:19	0
P01	10-18-2016 11:15:17	0
P01	10-18-2016 11:15:15	0
P01	10-18-2016 11:15:13	0
P01	10-18-2016 11:15:11	118

b3006	b3007	b3008	b3009	b3010	b3011	b3012	b3013
122	0	0	0	0	0	0	0
122	0	0	0	0	0	0	0
123	0	0	0	0	0	0	0
123	0	0	0	0	0	0	0
123	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0

It is also important to notice that we now have a value of zero to indicate that no signal from a given sensor was detected.

Splitting Target Feature

Now we want to split the target feature into x- and y-coordinates. From our earlier exploration of the datset we found that only roughly 25 % of possible target feature levels were present in the data set. By splitting the target feature in to x- and y-coordinates we get two sets of target features with a higher percentages of possible target feature levels present. Furthermore, we get more instances for each target feature level in each new set of target features.

To search through “location and split into x and y we convert”location" to characters. We use the unique function to check if the conversion went right.

indoor_local$location <- as.character(indoor_local$location)
unique(indoor_local$location)

##   [1] "O02" "P01" "P02" "R01" "R02" "S01" "S02" "T01" "U02" "U01" "J03"
##  [12] "K03" "L03" "M03" "N03" "O03" "P03" "Q03" "R03" "S03" "T03" "U03"
##  [23] "U04" "T04" "S04" "R04" "Q04" "P04" "O04" "N04" "M04" "L04" "K04"
##  [34] "J04" "I04" "I05" "J05" "K05" "L05" "M05" "N05" "O05" "P05" "Q05"
##  [45] "R05" "S05" "T05" "U05" "S06" "R06" "Q06" "P06" "O06" "N06" "M06"
##  [56] "L06" "K06" "J06" "I06" "F08" "J02" "J07" "I07" "I10" "J10" "D15"
##  [67] "E15" "G15" "J15" "L15" "R15" "T15" "W15" "I08" "I03" "J08" "I01"
##  [78] "I02" "J01" "K01" "K02" "L01" "L02" "M01" "M02" "N01" "N02" "O01"
##  [89] "I09" "D14" "D13" "K07" "K08" "N15" "P15" "I15" "S15" "U15" "V15"
## [100] "S07" "S08" "L09" "L08" "Q02" "Q01"

The conversion went right and therefore we can start splitting the target feature. We start by finding all the x-values:

indoor_local$x[grepl("A", indoor_local$location, ignore.case=FALSE)] <- "A"
indoor_local$x[grepl("B", indoor_local$location, ignore.case=FALSE)] <- "B"
indoor_local$x[grepl("C", indoor_local$location, ignore.case=FALSE)] <- "C"
indoor_local$x[grepl("D", indoor_local$location, ignore.case=FALSE)] <- "D"
indoor_local$x[grepl("E", indoor_local$location, ignore.case=FALSE)] <- "E"
indoor_local$x[grepl("F", indoor_local$location, ignore.case=FALSE)] <- "F"
indoor_local$x[grepl("G", indoor_local$location, ignore.case=FALSE)] <- "G"
indoor_local$x[grepl("H", indoor_local$location, ignore.case=FALSE)] <- "H"
indoor_local$x[grepl("I", indoor_local$location, ignore.case=FALSE)] <- "I"
indoor_local$x[grepl("J", indoor_local$location, ignore.case=FALSE)] <- "J"
indoor_local$x[grepl("K", indoor_local$location, ignore.case=FALSE)] <- "K"
indoor_local$x[grepl("L", indoor_local$location, ignore.case=FALSE)] <- "L"
indoor_local$x[grepl("M", indoor_local$location, ignore.case=FALSE)] <- "M"
indoor_local$x[grepl("N", indoor_local$location, ignore.case=FALSE)] <- "N"
indoor_local$x[grepl("O", indoor_local$location, ignore.case=FALSE)] <- "O"
indoor_local$x[grepl("P", indoor_local$location, ignore.case=FALSE)] <- "P"
indoor_local$x[grepl("Q", indoor_local$location, ignore.case=FALSE)] <- "Q"
indoor_local$x[grepl("R", indoor_local$location, ignore.case=FALSE)] <- "R"
indoor_local$x[grepl("S", indoor_local$location, ignore.case=FALSE)] <- "S"
indoor_local$x[grepl("T", indoor_local$location, ignore.case=FALSE)] <- "T"
indoor_local$x[grepl("U", indoor_local$location, ignore.case=FALSE)] <- "U"
indoor_local$x[grepl("V", indoor_local$location, ignore.case=FALSE)] <- "V"
indoor_local$x[grepl("W", indoor_local$location, ignore.case=FALSE)] <- "W"

We also asign all x-coordinates with a a numerical value for plotting.

indoor_local$x_num[grepl("A", indoor_local$location, ignore.case=FALSE)] <- 1
indoor_local$x_num[grepl("B", indoor_local$location, ignore.case=FALSE)] <- 2
indoor_local$x_num[grepl("C", indoor_local$location, ignore.case=FALSE)] <- 3
indoor_local$x_num[grepl("D", indoor_local$location, ignore.case=FALSE)] <- 4
indoor_local$x_num[grepl("E", indoor_local$location, ignore.case=FALSE)] <- 6
indoor_local$x_num[grepl("F", indoor_local$location, ignore.case=FALSE)] <- 6
indoor_local$x_num[grepl("G", indoor_local$location, ignore.case=FALSE)] <- 7
indoor_local$x_num[grepl("H", indoor_local$location, ignore.case=FALSE)] <- 8
indoor_local$x_num[grepl("I", indoor_local$location, ignore.case=FALSE)] <- 9
indoor_local$x_num[grepl("J", indoor_local$location, ignore.case=FALSE)] <- 10
indoor_local$x_num[grepl("K", indoor_local$location, ignore.case=FALSE)] <- 11
indoor_local$x_num[grepl("L", indoor_local$location, ignore.case=FALSE)] <- 12
indoor_local$x_num[grepl("M", indoor_local$location, ignore.case=FALSE)] <- 13
indoor_local$x_num[grepl("N", indoor_local$location, ignore.case=FALSE)] <- 14
indoor_local$x_num[grepl("O", indoor_local$location, ignore.case=FALSE)] <- 15
indoor_local$x_num[grepl("P", indoor_local$location, ignore.case=FALSE)] <- 16
indoor_local$x_num[grepl("Q", indoor_local$location, ignore.case=FALSE)] <- 17
indoor_local$x_num[grepl("R", indoor_local$location, ignore.case=FALSE)] <- 18
indoor_local$x_num[grepl("S", indoor_local$location, ignore.case=FALSE)] <- 19
indoor_local$x_num[grepl("T", indoor_local$location, ignore.case=FALSE)] <- 20
indoor_local$x_num[grepl("U", indoor_local$location, ignore.case=FALSE)] <- 21
indoor_local$x_num[grepl("V", indoor_local$location, ignore.case=FALSE)] <- 22
indoor_local$x_num[grepl("W", indoor_local$location, ignore.case=FALSE)] <- 23

And we do the same for the y-coordinates:

indoor_local$y[grepl("01", indoor_local$location, ignore.case=FALSE)] <- "01"
indoor_local$y[grepl("02", indoor_local$location, ignore.case=FALSE)] <- "02"
indoor_local$y[grepl("03", indoor_local$location, ignore.case=FALSE)] <- "03"
indoor_local$y[grepl("04", indoor_local$location, ignore.case=FALSE)] <- "04"
indoor_local$y[grepl("05", indoor_local$location, ignore.case=FALSE)] <- "05"
indoor_local$y[grepl("06", indoor_local$location, ignore.case=FALSE)] <- "06"
indoor_local$y[grepl("07", indoor_local$location, ignore.case=FALSE)] <- "07"
indoor_local$y[grepl("08", indoor_local$location, ignore.case=FALSE)] <- "08"
indoor_local$y[grepl("09", indoor_local$location, ignore.case=FALSE)] <- "09"
indoor_local$y[grepl("10", indoor_local$location, ignore.case=FALSE)] <- "10"
indoor_local$y[grepl("11", indoor_local$location, ignore.case=FALSE)] <- "11"
indoor_local$y[grepl("12", indoor_local$location, ignore.case=FALSE)] <- "12"
indoor_local$y[grepl("13", indoor_local$location, ignore.case=FALSE)] <- "13"
indoor_local$y[grepl("14", indoor_local$location, ignore.case=FALSE)] <- "14"
indoor_local$y[grepl("15", indoor_local$location, ignore.case=FALSE)] <- "15"
indoor_local$y[grepl("16", indoor_local$location, ignore.case=FALSE)] <- "16"
indoor_local$y[grepl("17", indoor_local$location, ignore.case=FALSE)] <- "17"
indoor_local$y[grepl("18", indoor_local$location, ignore.case=FALSE)] <- "18"

## saving y as a number for plot as well:

indoor_local$y_num[grepl("01", indoor_local$location, ignore.case=FALSE)] <- 1
indoor_local$y_num[grepl("02", indoor_local$location, ignore.case=FALSE)] <- 2
indoor_local$y_num[grepl("03", indoor_local$location, ignore.case=FALSE)] <- 3
indoor_local$y_num[grepl("04", indoor_local$location, ignore.case=FALSE)] <- 4
indoor_local$y_num[grepl("05", indoor_local$location, ignore.case=FALSE)] <- 5
indoor_local$y_num[grepl("06", indoor_local$location, ignore.case=FALSE)] <- 6
indoor_local$y_num[grepl("07", indoor_local$location, ignore.case=FALSE)] <- 7
indoor_local$y_num[grepl("08", indoor_local$location, ignore.case=FALSE)] <- 8
indoor_local$y_num[grepl("09", indoor_local$location, ignore.case=FALSE)] <- 9
indoor_local$y_num[grepl("10", indoor_local$location, ignore.case=FALSE)] <- 10
indoor_local$y_num[grepl("11", indoor_local$location, ignore.case=FALSE)] <- 11
indoor_local$y_num[grepl("12", indoor_local$location, ignore.case=FALSE)] <- 12
indoor_local$y_num[grepl("13", indoor_local$location, ignore.case=FALSE)] <- 13
indoor_local$y_num[grepl("14", indoor_local$location, ignore.case=FALSE)] <- 14
indoor_local$y_num[grepl("15", indoor_local$location, ignore.case=FALSE)] <- 15
indoor_local$y_num[grepl("16", indoor_local$location, ignore.case=FALSE)] <- 16
indoor_local$y_num[grepl("17", indoor_local$location, ignore.case=FALSE)] <- 17
indoor_local$y_num[grepl("18", indoor_local$location, ignore.case=FALSE)] <- 18

We now factor the location, x- and y-coordinate.

indoor_local$x <- factor(indoor_local$x)
indoor_local$y <- factor(indoor_local$y)
indoor_local$location <- factor(indoor_local$location)

Removing date column

We also remove the date column since we are not going to use it for our predictions.

indoor_local$date <- NULL

Preparing data frames for model building

Since the target feature were split in two we create a dataframe for x- and y-coordinate prediction model building.

X-coordinate data frame

We define a new data frame and remove all attributes that is not necessary for predicting the x-coordinate.

data_x <- indoor_local

data_x$y <- NULL
data_x$y_num <- NULL
data_x$x_num <- NULL
data_x$location <- NULL

kable(summarizeColumns(data_x), caption = "Summary of x data frame after transformation\\label{data_x}")

Summary of x data frame after transformation
name	type	mean	disp	median	mad	min	max	nlevs
b3001	numeric	2.174648	16.2591046	0	0	0	133	0
b3002	numeric	43.376056	60.2177467	0	0	0	141	0
b3003	numeric	24.466901	49.4529584	0	0	0	144	0
b3004	numeric	35.465493	56.5232607	0	0	0	144	0
b3005	numeric	21.621831	47.1757986	0	0	0	140	0
b3006	numeric	24.936620	49.5966273	0	0	0	138	0
b3007	numeric	4.362676	22.8809805	0	0	0	142	0
b3008	numeric	8.029578	30.7337421	0	0	0	144	0
b3009	numeric	2.854930	19.1602072	0	0	0	145	0
b3010	numeric	2.557746	17.7416322	0	0	0	139	0
b3011	numeric	2.251409	16.8525347	0	0	0	141	0
b3012	numeric	2.766197	18.5410878	0	0	0	140	0
b3013	numeric	3.934507	22.0539237	0	0	0	141	0
x	factor	NA	0.8577465	NA	NA	4	202	19

Y-coordinate data frame

We define a new data frame and remove all attributes that is not necessary for predicting the y-coordinate.

data_y <- indoor_local

data_y$x <- NULL
data_y$x_num <- NULL
data_y$y_num <- NULL
data_y$location <- NULL

kable(summarizeColumns(data_y), caption = "Summary of y data frame after transformation\\label{data_y}")

Summary of y data frame after transformation
name	type	mean	disp	median	mad	min	max	nlevs
b3001	numeric	2.174648	16.2591046	0	0	0	133	0
b3002	numeric	43.376056	60.2177467	0	0	0	141	0
b3003	numeric	24.466901	49.4529584	0	0	0	144	0
b3004	numeric	35.465493	56.5232607	0	0	0	144	0
b3005	numeric	21.621831	47.1757986	0	0	0	140	0
b3006	numeric	24.936620	49.5966273	0	0	0	138	0
b3007	numeric	4.362676	22.8809805	0	0	0	142	0
b3008	numeric	8.029578	30.7337421	0	0	0	144	0
b3009	numeric	2.854930	19.1602072	0	0	0	145	0
b3010	numeric	2.557746	17.7416322	0	0	0	139	0
b3011	numeric	2.251409	16.8525347	0	0	0	141	0
b3012	numeric	2.766197	18.5410878	0	0	0	140	0
b3013	numeric	3.934507	22.0539237	0	0	0	141	0
y	factor	NA	0.8323944	NA	NA	4	238	13

Summary after data transformation

We use summarizeColumns to get an overview of the data after data transformation.

kable(summarizeColumns(indoor_local), caption = "Summary of data after data transformation")

Summary of data after data transformation
name	type	mean	disp	median	mad	min	max	nlevs
location	factor	NA	0.9760563	NA	NA	2	34	105
b3001	numeric	2.174648	16.2591046	0	0.0000	0	133	0
b3002	numeric	43.376056	60.2177467	0	0.0000	0	141	0
b3003	numeric	24.466901	49.4529584	0	0.0000	0	144	0
b3004	numeric	35.465493	56.5232607	0	0.0000	0	144	0
b3005	numeric	21.621831	47.1757986	0	0.0000	0	140	0
b3006	numeric	24.936620	49.5966273	0	0.0000	0	138	0
b3007	numeric	4.362676	22.8809805	0	0.0000	0	142	0
b3008	numeric	8.029578	30.7337421	0	0.0000	0	144	0
b3009	numeric	2.854930	19.1602072	0	0.0000	0	145	0
b3010	numeric	2.557746	17.7416322	0	0.0000	0	139	0
b3011	numeric	2.251409	16.8525347	0	0.0000	0	141	0
b3012	numeric	2.766197	18.5410878	0	0.0000	0	140	0
b3013	numeric	3.934507	22.0539237	0	0.0000	0	141	0
x	factor	NA	0.8577465	NA	NA	4	202	19
x_num	numeric	13.654225	4.1723970	13	4.4478	4	23	0
y	factor	NA	0.8323944	NA	NA	4	238	13
y_num	numeric	5.169014	3.6182760	4	2.9652	1	15	0

We notice:

We now have 4 new columns
- x which is the x-coordinate target feature as a factor
- y which is the y-coordinate target feature as a factor.
- x_num which a numeric value of the x-coordinate used for plotting.
- y_num which is a numeric value of the y-coordinate used for plotting.
All RSSI signals are now zero or greater since we added 200 to the existing values.
The date column were removed.

We also have to new data frames for building a prediction model x- and y-coordinates respectively(See table and table ). Notice that we save normalization or standardization to Phase II.

Data exploration

First we will explore the target feature. Afterwards the desctiptive features will be explored individualy and eventually a multivariate exploration is done.

Exploration of target feature

We start by plotting the instances of the target features in a plot representing the library. The more transparent a location is the less visited it was during data collection.

ggplot(indoor_local, aes(x_num,y_num, colour ="y_num")) + geom_point(colour = "blue", size = 5, alpha = 0.05)  + scale_y_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)) + scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23), label = c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T", "U","V","W"))+ ggtitle("Target feature instances") + xlab("x-coordinate")+ ylab("y-coordinate")

$Target feature instances. \label{instances}$

Target feature instances.

From the plot we se that there was a lot of locations in the library that was not visited(The same as we discovered by seeing that there was only 105 levels of the target feature present). We see from the plot that the majority of the instances are present in an area bounded by the coordinates I1,I6,U1 and U6.
We also notice that there is no instances with x-coordinates A,B,C and H and no instances with the y-coordinates 11,12,16,17 and 18.

We will also take a look at the destribution of the instances. We will have a look at how the instances are destributed along the x- and y-axis respectively.

Distribution of target feature instances along x- and y-axis

As expected we see that most of our instances are between I and U with respect to the x-axis and between one and six with respect to the y-axis.

For optimal model building it would be more sufficient to have a more equally distributed instances of target levels. Alternatively it could be decided to focus on predicting locations within the area bounded by I1,I6,U1,U6.

We also notice a strange tendency that there is a lot of instances recorded with y-coordinate 15.

The data shown in graphs is also summarized in table and table .

table_x <- as.data.frame(table(indoor_local$x))
table_x <- t(table_x)
rownames(table_x) <- c("X", "Freq")
kable(table_x, caption = "Frequency of X \\label{FreqX}")

Frequency of X
X	D	E	F	G	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W
Freq	24	4	4	4	202	192	142	100	85	86	86	71	74	91	136	39	55	8	17

table_y <- as.data.frame(table(indoor_local$y))
table_y <- t(table_y)
rownames(table_y) <- c("Y", "Freq")
kable(table_y, caption = "Frequency of Y\\label{FreqY}")

Frequency of Y
Y	01	02	03	04	05	06	07	08	09	10	13	14	15
Freq	138	155	200	238	213	187	75	58	10	20	6	4	116

Univariate exploration

We will look at each iBeacon separately and see if there is anything to notice. For each feature will will plot a histogram of the signal strengths. We will also plot a scatter plot showing the signal strength with respect to the x- and y-coordinates. This will help us understand the relationship between location and iBeacon signals.

b3001

The histogram shows that this signal has a really low number of instances where it is detected. This makes sense looking at the placement of the iBeacon. However we see that strong signals from this iBeacon is detected at x-coordinates F,I and L and at y-coordinates 4,7 and 9.

Visualisation of iBeacon b3001

b3002

b3002 detect a relatively high number of RSSI signals and at a high variety of locations. For each location there is also a relatively high variation of signal strength. We notice a signal detected in row U which could be an outlier.

Visualisation of iBeacon b3002

b3003

Signal b3003 is also detected high variety of locations. For the y-coordinate it is between 1 and 7 and for the x-coordinate signals are detected from I to U. The highest variety of the signal strength is at M and O and 3 and 4.

Visualisation of iBeacon b3003

b3004

This iBeacon has a similar tendency as iBeacon b3003, but largest variety of signal strength is detected at S and 3 and 4. Furthermore we see the signal strengths are really close to follow a normal distrubution.

Visualisation of iBeacon b3004

b3005

The signals from b3005 is strongest between I and O and 1 and 10. We see that the signal strengths varies the most at y-coordinate 7 which is where the iBeacon is locatated.

Visualisation of iBeacon b3005

b3006

Again we see that the detected signal strengths are normally distributed. This iBeacon signal is also detected a relatively high number of times. Signal were detected between I and S and 1 and 10. It is also worth noticing that signals are detected in at x-coordinate S but not R even though R is closer to the iBeacon than S.

Visualisation of iBeacon b3006

b3007

From this histogram we see that only at a low number of a signal was detected. However we see that most of the signals were dectected at S and 7. We notice that a signal is detected at x-coordinate I which could be an outlier.

Visualisation of iBeacon b3007

b3008

Signals from this iBeacon also has a low number of instances where a signal is dectected. Most of the signals detected are detected at I or J and 9 and 10.We also see that for 9 and 10 at all a signal was detected. Also are signals detected at x-coordinate R and y-coordinate 15.

Visualisation of iBeacon b3008

b3009

This signal is detected at a very low number of instances, but when a signal is detected it is inly detected in a limited area. Signals are mainly detected at D and E and 13, 14 and 15. Some signals were also detected at I and J.

Visualisation of iBeacon b3009

b3010

Signals are also only detected at limited locations. String signals are mostly detected at G,I,J and 15. A few signals are detected at y-coordinate 10.

Visualisation of iBeacon b3010

b3011

This signal is also only detected at a very low number of instances. However the signals detected for x-coordinates are more spread than the signals detected at y-coordinates. Only signals a y-coordinate 15 are detected. We also notice that signals are detected at x-coordinate U.

$Visualisation of iBeacon b3011\label{b3011}$ $Visualisation of iBeacon b3011\label{b3011}$

Visualisation of iBeacon b3011

b3012

This signal looks similar to signal from iBeacon 3011. The only difference is that signals detected at x-coordinates are slightly more to the the right.

Visualisation of iBeacon b3012

b3013

Signals from iBeacon b3013 is detected at locations from Q to W and at K. The single detection at L could be an outlier. If we take a look at the library map(See figure ) we see that L is located far away from iBeacon b3013. We see that all instances at V and W a signal were detected. We also see that 15 is the only y-coordinate were signals are detected.

Visualisation of iBeacon b3013

Learnings from univariate visualisation

Looking at all iBeacons it is clear to see that there is a relationship between the placement of the iBeacon and the signals detected. We also learned that there are far more instances where a signal from a given iBeacon is not detected. All signals that are detected seems to be distributed around a RSSI signal strength of 125 and the more instances were a signal from a iBeacon is detected the more the distribution looks like a normal distribution.
About half of the iBeacons are dectected more often than the other iBeacons. These iBeacons also gets detected at a higher variety of locations. If there is a low number of signals that are detected there is a tendency that signals were only detected from a few x- and y-coordinates e.g. b3013 is only detected at y-coordinate 15.

We also notice that some of the iBeacons are detected from a location far away a few times e.g. a signal from iBeacon b3011 is detected at U(See figure ). These could potentially be outliers, but since this behaviour is noticed at more than half of the iBeacons we consider it as natural tendency of the iBeacons and do not treat these values as outliers.

Multivariate visualisation

To explore the relationship between the different iBeacon RSSI signals we use scatterplot matrices and correlation matrix. We notice that the majority of instances are detected close to iBeacon b3002, b3003, b3004, b3005, b3006 and b3007(See figure ). Therefore we want to plot and explore these descirptive features in a scatter plot matrix. We also notice that a lot of signals are detected at y-coordinate 15(See figure ). Therefore we also want to explore all iBeacons located at y-coordinate 15 in a scatter plot matrix. We also try to look for a correlation between the iBeacons located close to the entrance of the library(See figure ) using a scatterplot matrix.

Finally we also calculate and plot the correlation matrix of all descriptive features to if any features are correlated.

Location wise visualisation of iBeacons

We define three groups of iBeacons based on their location (See figure ) and analyse them using a scatter plot matrix.

First group of iBeacons

First group of iBeacons are the iBeacons located where most of the target feature instances appear(See figure ). This group consists of iBeacon b3002, b3003, b3004, b3005, b3006 and b3007.

$iBeacon b3002, b3003, b3005, b3006 and b3007. First group of iBeacons.\label{scat1}$

iBeacon b3002, b3003, b3005, b3006 and b3007. First group of iBeacons.

From graphs in figure we can see that there is no or less linear correlation exists between the group of iBeacons.

Second group of iBeacons

In the second group we take all iBeacons located at y-coordinate 15 except iBeacon b3009 because this row contains a high number of instances(See figure ) . Therefore this group contains iBeacon b3010, b3011, b3012 and b3013 (See figure .

iBeacon b3010, b3011, b3012 and b3013. Second group of iBeacons

Third group of iBeacons

This group consist of iBeacons located near the entrance area of the library. In this group we find iBeacons b3001, b3005,b3008 and b3009(See figure ).

$iBeacon b3001, b3005, b3008 and b3009. Third group of iBeacons. \label{scat3}$

iBeacon b3001, b3005, b3008 and b3009. Third group of iBeacons.

From graph in figure we can see that there is no or less linear correlation exists between the group of iBeacons.

As far as for all three groups of of iBeacons we see that the zero values where no signal is detected has a great influence on the relationship between the iBeacons.

Correlation matrix

We calculate and plot the correclation matrix to see if there is a linear relationship between any of the iBeacons and not just the groups that we inspected using scatter plots.

correlation <- cor(indoor_local[,2:14])
corrplot(correlation, type = "lower")

Correlation plot of all iBeacons

In general we see that the linear correlations between the iBeacons are really week. We notice that iBeacon b3002 and b3004 have the highest correlation.

Learnings from multivariate visualisation

We see a general tendency that there is a low or less correlation between the different iBeacons. This could be explained by the high number of zero values for each iBeacon. The high number of zero values could turn out to be an important characteristic when prediction models are build.

Summary

We discovered that the data set downloaded is in clean state and therefore we did not have to clean up the data. However, some data manipulation were made to prepare the data set for model building. Firstly, the positive number(200) was added to all iBeacon RSSI signal readings to convert all negative values into positive. Now the signal readings with the value of -200 turned zero which represents that no signal has been detected for the particular observation. To manipulate the data set for model building, the target feature was split into two sets of target features. One containing the x-coordinate of the location and the other containing the y-coordinate. We also converted the coordinates into a numerical value for plotting. Therefore, two independent models need to be build for prediction in Phase II. From univariate visualization of the iBeacons, we found that there is a clear tendency towards the placement of the iBeacons and where the signals has been detected. In addition, we are suspecting that there are some outlier values in the dataset as seen in the graphs of individual iBeacons, but since these values appear for more than half the iBeacons this could also be a natural tendency of the iBeacon signals. From multivariate visualization, we found that there is no or less linear correlation between the iBeacons. Moreover, we found that there are a lot of zero values in the data set which could be an important characteristic. Notice that the data normalization or standardization is saved for Phase II.

References

Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.

Daróczi, Gergely, and Roman Tsegelskyi. 2017. Pander: An R ’Pandoc’ Writer. https://CRAN.R-project.org/package=pander.

Mohammadi, M., A. Al-Fuqaha, M. Guizani, and J. S. Oh. 2017. “Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services.” IEEE Internet of Things Journal. IEEE, 1–12. doi:10.1109/JIOT.2017.2712560.

Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2017. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.

Wei, Taiyun, and Viliam Simko. 2017. R Package “Corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Predicting location from RSSI signals

MATH 2319 Machine Learning Applied Project Phase I

Maya Dere (s3675042) & Kristoffer LÃ¸wenstein (s3706122)

April 8, 2018