The objective of this project was to build classifiers to predict an indoor location based on RSSI readings from 13 iBeacons. The data was collected in Waldo Library, Western Michigan University using an iPhone 6S. The data sets were sourced from UCI Machine Learning Repository. This project is divided into two phases. Phase I puts emphasis on data preprocessing and exploration which will be covered in this report. In section two the data and its attributes are described. Section three covers data preprocessing and transformation. Section four first contains an exploration of the target feature followed by univariate and multivariate exploration. Prediction model building will be covered in phase two of this project.
The UCI Machine learning Respository provides two data sets(Mohammadi et al. (2017)). One labeled and one unlabeled data set. In order to fulfill the purpose of this course only the labeled dataset will be used. The dataset contains 13 RSSI readings of iBeacons and the location. Furthermore, the data set also contains a timestamp on when the RSSI readings of the 13 iBeacons was made. The data set has 1420 observations.
In this project we decide not to split the data set into test and training data since we in Phase II of the project will build classifiers from the entire data set and use cross-validation to evaluate their performance.
Library map showing the target feature location(Mohammadi et al. (2017)).
The target feature is a location on a map of the library (see figure ). The target feature(labeled as location) consists of a horizontal coordinate expressed as a letter and a vertical coordinate expressed as a number on the map.
As an example, if the target feature is K03 it means that the location is column K and row 3.
All descriptive features are RSSI signal readings collected from the iBeacons. All iBeacons are named upon there given position shown on the map (See figure ) from b3001 to b3013 where the last two digits corresponds to the position. All readings are integers.
If a RSSI reading has a value of -200 the given iBeacon is out of reach. The greater the RSSI signal values indicates a closer proximity to a given iBeacon. A RSSI reading of -45 represents a closer distance to a certain iBeacon than a RSSI reading of -125.
The following R packages were used in this project.
library(knitr)
library(mlr)
library(rmarkdown)
library(ggplot2)
library(corrplot)
library(GGally)
library(pander)
A short mentioning of the used packages:
We read the data from the csv-file and store it as a data frame. We decided not to split the dataset in to a test and training set since we will use five fold cross validation to measure the perfomance of our prediction models in Phase II.
indoor_local <- read.csv("iBeacon_RSSI_Labeled.csv")
First we get a general overview of the data set and check the dataset for missing values, white spaces and misspellings. From our findings we transform and prepare the data set for visualisation and model building.
First we use the head and dim functions to get a first sight of the dataset.
pander(head(indoor_local),split.table = 85, style = "rmarkdown", caption = "Overview of the dataset before data transformation")
| location | date | b3001 | b3002 | b3003 | b3004 | b3005 |
|---|---|---|---|---|---|---|
| O02 | 10-18-2016 11:15:21 | -200 | -200 | -200 | -200 | -200 |
| P01 | 10-18-2016 11:15:19 | -200 | -200 | -200 | -200 | -200 |
| P01 | 10-18-2016 11:15:17 | -200 | -200 | -200 | -200 | -200 |
| P01 | 10-18-2016 11:15:15 | -200 | -200 | -200 | -200 | -200 |
| P01 | 10-18-2016 11:15:13 | -200 | -200 | -200 | -200 | -200 |
| P01 | 10-18-2016 11:15:11 | -200 | -200 | -82 | -200 | -200 |
| b3006 | b3007 | b3008 | b3009 | b3010 | b3011 | b3012 | b3013 |
|---|---|---|---|---|---|---|---|
| -78 | -200 | -200 | -200 | -200 | -200 | -200 | -200 |
| -78 | -200 | -200 | -200 | -200 | -200 | -200 | -200 |
| -77 | -200 | -200 | -200 | -200 | -200 | -200 | -200 |
| -77 | -200 | -200 | -200 | -200 | -200 | -200 | -200 |
| -77 | -200 | -200 | -200 | -200 | -200 | -200 | -200 |
| -200 | -200 | -200 | -200 | -200 | -200 | -200 | -200 |
dim_df <- as.data.frame(dim(indoor_local))
rownames(dim_df) <- c("Instances", "Features")
colnames(dim_df) <- c("Dimension")
kable(dim_df,caption = "Dimensions of data", row.names = TRUE)
| Dimension | |
|---|---|
| Instances | 1420 |
| Features | 15 |
We use the summarizeColumns and str functions to get an overview of the data set.
kable(summarizeColumns(indoor_local), caption = "Summary of data before data transformation")
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| location | factor | 0 | NA | 0.9760563 | NA | NA | 2 | 34 | 105 |
| date | factor | 0 | NA | 0.9992958 | NA | NA | 1 | 1 | 1420 |
| b3001 | integer | 0 | -197.8254 | 16.2591046 | -200 | 0 | -200 | -67 | 0 |
| b3002 | integer | 0 | -156.6239 | 60.2177467 | -200 | 0 | -200 | -59 | 0 |
| b3003 | integer | 0 | -175.5331 | 49.4529584 | -200 | 0 | -200 | -56 | 0 |
| b3004 | integer | 0 | -164.5345 | 56.5232607 | -200 | 0 | -200 | -56 | 0 |
| b3005 | integer | 0 | -178.3782 | 47.1757986 | -200 | 0 | -200 | -60 | 0 |
| b3006 | integer | 0 | -175.0634 | 49.5966273 | -200 | 0 | -200 | -62 | 0 |
| b3007 | integer | 0 | -195.6373 | 22.8809805 | -200 | 0 | -200 | -58 | 0 |
| b3008 | integer | 0 | -191.9704 | 30.7337421 | -200 | 0 | -200 | -56 | 0 |
| b3009 | integer | 0 | -197.1451 | 19.1602072 | -200 | 0 | -200 | -55 | 0 |
| b3010 | integer | 0 | -197.4423 | 17.7416322 | -200 | 0 | -200 | -61 | 0 |
| b3011 | integer | 0 | -197.7486 | 16.8525347 | -200 | 0 | -200 | -59 | 0 |
| b3012 | integer | 0 | -197.2338 | 18.5410878 | -200 | 0 | -200 | -60 | 0 |
| b3013 | integer | 0 | -196.0655 | 22.0539237 | -200 | 0 | -200 | -59 | 0 |
From this we see:
We are now using the str function to understand the internal structure of the data frame.
str(indoor_local)
## 'data.frame': 1420 obs. of 15 variables:
## $ location: Factor w/ 105 levels "D13","D14","D15",..: 59 64 64 64 64 64 64 65 77 77 ...
## $ date : Factor w/ 1420 levels "10-18-2016 10:35:20",..: 600 599 598 597 596 595 594 593 592 591 ...
## $ b3001 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3002 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3003 : int -200 -200 -200 -200 -200 -82 -80 -86 -200 -200 ...
## $ b3004 : int -200 -200 -200 -200 -200 -200 -200 -200 -75 -75 ...
## $ b3005 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3006 : int -78 -78 -77 -77 -77 -200 -77 -200 -200 -200 ...
## $ b3007 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3008 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3009 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3010 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3011 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3012 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
## $ b3013 : int -200 -200 -200 -200 -200 -200 -200 -200 -200 -200 ...
From this we see:
We now use the unique function to check if there are any misspellings in the target feature.
unique(indoor_local$location)
## [1] O02 P01 P02 R01 R02 S01 S02 T01 U02 U01 J03 K03 L03 M03 N03 O03 P03
## [18] Q03 R03 S03 T03 U03 U04 T04 S04 R04 Q04 P04 O04 N04 M04 L04 K04 J04
## [35] I04 I05 J05 K05 L05 M05 N05 O05 P05 Q05 R05 S05 T05 U05 S06 R06 Q06
## [52] P06 O06 N06 M06 L06 K06 J06 I06 F08 J02 J07 I07 I10 J10 D15 E15 G15
## [69] J15 L15 R15 T15 W15 I08 I03 J08 I01 I02 J01 K01 K02 L01 L02 M01 M02
## [86] N01 N02 O01 I09 D14 D13 K07 K08 N15 P15 I15 S15 U15 V15 S07 S08 L09
## [103] L08 Q02 Q01
## 105 Levels: D13 D14 D15 E15 F08 G15 I01 I02 I03 I04 I05 I06 I07 I08 ... W15
Following observations are made from the output:
Since only 105 out of 414 or roughly around 25 % of possible target levels are present in the dataset we want to plot these 105 on the map to see if they are spread or located in the same area. It could be an option to split the target feature into its x- and y-coordinates. We should consider this since we would then have two sets of target features with maximum 18 and 23 levels respectively. If it is decided to split the target feature into its x- and y-coordinates it would be necessary to make two independent prediction models. One model to predict the x-coordinate and one model to predict the y-coordinate.
Till now we did not find any outliers, but we do a visual check of outliers during data visualisation(Section 4.2).
In this section data will be transformed and prepared for model building. We will save the data normalisation to Phase II of the project.
We will add 200 to all the signal measurements which will turn all negative numbers into zeroes and positive numbers. We can do this because the signals are measured in the same range.
indoor_local[,3:15] <- indoor_local[,3:15] + 200
pander(head(indoor_local),split.table = 85, style = "rmarkdown", caption = "Overview of the dataset after adding 200 to all RSSI signal readings ")
| location | date | b3001 | b3002 | b3003 | b3004 | b3005 |
|---|---|---|---|---|---|---|
| O02 | 10-18-2016 11:15:21 | 0 | 0 | 0 | 0 | 0 |
| P01 | 10-18-2016 11:15:19 | 0 | 0 | 0 | 0 | 0 |
| P01 | 10-18-2016 11:15:17 | 0 | 0 | 0 | 0 | 0 |
| P01 | 10-18-2016 11:15:15 | 0 | 0 | 0 | 0 | 0 |
| P01 | 10-18-2016 11:15:13 | 0 | 0 | 0 | 0 | 0 |
| P01 | 10-18-2016 11:15:11 | 0 | 0 | 118 | 0 | 0 |
| b3006 | b3007 | b3008 | b3009 | b3010 | b3011 | b3012 | b3013 |
|---|---|---|---|---|---|---|---|
| 122 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 122 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 123 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 123 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 123 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
It is also important to notice that we now have a value of zero to indicate that no signal from a given sensor was detected.
Now we want to split the target feature into x- and y-coordinates. From our earlier exploration of the datset we found that only roughly 25 % of possible target feature levels were present in the data set. By splitting the target feature in to x- and y-coordinates we get two sets of target features with a higher percentages of possible target feature levels present. Furthermore, we get more instances for each target feature level in each new set of target features.
To search through “location and split into x and y we convert”location" to characters. We use the unique function to check if the conversion went right.
indoor_local$location <- as.character(indoor_local$location)
unique(indoor_local$location)
## [1] "O02" "P01" "P02" "R01" "R02" "S01" "S02" "T01" "U02" "U01" "J03"
## [12] "K03" "L03" "M03" "N03" "O03" "P03" "Q03" "R03" "S03" "T03" "U03"
## [23] "U04" "T04" "S04" "R04" "Q04" "P04" "O04" "N04" "M04" "L04" "K04"
## [34] "J04" "I04" "I05" "J05" "K05" "L05" "M05" "N05" "O05" "P05" "Q05"
## [45] "R05" "S05" "T05" "U05" "S06" "R06" "Q06" "P06" "O06" "N06" "M06"
## [56] "L06" "K06" "J06" "I06" "F08" "J02" "J07" "I07" "I10" "J10" "D15"
## [67] "E15" "G15" "J15" "L15" "R15" "T15" "W15" "I08" "I03" "J08" "I01"
## [78] "I02" "J01" "K01" "K02" "L01" "L02" "M01" "M02" "N01" "N02" "O01"
## [89] "I09" "D14" "D13" "K07" "K08" "N15" "P15" "I15" "S15" "U15" "V15"
## [100] "S07" "S08" "L09" "L08" "Q02" "Q01"
The conversion went right and therefore we can start splitting the target feature. We start by finding all the x-values:
indoor_local$x[grepl("A", indoor_local$location, ignore.case=FALSE)] <- "A"
indoor_local$x[grepl("B", indoor_local$location, ignore.case=FALSE)] <- "B"
indoor_local$x[grepl("C", indoor_local$location, ignore.case=FALSE)] <- "C"
indoor_local$x[grepl("D", indoor_local$location, ignore.case=FALSE)] <- "D"
indoor_local$x[grepl("E", indoor_local$location, ignore.case=FALSE)] <- "E"
indoor_local$x[grepl("F", indoor_local$location, ignore.case=FALSE)] <- "F"
indoor_local$x[grepl("G", indoor_local$location, ignore.case=FALSE)] <- "G"
indoor_local$x[grepl("H", indoor_local$location, ignore.case=FALSE)] <- "H"
indoor_local$x[grepl("I", indoor_local$location, ignore.case=FALSE)] <- "I"
indoor_local$x[grepl("J", indoor_local$location, ignore.case=FALSE)] <- "J"
indoor_local$x[grepl("K", indoor_local$location, ignore.case=FALSE)] <- "K"
indoor_local$x[grepl("L", indoor_local$location, ignore.case=FALSE)] <- "L"
indoor_local$x[grepl("M", indoor_local$location, ignore.case=FALSE)] <- "M"
indoor_local$x[grepl("N", indoor_local$location, ignore.case=FALSE)] <- "N"
indoor_local$x[grepl("O", indoor_local$location, ignore.case=FALSE)] <- "O"
indoor_local$x[grepl("P", indoor_local$location, ignore.case=FALSE)] <- "P"
indoor_local$x[grepl("Q", indoor_local$location, ignore.case=FALSE)] <- "Q"
indoor_local$x[grepl("R", indoor_local$location, ignore.case=FALSE)] <- "R"
indoor_local$x[grepl("S", indoor_local$location, ignore.case=FALSE)] <- "S"
indoor_local$x[grepl("T", indoor_local$location, ignore.case=FALSE)] <- "T"
indoor_local$x[grepl("U", indoor_local$location, ignore.case=FALSE)] <- "U"
indoor_local$x[grepl("V", indoor_local$location, ignore.case=FALSE)] <- "V"
indoor_local$x[grepl("W", indoor_local$location, ignore.case=FALSE)] <- "W"
We also asign all x-coordinates with a a numerical value for plotting.
indoor_local$x_num[grepl("A", indoor_local$location, ignore.case=FALSE)] <- 1
indoor_local$x_num[grepl("B", indoor_local$location, ignore.case=FALSE)] <- 2
indoor_local$x_num[grepl("C", indoor_local$location, ignore.case=FALSE)] <- 3
indoor_local$x_num[grepl("D", indoor_local$location, ignore.case=FALSE)] <- 4
indoor_local$x_num[grepl("E", indoor_local$location, ignore.case=FALSE)] <- 6
indoor_local$x_num[grepl("F", indoor_local$location, ignore.case=FALSE)] <- 6
indoor_local$x_num[grepl("G", indoor_local$location, ignore.case=FALSE)] <- 7
indoor_local$x_num[grepl("H", indoor_local$location, ignore.case=FALSE)] <- 8
indoor_local$x_num[grepl("I", indoor_local$location, ignore.case=FALSE)] <- 9
indoor_local$x_num[grepl("J", indoor_local$location, ignore.case=FALSE)] <- 10
indoor_local$x_num[grepl("K", indoor_local$location, ignore.case=FALSE)] <- 11
indoor_local$x_num[grepl("L", indoor_local$location, ignore.case=FALSE)] <- 12
indoor_local$x_num[grepl("M", indoor_local$location, ignore.case=FALSE)] <- 13
indoor_local$x_num[grepl("N", indoor_local$location, ignore.case=FALSE)] <- 14
indoor_local$x_num[grepl("O", indoor_local$location, ignore.case=FALSE)] <- 15
indoor_local$x_num[grepl("P", indoor_local$location, ignore.case=FALSE)] <- 16
indoor_local$x_num[grepl("Q", indoor_local$location, ignore.case=FALSE)] <- 17
indoor_local$x_num[grepl("R", indoor_local$location, ignore.case=FALSE)] <- 18
indoor_local$x_num[grepl("S", indoor_local$location, ignore.case=FALSE)] <- 19
indoor_local$x_num[grepl("T", indoor_local$location, ignore.case=FALSE)] <- 20
indoor_local$x_num[grepl("U", indoor_local$location, ignore.case=FALSE)] <- 21
indoor_local$x_num[grepl("V", indoor_local$location, ignore.case=FALSE)] <- 22
indoor_local$x_num[grepl("W", indoor_local$location, ignore.case=FALSE)] <- 23
And we do the same for the y-coordinates:
indoor_local$y[grepl("01", indoor_local$location, ignore.case=FALSE)] <- "01"
indoor_local$y[grepl("02", indoor_local$location, ignore.case=FALSE)] <- "02"
indoor_local$y[grepl("03", indoor_local$location, ignore.case=FALSE)] <- "03"
indoor_local$y[grepl("04", indoor_local$location, ignore.case=FALSE)] <- "04"
indoor_local$y[grepl("05", indoor_local$location, ignore.case=FALSE)] <- "05"
indoor_local$y[grepl("06", indoor_local$location, ignore.case=FALSE)] <- "06"
indoor_local$y[grepl("07", indoor_local$location, ignore.case=FALSE)] <- "07"
indoor_local$y[grepl("08", indoor_local$location, ignore.case=FALSE)] <- "08"
indoor_local$y[grepl("09", indoor_local$location, ignore.case=FALSE)] <- "09"
indoor_local$y[grepl("10", indoor_local$location, ignore.case=FALSE)] <- "10"
indoor_local$y[grepl("11", indoor_local$location, ignore.case=FALSE)] <- "11"
indoor_local$y[grepl("12", indoor_local$location, ignore.case=FALSE)] <- "12"
indoor_local$y[grepl("13", indoor_local$location, ignore.case=FALSE)] <- "13"
indoor_local$y[grepl("14", indoor_local$location, ignore.case=FALSE)] <- "14"
indoor_local$y[grepl("15", indoor_local$location, ignore.case=FALSE)] <- "15"
indoor_local$y[grepl("16", indoor_local$location, ignore.case=FALSE)] <- "16"
indoor_local$y[grepl("17", indoor_local$location, ignore.case=FALSE)] <- "17"
indoor_local$y[grepl("18", indoor_local$location, ignore.case=FALSE)] <- "18"
## saving y as a number for plot as well:
indoor_local$y_num[grepl("01", indoor_local$location, ignore.case=FALSE)] <- 1
indoor_local$y_num[grepl("02", indoor_local$location, ignore.case=FALSE)] <- 2
indoor_local$y_num[grepl("03", indoor_local$location, ignore.case=FALSE)] <- 3
indoor_local$y_num[grepl("04", indoor_local$location, ignore.case=FALSE)] <- 4
indoor_local$y_num[grepl("05", indoor_local$location, ignore.case=FALSE)] <- 5
indoor_local$y_num[grepl("06", indoor_local$location, ignore.case=FALSE)] <- 6
indoor_local$y_num[grepl("07", indoor_local$location, ignore.case=FALSE)] <- 7
indoor_local$y_num[grepl("08", indoor_local$location, ignore.case=FALSE)] <- 8
indoor_local$y_num[grepl("09", indoor_local$location, ignore.case=FALSE)] <- 9
indoor_local$y_num[grepl("10", indoor_local$location, ignore.case=FALSE)] <- 10
indoor_local$y_num[grepl("11", indoor_local$location, ignore.case=FALSE)] <- 11
indoor_local$y_num[grepl("12", indoor_local$location, ignore.case=FALSE)] <- 12
indoor_local$y_num[grepl("13", indoor_local$location, ignore.case=FALSE)] <- 13
indoor_local$y_num[grepl("14", indoor_local$location, ignore.case=FALSE)] <- 14
indoor_local$y_num[grepl("15", indoor_local$location, ignore.case=FALSE)] <- 15
indoor_local$y_num[grepl("16", indoor_local$location, ignore.case=FALSE)] <- 16
indoor_local$y_num[grepl("17", indoor_local$location, ignore.case=FALSE)] <- 17
indoor_local$y_num[grepl("18", indoor_local$location, ignore.case=FALSE)] <- 18
We now factor the location, x- and y-coordinate.
indoor_local$x <- factor(indoor_local$x)
indoor_local$y <- factor(indoor_local$y)
indoor_local$location <- factor(indoor_local$location)
We also remove the date column since we are not going to use it for our predictions.
indoor_local$date <- NULL
Since the target feature were split in two we create a dataframe for x- and y-coordinate prediction model building.
We define a new data frame and remove all attributes that is not necessary for predicting the x-coordinate.
data_x <- indoor_local
data_x$y <- NULL
data_x$y_num <- NULL
data_x$x_num <- NULL
data_x$location <- NULL
kable(summarizeColumns(data_x), caption = "Summary of x data frame after transformation\\label{data_x}")
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| b3001 | numeric | 0 | 2.174648 | 16.2591046 | 0 | 0 | 0 | 133 | 0 |
| b3002 | numeric | 0 | 43.376056 | 60.2177467 | 0 | 0 | 0 | 141 | 0 |
| b3003 | numeric | 0 | 24.466901 | 49.4529584 | 0 | 0 | 0 | 144 | 0 |
| b3004 | numeric | 0 | 35.465493 | 56.5232607 | 0 | 0 | 0 | 144 | 0 |
| b3005 | numeric | 0 | 21.621831 | 47.1757986 | 0 | 0 | 0 | 140 | 0 |
| b3006 | numeric | 0 | 24.936620 | 49.5966273 | 0 | 0 | 0 | 138 | 0 |
| b3007 | numeric | 0 | 4.362676 | 22.8809805 | 0 | 0 | 0 | 142 | 0 |
| b3008 | numeric | 0 | 8.029578 | 30.7337421 | 0 | 0 | 0 | 144 | 0 |
| b3009 | numeric | 0 | 2.854930 | 19.1602072 | 0 | 0 | 0 | 145 | 0 |
| b3010 | numeric | 0 | 2.557746 | 17.7416322 | 0 | 0 | 0 | 139 | 0 |
| b3011 | numeric | 0 | 2.251409 | 16.8525347 | 0 | 0 | 0 | 141 | 0 |
| b3012 | numeric | 0 | 2.766197 | 18.5410878 | 0 | 0 | 0 | 140 | 0 |
| b3013 | numeric | 0 | 3.934507 | 22.0539237 | 0 | 0 | 0 | 141 | 0 |
| x | factor | 0 | NA | 0.8577465 | NA | NA | 4 | 202 | 19 |
We define a new data frame and remove all attributes that is not necessary for predicting the y-coordinate.
data_y <- indoor_local
data_y$x <- NULL
data_y$x_num <- NULL
data_y$y_num <- NULL
data_y$location <- NULL
kable(summarizeColumns(data_y), caption = "Summary of y data frame after transformation\\label{data_y}")
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| b3001 | numeric | 0 | 2.174648 | 16.2591046 | 0 | 0 | 0 | 133 | 0 |
| b3002 | numeric | 0 | 43.376056 | 60.2177467 | 0 | 0 | 0 | 141 | 0 |
| b3003 | numeric | 0 | 24.466901 | 49.4529584 | 0 | 0 | 0 | 144 | 0 |
| b3004 | numeric | 0 | 35.465493 | 56.5232607 | 0 | 0 | 0 | 144 | 0 |
| b3005 | numeric | 0 | 21.621831 | 47.1757986 | 0 | 0 | 0 | 140 | 0 |
| b3006 | numeric | 0 | 24.936620 | 49.5966273 | 0 | 0 | 0 | 138 | 0 |
| b3007 | numeric | 0 | 4.362676 | 22.8809805 | 0 | 0 | 0 | 142 | 0 |
| b3008 | numeric | 0 | 8.029578 | 30.7337421 | 0 | 0 | 0 | 144 | 0 |
| b3009 | numeric | 0 | 2.854930 | 19.1602072 | 0 | 0 | 0 | 145 | 0 |
| b3010 | numeric | 0 | 2.557746 | 17.7416322 | 0 | 0 | 0 | 139 | 0 |
| b3011 | numeric | 0 | 2.251409 | 16.8525347 | 0 | 0 | 0 | 141 | 0 |
| b3012 | numeric | 0 | 2.766197 | 18.5410878 | 0 | 0 | 0 | 140 | 0 |
| b3013 | numeric | 0 | 3.934507 | 22.0539237 | 0 | 0 | 0 | 141 | 0 |
| y | factor | 0 | NA | 0.8323944 | NA | NA | 4 | 238 | 13 |
We use summarizeColumns to get an overview of the data after data transformation.
kable(summarizeColumns(indoor_local), caption = "Summary of data after data transformation")
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| location | factor | 0 | NA | 0.9760563 | NA | NA | 2 | 34 | 105 |
| b3001 | numeric | 0 | 2.174648 | 16.2591046 | 0 | 0.0000 | 0 | 133 | 0 |
| b3002 | numeric | 0 | 43.376056 | 60.2177467 | 0 | 0.0000 | 0 | 141 | 0 |
| b3003 | numeric | 0 | 24.466901 | 49.4529584 | 0 | 0.0000 | 0 | 144 | 0 |
| b3004 | numeric | 0 | 35.465493 | 56.5232607 | 0 | 0.0000 | 0 | 144 | 0 |
| b3005 | numeric | 0 | 21.621831 | 47.1757986 | 0 | 0.0000 | 0 | 140 | 0 |
| b3006 | numeric | 0 | 24.936620 | 49.5966273 | 0 | 0.0000 | 0 | 138 | 0 |
| b3007 | numeric | 0 | 4.362676 | 22.8809805 | 0 | 0.0000 | 0 | 142 | 0 |
| b3008 | numeric | 0 | 8.029578 | 30.7337421 | 0 | 0.0000 | 0 | 144 | 0 |
| b3009 | numeric | 0 | 2.854930 | 19.1602072 | 0 | 0.0000 | 0 | 145 | 0 |
| b3010 | numeric | 0 | 2.557746 | 17.7416322 | 0 | 0.0000 | 0 | 139 | 0 |
| b3011 | numeric | 0 | 2.251409 | 16.8525347 | 0 | 0.0000 | 0 | 141 | 0 |
| b3012 | numeric | 0 | 2.766197 | 18.5410878 | 0 | 0.0000 | 0 | 140 | 0 |
| b3013 | numeric | 0 | 3.934507 | 22.0539237 | 0 | 0.0000 | 0 | 141 | 0 |
| x | factor | 0 | NA | 0.8577465 | NA | NA | 4 | 202 | 19 |
| x_num | numeric | 0 | 13.654225 | 4.1723970 | 13 | 4.4478 | 4 | 23 | 0 |
| y | factor | 0 | NA | 0.8323944 | NA | NA | 4 | 238 | 13 |
| y_num | numeric | 0 | 5.169014 | 3.6182760 | 4 | 2.9652 | 1 | 15 | 0 |
We notice:
We also have to new data frames for building a prediction model x- and y-coordinates respectively(See table and table ). Notice that we save normalization or standardization to Phase II.
First we will explore the target feature. Afterwards the desctiptive features will be explored individualy and eventually a multivariate exploration is done.
We start by plotting the instances of the target features in a plot representing the library. The more transparent a location is the less visited it was during data collection.
ggplot(indoor_local, aes(x_num,y_num, colour ="y_num")) + geom_point(colour = "blue", size = 5, alpha = 0.05) + scale_y_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)) + scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23), label = c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T", "U","V","W"))+ ggtitle("Target feature instances") + xlab("x-coordinate")+ ylab("y-coordinate")
Target feature instances.
From the plot we se that there was a lot of locations in the library that was not visited(The same as we discovered by seeing that there was only 105 levels of the target feature present). We see from the plot that the majority of the instances are present in an area bounded by the coordinates I1,I6,U1 and U6.
We also notice that there is no instances with x-coordinates A,B,C and H and no instances with the y-coordinates 11,12,16,17 and 18.
We will also take a look at the destribution of the instances. We will have a look at how the instances are destributed along the x- and y-axis respectively.
Distribution of target feature instances along x- and y-axis
As expected we see that most of our instances are between I and U with respect to the x-axis and between one and six with respect to the y-axis.
For optimal model building it would be more sufficient to have a more equally distributed instances of target levels. Alternatively it could be decided to focus on predicting locations within the area bounded by I1,I6,U1,U6.
We also notice a strange tendency that there is a lot of instances recorded with y-coordinate 15.
The data shown in graphs is also summarized in table and table .
table_x <- as.data.frame(table(indoor_local$x))
table_x <- t(table_x)
rownames(table_x) <- c("X", "Freq")
kable(table_x, caption = "Frequency of X \\label{FreqX}")
| X | D | E | F | G | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W |
| Freq | 24 | 4 | 4 | 4 | 202 | 192 | 142 | 100 | 85 | 86 | 86 | 71 | 74 | 91 | 136 | 39 | 55 | 8 | 17 |
table_y <- as.data.frame(table(indoor_local$y))
table_y <- t(table_y)
rownames(table_y) <- c("Y", "Freq")
kable(table_y, caption = "Frequency of Y\\label{FreqY}")
| Y | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 13 | 14 | 15 |
| Freq | 138 | 155 | 200 | 238 | 213 | 187 | 75 | 58 | 10 | 20 | 6 | 4 | 116 |
We will look at each iBeacon separately and see if there is anything to notice. For each feature will will plot a histogram of the signal strengths. We will also plot a scatter plot showing the signal strength with respect to the x- and y-coordinates. This will help us understand the relationship between location and iBeacon signals.
The histogram shows that this signal has a really low number of instances where it is detected. This makes sense looking at the placement of the iBeacon. However we see that strong signals from this iBeacon is detected at x-coordinates F,I and L and at y-coordinates 4,7 and 9.
Visualisation of iBeacon b3001
b3002 detect a relatively high number of RSSI signals and at a high variety of locations. For each location there is also a relatively high variation of signal strength. We notice a signal detected in row U which could be an outlier.
Visualisation of iBeacon b3002
Signal b3003 is also detected high variety of locations. For the y-coordinate it is between 1 and 7 and for the x-coordinate signals are detected from I to U. The highest variety of the signal strength is at M and O and 3 and 4.
Visualisation of iBeacon b3003
This iBeacon has a similar tendency as iBeacon b3003, but largest variety of signal strength is detected at S and 3 and 4. Furthermore we see the signal strengths are really close to follow a normal distrubution.
Visualisation of iBeacon b3004
The signals from b3005 is strongest between I and O and 1 and 10. We see that the signal strengths varies the most at y-coordinate 7 which is where the iBeacon is locatated.
Visualisation of iBeacon b3005
Again we see that the detected signal strengths are normally distributed. This iBeacon signal is also detected a relatively high number of times. Signal were detected between I and S and 1 and 10. It is also worth noticing that signals are detected in at x-coordinate S but not R even though R is closer to the iBeacon than S.
Visualisation of iBeacon b3006
From this histogram we see that only at a low number of a signal was detected. However we see that most of the signals were dectected at S and 7. We notice that a signal is detected at x-coordinate I which could be an outlier.
Visualisation of iBeacon b3007
Signals from this iBeacon also has a low number of instances where a signal is dectected. Most of the signals detected are detected at I or J and 9 and 10.We also see that for 9 and 10 at all a signal was detected. Also are signals detected at x-coordinate R and y-coordinate 15.
Visualisation of iBeacon b3008
This signal is detected at a very low number of instances, but when a signal is detected it is inly detected in a limited area. Signals are mainly detected at D and E and 13, 14 and 15. Some signals were also detected at I and J.
Visualisation of iBeacon b3009
Signals are also only detected at limited locations. String signals are mostly detected at G,I,J and 15. A few signals are detected at y-coordinate 10.
Visualisation of iBeacon b3010
This signal is also only detected at a very low number of instances. However the signals detected for x-coordinates are more spread than the signals detected at y-coordinates. Only signals a y-coordinate 15 are detected. We also notice that signals are detected at x-coordinate U.
Visualisation of iBeacon b3011
This signal looks similar to signal from iBeacon 3011. The only difference is that signals detected at x-coordinates are slightly more to the the right.
Visualisation of iBeacon b3012
Signals from iBeacon b3013 is detected at locations from Q to W and at K. The single detection at L could be an outlier. If we take a look at the library map(See figure ) we see that L is located far away from iBeacon b3013. We see that all instances at V and W a signal were detected. We also see that 15 is the only y-coordinate were signals are detected.
Visualisation of iBeacon b3013
Looking at all iBeacons it is clear to see that there is a relationship between the placement of the iBeacon and the signals detected. We also learned that there are far more instances where a signal from a given iBeacon is not detected. All signals that are detected seems to be distributed around a RSSI signal strength of 125 and the more instances were a signal from a iBeacon is detected the more the distribution looks like a normal distribution.
About half of the iBeacons are dectected more often than the other iBeacons. These iBeacons also gets detected at a higher variety of locations. If there is a low number of signals that are detected there is a tendency that signals were only detected from a few x- and y-coordinates e.g. b3013 is only detected at y-coordinate 15.
We also notice that some of the iBeacons are detected from a location far away a few times e.g. a signal from iBeacon b3011 is detected at U(See figure ). These could potentially be outliers, but since this behaviour is noticed at more than half of the iBeacons we consider it as natural tendency of the iBeacons and do not treat these values as outliers.
To explore the relationship between the different iBeacon RSSI signals we use scatterplot matrices and correlation matrix. We notice that the majority of instances are detected close to iBeacon b3002, b3003, b3004, b3005, b3006 and b3007(See figure ). Therefore we want to plot and explore these descirptive features in a scatter plot matrix. We also notice that a lot of signals are detected at y-coordinate 15(See figure ). Therefore we also want to explore all iBeacons located at y-coordinate 15 in a scatter plot matrix. We also try to look for a correlation between the iBeacons located close to the entrance of the library(See figure ) using a scatterplot matrix.
Finally we also calculate and plot the correlation matrix of all descriptive features to if any features are correlated.
We define three groups of iBeacons based on their location (See figure ) and analyse them using a scatter plot matrix.
First group of iBeacons are the iBeacons located where most of the target feature instances appear(See figure ). This group consists of iBeacon b3002, b3003, b3004, b3005, b3006 and b3007.
iBeacon b3002, b3003, b3005, b3006 and b3007. First group of iBeacons.
From graphs in figure we can see that there is no or less linear correlation exists between the group of iBeacons.
In the second group we take all iBeacons located at y-coordinate 15 except iBeacon b3009 because this row contains a high number of instances(See figure ) . Therefore this group contains iBeacon b3010, b3011, b3012 and b3013 (See figure .
iBeacon b3010, b3011, b3012 and b3013. Second group of iBeacons
This group consist of iBeacons located near the entrance area of the library. In this group we find iBeacons b3001, b3005,b3008 and b3009(See figure ).
iBeacon b3001, b3005, b3008 and b3009. Third group of iBeacons.
From graph in figure we can see that there is no or less linear correlation exists between the group of iBeacons.
As far as for all three groups of of iBeacons we see that the zero values where no signal is detected has a great influence on the relationship between the iBeacons.
We calculate and plot the correclation matrix to see if there is a linear relationship between any of the iBeacons and not just the groups that we inspected using scatter plots.
correlation <- cor(indoor_local[,2:14])
corrplot(correlation, type = "lower")
Correlation plot of all iBeacons
In general we see that the linear correlations between the iBeacons are really week. We notice that iBeacon b3002 and b3004 have the highest correlation.
We see a general tendency that there is a low or less correlation between the different iBeacons. This could be explained by the high number of zero values for each iBeacon. The high number of zero values could turn out to be an important characteristic when prediction models are build.
We discovered that the data set downloaded is in clean state and therefore we did not have to clean up the data. However, some data manipulation were made to prepare the data set for model building. Firstly, the positive number(200) was added to all iBeacon RSSI signal readings to convert all negative values into positive. Now the signal readings with the value of -200 turned zero which represents that no signal has been detected for the particular observation. To manipulate the data set for model building, the target feature was split into two sets of target features. One containing the x-coordinate of the location and the other containing the y-coordinate. We also converted the coordinates into a numerical value for plotting. Therefore, two independent models need to be build for prediction in Phase II. From univariate visualization of the iBeacons, we found that there is a clear tendency towards the placement of the iBeacons and where the signals has been detected. In addition, we are suspecting that there are some outlier values in the dataset as seen in the graphs of individual iBeacons, but since these values appear for more than half the iBeacons this could also be a natural tendency of the iBeacon signals. From multivariate visualization, we found that there is no or less linear correlation between the iBeacons. Moreover, we found that there are a lot of zero values in the data set which could be an important characteristic. Notice that the data normalization or standardization is saved for Phase II.
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.
Daróczi, Gergely, and Roman Tsegelskyi. 2017. Pander: An R ’Pandoc’ Writer. https://CRAN.R-project.org/package=pander.
Mohammadi, M., A. Al-Fuqaha, M. Guizani, and J. S. Oh. 2017. “Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services.” IEEE Internet of Things Journal. IEEE, 1–12. doi:10.1109/JIOT.2017.2712560.
Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2017. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.
Wei, Taiyun, and Viliam Simko. 2017. R Package “Corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.