Data is retrieved from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ ml/datasets/Occupancy+Detection+#. There are 3 datasets in the repository. One is training set and other two are test sets. Training dataset includes 8143 observation, test1 includes 2665 and test2 includes 9752 with 7 variables. Aim is to detect occupancy status of an office room based on attributes such as temperature, CO2, light. Occupancy variable has 0 and 1 values that represents not occupied and occupied respectively. These 3 data frames will be combined into one room data frame.
# import the txt files
roomTr <- read.csv("datatraining.txt", header = TRUE, sep=',',stringsAsFactors = FALSE)
roomTe1 <- read.csv("datatest.txt", header = TRUE, sep=',',stringsAsFactors = FALSE)
roomTe2 <- read.csv("datatest2.txt", header = TRUE, sep=',',stringsAsFactors = FALSE)
room <- rbind(roomTr, roomTe1, roomTe2)
# examine the structure of the room data frame
str(room)
## 'data.frame': 20560 obs. of 7 variables:
## $ date : chr "2015-02-04 17:51:00" "2015-02-04 17:51:59" "2015-02-04 17:53:00" "2015-02-04 17:54:00" ...
## $ Temperature : num 23.2 23.1 23.1 23.1 23.1 ...
## $ Humidity : num 27.3 27.3 27.2 27.2 27.2 ...
## $ Light : num 426 430 426 426 426 ...
## $ CO2 : num 721 714 714 708 704 ...
## $ HumidityRatio: num 0.00479 0.00478 0.00478 0.00477 0.00476 ...
## $ Occupancy : int 1 1 1 1 1 1 1 1 1 1 ...
Date column is removed from the data set. We see that 23% of the times the room was occupied while 76.9% of the times it was not occupied.
# drop the temperature feature
room <- room[-1]
# table of diagnosis
table(room$Occupancy)
##
## 0 1
## 15810 4750
# recode occupancy as a factor
room$Occupancy <- factor(room$Occupancy, levels = c(0,1),
labels = c("Not Occupied", "Occupied"))
# table or proportions with more informative labels
round(prop.table(table(room$Occupancy)) * 100, digits = 1)
##
## Not Occupied Occupied
## 76.9 23.1
If we have a look at the numerical variables, we see that ranges differ from each other. We need to normalize these numbers to have better results in the classification.
# summarize three numeric features
summary(room[c("Temperature", "Light", "CO2")])
## Temperature Light CO2
## Min. :19.00 Min. : 0.0 Min. : 412.8
## 1st Qu.:20.20 1st Qu.: 0.0 1st Qu.: 460.0
## Median :20.70 Median : 0.0 Median : 565.4
## Mean :20.91 Mean : 130.8 Mean : 690.6
## 3rd Qu.:21.52 3rd Qu.: 301.0 3rd Qu.: 804.7
## Max. :24.41 Max. :1697.2 Max. :2076.5
# create normalization function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
# normalize the room data
room_n <- as.data.frame(lapply(room[1:5], normalize))
Dataset should be divided into training and test sets to assess the performance of the algorithm. In this case, we will use 15560 observations to train the kNN model and will use the rest 5000 to predict the results of unlabeled observations to estimate accuracy of the model.
# create training and test data
room_train <- room_n[201:15760, ]
room_test <- room_n[c(1:200,15761:20560), ]
# create labels for training and test data
room_train_labels <- room[201:15760, 6 ]
room_test_labels <- room[c(1:200,15761:20560),6 ]
library(car)
scatterplotMatrix(~Temperature+Humidity+Light+CO2+HumidityRatio | Occupancy, data=room)
In the scatterplot, we see that there is a correlation between humidity and humidity ratio. Furthermore, in the light and C02 attribute pair plot, we see that there is somehow a separation for these two classes.
In the training phase, we train the model with 15560 observations and classify the 5000 observations by taking votes of the closest observations.
Since we have 15560 observations, we will choose k=125 as the square root of number of training observations.
# load the "class" library
library(class)
room_test_pred <- knn(train = room_train, test = room_test,
cl = room_train_labels, k = 125)
We will use CrossTable function to evaluate the model’s performance by comparing the estimated labels with the real class values. In the confusion matrix, we see that true positives and true negatives are accounted for 0.736+0.222 of the data set. So, the overall accuracy of the model is %95.8.
# load the "gmodels" library
library(gmodels)
# Create the cross tabulation of predicted vs. actual
CrossTable(x = room_test_labels, y = room_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 5000
##
##
## | room_test_pred
## room_test_labels | Not Occupied | Occupied | Row Total |
## -----------------|--------------|--------------|--------------|
## Not Occupied | 3681 | 206 | 3887 |
## | 0.947 | 0.053 | 0.777 |
## | 1.000 | 0.156 | |
## | 0.736 | 0.041 | |
## -----------------|--------------|--------------|--------------|
## Occupied | 1 | 1112 | 1113 |
## | 0.001 | 0.999 | 0.223 |
## | 0.000 | 0.844 | |
## | 0.000 | 0.222 | |
## -----------------|--------------|--------------|--------------|
## Column Total | 3682 | 1318 | 5000 |
## | 0.736 | 0.264 | |
## -----------------|--------------|--------------|--------------|
##
##
There is just one case that is labeled as not occupied but the room was occupied. That is why false negative rate of the model is very low. On the other hand, %4.1 of the observations are labeled as occupied but the room was indeed not occupied which corresponds a high false positive rate.
We can try to improve model performance to reduce false negatives by normalizing the dataset with z-score normalization and changing k values. By standardizing the values with z-score, we reached better results. It did not improve the model. True negative number increased from 3681 to 3732, while false positive rate decreases from 206 to 155. Overall accuracy of the model with z-score normalization increased 1 percent and became 96.8%.
# use the scale() function to z-score standardize a data frame
room_z <- as.data.frame(scale(room[1:5]))
# confirm that the transformation was applied correctly
summary(room_z$Light)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6214 -0.6214 -0.6214 0.0000 0.8090 7.4440
# create training and test datasets
room_train <- room_z[201:15760, ]
room_test <- room_z[c(1:200,15761:20560), ]
# re-classify test cases
room_test_pred <- knn(train = room_train, test = room_test,
cl = room_train_labels, k = 125)
# Create the cross tabulation of predicted vs. actual
CrossTable(x = room_test_labels, y = room_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 5000
##
##
## | room_test_pred
## room_test_labels | Not Occupied | Occupied | Row Total |
## -----------------|--------------|--------------|--------------|
## Not Occupied | 3732 | 155 | 3887 |
## | 0.960 | 0.040 | 0.777 |
## | 1.000 | 0.122 | |
## | 0.746 | 0.031 | |
## -----------------|--------------|--------------|--------------|
## Occupied | 1 | 1112 | 1113 |
## | 0.001 | 0.999 | 0.223 |
## | 0.000 | 0.878 | |
## | 0.000 | 0.222 | |
## -----------------|--------------|--------------|--------------|
## Column Total | 3733 | 1267 | 5000 |
## | 0.747 | 0.253 | |
## -----------------|--------------|--------------|--------------|
##
##
# try several different values of k
room_train <- room_n[201:15760, ]
room_test <- room_n[c(1:200,15761:20560), ]
room_test_pred <- knn(train = room_train, test = room_test, cl = room_train_labels, k=1)
CrossTable(x = room_test_labels, y = room_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 5000
##
##
## | room_test_pred
## room_test_labels | Not Occupied | Occupied | Row Total |
## -----------------|--------------|--------------|--------------|
## Not Occupied | 3686 | 201 | 3887 |
## | 0.948 | 0.052 | 0.777 |
## | 0.997 | 0.154 | |
## | 0.737 | 0.040 | |
## -----------------|--------------|--------------|--------------|
## Occupied | 10 | 1103 | 1113 |
## | 0.009 | 0.991 | 0.223 |
## | 0.003 | 0.846 | |
## | 0.002 | 0.221 | |
## -----------------|--------------|--------------|--------------|
## Column Total | 3696 | 1304 | 5000 |
## | 0.739 | 0.261 | |
## -----------------|--------------|--------------|--------------|
##
##
room_test_pred <- knn(train = room_train, test = room_test, cl = room_train_labels, k=31)
CrossTable(x = room_test_labels, y = room_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 5000
##
##
## | room_test_pred
## room_test_labels | Not Occupied | Occupied | Row Total |
## -----------------|--------------|--------------|--------------|
## Not Occupied | 3705 | 182 | 3887 |
## | 0.953 | 0.047 | 0.777 |
## | 1.000 | 0.141 | |
## | 0.741 | 0.036 | |
## -----------------|--------------|--------------|--------------|
## Occupied | 1 | 1112 | 1113 |
## | 0.001 | 0.999 | 0.223 |
## | 0.000 | 0.859 | |
## | 0.000 | 0.222 | |
## -----------------|--------------|--------------|--------------|
## Column Total | 3706 | 1294 | 5000 |
## | 0.741 | 0.259 | |
## -----------------|--------------|--------------|--------------|
##
##
room_test_pred <- knn(train = room_train, test = room_test, cl = room_train_labels, k=61)
CrossTable(x = room_test_labels, y = room_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 5000
##
##
## | room_test_pred
## room_test_labels | Not Occupied | Occupied | Row Total |
## -----------------|--------------|--------------|--------------|
## Not Occupied | 3680 | 207 | 3887 |
## | 0.947 | 0.053 | 0.777 |
## | 1.000 | 0.157 | |
## | 0.736 | 0.041 | |
## -----------------|--------------|--------------|--------------|
## Occupied | 1 | 1112 | 1113 |
## | 0.001 | 0.999 | 0.223 |
## | 0.000 | 0.843 | |
## | 0.000 | 0.222 | |
## -----------------|--------------|--------------|--------------|
## Column Total | 3681 | 1319 | 5000 |
## | 0.736 | 0.264 | |
## -----------------|--------------|--------------|--------------|
##
##
room_test_pred <- knn(train = room_train, test = room_test, cl = room_train_labels, k=91)
CrossTable(x = room_test_labels, y = room_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 5000
##
##
## | room_test_pred
## room_test_labels | Not Occupied | Occupied | Row Total |
## -----------------|--------------|--------------|--------------|
## Not Occupied | 3683 | 204 | 3887 |
## | 0.948 | 0.052 | 0.777 |
## | 1.000 | 0.155 | |
## | 0.737 | 0.041 | |
## -----------------|--------------|--------------|--------------|
## Occupied | 1 | 1112 | 1113 |
## | 0.001 | 0.999 | 0.223 |
## | 0.000 | 0.845 | |
## | 0.000 | 0.222 | |
## -----------------|--------------|--------------|--------------|
## Column Total | 3684 | 1316 | 5000 |
## | 0.737 | 0.263 | |
## -----------------|--------------|--------------|--------------|
##
##
room_test_pred <- knn(train = room_train, test = room_test, cl = room_train_labels, k=201)
CrossTable(x = room_test_labels, y = room_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 5000
##
##
## | room_test_pred
## room_test_labels | Not Occupied | Occupied | Row Total |
## -----------------|--------------|--------------|--------------|
## Not Occupied | 3543 | 344 | 3887 |
## | 0.911 | 0.089 | 0.777 |
## | 0.999 | 0.236 | |
## | 0.709 | 0.069 | |
## -----------------|--------------|--------------|--------------|
## Occupied | 2 | 1111 | 1113 |
## | 0.002 | 0.998 | 0.223 |
## | 0.001 | 0.764 | |
## | 0.000 | 0.222 | |
## -----------------|--------------|--------------|--------------|
## Column Total | 3545 | 1455 | 5000 |
## | 0.709 | 0.291 | |
## -----------------|--------------|--------------|--------------|
##
##
For k=31: We reach better results. Overall accuracy of the model increases %0.5 and true negative rate increases.