Introduction and Objective

This RMarkdown is created to increase our knowledge and experience as we learn by building a good model that can classify a room occupancy by its temperature, relative humidity, light, carbon dioxide (CO2), and humidity ratio (derived quantity from temperature and relative humidity).

We will use data from Kaggle: https://www.kaggle.com/sachinsharma1123/room-occupancy. Here we already provided with a dataset consists 2.666 rows and 6 columns, including the target variable Occupancy.

The room occupancy prediction result can help people, like the rooms owner, to estimate how many rooms are still vacant and can be advertised. Further, if we interpret the logistic regression model, we can notice how much the influence of the selected predictors (e.g. temperature, humidity) to the target variable.

Library Used

library(tidyverse)
library(gtools)
library(caret)

Read Data and Exploratory Data Analysis

We will read the dataset first then take a look on each columns’ data type.

room <- read.csv("data/room_occupancy.csv")
glimpse(room)

## Rows: 2,665
## Columns: 6
## $ Temperature   <dbl> 23.7000, 23.7180, 23.7300, 23.7225, 23.7540, 23.7600,...
## $ Humidity      <dbl> 26.2720, 26.2900, 26.2300, 26.1250, 26.2000, 26.2600,...
## $ Light         <dbl> 585.2000, 578.4000, 572.6667, 493.7500, 488.6000, 568...
## $ CO2           <dbl> 749.2000, 760.4000, 769.6667, 774.7500, 779.0000, 790...
## $ HumidityRatio <dbl> 0.004764163, 0.004772661, 0.004765153, 0.004743773, 0...
## $ Occupancy     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

Below are the description for each column in the dataset:

Temperature: Temperature in celcius
Humidity: Relative Humidity in %
Light: Light, in Lux
CO2: CO2, in ppm
HumidityRatio: Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air
Occupancy: Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status

Now, let’s also take a look on the first 6 rows of the dataset.

head(room)

##   Temperature Humidity    Light      CO2 HumidityRatio Occupancy
## 1     23.7000   26.272 585.2000 749.2000   0.004764163         1
## 2     23.7180   26.290 578.4000 760.4000   0.004772661         1
## 3     23.7300   26.230 572.6667 769.6667   0.004765153         1
## 4     23.7225   26.125 493.7500 774.7500   0.004743773         1
## 5     23.7540   26.200 488.6000 779.0000   0.004766594         1
## 6     23.7600   26.260 568.6667 790.0000   0.004779332         1

Then, we check whether currently the dataset have a NA values or not.

colSums(is.na(room))

##   Temperature      Humidity         Light           CO2 HumidityRatio 
##             0             0             0             0             0 
##     Occupancy 
##             0

Data Preprocessing

As we glimpse the dataset before, we note the Occupancy column is not in the right data type (integer) thus, we can change the data type and label it not occupied if Occupancy = 0 and occupied if Occupancy = 1.

room <- room %>% 
  mutate(Occupancy = factor(ifelse(Occupancy == 0, "not occupied", "occupied")))

summary(room)

##   Temperature       Humidity         Light             CO2        
##  Min.   :20.20   Min.   :22.10   Min.   :   0.0   Min.   : 427.5  
##  1st Qu.:20.65   1st Qu.:23.26   1st Qu.:   0.0   1st Qu.: 466.0  
##  Median :20.89   Median :25.00   Median :   0.0   Median : 580.5  
##  Mean   :21.43   Mean   :25.35   Mean   : 193.2   Mean   : 717.9  
##  3rd Qu.:22.36   3rd Qu.:26.86   3rd Qu.: 442.5   3rd Qu.: 956.3  
##  Max.   :24.41   Max.   :31.47   Max.   :1697.2   Max.   :1402.2  
##  HumidityRatio             Occupancy   
##  Min.   :0.003303   not occupied:1693  
##  1st Qu.:0.003529   occupied    : 972  
##  Median :0.003815                      
##  Mean   :0.004027                      
##  3rd Qu.:0.004532                      
##  Max.   :0.005378

Cross-Validation

To train the model, we will use a randomly picked data inside the dataset. We set the train dataset threshold = 80% and the rest will be the testing dataset.

RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(room)

index <- sample(row_data, row_data*0.8)

data_train <- room[index, ]
data_test <- room[-index, ]

Then, to check whether the target variable has a class imbalance or not, we can use prop.table() function.

prop.table(table(data_train$Occupancy))

## 
## not occupied     occupied 
##    0.6355535    0.3644465

Up-Sample

Since the target variable class proportion appears to be imbalance, this will affect the model performance since the model will have a tendency to predict the majority class. Thus, we need to adjust the train dataset to eliminate the class imbalance.

Since we only have a few data, upsampling the train dataset will be preferable because we won’t erased any information from the train dataset.

set.seed(123)
data_train_up <- upSample(x = data_train %>% select(-Occupancy),
                          y = data_train$Occupancy,
                          list = F,
                          yname = "Occupancy")

table(data_train_up$Occupancy)

## 
## not occupied     occupied 
##         1355         1355

Model Fitting and Evaluation

After adjusting the class imbalance, we can prepare the model and evaluate it. We will compare models created using logistic regression and k-nearest neighbors (KNN).

Logistic Regression

First, we will create a model using logistic regression then we can applied stepwise to find a model with the lowest AIC.

model_up <- glm(Occupancy ~ ., data_train_up, family = "binomial")
model_step_up <- step(model_up, direction = "both", trace = 0)

Once we created the model, we can evaluate the model using confusionMatrix() function.

# Predict data train
pred_train_upsample <- predict(model_step_up, data_train_up, type = "response")
pred_class_train_up <- ifelse(pred_train_upsample > 0.5, "occupied", "not occupied") %>% as.factor()

confusionMatrix(pred_class_train_up, data_train_up$Occupancy, positive = "occupied")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     not occupied occupied
##   not occupied         1307        4
##   occupied               48     1351
##                                                
##                Accuracy : 0.9808               
##                  95% CI : (0.9749, 0.9856)     
##     No Information Rate : 0.5                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9616               
##                                                
##  Mcnemar's Test P-Value : 0.000000002476       
##                                                
##             Sensitivity : 0.9970               
##             Specificity : 0.9646               
##          Pos Pred Value : 0.9657               
##          Neg Pred Value : 0.9969               
##              Prevalence : 0.5000               
##          Detection Rate : 0.4985               
##    Detection Prevalence : 0.5162               
##       Balanced Accuracy : 0.9808               
##                                                
##        'Positive' Class : occupied             
##

# Predict data test
pred_test <- predict(model_step_up, data_test, type = "response")
pred_class_test <- ifelse(pred_test > 0.5, "occupied", "not occupied") %>% as.factor()

confusionMatrix(pred_class_test, data_test$Occupancy, positive = "occupied")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     not occupied occupied
##   not occupied          332        0
##   occupied                6      195
##                                               
##                Accuracy : 0.9887              
##                  95% CI : (0.9757, 0.9959)    
##     No Information Rate : 0.6341              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.9759              
##                                               
##  Mcnemar's Test P-Value : 0.04123             
##                                               
##             Sensitivity : 1.0000              
##             Specificity : 0.9822              
##          Pos Pred Value : 0.9701              
##          Neg Pred Value : 1.0000              
##              Prevalence : 0.3659              
##          Detection Rate : 0.3659              
##    Detection Prevalence : 0.3771              
##       Balanced Accuracy : 0.9911              
##                                               
##        'Positive' Class : occupied            
##

Overall, the logistic regression model showing a good and optimum performance because its accuracy, sensitivity, specificity, precision are above 70% and the accuracy between the train & test dataset are similar.

KNN

Let’s see the performance of the model created via KNN.

Scaling Data

Before, when we glimpse and see the first 6 rows of the dataset, we note that the values inside the dataset are various in the minimum and maximum amount or in various intervals. Thus, we need to scale the values first.

The reason we need to scale the dataset first because KNN is calculating the distance between the data (euclidean distance) therefore the range between the data must be the same. The scaling process is carried out using the z-score method and the scaling process only changes the scale of the data without changing the distribution of the initial data.

# Scaling Data Train
train_x <- data_train_up %>% 
  select(-Occupancy) %>%
  scale()

# Saved target variable
train_y <- data_train_up$Occupancy

# Scaling Data Test
test_x <- data_test %>% 
  select(-Occupancy) %>%
  scale(center = attr(train_x, "scaled:center"), 
        scale = attr(train_x, "scaled:scale") 
        )

# Saved target variable
test_y <- data_test$Occupancy

Once we already scaled the dataset, we can create the KNN model and evaluate it.

pred_knn <- knn3Train(train = train_x, 
                      test = test_x, 
                      cl = train_y, 
                      k = sqrt(nrow(train_x)) %>% round()
                      ) %>% 
  as.factor()

confusionMatrix(pred_knn, test_y, positive = "occupied")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     not occupied occupied
##   not occupied          328        0
##   occupied               10      195
##                                                
##                Accuracy : 0.9812               
##                  95% CI : (0.9658, 0.991)      
##     No Information Rate : 0.6341               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.96                 
##                                                
##  Mcnemar's Test P-Value : 0.004427             
##                                                
##             Sensitivity : 1.0000               
##             Specificity : 0.9704               
##          Pos Pred Value : 0.9512               
##          Neg Pred Value : 1.0000               
##              Prevalence : 0.3659               
##          Detection Rate : 0.3659               
##    Detection Prevalence : 0.3846               
##       Balanced Accuracy : 0.9852               
##                                                
##        'Positive' Class : occupied             
##

Conclusion

Comparing the performance evaluation between logistic regression and KNN models, model created using logistic regression slightly better than the KNN model because its accuracy, specificity, and precision are higher.

Further, we can also consider other metric evaluation beside accuracy. We can consider use the precision metric.

Why precision? Because if we act as the rooms owner and we want to advertise the room for rent, we as the owner will want to get a precise room status whether it is really occupied (higher / more true positive result) rather than predicted to be occupied but actually it is not (lower / less false positive result). Better precision can help the rooms owner to give the right room vacancy information to the customer when they advertise the room for rent.

Room Occupancy Prediction using Logistic Regression & KNN

Margareth Devina

6 April 2021