Introduction and Objective
This RMarkdown is created to increase our knowledge and experience as we learn by building a good model that can classify a room occupancy by its temperature, relative humidity, light, carbon dioxide (CO2), and humidity ratio (derived quantity from temperature and relative humidity).
We will use data from Kaggle: https://www.kaggle.com/sachinsharma1123/room-occupancy. Here we already provided with a dataset consists 2.666 rows and 6 columns, including the target variable Occupancy.
The room occupancy prediction result can help people, like the rooms owner, to estimate how many rooms are still vacant and can be advertised. Further, if we interpret the logistic regression model, we can notice how much the influence of the selected predictors (e.g. temperature, humidity) to the target variable.
Library Used
library(tidyverse)
library(gtools)
library(caret)Read Data and Exploratory Data Analysis
We will read the dataset first then take a look on each columns’ data type.
room <- read.csv("data/room_occupancy.csv")
glimpse(room)## Rows: 2,665
## Columns: 6
## $ Temperature <dbl> 23.7000, 23.7180, 23.7300, 23.7225, 23.7540, 23.7600,...
## $ Humidity <dbl> 26.2720, 26.2900, 26.2300, 26.1250, 26.2000, 26.2600,...
## $ Light <dbl> 585.2000, 578.4000, 572.6667, 493.7500, 488.6000, 568...
## $ CO2 <dbl> 749.2000, 760.4000, 769.6667, 774.7500, 779.0000, 790...
## $ HumidityRatio <dbl> 0.004764163, 0.004772661, 0.004765153, 0.004743773, 0...
## $ Occupancy <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
Below are the description for each column in the dataset:
Temperature: Temperature in celciusHumidity: Relative Humidity in %Light: Light, in LuxCO2: CO2, in ppmHumidityRatio: Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-airOccupancy: Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status
Now, let’s also take a look on the first 6 rows of the dataset.
head(room)## Temperature Humidity Light CO2 HumidityRatio Occupancy
## 1 23.7000 26.272 585.2000 749.2000 0.004764163 1
## 2 23.7180 26.290 578.4000 760.4000 0.004772661 1
## 3 23.7300 26.230 572.6667 769.6667 0.004765153 1
## 4 23.7225 26.125 493.7500 774.7500 0.004743773 1
## 5 23.7540 26.200 488.6000 779.0000 0.004766594 1
## 6 23.7600 26.260 568.6667 790.0000 0.004779332 1
Then, we check whether currently the dataset have a NA values or not.
colSums(is.na(room))## Temperature Humidity Light CO2 HumidityRatio
## 0 0 0 0 0
## Occupancy
## 0
Data Preprocessing
As we glimpse the dataset before, we note the Occupancy column is not in the right data type (integer) thus, we can change the data type and label it not occupied if Occupancy = 0 and occupied if Occupancy = 1.
room <- room %>%
mutate(Occupancy = factor(ifelse(Occupancy == 0, "not occupied", "occupied")))
summary(room)## Temperature Humidity Light CO2
## Min. :20.20 Min. :22.10 Min. : 0.0 Min. : 427.5
## 1st Qu.:20.65 1st Qu.:23.26 1st Qu.: 0.0 1st Qu.: 466.0
## Median :20.89 Median :25.00 Median : 0.0 Median : 580.5
## Mean :21.43 Mean :25.35 Mean : 193.2 Mean : 717.9
## 3rd Qu.:22.36 3rd Qu.:26.86 3rd Qu.: 442.5 3rd Qu.: 956.3
## Max. :24.41 Max. :31.47 Max. :1697.2 Max. :1402.2
## HumidityRatio Occupancy
## Min. :0.003303 not occupied:1693
## 1st Qu.:0.003529 occupied : 972
## Median :0.003815
## Mean :0.004027
## 3rd Qu.:0.004532
## Max. :0.005378
Cross-Validation
To train the model, we will use a randomly picked data inside the dataset. We set the train dataset threshold = 80% and the rest will be the testing dataset.
RNGkind(sample.kind = "Rounding")
set.seed(123)
row_data <- nrow(room)
index <- sample(row_data, row_data*0.8)
data_train <- room[index, ]
data_test <- room[-index, ] Then, to check whether the target variable has a class imbalance or not, we can use prop.table() function.
prop.table(table(data_train$Occupancy))##
## not occupied occupied
## 0.6355535 0.3644465
Up-Sample
Since the target variable class proportion appears to be imbalance, this will affect the model performance since the model will have a tendency to predict the majority class. Thus, we need to adjust the train dataset to eliminate the class imbalance.
Since we only have a few data, upsampling the train dataset will be preferable because we won’t erased any information from the train dataset.
set.seed(123)
data_train_up <- upSample(x = data_train %>% select(-Occupancy),
y = data_train$Occupancy,
list = F,
yname = "Occupancy")
table(data_train_up$Occupancy)##
## not occupied occupied
## 1355 1355
Model Fitting and Evaluation
After adjusting the class imbalance, we can prepare the model and evaluate it. We will compare models created using logistic regression and k-nearest neighbors (KNN).
Logistic Regression
First, we will create a model using logistic regression then we can applied stepwise to find a model with the lowest AIC.
model_up <- glm(Occupancy ~ ., data_train_up, family = "binomial")
model_step_up <- step(model_up, direction = "both", trace = 0)Once we created the model, we can evaluate the model using confusionMatrix() function.
# Predict data train
pred_train_upsample <- predict(model_step_up, data_train_up, type = "response")
pred_class_train_up <- ifelse(pred_train_upsample > 0.5, "occupied", "not occupied") %>% as.factor()
confusionMatrix(pred_class_train_up, data_train_up$Occupancy, positive = "occupied")## Confusion Matrix and Statistics
##
## Reference
## Prediction not occupied occupied
## not occupied 1307 4
## occupied 48 1351
##
## Accuracy : 0.9808
## 95% CI : (0.9749, 0.9856)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9616
##
## Mcnemar's Test P-Value : 0.000000002476
##
## Sensitivity : 0.9970
## Specificity : 0.9646
## Pos Pred Value : 0.9657
## Neg Pred Value : 0.9969
## Prevalence : 0.5000
## Detection Rate : 0.4985
## Detection Prevalence : 0.5162
## Balanced Accuracy : 0.9808
##
## 'Positive' Class : occupied
##
# Predict data test
pred_test <- predict(model_step_up, data_test, type = "response")
pred_class_test <- ifelse(pred_test > 0.5, "occupied", "not occupied") %>% as.factor()
confusionMatrix(pred_class_test, data_test$Occupancy, positive = "occupied")## Confusion Matrix and Statistics
##
## Reference
## Prediction not occupied occupied
## not occupied 332 0
## occupied 6 195
##
## Accuracy : 0.9887
## 95% CI : (0.9757, 0.9959)
## No Information Rate : 0.6341
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.9759
##
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 1.0000
## Specificity : 0.9822
## Pos Pred Value : 0.9701
## Neg Pred Value : 1.0000
## Prevalence : 0.3659
## Detection Rate : 0.3659
## Detection Prevalence : 0.3771
## Balanced Accuracy : 0.9911
##
## 'Positive' Class : occupied
##
Overall, the logistic regression model showing a good and optimum performance because its accuracy, sensitivity, specificity, precision are above 70% and the accuracy between the train & test dataset are similar.
KNN
Let’s see the performance of the model created via KNN.
Scaling Data
Before, when we glimpse and see the first 6 rows of the dataset, we note that the values inside the dataset are various in the minimum and maximum amount or in various intervals. Thus, we need to scale the values first.
The reason we need to scale the dataset first because KNN is calculating the distance between the data (euclidean distance) therefore the range between the data must be the same. The scaling process is carried out using the z-score method and the scaling process only changes the scale of the data without changing the distribution of the initial data.
# Scaling Data Train
train_x <- data_train_up %>%
select(-Occupancy) %>%
scale()
# Saved target variable
train_y <- data_train_up$Occupancy# Scaling Data Test
test_x <- data_test %>%
select(-Occupancy) %>%
scale(center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale")
)
# Saved target variable
test_y <- data_test$OccupancyOnce we already scaled the dataset, we can create the KNN model and evaluate it.
pred_knn <- knn3Train(train = train_x,
test = test_x,
cl = train_y,
k = sqrt(nrow(train_x)) %>% round()
) %>%
as.factor()confusionMatrix(pred_knn, test_y, positive = "occupied")## Confusion Matrix and Statistics
##
## Reference
## Prediction not occupied occupied
## not occupied 328 0
## occupied 10 195
##
## Accuracy : 0.9812
## 95% CI : (0.9658, 0.991)
## No Information Rate : 0.6341
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.96
##
## Mcnemar's Test P-Value : 0.004427
##
## Sensitivity : 1.0000
## Specificity : 0.9704
## Pos Pred Value : 0.9512
## Neg Pred Value : 1.0000
## Prevalence : 0.3659
## Detection Rate : 0.3659
## Detection Prevalence : 0.3846
## Balanced Accuracy : 0.9852
##
## 'Positive' Class : occupied
##
Conclusion
Comparing the performance evaluation between logistic regression and KNN models, model created using logistic regression slightly better than the KNN model because its accuracy, specificity, and precision are higher.
Further, we can also consider other metric evaluation beside accuracy. We can consider use the precision metric.
Why precision? Because if we act as the rooms owner and we want to advertise the room for rent, we as the owner will want to get a precise room status whether it is really occupied (higher / more true positive result) rather than predicted to be occupied but actually it is not (lower / less false positive result). Better precision can help the rooms owner to give the right room vacancy information to the customer when they advertise the room for rent.