1 Penjelasan

Data Human Activity Recognition adalah data yang berisi penelitian terhadap 30 sukarelawan yang berusia 19-48 tahun. Tiap sukarelawan melakukan 6 aktivitas yang tercermin sebagai kolom sebagai berikut:

WALKING: berjalan
WALKING_UPSTAIRS: berjalan menaiki tangga
WALKING_DOWNSTAIRS: berjalan menuruni tangga
SITTING: duduk
STANDING: berdiri
LAYING: berbaring

Tiap sukarelawan melakukan kegiatan-kegiatan tersebut dengan membawa smartphone Galaxy SII di pinggangnya. Pergerakan sukarelawan tersebut direkam menggunakan akselerometer dan gyroscope yang terdapat pada smartphone.

Kita akan membuat 3 model dan akan melakukan perbandingan berdasarkan data di atas yaitu: - Naive-Bayes - Descision Tree - Random Forest

2 Eksplorasi Data

#install package
library(dplyr)
library(tidyverse)
library(e1071)
library(GGally)
library(caret)
library(randomForest)
library(partykit)

Terdapat 4 data, namun tiap data tidak memiliki variabel untuk menjelaskan isi data tersebut sehingga kita perlu memperbaiki datanya

# membaca data
train_x <- read.table("train/X_train.txt",sep = "")
train_y <- read.table("train/y_train.txt",sep = "")
test_x <- read.table("test/X_test.txt",sep = "")
test_y <- read.table("test/y_test.txt",sep = "")

2.1 Mengubah Tipe Data

# Mengubah tipe data menjadi factor
train_y$V1 = as.factor(train_y$V1)
test_y$V1 = as.factor(test_y$V1)

# Mengubah nama kolom
data_act = read.table("activity_labels.txt", sep = "")
features = read.table("features.txt", sep = "")
act = as.character(data_act$V2)
attr = features$V2

colnames(train_x) = make.unique(attr)
colnames(train_y) = "Activity"

colnames(test_x) = make.unique(attr)
colnames(test_y) = "Activity"

# Mengubah nama variabel target
levels(train_y$Activity) = act
levels(test_y$Activity) = act

2.2 Menggabungkan Data

Menggabungkan data_train dan data_test

train_df <- cbind(train_x, train_y)
test_df <- cbind(test_x, test_y)

3 Analisis Data

Mengkalkulasi jumlah tiap kegiatan dan melakukan pengurutan untuk keperluan pembuatan model

all = rbind(train_df, test_df)
count_all <- all %>% select(Activity) %>% 
  group_by(Activity) %>% 
  summarise(Count = length(Activity)) %>% 
  arrange(-Count)
count_all

#> # A tibble: 6 × 2
#>   Activity           Count
#>   <fct>              <int>
#> 1 LAYING              1944
#> 2 STANDING            1906
#> 3 SITTING             1777
#> 4 WALKING             1722
#> 5 WALKING_UPSTAIRS    1544
#> 6 WALKING_DOWNSTAIRS  1406

count_all %>% ggplot(aes(x = reorder(Activity, Count), y = Count)) +
  geom_col() +
  coord_flip() +
  labs(x = "Activity") +
  theme_minimal()

Data tidak memiliki jumlah yang sama, namun tetap bisa kita proses karena perbedaannya tidak terlalu signifikan.

4 Modelling

4.1 Naive-Bayes

# Membuat model
model1 <- naiveBayes(Activity ~ ., data = train_df)

# Membuat prediksi
pred1 <- predict(model1, newdata = test_df %>% select(-Activity))

# Melakukan evaluasi model menggunakan Confusion Matrix
cm1 <- table(as.factor(pred1), test_df %>% select(Activity) %>% pull)

confusionMatrix(cm1)

#> Confusion Matrix and Statistics
#> 
#>                     
#>                      WALKING WALKING_UPSTAIRS WALKING_DOWNSTAIRS SITTING
#>   WALKING                416                9                 80       0
#>   WALKING_UPSTAIRS        38              451                 83       7
#>   WALKING_DOWNSTAIRS      42               11                257       0
#>   SITTING                  0                0                  0     368
#>   STANDING                 0                0                  0     111
#>   LAYING                   0                0                  0       5
#>                     
#>                      STANDING LAYING
#>   WALKING                   0      0
#>   WALKING_UPSTAIRS         15      3
#>   WALKING_DOWNSTAIRS        0      0
#>   SITTING                  54    212
#>   STANDING                455      0
#>   LAYING                    8    322
#> 
#> Overall Statistics
#>                                          
#>                Accuracy : 0.7699         
#>                  95% CI : (0.7543, 0.785)
#>     No Information Rate : 0.1822         
#>     P-Value [Acc > NIR] : < 2.2e-16      
#>                                          
#>                   Kappa : 0.7237         
#>                                          
#>  Mcnemar's Test P-Value : NA             
#> 
#> Statistics by Class:
#> 
#>                      Class: WALKING Class: WALKING_UPSTAIRS
#> Sensitivity                  0.8387                  0.9575
#> Specificity                  0.9637                  0.9410
#> Pos Pred Value               0.8238                  0.7554
#> Neg Pred Value               0.9672                  0.9915
#> Prevalence                   0.1683                  0.1598
#> Detection Rate               0.1412                  0.1530
#> Detection Prevalence         0.1714                  0.2026
#> Balanced Accuracy            0.9012                  0.9493
#>                      Class: WALKING_DOWNSTAIRS Class: SITTING Class: STANDING
#> Sensitivity                            0.61190         0.7495          0.8553
#> Specificity                            0.97903         0.8917          0.9540
#> Pos Pred Value                         0.82903         0.5804          0.8039
#> Neg Pred Value                         0.93819         0.9468          0.9677
#> Prevalence                             0.14252         0.1666          0.1805
#> Detection Rate                         0.08721         0.1249          0.1544
#> Detection Prevalence                   0.10519         0.2151          0.1921
#> Balanced Accuracy                      0.79547         0.8206          0.9047
#>                      Class: LAYING
#> Sensitivity                 0.5996
#> Specificity                 0.9946
#> Pos Pred Value              0.9612
#> Neg Pred Value              0.9177
#> Prevalence                  0.1822
#> Detection Rate              0.1093
#> Detection Prevalence        0.1137
#> Balanced Accuracy           0.7971

4.2 Decision Tree

# Membuat Model
model2 <- ctree(Activity ~ ., data = train_df)

# Membuat Prediksi
pred2 <- predict(model2, newdata = test_df %>% select(-Activity))

# Melakukan evaluasi model menggunakan Confusion Matrix
cm2 <- table(as.factor(pred2), test_df %>% select(Activity) %>% pull)

confusionMatrix(cm2)

#> Confusion Matrix and Statistics
#> 
#>                     
#>                      WALKING WALKING_UPSTAIRS WALKING_DOWNSTAIRS SITTING
#>   WALKING                456              129                 30       0
#>   WALKING_UPSTAIRS        13              297                 61       0
#>   WALKING_DOWNSTAIRS      27               45                329       0
#>   SITTING                  0                0                  0     397
#>   STANDING                 0                0                  0      94
#>   LAYING                   0                0                  0       0
#>                     
#>                      STANDING LAYING
#>   WALKING                   0      0
#>   WALKING_UPSTAIRS          0      0
#>   WALKING_DOWNSTAIRS        0      0
#>   SITTING                  46      1
#>   STANDING                486      0
#>   LAYING                    0    536
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.8487          
#>                  95% CI : (0.8352, 0.8614)
#>     No Information Rate : 0.1822          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.818           
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: WALKING Class: WALKING_UPSTAIRS
#> Sensitivity                  0.9194                  0.6306
#> Specificity                  0.9351                  0.9701
#> Pos Pred Value               0.7415                  0.8005
#> Neg Pred Value               0.9828                  0.9325
#> Prevalence                   0.1683                  0.1598
#> Detection Rate               0.1547                  0.1008
#> Detection Prevalence         0.2087                  0.1259
#> Balanced Accuracy            0.9272                  0.8003
#>                      Class: WALKING_DOWNSTAIRS Class: SITTING Class: STANDING
#> Sensitivity                             0.7833         0.8086          0.9135
#> Specificity                             0.9715         0.9809          0.9611
#> Pos Pred Value                          0.8204         0.8941          0.8379
#> Neg Pred Value                          0.9643         0.9624          0.9806
#> Prevalence                              0.1425         0.1666          0.1805
#> Detection Rate                          0.1116         0.1347          0.1649
#> Detection Prevalence                    0.1361         0.1507          0.1968
#> Balanced Accuracy                       0.8774         0.8947          0.9373
#>                      Class: LAYING
#> Sensitivity                 0.9981
#> Specificity                 1.0000
#> Pos Pred Value              1.0000
#> Neg Pred Value              0.9996
#> Prevalence                  0.1822
#> Detection Rate              0.1819
#> Detection Prevalence        0.1819
#> Balanced Accuracy           0.9991

4.3 Random Forest

#set.seed(123)
#ctrl <- trainControl(method="repeatedcv", number=4, repeats=3) # k-fold cross validation

#model3 <- train(Activity ~ ., data = train_df, method="rf", trControl = ctrl) #random forest

#saveRDS(model3, "fb_forest_update.RDS") # simpan model

Load data yang sudah jadi untuk mempersingkat waktu.

# Model yang sudah jadi
model3 <- readRDS("fb_forest_update.RDS")

# Membuat prediksi
pred3 <- predict(model3, test_df)

# Melakukan evaluasi model menggunakan Confusion Matrix
cm3 <- table(as.factor(pred3), test_df %>% select(Activity) %>% pull)

confusionMatrix(cm3)

#> Confusion Matrix and Statistics
#> 
#>                     
#>                      WALKING WALKING_UPSTAIRS WALKING_DOWNSTAIRS SITTING
#>   WALKING                483               35                 18       0
#>   WALKING_UPSTAIRS         6              429                 45       0
#>   WALKING_DOWNSTAIRS       7                7                357       0
#>   SITTING                  0                0                  0     428
#>   STANDING                 0                0                  0      63
#>   LAYING                   0                0                  0       0
#>                     
#>                      STANDING LAYING
#>   WALKING                   0      0
#>   WALKING_UPSTAIRS          0      0
#>   WALKING_DOWNSTAIRS        0      0
#>   SITTING                  51      0
#>   STANDING                481      0
#>   LAYING                    0    537
#> 
#> Overall Statistics
#>                                          
#>                Accuracy : 0.9213         
#>                  95% CI : (0.911, 0.9307)
#>     No Information Rate : 0.1822         
#>     P-Value [Acc > NIR] : < 2.2e-16      
#>                                          
#>                   Kappa : 0.9054         
#>                                          
#>  Mcnemar's Test P-Value : NA             
#> 
#> Statistics by Class:
#> 
#>                      Class: WALKING Class: WALKING_UPSTAIRS
#> Sensitivity                  0.9738                  0.9108
#> Specificity                  0.9784                  0.9794
#> Pos Pred Value               0.9011                  0.8937
#> Neg Pred Value               0.9946                  0.9830
#> Prevalence                   0.1683                  0.1598
#> Detection Rate               0.1639                  0.1456
#> Detection Prevalence         0.1819                  0.1629
#> Balanced Accuracy            0.9761                  0.9451
#>                      Class: WALKING_DOWNSTAIRS Class: SITTING Class: STANDING
#> Sensitivity                             0.8500         0.8717          0.9041
#> Specificity                             0.9945         0.9792          0.9739
#> Pos Pred Value                          0.9623         0.8935          0.8842
#> Neg Pred Value                          0.9755         0.9745          0.9788
#> Prevalence                              0.1425         0.1666          0.1805
#> Detection Rate                          0.1211         0.1452          0.1632
#> Detection Prevalence                    0.1259         0.1625          0.1846
#> Balanced Accuracy                       0.9222         0.9255          0.9390
#>                      Class: LAYING
#> Sensitivity                 1.0000
#> Specificity                 1.0000
#> Pos Pred Value              1.0000
#> Neg Pred Value              1.0000
#> Prevalence                  0.1822
#> Detection Rate              0.1822
#> Detection Prevalence        0.1822
#> Balanced Accuracy           1.0000

Selanjutnya, kita akan mengecek hasil prediksi dan data aktual

pred_model3 <- data.frame(prediction = pred3, real = (test_df %>% select(Activity))) %>% 
  mutate(Check = ifelse(prediction==Activity, 1, 0),
         Check = as.factor(Check))

pred_model3 %>% ggplot(aes(x = Activity, fill = Check)) +
  geom_bar() +
  coord_flip() +
  labs(x = "Activity") +
  theme_minimal()

5 Membandingkan Model

Membandingkan ketiga model untuk melihat model mana yang memiliki performa paling baik.

# Membuat dataframe berisi confusion matrix naive-bayes
temp1 <- confusionMatrix(cm1)
nb <- data_frame(Model = "Naive Bayes",
                       Accuracy = temp1$overall[1],
                       Recall = temp1$byClass[1],
                       Specificity = temp1$byClass[2],
                       Precision = temp1$byClass[3])

# Membuat dataframe berisi confusion matrix decision tree
temp2 <- confusionMatrix(cm2) 
dt <- data_frame(Model = "Decision Tree",
                       Accuracy = temp2$overall[1],
                       Recall = temp2$byClass[1],
                       Specificity = temp2$byClass[2],
                       Precision = temp2$byClass[3])

# Membuat dataframe berisi confusion matrix random forest
temp3 <- confusionMatrix(cm3) 
rf <- data_frame(Model = "Random Forest",
                       Accuracy = temp3$overall[1],
                       Recall = temp3$byClass[1],
                       Specificity = temp3$byClass[2],
                       Precision = temp3$byClass[3])

# Menggabungkan ketiga hasil confusion matrix untuk perbandingan
rbind(nb, dt, rf)

#> # A tibble: 3 × 5
#>   Model         Accuracy Recall Specificity Precision
#>   <chr>            <dbl>  <dbl>       <dbl>     <dbl>
#> 1 Naive Bayes      0.770  0.839       0.958     0.612
#> 2 Decision Tree    0.849  0.919       0.631     0.783
#> 3 Random Forest    0.921  0.974       0.911     0.85

Dari hasil perbandingan di atas, model Random Forest adalah model terbaik untuk menganalisis data Human Activity Recognition. Secara umum, hasil proses Random Forest lebih baik dari Naive-Bayes dan Decision Tree.

LBB Regression Model

Ali Ramadhani

2023-01-03