Data Human Activity Recognition adalah data yang berisi penelitian terhadap 30 sukarelawan yang berusia 19-48 tahun. Tiap sukarelawan melakukan 6 aktivitas yang tercermin sebagai kolom sebagai berikut:
WALKING: berjalanWALKING_UPSTAIRS: berjalan menaiki tanggaWALKING_DOWNSTAIRS: berjalan menuruni tanggaSITTING: dudukSTANDING: berdiriLAYING: berbaringTiap sukarelawan melakukan kegiatan-kegiatan tersebut dengan membawa smartphone Galaxy SII di pinggangnya. Pergerakan sukarelawan tersebut direkam menggunakan akselerometer dan gyroscope yang terdapat pada smartphone.
Kita akan membuat 3 model dan akan melakukan perbandingan berdasarkan data di atas yaitu: - Naive-Bayes - Descision Tree - Random Forest
#install package
library(dplyr)
library(tidyverse)
library(e1071)
library(GGally)
library(caret)
library(randomForest)
library(partykit)Terdapat 4 data, namun tiap data tidak memiliki variabel untuk menjelaskan isi data tersebut sehingga kita perlu memperbaiki datanya
# membaca data
train_x <- read.table("train/X_train.txt",sep = "")
train_y <- read.table("train/y_train.txt",sep = "")
test_x <- read.table("test/X_test.txt",sep = "")
test_y <- read.table("test/y_test.txt",sep = "")# Mengubah tipe data menjadi factor
train_y$V1 = as.factor(train_y$V1)
test_y$V1 = as.factor(test_y$V1)
# Mengubah nama kolom
data_act = read.table("activity_labels.txt", sep = "")
features = read.table("features.txt", sep = "")
act = as.character(data_act$V2)
attr = features$V2
colnames(train_x) = make.unique(attr)
colnames(train_y) = "Activity"
colnames(test_x) = make.unique(attr)
colnames(test_y) = "Activity"
# Mengubah nama variabel target
levels(train_y$Activity) = act
levels(test_y$Activity) = actMenggabungkan data_train dan data_test
train_df <- cbind(train_x, train_y)
test_df <- cbind(test_x, test_y)Mengkalkulasi jumlah tiap kegiatan dan melakukan pengurutan untuk keperluan pembuatan model
all = rbind(train_df, test_df)
count_all <- all %>% select(Activity) %>%
group_by(Activity) %>%
summarise(Count = length(Activity)) %>%
arrange(-Count)
count_all#> # A tibble: 6 × 2
#> Activity Count
#> <fct> <int>
#> 1 LAYING 1944
#> 2 STANDING 1906
#> 3 SITTING 1777
#> 4 WALKING 1722
#> 5 WALKING_UPSTAIRS 1544
#> 6 WALKING_DOWNSTAIRS 1406
count_all %>% ggplot(aes(x = reorder(Activity, Count), y = Count)) +
geom_col() +
coord_flip() +
labs(x = "Activity") +
theme_minimal()Data tidak memiliki jumlah yang sama, namun tetap bisa kita proses karena perbedaannya tidak terlalu signifikan.
# Membuat model
model1 <- naiveBayes(Activity ~ ., data = train_df)
# Membuat prediksi
pred1 <- predict(model1, newdata = test_df %>% select(-Activity))# Melakukan evaluasi model menggunakan Confusion Matrix
cm1 <- table(as.factor(pred1), test_df %>% select(Activity) %>% pull)
confusionMatrix(cm1)#> Confusion Matrix and Statistics
#>
#>
#> WALKING WALKING_UPSTAIRS WALKING_DOWNSTAIRS SITTING
#> WALKING 416 9 80 0
#> WALKING_UPSTAIRS 38 451 83 7
#> WALKING_DOWNSTAIRS 42 11 257 0
#> SITTING 0 0 0 368
#> STANDING 0 0 0 111
#> LAYING 0 0 0 5
#>
#> STANDING LAYING
#> WALKING 0 0
#> WALKING_UPSTAIRS 15 3
#> WALKING_DOWNSTAIRS 0 0
#> SITTING 54 212
#> STANDING 455 0
#> LAYING 8 322
#>
#> Overall Statistics
#>
#> Accuracy : 0.7699
#> 95% CI : (0.7543, 0.785)
#> No Information Rate : 0.1822
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.7237
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: WALKING Class: WALKING_UPSTAIRS
#> Sensitivity 0.8387 0.9575
#> Specificity 0.9637 0.9410
#> Pos Pred Value 0.8238 0.7554
#> Neg Pred Value 0.9672 0.9915
#> Prevalence 0.1683 0.1598
#> Detection Rate 0.1412 0.1530
#> Detection Prevalence 0.1714 0.2026
#> Balanced Accuracy 0.9012 0.9493
#> Class: WALKING_DOWNSTAIRS Class: SITTING Class: STANDING
#> Sensitivity 0.61190 0.7495 0.8553
#> Specificity 0.97903 0.8917 0.9540
#> Pos Pred Value 0.82903 0.5804 0.8039
#> Neg Pred Value 0.93819 0.9468 0.9677
#> Prevalence 0.14252 0.1666 0.1805
#> Detection Rate 0.08721 0.1249 0.1544
#> Detection Prevalence 0.10519 0.2151 0.1921
#> Balanced Accuracy 0.79547 0.8206 0.9047
#> Class: LAYING
#> Sensitivity 0.5996
#> Specificity 0.9946
#> Pos Pred Value 0.9612
#> Neg Pred Value 0.9177
#> Prevalence 0.1822
#> Detection Rate 0.1093
#> Detection Prevalence 0.1137
#> Balanced Accuracy 0.7971
# Membuat Model
model2 <- ctree(Activity ~ ., data = train_df)
# Membuat Prediksi
pred2 <- predict(model2, newdata = test_df %>% select(-Activity))# Melakukan evaluasi model menggunakan Confusion Matrix
cm2 <- table(as.factor(pred2), test_df %>% select(Activity) %>% pull)
confusionMatrix(cm2)#> Confusion Matrix and Statistics
#>
#>
#> WALKING WALKING_UPSTAIRS WALKING_DOWNSTAIRS SITTING
#> WALKING 456 129 30 0
#> WALKING_UPSTAIRS 13 297 61 0
#> WALKING_DOWNSTAIRS 27 45 329 0
#> SITTING 0 0 0 397
#> STANDING 0 0 0 94
#> LAYING 0 0 0 0
#>
#> STANDING LAYING
#> WALKING 0 0
#> WALKING_UPSTAIRS 0 0
#> WALKING_DOWNSTAIRS 0 0
#> SITTING 46 1
#> STANDING 486 0
#> LAYING 0 536
#>
#> Overall Statistics
#>
#> Accuracy : 0.8487
#> 95% CI : (0.8352, 0.8614)
#> No Information Rate : 0.1822
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.818
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: WALKING Class: WALKING_UPSTAIRS
#> Sensitivity 0.9194 0.6306
#> Specificity 0.9351 0.9701
#> Pos Pred Value 0.7415 0.8005
#> Neg Pred Value 0.9828 0.9325
#> Prevalence 0.1683 0.1598
#> Detection Rate 0.1547 0.1008
#> Detection Prevalence 0.2087 0.1259
#> Balanced Accuracy 0.9272 0.8003
#> Class: WALKING_DOWNSTAIRS Class: SITTING Class: STANDING
#> Sensitivity 0.7833 0.8086 0.9135
#> Specificity 0.9715 0.9809 0.9611
#> Pos Pred Value 0.8204 0.8941 0.8379
#> Neg Pred Value 0.9643 0.9624 0.9806
#> Prevalence 0.1425 0.1666 0.1805
#> Detection Rate 0.1116 0.1347 0.1649
#> Detection Prevalence 0.1361 0.1507 0.1968
#> Balanced Accuracy 0.8774 0.8947 0.9373
#> Class: LAYING
#> Sensitivity 0.9981
#> Specificity 1.0000
#> Pos Pred Value 1.0000
#> Neg Pred Value 0.9996
#> Prevalence 0.1822
#> Detection Rate 0.1819
#> Detection Prevalence 0.1819
#> Balanced Accuracy 0.9991
#set.seed(123)
#ctrl <- trainControl(method="repeatedcv", number=4, repeats=3) # k-fold cross validation
#model3 <- train(Activity ~ ., data = train_df, method="rf", trControl = ctrl) #random forest
#saveRDS(model3, "fb_forest_update.RDS") # simpan modelLoad data yang sudah jadi untuk mempersingkat waktu.
# Model yang sudah jadi
model3 <- readRDS("fb_forest_update.RDS")
# Membuat prediksi
pred3 <- predict(model3, test_df)# Melakukan evaluasi model menggunakan Confusion Matrix
cm3 <- table(as.factor(pred3), test_df %>% select(Activity) %>% pull)
confusionMatrix(cm3)#> Confusion Matrix and Statistics
#>
#>
#> WALKING WALKING_UPSTAIRS WALKING_DOWNSTAIRS SITTING
#> WALKING 483 35 18 0
#> WALKING_UPSTAIRS 6 429 45 0
#> WALKING_DOWNSTAIRS 7 7 357 0
#> SITTING 0 0 0 428
#> STANDING 0 0 0 63
#> LAYING 0 0 0 0
#>
#> STANDING LAYING
#> WALKING 0 0
#> WALKING_UPSTAIRS 0 0
#> WALKING_DOWNSTAIRS 0 0
#> SITTING 51 0
#> STANDING 481 0
#> LAYING 0 537
#>
#> Overall Statistics
#>
#> Accuracy : 0.9213
#> 95% CI : (0.911, 0.9307)
#> No Information Rate : 0.1822
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9054
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: WALKING Class: WALKING_UPSTAIRS
#> Sensitivity 0.9738 0.9108
#> Specificity 0.9784 0.9794
#> Pos Pred Value 0.9011 0.8937
#> Neg Pred Value 0.9946 0.9830
#> Prevalence 0.1683 0.1598
#> Detection Rate 0.1639 0.1456
#> Detection Prevalence 0.1819 0.1629
#> Balanced Accuracy 0.9761 0.9451
#> Class: WALKING_DOWNSTAIRS Class: SITTING Class: STANDING
#> Sensitivity 0.8500 0.8717 0.9041
#> Specificity 0.9945 0.9792 0.9739
#> Pos Pred Value 0.9623 0.8935 0.8842
#> Neg Pred Value 0.9755 0.9745 0.9788
#> Prevalence 0.1425 0.1666 0.1805
#> Detection Rate 0.1211 0.1452 0.1632
#> Detection Prevalence 0.1259 0.1625 0.1846
#> Balanced Accuracy 0.9222 0.9255 0.9390
#> Class: LAYING
#> Sensitivity 1.0000
#> Specificity 1.0000
#> Pos Pred Value 1.0000
#> Neg Pred Value 1.0000
#> Prevalence 0.1822
#> Detection Rate 0.1822
#> Detection Prevalence 0.1822
#> Balanced Accuracy 1.0000
Selanjutnya, kita akan mengecek hasil prediksi dan data aktual
pred_model3 <- data.frame(prediction = pred3, real = (test_df %>% select(Activity))) %>%
mutate(Check = ifelse(prediction==Activity, 1, 0),
Check = as.factor(Check))
pred_model3 %>% ggplot(aes(x = Activity, fill = Check)) +
geom_bar() +
coord_flip() +
labs(x = "Activity") +
theme_minimal()Membandingkan ketiga model untuk melihat model mana yang memiliki performa paling baik.
# Membuat dataframe berisi confusion matrix naive-bayes
temp1 <- confusionMatrix(cm1)
nb <- data_frame(Model = "Naive Bayes",
Accuracy = temp1$overall[1],
Recall = temp1$byClass[1],
Specificity = temp1$byClass[2],
Precision = temp1$byClass[3])# Membuat dataframe berisi confusion matrix decision tree
temp2 <- confusionMatrix(cm2)
dt <- data_frame(Model = "Decision Tree",
Accuracy = temp2$overall[1],
Recall = temp2$byClass[1],
Specificity = temp2$byClass[2],
Precision = temp2$byClass[3])# Membuat dataframe berisi confusion matrix random forest
temp3 <- confusionMatrix(cm3)
rf <- data_frame(Model = "Random Forest",
Accuracy = temp3$overall[1],
Recall = temp3$byClass[1],
Specificity = temp3$byClass[2],
Precision = temp3$byClass[3])# Menggabungkan ketiga hasil confusion matrix untuk perbandingan
rbind(nb, dt, rf)#> # A tibble: 3 × 5
#> Model Accuracy Recall Specificity Precision
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Naive Bayes 0.770 0.839 0.958 0.612
#> 2 Decision Tree 0.849 0.919 0.631 0.783
#> 3 Random Forest 0.921 0.974 0.911 0.85
Dari hasil perbandingan di atas, model Random Forest adalah model terbaik untuk menganalisis data Human Activity Recognition. Secara umum, hasil proses Random Forest lebih baik dari Naive-Bayes dan Decision Tree.