Multiclass Machine Learning Case Study: Human Activity Recognition with Smartphones

Introduction

This documentation is a demonstration and an exercise to establish the multi-class classification models. The machine learning models that are built in this documentation including Naive Bayes model, Decision Tree, and Random Forest model, which would then be compared by their performance in classifying the target variables. The purpose of this documentation is to demonstrate and to compare the models that would be used for the Human Activity Recognition data set. The data that is used originally derives from UCI Machine Learning Repository. The data within this documentation, however, is the data that is downloaded from Kaggle as it has been compiled and divided into train set and test set for the deployment of the machine learning.

In brief, the data set comprises of large numbers of numerical variables that represent the position and attributes (e.g. body acceleration, gravitational position, etc.) of a wearable device that would be connected to a smartphone to identify the human activities. The experiment was conducted by sampling 30 subjects to identify each of these activities including sitting, laying, standing, walking, walking upstairs, and walking downstairs. By this, the target variable of the models would be the activities as mentioned and the goals of the machine learning deployment is to accurately classify each of these activities by the positions of the wearable device. For further information about the data set, please refer to the link here

Data Preparation

To begin this documentation, the R library would be loaded and the data set would be imported in this workspace as follow:

library(dplyr)
library(janitor)
library(ggplot2)
library(gridExtra)
library(inspectdf)
library(tidymodels)
library(factoextra)
library(caret)
library(e1071)
library(MLmetrics)
library(rpart)
library(rattle)

The data set was previously separated (train set and test set). For the EDA purpose, the data set would be combined before hand. Previously the compilation of the data set had been done beforehand and was saved as a .RDS format due to the large size of the data set. Likewise, note that the variable names within the data set would still be required to change to a tidy format by using clean_names which could be seen as follow:

whole_df <- readRDS("data_input/whole_df.RDS") %>% 
  clean_names()

The glimpse of each variables and their data types would be seen by the glimpse function as below:

glimpse(whole_df)

## Rows: 10,299
## Columns: 563
## $ t_body_acc_mean_x                        <dbl> 0.2885845, 0.2784188, 0.27965~
## $ t_body_acc_mean_y                        <dbl> -0.020294171, -0.016410568, -~
## $ t_body_acc_mean_z                        <dbl> -0.13290514, -0.12352019, -0.~
## $ t_body_acc_std_x                         <dbl> -0.9952786, -0.9982453, -0.99~
## $ t_body_acc_std_y                         <dbl> -0.9831106, -0.9753002, -0.96~
## $ t_body_acc_std_z                         <dbl> -0.9135264, -0.9603220, -0.97~
## $ t_body_acc_mad_x                         <dbl> -0.9951121, -0.9988072, -0.99~
## $ t_body_acc_mad_y                         <dbl> -0.9831846, -0.9749144, -0.96~
## $ t_body_acc_mad_z                         <dbl> -0.9235270, -0.9576862, -0.97~
## $ t_body_acc_max_x                         <dbl> -0.9347238, -0.9430675, -0.93~
## $ t_body_acc_max_y                         <dbl> -0.5673781, -0.5578513, -0.55~
## $ t_body_acc_max_z                         <dbl> -0.7444125, -0.8184087, -0.81~
## $ t_body_acc_min_x                         <dbl> 0.8529474, 0.8493079, 0.84360~
## $ t_body_acc_min_y                         <dbl> 0.6858446, 0.6858446, 0.68240~
## $ t_body_acc_min_z                         <dbl> 0.8142628, 0.8226368, 0.83934~
## $ t_body_acc_sma                           <dbl> -0.9655228, -0.9819301, -0.98~
## $ t_body_acc_energy_x                      <dbl> -0.9999446, -0.9999913, -0.99~
## $ t_body_acc_energy_y                      <dbl> -0.9998630, -0.9997884, -0.99~
## $ t_body_acc_energy_z                      <dbl> -0.9946122, -0.9984054, -0.99~
## $ t_body_acc_iqr_x                         <dbl> -0.9942308, -0.9991504, -0.99~
## $ t_body_acc_iqr_y                         <dbl> -0.9876139, -0.9778655, -0.96~
## $ t_body_acc_iqr_z                         <dbl> -0.9432200, -0.9482248, -0.97~
## $ t_body_acc_entropy_x                     <dbl> -0.4077471, -0.7148917, -0.59~
## $ t_body_acc_entropy_y                     <dbl> -0.67933751, -0.50093000, -0.~
## $ t_body_acc_entropy_z                     <dbl> -0.60212187, -0.57097906, -0.~
## $ t_body_acc_ar_coeff_x_1                  <dbl> 0.92929351, 0.61162716, 0.273~
## $ t_body_acc_ar_coeff_x_2                  <dbl> -0.85301114, -0.32954862, -0.~
## $ t_body_acc_ar_coeff_x_3                  <dbl> 0.35990976, 0.28421321, 0.337~
## $ t_body_acc_ar_coeff_x_4                  <dbl> -0.05852638, 0.28459454, -0.1~
## $ t_body_acc_ar_coeff_y_1                  <dbl> 0.25689154, 0.11570542, 0.017~
## $ t_body_acc_ar_coeff_y_2                  <dbl> -0.22484763, -0.09096253, -0.~
## $ t_body_acc_ar_coeff_y_3                  <dbl> 0.26410572, 0.29431041, 0.342~
## $ t_body_acc_ar_coeff_y_4                  <dbl> -0.09524563, -0.28121057, -0.~
## $ t_body_acc_ar_coeff_z_1                  <dbl> 0.27885143, 0.08598843, 0.239~
## $ t_body_acc_ar_coeff_z_2                  <dbl> -0.465084570, -0.022152694, -~
## $ t_body_acc_ar_coeff_z_3                  <dbl> 0.49193596, -0.01665654, 0.17~
## $ t_body_acc_ar_coeff_z_4                  <dbl> -0.190883560, -0.220643500, -~
## $ t_body_acc_correlation_x_y               <dbl> 0.376313890, -0.013428663, -0~
## $ t_body_acc_correlation_x_z               <dbl> 0.43512919, -0.07269189, -0.1~
## $ t_body_acc_correlation_y_z               <dbl> 0.66079033, 0.57938169, 0.608~
## $ t_gravity_acc_mean_x                     <dbl> 0.9633961, 0.9665611, 0.96687~
## $ t_gravity_acc_mean_y                     <dbl> -0.1408397, -0.1415513, -0.14~
## $ t_gravity_acc_mean_z                     <dbl> 0.115374940, 0.109378810, 0.1~
## $ t_gravity_acc_std_x                      <dbl> -0.9852497, -0.9974113, -0.99~
## $ t_gravity_acc_std_y                      <dbl> -0.9817084, -0.9894474, -0.99~
## $ t_gravity_acc_std_z                      <dbl> -0.8776250, -0.9316387, -0.99~
## $ t_gravity_acc_mad_x                      <dbl> -0.9850014, -0.9978836, -0.99~
## $ t_gravity_acc_mad_y                      <dbl> -0.9844162, -0.9896137, -0.99~
## $ t_gravity_acc_mad_z                      <dbl> -0.8946774, -0.9332404, -0.99~
## $ t_gravity_acc_max_x                      <dbl> 0.8920545, 0.8920603, 0.89240~
## $ t_gravity_acc_max_y                      <dbl> -0.1612655, -0.1613426, -0.16~
## $ t_gravity_acc_max_z                      <dbl> 0.124659770, 0.122585730, 0.0~
## $ t_gravity_acc_min_x                      <dbl> 0.9774363, 0.9845201, 0.98677~
## $ t_gravity_acc_min_y                      <dbl> -0.1232134, -0.1148933, -0.11~
## $ t_gravity_acc_min_z                      <dbl> 0.05648273, 0.10276411, 0.102~
## $ t_gravity_acc_sma                        <dbl> -0.3754260, -0.3834296, -0.40~
## $ t_gravity_acc_energy_x                   <dbl> 0.8994686, 0.9078289, 0.90866~
## $ t_gravity_acc_energy_y                   <dbl> -0.9709052, -0.9705828, -0.97~
## $ t_gravity_acc_energy_z                   <dbl> -0.9755104, -0.9785004, -0.98~
## $ t_gravity_acc_iqr_x                      <dbl> -0.9843254, -0.9991884, -0.99~
## $ t_gravity_acc_iqr_y                      <dbl> -0.9888491, -0.9900285, -0.99~
## $ t_gravity_acc_iqr_z                      <dbl> -0.9177426, -0.9416854, -0.99~
## $ t_gravity_acc_entropy_x                  <dbl> -1.0000000, -1.0000000, -1.00~
## $ t_gravity_acc_entropy_y                  <dbl> -1, -1, -1, -1, -1, -1, -1, -~
## $ t_gravity_acc_entropy_z                  <dbl> 0.1138061, -0.2104936, -0.926~
## $ t_gravity_acc_ar_coeff_x_1               <dbl> -0.590425000, -0.410055520, 0~
## $ t_gravity_acc_ar_coeff_x_2               <dbl> 0.59114630, 0.41385634, 0.027~
## $ t_gravity_acc_ar_coeff_x_3               <dbl> -0.59177346, -0.41756716, -0.~
## $ t_gravity_acc_ar_coeff_x_4               <dbl> 0.59246928, 0.42132499, 0.085~
## $ t_gravity_acc_ar_coeff_y_1               <dbl> -0.745448780, -0.196359290, -~
## $ t_gravity_acc_ar_coeff_y_2               <dbl> 0.72086167, 0.12534464, 0.270~
## $ t_gravity_acc_ar_coeff_y_3               <dbl> -0.71237239, -0.10556772, -0.~
## $ t_gravity_acc_ar_coeff_y_4               <dbl> 0.71130003, 0.10909013, 0.257~
## $ t_gravity_acc_ar_coeff_z_1               <dbl> -0.995111590, -0.833882110, -~
## $ t_gravity_acc_ar_coeff_z_2               <dbl> 0.995674910, 0.834271100, 0.7~
## $ t_gravity_acc_ar_coeff_z_3               <dbl> -0.99566759, -0.83418438, -0.~
## $ t_gravity_acc_ar_coeff_z_4               <dbl> 0.99165268, 0.83046390, 0.728~
## $ t_gravity_acc_correlation_x_y            <dbl> 0.5702216, -0.8312839, -0.181~
## $ t_gravity_acc_correlation_x_z            <dbl> 0.43902735, -0.86571108, 0.33~
## $ t_gravity_acc_correlation_y_z            <dbl> 0.9869131, 0.9743856, 0.64341~
## $ t_body_acc_jerk_mean_x                   <dbl> 0.07799634, 0.07400671, 0.073~
## $ t_body_acc_jerk_mean_y                   <dbl> 0.005000803, 0.005771104, 0.0~
## $ t_body_acc_jerk_mean_z                   <dbl> -0.0678308080, 0.0293766330, ~
## $ t_body_acc_jerk_std_x                    <dbl> -0.9935191, -0.9955481, -0.99~
## $ t_body_acc_jerk_std_y                    <dbl> -0.9883600, -0.9810636, -0.98~
## $ t_body_acc_jerk_std_z                    <dbl> -0.9935750, -0.9918457, -0.98~
## $ t_body_acc_jerk_mad_x                    <dbl> -0.9944876, -0.9956320, -0.99~
## $ t_body_acc_jerk_mad_y                    <dbl> -0.9862066, -0.9789380, -0.97~
## $ t_body_acc_jerk_mad_z                    <dbl> -0.9928183, -0.9912766, -0.98~
## $ t_body_acc_jerk_max_x                    <dbl> -0.9851801, -0.9945447, -0.98~
## $ t_body_acc_jerk_max_y                    <dbl> -0.9919942, -0.9790682, -0.97~
## $ t_body_acc_jerk_max_z                    <dbl> -0.9931189, -0.9922574, -0.99~
## $ t_body_acc_jerk_min_x                    <dbl> 0.9898347, 0.9925771, 0.98839~
## $ t_body_acc_jerk_min_y                    <dbl> 0.9919569, 0.9918084, 0.99180~
## $ t_body_acc_jerk_min_z                    <dbl> 0.9905192, 0.9885391, 0.98853~
## $ t_body_acc_jerk_sma                      <dbl> -0.9935220, -0.9913937, -0.98~
## $ t_body_acc_jerk_energy_x                 <dbl> -0.9999349, -0.9999597, -0.99~
## $ t_body_acc_jerk_energy_y                 <dbl> -0.9998204, -0.9996396, -0.99~
## $ t_body_acc_jerk_energy_z                 <dbl> -0.9998785, -0.9998454, -0.99~
## $ t_body_acc_jerk_iqr_x                    <dbl> -0.9943640, -0.9938627, -0.98~
## $ t_body_acc_jerk_iqr_y                    <dbl> -0.9860249, -0.9794351, -0.98~
## $ t_body_acc_jerk_iqr_z                    <dbl> -0.9892336, -0.9933838, -0.98~
## $ t_body_acc_jerk_entropy_x                <dbl> -0.8199492, -0.8750964, -0.75~
## $ t_body_acc_jerk_entropy_y                <dbl> -0.7930464, -0.6553621, -0.67~
## $ t_body_acc_jerk_entropy_z                <dbl> -0.8888529, -0.7673809, -0.74~
## $ t_body_acc_jerk_ar_coeff_x_1             <dbl> 1.0000000, 0.4896622, 0.26522~
## $ t_body_acc_jerk_ar_coeff_x_2             <dbl> -0.22074703, 0.07099708, 0.18~
## $ t_body_acc_jerk_ar_coeff_x_3             <dbl> 0.63683075, 0.36271450, 0.464~
## $ t_body_acc_jerk_ar_coeff_x_4             <dbl> 0.38764356, 0.52730342, 0.371~
## $ t_body_acc_jerk_ar_coeff_y_1             <dbl> 0.24140146, 0.14939565, 0.082~
## $ t_body_acc_jerk_ar_coeff_y_2             <dbl> -0.052252848, 0.062925097, -0~
## $ t_body_acc_jerk_ar_coeff_y_3             <dbl> 0.26417720, 0.37049343, 0.327~
## $ t_body_acc_jerk_ar_coeff_y_4             <dbl> 0.37343945, 0.41354814, 0.437~
## $ t_body_acc_jerk_ar_coeff_z_1             <dbl> 0.34177752, 0.12221568, 0.257~
## $ t_body_acc_jerk_ar_coeff_z_2             <dbl> -0.56979119, 0.18061304, 0.07~
## $ t_body_acc_jerk_ar_coeff_z_3             <dbl> 0.265398820, 0.047423999, 0.1~
## $ t_body_acc_jerk_ar_coeff_z_4             <dbl> -0.4778749, 0.1665727, 0.2467~
## $ t_body_acc_jerk_correlation_x_y          <dbl> -0.385300500, -0.208772180, -~
## $ t_body_acc_jerk_correlation_x_z          <dbl> 0.03364394, 0.08410380, -0.11~
## $ t_body_acc_jerk_correlation_y_z          <dbl> -0.126510820, -0.268553900, -~
## $ t_body_gyro_mean_x                       <dbl> -0.006100849, -0.016111620, -~
## $ t_body_gyro_mean_y                       <dbl> -0.03136479, -0.08389378, -0.~
## $ t_body_gyro_mean_z                       <dbl> 0.10772540, 0.10058429, 0.096~
## $ t_body_gyro_std_x                        <dbl> -0.9853103, -0.9831200, -0.97~
## $ t_body_gyro_std_y                        <dbl> -0.9766234, -0.9890458, -0.99~
## $ t_body_gyro_std_z                        <dbl> -0.9922053, -0.9891212, -0.98~
## $ t_body_gyro_mad_x                        <dbl> -0.9845863, -0.9868904, -0.97~
## $ t_body_gyro_mad_y                        <dbl> -0.9763526, -0.9890380, -0.99~
## $ t_body_gyro_mad_z                        <dbl> -0.9923616, -0.9891846, -0.98~
## $ t_body_gyro_max_x                        <dbl> -0.8670437, -0.8649038, -0.86~
## $ t_body_gyro_max_y                        <dbl> -0.9337860, -0.9535605, -0.95~
## $ t_body_gyro_max_z                        <dbl> -0.7475662, -0.7458700, -0.74~
## $ t_body_gyro_min_x                        <dbl> 0.8473080, 0.8337211, 0.83372~
## $ t_body_gyro_min_y                        <dbl> 0.9148953, 0.9081096, 0.90575~
## $ t_body_gyro_min_z                        <dbl> 0.8308405, 0.8289350, 0.82893~
## $ t_body_gyro_sma                          <dbl> -0.9671843, -0.9806131, -0.97~
## $ t_body_gyro_energy_x                     <dbl> -0.9995783, -0.9997558, -0.99~
## $ t_body_gyro_energy_y                     <dbl> -0.9993543, -0.9998973, -0.99~
## $ t_body_gyro_energy_z                     <dbl> -0.9997634, -0.9998224, -0.99~
## $ t_body_gyro_iqr_x                        <dbl> -0.9834381, -0.9928328, -0.97~
## $ t_body_gyro_iqr_y                        <dbl> -0.9786140, -0.9893447, -0.99~
## $ t_body_gyro_iqr_z                        <dbl> -0.9929656, -0.9902402, -0.98~
## $ t_body_gyro_entropy_x                    <dbl> 0.082631682, 0.007469356, -0.~
## $ t_body_gyro_entropy_y                    <dbl> 0.20226765, -0.53115659, -1.0~
## $ t_body_gyro_entropy_z                    <dbl> -0.1687567, -0.1774446, -0.24~
## $ t_body_gyro_ar_coeff_x_1                 <dbl> 0.09632324, -0.38768063, -0.4~
## $ t_body_gyro_ar_coeff_x_2                 <dbl> -0.27498511, 0.17913763, 0.23~
## $ t_body_gyro_ar_coeff_x_3                 <dbl> 0.49864419, 0.21078900, 0.145~
## $ t_body_gyro_ar_coeff_x_4                 <dbl> -0.22031685, -0.14025958, -0.~
## $ t_body_gyro_ar_coeff_y_1                 <dbl> 1.000000000, -0.047031809, 0.~
## $ t_body_gyro_ar_coeff_y_2                 <dbl> -0.97297139, -0.06494907, -0.~
## $ t_body_gyro_ar_coeff_y_3                 <dbl> 0.31665451, 0.11768661, 0.114~
## $ t_body_gyro_ar_coeff_y_4                 <dbl> 0.375726410, 0.081691287, 0.1~
## $ t_body_gyro_ar_coeff_z_1                 <dbl> 0.72339919, 0.04236404, 0.112~
## $ t_body_gyro_ar_coeff_z_2                 <dbl> -0.77111201, -0.14992836, -0.~
## $ t_body_gyro_ar_coeff_z_3                 <dbl> 0.69021323, 0.29261893, 0.134~
## $ t_body_gyro_ar_coeff_z_4                 <dbl> -0.331831040, -0.149429350, 0~
## $ t_body_gyro_correlation_x_y              <dbl> 0.709583770, 0.046721243, -0.~
## $ t_body_gyro_correlation_x_z              <dbl> 0.134873360, -0.256929400, 0.~
## $ t_body_gyro_correlation_y_z              <dbl> 0.30109948, 0.16939480, -0.35~
## $ t_body_gyro_jerk_mean_x                  <dbl> -0.09916740, -0.11050283, -0.~
## $ t_body_gyro_jerk_mean_y                  <dbl> -0.05551737, -0.04481873, -0.~
## $ t_body_gyro_jerk_mean_z                  <dbl> -0.061985797, -0.059242822, -~
## $ t_body_gyro_jerk_std_x                   <dbl> -0.9921107, -0.9898726, -0.98~
## $ t_body_gyro_jerk_std_y                   <dbl> -0.9925193, -0.9972926, -0.99~
## $ t_body_gyro_jerk_std_z                   <dbl> -0.9920553, -0.9938510, -0.99~
## $ t_body_gyro_jerk_mad_x                   <dbl> -0.9921648, -0.9898762, -0.98~
## $ t_body_gyro_jerk_mad_y                   <dbl> -0.9949416, -0.9974917, -0.99~
## $ t_body_gyro_jerk_mad_z                   <dbl> -0.9926190, -0.9937783, -0.99~
## $ t_body_gyro_jerk_max_x                   <dbl> -0.9901558, -0.9919469, -0.99~
## $ t_body_gyro_jerk_max_y                   <dbl> -0.9867428, -0.9977171, -0.99~
## $ t_body_gyro_jerk_max_z                   <dbl> -0.9920416, -0.9949208, -0.98~
## $ t_body_gyro_jerk_min_x                   <dbl> 0.9944288, 0.9904860, 0.98929~
## $ t_body_gyro_jerk_min_y                   <dbl> 0.9917558, 0.9971222, 0.99712~
## $ t_body_gyro_jerk_min_z                   <dbl> 0.9893519, 0.9945031, 0.99414~
## $ t_body_gyro_jerk_sma                     <dbl> -0.9944534, -0.9952984, -0.99~
## $ t_body_gyro_jerk_energy_x                <dbl> -0.9999375, -0.9999077, -0.99~
## $ t_body_gyro_jerk_energy_y                <dbl> -0.9999535, -0.9999897, -0.99~
## $ t_body_gyro_jerk_energy_z                <dbl> -0.9999229, -0.9999459, -0.99~
## $ t_body_gyro_jerk_iqr_x                   <dbl> -0.9922997, -0.9907418, -0.98~
## $ t_body_gyro_jerk_iqr_y                   <dbl> -0.9969389, -0.9973013, -0.99~
## $ t_body_gyro_jerk_iqr_z                   <dbl> -0.9922430, -0.9938078, -0.99~
## $ t_body_gyro_jerk_entropy_x               <dbl> -0.5898510, -0.6009445, -0.54~
## $ t_body_gyro_jerk_entropy_y               <dbl> -0.6884590, -0.7482472, -0.67~
## $ t_body_gyro_jerk_entropy_z               <dbl> -0.5721069, -0.6089321, -0.58~
## $ t_body_gyro_jerk_ar_coeff_x_1            <dbl> 0.292376340, -0.193307570, -0~
## $ t_body_gyro_jerk_ar_coeff_x_2            <dbl> -0.361998020, -0.067406458, -~
## $ t_body_gyro_jerk_ar_coeff_x_3            <dbl> 0.4055426900, 0.1856190700, 0~
## $ t_body_gyro_jerk_ar_coeff_x_4            <dbl> -0.039006951, 0.041521811, 0.~
## $ t_body_gyro_jerk_ar_coeff_y_1            <dbl> 0.98928381, 0.07235255, 0.095~
## $ t_body_gyro_jerk_ar_coeff_y_2            <dbl> -0.414560480, -0.035377727, 0~
## $ t_body_gyro_jerk_ar_coeff_y_3            <dbl> 0.391602510, 0.177606360, 0.0~
## $ t_body_gyro_jerk_ar_coeff_y_4            <dbl> 0.282250870, 0.027498054, 0.2~
## $ t_body_gyro_jerk_ar_coeff_z_1            <dbl> 0.92726984, 0.18270272, 0.181~
## $ t_body_gyro_jerk_ar_coeff_z_2            <dbl> -0.57237001, -0.16745740, -0.~
## $ t_body_gyro_jerk_ar_coeff_z_3            <dbl> 0.691619200, 0.253251030, 0.1~
## $ t_body_gyro_jerk_ar_coeff_z_4            <dbl> 0.468289820, 0.132333860, 0.0~
## $ t_body_gyro_jerk_correlation_x_y         <dbl> -0.131076970, 0.293855350, 0.~
## $ t_body_gyro_jerk_correlation_x_z         <dbl> -0.087159695, -0.018075169, 0~
## $ t_body_gyro_jerk_correlation_y_z         <dbl> 0.33624748, -0.34333678, -0.3~
## $ t_body_acc_mag_mean                      <dbl> -0.9594339, -0.9792892, -0.98~
## $ t_body_acc_mag_std                       <dbl> -0.9505515, -0.9760571, -0.98~
## $ t_body_acc_mag_mad                       <dbl> -0.9579929, -0.9782473, -0.98~
## $ t_body_acc_mag_max                       <dbl> -0.9463052, -0.9787115, -0.98~
## $ t_body_acc_mag_min                       <dbl> -0.9925557, -0.9953329, -0.99~
## $ t_body_acc_mag_sma                       <dbl> -0.9594339, -0.9792892, -0.98~
## $ t_body_acc_mag_energy                    <dbl> -0.9984928, -0.9994880, -0.99~
## $ t_body_acc_mag_iqr                       <dbl> -0.9576374, -0.9812483, -0.98~
## $ t_body_acc_mag_entropy                   <dbl> -0.23258164, -0.44187611, -0.~
## $ t_body_acc_mag_ar_coeff_1                <dbl> -0.17317874, 0.08156863, 0.03~
## $ t_body_acc_mag_ar_coeff_2                <dbl> -0.02289666, -0.10936606, -0.~
## $ t_body_acc_mag_ar_coeff_3                <dbl> 0.0948315680, 0.3117577100, 0~
## $ t_body_acc_mag_ar_coeff_4                <dbl> 0.19181715, -0.41167480, -0.2~
## $ t_gravity_acc_mag_mean                   <dbl> -0.9594339, -0.9792892, -0.98~
## $ t_gravity_acc_mag_std                    <dbl> -0.9505515, -0.9760571, -0.98~
## $ t_gravity_acc_mag_mad                    <dbl> -0.9579929, -0.9782473, -0.98~
## $ t_gravity_acc_mag_max                    <dbl> -0.9463052, -0.9787115, -0.98~
## $ t_gravity_acc_mag_min                    <dbl> -0.9925557, -0.9953329, -0.99~
## $ t_gravity_acc_mag_sma                    <dbl> -0.9594339, -0.9792892, -0.98~
## $ t_gravity_acc_mag_energy                 <dbl> -0.9984928, -0.9994880, -0.99~
## $ t_gravity_acc_mag_iqr                    <dbl> -0.9576374, -0.9812483, -0.98~
## $ t_gravity_acc_mag_entropy                <dbl> -0.23258164, -0.44187611, -0.~
## $ t_gravity_acc_mag_ar_coeff_1             <dbl> -0.17317874, 0.08156863, 0.03~
## $ t_gravity_acc_mag_ar_coeff_2             <dbl> -0.02289666, -0.10936606, -0.~
## $ t_gravity_acc_mag_ar_coeff_3             <dbl> 0.0948315680, 0.3117577100, 0~
## $ t_gravity_acc_mag_ar_coeff_4             <dbl> 0.19181715, -0.41167480, -0.2~
## $ t_body_acc_jerk_mag_mean                 <dbl> -0.9933059, -0.9912535, -0.98~
## $ t_body_acc_jerk_mag_std                  <dbl> -0.9943364, -0.9916944, -0.99~
## $ t_body_acc_jerk_mag_mad                  <dbl> -0.9945004, -0.9927160, -0.99~
## $ t_body_acc_jerk_mag_max                  <dbl> -0.9927840, -0.9886606, -0.98~
## $ t_body_acc_jerk_mag_min                  <dbl> -0.9912085, -0.9912085, -0.99~
## $ t_body_acc_jerk_mag_sma                  <dbl> -0.9933059, -0.9912535, -0.98~
## $ t_body_acc_jerk_mag_energy               <dbl> -0.9998919, -0.9998454, -0.99~
## $ t_body_acc_jerk_mag_iqr                  <dbl> -0.9929337, -0.9934851, -0.98~
## $ t_body_acc_jerk_mag_entropy              <dbl> -0.8634148, -0.8199283, -0.79~
## $ t_body_acc_jerk_mag_ar_coeff_1           <dbl> 0.28308522, 0.45881205, 0.649~
## $ t_body_acc_jerk_mag_ar_coeff_2           <dbl> -0.23730869, -0.24494134, -0.~
## $ t_body_acc_jerk_mag_ar_coeff_3           <dbl> -0.105432190, 0.056139272, -0~
## $ t_body_acc_jerk_mag_ar_coeff_4           <dbl> -0.038212313, -0.458345680, -~
## $ t_body_gyro_mag_mean                     <dbl> -0.9689591, -0.9806831, -0.97~
## $ t_body_gyro_mag_std                      <dbl> -0.9643352, -0.9837542, -0.98~
## $ t_body_gyro_mag_mad                      <dbl> -0.9572448, -0.9820027, -0.98~
## $ t_body_gyro_mag_max                      <dbl> -0.9750599, -0.9847146, -0.98~
## $ t_body_gyro_mag_min                      <dbl> -0.9915537, -0.9915537, -0.96~
## $ t_body_gyro_mag_sma                      <dbl> -0.9689591, -0.9806831, -0.97~
## $ t_body_gyro_mag_energy                   <dbl> -0.9992865, -0.9997247, -0.99~
## $ t_body_gyro_mag_iqr                      <dbl> -0.9497658, -0.9828568, -0.98~
## $ t_body_gyro_mag_entropy                  <dbl> 0.07257904, -0.19289906, -0.2~
## $ t_body_gyro_mag_ar_coeff_1               <dbl> 0.57251142, -0.22531738, -0.2~
## $ t_body_gyro_mag_ar_coeff_2               <dbl> -0.738602190, -0.017059623, 0~
## $ t_body_gyro_mag_ar_coeff_3               <dbl> 0.21257776, 0.15577724, 0.061~
## $ t_body_gyro_mag_ar_coeff_4               <dbl> 0.43340495, 0.08257521, 0.041~
## $ t_body_gyro_jerk_mag_mean                <dbl> -0.9942478, -0.9951232, -0.99~
## $ t_body_gyro_jerk_mag_std                 <dbl> -0.9913676, -0.9961016, -0.99~
## $ t_body_gyro_jerk_mag_mad                 <dbl> -0.9931430, -0.9958385, -0.99~
## $ t_body_gyro_jerk_mag_max                 <dbl> -0.9889356, -0.9965449, -0.99~
## $ t_body_gyro_jerk_mag_min                 <dbl> -0.9934860, -0.9920060, -0.99~
## $ t_body_gyro_jerk_mag_sma                 <dbl> -0.9942478, -0.9951232, -0.99~
## $ t_body_gyro_jerk_mag_energy              <dbl> -0.9999490, -0.9999698, -0.99~
## $ t_body_gyro_jerk_mag_iqr                 <dbl> -0.9945472, -0.9948192, -0.99~
## $ t_body_gyro_jerk_mag_entropy             <dbl> -0.6197676, -0.7307216, -0.66~
## $ t_body_gyro_jerk_mag_ar_coeff_1          <dbl> 0.2928405, 0.2093341, 0.32803~
## $ t_body_gyro_jerk_mag_ar_coeff_2          <dbl> -0.1768892, -0.1781126, -0.15~
## $ t_body_gyro_jerk_mag_ar_coeff_3          <dbl> -0.145779210, -0.103084330, -~
## $ t_body_gyro_jerk_mag_ar_coeff_4          <dbl> -0.124072330, -0.043823965, -~
## $ f_body_acc_mean_x                        <dbl> -0.9947832, -0.9974507, -0.99~
## $ f_body_acc_mean_y                        <dbl> -0.9829841, -0.9768517, -0.97~
## $ f_body_acc_mean_z                        <dbl> -0.9392687, -0.9735227, -0.98~
## $ f_body_acc_std_x                         <dbl> -0.9954217, -0.9986803, -0.99~
## $ f_body_acc_std_y                         <dbl> -0.9831330, -0.9749298, -0.96~
## $ f_body_acc_std_z                         <dbl> -0.9061650, -0.9554381, -0.97~
## $ f_body_acc_mad_x                         <dbl> -0.9968886, -0.9978897, -0.99~
## $ f_body_acc_mad_y                         <dbl> -0.9845193, -0.9769239, -0.97~
## $ f_body_acc_mad_z                         <dbl> -0.9320820, -0.9683768, -0.98~
## $ f_body_acc_max_x                         <dbl> -0.9937563, -0.9993717, -0.99~
## $ f_body_acc_max_y                         <dbl> -0.9831629, -0.9737703, -0.96~
## $ f_body_acc_max_z                         <dbl> -0.8850542, -0.9487768, -0.96~
## $ f_body_acc_min_x                         <dbl> -0.9939619, -0.9982806, -0.99~
## $ f_body_acc_min_y                         <dbl> -0.9934461, -0.9927209, -0.98~
## $ f_body_acc_min_z                         <dbl> -0.9234277, -0.9895135, -0.99~
## $ f_body_acc_sma                           <dbl> -0.9747327, -0.9858116, -0.98~
## $ f_body_acc_energy_x                      <dbl> -0.9999684, -0.9999908, -0.99~
## $ f_body_acc_energy_y                      <dbl> -0.9996891, -0.9994499, -0.99~
## $ f_body_acc_energy_z                      <dbl> -0.9948915, -0.9985691, -0.99~
## $ f_body_acc_iqr_x                         <dbl> -0.9959260, -0.9948649, -0.98~
## $ f_body_acc_iqr_y                         <dbl> -0.9897089, -0.9807836, -0.97~
## $ f_body_acc_iqr_z                         <dbl> -0.9879912, -0.9857747, -0.98~
## $ f_body_acc_entropy_x                     <dbl> -0.9463569, -1.0000000, -1.00~
## $ f_body_acc_entropy_y                     <dbl> -0.9047478, -0.9047478, -0.81~
## $ f_body_acc_entropy_z                     <dbl> -0.5913025, -0.7584085, -0.81~
## $ f_body_acc_max_inds_x                    <dbl> -1.00000000, 0.09677419, -0.9~
## $ f_body_acc_max_inds_y                    <dbl> -1.0000000, -1.0000000, -1.00~
## $ f_body_acc_max_inds_z                    <dbl> -1.0000000, -1.0000000, -1.00~
## $ f_body_acc_mean_freq_x                   <dbl> 0.252482900, 0.271308550, 0.1~
## $ f_body_acc_mean_freq_y                   <dbl> 0.13183575, 0.04286364, -0.06~
## $ f_body_acc_mean_freq_z                   <dbl> -0.05205025, -0.01430976, 0.0~
## $ f_body_acc_skewness_x                    <dbl> 0.1420506, -0.6925409, -0.727~
## $ f_body_acc_kurtosis_x                    <dbl> -0.15068250, -0.95404703, -0.~
## $ f_body_acc_skewness_y                    <dbl> -0.22054694, -0.04970910, 0.1~
## $ f_body_acc_kurtosis_y                    <dbl> -0.55873853, -0.33197386, -0.~
## $ f_body_acc_skewness_z                    <dbl> 0.24676868, 0.05667537, -0.04~
## $ f_body_acc_kurtosis_z                    <dbl> -0.007415521, -0.289001440, -~
## $ f_body_acc_bands_energy_1_8              <dbl> -0.9999628, -0.9999962, -0.99~
## $ f_body_acc_bands_energy_9_16             <dbl> -0.9999865, -0.9999818, -0.99~
## $ f_body_acc_bands_energy_17_24            <dbl> -0.9999791, -0.9999440, -0.99~
## $ f_body_acc_bands_energy_25_32            <dbl> -0.9999624, -0.9999699, -0.99~
## $ f_body_acc_bands_energy_33_40            <dbl> -0.9999322, -0.9999189, -0.99~
## $ f_body_acc_bands_energy_41_48            <dbl> -0.9997251, -0.9998657, -0.99~
## $ f_body_acc_bands_energy_49_56            <dbl> -0.9996704, -0.9999651, -0.99~
## $ f_body_acc_bands_energy_57_64            <dbl> -0.9999858, -0.9999995, -0.99~
## $ f_body_acc_bands_energy_1_16             <dbl> -0.9999687, -0.9999939, -0.99~
## $ f_body_acc_bands_energy_17_32            <dbl> -0.9999769, -0.9999490, -0.99~
## $ f_body_acc_bands_energy_33_48            <dbl> -0.9998697, -0.9999140, -0.99~
## $ f_body_acc_bands_energy_49_64            <dbl> -0.9997761, -0.9999766, -0.99~
## $ f_body_acc_bands_energy_1_24             <dbl> -0.9999712, -0.9999921, -0.99~
## $ f_body_acc_bands_energy_25_48            <dbl> -0.9999193, -0.9999459, -0.99~
## $ f_body_acc_bands_energy_1_8_1            <dbl> -0.9996568, -0.9994166, -0.99~
## $ f_body_acc_bands_energy_9_16_1           <dbl> -0.9998605, -0.9998133, -0.99~
## $ f_body_acc_bands_energy_17_24_1          <dbl> -0.9998670, -0.9995686, -0.99~
## $ f_body_acc_bands_energy_25_32_1          <dbl> -0.9998630, -0.9998737, -0.99~
## $ f_body_acc_bands_energy_33_40_1          <dbl> -0.9997378, -0.9995489, -0.99~
## $ f_body_acc_bands_energy_41_48_1          <dbl> -0.9997322, -0.9997371, -0.99~
## $ f_body_acc_bands_energy_49_56_1          <dbl> -0.9994926, -0.9995658, -0.99~
## $ f_body_acc_bands_energy_57_64_1          <dbl> -0.9998136, -0.9999053, -0.99~
## $ f_body_acc_bands_energy_1_16_1           <dbl> -0.9996818, -0.9994735, -0.99~
## $ f_body_acc_bands_energy_17_32_1          <dbl> -0.9998394, -0.9995542, -0.99~
## $ f_body_acc_bands_energy_33_48_1          <dbl> -0.9997382, -0.9996020, -0.99~
## $ f_body_acc_bands_energy_49_64_1          <dbl> -0.9996120, -0.9996953, -0.99~
## $ f_body_acc_bands_energy_1_24_1           <dbl> -0.9996872, -0.9994442, -0.99~
## $ f_body_acc_bands_energy_25_48_1          <dbl> -0.9998386, -0.9998042, -0.99~
## $ f_body_acc_bands_energy_1_8_2            <dbl> -0.9935923, -0.9982346, -0.99~
## $ f_body_acc_bands_energy_9_16_2           <dbl> -0.9994758, -0.9997692, -0.99~
## $ f_body_acc_bands_energy_17_24_2          <dbl> -0.9996620, -0.9996922, -0.99~
## $ f_body_acc_bands_energy_25_32_2          <dbl> -0.9996423, -0.9998749, -0.99~
## $ f_body_acc_bands_energy_33_40_2          <dbl> -0.9992934, -0.9996656, -0.99~
## $ f_body_acc_bands_energy_41_48_2          <dbl> -0.9978922, -0.9994483, -0.99~
## $ f_body_acc_bands_energy_49_56_2          <dbl> -0.9959325, -0.9989302, -0.99~
## $ f_body_acc_bands_energy_57_64_2          <dbl> -0.9951464, -0.9987544, -0.99~
## $ f_body_acc_bands_energy_1_16_2           <dbl> -0.9947399, -0.9985456, -0.99~
## $ f_body_acc_bands_energy_17_32_2          <dbl> -0.9996883, -0.9997918, -0.99~
## $ f_body_acc_bands_energy_33_48_2          <dbl> -0.9989246, -0.9996312, -0.99~
## $ f_body_acc_bands_energy_49_64_2          <dbl> -0.9956713, -0.9988775, -0.99~
## $ f_body_acc_bands_energy_1_24_2           <dbl> -0.9948773, -0.9985534, -0.99~
## $ f_body_acc_bands_energy_25_48_2          <dbl> -0.9994544, -0.9998221, -0.99~
## $ f_body_acc_jerk_mean_x                   <dbl> -0.9923325, -0.9950322, -0.99~
## $ f_body_acc_jerk_mean_y                   <dbl> -0.9871699, -0.9813115, -0.98~
## $ f_body_acc_jerk_mean_z                   <dbl> -0.9896961, -0.9897398, -0.98~
## $ f_body_acc_jerk_std_x                    <dbl> -0.9958207, -0.9966523, -0.99~
## $ f_body_acc_jerk_std_y                    <dbl> -0.9909363, -0.9820839, -0.98~
## $ f_body_acc_jerk_std_z                    <dbl> -0.9970517, -0.9926268, -0.99~
## $ f_body_acc_jerk_mad_x                    <dbl> -0.9938055, -0.9949767, -0.98~
## $ f_body_acc_jerk_mad_y                    <dbl> -0.9905187, -0.9829295, -0.98~
## $ f_body_acc_jerk_mad_z                    <dbl> -0.9969928, -0.9916414, -0.98~
## $ f_body_acc_jerk_max_x                    <dbl> -0.9967369, -0.9974245, -0.99~
## $ f_body_acc_jerk_max_y                    <dbl> -0.9919752, -0.9849232, -0.98~
## $ f_body_acc_jerk_max_z                    <dbl> -0.9932417, -0.9931870, -0.99~
## $ f_body_acc_jerk_min_x                    <dbl> -0.9983491, -0.9979168, -0.99~
## $ f_body_acc_jerk_min_y                    <dbl> -0.9911084, -0.9825186, -0.99~
## $ f_body_acc_jerk_min_z                    <dbl> -0.9598854, -0.9868384, -0.99~
## $ f_body_acc_jerk_sma                      <dbl> -0.9905150, -0.9898509, -0.98~
## $ f_body_acc_jerk_energy_x                 <dbl> -0.9999347, -0.9999596, -0.99~
## $ f_body_acc_jerk_energy_y                 <dbl> -0.9998205, -0.9996396, -0.99~
## $ f_body_acc_jerk_energy_z                 <dbl> -0.9998845, -0.9998466, -0.99~
## $ f_body_acc_jerk_iqr_x                    <dbl> -0.9930263, -0.9928434, -0.98~
## $ f_body_acc_jerk_iqr_y                    <dbl> -0.9913734, -0.9852207, -0.98~
## $ f_body_acc_jerk_iqr_z                    <dbl> -0.9962396, -0.9910493, -0.98~
## $ f_body_acc_jerk_entropy_x                <dbl> -1, -1, -1, -1, -1, -1, -1, -~
## $ f_body_acc_jerk_entropy_y                <dbl> -1.0000000, -1.0000000, -1.00~
## $ f_body_acc_jerk_entropy_z                <dbl> -1, -1, -1, -1, -1, -1, -1, -~
## $ f_body_acc_jerk_max_inds_x               <dbl> 1.00, -0.32, -0.16, -0.12, -0~
## $ f_body_acc_jerk_max_inds_y               <dbl> -0.24, -0.12, -0.48, -0.56, -~
## $ f_body_acc_jerk_max_inds_z               <dbl> -1.00, -0.32, -0.28, -0.28, 0~
## $ f_body_acc_jerk_mean_freq_x              <dbl> 0.87038451, 0.60851352, 0.115~
## $ f_body_acc_jerk_mean_freq_y              <dbl> 0.210697000, -0.053675613, -0~
## $ f_body_acc_jerk_mean_freq_z              <dbl> 0.26370789, 0.06314827, 0.038~
## $ f_body_acc_jerk_skewness_x               <dbl> -0.7036858, -0.6303049, -0.59~
## $ f_body_acc_jerk_kurtosis_x               <dbl> -0.9037425, -0.9103945, -0.92~
## $ f_body_acc_jerk_skewness_y               <dbl> -0.5825736, -0.4144235, -0.52~
## $ f_body_acc_jerk_kurtosis_y               <dbl> -0.9363101, -0.8505864, -0.91~
## $ f_body_acc_jerk_skewness_z               <dbl> -0.5073447, -0.6555347, -0.80~
## $ f_body_acc_jerk_kurtosis_z               <dbl> -0.8055359, -0.9159869, -0.98~
## $ f_body_acc_jerk_bands_energy_1_8         <dbl> -0.9999865, -0.9999963, -0.99~
## $ f_body_acc_jerk_bands_energy_9_16        <dbl> -0.9999796, -0.9999797, -0.99~
## $ f_body_acc_jerk_bands_energy_17_24       <dbl> -0.9999748, -0.9999489, -0.99~
## $ f_body_acc_jerk_bands_energy_25_32       <dbl> -0.9999551, -0.9999683, -0.99~
## $ f_body_acc_jerk_bands_energy_33_40       <dbl> -0.9999186, -0.9999101, -0.99~
## $ f_body_acc_jerk_bands_energy_41_48       <dbl> -0.9996401, -0.9998137, -0.99~
## $ f_body_acc_jerk_bands_energy_49_56       <dbl> -0.9994833, -0.9999203, -0.99~
## $ f_body_acc_jerk_bands_energy_57_64       <dbl> -0.9999609, -0.9999607, -0.99~
## $ f_body_acc_jerk_bands_energy_1_16        <dbl> -0.9999823, -0.9999867, -0.99~
## $ f_body_acc_jerk_bands_energy_17_32       <dbl> -0.9999707, -0.9999560, -0.99~
## $ f_body_acc_jerk_bands_energy_33_48       <dbl> -0.9998110, -0.9998767, -0.99~
## $ f_body_acc_jerk_bands_energy_49_64       <dbl> -0.9994847, -0.9999141, -0.99~
## $ f_body_acc_jerk_bands_energy_1_24        <dbl> -0.9999808, -0.9999744, -0.99~
## $ f_body_acc_jerk_bands_energy_25_48       <dbl> -0.9998519, -0.9999058, -0.99~
## $ f_body_acc_jerk_bands_energy_1_8_1       <dbl> -0.9999326, -0.9998610, -0.99~
## $ f_body_acc_jerk_bands_energy_9_16_1      <dbl> -0.9998999, -0.9998272, -0.99~
## $ f_body_acc_jerk_bands_energy_17_24_1     <dbl> -0.9998244, -0.9994565, -0.99~
## $ f_body_acc_jerk_bands_energy_25_32_1     <dbl> -0.9998598, -0.9998303, -0.99~
## $ f_body_acc_jerk_bands_energy_33_40_1     <dbl> -0.9997275, -0.9996093, -0.99~
## $ f_body_acc_jerk_bands_energy_41_48_1     <dbl> -0.9997288, -0.9996855, -0.99~
## $ f_body_acc_jerk_bands_energy_49_56_1     <dbl> -0.9995671, -0.9995761, -0.99~
## $ f_body_acc_jerk_bands_energy_57_64_1     <dbl> -0.9997652, -0.9999370, -0.99~
## $ f_body_acc_jerk_bands_energy_1_16_1      <dbl> -0.9999002, -0.9998174, -0.99~
## $ f_body_acc_jerk_bands_energy_17_32_1     <dbl> -0.9998149, -0.9995325, -0.99~
## $ f_body_acc_jerk_bands_energy_33_48_1     <dbl> -0.9997098, -0.9995952, -0.99~
## $ f_body_acc_jerk_bands_energy_49_64_1     <dbl> -0.9995961, -0.9996257, -0.99~
## $ f_body_acc_jerk_bands_energy_1_24_1      <dbl> -0.9998522, -0.9996299, -0.99~
## $ f_body_acc_jerk_bands_energy_25_48_1     <dbl> -0.9998221, -0.9997593, -0.99~
## $ f_body_acc_jerk_bands_energy_1_8_2       <dbl> -0.9993999, -0.9998589, -0.99~
## $ f_body_acc_jerk_bands_energy_9_16_2      <dbl> -0.9997656, -0.9998465, -0.99~
## $ f_body_acc_jerk_bands_energy_17_24_2     <dbl> -0.9999585, -0.9997949, -0.99~
## $ f_body_acc_jerk_bands_energy_25_32_2     <dbl> -0.9999495, -0.9998009, -0.99~
## $ f_body_acc_jerk_bands_energy_33_40_2     <dbl> -0.9998385, -0.9998193, -0.99~
## $ f_body_acc_jerk_bands_energy_41_48_2     <dbl> -0.9998135, -0.9997692, -0.99~
## $ f_body_acc_jerk_bands_energy_49_56_2     <dbl> -0.9987805, -0.9996370, -0.99~
## $ f_body_acc_jerk_bands_energy_57_64_2     <dbl> -0.9985778, -0.9999545, -0.99~
## $ f_body_acc_jerk_bands_energy_1_16_2      <dbl> -0.9996197, -0.9998519, -0.99~
## $ f_body_acc_jerk_bands_energy_17_32_2     <dbl> -0.9999836, -0.9998273, -0.99~
## $ f_body_acc_jerk_bands_energy_33_48_2     <dbl> -0.9998281, -0.9998001, -0.99~
## $ f_body_acc_jerk_bands_energy_49_64_2     <dbl> -0.9986807, -0.9996510, -0.99~
## $ f_body_acc_jerk_bands_energy_1_24_2      <dbl> -0.9998442, -0.9998350, -0.99~
## $ f_body_acc_jerk_bands_energy_25_48_2     <dbl> -0.9999279, -0.9998267, -0.99~
## $ f_body_gyro_mean_x                       <dbl> -0.9865744, -0.9773867, -0.97~
## $ f_body_gyro_mean_y                       <dbl> -0.9817615, -0.9925300, -0.99~
## $ f_body_gyro_mean_z                       <dbl> -0.9895148, -0.9896058, -0.98~
## $ f_body_gyro_std_x                        <dbl> -0.9850326, -0.9849043, -0.97~
## $ f_body_gyro_std_y                        <dbl> -0.9738861, -0.9871681, -0.99~
## $ f_body_gyro_std_z                        <dbl> -0.9940349, -0.9897847, -0.98~
## $ f_body_gyro_mad_x                        <dbl> -0.9865308, -0.9793612, -0.97~
## $ f_body_gyro_mad_y                        <dbl> -0.9836164, -0.9918368, -0.99~
## $ f_body_gyro_mad_z                        <dbl> -0.9923520, -0.9879651, -0.98~
## $ f_body_gyro_max_x                        <dbl> -0.9804984, -0.9873538, -0.97~
## $ f_body_gyro_max_y                        <dbl> -0.9722709, -0.9847864, -0.99~
## $ f_body_gyro_max_z                        <dbl> -0.9949443, -0.9901508, -0.99~
## $ f_body_gyro_min_x                        <dbl> -0.9975686, -0.9868918, -0.98~
## $ f_body_gyro_min_y                        <dbl> -0.9840851, -0.9990535, -0.99~
## $ f_body_gyro_min_z                        <dbl> -0.9943354, -0.9944137, -0.99~
## $ f_body_gyro_sma                          <dbl> -0.9852762, -0.9868687, -0.98~
## $ f_body_gyro_energy_x                     <dbl> -0.9998637, -0.9998249, -0.99~
## $ f_body_gyro_energy_y                     <dbl> -0.9996661, -0.9999115, -0.99~
## $ f_body_gyro_energy_z                     <dbl> -0.9999346, -0.9998921, -0.99~
## $ f_body_gyro_iqr_x                        <dbl> -0.9903439, -0.9870994, -0.98~
## $ f_body_gyro_iqr_y                        <dbl> -0.9948357, -0.9955637, -0.99~
## $ f_body_gyro_iqr_z                        <dbl> -0.9944116, -0.9872545, -0.99~
## $ f_body_gyro_entropy_x                    <dbl> -0.7124023, -0.6111119, -0.59~
## $ f_body_gyro_entropy_y                    <dbl> -0.6448424, -0.7646030, -0.80~
## $ f_body_gyro_entropy_z                    <dbl> -0.8389930, -0.7510797, -0.75~
## $ f_body_gyro_max_inds_x                   <dbl> -1.0000000, -1.0000000, -1.00~
## $ f_body_gyro_max_inds_y                   <dbl> -1.0000000, -1.0000000, -0.87~
## $ f_body_gyro_max_inds_z                   <dbl> -1.0000000, -1.0000000, -1.00~
## $ f_body_gyro_mean_freq_x                  <dbl> -0.25754888, -0.04816744, -0.~
## $ f_body_gyro_mean_freq_y                  <dbl> 0.09794711, -0.40160791, -0.0~
## $ f_body_gyro_mean_freq_z                  <dbl> 0.54715105, -0.06817833, -0.1~
## $ f_body_gyro_skewness_x                   <dbl> 0.377311210, -0.458553310, 0.~
## $ f_body_gyro_kurtosis_x                   <dbl> 0.1340915, -0.7970135, -0.244~
## $ f_body_gyro_skewness_y                   <dbl> 0.27337197, 0.38756889, -0.42~
## $ f_body_gyro_kurtosis_y                   <dbl> -0.09126183, 0.14866483, -0.8~
## $ f_body_gyro_skewness_z                   <dbl> -0.48434650, -0.15690927, -0.~
## $ f_body_gyro_kurtosis_z                   <dbl> -0.78285070, -0.45177589, -0.~
## $ f_body_gyro_bands_energy_1_8             <dbl> -0.9998650, -0.9998509, -0.99~
## $ f_body_gyro_bands_energy_9_16            <dbl> -0.9999318, -0.9997943, -0.99~
## $ f_body_gyro_bands_energy_17_24           <dbl> -0.9999729, -0.9999131, -0.99~
## $ f_body_gyro_bands_energy_25_32           <dbl> -0.9999702, -0.9999182, -0.99~
## $ f_body_gyro_bands_energy_33_40           <dbl> -0.9999301, -0.9998964, -0.99~
## $ f_body_gyro_bands_energy_41_48           <dbl> -0.9999586, -0.9998853, -0.99~
## $ f_body_gyro_bands_energy_49_56           <dbl> -0.9999290, -0.9997842, -0.99~
## $ f_body_gyro_bands_energy_57_64           <dbl> -0.9999847, -0.9997824, -0.99~
## $ f_body_gyro_bands_energy_1_16            <dbl> -0.9998633, -0.9998299, -0.99~
## $ f_body_gyro_bands_energy_17_32           <dbl> -0.9999681, -0.9998988, -0.99~
## $ f_body_gyro_bands_energy_33_48           <dbl> -0.9999361, -0.9998828, -0.99~
## $ f_body_gyro_bands_energy_49_64           <dbl> -0.9999536, -0.9997834, -0.99~
## $ f_body_gyro_bands_energy_1_24            <dbl> -0.9998644, -0.9998283, -0.99~
## $ f_body_gyro_bands_energy_25_48           <dbl> -0.9999610, -0.9999080, -0.99~
## $ f_body_gyro_bands_energy_1_8_1           <dbl> -0.9994537, -0.9998564, -0.99~
## $ f_body_gyro_bands_energy_9_16_1          <dbl> -0.9999781, -0.9999885, -0.99~
## $ f_body_gyro_bands_energy_17_24_1         <dbl> -0.9999915, -0.9999957, -0.99~
## $ f_body_gyro_bands_energy_25_32_1         <dbl> -0.9999901, -0.9999942, -0.99~
## $ f_body_gyro_bands_energy_33_40_1         <dbl> -0.9999686, -0.9999861, -0.99~
## $ f_body_gyro_bands_energy_41_48_1         <dbl> -0.9998066, -0.9999845, -0.99~
## $ f_body_gyro_bands_energy_49_56_1         <dbl> -0.9983460, -0.9999800, -0.99~
## $ f_body_gyro_bands_energy_57_64_1         <dbl> -0.9989612, -0.9999900, -0.99~
## $ f_body_gyro_bands_energy_1_16_1          <dbl> -0.9996187, -0.9998966, -0.99~
## $ f_body_gyro_bands_energy_17_32_1         <dbl> -0.9999893, -0.9999945, -0.99~
## $ f_body_gyro_bands_energy_33_48_1         <dbl> -0.9999354, -0.9999860, -0.99~
## $ f_body_gyro_bands_energy_49_64_1         <dbl> -0.9983875, -0.9999817, -0.99~
## $ f_body_gyro_bands_energy_1_24_1          <dbl> -0.9996426, -0.9999026, -0.99~
## $ f_body_gyro_bands_energy_25_48_1         <dbl> -0.9999727, -0.9999917, -0.99~
## $ f_body_gyro_bands_energy_1_8_2           <dbl> -0.9999554, -0.9999089, -0.99~
## $ f_body_gyro_bands_energy_9_16_2          <dbl> -0.9999763, -0.9999594, -0.99~
## $ f_body_gyro_bands_energy_17_24_2         <dbl> -0.9999058, -0.9999281, -0.99~
## $ f_body_gyro_bands_energy_25_32_2         <dbl> -0.9999855, -0.9999663, -0.99~
## $ f_body_gyro_bands_energy_33_40_2         <dbl> -0.9999372, -0.9999855, -0.99~
## $ f_body_gyro_bands_energy_41_48_2         <dbl> -0.9997512, -0.9999264, -0.99~
## $ f_body_gyro_bands_energy_49_56_2         <dbl> -0.9990723, -0.9999615, -0.99~
## $ f_body_gyro_bands_energy_57_64_2         <dbl> -0.9999275, -0.9999831, -0.99~
## $ f_body_gyro_bands_energy_1_16_2          <dbl> -0.9999516, -0.9999017, -0.99~
## $ f_body_gyro_bands_energy_17_32_2         <dbl> -0.9999058, -0.9999178, -0.99~
## $ f_body_gyro_bands_energy_33_48_2         <dbl> -0.9998927, -0.9999754, -0.99~
## $ f_body_gyro_bands_energy_49_64_2         <dbl> -0.9994443, -0.9999711, -0.99~
## $ f_body_gyro_bands_energy_1_24_2          <dbl> -0.9999410, -0.9998943, -0.99~
## $ f_body_gyro_bands_energy_25_48_2         <dbl> -0.9999586, -0.9999710, -0.99~
## $ f_body_acc_mag_mean                      <dbl> -0.9521547, -0.9808566, -0.98~
## $ f_body_acc_mag_std                       <dbl> -0.9561340, -0.9758658, -0.98~
## $ f_body_acc_mag_mad                       <dbl> -0.9488701, -0.9757769, -0.98~
## $ f_body_acc_mag_max                       <dbl> -0.9743206, -0.9782264, -0.99~
## $ f_body_acc_mag_min                       <dbl> -0.9257218, -0.9869108, -0.98~
## $ f_body_acc_mag_sma                       <dbl> -0.9521547, -0.9808566, -0.98~
## $ f_body_acc_mag_energy                    <dbl> -0.9982852, -0.9994719, -0.99~
## $ f_body_acc_mag_iqr                       <dbl> -0.9732732, -0.9844792, -0.98~
## $ f_body_acc_mag_entropy                   <dbl> -0.6463764, -0.8166736, -0.90~
## $ f_body_acc_mag_max_inds                  <dbl> -0.7931035, -1.0000000, -0.86~
## $ f_body_acc_mag_mean_freq                 <dbl> -0.08843612, -0.04414989, 0.2~
## $ f_body_acc_mag_skewness                  <dbl> -0.43647104, -0.12204037, -0.~
## $ f_body_acc_mag_kurtosis                  <dbl> -0.7968405, -0.4495219, -0.87~
## $ f_body_body_acc_jerk_mag_mean            <dbl> -0.9937257, -0.9903355, -0.98~
## $ f_body_body_acc_jerk_mag_std             <dbl> -0.9937550, -0.9919603, -0.99~
## $ f_body_body_acc_jerk_mag_mad             <dbl> -0.9919757, -0.9897320, -0.98~
## $ f_body_body_acc_jerk_mag_max             <dbl> -0.9933647, -0.9944888, -0.99~
## $ f_body_body_acc_jerk_mag_min             <dbl> -0.9881754, -0.9895488, -0.99~
## $ f_body_body_acc_jerk_mag_sma             <dbl> -0.9937257, -0.9903355, -0.98~
## $ f_body_body_acc_jerk_mag_energy          <dbl> -0.9999184, -0.9998669, -0.99~
## $ f_body_body_acc_jerk_mag_iqr             <dbl> -0.9913637, -0.9911339, -0.98~
## $ f_body_body_acc_jerk_mag_entropy         <dbl> -1, -1, -1, -1, -1, -1, -1, -~
## $ f_body_body_acc_jerk_mag_max_inds        <dbl> -0.9365079, -0.8412698, -0.90~
## $ f_body_body_acc_jerk_mag_mean_freq       <dbl> 0.3469885, 0.5320605, 0.66079~
## $ f_body_body_acc_jerk_mag_skewness        <dbl> -0.5160801, -0.6248710, -0.72~
## $ f_body_body_acc_jerk_mag_kurtosis        <dbl> -0.8027600, -0.9001600, -0.92~
## $ f_body_body_gyro_mag_mean                <dbl> -0.9801349, -0.9882956, -0.98~
## $ f_body_body_gyro_mag_std                 <dbl> -0.9613094, -0.9833219, -0.98~
## $ f_body_body_gyro_mag_mad                 <dbl> -0.9736534, -0.9826593, -0.98~
## $ f_body_body_gyro_mag_max                 <dbl> -0.9522638, -0.9863208, -0.99~
## $ f_body_body_gyro_mag_min                 <dbl> -0.9894981, -0.9918288, -0.99~
## $ f_body_body_gyro_mag_sma                 <dbl> -0.9801349, -0.9882956, -0.98~
## $ f_body_body_gyro_mag_energy              <dbl> -0.9992403, -0.9998112, -0.99~
## $ f_body_body_gyro_mag_iqr                 <dbl> -0.9926555, -0.9939785, -0.99~
## $ f_body_body_gyro_mag_entropy             <dbl> -0.7012914, -0.7206830, -0.73~
## $ f_body_body_gyro_mag_max_inds            <dbl> -1.0000000, -0.9487180, -0.79~
## $ f_body_body_gyro_mag_mean_freq           <dbl> -0.12898890, -0.27195846, -0.~
## $ f_body_body_gyro_mag_skewness            <dbl> 0.586156430, -0.336310410, -0~
## $ f_body_body_gyro_mag_kurtosis            <dbl> 0.37460462, -0.72001508, -0.8~
## $ f_body_body_gyro_jerk_mag_mean           <dbl> -0.9919904, -0.9958539, -0.99~
## $ f_body_body_gyro_jerk_mag_std            <dbl> -0.9906975, -0.9963995, -0.99~
## $ f_body_body_gyro_jerk_mag_mad            <dbl> -0.9899408, -0.9954421, -0.99~
## $ f_body_body_gyro_jerk_mag_max            <dbl> -0.9924478, -0.9968660, -0.99~
## $ f_body_body_gyro_jerk_mag_min            <dbl> -0.9910477, -0.9944397, -0.99~
## $ f_body_body_gyro_jerk_mag_sma            <dbl> -0.9919904, -0.9958539, -0.99~
## $ f_body_body_gyro_jerk_mag_energy         <dbl> -0.9999368, -0.9999807, -0.99~
## $ f_body_body_gyro_jerk_mag_iqr            <dbl> -0.9904579, -0.9945437, -0.99~
## $ f_body_body_gyro_jerk_mag_entropy        <dbl> -0.8713058, -1.0000000, -1.00~
## $ f_body_body_gyro_jerk_mag_max_inds       <dbl> -1.00000000, -1.00000000, -0.~
## $ f_body_body_gyro_jerk_mag_mean_freq      <dbl> -0.07432303, 0.15807454, 0.41~
## $ f_body_body_gyro_jerk_mag_skewness       <dbl> -0.29867637, -0.59505094, -0.~
## $ f_body_body_gyro_jerk_mag_kurtosis       <dbl> -0.71030407, -0.86149931, -0.~
## $ angle_t_body_acc_mean_gravity            <dbl> -0.112754340, 0.053476955, -0~
## $ angle_t_body_acc_jerk_mean_gravity_mean  <dbl> 0.030400372, -0.007434566, 0.~
## $ angle_t_body_gyro_mean_gravity_mean      <dbl> -0.464761390, -0.732626210, 0~
## $ angle_t_body_gyro_jerk_mean_gravity_mean <dbl> -0.01844588, 0.70351059, 0.80~
## $ angle_x_gravity_mean                     <dbl> -0.8412468, -0.8447876, -0.84~
## $ angle_y_gravity_mean                     <dbl> 0.1799406, 0.1802889, 0.18063~
## $ angle_z_gravity_mean                     <dbl> -0.05862692, -0.05431672, -0.~
## $ subject                                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ activity                                 <chr> "STANDING", "STANDING", "STAN~

By this, the general information of the data set is achieved. The data set comprises of 10,299 observations and 563 variables. The large variables may indicate that there would be an opportunity for conducting the dimensional reduction which would be explained in the latter chapter.

The target variable activity is still in its character data type. By this, the variable would then be converted to its proper type factor which could be seen as follow:

whole_df$activity <- as.factor(whole_df$activity)

class(whole_df$activity)

## [1] "factor"

To proceed, the missing data and any duplicated values within the data set would be observed and should be removed for the modelling purpose:

anyDuplicated(whole_df)

## [1] 0

checking for missing values

anyNA(whole_df)

## [1] FALSE

Exploratory Data Analysis

In the EDA phase, several analyses would be conducted including the proportion of the classes in the target and identifying stationary and moving activities from several variables including body acceleration and gravitational axes.

Proportion of Classes in Target Variable

the inspectdf library would be useful for visualizing and exploring the classes within the target. To simply inspecting, the inspect_cat function is used along with the combination of show_plot which would be seen as below.

show_plot(inspect_cat(whole_df))

Based on the plot above, it could be seen that each of the class have a balanced proportion as this would be helpful for the modelling phase. An unbalanced class proportion may impacted the performance of the model that would be built as the minority class would not be properly learned by the model.

Identifying Differences between Stationary and Moving Activities

Within this section, several variables would be utilized to explain and to discern the differences between stationary activities (sitting, laying, standing) and moving activities (walking, walking upstairs, walking downstairs). To begin, the first variable is t_body_acc_mag_mean which is the mean of magnitude for body acceleration which the visualization could be seen as below:

whole_df %>% 
  ggplot(aes(t_body_acc_mag_mean, color = activity)) +
  geom_density() +
  annotate(geom = "segment", x = -0.65, y = 23, xend = -0.9, yend = 21,
           arrow = arrow(length = unit(2, "mm")))+
  annotate(geom = "text", -0.63, y = 23, label = "Stationary", hjust = "left") +
  annotate(geom = "segment", x = 0.25, y = 5, xend = 0, yend = 2.5,
           arrow = arrow(length = unit(2, "mm"))) +
  annotate(geom = "text", 0.27, y = 5, label = "Moving", hjust = "left") +
  labs(y = NULL) +
  theme_minimal()

By the density plot above, it could be seen that the moving activities has a greater magnitude of body acceleration than its counterpart. This would be relevant as in stationary activities, the human body would not accelerated. Likewise it is shown that the stationary activities have a higher observations than the moving activities in the data set.

To further analyze, the boxplot of each of the activities would be seen as below:

whole_df %>% 
  ggplot(aes(x = activity, y = t_body_acc_mag_mean, fill = activity)) +
  geom_boxplot(outlier.shape = NA) +
  geom_hline(yintercept = 0, color = "red", linetype = 2) +
  geom_hline(yintercept = -0.6, color = "blue", linetype = 2) +
  labs(y = "Acceleration Magnitude Mean",
       x = NULL) +
  scale_x_discrete(label = function(x) stringr::str_trunc(x, 12)) +
  theme_minimal() +
  theme(axis.text.x.bottom = element_text(angle = 45),
        legend.position = "none")

Specifically, walking downstairs would have a greater mean of body acceleration in comparison to the other moving activities as seen from the plot above. Similar to the density plot, the stationary activities would be seen to have lower magnitude of body acceleration.

To see from a different angle, the box plot of gravitational angles by each target classes would be seen on the plot below:

bp1 <- whole_df %>% 
  ggplot(aes(x = activity, y = angle_x_gravity_mean, fill = activity)) +
  geom_boxplot(outlier.shape = NA) +
  geom_hline(yintercept = 0, color = "red", linetype = 2) +
  labs(title = "X-axis vs Gravity Mean Angle",
       x = NULL,
       y = "Angle X Gravity Mean") +
  scale_x_discrete(label = function(x) stringr::str_trunc(x, 12)) +
  theme_minimal() +
  theme(axis.text.x.bottom = element_text(angle = 45),
        legend.position = "none") 

bp2 <- whole_df %>% 
  ggplot(aes(x = activity, y = angle_y_gravity_mean, fill = activity)) +
  geom_boxplot(outlier.shape = NA) +
  geom_hline(yintercept = 0, color = "red", linetype = 2)+
  labs(title = "Y-axis vs Gravity Mean Angle",
       x = NULL,
       y = "Angle Y Gravity Mean") +
  scale_x_discrete(label = function(x) stringr::str_trunc(x, 12)) +
  theme_minimal() +
  theme(axis.text.x.bottom = element_text(angle = 45),
        legend.position = "none") 

grid.arrange(bp1, bp2, nrow = 1)

By the plots above, it is shown that laying activity is highly different than other activities in terms of the gravitational angles. Meanwhile, the other activities would have a similar central point of gravitational angles. This would also mean that the gravitational angles would not be the best to discern differences of the activities other than laying.

Overall, the exploratory analysis of stationary and moving activities shows that each of predictor variables would be describing several differences in each target class. This would also show that there are several variables that would be best to classify each of the activities.

Principal Component Analysis

There are two common methods to selecting of which the relevant features for the model that are being established. The common methods would be described in the following points:

Feature Elimination: The selection of features would be done by eliminating irrelevant features to the model that is being built (e.g. stepwise regression, AIC, etc.)
Feature Extraction: The inclusion of features would be done by extracting each of the predictor components into a new independent variables (e.g. Principal Component Analysis)

Principal Component Analysis (PCA) is one of the methods and commonly used in performing feature extraction. Briefly, the extraction is conducted by calculating the covariance matrix for each variables and then finding the Eigen vectors to calculate the Eigien Values for each Principal component (see this link for further information). By acquiring the eigen value for each observations by each features, the cumulative proportions of variance would then could determine of which the top Principal Components could be representing the whole features of the model.

The use of PCA is highly beneficial for a data set that has many predictors. However, the ability to interpret of which features are significant to the model would be a disadvantage here as the value within the PCA itself is the extraction of all predictors by the eigen values. Moreover, the non-linear correlation would be disregarded as the standard PCA would only detect linear relationships between predictors. In this documentation, those disadvantages could be disregarded as the focus here is to find and compare the models that are built.

To begin the PCA, the whole_df would be split into train set and data set. Note that the split would be different from the data source (by specific subject ID) as the balanced proportion of target classes within each set. To conduct the split, function initial_split is used from rsample package. Afterwards, the set for PCA extraction would be sliced on each set by removing the activity variable.

# dividing train and test
set.seed(318)

df_split <- initial_split(whole_df %>% select(-subject), strata = activity)

train <- training(df_split)
test <- testing(df_split)

train_pca <- train %>% 
  select(-activity)

test_pca <- test %>% 
  select(-activity)

#original method in splitting the df
#train <- whole_df %>% 
#  filter(!(subject %in% c(2, 4, 9, 10, 12, 13, 18, 20, 24)))
#
#train_pca <- train %>% 
#  select(-c(subject, activity))
#
#test <- whole_df %>% 
#  filter(subject %in% c(2, 4, 9, 10, 12, 13, 18, 20, 24))
#
#test_pca <- test %>% 
#  select(-c(subject, activity))

Performing the PCA is simply done by using prcomp function as seen as below. Note that the scale. parameter is set to be TRUE as each of the variables should be measured in the same scale in performing the PCA

prin_comp <- prcomp(train_pca, scale. = T)

The PCA could also be visualized by using biplot from base. Below is the visualization of PCA for PC1 and PC2.

biplot(prin_comp, scale = F, cex = 0.3)

Unfortunately, due to the large numbers of variables, the directions along with the strength of each variables contributions to the PC could not be seen clearly. However, the plot could show the outliers (top left of the plot) by the PC 1 and PC 2, meaning that there are several observations that could not be captured only by these 2 PCs.

To see how each variables contributes to each of the PC, fviz_contrib function would be used from factoextra package. Note that in the code-chunk below, the choice parameter would be set as “var” to compare each of variables and axis to set which of the PC should be visualized.

fviz_contrib(prin_comp, axis = 1, choice = "var", top = 10)

Based on the plot above, it is shown that the top 10 variables that contributes to the PC 1 are several variables related to body accelerations. However, the percentage of contributions for each variables are still considered as small as each variables would only be no more than 0.3 %. This would also means that the each of the variables are similarly contributes to the PC 1.

To determine how each PC could explain the variance of the whole features, the proportion of variance would be calculated by squaring the sdev element in the prcomp and then divided by the total (sum) of the variance of the whole principal components. The visualization could be seen as below.

prin_comp_var <- prin_comp$sdev^2

prop_varex <- prin_comp_var/sum(prin_comp_var)

plot(prop_varex, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     type = "b")

By the plot above, it is shown that PC 1 already has 50 % of the data set variance explained. Meanwhile, the rest of the PCs have a proportion of variance explained close to zero.

To determine the top Principal Components that would be used in the models, the Cumulative Proportion of Variance Explained would be calculated (by cumsum) and plotted as below.

plot(cumsum(prop_varex), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b")

The plot above shows how each Principal Components added the cumulative proportions of variance explained to 1 (or 100%). Likewise, it is shown that in PC 50, the cumulative proportion had reached more than 80%. By this, the PC that would be used for the models would be ranging from PC 1 to PC 50.

As the PCA was done in the train data set, the test set would then be fitted for acquiring the eigenvalues for each observations in the set. Note that the PCA step is done after splitting the train and the test sets and then fit the PCA to the test set. This is due to occurrence of bias if the PCA was done in the whole data set as the unseen data (in this term is the data) would have been compromised during the PCA.

train_data <- data.frame(activity = train$activity, prin_comp$x)
train_data <- train_data[,1:51]

#transform test into PCA

test_data <- predict(prin_comp, newdata = test_pca) %>% 
  as.data.frame()
test_data <- test_data[,1:50]
test_data <- data.frame(activity = test$activity,
                        test_data)

Modeling

As previously mentioned, the models that are used within this documentation are Naive Bayes, Decision Tree, and Random Forest models. At the end of each model establishment, the confusion matrix for each model would be provided to evaluate and to compare each of the models to get the best performing model for the case. Moreover, F1 - Scores would be provided to evaluate how each of the models correctly classify each of the classes.

Naive Bayes

Naive Bayes is one of the algorithm that is commonly used in a classification problems with several characteristics as below:

Assumption of Independence among Predictors: the Naive Bayes model assumes that each of independent variables are not dependent with other independent variables
prone to bias due to data scarcity: Some of the variable distributions may have some bins that do not have any observations. This would be called as data scarcity and it would give approximation of probablities close to 0 or 1 which leads to a heavy bias for the Naive Bayes model. By this issue, the Laplace Smoothing would be helpful as it would add a small number in each predictors (usually value of 1)

By these characteristics above, it would also be concluded that the Naive Bayes model would be suitable for data with categorical predictors rather than the numerical ones as the data scarcity would give a heavy bias to the model. It also means that the data set that is used within the documentation may not be able to use this model due to the numerical predictors.

To establish the Naive Bayes model, naiveBayes function from e1071 would be used as below. Note that the Naive Bayes model would required two parameters including the target y and predictors x.

naive_model <- naiveBayes(x = train_data %>% select(-activity), y = train_data$activity)

Afterwards, the prediction to the test set is performed to get the predictions and then to be put into the confusion matrix to evaluate the model. The metrics that is used to evaluate the models in this documentation is the accuracy of the model despite some arguments that accuracy would not be the best metrics for most of the case. This is due to the goals that each of the class should be accurately classified. To further evaluate each of the models, the F1-score for each classes would be performed and explained in the latter passage.

The confusion matrix would be done by using confusionMatrix from caret package as follow:

naive_pred <- predict(naive_model, test_data, type = "class")

confmat_naive <- confusionMatrix(data = naive_pred,
                reference = test_data$activity)

confmat_naive

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
##   LAYING                452       7        1       0                  0
##   SITTING                 8     311       57       0                  0
##   STANDING                3     106      397       0                  0
##   WALKING                 0       0        4     357                 28
##   WALKING_DOWNSTAIRS     22      18       13      50                266
##   WALKING_UPSTAIRS        1       3        5      24                 58
##                     Reference
## Prediction           WALKING_UPSTAIRS
##   LAYING                            0
##   SITTING                           0
##   STANDING                          0
##   WALKING                          16
##   WALKING_DOWNSTAIRS               27
##   WALKING_UPSTAIRS                343
## 
## Overall Statistics
##                                                
##                Accuracy : 0.825                
##                  95% CI : (0.8098, 0.8395)     
##     No Information Rate : 0.1886               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.7897               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: LAYING Class: SITTING Class: STANDING
## Sensitivity                 0.9300         0.6989          0.8323
## Specificity                 0.9962         0.9695          0.9481
## Pos Pred Value              0.9826         0.8271          0.7846
## Neg Pred Value              0.9839         0.9391          0.9614
## Prevalence                  0.1886         0.1727          0.1851
## Detection Rate              0.1754         0.1207          0.1541
## Detection Prevalence        0.1785         0.1459          0.1964
## Balanced Accuracy           0.9631         0.8342          0.8902
##                      Class: WALKING Class: WALKING_DOWNSTAIRS
## Sensitivity                  0.8283                    0.7557
## Specificity                  0.9776                    0.9416
## Pos Pred Value               0.8815                    0.6717
## Neg Pred Value               0.9659                    0.9606
## Prevalence                   0.1672                    0.1366
## Detection Rate               0.1385                    0.1032
## Detection Prevalence         0.1572                    0.1537
## Balanced Accuracy            0.9030                    0.8486
##                      Class: WALKING_UPSTAIRS
## Sensitivity                           0.8886
## Specificity                           0.9585
## Pos Pred Value                        0.7903
## Neg Pred Value                        0.9799
## Prevalence                            0.1498
## Detection Rate                        0.1331
## Detection Prevalence                  0.1684
## Balanced Accuracy                     0.9235

The confusion matrix above shows that the overall accuracy for Naive Bayes model is 82.5% to the test set which would be considered as high for a classification model. However, to further evaluate whether the model is overfit to its train set, the confusion matrix for the train set would also be observed.

naive_train_pred <- predict(naive_model, train_data, type = "class")

confusionMatrix(naive_train_pred, reference = train_data$activity)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
##   LAYING               1362      16        4       0                  0
##   SITTING                20     939      132       0                  0
##   STANDING               18     310     1244       0                  0
##   WALKING                 1       3        4    1097                123
##   WALKING_DOWNSTAIRS     57      54       32     134                763
##   WALKING_UPSTAIRS        0      10       13      60                168
##                     Reference
## Prediction           WALKING_UPSTAIRS
##   LAYING                            0
##   SITTING                           0
##   STANDING                          0
##   WALKING                          27
##   WALKING_DOWNSTAIRS               81
##   WALKING_UPSTAIRS               1050
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8359               
##                  95% CI : (0.8275, 0.8441)     
##     No Information Rate : 0.1888               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8028               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: LAYING Class: SITTING Class: STANDING
## Sensitivity                 0.9342         0.7050          0.8705
## Specificity                 0.9968         0.9762          0.9479
## Pos Pred Value              0.9855         0.8607          0.7913
## Neg Pred Value              0.9849         0.9407          0.9699
## Prevalence                  0.1888         0.1725          0.1851
## Detection Rate              0.1764         0.1216          0.1611
## Detection Prevalence        0.1790         0.1413          0.2036
## Balanced Accuracy           0.9655         0.8406          0.9092
##                      Class: WALKING Class: WALKING_DOWNSTAIRS
## Sensitivity                  0.8497                   0.72391
## Specificity                  0.9754                   0.94631
## Pos Pred Value               0.8741                   0.68064
## Neg Pred Value               0.9700                   0.95592
## Prevalence                   0.1672                   0.13649
## Detection Rate               0.1421                   0.09881
## Detection Prevalence         0.1625                   0.14517
## Balanced Accuracy            0.9126                   0.83511
##                      Class: WALKING_UPSTAIRS
## Sensitivity                           0.9067
## Specificity                           0.9618
## Pos Pred Value                        0.8071
## Neg Pred Value                        0.9832
## Prevalence                            0.1500
## Detection Rate                        0.1360
## Detection Prevalence                  0.1685
## Balanced Accuracy                     0.9342

By these confusion matrices, it is seen that the gap of accuracy between train set and test set from Naive Bayes model is not wide. Thus, it is considered that the model is in a good fit for the case.

For further evaluation of the multi-class models, F1-score would be used in this documentation. By a common means, F1-score is one of the metrics to evaluate the performance of a classifier by getting a harmonic mean of two metrics namely Precision and Recall/Sensitivity. This metric is preferable when the evaluation of the model has an imbalanced proportion of target classes and when the balance of Precision and Recall for each classes is desireable. To calculate the F1-Score for each classes, the equation is as below

\[F1\ Score = \frac{2\times Precision \times Recall}{Precision\ + Recall}\]

By this equation, The F1-Score for each of the classes in the model could be observed. The value of the F1-score would indicate how well the model classify each classes. The closer the value to 1, the better the model classes are classified. To calculate the F1-Score in this documentation, the f1_scrore_multi function has been predetermined and derivated from the confusionMatrix function.

#f1 score function for multiclass
f1_score_multi <- function(confmat_caret){
  
  output <- matrix(ncol = 2, nrow = nrow(confmat_caret$byClass))
  output[,1] <- rownames(confmat_caret$byClass)
  
  for (i in 1:nrow(confmat_caret$byClass)){
  num <- 2*confmat_caret$byClass[i,5]*confmat_caret$byClass[i,6]
  div <- confmat_caret$byClass[i,5]+confmat_caret$byClass[i,6]
  output[i,2] <- round(num/div, 4)
  }
  
  output %>% as_tibble() %>% 
    column_to_rownames("V1") %>% 
    rename(f1_scores = V2)
}

f1_multi_naive <- f1_score_multi(confmat_caret = confmat_naive)

f1_multi_naive

Based on the table above, it is shown that each of the classes have different F1-Scores. The Naive Bayes model could classify Laying activities better than other activities. In contrast, both Sitting and Walking Downstairs activities has lower values of F1-Scores which indicate that the model is slightly have lower performance to classify these activities.

Decision Tree

The Decision Tree model is also one of the classification model that is commonly used in the machine learning communities. This is due to the ease of interpretation of the model as it classifies the data into set of rules that divide the information in the data set (similarly like trees). The advantage and disadvantage of decision tree would be described as follow:

Advantage

Easy and simple to interpret, visualize, and to be comprehended;
The feature selection is done implicitly,
Both numerical and categorical data could be handled by using the model
a multi-output problems could be handled by the model
No affection by a non-linear relationships among the predictors

Disadvantage

An over-complex tree may tend to overfit the model
A small variations in the data may change the model completely which requires the bagging and boosting methods to prune
A dominated class (or an imbalanced proportion of class) may bias the model

Specifically, the explanation of decision tree could be seen in the link here.

To establish the model, rpart function would be used in this documentation. The parameters within the function that need to be considered are the formula, the data, and the method of the model. the method parameter is used to determine the output of the model which is the class of the activities.

dtree_model <- rpart(formula = activity ~., data = train_data, method = "class")

fancyRpartPlot(dtree_model, sub = NULL)

The plot above show how the decision tree model classify each of the target classes of activities. Note that the classification above was from the Principal Components that was acquired in the previous section as it would make the model difficult to interpret (due to feature extraction). Nevertheless, three main components could be observed in the plot. First, the root node of the model which is the top level of the branch. It is seen that the PC1 would be the main predictor for the model. Secondly, the interior node which are the next node after the main ones. This could be seen from the ‘LAYING’ and ‘WALKING’ branch which also explained the differences of moving and stationary activities in the EDA phase. Lastly, is the Terminal Nodes which are the bottom level of the tree that classify each of the target classes. To ease the interpretation of the model with the actual predictor variables, the model could use the set without PC (the use of original data) to see of which predictors are in which nodes.

To evaluate the model, a confusion matrix is used by parsing the model prediction and compare it with the test data as shown below:

dtree_pred <- predict(dtree_model, newdata = test_data, type = "class")

confmat_dtree <- confusionMatrix(dtree_pred, 
                                 reference = test_data$activity)
confmat_dtree

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
##   LAYING                434      55        4       0                  0
##   SITTING                48     199       78       0                  0
##   STANDING                1     189      395       0                  0
##   WALKING                 0       0        0     341                 93
##   WALKING_DOWNSTAIRS      0       0        0      44                205
##   WALKING_UPSTAIRS        3       2        0      46                 54
##                     Reference
## Prediction           WALKING_UPSTAIRS
##   LAYING                            0
##   SITTING                           0
##   STANDING                          0
##   WALKING                          71
##   WALKING_DOWNSTAIRS               10
##   WALKING_UPSTAIRS                305
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7291               
##                  95% CI : (0.7115, 0.7462)     
##     No Information Rate : 0.1886               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6736               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: LAYING Class: SITTING Class: STANDING
## Sensitivity                 0.8930        0.44719          0.8281
## Specificity                 0.9718        0.94090          0.9095
## Pos Pred Value              0.8803        0.61231          0.6752
## Neg Pred Value              0.9750        0.89076          0.9588
## Prevalence                  0.1886        0.17268          0.1851
## Detection Rate              0.1684        0.07722          0.1533
## Detection Prevalence        0.1913        0.12612          0.2270
## Balanced Accuracy           0.9324        0.69405          0.8688
##                      Class: WALKING Class: WALKING_DOWNSTAIRS
## Sensitivity                  0.7912                   0.58239
## Specificity                  0.9236                   0.97573
## Pos Pred Value               0.6752                   0.79151
## Neg Pred Value               0.9566                   0.93658
## Prevalence                   0.1672                   0.13659
## Detection Rate               0.1323                   0.07955
## Detection Prevalence         0.1960                   0.10050
## Balanced Accuracy            0.8574                   0.77906
##                      Class: WALKING_UPSTAIRS
## Sensitivity                           0.7902
## Specificity                           0.9521
## Pos Pred Value                        0.7439
## Neg Pred Value                        0.9626
## Prevalence                            0.1498
## Detection Rate                        0.1184
## Detection Prevalence                  0.1591
## Balanced Accuracy                     0.8711

Based on the matrix above, it is shown that the accuracy of the model is lower than the previous one (Naive Bayes model). To further analyze, a confusion matrix for the train set is presented below to see whether the decision model has a good fit.

dtree_train_pred <- predict(dtree_model, newdata = train_data, type = "class")
confusionMatrix(dtree_train_pred, reference = train_data$activity)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
##   LAYING               1288     128       19       0                  0
##   SITTING               157     668      236       0                  0
##   STANDING                2     535     1174       0                  0
##   WALKING                 0       0        0    1000                278
##   WALKING_DOWNSTAIRS      0       0        0     132                622
##   WALKING_UPSTAIRS       11       1        0     159                154
##                     Reference
## Prediction           WALKING_UPSTAIRS
##   LAYING                            0
##   SITTING                           1
##   STANDING                          0
##   WALKING                         212
##   WALKING_DOWNSTAIRS               30
##   WALKING_UPSTAIRS                915
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7339               
##                  95% CI : (0.7239, 0.7437)     
##     No Information Rate : 0.1888               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6794               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: LAYING Class: SITTING Class: STANDING
## Sensitivity                 0.8834        0.50150          0.8216
## Specificity                 0.9765        0.93834          0.9147
## Pos Pred Value              0.8976        0.62900          0.6861
## Neg Pred Value              0.9730        0.90030          0.9576
## Prevalence                  0.1888        0.17249          0.1851
## Detection Rate              0.1668        0.08651          0.1520
## Detection Prevalence        0.1858        0.13753          0.2216
## Balanced Accuracy           0.9300        0.71992          0.8681
##                      Class: WALKING Class: WALKING_DOWNSTAIRS
## Sensitivity                  0.7746                   0.59013
## Specificity                  0.9238                   0.97570
## Pos Pred Value               0.6711                   0.79337
## Neg Pred Value               0.9533                   0.93773
## Prevalence                   0.1672                   0.13649
## Detection Rate               0.1295                   0.08055
## Detection Prevalence         0.1930                   0.10153
## Balanced Accuracy            0.8492                   0.78292
##                      Class: WALKING_UPSTAIRS
## Sensitivity                           0.7902
## Specificity                           0.9505
## Pos Pred Value                        0.7379
## Neg Pred Value                        0.9625
## Prevalence                            0.1500
## Detection Rate                        0.1185
## Detection Prevalence                  0.1606
## Balanced Accuracy                     0.8703

Interestingly, the gap of accuracy between the accuracy in train set prediction and test set prediction for decision tree model is not considered as wide, which could be concluded as a good fit. Despite this matter, an improvement for the model could be performed which would be explained in the next chapter.

To further analyze the performance of decision tree in classifying each target classes, a F1-score is presented as below:

f1_multi_dtree <- f1_score_multi(confmat_dtree)

f1_multi_dtree

Based on the table above, it is shown that the decision tree model also have similar performance with the Naive Bayes models in classifying each of the target classes but with a lower scores. The lowest score within the table is the ‘SITTING’ and "WALKING_DOWNSTAIRS’ activities, while the model would also performed better in classifying ‘LAYING’ activities.

Decision Tree Tuned

There are several ways to tune a decision model. One of the common ways is to prune the decision tree to reduce the bias within the model. Another ways is to use a hyperparameter tuning method to define the best parameter that have a great performance in classifying the target class. In this documentation, the tuning method that would be used is a hyperparameter tuning to define the best parameter for the model. In this documentation, the tuning is performed by using tidymodels package. To initiate the tuning, the first step is to create a grid search for the parameter to be tuned by parsing tune() to each parameters that are wanted to be tuned which could be seen on the code-chunk below:

dtree_tune <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune()
) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

dtree_tune

## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   tree_depth = tune()
## 
## Computational engine: rpart

Next, the grid search is initiated by using grid_regular function. Note that the levels parameter is set to 5 as it would try to choose sensible values for each parameters as much as 5 times for each parameter which would then return 5 x 5 combinations (cost_complexity x tree_depth combinations) to try.

dtree_grid <- grid_regular(cost_complexity(), 
                           tree_depth(),
                           levels = 5)

dtree_grid %>% count(tree_depth)

Afterwards, the cross validation of the train set would be required to perform as the tuning of the parameters by using tidymodels would require a resampled object created by rsample package.

set.seed(318)

train_folds <- vfold_cv(train_data)

train_folds

By the resampling technique, it is acquired that there would be a 10 fold cross validation to acquire the best tune of the decision tree.

To initialize the the grid search of the hyperparameter tuning, a workflow function would be used with the combination of add_model and add_formula function to determine the formula and the model to be tuned. Then, the result of the tuning combination would be assigned in the tree_res object as follow:

#set.seed(129)
#
#tree_wf <- workflow() %>% 
#  add_model(dtree_tune) %>% 
#  add_formula(activity ~ .)
#
#tree_res <- tree_wf %>% 
#  tune_grid(
#    resample = train_folds,
#    grid = dtree_grid
#  )
#
#tree_res
#
#saveRDS(tree_res, "data_input/tree_res.RDS")

tree_res <- readRDS("data_input/tree_res.RDS")
tree_res

To acquire the information of which combination has a better performance in terms of its accuracy, collect_metrics function could be used to break down the data frame above. Likewise, the information of the grid search would be able to be visualized as seen below:

tree_res %>% 
  collect_metrics() %>% 
  mutate(tree_depth = factor(tree_depth)) %>% 
  ggplot(aes(cost_complexity, mean, color = tree_depth)) +
  geom_line(size = 1.5, alpha = 0.7) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0)

By this visualization, it is seen that the best performing model is the models with the tree depth of 11 or 15 and the cost complexity ranging from 0 - 0.0005 (aproximation). The best performing models could also be seen from the tree_res by piping down to the show_best function with the accuracy metric as shown below:

tree_res %>% 
  show_best("accuracy")

By this table, it is shown that the best tree model would be the model with cost complexity of 0.0005623413 and the depth of tree of 11. To subset the best model, select_best could be used by piping down to the tree_res as below.

best_tree <- tree_res %>% 
  select_best("accuracy")

best_tree

After acquiring the best parameter of cost_complexity and tree_depth the parameters would then be used into the tuned model. To use the tuned model by the tidymodels package, the model could be finalized by using finalize_workflow and then be fitted into the split data frame object by initial_split. Lastly, the final model would be extracted by using extract_workflow to extract the best tuned model (see the link here for further information).

In this documentation, however, the use of the PCA would hinder the use of the steps as explained above. The last_fit function could not read the PCA steps that has been conducted. By this matter, the best parameter that was acquired would then be put manually by parsing down the control parameter in the rpart function as below.

dtree_model_tuned <- rpart(formula = activity ~., data = train_data,
                           method = "class", 
                           control = list(cp = 0.0005623413,
                                          maxdepth = 11))

fancyRpartPlot(dtree_model_tuned)

By this, the tree depth and the number of terminal nodes has tremendously increased. This could potentially raise the issue of overfitting which has been described above. To see the performance of the tuned model along with the fitness of the model, the confusion matrices would be presented below:

dtree_tuned_pred <- predict(dtree_model_tuned, newdata = test_data, 
                            type = "class")

confmat_dtree_tuned <- 
  confusionMatrix(dtree_tuned_pred, reference = test_data$activity)

confmat_dtree_tuned

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
##   LAYING                473      18        3       0                  0
##   SITTING                 9     304       81       0                  0
##   STANDING                3     122      393       0                  0
##   WALKING                 0       0        0     377                 38
##   WALKING_DOWNSTAIRS      0       0        0      25                286
##   WALKING_UPSTAIRS        1       1        0      29                 28
##                     Reference
## Prediction           WALKING_UPSTAIRS
##   LAYING                            0
##   SITTING                           0
##   STANDING                          0
##   WALKING                          27
##   WALKING_DOWNSTAIRS               23
##   WALKING_UPSTAIRS                336
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8417               
##                  95% CI : (0.827, 0.8556)      
##     No Information Rate : 0.1886               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8095               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: LAYING Class: SITTING Class: STANDING
## Sensitivity                 0.9733         0.6831          0.8239
## Specificity                 0.9900         0.9578          0.9405
## Pos Pred Value              0.9575         0.7716          0.7587
## Neg Pred Value              0.9938         0.9354          0.9592
## Prevalence                  0.1886         0.1727          0.1851
## Detection Rate              0.1835         0.1180          0.1525
## Detection Prevalence        0.1917         0.1529          0.2010
## Balanced Accuracy           0.9816         0.8205          0.8822
##                      Class: WALKING Class: WALKING_DOWNSTAIRS
## Sensitivity                  0.8747                    0.8125
## Specificity                  0.9697                    0.9784
## Pos Pred Value               0.8529                    0.8563
## Neg Pred Value               0.9747                    0.9706
## Prevalence                   0.1672                    0.1366
## Detection Rate               0.1463                    0.1110
## Detection Prevalence         0.1715                    0.1296
## Balanced Accuracy            0.9222                    0.8955
##                      Class: WALKING_UPSTAIRS
## Sensitivity                           0.8705
## Specificity                           0.9731
## Pos Pred Value                        0.8506
## Neg Pred Value                        0.9771
## Prevalence                            0.1498
## Detection Rate                        0.1304
## Detection Prevalence                  0.1533
## Balanced Accuracy                     0.9218

It could be seen that the accuracy of the model to predict the values within the test set has increased by the comparison of the default decision tree model. To see whether the model is a good fit, the confusion matrix for the train set would be available to see as below:

dtree_tuned_train_pred <- predict(dtree_model_tuned, newdata = train_data, type = "class")

confusionMatrix(dtree_tuned_train_pred, reference = train_data$activity)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
##   LAYING               1430      23        1       0                  0
##   SITTING                18    1125      140       0                  0
##   STANDING                8     183     1288       0                  0
##   WALKING                 0       0        0    1202                 70
##   WALKING_DOWNSTAIRS      0       0        0      41                923
##   WALKING_UPSTAIRS        2       1        0      48                 61
##                     Reference
## Prediction           WALKING_UPSTAIRS
##   LAYING                            0
##   SITTING                           1
##   STANDING                          0
##   WALKING                          43
##   WALKING_DOWNSTAIRS               42
##   WALKING_UPSTAIRS               1072
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9117               
##                  95% CI : (0.9051, 0.9179)     
##     No Information Rate : 0.1888               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8937               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: LAYING Class: SITTING Class: STANDING
## Sensitivity                 0.9808         0.8446          0.9013
## Specificity                 0.9962         0.9751          0.9696
## Pos Pred Value              0.9835         0.8762          0.8709
## Neg Pred Value              0.9955         0.9678          0.9774
## Prevalence                  0.1888         0.1725          0.1851
## Detection Rate              0.1852         0.1457          0.1668
## Detection Prevalence        0.1883         0.1663          0.1915
## Balanced Accuracy           0.9885         0.9099          0.9355
##                      Class: WALKING Class: WALKING_DOWNSTAIRS
## Sensitivity                  0.9311                    0.8757
## Specificity                  0.9824                    0.9876
## Pos Pred Value               0.9141                    0.9175
## Neg Pred Value               0.9861                    0.9805
## Prevalence                   0.1672                    0.1365
## Detection Rate               0.1557                    0.1195
## Detection Prevalence         0.1703                    0.1303
## Balanced Accuracy            0.9567                    0.9316
##                      Class: WALKING_UPSTAIRS
## Sensitivity                           0.9257
## Specificity                           0.9829
## Pos Pred Value                        0.9054
## Neg Pred Value                        0.9868
## Prevalence                            0.1500
## Detection Rate                        0.1388
## Detection Prevalence                  0.1533
## Balanced Accuracy                     0.9543

Surprisingly, the accuracy of the model to predict the data set is slightly higher than the accuracy of the model to the train set. This may indicate that the tuned model is overfit as the gap between accuracies are widen in comparison to the default tree model. This would also explain the disadvantage of the decision tree as the more branches of the trees created would also impacted on the fitness of the model.

To further analyze each of the target classes performance, the F1-scores for tuned-decision tree model would be seen as below:

f1_multi_dtree_tuned <- f1_score_multi(confmat_dtree_tuned)

f1_multi_dtree_tuned

The tuned model is also better in classifying ‘LAYING’ activity of all the activities on the target class. Differ from other models, however, the model is poorly classify the ‘SITTING’ and ‘STANDING’ activities.

Random Forest

Random Forest is also one of the common alternatives to build a classification models. The model is basically an enhancement of a decision tree model which creates a multiple trees with different rule based criteria which would then use a voting method to determine the class or values of the target variables. This model would also be categorized as an ensemble learning method which uses bootstrap aggregation or bagging method to learn the individual data points (see here for more information).

The advantages and disadvantages of this model would be described as follow:

Advantages

Reduction of overfitting risk: Due to the voting method in the random forest, the averaging of uncorrelated trees would lower the variance and prediction errors
Flexibility: Both classification and regression problems could be solved by the model
The determination of feature importance: the variance of importance could be analyzed within the model to see of which predictors are significant in the model

Disadvantages

Computational time: The computational time for this model is bigger than the other models which would require more resources on the computation
Black-box model: the process of the computation would be hardly described and interpreted.

To establish the model, the train function would be used. Note that the trainControl function would also be used here to control the parameter of the model. The parameters that are configured are the number of cross validations that are performed in the bootstraping aggregation process. The model was pre-built and saved into the RDS file to shorten the computational process which would be seen as follow:

#set.seed(318)
#
#ctrl <- trainControl(method = "repeatedcv", number = 4, repeats = 4) 
#rf_model <- train(activity ~., data = train_data, method = "rf", trControl = ctrl)
#
#saveRDS(rf_model, "rf_model.RDS")
#
#rf_model

rf_model <- readRDS("rf_model.RDS")

rf_model

## Random Forest 
## 
## 7722 samples
##   50 predictor
##    6 classes: 'LAYING', 'SITTING', 'STANDING', 'WALKING', 'WALKING_DOWNSTAIRS', 'WALKING_UPSTAIRS' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times) 
## Summary of sample sizes: 5790, 5791, 5793, 5792, 5792, 5792, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9279644  0.9133210
##   26    0.9180908  0.9014380
##   50    0.9020340  0.8821194
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Based on a 4 fold cross validations and with 4 repetitions, it is acquired that the best performing model was the model with mtry parameter of 2 which resulted the accuracy of 92.7 %. To investigate the importance of predictor variables, varImp function could be used as below:

varImp(rf_model)

## rf variable importance
## 
##   only 20 most important variables shown (out of 50)
## 
##      Overall
## PC1  100.000
## PC5   75.532
## PC4   69.467
## PC3   39.936
## PC2   33.425
## PC8   22.848
## PC6   21.803
## PC7   19.694
## PC13  13.636
## PC11  13.057
## PC9   10.928
## PC31   9.553
## PC10   8.987
## PC12   8.870
## PC14   8.668
## PC27   8.604
## PC37   7.584
## PC36   6.864
## PC26   6.490
## PC15   6.480

The varImp function returns the result of percentages of each importance of the predictor variables to the model. Within this model, however, the predictors has been extracted into 50 principal components. To see the importance of the real predictors, the model could be established by using the original data set without adding the dimensional reduction step (this would raise the computational time due to large variables of the data set).

The random forest model also would not require any splitting for the train and test set due to the use of Out-of-Bag Sampling method from the model which would be considered as reliable in estimating the accuracy of an unseen examples. For further information of OOB sampling, please refer to this article

plot(rf_model$finalModel)
legend("topright", colnames(rf_model$finalModel$err.rate), col = 1:6, cex = 0.6, fill = 1:6)

The finalModel elements could also be extracted from the model as well as be visualized. The visualization above show the error rates over the number of trees generated in the model. Overall, the error of OOB as well as the errors of classifications of target classes were stagnant after the number of trees over 200. This may indicate that the optimal number of trees for this model would be 200 for the case.

To further analyze the performance of the model, a confunsion matrix would be presented as below:

rf_pred <- predict(rf_model, test_data, type = "raw")

confmat_rf <- confusionMatrix(rf_pred, test_data$activity)

confmat_rf

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS
##   LAYING                485       6        0       0                  0
##   SITTING                 0     355       55       0                  0
##   STANDING                1      83      422       0                  0
##   WALKING                 0       0        0     423                  6
##   WALKING_DOWNSTAIRS      0       0        0       5                335
##   WALKING_UPSTAIRS        0       1        0       3                 11
##                     Reference
## Prediction           WALKING_UPSTAIRS
##   LAYING                            0
##   SITTING                           0
##   STANDING                          0
##   WALKING                          16
##   WALKING_DOWNSTAIRS                3
##   WALKING_UPSTAIRS                367
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9263               
##                  95% CI : (0.9155, 0.9361)     
##     No Information Rate : 0.1886               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9113               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: LAYING Class: SITTING Class: STANDING
## Sensitivity                 0.9979         0.7978          0.8847
## Specificity                 0.9971         0.9742          0.9600
## Pos Pred Value              0.9878         0.8659          0.8340
## Neg Pred Value              0.9995         0.9585          0.9734
## Prevalence                  0.1886         0.1727          0.1851
## Detection Rate              0.1882         0.1378          0.1638
## Detection Prevalence        0.1905         0.1591          0.1964
## Balanced Accuracy           0.9975         0.8860          0.9223
##                      Class: WALKING Class: WALKING_DOWNSTAIRS
## Sensitivity                  0.9814                    0.9517
## Specificity                  0.9897                    0.9964
## Pos Pred Value               0.9506                    0.9767
## Neg Pred Value               0.9962                    0.9924
## Prevalence                   0.1672                    0.1366
## Detection Rate               0.1641                    0.1300
## Detection Prevalence         0.1727                    0.1331
## Balanced Accuracy            0.9856                    0.9741
##                      Class: WALKING_UPSTAIRS
## Sensitivity                           0.9508
## Specificity                           0.9932
## Pos Pred Value                        0.9607
## Neg Pred Value                        0.9913
## Prevalence                            0.1498
## Detection Rate                        0.1424
## Detection Prevalence                  0.1482
## Balanced Accuracy                     0.9720

Interestingly, the accuracy performance of the model overcome the other models as it returns 92 % of the accuracy performance in the test set. It is also seen from the F1-scores for each target classes as shown below:

f1_multi_rf <- f1_score_multi(confmat_rf)
f1_multi_rf

Conclusion

In conclusion, the Naive Bayes, Decision Tree, and Random Forest models, have several advantages and disadvantages in classifying the target classes in a classification problem. Naive Bayes model would be advantageous for its short computation time but also have a bias for the variables with data scarcity. On the other hand, the Decision Tree model would be known best for its interpretations by the set of rules to classify the target variable but is prone to bias due to complexity of the rules that has been set. Meanwhile, Random Forest would be considered as the best classification model but has a slow computational times due to its complex algorithm. The selection of these models would be considered from the business requirements.

The summary of the performance of each models show that the Random Forest model has the highest performance of all models. This would also be seen from the F1-Scores of each classes, showing that each of the target classes are well-defined. By this matter, the Random Forest would be best for the Human Activity Recognition case due to its high performance and its accuracy and F1-scores for each target classes.

As a references for further exploration, the random forest model could be tuned by using hyperparameter tuning to find the best parameter for the model as performed in the decision tree as above. Further, other models could also be compared such as logistic regression and/or Support Vector Machine