In this project, we are going to predict the human activities by the wearable accelerometers data. There are 160 variables and 19622 observations in the training data set.
After PCA, we applied six reasonable classification models, including decision tree, random forest, LDA, QDA, KNN, and SVM. The result shows that the random forest performs best in cross-validation, and its accuracy is 97%.
As techniques develop, it is now possible to collect a large amount of data about personal activity conveniently. Some popular devices include Jawbone Up, Nike FuelBand, and Fitbit.
Those devices could take measurements about people regularly to improve our health, to find patterns in our behaviour and so on.
In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Our goal is:
Identifying those 6 performances according to the accelerometers recordings in the training data set.
Predicting the movement in the testing data set.
There are 160 columns in both the training and testing data.
The training data set has 19622 observations, while the testing data set has 20 observations.
We might need to reduce the number or dimension of the predictors. First, we have a look at each variable in the two data sets.
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1 : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_arm : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_arm : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_arm : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_arm : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_arm : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_dumbbell : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
## 'data.frame': 20 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 6 5 5 1 4 5 5 5 2 3 ...
## $ raw_timestamp_part_1 : int 1323095002 1322673067 1322673075 1322832789 1322489635 1322673149 1322673128 1322673076 1323084240 1322837822 ...
## $ raw_timestamp_part_2 : int 868349 778725 342967 560311 814776 510661 766645 54671 916313 384285 ...
## $ cvtd_timestamp : Factor w/ 11 levels "02/12/2011 13:33",..: 5 10 10 1 6 11 11 10 3 2 ...
## $ new_window : Factor w/ 1 level "no": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 74 431 439 194 235 504 485 440 323 664 ...
## $ roll_belt : num 123 1.02 0.87 125 1.35 -5.92 1.2 0.43 0.93 114 ...
## $ pitch_belt : num 27 4.87 1.82 -41.6 3.33 1.59 4.44 4.15 6.72 22.4 ...
## $ yaw_belt : num -4.75 -88.9 -88.5 162 -88.6 -87.7 -87.3 -88.5 -93.7 -13.1 ...
## $ total_accel_belt : int 20 4 5 17 3 4 4 4 4 18 ...
## $ kurtosis_roll_belt : logi NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : logi NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : logi NA NA NA NA NA NA ...
## $ skewness_roll_belt : logi NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : logi NA NA NA NA NA NA ...
## $ skewness_yaw_belt : logi NA NA NA NA NA NA ...
## $ max_roll_belt : logi NA NA NA NA NA NA ...
## $ max_picth_belt : logi NA NA NA NA NA NA ...
## $ max_yaw_belt : logi NA NA NA NA NA NA ...
## $ min_roll_belt : logi NA NA NA NA NA NA ...
## $ min_pitch_belt : logi NA NA NA NA NA NA ...
## $ min_yaw_belt : logi NA NA NA NA NA NA ...
## $ amplitude_roll_belt : logi NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : logi NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : logi NA NA NA NA NA NA ...
## $ var_total_accel_belt : logi NA NA NA NA NA NA ...
## $ avg_roll_belt : logi NA NA NA NA NA NA ...
## $ stddev_roll_belt : logi NA NA NA NA NA NA ...
## $ var_roll_belt : logi NA NA NA NA NA NA ...
## $ avg_pitch_belt : logi NA NA NA NA NA NA ...
## $ stddev_pitch_belt : logi NA NA NA NA NA NA ...
## $ var_pitch_belt : logi NA NA NA NA NA NA ...
## $ avg_yaw_belt : logi NA NA NA NA NA NA ...
## $ stddev_yaw_belt : logi NA NA NA NA NA NA ...
## $ var_yaw_belt : logi NA NA NA NA NA NA ...
## $ gyros_belt_x : num -0.5 -0.06 0.05 0.11 0.03 0.1 -0.06 -0.18 0.1 0.14 ...
## $ gyros_belt_y : num -0.02 -0.02 0.02 0.11 0.02 0.05 0 -0.02 0 0.11 ...
## $ gyros_belt_z : num -0.46 -0.07 0.03 -0.16 0 -0.13 0 -0.03 -0.02 -0.16 ...
## $ accel_belt_x : int -38 -13 1 46 -8 -11 -14 -10 -15 -25 ...
## $ accel_belt_y : int 69 11 -1 45 4 -16 2 -2 1 63 ...
## $ accel_belt_z : int -179 39 49 -156 27 38 35 42 32 -158 ...
## $ magnet_belt_x : int -13 43 29 169 33 31 50 39 -6 10 ...
## $ magnet_belt_y : int 581 636 631 608 566 638 622 635 600 601 ...
## $ magnet_belt_z : int -382 -309 -312 -304 -418 -291 -315 -305 -302 -330 ...
## $ roll_arm : num 40.7 0 0 -109 76.1 0 0 0 -137 -82.4 ...
## $ pitch_arm : num -27.8 0 0 55 2.76 0 0 0 11.2 -63.8 ...
## $ yaw_arm : num 178 0 0 -142 102 0 0 0 -167 -75.3 ...
## $ total_accel_arm : int 10 38 44 25 29 14 15 22 34 32 ...
## $ var_accel_arm : logi NA NA NA NA NA NA ...
## $ avg_roll_arm : logi NA NA NA NA NA NA ...
## $ stddev_roll_arm : logi NA NA NA NA NA NA ...
## $ var_roll_arm : logi NA NA NA NA NA NA ...
## $ avg_pitch_arm : logi NA NA NA NA NA NA ...
## $ stddev_pitch_arm : logi NA NA NA NA NA NA ...
## $ var_pitch_arm : logi NA NA NA NA NA NA ...
## $ avg_yaw_arm : logi NA NA NA NA NA NA ...
## $ stddev_yaw_arm : logi NA NA NA NA NA NA ...
## $ var_yaw_arm : logi NA NA NA NA NA NA ...
## $ gyros_arm_x : num -1.65 -1.17 2.1 0.22 -1.96 0.02 2.36 -3.71 0.03 0.26 ...
## $ gyros_arm_y : num 0.48 0.85 -1.36 -0.51 0.79 0.05 -1.01 1.85 -0.02 -0.5 ...
## $ gyros_arm_z : num -0.18 -0.43 1.13 0.92 -0.54 -0.07 0.89 -0.69 -0.02 0.79 ...
## $ accel_arm_x : int 16 -290 -341 -238 -197 -26 99 -98 -287 -301 ...
## $ accel_arm_y : int 38 215 245 -57 200 130 79 175 111 -42 ...
## $ accel_arm_z : int 93 -90 -87 6 -30 -19 -67 -78 -122 -80 ...
## $ magnet_arm_x : int -326 -325 -264 -173 -170 396 702 535 -367 -420 ...
## $ magnet_arm_y : int 385 447 474 257 275 176 15 215 335 294 ...
## $ magnet_arm_z : int 481 434 413 633 617 516 217 385 520 493 ...
## $ kurtosis_roll_arm : logi NA NA NA NA NA NA ...
## $ kurtosis_picth_arm : logi NA NA NA NA NA NA ...
## $ kurtosis_yaw_arm : logi NA NA NA NA NA NA ...
## $ skewness_roll_arm : logi NA NA NA NA NA NA ...
## $ skewness_pitch_arm : logi NA NA NA NA NA NA ...
## $ skewness_yaw_arm : logi NA NA NA NA NA NA ...
## $ max_roll_arm : logi NA NA NA NA NA NA ...
## $ max_picth_arm : logi NA NA NA NA NA NA ...
## $ max_yaw_arm : logi NA NA NA NA NA NA ...
## $ min_roll_arm : logi NA NA NA NA NA NA ...
## $ min_pitch_arm : logi NA NA NA NA NA NA ...
## $ min_yaw_arm : logi NA NA NA NA NA NA ...
## $ amplitude_roll_arm : logi NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : logi NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : logi NA NA NA NA NA NA ...
## $ roll_dumbbell : num -17.7 54.5 57.1 43.1 -101.4 ...
## $ pitch_dumbbell : num 25 -53.7 -51.4 -30 -53.4 ...
## $ yaw_dumbbell : num 126.2 -75.5 -75.2 -103.3 -14.2 ...
## $ kurtosis_roll_dumbbell : logi NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : logi NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ max_roll_dumbbell : logi NA NA NA NA NA NA ...
## $ max_picth_dumbbell : logi NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ min_roll_dumbbell : logi NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : logi NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : logi NA NA NA NA NA NA ...
## [list output truncated]
We notice that in the testing dataset, several variables have lots of NAs (missing values). We then look at how the missing values are distributed.
library(naniar)
vis_miss(testing)
We could see that several columns in the testing data set have nothing but missing values. We could delete those columns from the training and testing dataset as they would not play any roles in the model.
Now we have 56 predictors in the training data set. Check for any missing values in the new data sets.
library(naniar)
#vis_miss(testing0a)
sapply(training0, function(x) sum(is.na(x)))
## raw_timestamp_part_1 raw_timestamp_part_2 new_window
## 0 0 0
## num_window roll_belt pitch_belt
## 0 0 0
## yaw_belt total_accel_belt gyros_belt_x
## 0 0 0
## gyros_belt_y gyros_belt_z accel_belt_x
## 0 0 0
## accel_belt_y accel_belt_z magnet_belt_x
## 0 0 0
## magnet_belt_y magnet_belt_z roll_arm
## 0 0 0
## pitch_arm yaw_arm total_accel_arm
## 0 0 0
## gyros_arm_x gyros_arm_y gyros_arm_z
## 0 0 0
## accel_arm_x accel_arm_y accel_arm_z
## 0 0 0
## magnet_arm_x magnet_arm_y magnet_arm_z
## 0 0 0
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## 0 0 0
## total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 0 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 0 0 0
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## 0 0 0
## magnet_dumbbell_z roll_forearm pitch_forearm
## 0 0 0
## yaw_forearm total_accel_forearm gyros_forearm_x
## 0 0 0
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## 0 0 0
## accel_forearm_y accel_forearm_z magnet_forearm_x
## 0 0 0
## magnet_forearm_y magnet_forearm_z classe
## 0 0 0
sapply(testing0a, function(x) sum(is.na(x)))
## raw_timestamp_part_1 raw_timestamp_part_2 new_window
## 0 0 0
## num_window roll_belt pitch_belt
## 0 0 0
## yaw_belt total_accel_belt gyros_belt_x
## 0 0 0
## gyros_belt_y gyros_belt_z accel_belt_x
## 0 0 0
## accel_belt_y accel_belt_z magnet_belt_x
## 0 0 0
## magnet_belt_y magnet_belt_z roll_arm
## 0 0 0
## pitch_arm yaw_arm total_accel_arm
## 0 0 0
## gyros_arm_x gyros_arm_y gyros_arm_z
## 0 0 0
## accel_arm_x accel_arm_y accel_arm_z
## 0 0 0
## magnet_arm_x magnet_arm_y magnet_arm_z
## 0 0 0
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## 0 0 0
## total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 0 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 0 0 0
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## 0 0 0
## magnet_dumbbell_z roll_forearm pitch_forearm
## 0 0 0
## yaw_forearm total_accel_forearm gyros_forearm_x
## 0 0 0
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## 0 0 0
## accel_forearm_y accel_forearm_z magnet_forearm_x
## 0 0 0
## magnet_forearm_y magnet_forearm_z
## 0 0
We do not have any missing values in the two data sets.
We separate the training data into two parts randomly to a 7:3 ratio.
There are 56 predictors in the training data set. We may reduce them by pca to avoid overfitting.
PCA is extremely helpful when we have an amount of variables.
The decision tree is a basic classifier in machine learning.
It predicts the response based on the attributes of the predictors. The results could be plotted and shown as a flowchart, and this makes it looks like a “tree”- the output is like the leaves, and the process is like the branches.
library(tree)
## Registered S3 method overwritten by 'tree':
## method from
## print.tree cli
tree =tree(classe~.,training1a)
plot(tree)
text(tree,pretty =0)
tree.pred=predict(tree,training2a,type="class")
table(tree.pred ,training2a$classe)
##
## tree.pred A B C D E
## A 1220 284 159 201 156
## B 22 332 23 104 49
## C 75 60 655 105 71
## D 335 280 97 429 130
## E 22 183 92 125 676
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 4.937978e-01 3.379542e-01 4.809450e-01 5.066567e-01 2.844520e-01
## AccuracyPValue McnemarPValue
## 8.473727e-251 NaN
The accuracy rate of the decision tree is about 49%.
The random forest is the improved version of the tree decision classifier. Instead of counting on one “tree”, we generate lots of “trees” -like we now have a “forest”- and each tree votes for our results based on its own output. Statistically, the final result was decided by the mode of the trees instead of a single output of one tree.
##
## predictions_rf A B C D E
## A 1674 0 0 0 0
## B 0 1139 3 0 0
## C 0 0 1023 2 0
## D 0 0 0 961 0
## E 0 0 0 1 1082
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9989805 0.9987104 0.9977822 0.9996258 0.2844520
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
The accuracy rate of the random forest in the training data set is about 99.9%.
importance(modFit_rf)
## MeanDecreaseGini
## raw_timestamp_part_1 1305.6975877
## raw_timestamp_part_2 14.7719730
## new_window 0.2494942
## num_window 754.6774410
## roll_belt 711.7297947
## pitch_belt 393.6603326
## yaw_belt 482.7320384
## total_accel_belt 150.4030288
## gyros_belt_x 51.9663588
## gyros_belt_y 63.1683911
## gyros_belt_z 173.1603951
## accel_belt_x 77.1172576
## accel_belt_y 81.7437427
## accel_belt_z 251.0262154
## magnet_belt_x 145.5489140
## magnet_belt_y 249.8255841
## magnet_belt_z 217.2270763
## roll_arm 179.6985740
## pitch_arm 86.5862987
## yaw_arm 121.0340440
## total_accel_arm 48.8191663
## gyros_arm_x 60.3228365
## gyros_arm_y 63.5428679
## gyros_arm_z 28.4863720
## accel_arm_x 135.1681881
## accel_arm_y 73.5352654
## accel_arm_z 58.9083016
## magnet_arm_x 151.1151934
## magnet_arm_y 115.6861412
## magnet_arm_z 87.4088502
## roll_dumbbell 256.8840567
## pitch_dumbbell 114.3513837
## yaw_dumbbell 161.9740988
## total_accel_dumbbell 154.9237760
## gyros_dumbbell_x 62.3381073
## gyros_dumbbell_y 137.4000303
## gyros_dumbbell_z 37.0873489
## accel_dumbbell_x 175.2524185
## accel_dumbbell_y 245.2034651
## accel_dumbbell_z 199.6260225
## magnet_dumbbell_x 280.9854166
## magnet_dumbbell_y 386.3899192
## magnet_dumbbell_z 454.4603416
## roll_forearm 357.6984008
## pitch_forearm 462.9530712
## yaw_forearm 85.6735857
## total_accel_forearm 52.5731842
## gyros_forearm_x 37.8221772
## gyros_forearm_y 59.0224981
## gyros_forearm_z 39.7255322
## accel_forearm_x 185.4255585
## accel_forearm_y 70.6384909
## accel_forearm_z 137.6899287
## magnet_forearm_x 116.0251819
## magnet_forearm_y 112.4537420
## magnet_forearm_z 140.3223889
We could see the importance of each variable by MeanDecreaseGini. We could see that “raw_timestamp_part_1” is the most important variable.
This method classifies the output according to their linear characters.
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.087511e-01 6.315417e-01 6.969567e-01 7.203381e-01 2.844520e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 5.200708e-58
The accuracy rate of LDA is about 71%.
This method classifies the output according to the quadratic decision surface. This is a more general version of LDA.
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.941376e-01 8.663044e-01 8.859959e-01 9.018863e-01 2.844520e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 4.053350e-47
The accuracy rate of QDA is about 89%.
The k-NN classification is a type of instance-based learning. The output depends on its nearest neighbours.
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 2.910790e-01 9.820378e-02 2.794940e-01 3.028714e-01 2.844520e-01
## AccuracyPValue McnemarPValue
## 1.330847e-01 7.633337e-16
The principle of SVM is to find the maximal margin to separate two groups.
It is very accurate and works well on small data sets.
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9199660 0.8989294 0.9127382 0.9267741 0.2844520
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
The accuracy rate of SVM is about 92%. This is a not too bad result, considering the coefficients could be optimized if necessary. However, the drawback is that it is time-consuming compared to other methods.
The randomforest has the best accuracy and we are using randomforest method to build the model for the cross validation.
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9777400170 0.9718401212 0.9736404879 0.9813558807 0.2844519966
## AccuracyPValue McnemarPValue
## 0.0000000000 0.0004058384
The accuracy of the random forest to the cross-validation is 97.9%.
## 1 2 31 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E
We compared several models and chose the random forest to make the prediction because it has the best and remarkable accuracy among the models we tested.
The accuracy of the random forest in the training data set is 99.9% and that to the cross-validation is 97.9%.