data cleaning process

I first remove variables that contains NA values. I also remove the first seven variables (user names, row numbers, etc) because they has nothing to do with the outcome.Lastly, I remove variables that large have empty inputs. By the end of the data cleanning process, the dataset has 53 variables.

##  [1] "X"                       "user_name"              
##  [3] "raw_timestamp_part_1"    "raw_timestamp_part_2"   
##  [5] "cvtd_timestamp"          "new_window"             
##  [7] "num_window"              "roll_belt"              
##  [9] "pitch_belt"              "yaw_belt"               
## [11] "total_accel_belt"        "kurtosis_roll_belt"     
## [13] "kurtosis_picth_belt"     "kurtosis_yaw_belt"      
## [15] "skewness_roll_belt"      "skewness_roll_belt.1"   
## [17] "skewness_yaw_belt"       "max_yaw_belt"           
## [19] "min_yaw_belt"            "amplitude_yaw_belt"     
## [21] "gyros_belt_x"            "gyros_belt_y"           
## [23] "gyros_belt_z"            "accel_belt_x"           
## [25] "accel_belt_y"            "accel_belt_z"           
## [27] "magnet_belt_x"           "magnet_belt_y"          
## [29] "magnet_belt_z"           "roll_arm"               
## [31] "pitch_arm"               "yaw_arm"                
## [33] "total_accel_arm"         "gyros_arm_x"            
## [35] "gyros_arm_y"             "gyros_arm_z"            
## [37] "accel_arm_x"             "accel_arm_y"            
## [39] "accel_arm_z"             "magnet_arm_x"           
## [41] "magnet_arm_y"            "magnet_arm_z"           
## [43] "kurtosis_roll_arm"       "kurtosis_picth_arm"     
## [45] "kurtosis_yaw_arm"        "skewness_roll_arm"      
## [47] "skewness_pitch_arm"      "skewness_yaw_arm"       
## [49] "roll_dumbbell"           "pitch_dumbbell"         
## [51] "yaw_dumbbell"            "kurtosis_roll_dumbbell" 
## [53] "kurtosis_picth_dumbbell" "kurtosis_yaw_dumbbell"  
## [55] "skewness_roll_dumbbell"  "skewness_pitch_dumbbell"
## [57] "skewness_yaw_dumbbell"   "max_yaw_dumbbell"       
## [59] "min_yaw_dumbbell"        "amplitude_yaw_dumbbell" 
## [61] "total_accel_dumbbell"    "gyros_dumbbell_x"       
## [63] "gyros_dumbbell_y"        "gyros_dumbbell_z"       
## [65] "accel_dumbbell_x"        "accel_dumbbell_y"       
## [67] "accel_dumbbell_z"        "magnet_dumbbell_x"      
## [69] "magnet_dumbbell_y"       "magnet_dumbbell_z"      
## [71] "roll_forearm"            "pitch_forearm"          
## [73] "yaw_forearm"             "kurtosis_roll_forearm"  
## [75] "kurtosis_picth_forearm"  "kurtosis_yaw_forearm"   
## [77] "skewness_roll_forearm"   "skewness_pitch_forearm" 
## [79] "skewness_yaw_forearm"    "max_yaw_forearm"        
## [81] "min_yaw_forearm"         "amplitude_yaw_forearm"  
## [83] "total_accel_forearm"     "gyros_forearm_x"        
## [85] "gyros_forearm_y"         "gyros_forearm_z"        
## [87] "accel_forearm_x"         "accel_forearm_y"        
## [89] "accel_forearm_z"         "magnet_forearm_x"       
## [91] "magnet_forearm_y"        "magnet_forearm_z"       
## [93] "classe"
## [1] 19622    53

Building training and testing dataset

I creat a training dataset and a testing dataset using a 70/30 split.

Predict model

I use random forest method because this random forest is versatile. I also set ntree to 250 in order to save the computational power.

The accuracy of the model is 0.9969, meaning that the out of sample error is 0.0031.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    0    0    0    0
##          B    2 1137    1    1    0
##          C    0    2 1023    9    0
##          D    0    0    2  953    3
##          E    0    0    0    1 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9964          
##                  95% CI : (0.9946, 0.9978)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9955          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9982   0.9971   0.9886   0.9972
## Specificity            1.0000   0.9992   0.9977   0.9990   0.9998
## Pos Pred Value         1.0000   0.9965   0.9894   0.9948   0.9991
## Neg Pred Value         0.9995   0.9996   0.9994   0.9978   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1932   0.1738   0.1619   0.1833
## Detection Prevalence   0.2841   0.1939   0.1757   0.1628   0.1835
## Balanced Accuracy      0.9994   0.9987   0.9974   0.9938   0.9985

Lastly, I use the model to predict the 20 different test cases.

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E