Introduction

In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. Six young health participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. The data for this project come from this source

Download the data and read the data into R

trainingUrl<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingUrl<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainingUrl, destfile = "pml-training.csv", method="curl")
download.file(testingUrl, destfile = "pml-testing.csv", method="curl")
training<- read.csv("pml-training.csv", stringsAsFactors = F)
testing<- read.csv("pml-testing.csv", stringsAsFactors = F)

feature selection, Cross Validation and model buidling

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(333)
# let's look at the data first
str(training)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window              : chr  "no" "no" "no" "no" ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : chr  "" "" "" "" ...
##  $ kurtosis_picth_belt     : chr  "" "" "" "" ...
##  $ kurtosis_yaw_belt       : chr  "" "" "" "" ...
##  $ skewness_roll_belt      : chr  "" "" "" "" ...
##  $ skewness_roll_belt.1    : chr  "" "" "" "" ...
##  $ skewness_yaw_belt       : chr  "" "" "" "" ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : chr  "" "" "" "" ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : chr  "" "" "" "" ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : chr  "" "" "" "" ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : chr  "" "" "" "" ...
##  $ kurtosis_picth_arm      : chr  "" "" "" "" ...
##  $ kurtosis_yaw_arm        : chr  "" "" "" "" ...
##  $ skewness_roll_arm       : chr  "" "" "" "" ...
##  $ skewness_pitch_arm      : chr  "" "" "" "" ...
##  $ skewness_yaw_arm        : chr  "" "" "" "" ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : chr  "" "" "" "" ...
##  $ kurtosis_picth_dumbbell : chr  "" "" "" "" ...
##  $ kurtosis_yaw_dumbbell   : chr  "" "" "" "" ...
##  $ skewness_roll_dumbbell  : chr  "" "" "" "" ...
##  $ skewness_pitch_dumbbell : chr  "" "" "" "" ...
##  $ skewness_yaw_dumbbell   : chr  "" "" "" "" ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : chr  "" "" "" "" ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : chr  "" "" "" "" ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

the first column is the entry number, the second column is the user_name. Let’s remove them and some other non-sensor columns

see discussion on forum
Including the non-sensor columns will give you artifically high accuracy on your model, because they are highly correlated with the classe outcome.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
training<- select(training, c(-X,-user_name,-raw_timestamp_part_1, -raw_timestamp_part_2, -cvtd_timestamp, -new_window, -num_window))
testing<- select(testing, c(-X,-user_name,-raw_timestamp_part_1, -raw_timestamp_part_2, -cvtd_timestamp, -new_window, -num_window))

## turn all the character columns into numeric.

char_col_index<- sapply(training, class) == "character"

char_col<- names(training)[char_col_index]
char_col
##  [1] "kurtosis_roll_belt"      "kurtosis_picth_belt"    
##  [3] "kurtosis_yaw_belt"       "skewness_roll_belt"     
##  [5] "skewness_roll_belt.1"    "skewness_yaw_belt"      
##  [7] "max_yaw_belt"            "min_yaw_belt"           
##  [9] "amplitude_yaw_belt"      "kurtosis_roll_arm"      
## [11] "kurtosis_picth_arm"      "kurtosis_yaw_arm"       
## [13] "skewness_roll_arm"       "skewness_pitch_arm"     
## [15] "skewness_yaw_arm"        "kurtosis_roll_dumbbell" 
## [17] "kurtosis_picth_dumbbell" "kurtosis_yaw_dumbbell"  
## [19] "skewness_roll_dumbbell"  "skewness_pitch_dumbbell"
## [21] "skewness_yaw_dumbbell"   "max_yaw_dumbbell"       
## [23] "min_yaw_dumbbell"        "amplitude_yaw_dumbbell" 
## [25] "kurtosis_roll_forearm"   "kurtosis_picth_forearm" 
## [27] "kurtosis_yaw_forearm"    "skewness_roll_forearm"  
## [29] "skewness_pitch_forearm"  "skewness_yaw_forearm"   
## [31] "max_yaw_forearm"         "min_yaw_forearm"        
## [33] "amplitude_yaw_forearm"   "classe"
training<- training %>% mutate_each_(funs(as.numeric), char_col[-34])
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
testing<- testing %>% mutate_each_(funs(as.numeric), char_col[-34])

# some variables have no variability at all 
# these variables are not useful when we want to construct a prediction modewhen the predictor nzv=TRUE, exclude it in the model 

zeroV<- nearZeroVar(training,saveMetrics=TRUE)
zeroV
##                          freqRatio percentUnique zeroVar   nzv
## roll_belt                 1.101904    6.77810621   FALSE FALSE
## pitch_belt                1.036082    9.37722964   FALSE FALSE
## yaw_belt                  1.058480    9.97349913   FALSE FALSE
## total_accel_belt          1.063160    0.14779329   FALSE FALSE
## kurtosis_roll_belt        2.000000    2.01304658   FALSE FALSE
## kurtosis_picth_belt       1.333333    1.60534094   FALSE FALSE
## kurtosis_yaw_belt         0.000000    0.00000000    TRUE  TRUE
## skewness_roll_belt        2.000000    2.00285394   FALSE FALSE
## skewness_roll_belt.1      1.333333    1.71236367   FALSE FALSE
## skewness_yaw_belt         0.000000    0.00000000    TRUE  TRUE
## max_roll_belt             1.000000    0.99378249   FALSE FALSE
## max_picth_belt            1.538462    0.11211905   FALSE FALSE
## max_yaw_belt              1.034483    0.33635715   FALSE FALSE
## min_roll_belt             1.000000    0.93772296   FALSE FALSE
## min_pitch_belt            2.192308    0.08154113   FALSE FALSE
## min_yaw_belt              1.034483    0.33635715   FALSE FALSE
## amplitude_roll_belt       1.290323    0.75425543   FALSE FALSE
## amplitude_pitch_belt      3.042254    0.06625217   FALSE FALSE
## amplitude_yaw_belt        0.000000    0.00509632    TRUE  TRUE
## var_total_accel_belt      1.426829    0.33126083   FALSE FALSE
## avg_roll_belt             1.066667    0.97339721   FALSE FALSE
## stddev_roll_belt          1.039216    0.35164611   FALSE FALSE
## var_roll_belt             1.615385    0.48924676   FALSE FALSE
## avg_pitch_belt            1.375000    1.09061258   FALSE FALSE
## stddev_pitch_belt         1.161290    0.21914178   FALSE FALSE
## var_pitch_belt            1.307692    0.32106819   FALSE FALSE
## avg_yaw_belt              1.200000    1.22311691   FALSE FALSE
## stddev_yaw_belt           1.693878    0.29558659   FALSE FALSE
## var_yaw_belt              1.500000    0.73896647   FALSE FALSE
## gyros_belt_x              1.058651    0.71348486   FALSE FALSE
## gyros_belt_y              1.144000    0.35164611   FALSE FALSE
## gyros_belt_z              1.066214    0.86127816   FALSE FALSE
## accel_belt_x              1.055412    0.83579655   FALSE FALSE
## accel_belt_y              1.113725    0.72877383   FALSE FALSE
## accel_belt_z              1.078767    1.52379982   FALSE FALSE
## magnet_belt_x             1.090141    1.66649679   FALSE FALSE
## magnet_belt_y             1.099688    1.51870350   FALSE FALSE
## magnet_belt_z             1.006369    2.32901845   FALSE FALSE
## roll_arm                 52.338462   13.52563449   FALSE FALSE
## pitch_arm                87.256410   15.73234125   FALSE FALSE
## yaw_arm                  33.029126   14.65701763   FALSE FALSE
## total_accel_arm           1.024526    0.33635715   FALSE FALSE
## var_accel_arm             5.500000    2.01304658   FALSE FALSE
## avg_roll_arm             77.000000    1.68178575   FALSE  TRUE
## stddev_roll_arm          77.000000    1.68178575   FALSE  TRUE
## var_roll_arm             77.000000    1.68178575   FALSE  TRUE
## avg_pitch_arm            77.000000    1.68178575   FALSE  TRUE
## stddev_pitch_arm         77.000000    1.68178575   FALSE  TRUE
## var_pitch_arm            77.000000    1.68178575   FALSE  TRUE
## avg_yaw_arm              77.000000    1.68178575   FALSE  TRUE
## stddev_yaw_arm           80.000000    1.66649679   FALSE  TRUE
## var_yaw_arm              80.000000    1.66649679   FALSE  TRUE
## gyros_arm_x               1.015504    3.27693405   FALSE FALSE
## gyros_arm_y               1.454369    1.91621649   FALSE FALSE
## gyros_arm_z               1.110687    1.26388747   FALSE FALSE
## accel_arm_x               1.017341    3.95984099   FALSE FALSE
## accel_arm_y               1.140187    2.73672409   FALSE FALSE
## accel_arm_z               1.128000    4.03628580   FALSE FALSE
## magnet_arm_x              1.000000    6.82397309   FALSE FALSE
## magnet_arm_y              1.056818    4.44399144   FALSE FALSE
## magnet_arm_z              1.036364    6.44684538   FALSE FALSE
## kurtosis_roll_arm         1.000000    1.67159311   FALSE FALSE
## kurtosis_picth_arm        1.000000    1.66140047   FALSE FALSE
## kurtosis_yaw_arm          1.000000    2.00285394   FALSE FALSE
## skewness_roll_arm         1.000000    1.67668943   FALSE FALSE
## skewness_pitch_arm        1.000000    1.66140047   FALSE FALSE
## skewness_yaw_arm          1.000000    2.00285394   FALSE FALSE
## max_roll_arm             25.666667    1.47793293   FALSE  TRUE
## max_picth_arm            12.833333    1.34033228   FALSE FALSE
## max_yaw_arm               1.227273    0.25991234   FALSE FALSE
## min_roll_arm             19.250000    1.41677709   FALSE  TRUE
## min_pitch_arm            19.250000    1.47793293   FALSE  TRUE
## min_yaw_arm               1.000000    0.19366018   FALSE FALSE
## amplitude_roll_arm       25.666667    1.55947406   FALSE  TRUE
## amplitude_pitch_arm      20.000000    1.49831821   FALSE  TRUE
## amplitude_yaw_arm         1.037037    0.25991234   FALSE FALSE
## roll_dumbbell             1.022388   84.20650290   FALSE FALSE
## pitch_dumbbell            2.277372   81.74498012   FALSE FALSE
## yaw_dumbbell              1.132231   83.48282540   FALSE FALSE
## kurtosis_roll_dumbbell    1.000000    2.01814290   FALSE FALSE
## kurtosis_picth_dumbbell   1.000000    2.03343186   FALSE FALSE
## kurtosis_yaw_dumbbell     0.000000    0.00000000    TRUE  TRUE
## skewness_roll_dumbbell    1.000000    2.03343186   FALSE FALSE
## skewness_pitch_dumbbell   1.000000    2.03852818   FALSE FALSE
## skewness_yaw_dumbbell     0.000000    0.00000000    TRUE  TRUE
## max_roll_dumbbell         1.000000    1.72255631   FALSE FALSE
## max_picth_dumbbell        1.333333    1.72765263   FALSE FALSE
## max_yaw_dumbbell          1.052632    0.36183875   FALSE FALSE
## min_roll_dumbbell         1.000000    1.69197839   FALSE FALSE
## min_pitch_dumbbell        1.666667    1.81429008   FALSE FALSE
## min_yaw_dumbbell          1.052632    0.36183875   FALSE FALSE
## amplitude_roll_dumbbell   8.000000    1.97227602   FALSE FALSE
## amplitude_pitch_dumbbell  8.000000    1.95189073   FALSE FALSE
## amplitude_yaw_dumbbell    0.000000    0.00509632    TRUE  TRUE
## total_accel_dumbbell      1.072634    0.21914178   FALSE FALSE
## var_accel_dumbbell        6.000000    1.95698706   FALSE FALSE
## avg_roll_dumbbell         1.000000    2.02323922   FALSE FALSE
## stddev_roll_dumbbell     16.000000    1.99266130   FALSE FALSE
## var_roll_dumbbell        16.000000    1.99266130   FALSE FALSE
## avg_pitch_dumbbell        1.000000    2.02323922   FALSE FALSE
## stddev_pitch_dumbbell    16.000000    1.99266130   FALSE FALSE
## var_pitch_dumbbell       16.000000    1.99266130   FALSE FALSE
## avg_yaw_dumbbell          1.000000    2.02323922   FALSE FALSE
## stddev_yaw_dumbbell      16.000000    1.99266130   FALSE FALSE
## var_yaw_dumbbell         16.000000    1.99266130   FALSE FALSE
## gyros_dumbbell_x          1.003268    1.22821323   FALSE FALSE
## gyros_dumbbell_y          1.264957    1.41677709   FALSE FALSE
## gyros_dumbbell_z          1.060100    1.04984201   FALSE FALSE
## accel_dumbbell_x          1.018018    2.16593619   FALSE FALSE
## accel_dumbbell_y          1.053061    2.37488533   FALSE FALSE
## accel_dumbbell_z          1.133333    2.08949139   FALSE FALSE
## magnet_dumbbell_x         1.098266    5.74864948   FALSE FALSE
## magnet_dumbbell_y         1.197740    4.30129447   FALSE FALSE
## magnet_dumbbell_z         1.020833    3.44511263   FALSE FALSE
## roll_forearm             11.589286   11.08959331   FALSE FALSE
## pitch_forearm            65.983051   14.85577413   FALSE FALSE
## yaw_forearm              15.322835   10.14677403   FALSE FALSE
## kurtosis_roll_forearm     1.000000    1.63082255   FALSE FALSE
## kurtosis_picth_forearm    1.000000    1.63591887   FALSE FALSE
## kurtosis_yaw_forearm      0.000000    0.00000000    TRUE  TRUE
## skewness_roll_forearm     1.000000    1.63591887   FALSE FALSE
## skewness_pitch_forearm    2.000000    1.61553358   FALSE FALSE
## skewness_yaw_forearm      0.000000    0.00000000    TRUE  TRUE
## max_roll_forearm         27.666667    1.38110284   FALSE  TRUE
## max_picth_forearm         2.964286    0.78992967   FALSE FALSE
## max_yaw_forearm           1.032258    0.21914178   FALSE FALSE
## min_roll_forearm         27.666667    1.37091020   FALSE  TRUE
## min_pitch_forearm         2.862069    0.87147080   FALSE FALSE
## min_yaw_forearm           1.032258    0.21914178   FALSE FALSE
## amplitude_roll_forearm   20.750000    1.49322189   FALSE  TRUE
## amplitude_pitch_forearm   3.269231    0.93262664   FALSE FALSE
## amplitude_yaw_forearm     0.000000    0.00509632    TRUE  TRUE
## total_accel_forearm       1.128928    0.35674243   FALSE FALSE
## var_accel_forearm         3.500000    2.03343186   FALSE FALSE
## avg_roll_forearm         27.666667    1.64101519   FALSE  TRUE
## stddev_roll_forearm      87.000000    1.63082255   FALSE  TRUE
## var_roll_forearm         87.000000    1.63082255   FALSE  TRUE
## avg_pitch_forearm        83.000000    1.65120783   FALSE  TRUE
## stddev_pitch_forearm     41.500000    1.64611151   FALSE  TRUE
## var_pitch_forearm        83.000000    1.65120783   FALSE  TRUE
## avg_yaw_forearm          83.000000    1.65120783   FALSE  TRUE
## stddev_yaw_forearm       85.000000    1.64101519   FALSE  TRUE
## var_yaw_forearm          85.000000    1.64101519   FALSE  TRUE
## gyros_forearm_x           1.059273    1.51870350   FALSE FALSE
## gyros_forearm_y           1.036554    3.77637346   FALSE FALSE
## gyros_forearm_z           1.122917    1.56457038   FALSE FALSE
## accel_forearm_x           1.126437    4.04647844   FALSE FALSE
## accel_forearm_y           1.059406    5.11160942   FALSE FALSE
## accel_forearm_z           1.006250    2.95586586   FALSE FALSE
## magnet_forearm_x          1.012346    7.76679238   FALSE FALSE
## magnet_forearm_y          1.246914    9.54031189   FALSE FALSE
## magnet_forearm_z          1.000000    8.57710733   FALSE FALSE
## classe                    1.469581    0.02548160   FALSE FALSE
## only 118 predictors left
training<- training[,!zeroV$nzv]
training$classe <- as.factor(training$classe)

testing<- testing[,!zeroV$nzv]

## remove columns with NAs, most machine-learning algorithm can not deal with NAs, although imputation
## can help. For simplicity, I just remove columns containing any NAs.

NA_col<- c()
for (col in names(training)){
        logic<- any(is.na(training[,col]))
        NA_col<- c(NA_col,logic)
}

NA_col 
##   [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [56]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
##  [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## only 53 predictors left
training<- training[,!NA_col]

testing<- testing[,!NA_col]



### Cross Validation and model buidling 
# I am going to use K-fold corss validation. 
# 1. First, I will break training set into K subsets (in this case a 10-fold cross validation)  
# 2. build the model/predictor on the remaining training data in each subset and applied to the test subset
# 3. rebuild the data 10 times with the training and test subsets and average the findings



fitControl<- trainControl( ## 10-fold CV
                           method="cv",
                           number = 10)


# enable multi-core processing
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)


## fit a model using random forest, it takes 20mins using 4 cpus.

rfFit1<- train(classe ~ ., data=training, method="rf", trControl=fitControl, verbose = FALSE)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
rfFit1
## Random Forest 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 17659, 17661, 17659, 17660, 17660, 17659, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##    2    0.9951585  0.9938756  0.001984239  0.00251024
##   27    0.9948529  0.9934890  0.001280618  0.00162018
##   52    0.9901133  0.9874932  0.001023639  0.00129476
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
stopCluster(cl)
# The stopCluster is necessary to terminate the extra processes


# estimate variable importance
importance <- varImp(rfFit1, scale=FALSE)
# summarize importance
print(importance)
## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt              718.6
## yaw_belt               645.3
## magnet_dumbbell_z      546.4
## pitch_belt             520.1
## magnet_dumbbell_y      513.8
## pitch_forearm          489.4
## magnet_dumbbell_x      445.9
## roll_forearm           443.1
## accel_dumbbell_y       392.4
## accel_belt_z           380.8
## magnet_belt_y          378.1
## roll_dumbbell          373.3
## magnet_belt_z          370.2
## accel_dumbbell_z       352.8
## roll_arm               339.3
## accel_forearm_x        330.8
## gyros_belt_z           299.4
## accel_dumbbell_x       296.7
## total_accel_dumbbell   295.8
## yaw_dumbbell           293.2
# plot importance
plot(importance)

confusionMatrix(training$classe,predict(rfFit1,training))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5580    0    0    0    0
##          B    0 3797    0    0    0
##          C    0    0 3422    0    0
##          D    0    0    0 3216    0
##          E    0    0    0    0 3607
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2844     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000
table(prediction=predict(rfFit1, training), training$classe)
##           
## prediction    A    B    C    D    E
##          A 5580    0    0    0    0
##          B    0 3797    0    0    0
##          C    0    0 3422    0    0
##          D    0    0    0 3216    0
##          E    0    0    0    0 3607

prediction on the testing data set

in sample error = error resulted from applying your prediction algorithm to the dataset you built it with also known as resubstitution error.

out of sample error = error resulted from applying your prediction algorithm to a new data set also known as generalization error

The random forest model is very accurate on the training data sets, I expect: in sample error < out of sample error
reason is over-fitting: model too adapted/optimized for the initial dataset

predict(rfFit1, newdata = testing)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E