In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. Six young health participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. The data for this project come from this source
trainingUrl<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingUrl<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainingUrl, destfile = "pml-training.csv", method="curl")
download.file(testingUrl, destfile = "pml-testing.csv", method="curl")
training<- read.csv("pml-training.csv", stringsAsFactors = F)
testing<- read.csv("pml-testing.csv", stringsAsFactors = F)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(333)
# let's look at the data first
str(training)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : chr "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : chr "" "" "" "" ...
## $ kurtosis_picth_belt : chr "" "" "" "" ...
## $ kurtosis_yaw_belt : chr "" "" "" "" ...
## $ skewness_roll_belt : chr "" "" "" "" ...
## $ skewness_roll_belt.1 : chr "" "" "" "" ...
## $ skewness_yaw_belt : chr "" "" "" "" ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : chr "" "" "" "" ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : chr "" "" "" "" ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : chr "" "" "" "" ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : chr "" "" "" "" ...
## $ kurtosis_picth_arm : chr "" "" "" "" ...
## $ kurtosis_yaw_arm : chr "" "" "" "" ...
## $ skewness_roll_arm : chr "" "" "" "" ...
## $ skewness_pitch_arm : chr "" "" "" "" ...
## $ skewness_yaw_arm : chr "" "" "" "" ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : chr "" "" "" "" ...
## $ kurtosis_picth_dumbbell : chr "" "" "" "" ...
## $ kurtosis_yaw_dumbbell : chr "" "" "" "" ...
## $ skewness_roll_dumbbell : chr "" "" "" "" ...
## $ skewness_pitch_dumbbell : chr "" "" "" "" ...
## $ skewness_yaw_dumbbell : chr "" "" "" "" ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : chr "" "" "" "" ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : chr "" "" "" "" ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
the first column is the entry number, the second column is the user_name. Let’s remove them and some other non-sensor columns
see discussion on forum
Including the non-sensor columns will give you artifically high accuracy on your model, because they are highly correlated with the classe outcome.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
training<- select(training, c(-X,-user_name,-raw_timestamp_part_1, -raw_timestamp_part_2, -cvtd_timestamp, -new_window, -num_window))
testing<- select(testing, c(-X,-user_name,-raw_timestamp_part_1, -raw_timestamp_part_2, -cvtd_timestamp, -new_window, -num_window))
## turn all the character columns into numeric.
char_col_index<- sapply(training, class) == "character"
char_col<- names(training)[char_col_index]
char_col
## [1] "kurtosis_roll_belt" "kurtosis_picth_belt"
## [3] "kurtosis_yaw_belt" "skewness_roll_belt"
## [5] "skewness_roll_belt.1" "skewness_yaw_belt"
## [7] "max_yaw_belt" "min_yaw_belt"
## [9] "amplitude_yaw_belt" "kurtosis_roll_arm"
## [11] "kurtosis_picth_arm" "kurtosis_yaw_arm"
## [13] "skewness_roll_arm" "skewness_pitch_arm"
## [15] "skewness_yaw_arm" "kurtosis_roll_dumbbell"
## [17] "kurtosis_picth_dumbbell" "kurtosis_yaw_dumbbell"
## [19] "skewness_roll_dumbbell" "skewness_pitch_dumbbell"
## [21] "skewness_yaw_dumbbell" "max_yaw_dumbbell"
## [23] "min_yaw_dumbbell" "amplitude_yaw_dumbbell"
## [25] "kurtosis_roll_forearm" "kurtosis_picth_forearm"
## [27] "kurtosis_yaw_forearm" "skewness_roll_forearm"
## [29] "skewness_pitch_forearm" "skewness_yaw_forearm"
## [31] "max_yaw_forearm" "min_yaw_forearm"
## [33] "amplitude_yaw_forearm" "classe"
training<- training %>% mutate_each_(funs(as.numeric), char_col[-34])
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
## Warning in mutate_impl(.data, dots): NAs introduced by coercion
testing<- testing %>% mutate_each_(funs(as.numeric), char_col[-34])
# some variables have no variability at all
# these variables are not useful when we want to construct a prediction modewhen the predictor nzv=TRUE, exclude it in the model
zeroV<- nearZeroVar(training,saveMetrics=TRUE)
zeroV
## freqRatio percentUnique zeroVar nzv
## roll_belt 1.101904 6.77810621 FALSE FALSE
## pitch_belt 1.036082 9.37722964 FALSE FALSE
## yaw_belt 1.058480 9.97349913 FALSE FALSE
## total_accel_belt 1.063160 0.14779329 FALSE FALSE
## kurtosis_roll_belt 2.000000 2.01304658 FALSE FALSE
## kurtosis_picth_belt 1.333333 1.60534094 FALSE FALSE
## kurtosis_yaw_belt 0.000000 0.00000000 TRUE TRUE
## skewness_roll_belt 2.000000 2.00285394 FALSE FALSE
## skewness_roll_belt.1 1.333333 1.71236367 FALSE FALSE
## skewness_yaw_belt 0.000000 0.00000000 TRUE TRUE
## max_roll_belt 1.000000 0.99378249 FALSE FALSE
## max_picth_belt 1.538462 0.11211905 FALSE FALSE
## max_yaw_belt 1.034483 0.33635715 FALSE FALSE
## min_roll_belt 1.000000 0.93772296 FALSE FALSE
## min_pitch_belt 2.192308 0.08154113 FALSE FALSE
## min_yaw_belt 1.034483 0.33635715 FALSE FALSE
## amplitude_roll_belt 1.290323 0.75425543 FALSE FALSE
## amplitude_pitch_belt 3.042254 0.06625217 FALSE FALSE
## amplitude_yaw_belt 0.000000 0.00509632 TRUE TRUE
## var_total_accel_belt 1.426829 0.33126083 FALSE FALSE
## avg_roll_belt 1.066667 0.97339721 FALSE FALSE
## stddev_roll_belt 1.039216 0.35164611 FALSE FALSE
## var_roll_belt 1.615385 0.48924676 FALSE FALSE
## avg_pitch_belt 1.375000 1.09061258 FALSE FALSE
## stddev_pitch_belt 1.161290 0.21914178 FALSE FALSE
## var_pitch_belt 1.307692 0.32106819 FALSE FALSE
## avg_yaw_belt 1.200000 1.22311691 FALSE FALSE
## stddev_yaw_belt 1.693878 0.29558659 FALSE FALSE
## var_yaw_belt 1.500000 0.73896647 FALSE FALSE
## gyros_belt_x 1.058651 0.71348486 FALSE FALSE
## gyros_belt_y 1.144000 0.35164611 FALSE FALSE
## gyros_belt_z 1.066214 0.86127816 FALSE FALSE
## accel_belt_x 1.055412 0.83579655 FALSE FALSE
## accel_belt_y 1.113725 0.72877383 FALSE FALSE
## accel_belt_z 1.078767 1.52379982 FALSE FALSE
## magnet_belt_x 1.090141 1.66649679 FALSE FALSE
## magnet_belt_y 1.099688 1.51870350 FALSE FALSE
## magnet_belt_z 1.006369 2.32901845 FALSE FALSE
## roll_arm 52.338462 13.52563449 FALSE FALSE
## pitch_arm 87.256410 15.73234125 FALSE FALSE
## yaw_arm 33.029126 14.65701763 FALSE FALSE
## total_accel_arm 1.024526 0.33635715 FALSE FALSE
## var_accel_arm 5.500000 2.01304658 FALSE FALSE
## avg_roll_arm 77.000000 1.68178575 FALSE TRUE
## stddev_roll_arm 77.000000 1.68178575 FALSE TRUE
## var_roll_arm 77.000000 1.68178575 FALSE TRUE
## avg_pitch_arm 77.000000 1.68178575 FALSE TRUE
## stddev_pitch_arm 77.000000 1.68178575 FALSE TRUE
## var_pitch_arm 77.000000 1.68178575 FALSE TRUE
## avg_yaw_arm 77.000000 1.68178575 FALSE TRUE
## stddev_yaw_arm 80.000000 1.66649679 FALSE TRUE
## var_yaw_arm 80.000000 1.66649679 FALSE TRUE
## gyros_arm_x 1.015504 3.27693405 FALSE FALSE
## gyros_arm_y 1.454369 1.91621649 FALSE FALSE
## gyros_arm_z 1.110687 1.26388747 FALSE FALSE
## accel_arm_x 1.017341 3.95984099 FALSE FALSE
## accel_arm_y 1.140187 2.73672409 FALSE FALSE
## accel_arm_z 1.128000 4.03628580 FALSE FALSE
## magnet_arm_x 1.000000 6.82397309 FALSE FALSE
## magnet_arm_y 1.056818 4.44399144 FALSE FALSE
## magnet_arm_z 1.036364 6.44684538 FALSE FALSE
## kurtosis_roll_arm 1.000000 1.67159311 FALSE FALSE
## kurtosis_picth_arm 1.000000 1.66140047 FALSE FALSE
## kurtosis_yaw_arm 1.000000 2.00285394 FALSE FALSE
## skewness_roll_arm 1.000000 1.67668943 FALSE FALSE
## skewness_pitch_arm 1.000000 1.66140047 FALSE FALSE
## skewness_yaw_arm 1.000000 2.00285394 FALSE FALSE
## max_roll_arm 25.666667 1.47793293 FALSE TRUE
## max_picth_arm 12.833333 1.34033228 FALSE FALSE
## max_yaw_arm 1.227273 0.25991234 FALSE FALSE
## min_roll_arm 19.250000 1.41677709 FALSE TRUE
## min_pitch_arm 19.250000 1.47793293 FALSE TRUE
## min_yaw_arm 1.000000 0.19366018 FALSE FALSE
## amplitude_roll_arm 25.666667 1.55947406 FALSE TRUE
## amplitude_pitch_arm 20.000000 1.49831821 FALSE TRUE
## amplitude_yaw_arm 1.037037 0.25991234 FALSE FALSE
## roll_dumbbell 1.022388 84.20650290 FALSE FALSE
## pitch_dumbbell 2.277372 81.74498012 FALSE FALSE
## yaw_dumbbell 1.132231 83.48282540 FALSE FALSE
## kurtosis_roll_dumbbell 1.000000 2.01814290 FALSE FALSE
## kurtosis_picth_dumbbell 1.000000 2.03343186 FALSE FALSE
## kurtosis_yaw_dumbbell 0.000000 0.00000000 TRUE TRUE
## skewness_roll_dumbbell 1.000000 2.03343186 FALSE FALSE
## skewness_pitch_dumbbell 1.000000 2.03852818 FALSE FALSE
## skewness_yaw_dumbbell 0.000000 0.00000000 TRUE TRUE
## max_roll_dumbbell 1.000000 1.72255631 FALSE FALSE
## max_picth_dumbbell 1.333333 1.72765263 FALSE FALSE
## max_yaw_dumbbell 1.052632 0.36183875 FALSE FALSE
## min_roll_dumbbell 1.000000 1.69197839 FALSE FALSE
## min_pitch_dumbbell 1.666667 1.81429008 FALSE FALSE
## min_yaw_dumbbell 1.052632 0.36183875 FALSE FALSE
## amplitude_roll_dumbbell 8.000000 1.97227602 FALSE FALSE
## amplitude_pitch_dumbbell 8.000000 1.95189073 FALSE FALSE
## amplitude_yaw_dumbbell 0.000000 0.00509632 TRUE TRUE
## total_accel_dumbbell 1.072634 0.21914178 FALSE FALSE
## var_accel_dumbbell 6.000000 1.95698706 FALSE FALSE
## avg_roll_dumbbell 1.000000 2.02323922 FALSE FALSE
## stddev_roll_dumbbell 16.000000 1.99266130 FALSE FALSE
## var_roll_dumbbell 16.000000 1.99266130 FALSE FALSE
## avg_pitch_dumbbell 1.000000 2.02323922 FALSE FALSE
## stddev_pitch_dumbbell 16.000000 1.99266130 FALSE FALSE
## var_pitch_dumbbell 16.000000 1.99266130 FALSE FALSE
## avg_yaw_dumbbell 1.000000 2.02323922 FALSE FALSE
## stddev_yaw_dumbbell 16.000000 1.99266130 FALSE FALSE
## var_yaw_dumbbell 16.000000 1.99266130 FALSE FALSE
## gyros_dumbbell_x 1.003268 1.22821323 FALSE FALSE
## gyros_dumbbell_y 1.264957 1.41677709 FALSE FALSE
## gyros_dumbbell_z 1.060100 1.04984201 FALSE FALSE
## accel_dumbbell_x 1.018018 2.16593619 FALSE FALSE
## accel_dumbbell_y 1.053061 2.37488533 FALSE FALSE
## accel_dumbbell_z 1.133333 2.08949139 FALSE FALSE
## magnet_dumbbell_x 1.098266 5.74864948 FALSE FALSE
## magnet_dumbbell_y 1.197740 4.30129447 FALSE FALSE
## magnet_dumbbell_z 1.020833 3.44511263 FALSE FALSE
## roll_forearm 11.589286 11.08959331 FALSE FALSE
## pitch_forearm 65.983051 14.85577413 FALSE FALSE
## yaw_forearm 15.322835 10.14677403 FALSE FALSE
## kurtosis_roll_forearm 1.000000 1.63082255 FALSE FALSE
## kurtosis_picth_forearm 1.000000 1.63591887 FALSE FALSE
## kurtosis_yaw_forearm 0.000000 0.00000000 TRUE TRUE
## skewness_roll_forearm 1.000000 1.63591887 FALSE FALSE
## skewness_pitch_forearm 2.000000 1.61553358 FALSE FALSE
## skewness_yaw_forearm 0.000000 0.00000000 TRUE TRUE
## max_roll_forearm 27.666667 1.38110284 FALSE TRUE
## max_picth_forearm 2.964286 0.78992967 FALSE FALSE
## max_yaw_forearm 1.032258 0.21914178 FALSE FALSE
## min_roll_forearm 27.666667 1.37091020 FALSE TRUE
## min_pitch_forearm 2.862069 0.87147080 FALSE FALSE
## min_yaw_forearm 1.032258 0.21914178 FALSE FALSE
## amplitude_roll_forearm 20.750000 1.49322189 FALSE TRUE
## amplitude_pitch_forearm 3.269231 0.93262664 FALSE FALSE
## amplitude_yaw_forearm 0.000000 0.00509632 TRUE TRUE
## total_accel_forearm 1.128928 0.35674243 FALSE FALSE
## var_accel_forearm 3.500000 2.03343186 FALSE FALSE
## avg_roll_forearm 27.666667 1.64101519 FALSE TRUE
## stddev_roll_forearm 87.000000 1.63082255 FALSE TRUE
## var_roll_forearm 87.000000 1.63082255 FALSE TRUE
## avg_pitch_forearm 83.000000 1.65120783 FALSE TRUE
## stddev_pitch_forearm 41.500000 1.64611151 FALSE TRUE
## var_pitch_forearm 83.000000 1.65120783 FALSE TRUE
## avg_yaw_forearm 83.000000 1.65120783 FALSE TRUE
## stddev_yaw_forearm 85.000000 1.64101519 FALSE TRUE
## var_yaw_forearm 85.000000 1.64101519 FALSE TRUE
## gyros_forearm_x 1.059273 1.51870350 FALSE FALSE
## gyros_forearm_y 1.036554 3.77637346 FALSE FALSE
## gyros_forearm_z 1.122917 1.56457038 FALSE FALSE
## accel_forearm_x 1.126437 4.04647844 FALSE FALSE
## accel_forearm_y 1.059406 5.11160942 FALSE FALSE
## accel_forearm_z 1.006250 2.95586586 FALSE FALSE
## magnet_forearm_x 1.012346 7.76679238 FALSE FALSE
## magnet_forearm_y 1.246914 9.54031189 FALSE FALSE
## magnet_forearm_z 1.000000 8.57710733 FALSE FALSE
## classe 1.469581 0.02548160 FALSE FALSE
## only 118 predictors left
training<- training[,!zeroV$nzv]
training$classe <- as.factor(training$classe)
testing<- testing[,!zeroV$nzv]
## remove columns with NAs, most machine-learning algorithm can not deal with NAs, although imputation
## can help. For simplicity, I just remove columns containing any NAs.
NA_col<- c()
for (col in names(training)){
logic<- any(is.na(training[,col]))
NA_col<- c(NA_col,logic)
}
NA_col
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [100] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## only 53 predictors left
training<- training[,!NA_col]
testing<- testing[,!NA_col]
### Cross Validation and model buidling
# I am going to use K-fold corss validation.
# 1. First, I will break training set into K subsets (in this case a 10-fold cross validation)
# 2. build the model/predictor on the remaining training data in each subset and applied to the test subset
# 3. rebuild the data 10 times with the training and test subsets and average the findings
fitControl<- trainControl( ## 10-fold CV
method="cv",
number = 10)
# enable multi-core processing
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
## fit a model using random forest, it takes 20mins using 4 cpus.
rfFit1<- train(classe ~ ., data=training, method="rf", trControl=fitControl, verbose = FALSE)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
rfFit1
## Random Forest
##
## 19622 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 17659, 17661, 17659, 17660, 17660, 17659, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9951585 0.9938756 0.001984239 0.00251024
## 27 0.9948529 0.9934890 0.001280618 0.00162018
## 52 0.9901133 0.9874932 0.001023639 0.00129476
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
stopCluster(cl)
# The stopCluster is necessary to terminate the extra processes
# estimate variable importance
importance <- varImp(rfFit1, scale=FALSE)
# summarize importance
print(importance)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 718.6
## yaw_belt 645.3
## magnet_dumbbell_z 546.4
## pitch_belt 520.1
## magnet_dumbbell_y 513.8
## pitch_forearm 489.4
## magnet_dumbbell_x 445.9
## roll_forearm 443.1
## accel_dumbbell_y 392.4
## accel_belt_z 380.8
## magnet_belt_y 378.1
## roll_dumbbell 373.3
## magnet_belt_z 370.2
## accel_dumbbell_z 352.8
## roll_arm 339.3
## accel_forearm_x 330.8
## gyros_belt_z 299.4
## accel_dumbbell_x 296.7
## total_accel_dumbbell 295.8
## yaw_dumbbell 293.2
# plot importance
plot(importance)
confusionMatrix(training$classe,predict(rfFit1,training))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5580 0 0 0 0
## B 0 3797 0 0 0
## C 0 0 3422 0 0
## D 0 0 0 3216 0
## E 0 0 0 0 3607
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
table(prediction=predict(rfFit1, training), training$classe)
##
## prediction A B C D E
## A 5580 0 0 0 0
## B 0 3797 0 0 0
## C 0 0 3422 0 0
## D 0 0 0 3216 0
## E 0 0 0 0 3607
in sample error = error resulted from applying your prediction algorithm to the dataset you built it with also known as resubstitution error.
out of sample error = error resulted from applying your prediction algorithm to a new data set also known as generalization error
The random forest model is very accurate on the training data sets, I expect: in sample error < out of sample error
reason is over-fitting: model too adapted/optimized for the initial dataset
predict(rfFit1, newdata = testing)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E