From the course website:
“Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).”
Use package RCurl to download the the data and load it as training and testing_final.
library(RCurl)
## Loading required package: bitops
URL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
X <- getURL(URL, ssl.verifypeer = FALSE)
training <- read.csv(textConnection(X))
URL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
Y <- getURL(URL, ssl.verifypeer = FALSE)
testing_final <- read.csv(textConnection(Y))
rm(list=c("URL","X","Y")) # clean up the workspace
Now we’re goign to further divide training data into two sets for cross validation. We will call these training and testing. We’ll use the createDataPartition function in the caret package. We’ll split the data into 80% for training and 20% for cross-validation and we’ll do the splitting proportional to our response variable classe.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y=training$classe, p=0.8, list=FALSE)
training <- training[inTrain,]
testing <- training[-inTrain,]
rm(inTrain) # clean up your workspace
Now you can see the sizes of our training and testing data
dim(training)
## [1] 15699 160
dim(testing)
## [1] 3141 160
Before we fit a model let’s take a closer look at our training data.
First, how balanced is the response variable (i.e., outcome we’re trying to predict)?
summary(training$classe)
## A B C D E
## 4464 3038 2738 2573 2886
summary(training$classe)/nrow(training)
## A B C D E
## 0.2843493 0.1935155 0.1744060 0.1638958 0.1838334
This looks fairly balanced between different outcomes. Although “A” is a little more common.
Let’s look at all of the predictors that we have:
summary(training)
## X user_name raw_timestamp_part_1 raw_timestamp_part_2
## Min. : 1 adelmo :3105 Min. :1.322e+09 Min. : 294
## 1st Qu.: 4892 carlitos:2520 1st Qu.:1.323e+09 1st Qu.:252303
## Median : 9821 charles :2825 Median :1.323e+09 Median :500295
## Mean : 9813 eurico :2409 Mean :1.323e+09 Mean :500863
## 3rd Qu.:14726 jeremy :2740 3rd Qu.:1.323e+09 3rd Qu.:752292
## Max. :19622 pedro :2100 Max. :1.323e+09 Max. :998801
##
## cvtd_timestamp new_window num_window roll_belt
## 05/12/2011 11:24:1202 no :15374 Min. : 1.0 Min. :-28.90
## 05/12/2011 11:25:1172 yes: 325 1st Qu.:223.0 1st Qu.: 1.10
## 28/11/2011 14:14:1169 Median :424.0 Median :113.00
## 30/11/2011 17:11:1149 Mean :431.2 Mean : 64.42
## 02/12/2011 14:57:1107 3rd Qu.:645.0 3rd Qu.:123.00
## 05/12/2011 14:23:1099 Max. :864.0 Max. :162.00
## (Other) :8801
## pitch_belt yaw_belt total_accel_belt kurtosis_roll_belt
## Min. :-55.800 Min. :-180.00 Min. : 0.00 :15374
## 1st Qu.: 1.830 1st Qu.: -88.30 1st Qu.: 3.00 #DIV/0! : 9
## Median : 5.300 Median : -13.20 Median :17.00 -1.908453: 2
## Mean : 0.373 Mean : -11.40 Mean :11.31 -0.021024: 1
## 3rd Qu.: 15.100 3rd Qu.: 12.55 3rd Qu.:18.00 -0.025513: 1
## Max. : 60.300 Max. : 179.00 Max. :29.00 -0.033935: 1
## (Other) : 311
## kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## :15374 :15374 :15374
## #DIV/0! : 28 #DIV/0!: 325 #DIV/0! : 8
## 47.000000: 4 0.000000 : 3
## -0.150950: 3 0.422463 : 2
## 1.216445 : 3 -0.003095: 1
## 1.326417 : 3 -0.010002: 1
## (Other) : 284 (Other) : 310
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## :15374 :15374 Min. :-94.300 Min. : 3.00
## #DIV/0! : 28 #DIV/0!: 325 1st Qu.:-88.000 1st Qu.: 5.00
## -2.156553: 3 Median : -4.900 Median :18.00
## -3.072669: 3 Mean : -4.574 Mean :13.05
## 0.000000 : 3 3rd Qu.: 20.100 3rd Qu.:19.00
## 6.855655 : 3 Max. :180.000 Max. :30.00
## (Other) : 285 NA's :15374 NA's :15374
## max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt
## :15374 Min. :-180.000 Min. : 0.00 :15374
## -1.4 : 27 1st Qu.: -88.400 1st Qu.: 3.00 -1.4 : 27
## -1.1 : 23 Median : -7.000 Median :16.00 -1.1 : 23
## -1.2 : 22 Mean : -8.758 Mean :10.87 -1.2 : 22
## -0.9 : 18 3rd Qu.: 13.600 3rd Qu.:17.00 -0.9 : 18
## -0.7 : 17 Max. : 173.000 Max. :23.00 -0.7 : 17
## (Other): 218 NA's :15374 NA's :15374 (Other): 218
## amplitude_roll_belt amplitude_pitch_belt amplitude_yaw_belt
## Min. : 0.000 Min. : 0.000 :15374
## 1st Qu.: 0.300 1st Qu.: 1.000 #DIV/0!: 9
## Median : 1.000 Median : 1.000 0.00 : 11
## Mean : 4.183 Mean : 2.175 0.0000 : 305
## 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :360.000 Max. :12.000
## NA's :15374 NA's :15374
## var_total_accel_belt avg_roll_belt stddev_roll_belt var_roll_belt
## Min. : 0.000 Min. :-20.90 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.100 1st Qu.: 1.20 1st Qu.: 0.100 1st Qu.: 0.000
## Median : 0.200 Median :116.70 Median : 0.400 Median : 0.100
## Mean : 1.004 Mean : 69.24 Mean : 1.353 Mean : 8.153
## 3rd Qu.: 0.300 3rd Qu.:123.90 3rd Qu.: 0.700 3rd Qu.: 0.440
## Max. :16.500 Max. :157.40 Max. :14.200 Max. :200.700
## NA's :15374 NA's :15374 NA's :15374 NA's :15374
## avg_pitch_belt stddev_pitch_belt var_pitch_belt avg_yaw_belt
## Min. :-51.400 Min. :0.000 Min. : 0.000 Min. :-138.30
## 1st Qu.: 1.900 1st Qu.:0.200 1st Qu.: 0.000 1st Qu.: -88.10
## Median : 5.300 Median :0.300 Median : 0.100 Median : -5.90
## Mean : 0.013 Mean :0.588 Mean : 0.737 Mean : -7.02
## 3rd Qu.: 15.700 3rd Qu.:0.700 3rd Qu.: 0.500 3rd Qu.: 18.20
## Max. : 41.000 Max. :4.000 Max. :16.200 Max. : 173.40
## NA's :15374 NA's :15374 NA's :15374 NA's :15374
## stddev_yaw_belt var_yaw_belt gyros_belt_x
## Min. : 0.000 Min. : 0.00 Min. :-1.040000
## 1st Qu.: 0.100 1st Qu.: 0.01 1st Qu.:-0.030000
## Median : 0.300 Median : 0.09 Median : 0.030000
## Mean : 1.504 Mean : 133.95 Mean :-0.005337
## 3rd Qu.: 0.700 3rd Qu.: 0.51 3rd Qu.: 0.110000
## Max. :176.600 Max. :31183.24 Max. : 2.200000
## NA's :15374 NA's :15374
## gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y
## Min. :-0.64000 Min. :-1.4600 Min. :-120.000 Min. :-69.0
## 1st Qu.: 0.00000 1st Qu.:-0.2000 1st Qu.: -21.000 1st Qu.: 3.0
## Median : 0.02000 Median :-0.1000 Median : -15.000 Median : 34.0
## Mean : 0.03924 Mean :-0.1316 Mean : -5.695 Mean : 30.2
## 3rd Qu.: 0.11000 3rd Qu.:-0.0200 3rd Qu.: -5.000 3rd Qu.: 61.0
## Max. : 0.64000 Max. : 1.6200 Max. : 85.000 Max. :164.0
##
## accel_belt_z magnet_belt_x magnet_belt_y magnet_belt_z
## Min. :-275.00 Min. :-52.00 Min. :354.0 Min. :-623.0
## 1st Qu.:-162.00 1st Qu.: 9.00 1st Qu.:581.0 1st Qu.:-375.0
## Median :-152.00 Median : 35.00 Median :601.0 Median :-319.0
## Mean : -72.62 Mean : 55.48 Mean :593.8 Mean :-345.3
## 3rd Qu.: 27.00 3rd Qu.: 59.00 3rd Qu.:610.0 3rd Qu.:-306.0
## Max. : 105.00 Max. :485.00 Max. :673.0 Max. : 293.0
##
## roll_arm pitch_arm yaw_arm total_accel_arm
## Min. :-180.0 Min. :-88.800 Min. :-180.0000 Min. : 1.00
## 1st Qu.: -31.5 1st Qu.:-26.000 1st Qu.: -42.7000 1st Qu.:17.00
## Median : 0.0 Median : 0.000 Median : 0.0000 Median :27.00
## Mean : 17.7 Mean : -4.758 Mean : -0.5804 Mean :25.55
## 3rd Qu.: 77.4 3rd Qu.: 11.100 3rd Qu.: 45.6500 3rd Qu.:33.00
## Max. : 180.0 Max. : 88.500 Max. : 180.0000 Max. :66.00
##
## var_accel_arm avg_roll_arm stddev_roll_arm var_roll_arm
## Min. : 0.000 Min. :-166.67 Min. : 0.000 Min. : 0.00
## 1st Qu.: 9.682 1st Qu.: -38.31 1st Qu.: 1.643 1st Qu.: 2.70
## Median : 40.562 Median : 0.00 Median : 5.455 Median : 29.75
## Mean : 51.126 Mean : 13.42 Mean : 10.298 Mean : 329.58
## 3rd Qu.: 70.608 3rd Qu.: 76.25 3rd Qu.: 13.929 3rd Qu.: 194.01
## Max. :331.699 Max. : 160.78 Max. :161.452 Max. :26066.58
## NA's :15374 NA's :15374 NA's :15374 NA's :15374
## avg_pitch_arm stddev_pitch_arm var_pitch_arm avg_yaw_arm
## Min. :-77.019 Min. : 0.000 Min. : 0.000 Min. :-173.440
## 1st Qu.:-21.041 1st Qu.: 2.518 1st Qu.: 6.341 1st Qu.: -30.206
## Median : 0.000 Median : 8.219 Median : 67.546 Median : 0.000
## Mean : -3.090 Mean :10.758 Mean : 204.430 Mean : 2.987
## 3rd Qu.: 9.755 3rd Qu.:16.813 3rd Qu.: 282.666 3rd Qu.: 41.600
## Max. : 75.659 Max. :43.097 Max. :1857.367 Max. : 152.000
## NA's :15374 NA's :15374 NA's :15374 NA's :15374
## stddev_yaw_arm var_yaw_arm gyros_arm_x gyros_arm_y
## Min. : 0.000 Min. : 0.00 Min. :-6.37000 Min. :-3.440
## 1st Qu.: 3.965 1st Qu.: 15.72 1st Qu.:-1.32000 1st Qu.:-0.790
## Median : 16.520 Median : 272.91 Median : 0.08000 Median :-0.240
## Mean : 22.118 Mean : 1062.50 Mean : 0.03828 Mean :-0.256
## 3rd Qu.: 32.775 3rd Qu.: 1074.19 3rd Qu.: 1.54000 3rd Qu.: 0.140
## Max. :177.044 Max. :31344.57 Max. : 4.87000 Max. : 2.840
## NA's :15374 NA's :15374
## gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## Min. :-2.3300 Min. :-404.00 Min. :-315.00 Min. :-636.00
## 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.00 1st Qu.:-144.00
## Median : 0.2300 Median : -44.00 Median : 14.00 Median : -47.00
## Mean : 0.2674 Mean : -60.31 Mean : 32.57 Mean : -71.47
## 3rd Qu.: 0.7200 3rd Qu.: 83.00 3rd Qu.: 139.00 3rd Qu.: 24.00
## Max. : 3.0200 Max. : 437.00 Max. : 308.00 Max. : 292.00
##
## magnet_arm_x magnet_arm_y magnet_arm_z kurtosis_roll_arm
## Min. :-584.0 Min. :-392 Min. :-597.0 :15374
## 1st Qu.:-304.0 1st Qu.: -9 1st Qu.: 134.0 #DIV/0! : 56
## Median : 290.0 Median : 202 Median : 443.0 -0.02438: 1
## Mean : 190.6 Mean : 157 Mean : 306.1 -0.04190: 1
## 3rd Qu.: 637.0 3rd Qu.: 324 3rd Qu.: 544.0 -0.05051: 1
## Max. : 782.0 Max. : 583 Max. : 694.0 -0.05695: 1
## (Other) : 265
## kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## :15374 :15374 :15374 :15374
## #DIV/0! : 58 #DIV/0! : 10 #DIV/0! : 55 #DIV/0! : 58
## -0.00484: 1 -0.01548: 1 -0.00051: 1 -0.00184: 1
## -0.02967: 1 -0.01749: 1 -0.00696: 1 -0.01247: 1
## -0.07394: 1 -0.04059: 1 -0.01884: 1 -0.02063: 1
## -0.10385: 1 -0.04626: 1 -0.03359: 1 -0.02652: 1
## (Other) : 263 (Other) : 311 (Other) : 266 (Other) : 263
## skewness_yaw_arm max_roll_arm max_picth_arm max_yaw_arm
## :15374 Min. :-71.90 Min. :-173.00 Min. : 4.00
## #DIV/0! : 10 1st Qu.: 0.00 1st Qu.: -5.30 1st Qu.:29.00
## -0.00311: 1 Median : 8.40 Median : 27.30 Median :34.00
## -0.04470: 1 Mean : 13.14 Mean : 36.03 Mean :35.04
## -0.04866: 1 3rd Qu.: 28.10 3rd Qu.: 100.00 3rd Qu.:41.00
## -0.05413: 1 Max. : 85.50 Max. : 180.00 Max. :65.00
## (Other) : 311 NA's :15374 NA's :15374 NA's :15374
## min_roll_arm min_pitch_arm min_yaw_arm amplitude_roll_arm
## Min. :-89.1 Min. :-180.00 Min. : 1.00 Min. : 0.00
## 1st Qu.:-41.4 1st Qu.: -75.30 1st Qu.: 8.00 1st Qu.: 9.70
## Median :-21.7 Median : -32.80 Median :13.00 Median : 28.64
## Mean :-20.3 Mean : -32.86 Mean :14.59 Mean : 33.44
## 3rd Qu.: 0.0 3rd Qu.: 0.00 3rd Qu.:19.00 3rd Qu.: 51.90
## Max. : 66.4 Max. : 152.00 Max. :38.00 Max. :119.50
## NA's :15374 NA's :15374 NA's :15374 NA's :15374
## amplitude_pitch_arm amplitude_yaw_arm roll_dumbbell pitch_dumbbell
## Min. : 0.0 Min. : 0.00 Min. :-153.71 Min. :-149.59
## 1st Qu.: 14.6 1st Qu.:13.00 1st Qu.: -18.90 1st Qu.: -40.82
## Median : 55.4 Median :21.00 Median : 48.20 Median : -20.90
## Mean : 68.9 Mean :20.45 Mean : 23.94 Mean : -10.74
## 3rd Qu.:110.8 3rd Qu.:27.00 3rd Qu.: 67.73 3rd Qu.: 17.44
## Max. :360.0 Max. :52.00 Max. : 153.55 Max. : 129.82
## NA's :15374 NA's :15374
## yaw_dumbbell kurtosis_roll_dumbbell kurtosis_picth_dumbbell
## Min. :-148.766 :15374 :15374
## 1st Qu.: -77.592 #DIV/0!: 5 -0.5464: 2
## Median : -2.282 -0.3705: 2 -0.9334: 2
## Mean : 1.865 -0.5855: 2 -2.0833: 2
## 3rd Qu.: 79.998 -2.0851: 2 -2.0851: 2
## Max. : 154.952 -2.0889: 2 -2.0889: 2
## (Other): 312 (Other): 315
## kurtosis_yaw_dumbbell skewness_roll_dumbbell skewness_pitch_dumbbell
## :15374 :15374 :15374
## #DIV/0!: 325 #DIV/0!: 4 -0.3521: 2
## 0.1110 : 2 0.1090 : 2
## 1.0312 : 2 1.0326 : 2
## -0.0082: 1 -0.0053: 1
## -0.0096: 1 -0.0166: 1
## (Other): 315 (Other): 317
## skewness_yaw_dumbbell max_roll_dumbbell max_picth_dumbbell
## :15374 Min. :-70.10 Min. :-112.90
## #DIV/0!: 325 1st Qu.:-26.90 1st Qu.: -67.80
## Median : 16.50 Median : 42.60
## Mean : 14.21 Mean : 32.91
## 3rd Qu.: 50.60 3rd Qu.: 133.00
## Max. :129.80 Max. : 155.00
## NA's :15374 NA's :15374
## max_yaw_dumbbell min_roll_dumbbell min_pitch_dumbbell min_yaw_dumbbell
## :15374 Min. :-149.6 Min. :-147.00 :15374
## 0.2 : 17 1st Qu.: -59.2 1st Qu.: -92.00 0.2 : 17
## -0.6 : 16 Median : -39.8 Median : -62.70 -0.6 : 16
## -0.4 : 13 Mean : -39.5 Mean : -31.46 -0.4 : 13
## -0.8 : 13 3rd Qu.: -19.3 3rd Qu.: 23.00 -0.8 : 13
## -0.3 : 12 Max. : 73.2 Max. : 120.90 -0.3 : 12
## (Other): 254 NA's :15374 NA's :15374 (Other): 254
## amplitude_roll_dumbbell amplitude_pitch_dumbbell amplitude_yaw_dumbbell
## Min. : 0.00 Min. : 0.00 :15374
## 1st Qu.: 13.41 1st Qu.: 16.50 #DIV/0!: 5
## Median : 33.22 Median : 41.52 0.00 : 320
## Mean : 53.71 Mean : 64.36
## 3rd Qu.: 76.14 3rd Qu.: 97.47
## Max. :256.48 Max. :270.84
## NA's :15374 NA's :15374
## total_accel_dumbbell var_accel_dumbbell avg_roll_dumbbell
## Min. : 0.00 Min. : 0.000 Min. :-128.96
## 1st Qu.: 4.00 1st Qu.: 0.374 1st Qu.: -11.03
## Median :10.00 Median : 0.932 Median : 47.20
## Mean :13.73 Mean : 4.588 Mean : 24.97
## 3rd Qu.:19.00 3rd Qu.: 3.466 3rd Qu.: 65.01
## Max. :58.00 Max. :230.428 Max. : 125.99
## NA's :15374 NA's :15374
## stddev_roll_dumbbell var_roll_dumbbell avg_pitch_dumbbell
## Min. : 0.00 Min. : 0.00 Min. :-70.73
## 1st Qu.: 4.51 1st Qu.: 20.34 1st Qu.:-40.23
## Median : 11.31 Median : 127.98 Median :-15.61
## Mean : 20.66 Mean : 1043.02 Mean :-10.96
## 3rd Qu.: 26.18 3rd Qu.: 685.21 3rd Qu.: 15.44
## Max. :123.78 Max. :15321.01 Max. : 94.28
## NA's :15374 NA's :15374 NA's :15374
## stddev_pitch_dumbbell var_pitch_dumbbell avg_yaw_dumbbell
## Min. : 0.000 Min. : 0.00 Min. :-117.950
## 1st Qu.: 3.108 1st Qu.: 9.66 1st Qu.: -76.640
## Median : 7.938 Median : 63.01 Median : 4.815
## Mean :12.930 Mean : 349.00 Mean : 1.297
## 3rd Qu.:18.291 3rd Qu.: 334.57 3rd Qu.: 72.140
## Max. :82.680 Max. :6836.02 Max. : 130.879
## NA's :15374 NA's :15374 NA's :15374
## stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x
## Min. : 0.000 Min. : 0.00 Min. :-204.0000
## 1st Qu.: 3.643 1st Qu.: 13.27 1st Qu.: -0.0300
## Median : 9.587 Median : 91.92 Median : 0.1300
## Mean : 16.192 Mean : 565.21 Mean : 0.1578
## 3rd Qu.: 23.642 3rd Qu.: 558.93 3rd Qu.: 0.3500
## Max. :107.088 Max. :11467.91 Max. : 2.2200
## NA's :15374 NA's :15374
## gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## Min. :-2.10000 Min. : -2.3800 Min. :-419.00 Min. :-189.00
## 1st Qu.:-0.14000 1st Qu.: -0.3100 1st Qu.: -50.00 1st Qu.: -8.00
## Median : 0.05000 Median : -0.1300 Median : -8.00 Median : 42.00
## Mean : 0.04931 Mean : -0.1244 Mean : -28.53 Mean : 52.82
## 3rd Qu.: 0.21000 3rd Qu.: 0.0300 3rd Qu.: 11.00 3rd Qu.: 112.00
## Max. :52.00000 Max. :317.0000 Max. : 234.00 Max. : 315.00
##
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z
## Min. :-284.00 Min. :-643.0 Min. :-744 Min. :-262.0
## 1st Qu.:-142.00 1st Qu.:-535.0 1st Qu.: 232 1st Qu.: -45.0
## Median : -1.00 Median :-480.0 Median : 311 Median : 13.0
## Mean : -38.42 Mean :-327.6 Mean : 221 Mean : 45.2
## 3rd Qu.: 38.00 3rd Qu.:-301.0 3rd Qu.: 391 3rd Qu.: 94.0
## Max. : 318.00 Max. : 592.0 Max. : 633 Max. : 451.0
##
## roll_forearm pitch_forearm yaw_forearm
## Min. :-180.000 Min. :-72.50 Min. :-180.00
## 1st Qu.: -0.865 1st Qu.: 0.00 1st Qu.: -69.10
## Median : 21.500 Median : 9.22 Median : 0.00
## Mean : 33.564 Mean : 10.74 Mean : 19.32
## 3rd Qu.: 140.000 3rd Qu.: 28.60 3rd Qu.: 110.00
## Max. : 180.000 Max. : 88.70 Max. : 180.00
##
## kurtosis_roll_forearm kurtosis_picth_forearm kurtosis_yaw_forearm
## :15374 :15374 :15374
## #DIV/0!: 70 #DIV/0!: 70 #DIV/0!: 325
## -0.8079: 2 -0.0489: 1
## -0.0227: 1 -0.0523: 1
## -0.0359: 1 -0.0891: 1
## -0.0567: 1 -0.0920: 1
## (Other): 250 (Other): 251
## skewness_roll_forearm skewness_pitch_forearm skewness_yaw_forearm
## :15374 :15374 :15374
## #DIV/0!: 69 #DIV/0!: 70 #DIV/0!: 325
## -0.1912: 2 0.0000 : 4
## -0.0004: 1 -0.6992: 2
## -0.0013: 1 -0.0113: 1
## -0.0088: 1 -0.0131: 1
## (Other): 251 (Other): 247
## max_roll_forearm max_picth_forearm max_yaw_forearm min_roll_forearm
## Min. :-66.60 Min. :-151.00 :15374 Min. :-72.500
## 1st Qu.: 0.00 1st Qu.: 0.00 #DIV/0!: 70 1st Qu.: -6.000
## Median : 26.90 Median : 112.00 -1.2 : 26 Median : 0.000
## Mean : 24.34 Mean : 81.33 -1.3 : 23 Mean : -0.007
## 3rd Qu.: 47.20 3rd Qu.: 175.00 -1.5 : 22 3rd Qu.: 12.600
## Max. : 89.80 Max. : 180.00 -1.6 : 21 Max. : 62.100
## NA's :15374 NA's :15374 (Other): 163 NA's :15374
## min_pitch_forearm min_yaw_forearm amplitude_roll_forearm
## Min. :-180.0 :15374 Min. : 0.00
## 1st Qu.:-175.0 #DIV/0!: 70 1st Qu.: 1.07
## Median : -65.5 -1.2 : 26 Median : 17.84
## Mean : -59.1 -1.3 : 23 Mean : 24.34
## 3rd Qu.: 0.0 -1.5 : 22 3rd Qu.: 40.20
## Max. : 167.0 -1.6 : 21 Max. :120.30
## NA's :15374 (Other): 163 NA's :15374
## amplitude_pitch_forearm amplitude_yaw_forearm total_accel_forearm
## Min. : 0.0 :15374 Min. : 0.00
## 1st Qu.: 1.8 #DIV/0!: 70 1st Qu.: 29.00
## Median : 85.6 0.00 : 255 Median : 36.00
## Mean :140.4 Mean : 34.67
## 3rd Qu.:350.0 3rd Qu.: 41.00
## Max. :360.0 Max. :108.00
## NA's :15374
## var_accel_forearm avg_roll_forearm stddev_roll_forearm
## Min. : 0.000 Min. :-177.234 Min. : 0.000
## 1st Qu.: 6.768 1st Qu.: -2.985 1st Qu.: 0.428
## Median : 20.892 Median : 5.499 Median : 8.455
## Mean : 33.659 Mean : 31.379 Mean : 43.048
## 3rd Qu.: 51.253 3rd Qu.: 104.021 3rd Qu.: 87.099
## Max. :172.606 Max. : 174.714 Max. :179.171
## NA's :15374 NA's :15374 NA's :15374
## var_roll_forearm avg_pitch_forearm stddev_pitch_forearm
## Min. : 0.00 Min. :-68.17 Min. : 0.000
## 1st Qu.: 0.18 1st Qu.: 0.00 1st Qu.: 0.298
## Median : 71.48 Median : 12.24 Median : 5.552
## Mean : 5367.40 Mean : 11.95 Mean : 7.875
## 3rd Qu.: 7586.30 3rd Qu.: 29.55 3rd Qu.:12.954
## Max. :32102.24 Max. : 72.09 Max. :39.561
## NA's :15374 NA's :15374 NA's :15374
## var_pitch_forearm avg_yaw_forearm stddev_yaw_forearm
## Min. : 0.000 Min. :-155.06 Min. : 0.00
## 1st Qu.: 0.089 1st Qu.: -26.87 1st Qu.: 0.52
## Median : 30.825 Median : 0.00 Median : 26.16
## Mean : 134.153 Mean : 17.18 Mean : 45.41
## 3rd Qu.: 167.818 3rd Qu.: 84.15 3rd Qu.: 87.95
## Max. :1565.055 Max. : 169.24 Max. :197.51
## NA's :15374 NA's :15374 NA's :15374
## var_yaw_forearm gyros_forearm_x gyros_forearm_y
## Min. : 0.00 Min. :-22.0000 Min. : -6.6200
## 1st Qu.: 0.27 1st Qu.: -0.2200 1st Qu.: -1.4800
## Median : 684.62 Median : 0.0500 Median : 0.0300
## Mean : 4710.02 Mean : 0.1575 Mean : 0.0747
## 3rd Qu.: 7735.10 3rd Qu.: 0.5600 3rd Qu.: 1.6100
## Max. :39009.33 Max. : 3.9700 Max. :311.0000
## NA's :15374
## gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## Min. : -8.090 Min. :-498.00 Min. :-632.0 Min. :-446.00
## 1st Qu.: -0.180 1st Qu.:-178.00 1st Qu.: 54.0 1st Qu.:-182.00
## Median : 0.080 Median : -57.00 Median : 199.0 Median : -42.00
## Mean : 0.153 Mean : -61.26 Mean : 162.5 Mean : -56.63
## 3rd Qu.: 0.490 3rd Qu.: 77.00 3rd Qu.: 312.0 3rd Qu.: 25.00
## Max. :231.000 Max. : 477.00 Max. : 923.0 Max. : 291.00
##
## magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## Min. :-1280.0 Min. :-896.0 Min. :-973 A:4464
## 1st Qu.: -615.0 1st Qu.: -7.0 1st Qu.: 197 B:3038
## Median : -377.0 Median : 587.0 Median : 512 C:2738
## Mean : -311.4 Mean : 376.5 Mean : 396 D:2573
## 3rd Qu.: -70.0 3rd Qu.: 735.0 3rd Qu.: 653 E:2886
## Max. : 666.0 Max. :1480.0 Max. :1090
##
Wow. That’s a lot of variables. Let’s see if we can remove anything that is not going to be informative
We probably don’t want to predict with training$X. It is just the observations numbered from 1 to 19622.
summary(training$X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 4892 9821 9813 14730 19620
Remove ’em! And we’ll do this for all of our data sets training for model building, testing for cross-validation, and testing_final for the set we’re making our actual predictions on.
training$X <- NULL
testing$X <- NULL
testing_final$X <- NULL
There are a lot of missing variables. In almost all of the cases where there are NAs, the number of NAs is 19216. Most of these variables are statistics of other columns. e.g., kurtosis, skewness, var, min, max, etc.
Let’s remove those variables with tons of missing values. We’ll use regular expressions to find all of the columns that you don’t want.
var_names <- grep("^(var_|stddev_|avg_|min_|max_|skewness_|kurtosis_|amplitude_)",names(training))
# remove those columns from the training and testing sets and the final testing set too
training02 <- training[,-var_names]
testing02 <- testing[,-var_names]
testing_final02 <- testing_final[,-var_names]
We’ll now call our data sets training02, testing02, and testing_final02.
We’re going to fit a random forest model to our data. Random forests are a good method for classification, especially in cases with non-linear relationships between variables.
library(caret)
# run the model
set.seed(1235)
modelFit <- train(classe ~ . ,
method="rf",
verbose=TRUE,
importance=TRUE,
data=training02)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Now, let’s look at how this model performed
print(modelFit)
## Random Forest
##
## 15699 samples
## 58 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 15699, 15699, 15699, 15699, 15699, 15699, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9879858 0.9847964 0.0018194717 0.002303120
## 41 0.9987640 0.9984363 0.0007251904 0.000917370
## 80 0.9978913 0.9973324 0.0013177840 0.001665298
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 41.
We’ve built this model on the training02 set. Now, let’s see how well the model did not on testing02 set.
# predict new values
pred <- predict(modelFit, testing02)
# confusion matrix on the test data
confusionMatrix(pred, testing02$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 879 0 0 0 0
## B 0 625 0 0 0
## C 0 0 550 0 0
## D 0 0 0 507 0
## E 0 0 0 0 580
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9988, 1)
## No Information Rate : 0.2798
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.000 1.0000 1.0000 1.0000
## Prevalence 0.2798 0.199 0.1751 0.1614 0.1847
## Detection Rate 0.2798 0.199 0.1751 0.1614 0.1847
## Detection Prevalence 0.2798 0.199 0.1751 0.1614 0.1847
## Balanced Accuracy 1.0000 1.000 1.0000 1.0000 1.0000
This looks pretty good. Accuracy is 100% i.e. out-of-bag (OOB) error rate is 0%!
Let’s see which variables were the most important predictors
varImp(modelFit)
## rf variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 80)
##
## A B C D E
## raw_timestamp_part_1 84.70 100.00 95.762 84.55 50.61
## roll_belt 56.78 85.67 76.215 68.94 57.56
## pitch_forearm 33.57 48.01 67.647 50.31 40.95
## num_window 42.31 65.33 54.562 43.65 46.35
## magnet_dumbbell_z 58.64 39.55 49.873 36.95 32.59
## cvtd_timestamp30/11/2011 17:12 19.36 37.00 44.796 41.60 53.68
## yaw_belt 20.32 32.37 42.259 34.80 21.49
## cvtd_timestamp28/11/2011 14:15 15.43 25.18 26.136 22.44 36.72
## magnet_dumbbell_y 36.50 32.78 34.115 31.81 28.93
## pitch_belt 17.92 30.78 34.499 24.39 21.26
## cvtd_timestamp02/12/2011 14:58 19.42 26.72 16.135 30.54 19.52
## cvtd_timestamp05/12/2011 11:24 16.73 27.80 14.835 23.88 17.43
## cvtd_timestamp02/12/2011 13:33 23.36 21.88 26.459 24.95 25.29
## cvtd_timestamp05/12/2011 14:24 12.01 14.74 14.780 22.61 21.11
## cvtd_timestamp05/12/2011 11:25 16.17 22.55 9.316 16.70 16.17
## roll_forearm 22.34 18.12 19.380 16.26 15.69
## roll_dumbbell 16.02 19.07 21.494 19.99 16.69
## gyros_dumbbell_y 20.40 16.42 19.450 13.16 12.09
## cvtd_timestamp02/12/2011 13:35 13.68 20.29 16.588 17.75 10.80
## magnet_dumbbell_x 19.05 19.88 19.096 19.95 17.38
Finally, let’s make our predictions on the unknown testing_final02 cases.
predictions_final <- predict(modelFit,newdata=testing_final02)
predictions_final
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
There you have it. Those are the predictions for how the excercise was performed in 20 final test cases.