Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
#Downloading files
URLtrain="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URLtest="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(URLtrain,destfile = "./train.csv")
download.file(URLtest,destfile="./test.csv")
Loading files:
training=read.csv(file="./train.csv",na.strings=c("NA","#DIV/0!",""))
testing=read.csv(file="./test.csv",na.strings=c("NA","#DIV/0!",""))
We have two diferent data sets: train and test set. The data have been extracted from this source: http://groupware.les.inf.puc-rio.br/har and it provides us data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. This is the classe variable in the training set that we want to predict.
Some exploratory data analysis from cleaning training data set:
NAs=apply(training,2,function(x)sum(is.na(x))/length(x))#NAs percent by column
#Cleaning training and testing data set with, at least, 50% NAs
training=training[,NAs<0.5]
testing=testing[,NAs<0.5]
str(training)
## 'data.frame': 19622 obs. of 60 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1: int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2: int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
table(training$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
We are going to create a partition in the training dataset. 70% of the partition will be for calibrating the model and the rest of the data will be for testing.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain=createDataPartition(y=training$classe,p=0.7,list=F)
myTraining=training[inTrain,]
myTesting=training[-inTrain,]
dim(myTraining); dim(myTesting)
## [1] 13737 60
## [1] 5885 60
Removing variables that are not predictors (columns from 1 to 7)
myTraining=myTraining[,8:dim(myTraining)[2]]
nzv=nearZeroVar(myTraining, saveMetrics = TRUE)
nzv
## freqRatio percentUnique zeroVar nzv
## roll_belt 1.109453 8.08764650 FALSE FALSE
## pitch_belt 1.075188 12.31709980 FALSE FALSE
## yaw_belt 1.022161 13.01594235 FALSE FALSE
## total_accel_belt 1.052429 0.21110868 FALSE FALSE
## gyros_belt_x 1.070288 0.96090850 FALSE FALSE
## gyros_belt_y 1.128822 0.47317464 FALSE FALSE
## gyros_belt_z 1.073804 1.20841523 FALSE FALSE
## accel_belt_x 1.043478 1.15745796 FALSE FALSE
## accel_belt_y 1.120188 0.99730654 FALSE FALSE
## accel_belt_z 1.067742 2.10380724 FALSE FALSE
## magnet_belt_x 1.090535 2.20572177 FALSE FALSE
## magnet_belt_y 1.168565 2.13292568 FALSE FALSE
## magnet_belt_z 1.005917 3.17390988 FALSE FALSE
## roll_arm 49.916667 17.47834316 FALSE FALSE
## pitch_arm 79.900000 20.41202592 FALSE FALSE
## yaw_arm 32.378378 19.31280483 FALSE FALSE
## total_accel_arm 1.053140 0.47317464 FALSE FALSE
## gyros_arm_x 1.036723 4.55703574 FALSE FALSE
## gyros_arm_y 1.420765 2.67161680 FALSE FALSE
## gyros_arm_z 1.170391 1.71070831 FALSE FALSE
## accel_arm_x 1.024390 5.56162190 FALSE FALSE
## accel_arm_y 1.164474 3.82179515 FALSE FALSE
## accel_arm_z 1.125000 5.55434229 FALSE FALSE
## magnet_arm_x 1.031746 9.57268690 FALSE FALSE
## magnet_arm_y 1.000000 6.20950717 FALSE FALSE
## magnet_arm_z 1.038462 9.11407149 FALSE FALSE
## roll_dumbbell 1.146067 86.40168887 FALSE FALSE
## pitch_dumbbell 2.068627 84.18868749 FALSE FALSE
## yaw_dumbbell 1.259259 85.84115891 FALSE FALSE
## total_accel_dumbbell 1.086458 0.31302322 FALSE FALSE
## gyros_dumbbell_x 1.040000 1.65975104 FALSE FALSE
## gyros_dumbbell_y 1.281863 1.95821504 FALSE FALSE
## gyros_dumbbell_z 1.043981 1.40496469 FALSE FALSE
## accel_dumbbell_x 1.136564 2.98464002 FALSE FALSE
## accel_dumbbell_y 1.045198 3.31950207 FALSE FALSE
## accel_dumbbell_z 1.164706 2.91184393 FALSE FALSE
## magnet_dumbbell_x 1.065574 7.81102133 FALSE FALSE
## magnet_dumbbell_y 1.153846 5.99111888 FALSE FALSE
## magnet_dumbbell_z 1.000000 4.81910170 FALSE FALSE
## roll_forearm 11.040323 13.70022567 FALSE FALSE
## pitch_forearm 59.478261 18.96338356 FALSE FALSE
## yaw_forearm 15.116022 12.82667249 FALSE FALSE
## total_accel_forearm 1.146621 0.48773386 FALSE FALSE
## gyros_forearm_x 1.024390 2.07468880 FALSE FALSE
## gyros_forearm_y 1.010676 5.20492102 FALSE FALSE
## gyros_forearm_z 1.110787 2.09652763 FALSE FALSE
## accel_forearm_x 1.193548 5.68537526 FALSE FALSE
## accel_forearm_y 1.013889 7.09761957 FALSE FALSE
## accel_forearm_z 1.060345 4.07658150 FALSE FALSE
## magnet_forearm_x 1.035714 10.59911189 FALSE FALSE
## magnet_forearm_y 1.368421 13.24888986 FALSE FALSE
## magnet_forearm_z 1.022727 11.66921453 FALSE FALSE
## classe 1.469526 0.03639805 FALSE FALSE
library(rattle); library(rpart)
## Rattle: A free graphical interface for data mining with R.
## Versión 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Escriba 'rattle()' para agitar, sacudir y rotar sus datos.
set.seed(1124)
treeModel=rpart(classe ~ .,data=myTraining,method="class")
fancyRpartPlot(treeModel,sub="")
predictions1=predict(treeModel,newdata=myTesting,type="class")
confusionMatrix(predictions1,myTesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1516 171 22 60 17
## B 68 704 84 93 98
## C 41 116 820 161 131
## D 17 88 70 585 59
## E 32 60 30 65 777
##
## Overall Statistics
##
## Accuracy : 0.748
## 95% CI : (0.7367, 0.7591)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6805
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9056 0.6181 0.7992 0.60685 0.7181
## Specificity 0.9359 0.9277 0.9076 0.95245 0.9611
## Pos Pred Value 0.8488 0.6724 0.6462 0.71429 0.8060
## Neg Pred Value 0.9615 0.9101 0.9554 0.92519 0.9380
## Prevalence 0.2845 0.1935 0.1743 0.16381 0.1839
## Detection Rate 0.2576 0.1196 0.1393 0.09941 0.1320
## Detection Prevalence 0.3035 0.1779 0.2156 0.13917 0.1638
## Balanced Accuracy 0.9207 0.7729 0.8534 0.77965 0.8396
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rfModel=randomForest(classe ~ .,data=myTraining)
predictions2=predict(rfModel,newdata=myTesting)
confusionMatrix(predictions2,myTesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1134 3 0 0
## C 0 1 1022 12 0
## D 0 0 1 951 0
## E 0 0 0 1 1082
##
## Overall Statistics
##
## Accuracy : 0.9963
## 95% CI : (0.9943, 0.9977)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9953
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9956 0.9961 0.9865 1.0000
## Specificity 0.9991 0.9994 0.9973 0.9998 0.9998
## Pos Pred Value 0.9976 0.9974 0.9874 0.9989 0.9991
## Neg Pred Value 1.0000 0.9989 0.9992 0.9974 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1927 0.1737 0.1616 0.1839
## Detection Prevalence 0.2851 0.1932 0.1759 0.1618 0.1840
## Balanced Accuracy 0.9995 0.9975 0.9967 0.9932 0.9999
Random Forests algorithm is more accurate than Decision Trees algorithm. Random Forests gives us an accuracy of 99.64% in the testing set. So we are going to use this model for predicting values in the test set.
predictions3=predict(rfModel,testing)
predictions3
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E