## Instructions
# One thing that people regularly do is quantify how much of a particular activity
# they do, but they rarely quantify how well they do it. In this project, your goal
# will be to use data from accelerometers on the belt, forearm, arm, and dumbell of
# 6 participants.
## Review criterialess
## What you should submit
# The goal of your project is to predict the manner in which they did the exercise.
# This is the "classe" variable in the training set. You may use any of the other
# variables to predict with. You should create a report describing how you built
# your model, how you used cross validation, what you think the expected out of
# sample error is, and why you made the choices you did. You will also use your
# prediction model to predict 20 different test cases.
## Peer Review Portion
# Your submission for the Peer Review portion should consist of a link to a Github
# repo with your R markdown and compiled HTML file describing your analysis. Please
# constrain the text of the writeup to < 2000 words and the number of figures to be
# less than 5. It will make it easier for the graders if you submit a repo with a
# gh-pages branch so the HTML page can be viewed online (and you always want to make
# it easy on graders :-).
## Course Project Prediction Quiz Portion
# Apply your machine learning algorithm to the 20 test cases available in the test
# data above and submit your predictions in appropriate format to the Course Project
# Prediction Quiz for automated grading.
## Reproducibility
# Due to security concerns with the exchange of R code, your code will not be run
# during the evaluation by your classmates. Please be sure that if they download
# the repo, they will be able to view the compiled HTML version of your analysis.
## Prediction Assignment Writeupless
## Background
# Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to
# collect a large amount of data about personal activity relatively inexpensively.
# These type of devices are part of the quantified self movement – a group of
# enthusiasts who take measurements about themselves regularly to improve their
# health, to find patterns in their behavior, or because they are tech geeks. One
# thing that people regularly do is quantify how much of a particular activity they
# do, but they rarely quantify how well they do it. In this project, your goal will
# be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6
# participants. They were asked to perform barbell lifts correctly and incorrectly
# in 5 different ways. More information is available from the website here:
# http://groupware.les.inf.puc-rio.br/har
# (see the section on the Weight Lifting Exercise Dataset).
## Data
# The training data for this project are available here:
# https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
# https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
# The data for this project come from this source:
# http://groupware.les.inf.puc-rio.br/har
# If you use the document you create for this class for any purpose please cite them
# as they have been very generous in allowing their data to be used for this kind of
# assignment.
# Choosing the prediction algorithm
# Steps Taken
# 1. Tidy data. Remove columns with little/no data.
# 2. Create Training and test data from traing data for cross validation checking.
# 3. Trial one method only: Random Forrest using Train Control Method "cv"
# and number of resampling 25.
# 4. Fine tune model through combinations of above methods, reduction of input
# variables or similar. The fine tuning will take into account accuracy first and
# speed of analysis second.
setwd("/home/pcbrom/Dropbox/Trabalho e Estudo/Cursos Livres/Machine Learning/Curse Project")
# Do Multiple Cores
suppressMessages(require(doMC)); registerDoMC(cores = 4)
# GET DATA
# IMPORT TRAINING AND TESTING
training = read.csv("pml-training.csv")
# Eliminating useless variables
# Note that:
# "X"
# "user_name"
# "raw_timestamp_part_1"
# "raw_timestamp_part_2"
# "cvtd_timestamp"
# "new_window"
# "num_window"
# They are variables that can be added to the model, but every model by adding variables,
# even if it is a numeric sequence with spurious correlation with the response variable,
# increase the hit rate. On the other hand are variables that do not qualitatively
# contribute to the system, ie, it is not coherent to keep them only to improve the
# accuracy of the model and say that everything is fine.
training = training[, -c(1:7)]
# Remove bad columns
bad.col = !apply(training, 2, function(x) sum(is.na(x)) > 0.95*nrow(training) ||
sum(x == "") > 0.95*nrow(training))
bad.col[is.na(bad.col) == T] = F
training = training[, bad.col]
# Remove near zero values
suppressMessages(require(caret))
training.zeroVar = nearZeroVar(training, saveMetrics = T)
# Remove incomplete lines
training = training[complete.cases(training), ]
# Assessing the Data
str(training)
## 'data.frame': 19622 obs. of 53 variables:
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(training)
## roll_belt pitch_belt yaw_belt total_accel_belt
## Min. :-28.90 Min. :-55.8000 Min. :-180.00 Min. : 0.00
## 1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00
## Median :113.00 Median : 5.2800 Median : -13.00 Median :17.00
## Mean : 64.41 Mean : 0.3053 Mean : -11.21 Mean :11.31
## 3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00
## Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00
## gyros_belt_x gyros_belt_y gyros_belt_z
## Min. :-1.040000 Min. :-0.64000 Min. :-1.4600
## 1st Qu.:-0.030000 1st Qu.: 0.00000 1st Qu.:-0.2000
## Median : 0.030000 Median : 0.02000 Median :-0.1000
## Mean :-0.005592 Mean : 0.03959 Mean :-0.1305
## 3rd Qu.: 0.110000 3rd Qu.: 0.11000 3rd Qu.:-0.0200
## Max. : 2.220000 Max. : 0.64000 Max. : 1.6200
## accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## Min. :-120.000 Min. :-69.00 Min. :-275.00 Min. :-52.0
## 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00 1st Qu.: 9.0
## Median : -15.000 Median : 35.00 Median :-152.00 Median : 35.0
## Mean : -5.595 Mean : 30.15 Mean : -72.59 Mean : 55.6
## 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00 3rd Qu.: 59.0
## Max. : 85.000 Max. :164.00 Max. : 105.00 Max. :485.0
## magnet_belt_y magnet_belt_z roll_arm pitch_arm
## Min. :354.0 Min. :-623.0 Min. :-180.00 Min. :-88.800
## 1st Qu.:581.0 1st Qu.:-375.0 1st Qu.: -31.77 1st Qu.:-25.900
## Median :601.0 Median :-320.0 Median : 0.00 Median : 0.000
## Mean :593.7 Mean :-345.5 Mean : 17.83 Mean : -4.612
## 3rd Qu.:610.0 3rd Qu.:-306.0 3rd Qu.: 77.30 3rd Qu.: 11.200
## Max. :673.0 Max. : 293.0 Max. : 180.00 Max. : 88.500
## yaw_arm total_accel_arm gyros_arm_x gyros_arm_y
## Min. :-180.0000 Min. : 1.00 Min. :-6.37000 Min. :-3.4400
## 1st Qu.: -43.1000 1st Qu.:17.00 1st Qu.:-1.33000 1st Qu.:-0.8000
## Median : 0.0000 Median :27.00 Median : 0.08000 Median :-0.2400
## Mean : -0.6188 Mean :25.51 Mean : 0.04277 Mean :-0.2571
## 3rd Qu.: 45.8750 3rd Qu.:33.00 3rd Qu.: 1.57000 3rd Qu.: 0.1400
## Max. : 180.0000 Max. :66.00 Max. : 4.87000 Max. : 2.8400
## gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## Min. :-2.3300 Min. :-404.00 Min. :-318.0 Min. :-636.00
## 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.0 1st Qu.:-143.00
## Median : 0.2300 Median : -44.00 Median : 14.0 Median : -47.00
## Mean : 0.2695 Mean : -60.24 Mean : 32.6 Mean : -71.25
## 3rd Qu.: 0.7200 3rd Qu.: 84.00 3rd Qu.: 139.0 3rd Qu.: 23.00
## Max. : 3.0200 Max. : 437.00 Max. : 308.0 Max. : 292.00
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell
## Min. :-584.0 Min. :-392.0 Min. :-597.0 Min. :-153.71
## 1st Qu.:-300.0 1st Qu.: -9.0 1st Qu.: 131.2 1st Qu.: -18.49
## Median : 289.0 Median : 202.0 Median : 444.0 Median : 48.17
## Mean : 191.7 Mean : 156.6 Mean : 306.5 Mean : 23.84
## 3rd Qu.: 637.0 3rd Qu.: 323.0 3rd Qu.: 545.0 3rd Qu.: 67.61
## Max. : 782.0 Max. : 583.0 Max. : 694.0 Max. : 153.55
## pitch_dumbbell yaw_dumbbell total_accel_dumbbell
## Min. :-149.59 Min. :-150.871 Min. : 0.00
## 1st Qu.: -40.89 1st Qu.: -77.644 1st Qu.: 4.00
## Median : -20.96 Median : -3.324 Median :10.00
## Mean : -10.78 Mean : 1.674 Mean :13.72
## 3rd Qu.: 17.50 3rd Qu.: 79.643 3rd Qu.:19.00
## Max. : 149.40 Max. : 154.952 Max. :58.00
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## Min. :-204.0000 Min. :-2.10000 Min. : -2.380
## 1st Qu.: -0.0300 1st Qu.:-0.14000 1st Qu.: -0.310
## Median : 0.1300 Median : 0.03000 Median : -0.130
## Mean : 0.1611 Mean : 0.04606 Mean : -0.129
## 3rd Qu.: 0.3500 3rd Qu.: 0.21000 3rd Qu.: 0.030
## Max. : 2.2200 Max. :52.00000 Max. :317.000
## accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## Min. :-419.00 Min. :-189.00 Min. :-334.00 Min. :-643.0
## 1st Qu.: -50.00 1st Qu.: -8.00 1st Qu.:-142.00 1st Qu.:-535.0
## Median : -8.00 Median : 41.50 Median : -1.00 Median :-479.0
## Mean : -28.62 Mean : 52.63 Mean : -38.32 Mean :-328.5
## 3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 38.00 3rd Qu.:-304.0
## Max. : 235.00 Max. : 315.00 Max. : 318.00 Max. : 592.0
## magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm
## Min. :-3600 Min. :-262.00 Min. :-180.0000 Min. :-72.50
## 1st Qu.: 231 1st Qu.: -45.00 1st Qu.: -0.7375 1st Qu.: 0.00
## Median : 311 Median : 13.00 Median : 21.7000 Median : 9.24
## Mean : 221 Mean : 46.05 Mean : 33.8265 Mean : 10.71
## 3rd Qu.: 390 3rd Qu.: 95.00 3rd Qu.: 140.0000 3rd Qu.: 28.40
## Max. : 633 Max. : 452.00 Max. : 180.0000 Max. : 89.80
## yaw_forearm total_accel_forearm gyros_forearm_x
## Min. :-180.00 Min. : 0.00 Min. :-22.000
## 1st Qu.: -68.60 1st Qu.: 29.00 1st Qu.: -0.220
## Median : 0.00 Median : 36.00 Median : 0.050
## Mean : 19.21 Mean : 34.72 Mean : 0.158
## 3rd Qu.: 110.00 3rd Qu.: 41.00 3rd Qu.: 0.560
## Max. : 180.00 Max. :108.00 Max. : 3.970
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## Min. : -7.02000 Min. : -8.0900 Min. :-498.00 Min. :-632.0
## 1st Qu.: -1.46000 1st Qu.: -0.1800 1st Qu.:-178.00 1st Qu.: 57.0
## Median : 0.03000 Median : 0.0800 Median : -57.00 Median : 201.0
## Mean : 0.07517 Mean : 0.1512 Mean : -61.65 Mean : 163.7
## 3rd Qu.: 1.62000 3rd Qu.: 0.4900 3rd Qu.: 76.00 3rd Qu.: 312.0
## Max. :311.00000 Max. :231.0000 Max. : 477.00 Max. : 923.0
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## Min. :-446.00 Min. :-1280.0 Min. :-896.0 Min. :-973.0
## 1st Qu.:-182.00 1st Qu.: -616.0 1st Qu.: 2.0 1st Qu.: 191.0
## Median : -39.00 Median : -378.0 Median : 591.0 Median : 511.0
## Mean : -55.29 Mean : -312.6 Mean : 380.1 Mean : 393.6
## 3rd Qu.: 26.00 3rd Qu.: -73.0 3rd Qu.: 737.0 3rd Qu.: 653.0
## Max. : 291.00 Max. : 672.0 Max. :1480.0 Max. :1090.0
## classe
## A:5580
## B:3797
## C:3422
## D:3216
## E:3607
##
# DATA ANALYSIS
# Assessing correlated col
suppressMessages(require(corrr))
rdf = correlate(subset(training, select = -c(classe)))
rplot(rdf, print_cor = T, legend = T, colours = heat.colors(20, alpha = .5))

# Using Random Forest
set.seed(2964)
# Partition rows into training and crossvalidation
# In this pretest I made a model valuation adjustment using only 5% of the training database.
# The aim is to quickly calibrate a model to separate the most important variables. After this
# step we will use only the most significant with the full dataset training.
inTrain = createDataPartition(training$classe, p = 0.05, list = F)
crossv = training[-inTrain, ]
training2 = training[inTrain, ]
dim(crossv); dim(training2)
## [1] 18639 53
## [1] 983 53
mod = suppressMessages(
train(classe ~ ., method = "rf", data = training2,
trControl = trainControl(method = "cv"), number = 25)
)
mod$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, number = 25)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 9.56%
## Confusion matrix:
## A B C D E class.error
## A 270 2 1 5 1 0.03225806
## B 16 161 10 2 1 0.15263158
## C 1 8 158 5 0 0.08139535
## D 2 2 14 138 5 0.14285714
## E 0 5 9 5 162 0.10497238
pred.test = predict(mod, crossv); confusionMatrix(pred.test, crossv$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5158 320 28 32 24
## B 35 2951 145 12 120
## C 37 254 2959 271 145
## D 64 58 117 2674 58
## E 7 24 1 66 3079
##
## Overall Statistics
##
## Accuracy : 0.9025
## 95% CI : (0.8981, 0.9067)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8765
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9730 0.8181 0.9105 0.8753 0.8987
## Specificity 0.9697 0.9792 0.9541 0.9809 0.9936
## Pos Pred Value 0.9274 0.9044 0.8071 0.9000 0.9692
## Neg Pred Value 0.9891 0.9573 0.9806 0.9757 0.9776
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2767 0.1583 0.1588 0.1435 0.1652
## Detection Prevalence 0.2984 0.1751 0.1967 0.1594 0.1704
## Balanced Accuracy 0.9714 0.8987 0.9323 0.9281 0.9461
# As might be expected, the accuracy is not high at this time Accuracy: Accuracy: 0.9025 and 95%
# CI: (0.8981, 0.9067), for use only a trickle minimum database.
# Create fine tunning
mod.varImp = varImp(mod)
plot(mod.varImp, main = "Importance of all Variables for 'rf' model")

# According to the image "Importance of all Variables for 'rf' model" have a potential variable
# filter with more than 35% of importance to be candidates of the final model.
mod.col = mod.varImp$importance > 35
training = training[, mod.col]
# Create FINE TUNNING, on original training set
mod.ft = suppressMessages(
train(classe ~ ., method = "rf", data = training,
trControl = trainControl(method = "cv"), number = 25)
)
pred.ft.test = predict(mod.ft, crossv)
confusionMatrix(pred.ft.test, crossv$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5296 1 0 0 0
## B 0 3606 2 0 0
## C 5 0 3248 0 0
## D 0 0 0 3055 0
## E 0 0 0 0 3426
##
## Overall Statistics
##
## Accuracy : 0.9996
## 95% CI : (0.9992, 0.9998)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9995
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9997 0.9994 1.0000 1.0000
## Specificity 0.9999 0.9999 0.9997 1.0000 1.0000
## Pos Pred Value 0.9998 0.9994 0.9985 1.0000 1.0000
## Neg Pred Value 0.9996 0.9999 0.9999 1.0000 1.0000
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2841 0.1935 0.1743 0.1639 0.1838
## Detection Prevalence 0.2842 0.1936 0.1745 0.1639 0.1838
## Balanced Accuracy 0.9995 0.9998 0.9995 1.0000 1.0000
mod.ft$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, number = 25)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 1.38%
## Confusion matrix:
## A B C D E class.error
## A 5525 20 18 13 4 0.009856631
## B 32 3699 51 14 1 0.025809850
## C 5 14 3381 22 0 0.011981297
## D 3 2 30 3178 3 0.011815920
## E 4 18 7 10 3568 0.010812309
# Thus the final model could not be better. The results are notable: Accuracy > 0.99,
# CI extremely precise and estimate of error rate: < 2%.
# Let's get a beautiful decision tree
suppressMessages(require(tree))
tr = tree(classe ~ . , data = training)
plot(tr); text(tr, cex = .75)

# Prepare the submission. (using COURSERA provided code)
testing = read.csv("pml-testing.csv")
testing = testing[, -c(1:7)]
bad.col = !apply(testing, 2, function(x) sum(is.na(x)) > 0.95*nrow(testing) ||
sum(x == "") > 0.95*nrow(testing))
bad.col[is.na(bad.col) == T] = F
testing = testing[, bad.col]
testing = testing[complete.cases(testing), ]
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
x = testing
answers = predict(mod.ft, newdata = x); answers
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
pml_write_files(answers)
# Reference
# [1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity
# Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference
# in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.
# [2] Revolution Analytics and Steve Weston (2015). doMC: Foreach Parallel Adaptor for
# 'parallel'. R package version 1.3.4. https://CRAN.R-project.org/package=doMC
# [3] Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer,
# Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael
# Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang and Can Candan. (2016).
# caret: Classification and Regression Training. R package version 6.0-71.
# https://CRAN.R-project.org/package=caret
# [4] Simon Jackson (2016). corrr: Correlations in R. R package version 0.2.1.
# https://CRAN.R-project.org/package=corrr
# [5] Brian Ripley (2016). tree: Classification and Regression Trees. R package version
# 1.0-37. https://CRAN.R-project.org/package=tree