final_projectML.R

## Instructions

# One thing that people regularly do is quantify how much of a particular activity 
# they do, but they rarely quantify how well they do it. In this project, your goal 
# will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 
# 6 participants.

## Review criterialess 

## What you should submit

# The goal of your project is to predict the manner in which they did the exercise. 
# This is the "classe" variable in the training set. You may use any of the other 
# variables to predict with. You should create a report describing how you built 
# your model, how you used cross validation, what you think the expected out of 
# sample error is, and why you made the choices you did. You will also use your 
# prediction model to predict 20 different test cases.

## Peer Review Portion

# Your submission for the Peer Review portion should consist of a link to a Github 
# repo with your R markdown and compiled HTML file describing your analysis. Please 
# constrain the text of the writeup to < 2000 words and the number of figures to be 
# less than 5. It will make it easier for the graders if you submit a repo with a 
# gh-pages branch so the HTML page can be viewed online (and you always want to make 
# it easy on graders :-).

## Course Project Prediction Quiz Portion

# Apply your machine learning algorithm to the 20 test cases available in the test 
# data above and submit your predictions in appropriate format to the Course Project 
# Prediction Quiz for automated grading.

## Reproducibility

# Due to security concerns with the exchange of R code, your code will not be run 
# during the evaluation by your classmates. Please be sure that if they download 
# the repo, they will be able to view the compiled HTML version of your analysis.

## Prediction Assignment Writeupless 

## Background

# Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to 
# collect a large amount of data about personal activity relatively inexpensively. 
# These type of devices are part of the quantified self movement – a group of 
# enthusiasts who take measurements about themselves regularly to improve their 
# health, to find patterns in their behavior, or because they are tech geeks. One 
# thing that people regularly do is quantify how much of a particular activity they 
# do, but they rarely quantify how well they do it. In this project, your goal will 
# be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 
# participants. They were asked to perform barbell lifts correctly and incorrectly 
# in 5 different ways. More information is available from the website here: 
# http://groupware.les.inf.puc-rio.br/har 
# (see the section on the Weight Lifting Exercise Dataset).

## Data

# The training data for this project are available here:
# https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
# https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

# The data for this project come from this source: 
# http://groupware.les.inf.puc-rio.br/har
# If you use the document you create for this class for any purpose please cite them 
# as they have been very generous in allowing their data to be used for this kind of 
# assignment.

# Choosing the prediction algorithm

# Steps Taken
# 1. Tidy data. Remove columns with little/no data.
# 2. Create Training and test data from traing data for cross validation checking.
# 3. Trial one method only: Random Forrest using Train Control Method "cv" 
# and number of resampling 25.
# 4. Fine tune model through combinations of above methods, reduction of input 
# variables or similar. The fine tuning will take into account accuracy first and 
# speed of analysis second.

setwd("/home/pcbrom/Dropbox/Trabalho e Estudo/Cursos Livres/Machine Learning/Curse Project")

# Do Multiple Cores
suppressMessages(require(doMC)); registerDoMC(cores = 4)

# GET DATA

# IMPORT TRAINING AND TESTING

training = read.csv("pml-training.csv")

# Eliminating useless variables

# Note that:
# "X"
# "user_name"               
# "raw_timestamp_part_1"
# "raw_timestamp_part_2"    
# "cvtd_timestamp"
# "new_window"              
# "num_window"

# They are variables that can be added to the model, but every model by adding variables,
# even if it is a numeric sequence with spurious correlation with the response variable,
# increase the hit rate. On the other hand are variables that do not qualitatively 
# contribute to the system, ie, it is not coherent to keep them only to improve the 
# accuracy of the model and say that everything is fine.

training = training[, -c(1:7)]

# Remove bad columns

bad.col = !apply(training, 2, function(x) sum(is.na(x)) > 0.95*nrow(training) || 
                     sum(x == "") > 0.95*nrow(training))
bad.col[is.na(bad.col) == T] = F
training = training[, bad.col]

# Remove near zero values

suppressMessages(require(caret))

training.zeroVar = nearZeroVar(training, saveMetrics = T)

# Remove incomplete lines

training = training[complete.cases(training), ]

# Assessing the Data

str(training)

## 'data.frame':    19622 obs. of  53 variables:
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(training)

##    roll_belt        pitch_belt          yaw_belt       total_accel_belt
##  Min.   :-28.90   Min.   :-55.8000   Min.   :-180.00   Min.   : 0.00   
##  1st Qu.:  1.10   1st Qu.:  1.7600   1st Qu.: -88.30   1st Qu.: 3.00   
##  Median :113.00   Median :  5.2800   Median : -13.00   Median :17.00   
##  Mean   : 64.41   Mean   :  0.3053   Mean   : -11.21   Mean   :11.31   
##  3rd Qu.:123.00   3rd Qu.: 14.9000   3rd Qu.:  12.90   3rd Qu.:18.00   
##  Max.   :162.00   Max.   : 60.3000   Max.   : 179.00   Max.   :29.00   
##   gyros_belt_x        gyros_belt_y       gyros_belt_z    
##  Min.   :-1.040000   Min.   :-0.64000   Min.   :-1.4600  
##  1st Qu.:-0.030000   1st Qu.: 0.00000   1st Qu.:-0.2000  
##  Median : 0.030000   Median : 0.02000   Median :-0.1000  
##  Mean   :-0.005592   Mean   : 0.03959   Mean   :-0.1305  
##  3rd Qu.: 0.110000   3rd Qu.: 0.11000   3rd Qu.:-0.0200  
##  Max.   : 2.220000   Max.   : 0.64000   Max.   : 1.6200  
##   accel_belt_x       accel_belt_y     accel_belt_z     magnet_belt_x  
##  Min.   :-120.000   Min.   :-69.00   Min.   :-275.00   Min.   :-52.0  
##  1st Qu.: -21.000   1st Qu.:  3.00   1st Qu.:-162.00   1st Qu.:  9.0  
##  Median : -15.000   Median : 35.00   Median :-152.00   Median : 35.0  
##  Mean   :  -5.595   Mean   : 30.15   Mean   : -72.59   Mean   : 55.6  
##  3rd Qu.:  -5.000   3rd Qu.: 61.00   3rd Qu.:  27.00   3rd Qu.: 59.0  
##  Max.   :  85.000   Max.   :164.00   Max.   : 105.00   Max.   :485.0  
##  magnet_belt_y   magnet_belt_z       roll_arm         pitch_arm      
##  Min.   :354.0   Min.   :-623.0   Min.   :-180.00   Min.   :-88.800  
##  1st Qu.:581.0   1st Qu.:-375.0   1st Qu.: -31.77   1st Qu.:-25.900  
##  Median :601.0   Median :-320.0   Median :   0.00   Median :  0.000  
##  Mean   :593.7   Mean   :-345.5   Mean   :  17.83   Mean   : -4.612  
##  3rd Qu.:610.0   3rd Qu.:-306.0   3rd Qu.:  77.30   3rd Qu.: 11.200  
##  Max.   :673.0   Max.   : 293.0   Max.   : 180.00   Max.   : 88.500  
##     yaw_arm          total_accel_arm  gyros_arm_x        gyros_arm_y     
##  Min.   :-180.0000   Min.   : 1.00   Min.   :-6.37000   Min.   :-3.4400  
##  1st Qu.: -43.1000   1st Qu.:17.00   1st Qu.:-1.33000   1st Qu.:-0.8000  
##  Median :   0.0000   Median :27.00   Median : 0.08000   Median :-0.2400  
##  Mean   :  -0.6188   Mean   :25.51   Mean   : 0.04277   Mean   :-0.2571  
##  3rd Qu.:  45.8750   3rd Qu.:33.00   3rd Qu.: 1.57000   3rd Qu.: 0.1400  
##  Max.   : 180.0000   Max.   :66.00   Max.   : 4.87000   Max.   : 2.8400  
##   gyros_arm_z       accel_arm_x       accel_arm_y      accel_arm_z     
##  Min.   :-2.3300   Min.   :-404.00   Min.   :-318.0   Min.   :-636.00  
##  1st Qu.:-0.0700   1st Qu.:-242.00   1st Qu.: -54.0   1st Qu.:-143.00  
##  Median : 0.2300   Median : -44.00   Median :  14.0   Median : -47.00  
##  Mean   : 0.2695   Mean   : -60.24   Mean   :  32.6   Mean   : -71.25  
##  3rd Qu.: 0.7200   3rd Qu.:  84.00   3rd Qu.: 139.0   3rd Qu.:  23.00  
##  Max.   : 3.0200   Max.   : 437.00   Max.   : 308.0   Max.   : 292.00  
##   magnet_arm_x     magnet_arm_y     magnet_arm_z    roll_dumbbell    
##  Min.   :-584.0   Min.   :-392.0   Min.   :-597.0   Min.   :-153.71  
##  1st Qu.:-300.0   1st Qu.:  -9.0   1st Qu.: 131.2   1st Qu.: -18.49  
##  Median : 289.0   Median : 202.0   Median : 444.0   Median :  48.17  
##  Mean   : 191.7   Mean   : 156.6   Mean   : 306.5   Mean   :  23.84  
##  3rd Qu.: 637.0   3rd Qu.: 323.0   3rd Qu.: 545.0   3rd Qu.:  67.61  
##  Max.   : 782.0   Max.   : 583.0   Max.   : 694.0   Max.   : 153.55  
##  pitch_dumbbell     yaw_dumbbell      total_accel_dumbbell
##  Min.   :-149.59   Min.   :-150.871   Min.   : 0.00       
##  1st Qu.: -40.89   1st Qu.: -77.644   1st Qu.: 4.00       
##  Median : -20.96   Median :  -3.324   Median :10.00       
##  Mean   : -10.78   Mean   :   1.674   Mean   :13.72       
##  3rd Qu.:  17.50   3rd Qu.:  79.643   3rd Qu.:19.00       
##  Max.   : 149.40   Max.   : 154.952   Max.   :58.00       
##  gyros_dumbbell_x    gyros_dumbbell_y   gyros_dumbbell_z 
##  Min.   :-204.0000   Min.   :-2.10000   Min.   : -2.380  
##  1st Qu.:  -0.0300   1st Qu.:-0.14000   1st Qu.: -0.310  
##  Median :   0.1300   Median : 0.03000   Median : -0.130  
##  Mean   :   0.1611   Mean   : 0.04606   Mean   : -0.129  
##  3rd Qu.:   0.3500   3rd Qu.: 0.21000   3rd Qu.:  0.030  
##  Max.   :   2.2200   Max.   :52.00000   Max.   :317.000  
##  accel_dumbbell_x  accel_dumbbell_y  accel_dumbbell_z  magnet_dumbbell_x
##  Min.   :-419.00   Min.   :-189.00   Min.   :-334.00   Min.   :-643.0   
##  1st Qu.: -50.00   1st Qu.:  -8.00   1st Qu.:-142.00   1st Qu.:-535.0   
##  Median :  -8.00   Median :  41.50   Median :  -1.00   Median :-479.0   
##  Mean   : -28.62   Mean   :  52.63   Mean   : -38.32   Mean   :-328.5   
##  3rd Qu.:  11.00   3rd Qu.: 111.00   3rd Qu.:  38.00   3rd Qu.:-304.0   
##  Max.   : 235.00   Max.   : 315.00   Max.   : 318.00   Max.   : 592.0   
##  magnet_dumbbell_y magnet_dumbbell_z  roll_forearm       pitch_forearm   
##  Min.   :-3600     Min.   :-262.00   Min.   :-180.0000   Min.   :-72.50  
##  1st Qu.:  231     1st Qu.: -45.00   1st Qu.:  -0.7375   1st Qu.:  0.00  
##  Median :  311     Median :  13.00   Median :  21.7000   Median :  9.24  
##  Mean   :  221     Mean   :  46.05   Mean   :  33.8265   Mean   : 10.71  
##  3rd Qu.:  390     3rd Qu.:  95.00   3rd Qu.: 140.0000   3rd Qu.: 28.40  
##  Max.   :  633     Max.   : 452.00   Max.   : 180.0000   Max.   : 89.80  
##   yaw_forearm      total_accel_forearm gyros_forearm_x  
##  Min.   :-180.00   Min.   :  0.00      Min.   :-22.000  
##  1st Qu.: -68.60   1st Qu.: 29.00      1st Qu.: -0.220  
##  Median :   0.00   Median : 36.00      Median :  0.050  
##  Mean   :  19.21   Mean   : 34.72      Mean   :  0.158  
##  3rd Qu.: 110.00   3rd Qu.: 41.00      3rd Qu.:  0.560  
##  Max.   : 180.00   Max.   :108.00      Max.   :  3.970  
##  gyros_forearm_y     gyros_forearm_z    accel_forearm_x   accel_forearm_y 
##  Min.   : -7.02000   Min.   : -8.0900   Min.   :-498.00   Min.   :-632.0  
##  1st Qu.: -1.46000   1st Qu.: -0.1800   1st Qu.:-178.00   1st Qu.:  57.0  
##  Median :  0.03000   Median :  0.0800   Median : -57.00   Median : 201.0  
##  Mean   :  0.07517   Mean   :  0.1512   Mean   : -61.65   Mean   : 163.7  
##  3rd Qu.:  1.62000   3rd Qu.:  0.4900   3rd Qu.:  76.00   3rd Qu.: 312.0  
##  Max.   :311.00000   Max.   :231.0000   Max.   : 477.00   Max.   : 923.0  
##  accel_forearm_z   magnet_forearm_x  magnet_forearm_y magnet_forearm_z
##  Min.   :-446.00   Min.   :-1280.0   Min.   :-896.0   Min.   :-973.0  
##  1st Qu.:-182.00   1st Qu.: -616.0   1st Qu.:   2.0   1st Qu.: 191.0  
##  Median : -39.00   Median : -378.0   Median : 591.0   Median : 511.0  
##  Mean   : -55.29   Mean   : -312.6   Mean   : 380.1   Mean   : 393.6  
##  3rd Qu.:  26.00   3rd Qu.:  -73.0   3rd Qu.: 737.0   3rd Qu.: 653.0  
##  Max.   : 291.00   Max.   :  672.0   Max.   :1480.0   Max.   :1090.0  
##  classe  
##  A:5580  
##  B:3797  
##  C:3422  
##  D:3216  
##  E:3607  
##

# DATA ANALYSIS

# Assessing correlated col

suppressMessages(require(corrr))

rdf = correlate(subset(training, select = -c(classe)))
rplot(rdf, print_cor = T, legend = T, colours = heat.colors(20, alpha = .5))

# Using Random Forest

set.seed(2964)

# Partition rows into training and crossvalidation

# In this pretest I made a model valuation adjustment using only 5% of the training database. 
# The aim is to quickly calibrate a model to separate the most important variables. After this 
# step we will use only the most significant with the full dataset training.

inTrain = createDataPartition(training$classe, p = 0.05, list = F)
crossv = training[-inTrain, ]
training2 = training[inTrain, ]

dim(crossv); dim(training2)

## [1] 18639    53

## [1] 983  53

mod = suppressMessages(
    train(classe ~ ., method = "rf", data = training2, 
          trControl = trainControl(method = "cv"), number = 25)
)

mod$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, number = 25) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 9.56%
## Confusion matrix:
##     A   B   C   D   E class.error
## A 270   2   1   5   1  0.03225806
## B  16 161  10   2   1  0.15263158
## C   1   8 158   5   0  0.08139535
## D   2   2  14 138   5  0.14285714
## E   0   5   9   5 162  0.10497238

pred.test = predict(mod, crossv); confusionMatrix(pred.test, crossv$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5158  320   28   32   24
##          B   35 2951  145   12  120
##          C   37  254 2959  271  145
##          D   64   58  117 2674   58
##          E    7   24    1   66 3079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9025          
##                  95% CI : (0.8981, 0.9067)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8765          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9730   0.8181   0.9105   0.8753   0.8987
## Specificity            0.9697   0.9792   0.9541   0.9809   0.9936
## Pos Pred Value         0.9274   0.9044   0.8071   0.9000   0.9692
## Neg Pred Value         0.9891   0.9573   0.9806   0.9757   0.9776
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2767   0.1583   0.1588   0.1435   0.1652
## Detection Prevalence   0.2984   0.1751   0.1967   0.1594   0.1704
## Balanced Accuracy      0.9714   0.8987   0.9323   0.9281   0.9461

# As might be expected, the accuracy is not high at this time Accuracy: Accuracy: 0.9025 and 95% 
# CI: (0.8981, 0.9067), for use only a trickle minimum database.

# Create fine tunning

mod.varImp = varImp(mod)
plot(mod.varImp, main = "Importance of all Variables for 'rf' model")

# According to the image "Importance of all Variables for 'rf' model" have a potential variable 
# filter with more than 35% of importance to be candidates of the final model.

mod.col = mod.varImp$importance > 35
training = training[, mod.col]

# Create FINE TUNNING, on original training set

mod.ft = suppressMessages(
    train(classe ~ ., method = "rf", data = training, 
          trControl = trainControl(method = "cv"), number = 25)
)

pred.ft.test = predict(mod.ft, crossv)
confusionMatrix(pred.ft.test, crossv$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5296    1    0    0    0
##          B    0 3606    2    0    0
##          C    5    0 3248    0    0
##          D    0    0    0 3055    0
##          E    0    0    0    0 3426
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9996          
##                  95% CI : (0.9992, 0.9998)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9995          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9997   0.9994   1.0000   1.0000
## Specificity            0.9999   0.9999   0.9997   1.0000   1.0000
## Pos Pred Value         0.9998   0.9994   0.9985   1.0000   1.0000
## Neg Pred Value         0.9996   0.9999   0.9999   1.0000   1.0000
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2841   0.1935   0.1743   0.1639   0.1838
## Detection Prevalence   0.2842   0.1936   0.1745   0.1639   0.1838
## Balanced Accuracy      0.9995   0.9998   0.9995   1.0000   1.0000

mod.ft$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, number = 25) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 1.38%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 5525   20   18   13    4 0.009856631
## B   32 3699   51   14    1 0.025809850
## C    5   14 3381   22    0 0.011981297
## D    3    2   30 3178    3 0.011815920
## E    4   18    7   10 3568 0.010812309

# Thus the final model could not be better. The results are notable: Accuracy > 0.99,
# CI extremely precise and estimate of error rate: < 2%.

# Let's get a beautiful decision tree

suppressMessages(require(tree))

tr = tree(classe ~ . , data = training)
plot(tr); text(tr, cex = .75)

# Prepare the submission. (using COURSERA provided code)

testing = read.csv("pml-testing.csv")
testing = testing[, -c(1:7)]
bad.col = !apply(testing, 2, function(x) sum(is.na(x)) > 0.95*nrow(testing) || 
                     sum(x == "") > 0.95*nrow(testing))
bad.col[is.na(bad.col) == T] = F
testing = testing[, bad.col]
testing = testing[complete.cases(testing), ]

pml_write_files = function(x){
    n = length(x)
    for(i in 1:n){
        filename = paste0("problem_id_",i,".txt")
        write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
    }
}
x = testing

answers = predict(mod.ft, newdata = x); answers

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

pml_write_files(answers)

# Reference

# [1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity
# Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference
# in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.

# [2] Revolution Analytics and Steve Weston (2015). doMC: Foreach Parallel Adaptor for 
# 'parallel'. R package version 1.3.4. https://CRAN.R-project.org/package=doMC

# [3] Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, 
# Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael 
# Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang and Can Candan. (2016). 
# caret: Classification and Regression Training. R package version 6.0-71. 
# https://CRAN.R-project.org/package=caret

# [4] Simon Jackson (2016). corrr: Correlations in R. R package version 0.2.1. 
# https://CRAN.R-project.org/package=corrr

# [5] Brian Ripley (2016). tree: Classification and Regression Trees. R package version 
# 1.0-37. https://CRAN.R-project.org/package=tree

final_projectML.R

pcbrom

Fri Oct 14 21:32:20 2016