Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data Source

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Preprocessing

Here we set up the environment.

# packages we will need
        library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

        library(ggplot2)
        library(caret)

## Loading required package: lattice

        library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

        library(rpart)

Loading the data

Here we download the data (if necessary) and then assign to data frames.

# declare the links
        Link1 <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
        Link2 <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# get the data and save
        # download.file(url = Link1, destfile = "./pml-training.csv", method = "curl")
        # download.file(url = Link2, destfile = "./pml-testing.csv", method = "curl")

# Load data sets
        # training_DS <- read.csv("./pml-training.csv", na.strings=c("NA","#DIV/0!",""))
        # testing_DS <- read.csv("./pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
        
# if already local use this
        training_DS<- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
        testing_DS<- read.csv("pml-testing.csv")

Exploratory data analysis

In this section of code, we take an exploratory look at the data to understand it’s names and size better.

        dim(training_DS)

## [1] 19622   160

        dim(testing_DS)

## [1]  20 160

        names(training_DS)

##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "new_window"              
##   [7] "num_window"               "roll_belt"               
##   [9] "pitch_belt"               "yaw_belt"                
##  [11] "total_accel_belt"         "kurtosis_roll_belt"      
##  [13] "kurtosis_picth_belt"      "kurtosis_yaw_belt"       
##  [15] "skewness_roll_belt"       "skewness_roll_belt.1"    
##  [17] "skewness_yaw_belt"        "max_roll_belt"           
##  [19] "max_picth_belt"           "max_yaw_belt"            
##  [21] "min_roll_belt"            "min_pitch_belt"          
##  [23] "min_yaw_belt"             "amplitude_roll_belt"     
##  [25] "amplitude_pitch_belt"     "amplitude_yaw_belt"      
##  [27] "var_total_accel_belt"     "avg_roll_belt"           
##  [29] "stddev_roll_belt"         "var_roll_belt"           
##  [31] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [33] "var_pitch_belt"           "avg_yaw_belt"            
##  [35] "stddev_yaw_belt"          "var_yaw_belt"            
##  [37] "gyros_belt_x"             "gyros_belt_y"            
##  [39] "gyros_belt_z"             "accel_belt_x"            
##  [41] "accel_belt_y"             "accel_belt_z"            
##  [43] "magnet_belt_x"            "magnet_belt_y"           
##  [45] "magnet_belt_z"            "roll_arm"                
##  [47] "pitch_arm"                "yaw_arm"                 
##  [49] "total_accel_arm"          "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "gyros_arm_x"             
##  [61] "gyros_arm_y"              "gyros_arm_z"             
##  [63] "accel_arm_x"              "accel_arm_y"             
##  [65] "accel_arm_z"              "magnet_arm_x"            
##  [67] "magnet_arm_y"             "magnet_arm_z"            
##  [69] "kurtosis_roll_arm"        "kurtosis_picth_arm"      
##  [71] "kurtosis_yaw_arm"         "skewness_roll_arm"       
##  [73] "skewness_pitch_arm"       "skewness_yaw_arm"        
##  [75] "max_roll_arm"             "max_picth_arm"           
##  [77] "max_yaw_arm"              "min_roll_arm"            
##  [79] "min_pitch_arm"            "min_yaw_arm"             
##  [81] "amplitude_roll_arm"       "amplitude_pitch_arm"     
##  [83] "amplitude_yaw_arm"        "roll_dumbbell"           
##  [85] "pitch_dumbbell"           "yaw_dumbbell"            
##  [87] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [89] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [91] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [93] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [95] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [97] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [99] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell"   "total_accel_dumbbell"    
## [103] "var_accel_dumbbell"       "avg_roll_dumbbell"       
## [105] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
## [107] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
## [109] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
## [111] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
## [113] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
## [115] "gyros_dumbbell_z"         "accel_dumbbell_x"        
## [117] "accel_dumbbell_y"         "accel_dumbbell_z"        
## [119] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
## [121] "magnet_dumbbell_z"        "roll_forearm"            
## [123] "pitch_forearm"            "yaw_forearm"             
## [125] "kurtosis_roll_forearm"    "kurtosis_picth_forearm"  
## [127] "kurtosis_yaw_forearm"     "skewness_roll_forearm"   
## [129] "skewness_pitch_forearm"   "skewness_yaw_forearm"    
## [131] "max_roll_forearm"         "max_picth_forearm"       
## [133] "max_yaw_forearm"          "min_roll_forearm"        
## [135] "min_pitch_forearm"        "min_yaw_forearm"         
## [137] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
## [139] "amplitude_yaw_forearm"    "total_accel_forearm"     
## [141] "var_accel_forearm"        "avg_roll_forearm"        
## [143] "stddev_roll_forearm"      "var_roll_forearm"        
## [145] "avg_pitch_forearm"        "stddev_pitch_forearm"    
## [147] "var_pitch_forearm"        "avg_yaw_forearm"         
## [149] "stddev_yaw_forearm"       "var_yaw_forearm"         
## [151] "gyros_forearm_x"          "gyros_forearm_y"         
## [153] "gyros_forearm_z"          "accel_forearm_x"         
## [155] "accel_forearm_y"          "accel_forearm_z"         
## [157] "magnet_forearm_x"         "magnet_forearm_y"        
## [159] "magnet_forearm_z"         "classe"

        names(testing_DS)

##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "new_window"              
##   [7] "num_window"               "roll_belt"               
##   [9] "pitch_belt"               "yaw_belt"                
##  [11] "total_accel_belt"         "kurtosis_roll_belt"      
##  [13] "kurtosis_picth_belt"      "kurtosis_yaw_belt"       
##  [15] "skewness_roll_belt"       "skewness_roll_belt.1"    
##  [17] "skewness_yaw_belt"        "max_roll_belt"           
##  [19] "max_picth_belt"           "max_yaw_belt"            
##  [21] "min_roll_belt"            "min_pitch_belt"          
##  [23] "min_yaw_belt"             "amplitude_roll_belt"     
##  [25] "amplitude_pitch_belt"     "amplitude_yaw_belt"      
##  [27] "var_total_accel_belt"     "avg_roll_belt"           
##  [29] "stddev_roll_belt"         "var_roll_belt"           
##  [31] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [33] "var_pitch_belt"           "avg_yaw_belt"            
##  [35] "stddev_yaw_belt"          "var_yaw_belt"            
##  [37] "gyros_belt_x"             "gyros_belt_y"            
##  [39] "gyros_belt_z"             "accel_belt_x"            
##  [41] "accel_belt_y"             "accel_belt_z"            
##  [43] "magnet_belt_x"            "magnet_belt_y"           
##  [45] "magnet_belt_z"            "roll_arm"                
##  [47] "pitch_arm"                "yaw_arm"                 
##  [49] "total_accel_arm"          "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "gyros_arm_x"             
##  [61] "gyros_arm_y"              "gyros_arm_z"             
##  [63] "accel_arm_x"              "accel_arm_y"             
##  [65] "accel_arm_z"              "magnet_arm_x"            
##  [67] "magnet_arm_y"             "magnet_arm_z"            
##  [69] "kurtosis_roll_arm"        "kurtosis_picth_arm"      
##  [71] "kurtosis_yaw_arm"         "skewness_roll_arm"       
##  [73] "skewness_pitch_arm"       "skewness_yaw_arm"        
##  [75] "max_roll_arm"             "max_picth_arm"           
##  [77] "max_yaw_arm"              "min_roll_arm"            
##  [79] "min_pitch_arm"            "min_yaw_arm"             
##  [81] "amplitude_roll_arm"       "amplitude_pitch_arm"     
##  [83] "amplitude_yaw_arm"        "roll_dumbbell"           
##  [85] "pitch_dumbbell"           "yaw_dumbbell"            
##  [87] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [89] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [91] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [93] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [95] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [97] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [99] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell"   "total_accel_dumbbell"    
## [103] "var_accel_dumbbell"       "avg_roll_dumbbell"       
## [105] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
## [107] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
## [109] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
## [111] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
## [113] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
## [115] "gyros_dumbbell_z"         "accel_dumbbell_x"        
## [117] "accel_dumbbell_y"         "accel_dumbbell_z"        
## [119] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
## [121] "magnet_dumbbell_z"        "roll_forearm"            
## [123] "pitch_forearm"            "yaw_forearm"             
## [125] "kurtosis_roll_forearm"    "kurtosis_picth_forearm"  
## [127] "kurtosis_yaw_forearm"     "skewness_roll_forearm"   
## [129] "skewness_pitch_forearm"   "skewness_yaw_forearm"    
## [131] "max_roll_forearm"         "max_picth_forearm"       
## [133] "max_yaw_forearm"          "min_roll_forearm"        
## [135] "min_pitch_forearm"        "min_yaw_forearm"         
## [137] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
## [139] "amplitude_yaw_forearm"    "total_accel_forearm"     
## [141] "var_accel_forearm"        "avg_roll_forearm"        
## [143] "stddev_roll_forearm"      "var_roll_forearm"        
## [145] "avg_pitch_forearm"        "stddev_pitch_forearm"    
## [147] "var_pitch_forearm"        "avg_yaw_forearm"         
## [149] "stddev_yaw_forearm"       "var_yaw_forearm"         
## [151] "gyros_forearm_x"          "gyros_forearm_y"         
## [153] "gyros_forearm_z"          "accel_forearm_x"         
## [155] "accel_forearm_y"          "accel_forearm_z"         
## [157] "magnet_forearm_x"         "magnet_forearm_y"        
## [159] "magnet_forearm_z"         "problem_id"

yikes!

Tidy Data is important

Here we want the data set to be tidy, so we remove NA values, and remove features that are not in the testing data set. Additionally we see that the first 7 cols are not relevant to our report so we only need cols 8 to 59.

# what features are we dealing with?        
        features <- names(testing_DS[,colSums(is.na(testing_DS)) == 0])[8:59]

# Only use features used in testing cases.
        training_DS <- training_DS[,c(features,"classe")]
        testing_DS <- testing_DS[,c(features,"problem_id")]

# verify our work so far
        dim(training_DS)

## [1] 19622    53

        dim(testing_DS)

## [1] 20 53

Data Partitioning

As discussed in the lectures, it is good practices to partition the data. Here we are using 60% of the data as the initial training set and reserving 40%.

# set seed for reproducibility       
        set.seed(1)

# create partitions
        part_info <- createDataPartition(training_DS$classe, p=0.6, list=FALSE)
        training_set <- training_DS[part_info,]
        testing_set <- training_DS[-part_info,]


# verify our work again
        dim(training_set)

## [1] 11776    53

        dim(testing_set)

## [1] 7846   53

nice.

Data Modelling

We will build two models for our analysis, a Descision Tree and a Random Forest.

Build a Decision Tree with training data set.

As Jeff discussed in the lecture, a decision tree is a good place to start but may not yield a high level of accuracy. We can expect an accuracy of about 70-80%.

# set seed for reproducibility 
        set.seed(1)

# build up the tree        
        model_tree <- rpart(classe ~ ., data = training_set, method="class", 
                            control =  rpart.control(method = "cv", number = 100))
        
# lets see it
        fancyRpartPlot(model_tree)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Test the Decision Tree using the testing data set.

The confusion matrix will give us an accuracy result and an estimate at the 95% CI.

# set seed for reproducibility 
        set.seed(1)

# set up classe as a factor
        testing_set$classe <- as.factor(testing_set$classe)

# analysis with the confusion matrix
        prediction <- predict(model_tree, testing_set, type = "class")
        confusionMatrix(prediction, testing_set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1983  223   20   50   21
##          B   66  922   67   83  104
##          C   60  149 1098  197  173
##          D   74   97   92  775  102
##          E   49  127   91  181 1042
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7418          
##                  95% CI : (0.7319, 0.7514)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6732          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8884   0.6074   0.8026  0.60264   0.7226
## Specificity            0.9441   0.9494   0.9106  0.94436   0.9300
## Pos Pred Value         0.8633   0.7424   0.6547  0.67982   0.6993
## Neg Pred Value         0.9551   0.9098   0.9562  0.92380   0.9371
## Prevalence             0.2845   0.1935   0.1744  0.16391   0.1838
## Detection Rate         0.2527   0.1175   0.1399  0.09878   0.1328
## Detection Prevalence   0.2928   0.1583   0.2137  0.14530   0.1899
## Balanced Accuracy      0.9163   0.7784   0.8566  0.77350   0.8263

The Accuracy is about 74%, not bad but we can do better.

Build a Random Forest Model with training data set.

As we learned in class, the Random forest should give a better accuracy.

# we need another library to do this
        library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

# convert to factor so RF works!
        training_set$classe <- as.factor(training_set$classe)

# set seed for reproducibility 
        set.seed(1)

# build up the forest now
        model_random_forest <- randomForest(classe ~ ., data = training_set, ntree = 100)

Plot the random forest models

        plot(model_random_forest)

This looks promising and follows the pattern that Jeff highlighted in the lecture.

### Test the random forest model using the testing data set. The confusion matrix will give us an accuracy result and an estimate at the 95% CI.

prediction <- predict(model_random_forest, testing_set, type = "class")
confusionMatrix(prediction, testing_set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2229    4    0    0    0
##          B    2 1507   20    0    0
##          C    1    7 1346    8    1
##          D    0    0    2 1277    6
##          E    0    0    0    1 1435
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9934         
##                  95% CI : (0.9913, 0.995)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9916         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9987   0.9928   0.9839   0.9930   0.9951
## Specificity            0.9993   0.9965   0.9974   0.9988   0.9998
## Pos Pred Value         0.9982   0.9856   0.9875   0.9938   0.9993
## Neg Pred Value         0.9995   0.9983   0.9966   0.9986   0.9989
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2841   0.1921   0.1716   0.1628   0.1829
## Detection Prevalence   0.2846   0.1949   0.1737   0.1638   0.1830
## Balanced Accuracy      0.9990   0.9946   0.9906   0.9959   0.9975

whoa! 99.34%

Prediction on the Actual Test Data

Now the real test! Can our models predict the actual results using the test data which consists of 20 obs.

Decision Tree Predictions:

prediction_model_tree <- predict(model_tree, testing_DS, type = "class")
prediction_model_tree

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  E  D  A  C  D  A  A  A  C  E  C  A  D  E  A  B  B  B 
## Levels: A B C D E

Not bad but…

Random Forest Predictions:

prediction_random_forest <- predict(model_random_forest, testing_DS)
prediction_random_forest

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Nice!

Conclusion

The decision tree model has a moderate degree of accuracy. This can be seen in the confusion matrix where the accuracy is about 74%. This is reflected in actual test.

In contrast, the random tree model has a high degree of accuracy. This can be seen in the confusion matrix where the accuracy is over 99%, furthermore the model exhibits 95% CI: (0.9942, 0.9972). The 20 sample results based on the actual test confirms this.

Practical Machine Learning - Assignment - Week 4

RMemmel

05/10/2020