You may get source code of this at https://github.com/svicente99/Machine_Learning
And html version is available at RPubs: http://rpubs.com/svicente99/MachLearn_Peer_Assesment
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways [A,B,C,D,E].
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. Any of the other variables were used to predict it. Along next sections we describe how the model was built and fitted. Two alternatives were tested: one with classification tree and another one with random forest The expected out of sample error was showed in order to advice the choice we did. Our option was random forest as best model fit. Finally we use this model to predict 20 different test cases.
The data for this assignment can be downloaded from the course web site:
There is also some documentation of the database available. See above links to know how some of the variables are constructed/defined.
Setting parameters and files to be processed:
# MAIN PARAMETERS
DATA_FOLDER <- "./data" # subdirectory of current named 'data'
URL_DATA <- "http://d396qusza40orc.cloudfront.net/predmachlearn"
CSV_TRAIN <- "pml-training.csv"
CSV_TEST <- "pml-testing.csv"
# -----------------------------------------------------------------
Getting (original) CSV data files and setting the main data frame (df_*):
## [1] "Dimension of training data - rows x cols"
## [1] 19622 160
## [1] "Dimension of test data - rows x cols"
## [1] 20 160
The next Libraries were used in this project. A custom function will install them (if not done yet) and will load on your “RStudio” environment.
# we just install and load the libraries we need to do this analysis
pkgTest("rpart")
## Loading required package: rpart
pkgTest("rpart.plot")
## Loading required package: rpart.plot
pkgTest("caret")
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
pkgTest("rattle")
## Loading required package: rattle
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
pkgTest("randomForest")
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(rpart)
library(rpart.plot)
library(caret)
library(rattle)
library(randomForest)
We need to pass a good filter to clean these data up. Next function do this job. It’s not interesting to include in this analysis data containing missing values. So, get rid of them:
clean_data <- function(df, step3=TRUE)
{
# get rid of columns that has only "missing values"
df_clean <- df[, colSums(is.na(df)) == 0]
print(dim(df_clean))
# then throw columns out that not participate as predictors
vColsNotPred <- c("user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window")
cols2Remove <- which(colnames(df) %in% vColsNotPred)
df_clean <- df_clean[, -cols2Remove]
print(dim(df_clean))
# the ID column (the 1st) is another that not influences our forecasting
df_clean <- df_clean[c(-1)]
# lastly, remove columns with zero variability
if(step3) { ## because there is no need to 'testing' dataset ##
colsNear0 <- nearZeroVar(df_clean)
df_clean <- df_clean[, -colsNear0]
print(dim(df_clean))
}
return(df_clean)
}
df_train <- clean_data(df_train)
## [1] 19622 93
## [1] 19622 87
## [1] 19622 53
summary(df_train$classe)
## A B C D E
## 5580 3797 3422 3216 3607
We’ll split training data set, after cleaning, in two subsets - one is a pure training data set, containing 70% of observations and another, smaller (30%) to contain data for validation.
set.seed(1987) ## to may reproducible research
trainPart <- createDataPartition(df_train$classe, p=0.7, list=FALSE)
training <- df_train[+trainPart,]
testing <- df_train[-trainPart,]
rm(trainPart)
dim(training); dim(testing)
## [1] 13737 53
## [1] 5885 53
class(training)
## [1] "data.frame"
training$classe <- as.factor(training$classe)
# Equalizing dimensions (columns) to test data sets
# These are the final counting to each part of our data use to apply algorithm models:
#nCols <- ncol(training)
##colsTraining1 <- colnames(training)
#colsTraining2 <- colnames(training[, -nCols]) # remove classe column (for test data set)
##testing <- testing[colsTraining1] # leave vars that are also in training set
#df_test <- df_test[colsTraining2] # leave vars that are also in training set
summary(training$classe); summary(testing$classe)
## A B C D E
## 3906 2658 2396 2252 2525
## A B C D E
## 1674 1139 1026 964 1082
We initially use Decision Tree technique to fit a model in these data. Here are the results obtained to this adjusted model and its tree plotting.
##
## Classification tree:
## rpart(formula = classe ~ ., data = training, method = "class")
##
## Variables actually used in tree construction:
## [1] accel_dumbbell_y accel_dumbbell_z accel_forearm_x
## [4] magnet_arm_y magnet_belt_z magnet_dumbbell_y
## [7] magnet_dumbbell_z magnet_forearm_z pitch_belt
## [10] pitch_forearm roll_belt roll_dumbbell
## [13] roll_forearm total_accel_dumbbell yaw_belt
##
## Root node error: 9831/13737 = 0.71566
##
## n= 13737
##
## CP nsplit rel error xerror xstd
## 1 0.115553 0 1.00000 1.00000 0.0053780
## 2 0.059336 1 0.88445 0.88445 0.0057464
## 3 0.035398 4 0.70644 0.70735 0.0059605
## 4 0.031431 5 0.67104 0.66982 0.0059559
## 5 0.022124 6 0.63961 0.64022 0.0059401
## 6 0.019835 11 0.51439 0.51490 0.0057511
## 7 0.018716 12 0.49456 0.49456 0.0057010
## 8 0.018513 13 0.47584 0.47289 0.0056412
## 9 0.014749 15 0.43882 0.44217 0.0055448
## 10 0.013122 16 0.42407 0.42498 0.0054846
## 11 0.011189 18 0.39782 0.41267 0.0054387
## 12 0.011087 19 0.38663 0.39630 0.0053738
## 13 0.010000 20 0.37555 0.38409 0.0053226
For next, we use ‘random forest’ algorithm to fit another model to those data. Follow the results.
modelFit2 <- randomForest(classe ~. , data=training)
modelFit2
##
## Call:
## randomForest(formula = classe ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.5%
## Confusion matrix:
## A B C D E class.error
## A 3904 2 0 0 0 0.0005120328
## B 13 2640 5 0 0 0.0067720090
## C 0 6 2388 2 0 0.0033388982
## D 0 0 25 2224 3 0.0124333925
## E 0 0 1 11 2513 0.0047524752
prev1 <- predict(modelFit1, data=testing, type="class")
prev2 <- predict(modelFit2, data=testing, type="class")
tab <- table(prev1, prev2)
apply(
prop.table(tab,1)*100, 2,
function(u) sprintf( "%.1f%%", u )
)
## prev2
## A B C D E
## [1,] "74.9%" "11.4%" "3.5%" "7.7%" "2.5%"
## [2,] "4.3%" "74.3%" "9.5%" "4.0%" "7.9%"
## [3,] "3.3%" "6.9%" "66.9%" "12.0%" "10.8%"
## [4,] "4.4%" "10.0%" "9.3%" "70.9%" "5.4%"
## [5,] "2.9%" "5.8%" "2.7%" "9.6%" "79.1%"
We see above that both models tied more at values ‘E’ and ‘A’ for ‘classe’.
Random Forest provides right answers to ‘testing’ dataset in almost 99% [OOB estimate of error rate: 0.5%]. It is more then we achieve using Decision Tree [Root node error: 0.71566].
## modelFit2 was adjusted by "Random Forest" technique ##
myChoiceModel <- predict(modelFit2, df_test, type="class")
myChoiceModel
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# Write the result file to submission
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("./data/problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(myChoiceModel)
# Twenty files were saved on data folder (check it!).