Introduction

Overview

This report describes the data analysis and the model fitting performed as part of Pratical Machine Learning course, by John Hopkins Blooberg School of Public Health at Coursera.

Backgroud on Human Activity Recognition

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Details of how the data was collected can be found in this link at the section on the Weight Lifting Exercise Dataset.

The goal of this project is to predict the manner in which they did the exercise, and this report describes the data analysis, pre-processing, model fitting and model evaluation for this goal.

strategy

In this analysis we’ll do the following steps towards the project goal:

  1. Data Analysis: To load the datasets and perform some exploratory data analysis;
  2. Preprocessing: To clear the datasets and remove NA’s values and Non-Significant features;
  3. Data Partition: To partition the Training dataset to evaluate the perform of a fitted model through cross validation;
  4. Training a Model: To train/fit the model;
  5. Evaluation: To evaluate the performance of the fitted model;
  6. Predict Test Dataset Values: Lastly, to predict the values of the Test dataset.

Execution

Data Analysis

There are two data set, the Training and the Test dataset, the cvs data files can be download from:

If you want to reproduce this Rmd, you will have download the data into a ./data/ folder in the working directory.

## setup
library(caret)
library(ggplot2)
library(gridExtra)

## reading data
training <- read.csv("./data/pml-training.csv", na.strings=c("NA","","#DIV/0!"))
testing <- read.csv("./data/pml-testing.csv", na.strings=c("NA","","#DIV/0!"))

# the 'result' column
str(training$classe)
##  Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set, the “A” value are the correct way to do the exercise, other values are common errors: throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Let’s starting choosing which data.frames columns has only NA’s values and that don’t make sense to be used in the model training as a valid data feature.

# subsetting the dataframeremoving columns with NA values for all row
nonNA_Cols <- sapply(training, function(x){mean(!is.na(x))}) # % of non NA values in a column
valid_Cols <- nonNA_Cols > 0 # T -> has some values or F -> only NA values 

# removing Non relevant data: shouldn't be predictors
valid_Cols[c("X", "user_name","cvtd_timestamp","num_window","new_window")]  <- F

Understand the TimeStamp information

There are two columns named raw_timestamp_part, looking how the datapoints, from one specific case, are distributed for these columns it is possible to verify that they should be unified into a single information for distributing data points adequately in time.

# understanding the timestamp info

# just one user and class of one of the features (yaw_arm)
# to see how is the behavior of timestamp parts
dt <- training[training$classe=="E" & training$user_name=="adelmo",] 

g1 <- qplot(dt$raw_timestamp_part_1,dt$yaw_arm) + ggtitle("Timestamp Part 1") + xlab("Raw Timestamp Part 1") + ylab("yaw_arm")
g2 <- qplot(dt$raw_timestamp_part_2,dt$yaw_arm) + ggtitle("Timestamp Parte 2")+ xlab("Raw Timestamp Part 2") + ylab("yaw_arm")
grid.arrange(g1, g2, ncol=2)

We can see there is several datapoint at same raw_timestamp_part_1, so we have to sum the two parts.

# see the maximum value, so this is the 'magnitude'
# of the part 1 in respect to part 2
max(dt$raw_timestamp_part_2) 
## [1] 996368
# testing the sum of the two datasets
new_timestamp <- dt$raw_timestamp_part_1*1000000+dt$raw_timestamp_part_2
qplot(new_timestamp,dt$yaw_arm)+ ggtitle("New Timestamp") + xlab("Timestamp") + ylab("yaw_arm")

Now, we can see that the time information is correct. Let’s apply this transformation in the Training and Test datasets.

# adjusting the datasets
training_timestamp <- training$raw_timestamp_part_1*1000000+training$raw_timestamp_part_2
testing_timestamp <- testing$raw_timestamp_part_1*1000000+testing$raw_timestamp_part_2
valid_Cols[c("raw_timestamp_part_1","raw_timestamp_part_2")] <- F

Preprocessing

Now, we’ll apply some basic preprocessing in both datasets, first removing the full NA’s columns and the non relevant data (the columns wehre valid_Cols are FALSE), and after, removing the Near Zero Variance columns.

We will also normalizing the remaining data, but this will be done after partition of the training dataset in the next session.

# subsetting columns and adding the new timestamp info
training <- training[,which(valid_Cols==T)]
training$timestamp <- training_timestamp
testing <- testing[,which(valid_Cols==T)]
testing$timestamp <- testing_timestamp

# Near Zero Variance
nZeroVar <- nearZeroVar(training,saveMetrics=T)
nZeroVar[nZeroVar$nzv==T,]
##                        freqRatio percentUnique zeroVar  nzv
## amplitude_yaw_belt       0.00000    0.00509632    TRUE TRUE
## avg_roll_arm            77.00000    1.68178575   FALSE TRUE
## stddev_roll_arm         77.00000    1.68178575   FALSE TRUE
## var_roll_arm            77.00000    1.68178575   FALSE TRUE
## avg_pitch_arm           77.00000    1.68178575   FALSE TRUE
## stddev_pitch_arm        77.00000    1.68178575   FALSE TRUE
## var_pitch_arm           77.00000    1.68178575   FALSE TRUE
## avg_yaw_arm             77.00000    1.68178575   FALSE TRUE
## stddev_yaw_arm          80.00000    1.66649679   FALSE TRUE
## var_yaw_arm             80.00000    1.66649679   FALSE TRUE
## max_roll_arm            25.66667    1.47793293   FALSE TRUE
## min_roll_arm            19.25000    1.41677709   FALSE TRUE
## min_pitch_arm           19.25000    1.47793293   FALSE TRUE
## amplitude_roll_arm      25.66667    1.55947406   FALSE TRUE
## amplitude_pitch_arm     20.00000    1.49831821   FALSE TRUE
## amplitude_yaw_dumbbell   0.00000    0.00509632    TRUE TRUE
## max_roll_forearm        27.66667    1.38110284   FALSE TRUE
## min_roll_forearm        27.66667    1.37091020   FALSE TRUE
## amplitude_roll_forearm  20.75000    1.49322189   FALSE TRUE
## amplitude_yaw_forearm    0.00000    0.00509632    TRUE TRUE
## avg_roll_forearm        27.66667    1.64101519   FALSE TRUE
## stddev_roll_forearm     87.00000    1.63082255   FALSE TRUE
## var_roll_forearm        87.00000    1.63082255   FALSE TRUE
## avg_pitch_forearm       83.00000    1.65120783   FALSE TRUE
## stddev_pitch_forearm    41.50000    1.64611151   FALSE TRUE
## var_pitch_forearm       83.00000    1.65120783   FALSE TRUE
## avg_yaw_forearm         83.00000    1.65120783   FALSE TRUE
## stddev_yaw_forearm      85.00000    1.64101519   FALSE TRUE
## var_yaw_forearm         85.00000    1.64101519   FALSE TRUE
# subsetting bya NZV
training <- training[,nZeroVar$nzv==F]
testing <- testing[,nZeroVar$nzv==F]

Partition of Training Dataset

To perform the cross validation of the prediction model after training, we’ll partition the training dataset.

# partition in the training set (Cross Validation)
set.seed(1234)
inTrain <- createDataPartition(training$classe, p=.75, list=F)
train <- training[inTrain,]
cross <- training[-inTrain,]

Lastly, we’ll normalize the partitions

# normalizing the Training subset
preProc <- preProcess(train[,-118],method=c("knnImpute", "center", "scale"))
pTrain <- predict(preProc,train[,-118])
pTrain$classe <- train$classe

# normalizing the Cross Validation subset
pCross <- predict(preProc,cross[,-118])

Now we have two new datasets:

  • pTrain: a normalized partition of original Training dataset that will be used to train the model
  • pCross: a normalized partition of original Training dataset to be used as ‘cross validation’ dataset to evaluate the performance of the trained model

Training a Model

At this stage, we’ll train a model using randon forrest algorithm with settings to make the training task duration shorter.

# training a model
modFit <- train(classe~., method="rf", data=pTrain, allowParallel=T, ncores=4, nodesize=10, importance=F,  proximity=F, trControl=trainControl(method="cv"))
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
modFit
## Random Forest 
## 
## 14718 samples
##   118 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13246, 13246, 13246, 13244, 13247, 13245, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD   Kappa SD   
##     2   0.9697633  0.9617336  0.0042036645  0.005318093
##    60   0.9983694  0.9979376  0.0009162832  0.001159055
##   118   0.9962629  0.9952731  0.0009746586  0.001232760
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 60.

Evaluation

To evaluate the performance of the model, we apply it in the cross validation dataset and check the results of prediction against the real values

# predict and evaluation
crossValid <- predict(modFit, pCross)
confusionMatrix(crossValid, cross$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    1    0    0    0
##          B    0  948    1    0    0
##          C    0    0  851    0    0
##          D    0    0    3  802    0
##          E    0    0    0    2  901
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9986          
##                  95% CI : (0.9971, 0.9994)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9982          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9989   0.9953   0.9975   1.0000
## Specificity            0.9997   0.9997   1.0000   0.9993   0.9995
## Pos Pred Value         0.9993   0.9989   1.0000   0.9963   0.9978
## Neg Pred Value         1.0000   0.9997   0.9990   0.9995   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1933   0.1735   0.1635   0.1837
## Detection Prevalence   0.2847   0.1935   0.1735   0.1642   0.1841
## Balanced Accuracy      0.9999   0.9993   0.9977   0.9984   0.9998

We can see that the model obtain a very good performance, according the confusion matrix.

Predict Test Dataset Values

The last step of this project is to generate the classification of Testing data set, predicting the way that the subject did the exercise (the values of “classe” variable).

# predicting test set
pTest <- predict(preProc,testing[,-118])
result <- predict(modFit,pTest)
result
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E