This report describes the data analysis and the model fitting performed as part of Pratical Machine Learning course, by John Hopkins Blooberg School of Public Health at Coursera.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Details of how the data was collected can be found in this link at the section on the Weight Lifting Exercise Dataset.
The goal of this project is to predict the manner in which they did the exercise, and this report describes the data analysis, pre-processing, model fitting and model evaluation for this goal.
In this analysis we’ll do the following steps towards the project goal:
There are two data set, the Training and the Test dataset, the cvs data files can be download from:
If you want to reproduce this Rmd, you will have download the data into a ./data/ folder in the working directory.
## setup
library(caret)
library(ggplot2)
library(gridExtra)
## reading data
training <- read.csv("./data/pml-training.csv", na.strings=c("NA","","#DIV/0!"))
testing <- read.csv("./data/pml-testing.csv", na.strings=c("NA","","#DIV/0!"))
# the 'result' column
str(training$classe)
## Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set, the “A” value are the correct way to do the exercise, other values are common errors: throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Let’s starting choosing which data.frames columns has only NA’s values and that don’t make sense to be used in the model training as a valid data feature.
# subsetting the dataframeremoving columns with NA values for all row
nonNA_Cols <- sapply(training, function(x){mean(!is.na(x))}) # % of non NA values in a column
valid_Cols <- nonNA_Cols > 0 # T -> has some values or F -> only NA values
# removing Non relevant data: shouldn't be predictors
valid_Cols[c("X", "user_name","cvtd_timestamp","num_window","new_window")] <- F
There are two columns named raw_timestamp_part, looking how the datapoints, from one specific case, are distributed for these columns it is possible to verify that they should be unified into a single information for distributing data points adequately in time.
# understanding the timestamp info
# just one user and class of one of the features (yaw_arm)
# to see how is the behavior of timestamp parts
dt <- training[training$classe=="E" & training$user_name=="adelmo",]
g1 <- qplot(dt$raw_timestamp_part_1,dt$yaw_arm) + ggtitle("Timestamp Part 1") + xlab("Raw Timestamp Part 1") + ylab("yaw_arm")
g2 <- qplot(dt$raw_timestamp_part_2,dt$yaw_arm) + ggtitle("Timestamp Parte 2")+ xlab("Raw Timestamp Part 2") + ylab("yaw_arm")
grid.arrange(g1, g2, ncol=2)
We can see there is several datapoint at same raw_timestamp_part_1, so we have to sum the two parts.
# see the maximum value, so this is the 'magnitude'
# of the part 1 in respect to part 2
max(dt$raw_timestamp_part_2)
## [1] 996368
# testing the sum of the two datasets
new_timestamp <- dt$raw_timestamp_part_1*1000000+dt$raw_timestamp_part_2
qplot(new_timestamp,dt$yaw_arm)+ ggtitle("New Timestamp") + xlab("Timestamp") + ylab("yaw_arm")
Now, we can see that the time information is correct. Let’s apply this transformation in the Training and Test datasets.
# adjusting the datasets
training_timestamp <- training$raw_timestamp_part_1*1000000+training$raw_timestamp_part_2
testing_timestamp <- testing$raw_timestamp_part_1*1000000+testing$raw_timestamp_part_2
valid_Cols[c("raw_timestamp_part_1","raw_timestamp_part_2")] <- F
Now, we’ll apply some basic preprocessing in both datasets, first removing the full NA’s columns and the non relevant data (the columns wehre valid_Cols are FALSE), and after, removing the Near Zero Variance columns.
We will also normalizing the remaining data, but this will be done after partition of the training dataset in the next session.
# subsetting columns and adding the new timestamp info
training <- training[,which(valid_Cols==T)]
training$timestamp <- training_timestamp
testing <- testing[,which(valid_Cols==T)]
testing$timestamp <- testing_timestamp
# Near Zero Variance
nZeroVar <- nearZeroVar(training,saveMetrics=T)
nZeroVar[nZeroVar$nzv==T,]
## freqRatio percentUnique zeroVar nzv
## amplitude_yaw_belt 0.00000 0.00509632 TRUE TRUE
## avg_roll_arm 77.00000 1.68178575 FALSE TRUE
## stddev_roll_arm 77.00000 1.68178575 FALSE TRUE
## var_roll_arm 77.00000 1.68178575 FALSE TRUE
## avg_pitch_arm 77.00000 1.68178575 FALSE TRUE
## stddev_pitch_arm 77.00000 1.68178575 FALSE TRUE
## var_pitch_arm 77.00000 1.68178575 FALSE TRUE
## avg_yaw_arm 77.00000 1.68178575 FALSE TRUE
## stddev_yaw_arm 80.00000 1.66649679 FALSE TRUE
## var_yaw_arm 80.00000 1.66649679 FALSE TRUE
## max_roll_arm 25.66667 1.47793293 FALSE TRUE
## min_roll_arm 19.25000 1.41677709 FALSE TRUE
## min_pitch_arm 19.25000 1.47793293 FALSE TRUE
## amplitude_roll_arm 25.66667 1.55947406 FALSE TRUE
## amplitude_pitch_arm 20.00000 1.49831821 FALSE TRUE
## amplitude_yaw_dumbbell 0.00000 0.00509632 TRUE TRUE
## max_roll_forearm 27.66667 1.38110284 FALSE TRUE
## min_roll_forearm 27.66667 1.37091020 FALSE TRUE
## amplitude_roll_forearm 20.75000 1.49322189 FALSE TRUE
## amplitude_yaw_forearm 0.00000 0.00509632 TRUE TRUE
## avg_roll_forearm 27.66667 1.64101519 FALSE TRUE
## stddev_roll_forearm 87.00000 1.63082255 FALSE TRUE
## var_roll_forearm 87.00000 1.63082255 FALSE TRUE
## avg_pitch_forearm 83.00000 1.65120783 FALSE TRUE
## stddev_pitch_forearm 41.50000 1.64611151 FALSE TRUE
## var_pitch_forearm 83.00000 1.65120783 FALSE TRUE
## avg_yaw_forearm 83.00000 1.65120783 FALSE TRUE
## stddev_yaw_forearm 85.00000 1.64101519 FALSE TRUE
## var_yaw_forearm 85.00000 1.64101519 FALSE TRUE
# subsetting bya NZV
training <- training[,nZeroVar$nzv==F]
testing <- testing[,nZeroVar$nzv==F]
To perform the cross validation of the prediction model after training, we’ll partition the training dataset.
# partition in the training set (Cross Validation)
set.seed(1234)
inTrain <- createDataPartition(training$classe, p=.75, list=F)
train <- training[inTrain,]
cross <- training[-inTrain,]
Lastly, we’ll normalize the partitions
# normalizing the Training subset
preProc <- preProcess(train[,-118],method=c("knnImpute", "center", "scale"))
pTrain <- predict(preProc,train[,-118])
pTrain$classe <- train$classe
# normalizing the Cross Validation subset
pCross <- predict(preProc,cross[,-118])
Now we have two new datasets:
pTrain: a normalized partition of original Training dataset that will be used to train the modelpCross: a normalized partition of original Training dataset to be used as ‘cross validation’ dataset to evaluate the performance of the trained modelAt this stage, we’ll train a model using randon forrest algorithm with settings to make the training task duration shorter.
# training a model
modFit <- train(classe~., method="rf", data=pTrain, allowParallel=T, ncores=4, nodesize=10, importance=F, proximity=F, trControl=trainControl(method="cv"))
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
modFit
## Random Forest
##
## 14718 samples
## 118 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 13246, 13246, 13246, 13244, 13247, 13245, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9697633 0.9617336 0.0042036645 0.005318093
## 60 0.9983694 0.9979376 0.0009162832 0.001159055
## 118 0.9962629 0.9952731 0.0009746586 0.001232760
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 60.
To evaluate the performance of the model, we apply it in the cross validation dataset and check the results of prediction against the real values
# predict and evaluation
crossValid <- predict(modFit, pCross)
confusionMatrix(crossValid, cross$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 1 0 0 0
## B 0 948 1 0 0
## C 0 0 851 0 0
## D 0 0 3 802 0
## E 0 0 0 2 901
##
## Overall Statistics
##
## Accuracy : 0.9986
## 95% CI : (0.9971, 0.9994)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9982
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9989 0.9953 0.9975 1.0000
## Specificity 0.9997 0.9997 1.0000 0.9993 0.9995
## Pos Pred Value 0.9993 0.9989 1.0000 0.9963 0.9978
## Neg Pred Value 1.0000 0.9997 0.9990 0.9995 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1933 0.1735 0.1635 0.1837
## Detection Prevalence 0.2847 0.1935 0.1735 0.1642 0.1841
## Balanced Accuracy 0.9999 0.9993 0.9977 0.9984 0.9998
We can see that the model obtain a very good performance, according the confusion matrix.
The last step of this project is to generate the classification of Testing data set, predicting the way that the subject did the exercise (the values of “classe” variable).
# predicting test set
pTest <- predict(preProc,testing[,-118])
result <- predict(modFit,pTest)
result
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E