Analysis of data from personal finess devices

Background and Data Description

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. y they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The aim of this report is to develop a predictive model that looks at the data and uses the different variables to predict the quality of the workout. This is given in the classe variable in the data set and divided into 5 categories from A to E. Since the aim is to perform a classification between the different classe types, algorithms that are good for classification need to be employed.

The training and test data sets can be downloaded and read into R as follows.

setwd("C:\\Users\\Alina\\OneDrive\\R")
training<-read.csv("pml-training.csv")
testing<-read.csv("pml-testing.csv")

Similarly, the R packages required for subsequent processing will also be loaded at this stage.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

We look at the training data set to get an idea about the kind of variables that are making it up and also to ascertain the distribution of the classe variable.

str(training, list.len=15)

## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt     : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt      : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
##   [list output truncated]

barplot(table(training$classe))

To reduce the dimension of the data set, the first 7 columns of the data set can be removed since they are not useful for predicting the classe type.

training<-training[,8:160]
testing<-testing[,8:160]

As part of preprocessing the columns with a majority of NA values can be removed.

not_na<-apply(!is.na(training), 2, sum)>19621 #the total no of observations
training<-training[,not_na]
testing<-testing[, not_na]

To perform cross-validation with the predictive model, the training set will be divided into 2 parts. 60% of the training set will be used to build the model and the remaining 40% will be used for validation once the model is built.

set.seed(12345)
inTrain<-createDataPartition(y=training$classe, p=0.6, list=FALSE)
train1<-training[inTrain,]
train2<-training[-inTrain,] #for validation
dim(train1)

## [1] 11776    86

So there are stil 86 different variables in the data set that can be used for prediction. Further preprocessing needs to be done to narrow down to a smaller number of variables that can predict the classe with a high level of accuracy. In order to do so, the variables that show near zero variance will be removed since they are not likely to be good predictors of the classe.

nzv<-nearZeroVar(train1)
if (length(nzv)>0){
  train1<-train1[,-nzv]
  train2<-train2[,-nzv]
}
dim(train1)

## [1] 11776    53

After removing the near zero variance covariates, there are 53 predictors left in the training set. But 53 is still too many covariates. The random forest package will be used to identify the most important of these variables for prediction purposes.

set.seed(33633)
modfit<-randomForest(classe~., data=train1, importance=TRUE, ntree=100)
varImpPlot(modfit)

Using the accuracy and Gini plots, the 10 most important variables for prediction can be identified. A predictive model based on these 10 will be created and tested for accuracy. If the resulting model has a high accuracy level it will be accepted. Otherwise the number of covariates being selected can be increased to include more. This is a necessary step as predicting with 53 variables is too time consuming through most of the algorithms to be practical.

The 10 most important covariates are: yaw_belt, roll_belt, num_window, pitch_belt, magnet_dumbbell_y, magnet_dumbbell_z, pitch_forearm, accel_dumbbell_y, roll_arm, and roll_forearm.

Next the correlation between these 10 variables will be calculated to check whether any of them are highly correlated (more than 75%) and can be removed from the prediction set.

correl<-cor(train1[,c("yaw_belt","roll_belt","pitch_belt","magnet_dumbbell_z","magnet_dumbbell_y","pitch_forearm","accel_dumbbell_y","roll_arm","roll_forearm")])
diag(correl)<-0
which(abs(correl)>0.75, arr.ind=TRUE)

##           row col
## roll_belt   2   1
## yaw_belt    1   2

The results reveal that yaw_belt and roll_belt have a higher correlation that 75%. Hence one of them can be safely eliminated from the list of predictive variables. After removing roll_belt, the maximum correlation between the remaining covariates can be calculated as follows:

correl2<-cor(train1[,c("roll_belt","pitch_belt","magnet_dumbbell_z","magnet_dumbbell_y","pitch_forearm","accel_dumbbell_y","roll_arm","roll_forearm")])
diag(correl2)<-0
max(abs(correl2))

## [1] 0.4969822

correl3<-cor(train1[,c("yaw_belt","pitch_belt","magnet_dumbbell_z","magnet_dumbbell_y","pitch_forearm","accel_dumbbell_y","roll_arm","roll_forearm")])
diag(correl3)<-0
max(abs(correl3))

## [1] 0.6979023

After removing ‘roll_belt’, the maximum correlation between the covariates is 69%, whereas after removing ‘yaw_belt’, the maximum correlation is 49%. Hence roll_belt is the more significant covariate and yaw_belt can be removed from the shortened list of covariates.

Model

The algorithm to be used for building the model is random forests from the caret package in R. The 9 most important variables are being used for prediction. These are roll_belt, num_window, pitch_belt, magnet_dumbbell_y, magnet_dumbbell_z, pitch_forearm, accel_dumbbell_y, roll_arm, and roll_forearm. A 2-fold cross-validation control will be employed. This is the simplest k-fold cross-validation possible and it will give a reduced computation time. Because the data set is large, using a small number of folds is justified.

set.seed(3141592)
modelFit <- train(classe~roll_belt+pitch_belt+magnet_dumbbell_y+magnet_dumbbell_z+pitch_forearm+accel_dumbbell_y+roll_arm+roll_forearm,
                  data=train1,
                  method="rf",
                  trControl=trainControl(method="cv",number=2),
                  prox=TRUE,
                  verbose=TRUE,
                  allowParallel=TRUE)

Once the model has been trained on the set train1, it can be used to make predictions through the validation set to check its accuracy.

pred<-predict(modelFit, newdata=train2)
confusionMat<-confusionMatrix(pred, train2$classe)
confusionMat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2218   16    1    1    0
##          B   13 1465    9    2   13
##          C    0   23 1347   29    6
##          D    1   10    9 1249    8
##          E    0    4    2    5 1415
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9806          
##                  95% CI : (0.9773, 0.9836)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9755          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9937   0.9651   0.9846   0.9712   0.9813
## Specificity            0.9968   0.9942   0.9910   0.9957   0.9983
## Pos Pred Value         0.9919   0.9754   0.9587   0.9781   0.9923
## Neg Pred Value         0.9975   0.9916   0.9967   0.9944   0.9958
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2827   0.1867   0.1717   0.1592   0.1803
## Detection Prevalence   0.2850   0.1914   0.1791   0.1628   0.1817
## Balanced Accuracy      0.9953   0.9796   0.9878   0.9835   0.9898

According to the confusion matrix calculations, the model predicts on the validation set with an accuracy of 98%.

Estimate for the OOB error

The validation set will be used to calculate a measure for the OOB error. Since it was not involved in training the model, it can be used to get a good estimate of the OOB error.

missClass = function(values, predicted) {
  sum(predicted != values) / length(values)
}
OOB_errRate = missClass(train2$classe, pred)
OOB_errRate

## [1] 0.01937293

The OOB error rate is about 2% which is quite low. Hence this is a strong model for making predictions using the testing set and will be used to answer the rest of the questions from the quiz.

Prediction Quiz

Using the predictive model, predictions can be made from the testing portion of the data set as follows:

pred2<-predict(modelFit, testing)
pred2

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

As per the results of the quiz, the model predicted 100% of the results from the testing data accurately, proving that it is a good model for the data provided.