Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
exactly according to the specification (Class A) throwing the elbows to the front (Class B) lifting the dumbbell only halfway (Class C) lowering the dumbbell only halfway (Class D) throwing the hips to the front (Class E)
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har
The main objectives of this project are as follows
training <- read.csv(file="./data/pml-training.csv", head=TRUE, na.strings=c("NA",""))
testing <- read.csv(file="data/pml-testing.csv", head=TRUE, na.strings=c("NA",""))
dim(training) #[1] 19622 160
## [1] 19622 160
dim(testing) #[1] 20 160
## [1] 20 160
# str(training)
The dataset comprises 160 features and 19622 observations in the training set and 20 test cases in the testing set.
First, we check how many columns have NA values in the training and testing data and what is the quantity of NA values present.
sum(is.na(training)) #[1] 1921600
## [1] 1921600
sum(is.na(testing)) #[1] 2000
## [1] 2000
we are going to ignore NA values using the following code segment
# for training dataset
columnNACounts <- colSums(is.na(training))
# columnNACounts
# after checking columnNACounts , we noticed:
# most columns with NA values have sum of NA values exceeeds 19200
badColumns <- columnNACounts >= 19200
cleanTrainingdata <- training[!badColumns]
sum(is.na(cleanTrainingdata)) # 0
## [1] 0
# same for testing dataset
columnNACounts <- colSums(is.na(testing))
# columnNACounts
# after checking columnNACounts , we noticed:
# most columns with NA values have sum of NA values exceeeds 20
badColumns <- columnNACounts >= 20
cleanTestingdata <- testing[!badColumns]
sum(is.na(cleanTestingdata)) # 0
## [1] 0
# remove the first 6 columns as they contain user name and time stamps
# which are not useful to the classifier
cleanTrainingdata <- cleanTrainingdata[, c(7:60)]
cleanTestingdata <- cleanTestingdata[, c(7:60)]
dim(cleanTrainingdata) # [1] 19622 54
## [1] 19622 54
dim(cleanTestingdata) # [1] 20 54
## [1] 20 54
plot(cleanTrainingdata$classe,col=rainbow(5),main = "classe frequency plot")
attach(cleanTrainingdata)
# plot scatter plot matrices to determine relationship: Linear or Nonlinear
pairs(classe~num_window+roll_arm+pitch_arm,data=cleanTrainingdata,
main="Simple Scatterplot Matrix")
pairs(classe~roll_belt+pitch_belt+yaw_belt,data=cleanTrainingdata,
main="Simple Scatterplot Matrix")
From the above analysis, we may conclude that the relation is nonlinear
Now we start partitioning the data:
library(caret)
## Warning: package 'caret' was built under R version 3.1.2
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y = cleanTrainingdata$classe, p = 0.6, list = FALSE)
trainingdata <- cleanTrainingdata[inTrain, ]
crossval <- cleanTrainingdata[-inTrain, ]
cvCtrl <- trainControl(method = "cv", number = 5, allowParallel = TRUE, verboseIter = TRUE)
# Build the model using 5-fold cross validation
model <- train(classe ~ ., data = trainingdata, method = "rf", trControl = cvCtrl)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## + Fold1: mtry= 2
## - Fold1: mtry= 2
## + Fold1: mtry=27
## - Fold1: mtry=27
## + Fold1: mtry=53
## - Fold1: mtry=53
## + Fold2: mtry= 2
## - Fold2: mtry= 2
## + Fold2: mtry=27
## - Fold2: mtry=27
## + Fold2: mtry=53
## - Fold2: mtry=53
## + Fold3: mtry= 2
## - Fold3: mtry= 2
## + Fold3: mtry=27
## - Fold3: mtry=27
## + Fold3: mtry=53
## - Fold3: mtry=53
## + Fold4: mtry= 2
## - Fold4: mtry= 2
## + Fold4: mtry=27
## - Fold4: mtry=27
## + Fold4: mtry=53
## - Fold4: mtry=53
## + Fold5: mtry= 2
## - Fold5: mtry= 2
## + Fold5: mtry=27
## - Fold5: mtry=27
## + Fold5: mtry=53
## - Fold5: mtry=53
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27 on full training set
vimp <- varImp(model)
print(vimp)
## rf variable importance
##
## only 20 most important variables shown (out of 53)
##
## Overall
## num_window 100.000
## roll_belt 67.998
## pitch_forearm 42.220
## yaw_belt 32.119
## magnet_dumbbell_z 31.338
## pitch_belt 29.957
## magnet_dumbbell_y 29.619
## roll_forearm 26.876
## accel_dumbbell_y 12.229
## accel_forearm_x 11.776
## magnet_dumbbell_x 11.358
## roll_dumbbell 11.210
## accel_belt_z 10.329
## total_accel_dumbbell 9.381
## magnet_forearm_z 8.428
## accel_dumbbell_z 8.056
## magnet_belt_y 7.982
## magnet_belt_z 7.861
## magnet_belt_x 6.143
## yaw_dumbbell 5.323
Here, we calculate the in sample accuracy which is the prediction accuracy of our model on the training data set.
training_pred <- predict(model, trainingdata)# We build the model using 5-fold cross validation.
confusionMatrix(training_pred, trainingdata$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3348 0 0 0 0
## B 0 2279 0 0 0
## C 0 0 2054 0 0
## D 0 0 0 1930 0
## E 0 0 0 0 2165
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Thus, from the above confusion matrix, sample accuracy value is 100%.
testing_pred <- predict(model, crossval)
confusionMatrix(testing_pred, crossval$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2231 2 0 0 0
## B 0 1515 1 0 0
## C 0 1 1367 6 0
## D 0 0 0 1280 3
## E 1 0 0 0 1439
##
## Overall Statistics
##
## Accuracy : 0.9982
## 95% CI : (0.997, 0.999)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9977
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9996 0.9980 0.9993 0.9953 0.9979
## Specificity 0.9996 0.9998 0.9989 0.9995 0.9998
## Pos Pred Value 0.9991 0.9993 0.9949 0.9977 0.9993
## Neg Pred Value 0.9998 0.9995 0.9998 0.9991 0.9995
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1931 0.1742 0.1631 0.1834
## Detection Prevalence 0.2846 0.1932 0.1751 0.1635 0.1835
## Balanced Accuracy 0.9996 0.9989 0.9991 0.9974 0.9989
The out-of-sample accuracy is 99%. Now, we apply the above model to the clean testing data (20 cases)
answers <- predict(model, testing)
answers <- as.character(answers)
answers
## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"