The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment. The data sets was originated from here: Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012 Read more: http://groupware.les.inf.puc-rio.br/har#ixzz40iHm7JQz
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
For this project, we are given data from accelerometers on the belt, forearm, arm, and dumbell of 6 research study participants. Our training data consists of accelerometer data and a label identifying the quality of the activity the participant was doing. Our testing data consists of accelerometer data without the identifying label. Our goal is to predict the labels for the test set observations.
Notes: Studied this excellent pdf for implementation this Project. https://www.r-project.org/nosvn/conferences/useR-2013/Tutorials/kuhn/user_caret_2up.pdf
The first step is to get the data and check all the fields in training and test data are exactly same.
#Loading all the libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#library(rattle)
url_raw_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
training <- "pml-training.csv"
training<-read.csv(training, na.strings=c("NA","","#DIV/0!"), header=TRUE)
#download.file(url=url_raw_training, destfile=file_dest_training, method="curl")
url_raw_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testing <- "pml-testing.csv"
testing <- read.csv(testing, na.strings=c("NA","","#DIV/0!"), header=TRUE)
#download.file(url=url_raw_testing, destfile=file_dest_testing, method="curl")
# Get column names
colnames_train <- colnames(training)
colnames_test <- colnames(testing)
# Verify that the column names (excluding classe and problem_id) are identical in the training and test set.
all.equal(colnames_train[1:length(colnames_train)-1], colnames_test[1:length(colnames_train)-1])
## [1] TRUE
Creating training and test set.Also, Clean data values which are null.
# remove variables with nearly zero variance
nzv <- nearZeroVar(training)
training <- training[, -nzv]
testing <- testing[, -nzv]
# remove variables that are almost always NA
mostlyNA <- sapply(training, function(x) mean(is.na(x))) > 0.90
training <- training[, mostlyNA==F]
testing <- testing[, mostlyNA==F]
# remove variables that don't make much sense for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp . . .), which happen to be the first seven variables
training <- training[, -(1:7)]
testing <- testing[, -(1:7)]
#Predictors information after data cleaning
nearZeroVar(training,saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## pitch_belt 1.036082 9.3772296 FALSE FALSE
## yaw_belt 1.058480 9.9734991 FALSE FALSE
## total_accel_belt 1.063160 0.1477933 FALSE FALSE
## gyros_belt_x 1.058651 0.7134849 FALSE FALSE
## gyros_belt_y 1.144000 0.3516461 FALSE FALSE
## gyros_belt_z 1.066214 0.8612782 FALSE FALSE
## accel_belt_x 1.055412 0.8357966 FALSE FALSE
## accel_belt_y 1.113725 0.7287738 FALSE FALSE
## accel_belt_z 1.078767 1.5237998 FALSE FALSE
## magnet_belt_x 1.090141 1.6664968 FALSE FALSE
## magnet_belt_y 1.099688 1.5187035 FALSE FALSE
## magnet_belt_z 1.006369 2.3290184 FALSE FALSE
## roll_arm 52.338462 13.5256345 FALSE FALSE
## pitch_arm 87.256410 15.7323412 FALSE FALSE
## yaw_arm 33.029126 14.6570176 FALSE FALSE
## total_accel_arm 1.024526 0.3363572 FALSE FALSE
## gyros_arm_x 1.015504 3.2769341 FALSE FALSE
## gyros_arm_y 1.454369 1.9162165 FALSE FALSE
## gyros_arm_z 1.110687 1.2638875 FALSE FALSE
## accel_arm_x 1.017341 3.9598410 FALSE FALSE
## accel_arm_y 1.140187 2.7367241 FALSE FALSE
## accel_arm_z 1.128000 4.0362858 FALSE FALSE
## magnet_arm_x 1.000000 6.8239731 FALSE FALSE
## magnet_arm_y 1.056818 4.4439914 FALSE FALSE
## magnet_arm_z 1.036364 6.4468454 FALSE FALSE
## roll_dumbbell 1.022388 84.2065029 FALSE FALSE
## pitch_dumbbell 2.277372 81.7449801 FALSE FALSE
## yaw_dumbbell 1.132231 83.4828254 FALSE FALSE
## total_accel_dumbbell 1.072634 0.2191418 FALSE FALSE
## gyros_dumbbell_x 1.003268 1.2282132 FALSE FALSE
## gyros_dumbbell_y 1.264957 1.4167771 FALSE FALSE
## gyros_dumbbell_z 1.060100 1.0498420 FALSE FALSE
## accel_dumbbell_x 1.018018 2.1659362 FALSE FALSE
## accel_dumbbell_y 1.053061 2.3748853 FALSE FALSE
## accel_dumbbell_z 1.133333 2.0894914 FALSE FALSE
## magnet_dumbbell_x 1.098266 5.7486495 FALSE FALSE
## magnet_dumbbell_y 1.197740 4.3012945 FALSE FALSE
## magnet_dumbbell_z 1.020833 3.4451126 FALSE FALSE
## roll_forearm 11.589286 11.0895933 FALSE FALSE
## pitch_forearm 65.983051 14.8557741 FALSE FALSE
## yaw_forearm 15.322835 10.1467740 FALSE FALSE
## total_accel_forearm 1.128928 0.3567424 FALSE FALSE
## gyros_forearm_x 1.059273 1.5187035 FALSE FALSE
## gyros_forearm_y 1.036554 3.7763735 FALSE FALSE
## gyros_forearm_z 1.122917 1.5645704 FALSE FALSE
## accel_forearm_x 1.126437 4.0464784 FALSE FALSE
## accel_forearm_y 1.059406 5.1116094 FALSE FALSE
## accel_forearm_z 1.006250 2.9558659 FALSE FALSE
## magnet_forearm_x 1.012346 7.7667924 FALSE FALSE
## magnet_forearm_y 1.246914 9.5403119 FALSE FALSE
## magnet_forearm_z 1.000000 8.5771073 FALSE FALSE
## classe 1.469581 0.0254816 FALSE FALSE
Creating Training data set (60%) and test data (40%) from given training data
set.seed(007)
intrain <- createDataPartition(y=training$classe, p=0.6, list=FALSE)
my_training <- training[intrain, ]; my_testing <- training[-intrain, ]
dim(my_training); dim(my_testing)
## [1] 11776 52
## [1] 7846 52
Lets see how our remaining predictors with lowest unique percentage fit together
featurePlot(x=training[,c("total_accel_forearm","total_accel_dumbbell","total_accel_arm","gyros_belt_x","gyros_belt_y","gyros_belt_z")],y=training$classe)
Data cleaning process is now compelted. We have 52 predictor variables left after cleaning. These predictors will be used to fit different ML models.
As hinted by professor during video lectures, Decision tree combined with boosting and Random forest algorithms perform best (accurate) though they might be slow. Since, our training data is 11776 records, we will be training models based on those algorithms.
I will start with decision tree.
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
Note: This tree plotting techniques derived from : http://blog.revolutionanalytics.com/2013/06/plotting-classification-and-regression-trees-with-plotrpart.html
predictions_dectree <- predict(model_dectree, my_testing, type = "class")
confusionMatrix(predictions_dectree, my_testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1975 296 21 105 70
## B 50 743 119 45 163
## C 42 182 954 116 193
## D 143 240 222 944 206
## E 22 57 52 76 810
##
## Overall Statistics
##
## Accuracy : 0.6916
## 95% CI : (0.6812, 0.7018)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6093
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8849 0.4895 0.6974 0.7341 0.5617
## Specificity 0.9124 0.9404 0.9177 0.8764 0.9677
## Pos Pred Value 0.8006 0.6634 0.6416 0.5379 0.7965
## Neg Pred Value 0.9522 0.8848 0.9349 0.9439 0.9075
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2517 0.0947 0.1216 0.1203 0.1032
## Detection Prevalence 0.3144 0.1427 0.1895 0.2237 0.1296
## Balanced Accuracy 0.8986 0.7149 0.8075 0.8052 0.7647
The evaluation matrix shows that, our decision tree model achieved accuray of just 69.16% which is not satisfactory in our case. We will divert from our initial planning to apply boosting algorithm to decision tree as decision tree achived far less accuracy then we have expected. We will directly apply random forest algorithms and see how it performs.
Lets quicly see the results of random forest.
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 26
##
## OOB estimate of error rate: 0.87%
## Confusion matrix:
## A B C D E class.error
## A 3341 5 1 0 1 0.002090800
## B 17 2251 10 0 1 0.012286090
## C 0 20 2027 7 0 0.013145083
## D 0 0 26 1901 3 0.015025907
## E 0 1 4 7 2153 0.005542725
It took my machine around 5 Mins to generate this above model.
predictions_randfor <- predict(model_randfor, my_testing)
confusionMatrix(predictions_randfor, my_testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2229 16 0 0 0
## B 2 1500 4 0 0
## C 0 2 1359 9 1
## D 1 0 5 1274 6
## E 0 0 0 3 1435
##
## Overall Statistics
##
## Accuracy : 0.9938
## 95% CI : (0.9918, 0.9954)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9921
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9987 0.9881 0.9934 0.9907 0.9951
## Specificity 0.9971 0.9991 0.9981 0.9982 0.9995
## Pos Pred Value 0.9929 0.9960 0.9912 0.9907 0.9979
## Neg Pred Value 0.9995 0.9972 0.9986 0.9982 0.9989
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2841 0.1912 0.1732 0.1624 0.1829
## Detection Prevalence 0.2861 0.1919 0.1747 0.1639 0.1833
## Balanced Accuracy 0.9979 0.9936 0.9958 0.9944 0.9973
This matrix as you can see shows that, our random forest model has achieved 99.41% accuracy and 0.0059 out of sample error., which is far more better than decision tree model that we evaluated previously. Getting 99.41% accuracy is great, so I will using this model for test set rather than decision tree.
Lets use the provided test data set and evaluate the output from our best model.
predictions_test <- predict(model_randfor, testing)
predictions_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E