============================
*****Alexa Kiss*****
This is a homework assignment of Coursera’s MOOC Practical Machine Learning from Johns Hopkins University. For more information about the MOOCs in this Specialization, please visit: https://www.coursera.org/specialization/jhudatascience/
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly. One thing that people often do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. ### Aim In this project, I have used data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants in order to predict in which manner they have performed the excercise.
Weight lifting excercises dataset
“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate."
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3s9NNDsSK
## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## sysname release
## "Darwin" "12.2.0"
## [1] "R version 3.1.2 (2014-10-31)"
# setting seed and downloading data
set.seed(123)
trainURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
traindata <- read.csv(url(trainURL), na.strings=c("NA","#DIV/0!",""))
testdata <- read.csv(url(testURL), na.strings=c("NA","#DIV/0!",""))
NAsums <- function(x) sum(is.na(x))
NAvari <- sapply(traindata, NAsums)
sum(NAvari == 0)
## [1] 60
Upon viewing the dataset, it appears, that most of the variables contain a large amount of NaNs. As missing values can cause most classification models to fail, and these variables represent summary statistics, they will be removed.
rem<-which(colSums(is.na(traindata))>1000)
traindata<-traindata[, -rem]
dim(traindata)
## [1] 19622 60
Near-zero variance predictors may also cause model failure, it is better to get rid of them.
nzv<- nearZeroVar(traindata,saveMetrics=TRUE)
traindata <- traindata[,nzv$nzv==FALSE]
dim(traindata)
## [1] 19622 59
59 variables remain. However, the goal is to produce a model that can tell whether an unknown user is performing an exercise well or not. There are still variables left that are not needed for this: the ‘X’ row IDs, the user names and the timestamp details.
traindata<-traindata[,-(1:6),drop=FALSE]
dim(traindata)
## [1] 19622 53
I will use cross-validation, using 75% of the data for training.
dataparts<- createDataPartition(traindata$classe, p=0.75, list=FALSE)
training <- traindata[dataparts, ]
validation <- traindata[-dataparts, ]
dim(training); dim(validation)
## [1] 14718 53
## [1] 4904 53
# check the proportion of the different classes after data splitting
prop.table(table(traindata$classe))
##
## A B C D E
## 0.2843747 0.1935073 0.1743961 0.1638977 0.1838243
prop.table(table(training$classe))
##
## A B C D E
## 0.2843457 0.1935046 0.1744123 0.1638810 0.1838565
First, I apply a decision tree:
model1 <- rpart(classe ~ ., data=training, method="class")
plot(model1, uniform=FALSE,
main="Decision tree of weight lifting data ")
text(model1, use.n=FALSE, all=TRUE, cex=.6)
prediction1 <- predict(model1, validation, type = "class")
CMmodel1 <- confusionMatrix(prediction1, validation$classe)
CMmodel1
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1237 131 16 44 15
## B 45 598 72 67 71
## C 39 102 683 134 115
## D 51 64 65 499 46
## E 23 54 19 60 654
##
## Overall Statistics
##
## Accuracy : 0.7486
## 95% CI : (0.7362, 0.7607)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6817
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8867 0.6301 0.7988 0.6206 0.7259
## Specificity 0.9413 0.9355 0.9037 0.9449 0.9610
## Pos Pred Value 0.8572 0.7011 0.6365 0.6883 0.8074
## Neg Pred Value 0.9543 0.9134 0.9551 0.9270 0.9397
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2522 0.1219 0.1393 0.1018 0.1334
## Detection Prevalence 0.2942 0.1739 0.2188 0.1478 0.1652
## Balanced Accuracy 0.9140 0.7828 0.8513 0.7828 0.8434
Note that the prediction of the validation dataset is not very accurate (Accuracy : 0.7486
and 95% CI : (0.7362, 0.7607)). Using this model, we would expect around 0.25 (25%) out-of-sample error.
Next, I will use random forest: Some advantages of this method: high accuracy, no variable transformation needed, results are relatively easy to interpret (e.g. using variable importance).
model2 <- randomForest(classe ~ ., data=training)
prediction2 <- predict(model2, validation, type = "class")
CMmodel2 <- confusionMatrix(prediction2, validation$classe)
CMmodel2
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 1 0 0 0
## B 1 946 7 0 0
## C 0 2 848 9 0
## D 0 0 0 794 1
## E 0 0 0 1 900
##
## Overall Statistics
##
## Accuracy : 0.9955
## 95% CI : (0.9932, 0.9972)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9943
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9968 0.9918 0.9876 0.9989
## Specificity 0.9997 0.9980 0.9973 0.9998 0.9998
## Pos Pred Value 0.9993 0.9916 0.9872 0.9987 0.9989
## Neg Pred Value 0.9997 0.9992 0.9983 0.9976 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1929 0.1729 0.1619 0.1835
## Detection Prevalence 0.2845 0.1945 0.1752 0.1621 0.1837
## Balanced Accuracy 0.9995 0.9974 0.9945 0.9937 0.9993
# plot the important variables
varImpPlot(model2, main="Random forest of weight lifting data", cex=1)
Clearly, random forest yields higher accuracy then the previous case, the expected out-of-sample error is <0.01.
Finally, I use the selected method to predict the manner of weight lifting in the 20 events of test data.
answers<-predict(model2,testdata, type="class")
answers1<-as.character(answers)
answers1
## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers1)
Note: The model succesfully classified all of the test dataset, with high accuracy. However, based on the variable importance plot (i.e. after the first 10-15 variables not much change occurs), decreasing the number of predictors may yield further improvements (e.g. in interpretability and in terms computational costs). For example, the number of predictors may be reduced by removing variables of highly correlated variable pairs, or performing PCA in the preprocessing step.
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr