Practical machine learning course report

============================

*****Alexa Kiss*****

This is a homework assignment of Coursera’s MOOC Practical Machine Learning from Johns Hopkins University. For more information about the MOOCs in this Specialization, please visit: https://www.coursera.org/specialization/jhudatascience/

Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly. One thing that people often do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. ### Aim In this project, I have used data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants in order to predict in which manner they have performed the excercise.

Data source and experimental details

Weight lifting excercises dataset

“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate."

Loading and cleaning data

## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

##  sysname  release 
## "Darwin" "12.2.0"

## [1] "R version 3.1.2 (2014-10-31)"

#  setting seed and downloading data
set.seed(123)

trainURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

traindata <- read.csv(url(trainURL), na.strings=c("NA","#DIV/0!",""))
testdata <- read.csv(url(testURL), na.strings=c("NA","#DIV/0!",""))

NAsums <- function(x) sum(is.na(x))
NAvari <- sapply(traindata, NAsums)
sum(NAvari == 0)

## [1] 60

Upon viewing the dataset, it appears, that most of the variables contain a large amount of NaNs. As missing values can cause most classification models to fail, and these variables represent summary statistics, they will be removed.

rem<-which(colSums(is.na(traindata))>1000)
traindata<-traindata[, -rem]
dim(traindata)

## [1] 19622    60

Near-zero variance predictors may also cause model failure, it is better to get rid of them.

nzv<- nearZeroVar(traindata,saveMetrics=TRUE)
traindata <- traindata[,nzv$nzv==FALSE]

dim(traindata)

## [1] 19622    59

59 variables remain. However, the goal is to produce a model that can tell whether an unknown user is performing an exercise well or not. There are still variables left that are not needed for this: the ‘X’ row IDs, the user names and the timestamp details.

traindata<-traindata[,-(1:6),drop=FALSE]
dim(traindata)

## [1] 19622    53

Data partitioning and training

I will use cross-validation, using 75% of the data for training.

dataparts<- createDataPartition(traindata$classe, p=0.75, list=FALSE)
training <- traindata[dataparts, ]
validation <- traindata[-dataparts, ]
dim(training); dim(validation)

## [1] 14718    53

## [1] 4904   53

# check the proportion of the different classes after data splitting
prop.table(table(traindata$classe))

## 
##         A         B         C         D         E 
## 0.2843747 0.1935073 0.1743961 0.1638977 0.1838243

prop.table(table(training$classe))

## 
##         A         B         C         D         E 
## 0.2843457 0.1935046 0.1744123 0.1638810 0.1838565

First, I apply a decision tree:

model1 <- rpart(classe ~ ., data=training, method="class")
plot(model1, uniform=FALSE,
   main="Decision tree of weight lifting data ")
text(model1, use.n=FALSE, all=TRUE, cex=.6)

prediction1 <- predict(model1, validation, type = "class")
CMmodel1 <- confusionMatrix(prediction1, validation$classe)
CMmodel1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1237  131   16   44   15
##          B   45  598   72   67   71
##          C   39  102  683  134  115
##          D   51   64   65  499   46
##          E   23   54   19   60  654
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7486          
##                  95% CI : (0.7362, 0.7607)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6817          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8867   0.6301   0.7988   0.6206   0.7259
## Specificity            0.9413   0.9355   0.9037   0.9449   0.9610
## Pos Pred Value         0.8572   0.7011   0.6365   0.6883   0.8074
## Neg Pred Value         0.9543   0.9134   0.9551   0.9270   0.9397
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2522   0.1219   0.1393   0.1018   0.1334
## Detection Prevalence   0.2942   0.1739   0.2188   0.1478   0.1652
## Balanced Accuracy      0.9140   0.7828   0.8513   0.7828   0.8434

Note that the prediction of the validation dataset is not very accurate (Accuracy : 0.7486
and 95% CI : (0.7362, 0.7607)). Using this model, we would expect around 0.25 (25%) out-of-sample error.

Next, I will use random forest: Some advantages of this method: high accuracy, no variable transformation needed, results are relatively easy to interpret (e.g. using variable importance).

model2 <- randomForest(classe ~ ., data=training)
prediction2 <- predict(model2, validation, type = "class")
CMmodel2 <- confusionMatrix(prediction2, validation$classe)
CMmodel2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    1    0    0    0
##          B    1  946    7    0    0
##          C    0    2  848    9    0
##          D    0    0    0  794    1
##          E    0    0    0    1  900
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9955          
##                  95% CI : (0.9932, 0.9972)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9943          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9968   0.9918   0.9876   0.9989
## Specificity            0.9997   0.9980   0.9973   0.9998   0.9998
## Pos Pred Value         0.9993   0.9916   0.9872   0.9987   0.9989
## Neg Pred Value         0.9997   0.9992   0.9983   0.9976   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2843   0.1929   0.1729   0.1619   0.1835
## Detection Prevalence   0.2845   0.1945   0.1752   0.1621   0.1837
## Balanced Accuracy      0.9995   0.9974   0.9945   0.9937   0.9993

# plot the important variables
varImpPlot(model2, main="Random forest of weight lifting data", cex=1)

Clearly, random forest yields higher accuracy then the previous case, the expected out-of-sample error is <0.01.

Final prediction

Finally, I use the selected method to predict the manner of weight lifting in the 20 events of test data.

answers<-predict(model2,testdata, type="class")
answers1<-as.character(answers)
answers1

##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(answers1)

Note: The model succesfully classified all of the test dataset, with high accuracy. However, based on the variable importance plot (i.e. after the first 10-15 variables not much change occurs), decreasing the number of predictors may yield further improvements (e.g. in interpretability and in terms computational costs). For example, the number of predictors may be reduced by removing variables of highly correlated variable pairs, or performing PCA in the preprocessing step.

References

http://groupware.les.inf.puc-rio.br/har
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr