Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways (exercise class). We will build a model to predict the exercise class based on the data from the accelerometers.
if (!file.exists("./data/pml-training.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
"./data/pml-training.csv", method='curl')
}
if (!file.exists("./data/pml-testing.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
"./data/pml-testing.csv", method='curl')
}
df_train_base <- read.csv('./data/pml-training.csv', sep=',', stringsAsFactors = FALSE, strip.white = TRUE)
df_test_base <- read.csv('./data/pml-testing.csv', sep=',', strip.white = TRUE, stringsAsFactors = FALSE)
We set an initial seed to ensure this analysis can be reproduced:
set.seed(123)
The training set contains 19622 observations and 160 features. We can see that some features are only present for some observations. These observations correspond to observations windows where the data from the sensors is being derived (average, variance, etc..). The windows last 0.5 seconds to 2.5 seconds and correspond to one exercise class. This approach taken by the research team can not be used for our problem as the observations we have to predict are instantaneous measurements.
We can therefore clean up the dataset by removing all the derived data generated by the research team for their window based approach. Also the time related data are not relevant for our exercise as we’re looking to classify instatanous measurements. Finally, the name of the participant and the X feature are also irrelevant.
df_train <- df_train_base %>%
select(-starts_with("avg_"),
-starts_with("var_"),
-starts_with("stddev_"),
-starts_with("max_"),
-starts_with("min_"),
-starts_with("amplitude_"),
-starts_with("kurtosis_"),
-starts_with("skewness_"),
-new_window,
-num_window,
-user_name,
-raw_timestamp_part_1,
-raw_timestamp_part_2,
-cvtd_timestamp,
-X)
dim(df_train)
## [1] 19622 53
We are now left with 53 features corresponding to instantaneous measurements of the sensors installed on the participants.
The output is the class and must be converted to a factor:
df_train$classe <- as.factor(df_train$classe)
We will split the training set to keep a test set on the side in order to estimate the performances of our models. Here the split is to keep 70% of the data for training, and the remaining 30% for testing. We will use the testing set to estimate the out-of-sample accuracy of our model.
inTrain <- createDataPartition(y=df_train$classe, p=0.7, list=FALSE)
training <- df_train[inTrain, ]
testing <- df_train[-inTrain,]
We can now train a random forest model, using a 3-fold cross validation to determine the best parameters.
myTrainControl <- trainControl(method="cv", number=3)
modFit <- train(classe ~ ., method="rf", data=training, prox=TRUE, trControl = myTrainControl)
The final model:
modFit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.72%
## Confusion matrix:
## A B C D E class.error
## A 3900 3 1 0 2 0.001536098
## B 23 2628 6 1 0 0.011286682
## C 0 14 2376 6 0 0.008347245
## D 0 1 27 2222 2 0.013321492
## E 0 0 5 8 2512 0.005148515
We can see that some observations has been missclassified, the error rate of the model using cross validation is 0.72%, meaning an accuracy of more than 99%. This is really good and we can think that our model may be overfitting the training data.
To estimate more realistically the out-of-sample error of our model, we can now use the testing set that we defined earlier. Let’s predict the classes of the testing observations and compare to their real values.
pred <- predict(modFit, testing)
confusionMatrix(table(pred, testing$classe))
## Confusion Matrix and Statistics
##
##
## pred A B C D E
## A 1673 8 0 0 0
## B 1 1130 9 0 0
## C 0 1 1014 14 2
## D 0 0 3 949 1
## E 0 0 0 1 1079
##
## Overall Statistics
##
## Accuracy : 0.9932
## 95% CI : (0.9908, 0.9951)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9914
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9921 0.9883 0.9844 0.9972
## Specificity 0.9981 0.9979 0.9965 0.9992 0.9998
## Pos Pred Value 0.9952 0.9912 0.9835 0.9958 0.9991
## Neg Pred Value 0.9998 0.9981 0.9975 0.9970 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1920 0.1723 0.1613 0.1833
## Detection Prevalence 0.2856 0.1937 0.1752 0.1619 0.1835
## Balanced Accuracy 0.9988 0.9950 0.9924 0.9918 0.9985
We can see some missclassification as well on the testing set. The estimated accuracy is 99.32% which is very good. We can confidently use this model for the prediction of the remaining data.
The data used in this analysis is provided by http://groupware.les.inf.puc-rio.br/har .