Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Load required R libraries and set the global option:
library(corrplot)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
Download the training & test data and read it as csv file:
if (!file.exists("pmlTraining.csv")) {
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
destfile = "pmlTraining.csv")
}
if (!file.exists("pmlTesting.csv")) {
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
destfile = "pmlTesting.csv")
}
pmlTraining <- read.csv("pmlTraining.csv", header = TRUE, na.strings = c("NA",""))
pmlTesting <- read.csv("pmlTesting.csv", header = TRUE, na.strings = c("NA",""))
The training set consists of 19622 observations of 160 variables, one of which is the dependent variable as far as this study is concerned:
dim(pmlTraining)
## [1] 19622 160
Columns in the orignal training and testing datasets that are mostly filled with missing values are then removed. This will help to remove unnecessary predictors from dataset
To do this, count the number of missing values in each column of the full training dataset. We use those sums to identify the columns which are not required as predictors.
pmlTraining_filter_col <- pmlTraining[,(colSums(is.na(pmlTraining)) == 0)]
pmlTesting_filter_col <- pmlTesting[,(colSums(is.na(pmlTesting)) == 0)]
Delete additional unnecessary columns from the pared-down training and testing datasets.
removeCol <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","new_window")
pmlTrainig_filter_col <- pmlTraining_filter_col[,!(names(pmlTraining_filter_col) %in% removeCol)]
pmlTesting_filter_col <- pmlTesting_filter_col[,!(names(pmlTesting_filter_col) %in% removeCol)]
We now split the updated training dataset into a training dataset (70% of the observations) and a validation dataset (30% of the observations). This validation dataset will allow us to perform cross validation when developing our model.
inTrain = createDataPartition(y = pmlTrainig_filter_col$classe, p = 0.7, list = FALSE)
pmlTraining_sub_data <- pmlTrainig_filter_col[inTrain,]
pmlValid_sub_data <- pmlTrainig_filter_col[-inTrain,]
At this point, our dataset contains 54 variables, with the last column containing the ‘classe’ variable we are trying to predict. We begin by looking at the correlations between the variables in our dataset. We may want to remove highly correlated predictors from our analysis and replace them with weighted combinations of predictors. This may allow a more complete capture of the information available.
corMatrix<- cor(pmlTraining_sub_data[, -54])
corrplot(corMatrix, order = "FPC", method = "color", type = "lower", tl.cex = 0.8, tl.col = rgb(0, 0, 0))
The graph shows that how different columns are correlated to each other. From a high-level perspective darker blue and darker red squares indicate high positive and high negative correlations, respectively. To make prediction model less biased, we will use the above graph to implement a principal components analysis to produce a set of linearly uncorrelated variables to use as our predictors.
We pre-process our data using a principal component analysis, leaving out the last column (‘classe’). After pre-processing, we use the ‘predict’ function to apply the pre-processing to both the training and validation subsets of the original larger ‘training’ dataset.
preProc <- preProcess(pmlTraining_sub_data[, -54], method = "pca", thresh = 0.99)
trainPC <- predict(preProc, pmlTraining_sub_data[, -54])
valid_testPC <- predict(preProc, pmlValid_sub_data[, -54])
Next, we train a model using a random forest approach on the smaller training dataset. We chose to specify the use of a cross validation method when applying the random forest routine in the ‘trainControl()’ parameter. Without specifying this, the default method (bootstrapping) would have been used. The bootstrapping method seemed to take a lot longer to complete, while essentially producing the same level of ‘accuracy’.
They deal naturally with non-linearity, and assuming linearity in this case would be imprudent.
There’s no parameter selection involved. While random forest may overfit a given data set, just as any other machine learning algorithm, it has been shown by Breiman that classifier variance does not grow with the number of trees used (unlike with Adaboosted decision trees, for example). Therefore, it’s always better to use more trees, memory and computational power allowing.
The algorithm allows for good in-training estimates of variable importance and generalization error [2], which largely eliminates the need for a separate validation stage, though obtaining a proper generalization error estimate on a testing set would still be prudent.
The algorithm is generally robust to outliers and correlated covariates [2], which seems like a nice property to have when there are known interactions between variables and no data on presence of outliers in the data set.
modFit <- train(pmlTraining_sub_data$classe ~ ., method = "rf", data = trainPC, trControl = trainControl(method = "cv", number = 4), importance = TRUE)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
We now review the relative importance of the resulting principal components of the trained model, ‘modFit’.
varImpPlot(modFit$finalModel, sort = TRUE, type = 1, pch = 19, col = 1, cex = 1, main = "Importance of the Individual Principal Components")
As you look from the top to the bottom on the y-axis, this plot shows each of the principal components in order from most important to least important. The degree of importance is shown on the x-axis-increasing from left to right. Therefore, points high and to the right on this graph correspond to those principal components that are especially valuable in terms of being able to classify the observed training data.
Call the ‘predict’ function again so that our trained model can be applied to our cross validation test dataset. We can then view the resulting table in the ‘confusionMatrix’ function’s output to see how well the model predicted/classified the values in the validation test set
predValidRF <- predict(modFit, valid_testPC)
confus <- confusionMatrix(pmlValid_sub_data$classe, predValidRF)
confus$table
## Reference
## Prediction A B C D E
## A 1669 1 2 0 2
## B 15 1103 17 1 3
## C 1 17 999 9 0
## D 2 2 37 919 4
## E 0 2 8 1 1071
The estimated out-of-sample error is 1.000 minus the model’s accuracy, the later of which is provided in the output of the confusionmatrix, or more directly via the ‘postresample’ function.
accur <- postResample(pmlValid_sub_data$classe, predValidRF)
modAccuracy <- accur[[1]]
modAccuracy
## [1] 0.9789
out_of_sample_error <- 1 - modAccuracy
out_of_sample_error
## [1] 0.02107
Finally, we apply the pre-processing to the original testing dataset, after removing the extraneous column labeled ‘problem_id’ (column 54). We then run our model against the testing dataset and display the predicted results.
testPC <- predict(preProc, pmlTesting_filter_col[, -54])
pred_final <- predict(modFit, testPC)
pred_final
## [1] B A C A A E D B A A B C B A E E A B B B
## Levels: A B C D E