In this project, we analyze a set of weight lifting exercise data and come up with a predictive model using machine learning algorithm that is able to identify mistakes in weight lifting with high accuracy.
The Random Forest machine learning algorithm without Principal Component Analysis produces very high accuracy against the cross validation data set, and precisely predicts the outcomes of 20 testing observations.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
First of all, we download both the training and testing data files, and load them to data tables.
trainLocal <- "pml-training.csv"
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testLocal <- "pml-testing.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if(!file.exists(trainLocal)) {
download.file(trainURL, destfile = trainLocal)
}
if(!file.exists(testLocal)) {
download.file(testURL, destfile = testLocal)
}
# Also treat empty string as "NA"
rawTrain <- read.csv(trainLocal, header = TRUE, na.strings = c("NA", ""))
rawTest <- read.csv(testLocal, header = TRUE, na.strings = c("NA", ""))
We then preform exploratory analysis and data cleaning wherever is needed on the testing data set.
dim(rawTrain)
## [1] 19622 160
dim(rawTest)
## [1] 20 160
The training data set has 19622 rows and 160 columns. And the training data set has 20 rows and 160 columns.
# Display internal structure of the training data set
str(rawTrain, list.len = 20)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
# Get a count of the NAs in each column
unname(colSums(is.na(rawTrain)))
## [1] 0 0 0 0 0 0 0 0 0 0 0
## [12] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [23] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [34] 19216 19216 19216 0 0 0 0 0 0 0 0
## [45] 0 0 0 0 0 19216 19216 19216 19216 19216 19216
## [56] 19216 19216 19216 19216 0 0 0 0 0 0 0
## [67] 0 0 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [78] 19216 19216 19216 19216 19216 19216 0 0 0 19216 19216
## [89] 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216 19216
## [100] 19216 19216 0 19216 19216 19216 19216 19216 19216 19216 19216
## [111] 19216 19216 0 0 0 0 0 0 0 0 0
## [122] 0 0 0 19216 19216 19216 19216 19216 19216 19216 19216
## [133] 19216 19216 19216 19216 19216 19216 19216 0 19216 19216 19216
## [144] 19216 19216 19216 19216 19216 19216 19216 0 0 0 0
## [155] 0 0 0 0 0 0
The internal structure of training data set shows that there are columns that have NA values.
A further analysis performed by getting a count of the number of NAs in each column shows that columns either have no NAs at all or 19216 NAs, which account for 97.93% of total observations.
Therefore, it’s safe to exclude these columns from further analysis as they have no significant impact to the rest of the data:
# Exclude columns with NA value(s)
isColNA <- (colSums(is.na(rawTrain)) > 0)
We also noticed that the first 7 columns are irrelavent to the data that we are intereted in as we focus on what sensors provided. So we can also exclude them:
# Remove the first 7 columns
isColUnrelated <- rep(FALSE, ncol(rawTrain))
isColUnrelated[1:7] <- TRUE
# Extract the traning and validation data
cleanedTrain <- rawTrain[, !(isColNA | isColUnrelated)]
ncol(cleanedTrain)
## [1] 53
Now we have 53 columns left for building predictive model.
Now we are going to split the training data into training and crosss validation sets, with the proportion of 80% and 20%, respectively:
set.seed(800)
# Create data partitions
idxTrain <- createDataPartition(y = cleanedTrain$classe, p = 0.8, list = FALSE)
# Set taining data set
trainData <- cleanedTrain[idxTrain, ]
dim(trainData)
## [1] 15699 53
# Set CV data set
cvData <- cleanedTrain[-idxTrain, ]
dim(cvData)
## [1] 3923 53
Numbers of observations of training data set and cross validation data set are 15699 and 3923, respectively.
We will compare two predictive models below: Random Forest with and without Principal Component Analysis (PCA). We will use the predictive model with higher accuracy.
# Create the Random Forest model
rfFit <- randomForest(classe ~ ., data = trainData)
rfFit
##
## Call:
## randomForest(formula = classe ~ ., data = trainData)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.4%
## Confusion matrix:
## A B C D E class.error
## A 4461 2 0 0 1 0.000672043
## B 13 3022 3 0 0 0.005266623
## C 0 13 2724 1 0 0.005113221
## D 0 0 23 2549 1 0.009327633
## E 0 0 1 5 2880 0.002079002
# Perform cross validation
cvPred <- predict(rfFit, cvData)
cvConf <- confusionMatrix(cvData$classe, cvPred)
cvConf$table
## Reference
## Prediction A B C D E
## A 1116 0 0 0 0
## B 2 755 2 0 0
## C 0 4 679 1 0
## D 0 0 2 640 1
## E 0 0 0 0 721
# Get accuracy
cvAccuracy <- cvConf$overall["Accuracy"]
round(cvAccuracy, 4)
## Accuracy
## 0.9969
From the confusion matrix summary above, we got accuracy of 0.9969 on cross validation data set.
# Get the names of predictors
predNames <- names(cleanedTrain)
predIdx <- grep("^classe", predNames, invert = TRUE)
predNames <- predNames[predIdx]
# Create preProcess object
preProc <- preProcess(trainData[, predNames], method = "pca", thresh = 0.99)
preProc$numComp
## [1] 36
PCA needed 36 components to capture 99% of the variance. So the number of variables randomly sampled as candidates at each split will be: 6 (square root of 36).
# Calculate PCs for training data
trainPC <- predict(preProc, trainData[, predNames])
dim(trainPC)
## [1] 15699 36
Number of columns of training data set with PCs is 36.
# Create the Random Forest model
rfFitPC <- randomForest(trainData$classe ~ ., data = trainPC)
rfFitPC
##
## Call:
## randomForest(formula = trainData$classe ~ ., data = trainPC)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 1.75%
## Confusion matrix:
## A B C D E class.error
## A 4453 7 1 1 2 0.002464158
## B 53 2961 19 4 1 0.025345622
## C 7 26 2691 12 2 0.017165814
## D 4 1 88 2471 9 0.039642441
## E 1 9 17 11 2848 0.013167013
# calculate PCs for cv data
cvPC <- predict(preProc, cvData[, predNames])
# compare results
cvConfPC <- confusionMatrix(cvData$classe, predict(rfFitPC, cvPC))
cvConfPC$table
## Reference
## Prediction A B C D E
## A 1116 0 0 0 0
## B 22 730 4 1 2
## C 0 13 668 2 1
## D 0 0 14 626 3
## E 0 3 3 1 714
cvPCAccuracy <- cvConfPC$overall["Accuracy"]
round(cvPCAccuracy, 4)
## Accuracy
## 0.9824
From the confusion matrix summary above, we got accuracy of 0.9824 on cross validation data set with PCs.
modCompare <- rbind(c(round(cvAccuracy, 4), round(1 - cvAccuracy, 4)),
c(round(cvPCAccuracy, 4), round(1 - cvPCAccuracy, 4)))
colnames(modCompare) <- c("Accuracy", "Out-of-Sample Err");
rownames(modCompare) <- c("Without PCA", "With PCA");
modCompare
## Accuracy Out-of-Sample Err
## Without PCA 0.9969 0.0031
## With PCA 0.9824 0.0176
The above table shows that the Random Forest model without PCA has slightly higher accuracy. Hence it is chosen as the model to perform prediction on the testing data set.
Now we predict the testing data set using the Random Forest model without PCA that we derived above:
testPred <- predict(rfFit, rawTest)
testPred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Among the algorithms we used against the cross validation data set, the Random Forest without Principal Component Analysis was able to produce as high accuracy as 99.69%, and precisely predict the outcomes of 20 testing observations. Such machine learning algorithm was able to identify mistakes in weight lifting with high accuracy.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.