This report presents an analysis that corresponds to the Project Assignment for the Practical Machine Learning course of the Johns Hopkins Data Science Specialization at Coursera. The project uses data from the Weight Lifting Exercises (WLE) Dataset (see http://groupware.les.inf.puc-rio.br/har and also the References section below.)
According to the WLE website, six participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions, identified as classes A, B, C, D and E. Class A corresponds to a correct execution of the exercise, and the remaining five classes identify common mistakes in this weight lifting exercise. Several sensors were used to collect data about the quality of the exercise execution. The goal of this project is to obtain a prediction algorithm that takes such a set of sensor readings and correctly predicts the corresponding class (A to E).
The following analysis uses a random forest prediction algorithm to accomplish this task, after data cleaning. The results of the analysis confirm that the model provided by this algorithm achieves a high prediction accuracy (as indicated by several prediction quality indicators).
Data File Loading and Initial Data Exploration.
The project assignment includes two data files (in csv format), that can be downloaded from these links:
# dowloading training dataset
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile = "pml-training.csv")
pml_training <- read.csv("pml-training.csv",
header = TRUE, sep = ",",
na.strings = c("NA", "#DIV/0!"))
# downloading testing dataset
file2Url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(file2Url, destfile = "pml-testing.csv")
pml_testing <- read.csv("pml-testing.csv")
The pml-training.csv file contains both sensor data and execution type data, but the pml-testing.csv file does not contain execution type data. As an additional part of the assignment, we have to use the prediction algorithm trained on the data from the pml-testing.csv file, in order to predict the execution type for the data in the pml-testing.csv file.
In this assignment there is no codebook for the data files. However, relevant information can be obtained from the sources cited, here:
[https://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har]
In particular, we know that four types of sensors were used in the experiment, and we will see below that this is reflected in the names of many of the variables in the data set.
First I read the pml-training.csv file into R. An initial inspection of the data file (using e.g. a text editor or a spreadsheet program) shows that:
The data columns in the file are separated by commas. There are many missing values. These missing values come in two versions: the usual NA value, but also as values of the form “#DIV/0!” (this is probably the result of an attempt to divide by zero in a spreadsheet).
The header line contains the names of the variables in the data set. The first column is not really a variable, it just contains the row number. Taking all that into account, we read the csv into a data frame in R as follows:
19622, 160
As you can see, the data frame has 19622 rows (observations) and 160 columns (variables). Most of the variables (152 out of 160) correspond to sensor readings for one of the four sensors. Those sensor-reading variable names (columns 8 to 159) include one of the following strings to identify the corresponding sensor:
_belt _arm _dumbbell _forearm
The last column in the data frame (column 160) contains the values A to E of the classe variable that indicates the execution type of the exercise.
Finally, the first seven columns contain:
column 1: the row index (not really a variable). column 2: the user_name variable; that is, the name of the person performing the exercise. columns 3 to 7: variables related to the time window for that particular sensor reading. See Section 5.1 of the paper in the references for more details on these variables.
Restricting the Variables to Sensor-related Ones. Thus, the data in the first seven columns are not sensor readings. For the prediction purposes of this analysis, we will remove the data in those columns from the data frame (using grep to select the sensor-related columns).
sensorColumns = grep(pattern = "_belt|_arm|_dumbbell|_forearm", names(pml_training))
length(sensorColumns)
## [1] 152
## [1] 152
data = pml_training[, c(sensorColumns,160)]
dim(data)
## [1] 19622 153
## [1] 19622 153
See the Notes section below for further discussion of this choice of variables.
The selected sensor data columns still include many variables whose values are NA for almost all observations. To remove those variables we do the following:
missingData = is.na(data)
omitColumns = which(colSums(missingData) > 19000)
data = data[, -omitColumns]
dim(data)
## [1] 19622 53
## [1] 19622 53
As you can see, only 53 predictor variables (plus classe) remain in the data set. Next we check that the resulting data frame has no missing values with:
table(complete.cases(data))
##
## TRUE
## 19622
##
## TRUE
## 19622
All of the remaining predictor variables are of numeric type:
table(sapply(data[1,], class))
##
## character integer numeric
## 1 25 27
##
## factor integer numeric
## 1 25 27
Following the most common practice in Machine Learning, I split our data into a training data set (75% of the total cases) and a testing data set (with the remaining cases; the latter should not be confused with the data in the pml-testing.csv file). This will allow me to estimate the out of sample error of our predictor. I use the caret package for this purpose, and begin by setting the seed to ensure reproducibility.
set.seed(2014)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y=data$classe, p=0.75, list=FALSE)
training <- data[inTrain,]
dim(training)
## [1] 14718 53
## [1] 14718 53
testing <- data[-inTrain,]
dim(testing)
## [1] 4904 53
## [1] 4904 53
Some remarks are in order, before proceeding to train our predictor:
Since we are going to apply a non-parametric model (random forests), no preprocessing is needed to transform the variables. The possible use of PCA to further reduce the number of features is discussed in the Notes section below. Even though the assignment rubric mentions it, Cross Validation is not necessary for such a direct construction of random forests (in short, because the random forest construction already includes enough subsampling).
Thus, we are ready to continue building the predictor.
Training the Predictor. We will use the randomForest function (in the randomForest package) to fit the predictor to the training set. In the computer used for this analysis (see the Notes section below for details) the default number of trees (500) gives a reasonable tradeoff between training time and accuracy. In more powerful machines that number can be increased for (slightly) better predictions.
training$classe <- as.factor(training$classe)
training <- data.frame(lapply(training, function(x) if(is.character(x)) as.factor(x) else x))
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
time1 = proc.time()
(randForest = randomForest(classe~., data=training, ntree = 500))
##
## Call:
## randomForest(formula = classe ~ ., data = training, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.43%
## Confusion matrix:
## A B C D E class.error
## A 4184 0 0 0 1 0.0002389486
## B 11 2835 2 0 0 0.0045646067
## C 0 12 2553 2 0 0.0054538372
## D 0 0 27 2383 2 0.0120232172
## E 0 0 2 4 2700 0.0022172949
time2 = proc.time()
(time = time2 - time1)
## user system elapsed
## 33.468 0.499 34.358
As the above results show, the resulting predictor has a quite low error estimate. The confusion matrix for the training set indicates that the predictor is accurate on that set.
After training the predictor we use it on the testing subsample we constructed before, to get an estimate of its out of sample error.
library(caret)
training <- training[, sapply(training, function(x) is.numeric(x) | is.factor(x))]
predictionTesting = predict(randForest, newdata = testing)
The error estimate can be obtained with the confusionMatrix function of the caret package:
# Train the random forest model
training$classe <- as.factor(training$classe)
training <- data.frame(lapply(training, function(x) if(is.character(x)) as.factor(x) else x))
time1 <- proc.time()
randForest <- randomForest(classe ~ ., data=training, ntree=500)
time2 <- proc.time()
time <- time2 - time1
# Generate predictions on the testing set
predictions <- predict(randForest, newdata=testing)
# Create the confusion matrix
testing$classe <- factor(testing$classe, levels = levels(predictions))
predictions <- factor(predictions, levels = levels(testing$classe))
confMatrix <- confusionMatrix(predictions, testing$classe)
print(confMatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 5 0 0 0
## B 0 943 5 0 0
## C 0 1 850 11 0
## D 0 0 0 791 1
## E 0 0 0 2 900
##
## Overall Statistics
##
## Accuracy : 0.9949
## 95% CI : (0.9925, 0.9967)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9936
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9937 0.9942 0.9838 0.9989
## Specificity 0.9986 0.9987 0.9970 0.9998 0.9995
## Pos Pred Value 0.9964 0.9947 0.9861 0.9987 0.9978
## Neg Pred Value 1.0000 0.9985 0.9988 0.9968 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1923 0.1733 0.1613 0.1835
## Detection Prevalence 0.2855 0.1933 0.1758 0.1615 0.1839
## Balanced Accuracy 0.9993 0.9962 0.9956 0.9918 0.9992
Here I break down the results of your confusion matrix and the associated statistics:
The confusion matrix shows the number of correct and incorrect predictions made by your model compared to the actual classifications (reference). Here’s a summary:
The model performs exceptionally well, with high accuracy, sensitivity, specificity, and precision across all classes. The Kappa statistic also indicates almost perfect agreement between the predicted and actual classifications. This suggests that the random forest model is highly effective at predicting the correct class for the weight lifting exercise data.
Note: R version and System information for this analysis: Sys.info()[1:2] “Darwin” release “23.3.0” version “Darwin Kernel Version 23.3.0: Wed Dec 20 21:30:27 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T8103” R.version.string
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.