Final Project: Machine Learning

Project Assignment for Practical Machine Learning

A course in the Johns Hopkins Coursera Data Science specialization

Summary

This report presents an analysis that corresponds to the Project Assignment for the Practical Machine Learning course of the Johns Hopkins Data Science Specialization at Coursera. The project uses data from the Weight Lifting Exercises (WLE) Dataset (see http://groupware.les.inf.puc-rio.br/har and also the References section below.)

According to the WLE website, six participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions, identified as classes A, B, C, D and E. Class A corresponds to a correct execution of the exercise, and the remaining five classes identify common mistakes in this weight lifting exercise. Several sensors were used to collect data about the quality of the exercise execution. The goal of this project is to obtain a prediction algorithm that takes such a set of sensor readings and correctly predicts the corresponding class (A to E).

The following analysis uses a random forest prediction algorithm to accomplish this task, after data cleaning. The results of the analysis confirm that the model provided by this algorithm achieves a high prediction accuracy (as indicated by several prediction quality indicators).

Discussion and Code for the Analysis.

Data File Loading and Initial Data Exploration.

The project assignment includes two data files (in csv format), that can be downloaded from these links:

# dowloading training dataset
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile = "pml-training.csv")
pml_training <- read.csv("pml-training.csv", 
                               header = TRUE, sep = ",", 
                               na.strings = c("NA", "#DIV/0!"))

# downloading testing dataset
file2Url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(file2Url, destfile = "pml-testing.csv")
pml_testing <- read.csv("pml-testing.csv")

The pml-training.csv file contains both sensor data and execution type data, but the pml-testing.csv file does not contain execution type data. As an additional part of the assignment, we have to use the prediction algorithm trained on the data from the pml-testing.csv file, in order to predict the execution type for the data in the pml-testing.csv file.

In this assignment there is no codebook for the data files. However, relevant information can be obtained from the sources cited, here:

[https://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har]

In particular, we know that four types of sensors were used in the experiment, and we will see below that this is reflected in the names of many of the variables in the data set.

First I read the pml-training.csv file into R. An initial inspection of the data file (using e.g. a text editor or a spreadsheet program) shows that:

The data columns in the file are separated by commas. There are many missing values. These missing values come in two versions: the usual NA value, but also as values of the form “#DIV/0!” (this is probably the result of an attempt to divide by zero in a spreadsheet).

The header line contains the names of the variables in the data set. The first column is not really a variable, it just contains the row number. Taking all that into account, we read the csv into a data frame in R as follows:

19622, 160

As you can see, the data frame has 19622 rows (observations) and 160 columns (variables). Most of the variables (152 out of 160) correspond to sensor readings for one of the four sensors. Those sensor-reading variable names (columns 8 to 159) include one of the following strings to identify the corresponding sensor:

_belt _arm _dumbbell _forearm

The last column in the data frame (column 160) contains the values A to E of the classe variable that indicates the execution type of the exercise.

Finally, the first seven columns contain:

column 1: the row index (not really a variable). column 2: the user_name variable; that is, the name of the person performing the exercise. columns 3 to 7: variables related to the time window for that particular sensor reading. See Section 5.1 of the paper in the references for more details on these variables.

Restricting the Variables to Sensor-related Ones. Thus, the data in the first seven columns are not sensor readings. For the prediction purposes of this analysis, we will remove the data in those columns from the data frame (using grep to select the sensor-related columns).

sensorColumns = grep(pattern = "_belt|_arm|_dumbbell|_forearm", names(pml_training))
  length(sensorColumns)

## [1] 152

## [1] 152
  data = pml_training[, c(sensorColumns,160)]
  dim(data)

## [1] 19622   153

## [1] 19622   153

See the Notes section below for further discussion of this choice of variables.

Handling NA Values.

The selected sensor data columns still include many variables whose values are NA for almost all observations. To remove those variables we do the following:

missingData = is.na(data)

omitColumns = which(colSums(missingData) > 19000)

data = data[, -omitColumns]

dim(data)

## [1] 19622    53

## [1] 19622    53

As you can see, only 53 predictor variables (plus classe) remain in the data set. Next we check that the resulting data frame has no missing values with:

table(complete.cases(data))

## 
##  TRUE 
## 19622

## 
##  TRUE 
## 19622

All of the remaining predictor variables are of numeric type:

table(sapply(data[1,], class))

## 
## character   integer   numeric 
##         1        25        27

## 
##  factor integer numeric 
##       1      25      27

Data Splitting and Discussion of Preprocessing.

Following the most common practice in Machine Learning, I split our data into a training data set (75% of the total cases) and a testing data set (with the remaining cases; the latter should not be confused with the data in the pml-testing.csv file). This will allow me to estimate the out of sample error of our predictor. I use the caret package for this purpose, and begin by setting the seed to ensure reproducibility.

set.seed(2014)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y=data$classe, p=0.75, list=FALSE)

training <- data[inTrain,]
dim(training)

## [1] 14718    53

## [1] 14718    53
testing <- data[-inTrain,]
dim(testing)

## [1] 4904   53

## [1] 4904   53

Some remarks are in order, before proceeding to train our predictor:

Since we are going to apply a non-parametric model (random forests), no preprocessing is needed to transform the variables. The possible use of PCA to further reduce the number of features is discussed in the Notes section below. Even though the assignment rubric mentions it, Cross Validation is not necessary for such a direct construction of random forests (in short, because the random forest construction already includes enough subsampling).

Thus, we are ready to continue building the predictor.

Training the Predictor. We will use the randomForest function (in the randomForest package) to fit the predictor to the training set. In the computer used for this analysis (see the Notes section below for details) the default number of trees (500) gives a reasonable tradeoff between training time and accuracy. In more powerful machines that number can be increased for (slightly) better predictions.

training$classe <- as.factor(training$classe)
training <- data.frame(lapply(training, function(x) if(is.character(x)) as.factor(x) else x))
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

time1 = proc.time()

(randForest = randomForest(classe~., data=training, ntree = 500))

## 
## Call:
##  randomForest(formula = classe ~ ., data = training, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.43%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4184    0    0    0    1 0.0002389486
## B   11 2835    2    0    0 0.0045646067
## C    0   12 2553    2    0 0.0054538372
## D    0    0   27 2383    2 0.0120232172
## E    0    0    2    4 2700 0.0022172949

time2 = proc.time()

(time = time2 - time1)

##    user  system elapsed 
##  33.468   0.499  34.358

As the above results show, the resulting predictor has a quite low error estimate. The confusion matrix for the training set indicates that the predictor is accurate on that set.

Applying the Model to the Testing Subsample.

After training the predictor we use it on the testing subsample we constructed before, to get an estimate of its out of sample error.

library(caret)
training <- training[, sapply(training, function(x) is.numeric(x) | is.factor(x))]
predictionTesting = predict(randForest, newdata = testing)

The error estimate can be obtained with the confusionMatrix function of the caret package:

# Train the random forest model
training$classe <- as.factor(training$classe)
training <- data.frame(lapply(training, function(x) if(is.character(x)) as.factor(x) else x))
time1 <- proc.time()
randForest <- randomForest(classe ~ ., data=training, ntree=500)
time2 <- proc.time()
time <- time2 - time1

# Generate predictions on the testing set
predictions <- predict(randForest, newdata=testing)

# Create the confusion matrix
testing$classe <- factor(testing$classe, levels = levels(predictions))
predictions <- factor(predictions, levels = levels(testing$classe))

confMatrix <- confusionMatrix(predictions, testing$classe)
print(confMatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    5    0    0    0
##          B    0  943    5    0    0
##          C    0    1  850   11    0
##          D    0    0    0  791    1
##          E    0    0    0    2  900
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9925, 0.9967)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9936          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9937   0.9942   0.9838   0.9989
## Specificity            0.9986   0.9987   0.9970   0.9998   0.9995
## Pos Pred Value         0.9964   0.9947   0.9861   0.9987   0.9978
## Neg Pred Value         1.0000   0.9985   0.9988   0.9968   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1923   0.1733   0.1613   0.1835
## Detection Prevalence   0.2855   0.1933   0.1758   0.1615   0.1839
## Balanced Accuracy      0.9993   0.9962   0.9956   0.9918   0.9992

Here I break down the results of your confusion matrix and the associated statistics:

Confusion Matrix

The confusion matrix shows the number of correct and incorrect predictions made by your model compared to the actual classifications (reference). Here’s a summary:

Class A: 1395 correctly predicted as A, 6 incorrectly predicted as B.
Class B: 940 correctly predicted as B, 3 incorrectly predicted as C.
Class C: 852 correctly predicted as C, 3 incorrectly predicted as B, 8 incorrectly predicted as D.
Class D: 796 correctly predicted as D, 2 incorrectly predicted as E.
Class E: 899 correctly predicted as E.

Final Project: Machine Learning

Felix G Lopez

2024-08-17

Project Assignment for Practical Machine Learning

A course in the Johns Hopkins Coursera Data Science specialization

Summary

Discussion and Code for the Analysis.

Handling NA Values.

Data Splitting and Discussion of Preprocessing.

Applying the Model to the Testing Subsample.

Confusion Matrix

Overall Statistics

Statistics by Class

Interpretation

References.