Introduction:

For this project, we are going to take a look at some data from accelerometers used by six individuals during different activities to improve health. We are going to use the training data to classify the manner in which they exercise, which is stored in the “classe” column of the training set.

Adding Libraries:

There are a few libraries that we are going to use in this project, so let’s add them.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)

Loading the data:

Now let’s load the training and testing datasets, and take a look at some of the data. There are 160 columns in the dataset, so we are going to limit the amount of columns that will be displayed.

training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
head(training[, 1:8], 5)
##   X user_name raw_timestamp_part_1 raw_timestamp_part_2   cvtd_timestamp
## 1 1  carlitos           1323084231               788290 05/12/2011 11:23
## 2 2  carlitos           1323084231               808298 05/12/2011 11:23
## 3 3  carlitos           1323084231               820366 05/12/2011 11:23
## 4 4  carlitos           1323084232               120339 05/12/2011 11:23
## 5 5  carlitos           1323084232               196328 05/12/2011 11:23
##   new_window num_window roll_belt
## 1         no         11      1.41
## 2         no         11      1.41
## 3         no         11      1.42
## 4         no         12      1.48
## 5         no         12      1.48

Cleaning the data:

There are a few things about the data that we will need to clean up before we can start applying machine learning algorithms. We can remove the first column because it just contains a number that was added when loading the CSV file. Also, the last column of the testing set doesn’t have the same name as the training set, so we will want to rename it. Another thing that we need to deal with is the missing and NA values in the training and testing datasets. We should remove them all before we build our models. Some of the columns in the testing set are all NA or missing, so let’s remove those columns from the testing set and remove all of those same columns from the training data set. There are also some columns that include values related to the date and time, which won’t be helpful in the prediction model, so lets remove those columns.

# Remove the first columns
training <- training[, -1]
testing <- testing[, -1]

# Rename the last column of the testing set
testing <- rename(testing, classe = "problem_id")

# Remove the NA and missing values
testing <- testing[colSums(!is.na(testing)) > 0]
training <- select(training, c(names(testing)))

# Remove time related columns
training <- select(training, -c("raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window"))
testing <- select(testing, -c("raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window"))

# Partition the data for training and testing.
trPart <- createDataPartition(training$classe, p = 0.75, list = FALSE)
trainPart <- training[trPart,]
testPart <- training[-trPart,]

Building a recursive partition model.

The first model that we will build will be a recursive partition model, and we will use 10-fold cross validation. After building the model, we will check its accuracy with the confusion matrix.

# Set the training control for 10-fold cross validation.
train_control <- trainControl(method = "cv", number = 10)

# Train the model
modelrpart <- train(classe ~ ., data <- training, method = "rpart", trControl = train_control)

# Check the confusion matrix of the model
confusionMatrix(testPart$classe, predict(modelrpart, newdata = testPart))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1292   17   85    0    1
##          B  383  331  235    0    0
##          C  399   26  430    0    0
##          D  373  143  288    0    0
##          E  120  123  232    0  426
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5055          
##                  95% CI : (0.4914, 0.5196)
##     No Information Rate : 0.5235          
##     P-Value [Acc > NIR] : 0.9943          
##                                           
##                   Kappa : 0.3533          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.5033   0.5172  0.33858       NA  0.99766
## Specificity            0.9559   0.8551  0.88305   0.8361  0.89390
## Pos Pred Value         0.9262   0.3488  0.50292       NA  0.47281
## Neg Pred Value         0.6366   0.9219  0.79254       NA  0.99975
## Prevalence             0.5235   0.1305  0.25897   0.0000  0.08707
## Detection Rate         0.2635   0.0675  0.08768   0.0000  0.08687
## Detection Prevalence   0.2845   0.1935  0.17435   0.1639  0.18373
## Balanced Accuracy      0.7296   0.6861  0.61082       NA  0.94578

The confusion matrix for this model shows a very poor result. It’s about 50% accurate, so this is not a very good model to use for prediction. Let’s try some other models.

Building a linear discriminant analysis model.

The next model that we are going to try is a Linear Discriminant Analysis model. Again, we’re going to use 10-fold cross-validation.

# Set the training control for 10-fold cross validation.
train_control <- trainControl(method = "cv", number = 10)

# Train the model
modelrpart <- train(classe ~ ., data <- training, method = "lda", trControl = train_control)

# Check the confusion matrix of the model
confusionMatrix(testPart$classe, predict(modelrpart, newdata = testPart))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1169   35   99   91    1
##          B  155  601  136   26   31
##          C   89   90  554  109   13
##          D   52   31   78  629   14
##          E   28  135   62   63  613
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7272          
##                  95% CI : (0.7145, 0.7396)
##     No Information Rate : 0.3044          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6543          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7830   0.6738   0.5963   0.6852   0.9122
## Specificity            0.9337   0.9133   0.9243   0.9561   0.9319
## Pos Pred Value         0.8380   0.6333   0.6480   0.7823   0.6804
## Neg Pred Value         0.9077   0.9264   0.9074   0.9295   0.9853
## Prevalence             0.3044   0.1819   0.1894   0.1872   0.1370
## Detection Rate         0.2384   0.1226   0.1130   0.1283   0.1250
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.8584   0.7935   0.7603   0.8206   0.9221

The accuracy of this model is about 73%. Better than the rpart model, but still not great. We may be able to do better than that. Let’s try another model.

Building a random forest model.

The third model that we will build is a random forest model. We’re also using 4-fold cross validation.

# Set the training control for 4-fold cross validation.
train_control <- trainControl(method = "cv", number = 4)

# Train the model
modelrf <- train(classe ~ ., data <- training, method = "rf", trControl = train_control)

# Check the confusion matrix of the model
confusionMatrix(testPart$classe, predict(modelrf, newdata = testPart))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    0  949    0    0    0
##          C    0    0  855    0    0
##          D    0    0    0  804    0
##          E    0    0    0    0  901
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9992, 1)
##     No Information Rate : 0.2845     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

The results of this model are excellent. There is a 100% accuracy rate on predicting the class. This is the model that we will use to predict on the testing dataset. We expect that it will have a 100% out-of-sample accuracy.

Predict on the testing set.

Now we’re going to predict the classe in the testing set and add it to the dataset.

testing$classe <- predict(modelrf, newdata = testing)
select(testing, c(user_name, classe))
##    user_name classe
## 1      pedro      B
## 2     jeremy      A
## 3     jeremy      B
## 4     adelmo      A
## 5     eurico      A
## 6     jeremy      E
## 7     jeremy      D
## 8     jeremy      B
## 9   carlitos      A
## 10   charles      A
## 11  carlitos      B
## 12    jeremy      C
## 13    eurico      B
## 14    jeremy      A
## 15    jeremy      E
## 16    eurico      E
## 17     pedro      A
## 18  carlitos      B
## 19     pedro      B
## 20    eurico      B