Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Data Collection:

First a function to read data from respective CSV file. It keeps the header and replaces all “NA”, “” and “#DIV/0!” with “NA”

readData <- function(file) {
    df <- read.csv(file,
                   header=TRUE, 
                   na.strings=c("NA", "", "#DIV/0!"), 
                   sep=",")
    }

Then the data is read using the read readData function:

data <- readData("./data/training.csv") #read data

Data Cleaning:

Then the function cleanData will be used to clean data. This function turns the variable new_window from “yes” or “no” into “1” or “0” respectively. It converts the variable cvtd_timestamp into a time object and splits it into year, month, weekday, hour and minute variables.

Then all the columns with the most “NA” are droped from the dataframe. Columns X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp are also droped.

cleanData <- function(df) {    
    df$new_window <- ifelse(df$new_window=="yes", "1", "0") #convert new_window into "0" or "1"
    df$cvtd_timestamp <- strptime(df$cvtd_timestamp, "%d/%m/%Y %H:%M") #convert cvtd_timestamp into a time object
    df$year <- as.numeric(strftime(df$cvtd_timestamp, "%Y")) #creat a new feature year
    df$month <- as.numeric(strftime(df$cvtd_timestamp, "%m")) #creat a new feature month
    df$weekday <- as.numeric(strftime(df$cvtd_timestamp, "%d")) #creat a new feature weekday
    df$hour <- as.numeric(strftime(df$cvtd_timestamp, "%H")) #creat a new feature hour
    df$minute <- as.numeric(strftime(df$cvtd_timestamp, "%M")) #creat a new feature minute
    df <- df[,colSums(is.na(df)) < 1] #drop columns with the most "NA"
    df <- subset(df, select=-c(X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp)) #drop noisy features 
    }

Then the data is cleaned using the cleanData function:

data <- cleanData(data) #clean data

Data Exploration:

I used the featurePlot function to visualize the data. Here is an example of feature plots for the first 10 features in the data set:

library(caret)
featurePlot(x=data[,1:10],
            y=data$classe,
            plot="density",
            scales=list(x=list(relation="free"),
                        y=list(relation="free")),
            adjust=1.5,
            pch="|",
            layout=c(5,2),
            auto.key=list(columns=3))

plot of chunk unnamed-chunk-5

Training:

The function createDataPartition can be used to create a stratified random sample of the data into training and validation sets with 70% of the data in the training set and the rest in the validation set:

set.seed(1)
inTrain <- createDataPartition(y=data$classe, p=0.7, list=FALSE)
training <- data[inTrain,] #create training set
testing <- data[-inTrain,] #create validation set

The model is fitted using train from the caret library with the following parameters:

  • method : argument specifies the type of training model.

  • tuneGrid : a data frame with columns for each tuning parameter.

  • trControl : used to specifiy the type of resampling.

    • method : the resampling method to be used.

    • number : number of cross-validation groups. This may also be an explicit list of integers that define the cross-validation groups.

#creat a tune grid.
grid <-  expand.grid(cp=c(1:10)*0.01)

#fit the classification model
fit <- train(classe ~ ., data=training,
             method="rpart",
             tuneGrid=grid,
             trControl=trainControl(method="cv", number=10))

Cross-Validation:

The training set ids resampled in the trainig step above using Cross-Validated (10 fold). The cross-validation results are given below:

From this model, I expect Out of Sample Error to be approximately 0.3 from the information below:

fit
## CART 
## 
## 13737 samples
##    59 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 12363, 12361, 12363, 12363, 12363, 12363, ... 
## 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.01  0.7       0.7    0.013        0.016   
##   0.02  0.7       0.6    0.010        0.012   
##   0.03  0.6       0.5    0.012        0.015   
##   0.04  0.5       0.4    0.025        0.040   
##   0.05  0.5       0.3    0.015        0.020   
##   0.06  0.4       0.2    0.049        0.084   
##   0.07  0.4       0.1    0.004        0.006   
##   0.08  0.4       0.1    0.004        0.006   
##   0.09  0.4       0.1    0.004        0.006   
##   0.10  0.4       0.1    0.004        0.006   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01.
plot(fit)

plot of chunk unnamed-chunk-7

This is a visualisation for the “classification” tree produced by the model:

library(rattle)
fancyRpartPlot(fit$finalModel)

plot of chunk unnamed-chunk-8

Predicting:

Using the model developed, I predict using the variable classe

testPred <- predict(fit, testing)

From the Confusion Matrix we can calculate our Out of Sample Error as 24.84%:

Out of Sample Error = 1 - 0.7516
                    = 0.2484
confusionMatrix(testPred, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1451  134   25   51   83
##          B   51  685   39   74   86
##          C   31  130  840   51   53
##          D  110  181  122  740  153
##          E   31    9    0   48  707
## 
## Overall Statistics
##                                        
##                Accuracy : 0.752        
##                  95% CI : (0.74, 0.763)
##     No Information Rate : 0.284        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.686        
##  Mcnemar's Test P-Value : <2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.867    0.601    0.819    0.768    0.653
## Specificity             0.930    0.947    0.945    0.885    0.982
## Pos Pred Value          0.832    0.733    0.760    0.567    0.889
## Neg Pred Value          0.946    0.908    0.961    0.951    0.926
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.247    0.116    0.143    0.126    0.120
## Detection Prevalence    0.296    0.159    0.188    0.222    0.135
## Balanced Accuracy       0.899    0.774    0.882    0.826    0.818