Project to predict and quantify the manner in which the exercise was done

Executive summary:-

This report provides an analysis and evaluation of the data collected to predict which is better exercise out of 5 different ways to quantify how well they do it. The method of analysis includes prediction model and exploratory data analyses.The data was extracted from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har.

Data

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Exploratory data analyses:-

Lets first check the data content and datatype for the training dataset

temp <- tempfile()
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url,temp,method="curl")
#Replace blank fields to NA
file <- read.csv(temp,header = TRUE,sep = ",",na.strings = c("NA", ""))
dim(file)

## [1] 19622   160

Before starting with the analysis we will do some data cleaning to the training data.We will be removing fields having sum <> 0 to make data more clean and accurate for prediction.

#Select colums having sum=0 
test <- file[,colSums(is.na(file))==0]
#Remove columns not need for predection like X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,
#cvtd_timestap,new_window and num_window
final <- test[,!names(test) %in% names(test[,c(1,2,3,4,5,6,7)])]

dim(final)

## [1] 19622    53

Now since the data has been cleaned up and reduced to reqired fields we will do cross data validation by breaking the training data into training and test data.

cross data validation

library(caret);library(kernlab);
#Breaking data into 75% training and 25% testing data for cross data validation
intain <- createDataPartition(y=final$classe,p=0.75,list=FALSE)
training <- final[intain,]
testing <- final[-intain,]
dim(training);dim(testing)

## [1] 14718    53

## [1] 4904   53

Create Fit Model

We will create a fit model using training data with method= Knn(K-nearest neighbor)

modelfit <- train(classe ~ .,data=training,method="knn",preProcess = c("center", "scale"),
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
modelfit

## k-Nearest Neighbors 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 13246, 13247, 13245, 13248, 13246, 13245, ... 
## 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa      Accuracy SD  Kappa SD   
##    5  0.9658243  0.9567613  0.004939973  0.006249964
##    7  0.9545458  0.9424857  0.004972096  0.006302970
##    9  0.9444897  0.9297494  0.004368420  0.005538304
##   11  0.9333470  0.9156323  0.006463869  0.008197362
##   13  0.9204360  0.8992688  0.006045151  0.007659166
##   15  0.9126893  0.8894622  0.007473483  0.009469397
##   17  0.9040599  0.8785376  0.008386261  0.010618301
##   19  0.8963156  0.8687526  0.008609216  0.010927292
##   21  0.8881607  0.8584177  0.011418634  0.014475394
##   23  0.8794656  0.8474311  0.008171329  0.010364854
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.

modelfit$finalModel

## 5-nearest neighbor classification model
## 
## Call:
## knn3.matrix(x = as.matrix(x), y = y, k = param$k)
## 
## Training set class distribution:
## 
##    A    B    C    D    E 
## 4185 2848 2567 2412 2706

prediction <- predict(modelfit,newdata=testing)

After creating the fit model we will do prediction on testing data created and create confusionMatrix.

modelfit$finalModel

## 5-nearest neighbor classification model
## 
## Call:
## knn3.matrix(x = as.matrix(x), y = y, k = param$k)
## 
## Training set class distribution:
## 
##    A    B    C    D    E 
## 4185 2848 2567 2412 2706

prediction <- predict(modelfit,newdata=testing)
confusionMatrix(prediction,testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1372   25    4    0    0
##          B    8  898   11    0    4
##          C    6   24  818   27    6
##          D    7    1   20  776    5
##          E    2    1    2    1  886
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9686          
##                  95% CI : (0.9633, 0.9733)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9603          
##  Mcnemar's Test P-Value : 0.0004858       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9835   0.9463   0.9567   0.9652   0.9834
## Specificity            0.9917   0.9942   0.9844   0.9920   0.9985
## Pos Pred Value         0.9793   0.9750   0.9285   0.9592   0.9933
## Neg Pred Value         0.9934   0.9872   0.9908   0.9932   0.9963
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2798   0.1831   0.1668   0.1582   0.1807
## Detection Prevalence   0.2857   0.1878   0.1796   0.1650   0.1819
## Balanced Accuracy      0.9876   0.9702   0.9706   0.9786   0.9909

Out of sample error

As from above you can see the out of sample error Accuracy = 0.9686

Below is the Plot created showing all the feature of outcome classe against other fields.

Plot

library(knitr)
featurePlot(x=testing[,grep("total",names(testing))],y=testing$classe,plot="pairs")

Applying Above prediction to test

temp <- tempfile()
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url,temp,method="curl")
test <- read.csv(temp,header = TRUE,sep = ",",na.strings = c("NA", ""))
x <- as.character(predict(modelfit,newdata=test))

Results

print(x)

##  [1] "B" "A" "A" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"