This report provides an analysis and evaluation of the data collected to predict which is better exercise out of 5 different ways to quantify how well they do it. The method of analysis includes prediction model and exploratory data analyses.The data was extracted from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Lets first check the data content and datatype for the training dataset
temp <- tempfile()
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url,temp,method="curl")
#Replace blank fields to NA
file <- read.csv(temp,header = TRUE,sep = ",",na.strings = c("NA", ""))
dim(file)
## [1] 19622 160
Before starting with the analysis we will do some data cleaning to the training data.We will be removing fields having sum <> 0 to make data more clean and accurate for prediction.
#Select colums having sum=0
test <- file[,colSums(is.na(file))==0]
#Remove columns not need for predection like X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,
#cvtd_timestap,new_window and num_window
final <- test[,!names(test) %in% names(test[,c(1,2,3,4,5,6,7)])]
dim(final)
## [1] 19622 53
Now since the data has been cleaned up and reduced to reqired fields we will do cross data validation by breaking the training data into training and test data.
library(caret);library(kernlab);
#Breaking data into 75% training and 25% testing data for cross data validation
intain <- createDataPartition(y=final$classe,p=0.75,list=FALSE)
training <- final[intain,]
testing <- final[-intain,]
dim(training);dim(testing)
## [1] 14718 53
## [1] 4904 53
We will create a fit model using training data with method= Knn(K-nearest neighbor)
modelfit <- train(classe ~ .,data=training,method="knn",preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
modelfit
## k-Nearest Neighbors
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 13246, 13247, 13245, 13248, 13246, 13245, ...
##
## Resampling results across tuning parameters:
##
## k Accuracy Kappa Accuracy SD Kappa SD
## 5 0.9658243 0.9567613 0.004939973 0.006249964
## 7 0.9545458 0.9424857 0.004972096 0.006302970
## 9 0.9444897 0.9297494 0.004368420 0.005538304
## 11 0.9333470 0.9156323 0.006463869 0.008197362
## 13 0.9204360 0.8992688 0.006045151 0.007659166
## 15 0.9126893 0.8894622 0.007473483 0.009469397
## 17 0.9040599 0.8785376 0.008386261 0.010618301
## 19 0.8963156 0.8687526 0.008609216 0.010927292
## 21 0.8881607 0.8584177 0.011418634 0.014475394
## 23 0.8794656 0.8474311 0.008171329 0.010364854
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
modelfit$finalModel
## 5-nearest neighbor classification model
##
## Call:
## knn3.matrix(x = as.matrix(x), y = y, k = param$k)
##
## Training set class distribution:
##
## A B C D E
## 4185 2848 2567 2412 2706
prediction <- predict(modelfit,newdata=testing)
After creating the fit model we will do prediction on testing data created and create confusionMatrix.
modelfit$finalModel
## 5-nearest neighbor classification model
##
## Call:
## knn3.matrix(x = as.matrix(x), y = y, k = param$k)
##
## Training set class distribution:
##
## A B C D E
## 4185 2848 2567 2412 2706
prediction <- predict(modelfit,newdata=testing)
confusionMatrix(prediction,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1372 25 4 0 0
## B 8 898 11 0 4
## C 6 24 818 27 6
## D 7 1 20 776 5
## E 2 1 2 1 886
##
## Overall Statistics
##
## Accuracy : 0.9686
## 95% CI : (0.9633, 0.9733)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9603
## Mcnemar's Test P-Value : 0.0004858
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9835 0.9463 0.9567 0.9652 0.9834
## Specificity 0.9917 0.9942 0.9844 0.9920 0.9985
## Pos Pred Value 0.9793 0.9750 0.9285 0.9592 0.9933
## Neg Pred Value 0.9934 0.9872 0.9908 0.9932 0.9963
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2798 0.1831 0.1668 0.1582 0.1807
## Detection Prevalence 0.2857 0.1878 0.1796 0.1650 0.1819
## Balanced Accuracy 0.9876 0.9702 0.9706 0.9786 0.9909
As from above you can see the out of sample error Accuracy = 0.9686
Below is the Plot created showing all the feature of outcome classe against other fields.
library(knitr)
featurePlot(x=testing[,grep("total",names(testing))],y=testing$classe,plot="pairs")
temp <- tempfile()
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url,temp,method="curl")
test <- read.csv(temp,header = TRUE,sep = ",",na.strings = c("NA", ""))
x <- as.character(predict(modelfit,newdata=test))
print(x)
## [1] "B" "A" "A" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"