Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The Weight Lifting Exercise Dataset was analysed to predict class of the exercise using the other predictors in the dataset. Initially, the data was preprocessed to remove to columns with large number of NA values. Next, nearZeroVar function was used to check if attributes with near zero variance was present. After preprocessing the data, model fitting was performed using Trees with 4-fold cross-validation. Since the out of sample accuracy turned out to be <50%, random forests were trained on the data, also with 4-fold cross-validation. An out of sample accuracy of 99% was obtained and hence this model was selected.
It was analysed that certain columns had very large number of missing values.
{
set.seed(1234)
library(caret,quietly=TRUE)
data <- read.csv("pml-training.csv",na.string=c("","NA","NULL"))
quiz <- read.csv("pml-testing.csv",na.string=c("","NA","NULL"))
table(sapply(data,function(x) sum(is.na(x))))
}
##
## 0 19216
## 60 100
These columns were removed to transform it into a clean dataset.
The first 7 columns of the dataset are removed also, as they contain trivial parameters which do not aid in prediction of the class.
Also nearZeroVar function is used to check if any column has near zero variance, as it affects the model training process.
cleanData <- data[,which(as.numeric(colSums(is.na(data)))==0)]
cleanData <- cleanData[,-c(1:7)] #First 7 Columns Of The dataset are removed
nearZeroVar(cleanData)
## integer(0)
inTrain <- createDataPartition(cleanData$classe,p=0.7,list=FALSE)
training <- cleanData[inTrain,]
testing <- cleanData[-inTrain,]
modFit <- train(classe~.,data=training,method="rpart",trControl = trainControl(method="cv",number=4,allowParallel=TRUE))
## Loading required namespace: e1071
confusionMatrix(testing$classe,predict(modFit,testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1530 35 105 0 4
## B 486 379 274 0 0
## C 493 31 502 0 0
## D 452 164 348 0 0
## E 168 145 302 0 467
##
## Overall Statistics
##
## Accuracy : 0.489
## 95% CI : (0.476, 0.502)
## No Information Rate : 0.532
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.331
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.489 0.5027 0.3279 NA 0.9915
## Specificity 0.948 0.8519 0.8797 0.836 0.8864
## Pos Pred Value 0.914 0.3327 0.4893 NA 0.4316
## Neg Pred Value 0.620 0.9210 0.7882 NA 0.9992
## Prevalence 0.532 0.1281 0.2602 0.000 0.0800
## Detection Rate 0.260 0.0644 0.0853 0.000 0.0794
## Detection Prevalence 0.284 0.1935 0.1743 0.164 0.1839
## Balanced Accuracy 0.718 0.6773 0.6038 NA 0.9390
Since the accuracy in the cross-validation set (Out of sample accuracy) is <50%, we try fitting a different model.
modFit <- train(classe~.,data=training,method="rf",trControl = trainControl(method="cv",number=4,allowParallel=TRUE))
confusionMatrix(testing$classe,predict(modFit,testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 11 1127 1 0 0
## C 0 4 1018 4 0
## D 0 2 6 955 1
## E 0 1 2 3 1076
##
## Overall Statistics
##
## Accuracy : 0.994
## 95% CI : (0.992, 0.996)
## No Information Rate : 0.286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.992
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.993 0.994 0.991 0.993 0.999
## Specificity 1.000 0.997 0.998 0.998 0.999
## Pos Pred Value 1.000 0.989 0.992 0.991 0.994
## Neg Pred Value 0.997 0.999 0.998 0.999 1.000
## Prevalence 0.286 0.193 0.175 0.163 0.183
## Detection Rate 0.284 0.192 0.173 0.162 0.183
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 0.997 0.996 0.995 0.995 0.999
This model is accepted as the out-of sample accuracy (accuracy in the cross-validation set) is >90%.
cleanTestData <- quiz[,which(as.numeric(colSums(is.na(data)))==0)] #Selecting Same Variables In Test Set As In The Training Set
cleanTestData <- cleanTestData[,-c(1:7)] #First 7 Columns Of The dataset are removed
answers <- predict(modFit,cleanTestData)
print(answers)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
n = length(answers)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(answers[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
These results will be submitted for the assignment.