Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
he training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
Goal
The goal of the project is to predict how the excercise was done or predicting the “classe” variable in the training set.
Setting the working directory and importing the data
setwd("D:/git/PML")
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
library(lubridate); library(caret); library(randomForest); library(dplyr); library(rpart);library(e1071);
## Warning: package 'lubridate' was built under R version 3.4.2
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Warning: package 'caret' was built under R version 3.4.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.2
## Warning: package 'randomForest' was built under R version 3.4.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'e1071' was built under R version 3.4.2
There are many columns with all 0 or NA value. We would be removing them basis the data test set data since we would only have the test set data to use for predictions. Hence I would build my model only with data available in the test set.
datacleantraining <- training[,colSums(is.na(testing)) == 0]
datacleantesting <- testing[,colSums(is.na(testing)) ==0]
# changing the time and data format
datacleantraining$cvtd_timestamp <- as.Date(datacleantraining$cvtd_timestamp, format = "%m/%d/%Y %H:%M")
datacleantesting$cvtd_timestamp <- as.Date(datacleantesting$cvtd_timestamp, format = "%m/%d/%Y %H:%M")
table(datacleantraining$classe, datacleantraining$user_name)
##
## adelmo carlitos charles eurico jeremy pedro
## A 1165 834 899 865 1177 640
## B 776 690 745 592 489 505
## C 750 493 539 489 652 499
## D 515 486 642 582 522 469
## E 686 609 711 542 562 497
qplot(datacleantraining$classe, xlab = "Activity Class", fill = datacleantraining$user_name)
qplot(datacleantraining$user_name, xlab = "User", fill = datacleantraining$classe)
Class A is the highest done activity followd by B and E. Among the users, “Adelmo” is the highest followd by “Charles” and “Jeremy”
Of the available variables(regressors), I want to build my model with regressors which have are dependent on the reading values and not the date or the user.
For the model to be effective, I believe this should be user agnostic and should be able to predict the class of activity irrespective of the time
filter1 <- grepl("belt|arm|dumbell|classe", names(datacleantraining))
datacleantraining <- datacleantraining[,filter1]
datacleantesting <- datacleantesting[,filter1]
dim(datacleantraining)
## [1] 19622 40
Hence, from the 160 initial columns we are now down to 40 columns of which we will use 39 variables (and one is the outcome variable ) to construct our model.
Before constructing the model, I want to check for zero variance regressors among the set of variables
zerovariance <- nearZeroVar(datacleantraining, saveMetrics = T)
summary(zerovariance$zeroVar); summary(zerovariance$nzv)
## Mode FALSE
## logical 40
## Mode FALSE
## logical 40
We see that of this, there are no variables with near zero variance. Hence we would be using all the 39 variable ( 40 columns - the outcome varianble) to construct our predictive model
I will use the training set to create a further training set and a testing set which I would be useing for cross validation and testing
inTrain <- createDataPartition(datacleantraining$classe , p = 0.5, list = FALSE)
train1 <- datacleantraining[inTrain,]; test1 <- datacleantraining[-inTrain,]
While ideally I would have wanted 60% to be classfied in the training set, my machine crashed thrice because of resource and RAM constraints.
Hence used it on 50%
fit1 <- train(classe ~ . , data = train1, method = "rf")
Checking the model
fit1$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 20
##
## OOB estimate of error rate: 1.4%
## Confusion matrix:
## A B C D E class.error
## A 2782 4 1 2 1 0.002867384
## B 11 1867 18 1 2 0.016850974
## C 0 35 1655 19 2 0.032729398
## D 0 4 23 1577 4 0.019278607
## E 0 3 2 5 1794 0.005543237
We see that the error rate is 1.53%
Now to check the accuracy of the preictions
pred1 <- predict(fit1, train1)
confusionMatrix(pred1, train1$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2790 0 0 0 0
## B 0 1899 0 0 0
## C 0 0 1711 0 0
## D 0 0 0 1608 0
## E 0 0 0 0 1804
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9996, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Here we get 100% accuracy
We would now use this on the test set which we created on the partition
pred2 <- predict(fit1, newdata = test1)
confusionMatrix(pred2, test1$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2773 14 0 0 0
## B 11 1870 22 3 1
## C 3 10 1670 27 0
## D 1 1 18 1575 2
## E 2 3 1 3 1800
##
## Overall Statistics
##
## Accuracy : 0.9876
## 95% CI : (0.9852, 0.9897)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9843
## Mcnemar's Test P-Value : 0.1037
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9939 0.9852 0.9760 0.9795 0.9983
## Specificity 0.9980 0.9953 0.9951 0.9973 0.9989
## Pos Pred Value 0.9950 0.9806 0.9766 0.9862 0.9950
## Neg Pred Value 0.9976 0.9965 0.9949 0.9960 0.9996
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2827 0.1906 0.1702 0.1606 0.1835
## Detection Prevalence 0.2841 0.1944 0.1743 0.1628 0.1844
## Balanced Accuracy 0.9960 0.9903 0.9855 0.9884 0.9986
on the test set the model gives an accuracy of 99% which is a good fit.
plot(fit1$finalModel, main = "Final Model")
plot(fit1)
pred3 <- predict(fit1, newdata = datacleantesting)
pred3
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
These were the final answers I used