The goal of this assignment is to predict the behaviour of the enthusiasts using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Using the tools that we’ve learnt about mashine learning and various models we need to prepare and find the best model to achive the assignment objective.
The training data for this project are available here:
The test data are available here:
Required Libraries
#required libraries for our analysis
require(rpart)
require(gbm)
require(rattle)
require(randomForest)
require(caret)
require(here)
pml_training<- read.csv(here("pml-training.csv"))
Diving into data
#dimensions of our data
dim(pml_training)
## [1] 19622 160
#structure of data
str(pml_training)
#variables of the data
names(pml_training)
Our first step is to remove the columns that are unnecessary to our models and we can catch them by just little exploration that we did…
data<- pml_training[,-(1:5)]
Now filtering out the variables that have negligible effect in improving our models.
#zero variance columns
ext_col<-nearZeroVar(data)
data<- data[,-ext_col]
#removing columns which have more than the avg NA of 70%
rem_na<- sapply(data, function(x) mean(is.na(x))>0.7)
clean_data<- data[,rem_na == FALSE]
Data Slicing
To find the best model we need to take a piece of our training data for validation.
slice<- createDataPartition(clean_data$classe, p = 0.7, list = FALSE)
training<- clean_data[slice,]
val_data<- clean_data[-slice,]
# setting parameter
set.seed(122)
control<- trainControl(method = "cv", number = 5)
#gradient boosting modelling(gbm)
set.seed(123)
gbm_model<- train(form = classe ~ .,
data = training,
method = "gbm",
trControl = control)
#validation
pre_gbm<- predict(gbm_model, val_data)
confusionMatrix(table(pre_gbm, val_data$classe))
## Confusion Matrix and Statistics
##
##
## pre_gbm A B C D E
## A 1673 13 0 0 0
## B 1 1112 14 2 1
## C 0 13 1008 15 4
## D 0 1 4 947 6
## E 0 0 0 0 1071
##
## Overall Statistics
##
## Accuracy : 0.9874
## 95% CI : (0.9842, 0.9901)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9841
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9763 0.9825 0.9824 0.9898
## Specificity 0.9969 0.9962 0.9934 0.9978 1.0000
## Pos Pred Value 0.9923 0.9841 0.9692 0.9885 1.0000
## Neg Pred Value 0.9998 0.9943 0.9963 0.9965 0.9977
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1890 0.1713 0.1609 0.1820
## Detection Prevalence 0.2865 0.1920 0.1767 0.1628 0.1820
## Balanced Accuracy 0.9982 0.9863 0.9879 0.9901 0.9949
Insights:
Accuracy : 0.9854, 98% accuracy is quite impressive but we cannot say it is the best as we have to test more model type and need to see which fits best.
#decision tree
set.seed(124)
dec_model<- rpart(formula = classe ~ .,
data = training,
method = "class")
Tuning the plot for a insightful fit.
#plot tuning
printcp(dec_model)
pruned_mod<- prune(dec_model, cp = 0.065)
# testing
pre_dec<- predict(dec_model, val_data, type = "class")
dec_valmat<- confusionMatrix(table(pre_dec, val_data$classe))
Insights:
Accuracy : 0.7487, 74% accuracy is just average as we can observe it got worse from the previous model we’ve tested
set.seed(125)
rf_model <- train(
classe ~ .,
data = training,
method = "rf",
trControl = control)
#plot
plot(varImp(rf_model))
#validation
pre_rf<- predict(rf_model, val_data)
confusionMatrix(table(pre_rf, val_data$classe))
## Confusion Matrix and Statistics
##
##
## pre_rf A B C D E
## A 1674 5 0 0 0
## B 0 1133 3 0 0
## C 0 1 1023 4 0
## D 0 0 0 960 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9975
## 95% CI : (0.9958, 0.9986)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9968
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9947 0.9971 0.9959 0.9982
## Specificity 0.9988 0.9994 0.9990 0.9996 1.0000
## Pos Pred Value 0.9970 0.9974 0.9951 0.9979 1.0000
## Neg Pred Value 1.0000 0.9987 0.9994 0.9992 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1925 0.1738 0.1631 0.1835
## Detection Prevalence 0.2853 0.1930 0.1747 0.1635 0.1835
## Balanced Accuracy 0.9994 0.9971 0.9980 0.9977 0.9991
Insights:
Accuracy : 0.9971, 99% impressive, we got our best model which have the best accuracy among all above so this is the model we’re going to fit for our test data.
Now let’s fit our model with best accuracy to the test dataset
pre_test<- predict(rf_model, pml_testing)#fitting random forest model
pre_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
And here our assignment come to an end as we found the best model that can predict test data with almost 99% accuracy.