Abstract

The goal of this assignment is to predict the behaviour of the enthusiasts using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Using the tools that we’ve learnt about mashine learning and various models we need to prepare and find the best model to achive the assignment objective.

Data Source:

The training data for this project are available here:

Training Data

The test data are available here:

Testing Data

Required Libraries

#required libraries for our analysis
require(rpart)
require(gbm)
require(rattle)
require(randomForest)
require(caret)
require(here)

Loading Data

pml_training<- read.csv(here("pml-training.csv"))

Diving into data

#dimensions of our data 
dim(pml_training)

## [1] 19622   160

#structure of data
str(pml_training)

#variables of the data
names(pml_training)

Polishing Data

Our first step is to remove the columns that are unnecessary to our models and we can catch them by just little exploration that we did…

data<- pml_training[,-(1:5)]

Now filtering out the variables that have negligible effect in improving our models.

#zero variance columns
ext_col<-nearZeroVar(data)
data<- data[,-ext_col]
#removing columns which have more than the avg NA of 70% 
rem_na<- sapply(data, function(x) mean(is.na(x))>0.7)
clean_data<- data[,rem_na == FALSE]

Data Slicing

To find the best model we need to take a piece of our training data for validation.

slice<- createDataPartition(clean_data$classe, p = 0.7, list = FALSE)
training<- clean_data[slice,]
val_data<- clean_data[-slice,]
# setting parameter
set.seed(122)
control<- trainControl(method = "cv", number = 5)

Model Testing

1. Gradient Boosting

#gradient boosting  modelling(gbm)
set.seed(123)
gbm_model<- train(form = classe ~ .,
                  data = training,
                  method = "gbm",
                  trControl = control)

Visual Insights

Model Validation

#validation
pre_gbm<- predict(gbm_model, val_data)
confusionMatrix(table(pre_gbm, val_data$classe))

## Confusion Matrix and Statistics
## 
##        
## pre_gbm    A    B    C    D    E
##       A 1673   13    0    0    0
##       B    1 1112   14    2    1
##       C    0   13 1008   15    4
##       D    0    1    4  947    6
##       E    0    0    0    0 1071
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9874          
##                  95% CI : (0.9842, 0.9901)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9841          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9763   0.9825   0.9824   0.9898
## Specificity            0.9969   0.9962   0.9934   0.9978   1.0000
## Pos Pred Value         0.9923   0.9841   0.9692   0.9885   1.0000
## Neg Pred Value         0.9998   0.9943   0.9963   0.9965   0.9977
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1890   0.1713   0.1609   0.1820
## Detection Prevalence   0.2865   0.1920   0.1767   0.1628   0.1820
## Balanced Accuracy      0.9982   0.9863   0.9879   0.9901   0.9949

Insights:

Accuracy : 0.9854, 98% accuracy is quite impressive but we cannot say it is the best as we have to test more model type and need to see which fits best.

2. Decision Tree

#decision tree
set.seed(124)
dec_model<- rpart(formula = classe ~ .,
                  data = training,
                  method = "class")

Tuning the plot for a insightful fit.

#plot tuning
printcp(dec_model)
pruned_mod<- prune(dec_model, cp = 0.065)

Visual Insights

Model Validation

# testing 
pre_dec<- predict(dec_model, val_data, type = "class")
dec_valmat<- confusionMatrix(table(pre_dec, val_data$classe))

Insights:

Accuracy : 0.7487, 74% accuracy is just average as we can observe it got worse from the previous model we’ve tested

3. Random Forest

set.seed(125)
rf_model <- train(
  classe ~ .,
  data = training,
  method = "rf",
  trControl = control)

Visual Insights

#plot
plot(varImp(rf_model))

Model Validation

#validation
pre_rf<- predict(rf_model, val_data)
confusionMatrix(table(pre_rf, val_data$classe))

## Confusion Matrix and Statistics
## 
##       
## pre_rf    A    B    C    D    E
##      A 1674    5    0    0    0
##      B    0 1133    3    0    0
##      C    0    1 1023    4    0
##      D    0    0    0  960    2
##      E    0    0    0    0 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9975          
##                  95% CI : (0.9958, 0.9986)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9968          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9947   0.9971   0.9959   0.9982
## Specificity            0.9988   0.9994   0.9990   0.9996   1.0000
## Pos Pred Value         0.9970   0.9974   0.9951   0.9979   1.0000
## Neg Pred Value         1.0000   0.9987   0.9994   0.9992   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1925   0.1738   0.1631   0.1835
## Detection Prevalence   0.2853   0.1930   0.1747   0.1635   0.1835
## Balanced Accuracy      0.9994   0.9971   0.9980   0.9977   0.9991

Insights:

Accuracy : 0.9971, 99% impressive, we got our best model which have the best accuracy among all above so this is the model we’re going to fit for our test data.

The Best Fit

Now let’s fit our model with best accuracy to the test dataset

pre_test<- predict(rf_model, pml_testing)#fitting random forest model
pre_test

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

And here our assignment come to an end as we found the best model that can predict test data with almost 99% accuracy.

Prediction Assignment Writeup

Jeet Bhadouria

2025-11-09

Abstract

Data Source:

Loading Data

Polishing Data

Model Testing

1. Gradient Boosting

2. Decision Tree

3. Random Forest

The Best Fit