Prediction Assignment

Objective

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with.

We propose a dataset with 5 classes (sitting-down, standing-up, standing, walking, and sitting) collected on 8 hours of activities of 4 healthy subjects.

Goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

Data

The data for this project come from this source - The training data for this project are available here - The test data are available here

# import libraries
library(plotly);library(caret);

# import data
train <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")

Data Wrangling

First we need to clean up our dataset and only use column with data. - Remove unwanted columns - Handle NAs

# removing columns with NAs
train <- train[, colSums(is.na(train)) == 0]

NZV <- nearZeroVar(train)
train <- train[, -NZV]

# removing columns with unneeded data
train <- train[, -c(1:7)]

# examine our objective column
plot_ly(x = train$classe, type = "histogram", histnorm = "probability") %>%
  layout(title = "Propotion of Classe variable in Train dataset",
         xaxis= list(title = "Classe"), yaxis= list(title = "Probability") )

Prediction Technique

Two main techniques will be Random Forest & Decision Tree modeling.

First Model: Decision Tree Classification

library(rpart)
library(rpart.plot)

# create training, validation
inTrain <- createDataPartition(train$classe, p = 0.7, list = FALSE)
training <- train[inTrain, ]
validation <- train[-inTrain, ]

training$classe <- as.factor(training$classe)
validation$classe <- as.factor(validation$classe)

# build model & plot result
mod1 <- rpart(classe ~ ., data = training, method = "class",na.action = na.pass)
rpart.plot(mod1, main="Model#1: Decision Tree")

# use model#1 to calculate prediction
pred1 <- predict(mod1, newdata=validation, type = "class")
confusionTree <- confusionMatrix(pred1, validation$classe)
confusionTree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1487  210    9   75   69
##          B   34  636   90   31  125
##          C   32   74  763  157  149
##          D   95  178  134  658  133
##          E   26   41   30   43  606
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7052          
##                  95% CI : (0.6933, 0.7168)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6263          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8883   0.5584   0.7437   0.6826   0.5601
## Specificity            0.9138   0.9410   0.9152   0.8903   0.9709
## Pos Pred Value         0.8038   0.6943   0.6494   0.5492   0.8123
## Neg Pred Value         0.9537   0.8988   0.9442   0.9347   0.9074
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2527   0.1081   0.1297   0.1118   0.1030
## Detection Prevalence   0.3144   0.1556   0.1997   0.2036   0.1268
## Balanced Accuracy      0.9010   0.7497   0.8294   0.7864   0.7655

Second Model: Random Forest Use number of trees = 50 with 5 main features per tree

library(randomForest)

# build model & plot result .. 
mod2 <- randomForest(classe ~ ., data=training, ntree=50, mtry=5, importance=TRUE)
plot(mod2, log="y")

# use model#2 to calculate prediction
pred2 <- predict(mod2, validation)
confusionRF <- confusionMatrix(pred2, validation$classe)
confusionRF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672   10    0    0    0
##          B    2 1126    9    0    0
##          C    0    3 1015   21    2
##          D    0    0    2  940    0
##          E    0    0    0    3 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9912          
##                  95% CI : (0.9884, 0.9934)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9888          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9886   0.9893   0.9751   0.9982
## Specificity            0.9976   0.9977   0.9946   0.9996   0.9994
## Pos Pred Value         0.9941   0.9903   0.9750   0.9979   0.9972
## Neg Pred Value         0.9995   0.9973   0.9977   0.9951   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1913   0.1725   0.1597   0.1835
## Detection Prevalence   0.2858   0.1932   0.1769   0.1601   0.1840
## Balanced Accuracy      0.9982   0.9931   0.9920   0.9873   0.9988

Use Random Forest Model to predict Classe from test data, as its accuracy (0.98) is greater than decision tree model (0.70)

predict(mod2, newdata=test)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Prediction Assignment

KKher

8/30/2020

Objective

Data

Data Wrangling

Prediction Technique