Brief

This project is about predicting human movements. Wearable devices are getting common usage. In our data set users wore some devices and performed to do 5 different movements, and these accelerometers recorded changes in belt, arm, forearm and dumbell. The main idea is to investigate if we can determine which movement has been performed by these records. # Downloading training and test sets:

options(warn=-1)
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"

download.file(url, destfile = paste(getwd(),"/training.csv",sep = ""))

testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

download.file(testurl, destfile = paste(getwd(),"/testing.csv",sep = ""))

training <- read.csv(paste(getwd(),"/training.csv",sep = ""))
testing <- read.csv(paste(getwd(),"/testing.csv",sep = ""))
dim(training);dim(testing)

## [1] 19622   160

## [1]  20 160

Preparing the data

head(training)

str(training$classe)

##  chr [1:19622] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" ...

Since our target variable is classe and it is a categorical variable we should make it a factor variable.

training$classe <- as.factor(training$classe)

Let’s have a look which columns have NA values.

colSums(is.na(training))[colSums(is.na(training))>100]

##            max_roll_belt           max_picth_belt            min_roll_belt 
##                    19216                    19216                    19216 
##           min_pitch_belt      amplitude_roll_belt     amplitude_pitch_belt 
##                    19216                    19216                    19216 
##     var_total_accel_belt            avg_roll_belt         stddev_roll_belt 
##                    19216                    19216                    19216 
##            var_roll_belt           avg_pitch_belt        stddev_pitch_belt 
##                    19216                    19216                    19216 
##           var_pitch_belt             avg_yaw_belt          stddev_yaw_belt 
##                    19216                    19216                    19216 
##             var_yaw_belt            var_accel_arm             avg_roll_arm 
##                    19216                    19216                    19216 
##          stddev_roll_arm             var_roll_arm            avg_pitch_arm 
##                    19216                    19216                    19216 
##         stddev_pitch_arm            var_pitch_arm              avg_yaw_arm 
##                    19216                    19216                    19216 
##           stddev_yaw_arm              var_yaw_arm             max_roll_arm 
##                    19216                    19216                    19216 
##            max_picth_arm              max_yaw_arm             min_roll_arm 
##                    19216                    19216                    19216 
##            min_pitch_arm              min_yaw_arm       amplitude_roll_arm 
##                    19216                    19216                    19216 
##      amplitude_pitch_arm        amplitude_yaw_arm        max_roll_dumbbell 
##                    19216                    19216                    19216 
##       max_picth_dumbbell        min_roll_dumbbell       min_pitch_dumbbell 
##                    19216                    19216                    19216 
##  amplitude_roll_dumbbell amplitude_pitch_dumbbell       var_accel_dumbbell 
##                    19216                    19216                    19216 
##        avg_roll_dumbbell     stddev_roll_dumbbell        var_roll_dumbbell 
##                    19216                    19216                    19216 
##       avg_pitch_dumbbell    stddev_pitch_dumbbell       var_pitch_dumbbell 
##                    19216                    19216                    19216 
##         avg_yaw_dumbbell      stddev_yaw_dumbbell         var_yaw_dumbbell 
##                    19216                    19216                    19216 
##         max_roll_forearm        max_picth_forearm         min_roll_forearm 
##                    19216                    19216                    19216 
##        min_pitch_forearm   amplitude_roll_forearm  amplitude_pitch_forearm 
##                    19216                    19216                    19216 
##        var_accel_forearm         avg_roll_forearm      stddev_roll_forearm 
##                    19216                    19216                    19216 
##         var_roll_forearm        avg_pitch_forearm     stddev_pitch_forearm 
##                    19216                    19216                    19216 
##        var_pitch_forearm          avg_yaw_forearm       stddev_yaw_forearm 
##                    19216                    19216                    19216 
##          var_yaw_forearm 
##                    19216

There are missing values in some columns in training and test sets. We will drop them from both sets. So we are going to drop merge of empty coulmns in the training and test set.

empty_columns <- colnames(training)[colSums(is.na(training))>100]
training = training[,!(names(training) %in% empty_columns)]
testing = testing[,!(names(testing) %in% empty_columns)]
empty_chars <- colnames(training)[sapply(training,function(x) table(as.character(x) =="")["TRUE"])>100]
empty_chars <- empty_chars[!(is.na(empty_chars))]
training = training[,!(names(training) %in% empty_chars)]
testing = testing[,!(names(testing) %in% empty_chars)]
dim(training); dim(testing)

## [1] 19622    60

## [1] 20 60

head(training,2);head(testing,2)

Now we have a clean data. Let’s do some exploring. Let’s have a look at some plots. # Exploring the data

library(caret)

## Zorunlu paket yükleniyor: ggplot2

## Zorunlu paket yükleniyor: lattice

featurePlot(x=training[,c(7,11,14,16)], y=training$classe, plot="pairs")

In this plot we can see classes are piled. Sure we can look at any plot of combination of different covariants and seek for a relation. Let’s try another one but a regression line added on it:

q <- qplot(roll_belt, pitch_belt, data=training, color=classe)
q + geom_smooth(method="lm", formula = y ~ x)

In this plot we can see some classe categories are piled on different places in the plot. So there are some ways to distinguish the classes but relations between them still not kind of a regression problem. And those variables are positive numbers. Are the left of the covariants has similar ranges with this ? What is the range of the data excluding user_name, new_window and classe column?

sapply(training[,-c(2,5,6,60)], range)

##          X raw_timestamp_part_1 raw_timestamp_part_2 num_window roll_belt
## [1,]     1           1322489605                  294          1     -28.9
## [2,] 19622           1323095081               998801        864     162.0
##      pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y
## [1,]      -55.8     -180                0        -1.04        -0.64
## [2,]       60.3      179               29         2.22         0.64
##      gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## [1,]        -1.46         -120          -69         -275           -52
## [2,]         1.62           85          164          105           485
##      magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
## [1,]           354          -623     -180     -88.8    -180               1
## [2,]           673           293      180      88.5     180              66
##      gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## [1,]       -6.37       -3.44       -2.33        -404        -318        -636
## [2,]        4.87        2.84        3.02         437         308         292
##      magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
## [1,]         -584         -392         -597     -153.7137      -149.5936
## [2,]          782          583          694      153.5456       149.4024
##      yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## [1,]    -150.8712                    0          -204.00             -2.1
## [2,]     154.9523                   58             2.22             52.0
##      gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## [1,]            -2.38             -419             -189             -334
## [2,]           317.00              235              315              318
##      magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## [1,]              -643             -3600              -262         -180
## [2,]               592               633               452          180
##      pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x
## [1,]         -72.5        -180                   0          -22.00
## [2,]          89.8         180                 108            3.97
##      gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## [1,]           -7.02           -8.09            -498            -632
## [2,]          311.00          231.00             477             923
##      accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## [1,]            -446            -1280             -896             -973
## [2,]             291              672             1480             1090

We have negative and zero values as we expected. And we have some date-time information. This might can cause it to consider as a time-series forecasting but there is no value to forecast. In each row classe variable is defined. So we can’t use previous knowledge to forecast the classe because every movement is also classified in classe. Let’s have an ensembling approach for this data. We have training data but our test data has no classe variable. So we should create a new training and test sets from training data. And we will remove date-time information because classification is not relevant to date-time information. Also user_name, x and new_window is irrelevant. As a starting model we are going to take %5 train set and %95 test set.

Model Selecting

set.seed(323)
inTrain <- createDataPartition(y = training$classe, p = 0.05, list = F)
train <- training[-c(1,2,3,4,5,6)][inTrain,]
tester <- training[-c(1,2,3,4,5,6)][-inTrain,]

Now since we are going to make ensembling models, we should choose some classifying algorithms and measure accuracy and decide which is better and finally tuning hyperparameters of that model. Sure we can try to tune every model we try, but it will take a serious time. So we will use Random Forest, RPart, Gradient Boosting Machine, Support Vector Machine, Naive Bayes and linear discriminant analysis (lda) as classifier methods. And our measurement will be accuracy which is calculated as TP+TN/TP+TN+FP+FN. We will use “gbm” package for gradient boosting machine and “e1071” package for support vector machine and naive bayes classifier and “randomForest” package for random forest classifier. We will train models with those methods and make predictions and save accuracy in an array named accuracy_rate.

library(gbm)

## Loaded gbm 2.1.8

library(e1071)
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

method = c("Random Forest","RPart", "Gradient Boosting", "Support Vector Machine", "Naive Bayes", "Lda")
accuracy_rate = c()
accuracy <- function(pred, test){
  sum(diag(table(pred, test)))/sum(table(pred, test))
}

set.seed(1252)
model_rf <- train(classe~., method = "rf", data = train)
pred_rf <- predict(model_rf, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_rf, tester$classe))
model_rpart <- train(classe~., method = "rpart", data = train)
pred_rp <- predict(model_rpart, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_rp, tester$classe))
model_gbm <- train(classe~., method = "gbm", data = train,verbose=FALSE)
pred_gbm <- predict(model_gbm, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_gbm, tester$classe))
model_sv <- svm(classe~.,  data = train)
pred_sv <- predict(model_sv, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_sv, tester$classe))
model_nb <- naiveBayes(classe~.,  data = train, laplace=1)
pred_nb <- predict(model_nb, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_nb, tester$classe))
model_lda <- train(classe~.,method="lda",  data = train)
pred_lda <- predict(model_lda, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_lda, tester$classe))

We have trained models and deployed accuracy rates. And let’s have a look which method worked well for this data:

df <- data.frame(method, accuracy_rate)
df[order(accuracy_rate,decreasing = T),]

Random Forest has the best accuracy overall. But gradient boosting has also good results. As we mentioned before we used set.seed(323) to choose train set in particular randomness. Algorithms work well in different kinds of data. What if our random train set was better for Random Forest algorithm and when train data become larger our model will have faulty results? Let’s compare these two models with different train sets.

set.seed(101)
inTrain <- createDataPartition(y = training$classe, p = 0.05, list = F)
train <- training[-c(1,2,3,4,5,6)][inTrain,]
tester <- training[-c(1,2,3,4,5,6)][-inTrain,]
set.seed(152)
second_model_rf <- randomForest(classe~., data = train)
second_pred_rf <- predict(second_model_rf, tester)
second_model_gbm <- train(classe~., method = "gbm", data = train,verbose=FALSE)
second_pred_gbm <- predict(second_model_gbm, tester)
paste("Random Forest accuracy:", accuracy(second_pred_rf, tester$classe))

## [1] "Random Forest accuracy: 0.933472825795375"

paste("Gradient Boosting Machine accuracy:", accuracy(second_pred_gbm, tester$classe))

## [1] "Gradient Boosting Machine accuracy: 0.921508664627931"

Accuracy is still pretty good for both. But random forest has slightly better results. Let’s have a look at confusion matrix for random forest model:

library(cvms)
library(tibble)
library(ggimage)
library(rsvg)

## Linking to librsvg 2.48.8

tbl <- tibble("Test"=tester$classe, "Prediction"=second_pred_rf)
conf_mat <- confusion_matrix(targets = tbl$Test, predictions = tbl$Prediction)
plot_confusion_matrix(conf_mat$`Confusion Matrix`[[1]])

Tuning the hyperparameters

Random Forest is a better algorithm for our data. We are going to tune hyper parameters. For this we will create trainControl and expand.grid options. This method will create different size and proportion of samples and train the model with all options. But for the final model training set will be larger. So let’s create a bigger train set and make a deeper training in random forest model:

set.seed(323)
inTrain <- createDataPartition(y = training$classe, p = 0.7, list = F)
train <- training[-c(1,2,3,4,5,6)][inTrain,]
tester <- training[-c(1,2,3,4,5,6)][-inTrain,]
control <- trainControl(method="repeatedcv", number=5, repeats=3, search="grid")
tunegrid <- expand.grid(.mtry = (1:60)) 
set.seed(1252)
tuned_model <- train(classe~.,method ="rf", data = train,tuneGrid= tunegrid, trControl=control)
pred_rf <- predict(tuned_model, tester)
confusionMatrix(pred_rf, tester$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    4    0    0    0
##          B    0 1134    3    0    0
##          C    0    1 1023    3    0
##          D    0    0    0  961    7
##          E    0    0    0    0 1075
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9969          
##                  95% CI : (0.9952, 0.9982)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9961          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9956   0.9971   0.9969   0.9935
## Specificity            0.9991   0.9994   0.9992   0.9986   1.0000
## Pos Pred Value         0.9976   0.9974   0.9961   0.9928   1.0000
## Neg Pred Value         1.0000   0.9989   0.9994   0.9994   0.9985
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1927   0.1738   0.1633   0.1827
## Detection Prevalence   0.2851   0.1932   0.1745   0.1645   0.1827
## Balanced Accuracy      0.9995   0.9975   0.9981   0.9977   0.9968

Let’s look at our model’s properties:

print(tuned_model$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 19
## 
##         OOB estimate of  error rate: 0.23%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    0    0    0    1 0.0002560164
## B    6 2649    3    0    0 0.0033860045
## C    0    5 2391    0    0 0.0020868114
## D    0    0   11 2239    2 0.0057726465
## E    0    0    0    4 2521 0.0015841584

Confussion Matrix with other metrics:

confusionMatrix(pred_rf, tester$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    4    0    0    0
##          B    0 1134    3    0    0
##          C    0    1 1023    3    0
##          D    0    0    0  961    7
##          E    0    0    0    0 1075
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9969          
##                  95% CI : (0.9952, 0.9982)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9961          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9956   0.9971   0.9969   0.9935
## Specificity            0.9991   0.9994   0.9992   0.9986   1.0000
## Pos Pred Value         0.9976   0.9974   0.9961   0.9928   1.0000
## Neg Pred Value         1.0000   0.9989   0.9994   0.9994   0.9985
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1927   0.1738   0.1633   0.1827
## Detection Prevalence   0.2851   0.1932   0.1745   0.1645   0.1827
## Balanced Accuracy      0.9995   0.9975   0.9981   0.9977   0.9968

Now let’s plot our tuned model:

plot(tuned_model$finalModel, main="Tuned Random Forest Model")

varImpPlot(tuned_model$finalModel, main = " Variable Importance For Tuned Random Forest Model")

Final Test

From the start of this project we have downloaded the test data. Let’s make our final predictions with this tuned random forest model:

final_pred <- predict(tuned_model, testing)
final_pred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

write.csv(array(final_pred), paste(getwd(),"/predictions.csv",sep = ""), row.names = F)

Conclusion