This project is about predicting human movements. Wearable devices are getting common usage. In our data set users wore some devices and performed to do 5 different movements, and these accelerometers recorded changes in belt, arm, forearm and dumbell. The main idea is to investigate if we can determine which movement has been performed by these records. # Downloading training and test sets:
options(warn=-1)
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url, destfile = paste(getwd(),"/training.csv",sep = ""))
testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(testurl, destfile = paste(getwd(),"/testing.csv",sep = ""))
training <- read.csv(paste(getwd(),"/training.csv",sep = ""))
testing <- read.csv(paste(getwd(),"/testing.csv",sep = ""))
dim(training);dim(testing)
## [1] 19622 160
## [1] 20 160
head(training)
str(training$classe)
## chr [1:19622] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" ...
Since our target variable is classe and it is a categorical variable we should make it a factor variable.
training$classe <- as.factor(training$classe)
Let’s have a look which columns have NA values.
colSums(is.na(training))[colSums(is.na(training))>100]
## max_roll_belt max_picth_belt min_roll_belt
## 19216 19216 19216
## min_pitch_belt amplitude_roll_belt amplitude_pitch_belt
## 19216 19216 19216
## var_total_accel_belt avg_roll_belt stddev_roll_belt
## 19216 19216 19216
## var_roll_belt avg_pitch_belt stddev_pitch_belt
## 19216 19216 19216
## var_pitch_belt avg_yaw_belt stddev_yaw_belt
## 19216 19216 19216
## var_yaw_belt var_accel_arm avg_roll_arm
## 19216 19216 19216
## stddev_roll_arm var_roll_arm avg_pitch_arm
## 19216 19216 19216
## stddev_pitch_arm var_pitch_arm avg_yaw_arm
## 19216 19216 19216
## stddev_yaw_arm var_yaw_arm max_roll_arm
## 19216 19216 19216
## max_picth_arm max_yaw_arm min_roll_arm
## 19216 19216 19216
## min_pitch_arm min_yaw_arm amplitude_roll_arm
## 19216 19216 19216
## amplitude_pitch_arm amplitude_yaw_arm max_roll_dumbbell
## 19216 19216 19216
## max_picth_dumbbell min_roll_dumbbell min_pitch_dumbbell
## 19216 19216 19216
## amplitude_roll_dumbbell amplitude_pitch_dumbbell var_accel_dumbbell
## 19216 19216 19216
## avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 19216 19216 19216
## avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell
## 19216 19216 19216
## avg_yaw_dumbbell stddev_yaw_dumbbell var_yaw_dumbbell
## 19216 19216 19216
## max_roll_forearm max_picth_forearm min_roll_forearm
## 19216 19216 19216
## min_pitch_forearm amplitude_roll_forearm amplitude_pitch_forearm
## 19216 19216 19216
## var_accel_forearm avg_roll_forearm stddev_roll_forearm
## 19216 19216 19216
## var_roll_forearm avg_pitch_forearm stddev_pitch_forearm
## 19216 19216 19216
## var_pitch_forearm avg_yaw_forearm stddev_yaw_forearm
## 19216 19216 19216
## var_yaw_forearm
## 19216
There are missing values in some columns in training and test sets. We will drop them from both sets. So we are going to drop merge of empty coulmns in the training and test set.
empty_columns <- colnames(training)[colSums(is.na(training))>100]
training = training[,!(names(training) %in% empty_columns)]
testing = testing[,!(names(testing) %in% empty_columns)]
empty_chars <- colnames(training)[sapply(training,function(x) table(as.character(x) =="")["TRUE"])>100]
empty_chars <- empty_chars[!(is.na(empty_chars))]
training = training[,!(names(training) %in% empty_chars)]
testing = testing[,!(names(testing) %in% empty_chars)]
dim(training); dim(testing)
## [1] 19622 60
## [1] 20 60
head(training,2);head(testing,2)
Now we have a clean data. Let’s do some exploring. Let’s have a look at some plots. # Exploring the data
library(caret)
## Zorunlu paket yükleniyor: ggplot2
## Zorunlu paket yükleniyor: lattice
featurePlot(x=training[,c(7,11,14,16)], y=training$classe, plot="pairs")
In this plot we can see classes are piled. Sure we can look at any plot of combination of different covariants and seek for a relation. Let’s try another one but a regression line added on it:
q <- qplot(roll_belt, pitch_belt, data=training, color=classe)
q + geom_smooth(method="lm", formula = y ~ x)
In this plot we can see some classe categories are piled on different places in the plot. So there are some ways to distinguish the classes but relations between them still not kind of a regression problem. And those variables are positive numbers. Are the left of the covariants has similar ranges with this ? What is the range of the data excluding user_name, new_window and classe column?
sapply(training[,-c(2,5,6,60)], range)
## X raw_timestamp_part_1 raw_timestamp_part_2 num_window roll_belt
## [1,] 1 1322489605 294 1 -28.9
## [2,] 19622 1323095081 998801 864 162.0
## pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y
## [1,] -55.8 -180 0 -1.04 -0.64
## [2,] 60.3 179 29 2.22 0.64
## gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## [1,] -1.46 -120 -69 -275 -52
## [2,] 1.62 85 164 105 485
## magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm
## [1,] 354 -623 -180 -88.8 -180 1
## [2,] 673 293 180 88.5 180 66
## gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## [1,] -6.37 -3.44 -2.33 -404 -318 -636
## [2,] 4.87 2.84 3.02 437 308 292
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell
## [1,] -584 -392 -597 -153.7137 -149.5936
## [2,] 782 583 694 153.5456 149.4024
## yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## [1,] -150.8712 0 -204.00 -2.1
## [2,] 154.9523 58 2.22 52.0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## [1,] -2.38 -419 -189 -334
## [2,] 317.00 235 315 318
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## [1,] -643 -3600 -262 -180
## [2,] 592 633 452 180
## pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x
## [1,] -72.5 -180 0 -22.00
## [2,] 89.8 180 108 3.97
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## [1,] -7.02 -8.09 -498 -632
## [2,] 311.00 231.00 477 923
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## [1,] -446 -1280 -896 -973
## [2,] 291 672 1480 1090
We have negative and zero values as we expected. And we have some date-time information. This might can cause it to consider as a time-series forecasting but there is no value to forecast. In each row classe variable is defined. So we can’t use previous knowledge to forecast the classe because every movement is also classified in classe. Let’s have an ensembling approach for this data. We have training data but our test data has no classe variable. So we should create a new training and test sets from training data. And we will remove date-time information because classification is not relevant to date-time information. Also user_name, x and new_window is irrelevant. As a starting model we are going to take %5 train set and %95 test set.
set.seed(323)
inTrain <- createDataPartition(y = training$classe, p = 0.05, list = F)
train <- training[-c(1,2,3,4,5,6)][inTrain,]
tester <- training[-c(1,2,3,4,5,6)][-inTrain,]
Now since we are going to make ensembling models, we should choose some classifying algorithms and measure accuracy and decide which is better and finally tuning hyperparameters of that model. Sure we can try to tune every model we try, but it will take a serious time. So we will use Random Forest, RPart, Gradient Boosting Machine, Support Vector Machine, Naive Bayes and linear discriminant analysis (lda) as classifier methods. And our measurement will be accuracy which is calculated as TP+TN/TP+TN+FP+FN. We will use “gbm” package for gradient boosting machine and “e1071” package for support vector machine and naive bayes classifier and “randomForest” package for random forest classifier. We will train models with those methods and make predictions and save accuracy in an array named accuracy_rate.
library(gbm)
## Loaded gbm 2.1.8
library(e1071)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
method = c("Random Forest","RPart", "Gradient Boosting", "Support Vector Machine", "Naive Bayes", "Lda")
accuracy_rate = c()
accuracy <- function(pred, test){
sum(diag(table(pred, test)))/sum(table(pred, test))
}
set.seed(1252)
model_rf <- train(classe~., method = "rf", data = train)
pred_rf <- predict(model_rf, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_rf, tester$classe))
model_rpart <- train(classe~., method = "rpart", data = train)
pred_rp <- predict(model_rpart, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_rp, tester$classe))
model_gbm <- train(classe~., method = "gbm", data = train,verbose=FALSE)
pred_gbm <- predict(model_gbm, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_gbm, tester$classe))
model_sv <- svm(classe~., data = train)
pred_sv <- predict(model_sv, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_sv, tester$classe))
model_nb <- naiveBayes(classe~., data = train, laplace=1)
pred_nb <- predict(model_nb, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_nb, tester$classe))
model_lda <- train(classe~.,method="lda", data = train)
pred_lda <- predict(model_lda, tester)
accuracy_rate <- append(accuracy_rate, accuracy(pred_lda, tester$classe))
We have trained models and deployed accuracy rates. And let’s have a look which method worked well for this data:
df <- data.frame(method, accuracy_rate)
df[order(accuracy_rate,decreasing = T),]
Random Forest has the best accuracy overall. But gradient boosting has also good results. As we mentioned before we used set.seed(323) to choose train set in particular randomness. Algorithms work well in different kinds of data. What if our random train set was better for Random Forest algorithm and when train data become larger our model will have faulty results? Let’s compare these two models with different train sets.
set.seed(101)
inTrain <- createDataPartition(y = training$classe, p = 0.05, list = F)
train <- training[-c(1,2,3,4,5,6)][inTrain,]
tester <- training[-c(1,2,3,4,5,6)][-inTrain,]
set.seed(152)
second_model_rf <- randomForest(classe~., data = train)
second_pred_rf <- predict(second_model_rf, tester)
second_model_gbm <- train(classe~., method = "gbm", data = train,verbose=FALSE)
second_pred_gbm <- predict(second_model_gbm, tester)
paste("Random Forest accuracy:", accuracy(second_pred_rf, tester$classe))
## [1] "Random Forest accuracy: 0.933472825795375"
paste("Gradient Boosting Machine accuracy:", accuracy(second_pred_gbm, tester$classe))
## [1] "Gradient Boosting Machine accuracy: 0.921508664627931"
Accuracy is still pretty good for both. But random forest has slightly better results. Let’s have a look at confusion matrix for random forest model:
library(cvms)
library(tibble)
library(ggimage)
library(rsvg)
## Linking to librsvg 2.48.8
tbl <- tibble("Test"=tester$classe, "Prediction"=second_pred_rf)
conf_mat <- confusion_matrix(targets = tbl$Test, predictions = tbl$Prediction)
plot_confusion_matrix(conf_mat$`Confusion Matrix`[[1]])
Random Forest is a better algorithm for our data. We are going to tune hyper parameters. For this we will create trainControl and expand.grid options. This method will create different size and proportion of samples and train the model with all options. But for the final model training set will be larger. So let’s create a bigger train set and make a deeper training in random forest model:
set.seed(323)
inTrain <- createDataPartition(y = training$classe, p = 0.7, list = F)
train <- training[-c(1,2,3,4,5,6)][inTrain,]
tester <- training[-c(1,2,3,4,5,6)][-inTrain,]
control <- trainControl(method="repeatedcv", number=5, repeats=3, search="grid")
tunegrid <- expand.grid(.mtry = (1:60))
set.seed(1252)
tuned_model <- train(classe~.,method ="rf", data = train,tuneGrid= tunegrid, trControl=control)
pred_rf <- predict(tuned_model, tester)
confusionMatrix(pred_rf, tester$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1134 3 0 0
## C 0 1 1023 3 0
## D 0 0 0 961 7
## E 0 0 0 0 1075
##
## Overall Statistics
##
## Accuracy : 0.9969
## 95% CI : (0.9952, 0.9982)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9961
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9956 0.9971 0.9969 0.9935
## Specificity 0.9991 0.9994 0.9992 0.9986 1.0000
## Pos Pred Value 0.9976 0.9974 0.9961 0.9928 1.0000
## Neg Pred Value 1.0000 0.9989 0.9994 0.9994 0.9985
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1927 0.1738 0.1633 0.1827
## Detection Prevalence 0.2851 0.1932 0.1745 0.1645 0.1827
## Balanced Accuracy 0.9995 0.9975 0.9981 0.9977 0.9968
Let’s look at our model’s properties:
print(tuned_model$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 19
##
## OOB estimate of error rate: 0.23%
## Confusion matrix:
## A B C D E class.error
## A 3905 0 0 0 1 0.0002560164
## B 6 2649 3 0 0 0.0033860045
## C 0 5 2391 0 0 0.0020868114
## D 0 0 11 2239 2 0.0057726465
## E 0 0 0 4 2521 0.0015841584
Confussion Matrix with other metrics:
confusionMatrix(pred_rf, tester$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1134 3 0 0
## C 0 1 1023 3 0
## D 0 0 0 961 7
## E 0 0 0 0 1075
##
## Overall Statistics
##
## Accuracy : 0.9969
## 95% CI : (0.9952, 0.9982)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9961
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9956 0.9971 0.9969 0.9935
## Specificity 0.9991 0.9994 0.9992 0.9986 1.0000
## Pos Pred Value 0.9976 0.9974 0.9961 0.9928 1.0000
## Neg Pred Value 1.0000 0.9989 0.9994 0.9994 0.9985
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1927 0.1738 0.1633 0.1827
## Detection Prevalence 0.2851 0.1932 0.1745 0.1645 0.1827
## Balanced Accuracy 0.9995 0.9975 0.9981 0.9977 0.9968
Now let’s plot our tuned model:
plot(tuned_model$finalModel, main="Tuned Random Forest Model")
varImpPlot(tuned_model$finalModel, main = " Variable Importance For Tuned Random Forest Model")
From the start of this project we have downloaded the test data. Let’s make our final predictions with this tuned random forest model:
final_pred <- predict(tuned_model, testing)
final_pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
write.csv(array(final_pred), paste(getwd(),"/predictions.csv",sep = ""), row.names = F)
We have predicted movements by accelerometers records using random forest method with high accuracy. Although we didn’t tune the Gradient Boosting Machine method, we still can assume that this data could be modelled by gbm method too.