Abstract: In this project, we calculate a model by which a smartphone can detect its owner’s activity precisely. For the dataset, 30 people were used to perform 6 different activities. Each of them was wearing a Samsung Galaxy SII on their waist. Using the smartphone’s embedded sensors (the accelerometer and the gyroscope), the user’s speed and acceleration were measured in 3-axial directions. We use the sensor’s data to predict user’s activity. After preparing the datasets and doing some machine learning methods, we find that the random forest method performed the best on predicting the user’s activity by a precision of 93 percent.
Project’s Website: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Type: Classification, Clustering
Used Machine Learning Methods: Random Forest, KNN, Neural Networks (nnet, mlp, mlpWeightDecay)
Highest Achieved precision: 93% (Random Forest)
Goal: In this project we will try to predict human activity (1-Walking, 2-Walking upstairs, 3-Walking downstairs, 4-Sitting, 5-Standing or 6-Laying) by using the smartphone’s sensors. Meaning that by using the following methods, the smartphone can detect what we are doing at the moment.
The Code:
library(randomForest)
library(gmodels)
library(neuralnet)
library(RSNNS)
library(Rcpp)
library(lattice)
library(ggplot2)
library(caret)
set.seed(123)
The first thing to do is to create a uniform data set out of the available datasets in the site. We have two datasets for the training data and its corresponding labels, and two datasets for the test data and its corresponding labels. The following code will combine these datasets and will convert the label(type of the activity) column into the factor form for further machine learning(ML) procedure.
setwd("E:/DATA SCIENCE/CV/Human Activity Recognition Using Smartphones Data Set/UCI HAR Dataset/UCI HAR Dataset/")
train_data<-read.table("train/X_train.txt")
train_lables<-read.table("train/Y_train.txt")
test_data<-read.table("test/X_test.txt")
test_lables<-read.table("test/Y_test.txt")
col_names <- readLines("features.txt")
colnames(train_data)<-make.names(col_names)
colnames(test_data)<-make.names(col_names)
colnames(train_lables)<-"lable"
colnames(test_lables)<-"lable"
train_final<-cbind(train_lables,train_data)
test_final<-cbind(test_lables,test_data)
final_data<-rbind(train_final,test_final)
final_data$lable<-factor(final_data$lable)
Since the training dataset(7352 obs. of 562 variables) and the test dataset(2947 obs. of 562 variables) are big for a normal PC to process, I will pick a random subset of the training dataset and do my machine learning methods on it. This way I can see which method works the best and is suitable for a final big run. So at the beginning I will create an array of 500 random numbers from 1 to 7352. These random numbers will be used to pick random samples from the desired dataset.
The first method that I used is the “mlp” method, which is the abbr. of “Multi-Layer Perception”. It is a ML method based on neural networks.
model_mlp<-caret::train(lable~.,data=final_data[ttt,],method="mlp")
pre_mlp<-predict(model_mlp,final_data[-ttt,])
table(model_mlp,final_data[-ttt,1])
##
## p4 1 2 3 4 5 6
## 1 481 221 59 0 0 0
## 2 5 245 33 2 0 0
## 3 10 5 328 0 0 0
## 4 0 0 0 80 0 0
## 5 0 0 0 409 532 1
## 6 0 0 0 0 0 536
The above table shows the predicted results. The diagonal numbers show the correct predictions and the off-diagonal numbers show the false predictions. So the precision of the model is calculated as follows:
(481+245+328+80+532+536)/2947
## [1] 0.7472005
After fitting the “mlp” model on the training data’s subset, I calculated the predicted results from the test dataset. As can be seen from the predicted results, this method could predict 75 percent of the activities correctly. But, since the fourth activity has a low number of correct predictions, it means that the model couldn’t converge for this dataset and we should use another model.
The next model that I used is “mlpWeightDecay” which is another variation of mlp method. Lets see if this one converges and how it behaves on predicting the correct results:
model_mlpw<-caret::train(lable~.,data=final_data[ttt,],method="mlpWeightDecay")
pre_mlpw<-predict(model_mlpw,final_data[-ttt,])
table(pre_mlpw,final_data[-ttt,1])
##
## p4 1 2 3 4 5 6
## 1 456 7 12 0 1 0
## 2 39 461 61 3 1 0
## 3 1 3 347 0 0 0
## 4 0 0 0 467 153 0
## 5 0 0 0 20 377 0
## 6 0 0 0 1 0 537
The precision of the model is calculated as follows:
(456+461+347+467+377+537)/2947
## [1] 0.8975229
As can be seen from the above results, the predictions actually improved a lot better and we also don’t have a low number of correct predictions in any of the activities.
Lets try another neural network package. This time I used “pcaNNet”. This model is known as “Neural Networks with Feature Extraction”:
model_pca<-caret::train(lable~.,data=final_data[ttt,],method="pcaNNet")
pre_pca<-predict(model_pca,final_data[-ttt,])
table(pre_pca,final_data[-ttt,1])
##
## p41 1 2 3 4 5 6
## 1 394 17 11 0 0 0
## 2 14 398 37 0 1 0
## 3 88 55 372 0 0 0
## 4 0 0 0 397 48 13
## 5 0 1 0 91 483 14
## 6 0 0 0 3 0 510
The precision of the model is calculated as follows:
(394+398+372+397+483+510)/2947
## [1] 0.866644
This model is not as good as the previous one. We keep testing different famous models to reach an acceptable result.
Next, I used the Random Forest method:
model_rf5<-randomForest(lable~.,final_data[ttt,])
pre_rf5<-predict(model_rf5,final_data[-ttt,],type = "response")
table(pre_rf5,final_data[-ttt,1])
##
## p2 1 2 3 4 5 6
## 1 486 45 68 0 0 0
## 2 0 418 50 0 0 0
## 3 10 8 302 0 0 0
## 4 0 0 0 411 38 0
## 5 0 0 0 80 494 0
## 6 0 0 0 0 0 537
The precision of the model is calculated as follows:
(486+418+302+411+494+537)/2947
## [1] 0.8985409
As can be seen form the above results, the precision of this model is higher than all the other models. Lets try Knn(K-nearest neighbor) method and finish up our testing procedure.
model_knn<-train(lable~.,data=final_data[ttt,],method="knn")
pre_knn<-predict(model_knn,final_data[-ttt,])
table(pre_knn,final_data[-ttt,1])
##
## p5 1 2 3 4 5 6
## 1 472 61 60 0 0 0
## 2 6 400 69 3 0 1
## 3 18 10 291 0 0 0
## 4 0 0 0 365 51 3
## 5 0 0 0 122 481 6
## 6 0 0 0 1 0 527
The precision of the model is calculated as follows:
(472+400+291+365+481+527)/2947
## [1] 0.8605361
As can be seen this method’s prediction accuracy is lower than both NN and RF methods.
Finally I fitted the random forest model (the winner model) on the whole training dataset (instead of just a subset of it):
data()
model_rfF<-randomForest(lable~.,final_data[1:7352,])
pre_rfF<-predict(model_rfF,final_data[-(1:7352),],type = "response")
table(pre_rfF,final_data[-(1:7352),1])
The precision of this model is calculated as follows:
##
## p3 1 2 3 4 5 6
## 1 480 33 18 0 0 0
## 2 7 432 43 0 0 0
## 3 9 6 359 0 0 0
## 4 0 0 0 435 44 0
## 5 0 0 0 56 488 0
## 6 0 0 0 0 0 537
(480+432+359+435+488+537)/2947
## [1] 0.9267051
This is my final result. As can be seen this method has the highest precision (93%).