Abstract: In this project, we calculate a model by which a smartphone can detect its owner’s activity precisely. For the dataset, 30 people were used to perform 6 different activities. Each of them was wearing a Samsung Galaxy SII on their waist. Using the smartphone’s embedded sensors (the accelerometer and the gyroscope), the user’s speed and acceleration were measured in 3-axial directions. We use the sensor’s data to predict user’s activity. After preparing the datasets and doing some machine learning methods, we find that the random forest method performed the best on predicting the user’s activity by a precision of 93 percent.

Overall Information

Project’s Website: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Type: Classification, Clustering

Used Machine Learning Methods: Random Forest, KNN, Neural Networks (nnet, mlp, mlpWeightDecay)

Highest Achieved precision: 93% (Random Forest)

Goal: In this project we will try to predict human activity (1-Walking, 2-Walking upstairs, 3-Walking downstairs, 4-Sitting, 5-Standing or 6-Laying) by using the smartphone’s sensors. Meaning that by using the following methods, the smartphone can detect what we are doing at the moment.

The Code:

library(randomForest)
library(gmodels)
library(neuralnet)
library(RSNNS)
library(Rcpp)
library(lattice)
library(ggplot2)
library(caret)
set.seed(123)

The first thing to do is to create a uniform data set out of the available datasets in the site. We have two datasets for the training data and its corresponding labels, and two datasets for the test data and its corresponding labels. The following code will combine these datasets and will convert the label(type of the activity) column into the factor form for further machine learning(ML) procedure.

setwd("E:/DATA SCIENCE/CV/Human Activity Recognition Using Smartphones Data Set/UCI HAR Dataset/UCI HAR Dataset/")
train_data<-read.table("train/X_train.txt")
train_lables<-read.table("train/Y_train.txt")

test_data<-read.table("test/X_test.txt")
test_lables<-read.table("test/Y_test.txt")

col_names <- readLines("features.txt")
colnames(train_data)<-make.names(col_names)
colnames(test_data)<-make.names(col_names)
colnames(train_lables)<-"lable"
colnames(test_lables)<-"lable"

train_final<-cbind(train_lables,train_data)
test_final<-cbind(test_lables,test_data)
final_data<-rbind(train_final,test_final)
final_data$lable<-factor(final_data$lable)

Since the training dataset(7352 obs. of 562 variables) and the test dataset(2947 obs. of 562 variables) are big for a normal PC to process, I will pick a random subset of the training dataset and do my machine learning methods on it. This way I can see which method works the best and is suitable for a final big run. So at the beginning I will create an array of 500 random numbers from 1 to 7352. These random numbers will be used to pick random samples from the desired dataset.

The first method that I used is the “mlp” method, which is the abbr. of “Multi-Layer Perception”. It is a ML method based on neural networks.

model_mlp<-caret::train(lable~.,data=final_data[ttt,],method="mlp")
pre_mlp<-predict(model_mlp,final_data[-ttt,])
table(model_mlp,final_data[-ttt,1])
##    
## p4    1   2   3   4   5   6
##   1 481 221  59   0   0   0
##   2   5 245  33   2   0   0
##   3  10   5 328   0   0   0
##   4   0   0   0  80   0   0
##   5   0   0   0 409 532   1
##   6   0   0   0   0   0 536

The above table shows the predicted results. The diagonal numbers show the correct predictions and the off-diagonal numbers show the false predictions. So the precision of the model is calculated as follows:

(481+245+328+80+532+536)/2947
## [1] 0.7472005

After fitting the “mlp” model on the training data’s subset, I calculated the predicted results from the test dataset. As can be seen from the predicted results, this method could predict 75 percent of the activities correctly. But, since the fourth activity has a low number of correct predictions, it means that the model couldn’t converge for this dataset and we should use another model.

The next model that I used is “mlpWeightDecay” which is another variation of mlp method. Lets see if this one converges and how it behaves on predicting the correct results:

model_mlpw<-caret::train(lable~.,data=final_data[ttt,],method="mlpWeightDecay")
pre_mlpw<-predict(model_mlpw,final_data[-ttt,])
table(pre_mlpw,final_data[-ttt,1])
##    
## p4    1   2   3   4   5   6
##   1 456   7  12   0   1   0
##   2  39 461  61   3   1   0
##   3   1   3 347   0   0   0
##   4   0   0   0 467 153   0
##   5   0   0   0  20 377   0
##   6   0   0   0   1   0 537

The precision of the model is calculated as follows:

(456+461+347+467+377+537)/2947
## [1] 0.8975229

As can be seen from the above results, the predictions actually improved a lot better and we also don’t have a low number of correct predictions in any of the activities.

Lets try another neural network package. This time I used “pcaNNet”. This model is known as “Neural Networks with Feature Extraction”:

model_pca<-caret::train(lable~.,data=final_data[ttt,],method="pcaNNet")
pre_pca<-predict(model_pca,final_data[-ttt,])
table(pre_pca,final_data[-ttt,1])
##    
## p41   1   2   3   4   5   6
##   1 394  17  11   0   0   0
##   2  14 398  37   0   1   0
##   3  88  55 372   0   0   0
##   4   0   0   0 397  48  13
##   5   0   1   0  91 483  14
##   6   0   0   0   3   0 510

The precision of the model is calculated as follows:

(394+398+372+397+483+510)/2947
## [1] 0.866644

This model is not as good as the previous one. We keep testing different famous models to reach an acceptable result.

Next, I used the Random Forest method:

model_rf5<-randomForest(lable~.,final_data[ttt,])
pre_rf5<-predict(model_rf5,final_data[-ttt,],type = "response")
table(pre_rf5,final_data[-ttt,1])
##    
## p2    1   2   3   4   5   6
##   1 486  45  68   0   0   0
##   2   0 418  50   0   0   0
##   3  10   8 302   0   0   0
##   4   0   0   0 411  38   0
##   5   0   0   0  80 494   0
##   6   0   0   0   0   0 537

The precision of the model is calculated as follows:

(486+418+302+411+494+537)/2947
## [1] 0.8985409

As can be seen form the above results, the precision of this model is higher than all the other models. Lets try Knn(K-nearest neighbor) method and finish up our testing procedure.

model_knn<-train(lable~.,data=final_data[ttt,],method="knn")
pre_knn<-predict(model_knn,final_data[-ttt,])
table(pre_knn,final_data[-ttt,1])
##    
## p5    1   2   3   4   5   6
##   1 472  61  60   0   0   0
##   2   6 400  69   3   0   1
##   3  18  10 291   0   0   0
##   4   0   0   0 365  51   3
##   5   0   0   0 122 481   6
##   6   0   0   0   1   0 527

The precision of the model is calculated as follows:

(472+400+291+365+481+527)/2947
## [1] 0.8605361

As can be seen this method’s prediction accuracy is lower than both NN and RF methods.

Finally I fitted the random forest model (the winner model) on the whole training dataset (instead of just a subset of it):

data()
model_rfF<-randomForest(lable~.,final_data[1:7352,])
pre_rfF<-predict(model_rfF,final_data[-(1:7352),],type = "response")
table(pre_rfF,final_data[-(1:7352),1])

The precision of this model is calculated as follows:

##    
## p3    1   2   3   4   5   6
##   1 480  33  18   0   0   0
##   2   7 432  43   0   0   0
##   3   9   6 359   0   0   0
##   4   0   0   0 435  44   0
##   5   0   0   0  56 488   0
##   6   0   0   0   0   0 537
(480+432+359+435+488+537)/2947
## [1] 0.9267051

This is my final result. As can be seen this method has the highest precision (93%).