Introduction

This study tries to build a machine learning system to predict human activity using data from wearable technology sensors from HAR dataset.

 http://groupware.les.inf.puc-rio.br/har.

Files:

 https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
 https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Data Preprocessing

I have created some functions to help me process the data easier. The steps will be a as follow:

+ Download the files
+ Removing Unneeded Columns
+ Fixing missing spaces
+ Removing Highly correlated columns

Downloading Data:

Details in the two original csv files (t1 and t2)

## csv1 (first file)  - data set dimension : 19622, 160
## csv2 (second file) - data set dimension : 20, 160

Removing unneeded columns

Columns 1 to 7 do not seem to be useful here so I will remove them:

  • A row number
  • Timestamps
  • Windows
##   X user_name raw_timestamp_part_1 raw_timestamp_part_2   cvtd_timestamp
## 1 1  carlitos           1323084231               788290 05/12/2011 11:23
## 2 2  carlitos           1323084231               808298 05/12/2011 11:23
## 3 3  carlitos           1323084231               820366 05/12/2011 11:23
## 4 4  carlitos           1323084232               120339 05/12/2011 11:23
## 5 5  carlitos           1323084232               196328 05/12/2011 11:23
## 6 6  carlitos           1323084232               304277 05/12/2011 11:23
##   new_window num_window
## 1         no         11
## 2         no         11
## 3         no         11
## 4         no         12
## 5         no         12
## 6         no         12

Are both data files equal ?

## Columns on csv1 not included on csv2 : classe
## Columns on csv2 not included on csv1 : problem_id

Handling Missing Values

Let’s plot the missing values for csv1

Now the missing values for csv2

There are some columns on the csv2 file that have every single row missing (100%)

Let’s display it below in a table to confirm it:

##   DataSet Num_Cols Num_Missing_Rows Total_Rows %_Missing_Rows
## 1   csv1        86                0      19622           0.00
## 2   csv1        67            19216      19622          97.93
## 3   csv2        53                0         20           0.00
## 4   csv2       100               20         20         100.00
##   %_Cols_with_Missing_Rows
## 1                    56.21
## 2                    43.79
## 3                    34.64
## 4                    65.36

Seems that is the case. We won’t be able to impute any data or obtain anything from them. We will remove them.

Seems everything is clear now:

##   DataSet Num_Cols Num_Missing_Rows Total_Rows %_Missing_Rows
## 1   csv1        53                0      19622              0
## 2   csv2        53                0         20              0
##   %_Cols_with_Missing_Rows
## 1                      100
## 2                      100

Handling High Correlated Columns

Still we could have very highly correlated columns. Using them might not be needed and could reduce efficiency of some methods like lda We will try to find very highly correlated columns above 0.90 and remove them.

## [1] "Found 11 highly correlated columns above defined threshold [x>0.9]. Keeping the one from the list with the lowest correlation"

Partitioning the Data in csv1

##      [,1]       [,2]    [,3]
## [1,] "training" "13737" "53"
## [2,] "testing"  "5885"  "53"

Creating 4 different Training Models & Predictions

Let’s start with 4 different methodologies and then decide which one works best.

We will use the following methods :

Prediction Results and Out of Sample

Let’s compare their results and display Accuracy and Out of Sample Error:

##   Model Method Accuracy Out_of_Sample_Error
## 1           rf   0.9946                0.54
## 2          gbm   0.9601                3.99
## 3          lda   0.6967               30.33
## 4      treebag   0.9876                1.24

Using Random Forest for final Prediction

Based on the results in the table above, the random forest model is very accurate. We could create a model based on those four but the random forest is already very good. The Treebag does it very well as well but as Random Forest internally performs Cross Validation to Estimate error rate and it’s performance is the best already it gives more confidence.

Find below the random forest model prediction to the given values on csv2 file (the exercise) with new data.

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E