This study tries to build a machine learning system to predict human activity using data from wearable technology sensors from HAR dataset.
http://groupware.les.inf.puc-rio.br/har.
Files:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
I have created some functions to help me process the data easier. The steps will be a as follow:
+ Download the files
+ Removing Unneeded Columns
+ Fixing missing spaces
+ Removing Highly correlated columns
Details in the two original csv files (t1 and t2)
## csv1 (first file) - data set dimension : 19622, 160
## csv2 (second file) - data set dimension : 20, 160
Columns 1 to 7 do not seem to be useful here so I will remove them:
## X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1 1 carlitos 1323084231 788290 05/12/2011 11:23
## 2 2 carlitos 1323084231 808298 05/12/2011 11:23
## 3 3 carlitos 1323084231 820366 05/12/2011 11:23
## 4 4 carlitos 1323084232 120339 05/12/2011 11:23
## 5 5 carlitos 1323084232 196328 05/12/2011 11:23
## 6 6 carlitos 1323084232 304277 05/12/2011 11:23
## new_window num_window
## 1 no 11
## 2 no 11
## 3 no 11
## 4 no 12
## 5 no 12
## 6 no 12
Are both data files equal ?
## Columns on csv1 not included on csv2 : classe
## Columns on csv2 not included on csv1 : problem_id
Let’s plot the missing values for csv1
Now the missing values for csv2
There are some columns on the csv2 file that have every single row missing (100%)
Let’s display it below in a table to confirm it:
## DataSet Num_Cols Num_Missing_Rows Total_Rows %_Missing_Rows
## 1 csv1 86 0 19622 0.00
## 2 csv1 67 19216 19622 97.93
## 3 csv2 53 0 20 0.00
## 4 csv2 100 20 20 100.00
## %_Cols_with_Missing_Rows
## 1 56.21
## 2 43.79
## 3 34.64
## 4 65.36
Seems that is the case. We won’t be able to impute any data or obtain anything from them. We will remove them.
Seems everything is clear now:
## DataSet Num_Cols Num_Missing_Rows Total_Rows %_Missing_Rows
## 1 csv1 53 0 19622 0
## 2 csv2 53 0 20 0
## %_Cols_with_Missing_Rows
## 1 100
## 2 100
## [,1] [,2] [,3]
## [1,] "training" "13737" "53"
## [2,] "testing" "5885" "53"
Let’s start with 4 different methodologies and then decide which one works best.
We will use the following methods :
Let’s compare their results and display Accuracy and Out of Sample Error:
## Model Method Accuracy Out_of_Sample_Error
## 1 rf 0.9946 0.54
## 2 gbm 0.9601 3.99
## 3 lda 0.6967 30.33
## 4 treebag 0.9876 1.24
Based on the results in the table above, the random forest model is very accurate. We could create a model based on those four but the random forest is already very good. The Treebag does it very well as well but as Random Forest internally performs Cross Validation to Estimate error rate and it’s performance is the best already it gives more confidence.
Find below the random forest model prediction to the given values on csv2 file (the exercise) with new data.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E