The following document is designed to meet the requirements of the Practical Machine Learning - Peer Assessment. Please note that for the sake of clarity non-informative code, like sourcing or cleaning the data, was suppressed. If you wish to explore the work in a greater detail please refer to the provided markdown document.
Please note that due to the presentational requirements the documents uses echo=FALSE to process the R code where the output would be too long or not relevant. You may consider chaning it to echo=TRUE if you need the additional R output.
The code snippets were not counted in the word count.
# Clean any objects there may be in memory
rm(list = ls())
# Source the training and test data from provided URLs. I added the ssl.verifypeer = FALSE, in case of difficulties of sourcing the files on Windows.
suppressMessages(require(RCurl))
train.csv <-
getURL("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
ssl.verifypeer = FALSE)
test.csv <-
getURL("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
ssl.verifypeer = FALSE)
# Create data frames
train.dta <- read.csv(text = train.csv)
test.dta <- read.csv(text = test.csv)
The distribution of the classe variable is provided in the table below. The percentages account for the row frequencies.
| adelmo | carlitos | charles | eurico | jeremy | pedro | |
|---|---|---|---|---|---|---|
| A | 29.93 | 26.80 | 25.42 | 28.18 | 34.60 | 24.52 |
| B | 19.94 | 22.17 | 21.07 | 19.28 | 14.37 | 19.35 |
| C | 19.27 | 15.84 | 15.24 | 15.93 | 19.17 | 19.12 |
| D | 13.23 | 15.62 | 18.16 | 18.96 | 15.34 | 17.97 |
| E | 17.63 | 19.57 | 20.11 | 17.65 | 16.52 | 19.04 |
The distribution of the class variable, accounting for the count of the class values by user, is additionally visualised in the bar chart below.
In total the training data set consists of 160 variables.
Before progressing with the analysis the data is checked for the presence of erroneous values. First we can check columns for existence of string values. It appears that 37 columns contain strings. As summarised in the extract below, it is noticeable that some of the columns have erroneous values #DIV/0!, that on the same lines we can expect for the dates to be stored as factors (as the file was imported from the CSV).
| train.lst.strs.unq |
|---|
| carlitos,pedro,adelmo,charles,eurico,jeremy |
| 05/12/2011 11:23,05/12/2011 14:22,05/12/2011 14:23,02/12/2011 13:32,02/12/2011 13:33,02/12/2011 14:57, |
| no,yes |
| ,5.587755,-0.997130,7.515290,-2.121212,-1.122273, |
| ,#DIV/0!,-1.298590,19.810708,0.988872,-0.605475, |
| ,#DIV/0! |
The data cleaning process will involve 1. Converting dates to the date format 2. Removing the erroneous #DIV/0! values 3. Ensuring that NAs are properly coded as missing values and other minor inconsistencies are fixed accordingly
# Make a clean data frame
train.dta.cln <- train.dta
## Remove DIV/0 values
train.dta.cln[ train.dta.cln == "#DIV/0!" ] <- NA
## If there are NA strings clean them as well
train.dta.cln[ train.dta.cln == "NA" ] <- NA
## The same with empty strings
train.dta.cln[ train.dta.cln == "" ] <- NA
# Convert the date to proper date format
train.dta.cln$cvtd_timestamp <- as.Date(x = train.dta.cln$cvtd_timestamp,
format = "%d/%m/%Y %H:%M")
## Clean columns with missing data
train.dta.cln.no.miss <- train.dta.cln[ lapply(
train.dta.cln, function(x) sum(is.na(x)) / length(x) ) < 0.1]
After the cleaning it appears that 61.32 consists of NAs. After removing columns with more than 90% of values missing the total number of columns in data set is 60.
Finally, it may make sense to remove certain variables by hand. It is observable that certain variables were in-putted to the data set as strings; variables X, user_name, raw_timestamp_part1, raw_timestamp_part2, cvtd_timestamp, new_window and num_window are deleted not.
# Columns to keep
keep.colums <- c('roll_belt', 'pitch_belt', 'yaw_belt', 'total_accel_belt',
'gyros_belt_x', 'gyros_belt_y', 'gyros_belt_z',
'accel_belt_x', 'accel_belt_y', 'accel_belt_z',
'magnet_belt_x', 'magnet_belt_y', 'magnet_belt_z',
'roll_arm', 'pitch_arm', 'yaw_arm', 'total_accel_arm',
'gyros_arm_x', 'gyros_arm_y', 'gyros_arm_z',
'accel_arm_x', 'accel_arm_y', 'accel_arm_z',
'magnet_arm_x', 'magnet_arm_y', 'magnet_arm_z',
'roll_dumbbell', 'pitch_dumbbell', 'yaw_dumbbell', 'total_accel_dumbbell',
'gyros_dumbbell_x', 'gyros_dumbbell_y', 'gyros_dumbbell_z',
'accel_dumbbell_x', 'accel_dumbbell_y', 'accel_dumbbell_z',
'magnet_dumbbell_x', 'magnet_dumbbell_y', 'magnet_dumbbell_z',
'roll_forearm', 'pitch_forearm', 'yaw_forearm', 'total_accel_forearm',
'gyros_forearm_x', 'gyros_forearm_y', 'gyros_forearm_z',
'accel_forearm_x', 'accel_forearm_y', 'accel_forearm_z',
'magnet_forearm_x', 'magnet_forearm_y', 'magnet_forearm_z',
'classe')
# Create clean data set with selected column
train.dta.cln.sel <- train.dta.cln.no.miss[, keep.colums]
Having cleaned the data it is worthwhile to explore the correlations in the data.
# Compute correlations
corrs <- cor(train.dta.cln.sel[, names(train.dta.cln.sel) != 'classe'])
# Draw the correlations matrix
require(corrplot, quietly = TRUE, warn.conflicts = FALSE)
corrplot(corr = corrs, method = "color", type = "lower",
tl.col = "black", tl.cex =0.9, cl.cex=0.9, insig="blank", sig.level = 0.05)
From the visualised correlation matrix it is observable that some variables are significantly correlated. Using the code below we can get a list of correlations higher then .075.
# I want to exclude pairs of identical variables.
corrs[which((corrs > 0.75 & corrs != 1) | (corrs < -0.75 & corrs != 1))]
## [1] 0.8152297 0.9809241 0.9248983 -0.9920085 -0.7947807 -0.9657334
## [7] -0.8841727 0.8152297 0.7620963 -0.7764520 0.9809241 0.7620963
## [13] 0.9278069 -0.9749317 -0.7585137 0.7805650 -0.9657334 0.8920913
## [19] 0.9248983 0.9278069 -0.9333854 -0.9920085 -0.7764520 -0.9749317
## [25] -0.9333854 0.7869594 -0.8841727 0.8920913 0.7789335 0.7789335
## [31] -0.9181821 -0.9181821 0.8142732 -0.7947807 -0.7585137 0.7869594
## [37] 0.7788300 0.8142732 -0.7919744 -0.7919744 0.8144455 0.7788300
## [43] 0.8144455 0.8082885 0.8491322 0.7727934 -0.9789507 -0.9144764
## [49] -0.9789507 0.9330422 0.8082885 0.7727934 0.8491322 -0.7686882
## [55] 0.7805650 -0.7686882 0.8455626 -0.9144764 0.9330422 0.8455626
## [61] 0.7720986 0.7720986
Some correlations are usually high with values 0.98 suggesting bizarre association between variables. Manual exploration of the data, illustrated below, indicates that roll_belt variable is strongly correlated with the remaining variables.
corrs['roll_belt', 'total_accel_belt']
## [1] 0.9809241
corrs['roll_belt', 'accel_belt_z']
## [1] -0.9920085
corrs['total_accel_belt', 'accel_belt_z']
## [1] -0.9749317
Consequently, the roll_belt indicator is removed from the training data.
Having cleaned the variables accordingly it is possible to progress with the predictive model. Random forests are often considered to be the winner for lots of problems in classification. They’re fast and scalable, and there is no need to worry about the bunch of parameters. The parameters of the model are summarised in the table below.
##
## Call:
## randomForest(formula = classe ~ ., data = train.dta.cln.sel)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.32%
## Confusion matrix:
## A B C D E class.error
## A 5579 1 0 0 0 0.0001792115
## B 12 3782 3 0 0 0.0039504872
## C 0 13 3408 1 0 0.0040911748
## D 0 0 24 3190 2 0.0080845771
## E 0 0 1 6 3600 0.0019406709
Naturally, we should look at the confusion matrix.
| A | B | C | D | E | class.error | |
|---|---|---|---|---|---|---|
| A | 5579 | 1 | 0 | 0 | 0 | 0.0001792 |
| B | 12 | 3782 | 3 | 0 | 0 | 0.0039505 |
| C | 0 | 13 | 3408 | 1 | 0 | 0.0040912 |
| D | 0 | 0 | 24 | 3190 | 2 | 0.0080846 |
| E | 0 | 0 | 1 | 6 | 3600 | 0.0019407 |
The confusion matrix is acceptable. Finally, the validity of the proposed model can be tested using the provided test data set.
predict(mdl.frst, test.dta)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E