Practical Machine Learning

Overlook Data

training = read.csv('pml-training.csv')[, -1]
testing = read.csv('pml-testing.csv')[, -1]
### as the csv file contains rownames which is useless,
### so I directly remove the first columm when loading.

There are 19622 samples and 159 variables in the training dataset(pml-training.csv). Sadly, only 406 samples have no missing value, which means almost all the samples have at least one missing value, which is annoying.

As we are going to predict the classe variable, let’s have a look at its emprical distribution in the training data:

table(training$classe);


   A    B    C    D    E 
5580 3797 3422 3216 3607

barplot(sort(table(training$classe), decreasing = T), 
        col=rainbow(5), main='Fig1: Emprical distribution of Classe')

plot of chunk dis

It’s almost evenly distributed, except that the group A is much more than other groups.

Preprocessing

After have a look at the data, before build a model we should do some prepocess, which need us to look explore the data deeper.

Near Zore Variables

At the beginning, explore whether there are some near zero variables in the predictors, namely predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.

library(caret)
zeroVar = nearZeroVar(training)
training = training[, -c(zeroVar)]
testing = testing[, -c(zeroVar)]

Here, I simply use nearZeroVar function in the caret package, and remove the returned column. Then the training have 99 column left.

Missing values

As metioned in the first step, alomst all the samples have missing values.

require(plyr)
miss_var = as.vector(as.matrix(colwise(function(x) any(is.na(x)))(training)))
train_data = training[, !miss_var]
test_data = testing[, !miss_var]

After check out each predict, I find out there are sum(miss_var) predictors have missing values, which is hard to fill in. So I just remove all these sum(miss_var) variables, left 58 variables.

Modeling

After have overlook and preprocess thes data, it’s time to build a model.As there are many missing values in the dataset, I choose the Random Forest model, which is much more powful than simple trees.

factor_var = sapply(train_data, class) == 'factor'
train_data[, !factor_var] = colwise(as.numeric)(train_data[, !factor_var])
test_data[, !factor_var] = colwise(as.numeric)(test_data[, !factor_var])

tag = rep(c('train', 'test'), c(nrow(train_data), nrow(test_data)))
colnames(test_data) = colnames(train_data)
test_data$classe = 'A'
all_data = rbind(train_data, test_data)
set.seed(123)
require(randomForest)
model = randomForest(classe ~ ., data=all_data[tag == 'train', ])
model


Call:
 randomForest(formula = classe ~ ., data = all_data[tag == "train",      ]) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 7

        OOB estimate of  error rate: 0.06%
Confusion matrix:
     A    B    C    D    E class.error
A 5579    1    0    0    0   0.0001792
B    2 3795    0    0    0   0.0005267
C    0    4 3417    1    0   0.0014611
D    0    0    1 3214    1   0.0006219
E    0    0    0    1 3606   0.0002772

As we can see, the OOB(out of bag) estimate of error rate is 5.606 × 10^-4. This is also the expected out of sample error.

Predict

At last, given the test data, make a prediction.

pre = predict(model, newdata=all_data[tag == 'test', ])
as.character(pre)

 [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
[18] "B" "B" "B"