training = read.csv('pml-training.csv')[, -1]
testing = read.csv('pml-testing.csv')[, -1]
### as the csv file contains rownames which is useless,
### so I directly remove the first columm when loading.
There are 19622 samples and 159 variables in the training dataset(pml-training.csv). Sadly, only 406 samples have no missing value, which means almost all the samples have at least one missing value, which is annoying.
As we are going to predict the classe variable, let’s have a look at its emprical distribution in the training data:
table(training$classe);
A B C D E
5580 3797 3422 3216 3607
barplot(sort(table(training$classe), decreasing = T),
col=rainbow(5), main='Fig1: Emprical distribution of Classe')
It’s almost evenly distributed, except that the group A is much more than other groups.
After have a look at the data, before build a model we should do some prepocess, which need us to look explore the data deeper.
At the beginning, explore whether there are some near zero variables in the predictors, namely predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
library(caret)
zeroVar = nearZeroVar(training)
training = training[, -c(zeroVar)]
testing = testing[, -c(zeroVar)]
Here, I simply use nearZeroVar function in the caret package, and remove the returned column. Then the training have 99 column left.
As metioned in the first step, alomst all the samples have missing values.
require(plyr)
miss_var = as.vector(as.matrix(colwise(function(x) any(is.na(x)))(training)))
train_data = training[, !miss_var]
test_data = testing[, !miss_var]
After check out each predict, I find out there are sum(miss_var) predictors have missing values, which is hard to fill in. So I just remove all these sum(miss_var) variables, left 58 variables.
After have overlook and preprocess thes data, it’s time to build a model.As there are many missing values in the dataset, I choose the Random Forest model, which is much more powful than simple trees.
factor_var = sapply(train_data, class) == 'factor'
train_data[, !factor_var] = colwise(as.numeric)(train_data[, !factor_var])
test_data[, !factor_var] = colwise(as.numeric)(test_data[, !factor_var])
tag = rep(c('train', 'test'), c(nrow(train_data), nrow(test_data)))
colnames(test_data) = colnames(train_data)
test_data$classe = 'A'
all_data = rbind(train_data, test_data)
set.seed(123)
require(randomForest)
model = randomForest(classe ~ ., data=all_data[tag == 'train', ])
model
Call:
randomForest(formula = classe ~ ., data = all_data[tag == "train", ])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 7
OOB estimate of error rate: 0.06%
Confusion matrix:
A B C D E class.error
A 5579 1 0 0 0 0.0001792
B 2 3795 0 0 0 0.0005267
C 0 4 3417 1 0 0.0014611
D 0 0 1 3214 1 0.0006219
E 0 0 0 1 3606 0.0002772
As we can see, the OOB(out of bag) estimate of error rate is 5.606 × 10-4. This is also the expected out of sample error.
At last, given the test data, make a prediction.
pre = predict(model, newdata=all_data[tag == 'test', ])
as.character(pre)
[1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
[18] "B" "B" "B"