One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, we have seen that using the data from accelerometers on the belt, forearm, arm, and dumbell we could predict with Random Forest method the manner in which 6 participants did an exercise with 99.3% accurancy.
suppressWarnings(library(caret))
training <- read.csv("Data/pml-training.csv", header=T) #load training data
test <- read.csv("Data/pml-testing.csv", header=T) #load test data
training <- subset(training, select=-c(1:6)) #remove row numbers, time stamp and user names
test <- subset(test, select=-c(1:6)) #remove row numbers, time stamp and user names
classe <- training$classe #save classe information
training <- training[ , colSums(is.na(training)) == 0] #Remove all columns containing at least one NA
test <- test[ , colSums(is.na(test)) == 0] #Remove all columns containing at least one NA
training <- training[,sapply(training,is.numeric)] # remove non-numeric variables
test <- test[,sapply(test,is.numeric)] # remove non-numeric variables
training$classe <- classe; rm(classe) #add classe information and delete variable
Training data set: - It includes the parameter to estimate: classe - We will split the training set on two subsets (50% observations each), train and validate
- Train to build the model and validate to validate it
Test data set: - Includes 20 observations, without the parameter classe but with problem_id
set.seed(33433)
inTraining <- createDataPartition(training$classe, p=0.5, list=FALSE)
train <- training[ inTraining,]
validate <- training[-inTraining,]
rm(training, inTraining)
We will follow the recommendation of the study and use a Random Forest method to model the train data.
Fore the model accurancy we will be using a 10 fold cross validation and each forest with 10 trees. If accuracy is low then we will try a second model.
set.seed(33433)
tr <- trainControl(method = "cv", number = 10) #setting parameters for the model
fit.rf <- train(classe~., #fit model
method = "rf", #random forest model
data = train,
trControl = tr, #model parameters
allowParallel = TRUE, #use parallel processing (if availible)
ntree = 10, #number of trees
verbose = FALSE)
print(fit.rf$finalModel)
##
## Call:
## randomForest(x = x, y = y, ntree = 10, mtry = param$mtry, allowParallel = TRUE, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 10
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 2.47%
## Confusion matrix:
## A B C D E class.error
## A 2726 21 5 7 4 0.01339124
## B 23 1814 24 15 4 0.03510638
## C 2 26 1651 17 0 0.02653302
## D 2 14 28 1533 8 0.03280757
## E 2 16 9 13 1749 0.02235886
We evaluate the model with the validation data
set.seed(33433)
pred.rf <- predict(fit.rf, newdata = validate) #Predict with validation data
confusionMatrix(pred.rf, validate$classe)$overall[1] #Measure accuracy with validation data
## Accuracy
## 0.993578
We get 99.3% accuracy!!
There is no need to improve the model
Here we plot the variables that have a higher impact on the fitted model
VarImportance <- varImp(fit.rf)
plot(VarImportance, main = "Most relevant variables", top = 10)
Now we will predit classe for each problem_id of the test data set
pred.rf.test<-predict(fit.rf, newdata = test[,-which(names(test) %in% "problem_id")])
t(data.frame(problem_id = test$problem_id, prediction = pred.rf.test))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## problem_id " 1" " 2" " 3" " 4" " 5" " 6" " 7" " 8" " 9" "10" "11" "12"
## prediction "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C"
## [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## problem_id "13" "14" "15" "16" "17" "18" "19" "20"
## prediction "B" "A" "E" "E" "A" "B" "B" "B"