In this project, we examined the effects of exercise behaviors on the exercise qualities, using the data of measurements collected from wearable devices. We applied two predictive algorithms-decision tree and random forest, and finally decided to use random forest with 13 variables resampled at each split as our final model, which has a cross-validation accuracy 0.99.
The final model shows that roll_belt, yaw_belt and pitch_forearm are the most influential variables impacting the quality of exercise.
For this exercise, we will primarily use dplyr for data manipulation, caret for model building, ggplot2 for data visualization
require(dplyr)
require(caret)
require(ggplot2)
The first column and second column are timestamps and new_window, num_window variable which are not quite useful here, so we delete them.
We did not delete userName variable because there could be user-level confounding effect which will impacts the response (for instance, some users may be more likely to give a better exercise than others)
training <- read.csv('pml-training.csv', sep = ',',na.strings = c('','NA'))
training <- training[,-c(1,3:7)]
training$user_name <- factor(training$user_name)
In this section, we explore the datasets by plotting the distribution of variables and search for the irregularities which needs to be addressed by appropriate preprocessing method
There are 67 variables each of which have 98% missing values. With such a high percentage of NAs, doing imputation would be both too computationally costly and highly biased. Including them in the data will also reduce model performance. Therefore, the best way is to delete all of these variables.
missing_percent <- data.frame(VarName <- names(training), Pct <- 0.0)
for(i in 1:length(names(training))){
missing_percent$Pct[i] <- sum(is.na(training[,i]))/nrow(training)
}
missing_percent <- missing_percent[,c(1,3)]
names(missing_percent) <- c('Vars','Pct')
#Delete all of the variables
training <- training[,-which(missing_percent$Pct > 0.5)]
The variables with much less unique values than the total number of observation, as well as a huge imbalance between the most frequent value and the second most frequent value, should be removed because they have insignificant predictive power.
To be more conservative and not losing too much information, I use freqCut = 99/1 and uniqueCut = 10 as cutoff points, which resulted in 24 variables to be removed.
nsv <- nearZeroVar(training,saveMetrics=TRUE,freqCut = 99/1, uniqueCut=10 )
if (length(which(nsv$nzv)) > 0){
training <- training[,-which(nsv$nzv)]
}
Imbalance check in response variable is conducted to see if balancing technique such as undersampling/oversampling methods need to be applied before modeling.
As the result shows, there is not a huge imbalance among response levels. So we do not need to apply balancing methods.
training$classe <- factor(training$classe)
res_percent <- training %>% group_by(classe) %>% summarise(Percentage = n()/nrow(training))
ggplot(data=res_percent, aes(x = classe, y = Percentage)) + geom_bar(aes(fill = classe),stat='identity') + ggtitle('Percentage of exercise wellness levels') + xlab('Classe') + ylab('Percentage')
We applied two models here:
Decision Tree Classification Random Forest
In order to reduce the risk of overfitting, we applied 10-fold repeated cross-validation on the training set and use the same folds across three models. The purpose of repeated cross-validation is to test the stability of model performance, i.e. whether the model have close accuracy metrics across different samples.
We enable Parallel processing and random seeds here to give three methods the same set of folds for training.
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)
fitControl <- trainControl(method = 'cv',number=10,allowParallel= TRUE,seeds=seeds)
parGrid = expand.grid(cp = c(0.001,0.01,0.1))
CTree <- train(classe~.,data=training, trControl = fitControl, tuneGrid = parGrid,method = 'rpart')
final_tree <- CTree$finalModel
library(doMC)
registerDoMC(core=5)
number_of_var = length(names(training)) - 1
parGrid = expand.grid(mtry = c(floor(sqrt(number_of_var)),floor(number_of_var/3), floor(number_of_var/4)))
rForest <- train(classe~.,data=training,trControl = fitControl,tuneGrid=parGrid,method = 'rf',allowParallel = TRUE)
final_rForest <- rForest$finalModel
From the final results of these three models, random forest with 13 variables resampled at each split performs better at nearly 99% classification accuracy with similar variance.
The variable importance figure shows that the roll_belt, yaw_belt and pitch_forearm are the three most important variables determining the qualities of exercise.
load('../.RData')
res_rf <- rForest$results[1,]
res_tree <- CTree$results[1,]
metrics <- data.frame(Model=c('Random Forest','Decision Tree'),importance = rbind(res_rf[,2:5],res_tree[,2:5]))
metrics
Model importance.Accuracy importance.Kappa importance.AccuracySD
1 Random Forest 0.9966364 0.9957452 0.0009968378
2 Decision Tree 0.9040877 0.8785978 0.0087942896
importance.KappaSD
1 0.001261066
2 0.011148416
final_rForest = rForest$finalModel
#Variable Importance
varImportance= data.frame(Variable = row.names(final_rForest$importance),final_rForest$importance)
varImportance = varImportance %>% arrange(desc(MeanDecreaseGini))
varImportance$Variable <- ordered(varImportance$Variable,levels = varImportance$Variable[order(varImportance$MeanDecreaseGini,decreasing=FALSE)])
ggplot(data=varImportance[1:10,],aes(x = Variable, y = MeanDecreaseGini)) + geom_bar(aes(fill = Variable),stat = 'identity') + coord_flip() + ggtitle('Variable Importance of Random Forest') + theme(axis.text = element_text(colour = 'black',face='bold',size=10))
The best model is applied to 20 test cases to produce results
testing <- read.csv('pml-testing.csv',na.strings = c('NA',''),header=TRUE)
answers = as.character(predict(rForest,newdata=testing))
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("results/problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers)