Using the data sect from this project, we need to predict if people performed barbell lifts correctly.
For this assignment we will use the following packages:
We have a training and a test dataset that have been provided with the assignement and loaded.
We run a few checks on the training dataset.
However, there are 100 variables with 97% of NA values. We removed them from the training and test datasets (Annex 1 has the complete list of the variables removed)
We explore the class of the variables, and we see that only four are not numeric.
| variable | user_name | cvtd_timestamp | new_window | classe |
Three of these variables are factors: user_name, new_window, classe. The cvtd_timestamp is a time and date. We will apply these modifications in the datasets, i.e. move these variables from character to factor or date and time.
We first explore the correlation across covariates. In Annex 2, you can find a correlation matrix plot (the higher the intensity of the color, the higher the correlation). We see that the variance of exercises has a high correlation among them (gyros_dumbell_x is correlated with gyros_dumbell_y and gyros_dumbell_z)
We visually explored the relationship between the covariates and classe. The loop used to generate the plots is below, and the plots are stored here. We see that the variables raw_timestamp_part_1, raw_timestamp_part_2 and cvtd_timestamp are evenly distributed across class. Plot 1 presents timestamp_part_1 as an example. We remove them from the analysis.
for(i in 1:length(var_list)){
temp<-training_clean2%>%select(classe,var_list[i])%>%
ggplot(aes(classe,get(var_list[i]),color=classe))+
geom_jitter(alpha=0.3,size=2)+geom_point(color="black")+geom_boxplot(alpha=0.3)+
labs(title=var_list[i])+theme(axis.title.y=element_blank())
assign(paste0("plot_",var_list[i]),temp)
invisible(ggsave(paste0(".//plots_raw//plot_",var_list[i],".jpg"),temp))
}
| var | freqRatio | percentUnique | zeroVar | nzv |
|---|---|---|---|---|
| new_window | 47.33005 | 0.0101926 | FALSE | TRUE |
Finally, we remove the column “…1” - which is just the row index - and the user_name.
Finally, after checking, we see that the classe variable is not present in the testing dataset, it has been replaced with a variable problem_id
We end with datasets of the following dimensions
For cross-validation, we will split the data into two datasets according to the following percentages:
To avoid any confusion, we rename the original testing dataset (the one with 20 observations) to final validation dataset
The code to split the data set is below:
#we create the training data set with 60% of the observations
split1<-createDataPartition(y=training_clean5$classe,p=0.7,list=FALSE)
train<-training_clean5[split1,]
test<-training_clean5[-split1,]
We will use the following models:
For each of the models, we generated predicted values and generate the confusion matrix.
# we create the model
set.seed(1234)
control <- trainControl(method="cv", number=3, verboseIter=F)
## decision tree
mod_trees <- train(classe~., data=train, method="rpart",
trControl = control, tuneLength = 5)
pred_trees<-predict(mod_trees,test)
conf_trees<-confusionMatrix(pred_trees,factor(test$classe))
## random forest
mod_rf <- train(classe~., data=train, method="rf",
trControl = control, tuneLength = 5)
pred_rf<-predict(mod_rf,test)
conf_rf<-confusionMatrix(pred_rf,factor(test$classe))
## Boosting with trees
mod_gbm <- train(classe~., data=train, method="gbm",
trControl = control, tuneLength = 5,verbose=FALSE)
pred_gbm<-predict(mod_gbm,test)
conf_gbm<-confusionMatrix(pred_gbm,factor(test$classe))
##model based prediction
mod_lda<- train(classe~., data=train, method="lda")
pred_lda<-predict(mod_lda,test)
conf_lda<-confusionMatrix(pred_lda,factor(test$classe))
| model | accuracy |
|---|---|
| Random Forest | 0.9947324 |
| Boosting with trees | 0.9899745 |
| Model-based prediction | 0.7002549 |
| Decision Tree | 0.5417162 |
We see that the model with the highest accuracy is Random Forest with an accuracy of 99.5%. This is the model we will use against the validation set.
Below are our predictions of the 20 observations.
| prediciton | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
| var | percentage of NAs |
|---|---|
| kurtosis_roll_belt | 0.9793089 |
| kurtosis_picth_belt | 0.9793089 |
| kurtosis_yaw_belt | 0.9793089 |
| skewness_roll_belt | 0.9797676 |
| skewness_roll_belt.1 | 0.9793089 |
| skewness_yaw_belt | 0.9793089 |
| max_roll_belt | 0.9793089 |
| max_picth_belt | 0.9793089 |
| max_yaw_belt | 0.9793089 |
| min_roll_belt | 0.9793089 |
| min_pitch_belt | 0.9793089 |
| min_yaw_belt | 0.9793089 |
| amplitude_roll_belt | 0.9793089 |
| amplitude_pitch_belt | 0.9793089 |
| amplitude_yaw_belt | 0.9793089 |
| var_total_accel_belt | 0.9793089 |
| avg_roll_belt | 0.9793089 |
| stddev_roll_belt | 0.9793089 |
| var_roll_belt | 0.9793089 |
| avg_pitch_belt | 0.9793089 |
| stddev_pitch_belt | 0.9793089 |
| var_pitch_belt | 0.9793089 |
| avg_yaw_belt | 0.9793089 |
| stddev_yaw_belt | 0.9793089 |
| var_yaw_belt | 0.9793089 |
| var_accel_arm | 0.9793089 |
| avg_roll_arm | 0.9793089 |
| stddev_roll_arm | 0.9793089 |
| var_roll_arm | 0.9793089 |
| avg_pitch_arm | 0.9793089 |
| stddev_pitch_arm | 0.9793089 |
| var_pitch_arm | 0.9793089 |
| avg_yaw_arm | 0.9793089 |
| stddev_yaw_arm | 0.9793089 |
| var_yaw_arm | 0.9793089 |
| kurtosis_roll_arm | 0.9793089 |
| kurtosis_picth_arm | 0.9793089 |
| kurtosis_yaw_arm | 0.9793089 |
| skewness_roll_arm | 0.9793089 |
| skewness_pitch_arm | 0.9793089 |
| skewness_yaw_arm | 0.9793089 |
| max_roll_arm | 0.9793089 |
| max_picth_arm | 0.9793089 |
| max_yaw_arm | 0.9793089 |
| min_roll_arm | 0.9793089 |
| min_pitch_arm | 0.9793089 |
| min_yaw_arm | 0.9793089 |
| amplitude_roll_arm | 0.9793089 |
| amplitude_pitch_arm | 0.9793089 |
| amplitude_yaw_arm | 0.9793089 |
| kurtosis_roll_dumbbell | 0.9793089 |
| kurtosis_picth_dumbbell | 0.9793089 |
| kurtosis_yaw_dumbbell | 0.9793089 |
| skewness_roll_dumbbell | 0.9795128 |
| skewness_pitch_dumbbell | 0.9793599 |
| skewness_yaw_dumbbell | 0.9793089 |
| max_roll_dumbbell | 0.9793089 |
| max_picth_dumbbell | 0.9793089 |
| max_yaw_dumbbell | 0.9793089 |
| min_roll_dumbbell | 0.9793089 |
| min_pitch_dumbbell | 0.9793089 |
| min_yaw_dumbbell | 0.9793089 |
| amplitude_roll_dumbbell | 0.9793089 |
| amplitude_pitch_dumbbell | 0.9793089 |
| amplitude_yaw_dumbbell | 0.9793089 |
| var_accel_dumbbell | 0.9793089 |
| avg_roll_dumbbell | 0.9793089 |
| stddev_roll_dumbbell | 0.9793089 |
| var_roll_dumbbell | 0.9793089 |
| avg_pitch_dumbbell | 0.9793089 |
| stddev_pitch_dumbbell | 0.9793089 |
| var_pitch_dumbbell | 0.9793089 |
| avg_yaw_dumbbell | 0.9793089 |
| stddev_yaw_dumbbell | 0.9793089 |
| var_yaw_dumbbell | 0.9793089 |
| kurtosis_roll_forearm | 0.9793089 |
| kurtosis_picth_forearm | 0.9793089 |
| kurtosis_yaw_forearm | 0.9793089 |
| skewness_roll_forearm | 0.9793089 |
| skewness_pitch_forearm | 0.9793089 |
| skewness_yaw_forearm | 0.9793089 |
| max_roll_forearm | 0.9793089 |
| max_picth_forearm | 0.9793089 |
| max_yaw_forearm | 0.9793089 |
| min_roll_forearm | 0.9793089 |
| min_pitch_forearm | 0.9793089 |
| min_yaw_forearm | 0.9793089 |
| amplitude_roll_forearm | 0.9793089 |
| amplitude_pitch_forearm | 0.9793089 |
| amplitude_yaw_forearm | 0.9793089 |
| var_accel_forearm | 0.9793089 |
| avg_roll_forearm | 0.9793089 |
| stddev_roll_forearm | 0.9793089 |
| var_roll_forearm | 0.9793089 |
| avg_pitch_forearm | 0.9793089 |
| stddev_pitch_forearm | 0.9793089 |
| var_pitch_forearm | 0.9793089 |
| avg_yaw_forearm | 0.9793089 |
| stddev_yaw_forearm | 0.9793089 |
| var_yaw_forearm | 0.9793089 |