Practical Machine Learning

Assignement

Using the data sect from this project, we need to predict if people performed barbell lifts correctly.

Packages

For this assignment we will use the following packages:

Tidyverse
caret
gmb
readr
kableExtra
corrplot
lubridate
fastDummies

The data

We have a training and a test dataset that have been provided with the assignement and loaded.

Checking and cleaning the data

Empty values

We run a few checks on the training dataset.

Empty variables: there was no variable completely empty.
Empty rows: there are no empty rows.

However, there are 100 variables with 97% of NA values. We removed them from the training and test datasets (Annex 1 has the complete list of the variables removed)

Variables Class

We explore the class of the variables, and we see that only four are not numeric.

variable

user_name

cvtd_timestamp

new_window

classe

Three of these variables are factors: user_name, new_window, classe. The cvtd_timestamp is a time and date. We will apply these modifications in the datasets, i.e. move these variables from character to factor or date and time.

Correlation between variables

We first explore the correlation across covariates. In Annex 2, you can find a correlation matrix plot (the higher the intensity of the color, the higher the correlation). We see that the variance of exercises has a high correlation among them (gyros_dumbell_x is correlated with gyros_dumbell_y and gyros_dumbell_z)

Relationship between covariates and classe

We visually explored the relationship between the covariates and classe. The loop used to generate the plots is below, and the plots are stored here. We see that the variables raw_timestamp_part_1, raw_timestamp_part_2 and cvtd_timestamp are evenly distributed across class. Plot 1 presents timestamp_part_1 as an example. We remove them from the analysis.

for(i in 1:length(var_list)){
  
  temp<-training_clean2%>%select(classe,var_list[i])%>%
    ggplot(aes(classe,get(var_list[i]),color=classe))+
  geom_jitter(alpha=0.3,size=2)+geom_point(color="black")+geom_boxplot(alpha=0.3)+
    labs(title=var_list[i])+theme(axis.title.y=element_blank())
  assign(paste0("plot_",var_list[i]),temp)
  invisible(ggsave(paste0(".//plots_raw//plot_",var_list[i],".jpg"),temp))
}

Removing zero covariates

We check for zero covariates. We see that the variable new_window is a near zero var covariate, so we remove it from the dataset.

var	freqRatio	percentUnique	zeroVar	nzv
new_window	47.33005	0.0101926	FALSE	TRUE

Removing the index and the user

Finally, we remove the column “…1” - which is just the row index - and the user_name.

Final Datasets

Finally, after checking, we see that the classe variable is not present in the testing dataset, it has been replaced with a variable problem_id

We end with datasets of the following dimensions

Traning datasets:19622 observations and 53 variables.
Test datasets:20 observations and 53 variables.

Cross Validation

For cross-validation, we will split the data into two datasets according to the following percentages:

Training dataset: 70%.
Testing dataset: 30%.

To avoid any confusion, we rename the original testing dataset (the one with 20 observations) to final validation dataset

The code to split the data set is below:

#we create the training data set with 60% of the observations
split1<-createDataPartition(y=training_clean5$classe,p=0.7,list=FALSE)
train<-training_clean5[split1,]
test<-training_clean5[-split1,]

The models

We will use the following models:

Trees
Random forest
Gradient Boosted Trees
Model based prediction

For each of the models, we generated predicted values and generate the confusion matrix.

# we create the model 
set.seed(1234)
control <- trainControl(method="cv", number=3, verboseIter=F) 

## decision tree
mod_trees <- train(classe~., data=train, method="rpart", 
                   trControl = control, tuneLength = 5)

pred_trees<-predict(mod_trees,test)
conf_trees<-confusionMatrix(pred_trees,factor(test$classe))

## random forest
mod_rf <- train(classe~., data=train, method="rf", 
                   trControl = control, tuneLength = 5)

pred_rf<-predict(mod_rf,test)
conf_rf<-confusionMatrix(pred_rf,factor(test$classe))


## Boosting with trees
mod_gbm <- train(classe~., data=train, method="gbm", 
                   trControl = control, tuneLength = 5,verbose=FALSE)

pred_gbm<-predict(mod_gbm,test)
conf_gbm<-confusionMatrix(pred_gbm,factor(test$classe))


##model based prediction
mod_lda<- train(classe~., data=train, method="lda")
pred_lda<-predict(mod_lda,test)
conf_lda<-confusionMatrix(pred_lda,factor(test$classe))

Comparing accuracys

model	accuracy
Random Forest	0.9947324
Boosting with trees	0.9899745
Model-based prediction	0.7002549
Decision Tree	0.5417162

We see that the model with the highest accuracy is Random Forest with an accuracy of 99.5%. This is the model we will use against the validation set.

Predictions on the validation set

Below are our predictions of the 20 observations.

prediciton

B

A

B

A

E

D

B

A

B

C

B

A

E

A

B

ANNEX

ANNEX 1: Variables with almonst only NA

var	percentage of NAs
kurtosis_roll_belt	0.9793089
kurtosis_picth_belt	0.9793089
kurtosis_yaw_belt	0.9793089
skewness_roll_belt	0.9797676
skewness_roll_belt.1	0.9793089
skewness_yaw_belt	0.9793089
max_roll_belt	0.9793089
max_picth_belt	0.9793089
max_yaw_belt	0.9793089
min_roll_belt	0.9793089
min_pitch_belt	0.9793089
min_yaw_belt	0.9793089
amplitude_roll_belt	0.9793089
amplitude_pitch_belt	0.9793089
amplitude_yaw_belt	0.9793089
var_total_accel_belt	0.9793089
avg_roll_belt	0.9793089
stddev_roll_belt	0.9793089
var_roll_belt	0.9793089
avg_pitch_belt	0.9793089
stddev_pitch_belt	0.9793089
var_pitch_belt	0.9793089
avg_yaw_belt	0.9793089
stddev_yaw_belt	0.9793089
var_yaw_belt	0.9793089
var_accel_arm	0.9793089
avg_roll_arm	0.9793089
stddev_roll_arm	0.9793089
var_roll_arm	0.9793089
avg_pitch_arm	0.9793089
stddev_pitch_arm	0.9793089
var_pitch_arm	0.9793089
avg_yaw_arm	0.9793089
stddev_yaw_arm	0.9793089
var_yaw_arm	0.9793089
kurtosis_roll_arm	0.9793089
kurtosis_picth_arm	0.9793089
kurtosis_yaw_arm	0.9793089
skewness_roll_arm	0.9793089
skewness_pitch_arm	0.9793089
skewness_yaw_arm	0.9793089
max_roll_arm	0.9793089
max_picth_arm	0.9793089
max_yaw_arm	0.9793089
min_roll_arm	0.9793089
min_pitch_arm	0.9793089
min_yaw_arm	0.9793089
amplitude_roll_arm	0.9793089
amplitude_pitch_arm	0.9793089
amplitude_yaw_arm	0.9793089
kurtosis_roll_dumbbell	0.9793089
kurtosis_picth_dumbbell	0.9793089
kurtosis_yaw_dumbbell	0.9793089
skewness_roll_dumbbell	0.9795128
skewness_pitch_dumbbell	0.9793599
skewness_yaw_dumbbell	0.9793089
max_roll_dumbbell	0.9793089
max_picth_dumbbell	0.9793089
max_yaw_dumbbell	0.9793089
min_roll_dumbbell	0.9793089
min_pitch_dumbbell	0.9793089
min_yaw_dumbbell	0.9793089
amplitude_roll_dumbbell	0.9793089
amplitude_pitch_dumbbell	0.9793089
amplitude_yaw_dumbbell	0.9793089
var_accel_dumbbell	0.9793089
avg_roll_dumbbell	0.9793089
stddev_roll_dumbbell	0.9793089
var_roll_dumbbell	0.9793089
avg_pitch_dumbbell	0.9793089
stddev_pitch_dumbbell	0.9793089
var_pitch_dumbbell	0.9793089
avg_yaw_dumbbell	0.9793089
stddev_yaw_dumbbell	0.9793089
var_yaw_dumbbell	0.9793089
kurtosis_roll_forearm	0.9793089
kurtosis_picth_forearm	0.9793089
kurtosis_yaw_forearm	0.9793089
skewness_roll_forearm	0.9793089
skewness_pitch_forearm	0.9793089
skewness_yaw_forearm	0.9793089
max_roll_forearm	0.9793089
max_picth_forearm	0.9793089
max_yaw_forearm	0.9793089
min_roll_forearm	0.9793089
min_pitch_forearm	0.9793089
min_yaw_forearm	0.9793089
amplitude_roll_forearm	0.9793089
amplitude_pitch_forearm	0.9793089
amplitude_yaw_forearm	0.9793089
var_accel_forearm	0.9793089
avg_roll_forearm	0.9793089
stddev_roll_forearm	0.9793089
var_roll_forearm	0.9793089
avg_pitch_forearm	0.9793089
stddev_pitch_forearm	0.9793089
var_pitch_forearm	0.9793089
avg_yaw_forearm	0.9793089
stddev_yaw_forearm	0.9793089
var_yaw_forearm	0.9793089

Practical Machine Learning - Course Project

Pablo Cordova

2022-08-06