Data related to the execution of an exercise is provided. In the data, there are 6 participants and 5 classe variables. The 5 classe variables are
1.exactly according to the specification (Class A),
2.throwing the elbows to the front (Class B),
3.lifting the dumbbell only halfway (Class C),
4.lowering the dumbbell only halfway (Class D)
5.throwing the hips to the front (Class E).
predict the manner in which the participants did the excercise.
Since participants have performed correctly and incorrectly, first step above all would be to look at all the participants and classA-variable to see if there are any noticeable points to note.
In the column new_window, there are only 2 values. ‘yes’ and ‘no’. Considered the data with value = ‘no’ as there are more records with value ‘no’ (19216).
Detemine what columns to use.
Since the data could include both numeric and factor variables that could determine the performance of the model, for now, only the numeric data is considered.
I implemented cross-validation with cv-folds = 5 and cv-folds = 10. Keeping my computer’s hardware in mind.
Plotted a predictors graph to see how many variables are required to determine the number of columns to consider. There are around 7-10 variables that play a vital role in determining the execution of the exercise.
with(res, plot(n.var,error.cv,log="x",type= "o", lwd=2))
In order to find the columns, I used Gini Index to look at the ‘MeanDecreaseGini’ variable and picked the top 9 columns with maximum MeanDecreaseGini value for the analysis.
I implemented random forest model to determine the execution of exercise by each participant using the training dataset.
Not much emphasis was given on handling the out of sample error. I moved on to the implementation of the model.
I then considered the test dataset that has 20 test cases, retrieved only those columns used for the training model and predicted the results for the 20 test-cases in the test dataset.