Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Here exercises are classified as:
Loading up the required libraries and the data.
## corrplot 0.84 loaded
## Loading required package: lattice
## Loading required package: ggplot2
## Parsed with column specification:
## cols(
## .default = col_character(),
## X1 = col_integer(),
## raw_timestamp_part_1 = col_integer(),
## raw_timestamp_part_2 = col_integer(),
## num_window = col_integer(),
## roll_belt = col_double(),
## pitch_belt = col_double(),
## yaw_belt = col_double(),
## total_accel_belt = col_integer(),
## gyros_belt_x = col_double(),
## gyros_belt_y = col_double(),
## gyros_belt_z = col_double(),
## accel_belt_x = col_integer(),
## accel_belt_y = col_integer(),
## accel_belt_z = col_integer(),
## magnet_belt_x = col_integer(),
## magnet_belt_y = col_integer(),
## magnet_belt_z = col_integer(),
## roll_arm = col_double(),
## pitch_arm = col_double(),
## yaw_arm = col_double()
## # ... with 37 more columns
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_integer(),
## user_name = col_character(),
## raw_timestamp_part_1 = col_integer(),
## raw_timestamp_part_2 = col_integer(),
## cvtd_timestamp = col_character(),
## new_window = col_character(),
## num_window = col_integer(),
## total_accel_belt = col_integer(),
## kurtosis_roll_belt = col_character(),
## kurtosis_picth_belt = col_character(),
## kurtosis_yaw_belt = col_character(),
## skewness_roll_belt = col_character(),
## skewness_roll_belt.1 = col_character(),
## skewness_yaw_belt = col_character(),
## max_picth_belt = col_integer(),
## max_yaw_belt = col_character(),
## min_pitch_belt = col_integer(),
## min_yaw_belt = col_character(),
## amplitude_pitch_belt = col_integer(),
## amplitude_yaw_belt = col_character()
## # ... with 46 more columns
## )
## See spec(...) for full column specifications.
Creating a temporary table, kind of a legend to look and refer predictors (since there are 160 in total in training)
#Position of "classe" column (outcome) = 155
col_names<-names(training)[order(names(training),decreasing = FALSE)]
col_order<-order(names(training),decreasing = FALSE)
all_cols_pos<-data.frame(cbind(col_names,col_order))
all_cols_pos[all_cols_pos$col_names == "classe",]
## col_names col_order
## 37 classe 160
By looking into the data we can say that the first 5 columns will not be any helpful for predicting the response “classe”. So we will delete them.
#Removing first 5 columns because they are not helpful in prediction
training_red<-training[,-c(1:5)]
testing_red<-testing[,-c(1:5)]
A quick look of summary tells us that there are many other columns which have lots of NAs and thus need to be investigated futher for their removal (removal of only the columns which have substantial NAs and thus can’t be imputed)
#Summary of training
#Reveals presence of many columns with NAs
summary(training_red)
#Removing columns with NA
prop_NA<-vector()
order<-vector()
for (i in 1:dim(training_red)[2]-1)
{
order[i]<-i
prop_NA[i]<-round(sum(is.na(training_red[,i]))/dim(training_red)[1],3)
}
col_checks_NA<-data.frame(cbind(order,prop_NA))
x<-col_checks_NA[col_checks_NA$prop_NA>0.5,] #Columns where proportion of NAs is too much >0.5, 100 columns
training_red_final<-training_red[,-c(x[,1])]
testing_red_final<-testing_red[,-c(x[,1])]
After getting rid of columns/variables with more than half of NAs, we will do a check on all the variables if they have enough variance, since variables which have less variance will give less information. We found that the predictor “new_window” has almost zero variance and thus it also should be removed.
#Checking variance among predictors, in case we need a PCA preprocessing
zero_var<-nearZeroVar(training_red_final,saveMetrics = TRUE)
Removing of “new_window” & “num_window”:
#Removing new_window as it has zero variance
#Also num_window because percentUnique is too low
training_red_final<-training_red_final[,-c(1,2)]
testing_red_final<-testing_red_final[,-c(1,2)]
We still find that there are 3 rows where some cells are NA. Because we cant calculate correlation, we will remove these observations
sum(is.na(training_red_final)) #training has 3 observations
## [1] 3
sum(is.na(testing_red_final)) #testing has zero such observations
## [1] 0
training_red_final2<-training_red_final[complete.cases(training_red_final),]
which(complete.cases(training_red_final)==0)
## [1] 5373
5373th observation in training_red_final above had 3 NAs
Checking independence of predictors
res<-cor(training_red_final2[,-53])
corrplot(res, type = "upper", order = "hclust")
From the corrplot we can say that predictors like acc_belt_z is highly -vely correlated to acc_belt_y and 2 others and hence multi-collinearity exists. Based on manual selection, we will be removing following columns because of their correlation with other predictors (Left-COlumn name : Righ-Their position in training_red_final2 table)
#---accel_belt_z 10
#---accel_arm_y 22
#---yaw_belt 3
#---magnet_belt_x 11
training_red_final3<-training_red_final2[,-c(10,22,3,11)]
testing_red_final3<-testing_red_final[,-c(10,22,3,11)]
Creating training and testing dataset within original training dataset to look at out of sample error rates from 3 different models: Random forest, gradient boost model and linear discriminant analysis
train_obj<-createDataPartition(training_red_final3$classe, p = 0.6, list = FALSE)
intrain<-training_red_final3[train_obj,]
intest<-training_red_final3[-train_obj,]
system.time(mod_rf<-train(classe ~ .,data = intrain,method = "rf"))
system.time(mod_gbm<-train(classe ~ .,data = intrain,method = "gbm"))
system.time(mod_lda<-train(classe ~ .,data = intrain,method = "lda"))
pred_rf<-predict(mod_rf,intest)
confusionMatrix(pred_rf,intest$classe)
#Accuracy : 0.9874
pred_gbm<-predict(mod_gbm,intest)
confusionMatrix(pred_gbm,intest$classe)
#Accuracy : 0.9517
pred_lda<-predict(mod_lda,intest)
confusionMatrix(pred_lda,intest$classe)
#Accuracy : 0.6868
Since random forest gave us the best out of sample accuracy (RF = 98%, GBM = 95%, LDA = 68%), we will be using the above created random forest model on final testing dataset.
pred_testing<-predict(mod_rf,testing_red_final3)
pred_testing
Following were the predictions, which were found to be 100% true when checked against provided answers.
Testing Observation | Predictions |
---|---|
1 | B |
2 | A |
3 | B |
4 | A |
5 | A |
6 | E |
7 | D |
8 | B |
9 | A |
10 | A |
11 | B |
12 | C |
13 | B |
14 | A |
15 | E |
16 | E |
17 | A |
18 | B |
19 | B |
20 | B |
Following are the top predictors by their importance:
varImp(mod_rf)
According to above code: roll_belt, pitch_forearm, roll_forearm, magnet_dumbbell_z, pitch_belt are the top 5 predictors according to importance. Other predictors which have very less importance can be removed from the model in future for a better prediction.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har
The data has been published here:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013.