The goal of this project is to predict the ‘classe’ variable from a set of predictor. We start by loading the datasets and some packages in R. We choose to load the ‘caret’ and the ‘ggplot2’ packages in R. In each dataset, we choose to rid off the firs variable ‘X’ since it represents the row numbers.
# Loading packages
library(ggplot2)
library(caret)
# Loading training and testing sets
training<-read.csv("pml-training.csv"); testing<-read.csv("pml-testing.csv")
training<-training[,-1]; testing<-testing[,-1]
dim(training); dim(testing)
## [1] 19622 159
## [1] 20 159
# Data preparation
data<-rbind.fill(training,testing)
training<-data[1:19622,];testing<-data[19623:19642,]
Now, let’s explore the ‘classe’ variable in the training set. For such purpose, we just tabulate the variable.
# Looking at the outcome variable in the training set
table(training$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
From the table output, we can see that the outcome variable is a categorical variable with more than 2 levels. This is therefore not a regression problem, but rather a classification problem. Since the outcome is neither binary, we cannot rely on logistic regression. Now let’s explore a little further our training set.
# Removing zero covariates: Identifying variables that have no variabilities and hence are bad predictors
nzv<-nearZeroVar(training,saveMetrics = T)
nzv[nzv$nzv,]
## freqRatio percentUnique zeroVar nzv
## new_window 47.33005 0.01019264 FALSE TRUE
## kurtosis_roll_belt 1921.60000 2.02323922 FALSE TRUE
## kurtosis_picth_belt 600.50000 1.61553358 FALSE TRUE
## kurtosis_yaw_belt 47.33005 0.01019264 FALSE TRUE
## skewness_roll_belt 2135.11111 2.01304658 FALSE TRUE
## skewness_roll_belt.1 600.50000 1.72255631 FALSE TRUE
## skewness_yaw_belt 47.33005 0.01019264 FALSE TRUE
## max_yaw_belt 640.53333 0.34654979 FALSE TRUE
## min_yaw_belt 640.53333 0.34654979 FALSE TRUE
## amplitude_yaw_belt 50.04167 0.02038528 FALSE TRUE
## avg_roll_arm 77.00000 1.68178575 FALSE TRUE
## stddev_roll_arm 77.00000 1.68178575 FALSE TRUE
## var_roll_arm 77.00000 1.68178575 FALSE TRUE
## avg_pitch_arm 77.00000 1.68178575 FALSE TRUE
## stddev_pitch_arm 77.00000 1.68178575 FALSE TRUE
## var_pitch_arm 77.00000 1.68178575 FALSE TRUE
## avg_yaw_arm 77.00000 1.68178575 FALSE TRUE
## stddev_yaw_arm 80.00000 1.66649679 FALSE TRUE
## var_yaw_arm 80.00000 1.66649679 FALSE TRUE
## kurtosis_roll_arm 246.35897 1.68178575 FALSE TRUE
## kurtosis_picth_arm 240.20000 1.67159311 FALSE TRUE
## kurtosis_yaw_arm 1746.90909 2.01304658 FALSE TRUE
## skewness_roll_arm 249.55844 1.68688207 FALSE TRUE
## skewness_pitch_arm 240.20000 1.67159311 FALSE TRUE
## skewness_yaw_arm 1746.90909 2.01304658 FALSE TRUE
## max_roll_arm 25.66667 1.47793293 FALSE TRUE
## min_roll_arm 19.25000 1.41677709 FALSE TRUE
## min_pitch_arm 19.25000 1.47793293 FALSE TRUE
## amplitude_roll_arm 25.66667 1.55947406 FALSE TRUE
## amplitude_pitch_arm 20.00000 1.49831821 FALSE TRUE
## kurtosis_roll_dumbbell 3843.20000 2.02833554 FALSE TRUE
## kurtosis_picth_dumbbell 9608.00000 2.04362450 FALSE TRUE
## kurtosis_yaw_dumbbell 47.33005 0.01019264 FALSE TRUE
## skewness_roll_dumbbell 4804.00000 2.04362450 FALSE TRUE
## skewness_pitch_dumbbell 9608.00000 2.04872082 FALSE TRUE
## skewness_yaw_dumbbell 47.33005 0.01019264 FALSE TRUE
## max_yaw_dumbbell 960.80000 0.37203139 FALSE TRUE
## min_yaw_dumbbell 960.80000 0.37203139 FALSE TRUE
## amplitude_yaw_dumbbell 47.92020 0.01528896 FALSE TRUE
## kurtosis_roll_forearm 228.76190 1.64101519 FALSE TRUE
## kurtosis_picth_forearm 226.07059 1.64611151 FALSE TRUE
## kurtosis_yaw_forearm 47.33005 0.01019264 FALSE TRUE
## skewness_roll_forearm 231.51807 1.64611151 FALSE TRUE
## skewness_pitch_forearm 226.07059 1.62572623 FALSE TRUE
## skewness_yaw_forearm 47.33005 0.01019264 FALSE TRUE
## max_roll_forearm 27.66667 1.38110284 FALSE TRUE
## max_yaw_forearm 228.76190 0.22933442 FALSE TRUE
## min_roll_forearm 27.66667 1.37091020 FALSE TRUE
## min_yaw_forearm 228.76190 0.22933442 FALSE TRUE
## amplitude_roll_forearm 20.75000 1.49322189 FALSE TRUE
## amplitude_yaw_forearm 59.67702 0.01528896 FALSE TRUE
## avg_roll_forearm 27.66667 1.64101519 FALSE TRUE
## stddev_roll_forearm 87.00000 1.63082255 FALSE TRUE
## var_roll_forearm 87.00000 1.63082255 FALSE TRUE
## avg_pitch_forearm 83.00000 1.65120783 FALSE TRUE
## stddev_pitch_forearm 41.50000 1.64611151 FALSE TRUE
## var_pitch_forearm 83.00000 1.65120783 FALSE TRUE
## avg_yaw_forearm 83.00000 1.65120783 FALSE TRUE
## stddev_yaw_forearm 85.00000 1.64101519 FALSE TRUE
## var_yaw_forearm 85.00000 1.64101519 FALSE TRUE
## problem_id 0.00000 0.00000000 TRUE TRUE
nzv <- nearZeroVar(training)
filteredTraining <- training[, -nzv]
filteredTesting<-testing[,-nzv]
dim(filteredTraining)
## [1] 19622 99
dim(filteredTesting)
## [1] 20 99
We will now used the filtered training set to fit our predictive model. Since the training set contains lots of missing values, we decide to impute them. We will use the Random Forest classifier
# Model fit
modFit<-train(classe~.,data = filteredTraining,method='rf')
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.2.1
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
modFit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 61
##
## OOB estimate of error rate: 13.79%
## Confusion matrix:
## A B C D E class.error
## A 98 6 2 2 1 0.1009174
## B 9 62 5 2 1 0.2151899
## C 3 4 62 1 0 0.1142857
## D 5 1 4 59 0 0.1449275
## E 1 4 1 4 69 0.1265823