The goal of this project is to predict the ‘classe’ variable from a set of predictor. We start by loading the datasets and some packages in R. We choose to load the ‘caret’ and the ‘ggplot2’ packages in R. In each dataset, we choose to rid off the firs variable ‘X’ since it represents the row numbers.

# Loading packages
library(ggplot2)
library(caret)

# Loading training and testing sets
training<-read.csv("pml-training.csv"); testing<-read.csv("pml-testing.csv")
training<-training[,-1]; testing<-testing[,-1]
dim(training); dim(testing)
## [1] 19622   159
## [1]  20 159
# Data preparation
data<-rbind.fill(training,testing)
training<-data[1:19622,];testing<-data[19623:19642,]

Now, let’s explore the ‘classe’ variable in the training set. For such purpose, we just tabulate the variable.

# Looking at the outcome variable in the training set
table(training$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

From the table output, we can see that the outcome variable is a categorical variable with more than 2 levels. This is therefore not a regression problem, but rather a classification problem. Since the outcome is neither binary, we cannot rely on logistic regression. Now let’s explore a little further our training set.

# Removing zero covariates: Identifying variables that have no variabilities and hence are bad predictors
nzv<-nearZeroVar(training,saveMetrics = T)
nzv[nzv$nzv,]
##                          freqRatio percentUnique zeroVar  nzv
## new_window                47.33005    0.01019264   FALSE TRUE
## kurtosis_roll_belt      1921.60000    2.02323922   FALSE TRUE
## kurtosis_picth_belt      600.50000    1.61553358   FALSE TRUE
## kurtosis_yaw_belt         47.33005    0.01019264   FALSE TRUE
## skewness_roll_belt      2135.11111    2.01304658   FALSE TRUE
## skewness_roll_belt.1     600.50000    1.72255631   FALSE TRUE
## skewness_yaw_belt         47.33005    0.01019264   FALSE TRUE
## max_yaw_belt             640.53333    0.34654979   FALSE TRUE
## min_yaw_belt             640.53333    0.34654979   FALSE TRUE
## amplitude_yaw_belt        50.04167    0.02038528   FALSE TRUE
## avg_roll_arm              77.00000    1.68178575   FALSE TRUE
## stddev_roll_arm           77.00000    1.68178575   FALSE TRUE
## var_roll_arm              77.00000    1.68178575   FALSE TRUE
## avg_pitch_arm             77.00000    1.68178575   FALSE TRUE
## stddev_pitch_arm          77.00000    1.68178575   FALSE TRUE
## var_pitch_arm             77.00000    1.68178575   FALSE TRUE
## avg_yaw_arm               77.00000    1.68178575   FALSE TRUE
## stddev_yaw_arm            80.00000    1.66649679   FALSE TRUE
## var_yaw_arm               80.00000    1.66649679   FALSE TRUE
## kurtosis_roll_arm        246.35897    1.68178575   FALSE TRUE
## kurtosis_picth_arm       240.20000    1.67159311   FALSE TRUE
## kurtosis_yaw_arm        1746.90909    2.01304658   FALSE TRUE
## skewness_roll_arm        249.55844    1.68688207   FALSE TRUE
## skewness_pitch_arm       240.20000    1.67159311   FALSE TRUE
## skewness_yaw_arm        1746.90909    2.01304658   FALSE TRUE
## max_roll_arm              25.66667    1.47793293   FALSE TRUE
## min_roll_arm              19.25000    1.41677709   FALSE TRUE
## min_pitch_arm             19.25000    1.47793293   FALSE TRUE
## amplitude_roll_arm        25.66667    1.55947406   FALSE TRUE
## amplitude_pitch_arm       20.00000    1.49831821   FALSE TRUE
## kurtosis_roll_dumbbell  3843.20000    2.02833554   FALSE TRUE
## kurtosis_picth_dumbbell 9608.00000    2.04362450   FALSE TRUE
## kurtosis_yaw_dumbbell     47.33005    0.01019264   FALSE TRUE
## skewness_roll_dumbbell  4804.00000    2.04362450   FALSE TRUE
## skewness_pitch_dumbbell 9608.00000    2.04872082   FALSE TRUE
## skewness_yaw_dumbbell     47.33005    0.01019264   FALSE TRUE
## max_yaw_dumbbell         960.80000    0.37203139   FALSE TRUE
## min_yaw_dumbbell         960.80000    0.37203139   FALSE TRUE
## amplitude_yaw_dumbbell    47.92020    0.01528896   FALSE TRUE
## kurtosis_roll_forearm    228.76190    1.64101519   FALSE TRUE
## kurtosis_picth_forearm   226.07059    1.64611151   FALSE TRUE
## kurtosis_yaw_forearm      47.33005    0.01019264   FALSE TRUE
## skewness_roll_forearm    231.51807    1.64611151   FALSE TRUE
## skewness_pitch_forearm   226.07059    1.62572623   FALSE TRUE
## skewness_yaw_forearm      47.33005    0.01019264   FALSE TRUE
## max_roll_forearm          27.66667    1.38110284   FALSE TRUE
## max_yaw_forearm          228.76190    0.22933442   FALSE TRUE
## min_roll_forearm          27.66667    1.37091020   FALSE TRUE
## min_yaw_forearm          228.76190    0.22933442   FALSE TRUE
## amplitude_roll_forearm    20.75000    1.49322189   FALSE TRUE
## amplitude_yaw_forearm     59.67702    0.01528896   FALSE TRUE
## avg_roll_forearm          27.66667    1.64101519   FALSE TRUE
## stddev_roll_forearm       87.00000    1.63082255   FALSE TRUE
## var_roll_forearm          87.00000    1.63082255   FALSE TRUE
## avg_pitch_forearm         83.00000    1.65120783   FALSE TRUE
## stddev_pitch_forearm      41.50000    1.64611151   FALSE TRUE
## var_pitch_forearm         83.00000    1.65120783   FALSE TRUE
## avg_yaw_forearm           83.00000    1.65120783   FALSE TRUE
## stddev_yaw_forearm        85.00000    1.64101519   FALSE TRUE
## var_yaw_forearm           85.00000    1.64101519   FALSE TRUE
## problem_id                 0.00000    0.00000000    TRUE TRUE
nzv <- nearZeroVar(training)
filteredTraining <- training[, -nzv]
filteredTesting<-testing[,-nzv]
dim(filteredTraining)
## [1] 19622    99
dim(filteredTesting)
## [1] 20 99

We will now used the filtered training set to fit our predictive model. Since the training set contains lots of missing values, we decide to impute them. We will use the Random Forest classifier

# Model fit
modFit<-train(classe~.,data = filteredTraining,method='rf')
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.2.1
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
modFit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 61
## 
##         OOB estimate of  error rate: 13.79%
## Confusion matrix:
##    A  B  C  D  E class.error
## A 98  6  2  2  1   0.1009174
## B  9 62  5  2  1   0.2151899
## C  3  4 62  1  0   0.1142857
## D  5  1  4 59  0   0.1449275
## E  1  4  1  4 69   0.1265823