Introduction

The amount and intensity of activity during physical trraining is as important as its quality, i.e. whether proper form is retained during exercise. To classify the quality of activities in a quantifiable manner, recordings were collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, who were then asked to perform barbell lifts correctly and incorrectly in 5 different ways. Thus, based on the measurements, five different classes (A-E) of quality of activities were formed. We are asked to model the classes according to the measurements in a training set and apply the model on a set of 20 queries in the testing set.

Methods and Results

we download the training and testing files as well as caret package, to partition the training set into training and validating subsets of data, and random forest package, to use for calculating the most accurate model that predicts the classe variable.

# Download and read the files
x<-download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile="pml-training.csv")
x<-read.csv("pml-training.csv")
y<-download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile="pml-testing.csv")
y<-read.csv("pml-testing.csv")
# Install and open the caret and randomForest packages
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

We clean the data by removing the variables that are altogether empty or mostly incomplete or irrelevant to the analysis, as for example the first 7 columns, containing user id, time stamps, etc. To this end, we use loops or visual inspection.

# Find which columns have NAs and subset the original dataframe. First,make a loop, establishing an empty vector z and a nunerator n, equal to 1.
z<-vector()
n=1
for (i in names(x)) {
        z[n]<-sum(is.na(x[i]))
        n=n+1
}
z2<-data.frame("names"=names(x), "NAs"=z)
z3<-as.character(z2[z2$NAs==0, ]$names)
# subset the original data
x2<-x[z3]
y2<-y[z3[1:92]] ##because testing set does not have classe which is the last column in x and x2 training sets
# Then, Get rid of cells that are empty
x3<-apply(x2, 2, as.character)
x4<-apply(x3, 2, nchar)
x5<-apply(x4, 2, sum)
x8<-x5[x5>19000] # because there are approximately 19000 rows
smallset<-names(x8)
x9<-x2[smallset]
x10<-x9[, 8:60] # because the first 7 columns are not needed after visual inspection

We split the data into training and validating sets

# Split training set into training and validation
inTrain = createDataPartition(y = x10$classe, p = 0.7, list = FALSE)
toTrain = x10[inTrain, ]
toValidate = x10[-inTrain, ]

We examine possible correlations among the variables. We discover that there are multiple correlations, both negative and positive among variables.

#are the variables correlated? TO do this we install corrplot package and we exclude classe variable
library("corrplot", lib.loc="C:/R-3.1.1/library")
corMat<-cor(x10[, -53])
corrplot(corMat, order = "FPC", method = "color", type = "lower", tl.cex = 0.6, tl.col = rgb(0, 0, 0))

We apply random forest analysis to examine in greater detail multiple models that fit measurements and classe variable and produce the most accurate one. We choose to apply proProcessing using principal component analysis based on the correlation among the variables. We then predict the classe variable in the validating set. Th ecode we used is the following:

library(“randomForest”) modelFit2 <- train(toTrain$classe ~ ., data = toTrain, preProcess=“pca”, method=“rf”) # saveModelFit<-save(modelFit2, file=“savemodelFit2”)

We then predict in the validation test and calculate the accuracy

#predict in the validating set
predictsValid<-predict(modelFit2, toValidate)
results<-confusionMatrix(predictsValid, toValidate$classe)
print (results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    5    1    1    0
##          B    0 1130    5    0    3
##          C    1    3 1012    8    2
##          D    2    0    6  953    6
##          E    0    1    2    2 1071
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9918         
##                  95% CI : (0.9892, 0.994)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9897         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9921   0.9864   0.9886   0.9898
## Specificity            0.9983   0.9983   0.9971   0.9972   0.9990
## Pos Pred Value         0.9958   0.9930   0.9864   0.9855   0.9954
## Neg Pred Value         0.9993   0.9981   0.9971   0.9978   0.9977
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1920   0.1720   0.1619   0.1820
## Detection Prevalence   0.2851   0.1934   0.1743   0.1643   0.1828
## Balanced Accuracy      0.9983   0.9952   0.9917   0.9929   0.9944

The accuracy is satisfactory and above 99%. Error rate equals to 1-accuracy (equal to 0.82% in our case).We predict that we will have great possibility of predicting correctly in the testing set.

predictsTest<-predict(modelFit2, y)
answers=as.character(predictsTest)
print (answers)
##  [1] "B" "A" "C" "A" "A" "B" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"

After submission in the coursera we found out the problem 3 and problem 6 were not predicted correctly, which corresponds to a rate of 10%. I randomly changed the answer in problem 3 to B and became correct. I accidentally submitted it. I did not proceed with similar correction of problem 6, because it did not reflect the true nature of my algorithm. Perhaps the small size of the testing set may have led to an overestimation of the error rate. Still this algorithm is pretty accurate, as its accuracy is 90%.