Analysis of quality of the activity during weight lifting exercises

Methods and Results

we download the training and testing files as well as caret package, to partition the training set into training and validating subsets of data, and random forest package, to use for calculating the most accurate model that predicts the classe variable.

# Download and read the files
x<-download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile="pml-training.csv")
x<-read.csv("pml-training.csv")
y<-download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile="pml-testing.csv")
y<-read.csv("pml-testing.csv")
# Install and open the caret and randomForest packages
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

We clean the data by removing the variables that are altogether empty or mostly incomplete or irrelevant to the analysis, as for example the first 7 columns, containing user id, time stamps, etc. To this end, we use loops or visual inspection.

# Find which columns have NAs and subset the original dataframe. First,make a loop, establishing an empty vector z and a nunerator n, equal to 1.
z<-vector()
n=1
for (i in names(x)) {
        z[n]<-sum(is.na(x[i]))
        n=n+1
}
z2<-data.frame("names"=names(x), "NAs"=z)
z3<-as.character(z2[z2$NAs==0, ]$names)
# subset the original data
x2<-x[z3]
y2<-y[z3[1:92]] ##because testing set does not have classe which is the last column in x and x2 training sets
# Then, Get rid of cells that are empty
x3<-apply(x2, 2, as.character)
x4<-apply(x3, 2, nchar)
x5<-apply(x4, 2, sum)
x8<-x5[x5>19000] # because there are approximately 19000 rows
smallset<-names(x8)
x9<-x2[smallset]
x10<-x9[, 8:60] # because the first 7 columns are not needed after visual inspection

We split the data into training and validating sets

# Split training set into training and validation
inTrain = createDataPartition(y = x10$classe, p = 0.7, list = FALSE)
toTrain = x10[inTrain, ]
toValidate = x10[-inTrain, ]

We examine possible correlations among the variables. We discover that there are multiple correlations, both negative and positive among variables.

#are the variables correlated? TO do this we install corrplot package and we exclude classe variable
library("corrplot", lib.loc="C:/R-3.1.1/library")
corMat<-cor(x10[, -53])
corrplot(corMat, order = "FPC", method = "color", type = "lower", tl.cex = 0.6, tl.col = rgb(0, 0, 0))

We apply random forest analysis to examine in greater detail multiple models that fit measurements and classe variable and produce the most accurate one. We choose to apply proProcessing using principal component analysis based on the correlation among the variables. We then predict the classe variable in the validating set. Th ecode we used is the following:

library(“randomForest”) modelFit2 <- train(toTrain$classe ~ ., data = toTrain, preProcess=“pca”, method=“rf”) # saveModelFit<-save(modelFit2, file=“savemodelFit2”)

We then predict in the validation test and calculate the accuracy

#predict in the validating set
predictsValid<-predict(modelFit2, toValidate)
results<-confusionMatrix(predictsValid, toValidate$classe)
print (results)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    5    1    1    0
##          B    0 1130    5    0    3
##          C    1    3 1012    8    2
##          D    2    0    6  953    6
##          E    0    1    2    2 1071
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9918         
##                  95% CI : (0.9892, 0.994)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9897         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9921   0.9864   0.9886   0.9898
## Specificity            0.9983   0.9983   0.9971   0.9972   0.9990
## Pos Pred Value         0.9958   0.9930   0.9864   0.9855   0.9954
## Neg Pred Value         0.9993   0.9981   0.9971   0.9978   0.9977
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1920   0.1720   0.1619   0.1820
## Detection Prevalence   0.2851   0.1934   0.1743   0.1643   0.1828
## Balanced Accuracy      0.9983   0.9952   0.9917   0.9929   0.9944

The accuracy is satisfactory and above 99%. Error rate equals to 1-accuracy (equal to 0.82% in our case).We predict that we will have great possibility of predicting correctly in the testing set.

predictsTest<-predict(modelFit2, y)
answers=as.character(predictsTest)
print (answers)

##  [1] "B" "A" "C" "A" "A" "B" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"

After submission in the coursera we found out the problem 3 and problem 6 were not predicted correctly, which corresponds to a rate of 10%. I randomly changed the answer in problem 3 to B and became correct. I accidentally submitted it. I did not proceed with similar correction of problem 6, because it did not reflect the true nature of my algorithm. Perhaps the small size of the testing set may have led to an overestimation of the error rate. Still this algorithm is pretty accurate, as its accuracy is 90%.

Analysis of quality of the activity during weight lifting exercises

Introduction

Methods and Results