machine learning-project1

This is a analysis on the machine learning project. A dataset is used, belt, forearm, arm, and dumbell of 6 participants were asked to perform barbell lifts. The goal of this report is to evaluate is the exercise is performed correctly. Random Forest and CTress are used.

Load the Data

setwd("E:/machine learning/project1")
library(data.table)
library(caret)

## Warning: package 'caret' was built under R version 3.1.3

## Loading required package: lattice
## Loading required package: ggplot2

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.1.3

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

library(e1071)

## Warning: package 'e1071' was built under R version 3.1.3

library(party)

## Warning: package 'party' was built under R version 3.1.3

## Loading required package: grid
## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 3.1.3

## Loading required package: modeltools

## Warning: package 'modeltools' was built under R version 3.1.3

## Loading required package: stats4
## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 3.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.1.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 3.1.3

library(arm)

## Warning: package 'arm' was built under R version 3.1.3

## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4

## Warning: package 'lme4' was built under R version 3.1.3

## Loading required package: Rcpp
## 
## Attaching package: 'lme4'
## 
## The following object is masked from 'package:modeltools':
## 
##     refit
## 
## 
## arm (Version 1.8-4, built: 2015-04-07)
## 
## Working directory is E:/machine learning/project1

library(kernlab)

## Warning: package 'kernlab' was built under R version 3.1.3

## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:modeltools':
## 
##     prior

read.pml = function(file) {
    fread(file, na.strings=c("#DIV/0!", ""))}
test=fread("pml-testing.csv")
train=read.pml("pml-training.csv")

dim(test)

## [1]  20 160

dim(train)

## [1] 19622   160

Tidy Dataset

The test dataseet is used for test the result we got, the train dataset is used to perform data analysis. We will try to reduce the variables in train dataset, tidy the dataset and leave out the “useful” variables.

For all the columns in test and train, only one variable is different, the 160th column. We will remove the last column “probelm_id” in test set. So that now there will be only 159 columns in the test set, and 160 columns in the train set.

test=subset(test,select=-160)
dim(test)

## [1]  20 159

There are 160 variables in the dataset, first we can cut the first 5 columns, as they are just some participants’ information.

#drop=c("V1","user_name","raw-timestamp_part_1","raw_timestamp_part_2","cvtd_timestap"))
#train[,!(names(train)%in%drop)]
train=subset(train,select=-c(1:5))
test=subset(test,select=-c(1:5))   ##drop the unused columns

Eliminate the columns that are sparse

Since there are many columns that only consist few information, we will eliminate those columns in both training dataset and test dataset. Here columns with more than 90% of the rows are filled in will be remained.

#Zero Variance Tidying
zerovars <- nearZeroVar(train)
train2=subset(train,select=-c(zerovars))
test2=subset(test,select=-c(zerovars))
dim(train2)

## [1] 19622   119

dim(test2)

## [1]  20 118

Remove columns have too many NAs.

The next step is to eliminate the columns that has many NAs. Here columns with more than 50% NAs will be removed.

#NA Tidying
na= apply(train2,2, function(x) {sum(is.na(x)/length(x))})

drop= which(na> .50)
train3=subset(train2,select=-drop)
test3=subset(test2,select=-drop)

dim(train3)

## [1] 19622    54

dim(test3)

## [1] 20 53

Now both training dataset and testing dataset are nicely processed. The column number is reduced to 54 and 53. Note that the column number of test dataset is always one less than the colum number of column dataset. As we deleted the different column from train dataset in test dataset.

Subset Training Data Into Two Parts

We subset the training data into two parts: 70% as the training dataset, 30% as the test dataset.

set.seed(8221)
intrain=createDataPartition(y=train3$classe,p=0.7,list=FALSE)
mytrain=train3[intrain[,1]]
mytest=train3[-intrain[,1]]
dim(mytrain);dim(mytest)

## [1] 13737    54

## [1] 5885   54

Random Forest

set.seed(8221)
mytrain$classe=as.factor(mytrain$classe)
modfit=randomForest(classe ~ .,method="rf",data=mytrain,ntree=30)  ##set ntree to shorten the running time
perdictit=predict(modfit,mytest,type="class")
acc=confusionMatrix(perdictit,mytest$classe)
acc

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    1    0    0    0
##          B    0 1136    5    0    0
##          C    0    2 1019   10    0
##          D    0    0    2  954    5
##          E    0    0    0    0 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9958          
##                  95% CI : (0.9937, 0.9972)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9946          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9974   0.9932   0.9896   0.9954
## Specificity            0.9998   0.9989   0.9975   0.9986   1.0000
## Pos Pred Value         0.9994   0.9956   0.9884   0.9927   1.0000
## Neg Pred Value         1.0000   0.9994   0.9986   0.9980   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1930   0.1732   0.1621   0.1830
## Detection Prevalence   0.2846   0.1939   0.1752   0.1633   0.1830
## Balanced Accuracy      0.9999   0.9982   0.9954   0.9941   0.9977

We have 99.58% accuracy.Even though we can perform more tests, I don’t think the accuracy rate can be any more significanlly higher.

Conclusion

The out of sample error is approximately 0.42%. However, note that even though the error is pretty small, in the real life the out of sample error might be a bit higher due to unexpected circumstances. Overall, the random forest provides a satisfying result.