This is a analysis on the machine learning project. A dataset is used, belt, forearm, arm, and dumbell of 6 participants were asked to perform barbell lifts. The goal of this report is to evaluate is the exercise is performed correctly. Random Forest and CTress are used.
setwd("E:/machine learning/project1")
library(data.table)
library(caret)
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.1.3
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(e1071)
## Warning: package 'e1071' was built under R version 3.1.3
library(party)
## Warning: package 'party' was built under R version 3.1.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.1.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.1.3
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
library(arm)
## Warning: package 'arm' was built under R version 3.1.3
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
## Warning: package 'lme4' was built under R version 3.1.3
## Loading required package: Rcpp
##
## Attaching package: 'lme4'
##
## The following object is masked from 'package:modeltools':
##
## refit
##
##
## arm (Version 1.8-4, built: 2015-04-07)
##
## Working directory is E:/machine learning/project1
library(kernlab)
## Warning: package 'kernlab' was built under R version 3.1.3
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:modeltools':
##
## prior
read.pml = function(file) {
fread(file, na.strings=c("#DIV/0!", ""))}
test=fread("pml-testing.csv")
train=read.pml("pml-training.csv")
dim(test)
## [1] 20 160
dim(train)
## [1] 19622 160
The test dataseet is used for test the result we got, the train dataset is used to perform data analysis. We will try to reduce the variables in train dataset, tidy the dataset and leave out the “useful” variables.
For all the columns in test and train, only one variable is different, the 160th column. We will remove the last column “probelm_id” in test set. So that now there will be only 159 columns in the test set, and 160 columns in the train set.
test=subset(test,select=-160)
dim(test)
## [1] 20 159
There are 160 variables in the dataset, first we can cut the first 5 columns, as they are just some participants’ information.
#drop=c("V1","user_name","raw-timestamp_part_1","raw_timestamp_part_2","cvtd_timestap"))
#train[,!(names(train)%in%drop)]
train=subset(train,select=-c(1:5))
test=subset(test,select=-c(1:5)) ##drop the unused columns
Since there are many columns that only consist few information, we will eliminate those columns in both training dataset and test dataset. Here columns with more than 90% of the rows are filled in will be remained.
#Zero Variance Tidying
zerovars <- nearZeroVar(train)
train2=subset(train,select=-c(zerovars))
test2=subset(test,select=-c(zerovars))
dim(train2)
## [1] 19622 119
dim(test2)
## [1] 20 118
The next step is to eliminate the columns that has many NAs. Here columns with more than 50% NAs will be removed.
#NA Tidying
na= apply(train2,2, function(x) {sum(is.na(x)/length(x))})
drop= which(na> .50)
train3=subset(train2,select=-drop)
test3=subset(test2,select=-drop)
dim(train3)
## [1] 19622 54
dim(test3)
## [1] 20 53
Now both training dataset and testing dataset are nicely processed. The column number is reduced to 54 and 53. Note that the column number of test dataset is always one less than the colum number of column dataset. As we deleted the different column from train dataset in test dataset.
We subset the training data into two parts: 70% as the training dataset, 30% as the test dataset.
set.seed(8221)
intrain=createDataPartition(y=train3$classe,p=0.7,list=FALSE)
mytrain=train3[intrain[,1]]
mytest=train3[-intrain[,1]]
dim(mytrain);dim(mytest)
## [1] 13737 54
## [1] 5885 54
set.seed(8221)
mytrain$classe=as.factor(mytrain$classe)
modfit=randomForest(classe ~ .,method="rf",data=mytrain,ntree=30) ##set ntree to shorten the running time
perdictit=predict(modfit,mytest,type="class")
acc=confusionMatrix(perdictit,mytest$classe)
acc
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 1 0 0 0
## B 0 1136 5 0 0
## C 0 2 1019 10 0
## D 0 0 2 954 5
## E 0 0 0 0 1077
##
## Overall Statistics
##
## Accuracy : 0.9958
## 95% CI : (0.9937, 0.9972)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9946
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9974 0.9932 0.9896 0.9954
## Specificity 0.9998 0.9989 0.9975 0.9986 1.0000
## Pos Pred Value 0.9994 0.9956 0.9884 0.9927 1.0000
## Neg Pred Value 1.0000 0.9994 0.9986 0.9980 0.9990
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1930 0.1732 0.1621 0.1830
## Detection Prevalence 0.2846 0.1939 0.1752 0.1633 0.1830
## Balanced Accuracy 0.9999 0.9982 0.9954 0.9941 0.9977
We have 99.58% accuracy.Even though we can perform more tests, I don’t think the accuracy rate can be any more significanlly higher.
The out of sample error is approximately 0.42%. However, note that even though the error is pretty small, in the real life the out of sample error might be a bit higher due to unexpected circumstances. Overall, the random forest provides a satisfying result.