author: liuyubobobo
date: Sunday, May 24, 2015
The goal of this project is aimed to predict the manner in which people did the exercise.Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har
First of all, we read our data from csv files. These data is seperated by training data and test data. The training data for this project are available here:https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv; The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
To know about our data better, we can do some exploratory data analysis first. Here, we use head, str and summary methods to observe our data. For the limit of the space, we omit the results of these methods.
head(training)
str(training)
summary(training)
Through the str method result, we can see that out training data set have 19622 observations and 160 variables.
Clearly, 160 variables are too many to create a predical model. So we need to decrease our data dimension.
First of all, we can observe through the str method that lots of variables in the training data only contain NA value, which is absolutely useless. Therefore, we find these columns and elimate them in both training and testing data sets.
colBools = colSums(is.na(training)) == 0
training <- training[,colBools]
testing <- testing[,colBools]
dim(training)
## [1] 19622 93
In this step, we decrease our data dimension from 160 to 93.
Secondly, we can use nearZeroVar function to delete variables with all near zero variance. These variables don’t change much and are not contribute for our model creation in the future.
library(lattice)
library(ggplot2)
library(caret)
nearZeroVars <- nearZeroVar(training)
training <- training[,-nearZeroVars]
testing <- testing[,-nearZeroVars]
dim(training)
## [1] 19622 59
In this step, we decrease our data dimension from 93 to 59.
Thirdly, through observe by str function and colnames functions, we know that the first six variables, X, user_name, raw_timestamp_part_, raw_timestamp_part_2, cvtd_timestamp and num_window are not senatic related to our outcome (i.e. classe). We can delete these variables safely.
training <- training[,-c(1,2,3,4,5,6)]
testing <- testing[,-c(1,2,3,4,5,6)]
dim(training)
## [1] 19622 53
In this step, we decrease our data dimension from 59 to 53.
To make our data set more compact, we can use PCA to process our data set further more.
First, we can partition our training data set into 2 parts, training_set and validating_set.
set.seed(54321)
inTrain <- createDataPartition( y = training$classe , p=0.75, list=FALSE )
training_set <- training[ inTrain , ]
validating_set <- training[ -inTrain , ]
Through colnames method, we know the last variable(53th variable) classe is our outcome, we delete this variable to implement PCA method.
compress <- preProcess(training_set[,-53], method = "pca")
training_PCA_set <- predict(compress, training_set[,-53])
validating_PCA_set <- predict(compress, validating_set[,-53])
testing_PCA_set <- predict(compress, testing[,-53])
dim(training_PCA_set)
## [1] 14718 25
In this step, we decrease our data dimension from 53 to 25.
For further model creation function, we put our outcome classe back to the training_PCA_set.
training_PCA_set$classe <- training_set$classe
Now, we can build a random forest model by our new training_PCA_sets
library( randomForest )
## Warning: package 'randomForest' was built under R version 3.1.3
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
modelFit = randomForest(classe ~., data=training_PCA_set)
For this model, we can firstly use our validating_PCA_set to check the accuracy.
validating_pred <- predict( modelFit, newdata=validating_PCA_set)
confusionMatrix(validating_pred, validating_set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1388 13 0 0 0
## B 0 929 12 0 2
## C 3 6 830 32 4
## D 2 0 13 768 10
## E 2 1 0 4 885
##
## Overall Statistics
##
## Accuracy : 0.9788
## 95% CI : (0.9744, 0.9826)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9732
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9950 0.9789 0.9708 0.9552 0.9822
## Specificity 0.9963 0.9965 0.9889 0.9939 0.9983
## Pos Pred Value 0.9907 0.9852 0.9486 0.9685 0.9922
## Neg Pred Value 0.9980 0.9950 0.9938 0.9912 0.9960
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2830 0.1894 0.1692 0.1566 0.1805
## Detection Prevalence 0.2857 0.1923 0.1784 0.1617 0.1819
## Balanced Accuracy 0.9956 0.9877 0.9798 0.9746 0.9902
From the results above, we can see that the prediction accuracy on our validation_PCA_set is 97.88%. So, the out of sample error is 2.12%. The results imply that our model is suitable to do the prediction job.
Now, we can use our model on testing_PCA_set to predict the manner in people’s exercise.
testing_pred <- predict( modelFit, newdata=testing_PCA_set)
testing_pred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E