The data is about Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: -exactly according to the specification (Class A) -throwing the elbows to the front (Class B) -lifting the dumbbell only halfway (Class C) -lowering the dumbbell only halfway (Class D) -throwing the hips to the front (Class E).
In the next , we will analysis the data and try to find a predictor to classify the fashions. Then use the predictor to see if it s working well in the new data.
## Loading required package: lattice
## Loading required package: ggplot2
## Rattle: A free graphical interface for data science with R.
## XXXX 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## 键入'rattle()'去轻摇、晃动、翻滚你的数据。
if(!file.exists("./trainingdata.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
destfile = "./trainingdata.csv")
}
trainData<-read.csv("./trainingdata.csv")
after cleaning the data, The data looks like this :
str(train)
## 'data.frame': 19622 obs. of 53 variables:
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
set the data into training and valid to improve the preditor:
inTrain <- createDataPartition(train$classe, p=0.7, list=FALSE)
training <- train[inTrain,]
pmlvalid <- train[-inTrain,]
let see the density about different ways of the data:
p1<-qplot(total_accel_belt,colour=classe,data=training,geom="density")
p2<-qplot(total_accel_arm,colour=classe,data=training,geom="density")
p3<-qplot(total_accel_dumbbell,colour=classe,data=training,geom="density")
p4<-qplot(total_accel_forearm,colour=classe,data=training,geom="density")
#TREE Model I will use trees to build the predictor,and check it out with fancy:
set.seed(111)
treemodel <- rpart(classe ~ ., data=training, method="class")
fancyRpartPlot(treemodel, sub="")
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
Using confusion Matrix to test results:
treePred<-predict(treemodel,pmlvalid,type = "class")
treeconf<-confusionMatrix(treePred,pmlvalid$classe)
treeAccuracy<-round(treeconf$overall['Accuracy'],4)
print(treeconf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1562 229 18 95 38
## B 35 623 114 36 78
## C 35 163 791 101 94
## D 36 77 68 649 86
## E 6 47 35 83 786
##
## Overall Statistics
##
## Accuracy : 0.7495
## 95% CI : (0.7383, 0.7606)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6816
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9331 0.5470 0.7710 0.6732 0.7264
## Specificity 0.9098 0.9446 0.9191 0.9457 0.9644
## Pos Pred Value 0.8043 0.7032 0.6681 0.7085 0.8213
## Neg Pred Value 0.9716 0.8968 0.9500 0.9366 0.9399
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2654 0.1059 0.1344 0.1103 0.1336
## Detection Prevalence 0.3300 0.1506 0.2012 0.1556 0.1626
## Balanced Accuracy 0.9214 0.7458 0.8450 0.8095 0.8454
plot(treeconf$table,col=treeconf$byClass,main=paste(
"DECISION TREE MODEL CONFUSION MATRIX:ACCURACY=",round(treeconf$overall['Accuracy'],4)))
As we can see, From the confusion matrix and the prediction accuracy (72.35%) of the combined model, there is no significant value in the added computational complexity for using the combined model for prediction.
A random forest model was next applied to the dataset to see if it would lead to an improvement in prediction accuracy. For the Random Forest model, K-fold cross-validation is utilized.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(222)
RFmodel <- randomForest(classe ~ ., data=training)
RFprediction <- predict(RFmodel, pmlvalid, type = "class")
RFconf <- confusionMatrix(RFprediction, pmlvalid$classe)
U can see the tree:
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 6 0 0 0
## B 0 1131 9 0 0
## C 0 2 1017 17 1
## D 0 0 0 947 3
## E 1 0 0 0 1078
##
## Overall Statistics
##
## Accuracy : 0.9934
## 95% CI : (0.991, 0.9953)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9916
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9930 0.9912 0.9824 0.9963
## Specificity 0.9986 0.9981 0.9959 0.9994 0.9998
## Pos Pred Value 0.9964 0.9921 0.9807 0.9968 0.9991
## Neg Pred Value 0.9998 0.9983 0.9981 0.9966 0.9992
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1922 0.1728 0.1609 0.1832
## Detection Prevalence 0.2853 0.1937 0.1762 0.1614 0.1833
## Balanced Accuracy 0.9990 0.9955 0.9936 0.9909 0.9980
Using confusion Matrix to test results:
RFaccuracy <- round(RFconf$overall['Accuracy'], 4)
plot(RFconf$table, col = RFconf$byClass, main = paste(
"Random Forest model confusion matrix: Accuracy =", round(RFconf$overall['Accuracy'], 4)))
The random forest model had an overall prediction accuracy of 99.42%. This accuracy was much higher than that found for the simple tree model. For the model calculated using the Random Forest method, the out of sample error rate is 0.58%.
For Random Forests we use the following formula, which yielded a much better prediction in in-sample:
#test
if(!file.exists("./testingdata.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
destfile = "./testingdata.csv")
}
testData<-read.csv("./testingdata.csv")
library(randomForest)
RFpredictSubmit <- predict(RFmodel, testData, type = "class")
results <- data.frame("Participant"=testData$user_name, "Problem_id"=testData$problem_id,
"Class"=RFpredictSubmit)
print(results)
## Participant Problem_id Class
## 1 pedro 1 B
## 2 jeremy 2 A
## 3 jeremy 3 B
## 4 adelmo 4 A
## 5 eurico 5 A
## 6 jeremy 6 E
## 7 jeremy 7 D
## 8 jeremy 8 B
## 9 carlitos 9 A
## 10 charles 10 A
## 11 carlitos 11 B
## 12 jeremy 12 C
## 13 eurico 13 B
## 14 jeremy 14 A
## 15 jeremy 15 E
## 16 eurico 16 E
## 17 pedro 17 A
## 18 carlitos 18 B
## 19 pedro 19 B
## 20 eurico 20 B