Set up the libraries necessary for the code.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(stats)
library(ggplot2)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Note that there are ##DIV/0s that have to go
trainingAddress<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingAddress<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(trainingAddress, na.strings=c("##DIV/0!","NA",""))
testing<-read.csv(testingAddress, na.strings=c("##DIV/0!","NA",""))
Here we set a seed so the project can be reproduced.
set.seed(111)
Here, I cut training data into 2 groups: 60% train, 40% test. This is done for the purpose of cross-validation later.
inTrain<-createDataPartition(training$classe, p=0.6, list=FALSE)
myTraining<-training[inTrain,]
myTesting<-training[-inTrain,]
First, I remove columns that have nearly no variance.
NZV <- nearZeroVar(myTraining)
NZV<-c(1,NZV) ##Remove the X column so it isn't included in the model
myTraining<-myTraining[-NZV]
Then, while not particularly scientific, if the first 10 values of a variable were NA, then I removed the variable.
thenames<-c("amplitude_roll_forearm", "amplitude_pitch_forearm", "var_accel_forearm",
"max_picth_forearm", "max_yaw_forearm", "min_roll_forearm", "min_pitch_forearm",
"min_yaw_forearm","kurtosis_roll_forearm", "kurtosis_picth_forearm",
"skewness_roll_forearm", "skewness_pitch_forearm", "var_pitch_dumbbell",
"avg_yaw_dumbbell", "stddev_yaw_dumbbell", "var_yaw_dumbbell",
"stddev_roll_dumbbell", "var_roll_dumbbell", "avg_pitch_dumbbell",
"stddev_pitch_dumbbell","var_accel_dumbbell", "avg_roll_dumbbell",
"amplitude_pitch_dumbbell","min_roll_dumbbell", "min_pitch_dumbbell",
"min_yaw_dumbbell", "amplitude_roll_dumbbell","skewness_pitch_dumbbell",
"max_roll_dumbbell", "max_picth_dumbbell", "max_yaw_dumbbell",
"kurtosis_roll_dumbbell", "kurtosis_picth_dumbbell", "skewness_roll_dumbbell",
"min_yaw_arm", "amplitude_pitch_arm", "amplitude_yaw_arm","skewness_pitch_arm",
"skewness_yaw_arm", "max_picth_arm", "max_yaw_arm", "min_roll_arm",
"min_pitch_arm","kurtosis_roll_arm", "kurtosis_picth_arm", "kurtosis_yaw_arm",
"skewness_roll_arm","var_accel_arm","stddev_pitch_belt", "var_pitch_belt",
"avg_yaw_belt", "stddev_yaw_belt", "var_yaw_belt", "var_total_accel_belt",
"avg_roll_belt", "stddev_roll_belt", "var_roll_belt", "avg_pitch_belt",
"min_roll_belt", "min_pitch_belt", "min_yaw_belt", "amplitude_roll_belt",
"amplitude_pitch_belt","skewness_roll_belt", "skewness_roll_belt.1",
"max_roll_belt", "max_picth_belt", "max_yaw_belt","kurtosis_roll_belt",
"kurtosis_picth_belt")
notnames<-!(names(myTraining) %in% thenames)
myTraining<-myTraining[notnames]
I repeated the same process on the test data (both the test data from submission and the test data I generated by dividing the training set)
myTesting<-myTesting[-NZV]
myTesting<-myTesting[notnames]
testing<-testing[-NZV]
testing<-testing[notnames]
I did a little more data fixing to help clean up the Random Forests. The columns need to be the same data types.
testing[5:58]<-as.numeric(as.character(unlist(testing[5:58])))
myTraining[5:57]<-as.numeric(as.character(unlist(myTraining[5:57])))
myTesting[5:57]<-as.numeric(as.character(unlist(myTesting[5:57])))
Lastly, I found that the factor in column 4 was messing EVERYTHING up, so I took that out.
testing<-testing[-4]
myTesting<-myTesting[-4]
myTraining<-myTraining[-4]
I chose to use randomForest as my machine learning process.
model <- randomForest(classe ~. , data=myTraining, na.action=na.roughfix)
prediction <- predict(model, myTesting, type = "class")
#The na.action=naroughfix was added when I used Knitr
#For some reason, my code didn't work correctly here.
Earlier, we split the training data into test and training data. Here, I evaluate the model using test data (from the original training data).
confusionMatrix(prediction, myTesting$classe)
## Loading required namespace: e1071
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 45 0 0 0 0
## B 0 30 0 0 0
## C 0 0 30 1 0
## D 0 0 0 23 0
## E 0 0 0 0 31
##
## Overall Statistics
##
## Accuracy : 0.9938
## 95% CI : (0.9657, 0.9998)
## No Information Rate : 0.2812
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9921
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 0.9583 1.0000
## Specificity 1.0000 1.0000 0.9923 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 0.9677 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 0.9927 1.0000
## Prevalence 0.2812 0.1875 0.1875 0.1500 0.1938
## Detection Rate 0.2812 0.1875 0.1875 0.1437 0.1938
## Detection Prevalence 0.2812 0.1875 0.1938 0.1437 0.1938
## Balanced Accuracy 1.0000 1.0000 0.9962 0.9792 1.0000
Here we see a 99.75% accuracy (the accuracy was a little worse (99.38%) with the fix I used to get Knitr to work with my randomForest). This would be an out-of-sample error rate since this does not use the training data.
Here we use this on the real test data.
finalPredict<-predict(model,testing)
Last, the function below was used to generate the files for submission.
pml_write_files = function(x) {
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
#pml_write_files(finalPredict)
This is the output from the Confusion Matrix from my .R file. I could not determine why Knitr was not working with my code, so I found a fix (na.action=na.roughfix) and used it here. It drastically reduced the number of records in my project.
At any rate, below is my output.