Prep R

Set up the libraries necessary for the code.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(stats)
library(ggplot2)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Get data from webpage and load CSVs.

Note that there are ##DIV/0s that have to go

trainingAddress<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingAddress<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(trainingAddress, na.strings=c("##DIV/0!","NA",""))
testing<-read.csv(testingAddress, na.strings=c("##DIV/0!","NA",""))

Setting a seed for reproducibility

Here we set a seed so the project can be reproduced.

set.seed(111)

Dividing the training data

Here, I cut training data into 2 groups: 60% train, 40% test. This is done for the purpose of cross-validation later.

inTrain<-createDataPartition(training$classe, p=0.6, list=FALSE)
myTraining<-training[inTrain,] 
myTesting<-training[-inTrain,]

Cleaning the data

First, I remove columns that have nearly no variance.

NZV <- nearZeroVar(myTraining)
NZV<-c(1,NZV) ##Remove the X column so it isn't included in the model 
myTraining<-myTraining[-NZV]

Then, while not particularly scientific, if the first 10 values of a variable were NA, then I removed the variable.

thenames<-c("amplitude_roll_forearm", "amplitude_pitch_forearm", "var_accel_forearm",
            "max_picth_forearm", "max_yaw_forearm", "min_roll_forearm", "min_pitch_forearm",
            "min_yaw_forearm","kurtosis_roll_forearm", "kurtosis_picth_forearm", 
            "skewness_roll_forearm", "skewness_pitch_forearm", "var_pitch_dumbbell", 
            "avg_yaw_dumbbell", "stddev_yaw_dumbbell", "var_yaw_dumbbell",
            "stddev_roll_dumbbell", "var_roll_dumbbell", "avg_pitch_dumbbell", 
            "stddev_pitch_dumbbell","var_accel_dumbbell", "avg_roll_dumbbell",
            "amplitude_pitch_dumbbell","min_roll_dumbbell", "min_pitch_dumbbell", 
            "min_yaw_dumbbell", "amplitude_roll_dumbbell","skewness_pitch_dumbbell", 
            "max_roll_dumbbell", "max_picth_dumbbell", "max_yaw_dumbbell",
            "kurtosis_roll_dumbbell", "kurtosis_picth_dumbbell", "skewness_roll_dumbbell",
            "min_yaw_arm", "amplitude_pitch_arm", "amplitude_yaw_arm","skewness_pitch_arm",
            "skewness_yaw_arm", "max_picth_arm", "max_yaw_arm", "min_roll_arm", 
            "min_pitch_arm","kurtosis_roll_arm", "kurtosis_picth_arm", "kurtosis_yaw_arm",
            "skewness_roll_arm","var_accel_arm","stddev_pitch_belt", "var_pitch_belt", 
            "avg_yaw_belt", "stddev_yaw_belt", "var_yaw_belt", "var_total_accel_belt", 
            "avg_roll_belt", "stddev_roll_belt", "var_roll_belt", "avg_pitch_belt",
            "min_roll_belt", "min_pitch_belt", "min_yaw_belt", "amplitude_roll_belt", 
            "amplitude_pitch_belt","skewness_roll_belt", "skewness_roll_belt.1", 
            "max_roll_belt", "max_picth_belt", "max_yaw_belt","kurtosis_roll_belt", 
            "kurtosis_picth_belt")
notnames<-!(names(myTraining) %in% thenames)
myTraining<-myTraining[notnames]

I repeated the same process on the test data (both the test data from submission and the test data I generated by dividing the training set)

myTesting<-myTesting[-NZV]
myTesting<-myTesting[notnames]
testing<-testing[-NZV]
testing<-testing[notnames]

I did a little more data fixing to help clean up the Random Forests. The columns need to be the same data types.

testing[5:58]<-as.numeric(as.character(unlist(testing[5:58])))
myTraining[5:57]<-as.numeric(as.character(unlist(myTraining[5:57])))
myTesting[5:57]<-as.numeric(as.character(unlist(myTesting[5:57])))

Lastly, I found that the factor in column 4 was messing EVERYTHING up, so I took that out.

testing<-testing[-4]
myTesting<-myTesting[-4]
myTraining<-myTraining[-4]

Model it!

I chose to use randomForest as my machine learning process.

model <- randomForest(classe ~. , data=myTraining, na.action=na.roughfix)
prediction <- predict(model, myTesting, type = "class")
#The na.action=naroughfix was added when I used Knitr
#For some reason, my code didn't work correctly here.

Earlier, we split the training data into test and training data. Here, I evaluate the model using test data (from the original training data).

confusionMatrix(prediction, myTesting$classe)
## Loading required namespace: e1071
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  E
##          A 45  0  0  0  0
##          B  0 30  0  0  0
##          C  0  0 30  1  0
##          D  0  0  0 23  0
##          E  0  0  0  0 31
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9938          
##                  95% CI : (0.9657, 0.9998)
##     No Information Rate : 0.2812          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9921          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   0.9583   1.0000
## Specificity            1.0000   1.0000   0.9923   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   0.9677   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9927   1.0000
## Prevalence             0.2812   0.1875   0.1875   0.1500   0.1938
## Detection Rate         0.2812   0.1875   0.1875   0.1437   0.1938
## Detection Prevalence   0.2812   0.1875   0.1938   0.1437   0.1938
## Balanced Accuracy      1.0000   1.0000   0.9962   0.9792   1.0000

Here we see a 99.75% accuracy (the accuracy was a little worse (99.38%) with the fix I used to get Knitr to work with my randomForest). This would be an out-of-sample error rate since this does not use the training data.

Apply to test data

Here we use this on the real test data.

finalPredict<-predict(model,testing)

Last, the function below was used to generate the files for submission.

pml_write_files = function(x) {
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

#pml_write_files(finalPredict)

Appendix

This is the output from the Confusion Matrix from my .R file. I could not determine why Knitr was not working with my code, so I found a fix (na.action=na.roughfix) and used it here. It drastically reduced the number of records in my project.

At any rate, below is my output.

confusionMatrix(prediction, myTesting$classe)
Loading required namespace: e1071
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 2232 10 0 0 0
B 0 1508 3 0 0
C 0 0 1365 6 0
D 0 0 0 1279 0
E 0 0 0 1 1442
Overall Statistics
Accuracy : 0.9975
95% CI : (0.9961, 0.9984)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9968
Mcnemar’s Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 1.0000 0.9934 0.9978 0.9946 1.0000
Specificity 0.9982 0.9995 0.9991 1.0000 0.9998
Pos Pred Value 0.9955 0.9980 0.9956 1.0000 0.9993
Neg Pred Value 1.0000 0.9984 0.9995 0.9989 1.0000
Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2845 0.1922 0.1740 0.1630 0.1838
Detection Prevalence 0.2858 0.1926 0.1747 0.1630 0.1839
Balanced Accuracy 0.9991 0.9965 0.9984 0.9973 0.9999