Prep R

Set up the libraries necessary for the code.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(stats)
library(ggplot2)
library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Get data from webpage and load CSVs.

Note that there are ##DIV/0s that have to go

trainingAddress<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingAddress<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training<-read.csv(trainingAddress, na.strings=c("##DIV/0!","NA",""))
testing<-read.csv(testingAddress, na.strings=c("##DIV/0!","NA",""))

Setting a seed for reproducibility

Here we set a seed so the project can be reproduced.

set.seed(111)

Dividing the training data

Here, I cut training data into 2 groups: 60% train, 40% test. This is done for the purpose of cross-validation later.

inTrain<-createDataPartition(training$classe, p=0.6, list=FALSE)
myTraining<-training[inTrain,] 
myTesting<-training[-inTrain,]

Cleaning the data

First, I remove columns that have nearly no variance.

NZV <- nearZeroVar(myTraining)
NZV<-c(1,NZV) ##Remove the X column so it isn't included in the model 
myTraining<-myTraining[-NZV]

Then, while not particularly scientific, if the first 10 values of a variable were NA, then I removed the variable.

thenames<-c("amplitude_roll_forearm", "amplitude_pitch_forearm", "var_accel_forearm",
            "max_picth_forearm", "max_yaw_forearm", "min_roll_forearm", "min_pitch_forearm",
            "min_yaw_forearm","kurtosis_roll_forearm", "kurtosis_picth_forearm", 
            "skewness_roll_forearm", "skewness_pitch_forearm", "var_pitch_dumbbell", 
            "avg_yaw_dumbbell", "stddev_yaw_dumbbell", "var_yaw_dumbbell",
            "stddev_roll_dumbbell", "var_roll_dumbbell", "avg_pitch_dumbbell", 
            "stddev_pitch_dumbbell","var_accel_dumbbell", "avg_roll_dumbbell",
            "amplitude_pitch_dumbbell","min_roll_dumbbell", "min_pitch_dumbbell", 
            "min_yaw_dumbbell", "amplitude_roll_dumbbell","skewness_pitch_dumbbell", 
            "max_roll_dumbbell", "max_picth_dumbbell", "max_yaw_dumbbell",
            "kurtosis_roll_dumbbell", "kurtosis_picth_dumbbell", "skewness_roll_dumbbell",
            "min_yaw_arm", "amplitude_pitch_arm", "amplitude_yaw_arm","skewness_pitch_arm",
            "skewness_yaw_arm", "max_picth_arm", "max_yaw_arm", "min_roll_arm", 
            "min_pitch_arm","kurtosis_roll_arm", "kurtosis_picth_arm", "kurtosis_yaw_arm",
            "skewness_roll_arm","var_accel_arm","stddev_pitch_belt", "var_pitch_belt", 
            "avg_yaw_belt", "stddev_yaw_belt", "var_yaw_belt", "var_total_accel_belt", 
            "avg_roll_belt", "stddev_roll_belt", "var_roll_belt", "avg_pitch_belt",
            "min_roll_belt", "min_pitch_belt", "min_yaw_belt", "amplitude_roll_belt", 
            "amplitude_pitch_belt","skewness_roll_belt", "skewness_roll_belt.1", 
            "max_roll_belt", "max_picth_belt", "max_yaw_belt","kurtosis_roll_belt", 
            "kurtosis_picth_belt")
notnames<-!(names(myTraining) %in% thenames)
myTraining<-myTraining[notnames]

I repeated the same process on the test data (both the test data from submission and the test data I generated by dividing the training set)

myTesting<-myTesting[-NZV]
myTesting<-myTesting[notnames]
testing<-testing[-NZV]
testing<-testing[notnames]

I did a little more data fixing to help clean up the Random Forests. The columns need to be the same data types.

testing[5:58]<-as.numeric(as.character(unlist(testing[5:58])))
myTraining[5:57]<-as.numeric(as.character(unlist(myTraining[5:57])))
myTesting[5:57]<-as.numeric(as.character(unlist(myTesting[5:57])))

Lastly, I found that the factor in column 4 was messing EVERYTHING up, so I took that out.

testing<-testing[-4]
myTesting<-myTesting[-4]
myTraining<-myTraining[-4]

Model it!

I chose to use randomForest as my machine learning process.

model <- randomForest(classe ~. , data=myTraining, na.action=na.roughfix)
prediction <- predict(model, myTesting, type = "class")
#The na.action=naroughfix was added when I used Knitr
#For some reason, my code didn't work correctly here.

Earlier, we split the training data into test and training data. Here, I evaluate the model using test data (from the original training data).

confusionMatrix(prediction, myTesting$classe)

## Loading required namespace: e1071

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  E
##          A 45  0  0  0  0
##          B  0 30  0  0  0
##          C  0  0 30  1  0
##          D  0  0  0 23  0
##          E  0  0  0  0 31
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9938          
##                  95% CI : (0.9657, 0.9998)
##     No Information Rate : 0.2812          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9921          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   0.9583   1.0000
## Specificity            1.0000   1.0000   0.9923   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   0.9677   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   0.9927   1.0000
## Prevalence             0.2812   0.1875   0.1875   0.1500   0.1938
## Detection Rate         0.2812   0.1875   0.1875   0.1437   0.1938
## Detection Prevalence   0.2812   0.1875   0.1938   0.1437   0.1938
## Balanced Accuracy      1.0000   1.0000   0.9962   0.9792   1.0000

Here we see a 99.75% accuracy (the accuracy was a little worse (99.38%) with the fix I used to get Knitr to work with my randomForest). This would be an out-of-sample error rate since this does not use the training data.

Apply to test data

Here we use this on the real test data.

finalPredict<-predict(model,testing)

Last, the function below was used to generate the files for submission.

pml_write_files = function(x) {
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

#pml_write_files(finalPredict)

Appendix

This is the output from the Confusion Matrix from my .R file. I could not determine why Knitr was not working with my code, so I found a fix (na.action=na.roughfix) and used it here. It drastically reduced the number of records in my project.

At any rate, below is my output.

Practical Machine Learning Project

Ben Anderson

March 22, 2015

Prep R

Get data from webpage and load CSVs.

Setting a seed for reproducibility

Dividing the training data

Cleaning the data

Model it!

Apply to test data

Appendix

confusionMatrix(prediction, myTesting$classe)

Loading required namespace: e1071

Confusion Matrix and Statistics

Reference

Prediction A B C D E

A 2232 10 0 0 0

B 0 1508 3 0 0

C 0 0 1365 6 0

D 0 0 0 1279 0

E 0 0 0 1 1442

Overall Statistics

Accuracy : 0.9975

95% CI : (0.9961, 0.9984)

No Information Rate : 0.2845

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9968

Mcnemar’s Test P-Value : NA

Statistics by Class:

Class: A Class: B Class: C Class: D Class: E

Sensitivity 1.0000 0.9934 0.9978 0.9946 1.0000

Specificity 0.9982 0.9995 0.9991 1.0000 0.9998

Pos Pred Value 0.9955 0.9980 0.9956 1.0000 0.9993

Neg Pred Value 1.0000 0.9984 0.9995 0.9989 1.0000

Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838

Detection Rate 0.2845 0.1922 0.1740 0.1630 0.1838

Detection Prevalence 0.2858 0.1926 0.1747 0.1630 0.1839

Balanced Accuracy 0.9991 0.9965 0.9984 0.9973 0.9999