Loading data

Background:

In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

I will need to build a model, cross validate it and predict 20 different test cases.

Loading data

library(caret)

## Warning: package 'caret' was built under R version 2.15.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 2.15.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 2.15.3

# setwd('C:/Users/Alfonso/Desktop/JOM/Practical_Machine_Learning')
setwd("C:/Users/JosePortatil/Dropbox/Data Science/Practical_Machine_Learning")

## Error: no es posible cambiar el directorio de trabajo

training <- read.csv("pml-training.csv", header = T, na.strings = "")
testing <- read.csv("pml-testing.csv", header = T, na.strings = "")

cleaning data

Cleaning Trainin data:

There are several columns that are not important for the analysis and should be dropped.

# Drop Not relevant Columns.
training$X <- NULL
training$user_name <- NULL
training$raw_timestamp_part_1 <- NULL
training$raw_timestamp_part_2 <- NULL
training$cvtd_timestamp <- NULL
training$new_window <- NULL
training$num_window <- NULL
training$problem_id <- NULL

Almost all columns represents numeric values but they are in factor type. We must convert them to numeric values.

# Convert COlumn Variables to Numeric [First convert to Character then to
# Numeric]
training[, -c(153)] <- sapply(training[, -c(153)], as.character)
training[, -c(153)] <- sapply(training[, -c(153)], as.numeric)

Looking a summary of the data it can be seen a lot of NA values in columns.These columns can be dropped because they do not bring value to the model.

# Drop columns with lots of NA values
delete_columns <- which(colSums(is.na(training)) > 19000)
training <- training[, -c(delete_columns)]

Cleaning Testing Data:

The same process is done in the Tetsing Data.

# Drop Not relevant Columns.
testing$X <- NULL
testing$user_name <- NULL
testing$raw_timestamp_part_1 <- NULL
testing$raw_timestamp_part_2 <- NULL
testing$cvtd_timestamp <- NULL
testing$new_window <- NULL
testing$num_window <- NULL
testing$problem_id <- NULL

# Convert COlumn Variables to Numeric [First convert to Character then to
# Numeric]
testing[, -c(153)] <- sapply(testing[, -c(153)], as.character)
testing[, -c(153)] <- sapply(testing[, -c(153)], as.numeric)

# Drop columns with lots of NA values
testing <- testing[, -c(delete_columns)]

create a model and cross validate

The training set is a little big for my computer and the random forest model training would take too much time. To tackle this i will take a sample of 2000 observation from my training data and another 2000 observation sample to validate model.

# take a random sample of size 2000 from a dataset mydata sample without
# replacement
training_sample <- training[sample(1:nrow(training), 2000, replace = FALSE), 
    ]
# A testing set
test_sample <- training[sample(1:nrow(training), 2000, replace = FALSE), ]

Now a random forest with cross validation will be trained and tested in another data set. A confusion Matrix will be displayed.

# define training control
train_control <- trainControl(method = "cv", number = 10)
# train the model
model <- train(classe ~ ., data = training_sample, trControl = train_control, 
    method = "rf")

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 2.15.3

## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

## Warning: package 'e1071' was built under R version 2.15.3

# make predictions on testing set and make confusion Matrix
predictions <- predict(model, test_sample[, 1:52])
# summarize results
confusionMatrix(predictions, test_sample$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 560  20   2   1   3
##          B   6 363  12   1   2
##          C   2  11 319   9   2
##          D   0   1   5 321   5
##          E   0   1   1   0 353
## 
## Overall Statistics
##                                         
##                Accuracy : 0.958         
##                  95% CI : (0.948, 0.966)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.947         
##  Mcnemar's Test P-Value : 0.0487        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.986    0.917    0.941    0.967    0.967
## Specificity             0.982    0.987    0.986    0.993    0.999
## Pos Pred Value          0.956    0.945    0.930    0.967    0.994
## Neg Pred Value          0.994    0.980    0.988    0.993    0.993
## Prevalence              0.284    0.198    0.170    0.166    0.182
## Detection Rate          0.280    0.181    0.160    0.160    0.176
## Detection Prevalence    0.293    0.192    0.172    0.166    0.177
## Balanced Accuracy       0.984    0.952    0.963    0.980    0.983

The results performed and accuracy is very high so we proceed to predict values on the test set provided.

Predict

Final_predictions <- predict(model, testing)
Final_predictions

##  [1] B A B A A E D D A A B C B A E E A B A B
## Levels: A B C D E