Background:
In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
I will need to build a model, cross validate it and predict 20 different test cases.
library(caret)
## Warning: package 'caret' was built under R version 2.15.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 2.15.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 2.15.3
# setwd('C:/Users/Alfonso/Desktop/JOM/Practical_Machine_Learning')
setwd("C:/Users/JosePortatil/Dropbox/Data Science/Practical_Machine_Learning")
## Error: no es posible cambiar el directorio de trabajo
training <- read.csv("pml-training.csv", header = T, na.strings = "")
testing <- read.csv("pml-testing.csv", header = T, na.strings = "")
Cleaning Trainin data:
There are several columns that are not important for the analysis and should be dropped.
# Drop Not relevant Columns.
training$X <- NULL
training$user_name <- NULL
training$raw_timestamp_part_1 <- NULL
training$raw_timestamp_part_2 <- NULL
training$cvtd_timestamp <- NULL
training$new_window <- NULL
training$num_window <- NULL
training$problem_id <- NULL
Almost all columns represents numeric values but they are in factor type. We must convert them to numeric values.
# Convert COlumn Variables to Numeric [First convert to Character then to
# Numeric]
training[, -c(153)] <- sapply(training[, -c(153)], as.character)
training[, -c(153)] <- sapply(training[, -c(153)], as.numeric)
Looking a summary of the data it can be seen a lot of NA values in columns.These columns can be dropped because they do not bring value to the model.
# Drop columns with lots of NA values
delete_columns <- which(colSums(is.na(training)) > 19000)
training <- training[, -c(delete_columns)]
Cleaning Testing Data:
The same process is done in the Tetsing Data.
# Drop Not relevant Columns.
testing$X <- NULL
testing$user_name <- NULL
testing$raw_timestamp_part_1 <- NULL
testing$raw_timestamp_part_2 <- NULL
testing$cvtd_timestamp <- NULL
testing$new_window <- NULL
testing$num_window <- NULL
testing$problem_id <- NULL
# Convert COlumn Variables to Numeric [First convert to Character then to
# Numeric]
testing[, -c(153)] <- sapply(testing[, -c(153)], as.character)
testing[, -c(153)] <- sapply(testing[, -c(153)], as.numeric)
# Drop columns with lots of NA values
testing <- testing[, -c(delete_columns)]
The training set is a little big for my computer and the random forest model training would take too much time. To tackle this i will take a sample of 2000 observation from my training data and another 2000 observation sample to validate model.
# take a random sample of size 2000 from a dataset mydata sample without
# replacement
training_sample <- training[sample(1:nrow(training), 2000, replace = FALSE),
]
# A testing set
test_sample <- training[sample(1:nrow(training), 2000, replace = FALSE), ]
Now a random forest with cross validation will be trained and tested in another data set. A confusion Matrix will be displayed.
# define training control
train_control <- trainControl(method = "cv", number = 10)
# train the model
model <- train(classe ~ ., data = training_sample, trControl = train_control,
method = "rf")
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 2.15.3
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
## Warning: package 'e1071' was built under R version 2.15.3
# make predictions on testing set and make confusion Matrix
predictions <- predict(model, test_sample[, 1:52])
# summarize results
confusionMatrix(predictions, test_sample$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 560 20 2 1 3
## B 6 363 12 1 2
## C 2 11 319 9 2
## D 0 1 5 321 5
## E 0 1 1 0 353
##
## Overall Statistics
##
## Accuracy : 0.958
## 95% CI : (0.948, 0.966)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.947
## Mcnemar's Test P-Value : 0.0487
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.986 0.917 0.941 0.967 0.967
## Specificity 0.982 0.987 0.986 0.993 0.999
## Pos Pred Value 0.956 0.945 0.930 0.967 0.994
## Neg Pred Value 0.994 0.980 0.988 0.993 0.993
## Prevalence 0.284 0.198 0.170 0.166 0.182
## Detection Rate 0.280 0.181 0.160 0.160 0.176
## Detection Prevalence 0.293 0.192 0.172 0.166 0.177
## Balanced Accuracy 0.984 0.952 0.963 0.980 0.983
The results performed and accuracy is very high so we proceed to predict values on the test set provided.
Final_predictions <- predict(model, testing)
Final_predictions
## [1] B A B A A E D D A A B C B A E E A B A B
## Levels: A B C D E