The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
#Downloading the data
set.seed(1)
tn_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
ts_url <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(tn_url, "train.csv", method = "curl")
download.file(ts_url, "test.csv", method = "curl")
The first step, is load the data. The data are loaded in the train and test variables.
train <- read.csv("train.csv", na.strings = c("NA", "#DIV/0!"))
test <- read.csv("test.csv", na.strings = c("NA", "#DIV/0!"))
dim(train)
## [1] 19622 160
table(train$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
The first 7 columns are unnecessary for predicting. Therefore, should be removed.
train <- train[,-c(1,2,3,4,5,6,7)] #remove the first 7 columns that are unnecessary
test <- test[,-c(1,2,3,4,5,6,7)]
Remove data with near zero variance
train <- train[,colSums(is.na(train))== 0]
test <- test[,colSums(is.na(test))== 0]
We are going to use some libraries. The data will be partitioned using the 70% of the training set. A validation set is going to be builded using 30% of the data.
suppressPackageStartupMessages(library(caret))
train_set <- createDataPartition(train$classe, p=0.7, list = F )
train_full <- train[train_set,]
train_val <- train[-train_set,]
dim(train_full); #size of the training data
## [1] 13737 53
dim(train_val) #size of the validation data
## [1] 5885 53
The chosen model is Random Forest. The model uses 10 trees. According to the definition The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Therefore, is useful for prediction in this case.
suppressPackageStartupMessages(library(randomForest))
## Warning: package 'randomForest' was built under R version 4.0.2
fit <- randomForest(as.factor(classe) ~ ., data=train_full, ntrees=10, importance = TRUE)
print(fit)
##
## Call:
## randomForest(formula = as.factor(classe) ~ ., data = train_full, ntrees = 10, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.51%
## Confusion matrix:
## A B C D E class.error
## A 3902 2 0 0 2 0.001024066
## B 10 2642 6 0 0 0.006019564
## C 0 14 2381 1 0 0.006260434
## D 0 0 29 2222 1 0.013321492
## E 0 0 1 4 2520 0.001980198
Our model, as you can see is going to perform perfectly in the training data set since it was the one used to train it. We can see an accuracy of 100%.
train_full$classe <- as.factor(train_full$classe)
training_accuracy <- predict(fit, train_full)
print(confusionMatrix(training_accuracy, train_full$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3906 0 0 0 0
## B 0 2658 0 0 0
## C 0 0 2396 0 0
## D 0 0 0 2252 0
## E 0 0 0 0 2525
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
The validation set serves as an test set since the model was not trained using this set. The model performs with an accuracy of 99.27% which is really good.
train_val$classe <- as.factor(train_val$classe)
validation_accuracy <- predict(fit, train_val)
print(confusionMatrix(validation_accuracy, train_val$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 1 0 0 0
## B 2 1138 9 0 0
## C 0 0 1012 12 0
## D 0 0 5 951 2
## E 0 0 0 1 1080
##
## Overall Statistics
##
## Accuracy : 0.9946
## 95% CI : (0.9923, 0.9963)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9931
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9991 0.9864 0.9865 0.9982
## Specificity 0.9998 0.9977 0.9975 0.9986 0.9998
## Pos Pred Value 0.9994 0.9904 0.9883 0.9927 0.9991
## Neg Pred Value 0.9995 0.9998 0.9971 0.9974 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1934 0.1720 0.1616 0.1835
## Detection Prevalence 0.2843 0.1952 0.1740 0.1628 0.1837
## Balanced Accuracy 0.9993 0.9984 0.9919 0.9925 0.9990
Now we know that the model predicts with great accuracy, therefore, we are applying it to the test data.
test_accuracy <- predict(fit, test)
test_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E