- Platform: Coursera
- Course: Practical Machine Learning
- Task: Week 4 Final Project
- Location: Texas, USA
- Link: www.coursera.org/learn/practical-machine-learning/

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are part of the quantified self-movement. A group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The training data and the testing data are below:

```
if(!require("pacman")) install.packages("pacman")
pacman::p_load(knitr, corrplot, caret, randomForest, rattle, gbm)
options(digits = 3)
set.seed(123)
```

```
# link set
train.file = "pml-training.csv"
test.file = "pml-testing.csv"
# download dataset
train = read.csv(train.file)
test = read.csv(test.file)
# partition dataset
in.train = createDataPartition(train$classe, p = 0.7, list = F)
train.set = train[in.train, ]
test.set = train[-in.train, ]
# check dataset
dim(train.set)
```

`## [1] 13737 160`

Firstly, we are gonna clean the NA variables and identification variables. Importantly, the test dataset is not changed and will only be used for the quiz results generation.

```
# near zero variance
nzv = nearZeroVar(train.set)
train.set = train.set[, -nzv]
# check dataset
dim(train.set)
```

`## [1] 13737 111`

The variables as near zero variance are meaningless for modeling.

```
# mostly NA variable
near.na = sapply(train.set, function(x) mean(is.na(x))) > 0.95
train.set = train.set[, near.na == F]
# check dataset
dim(train.set)
```

`## [1] 13737 59`

The variables as mostly NA variables are useless for modeling.

```
# identification variable
train.set = train.set[, -c(1:5)]
# check dataset
dim(train.set)
```

`## [1] 13737 54`

The variable as identification variables are pointless for modeling. After cleaning, we can see the ready-to-analysis dataset has 53 variables as independent and 1 variable as dependent.

```
cor.matrix = cor(train.set[, -54])
corrplot(cor.matrix, order = "FPC", method = "color", type = "lower",
tl.cex = 0.6, tl.col = rgb(0, 0, 0))
```

A correlation among variables is analysed before modeling. If the correlations are quite more, a principal components analysis (PCA) could be performed as processing step to make an even more compact analysis. However, the plot shows quite a few correlations. PCA will not be applied.

We are gonna use three methods to build up models and even stack them together. The methods are random forest, decision tree, and generalized boosted model.

```
# train model
mod.rf = train(data = train.set,
classe ~ .,
method = "rf",
trControl = trainControl(method = "cv",
number = 3))
# test model & validate model
pred.rf = predict(mod.rf, newdata = test.set)
# evaluate model
cm.rf = confusionMatrix(table(pred.rf, test.set$classe))
# plot matrix result
plot(cm.rf$table, col = cm.rf$byClass,
main = "Random Forest",
sub = paste("Accuracy =", round(cm.rf$overall[1], 4)),
xlab = "Prediction",
ylab = "Reference")
```

```
# train model
mod.dt = train(data = train.set,
classe ~ .,
method = "rpart",
trControl = trainControl(method = "cv",
number = 3,
verboseIter = F))
fancyRpartPlot(mod.dt$finalModel) # decision tree plot (optional)
```

```
# test model & validate model
pred.dt = predict(mod.dt, newdata = test.set)
# evaluate model on validate
cm.dt = confusionMatrix(table(pred.dt, test.set$classe))
# plot matrix result
plot(cm.dt$table, col = cm.dt$byClass,
main = "Decision Tree",
sub = paste("Accuracy =", round(cm.dt$overall[1], 4)),
xlab = "Prediction",
ylab = "Reference")
```

```
# train model
mod.gbm = train(data = train.set,
classe ~ .,
method = "gbm",
trControl = trainControl(method = "cv",
number = 3),
verbose = F)
# test model & validate model
pred.gbm = predict(mod.gbm, newdata = test.set)
# evaluate model on validate
cm.gbm = confusionMatrix(table(pred.gbm, test.set$classe))
# plot matrix result
plot(cm.gbm$table, col = cm.gbm$byClass,
main = "Generalized Boosted Model",
sub = paste("Accuracy =", round(cm.gbm$overall[1], 4)),
xlab = "Prediction",
ylab = "Reference")
```

The accuracy of the 3 modeling methods above are:

- Random Forest (0.9986)
- Decision Tree (0.5694)
- Generalized Boosted Model (0.9845)

```
pred.test = predict(mod.rf, newdata = test)
pred.test
```

```
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
```

The random forest model will be applied to predict the 20 quiz results as shown above. We also used the stacking model strategy, but it was not working well with a classficiation model. Also, the accuracy was not improving.