Developing Data Products

Project Pitch

Anuj Parashar

2/21/2019

About the App

This application is a simple demonstration of running ‘random forest’ classification. It shows the impact of a subset of choices made for training the model and their impact.

The application uses the ‘IRIS’ dataset which has the following structure:

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Based on user’s inputs, the application trains the model, applies it to test set and displays the model accuracy along with variable importance plot. Here is a sample run:

library(caret)
data(iris)

inTrain <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
training <- iris[inTrain,]
testing <- iris [-inTrain,]

set.seed(1432)

trCtrl <- trainControl(method = "cv", number = 5)
mf <- train(Species ~ ., method = "rf", trControl = trCtrl, data = training)

prd <- predict(mf, testing)
cm <- confusionMatrix(prd, testing$Species)

print(paste("Model Accuracy:", round(cm$overall['Accuracy'] * 100, 2), '%'))
## [1] "Model Accuracy: 95.56 %"

User Selection Option

The user can select from the following options:

Default: 0.7 (70%: 70% Training, 30% Testing)

Default: All 4. Please note that at least one predictor should be selected otherwise an error will be thrown.

Default: Not Selected. If selected, there are 4 options includes:

Bootstrapping, Cross Validation, Leave one out Cross Validation, None

Default: 5

Output (Model Tab)

Based on user selection, the model gets trained and gets applied to testing Dataset. Following information is included:

Please note the different parameters as they change based on user inputs.

When the model gets trained. Note that for method ‘none’, the accuracy is not displayed. Can you think of why?

When the model is applied to testing set

When the model is applied to testing set

It shows the relative importance of different predictors in the model.