Developing Data Products

Project Pitch

Anuj Parashar

2/21/2019

About the App

This application is a simple demonstration of running ‘random forest’ classification. It shows the impact of a subset of choices made for training the model and their impact.

The application uses the ‘IRIS’ dataset which has the following structure:

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Based on user’s inputs, the application trains the model, applies it to test set and displays the model accuracy along with variable importance plot. Here is a sample run:

library(caret)
data(iris)

inTrain <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
training <- iris[inTrain,]
testing <- iris [-inTrain,]

set.seed(1432)

trCtrl <- trainControl(method = "cv", number = 5)
mf <- train(Species ~ ., method = "rf", trControl = trCtrl, data = training)

prd <- predict(mf, testing)
cm <- confusionMatrix(prd, testing$Species)

print(paste("Model Accuracy:", round(cm$overall['Accuracy'] * 100, 2), '%'))

## [1] "Model Accuracy: 95.56 %"

User Selection Option

The user can select from the following options:

Selecting Partition value for training and test datasets

Default: 0.7 (70%: 70% Training, 30% Testing)

Predictor Variables Selection

Default: All 4. Please note that at least one predictor should be selected otherwise an error will be thrown.

Resampling Method

Default: Not Selected. If selected, there are 4 options includes:

Bootstrapping, Cross Validation, Leave one out Cross Validation, None

Number of Folds for cross validation/bootstrap

Default: 5

Output (Model Tab)

Based on user selection, the model gets trained and gets applied to testing Dataset. Following information is included:

The Model

Please note the different parameters as they change based on user inputs.

In-Sample Accuracy

When the model gets trained. Note that for method ‘none’, the accuracy is not displayed. Can you think of why?

Confusion Matrix

When the model is applied to testing set

Out-of-Sample Accuracy

When the model is applied to testing set

Variable Importance Plot

It shows the relative importance of different predictors in the model.

Useful Links

Project

– Shiny: https://promisinganuj.shinyapps.io/myNewApp/

– Github Repository: https://github.com/promisinganuj/ddp_assigment

It contains the “ui.R” and “shiny.R” files of the application.