Introduction to the ML Studio Package

Rami Krispin

2017-11-05

Overview

The ML Studio is an interactive platform for data visualization, statistical modeling and machine learning applications. Based on Shiny and shinydashboard interface, with Plotly interactive data visualization, DT HTML tables and H2O machine learning and deep learning algorithms, the ML Studio provides a set of tools for the data science pipeline workflow.

Installation

The package is available for installation with the devtools package (if devetools package is not installed please use install.packages("devtools") to install it).

# Install the MLstudio
devtools::install_github("RamiKrispin/MLstudio")

Please note – the H2O package may require additional Java adds-in (if not installed) and therefor is listed under the “Suggests” packages list of the MLstudio package (and not under the Imports or Depends list) and won’t be installed automatically during the installation of the MLstudio package. More information about the installation of H2O can be find in H2O documentation (under the “INSTALL IN R” tab).

Getting Start

Launch the App

The app is called from R and opened on the default web browser (running best on Google Chrome). To open the app please use:

# Launch the MLstudio
runML()

Data

The ML Studio provide the user with the ability to load (or remove), modify, visualize and analyze multiple dataset at the same time.

Under the “Data” tab there are two sub-tabs:

  1. Load – set of tools to load data into the platform (from R environment, R datasets and/or csv file)

  2. Prep – data prep tools:

Loading Data

There are three methods to load dataset into the platform:

  1. Loading dataset from the R environment, currently supporting data frame, data table, and matrix and ts objects.
Loading dataset from R enironment

Loading dataset from R enironment

  1. Loading the available dataset within installed packages, supporting data frame, data table, and matrix and ts objects.
Loading the diamond dataset from the ggplot2’s datasets

Loading the diamond dataset from the ggplot2’s datasets

  1. Loading from csv file.
Loading the Kaggle’s Titanic train set from a csv file

Loading the Kaggle’s Titanic train set from a csv file

Data Attribution

The variables attribution can be seen in the “Prep” tab in the middle table, a more in depth summary available on the variable summary box. Using the variable attributes option, it is possible to modify if needed the variables attributes. Below can be find an interactive table, the fields can be sort and a search option is available.

Variables summary and attribution changing

Variables summary and attribution changing

Data Summary

A data summary function is available on the “Prep” tab under the “Select Option” dropdown menu. This is a dplyr based function and it is provide the ability to summaries data by a specific group. Currently there summaries categories are – count, mean, sd, max and mean.

diamonds dataset - price summary by cut

diamonds dataset - price summary by cut

Visualization

Utilizing Plotly interactive data visualization tools along with Shiny engine, the ML Studio provides the user with effective tool for data exploration. The “Visualization” tab provides key functionality:

Visualization of the diamonds and iris datasets

Visualization of the diamonds and iris datasets

Visualization of the AirPassengers datasets

Visualization of the AirPassengers datasets

Models

The models applications of the ML Studio is still under development and currently available four classification models from the H2O package (Deep Learning, GBM, GLM and Random Forest).

Classification model for the GermanCredit dataset

Classification model for the GermanCredit dataset

Features that under development:

  1. Machine learning applications:
  1. Time series and forecasting: