Introduction to the ML Studio Package

Installation

The package is available for installation with the devtools package (if devetools package is not installed please use install.packages("devtools") to install it).

# Install the MLstudio
devtools::install_github("RamiKrispin/MLstudio")

Please note – the H2O package may require additional Java adds-in (if not installed) and therefor is listed under the “Suggests” packages list of the MLstudio package (and not under the Imports or Depends list) and won’t be installed automatically during the installation of the MLstudio package. More information about the installation of H2O can be find in H2O documentation (under the “INSTALL IN R” tab).

Getting Start

Launch the App

The app is called from R and opened on the default web browser (running best on Google Chrome). To open the app please use:

# Launch the MLstudio
runML()

Data

The ML Studio provide the user with the ability to load (or remove), modify, visualize and analyze multiple dataset at the same time.

Under the “Data” tab there are two sub-tabs:

Load – set of tools to load data into the platform (from R environment, R datasets and/or csv file)
Prep – data prep tools:

Variables summary
Ability to modify the variables attributes
dplyr data summary

Loading Data

There are three methods to load dataset into the platform:

Loading dataset from the R environment, currently supporting data frame, data table, and matrix and ts objects.

Loading dataset from R enironment

Loading the available dataset within installed packages, supporting data frame, data table, and matrix and ts objects.

Loading the diamond dataset from the ggplot2’s datasets

Loading from csv file.

Loading the Kaggle’s Titanic train set from a csv file

Data Attribution

The variables attribution can be seen in the “Prep” tab in the middle table, a more in depth summary available on the variable summary box. Using the variable attributes option, it is possible to modify if needed the variables attributes. Below can be find an interactive table, the fields can be sort and a search option is available.

Variables summary and attribution changing

Data Summary

A data summary function is available on the “Prep” tab under the “Select Option” dropdown menu. This is a dplyr based function and it is provide the ability to summaries data by a specific group. Currently there summaries categories are – count, mean, sd, max and mean.

diamonds dataset - price summary by cut

Visualization

Utilizing Plotly interactive data visualization tools along with Shiny engine, the ML Studio provides the user with effective tool for data exploration. The “Visualization” tab provides key functionality:

Application for multivariate visualization – scatter, line, boxplot histogram, density, and correlation plots

Visualization of the diamonds and iris datasets

Application for time series visualization - seasonality, boxplot and lags plots

Visualization of the AirPassengers datasets

Models

The models applications of the ML Studio is still under development and currently available four classification models from the H2O package (Deep Learning, GBM, GLM and Random Forest).

Classification model for the GermanCredit dataset

Features that under development:

Machine learning applications:

In depth model summary
Ability to compare, select and save models
Regression models
The caret function and models
H2O grid search and autoML
Deep learning models with Keras

Time series and forecasting:

Tools for time series analysis
Forecasting models with the forecast package