November 13, 2016

What is Data Science

Data Science ≈

Using R & Python +

R and Python

  • Pros:
  1. Easy come
  2. Move fast and break things
  3. High humanity
  • Cons:
  1. Easy crash
  2. Hard to maintain
  3. Low performance

Cons -> Problems

General Side

  • Hard to initialize a new working environment

Academic Side

  • Hard to reproduce the result by paper said

Industry Side

  • Hard to deploy and integrate the product from research result

General Side

Tedious Installation

Academic Side

Hard to Repeat

Industry Side

Quality and Agile

Previously: Slash and Burn

R | Python

Levels R Python
Package and Environment pacman(Rinker and Kurkiewicz 2015) + packrat(Ushey et al. 2016) + .RData pip/conda virtulenv
Content Rmarkdown(Allaire et al. 2016) Jupyter Notebook
OS ? ?

In Details

  • R
packrat::init()
pacman::p_load(tidyverse,
               xgboost,
               rstanarm,
               liftr)
load(".RData")
  • Python
virtualenv xxx_project
pip freeze > requirements.txt
pip install -r requirements.txt

Distrupt

What is Docker

Docker ≈ Virtual Machine

Why is Docker

Item Docker Virtual Machine
Launch Speed <1 s >1 min
Base Size < 20 m >200 m
Performance 100% Native 80% Native
Cross Platform True True
Social Collabration True False

Cases

  1. Kaggle
  2. Shiny
  3. Deep Learning
  4. More

Case 1: Kaggle

Case 2: Shiny-Proxy

Case 3: Deep Learning

Case 4: More Benifit

Step by Step

1.Download Docker App

sudo brew install docker # Mac
sudo apt-get install docker # ubuntu 
# go to offical website # Windows

2.One Line To Build

docker run -d -p 8787:8787 --name sparklyr index.tenxcloud.com/7harryprince/sparkr-rstudio

3.Open Chrome

localhost:8787 # ifconfig|grep 0xfffff000|awk '{print $2}'

4.Witness the Miracle

user/passwd: harryzhu

Witness the Miracle

Q&A

Feel free to contact me at 7harryprince@gmail.com

Allaire, JJ, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham, Aron Atkins, and Rob Hyndman. 2016. Rmarkdown: Dynamic Documents for R. http://rmarkdown.rstudio.com.

Rinker, Tyler W., and Dason Kurkiewicz. 2015. pacman: Package Management for R. Buffalo, New York: University at Buffalo/SUNY. http://github.com/trinker/pacman.

Ushey, Kevin, Jonathan McPherson, Joe Cheng, Aron Atkins, and JJ Allaire. 2016. Packrat: A Dependency Management System for Projects and Their R Package Dependencies. https://CRAN.R-project.org/package=packrat.