Jason Freels
05 October 2016
Introduce the components of the R toolchain for data science
The R Project for Statistical Computing
The RStudio IDE
\(\LaTeX\)
R Tools (for Windows machines)
Git/GitHub
Illustrate the benefits of using this toolchain
Outline how to install and configure each of the components in the toolchain
On personal machines (Windows/Mac/Linux/Unix)
On AFIT-networked machines
What is Data Science?
Can be thought of as the intersection of statistics and computer science
The rate of progress in statistics has grown SLIGHTLY over the past decade
The rate of progress in computer science has grown SO FAST over the past decade that it's impossible to keep up with
Data science began when statisticians started capitalizing on the tools developed by computer scientists
A data scientist is someone who knows more about computer science than the average statistician AND more about statistics that the average computer scientist.
What is a toolchain?
Quite simply, a toolchain is a collection of tools that seamlessly link together to improve a workflow
While Microsoft's toolchain is very approachable for the average person - it is FAR less capable that open-source toolchains
Microsoft is tightening the integration between MS Office tools, in part by making huge investments in open source tools
Building a fully open-source toolchain can be daunting, which is why I'm going to walk you through it!
Each component in this toolchain is
Open-source
Available for all operating systems
Available for download on AFIT machines
A general-purpose, interpreted computing language and suite of statistical operators for calculations on vectors, matrices, and arrays
A large, coherent and integrated collection of tools and graphical facilities for data analysis and reproducible research
A well-developed programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities
Has more than 10,000 add-on packages available at the CRAN, GitHub, and BioConductor repositories
One of the fastest growing technical programming languages in the world
An integrated development environment (IDE) providing a seamless interface to many open-source computing languages
R (*.R)
Python (*.py)
Julia (*.julia)
C++ (*.cpp)
JavaScript (*.js)
JSON (*.json)
XML (*.xml)
CSS (*.css)
HTML5 (*.html)
TeX (*.tex)
YAML (*.yml)
Markdown (*.md)
RStudio also provides a seamless interface to many other open-source applications
Pandoc
MathJax
Git/GitHub
SVN
Shinyapps.io
Apache Spark
MikTeX for Windows 7/8/10
MacTeX for Mac OS X
TeX Live 2013+ for Linux
Linux, Unix, and Mac come with a suite of compilers for working with pre-compiled languages
C
C++
FORTRAN
Windows does not come with these tools so they must be added
Ordinarily, these tools would have to be installed one-by-one (and there's a lot of them)
Rtools makes life much easier to do this in one step
Git is a free/open-source distributed version control language for ASCII or UTF-8 encoded files
Organizes code, text files, images, etc. into repositories 'repos'
Easy to collaborate on projects with other users any where in the world without
No more emailing files, DVD's, USB drives - It's like dropbox on steroids
It can be used on AF NIPR machines (probably not all, but haven't found one yet that couldn't)
GitHub has a learning curve, but once you get the hang of it you don't go back
Outside of AFIT/AF, GitHub is the standard - some employers prefer GitHub profiles over resumes
"It's become so important...that if GitHub goes down, the software development world practically stops."
Millions of users - Github.com is currently the 54th most visited website in the world
Designed to handle small projects to very large projects with speed and efficiency
Writing and sharing software packages
Writing books with collaborators from around the world
View all changes to the entire volume of German Federal Laws
Text
HTML5, CSS3 and JavaScript code
HTML Tables
| Censoring type | Range | Likelihood |
|---|---|---|
| \(d_{i}\) observations interval censored in \(t_{i-1}\) and \(t_{i}\) | \(t_{i-1}<T\le t_{i}\) | \([F(t_{i})-F(t_{i-1})]^{d_{i}}\) |
| \(l_{i}\) observations left censored at \(t_{i}\) | \(T\le t_{i}\) | \([F(t_{i})]^{l_{i}}\) |
| \(r_{i}\) observations right censored at \(t_{i}\) | \(T>t_{i}\) | \([1-F(t_{i})]^{r_{i}}\) |
And interactive tablesFigure 2.6 - Likelihood contributions for different kids of censoring
x <- seq(0,2.4,by = .01)
y <- dweibull(seq(0,2.4,by = .01),shape = 1.7,scale = 1)
plot( x = x,
y = y,
type = 'l',
lwd = 1.25,
xlab = 't',
ylab = 'f(t)',
las = 1)
polygon( x = c(seq(0,0.5,.01),0.5),
y = c( dweibull(seq(0,0.5,.01),shape = 1.7,scale = 1),0),
col = 1)
polygon( x = c(1,seq(1,1.5,.01),1.5),
y = c(0,dweibull(seq(1,1.5,.01),shape = 1.7,scale = 1),0),
col = 1)
polygon( x = c(2,seq(2,2.4,.01),2.4),
y = c(0,dweibull(seq(2,2.4,.01),shape = 1.7,scale = 1),0),
col = 1)
text(x = .16,y = .75,'Left Censoring')
text(x = 1.3,y = .65,'Interval Censoring')
text(x = 2.2,y = .15,'Right Censoring')devtools::install_github('rstudio/revealjs')
devtools::install_github('MangoTheCat/rmdshower')
install.packages('rticles')
install.packages('rticles')
install.packages('rticles')
Static plots and static tables are quickly becoming outdated - like transparencies
Plots & tables are now expected to provide multiple layers of information
Animated plots illustrate complex processes - without having to "visualize" the changes
Interactive plots let the viewer change the assumptions and investigate multiple scenarios
Interactive tables can be searched and sorted - directly within the document or presentation
R has two major frameworks to customize how users interact with your presentations
Coverts R code to JavaScript - no experience programming in JavaScript is required
Build a custom UI - Lots of options...check out the gallery
Define what events will occur based on a user's input in the UI
Several shiny sub-packages have been created for specific purposes
shinyDashboard - Create realtime dashboards in R
shinyStan - Interactively explore Bayesian model fit using MCMC
shinyAce - Interactive interface to the Ace text editor
I've created many shiny apps and found them to be helpful for three primary reasons
To illustrate complex ideas
To present content and coding skills simultaneously
To replace lots of static plots
The 'showcase' widgets can be explored on the htmlwidgets showcase page
metricsgraphics - Dynamic scatterplots, line charts, and histograms with D3.js
leaflet - Dynamic maps with support for panning, zooming, and annotations
threejs - Interactive 3D scatterplots and globes
heatmapD3 - Interactive heatmaps with D3.js
networkD3 - Iinteractive network graphs with D3.js
dygraphs - Interactive charting for time-series data
DT - Interactive tables that support filtering, pagination, and sorting
DiagrammeR - Create diagrams and flowcharts using Graphviz.js and Mermaid.js
Many, many more widgets are available on the Widget-A-Week Project page
The links below provide instructions for installing, connecting, configuring, and testing the R/RStudio toolchain for data science
It is
Follow the link that corresponds to your machine's operating system
Contact me
937.255.3636 ext. 4676
If you are new to R, check out these presentations
If you are familiar with R, this presentation will get you started on how to create documents & presentations and add interactivity