Julia Package Ecosystem

Data Science with Julia

Julia Workshop

Julia’s Ecosystem is Growing, but Young

Julia was publicly released ~1.5 years ago, after 2 years of internal development. Although the Julia community has grown substantially since February 2012, the language ecosystem is still very young. Many of the popular libraries available for languages like R or Python have no parallels in Julia yet. Julia is slowly developing its own ecosystem of packages, but any practical data scientist will need to mix Julia code with R or Python when the problem at hand demands it.

Notable Packages in the Julia Ecosystem

Base Julia: Provides much of the functionality available in NumPy.
Stats.jl, Distributions.jl, Optim.jl, and JuMP.jl: Filling in the functionality of SciPy.
DataFrames.jl: Tools for working with tabular data, familiar to users of R or pandas.
Gadfly.jl: A bare-bones visualization package similar to ggplot2.
PyPlot.jl: Provides a complete interface to matplotlib from Julia.
Graphs.jl: Functionality from packages like igraph or NetworkX.

Strategies for Meeting Users’ Needs

Direct Language Interop:
- PyCall.jl: Call Python code from within a Julia program.
- SymPy.jl: Wraps a popular symbolic algebra system developed for Python.
- Matlab.jl: Call Matlab from Julia.
Multistep Pipelines:
- Many data science tasks can be divided into a pipeline of completely independent steps.
- Transitioning a pipeline over to Julia in steps eases the transition.
- Initially do data preprocessing and modeling in Julia, visualization steps in R.

Julia Packages and Functionalities

StatsBase: Descriptive statistics and moments, sampling, counting and ranking, autocorrelation and cross-correlation, weighted statistics.
DataArrays: Arrays with missing data, optimized representation of arrays with repetitive values, computational routines with missing values.
DataFrames: DataFrames for tabular datasets, database-style joins and indexing, split-apply-combine operations, pivoting, formula and model frames.
Distributions: Large collection of univariate and multivariate distributions, descriptive stats, pdf/pmf, mgf, efficient sampling, maximum likelihood estimation.
MultivariateStats: Linear regression, dimensionality reduction, multidimensional scaling, linear discriminant analysis.
HypothesisTests: Parametric and nonparametric tests, binomial tests, sign tests, exact tests, U tests, rank tests.
MLBase: Data preprocessing, score-based classification, performance evaluation, model selection, cross validation.
Distances: Various metrics, efficient column-wise and pairwise computation, support weighted distances.
KernelDensity: Kernel density estimation for univariate and bivariate data, customizable interpolation points, kernel, and bandwidth.
Clustering: K-means, K-medoids, affinity propagation, evaluation of clustering performance.
GLM: Generalized linear models, friendly API for fitting GLM, work with data frames and formulas, variety of link types, optimized implementation.
NMF: Nonnegative matrix factorization, variety of algorithms, NNDSVD initialization.
RegERMs: Regularized empirical risk minimization, support ridge regression, logistic regression, solvers: L-BFGS and SGD, highly configurable and extensible.
MCMC: Markov Chain Monte Carlo, Bayesian inference, variety of samplers, user-friendly syntax for model specification, auto-differentiation, suspend and resume.
TimeSeries: Tools for representing, manipulating, and applying computation to time series data, based on data frames.

Julia’s Role in Data Science