Julia Package Ecosystem

Data Science with Julia

Julia Workshop


Julia’s Ecosystem is Growing, but Young

Julia was publicly released ~1.5 years ago, after 2 years of internal development. Although the Julia community has grown substantially since February 2012, the language ecosystem is still very young. Many of the popular libraries available for languages like R or Python have no parallels in Julia yet. Julia is slowly developing its own ecosystem of packages, but any practical data scientist will need to mix Julia code with R or Python when the problem at hand demands it.


Notable Packages in the Julia Ecosystem

  • Base Julia: Provides much of the functionality available in NumPy.
  • Stats.jl, Distributions.jl, Optim.jl, and JuMP.jl: Filling in the functionality of SciPy.
  • DataFrames.jl: Tools for working with tabular data, familiar to users of R or pandas.
  • Gadfly.jl: A bare-bones visualization package similar to ggplot2.
  • PyPlot.jl: Provides a complete interface to matplotlib from Julia.
  • Graphs.jl: Functionality from packages like igraph or NetworkX.

Strategies for Meeting Users’ Needs

  1. Direct Language Interop:
    • PyCall.jl: Call Python code from within a Julia program.
    • SymPy.jl: Wraps a popular symbolic algebra system developed for Python.
    • Matlab.jl: Call Matlab from Julia.
  2. Multistep Pipelines:
    • Many data science tasks can be divided into a pipeline of completely independent steps.
    • Transitioning a pipeline over to Julia in steps eases the transition.
    • Initially do data preprocessing and modeling in Julia, visualization steps in R.

Julia Packages and Functionalities

  • StatsBase: Descriptive statistics and moments, sampling, counting and ranking, autocorrelation and cross-correlation, weighted statistics.
  • DataArrays: Arrays with missing data, optimized representation of arrays with repetitive values, computational routines with missing values.
  • DataFrames: DataFrames for tabular datasets, database-style joins and indexing, split-apply-combine operations, pivoting, formula and model frames.
  • Distributions: Large collection of univariate and multivariate distributions, descriptive stats, pdf/pmf, mgf, efficient sampling, maximum likelihood estimation.
  • MultivariateStats: Linear regression, dimensionality reduction, multidimensional scaling, linear discriminant analysis.
  • HypothesisTests: Parametric and nonparametric tests, binomial tests, sign tests, exact tests, U tests, rank tests.
  • MLBase: Data preprocessing, score-based classification, performance evaluation, model selection, cross validation.
  • Distances: Various metrics, efficient column-wise and pairwise computation, support weighted distances.
  • KernelDensity: Kernel density estimation for univariate and bivariate data, customizable interpolation points, kernel, and bandwidth.
  • Clustering: K-means, K-medoids, affinity propagation, evaluation of clustering performance.
  • GLM: Generalized linear models, friendly API for fitting GLM, work with data frames and formulas, variety of link types, optimized implementation.
  • NMF: Nonnegative matrix factorization, variety of algorithms, NNDSVD initialization.
  • RegERMs: Regularized empirical risk minimization, support ridge regression, logistic regression, solvers: L-BFGS and SGD, highly configurable and extensible.
  • MCMC: Markov Chain Monte Carlo, Bayesian inference, variety of samplers, user-friendly syntax for model specification, auto-differentiation, suspend and resume.
  • TimeSeries: Tools for representing, manipulating, and applying computation to time series data, based on data frames.

Julia’s Role in Data Science