Comparative Biology as Data Science

A Field Guide to the Tech Industry for Biologists

Samuel Crane, PhD
Data Scientist, Amplify Inc.


On Career Advice

Data Science = Applied Research Problems

Is 'data science' just another name for statistics? Kinda but not really.

It might be but one way to think of 'data science' is as a job that combines statistics, programming, and business intelligence, such that what was several distinct roles (statistician/modeler, engineer, and business analyst) is now a single role, with new responsibilities that arise by combination.

Three core skills
1. Mathematics (especially statistics and linear algebra)
2. Computer science (especially programming and infrastructure)
3. Communication (asking questions, visualization, writing)

You don't need an academic background in statistics to be hired as a data scientist or data analyst.

Industries

What else do you like besides anoles?

  • Education
  • Health
  • Government
  • Technology
  • Retail
  • pretty much everything else too

Data Science in Health

Data Science in Government & Politics

Data Science in Technology

Data Science in Retail

Data Science in Sports

Data Science at University

Data Science in Science

Data Science in Educational Technology

I was hired 7 months after defending.

My job entails:

  • Using specialized software and languages to build and manage data pipelines (just like grad school)
  • Using statistics to model these data (just like grad school)
  • Communicating results to stakeholders (just like grad school)

IRT, Machine Learning, and Bayesian Statistics

IRT, Machine Learning, and Bayesian Statistics

Our Analytics Stack

Research & Prototyping:

  • R, Python, SQL, Git
  • Weka, Mplus, RStudio, GitHub/Stash
  • Postgres

Deployment and Production:

  • Java, Python, Ruby, Hive/Impala
  • Hadoop, Amazon Web Services

Comparative Biology as Data Science

You bring this to the table:

  • Technical background
  • Research experience
  • Communication skills

What you're not thinking about now but should be

  • Statistics (do extra course work now if necessary, harder later)
  • Programming (just teach yourself, see Resources)

The Upside

  • Flexibility: More opportunities will arise more often over a smaller area. Work remotely.
  • Lifestyle: Compensation is greater than anything you'll get as a post-doc or even as an assistant professor. Rarely work outside business hours (unless small company or c-level)
  • Challenging Work: The work is very fast-paced and collaborative, in a way life science never was (at least for me)
  • Community: In the right kind of company, you'll still be publishing and going to conferences

The Pitfalls

  • The Tech interview
    • Things I was asked during my interviews:
      • Make a wide data table narrow, using language of your choice
      • What does the "general" in general linear model mean?
      • What are the assumptions of linear regression and how would you know when you've violated them?
  • The business environment
    • You're not always right. Being on a product team means that you're working with people from very different professional backgrounds and very different priorities then you may be used to. Arguing is a habit; listening is a skill.

Suggestions For Being a Better (Data) Scientist

  • Sufficient technology skills:
    • R, Python, SQL, Git
    • Bonus: Java
    • Learn best practices and style guides for coding
  • Sufficient statistical skills:
    • Descriptive statistics
    • Linear algebra
    • Bayesian statistics
    • Machine learning (try some niche modeling!)
  • Cultivate a passion for what you might do… no one will hire you if you make them feel like you're doing them a favor.

Embracing fundamental attribution error

You have to be visible.

  • Be the hacker in the lab
  • Push your code to GitHub
  • Be active on the Internet (Twitter, blog, LinkedIn, etc)
  • Go outside the Museum: go to MeetUps, hack days, courses and workshops outside AMNH.

What's Lost?

  • Focus on organisms
  • Field work
  • Domain knowledge (but this is a sunk cost and so a bad basis for decision-making)

field_work

What's Not Lost?

  • Research: Applied research is still a deep dive into complex topics
  • Nerds
  • Respect
  • Hard things to learn

\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta (x^i) - y^i)^2 \]

If you're doing quantitative comparative biology, you're well suited to a career as a data scientist.

Resources

Math & Statistics

Connect

Voices

Jessica Kirkpatrick on the transition from academia to industry:
http://womeninastronomy.blogspot.com/2013/01/datascience.html
http://womeninastronomy.blogspot.com/2013/01/astroVdatascience.html
http://www.astrobetter.com/nailing-the-tech-interview/

Trey Causey on getting started in data science
http://treycausey.com/getting_started.html

Philip Guo compares 6 months in industry vs 6 months as assistant professor
http://pgbovine.net/academia-industry-junior-employee.htm

Shelby Sturgis on Developing Skillset
http://insightdatascience.com/blog/fellow_spotlight_shelby_sturgis.html

Should I Get a PhD?
http://shouldigetaphd.com/

Jobs

https://twitter.com/RStatsJobs
https://www.kaggle.com/jobs

Stay In Touch: