MATH 216 Introduction to Data Science

Albert Y. Kim

From a presentation by former Institute of Mathematical Statistics president Bin Yu:

alt text

More refined Venn Diagram by Drew Conway:

alt text

How does one answer a scientific question with data? Leek & Peng in Nature (2015) illustrate the data pipeline

Intro stats classes focus a bit on 3, 4, and 5 partially. This class will try to cover all aspects of the pipeline.

Follow the complete statistical analysis cycle
Real data: more interesting, not clean, violating statistical assumptions
Data visualization: not just infographics, but as an analytical tool
Use computational tools: R coding, R packages, scraping data from the web, building web apps
Apply statistical methodologies: regression, correlated data, spatial statistics, text mining, etc.

Teach in a language agnostic way. Transferable and generalizable ideas, not a class on R.
Not learn a programming language, but learn how to learn a programming language
- By doing
- Google is your best friend
- By suffering, like learning any other language

Don't thrash. Really!
Don’t be stuck for more than 20 minutes. This takes self-awareness and mindfulness.
Seek expert advice; You’ll be on the other side soon enough
- Your peers.
- Me. Note I do prefer speaking in person than email.

alt text

For the first part of the class, we emphasize the two most important tools:

Tools for manipulating/wrangling your data: dplyr package
Tools for visualizing data: ggplot2 package (an implementation of the Grammar of Graphics)

The beauty of these two R packages is there deep philosophy underlying their implementations.