An introduction to the standards and tools of the professional data scientist
This class is about gaining knowledge from raw data. You'll learn to use large and complicated data sets to make better decisions.
A mix of practice and principles:
We’ll learn what to trust, how to use it, and how to learn more.
In real life, there might be hundreds or thousands of features. If you know regression: this is like regression on steroids!
Statistical learning, data mining, data science, ML, AI… there are many labels for what we’re doing!
How our goals fit into all this: we keep an eye on what is both useful and true:
Among economists, “data mining” is a dirty word. Example: the “Lucas critique”:
This is a total caricature. We'll strive to give data mining a better reputation :-)
Big in either or both:
In these settings, you cannot:
Some data-mining tools are familiar, or familiar with a twist:
Some are totally new:
All require a different approach when \( n \) and \( p \) get really big.
This course focuses a little on 2, heavily on 3-4, and not at all on 1.
On collection, management, and storage: a full subject unto itself. (I’m happy to provide references, but this isn’t the part of data science we cover in this course.)
On cleaning: I defer to Jeff Leek’s description of “How to Share Data with a Statistician.” (See course readings.) Always provide:
You will analyze a lot of data in this course. Our watchwords are transparency and reproducibility.
The ideal: “hit-enter” reproducibility.
All reports involve three main things:
The basic recipe for writing a statistical report:
This helps avoid “fear of the blank page”!
R is the real deal: an immensely capable, industrial-strength platform for data analysis.
It's used everywhere:
R is free and looks the same on all platforms, so you'll always be able to use it.
A huge strength of R is that it is open-source. R has a core, to which anyone can add contributed packages.
R has flaws, but so do all options (e.g. Python is great, but the community of stats developers is smaller, interactive data analysis is less slick, and you need to be a more careful and sophisticated programmer.)
Most students prefer to use R via an IDE. We'll use RStudio. It's awesome.
This presentation was written in Markdown.
## Markdown
- A simple markup language for generating a wide variety of output formats (HTML, PDF, etc) from plain text documents.
- Two pillars: (1) a formatting language; (2) a conversion tool.
- Much simpler than, for example, HTML.
This presentation was written in Markdown.
This is what the raw text looked like for the last slide; it got rendered as a bulleted list under a title.
git:
GitHub:
The git repo for our class website is stored both on GitHub (the remote copy) and my own computer (the local copy).
Basic workflow:
local copy of the repo.commit those changes, thereby creating a snapshot of the repo at a single moment in time that can always be restored.push those changes to remoteYou can use git either through: