Day 1
Introduction
Something has changed
Data used to be something that was collected in lab notebooks or in government offices and published in printed tables. Recording each piece of data required a human intervention, accessing each piece of data required a human interpretation.
Nowadays, data is being recorded automatically. Vast volumes are being collected. It's easily copied and transferred.
- Genetics:
- The old days: medical outcomes of your relatives, and handful of chromosomally detected abnormalities (e.g. Downs syndrome), inference of genotype based on ancestors phenotype, e.g. Sickle-cell disease, Tay-Sachs.
- Now: Human genome. 10 years to sequence for the first time, now we're seeing differential sequencing of normal and cancerous cells.
- Physiology:
- The old days: blood pressure, respiration, 5-second traces of electrocardiograms
- Now: Microarray data — monitor expressions of tens of thousands of proteins. Long-term monitoring.
- Consumer Economics:
- Old days: payments in cash or by check.
- Now: Payments by credit card and EFT. Data available on where people spend money.
- Sales:
- The old days: Close down for inventory every month or so. Thousands of products.
- Now: sales are tracked immediately and often linked to customer traits. Hundreds of thousands of products: need for recommender systems.
- Human-to-human contacts:
- Insurance:
- The old days: determine rates based based on group membership, e.g. males aged 16-20.
- Now: Monitor the car and pull out risky behaviors. (What constitutes risky behavior is a question a data analyst might want to address). A recent controversy and car monitoring: A test drive of the new Tesla original article, Tesla's response and an ombudsman's review
What is Data?
Measurements together with the context in which the measurement was taken.
Information collected for a purpose.
Major types of information being collected nowadays:
- Photographs and videos: Russian meteorite
- Time series and sounds: Stock prices, seismograms, North Korean nuclear test detection
- Text: Crisis mapping, earthquake estimation
- Tabular data — organization as rows and columns.
What is “Big Data”?
- Data with lots of cases.
- Data with lots of variables.
- “Any amount of data that is too big to be handled by one computer.” — John Rauser (Amazon.com)
The goal of computing on data is to reduce it to a form where decisions and judgments can be supported. We'll call this a “presentation” of data.
Some goals for data presentations:
- Identifying interesting or important events that are hidden by a mass of irrelevant events. Example: Going through a database of clinical reports to identify drug side-effects and interactions.
- Looking for patterns in data that help us understand the overall system. Often, information is distributed among the multiple measured variables and our job is to condense that distributed information into a form where it's readily interpreted. Example: Interpreting the expression pattern of thousands of genes to identify pathways or mechanisms of a disease.
Goals for this course
- Introduce you to the basic ideas of data presentation
- Graphics modalities
- Transformating and combining data
- Summarizing patterns with models
- Classification and dimension reduction
- Develop the skills needed to make effective data presentations
- Access to tabular data
- Re-organization of tabular data for combining different sources
- Proficiency with basic techniques for modeling, classification, and dimension reduction.
- Experience with choices in data presentation
- Develop the confidence to work with modern tools
- Computer commands
- Documentation and work-flow
One way to think about this is as the “grammar of data.” We want you to know what are the nouns and verbs, and what combinations make sense.
What we won't do
- Techniques for reduction of information from photographs, videos, or free text.
- Techniques for heirarchical or similarly structured data (e.g. XML files)
- Techniques particular to the Internet
- Techniques about computational processes for huge data sets; we'll work with data that fits on a conventional computer.
Style of the class
- In class:
- About half the time as presentation of concepts and new skills.
- Often these will be presented with very small data sets that make it easy to see what's going on.
- About half the time with you working on using these concepts and skills with us present.
- Getting a specific assigned example to work
- Exploring the data on your own.
- Out of class:
- A little bit of drill, just to solidify basic skills.
- Write up a short report about what you did in class.
- Work on another case study that we'll assign.
- You're job is to tell a story with the data. To do this, you'll have to think about what the story might be — call it making hypotheses — and then test these possibilities with the data. Construct a presentation of the data that supports or refutes your hypotheses.
You are pioneers
- There is no established way to teach this material, or even a broad consensus about what material should be taught.
- We need to find out what works and what doesn't.
- We ask you to be robust enough to bear up with things that don't seem initially to be working; give them a chance until we definitively conclude that they won't work.
- You will be working with very limited support materials: we're lacking texts, formally written assignments, crib sheets for computer commands.
- Sometimes we will miss something basic. Unfortunately, we the instructors take a lot of knowledge for granted and any amount of talking among us won't reliably identify what this is. We need you to help us identify the things that many students won't already know so that we can make them explicit. Example from algebra: If I can reduce it to a quadratic form, I can find the zeros.
Today's Agenda
Log in to R. Basic aspects of syntax:
- Functions and arguments.
- Assignment
Create an R Markdown document about basic syntax.
Tabular data
Structure of tabular data:
- cases and variables
- types of variables: quantitative and categorical
Basic operations on data frames: examples with small data sets, CPS85, KidsFeet, Utilities.
Today's graphics modality: The scatter plot
Quick scatter plots: xyplot
Graphical principles of scatter plots
- x and y are each a “dimension”. So natural to plot out two dimensions.
- Many possibilities for three dimensions, but x,y,z doesn't seem to be effective unless you can rotate it, and this is hard to reproduce and display. So color and size and shape are used. Later in the semester, we'll work with “dimension reduction.”
- Position is precisely detected.
- Size and color are not, so use these for categories with discrete differences or just to suggest a trend.
Elaborate scatter plots with mScatter:
- choices of x and y variables
- choice of symbol size
- choice of colors
Use of R Markdown
- Creating, naming, and saving files.
- Using “run chunks” to test
- Compiling to HTML
- Publishing to RPubs
- Copying the command form produced by
mScatter into a document.
Larger data: FAOsimple country data
How to make scatter.
Scaling and logarithms.
Work through the in-class activity.