Stat/CS 087: Introduction to Data Science

Sheila Weaver
26 August 2019

Introduction to Data Science -- Class #1

First, what is Big Data?

  • It's a matter of opinion…
  • In any case, it has presented a big jump in information assets

Here's how progress happens:

plot of chunk unnamed-chunk-1

  • Big Data is the latest ‘information explosion.’
  • The printing press was probably the first major one.
  • Took 300 years for the world to 'settle down' after its invention.
  • We're still settling down with big data…

plot of chunk unnamed-chunk-2

Then what is Data Science?

Here's one idea:

plot of chunk unnamed-chunk-3

Random Question:

Who is Antonie Van Leeuwenhoek ?

  • The Father of Microbiology (1632 - 1723)
  • Improved the microscope in order to bring small things into focus.
  • The goal of Data Science is to bring large things into focus.
  • “The greatest value of a picture is when it forces us to notice what we never expected to see.” —John Tukey
  • Let's look at an example from the NYTimes: NY Student Test Scores from the Regents Exam

Background Knowledge:

We know that Standardized Test Scores usually have a bell curve: plot of chunk unnamed-chunk-4

So, What's wrong with this picture?

plot of chunk unnamed-chunk-5

Stat/CS 087: Introduction to Data Science

  • Web page: https://bb.uvm.edu
  • Software: R, using IDE: RStudio
  • Assumes NO previous experience with coding or Statistics
  • Topics (not in this exact order during the semester)….

Main Topics

  • I. Framing Real World Problems as Data Questions
  • II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)
  • III. Data Analysis
  • IV. Communication

I. Framing Real World Problems as Data Questions (unlike CS, not working to spec)

  • Is the effect real? (signal/noise, chance)
  • What is causing the effect? (study design)
  • How do I predict a variable of interest? (regression, machine learning)
  • How do I identify similar subgroups? (clustering, 'market segmentation')
  • Are there ethical concerns?

II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)

  • Basic computing skills
  • Knowing how to detect problems in the data
  • Knowing how to shape data to answer your questions
  • Not trivial! Takes time, skill and artistry

III. Data Analysis

  • Association Rule Learning – Market Basket Analysis
  • Correlation, Regression and Predictive Modeling
  • Bootstrap Confidence Intervals
  • Classification and Clustering methods

Classification Ex: Consider the Not HotDog app

plot of chunk unnamed-chunk-6

What would it do with this image?

plot of chunk unnamed-chunk-7

IV. Communication

  • Teamwork for better quality (e.g., “Pair Programming” in Agile software development, https://en.wikipedia.org/wiki/Pair_programming)
  • Good Data Visualizations, accurate and appropriate for the audience (ggplot2)
  • Use Statistical Thinking to interpret results
  • Presentation of results, accurate and appropriate for the audience (R markdown, R presentation)

Note: Statistical Thinking is not always obvious! Example: WWII planes..

plot of chunk unnamed-chunk-8

In this course, we'll use:

  • R Studio as our “driver's seat,”
  • R for data analysis, and
  • R Markdown for presenting results:
  • Highway mileage for different types of cars:
library(ggplot2)
summary(mpg$hwy)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.00   18.00   24.00   23.44   27.00   44.00 
ggplot(data = mpg,
       mapping = aes(x = hwy, fill=class)) +
  geom_density() + facet_grid(class~.) + 
  theme(legend.title = element_text(size=18),
  legend.text = element_text(size = 16),
  strip.text.y = element_blank())

plot of chunk density