First, what is Big Data?

  • It’s a matter of opinion…

  • In any case, it has presented a big jump in information assets

How progress happens:

How progress happens:

  • Big Data is the latest ‘information explosion.’

  • The printing press was probably the first major one.

  • Took 300 years for the world to ‘settle down’ after its invention.

  • We’re still settling down with big data…

Then what is Data Science?

  • It’s such a new field – It is also a matter of opinion. Here’s one idea:

Who is Antonie Van Leeuwenhoek ?

  • The Father of Microbiology (1632 - 1723)

  • Improved the microscope in order to bring small things into focus.

  • The goal of Data Science is to bring large things into focus.

  • “The greatest value of a picture is when it forces us to notice what we never expected to see.” —John Tukey

  • Example from the NYTimes: NY Student Test Scores from the Regents Exam

First, we know that Standardized Test Scores usually have a bell curve:

What’s wrong with this picture?

Stat/CS 087: Introduction to Data Science

  • Web page: https://bb.uvm.edu

  • Software: R, using IDE: RStudio

  • Assumes NO previous experience with coding or Statistics

  • Topics (not in this exact order during the semester)….

Topics

  • I. Framing Real World Problems as Data Questions

  • II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)

  • III. Data Analysis

  • IV. Communicating Results

I. Framing Real World Problems as Data Questions (unlike CS, not working to spec)

  • Is the effect real? (signal/noise, chance)

  • What is causing the effect? (study design)

  • How do I predict a variable of interest? (regression, machine learning)

  • How do I identify similar subgroups? (clustering, ‘market segmentation’)

  • Are there ethical concerns?

II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)

  • Basic computing skills

  • Knowing how to detect problems in the data

  • Knowing how to shape data to answer your questions

  • Not trivial! Takes time, skill and artistry

III. Data Analysis

  • Association Rule Learning – Market Basket Analysis

  • Correlation, Regression and Predictive Modeling

  • Bootstrap Confidence Intervals

  • Classification and Clustering methods

Classification Ex: Consider the Not HotDog app

What would it do with this image?

IV. Communicating Results

  • Use Statistical Thinking to interpret results

  • Good Data Visualizations, accurate and appropriate for the audience (ggplot2)

  • Presentation of results, accurate and appropriate for the audience (R markdown)

BTW: Using Statistical Thinking – Not always obvious! – Example:

WWII planes..

  • In this course: We’ll use R Studio as our “driver’s seat,” R for data analysis, and R Markdown for presenting results:

    summary(mtcars$wt)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   1.513   2.581   3.325   3.217   3.610   5.424
    summary(mtcars$mpg)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   10.40   15.43   19.20   20.09   22.80   33.90

    ggplot(data = mtcars,
           mapping = aes(x = wt,y = mpg,color=factor(cyl))) +
      geom_point()