Stat 087 Introduction to Data Science

  • What is Big Data?

  • What is Data Science?

  • Why are you taking this course?

How progress happens:

  • Big Data is the latest ‘information explosion.’

  • The printing press was probably the first major one.

  • Took 300 years for the world to 'settle down' after its invention.

  • We're still settling down with big data…

How progress happens:

Wait, what IS Big Data?

  • It's a matter of opinion…..

One definition of Data Science:

Discoveries in the 20th century:

  • polio vaccine

  • smoking causes lung cancer

  • fertilizing crops increases yield

Discoveries in the 21st century:

Who is Antonie Van Leeuwenhoek ?

  • The Father of Microbiology (1632 - 1723)

  • Improved the microscope in order to bring small things into focus.

  • The goal of Data Science is to bring large things into focus.

John Tukey:

"The greatest value of a picture is when it forces us to notice what we never expected to see."

Example from NYTimes:

Stat 087: Introduction to Data Science

  • Web page: https://bb.uvm.edu

  • Software: R and RStudio

  • Topics (not in this exact order)….

Topics (not in this exact order)….

  • I. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)

  • II. Data Visualization

  • III. Data Analysis

  • IV. Using Statistical Thinking to interpret everything

I. Data Wrangling (not trivial! Takes skill and artistry)

  • Basic ideas

  • Computing skills!!

  • Managing data

II. Data Visualization

  • What constitutes a good Graphic? (Tufte text)

  • Basic plots (base R and ggplot2)

  • Fancy plots (Shinyapps, etc.?)

III. Data Analysis

  • Market Basket Analysis – Association Rules

  • Correlation, Regression and Predictive Modeling

  • Classification and Clustering methods

IV. Using Statistical Thinking – Not always obvious! – Example:

WWII planes..

IV. Using Statistical Thinking

  • How was data collected? Random sampling & selection bias

  • How was study designed? Random assignment & cause-effect

  • Separate signal from noise: Modeling & visualizations

  • Could results be due to chance? Bootstrap confidence intervals

  • Presenting the results (we'll use RMarkdown)

We'll use R Studio as our "driver's seat" and we'll use R for data analysis,

summary(mtcars$wt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.513   2.581   3.325   3.217   3.610   5.424
summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90

…and for plotting:

ggplot(data = mtcars,
       mapping = aes(wt,mpg,color=factor(cyl))) +
  geom_point()