Stat/CS 087: Introduction to Data Science

Sheila Weaver
6 July 2020

Introduction to Data Science -- Class #1

First, what is Big Data?

  • It's a matter of opinion…
  • In any case, it has presented a big jump in information assets

Here's how progress happens:

plot of chunk unnamed-chunk-1

  • Big Data is the latest ‘information explosion.’
  • The printing press was probably the first major one.
  • Took 300 years for the world to 'settle down' after its invention.
  • We're still settling down with big data…

plot of chunk unnamed-chunk-2

Then what is Data Science?

Here's one idea:

plot of chunk unnamed-chunk-3

Random Question:

Who is Antonie Van Leeuwenhoek ?

  • The Father of Microbiology (1632 - 1723)
  • Improved the microscope in order to bring small things into focus.
  • The goal of Data Science is to bring large things into focus.
  • “The greatest value of a picture is when it forces us to notice what we never expected to see.” —John Tukey
  • Let's look at an example from the NYTimes: NY Student Test Scores from the Regents Exam

Background Knowledge:

We know that Standardized Test Scores usually have a bell curve: plot of chunk unnamed-chunk-4

So, What's wrong with this picture?

plot of chunk unnamed-chunk-5

Stat/CS 087: Introduction to Data Science

  • Web page: https://bb.uvm.edu
  • Software: R, using IDE: RStudio
  • Assumes NO previous experience with coding or Statistics
  • Topics (not in this exact order during the semester)….

Main Topics

  • I. Framing Real World Problems as Data Questions
  • II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)
  • III. Data Analysis
  • IV. Communication

I. Framing Real World Problems as Data Questions (unlike CS, not working to spec)

  • Is the effect real? (signal/noise, chance)
  • What is causing the effect? (study design)
  • How do I predict a variable of interest? (regression, machine learning)
  • How do I identify similar subgroups? (clustering, 'market segmentation')
  • Are there ethical concerns?

II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)

  • Basic computing skills
  • Knowing how to detect problems in the data
  • Knowing how to shape data to answer your questions
  • Not trivial! Takes time, skill and artistry

III. Data Analysis

  • Association Rule Learning – Market Basket Analysis
  • Correlation, Regression and Predictive Modeling
  • Bootstrap Confidence Intervals
  • Classification and Clustering methods

Classification Ex #1: Consider the Not HotDog app

plot of chunk unnamed-chunk-6

What would it do with this image?

plot of chunk unnamed-chunk-7

Might use a 'Classification Tree' to Decide:

plot of chunk unnamed-chunk-8

Classification Ex #2: Democrat or Republican?

IV. Communication

  • Teamwork for better quality (e.g., “Pair Programming” in Agile software development, https://en.wikipedia.org/wiki/Pair_programming)
  • Good Data Visualizations, accurate and appropriate for the audience (ggplot2)
  • Use Statistical Thinking to interpret results
  • Presentation of results, accurate and appropriate for the audience (R markdown, R presentation)

Note: Statistical Thinking is not always obvious! Example: WWII planes..

plot of chunk unnamed-chunk-9

In this course, we'll use:

  • R Studio as our “driver's seat,”
  • R for data analysis,
  • R packages in the 'tidyverse' – dplyr and ggplot2
  • R Markdown for presenting results

For example:

  • Highway mileage for different types of cars:
library(tidyverse)
mpg %>% group_by(class) %>% 
  summarise(mileage = mean(hwy))
# A tibble: 7 x 2
  class      mileage
  <chr>        <dbl>
1 2seater       24.8
2 compact       28.3
3 midsize       27.3
4 minivan       22.4
5 pickup        16.9
6 subcompact    28.1
7 suv           18.1
ggplot(data = mpg,
       mapping = aes(x = hwy, fill=class)) +
  geom_density() + facet_grid(class~.) + 
  theme(legend.title = element_text(size=18),
  legend.text = element_text(size = 16),
  strip.text.y = element_blank())

plot of chunk density