Stat/CS 087: Introduction to Data Science

Sheila Weaver
6 July 2020

Introduction to Data Science -- Class #1

First, what is Big Data?

It's a matter of opinion…
In any case, it has presented a big jump in information assets

Here's how progress happens:

plot of chunk unnamed-chunk-1

Big Data is the latest ‘information explosion.’

The printing press was probably the first major one.

Took 300 years for the world to 'settle down' after its invention.

We're still settling down with big data…

plot of chunk unnamed-chunk-2

Then what is Data Science?

Here's one idea:

plot of chunk unnamed-chunk-3

Random Question:

Who is Antonie Van Leeuwenhoek ?

The Father of Microbiology (1632 - 1723)
Improved the microscope in order to bring small things into focus.
The goal of Data Science is to bring large things into focus.
“The greatest value of a picture is when it forces us to notice what we never expected to see.” —John Tukey
Let's look at an example from the NYTimes: NY Student Test Scores from the Regents Exam

Background Knowledge:

We know that Standardized Test Scores usually have a bell curve: plot of chunk unnamed-chunk-4

So, What's wrong with this picture?

plot of chunk unnamed-chunk-5

Stat/CS 087: Introduction to Data Science

Web page: https://bb.uvm.edu
Software: R, using IDE: RStudio
Assumes NO previous experience with coding or Statistics
Topics (not in this exact order during the semester)….

Main Topics

I. Framing Real World Problems as Data Questions
II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)
III. Data Analysis
IV. Communication

I. Framing Real World Problems as Data Questions (unlike CS, not working to spec)

Is the effect real? (signal/noise, chance)
What is causing the effect? (study design)
How do I predict a variable of interest? (regression, machine learning)
How do I identify similar subgroups? (clustering, 'market segmentation')
Are there ethical concerns?

II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)

Basic computing skills
Knowing how to detect problems in the data
Knowing how to shape data to answer your questions
Not trivial! Takes time, skill and artistry

III. Data Analysis

Association Rule Learning – Market Basket Analysis
Correlation, Regression and Predictive Modeling
Bootstrap Confidence Intervals
Classification and Clustering methods

Classification Ex #1: Consider the Not HotDog app

What would it do with this image?

Might use a 'Classification Tree' to Decide:

plot of chunk unnamed-chunk-8

Classification Ex #2: Democrat or Republican?

https://www.nytimes.com/interactive/2019/08/08/opinion/sunday/party-polarization-quiz.html

IV. Communication

Teamwork for better quality (e.g., “Pair Programming” in Agile software development, https://en.wikipedia.org/wiki/Pair_programming)
Good Data Visualizations, accurate and appropriate for the audience (ggplot2)
Use Statistical Thinking to interpret results
Presentation of results, accurate and appropriate for the audience (R markdown, R presentation)

Note: Statistical Thinking is not always obvious! Example: WWII planes..

plot of chunk unnamed-chunk-9

In this course, we'll use:

R Studio as our “driver's seat,”
R for data analysis,
R packages in the 'tidyverse' – dplyr and ggplot2
R Markdown for presenting results

For example:

Highway mileage for different types of cars:

library(tidyverse)
mpg %>% group_by(class) %>% 
  summarise(mileage = mean(hwy))

# A tibble: 7 x 2
  class      mileage
  <chr>        <dbl>
1 2seater       24.8
2 compact       28.3
3 midsize       27.3
4 minivan       22.4
5 pickup        16.9
6 subcompact    28.1
7 suv           18.1

ggplot(data = mpg,
       mapping = aes(x = hwy, fill=class)) +
  geom_density() + facet_grid(class~.) + 
  theme(legend.title = element_text(size=18),
  legend.text = element_text(size = 16),
  strip.text.y = element_blank())

plot of chunk density