Stat/CS 087

# Stat/CS 087
## ❆<br/>Fall 2021
### Sheila Weaver
### University of Vermont
### 31 August 2021 (updated: 2021-08-12)

---

# Get Started

---

# First, what is Big Data?

---

# It's a matter of opinion...

---

# In any case, it has presented a big jump in information assets:

---

background-position: center, bottom

background-size: 85%

---
class: inverse

##  Big Data is the latest ‘information explosion.’

--
class: inverse

## The printing press was probably the first major one.

class: inverse
## Took 300 years for the world to 'settle down' after its invention.

--
##  We're still settling down with big data...

---
class: inverse

# What is Data Science?

---

background-position: center, bottom

background-size: 60%

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Data_scientist_Venn_diagram.png)

---

#  Random Question:

---

## Who is Antonie Van Leeuwenhoek ?

### The Father of Microbiology (1632 - 1723)

### Improved the microscope in order to bring *small* things into focus.

### The goal of Data Science is to bring *large* things into focus.

>"The greatest value of a picture is when it forces us to notice what we never expected to see."*   ---John Tukey

### Let's look at an example from the NYTimes:  NY Student Test Scores from the Regents Exam

### First, remember this:

---

background-position: center, bottom

background-size: 70%

### We know that standardized tests usually have a bell shape:

---
class: inverse

background-position: center, bottom

background-size: 95%

## What's wrong with this?

---

##Stat/CS 087:  Introduction to Data Science

## Main Topics (not in this exact order) 
======================================

### I.	Framing Real World Problems as Data Questions

### II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data)

### III. Data Visualization and Analysis

### IV.	Communication

---

## I.	Framing Real World Problems as Data Questions
======================================

#### Is the effect real?  (signal/noise, chance)

#### What is causing the effect?  (study design)

#### How do I predict a variable of interest?  (regression, classification)

#### How do I identify similar subgroups? (clustering, 'market segmentation')

#### Are there ethical concerns?

---

## II. Data Wrangling (Transforming, ‘tidying’ data) 
======================================

#### Basic computing skills

#### Knowing how to detect problems in the data

#### Knowing how to shape data to answer your questions

#### Not trivial!  Takes time, skill and artistry

---

##  III.  Data Analysis  
======================================

#### Basic Summaries and Visualizations

#### Association Rule Learning -- Market Basket Analysis

#### Correlation, Regression and Predictive Modeling

#### Bootstrap Confidence Intervals

#### Classification and Clustering methods

---

background-position: center, bottom

background-size: 60%

### Classification Ex #1:  Consider the Not HotDog app

---

background-position: center, bottom

background-size: 70%

---

background-position: right, bottom

background-size: 50%

### Classification Ex #2:

### Create a Tree

---
class: inverse

### Classification Ex #3:

### [Political Leaning](https://www.nytimes.com/interactive/2019/08/08/opinion/sunday/party-polarization-quiz.html)

---

## IV.	Communication
======================================

#### Teamwork for better quality (e.g., ["Pair Programming"](https://en.wikipedia.org/wiki/Pair_programming) in Agile software development)

#### Good Data Visualizations, accurate and appropriate for the audience (ggplot2)

#### Use Statistical Thinking to interpret results

#### Presentation of results, accurate and appropriate for the audience, using code that is readable by others  (R markdown, R presentation)

---

background-position: center, bottom

background-size: 60%

### Note: Statistical Thinking is not always obvious, e.g., WWII planes:

---

background-position: right, bottom

background-size: 45%

### In this course, we'll use:
======================================

#### Blackboard page for resources: [bb.uvm.edu](https://bb.uvm.edu)

#### Microsoft Teams for meeting

#### Reading materials:  Free, online

#### Software:  R and RStudio

- R Studio, IDE, is our "driver's seat," 
- R is the engine, 
- R packages in the 'tidyverse' -- dplyr and ggplot2
- R Markdown for presenting results

---
background-position: center, bottom

background-size: 40%

### Warning!

### At times, there will be problems with R, R Studio, R Markdown.  That's just how it is with coding and computers.  But we'll figure it out!

---
### For example:
======================================

### Highway mileage for different types of cars:

```r
library(tidyverse)
m <- mpg %>% group_by(class) %>% 
  summarise(mileage = mean(hwy))
knitr::kable(head(m), format = 'html')
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> class </th>
   <th style="text-align:right;"> mileage </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 2seater </td>
   <td style="text-align:right;"> 24.80000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> compact </td>
   <td style="text-align:right;"> 28.29787 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> midsize </td>
   <td style="text-align:right;"> 27.29268 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> minivan </td>
   <td style="text-align:right;"> 22.36364 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> pickup </td>
   <td style="text-align:right;"> 16.87879 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> subcompact </td>
   <td style="text-align:right;"> 28.14286 </td>
  </tr>
</tbody>
</table>

---

```r
ggplot(data = mpg,
       mapping = aes(x = hwy, fill=class)) +
  geom_density() + facet_grid(class~.) + 
  theme(legend.title = element_text(size=18),
  legend.text = element_text(size = 16),
  strip.text.y = element_blank())
```