class: center, middle, inverse, title-slide # Stat/CS 087 ## ❆
Fall 2021 ### Sheila Weaver ### University of Vermont ### 31 August 2021 (updated: 2021-08-12) --- class: inverse, center, middle # Get Started --- class: inverse, center, middle # First, what is Big Data? --- class: inverse, center, middle # It's a matter of opinion... --- class: inverse, center, middle # In any case, it has presented a big jump in information assets: --- class: inverse, center, middle background-position: center, bottom background-size: 85% background-image: url("S-Curves2.png") --- class: inverse ## Big Data is the latest ‘information explosion.’ -- class: inverse ## The printing press was probably the first major one. -- class: inverse ## Took 300 years for the world to 'settle down' after its invention. -- ## We're still settling down with big data... --- class: inverse # What is Data Science? --- background-position: center, bottom background-size: 60% background-image: url("Data_scientist_Venn_diagram.png") Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Data_scientist_Venn_diagram.png) --- class: inverse, center, middle # Random Question: --- ## Who is Antonie Van Leeuwenhoek ? -- ### The Father of Microbiology (1632 - 1723) -- ### Improved the microscope in order to bring *small* things into focus. -- ### The goal of Data Science is to bring *large* things into focus. -- >"The greatest value of a picture is when it forces us to notice what we never expected to see."* ---John Tukey -- ### Let's look at an example from the NYTimes: NY Student Test Scores from the Regents Exam -- ### First, remember this: --- background-position: center, bottom background-size: 70% background-image: url("normalcurve.png") class: inverse ### We know that standardized tests usually have a bell shape: --- class: inverse background-position: center, bottom background-size: 95% background-image: url("Regents2.gif") class: inverse ## What's wrong with this? --- ##Stat/CS 087: Introduction to Data Science -- ## Main Topics (not in this exact order) ====================================== -- ### I. Framing Real World Problems as Data Questions -- ### II. Data Wrangling (Organizing, ‘tidying’, ‘munging’ data) -- ### III. Data Visualization and Analysis -- ### IV. Communication --- ## I. Framing Real World Problems as Data Questions ====================================== #### Is the effect real? (signal/noise, chance) -- #### What is causing the effect? (study design) -- #### How do I predict a variable of interest? (regression, classification) -- #### How do I identify similar subgroups? (clustering, 'market segmentation') -- #### Are there ethical concerns? --- ## II. Data Wrangling (Transforming, ‘tidying’ data) ====================================== -- #### Basic computing skills -- #### Knowing how to detect problems in the data -- #### Knowing how to shape data to answer your questions -- #### Not trivial! Takes time, skill and artistry --- ## III. Data Analysis ====================================== #### Basic Summaries and Visualizations -- #### Association Rule Learning -- Market Basket Analysis -- #### Correlation, Regression and Predictive Modeling -- #### Bootstrap Confidence Intervals -- #### Classification and Clustering methods --- background-position: center, bottom background-size: 60% background-image: url("NotHotdogapp.png") class: inverse ### Classification Ex #1: Consider the Not HotDog app --- background-position: center, bottom background-size: 70% background-image: url("GracesHotdogPic.png") class: inverse --- background-position: right, bottom background-size: 50% background-image: url("tree.jpg") class: inverse ### Classification Ex #2: ### Create a Tree --- class: inverse ### Classification Ex #3: ### [Political Leaning](https://www.nytimes.com/interactive/2019/08/08/opinion/sunday/party-polarization-quiz.html) --- ## IV. Communication ====================================== #### Teamwork for better quality (e.g., ["Pair Programming"](https://en.wikipedia.org/wiki/Pair_programming) in Agile software development) -- #### Good Data Visualizations, accurate and appropriate for the audience (ggplot2) -- #### Use Statistical Thinking to interpret results -- #### Presentation of results, accurate and appropriate for the audience, using code that is readable by others (R markdown, R presentation) --- background-position: center, bottom background-size: 60% background-image: url("WWIIplanes.png") class: inverse ### Note: Statistical Thinking is not always obvious, e.g., WWII planes: --- background-position: right, bottom background-size: 45% background-image: url("tidyverse.png") ### In this course, we'll use: ====================================== #### Blackboard page for resources: [bb.uvm.edu](https://bb.uvm.edu) -- #### Microsoft Teams for meeting -- #### Reading materials: Free, online #### Software: R and RStudio - R Studio, IDE, is our "driver's seat," - R is the engine, - R packages in the 'tidyverse' -- dplyr and ggplot2 - R Markdown for presenting results --- background-position: center, bottom background-size: 40% background-image: url("warning.jpg") class: inverse ### Warning! ### At times, there will be problems with R, R Studio, R Markdown. That's just how it is with coding and computers. But we'll figure it out! --- ### For example: ====================================== ### Highway mileage for different types of cars: ```r library(tidyverse) m <- mpg %>% group_by(class) %>% summarise(mileage = mean(hwy)) knitr::kable(head(m), format = 'html') ``` <table> <thead> <tr> <th style="text-align:left;"> class </th> <th style="text-align:right;"> mileage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 2seater </td> <td style="text-align:right;"> 24.80000 </td> </tr> <tr> <td style="text-align:left;"> compact </td> <td style="text-align:right;"> 28.29787 </td> </tr> <tr> <td style="text-align:left;"> midsize </td> <td style="text-align:right;"> 27.29268 </td> </tr> <tr> <td style="text-align:left;"> minivan </td> <td style="text-align:right;"> 22.36364 </td> </tr> <tr> <td style="text-align:left;"> pickup </td> <td style="text-align:right;"> 16.87879 </td> </tr> <tr> <td style="text-align:left;"> subcompact </td> <td style="text-align:right;"> 28.14286 </td> </tr> </tbody> </table> --- ```r ggplot(data = mpg, mapping = aes(x = hwy, fill=class)) + geom_density() + facet_grid(class~.) + theme(legend.title = element_text(size=18), legend.text = element_text(size = 16), strip.text.y = element_blank()) ``` <img src="Day1_files/figure-html/density-1.png" style="display: block; margin: auto;" />