Week 1 Introduction

09/11/2017

Course Syllabus (will discuss later)

Why teach this course?

Simple Answer: Data Science is Sexy!

Background

Jonathan Goldman (PhD, Physics, Stanford) arrived to LinkedIn in 2006
Started using data analytic to explore people’s connections
- Began forming theories, testing hunches, and finding patterns to predict whose networks a given profile would land in.
Goldman is a good example of a new key player in organizations: the “data scientist.”
- A high-ranking professional with the training and curiosity to make discoveries in the world of big data.

High Demand, Low Supply

“The sexiest job in the next 10 years will be statisticians," Hal Varian, the chief economist at Google.
Glassdoor named it the "best job of the year" for 2016.
- "It's one of the hottest and fastest growing jobs we're seeing right now," Andrew Chamberlain, Glassdoor's chief economist
the McKinsey Global Institute
- the “United States alone faces a shortage of 140,000 to 190,000 people [by 2018] with deep analytical skills”

Dangers

Course Goals/Objectives

Primary Objectives

The emphasis will be on statistical/economic reasoning and basic statistical concepts
Students completing the course will be able to
1. read critically economic research reports using basic methods
2. use basic statistical methods in their own work
3. pursue further coursework in statistics/econometrics

Secondary Objectives

Learn data visualization and analysis techniques using R statistical software.
- graphs, charts, maps, dynamic figures, networks, etc….
Work with multiple datasets and familiarize yourself with applied economic analysis topics
Learn to write professional reports and reproducible research

Some of the data sets that we will use

Consumer expenditures
Individual Income
U.S. Population
World Bank API - World Dev. Indicators
Quandl Financial data API
and multiple other data sets

What you will learn to do (Expected Outputs)

Create Tables: Ex. U.S. Unemployment

Convert Tables to Figures

Create Maps: U.S. County Population

Create Conditional Maps: ex. SF Crime

Create Interactive Maps

Create Networks: Ex. Media Mentions

Today's Agenda

Outline

Course overview
Introduction to R, R Studio and R Markdown
Programming basics
Coding style

Course Overview

How this class will work

No programming knowledge presumed
Some stats knowledge presumed. E.g.:
- Hypothesis testing (t-tests, confidence intervals)
- Linear regression
Class attendance is mandatory
Class will be very cumulative

Course Structure

Two 50 minute lectures a week:
- First 50 minutes: concepts, methods, examples
- Last 50 minutes: short labs
Labs:
- Students will get hands-on practice with the day's material by completing assigned lab activities.
- Tasks may include but are not limited to: running or modifying code from the lecture or completing short coding exercises.

Course Outline

The course is broken down into four main parts:
1. Getting to know R (Weeks 1-4)
  - Importing data, basic programming and data manipulation tools
2. Data Visualization (Weeks 5-7)
  - Plotting, ggplot, interactive plots, mapping
3. Statistical Inference and Modeling (Weeks 8-10)
  - Hypothesis testing, linear regression, programming for statistics/economics
4. Network Analysis (Weeks 11-14)
  - Creating and importing network data, topological features of networks, basic statistical analysis
5. Student Presentations (Weeks 15-16)

Grading - Class participation (10%)

Participation grade is based on in-class attendance and participation in lab
There will be approximately 10 labs during the semester
Participation points are calculated as follows:

# of Labs Participate	0-2	3-4	5-6	7-8	9-10
Points	0	1	4	7	10

Grading Policy – Homeworks (30%)

There will be 3 HW assignments
HW assigned Monday is due on the following Sunday by 11:59pm
Late homework will not be accepted for credit
Calculation of homework grade will be discussed at a later date.

Grading Policy – Quizzes (10%)

2 quizzes in the second half of term.
Will be based on class labs.
Dates and times will be announced in advance.
Purpose is to assess your understanding of various concepts that are central to the class.

Grading Policy – Final project (50%)

Write a data-led report that analyzes a policy question.
Complete research experience; Students will be expected to:
1. explore the data to identify important variables;
2. perform statistical analyses to address the policy question;
3. produce tabular and graphical summaries to support findings;
4. write a report describing their methodology and findings
Work in small groups to decide on appropriate statistical methodology and graphical/tabular summaries;
BUT each student will be required to produce and submit their own code and final report.

Grading Distribution

Activity	Grade Contribution
Participation	10%
Assignment/Homework	30%
Quizzes	10%
Research Report/Presentation	50%

Course resources

Office hours by appointment
Syllabus, assignments, class notes, and grading policies posted on class email.
Use course email for gradebook and for turning in homework
Wechat for class forum
- Please post class/homework related question on Wechat instead of emailing the teaching staff

Required Readings

No required textbook, but several are highly recommended:
Garrett Grolemund and Hadley Wickham, R for Data Science
Phil Spector, Data Manipulation with R
Paul Teetor, The R Cookbook
Winston Chang, The R Graphics Cookbook
Norman Matloff, The Art of R Programming: A Tour of Statistical Software Design

What you will learn to do in this class

This class will teach you to use R to:

Generate graphical and tabular data summaries
Perform statistical analyses (e.g., hypothesis testing, regression modeling)
Produce reproducible statistical reports using R Markdown
Integrate R with other tools (e.g., databases, web, etc.)

Why R?

Free (open-source)
Programming language (not point-and-click)
Excellent graphics
Offers broadest range of statistical tools
Easy to generate reproducible reports
Easy to integrate with other tools

Introduction to R and RStudio

The R Console

Basic interaction with R is through typing in the console
This is the terminal or command-line interface
Download from the homepage: https://www.r-project.org/

Rstudio (an IDE for R)

R Studio has 4 main windows ('panes'):
1. Source; (2) Console; (3) Environment; (4) Plots

Console pane

Use the Console pane to type or paste commands to get output from R
Use ? to look up help file (E.g., try typing in ?mean)

Source pane

Use the Source pane to create and edit .R and .Rmd files
The menu bar of this pane contains handy shortcuts for sending code to the Console for evaluation

Plots, etc. pane

All figures will be displayed in the Plots tab
Can Zoom, Export, and Navigate back to older plots

Environment pane

By default, you will see the global environment, listing all datasets and variables currently in your Rstudio session.
You can also use the `import datatset' button to read data files from other formats (e.g. stata, excel, etc.)

RStudio: Panes overview

Source pane: create a file that you can save and run later
Console pane: type or paste in commands to get output from R
Plots, etc. pane: see plots, help pages, and other items in this window.
Environment pane: see a list of variables or previous commands

Download from the homepage: http://www.rstudio.com/

R Libraries

R libraries (or packages) are bundles of code (functions) that can be downloaded to carry out additional statistical analyses not available in the base R software.
There are thousands of libraries each developed to help do research more efficiently and prevent us from having to write our own functions.

Download R library

First you have to install the library you want (Only do this once!)

install.packages("NAME", dependencies = TRUE) 
# Dependencies = TRUE tells R to download all relevant packages

- An alternative way to download packages is to go to the `Plots, etc.' Pane:
    - click on `Packages', 
    -  then click on `install', 
    - then type in the name of the package you want.

Next, repeat the following step every time you open up Rstudio to bring the package into R

library(NAME)

Introduction to Rmarkdown

R Markdown

R Markdown allows the user to integrate R code into a report
When data changes or code changes, so does the report
No more need to copy-and-paste graphics, tables, or numbers
Creates reproducible reports
- Anyone who has your R Markdown (.Rmd) file and input data can re-run your analysis and get the exact same results (tables, figures, summaries)
Can output report in HTML (default), Microsoft Word, or PDF

R Markdown

This example shows an R Markdown (.Rmd) file in RStudio.
To turn an Rmd file into a report, click the Knit HTML
The results will appear in a Preview window

R Markdown

Use R code `chunks' to convert R code into a report
All of the code that appears in between the "triple back-ticks"
You can knit into html (default), MS Word, and pdf format

Programming basics

Data building blocks

You'll encounter different kinds of data types

Booleans Direct binary values: TRUE or FALSE in R
Integers: whole numbers (positive, negative or zero)
Characters fixed-length blocks of bits, with special coding; strings = sequences of characters
Floating point numbers: a fraction (with a finite number of bits) times an exponent, like \(1.87 \times {10}^{6}\)
Missing or ill-defined values: NA, NaN, etc.

Operators (functions)

Command	Description
`+,-,*,\`	add, subtract, multiply, divide
`^`	raise to the power of
`%%`	remainder after division (ex: `8 %% 3 = 2`)
`( )`	change the order of operations
`log(), exp()`	logarithms and exponents (ex: `log(10) = 2.302`)
`sqrt()`	square root
`round()`	round to the nearest whole number (ex: `round(2.3) = 2`)

7 + 5 # Addition

## [1] 12

7 - 5 # Subtraction

## [1] 2

7 * 5 # Multiplication

## [1] 35

7 ^ 5 # Exponentiation

## [1] 16807

7 / 5 # Division

## [1] 1.4

7 %% 5 # Modulus

## [1] 2

Operators cont'd.

Comparisons are also binary operators; they take two objects, like numbers, and give a Boolean

7 > 5

## [1] TRUE

7 < 5

## [1] FALSE

7 >= 7

## [1] TRUE

7 <= 5

## [1] FALSE

7 == 5

## [1] FALSE

7 != 5

## [1] TRUE

Boolean operators

Basically "and" and "or":

(5 > 7) & (6*7 == 42)

## [1] FALSE

(5 > 7) | (6*7 == 42)

## [1] TRUE

Variables

We can give names to data objects; these give us variables

A few variables are built in:

pi

## [1] 3.141593

Variables can be arguments to functions or operators, just like constants:

pi*10

## [1] 31.41593

Assignment operator

Most variables created with the assignment operator, <- or =

time.factor <- 12
time.factor

## [1] 12

time.in.years = 2.5
time.in.years * time.factor

## [1] 30

The assignment operator also changes values:

time.in.months <- time.in.years * time.factor
time.in.months

## [1] 30

time.in.months <- 45
time.in.months

## [1] 45

Using names and variables makes code:
- easier to design,
- easier to debug
- and easier for others to read
Use descriptive variable names
- Good: num.students <- 35
- Bad: ns <- 35

The workspace

What names have you defined values for?

ls()

##  [1] "bbox"           "c_opts"         "colrs"          "deg"           
##  [5] "df"             "g"              "gg"             "l"             
##  [9] "links"          "map"            "net"            "nodes"         
## [13] "plot"           "population"     "presidential"   "sfData2"       
## [17] "sfg"            "SFOcrime"       "time.factor"    "time.in.months"
## [21] "time.in.years"  "unemp"          "UnempPres"      "us"            
## [25] "xrng"           "yrng"

Getting rid of variables:

rm("time.in.months")
ls()

##  [1] "bbox"          "c_opts"        "colrs"         "deg"          
##  [5] "df"            "g"             "gg"            "l"            
##  [9] "links"         "map"           "net"           "nodes"        
## [13] "plot"          "population"    "presidential"  "sfData2"      
## [17] "sfg"           "SFOcrime"      "time.factor"   "time.in.years"
## [21] "unemp"         "UnempPres"     "us"            "xrng"         
## [25] "yrng"

First data structure: vectors

Group related data values into one object, a data structure
A vector is a sequence of values, all of the same type
c() function returns a vector containing all its arguments

students <- c("Sean", "Louisa", "Frank", "Farhad", "Li")
midterm <- c(80, 90, 93, 82, 95)
final <- c(78, 84, 95, 82, 91) # Final exam scores

Typing the variable name at the prompt causes it to display

students

## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"

Indexing

vec[1] is the first element, vec[4] is the 4th element of vec

students

## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"

students[4]

## [1] "Farhad"

vec[-4] is a vector containing all but the fourth element

students[-4]

## [1] "Sean"   "Louisa" "Frank"  "Li"

Vector arithmetic

midterm # Midterm exam scores

## [1] 80 90 93 82 95

midterm + final # Sum of midterm and final scores

## [1] 158 174 188 164 186

(midterm + final)/2 # Average exam score

## [1] 79 87 94 82 93

course.grades <- 0.4*midterm + 0.6*final # Final course grade
course.grades

## [1] 78.8 86.4 94.2 82.0 92.6

Is the final score higher than the midterm score?

midterm

## [1] 80 90 93 82 95

final

## [1] 78 84 95 82 91

final > midterm

## [1] FALSE FALSE  TRUE FALSE FALSE

Functions on vectors

Command	Description
`sum(vec)`	sums up all the elements of `vec`
`mean(vec)`	mean of `vec`
`median(vec)`	median of `vec`
`sd(vec), var(vec)`	the standard deviation and variance of `vec`
`length(vec)`	the number of elements in `vec`
`sort(vec)`	returns the `vec` in sorted order
`summary(vec)`	gives a five-number summary

Functions on vectors

course.grades

## [1] 78.8 86.4 94.2 82.0 92.6

mean(course.grades) # mean grade

## [1] 86.8

median(course.grades)

## [1] 86.4

sd(course.grades) # grade standard deviation

## [1] 6.625708

More functions on vectors

sort(course.grades)

## [1] 78.8 82.0 86.4 92.6 94.2

max(course.grades) # highest course grade

## [1] 94.2

min(course.grades) # lowest course grade

## [1] 78.8

Referencing elements of vectors

students

## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"

Vector of indices:

students[c(2,4)]

## [1] "Louisa" "Farhad"

Vector of negative indices

students[c(-1,-3)]

## [1] "Louisa" "Farhad" "Li"

More referencing: `which()` function

Return only A students

a.threshold <- 90 # A grade = 90% or higher
course.grades >= a.threshold # vector of booleans

## [1] FALSE FALSE  TRUE FALSE  TRUE

a.students <- which(course.grades >= a.threshold) # Applying which() 
a.students

## [1] 3 5

students[a.students] # Names of A students

## [1] "Frank" "Li"

Named components

You can give names to elements or components of vectors

students

## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"

names(course.grades) <- students # Assign names to the grades
names(course.grades)

## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"

course.grades

##   Sean Louisa  Frank Farhad     Li 
##   78.8   86.4   94.2   82.0   92.6

R coding style

Coding style (and code commenting) will become increasingly more important as we get into more advanced and involved programming tasks
A few R "style guides" exist:
- Google's
- Hadley Wickham's
Borrowing Hadley Wickham's words: > You donâ€™t have to use my style, but you really should use a consistent style.

R style recommendations

Hadley Wickham's guide is short and easy to follow
We'll revisit the question of coding style several times over the course of the class

Enforced style: Assignment operator

Assignment operator. USE <-

student.names <- c("Eric", "Hao", "Jennifer")  # Good
student.names = c("Eric", "Hao", "Jennifer") # Bad

Note: When specifying function arguments, only = is valid

sort(tv.hours, decreasing=TRUE) # Good
sort(tv.hours, decreasing<-TRUE) # Bad!!

Enforced style: Spacing

Binary operators should have spaces around them
Commas should have a space after, but not before (just like in writing)

3 * 4 # Good
3*4 # Bad
which(student.names == "Eric") # Good
which(student.names=="Eric") # Bad

For specifying arguments, spacing around = is optional

sort(tv.hours, decreasing=TRUE) # Accepted
sort(tv.hours, decreasing = FALSE) # Accepted

Enforced style: Variable names

To make code easy to read, debug, and maintain, you should use concise but descriptive variable names
Terms in variable names should be separated by _ or .

# Accepted
day_one   day.one   day_1   day.1   day1

# Bad
d1   DayOne   dayone   

# Can be made more concise:
first.day.of.the.month

Avoid using variable names that are already pre-defined in R

# EXTREMELY bad:
c   T   pi   sum   mean

A common data problem

When data is entered manually, misspellings and case changes are very common
E.g., a column showing life support mechanism may look like,

Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")

Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")
summary(Health)

##    Length     Class      Mode 
##         6 character character

This character has 6 levels even though it should have 2 (dialysis, none)
We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later
There's a faster way to fix just the capitalization issue (this is an exercise for Homework 1)

Assignment

Download R: https://mirrors.tuna.tsinghua.edu.cn/CRAN/
- Can also download at: https://cran.cnr.berkeley.edu/
Next, download RStudio: http://www.rstudio.com/

Course Syllabus (will discuss later)

Why teach this course?

Simple Answer: Data Science is Sexy!

Background

High Demand, Low Supply

Dangers

Course Goals/Objectives

Primary Objectives

Secondary Objectives

Some of the data sets that we will use

What you will learn to do (Expected Outputs)

Create Tables: Ex. U.S. Unemployment

Convert Tables to Figures

Create Maps: U.S. County Population

Create Conditional Maps: ex. SF Crime

Create Interactive Maps

Create Networks: Ex. Media Mentions

Today's Agenda

Outline

Course Overview

How this class will work

Course Structure

Course Outline

Grading - Class participation (10%)

Grading Policy – Homeworks (30%)

Grading Policy – Quizzes (10%)

Grading Policy – Final project (50%)

Grading Distribution

Course resources

Required Readings

What you will learn to do in this class

Why R?

Introduction to R and RStudio

The R Console

Rstudio (an IDE for R)

Console pane

Source pane

Plots, etc. pane

Environment pane

RStudio: Panes overview

R Libraries

Download R library

Introduction to Rmarkdown

R Markdown

R Markdown

R Markdown

Programming basics

Data building blocks

Operators (functions)

Operators cont'd.

Boolean operators

Variables

Assignment operator

The workspace

First data structure: vectors

Indexing

Vector arithmetic

Functions on vectors

Functions on vectors

More functions on vectors

Referencing elements of vectors

More referencing: which() function

Named components

R coding style

R style recommendations

Enforced style: Assignment operator

Enforced style: Spacing

Enforced style: Variable names

A common data problem

Assignment

More referencing: `which()` function