09/11/2017

Course Syllabus (will discuss later)

Why teach this course?

Simple Answer: Data Science is Sexy!

Background

  • Jonathan Goldman (PhD, Physics, Stanford) arrived to LinkedIn in 2006
  • Started using data analytic to explore people’s connections
    • Began forming theories, testing hunches, and finding patterns to predict whose networks a given profile would land in.
  • Goldman is a good example of a new key player in organizations: the “data scientist.”
    • A high-ranking professional with the training and curiosity to make discoveries in the world of big data.

High Demand, Low Supply

  • “The sexiest job in the next 10 years will be statisticians," Hal Varian, the chief economist at Google.

  • Glassdoor named it the "best job of the year" for 2016.
    • "It's one of the hottest and fastest growing jobs we're seeing right now," Andrew Chamberlain, Glassdoor's chief economist
  • the McKinsey Global Institute
    • the “United States alone faces a shortage of 140,000 to 190,000 people [by 2018] with deep analytical skills”

Dangers

Course Goals/Objectives

Primary Objectives

  • The emphasis will be on statistical/economic reasoning and basic statistical concepts
  • Students completing the course will be able to
    1. read critically economic research reports using basic methods
    2. use basic statistical methods in their own work
    3. pursue further coursework in statistics/econometrics

Secondary Objectives

  1. Learn data visualization and analysis techniques using R statistical software.
    • graphs, charts, maps, dynamic figures, networks, etc….
  2. Work with multiple datasets and familiarize yourself with applied economic analysis topics

  3. Learn to write professional reports and reproducible research

Some of the data sets that we will use

  1. Consumer expenditures
  2. Individual Income
  3. U.S. Population
  4. World Bank API - World Dev. Indicators
  5. Quandl Financial data API
  6. and multiple other data sets

What you will learn to do (Expected Outputs)

Create Tables: Ex. U.S. Unemployment

Convert Tables to Figures

Create Maps: U.S. County Population

Create Conditional Maps: ex. SF Crime

Create Interactive Maps

Create Networks: Ex. Media Mentions

Today's Agenda

Outline

  • Course overview

  • Introduction to R, R Studio and R Markdown

  • Programming basics

  • Coding style

Course Overview

How this class will work

  • No programming knowledge presumed

  • Some stats knowledge presumed. E.g.:
    • Hypothesis testing (t-tests, confidence intervals)
    • Linear regression
  • Class attendance is mandatory

  • Class will be very cumulative

Course Structure

  • Two 50 minute lectures a week:

    • First 50 minutes: concepts, methods, examples
    • Last 50 minutes: short labs
  • Labs:
    • Students will get hands-on practice with the day's material by completing assigned lab activities.
    • Tasks may include but are not limited to: running or modifying code from the lecture or completing short coding exercises.

Course Outline

  • The course is broken down into four main parts:
    1. Getting to know R (Weeks 1-4)
      • Importing data, basic programming and data manipulation tools
    2. Data Visualization (Weeks 5-7)
      • Plotting, ggplot, interactive plots, mapping
    3. Statistical Inference and Modeling (Weeks 8-10)
      • Hypothesis testing, linear regression, programming for statistics/economics
    4. Network Analysis (Weeks 11-14)
      • Creating and importing network data, topological features of networks, basic statistical analysis
    5. Student Presentations (Weeks 15-16)

Grading - Class participation (10%)

  • Participation grade is based on in-class attendance and participation in lab

  • There will be approximately 10 labs during the semester

  • Participation points are calculated as follows:

# of Labs Participate 0-2 3-4 5-6 7-8 9-10
Points 0 1 4 7 10

Grading Policy – Homeworks (30%)

  • There will be 3 HW assignments
  • HW assigned Monday is due on the following Sunday by 11:59pm
  • Late homework will not be accepted for credit
  • Calculation of homework grade will be discussed at a later date.

Grading Policy – Quizzes (10%)

  • 2 quizzes in the second half of term.
  • Will be based on class labs.
  • Dates and times will be announced in advance.
  • Purpose is to assess your understanding of various concepts that are central to the class.

Grading Policy – Final project (50%)

  • Write a data-led report that analyzes a policy question.
  • Complete research experience; Students will be expected to:
    1. explore the data to identify important variables;
    2. perform statistical analyses to address the policy question;
    3. produce tabular and graphical summaries to support findings;
    4. write a report describing their methodology and findings
  • Work in small groups to decide on appropriate statistical methodology and graphical/tabular summaries;
  • BUT each student will be required to produce and submit their own code and final report.

Grading Distribution

Activity Grade Contribution
Participation 10%
Assignment/Homework 30%
Quizzes 10%
Research Report/Presentation 50%

Course resources

  • Office hours by appointment

  • Syllabus, assignments, class notes, and grading policies posted on class email.

  • Use course email for gradebook and for turning in homework

  • Wechat for class forum
    • Please post class/homework related question on Wechat instead of emailing the teaching staff

Required Readings

  • No required textbook, but several are highly recommended:
  • Garrett Grolemund and Hadley Wickham, R for Data Science
  • Phil Spector, Data Manipulation with R
  • Paul Teetor, The R Cookbook
  • Winston Chang, The R Graphics Cookbook
  • Norman Matloff, The Art of R Programming: A Tour of Statistical Software Design

What you will learn to do in this class

This class will teach you to use R to:

  • Generate graphical and tabular data summaries
  • Perform statistical analyses (e.g., hypothesis testing, regression modeling)
  • Produce reproducible statistical reports using R Markdown
  • Integrate R with other tools (e.g., databases, web, etc.)

Why R?

  • Free (open-source)
  • Programming language (not point-and-click)
  • Excellent graphics
  • Offers broadest range of statistical tools
  • Easy to generate reproducible reports
  • Easy to integrate with other tools

Introduction to R and RStudio

The R Console

  • Basic interaction with R is through typing in the console
  • This is the terminal or command-line interface
  • Download from the homepage: https://www.r-project.org/

Rstudio (an IDE for R)

  • R Studio has 4 main windows ('panes'):

    1. Source; (2) Console; (3) Environment; (4) Plots

Console pane

  • Use the Console pane to type or paste commands to get output from R

  • Use ? to look up help file (E.g., try typing in ?mean)

Source pane

  • Use the Source pane to create and edit .R and .Rmd files
  • The menu bar of this pane contains handy shortcuts for sending code to the Console for evaluation

Plots, etc. pane

  • All figures will be displayed in the Plots tab
  • Can Zoom, Export, and Navigate back to older plots

Environment pane

  • By default, you will see the global environment, listing all datasets and variables currently in your Rstudio session.

  • You can also use the `import datatset' button to read data files from other formats (e.g. stata, excel, etc.)

RStudio: Panes overview

  1. Source pane: create a file that you can save and run later

  2. Console pane: type or paste in commands to get output from R

  3. Plots, etc. pane: see plots, help pages, and other items in this window.

  4. Environment pane: see a list of variables or previous commands

R Libraries

  • R libraries (or packages) are bundles of code (functions) that can be downloaded to carry out additional statistical analyses not available in the base R software.
  • There are thousands of libraries each developed to help do research more efficiently and prevent us from having to write our own functions.

Download R library

  • First you have to install the library you want (Only do this once!)
install.packages("NAME", dependencies = TRUE) 
# Dependencies = TRUE tells R to download all relevant packages
- An alternative way to download packages is to go to the `Plots, etc.' Pane:
    - click on `Packages', 
    -  then click on `install', 
    - then type in the name of the package you want. 
  • Next, repeat the following step every time you open up Rstudio to bring the package into R
library(NAME)

Introduction to Rmarkdown

R Markdown

  • R Markdown allows the user to integrate R code into a report

  • When data changes or code changes, so does the report

  • No more need to copy-and-paste graphics, tables, or numbers

  • Creates reproducible reports
    • Anyone who has your R Markdown (.Rmd) file and input data can re-run your analysis and get the exact same results (tables, figures, summaries)
  • Can output report in HTML (default), Microsoft Word, or PDF

R Markdown

  • This example shows an R Markdown (.Rmd) file in RStudio.
  • To turn an Rmd file into a report, click the Knit HTML
  • The results will appear in a Preview window

R Markdown

  • Use R code `chunks' to convert R code into a report
  • All of the code that appears in between the "triple back-ticks"
  • You can knit into html (default), MS Word, and pdf format

Programming basics

Data building blocks

You'll encounter different kinds of data types

  • Booleans Direct binary values: TRUE or FALSE in R
  • Integers: whole numbers (positive, negative or zero)
  • Characters fixed-length blocks of bits, with special coding; strings = sequences of characters
  • Floating point numbers: a fraction (with a finite number of bits) times an exponent, like \(1.87 \times {10}^{6}\)
  • Missing or ill-defined values: NA, NaN, etc.

Operators (functions)

Command Description
+,-,*,\ add, subtract, multiply, divide
^ raise to the power of
%% remainder after division (ex: 8 %% 3 = 2)
( ) change the order of operations
log(), exp() logarithms and exponents (ex: log(10) = 2.302)
sqrt() square root
round() round to the nearest whole number (ex: round(2.3) = 2)

7 + 5 # Addition
## [1] 12
7 - 5 # Subtraction
## [1] 2
7 * 5 # Multiplication
## [1] 35
7 ^ 5 # Exponentiation
## [1] 16807

7 / 5 # Division
## [1] 1.4
7 %% 5 # Modulus
## [1] 2

Operators cont'd.

Comparisons are also binary operators; they take two objects, like numbers, and give a Boolean

7 > 5
## [1] TRUE
7 < 5
## [1] FALSE
7 >= 7
## [1] TRUE

7 <= 5
## [1] FALSE
7 == 5
## [1] FALSE
7 != 5
## [1] TRUE

Boolean operators

Basically "and" and "or":

(5 > 7) & (6*7 == 42)
## [1] FALSE
(5 > 7) | (6*7 == 42)
## [1] TRUE

Variables

We can give names to data objects; these give us variables

A few variables are built in:

pi
## [1] 3.141593

Variables can be arguments to functions or operators, just like constants:

pi*10
## [1] 31.41593

Assignment operator

Most variables created with the assignment operator, <- or =

time.factor <- 12
time.factor
## [1] 12
time.in.years = 2.5
time.in.years * time.factor
## [1] 30

The assignment operator also changes values:

time.in.months <- time.in.years * time.factor
time.in.months
## [1] 30
time.in.months <- 45
time.in.months
## [1] 45

  • Using names and variables makes code:
    • easier to design,
    • easier to debug
    • and easier for others to read
  • Use descriptive variable names
    • Good: num.students <- 35
    • Bad: ns <- 35

The workspace

What names have you defined values for?

ls()
##  [1] "bbox"           "c_opts"         "colrs"          "deg"           
##  [5] "df"             "g"              "gg"             "l"             
##  [9] "links"          "map"            "net"            "nodes"         
## [13] "plot"           "population"     "presidential"   "sfData2"       
## [17] "sfg"            "SFOcrime"       "time.factor"    "time.in.months"
## [21] "time.in.years"  "unemp"          "UnempPres"      "us"            
## [25] "xrng"           "yrng"

Getting rid of variables:

rm("time.in.months")
ls()
##  [1] "bbox"          "c_opts"        "colrs"         "deg"          
##  [5] "df"            "g"             "gg"            "l"            
##  [9] "links"         "map"           "net"           "nodes"        
## [13] "plot"          "population"    "presidential"  "sfData2"      
## [17] "sfg"           "SFOcrime"      "time.factor"   "time.in.years"
## [21] "unemp"         "UnempPres"     "us"            "xrng"         
## [25] "yrng"

First data structure: vectors

  • Group related data values into one object, a data structure

  • A vector is a sequence of values, all of the same type

  • c() function returns a vector containing all its arguments

students <- c("Sean", "Louisa", "Frank", "Farhad", "Li")
midterm <- c(80, 90, 93, 82, 95)
final <- c(78, 84, 95, 82, 91) # Final exam scores
  • Typing the variable name at the prompt causes it to display
students
## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"

Indexing

  • vec[1] is the first element, vec[4] is the 4th element of vec
students
## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"
students[4]
## [1] "Farhad"
  • vec[-4] is a vector containing all but the fourth element
students[-4]
## [1] "Sean"   "Louisa" "Frank"  "Li"

Vector arithmetic

midterm # Midterm exam scores
## [1] 80 90 93 82 95
midterm + final # Sum of midterm and final scores
## [1] 158 174 188 164 186
(midterm + final)/2 # Average exam score
## [1] 79 87 94 82 93
course.grades <- 0.4*midterm + 0.6*final # Final course grade
course.grades
## [1] 78.8 86.4 94.2 82.0 92.6

Is the final score higher than the midterm score?

midterm 
## [1] 80 90 93 82 95
final
## [1] 78 84 95 82 91
final > midterm
## [1] FALSE FALSE  TRUE FALSE FALSE

Functions on vectors

Command Description
sum(vec) sums up all the elements of vec
mean(vec) mean of vec
median(vec) median of vec
sd(vec), var(vec) the standard deviation and variance of vec
length(vec) the number of elements in vec
sort(vec) returns the vec in sorted order
summary(vec) gives a five-number summary

Functions on vectors

course.grades
## [1] 78.8 86.4 94.2 82.0 92.6
mean(course.grades) # mean grade
## [1] 86.8
median(course.grades)
## [1] 86.4
sd(course.grades) # grade standard deviation
## [1] 6.625708

More functions on vectors

sort(course.grades)
## [1] 78.8 82.0 86.4 92.6 94.2
max(course.grades) # highest course grade
## [1] 94.2
min(course.grades) # lowest course grade
## [1] 78.8

Referencing elements of vectors

students
## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"

Vector of indices:

students[c(2,4)]
## [1] "Louisa" "Farhad"

Vector of negative indices

students[c(-1,-3)]
## [1] "Louisa" "Farhad" "Li"

More referencing: which() function

Return only A students

a.threshold <- 90 # A grade = 90% or higher
course.grades >= a.threshold # vector of booleans
## [1] FALSE FALSE  TRUE FALSE  TRUE
a.students <- which(course.grades >= a.threshold) # Applying which() 
a.students
## [1] 3 5
students[a.students] # Names of A students
## [1] "Frank" "Li"

Named components

You can give names to elements or components of vectors

students
## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"
names(course.grades) <- students # Assign names to the grades
names(course.grades)
## [1] "Sean"   "Louisa" "Frank"  "Farhad" "Li"
course.grades
##   Sean Louisa  Frank Farhad     Li 
##   78.8   86.4   94.2   82.0   92.6

R coding style

  • Coding style (and code commenting) will become increasingly more important as we get into more advanced and involved programming tasks

  • A few R "style guides" exist:
  • Borrowing Hadley Wickham's words: > You don’t have to use my style, but you really should use a consistent style.

R style recommendations

  • Hadley Wickham's guide is short and easy to follow

  • We'll revisit the question of coding style several times over the course of the class

Enforced style: Assignment operator

Assignment operator. USE <-

student.names <- c("Eric", "Hao", "Jennifer")  # Good
student.names = c("Eric", "Hao", "Jennifer") # Bad
  • Note: When specifying function arguments, only = is valid
sort(tv.hours, decreasing=TRUE) # Good
sort(tv.hours, decreasing<-TRUE) # Bad!!

Enforced style: Spacing

  • Binary operators should have spaces around them

  • Commas should have a space after, but not before (just like in writing)

3 * 4 # Good
3*4 # Bad
which(student.names == "Eric") # Good
which(student.names=="Eric") # Bad
  • For specifying arguments, spacing around = is optional
sort(tv.hours, decreasing=TRUE) # Accepted
sort(tv.hours, decreasing = FALSE) # Accepted

Enforced style: Variable names

  • To make code easy to read, debug, and maintain, you should use concise but descriptive variable names

  • Terms in variable names should be separated by _ or .

# Accepted
day_one   day.one   day_1   day.1   day1

# Bad
d1   DayOne   dayone   

# Can be made more concise:
first.day.of.the.month
  • Avoid using variable names that are already pre-defined in R
# EXTREMELY bad:
c   T   pi   sum   mean   

A common data problem

  • When data is entered manually, misspellings and case changes are very common

  • E.g., a column showing life support mechanism may look like,

Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")

Health<-c("dialysis" , "Dialysis", "dialysis" ,"none", "None", "nnone")
summary(Health)
##    Length     Class      Mode 
##         6 character character
  • This character has 6 levels even though it should have 2 (dialysis, none)

  • We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later

  • There's a faster way to fix just the capitalization issue (this is an exercise for Homework 1)

Assignment