HMC R Bootcamp 2016

Jeho Park
May 26, 2016

HMC R Bootcamp 2016

DAY 1

Reference:

Some (i.e., a lot of) materials are adapted from the following websites:

Some Housekeeping Stuff

This R bootcamp is...

  • about R's programming aspects.
    • It's designed to help you start using and coding R.
  • not about Statistics.
    • I assume that you already have a basic knowledge of Statistics
  • for your summer research and beyond.
    • The teaching materials are used to help you get up to speed on using R as quickly as possible for your research activites with Prof. Lelia Hawkins and Prof. Rachel Levy

Day 1 Agenda

  • Module 0: Introduction
    • About this bootcamp
    • About R
    • Getting ready (R environment setup)
  • Module 1: The Basics of R
    • General Stuff
    • RStudio
    • R Data Structure (R Objects)
    • Basic graphics

Day 1 Agenda (Cont.)

  • Module 2: Working with Data
    • Raw data (CSV files)
    • Data import and export
    • Data cleaning
    • Subsetting
    • Dataframes vs Matrices
    • Attributes and missing values
    • Graphics (lattice and ggplot2)
    • Simple graph manipulations

Module 0: Introduction (15 min)

  • Goals and learning objectives
  • What is R?
  • What is not R?
  • Then Why R?
  • Getting ready for R Bootcamp

Bootcamp Goal:

To prepare students to use R to handle and analyze atmospheric sample data.

Learning Objectives:

By the end of this two-day bootcamp, you will be able to:

  • Import data from a simple tab delimited file in R
  • Distinguish the differences between string variable and other R data types
  • Convert different data types back and forth
  • Create a loop using R’s for loop syntax
  • Create an if-else statement in R
  • Interpolate data points and use interpolating functions given
  • Restate the existing R script and explain what each block of the script is for
  • Create basic plots and scale x and y axis according to data points
  • Use subsetting methods

What is R?

  • R is a statical programming language/environment.
  • R is open source/free.
  • R is widely used/prefered.
  • R is cross-platform.
  • R is hard to learn (really?).

What is not R?

  • S: R's ancestor
  • S-Plus: Commercial; modern implementation of S
  • SAS: Commercial; widely used in the commercial analytics.
  • SPSS: Commercial; easy to use; widely used in Social Science.
  • MATLAB: Commercial; can do some Stats.
  • Python: Also can do some Stats; good in text data manipulation.

Then Why R?

  • R community is active and constantly growing
  • R is most popular (other than SPSS and SAS)
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved

Then Why R?

  • R community is active and constantly growing
  • R is most popular (other than SPSS and SAS)
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improving

R Google Scholar Hit

Then Why R?

After SPSS and SAS:
R Google Scholar Hit

Then Why R?

  • R community is active and constantly growing
  • R is most popular (other than SPSS and SAS)
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved

R Extentions

Getting Ready!

  • Check R
  • Check RStudio
  • Check slides
  • Check check check…

End of Module 0

Yay! Good job guys!

Module 1: The Basics of R (45 min)

  • RStudio and its functionality
  • General stuff
  • Workspace of R and some calculations
  • R Objects (Vectors, Matrices, Lists, Data frames, and Factors)
  • Converting between different types of objects
  • Getting help online and offline

RStudio

  • Integrated Development Environment for R
  • Nice combination of GUI and CLI
  • Free and commercial version
  • 4 main windows, tabs, etc
  • Version control: Git and VPN
  • R Markdown
  • R Presentation

RStudio Panes

General Stuff

demo() # display available demos
demo(graphics) # try graphics demo
library() # show available packages on the computer
search() # show loaded packages
?hist # search for the usage of hist function
??histogram # search for package documents containing the word "histogram"

Workspace of R

R workspace stores objects like vecors, datasets and functions in memory (the available space for calculation is limited to the size of the RAM).

a <- 5 # notice a in your Environment window
A <- "text" 
a
A
ls()
print(c(a,A))
print(a, A)

Look Ma, R can do Math!

1+1
2+runif(1,0,1)
2+runif(1,min=0,max=1)
3^2
3*3
sqrt(3*3) # comments
# comments are preceded by hash sign

Even More Math!

  • R can take integrals and derivatives, for example:

Numerical Integral of

\( \displaystyle\int_0^{\infty} \frac{1}{(x+1)\sqrt{x}}dx \)

integrand <- function(x) {1/((x+1)*sqrt(x))} ## define the integrated function
integrate(integrand, lower=0, upper=Inf) ## integrate the function from 0 to infinity
3.141593 with absolute error < 2.7e-05

R Objects: Vectors

The most basic form of an R object.
Scalar values are vectors of length one.
A vector is an array object of the same type (homogeneous) data elements.

class(a)
class(A)
B <- c(a,A) # concatenation function
B # see the values
class(B) # why?
a <- rnorm(10)
a[3:5] <- NA # NA is a missing value
a

R Objects: Vectors (cont.)

R has five basic or “atomic” classes of objects:

  • character
  • numeric (real numbers)
  • integer
  • complex
  • logical (True/False)

A vector contains a set of data in any one of the atomic classes.

R Objects: Matrices

A matrix is a two-dimensional rectangular object of the same type (homogeneous) data elements.

mat <- matrix(rnorm(6), nrow = 3, ncol = 2) 
mat # a matrix
dim(mat) # dimension
t(mat) 
summary(mat) 

R Objects: Lists

A list is an object that can store different types of vectors.

aList <- list(name=c("Joseph"), married=T, kids=2)
aList
aList$kids <- aList$kids+1
aList$kids
aList2 <- list(numeric_data=a,character_data=A)
aList2
allList <- list(aList, aList2)
allList

R Objects: Data frames

A data frame is a list of vectors of equal length with possibly different types. It is used for storing retengular data tables.

n <- c(2, 3, 5) # a vector 
s <- c("aa", "bb", "cc") # a vector
b <- c(TRUE, FALSE, TRUE) # a vector
df <- data.frame(n, s, b) # a data frame
df
class(df$s) # was a string vector but now a factor. 
mtcars # a built-in (attached) data frame
mtcars$mpg

R Objects: Data frames (cont.)

myFrame <- data.frame(y1=rnorm(100),y2=rnorm(100), y3=rnorm(100))
head(myFrame) # display first few lines of data
names(myFrame) # display column names
summary(myFrame) # output depends on the data types
plot(myFrame)
myFrame2 <- read.table(file="http://scicomp.hmc.edu/data/R/Rtest.txt", header=T, sep=",")
myFrame2

R Objects: Factors

  • Factors are a special compoud object used to represent categorical data such as gender, social class, etc.
  • Factors have 'levels' attribute. They may be nominal or ordered.
v <- c("a","b","c","c","b")
x <- factor(v) # turn the character vector into a factor object
z <- factor(v, ordered = TRUE) # ordered factor
x
z
table(x)

Converting between different types

Use of the as() family of functions. Type as. and wait to see the list of as() functions.

integers <- 1:10
as.character(integers)
as.numeric(c('3.7', '4.8'))
indices <- c(1.7, 2.3)
integers[indices] # sometimes R is too generous
integers[0.999999999] # close to 1 but...
df <- as.data.frame(mat)
df

Getting help online and offline

Wrap-up exercise (15 min)

Basics
1) Using cars datasets, create a vector 'stop_mean' that contains mean stopping distance.
2) Create a matrix 'cars_mat' by converting the cars data frame.

Challenges
3) Install 'hflights' datasets.
4) Using hflights datasets, create a variable called 'x' that contains the mean flight arrival delay. (Hint: NA's)
5) Create a boolean (TRUE/FALSE) vector indicating whether the departure delay is shorter than the arrival delay.

Break (15 min)

Take a break!

Solutions and discussions (15 min)

1)

2)

3)

4)

5)

Module 2: Working with Data

  • Working with raw data (text files)
  • Data import and export
  • Subsetting
  • Using data frames vs. matrices

Working with Raw Data (text files)

  • Use read.table() to read text files into R
  • Try help document for read.table()

Data Import

  • read.csv() is a special case of read.table()
  • Data import from your local folder
cpds <- read.csv(file.path('.', 'data', 'cpds.csv'))
head(cpds) # good to look at a few lines
class(cpds) # data.frame
  • Data import from the Internet
data <- read.table(file="http://scicomp.hmc.edu/data/R/normtemp.txt", header=T)
tail(data)

Data Import (Cont.)

rta <- read.table("./data/RTADataSub.csv", sep = ",", head = TRUE)
dim(rta)
rta[1:5, 1:5]
class(rta)
class(rta$time) # what? let's see ?read.table more carefully
rta2 <- read.table("./data/RTADataSub.csv", sep = ",", head = TRUE, stringsAsFactors = FALSE)
class(rta2$time)

Data Export

  • Use write.table() to write data to a CSV file
write.csv(data, file = "temp.csv", row.names = FALSE) 
  • Writing out plots
pdf('myplot.pdf', width = 7, height = 7) # call pdf() before calling plot()
x <- rnorm(10); y <- rnorm(10)
plot(x, y)
dev.off()

Subsseting

Operators that can be used to extract subsets of R objects.

  • '[' and ']' always returns an object of the same class as the original; can be used to select more than one element.
  • '[[' and ']]' is used to extract elements of a list or a data frame; it can only be used to extract a single element.
  • $ is used to extract elements of a list or data frame by name.

Subsetting (cont.)

x <- c("a", "b", "c", "c", "d", "a")
x[1]
x[1:4]
x[x > "a"] 
u <- x > "a" # what's u here?
u
x[u] # subsetting using a boolean vector
y <- list(foo=x, bar=x[u]) 
y
y[[1]]
y$bar
subset(mtcars, gear == 5) # use of subset function for data frames

Data frame vs matrix

Consider the following:

  • Same types or different types? Numeric or other type?
  • Convenient using $ with col names?
  • Data size too big? (memory efficiency and size)
m = matrix(1:400000, 2, 200000) # esp. for a large number of columns!
d = as.data.frame(m)
object.size(m) # 1600200 bytes
object.size(d) # 22400568 bytes
  • Conversion between data frame and matrix
    • as.data.frame()
    • as.matrix() or data.matrix() # consider coercion

Wrap-up exercise

Basics
1) Try installing a new library (lmtest) from Console
2) Try installing another library (ggplot2) from Package pane

Challenges
3) From the dataset, hflights, create a subsample based on delay times greater than 10 hours (600 min) from ArrDelay and name it, hflightsSub.
4) Set all of the extreme delays in ArrDelay (more than 300 minutes) in hflightsSub to NA.
5) Convert the data frame, d, back to a matrix named m1 and compare the size of m and m1

Homework (Day 1)

  1. Install git on your computer
  2. Create a Github account
  3. Create a test repository, r-bootcamp-git-test
  4. Try push to the repository
  5. Try pull from the repository
  6. Digest what you learned from Day 1