R for Beginners

Jeho Park
Feb 1, 2017

HMC Scientific Computing Workshop Series, Spring 2017

https://github.com/jehopark/r-workshop-for-beginners-sp2017.git

Some Housekeeping Stuff

After this workshop, you will be able to...

  • Create basic R objects such as a vector, a matrix, a list, a dataframe.
  • Tell the differences between those data objects/types
  • Import CSV files into R environment
  • Export R objects in your environment to R data image
  • Subset data to create a subset of given data

Workshop Agenda: Real Basics of R

  • What is R?
  • What is not R?
  • Then Why R?
  • What is RStudio?
  • What can you do with it? (some math example)

Workshop Agenda: Basics of R

  • Demos, search, and help documents
  • Workspace of R
  • R Objects: Vector, Matrix, List, Data Frame, Factor, Function
  • Converting between different types
  • Working with Data: Import/Export, subsetting

What is R?

  • R is a statical programming language/environment.
  • R is open source/free.
  • R is widely used/prefered.
  • R is cross-platform.
  • R is hard to learn (really?).

What is not R?

  • S: R's ancestor
  • S-Plus: Commercial; modern implementation of S
  • SAS: Commercial; widely used in the commercial analytics.
  • SPSS: Commercial; easy to use; widely used in Social Science.
  • MATLAB: Commercial; can do some Stats.
  • Python: Also can do some Stats; good in text data manipulation.

Then Why R?

  • R community is active and constantly growing
  • R is one of the most popular stat programming lang
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved

R listserv help

Then Why R?

  • R community is active and constantly growing
  • R is one of the most popular stat programming lang
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improving
Further reading about R's popularity in science and engineering:
R moves up to 5th place in IEEE language rankings at https://www.r-bloggers.com/r-moves-up-to-5th-place-in-ieee-language-rankings/

R Google Scholar Hit

Then Why R?

  • R community is active and constantly growing
  • R is one of the most popular stat programming lang)
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved

R Extentions

Getting help online and offline

Some R Vocabulary

  • packages are add on features to R that include data, new functions and methods, and extended capabilities. Think of them as “apps'' on your phone. We've already installed several!
  • console is the main window of R where you enter commands
  • scripts are where you store commands to be run in the terminal later, like syntax files in SPSS or .do files in Stata
  • functions are commands that do something to an object in R
  • workspace is the working memory of R where all objects are stored
  • vector is the basic unit of data in R
  • dataframe is the main element for statistical purposes, an object with rows and columns that includes numbers, factors, and other data types

What is RStudio?

  • Integrated Development Environment for R
  • Nice combination of GUI and CLI
  • Free and commercial version
  • 4 main windows, tabs, etc
  • Version control: Git and VPN
  • R Markdown
  • R Presentation

Get ready

  • Open RStudio!

What Can We Do with RStudio?

RStudio

Look Ma, R can do Math!

1+1
2+runif(1,0,1)
2+runif(1,min=0,max=1)
3^2
3*3
sqrt(3*3) # comments
# comments are preceded by hash sign

Even More Math!

  • R can take integrals and derivatives, for example:

Numerical Integral of

\( \displaystyle\int_0^{\infty} \frac{1}{(x+1)\sqrt{x}}dx \)

integrand <- function(x) {1/((x+1)*sqrt(x))} ## define the function
integrate(integrand, lower=0, upper=Inf) ## integrate the function from 0 to infinity
3.141593 with absolute error < 2.7e-05

Some General Stuff

demo() # display available demos
demo(graphics) # try graphics demo
library() # show available packages on the computer
search() # show loaded packages
?hist # search for the usage of hist function
??histogram # search for package documents containing the word "histogram"

Workspace of R

R workspace stores objects like vecors, datasets and functions in memory (the available space for calculation is limited to the size of the RAM).

a <- 5 # notice a in your Environment window
A <- "text" 
a
A
ls()
print(c(a,A))
print(a,A)

R as a Programming Language: R Objects

VECTOR (homogeneous)
A vector is an array object of the same type data elements.

class(a)
class(A)
B <- c(a,A) # concatenation
print(B)
class(B) # why?

R Objects: Vectors (cont.)

R has five basic or “atomic” classes of objects:

  • character
  • numeric (real numbers)
  • integer
  • complex
  • logical (True/False)

A vector contains a set of data in any one of the atomic classes.

R as a Programming Language: R Objects

Matrices (homogeneous)
A matrix is a two-dimensional rectangular object of the same type (homogeneous) data elements.

mat <- matrix(rnorm(6), nrow = 3, ncol = 2) 
mat # a matrix
dim(mat) # dimension
t(mat) 
summary(mat) 

R as a Programming Language: R Objects

LIST (heterogeneous)
A list is an object that can store different types of vectors.

aList <- list(name=c("Joseph"), married=T, kids=2)
aList
aList$kids <- aList$kids+1
aList$kids
aList2 <- list(numeric_data=a,character_data=A)
aList2
allList <- list(aList, aList2)
allList # a list of lists

R as a Programming Language: R Objects

Data Frame (heterogeneous and homogeneous)
A data frame is used for storing data tables. It is a list of vectors of equal length.

n <- c(2, 3, 5) # a vector 
s <- c("aa", "bb", "cc") # a vector
b <- c(TRUE, FALSE, TRUE) # a vector
df <- data.frame(n, s, b) # a data frame
df
mtcars # a built-in (attached) data frame
mtcars$mpg

R as a Programming Language: R Objects

Data Frame (cont.)

myFrame <- data.frame(y1=rnorm(100),y2=rnorm(100), y3=rnorm(100))
head(myFrame) # display first few lines of data
names(myFrame) # display column names
summary(myFrame) # output depends on the data types
plot(myFrame)
myFrame2 <- read.table(file="http://scicomp.hmc.edu/data/R/Rtest.txt", header=T, sep=",")
myFrame2

R as a Programming Language: R Objects

FACTOR

  • Factors are a special compoud object used to represent categorical data such as gender, social class, etc.
  • Factors have 'levels' attribute. They may be nominal or ordered.
v <- c("a","b","c","c","b")
x <- factor(v) # turn the character vector into a factor object
z <- factor(v, ordered = TRUE) # ordered factor
x
z
table(x)

R as a Programming Language: R Objects

Function
Functions are also objects in R environment

fun <- function(a,b) {
  a*b
}

fun   
fun(2,3) # a function call

Converting between different types

Use of the as() family of functions. Type as. and wait to see the list of as() functions.

integers <- 1:10
as.character(integers)
as.numeric(c('3.7', '4.8'))
indices <- c(1.7, 2.3)
integers[indices] # sometimes R is too generous
integers[0.999999999] # close to 1 but...
df <- as.data.frame(mat)
df

Working with Data

  • Working with raw data (text files)
  • Data import and export
  • Subsetting
  • Using data frames vs. matrices

Working with Raw Data (text files)

  • Use read.table() to read text files into R
  • Try help document for read.table()

Data Import

  • read.csv() is a special case of read.table()
  • Data import from your local folder
cpds <- read.csv(file.path('.', 'data', 'cpds.csv'))
head(cpds) # good to look at a few lines
class(cpds) # data.frame
  • Data import from the Internet
data <- read.table(file="http://scicomp.hmc.edu/data/R/normtemp.txt", header=T)
tail(data)

Data Import (Cont.)

rta <- read.table("./data/RTADataSub.csv", sep = ",", head = TRUE)
dim(rta)
rta[1:5, 1:5]
class(rta)
class(rta$time) # what? let's see ?read.table more carefully
rta2 <- read.table("./data/RTADataSub.csv", sep = ",", head = TRUE, stringsAsFactors = FALSE)
class(rta2$time)

Data Export

  • Use write.table() to write data to a CSV file
write.csv(data, file = "temp.csv", row.names = FALSE) 
  • Save all the objects in current environmet
save.image(file="myenv.RData") 
# Q: How to load them back later?

Subsseting

Operators that can be used to extract subsets of R objects.

  • '[' and ']' always returns an object of the same class as the original; can be used to select more than one element.
  • '[[' and ']]' is used to extract elements of a list or a data frame; it can only be used to extract a single element.
  • $ is used to extract elements of a list or data frame by name.

Subsetting (cont.)

x <- c("a", "b", "c", "c", "d", "a")
x[1]
x[1:4]
x[x > "a"] 
u <- x > "a" # what's u here?
u
x[u] # subsetting using a boolean vector
y <- list(foo=x, bar=x[u]) 
y
y[[1]]
y$bar
subset(mtcars, gear == 5) # use of subset function for data frames

Data frame vs matrix

Consider the following:

  • Same types or different types? Numeric or other type?
  • Convenient using $ with col names?
  • Data size too big? (memory efficiency and size)
m = matrix(1:400000, 2, 200000) # esp. for a large number of columns!
d = as.data.frame(m)
object.size(m) # 1600200 bytes
object.size(d) # 22400568 bytes
  • Conversion between data frame and matrix
    • as.data.frame()
    • as.matrix() or data.matrix() # consider coercion

That's it!

Please submit your R environemnt file* at http://bit.ly/hmc-r-workshop-sign-in (digital badge requirement).

*You should know how to export the objetcs in your environment to a Rdata file.

Further Study!