R for Beginners

Jeho Park
Jan 10, 2018

HMC Scientific Computing Workshop Series, Spring 2018

Some Housekeeping Stuff

  • Slides at bit.ly/hmc-r-workshop-slides-sp2018
  • R files at bit.ly/hmc-r-workshop-git-sp2018
  • Sign-in at bit.ly/hmc-r-workshop-signin-sp2018

After this workshop, you will be able to...

  • Create basic R objects such as a vector, a matrix, a list, a dataframe.
  • Tell the differences between those data objects/types
  • Import CSV files into R environment
  • Export R objects in your environment to R data image
  • (Subset data to create a subset of given data)

For this workshop, I expect that...

  • You have had no experience using R
  • You have R and RStudio installed on your laptop
    • If not, please use one of our loaner laptops
  • You have the Internet access
    • If not, please let me know
  • You interact with me and ask questions anytime (if you have any)

Workshop Agenda: Real Basics of R

  • What is R?
  • What is not R?
  • Then Why R?
  • What is RStudio?
  • What can you do with it? (some math example)

Workshop Agenda: Basics of R

  • Demos, search, and help documents
  • Workspace of R
  • R Objects: Vector, Matrix, List, Data Frame, Factor, Function
  • Converting between different types
  • Working with Data: Import/Export, subsetting

What is R?

  • R is a statical programming language/environment.
  • R is open source/free.
  • R is widely used/prefered.
  • R is cross-platform.
  • R is hard to learn (really?).

What is not R?

  • S: R's ancestor
  • S-Plus: Commercial; modern implementation of S
  • SAS: Commercial; widely used in the commercial analytics.
  • SPSS: Commercial; easy to use; widely used in Social Science.
  • MATLAB: Commercial; can do some Stats.
  • Python: Also can do some Stats; good in text data manipulation.

Then Why R?

  • R community is active and constantly growing
  • R is one of the most popular stat programming lang
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved

Then Why R?

  • R community is active and constantly growing
  • R is one of the most popular stat programming lang
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved
Further reading about R's popularity in science and engineering:
R moves up to 5th place in IEEE language rankings at https://www.r-bloggers.com/r-moves-up-to-5th-place-in-ieee-language-rankings/
https://spectrum.ieee.org/static/interactive-the-top-programming-languages-2017 (6th in 2017)

Then Why R?

IEEE Spectrum: Popular Programming Languages in 2017 Poplar Programming Languages

Then Why R?

  • R community is active and constantly growing
  • R is one of the most popular stat programming lang)
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved

Getting help online and offline

On the Internet:

Local Resources:

What is RStudio?

  • Integrated Development Environment for R
  • Nice combination of GUI and CLI
  • Free and commercial version
  • 4 main windows, tabs, etc
  • Version control: Git and VPN
  • R Markdown
  • R Presentation

Get ready

  • Let's open RStudio

What Can We Do with RStudio?

RStudio

RStudio Project

  • RStudio project helps you to keep all the files associated with a project/work in one directory along with environment, history, version control (git) files.
  • Let's create a new RStudio project by

File >> New Project…

Look Ma, R can do Math!

1+1
2+runif(1,0,1)
2+runif(1,min=0,max=1)
3^2
3*3
sqrt(3*3) # comments
# comments are preceded by hash sign

Even More Math!

  • R can take integrals and derivatives, for example:

Numerical Integral of

\( \displaystyle\int_0^{\infty} \frac{1}{(x+1)\sqrt{x}}dx \)

integrand <- function(x) {1/((x+1)*sqrt(x))} ## define the function
integrate(integrand, lower=0, upper=Inf) ## integrate the function from 0 to infinity
3.141593 with absolute error < 2.7e-05

Some General Stuff

demo() # display available demos
demo(graphics) # try graphics demo
library() # show available packages on the computer
search() # show loaded packages
?hist # search for the usage of hist function
??histogram # search for package documents containing the word "histogram"

Workspace of R

R workspace stores objects like vecors, datasets and functions in memory (the available space for calculation is limited to the size of the RAM).

a <- 5 # notice a in your Environment window
A <- "text" 
a
A
ls()
print(c(a,A))
print(a,A)

R as a Programming Language: R Objects

VECTOR (homogeneous)
A vector is an array object of the same type data elements.

class(a)
class(A)
B <- c(a,A) # concatenation
print(B)
class(B) # why?

R Objects: Vectors (cont.)

R has five basic or “atomic” classes of objects:

  • character
  • numeric (real numbers)
  • integer
  • complex
  • logical (True/False)

A vector contains a set of data in any one of the atomic classes.

R as a Programming Language: R Objects

Matrices (homogeneous)
A matrix is a two-dimensional rectangular object of the same type (homogeneous) data elements.

mat <- matrix(rnorm(6), nrow = 3, ncol = 2) 
mat # a matrix
dim(mat) # dimension
t(mat) 
summary(mat) 

R as a Programming Language: R Objects

LIST (heterogeneous)
A list is an object that can store different types of vectors.

aList <- list(name=c("Joseph"), married=T, kids=2)
aList
aList$kids <- aList$kids+1
aList$kids
aList2 <- list(numeric_data=a,character_data=A)
aList2
allList <- list(aList, aList2)
allList # a list of lists

R as a Programming Language: R Objects

Data Frame (heterogeneous and homogeneous)
A data frame is used for storing data tables. It is a list of vectors of equal length.

n <- c(2, 3, 5) # a vector 
s <- c("aa", "bb", "cc") # a vector
b <- c(TRUE, FALSE, TRUE) # a vector
df <- data.frame(n, s, b) # a data frame
df
mtcars # a built-in (attached) data frame
mtcars$mpg
plot(mtcars$mpg, mtcars$hp) # what to expect?

R as a Programming Language: R Objects

Data Frame (cont.)

myFrame <- data.frame(y1=rnorm(100),y2=rnorm(100), y3=rnorm(100))
head(myFrame) # display first few lines of data
names(myFrame) # display column names
summary(myFrame) # output depends on the data types
plot(myFrame)
myFrame2 <- read.table(file="http://scicomp.hmc.edu/data/R/Rtest.txt", header=T, sep=",")
myFrame2

R as a Programming Language: R Objects

FACTOR

  • Factors are a special compoud object used to represent categorical data such as gender, social class, etc.
  • Factors have 'levels' attribute. They may be nominal or ordered.
v <- c("a","b","c","c","b")
x <- factor(v) # turn the character vector into a factor object
z <- factor(v, ordered = TRUE) # ordered factor
x
z
table(x)

R as a Programming Language: R Objects

Function
Functions are also objects in R environment

fun <- function(a,b) {
  a*b
}

fun   
fun(2,3) # a function call

Converting between different types

Use of the as() family of functions. Type as. and wait to see the list of as() functions.

integers <- 1:10
as.character(integers)
as.numeric(c('3.7', '4.8'))
indices <- c(1.7, 2.3)
integers[indices] # sometimes R is too generous
integers[0.999999999] # close to 1 but...
df <- as.data.frame(mat)
df

Working with Data

  • Working with raw data (text files)
  • Data import and export
  • Subsetting
  • Using data frames vs. matrices

Working with Raw Data (text files)

  • Use read.table() to read text files into R
  • Try help document for read.table()

Data Import

  • read.csv() is a special case of read.table()
  • Data import from your local folder
cpds <- read.csv(file.path('.', 'data', 'cpds.csv')) # Comparative Political Data Set (http://www.cpds-data.org/)
head(cpds) # good to look at a few lines
class(cpds) # data.frame
  • Data import from the Internet
data <- read.table(file="http://scicomp.hmc.edu/data/R/normtemp.txt", header=T)
tail(data)

Data Export

  • Use write.table() to write data to a CSV file
write.csv(data, file = "temp.csv", row.names = FALSE) 
  • Save all the objects in current environmet
save.image(file="myenv.RData") 
# Q: How to load them back later?

Subsseting

Operators that can be used to extract subsets of R objects.

  • '[' and ']' always returns an object of the same class as the original; can be used to select more than one element.
  • '[[' and ']]' is used to extract elements of a list or a data frame; it can only be used to extract a single element.
  • $ is used to extract elements of a list or data frame by name.

Subsetting (cont.)

x <- c("a", "b", "c", "c", "d", "a")
x[1]
x[1:4]
x[x > "a"] 
u <- x > "a" # what's u here?
u
x[u] # subsetting using a boolean vector
y <- list(foo=x, bar=x[u]) 
y
y[[1]]
y$bar
subset(mtcars, gear == 5) # use of subset function for data frames

Data frame vs matrix

Consider the following:

  • Same types or different types? Numeric or other type?
  • Convenient using $ with col names?
  • Data size too big? (memory efficiency and size)
m = matrix(1:400000, 2, 200000) # esp. for a large number of columns!
d = as.data.frame(m)
object.size(m) # 1600200 bytes
object.size(d) # 22400568 bytes
  • Conversion between data frame and matrix
    • as.data.frame()
    • as.matrix() or data.matrix() # consider coercion

That's it!

Further Study!