Jeho Park
May 26, 2016
HMC R Bootcamp 2016
DAY 1
Some (i.e., a lot of) materials are adapted from the following websites:
To prepare students to use R to handle and analyze atmospheric sample data.
By the end of this two-day bootcamp, you will be able to:
After SPSS and SAS:
Yay! Good job guys!
demo() # display available demos
demo(graphics) # try graphics demo
library() # show available packages on the computer
search() # show loaded packages
?hist # search for the usage of hist function
??histogram # search for package documents containing the word "histogram"
R workspace stores objects like vecors, datasets and functions in memory (the available space for calculation is limited to the size of the RAM).
a <- 5 # notice a in your Environment window
A <- "text"
a
A
ls()
print(c(a,A))
print(a, A)
1+1
2+runif(1,0,1)
2+runif(1,min=0,max=1)
3^2
3*3
sqrt(3*3) # comments
# comments are preceded by hash sign
Numerical Integral of
\( \displaystyle\int_0^{\infty} \frac{1}{(x+1)\sqrt{x}}dx \)
integrand <- function(x) {1/((x+1)*sqrt(x))} ## define the integrated function
integrate(integrand, lower=0, upper=Inf) ## integrate the function from 0 to infinity
3.141593 with absolute error < 2.7e-05
The most basic form of an R object.
Scalar values are vectors of length one.
A vector is an array object of the same type (homogeneous) data elements.
class(a)
class(A)
B <- c(a,A) # concatenation function
B # see the values
class(B) # why?
a <- rnorm(10)
a[3:5] <- NA # NA is a missing value
a
R has five basic or “atomic” classes of objects:
A vector contains a set of data in any one of the atomic classes.
A matrix is a two-dimensional rectangular object of the same type (homogeneous) data elements.
mat <- matrix(rnorm(6), nrow = 3, ncol = 2)
mat # a matrix
dim(mat) # dimension
t(mat)
summary(mat)
A list is an object that can store different types of vectors.
aList <- list(name=c("Joseph"), married=T, kids=2)
aList
aList$kids <- aList$kids+1
aList$kids
aList2 <- list(numeric_data=a,character_data=A)
aList2
allList <- list(aList, aList2)
allList
A data frame is a list of vectors of equal length with possibly different types. It is used for storing retengular data tables.
n <- c(2, 3, 5) # a vector
s <- c("aa", "bb", "cc") # a vector
b <- c(TRUE, FALSE, TRUE) # a vector
df <- data.frame(n, s, b) # a data frame
df
class(df$s) # was a string vector but now a factor.
mtcars # a built-in (attached) data frame
mtcars$mpg
myFrame <- data.frame(y1=rnorm(100),y2=rnorm(100), y3=rnorm(100))
head(myFrame) # display first few lines of data
names(myFrame) # display column names
summary(myFrame) # output depends on the data types
plot(myFrame)
myFrame2 <- read.table(file="http://scicomp.hmc.edu/data/R/Rtest.txt", header=T, sep=",")
myFrame2
v <- c("a","b","c","c","b")
x <- factor(v) # turn the character vector into a factor object
z <- factor(v, ordered = TRUE) # ordered factor
x
z
table(x)
Use of the as() family of functions. Type as. and wait to see the list of as() functions.
integers <- 1:10
as.character(integers)
as.numeric(c('3.7', '4.8'))
indices <- c(1.7, 2.3)
integers[indices] # sometimes R is too generous
integers[0.999999999] # close to 1 but...
df <- as.data.frame(mat)
df
Basics
1) Using cars datasets, create a vector 'stop_mean' that contains mean stopping distance.
2) Create a matrix 'cars_mat' by converting the cars data frame.
Challenges
3) Install 'hflights' datasets.
4) Using hflights datasets, create a variable called 'x' that contains the mean flight arrival delay. (Hint: NA's)
5) Create a boolean (TRUE/FALSE) vector indicating whether the departure delay is shorter than the arrival delay.
Take a break!
1)
2)
3)
4)
5)
cpds <- read.csv(file.path('.', 'data', 'cpds.csv'))
head(cpds) # good to look at a few lines
class(cpds) # data.frame
data <- read.table(file="http://scicomp.hmc.edu/data/R/normtemp.txt", header=T)
tail(data)
rta <- read.table("./data/RTADataSub.csv", sep = ",", head = TRUE)
dim(rta)
rta[1:5, 1:5]
class(rta)
class(rta$time) # what? let's see ?read.table more carefully
rta2 <- read.table("./data/RTADataSub.csv", sep = ",", head = TRUE, stringsAsFactors = FALSE)
class(rta2$time)
write.csv(data, file = "temp.csv", row.names = FALSE)
pdf('myplot.pdf', width = 7, height = 7) # call pdf() before calling plot()
x <- rnorm(10); y <- rnorm(10)
plot(x, y)
dev.off()
Operators that can be used to extract subsets of R objects.
x <- c("a", "b", "c", "c", "d", "a")
x[1]
x[1:4]
x[x > "a"]
u <- x > "a" # what's u here?
u
x[u] # subsetting using a boolean vector
y <- list(foo=x, bar=x[u])
y
y[[1]]
y$bar
subset(mtcars, gear == 5) # use of subset function for data frames
Consider the following:
m = matrix(1:400000, 2, 200000) # esp. for a large number of columns!
d = as.data.frame(m)
object.size(m) # 1600200 bytes
object.size(d) # 22400568 bytes
Basics
1) Try installing a new library (lmtest) from Console
2) Try installing another library (ggplot2) from Package pane
Challenges
3) From the dataset, hflights, create a subsample based on delay times greater than 10 hours (600 min) from ArrDelay and name it, hflightsSub.
4) Set all of the extreme delays in ArrDelay (more than 300 minutes) in hflightsSub to NA.
5) Convert the data frame, d, back to a matrix named m1 and compare the size of m and m1