R for Statistical Computing

Jeho Park
October 22, 2014

HMC Scientific Computing Workshop Series, Fall 2014

(Slides are made with the R Presentations tool in RStudio) (Some materials are adapted from the R-Bootcamp by Jared Knowles: http://jaredknowles.com/r-bootcamp/)

Some Housekeeping Stuff

Real Basics of R

  • What is R?
  • What is not R?
  • Then Why R?
  • What is RStudio?
  • What can you do with it?
  • I want to install it!
  • Look Ma, R can do MATH!
  • Even more Math!

What is R?

  • R is a statical programming language/environment.
  • R is open source/free.
  • R is widely used/prefered.
  • R is cross-platform.
  • R is hard to learn (really?).

What is not R?

  • S: R's ancestor
  • S-Plus: Commercial; modern implementation of S
  • SAS: Commercial; widely used in the commercial analytics.
  • SPSS: Commercial; easy to use; widely used in Social Science.
  • MATLAB: Commercial; can do some Stats.
  • Python: Also can do some Stats; good in text data manipulation.

Then Why R?

  • R community is active and constantly growing
  • R is most popular (other than SPSS and SAS)
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved

R listserv help

Then Why R?

  • R community is active and constantly growing
  • R is most popular (other than SPSS and SAS)
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improving

R Google Scholar Hit

Then Why R?

  • R community is active and constantly growing
  • R is most popular (other than SPSS and SAS)
  • R has tons of user generated libraries/packages
  • R code is easily shared with others
  • R is constantly improved

R Extentions

Asking Questions

What is RStudio?

  • Integrated Development Environment for R
  • Nice combination of GUI and CLI
  • Free and commercial version
  • 4 main windows, tabs, etc
  • Version control: Git and VPN
  • R Markdown
  • R Presentation

Let's install them!

  • R for Mac
  • R for Linux
  • R for Windows
  • RStudio Desktop
  • RStudio Server

Google is our friend!

What Can We Do with RStudio?

RStudio

Some R Vocabulary

  • packages are add on features to R that include data, new functions and methods, and extended capabilities. Think of them as “apps'' on your phone. We've already installed several!
  • terminal is the main window of R where you enter commands
  • scripts are where you store commands to be run in the terminal later, like syntax files in SPSS or .do files in Stata
  • functions are commands that do something to an object in R
  • dataframe is the main element for statistical purposes, an object with rows and columns that includes numbers, factors, and other data types

Some R Vocabulary (cont.)

  • workspace is the working memory of R where all objects are stored
  • vector is the basic unit of data in R
  • symbols is used to name and store objects or to designate operations/functions
  • attributes determine how functions act on objects

Look Ma, R can do Math!

1+1
2+runif(1,0,1)
2+runif(1,min=0,max=1)
3^2
3*3
sqrt(3*3) # comments
# comments are preceded by hash sign

Even More Math!

  • R can take integrals and derivatives, for example:

Numerical Integral of

\( \displaystyle\int_0^{\infty} \frac{1}{(x+1)\sqrt{x}}dx \)

integrand <- function(x) {1/((x+1)*sqrt(x))} ## define the integrated function
integrate(integrand, lower=0, upper=Inf) ## integrate the function from 0 to infinity
3.142 with absolute error < 2.7e-05

Some General Stuff

demo() # display available demos
demo(graphics) # try graphics demo
library() # show available packages on the computer
search() # show loaded packages
?hist # search for the usage of hist function
??histogram # search for package documents containing the word "histogram"

Workspace of R

R workspace stores objects like vecors, datasets and functions in memory (the available space for calculation is limited to the size of the RAM).

a <- 5 # notice a in your Environment window
A <- "text" 
a
A
ls()
print(c(a,A))
print(a,A)

R as a Programming Language: R Objects

VECTOR (homogeneous)
A vector is an array object of the same type data elements.

class(a)
class(A)
B <- c(a,A) # concatenation
print(B)
class(B) # why?

R as a Programming Language: R Objects

LIST (heterogeneous)
A list is an object that can store different types of vectors.

aList <- list(name=c("Joseph"), married=T, kids=2)
aList
aList$kids <- aList$kids+1
aList$kids
aList <- list(numeric_data=a,character_data=A)
aList

R as a Programming Language: R Objects

Data Frame
A data frame is used for storing data tables. It is a list of vectors of equal length.

n <- c(2, 3, 5) # a vector 
s <- c("aa", "bb", "cc") # a vector
b <- c(TRUE, FALSE, TRUE) # a vector
df <- data.frame(n, s, b) # a data frame
df
mtcars # a built-in (attached) data frame
mtcars$mpg

R as a Programming Language: R Objects

Data Frame (cont.)

myFrame <- data.frame(y1=rnorm(100),y2=rnorm(100), y3=rnorm(100))
head(myFrame) # display first few lines of data
names(myFrame) # display column names
summary(myFrame) # output depends on the data types
plot(myFrame)
myFrame2 <- read.table(file="http://scicomp.hmc.edu/data/R/Rtest.txt", header=T, sep=",")
myFrame2

R as a Programming Language: R Objects

FACTOR

  • Factors are a special compoud object used to represent categorical data such as gender, social class, etc.
  • Factors have 'levels' attribute. They may be nominal or ordered.
v <- c("a","b","c","c","b")
x <- factor(v) # turn the character vector into a factor object
z <- factor(v, ordered = TRUE) # ordered factor
x
z
table(x)

Classical Tests (1)

Single Sample Tests

Our questions for the sample might be:

  • What is the average (mean)?
  • Is the mean value significanntly different from current expectation?
  • What is the level of uncertainty associated with our estimate of the mean value?

Classical Tests (2)

Single Sample Tests

For our valid inferences, we need to find some facts about the distribution of data:

  • Are the values normally distributed or not? ()
  • Are there outliers in the data? (Should we safely exclude them?)
  • If data were collected over a period of time, is there evidence for serial correlation? (Do we need to do time series analysis?)

Based on the facts, standard parametric tests like Student's t test might not be applicable and we may have to seek for non-parametric techniques.

Classical Tests (3)

Single Sample Tests

data <- read.table(file="http://scicomp.hmc.edu/data/R/das.txt", header=T)
names(data)
attach(data)
par(mfrow=c(2,2))
plot(y)
boxplot(y)
hist(y,main="")
y2 <- y
y2[52] <- 21.75
plot(y2)
dev.off() # reset the graphic device

Classical Tests (4)

Single Sample Tests

summary(y)
# Graphical test for normality
qqnorm(y)
qqline(y,lty=2)
# Test for normality
x <- exp(rnorm(30)) # lognormally distributed 
shapiro.test(x) # look at the p value

The null-hypothesis of this test is that the population is normally distributed. Thus if the p-value is less than the chosen alpha level (e.g., p < 0.05), then the null hypothesis is rejected and there is evidence that the data tested are not from a normally distributed population.

That's it!

Please submit your R environemnt file at http://bit.ly/hmc-r-workshop-homework (a digital badge requirement).

The file has .RData extension. To attach to the form, it must be changed to .txt.

Useful links: