R for Big Data Analytics and Data Science

Getting Started

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To download R, please choose your preferred CRAN mirror.

Introduction to R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes

an effective data handling and storage facility,
. a suite of operators for calculations on arrays, in particular matrices,
. a large, coherent, integrated collection of intermediate tools for data analysis,
. graphical facilities for data analysis and display either on-screen or on hardcopy, and
. a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software. R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly. Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics. R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hardcopy.

R Console, R commander (Rcmdr) and RStudio

Once you have downloaded and installed R, with the default settings, you may then launch R from your Start Menu. On opening, the following will appear:

The above is the R Console where you write your R commands and functions. Alternatively, you may use R Commander.

R provides a powerful and comprehensive system for analysing data and when used in conjunction with the R-commander (a graphical user interface, commonly known as Rcmdr) it also provides one that is easy and intuitive to use. Basically, R provides the engine that carries out the analyses and Rcmdr provides a convenient way for users to input commands. The Rcmdr program enables analysts to access a selection of commonly-used R commands using a simple interface that should be familiar to most computer users. It also serves the important role of helping users to implement R commands and develop their knowledge and expertise in using the command line — an important skill for those wishing to exploit the full power of the program.

To download Rstudio Desktop (Free Vesion) visit https://www.rstudio.com/products/rstudio/download3/ and to learn about using the RStudio IDE, visit http://dss.princeton.edu/training/RStudio101.pdf.

R Help: help() and ?

The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages. To access documentation for the standard lm (linear model) function, for example, enter the command help(lm) or help(“lm”), or ?lm or ?“lm” (i.e., the quotes are optional). To access help for a function in a package that’s not currently loaded, specify in addition the name of the package: For example, to obtain documentation for the rlm() (robust linear model) function in theMASS package, help(rlm, package=“MASS”). Standard names in R consist of upper- and lower-case letters, numerals (0-9), underscores (_), and periods (.), and must begin with a letter or a period. To obtain help for an object with a non-standardname (such as the help operator ?), the name must be quoted: for example, help(‘?’) or ?“?”. You may also use the help() function to access information about a package in your library - for example, help(package=“MASS”) - which displays an index of available help pages for the package along with some other information. Help pages for functions usually include a section with executable examples illustrating how the functions work. You can execute these examples in the current R session via the example() command: e.g., example(lm).

Vignettes and Code Demonstrations: browseVignettes(), vignette() and demo()

Many packages include vignettes, which are discursive documents meant to illustrate and explain facilities in the package. You can discover vignettes by accessing the help page for a package, or via the browseVignettes() function: the command browseVignettes() opens a list of vignettes from all of your installed packages in your browser, while browseVignettes(package=package-name) (e.g.,browseVignettes(package=“survival”)) shows the vignettes, if any, for a particular package.vignette() is employed similarly, but displays a list of vignettes in text form. You can also use the vignette(“vignette-name”) command to view a vignette (possibly specifying the name of the package in which the vignette resides, if the vignette name is not unique): for example,vignette(“timedep”) or vignette(“timedep”, package=“survival”) (which are, in this case, equivalent). Vignettes may also be accessed from the CRAN page for the package (e.g. survival), if you wish to review the vignette for a package prior to installing and/or using it. Packages may also include extended code demonstrations (“demos”). The command demo() lists all demos for all packages in your library, while demo(package=“package-name”) (e.g.,demo(package=“stats”)) lists demos in a particular package. To run a demo, call the demo() function with the quoted name of the demo (e.g., demo(“nlm”)), specifying the name of the package if the name of the demo isn’t unique (e.g., demo(“nlm”, package=“stats”), where, in this case, the package name need not be given explicitly).

Searching for Help Within R

The help() function and ? operator are useful only if you already know the name of the function that you wish to use. There are also facilities in the standard R distribution for discovering functions and other objects. The following functions cast a progressively wider net. Use the help system to obtain complete documentation for these functions: for example, ?apropos.

apropos()

The apropos() function searches for objects, including functions, directly accessible in the current R session that have names that include a specified character string. This may be a literal string or aregular expression to be used for pattern-matching (see ?“regular expression”). By default, string matching by apropos() is case-insensitive. For example, apropos(“^glm”) returns the names of all accessible objects that start with the (case-insensitive) characters “glm”.

help.search() and ??

The help.search() function scans the documentation for packages installed in your library. The (first) argument to help.search() is a character string or regular expression. For example,help.search(“^glm”) searches for help pages, vignettes, and code demos that have help “aliases,” “concepts,” or titles that begin (case-insensitively) with the characters “glm”. The ?? operator is a synonym for help.search(): for example, ??“^glm”.

RSiteSearch()

RSiteSearch() uses an internet search engine (also see below) to search for information in function help pages and vignettes for all CRAN packages, and in CRAN task views (described below). Unlike theapropos() and help.search() functions, RSiteSearch() requires an active internet connection and doesn’t employ regular expressions. Braces may be used to specify multi-word terms; otherwise matches for individual words are included. For example, RSiteSearch(“{generalized linear model}”)returns information about R functions, vignettes, and CRAN task views related to the term“generalized linear model” without matching the individual words “generalized”, “linear”, or“model”.

findfn() and ??? in the sos package, which is not part of the standard R distribution but is available on CRAN, provide an alternative interface to RSiteSearch().

help.start()

help.start() starts and displays a hypertext based version of R’s online documentation in your default browser that provides links to locally installed versions of the R manuals, a listing of your currently installed packages and other documentation resources.

R Help on the Internet

There are internet search sites that are specialized for R searches, including search.r-project.org(which is the site used by RSiteSearch) and Rseek.org.

It is also possible to use a general search site like Google, by qualifying the search with “R” or the name of an R package (or both). It can be particularly helpful to paste an error message into a search engine to find out whether others have solved a problem that you encountered.

CRAN Task Views

CRAN Task Views are documents that summarize R resources on CRAN in particular areas of application, helping your to navigate the maze of thousands of CRAN packages. A list of available Task Views may be found on CRAN.

R FAQs (Frequently Asked Questions)

There are three primary FAQ listings which are periodically updated to reflect very commonly asked questions by R users. There is a Main R FAQ, a Windows specific R FAQ and a Mac OS (OS X) specific R FAQ.

Asking for Help

If you find that you can’t answer a question or solve a problem yourself, you can ask others for help, either locally (if you know someone who is knowledgeable about R) or on the internet. In order to ask a question effectively, it helps to phrase the question clearly, and, if you’re trying to solve a problem, to include a small, self-contained, reproducible example of the problem that others can execute. For information on how to ask questions, see, e.g., the R mailing list posting guide, and the document abouthow to create reproducible examples for R on Stack Overflow.

Stack Overflow

Stack Overflow is a well organized and formatted site for help and discussions about programming. It has excellent searchability. Topics are tagged, and “r” is a very popular tag on the site with almost 150,000 questions (as of summer 2016). To go directly to R-related topics, visithttp://stackoverflow.com/questions/tagged/r. For an example both of the value of the site’s organization and information that is very useful to R users, see “How to make a great R reproducible example?”, which is also mentioned above.

R Email Lists

The R Project maintains a number of subscription-based email lists for posing and answering questions about R, including the general R-help email list, the R-devel list for R code development, and R-package-devel list for developers of CRAN packages; lists for announcements about R and R packages; and a variety of more specialized lists. Before posing a question on one of these lists, please read the R mailing list instructions and the posting guide.

Creating Datasets in R

#  Creating vectors

a <- c(1, 2, 5, 3, 6, -2, 4)
b <- c("one", "two", "three")
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)

# selecting specific datapoints

a <- c(1, 2, 5, 3, 6, -2, 4)
a[3]

## [1] 5

a[c(1, 3, 5)]

## [1] 1 5 6

a[2:6]

## [1]  2  5  3  6 -2

# Creating Matrices

y <- matrix(1:20, nrow = 5, ncol = 4)
y

##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

cells <- c(1, 26, 24, 68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = TRUE, 
    dimnames = list(rnames, cnames))
mymatrix

##    C1 C2
## R1  1 26
## R2 24 68

mymatrix <- matrix(cells, nrow = 2, ncol = 2, byrow = FALSE, 
    dimnames = list(rnames, cnames))
mymatrix

##    C1 C2
## R1  1 24
## R2 26 68

# selecting elements

x <- matrix(1:10, nrow = 2)
x

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

x[2, ]

## [1]  2  4  6  8 10

x[, 2]

## [1] 3 4

x[1, 4]

## [1] 7

x[1, c(4, 5)]

## [1] 7 9

# Creating an array

dim1 <- c("A1", "A2")
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2", "C3", "C4")
z <- array(1:24, c(2, 3, 4), dimnames = list(dim1, 
    dim2, dim3))
z

## , , C1
## 
##    B1 B2 B3
## A1  1  3  5
## A2  2  4  6
## 
## , , C2
## 
##    B1 B2 B3
## A1  7  9 11
## A2  8 10 12
## 
## , , C3
## 
##    B1 B2 B3
## A1 13 15 17
## A2 14 16 18
## 
## , , C4
## 
##    B1 B2 B3
## A1 19 21 23
## A2 20 22 24

# Creating a dataframe

patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c("Type1", "Type2", "Type1", "Type1")
status <- c("Poor", "Improved", "Excellent", "Poor")
patientdata <- data.frame(patientID, age, diabetes, 
    status)
patientdata

##   patientID age diabetes    status
## 1         1  25    Type1      Poor
## 2         2  34    Type2  Improved
## 3         3  28    Type1 Excellent
## 4         4  52    Type1      Poor

# Specifying elements of a dataframe

patientdata[1:2]

##   patientID age
## 1         1  25
## 2         2  34
## 3         3  28
## 4         4  52

patientdata[c("diabetes", "status")]

##   diabetes    status
## 1    Type1      Poor
## 2    Type2  Improved
## 3    Type1 Excellent
## 4    Type1      Poor

patientdata$age

## [1] 25 34 28 52

# Using factors

patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
diabetes <- c("Type1", "Type2", "Type1", "Type1")
status <- c("Poor", "Improved", "Excellent", "Poor")
diabetes <- factor(diabetes)
status <- factor(status, order = TRUE)
patientdata <- data.frame(patientID, age, diabetes, 
    status)
str(patientdata)

## 'data.frame':    4 obs. of  4 variables:
##  $ patientID: num  1 2 3 4
##  $ age      : num  25 34 28 52
##  $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
##  $ status   : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3

summary(patientdata)

##    patientID         age         diabetes       status 
##  Min.   :1.00   Min.   :25.00   Type1:3   Excellent:1  
##  1st Qu.:1.75   1st Qu.:27.25   Type2:1   Improved :1  
##  Median :2.50   Median :31.00             Poor     :2  
##  Mean   :2.50   Mean   :34.75                          
##  3rd Qu.:3.25   3rd Qu.:38.50                          
##  Max.   :4.00   Max.   :52.00

# Creating a list

g <- "My First List"
h <- c(25, 26, 18, 39)
j <- matrix(1:10, nrow = 5)
k <- c("one", "two", "three")
mylist <- list(title = g, ages = h, j, k)
mylist

## $title
## [1] "My First List"
## 
## $ages
## [1] 25 26 18 39
## 
## [[3]]
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
## 
## [[4]]
## [1] "one"   "two"   "three"

Data Management

# Creating the leadership data frame

manager <- c(1, 2, 3, 4, 5)
date <- c("10/24/08", "10/28/08", "10/1/08", "10/12/08", 
    "5/1/09")
gender <- c("M", "F", "F", "M", "F")
age <- c(32, 45, 25, 39, 99)
q1 <- c(5, 3, 3, 3, 2)
q2 <- c(4, 5, 5, 3, 2)
q3 <- c(5, 2, 5, 4, 1)
q4 <- c(5, 5, 5, NA, 2)
q5 <- c(5, 5, 2, NA, 1)
leadership <- data.frame(manager, date, gender, age, 
    q1, q2, q3, q4, q5, stringsAsFactors = FALSE)

# the individual vectors are no longer needed
rm(manager, date, gender, age, q1, q2, q3, q4, q5)

# Creating new variables

mydata <- data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 
    4, 2, 8))
mydata$sumx <- mydata$x1 + mydata$x2
mydata$meanx <- (mydata$x1 + mydata$x2)/2

attach(mydata)
mydata$sumx <- x1 + x2
mydata$meanx <- (x1 + x2)/2
detach(mydata)

mydata <- transform(mydata, sumx = x1 + x2, meanx = (x1 + 
    x2)/2)

# Recoding variables

leadership$agecat[leadership$age > 75] <- "Elder"
leadership$agecat[leadership$age > 45 & 
    leadership$age <= 75] <- "Middle Aged"
leadership$agecat[leadership$age <= 45] <- "Young"

# or more compactly

leadership <- within(leadership, {
    agecat <- NA
    agecat[age > 75] <- "Elder"
    agecat[age >= 55 & age <= 75] <- "Middle Aged"
    agecat[age < 55] <- "Young"
})

# Renaming variables with the reshape package
library(reshape)
rename(leadership, c(manager = "managerID", date = "testDate"))

##   managerID testDate gender age q1 q2 q3 q4 q5 agecat
## 1         1 10/24/08      M  32  5  4  5  5  5  Young
## 2         2 10/28/08      F  45  3  5  2  5  5  Young
## 3         3  10/1/08      F  25  3  5  5  5  2  Young
## 4         4 10/12/08      M  39  3  3  4 NA NA  Young
## 5         5   5/1/09      F  99  2  2  1  2  1  Elder

# Applying the is.na() function
is.na(leadership[, 6:10])

##         q2    q3    q4    q5 agecat
## [1,] FALSE FALSE FALSE FALSE  FALSE
## [2,] FALSE FALSE FALSE FALSE  FALSE
## [3,] FALSE FALSE FALSE FALSE  FALSE
## [4,] FALSE FALSE  TRUE  TRUE  FALSE
## [5,] FALSE FALSE FALSE FALSE  FALSE

# recode 99 to missing for the variable age
leadership[leadership$age == 99, "age"] <- NA
leadership

##   manager     date gender age q1 q2 q3 q4 q5 agecat
## 1       1 10/24/08      M  32  5  4  5  5  5  Young
## 2       2 10/28/08      F  45  3  5  2  5  5  Young
## 3       3  10/1/08      F  25  3  5  5  5  2  Young
## 4       4 10/12/08      M  39  3  3  4 NA NA  Young
## 5       5   5/1/09      F  NA  2  2  1  2  1  Elder

# Using na.omit() to delete incomplete observations
newdata <- na.omit(leadership)
newdata

##   manager     date gender age q1 q2 q3 q4 q5 agecat
## 1       1 10/24/08      M  32  5  4  5  5  5  Young
## 2       2 10/28/08      F  45  3  5  2  5  5  Young
## 3       3  10/1/08      F  25  3  5  5  5  2  Young

# Working with Dates

mydates <- as.Date(c("2007-06-22", "2004-02-13"))

# Converting character values to dates

strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y")

myformat <- "%m/%d/%y"
leadership$date <- as.Date(leadership$date, myformat)

# Useful date functions

Sys.Date()

## [1] "2016-11-14"

date()

## [1] "Mon Nov 14 21:02:42 2016"

today <- Sys.Date()
format(today, format = "%B %d %Y")

## [1] "November 14 2016"

format(today, format = "%A")

## [1] "Monday"

# Calculations with with dates

startdate <- as.Date("2004-02-13")
enddate <- as.Date("2009-06-22")
days <- enddate - startdate

# Date functions and formatted printing

today <- Sys.Date()
format(today, format = "%B %d %Y")

## [1] "November 14 2016"

dob <- as.Date("1956-10-10")
format(dob, format = "%A")

## [1] "Wednesday"

# Converting from one data type to another

a <- c(1, 2, 3)
a

## [1] 1 2 3

is.numeric(a)

## [1] TRUE

is.vector(a)

## [1] TRUE

a <- as.character(a)
a

## [1] "1" "2" "3"

is.numeric(a)

## [1] FALSE

is.vector(a)

## [1] TRUE

is.character(a)

## [1] TRUE

# Sorting a dataset

attach(leadership)
newdata <- leadership[order(age), ]
newdata

##   manager       date gender age q1 q2 q3 q4 q5 agecat
## 3       3 2008-10-01      F  25  3  5  5  5  2  Young
## 1       1 2008-10-24      M  32  5  4  5  5  5  Young
## 4       4 2008-10-12      M  39  3  3  4 NA NA  Young
## 2       2 2008-10-28      F  45  3  5  2  5  5  Young
## 5       5 2009-05-01      F  NA  2  2  1  2  1  Elder

detach(leadership)

attach(leadership)
newdata <- leadership[order(gender, -age), ]
newdata

##   manager       date gender age q1 q2 q3 q4 q5 agecat
## 2       2 2008-10-28      F  45  3  5  2  5  5  Young
## 3       3 2008-10-01      F  25  3  5  5  5  2  Young
## 5       5 2009-05-01      F  NA  2  2  1  2  1  Elder
## 4       4 2008-10-12      M  39  3  3  4 NA NA  Young
## 1       1 2008-10-24      M  32  5  4  5  5  5  Young

detach(leadership)

# Selecting variables

newdata <- leadership[, c(6:10)]

myvars <- c("q1", "q2", "q3", "q4", "q5")
newdata <- leadership[myvars]

myvars <- paste("q", 1:5, sep = "")
newdata <- leadership[myvars]

# Dropping variables

myvars <- names(leadership) %in% c("q3", "q4")
newdata <- leadership[!myvars]

newdata <- leadership[c(-7, -8)]

# You could use the following to delete q3 and q4
# from the leadership dataset (commented out so 
# the rest of the code in this file will work)
#
# leadership$q3 <- leadership$q4 <- NULL

# Selecting Observations

newdata <- leadership[1:3, ]

newdata <- leadership[which(leadership$gender == "M" & 
    leadership$age > 30), ]

attach(leadership)
newdata <- leadership[which(leadership$gender == "M" & 
    leadership$age > 30), ]
detach(leadership)

# Selecting observations based on dates

leadership$date <- as.Date(leadership$date, "%m/%d/%y")
startdate <- as.Date("2009-01-01")
enddate <- as.Date("2009-10-31")
newdata <- leadership[leadership$date >= startdate & 
    leadership$date <= enddate, ]

# Using the subset() function

newdata <- subset(leadership, age >= 35 | age < 24, 
    select = c(q1, q2, q3, q4))
newdata <- subset(leadership, gender == "M" & age > 
    25, select = gender:q4)

Graphical Parameters

# working with graphs

attach(mtcars)
plot(wt, mpg)
abline(lm(mpg ~ wt))
title("Regression of MPG on Weight")

detach(mtcars)

# an example

dose <- c(20, 30, 40, 45, 60)
drugA <- c(16, 20, 27, 40, 60)
drugB <- c(15, 18, 25, 31, 40)
plot(dose, drugA, type = "b")

# --Section 3.3--

opar <- par(no.readonly = TRUE)
par(lty = 2, pch = 17)
plot(dose, drugA, type = "b")

par(opar)

plot(dose, drugA, type = "b", lty = 2, pch = 17)

plot(dose, drugA, type = "b", lty = 3, lwd = 3, pch = 15, 
    cex = 2)

# Using graphical parameters to control
# graph appearance

dose <- c(20, 30, 40, 45, 60)
drugA <- c(16, 20, 27, 40, 60)
drugB <- c(15, 18, 25, 31, 40)
opar <- par(no.readonly = TRUE)
par(pin = c(2, 3))
par(lwd = 2, cex = 1.5)
par(cex.axis = 0.75, font.axis = 3)
plot(dose, drugA, type = "b", pch = 19, lty = 2, col = "red")

plot(dose, drugB, type = "b", pch = 23, lty = 6, col = "blue", 
    bg = "green")

par(opar)

# Adding text, customized axes, and legends

plot(dose, drugA, type = "b", col = "red", lty = 2, 
    pch = 2, lwd = 2, main = "Clinical Trials for Drug A", 
    sub = "This is hypothetical data", 
    xlab = "Dosage", ylab = "Drug Response", xlim = c(0, 60), 
    ylim = c(0, 70))

# An Example of Custom Axes

x <- c(1:10)
y <- x
z <- 10/x
opar <- par(no.readonly = TRUE)
par(mar = c(5, 4, 4, 8) + 0.1)

plot(x, y, type = "b", pch = 21, col = "red", yaxt = "n", 
    lty = 3, ann = FALSE)
lines(x, z, type = "b", pch = 22, col = "blue", lty = 2)
axis(2, at = x, labels = x, col.axis = "red", las = 2)
axis(4, at = z, labels = round(z, digits = 2), col.axis = "blue", 
    las = 2, cex.axis = 0.7, tck = -0.01)
mtext("y=1/x", side = 4, line = 3, cex.lab = 1, las = 2, 
    col = "blue")
title("An Example of Creative Axes", xlab = "X values", 
    ylab = "Y=X")

par(opar)

# Comparing Drug A and Drug B response by dose

dose <- c(20, 30, 40, 45, 60)
drugA <- c(16, 20, 27, 40, 60)
drugB <- c(15, 18, 25, 31, 40)
opar <- par(no.readonly = TRUE)
par(lwd = 2, cex = 1.5, font.lab = 2)
plot(dose, drugA, type = "b", pch = 15, lty = 1, col = "red", 
    ylim = c(0, 60), main = "Drug A vs. Drug B", xlab = "Drug Dosage", 
    ylab = "Drug Response")
lines(dose, drugB, type = "b", pch = 17, lty = 2, 
    col = "blue")
abline(h = c(30), lwd = 1.5, lty = 2, col = "grey")
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

minor.tick(nx = 3, ny = 3, tick.ratio = 0.5)
legend("topleft", inset = 0.05, title = "Drug Type", 
    c("A", "B"), lty = c(1, 2), pch = c(15, 17), col = c("red", 
        "blue"))

par(opar)

# Example of labeling points

attach(mtcars)

## The following object is masked from package:ggplot2:
## 
##     mpg

plot(wt, mpg, main = "Milage vs. Car Weight", xlab = "Weight", 
    ylab = "Mileage", pch = 18, col = "blue")
text(wt, mpg, row.names(mtcars), cex = 0.6, pos = 4, 
    col = "red")

detach(mtcars)

# View font families
opar <- par(no.readonly = TRUE)
par(cex = 1.5)
plot(1:7, 1:7, type = "n")
text(3, 3, "Example of default text")
text(4, 4, family = "mono", "Example of mono-spaced text")
text(5, 5, family = "serif", "Example of serif text")

par(opar)

# combining graphs

attach(mtcars)

## The following object is masked from package:ggplot2:
## 
##     mpg

opar <- par(no.readonly = TRUE)
par(mfrow = c(2, 2))
plot(wt, mpg, main = "Scatterplot of wt vs. mpg")
plot(wt, disp, main = "Scatterplot of wt vs disp")
hist(wt, main = "Histogram of wt")
boxplot(wt, main = "Boxplot of wt")

par(opar)
detach(mtcars)


attach(mtcars)

## The following object is masked from package:ggplot2:
## 
##     mpg

opar <- par(no.readonly = TRUE)
par(mfrow = c(3, 1))
hist(wt)
hist(mpg)
hist(disp)
par(opar)
detach(mtcars)


attach(mtcars)

## The following object is masked from package:ggplot2:
## 
##     mpg

layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE))
hist(wt)
hist(mpg)
hist(disp)
detach(mtcars)


attach(mtcars)

## The following object is masked from package:ggplot2:
## 
##     mpg

layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE), 
    widths = c(1, 1), heights = c(1, 1))
hist(wt)
hist(mpg)
hist(disp)

detach(mtcars)

# Fine placement of figures in a graph

opar <- par(no.readonly = TRUE)
par(fig = c(0, 0.8, 0, 0.8))
plot(mtcars$wt, mtcars$mpg, xlab = "Miles Per Gallon", 
    ylab = "Car Weight")
par(fig = c(0, 0.8, 0.55, 1), new = TRUE)
boxplot(mtcars$wt, horizontal = TRUE, axes = FALSE)
par(fig = c(0.65, 1, 0, 0.8), new = TRUE)
boxplot(mtcars$mpg, axes = FALSE)
mtext("Enhanced Scatterplot", side = 3, outer = TRUE, 
    line = -3)

par(opar)

Basic Graphs

# pause after each graph
par(ask = TRUE)

# save original graphic settings
opar <- par(no.readonly = TRUE)

# Load vcd package
library(vcd)

## Loading required package: grid

# Get cell counts for improved variable
counts <- table(Arthritis$Improved)
counts

## 
##   None   Some Marked 
##     42     14     28

Bar Plots

# simple bar plot
barplot(counts, main = "Simple Bar Plot", xlab = "Improvement", 
    ylab = "Frequency")

# horizontal bar plot
barplot(counts, main = "Horizontal Bar Plot", xlab = "Frequency", 
    ylab = "Improvement", horiz = TRUE)

# get counts for Improved by Treatment table
counts <- table(Arthritis$Improved, Arthritis$Treatment)
counts

##         
##          Placebo Treated
##   None        29      13
##   Some         7       7
##   Marked       7      21

# stacked barplot
barplot(counts, main = "Stacked Bar Plot", xlab = "Treatment", 
    ylab = "Frequency", col = c("red", "yellow", "green"), 
    legend = rownames(counts))

# grouped barplot
barplot(counts, main = "Grouped Bar Plot", xlab = "Treatment", 
    ylab = "Frequency", col = c("red", "yellow", "green"), 
    legend = rownames(counts), 
    beside = TRUE)

# Mean bar plots

states <- data.frame(state.region, state.x77)
means <- aggregate(states$Illiteracy, 
    by = list(state.region), 
    FUN = mean)
means

##         Group.1        x
## 1     Northeast 1.000000
## 2         South 1.737500
## 3 North Central 0.700000
## 4          West 1.023077

means <- means[order(means$x), ]
means

##         Group.1        x
## 3 North Central 0.700000
## 1     Northeast 1.000000
## 4          West 1.023077
## 2         South 1.737500

barplot(means$x, names.arg = means$Group.1)
title("Mean Illiteracy Rate")

# Fitting labels in bar plots

par(mar = c(5, 8, 4, 2))
par(las = 2)
counts <- table(Arthritis$Improved)

barplot(counts, main = "Treatment Outcome", horiz = TRUE, 
    cex.names = 0.8, names.arg = c("No Improvement", 
    "Some Improvement", "Marked Improvement"))

# Spinograms

library(vcd)
attach(Arthritis)
counts <- table(Treatment, Improved)
spine(counts, main = "Spinogram Example")

detach(Arthritis)

# Pie charts

par(mfrow = c(2, 2))
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")

pie(slices, labels = lbls, main = "Simple Pie Chart")

pct <- round(slices/sum(slices) * 100)
lbls2 <- paste(lbls, " ", pct, "%", sep = "")
pie(slices, labels = lbls2, col = rainbow(length(lbls)), 
    main = "Pie Chart with Percentages")

library(plotrix)
pie3D(slices, labels = lbls, explode = 0.1, main = "3D Pie Chart ")

mytable <- table(state.region)
lbls <- paste(names(mytable), "\n", mytable, sep = "")
pie(mytable, labels = lbls, 
    main = "Pie Chart from a Table\n (with sample sizes)")

# restore original graphic parameters
par(opar)

# fan plots
library(plotrix)
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
fan.plot(slices, labels = lbls, main = "Fan Plot")

# Histograms

par(mfrow = c(2, 2))

hist(mtcars$mpg)

hist(mtcars$mpg, breaks = 12, col = "red", 
    xlab = "Miles Per Gallon", 
    main = "Colored histogram with 12 bins")

hist(mtcars$mpg, freq = FALSE, breaks = 12, col = "red", 
    xlab = "Miles Per Gallon", 
    main = "Histogram, rug plot, density curve")
rug(jitter(mtcars$mpg))
lines(density(mtcars$mpg), col = "blue", lwd = 2)

# Histogram with Superimposed Normal Curve 
# (Thanks to Peter Dalgaard)
x <- mtcars$mpg
h <- hist(x, breaks = 12, col = "red", 
    xlab = "Miles Per Gallon", 
    main = "Histogram with normal curve and box")
xfit <- seq(min(x), max(x), length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(h$mids[1:2]) * length(x)
lines(xfit, yfit, col = "blue", lwd = 2)
box()

# restore original graphic parameters
par(opar)

# Kernel density plot

par(mfrow = c(2, 1))
d <- density(mtcars$mpg)

plot(d)

d <- density(mtcars$mpg)
plot(d, main = "Kernel Density of Miles Per Gallon")
polygon(d, col = "red", border = "blue")
rug(mtcars$mpg, col = "brown")

# restore original graphic parameters
par(opar)

# Comparing kernel density plots

par(lwd = 2)
library(sm)

## Package 'sm', version 2.2-5.4: type help(sm) for summary information

attach(mtcars)

## The following object is masked from package:ggplot2:
## 
##     mpg

cyl.f <- factor(cyl, levels = c(4, 6, 8), 
    labels = c("4 cylinder", "6 cylinder", "8 cylinder"))

sm.density.compare(mpg, cyl, xlab = "Miles Per Gallon")
title(main = "MPG Distribution by Car Cylinders")

colfill <- c(2:(1 + length(levels(cyl.f))))
cat("Use mouse to place legend...", "\n\n")

## Use mouse to place legend...

#legend(locator(1), levels(cyl.f), fill = colfill)
detach(mtcars)
par(lwd = 1)

# Box Plot

boxplot(mpg ~ cyl, data = mtcars, 
    main = "Car Milage Data", 
    xlab = "Number of Cylinders", 
    ylab = "Miles Per Gallon")

boxplot(mpg ~ cyl, data = mtcars, notch = TRUE, 
    varwidth = TRUE, col = "red", 
    main = "Car Mileage Data", 
    xlab = "Number of Cylinders", 
    ylab = "Miles Per Gallon")

## Warning in bxp(structure(list(stats = structure(c(21.4, 22.8, 26, 30.4, :
## some notches went outside hinges ('box'): maybe set notch=FALSE

# Box plots for two crossed factors

mtcars$cyl.f <- factor(mtcars$cyl, levels = c(4, 6, 
    8), labels = c("4", "6", "8"))

mtcars$am.f <- factor(mtcars$am, levels = c(0, 1), 
    labels = c("auto", "standard"))

boxplot(mpg ~ am.f * cyl.f, data = mtcars, 
    varwidth = TRUE, col = c("gold", "darkgreen"), 
    main = "MPG Distribution by Auto Type", 
    xlab = "Auto Type")

# Violin plots

library(vioplot)
x1 <- mtcars$mpg[mtcars$cyl == 4]
x2 <- mtcars$mpg[mtcars$cyl == 6]
x3 <- mtcars$mpg[mtcars$cyl == 8]
vioplot(x1, x2, x3, 
    names = c("4 cyl", "6 cyl", "8 cyl"), 
    col = "gold")
title("Violin Plots of Miles Per Gallon")

# dotchart

dotchart(mtcars$mpg, labels = row.names(mtcars), 
    cex = 0.7, 
    main = "Gas Milage for Car Models", 
    xlab = "Miles Per Gallon")

# sorted colored grouped dot chart

x <- mtcars[order(mtcars$mpg), ]
x$cyl <- factor(x$cyl)
x$color[x$cyl == 4] <- "red"
x$color[x$cyl == 6] <- "blue"
x$color[x$cyl == 8] <- "darkgreen"
dotchart(x$mpg, labels = row.names(x), cex = 0.7, 
    pch = 19, groups = x$cyl, 
    gcolor = "black", color = x$color, 
    main = "Gas Milage for Car Models\ngrouped by cylinder", 
    xlab = "Miles Per Gallon")