This chapter introduces the foundations of R programming for visualization, statistical computing and scientific inference. Specifically, in this chapter we will (1) discuss the rationale for selecting R as a computational platform for all DSPA demonstrations; (2) present the basics of installing shell-based R and RStudio user-interface, (3) show some simple R commands and scripts (e.g., translate long-to-wide data format, data simulation, data stratification and subsetting), (4) introduce variable types and their manipulation; (5) demonstrate simple mathematical functions, statistics, and matrix operators; (6) explore simple data visualization; and (7) introduce optimization and model fitting. The chapter appendix includes references to R introductory and advanced resources, as well as a primer on debugging.
R?There are many different classes of software that can be used for
data interrogation, modeling, inference and statistical computing. Among
these are R, Python, Java, C/C++, Perl, and many others.
The table below compares R to various other statistical
analysis software packages and more
detailed comparison is available online.
| Statistical Software | Advantages | Disadvantages |
|---|---|---|
| R | R is actively maintained (\(\ge 100,000\) developers, \(\ge 15K\) packages). Excellent connectivity to various types of data and other systems. Versatile for solving problems in many domains. It’s free, open-source code. Anybody can access/review/extend the source code. R is very stable and reliable. If you change or redistribute the R source code, you have to make those changes available for anybody else to use. R runs anywhere (platform agnostic). Extensibility: R supports extensions, e.g., for data manipulation, statistical modeling, and graphics. Active and engaged community supports R. Unparalleled question-and-answer (Q&A) websites. R connects with other languages (Java/C/JavaScript/Python/Fortran) & database systems, and other programs, SAS, SPSS, etc. Other packages have add-ons to connect with R. SPSS has incorporated a link to R, and SAS has protocols to move data and graphics between the two packages. | Mostly scripting language. Steeper learning curve |
| SAS | Large datasets. Commonly used in business & Government | Expensive. Somewhat dated programming language. Expensive/proprietary |
| Stata | Easy statistical analyses | Mostly classical stats |
| SPSS | Appropriate for beginners Simple interfaces | Weak in more cutting edge statistical procedures lacking in robust methods and survey methods |
There exist substantial differences between different types of computational environments for data wrangling, preprocessing, analytics, visualization and interpretation. The table below provides some rough comparisons between some of the most popular data computational platforms. With the exception of ComputeTime, higher scores represent better performance within the specific category. Note that these are just estimates and the scales are not normalized between categories.
| Language | OpenSource | Speed | ComputeTime | LibraryExtent | EaseOfEntry | Costs | Interoperability |
|---|---|---|---|---|---|---|---|
| Python | Yes | 16 | 62 | 80 | 85 | 10 | 90 |
| Julia | Yes | 2941 | 0.34 | 100 | 30 | 10 | 90 |
| R | Yes | 1 | 745 | 100 | 80 | 15 | 90 |
| IDL | No | 67 | 14.77 | 50 | 88 | 100 | 20 |
| Matlab | No | 147 | 6.8 | 75 | 95 | 100 | 20 |
| Scala | Yes | 1428 | 0.7 | 50 | 30 | 20 | 40 |
| C | Yes | 1818 | 0.55 | 100 | 30 | 10 | 99 |
| Fortran | Yes | 1315 | 0.76 | 95 | 25 | 15 | 95 |
Let’s first look at some real peer-review publication data
(1995-2015), specifically comparing all published scientific reports
utilizing R, SAS and SPSS, as
popular tools for data manipulation and statistical modeling. These data
are retrieved using GoogleScholar
literature searches.
# library(ggplot2)
# library(reshape2)
library(ggplot2)
library(reshape2)
library(plotly)
Data_R_SAS_SPSS_Pubs <-
read.csv('https://umich.instructure.com/files/2361245/download?download_frd=1', header=T)
df <- data.frame(Data_R_SAS_SPSS_Pubs)
# convert to long format (http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/)
# df <- melt(df , id.vars = 'Year', variable.name = 'Software')
# ggplot(data=df, aes(x=Year, y=value, color=Software, group = Software)) +
# geom_line(size=4) + labs(x='Year', y='Paper Software Citations') +
# ggtitle("Manuscript Citations of Software Use (1995-2015)") +
# theme(legend.position=c(0.1,0.8),
# legend.direction="vertical",
# axis.text.x = element_text(angle = 45, hjust = 1),
# plot.title = element_text(hjust = 0.5))
plot_ly(df, x = ~Year) %>%
add_trace(y = ~R, name = 'R', mode = 'lines+markers') %>%
add_trace(y = ~SAS, name = 'SAS', mode = 'lines+markers') %>%
add_trace(y = ~SPSS, name = 'SPSS', mode = 'lines+markers') %>%
layout(title="Manuscript Citations of Software Use (1995-2015)", legend = list(orientation = 'h'))
We can also look at a dynamic
Google Trends map, which provides longitudinal tracking of the
number of web-searches for each of these three statistical computing
platforms (R, SAS, SPSS). The figure below shows one example of the
evolving software interest over the past 15 years. You can expand
this plot by modifying the trend terms, expanding the search phrases,
and changing the time period.
The 2004-2018 monthly data of popularity of SAS, SPSS and R programming
Google searches is saved in this file GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018.csv.
# require(ggplot2)
# require(reshape2)
# GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018 <-
# read.csv('https://umich.instructure.com/files/9310141/download?download_frd=1', header=T)
# # read.csv('https://umich.instructure.com/files/9314613/download?download_frd=1', header=T) # Include Python
# df_GT <- data.frame(GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018)
#
# # convert to long format
# # df_GT <- melt(df_GT , id.vars = 'Month', variable.name = 'Software')
# #
# # library(scales)
# df_GT$Month <- as.Date(paste(df_GT$Month,"-01",sep=""))
# ggplot(data=df_GT1, aes(x=Date, y=hits, color=keyword, group = keyword)) +
# geom_line(size=4) + labs(x='Month-Year', y='Worldwide Google Trends') +
# scale_x_date(labels = date_format("%m-%Y"), date_breaks='4 months') +
# ggtitle("Web-Search Trends of Statistical Software (2004-2018)") +
# theme(legend.position=c(0.1,0.8),
# legend.direction="vertical",
# axis.text.x = element_text(angle = 45, hjust = 1),
# plot.title = element_text(hjust = 0.5))
#### Pull dynamic Google-Trends data
# install.packages("prophet")
# install.packages("devtools")
# install.packages("ps"); install.packages("pkgbuild")
# devtools::install_github("PMassicotte/gtrendsR")
library(gtrendsR)
library(ggplot2)
library(prophet)
df_GT1 <- gtrends(c("R", "SAS", "SPSS", "Python"), # geo = c("US","CN","GB", "EU"),
gprop = "web", time = "2004-01-01 2021-08-01")[[1]]
head(df_GT1)## date hits keyword geo time gprop category
## 1 2004-01-01 62 R world 2004-01-01 2021-08-01 web 0
## 2 2004-02-01 61 R world 2004-01-01 2021-08-01 web 0
## 3 2004-03-01 59 R world 2004-01-01 2021-08-01 web 0
## 4 2004-04-01 58 R world 2004-01-01 2021-08-01 web 0
## 5 2004-05-01 56 R world 2004-01-01 2021-08-01 web 0
## 6 2004-06-01 60 R world 2004-01-01 2021-08-01 web 0
library(tidyr)
df_GT1_wide <- spread(df_GT1, key = keyword, value = hits)
# dim(df_GT1_wide ) # [1] 212 9
plot_ly(df_GT1_wide, x = ~date) %>%
add_trace(x = ~date, y = ~R, name = 'R', type = 'scatter', mode = 'lines+markers') %>%
add_trace(x = ~date, y = ~SAS, name = 'SAS', type = 'scatter', mode = 'lines+markers') %>%
add_trace(x = ~date, y = ~SPSS, name = 'SPSS', type = 'scatter', mode = 'lines+markers') %>%
add_trace(x = ~date, y = ~Python, name = 'Python', type = 'scatter', mode = 'lines+markers') %>%
layout(title="Monthly Web-Search Trends of Statistical Software (2003-2021)",
legend = list(orientation = 'h'),
xaxis = list(title = 'Time'),
yaxis = list (title = 'Relative Search Volume'))R is a free software that can be installed on any computer. The ‘R’ website is: http://R-project.org. There you can install a shell-based R-environment following this protocol:
For many readers, its best to also install and run R via RStudio GUI, which provides a nice user interface. To install RStudio, go to: http://www.rstudio.org/ and do the following:
The RStudio interface consists of several windows.
Updating and upgrading the R environment involves a three-step process:
installr package. Type this in the R console:
install.packages("installr"); library(installr); updateR(),Help menu and click Check for Updates,
andTools menu and
click Check for Package Updates….These R updates should be done regularly (preferably monthly or at least semi-annually).
reshape2, ggplot2, and we will show how
to expand
your Rstudio library throughout these materials.<- (although
= may also be used), so to assign a value of \(2\) to a variable \(x\), we can use x <- 2 or
equivalently x = 2.R provides us documentations for different R functions.
The function call those documentations use help(). Just put
help(topic) in the R console and you can get detailed
explanations for each R topic or function. Another way of doing it is to
call ?topic, which is even easier.
For example, if I want to check the function for linear models
(i.e. function lm()), I will use the following
function.
help(lm)
?lmA first R script for melting a simple dataset
rawdata_wide <- read.table(header=TRUE, text='
CaseID Gender Age Condition1 Condition2
1 M 5 13 10.5
2 F 6 16 11.2
3 F 8 10 18.3
4 M 9 9.5 18.1
5 M 10 12.1 19
')
# Make the CaseID column a factor
rawdata_wide$subject <- factor(rawdata_wide$CaseID)
rawdata_wide## CaseID Gender Age Condition1 Condition2 subject
## 1 1 M 5 13.0 10.5 1
## 2 2 F 6 16.0 11.2 2
## 3 3 F 8 10.0 18.3 3
## 4 4 M 9 9.5 18.1 4
## 5 5 M 10 12.1 19.0 5
library(reshape2)
# Specify id.vars: the variables to keep (don't split apart on!)
melt(rawdata_wide, id.vars=c("CaseID", "Gender"))## CaseID Gender variable value
## 1 1 M Age 5
## 2 2 F Age 6
## 3 3 F Age 8
## 4 4 M Age 9
## 5 5 M Age 10
## 6 1 M Condition1 13
## 7 2 F Condition1 16
## 8 3 F Condition1 10
## 9 4 M Condition1 9.5
## 10 5 M Condition1 12.1
## 11 1 M Condition2 10.5
## 12 2 F Condition2 11.2
## 13 3 F Condition2 18.3
## 14 4 M Condition2 18.1
## 15 5 M Condition2 19
## 16 1 M subject 1
## 17 2 F subject 2
## 18 3 F subject 3
## 19 4 M subject 4
## 20 5 M subject 5
There are options for melt that can make the output a
little easier to work with:
data_long <- melt(rawdata_wide,
# ID variables - all the variables to keep but not split apart on
id.vars=c("CaseID", "Gender"),
# The source columns
measure.vars=c("Age", "Condition1", "Condition2" ),
# Name of the destination column that will identify the original
# column that the measurement came from
variable.name="Feature",
value.name="Measurement"
)
data_long## CaseID Gender Feature Measurement
## 1 1 M Age 5.0
## 2 2 F Age 6.0
## 3 3 F Age 8.0
## 4 4 M Age 9.0
## 5 5 M Age 10.0
## 6 1 M Condition1 13.0
## 7 2 F Condition1 16.0
## 8 3 F Condition1 10.0
## 9 4 M Condition1 9.5
## 10 5 M Condition1 12.1
## 11 1 M Condition2 10.5
## 12 2 F Condition2 11.2
## 13 3 F Condition2 18.3
## 14 4 M Condition2 18.1
## 15 5 M Condition2 19.0
For an elaborate justification, detailed description, and multiple
examples of handling long-and-wide data, messy and tidy data, and data
cleaning strategies see the JSS
Tidy Data article by Hadley Wickham.
Popular data generation functions are c(),
seq(), rep(), and data.frame().
Sometimes we use list() and array() to create
data too.
c()
c() creates a (column) vector. With option
recursive=T, it descends through lists combining all
elements into one vector.
a<-c(1, 2, 3, 5, 6, 7, 10, 1, 4)
a## [1] 1 2 3 5 6 7 10 1 4
c(list(A = c(Z = 1, Y = 2), B = c(X = 7), C = c(W = 7, V=3, U=-1.9)), recursive = TRUE)## A.Z A.Y B.X C.W C.V C.U
## 1.0 2.0 7.0 7.0 3.0 -1.9
When combined with list(), c() successfully
created a vector with all the information in a list with three members
A, B, and C.
seq(from, to)
seq(from, to) generates a sequence. Adding option
by= can help us specifies increment; Option
length= specifies desired length. Also,
seq(along=x) generates a sequence
1, 2, ..., length(x). This is used for loops to create ID
for each element in x.
seq(1, 20, by=0.5)## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5
## [31] 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0
seq(1, 20, length=9)## [1] 1.000 3.375 5.750 8.125 10.500 12.875 15.250 17.625 20.000
seq(along=c(5, 4, 5, 6))## [1] 1 2 3 4
rep(x, times)
rep(x, times) creates a sequence that repeats
x a specified number of times. The option
each= also allow us to repeat first over each element of
x certain number of times.
rep(c(1, 2, 3), 4)## [1] 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1, 2, 3), each=4)## [1] 1 1 1 1 2 2 2 2 3 3 3 3
Compare this to replicating using replicate()
X <- seq(along=c(1, 2, 3)); replicate(4, X+1)## [,1] [,2] [,3] [,4]
## [1,] 2 2 2 2
## [2,] 3 3 3 3
## [3,] 4 4 4 4
data.frame()
data.frame() creates a data frame of named or unnamed
arguments. We can combine multiple vectors. Each vector is stored as a
column. Shorter vectors are recycled to the length of the longest one.
With data.frame() you can mix numeric and characteristic
vectors.
data.frame(v=1:4, ch=c("a", "B", "C", "d"), n=c(10, 11))## v ch n
## 1 1 a 10
## 2 2 B 11
## 3 3 C 10
## 4 4 d 11
Note that the 1:4 means from 1 to 4. The operator
: generates a sequence.
list()
Like we mentioned in function c(), list()
creates a list of the named or unnamed arguments - indexing rule: from 1
to n, including 1 and n.
l<-list(a=c(1, 2), b="hi", c=-3+3i)
l## $a
## [1] 1 2
##
## $b
## [1] "hi"
##
## $c
## [1] -3+3i
# Note Complex Numbers a <- -1+3i; b <- -2-2i; a+bWe use $ to call each member in the list and
[[]] to call the element corresponding to specific index.
For example,
l$a[[2]] ## [1] 2
l$b## [1] "hi"
Note that R uses 1-based numbering
rather than 0-based like some other languages (C/Java), so the first
element of a list has index \(1\).
array(x, dim=)
array(x, dim=) creates an array with specific
dimensions. For example, dim=c(3, 4, 2) means two 3x4
matrices. We use [] to extract specific elements in the
array. [2, 3, 1] means the element at the 2nd row 3rd
column in the 1st page. Leaving one number in the dimensions empty would
help us to get a specific row, column or page. [2, ,1]
means the second row in the 1st page. See this image:
ar<-array(1:24, dim=c(3, 4, 2))
ar## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
ar[2, 3, 1] ## [1] 8
ar[2, ,1]## [1] 2 5 8 11
Other useful functions are:
matrix(x, nrow=, ncol=): creates matrix elements of
nrow rows and ncol columns.factor(x, levels=): encodes a vector x as
a factor.gl(n, k, length=n*k, labels=1:n): generate levels
(factors) by specifying the pattern of their levels.expand.grid(): a data frame from all combinations of
the supplied vectors or factors.rbind() combine arguments by rows for matrices, data
frames, and otherscbind() combine arguments by columns for matrices, data
frames, and othersThe first pair of functions we will talk about are
load(), which helps us reload datasets written with the
save() function.
Let’s create some data first.
x <- seq(1, 10, by=0.5)
y <- list(a = 1, b = TRUE, c = "oops")
save(x, y, file="xy.RData")
load("xy.RData")data(x) loads specified data sets
library(x) load add-on packages.
data("iris")
summary(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
read.table(file) reads a file in table format and
creates a data frame from it. The default separator sep=""
is any whitespace. Use header=TRUE to read the first line
as a header of column names. Use as.is=TRUE to prevent
character vectors from being converted to factors. Use
comment.char="" to prevent "#" from being
interpreted as a comment. Use skip=n to skip n
lines before reading data. See the help for options on row naming, NA
treatment, and others.
Let’s use read.table() to read a text file in our class
file.
data.txt<-read.table("https://umich.instructure.com/files/1628628/download?download_frd=1", header=T, as.is = T) # 01a_data.txt
summary(data.txt)## Name Team Position Height
## Length:1034 Length:1034 Length:1034 Min. :67.0
## Class :character Class :character Class :character 1st Qu.:72.0
## Mode :character Mode :character Mode :character Median :74.0
## Mean :73.7
## 3rd Qu.:75.0
## Max. :83.0
## Weight Age
## Min. :150.0 Min. :20.90
## 1st Qu.:187.0 1st Qu.:25.44
## Median :200.0 Median :27.93
## Mean :201.7 Mean :28.74
## 3rd Qu.:215.0 3rd Qu.:31.23
## Max. :290.0 Max. :48.52
A note of caution; if you post data on a Cloud web service like
Instructure/Canvas or GoogleDrive/GDrive that you want to process
externally, e.g., via R, the direct URL reference to the
raw file will be different from the URL of the pointer to the file that
can be rendered in the browser window. For instance,
R processing,
uc?export=download&id=.dataGDrive.txt<-read.table("https://drive.google.com/uc?export=download&id=1Zpw3HSe-8HTDsOnR-n64KoMRWYpeBBek", header=T, as.is = T) # 01a_data.txt
summary(dataGDrive.txt)## Name Team Position Height
## Length:1034 Length:1034 Length:1034 Min. :67.0
## Class :character Class :character Class :character 1st Qu.:72.0
## Mode :character Mode :character Mode :character Median :74.0
## Mean :73.7
## 3rd Qu.:75.0
## Max. :83.0
## Weight Age
## Min. :150.0 Min. :20.90
## 1st Qu.:187.0 1st Qu.:25.44
## Median :200.0 Median :27.93
## Mean :201.7 Mean :28.74
## 3rd Qu.:215.0 3rd Qu.:31.23
## Max. :290.0 Max. :48.52
read.csv(“filename”, header=TRUE) is identical to
read.table() but with defaults set for reading
comma-delimited files.
data.csv<-read.csv("https://umich.instructure.com/files/1628650/download?download_frd=1", header = T) # 01_hdp.csv
summary(data.csv)## tumorsize co2 pain wound
## Min. : 33.97 Min. :1.222 Min. :1.000 Min. :1.000
## 1st Qu.: 62.49 1st Qu.:1.519 1st Qu.:4.000 1st Qu.:5.000
## Median : 70.07 Median :1.601 Median :5.000 Median :6.000
## Mean : 70.88 Mean :1.605 Mean :5.473 Mean :5.732
## 3rd Qu.: 79.02 3rd Qu.:1.687 3rd Qu.:6.000 3rd Qu.:7.000
## Max. :116.46 Max. :2.128 Max. :9.000 Max. :9.000
## mobility ntumors nmorphine remission
## Min. :1.00 Min. :0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.:5.00 1st Qu.:1.000 1st Qu.: 2.000 1st Qu.:0.0000
## Median :6.00 Median :3.000 Median : 3.000 Median :0.0000
## Mean :6.08 Mean :3.066 Mean : 3.624 Mean :0.2957
## 3rd Qu.:7.00 3rd Qu.:5.000 3rd Qu.: 5.000 3rd Qu.:1.0000
## Max. :9.00 Max. :9.000 Max. :18.000 Max. :1.0000
## lungcapacity Age Married FamilyHx
## Min. :0.01612 Min. :26.32 Min. :0.0 Length:8525
## 1st Qu.:0.67647 1st Qu.:46.69 1st Qu.:0.0 Class :character
## Median :0.81560 Median :50.93 Median :1.0 Mode :character
## Mean :0.77409 Mean :50.97 Mean :0.6
## 3rd Qu.:0.91150 3rd Qu.:55.27 3rd Qu.:1.0
## Max. :0.99980 Max. :74.48 Max. :1.0
## SmokingHx Sex CancerStage LengthofStay
## Length:8525 Length:8525 Length:8525 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 5.000
## Mode :character Mode :character Mode :character Median : 5.000
## Mean : 5.492
## 3rd Qu.: 6.000
## Max. :10.000
## WBC RBC BMI IL6
## Min. :2131 Min. :3.919 Min. :18.38 Min. : 0.03521
## 1st Qu.:5323 1st Qu.:4.802 1st Qu.:24.20 1st Qu.: 1.93039
## Median :6007 Median :4.994 Median :27.73 Median : 3.34400
## Mean :5998 Mean :4.995 Mean :29.07 Mean : 4.01698
## 3rd Qu.:6663 3rd Qu.:5.190 3rd Qu.:32.54 3rd Qu.: 5.40551
## Max. :9776 Max. :6.065 Max. :58.00 Max. :23.72777
## CRP DID Experience School
## Min. : 0.0451 Min. : 1.0 Min. : 7.00 Length:8525
## 1st Qu.: 2.6968 1st Qu.:100.0 1st Qu.:15.00 Class :character
## Median : 4.3330 Median :199.0 Median :18.00 Mode :character
## Mean : 4.9730 Mean :203.3 Mean :17.64
## 3rd Qu.: 6.5952 3rd Qu.:309.0 3rd Qu.:21.00
## Max. :28.7421 Max. :407.0 Max. :29.00
## Lawsuits HID Medicaid
## Min. :0.000 Min. : 1.00 Min. :0.1416
## 1st Qu.:1.000 1st Qu.: 9.00 1st Qu.:0.3369
## Median :2.000 Median :17.00 Median :0.5215
## Mean :1.866 Mean :17.76 Mean :0.5125
## 3rd Qu.:3.000 3rd Qu.:27.00 3rd Qu.:0.7083
## Max. :9.000 Max. :35.00 Max. :0.8187
read.delim(“filename”, header=TRUE) is very similar to the first two. However, it has defaults set for reading tab-delimited files.
Also we have
read.fwf(file, widths, header=FALSE, sep="\t", as.is=FALSE)
to read a table of fixed width formatted data into a data frame.
match(x, y) returns a vector of the positions of
(first) matches of its first argument in its second. For a specific
element in x if no elements matches it in y
then the output for that elements would be NA.
match(c(1, 2, 4, 5), c(1, 4, 4, 5, 6, 7))## [1] 1 NA 2 4
save.image(file) saves all objects in the current work space.
**write.table(x, file=““, row.names=TRUE, col.names=TRUE, sep=”“)**
prints x after converting to a data frame and stores it into a specified
file. If quote is TRUE, character or factor columns are
surrounded by quotes (”). sep is the field separator.
eol is the end-of-line separator. na is the
string for missing values. Use col.names=NA to add a blank
column header to get the column headers aligned correctly for
spreadsheet input.
Most of the I/O functions have a file argument. This can often be a
character string naming a file or a connection. file=""
means the standard input or output. Connections can include files,
pipes, zipped files, and R variables.
On windows, the file connection can also be used with
description = "clipboard". To read a table copied from
Excel, use x <- read.delim("clipboard")
To write a table to the clipboard for Excel, use
write.table(x, "clipboard", sep="\t", col.names=NA)
For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and ROracle, as well as packages XML, hdf5, netCDF for reading other file formats. We will talk about some of them in later chapters.
Note, an alternative library called rio handles
import/export of multiple data types with simple syntax.
The following table can help us to understand how to index vectors.
| Expression | Explanation |
|---|---|
x[n] |
nth element |
x[-n] |
all but the nth element |
x[1:n] |
first n elements |
x[-(1:n)] |
elements from n+1 to the end |
x[c(1, 4, 2)] |
specific elements |
x["name"] |
element named “name” |
x[x > 3] |
all elements greater than 3 |
x[x > 3 & x < 5] |
all elements between 3 and 5 |
x[x %in% c("a", "and", "the")] |
elements in the given set |
Indexing lists are similar to indexing vectors but some of the symbols are different.
| Expression | Explanation |
|---|---|
x[n] |
list with n elements |
x[[n]] |
nth element of the list |
x[["name"]] |
element of the list named “name” |
Indexing for matrices is a higher dimensional version of indexing vectors.
| Expression | Explanation |
|---|---|
x[i, j] |
element at row i, column j |
x[i, ] |
row i |
x[, j] |
column j |
x[, c(1, 3)] |
columns 1 and 3 |
x["name", ] |
row named “name” |
The following functions can be used to convert data types:
as.array(x), as.data.frame(x),
as.numeric(x), as.logical(x),
as.complex(x), as.character(x), …
Typing methods(as) in the console will generate a
complete list for variable conversion functions.
The following functions will test if the each data element is a specific type:
is.na(x), is.null(x),
is.array(x), is.data.frame(x),
is.numeric(x), is.complex(x),
is.character(x), …
For a complete list, type methods(is) in R console. The
output for these functions are a bunch of TRUE or
FALSE logical statements. One statement for one element in
the dataset.
length(x) gives us the number of elements in
x.
x<-c(1, 3, 10, 23, 1, 3)
length(x)## [1] 6
dim(x) retrieves or sets the dimension of an object.
x<-1:12
dim(x)<-c(3, 4)
x## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
dimnames(x) retrieves or sets the dimension names of
an object. For higher dimensional objects like matrix or arrays we can
combine dimnames() with list.
dimnames(x)<-list(c("R1", "R2", "R3"), c("C1", "C2", "C3", "C4"))
x## C1 C2 C3 C4
## R1 1 4 7 10
## R2 2 5 8 11
## R3 3 6 9 12
nrow(x) number of rows; ncol(x) number of columns
nrow(x) ## [1] 3
ncol(x)## [1] 4
class(x) get or set the class of x. Note that we can
use unclass(x) to remove the class attribute of x.
class(x)## [1] "matrix" "array"
class(x)<-"myclass"
x<-unclass(x)
x## C1 C2 C3 C4
## R1 1 4 7 10
## R2 2 5 8 11
## R3 3 6 9 12
attr(x, which) get or set the attribute
which of x.
attr(x, "class")## NULL
attr(x, "dim")<-c(2, 6)
x## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 3 5 7 9 11
## [2,] 2 4 6 8 10 12
From the above commands we know that when we unclass x,
its class would be NULL.
attributes(obj) get or set the list of attributes of object.
attributes(x) <- list(mycomment = "really special", dim = 3:4,dimnames = list(LETTERS[1:3], letters[1:4]), names = paste(1:12))
x## a b c d
## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12
## attr(,"mycomment")
## [1] "really special"
## attr(,"names")
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
In this section, we will introduce some data manipulation functions.
In addition, tools from dplyr provide easy dataset
manipulation routines.
which.max(x) returns the index of the greatest element of x. which.min(x) returns the index of the smallest element of x. rev(x) reverses the elements of x. Let’s see these three functions first.
x<-c(1, 5, 2, 1, 10, 40, 3)
which.max(x)## [1] 6
which.min(x)## [1] 1
rev(x)## [1] 3 40 10 1 2 5 1
sort(x) sorts the elements of x in increasing order.
To sort in decreasing order we can use rev(sort(x)).
sort(x) ## [1] 1 1 2 3 5 10 40
rev(sort(x))## [1] 40 10 5 3 2 1 1
cut(x, breaks) divides x into intervals with same
length (sometimes factors). breaks is the number of cut
intervals or a vector of cut points. cut divides the range
of x into intervals coding the values in x according to the intervals
they fall into.
x## [1] 1 5 2 1 10 40 3
cut(x, 3)## [1] (0.961,14] (0.961,14] (0.961,14] (0.961,14] (0.961,14] (27,40] (0.961,14]
## Levels: (0.961,14] (14,27] (27,40]
cut(x, c(0, 5, 20, 30))## [1] (0,5] (0,5] (0,5] (0,5] (5,20] <NA> (0,5]
## Levels: (0,5] (5,20] (20,30]
which(x == a) returns a vector of the indices of
x if the comparison operation is true (TRUE). For example
it returns the value i, if x[i]== a is true.
Thus, the argument of this function (like x==a) must be a
variable of mode logical.
x## [1] 1 5 2 1 10 40 3
which(x==2)## [1] 3
na.omit(x) suppresses the observations with missing
data (NA). It suppresses the corresponding line if x is a
matrix or a data frame. na.fail(x) returns an error
message if x contains at least one NA.
df<-data.frame(a=1:5, b=c(1, 3, NA, 9, 8))
df## a b
## 1 1 1
## 2 2 3
## 3 3 NA
## 4 4 9
## 5 5 8
na.omit(df)## a b
## 1 1 1
## 2 2 3
## 4 4 9
## 5 5 8
unique(x) If x is a vector or a data frame, it returns a similar object but with the duplicate elements suppressed.
df1<-data.frame(a=c(1, 1, 7, 6, 8), b=c(1, 1, NA, 9, 8))
df1## a b
## 1 1 1
## 2 1 1
## 3 7 NA
## 4 6 9
## 5 8 8
unique(df1)## a b
## 1 1 1
## 3 7 NA
## 4 6 9
## 5 8 8
table(x) returns a table with the different values
of x and their frequencies (typically for integers or factors). Also
check prop.table().
v<-c(1, 2, 4, 2, 2, 5, 6, 4, 7, 8, 8)
table(v)## v
## 1 2 4 5 6 7 8
## 1 3 2 1 1 1 2
subset(x, …) returns a selection of x with respect
to criteria ... (typically ... are comparisons
like x$V1 < 10). If x is a data frame, the option
select= gives the variables to be kept or dropped using a
minus sign.
sub<-subset(df1, df1$a>5)
sub ## a b
## 3 7 NA
## 4 6 9
## 5 8 8
sub<-subset(df1, select=-a)
sub## b
## 1 1
## 2 1
## 3 NA
## 4 9
## 5 8
sample(x, size) resample randomly and without
replacement size elements in the vector x, the option
replace = TRUE allows to resample with replacement.
df1<-data.frame(a=c(1, 1, 7, 6, 8), b=c(1, 1, NA, 9, 8))
sample(df1$a, 20, replace = T)## [1] 1 1 1 1 7 1 1 7 1 7 1 1 7 8 1 1 6 1 6 1
prop.table(x, margin=) table entries as fraction of marginal table.
prop.table(table(v))## v
## 1 2 4 5 6 7 8
## 0.09090909 0.27272727 0.18181818 0.09090909 0.09090909 0.09090909 0.18181818
Basic math functions like sin, cos,
tan, asin, acos,
atan, atan2, log,
log10, exp and “set” functions
union(x, y), intersect(x, y),
setdiff(x, y), setequal(x, y),
is.element(el, set) are available in R.
lsf.str("package:base") displays all base function built
in a specific R package (like base).
Also we have the following table of functions that you might need when use R for calculations.
| Expression | Explanation |
|---|---|
choose(n, k) |
computes the combinations of k events among n repetitions. Mathematically it equals to \(\frac{n!}{[(n-k)!k!]}\) |
max(x) |
maximum of the elements of x |
min(x) |
minimum of the elements of x |
range(x) |
minimum and maximum of the elements of x |
sum(x) |
sum of the elements of x |
diff(x) |
lagged and iterated differences of vector x |
prod(x) |
product of the elements of x |
mean(x) |
mean of the elements of x |
median(x) |
median of the elements of x |
quantile(x, probs=) |
sample quantiles corresponding to the given probabilities (defaults to 0, .25, .5, .75, 1) |
weighted.mean(x, w) |
mean of x with weights w |
rank(x) |
ranks of the elements of x |
var(x) or cov(x) |
variance of the elements of x (calculated on n>1). If x is a matrix or a data frame, the variance-covariance matrix is calculated |
sd(x) |
standard deviation of x |
cor(x) |
correlation matrix of x if it is a matrix or a data frame (1 if x is a vector) |
var(x, y) or cov(x, y) |
covariance between x and y, or between the columns of x and those of y if they are matrices or data frames |
cor(x, y) |
linear correlation between x and y, or correlation matrix if they are matrices or data frames |
round(x, n) |
rounds the elements of x to n decimals |
log(x, base) |
computes the logarithm of x with base base |
scale(x) |
if x is a matrix, centers and reduces the data. Without centering
use the option center=FALSE. Without scaling use
scale=FALSE (by default center=TRUE, scale=TRUE) |
pmin(x, y, ...) |
a vector which ith element is the minimum of x[i], y[i], . . . |
pmax(x, y, ...) |
a vector which ith element is the maximum of x[i], y[i], . . . |
cumsum(x) |
a vector which ith element is the sum from x[1] to x[i] |
cumprod(x) |
id. for the product |
cummin(x) |
id. for the minimum |
cummax(x) |
id. for the maximum |
Re(x) |
real part of a complex number |
Im(x) |
imaginary part of a complex number |
Mod(x) |
modulus. abs(x) is the same |
Arg(x) |
angle in radians of the complex number |
Conj(x) |
complex conjugate |
convolve(x, y) |
compute the several kinds of convolutions of two sequences |
fft(x) |
Fast Fourier Transform of an array |
mvfft(x) |
FFT of each column of a matrix |
filter(x, filter) |
applies linear filtering to a univariate time series or to each series separately of a multivariate time series |
Note: many math functions have a logical parameter
na.rm=TRUE to specify missing data (NA)
removal.
The following table summarizes basic operation functions. We will discuss this topic in detail in Chapter 4.
| Expression | Explanation |
|---|---|
t(x) |
transpose |
diag(x) |
diagonal |
%*% |
matrix multiplication |
solve(a, b) |
solves a %*% x = b for x |
solve(a) |
matrix inverse of a |
rowsum(x) |
sum of rows for a matrix-like object. rowSums(x) is a
faster version |
colsum(x), colSums(x) |
id. for columns |
rowMeans(x) |
fast version of row means |
colMeans(x) |
id. for columns |
mat1 <- cbind(c(1, -1/5), c(-1/3, 1))
mat1.inv <- solve(mat1)
mat1.identity <- mat1.inv %*% mat1
mat1.identity## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
b <- c(1, 2)
x <- solve (mat1, b)
x## [1] 1.785714 2.357143
optim(par, fn, method = c(“Nelder-Mead”, “BFGS”, “CG”,
“L-BFGS-B”, “SANN”)) general-purpose optimization;
par is initial values, fn is function to
optimize (normally minimize).
nlm(f, p) minimize function fusing a Newton-type algorithm with starting values p.
lm(formula) fit linear models; formula
is typically of the form response ~ termA + termB + ...;
use I(x*y) + I(x^2) for terms made of nonlinear
components.
glm(formula, family=) fit generalized linear models,
specified by giving a symbolic description of the linear predictor and a
description of the error distribution; family is a
description of the error distribution and link function to be used in
the model; see ?family.
nls(formula) nonlinear least-squares estimates of the nonlinear model parameters.
approx(x, y=) linearly interpolate given data points; x can be an xy plotting structure.
spline(x, y=) cubic spline interpolation.
loess(formula) (locally weighted scatterplot smoothing) fit a polynomial surface using local fitting.
Many of the formula-based modeling functions have several common arguments:
data= the data frame for the formula variables,
subset= a subset of variables used in the fit,
na.action= action for missing values:
"na.fail", "na.omit", or a function.
The following generics often apply to model fitting functions:
predict(fit, ...) predictions from fit based on input
data.
df.residual(fit) returns the number of residual degrees
of freedom.
coef(fit) returns the estimated coefficients (sometimes
with their standard-errors).
residuals(fit) returns the residuals.
deviance(fit) returns the deviance.
fitted(fit) returns the fitted values.
logLik(fit) computes the logarithm of the likelihood and
the number of parameters.
AIC(fit) computes the Akaike information criterion
(AIC).
aov(formula) analysis of variance model.
anova(fit, …) analysis of variance (or deviance) tables for one or more fitted model objects.
density(x) kernel density estimates of x.
Other functions include: binom.test(),
pairwise.t.test(), power.t.test(),
prop.test(), t.test(), … use
help.search("test") to see details.
Random sample generation from different distributions. Remember to
include set.seed() if you want to get reproducibility
during exercises.
| Expression | Explanation |
|---|---|
rnorm(n, mean=0, sd=1) |
Gaussian (normal) |
rexp(n, rate=1) |
exponential |
rgamma(n, shape, scale=1) |
gamma |
rpois(n, lambda) |
Poisson |
rweibull(n, shape, scale=1) |
Weibull |
rcauchy(n, location=0, scale=1) |
Cauchy |
rbeta(n, shape1, shape2) |
beta |
rt(n, df) |
Student’s (t) |
rf(n, df1, df2) |
Fisher’s (F) (df1, df2) |
rchisq(n, df) |
Pearson rbinom(n, size, prob) binomial |
rgeom(n, prob) |
geometric |
rhyper(nn, m, n, k) |
hypergeometric |
rlogis(n, location=0, scale=1) |
logistic |
rlnorm(n, meanlog=0, sdlog=1) |
lognormal |
rnbinom(n, size, prob) |
negative binomial |
runif(n, min=0, max=1) |
uniform |
rwilcox(nn, m, n), rsignrank(nn, n) |
Wilcoxon’s statistics |
Also all these functions can be used by replacing the letter
r with d, p or q to
get, respectively, the probability density (dfunc(x, ...)),
the cumulative probability density (pfunc(x, ...)), and the
value of quantile (qfunc(p, ...), with \(0 < p < 1\)).
In this section, we will introduce some fancy functions that can save time remarkably.
apply(X, INDEX, FUN=) a vector or array or list of
values obtained by applying a function FUN to margins
(INDEX=1 means row, INDEX=2 means column) of
X.
df1 ## a b
## 1 1 1
## 2 1 1
## 3 7 NA
## 4 6 9
## 5 8 8
apply(df1, 2, mean, na.rm=T)## a b
## 4.60 4.75
Note that we can add options for the FUN after the
function.
lapply(X, FUN) apply FUN to each member
of the list X. If X is a data frame then it will apply the
FUN to each column and return a list.
lapply(df1, mean, na.rm=T)## $a
## [1] 4.6
##
## $b
## [1] 4.75
lapply(list(a=c(1, 23, 5, 6, 1), b=c(9, 90, 999)), median)## $a
## [1] 5
##
## $b
## [1] 90
tapply(X, INDEX, FUN=) apply FUN to
each cell of a ragged array given by X with indexes equals to
INDEX. Note that X is an atomic object, typically a
vector
v## [1] 1 2 4 2 2 5 6 4 7 8 8
fac <- factor(rep(1:3, length = 11), levels = 1:3)
table(fac)## fac
## 1 2 3
## 4 4 3
tapply(v, fac, sum)## 1 2 3
## 17 16 16
by(data, INDEX, FUN) apply FUN to data
frame data subsetted by INDEX.
by(df1, df1[, 1], sum)## df1[, 1]: 1
## [1] 4
## ------------------------------------------------------------
## df1[, 1]: 6
## [1] 15
## ------------------------------------------------------------
## df1[, 1]: 7
## [1] NA
## ------------------------------------------------------------
## df1[, 1]: 8
## [1] 16
The above line of code apply the sum function using
column 1(a) as an index.
merge(a, b) merge two data frames by common columns
or row names. We can use option by= to specify the index
column.
df2<-data.frame(a=c(1, 1, 7, 6, 8), c=1:5)
df2 ## a c
## 1 1 1
## 2 1 2
## 3 7 3
## 4 6 4
## 5 8 5
df3<-merge(df1, df2, by="a")
df3## a b c
## 1 1 1 1
## 2 1 1 2
## 3 1 1 1
## 4 1 1 2
## 5 6 9 4
## 6 7 NA 3
## 7 8 8 5
xtabs(a ~ b, data=x) a contingency table from cross-classifying factors.
DF <- as.data.frame(UCBAdmissions)
## 'DF' is a data frame with a grid of the factors and the counts
## in variable 'Freq'.
DF## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
## 7 Admitted Female B 17
## 8 Rejected Female B 8
## 9 Admitted Male C 120
## 10 Rejected Male C 205
## 11 Admitted Female C 202
## 12 Rejected Female C 391
## 13 Admitted Male D 138
## 14 Rejected Male D 279
## 15 Admitted Female D 131
## 16 Rejected Female D 244
## 17 Admitted Male E 53
## 18 Rejected Male E 138
## 19 Admitted Female E 94
## 20 Rejected Female E 299
## 21 Admitted Male F 22
## 22 Rejected Male F 351
## 23 Admitted Female F 24
## 24 Rejected Female F 317
## Nice for taking margins …
xtabs(Freq ~ Gender + Admit, DF)## Admit
## Gender Admitted Rejected
## Male 1198 1493
## Female 557 1278
## And for testing independence …
summary(xtabs(Freq ~ ., DF))## Call: xtabs(formula = Freq ~ ., data = DF)
## Number of cases in table: 4526
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 2000.3, df = 16, p-value = 0
aggregate(x, by, FUN) splits the data frame x into
subsets, computes summary statistics for each, and returns the result in
a convenient form. by is a list of grouping elements, each
has the same length as the variables in x.
list(rep(1:3, length=7))## [[1]]
## [1] 1 2 3 1 2 3 1
aggregate(df3, by=list(rep(1:3, length=7)), sum)## Group.1 a b c
## 1 1 10 10 8
## 2 2 7 10 6
## 3 3 8 NA 4
The above code applied function sum to data frame
df3 according to the index created by
list(rep(1:3, length=7)).
stack(x, …) transform data available as separate
columns in a data frame or list into a single column.
and unstack(x, ...) is inverse of stack().
stack(df3)## values ind
## 1 1 a
## 2 1 a
## 3 1 a
## 4 1 a
## 5 6 a
## 6 7 a
## 7 8 a
## 8 1 b
## 9 1 b
## 10 1 b
## 11 1 b
## 12 9 b
## 13 NA b
## 14 8 b
## 15 1 c
## 16 2 c
## 17 1 c
## 18 2 c
## 19 4 c
## 20 3 c
## 21 5 c
unstack(stack(df3))## a b c
## 1 1 1 1
## 2 1 1 2
## 3 1 1 1
## 4 1 1 2
## 5 6 9 4
## 6 7 NA 3
## 7 8 8 5
reshape(x, …) reshapes a data frame between “wide”
format with repeated measurements in separate columns of the same record
and “long” format with the repeated measurements in separate records.
Use direction="wide" or direction="long".
df4 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),time = rep(c(1, 1, 2, 2), 3), score = rnorm(12))
wide <- reshape(df4, idvar = c("school", "class"), direction = "wide")
wide## school class score.1 score.2
## 1 1 9 -0.1432515 -0.4297252
## 2 1 10 0.4599993 -1.3383367
## 5 2 9 0.6658810 -0.6102316
## 6 2 10 -0.6621952 0.4192676
## 9 3 9 -0.1317508 -0.9635106
## 10 3 10 -0.4144321 1.9632750
long <- reshape(wide, idvar = c("school", "class"), direction = "long")
long## school class time score.1
## 1.9.1 1 9 1 -0.1432515
## 1.10.1 1 10 1 0.4599993
## 2.9.1 2 9 1 0.6658810
## 2.10.1 2 10 1 -0.6621952
## 3.9.1 3 9 1 -0.1317508
## 3.10.1 3 10 1 -0.4144321
## 1.9.2 1 9 2 -0.4297252
## 1.10.2 1 10 2 -1.3383367
## 2.9.2 2 9 2 -0.6102316
## 2.10.2 2 10 2 0.4192676
## 3.9.2 3 9 2 -0.9635106
## 3.10.2 3 10 2 1.9632750
Notes
rnorm used in reshape might generate
different results for each call, unless set.seed(1234) is
used to ensure reproducibility of random-number generation.The following functions are useful for handling strings in R.
paste(…) concatenate vectors after converting to
character. It has a few options. sep= is the string to
separate terms (a single space is the default). collapse=
is an optional string to separate “collapsed” results.
a<-"today"
b<-"is a good day"
paste(a, b) ## [1] "today is a good day"
paste(a, b, sep=",")## [1] "today,is a good day"
substr(x, start, stop) substrings in a character
vector. It can also assign values (with the same length) to part of a
string, as substr(x, start, stop) <- value.
a<-"When the going gets tough, the tough get going!"
substr(a, 10, 40)## [1] "going gets tough, the tough get"
## [1] "going gets tough, the tough get"
substr(a, 1, 9)<-"………"
a## [1] "………n the going gets tough, the tough get going!"
Note that characters at start and stop
indexes are inclusive in the output.
strsplit(x, split) split x according to the
substring split. Use fixed=TRUE for non-regular
expressions.
strsplit("a.b.c", ".", fixed = TRUE)## [[1]]
## [1] "a" "b" "c"
grep(pattern, x) searches for matches to pattern
within x. It will return a vector of the indices of the elements of x
that yielded a match. Use regular expression for
pattern(unless fixed=TRUE). See
?regex for details.
letters ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
grep("[a-z]", letters)## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26
gsub(pattern, replacement, x) replacement of matches determined by regular expression matching. sub() is the same but only replaces the first occurrence.
a<-c("e", 0, "kj", 10, ";")
gsub("[a-z]", "letters", a) ## [1] "letters" "0" "lettersletters" "10"
## [5] ";"
sub("[a-z]", "letters", a)## [1] "letters" "0" "lettersj" "10" ";"
tolower(x) convert to lowercase. toupper(x) convert to uppercase.
match(x, table) a vector of the positions of first
matches for the elements of x among table.
x %in% table id. but returns a logical vector.
x<-c(1, 2, 10, 19, 29)
match(x, c(1, 10))## [1] 1 NA 2 NA NA
x %in% c(1, 10)## [1] TRUE FALSE TRUE FALSE FALSE
pmatch(x, table) partial matches for the elements of x among table.
pmatch("m", c("mean", "median", "mode")) # returns NA ## [1] NA
pmatch("med", c("mean", "median", "mode")) # returns 2## [1] 2
The first one returns NA, and dependent on the R-version
possibly a warning, because all elements have the pattern
"m".
nchar(x) number of characters
Dates and Times
The class Date has dates without times.
POSIXct() has dates and times, including time zones.
Comparisons (e.g. >), seq(), and difftime()
are useful. ?DateTimeClasses gives more information. See
also package chron.
as.Date(s) and as.POSIXct(s) convert to the
respective class; format(dt) converts to a string
representation. The default string format is 2001-02-21.
These accept a second argument to specify a format for conversion. Some
common formats are:
| Formats | Explanations |
|---|---|
%a, %A |
Abbreviated and full weekday name. |
%b, %B |
Abbreviated and full month name. |
%d |
Day of the month (01 … 31). |
%H |
Hours (00 … 23). |
%I |
Hours (01 … 12). |
%j |
Day of year (001 … 366). |
%m |
Month (01 … 12). |
%M |
Minute (00 … 59). |
%p |
AM/PM indicator. |
%S |
Second as decimal number (00 … 61). |
%U |
Week (00 … 53); the first Sunday as day 1 of week 1. |
%w |
Weekday (0 … 6, Sunday is 0). |
%W |
Week (00 … 53); the first Monday as day 1 of week 1. |
%y |
Year without century (00 … 99). Don’t use. |
%Y |
Year with century. |
%z (output only.) |
Offset from Greenwich; -0800 is 8 hours west of. |
%Z (output only.) |
Time zone as a character string (empty if not available). |
Where leading zeros are shown they will be used on output but are
optional on input. See ?strftime for details.
This only an introduction for plotting functions in R. In Chapter 2, we will discuss visualization in more detail.
plot(x) plot of the values of x (on the y-axis) ordered on the x-axis.
plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis).
hist(x) histogram of the frequencies of x.
barplot(x) histogram of the values of x. Use
horiz=FALSE for horizontal bars.
dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column).
pie(x) circular pie-chart.
boxplot(x) ‘box-and-whiskers’ plot.
sunflowerplot(x, y) id. than plot() but the points with similar coordinates are drawn as flowers which petal number represents the number of points.
stripplot(x) plot of the values of x on a line (an
alternative to boxplot() for small sample sizes).
coplot(x~y | z) bivariate plot of x and y for each value or interval of values of z.
interaction.plot (f1, f2, y) if f1 and f2 are
factors, plots the means of y (on the y-axis) with respect to the values
of f1 (on the x-axis) and of f2 (different curves). The option
fun allows to choose the summary statistic of y (by default
fun=mean).
matplot(x, y) bivariate plot of the first column of x vs. the first one of y, the second one of x vs. the second one of y, etc.
fourfoldplot(x) visualizes, with quarters of circles, the association between two dichotomous variables for different populations (x must be an array with dim=c(2, 2, k), or a matrix with dim=c(2, 2) if k = 1)
assocplot(x) Cohen’s Friendly graph shows the deviations from independence of rows and columns in a two dimensional contingency table.
mosaicplot(x) “mosaic”” graph of the residuals from a log-linear regression of a contingency table.
pairs(x) if x is a matrix or a data frame, draws all possible bivariate plots between the columns of x.
plot.ts(x) if x is an object of class “ts”, plot of x with respect to time, x may be multivariate but the series must have the same frequency and dates. Detailed examples are in Chapter 17: Big Longitudinal Data Analysis.
ts.plot(x) id. but if x is multivariate the series may have different dates and must have the same frequency.
qqnorm(x) quantiles of x with respect to the values expected under a normal law.
qqplot(x, y) quantiles of y with respect to the quantiles of x.
contour(x, y, z) contour plot (data are interpolated
to draw the curves), x and y must be vectors and z must be a matrix so
that dim(z)=c(length(x), length(y)) (x and y may be
omitted).
filled.contour(x, y, z) areas between the contours are colored, and a legend of the colors is drawn as well.
image(x, y, z) plotting actual data with colors.
persp(x, y, z) plotting actual data in perspective view.
stars(x) if x is a matrix or a data frame, draws a graph with segments or a star where each row of x is represented by a star and the columns are the lengths of the segments.
symbols(x, y, …) draws, at the coordinates given by x and y, symbols (circles, squares, rectangles, stars, thermometers or “boxplots”“) which sizes, colors… are specified by supplementary arguments.
termplot(mod.obj) plot of the (partial) effects of a
regression model (mod.obj).
The following parameters are common to many plotting functions:
| Parameters | Explanations |
|---|---|
add=FALSE |
if TRUE superposes the plot on the previous one (if it exists) |
axes=TRUE |
if FALSE does not draw the axes and the box |
type="p" |
specifies the type of plot, “p”: points, “l”: lines, “b”: points connected by lines, “o”: id. But the lines are over the points, “h”: vertical lines, “s”: steps, the data are represented by the top of the vertical lines, “S”: id. However, the data are represented at the bottom of the vertical lines |
xlim=, ylim= |
specifies the lower and upper limits of the axes, for example with
xlim=c(1, 10) or xlim=range(x) |
xlab=, ylab= |
annotates the axes, must be variables of mode character |
main= |
main title, must be a variable of mode character |
sub= |
subtitle (written in a smaller font) |
Let’s look at one simple example - quantile-quantile probability plot. Suppose \(X\sim N(0,1)\) and \(Y\sim Cauchy\) represent the observed/raw and simulated/generated data for one feature (variable) in the data.
X_norm1 <- rnorm(1000)
X_norm2 <- rnorm(1000, m=-75, sd=3.7)
X_Cauchy <- rcauchy(1000)
# compare X to StdNormal distribution
# qqnorm(X,
# main="Normal Q-Q Plot of the data",
# xlab="Theoretical Quantiles of the Normal",
# ylab="Sample Quantiles of the X (Normal) Data")
# qqline(X)
# qqplot(X, Y)
fit_norm_norm = lm(X_norm2 ~ X_norm1)
fit_norm_cauchy = lm(X_Cauchy ~ X_norm1)
# Get model fitted values
Fitted.Values.norm_norm <- fitted(fit_norm_norm)
Fitted.Values.norm_cauchy <- fitted(fit_norm_cauchy)
# Extract model residuals
Residuals.norm_norm <- resid(fit_norm_norm)
Residuals.norm_cauchy <- resid(fit_norm_cauchy)
# Compute the model standardized residuals from lm() object
Std.Res.norm_norm <- MASS::stdres(fit_norm_norm)
Std.Res.norm_cauchy <- MASS::stdres(fit_norm_cauchy)
# Extract the theoretical (Normal) quantiles
Theoretical.Quantiles.norm_norm <- qqnorm(Residuals.norm_norm, plot.it = F)$x
Theoretical.Quantiles.norm_cauchy <- qqnorm(Residuals.norm_cauchy, plot.it = F)$x
qq.df.norm_norm <- data.frame(Std.Res.norm_norm, Theoretical.Quantiles.norm_norm)
qq.df.norm_cauchy <- data.frame(Std.Res.norm_cauchy, Theoretical.Quantiles.norm_cauchy)
qq.df.norm_norm %>%
plot_ly(x = ~Theoretical.Quantiles.norm_norm) %>%
add_markers(y = ~Std.Res.norm_norm, name="Normal(0,1) vs. Normal(-75, 3.7) Data") %>%
add_lines(x = ~Theoretical.Quantiles.norm_norm, y = ~Theoretical.Quantiles.norm_norm,
mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>%
layout(title = "Q-Q Normal Plot", legend = list(orientation = 'x = 100, y = 0.5')) # Normal vs. Cauchy
qq.df.norm_cauchy %>%
plot_ly(x = ~Theoretical.Quantiles.norm_cauchy) %>%
add_markers(y = ~Std.Res.norm_cauchy, name="Normal(0,1) vs. Cauchy Data") %>%
add_lines(x = ~Theoretical.Quantiles.norm_norm, y = ~Theoretical.Quantiles.norm_norm,
mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>%
layout(title = "Normal vs. Cauchy Q-Q Plot", legend = list(x = 100, y = 0.5)) # Q-Q plot data (X) vs. simulation(Y)
#
# myQQ <- function(x, y, ...) {
# #rang <- range(x, y, na.rm=T)
# rang <- range(-4, 4, na.rm=T)
# qqplot(x, y, xlim=rang, ylim=rang)
# }
#
# myQQ(X, Y) # where the Y is the newly simulated data for X
# qqline(X)
# Sample different number of observations from all the 3 processes
X_norm1 <- rnorm(500)
X_norm2 <- rnorm(1000, m=-75, sd=3.7)
X_Cauchy <- rcauchy(1500)
# estimate the quantiles (scale the values to ensure measuring-unit invariance of both processes)
qX_norm1 <- quantile(scale(X_norm1), probs = seq(from=0.01, to=0.99, by=0.01))
qX_norm2 <- quantile(scale(X_norm2), probs = seq(from=0.01, to=0.99, by=0.01))
qq.df.norm_norm <- data.frame(qX_norm1, qX_norm2)
# Normal(0,1) vs. Normal(-75, 3.7)
qq.df.norm_norm %>%
plot_ly(x = ~qX_norm1) %>%
add_markers(y = ~qX_norm2, name="Normal(0,1) vs. Normal(-75, 3.7) Data") %>%
add_lines(x = ~qX_norm1, y = ~qX_norm1,
mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>%
layout(title = "Q-Q Normal Plot", legend = list(x = 100, y = 0.5))# Normal(0,1) vs. Cauchy
qX_norm1 <- quantile(X_norm1, probs = seq(from=0.01, to=0.99, by=0.01))
qX_Cauchy <- quantile(X_Cauchy, probs = seq(from=0.01, to=0.99, by=0.01))
qq.df.norm_cauchy <- data.frame(qX_norm1, qX_Cauchy)
qq.df.norm_cauchy %>%
plot_ly(x = ~qX_norm1) %>%
add_markers(y = ~qX_Cauchy, name="Normal(0,1) vs. Cauchy Data") %>%
add_lines(x = ~qX_norm1, y = ~qX_norm1,
mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>%
layout(title = "Normal vs. Cauchy Q-Q Plot", legend = list(x = 100, y = 0.5))x <- matrix(rnorm(100), ncol = 5)
y <- c(1, seq(19))
z <- cbind(x, y)
z.df <- data.frame(z)
z.df## V1 V2 V3 V4 V5 y
## 1 0.52323096 -0.89798353 -0.39094732 -0.55373306 0.11038516 1
## 2 -0.08042115 -1.20529395 2.14972048 0.46923505 -1.42259713 1
## 3 -0.25279252 0.26675380 0.57090972 -1.13619033 0.05456048 2
## 4 1.67870567 1.26277600 -1.06624623 -3.10125689 0.77973935 3
## 5 -0.73803152 -1.25613142 0.29156874 0.67416763 1.06226220 4
## 6 0.48694076 0.55387445 -0.42028263 0.45264876 -1.33658920 5
## 7 0.11768260 -0.93730322 1.75924396 0.28709427 0.42624034 6
## 8 0.11798351 0.45820392 0.02915744 -0.23048392 -1.15671960 7
## 9 -0.65583076 -0.39199133 0.17465862 3.29309197 1.17048438 8
## 10 -1.37376355 1.05774303 -1.45695926 -0.18816334 0.74176111 9
## 11 0.09597003 -0.55748458 -1.29190378 -0.09555632 -0.30576667 10
## 12 -1.34217913 0.16703063 -1.02963187 0.35998561 -0.55285384 11
## 13 -0.14815680 0.55157272 -0.23791682 1.62506205 0.93179816 12
## 14 -0.85280475 -0.14882391 -0.28797595 0.63862163 0.19482344 13
## 15 1.83167418 1.54912667 0.89475123 -0.93042625 0.48846957 14
## 16 1.31610079 -0.30942746 -1.33916101 -0.36820902 -1.32411648 15
## 17 1.68539024 -0.80479148 -0.78004638 0.93583784 0.01708107 16
## 18 -0.48361691 -0.05891933 0.37559273 -0.78080907 0.21724811 17
## 19 1.01831004 -0.86107527 -0.48010318 -0.57142667 -2.24122252 18
## 20 -0.22236621 0.57612150 -0.74454153 1.02351600 -0.67869260 19
names(z.df) ## [1] "V1" "V2" "V3" "V4" "V5" "y"
# subsetting rows
z.sub <- subset(z.df, y > 2 & (y<10 | V1>0))
z.sub## V1 V2 V3 V4 V5 y
## 4 1.67870567 1.2627760 -1.06624623 -3.10125689 0.77973935 3
## 5 -0.73803152 -1.2561314 0.29156874 0.67416763 1.06226220 4
## 6 0.48694076 0.5538745 -0.42028263 0.45264876 -1.33658920 5
## 7 0.11768260 -0.9373032 1.75924396 0.28709427 0.42624034 6
## 8 0.11798351 0.4582039 0.02915744 -0.23048392 -1.15671960 7
## 9 -0.65583076 -0.3919913 0.17465862 3.29309197 1.17048438 8
## 10 -1.37376355 1.0577430 -1.45695926 -0.18816334 0.74176111 9
## 11 0.09597003 -0.5574846 -1.29190378 -0.09555632 -0.30576667 10
## 15 1.83167418 1.5491267 0.89475123 -0.93042625 0.48846957 14
## 16 1.31610079 -0.3094275 -1.33916101 -0.36820902 -1.32411648 15
## 17 1.68539024 -0.8047915 -0.78004638 0.93583784 0.01708107 16
## 19 1.01831004 -0.8610753 -0.48010318 -0.57142667 -2.24122252 18
z.sub1 <- z.df[z.df$y == 1, ]
z.sub1 ## V1 V2 V3 V4 V5 y
## 1 0.52323096 -0.8979835 -0.3909473 -0.5537331 0.1103852 1
## 2 -0.08042115 -1.2052939 2.1497205 0.4692350 -1.4225971 1
z.sub2 <- z.df[z.df$y %in% c(1, 4), ]
z.sub2## V1 V2 V3 V4 V5 y
## 1 0.52323096 -0.8979835 -0.3909473 -0.5537331 0.1103852 1
## 2 -0.08042115 -1.2052939 2.1497205 0.4692350 -1.4225971 1
## 5 -0.73803152 -1.2561314 0.2915687 0.6741676 1.0622622 4
#subsetting columns
z.sub6 <- z.df[, 1:2]
z.sub6## V1 V2
## 1 0.52323096 -0.89798353
## 2 -0.08042115 -1.20529395
## 3 -0.25279252 0.26675380
## 4 1.67870567 1.26277600
## 5 -0.73803152 -1.25613142
## 6 0.48694076 0.55387445
## 7 0.11768260 -0.93730322
## 8 0.11798351 0.45820392
## 9 -0.65583076 -0.39199133
## 10 -1.37376355 1.05774303
## 11 0.09597003 -0.55748458
## 12 -1.34217913 0.16703063
## 13 -0.14815680 0.55157272
## 14 -0.85280475 -0.14882391
## 15 1.83167418 1.54912667
## 16 1.31610079 -0.30942746
## 17 1.68539024 -0.80479148
## 18 -0.48361691 -0.05891933
## 19 1.01831004 -0.86107527
## 20 -0.22236621 0.57612150
points(x, y) adds points (the option
type= can be used)
lines(x, y) id. but with lines
text(x, y, labels, …) adds text given by labels at
coordinates (x, y). Typical use:
plot(x, y, type="n"); text(x, y, names)
mtext(text, side=3, line=0, …) adds text given by
text in the margin specified by side (see axis() below);
line specifies the line from the plotting area.
segments(x0, y0, x1, y1) draws lines from points
(x0, y0) to points (x1, y1)
arrows(x0, y0, x1, y1, angle= 30, code=2) id. With
arrows at points (x0, y0), if code=2. The
arrow is at point (x1, y1), if code=1. Arrows
are at both if code=3. Angle controls the angle from the
shaft of the arrow to the edge of the arrow head.
abline(a, b) draws a line of slope b
and intercept a.
abline(h=y) draws a horizontal line at ordinate y.
abline(v=x) draws a vertical line at abscissa x.
abline(lm.obj) draws the regression line given by
lm.obj. abline(h=0, col=2) #color (col) is often used
rect(x1, y1, x2, y2) draws a rectangle which left, right, bottom, and top limits are x1, x2, y1, and y2, respectively.
polygon(x, y) draws a polygon linking the points with coordinates given by x and y.
legend(x, y, legend) adds the legend at the point
(x, y) with the symbols given by legend.
title() adds a title and optionally a subtitle.
axis(side, vect) adds an axis at the bottom
(side=1), on the left (side=2), at the top
(side=3), or on the right (side=4);
vect (optional) gives the abscissa (or ordinates) where
tick-marks are drawn.
rug(x) draws the data x on the x-axis as small vertical lines.
locator(n, type=“n”, …) returns the coordinates
(x, y) after the user has clicked n times on the plot with
the mouse; also draws symbols (type="p") or lines
(type="l") with respect to optional graphic parameters (…);
by default nothing is drawn (type="n").
These can be set globally with par(…). Many can be passed as parameters to plotting commands.
adj controls text justification (adj=0
left-justified, adj=0.5 centered, adj=1
right-justified).
bg specifies the color of the background (ex. :
bg="red", bg="blue", …the list of the 657
available colors is displayed with colors()).
bty controls the type of box drawn around the plot.
Allowed values are: “o”, “l”, “7”, “c”, “u” ou “]” (the box looks like
the corresponding character). If bty="n" the box is not
drawn.
cex a value controlling the size of texts and
symbols with respect to the default. The following parameters have the
same control for numbers on the axes-cex.axis, the axis
labels-cex.lab, the title-cex.main, and the
subtitle-cex.sub.
col controls the color of symbols and lines. Use
color names: “red”, “blue” see colors() or as “#RRGGBB”;
see rgb(), hsv(), gray(), and
rainbow(); as for cex there are: col.axis,
col.lab, col.main, col.sub.
font an integer which controls the style of text (1:
normal, 2: italics, 3: bold, 4: bold italics); as for cex there are:
font.axis, font.lab, font.main,
font.sub.
las an integer which controls the orientation of the axis labels (0: parallel to the axes, 1: horizontal, 2: perpendicular to the axes, 3: vertical).
lty controls the type of lines, can be an integer or
string (1: “solid”, 2: “dashed”, 3: “dotted”, 4: “dotdash”, 5:
“longdash”, 6: “twodash”, or a string of up to eight characters (between
“0” and “9”) which specifies alternatively the length, in points or
pixels, of the drawn elements and the blanks, for example
lty="44" will have the same effect than
lty=2.
lwd a numeric which controls the width of lines, default=1.
mar a vector of 4 numeric values which control the
space between the axes and the border of the graph of the form
c(bottom, left, top, right), the default values are
c(5.1, 4.1, 4.1, 2.1).
mfcol a vector of the form c(nr, nc)
which partitions the graphic window as a matrix of nr lines and nc
columns, the plots are then drawn in columns.
mfrow id. but the plots are drawn by row.
pch controls the type of symbol, either an integer between 1 and 25, or any single character within ““.
ts.plot(x) id. but if x is multivariate the series may have different dates by x and y.
ps an integer which controls the size in points of texts and symbols.
pty a character, which specifies the type of the plotting region, “s”: square, “m”: maximal.
tck a value which specifies the length of tick-marks
on the axes as a fraction of the smallest of the width or height of the
plot; if tck=1 a grid is drawn.
tcl a value which specifies the length of tick-marks
on the axes as a fraction of the height of a line of text (by default
tcl=-0.5).
xaxt if xaxt="n" the x-axis is set but
not drawn (useful in conjunction with
axis(side=1, ...)).
yaxt if yaxt="n" the y-axis is set but
not drawn (useful in conjunction with
axis(side=2, ...)).
Lattice (Trellis) graphics
| Expression | Explanation |
|---|---|
| xyplot(y~x) | bivariate plots (with many functionalities). |
| barchart(y~x) | histogram of the values of y with respect to those of x. |
| dotplot(y~x) | Cleveland dot plot (stacked plots line-by-line and column-by-column) |
| densityplot(~x) | density functions plot |
| histogram(~x) | histogram of the frequencies of x |
| bwplot(y~x) | “box-and-whiskers” plot |
| qqmath(~x) | quantiles of x with respect to the values expected under a theoretical distribution |
| stripplot(y~x) | single dimension plot, x must be numeric, y may be a factor |
| qq(y~x) | quantiles to compare two distributions, x must be numeric, y may be numeric, character, or factor but must have two “levels” |
| splom(~x) | matrix of bivariate plots |
| parallel(~x) | parallel coordinates plot |
| levelplot(\(z\sim x*y\|g1*g2\)) | colored plot of the values of z at the coordinates given by x and y (x, y and z are all of the same length) |
| wireframe(\(z\sim x*y\|g1*g2\)) | 3d surface plot |
| cloud(\(z\sim x*y\|g1*g2\)) | 3d scatter plot |
In the normal Lattice formula, y~x|g1*g2 has
combinations of optional conditioning variables g1 and g2 plotted on
separate panels. Lattice functions take many of the same arguments as
base graphics plus also data= the data frame for the
formula variables and subset= for subsetting. Use
panel= to define a custom panel function (see
apropos("panel") and ?lines). Lattice
functions return an object of class trellis and have to be printed to
produce the graph. Use print(xyplot(...)) inside functions
where automatic printing doesn’t work. Use lattice.theme
and lset to change Lattice defaults.
The standard setting for our own function is:
function.name<-function(x) {
expr(an expression)
return(value)
}
Where \(x\) is the parameter in the expression. A simple example of this is:
adding<-function(x=0, y=0){z<-x+y
return(z)}
adding(x=5, y=10)## [1] 15
Conditions setting:
if(cond) {expr}
or
if(cond) cons.expr else alt.expr
x<-10
if(x>10) z="T" else z="F"
z## [1] "F"
Alternatively, ifelse represents a vectorized and
extremely efficient conditional mechanism that provides one of the main
advantages of R.
For loop:
for(var in seq) expr
x<-c()
for(i in 1:10) x[i]=i
x## [1] 1 2 3 4 5 6 7 8 9 10
Other loops:
While loop: while(cond) expr
Repeat: repeat expr
Applied to innermost of nested loops: break,
next
Use braces {} around statements.
ifelse(test, yes, no) a value with the same shape as test filled with elements from either yes or no.
do.call(funname, args) executes a function call from the name of the function and a list of arguments to be passed to it.
Before we demonstrate how to synthetically simulate that that resembles closely the characteristics of real observations from the same process let’s import some observed data for initial exploratory analytics.
Using the SOCR Health Evaluation and Linkage to Primary (HELP) Care Dataset we can extract some sample data: 00_Tiny_SOCR_HELP_Data_Simmulation.csv.
data_1 <- read.csv("https://umich.instructure.com/files/1628625/download?download_frd=1", as.is=T, header=T)
# data_1 = read.csv(file.choose( ))
attach(data_1)
# to ensure all variables are accessible within R, e.g., using "age" instead of data_1$age
# i2 maximum number of drinks (standard units) consumed per day (in the past 30 days range 0-184) see also i1
# treat randomization group (0=usual care, 1=HELP clinic)
# pcs SF-36 Physical Component Score (range 14-75)
# mcs SF-36 Mental Component Score(range 7-62)
# cesd Center for Epidemiologic Studies Depression scale (range 0-60)
# indtot Inventory of Drug Use Con-sequences (InDUC) total score (range 4-45)
# pss_fr perceived social supports (friends, range 0-14) see also dayslink
# drugrisk Risk-Assessment Battery(RAB) drug risk score (range0-21)
# satreat any BSAS substance abuse treatment at baseline (0=no, 1=yes)| ID | i2 | age | treat | homeless | pcs | mcs | cesd | indtot | pss_fr | drugrisk | sexrisk | satreat | female | substance | racegrp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 25 | 0 | 0 | 49 | 7 | 46 | 37 | 0 | 1 | 6 | 0 | 0 | cocaine | black 2 |
| 3 | 39 | 36 | 0 | 0 | 76 | 9 | 33 | 41 | 12 | 19 | 4 | 0 | 0 | heroin | black |
| . | |||||||||||||||
| 100 | 81 | 22 | 0 | 0 | 37 | 17 | 19 | 30 | 3 | 0 | 10 | 0 | 0 | alcohol | other |
summary(data_1)## ID i2 age treat
## Min. : 1.00 Min. : 0.00 Min. : 3.00 Min. :0.0000
## 1st Qu.: 24.25 1st Qu.: 1.00 1st Qu.:27.00 1st Qu.:0.0000
## Median : 50.50 Median : 15.50 Median :34.00 Median :0.0000
## Mean : 50.29 Mean : 27.08 Mean :34.31 Mean :0.1222
## 3rd Qu.: 74.75 3rd Qu.: 39.00 3rd Qu.:43.00 3rd Qu.:0.0000
## Max. :100.00 Max. :137.00 Max. :65.00 Max. :2.0000
## homeless pcs mcs cesd
## Min. :0.0000 Min. : 6.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:41.25 1st Qu.:20.25 1st Qu.:17.25
## Median :0.0000 Median :48.50 Median :29.00 Median :30.00
## Mean :0.1444 Mean :47.61 Mean :30.49 Mean :30.21
## 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:39.75 3rd Qu.:43.00
## Max. :1.0000 Max. :76.00 Max. :93.00 Max. :68.00
## indtot pss_fr drugrisk sexrisk
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:31.25 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 1.250
## Median :36.00 Median : 6.000 Median : 0.000 Median : 5.000
## Mean :37.03 Mean : 6.533 Mean : 2.578 Mean : 4.922
## 3rd Qu.:45.00 3rd Qu.:10.000 3rd Qu.: 3.000 3rd Qu.: 7.750
## Max. :60.00 Max. :20.000 Max. :23.000 Max. :13.000
## satreat female substance racegrp
## Min. :0.00000 Min. :0.00000 Length:90 Length:90
## 1st Qu.:0.00000 1st Qu.:0.00000 Class :character Class :character
## Median :0.00000 Median :0.00000 Mode :character Mode :character
## Mean :0.07778 Mean :0.05556
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
x.norm <- rnorm(n=200, m=10, sd=20) # hist(x.norm, main='N(10, 20) Histogram')
plot_ly(x = ~x.norm, type = "histogram") %>%layout(bargap=0.1, title='N(10, 20) Histogram')mean(data_1$age) ## [1] 34.31111
sd(data_1$age)## [1] 11.68947
Next, we will simulate new synthetic data to match the
properties/characteristics of the observed data (using
Uniform, Normal, and Poisson
distributions):
# i2 [0: 184]
# age m=34, sd=12
# treat {0, 1}
# homeless {0, 1}
# pcs 14-75
# mcs 7-62
# cesd 0-60
# indtot 4-45
# pss_fr 0-14
# drugrisk 0-21
# sexrisk
# satreat (0=no, 1=yes)
# female (0=no, 1=yes)
# racegrp (black, white, other)
# Demographics variables
# Define number of subjects
NumSubj <- 282
NumTime <- 4
# Define data elements
# Cases
Cases <- c(2, 3, 6, 7, 8, 10, 11, 12, 13, 14, 17, 18, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 37, 41, 42, 43, 44, 45, 53, 55, 58, 60, 62, 67, 69, 71, 72, 74, 79, 80, 85, 87, 90, 95, 97, 99, 100, 101, 106, 107, 109, 112, 120, 123, 125, 128, 129, 132, 134, 136, 139, 142, 147, 149, 153, 158, 160, 162, 163, 167, 172, 174, 178, 179, 180, 182, 192, 195, 201, 208, 211, 215, 217, 223, 227, 228, 233, 235, 236, 240, 245, 248, 250, 251, 254, 257, 259, 261, 264, 268, 269, 272, 273, 275, 279, 288, 289, 291, 296, 298, 303, 305, 309, 314, 318, 324, 325, 326, 328, 331, 332, 333, 334, 336, 338, 339, 341, 344, 346, 347, 350, 353, 354, 359, 361, 363, 364, 366, 367, 368, 369, 370, 371, 372, 374, 375, 376, 377, 378, 381, 382, 384, 385, 386, 387, 389, 390, 393, 395, 398, 400, 410, 421, 423, 428, 433, 435, 443, 447, 449, 450, 451, 453, 454, 455, 456, 457, 458, 459, 460, 461, 465, 466, 467, 470, 471, 472, 476, 477, 478, 479, 480, 481, 483, 484, 485, 486, 487, 488, 489, 492, 493, 494, 496, 498, 501, 504, 507, 510, 513, 515, 528, 530, 533, 537, 538, 542, 545, 546, 549, 555, 557, 559, 560, 566, 572, 573, 576, 582, 586, 590, 592, 597, 603, 604, 611, 619, 621, 623, 624, 625, 631, 633, 634, 635, 637, 640, 641, 643, 644, 645, 646, 647, 648, 649, 650, 652, 654, 656, 658, 660, 664, 665, 670, 673, 677, 678, 679, 680, 682, 683, 686, 687, 688, 689, 690, 692)
# Imaging Biomarkers
L_caudate_ComputeArea <- rpois(NumSubj, 600)
L_caudate_Volume <- rpois(NumSubj, 800)
R_caudate_ComputeArea <- rpois(NumSubj, 893)
R_caudate_Volume <- rpois(NumSubj, 1000)
L_putamen_ComputeArea <- rpois(NumSubj, 900)
L_putamen_Volume <- rpois(NumSubj, 1400)
R_putamen_ComputeArea <- rpois(NumSubj, 1300)
R_putamen_Volume <- rpois(NumSubj, 3000)
L_hippocampus_ComputeArea <- rpois(NumSubj, 1300)
L_hippocampus_Volume <- rpois(NumSubj, 3200)
R_hippocampus_ComputeArea <- rpois(NumSubj, 1500)
R_hippocampus_Volume <- rpois(NumSubj, 3800)
cerebellum_ComputeArea <- rpois(NumSubj, 16700)
cerebellum_Volume <- rpois(NumSubj, 14000)
L_lingual_gyrus_ComputeArea <- rpois(NumSubj, 3300)
L_lingual_gyrus_Volume <- rpois(NumSubj, 11000)
R_lingual_gyrus_ComputeArea <- rpois(NumSubj, 3300)
R_lingual_gyrus_Volume <- rpois(NumSubj, 12000)
L_fusiform_gyrus_ComputeArea <- rpois(NumSubj, 3600)
L_fusiform_gyrus_Volume <- rpois(NumSubj, 11000)
R_fusiform_gyrus_ComputeArea <- rpois(NumSubj, 3300)
R_fusiform_gyrus_Volume <- rpois(NumSubj, 10000)
Sex <- ifelse(runif(NumSubj)<.5, 0, 1)
Weight <- as.integer(rnorm(NumSubj, 80, 10))
Age <- as.integer(rnorm(NumSubj, 62, 10))
# Diagnosis:
Dx <- c(rep("PD", 100), rep("HC", 100), rep("SWEDD", 82))
# Genetics
chr12_rs34637584_GT <- c(ifelse(runif(100)<.3, 0, 1), ifelse(runif(100)<.6, 0, 1), ifelse(runif(82)<.4, 0, 1)) # NumSubj Bernoulli trials
chr17_rs11868035_GT <- c(ifelse(runif(100)<.7, 0, 1), ifelse(runif(100)<.4, 0, 1), ifelse(runif(82)<.5, 0, 1)) # NumSubj Bernoulli trials
# Clinical # rpois(NumSubj, 15) + rpois(NumSubj, 6)
UPDRS_part_I <- c( ifelse(runif(100)<.7, 0, 1) + ifelse(runif(100) < .7, 0, 1),
ifelse(runif(100)<.6, 0, 1)+ ifelse(runif(100)<.6, 0, 1),
ifelse(runif(82)<.4, 0, 1)+ ifelse(runif(82)<.4, 0, 1) )
UPDRS_part_II <- c(sample.int(20, 100, replace=T), sample.int(14, 100, replace=T),
sample.int(18, 82, replace=T) )
UPDRS_part_III <- c(sample.int(30, 100, replace=T), sample.int(20, 100, replace=T),
sample.int(25, 82, replace=T) )
# Time: VisitTime - done automatically below in aggregator
# Data (putting all components together)
sim_PD_Data <- cbind(
rep(Cases, each= NumTime), # Cases
rep(L_caudate_ComputeArea, each= NumTime), # Imaging
rep(Sex, each= NumTime), # Demographics
rep(Weight, each= NumTime),
rep(Age, each= NumTime),
rep(Dx, each= NumTime), # Dx
rep(chr12_rs34637584_GT, each= NumTime), # Genetics
rep(chr17_rs11868035_GT, each= NumTime),
rep(UPDRS_part_I, each= NumTime), # Clinical
rep(UPDRS_part_II, each= NumTime),
rep(UPDRS_part_III, each= NumTime),
rep(c(0, 6, 12, 18), NumSubj) # Time
)
# Assign the column names
colnames(sim_PD_Data) <- c(
"Cases",
"L_caudate_ComputeArea",
"Sex", "Weight", "Age",
"Dx", "chr12_rs34637584_GT", "chr17_rs11868035_GT",
"UPDRS_part_I", "UPDRS_part_II", "UPDRS_part_III",
"Time"
)
# some QC
summary(sim_PD_Data) ## Cases L_caudate_ComputeArea Sex Weight
## Length:1128 Length:1128 Length:1128 Length:1128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Age Dx chr12_rs34637584_GT chr17_rs11868035_GT
## Length:1128 Length:1128 Length:1128 Length:1128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## UPDRS_part_I UPDRS_part_II UPDRS_part_III Time
## Length:1128 Length:1128 Length:1128 Length:1128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
dim(sim_PD_Data) ## [1] 1128 12
head(sim_PD_Data) ## Cases L_caudate_ComputeArea Sex Weight Age Dx chr12_rs34637584_GT
## [1,] "2" "652" "0" "84" "79" "PD" "0"
## [2,] "2" "652" "0" "84" "79" "PD" "0"
## [3,] "2" "652" "0" "84" "79" "PD" "0"
## [4,] "2" "652" "0" "84" "79" "PD" "0"
## [5,] "3" "594" "0" "99" "29" "PD" "1"
## [6,] "3" "594" "0" "99" "29" "PD" "1"
## chr17_rs11868035_GT UPDRS_part_I UPDRS_part_II UPDRS_part_III Time
## [1,] "0" "1" "11" "12" "0"
## [2,] "0" "1" "11" "12" "6"
## [3,] "0" "1" "11" "12" "12"
## [4,] "0" "1" "11" "12" "18"
## [5,] "0" "0" "18" "20" "0"
## [6,] "0" "0" "18" "20" "6"
# hist(data_1$age, freq=FALSE, right=FALSE, ylim = c(0,0.05))
# lines(density(as.numeric(as.data.frame(sim_PD_Data)$Age)), lwd=2, col="blue")
# legend("topright", c("Raw Data", "Simulated Data"), fill=c("black", "blue"))
x <- data_1$age
fit <- density(as.numeric(as.data.frame(sim_PD_Data)$Age))
plot_ly(x = x, type = "histogram", name = "Histogram (Raw Age)") %>%
add_trace(x = fit$x, y = fit$y, type = "scatter", mode = "lines",
fill = "tozeroy", yaxis = "y2", name = "Density (Simulated Age)") %>%
layout(title='Observed and Simulated Ages', yaxis2 = list(overlaying = "y", side = "right")) # Save Results
# Write out (save) the result to a file that can be shared
write.table(sim_PD_Data, "output_data.csv", sep=", ", row.names=FALSE, col.names=TRUE) The Tidyverse represents a suite of integrated
R packages that provide support for
data science and Big Data analytics, including
functionality for data import (readr), data manipulation
(dplyr), data visualization (ggplot2),
expanded data frames (tibble), data tidying
(tidyr), and functional programming (purrr).
These learning
modules provide introduction to tidyverse.
R documentation and resourcesThe Software Carpentry Foundation provides useful Programming with R and R for Reproducible Scientific Analysis materials.
SOCR
Datasets can automatically be downloaded into the R
environment using the following protocol, which uses the Parkinson’s
Disease dataset as an example:
library(rvest)
#Loading required package: xml2
#wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata") # UMich SOCR Data
wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_PD_BiomedBigMetadata") # UCLA SOCR Data
html_nodes(wiki_url, "#content")## {xml_nodeset (1)}
## [1] <div id="content">\n\t\t<a name="top" id="top"></a>\n\t\t\t\t<h1 id="firs ...
pd_data <- html_table(html_nodes(wiki_url, "table")[[2]])
head(pd_data); summary(pd_data)## # A tibble: 6 × 33
## Cases L_caud…¹ L_cau…² R_cau…³ R_cau…⁴ L_put…⁵ L_put…⁶ R_put…⁷ R_put…⁸ L_hip…⁹
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 2 597 767 855 968 842 1357 1285 3052 1306
## 2 2 597 767 855 968 842 1357 1285 3052 1306
## 3 2 597 767 855 968 842 1357 1285 3052 1306
## 4 2 597 767 855 968 842 1357 1285 3052 1306
## 5 3 604 873 935 1043 892 1366 1305 2920 1292
## 6 3 604 873 935 1043 892 1366 1305 2920 1292
## # … with 23 more variables: L_hippocampus_Volume <int>,
## # R_hippocampus_ComputeArea <int>, R_hippocampus_Volume <int>,
## # cerebellum_ComputeArea <int>, cerebellum_Volume <int>,
## # L_lingual_gyrus_ComputeArea <int>, L_lingual_gyrus_Volume <int>,
## # R_lingual_gyrus_ComputeArea <int>, R_lingual_gyrus_Volume <int>,
## # L_fusiform_gyrus_ComputeArea <int>, L_fusiform_gyrus_Volume <int>,
## # R_fusiform_gyrus_ComputeArea <int>, R_fusiform_gyrus_Volume <int>, …
## Cases L_caudate_ComputeArea L_caudate_Volume R_caudate_ComputeArea
## Min. : 2.0 Min. :525.0 Min. :719.0 Min. :795.0
## 1st Qu.:158.0 1st Qu.:582.0 1st Qu.:784.0 1st Qu.:875.0
## Median :363.5 Median :600.0 Median :800.0 Median :897.0
## Mean :346.1 Mean :600.4 Mean :800.3 Mean :894.5
## 3rd Qu.:504.0 3rd Qu.:619.0 3rd Qu.:819.0 3rd Qu.:916.0
## Max. :692.0 Max. :667.0 Max. :890.0 Max. :977.0
## R_caudate_Volume L_putamen_ComputeArea L_putamen_Volume R_putamen_ComputeArea
## Min. : 916 Min. : 815.0 Min. :1298 Min. :1198
## 1st Qu.: 979 1st Qu.: 879.0 1st Qu.:1376 1st Qu.:1276
## Median : 998 Median : 897.5 Median :1400 Median :1302
## Mean :1001 Mean : 898.9 Mean :1400 Mean :1300
## 3rd Qu.:1022 3rd Qu.: 919.0 3rd Qu.:1427 3rd Qu.:1321
## Max. :1094 Max. :1003.0 Max. :1507 Max. :1392
## R_putamen_Volume L_hippocampus_ComputeArea L_hippocampus_Volume
## Min. :2846 Min. :1203 Min. :3036
## 1st Qu.:2959 1st Qu.:1277 1st Qu.:3165
## Median :3000 Median :1300 Median :3200
## Mean :3000 Mean :1302 Mean :3198
## 3rd Qu.:3039 3rd Qu.:1325 3rd Qu.:3228
## Max. :3148 Max. :1422 Max. :3381
## R_hippocampus_ComputeArea R_hippocampus_Volume cerebellum_ComputeArea
## Min. :1414 Min. :3634 Min. :16378
## 1st Qu.:1479 1st Qu.:3761 1st Qu.:16617
## Median :1504 Median :3802 Median :16699
## Mean :1504 Mean :3799 Mean :16700
## 3rd Qu.:1529 3rd Qu.:3833 3rd Qu.:16784
## Max. :1602 Max. :4013 Max. :17096
## cerebellum_Volume L_lingual_gyrus_ComputeArea L_lingual_gyrus_Volume
## Min. :13680 Min. :3136 Min. :10709
## 1st Qu.:13933 1st Qu.:3262 1st Qu.:10943
## Median :13996 Median :3299 Median :11007
## Mean :14002 Mean :3300 Mean :11010
## 3rd Qu.:14077 3rd Qu.:3333 3rd Qu.:11080
## Max. :14370 Max. :3469 Max. :11488
## R_lingual_gyrus_ComputeArea R_lingual_gyrus_Volume
## Min. :3135 Min. :11679
## 1st Qu.:3258 1st Qu.:11935
## Median :3294 Median :12001
## Mean :3296 Mean :12008
## 3rd Qu.:3338 3rd Qu.:12079
## Max. :3490 Max. :12324
## L_fusiform_gyrus_ComputeArea L_fusiform_gyrus_Volume
## Min. :3446 Min. :10682
## 1st Qu.:3554 1st Qu.:10947
## Median :3594 Median :11016
## Mean :3598 Mean :11011
## 3rd Qu.:3637 3rd Qu.:11087
## Max. :3763 Max. :11394
## R_fusiform_gyrus_ComputeArea R_fusiform_gyrus_Volume Sex
## Min. :3094 Min. : 9736 Min. :0.0000
## 1st Qu.:3260 1st Qu.: 9928 1st Qu.:0.0000
## Median :3296 Median : 9994 Median :1.0000
## Mean :3299 Mean : 9996 Mean :0.5851
## 3rd Qu.:3332 3rd Qu.:10058 3rd Qu.:1.0000
## Max. :3443 Max. :10235 Max. :1.0000
## Weight Age Dx chr12_rs34637584_GT
## Min. : 51.00 Min. :31.00 Length:1128 Min. :0.000
## 1st Qu.: 71.00 1st Qu.:54.00 Class :character 1st Qu.:0.000
## Median : 78.50 Median :61.00 Mode :character Median :1.000
## Mean : 78.45 Mean :60.64 Mean :0.539
## 3rd Qu.: 84.00 3rd Qu.:68.00 3rd Qu.:1.000
## Max. :109.00 Max. :87.00 Max. :1.000
## chr17_rs11868035_GT UPDRS_part_I UPDRS_part_II UPDRS_part_III
## Min. :0.0000 Min. :0.000 Min. : 1.000 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 5.000 1st Qu.: 6.00
## Median :0.0000 Median :1.000 Median : 9.000 Median :13.00
## Mean :0.4184 Mean :0.773 Mean : 8.879 Mean :13.02
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:13.000 3rd Qu.:18.00
## Max. :1.0000 Max. :2.000 Max. :20.000 Max. :30.00
## Time
## Min. : 0.0
## 1st Qu.: 4.5
## Median : 9.0
## Mean : 9.0
## 3rd Qu.:13.5
## Max. :18.0
Also see the SMHS Simulation Primer.
Most programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need explore deeper – this procedure to identify and fix bugs is “debugging”.
R tools for debugging: traceback(), debug() browser() trace() recover()
traceback(): Failing R functions report to the screen immediately the error. Calling traceback() will show the function where the error occurred. The traceback() function prints the list of functions that were called before the error occurred.
The function calls are printed in reverse order.
f1<-function(x) { r<- x-g1(x); r }
g1<-function(y) { r<-y*h1(y); r }
h1<-function(z) { r<-log(z); if(r<10) {r^2} else {r^3}}
f1(-1)## Warning in log(z): NaNs produced
## Error in if (r < 10) {: missing value where TRUE/FALSE needed
traceback()## No traceback available
3: h(y)## Error in h(y): could not find function "h"
2: g(x)## Error in g(x): could not find function "g"
1: f(-1)## Error in f(-1): could not find function "f"
debug()
traceback() does not tell you where is the error. To
find out which line causes the error, we may step through the function
using debug().
debug(foo) flags the function foo() for
debugging. undebug(foo) unflags the function.
When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and user can interact with the R shell.
This allows us to inspect a function line-by-line.
Example: compute sum of squared error SS
#compute sum of squares
SS<-function(mu, x) {
d<-x-mu;
d2<-d^2;
ss<-sum(d2);
ss }
set.seed(100);
x<-rnorm(100);
SS(1, x)## [1] 202.5615
#to debug
debug(SS); SS(1, x)## debugging in: SS(1, x)
## debug at <text>#3: {
## d <- x - mu
## d2 <- d^2
## ss <- sum(d2)
## ss
## }
## debug at <text>#4: d <- x - mu
## debug at <text>#5: d2 <- d^2
## debug at <text>#6: ss <- sum(d2)
## debug at <text>#7: ss
## exiting from: SS(1, x)
## [1] 202.5615
In the debugging shell (“Browse[1]>”), users can:
Example:
debug(SS)
SS(1, x)## debugging in: SS(1, x)
## debug at <text>#3: {
## d <- x - mu
## d2 <- d^2
## ss <- sum(d2)
## ss
## }
## debug at <text>#4: d <- x - mu
## debug at <text>#5: d2 <- d^2
## debug at <text>#6: ss <- sum(d2)
## debug at <text>#7: ss
## exiting from: SS(1, x)
## [1] 202.5615
Browse[1]> n
debug: d <- x - mu ## the next command
Browse[1]> ls() ## current environment [1] “mu” “x” ## there is no
d
Browse[1]> n ## go one step debug: d2 <- d^2 ## the next
command
Browse[1]> ls() ## current environment [1] “d” “mu” “x” ## d has been
created
Browse[1]> d[1:3] ## first three elements of d [1] -1.5021924
-0.8684688 -1.0789171
Browse[1]> hist(d) ## histogram of d
Browse[1]> where ## current position in call stack where 1: SS(1,
x)
Browse[1]> n
debug: ss <- sum(d2)
Browse[1]> Q ## quit
undebug(SS) ## remove debug label, stop debugging process
SS(1, x) ## now call SS again will without debugging## [1] 202.5615
You can label a function for debugging while debugging another function
f<-function(x){ r<-x-g(x);r }
g<-function(y){ r<-y*h(y);r }
h<-function(z){ r<-log(z);if(r<10) r^2 else r^3 }
debug(f) # ## If you only debug f, you will not go into g
f(-1)## debugging in: f(-1)
## debug at <text>#1: {
## r <- x - g(x)
## r
## }
## debug at <text>#1: r <- x - g(x)
## Warning in log(z): NaNs produced
## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed
Browse[1]> n
Browse[1]> n
But, we can also label g and h for debugging when we debug f
f(-1)
Browse[1]> n
Browse[1]> debug(g)
Browse[1]> debug(h)
Browse[1]> n
Inserting a call to browser() in a function will
pause the execution of a function at the point where browser() is
called.
Similar to using debug() except you can control where execution gets
paused.
h<-function(z) {browser() ## a break point inserted here
r<-log(z);if(r<10) r^2 else r^3 }
f(-1)## debugging in: f(-1)
## debug at <text>#1: {
## r <- x - g(x)
## r
## }
## debug at <text>#1: r <- x - g(x)
## Called from: h(y)
## debug at <text>#2: r <- log(z)
## Warning in log(z): NaNs produced
## debug at <text>#2: if (r < 10) r^2 else r^3
## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed
Browse[1]> ls()
Browse[1]> z
Browse[1]> n
Browse[1]> n
Browse[1]> ls()
Browse[1]> c
Calling trace() on a function allows inserting new code into a function. The syntax for trace() may be challenging.
as.list(body(h))
trace(“h”, quote(
if(is.nan(r))
{browser()}), at=3, print=FALSE)
f(1)
f(-1)
trace(“h”, quote(if(z<0) {z<-1}), at=2, print=FALSE)
f(-1)
untrace()
During the debugging process, recover() allows checking the status of variables in upper level functions. recover() can be used as an error handler using options() (e.g. options(error=recover)). When functions throw exceptions, execution stops at point of failure. Browsing the function calls and examining the environment may indicate the source of the problem.