Data analysis and visualization using R

september 2015

Descriptive statistics
Sort/order
General purpose functions
Reading and writing textual data
Flow control
Creating your own functions

Descriptive statistics

Descriptive stats functions

R provides a wealth of descriptive statistics functions
They are listed on the next two slides

descriptive statistics functions (1)

function	purpose
mean( )	mean
median( )	median
var( )	variance s^2
sd( )	standard deviation s
min( )	minimum
max( )	maximum
range( )	min and max

descriptive statistics functions (2)

function	purpose
quantile( )	quantiles
IQR( )	interquantile range
summary( )	6-number summary
hist( )	histogram
boxplot( )	boxplot

the `quantile()` function

Gives the data alues corresponding to the specified quantiles
Defaults to 0% 25% 50% 75% 100%

quantile(ChickWeight$weight)

    0%    25%    50%    75%   100% 
 35.00  63.00 103.00 163.75 373.00

quantile(ChickWeight$weight, probs = seq(0, 1, 0.2))

   0%   20%   40%   60%   80%  100% 
 35.0  57.0  85.0 126.0 181.6 373.0

Interquantile range `IQR()`

Gives the range between 25% and 75% quantiles

IQR(ChickWeight$weight)

[1] 100.75

## same as
quantile(ChickWeight$weight)[4] - quantile(ChickWeight$weight)[2]

   75% 
100.75

`boxplot()` is a picture of `summary()`

Boxplot is a graph of the 5-number summary, but summary() also gives the mean

summary(ChickWeight$weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   35.0    63.0   103.0   121.8   163.8   373.0

boxplot(ChickWeight$weight)

sorting and ordering

Sort and Order

sort() sorts a vector
order() returns a vector representing the ordered status of a vector

x <- c(2, 4, 6, 1, 3)
sort(x)

[1] 1 2 3 4 6

ordr <- order(x); ordr

[1] 4 1 5 2 3

x[ordr]

[1] 1 2 3 4 6

Use order when order matters

When you use sort(), a vector will be shuffled in-place. This is ususally NOT desirable when coupled vectors are being analysed (as in the most used data type Dataframes!)

Sorting dataframes

geneNames <- c("P53","BRCA1","VAMP1", "FHIT")
sig <- c(TRUE, TRUE, FALSE, FALSE)
meanExp <- c(4.5, 7.3, 5.4, 2.4)
genes <- data.frame(
    "name" = geneNames,
    "significant" = sig,
    "meanExp" = meanExp)
genes

   name significant meanExp
1   P53        TRUE     4.5
2 BRCA1        TRUE     7.3
3 VAMP1       FALSE     5.4
4  FHIT       FALSE     2.4

## sort on gene name
genes[order(genes$name), ]

   name significant meanExp
2 BRCA1        TRUE     7.3
4  FHIT       FALSE     2.4
1   P53        TRUE     4.5
3 VAMP1       FALSE     5.4

Multilevel sorting

You can also sort on multiple properties:

students <- data.frame(
    "st.names" = c("Henk", "Piet", "Sara", "Henk", "Henk"),
    "st.ages" = c(22, 23, 18, 19, 24))
students[order(students$st.names, students$st.ages), ]

  st.names st.ages
4     Henk      19
1     Henk      22
5     Henk      24
2     Piet      23
3     Sara      18

Some general purpose functions (part 1)

Remove objects from memory

When working with large datasets it may be usefull to free them from memory when no longer needed
i.e. intermediate results
use rm() to do this: rm(genes), rm(x, y, z)

search: the R scope search

R has a search path for objects/variabls: search()
R searches from the beginning of this list until the variable has been found.
Order:
- 1st position: always .GlobalEnv (defined in current session)
- last position: always package:base
- 2nd position: always last loaded package / attached dataframe

File system operations

getwd() returns the current working directory
setwd() sets the current working directory
dir(), dir(path) lists the contents of the current directory or of path
path can be defined as
- Windows: "E:\\emile\\datasets"
- Linux/Mac: "~/datasets" or "/home/emile/datasets"

Reading and writing textual data

Text data formats

Textual data comes in many forms. Here are a few examples:

DesertBirdCensus.csv

Species,"Count"
Black Vulture,64
Turkey Vulture,23
Harris's Hawk,3
Red-tailed Hawk,16
American Kestrel,7

BED_file.txt

browser position chr7:127471196-127495720
browser hide all
track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2
itemRgb="On"
chr7    127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  0  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  0  +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  0  -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  0  -  127477031  127478198  0,0,255

mySNPdata.vcf

##fileformat=VCFv4.0
##fileDate=20100501
##reference=1000GenomesPilot-NCBI36
##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta
##INFO=<ID=BKPTID,Number=-1,Type=String,Description="ID of the assembled alternate allele in the assembly file">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">
#CHROM  POS   ID  REF ALT   QUAL  FILTER  INFO  FORMAT  NA00001
1 2827693   . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA  C . PASS  SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682    . T <DEL>   6 PASS    IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62  GT:GQ 0/1:12
2 14477084  . C <DEL:ME:ALU>  12  PASS  IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32  GT:GQ 0/1:12
3 9425916   . C <INS:ME:L1> 23  PASS  IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 12665100  . A <DUP>   14  PASS  IMPRECISE;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500   GT:GQ:CN:CNQ  ./.:0:3:16.2
4 18665128  . T <DUP:TANDEM>  11  PASS  IMPRECISE;SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10  GT:GQ:CN:CNQ  ./.:0:5:8.3

Data file structure

Whatever the contents of a file, you always need to address (some of) these questions:

Are there comment lines at the top?
Is there a header line with column names?
What is the column separator? (fixed width?)
Are there quotes around character data?
How are missing values encoded?
How are numeric values encoded?
What is the type in each column?
- character / numeric / factor / date/time

Writing data to file

writing a data frame / matrix / vector to file:
write.table(myData, file="file.csv")
Standard is a comma-separated file with both column- and row names, unless otherwise specified:
- col.names = F
- row.names = F
- sep = ";"
- sep = "\t" # tab-separated

Some general purpose functions (part 2)

glueing text pieces: `paste()`

Use paste() to combine elements into a string

paste(1, 2, 3)

[1] "1 2 3"

paste(1, 2, 3, sep="-")

[1] "1-2-3"

paste(1:12, month.abb)

 [1] "1 Jan"  "2 Feb"  "3 Mar"  "4 Apr"  "5 May"  "6 Jun"  "7 Jul"  "8 Aug"  "9 Sep"  "10 Oct"
[11] "11 Nov" "12 Dec"

investigate structure: `str()`

Use str() to investigate the structure of a complex object

str(chickwts)

'data.frame':   71 obs. of  2 variables:
 $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
 $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...

Flow control

what is flow control

used to control the execution of different commands
these structures are used for flow control
- if(){} else if(){} else{}
- for(){}
- while(){}

if and else

Since flow control is used primarily within functions it is dealt with here
The first is if & else for conditional code

x <- 43
if (x > 40) {
    print("TRUE")
} else {
    print("FALSE")
}

[1] "TRUE"

if/else real life example

this code chunk checks if a file exists and only downloads it if it is not present

my_data_file <- "/some/file/on/disk"
## fetch file
if (!file.exists(my_data_file)) {
    print(paste("downloading", my_data_file))
    download.file(url = remote_url, destfile = my_data_file)
} else {
    print(paste("reading cached copy of", my_data_file))
}

ifelse ternary

There is also a shorthand for if(){}else{}

a <- 3
x <- if (a == 1) 1 else 2
x

[1] 2

for

For is used for looping
However, the prefered way to do this is by apply() and its relatives (see above)

for (i in 1:3) {
    print(i)
}

[1] 1
[1] 2
[1] 3

Creating functions

Functions are reusable code

Functions are named pieces of code with a single well-defined purpose
They ususally have some data as input: arguments
They usually have some return value

A first function

Here is a simple function calculating the mean of a vector

my_mean <- function(x) {
    sum(x) / length(x)
}
my_mean(1:5)

[1] 3

Function basics

The result of the last statement within a function is the return value of that function
Use return() for forcing return values at other points:

my_message <- function(age) {
    if (age < 18) return("have a lemonade!")
    else return("have a beer!")
}
my_message(20)

[1] "have a beer!"

Default argument values

Use default values for function arguments whenever possible
Almost all functions in R packages have many arguments with default values

my_power <- function(x, power = 2) {
    x ^ power
}
my_power(10, 3) ## custom power

[1] 1000

my_power(10) ## defaults to 2

[1] 100

Contents