september 2015

Contents

  • Descriptive statistics
  • Sort/order
  • General purpose functions
  • Reading and writing textual data
  • Flow control
  • Creating your own functions

Descriptive statistics

Descriptive stats functions

  • R provides a wealth of descriptive statistics functions
  • They are listed on the next two slides

descriptive statistics functions (1)

function purpose
mean( ) mean
median( ) median
var( ) variance s^2
sd( ) standard deviation s
min( ) minimum
max( ) maximum
range( ) min and max

descriptive statistics functions (2)

function purpose
quantile( ) quantiles
IQR( ) interquantile range
summary( ) 6-number summary
hist( ) histogram
boxplot( ) boxplot

the quantile() function

  • Gives the data alues corresponding to the specified quantiles
  • Defaults to 0% 25% 50% 75% 100%
quantile(ChickWeight$weight)
    0%    25%    50%    75%   100% 
 35.00  63.00 103.00 163.75 373.00 
quantile(ChickWeight$weight, probs = seq(0, 1, 0.2))
   0%   20%   40%   60%   80%  100% 
 35.0  57.0  85.0 126.0 181.6 373.0 

Interquantile range IQR()

  • Gives the range between 25% and 75% quantiles
IQR(ChickWeight$weight)
[1] 100.75
## same as
quantile(ChickWeight$weight)[4] - quantile(ChickWeight$weight)[2]
   75% 
100.75 

boxplot() is a picture of summary()

  • Boxplot is a graph of the 5-number summary, but summary() also gives the mean
summary(ChickWeight$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   35.0    63.0   103.0   121.8   163.8   373.0 
boxplot(ChickWeight$weight)

sorting and ordering

Sort and Order

  • sort() sorts a vector
  • order() returns a vector representing the ordered status of a vector
x <- c(2, 4, 6, 1, 3)
sort(x)
[1] 1 2 3 4 6
ordr <- order(x); ordr
[1] 4 1 5 2 3
x[ordr]
[1] 1 2 3 4 6

Use order when order matters

When you use sort(), a vector will be shuffled in-place. This is ususally NOT desirable when coupled vectors are being analysed (as in the most used data type Dataframes!)

Sorting dataframes

geneNames <- c("P53","BRCA1","VAMP1", "FHIT")
sig <- c(TRUE, TRUE, FALSE, FALSE)
meanExp <- c(4.5, 7.3, 5.4, 2.4)
genes <- data.frame(
    "name" = geneNames,
    "significant" = sig,
    "meanExp" = meanExp)
genes
   name significant meanExp
1   P53        TRUE     4.5
2 BRCA1        TRUE     7.3
3 VAMP1       FALSE     5.4
4  FHIT       FALSE     2.4
## sort on gene name
genes[order(genes$name), ]
   name significant meanExp
2 BRCA1        TRUE     7.3
4  FHIT       FALSE     2.4
1   P53        TRUE     4.5
3 VAMP1       FALSE     5.4

Multilevel sorting

You can also sort on multiple properties:

students <- data.frame(
    "st.names" = c("Henk", "Piet", "Sara", "Henk", "Henk"),
    "st.ages" = c(22, 23, 18, 19, 24))
students[order(students$st.names, students$st.ages), ]
  st.names st.ages
4     Henk      19
1     Henk      22
5     Henk      24
2     Piet      23
3     Sara      18

Some general purpose functions (part 1)

Remove objects from memory

  • When working with large datasets it may be usefull to free them from memory when no longer needed
  • i.e. intermediate results
  • use rm() to do this: rm(genes), rm(x, y, z)

search: the R scope search

File system operations

  • getwd() returns the current working directory
  • setwd() sets the current working directory
  • dir(), dir(path) lists the contents of the current directory or of path
  • path can be defined as
    • Windows: "E:\\emile\\datasets"
    • Linux/Mac: "~/datasets" or "/home/emile/datasets"

Reading and writing textual data

Text data formats

Textual data comes in many forms. Here are a few examples:

DesertBirdCensus.csv

Species,"Count"
Black Vulture,64
Turkey Vulture,23
Harris's Hawk,3
Red-tailed Hawk,16
American Kestrel,7

BED_file.txt

browser position chr7:127471196-127495720
browser hide all
track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2
itemRgb="On"
chr7    127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  0  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  0  +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  0  -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  0  -  127477031  127478198  0,0,255

mySNPdata.vcf

##fileformat=VCFv4.0
##fileDate=20100501
##reference=1000GenomesPilot-NCBI36
##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta
##INFO=<ID=BKPTID,Number=-1,Type=String,Description="ID of the assembled alternate allele in the assembly file">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">
#CHROM  POS   ID  REF ALT   QUAL  FILTER  INFO  FORMAT  NA00001
1 2827693   . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA  C . PASS  SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682    . T <DEL>   6 PASS    IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62  GT:GQ 0/1:12
2 14477084  . C <DEL:ME:ALU>  12  PASS  IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32  GT:GQ 0/1:12
3 9425916   . C <INS:ME:L1> 23  PASS  IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 12665100  . A <DUP>   14  PASS  IMPRECISE;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500   GT:GQ:CN:CNQ  ./.:0:3:16.2
4 18665128  . T <DUP:TANDEM>  11  PASS  IMPRECISE;SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10  GT:GQ:CN:CNQ  ./.:0:5:8.3

Data file structure

Whatever the contents of a file, you always need to address (some of) these questions:

  • Are there comment lines at the top?
  • Is there a header line with column names?
  • What is the column separator? (fixed width?)
  • Are there quotes around character data?
  • How are missing values encoded?
  • How are numeric values encoded?
  • What is the type in each column?
    • character / numeric / factor / date/time

Writing data to file

  • writing a data frame / matrix / vector to file:
  • write.table(myData, file="file.csv")
  • Standard is a comma-separated file with both column- and row names, unless otherwise specified:
    • col.names = F
    • row.names = F
    • sep = ";"
    • sep = "\t" # tab-separated

Some general purpose functions (part 2)

glueing text pieces: paste()

  • Use paste() to combine elements into a string
paste(1, 2, 3)
[1] "1 2 3"
paste(1, 2, 3, sep="-")
[1] "1-2-3"
paste(1:12, month.abb)
 [1] "1 Jan"  "2 Feb"  "3 Mar"  "4 Apr"  "5 May"  "6 Jun"  "7 Jul"  "8 Aug"  "9 Sep"  "10 Oct"
[11] "11 Nov" "12 Dec"

investigate structure: str()

  • Use str() to investigate the structure of a complex object
str(chickwts)
'data.frame':   71 obs. of  2 variables:
 $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
 $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...

Flow control

what is flow control

  • used to control the execution of different commands
  • these structures are used for flow control
    • if(){} else if(){} else{}
    • for(){}
    • while(){}

if and else

  • Since flow control is used primarily within functions it is dealt with here
  • The first is if & else for conditional code
x <- 43
if (x > 40) {
    print("TRUE")
} else {
    print("FALSE")
}
[1] "TRUE"

if/else real life example

  • this code chunk checks if a file exists and only downloads it if it is not present
my_data_file <- "/some/file/on/disk"
## fetch file
if (!file.exists(my_data_file)) {
    print(paste("downloading", my_data_file))
    download.file(url = remote_url, destfile = my_data_file)
} else {
    print(paste("reading cached copy of", my_data_file))
}

ifelse ternary

There is also a shorthand for if(){}else{}

a <- 3
x <- if (a == 1) 1 else 2
x
[1] 2

for

  • For is used for looping
  • However, the prefered way to do this is by apply() and its relatives (see above)
for (i in 1:3) {
    print(i)
}
[1] 1
[1] 2
[1] 3

Creating functions

Functions are reusable code

  • Functions are named pieces of code with a single well-defined purpose
  • They ususally have some data as input: arguments
  • They usually have some return value

A first function

  • Here is a simple function calculating the mean of a vector
my_mean <- function(x) {
    sum(x) / length(x)
}
my_mean(1:5)
[1] 3

Function basics

  • The result of the last statement within a function is the return value of that function
  • Use return() for forcing return values at other points:
my_message <- function(age) {
    if (age < 18) return("have a lemonade!")
    else return("have a beer!")
}
my_message(20)
[1] "have a beer!"

Default argument values

  • Use default values for function arguments whenever possible
  • Almost all functions in R packages have many arguments with default values
my_power <- function(x, power = 2) {
    x ^ power
}
my_power(10, 3) ## custom power
[1] 1000
my_power(10) ## defaults to 2
[1] 100