- Descriptive statistics
- Sort/order
- General purpose functions
- Reading and writing textual data
- Flow control
- Creating your own functions
september 2015
| function | purpose |
|---|---|
| mean( ) | mean |
| median( ) | median |
| var( ) | variance s^2 |
| sd( ) | standard deviation s |
| min( ) | minimum |
| max( ) | maximum |
| range( ) | min and max |
| function | purpose |
|---|---|
| quantile( ) | quantiles |
| IQR( ) | interquantile range |
| summary( ) | 6-number summary |
| hist( ) | histogram |
| boxplot( ) | boxplot |
quantile() function0% 25% 50% 75% 100%quantile(ChickWeight$weight)
0% 25% 50% 75% 100% 35.00 63.00 103.00 163.75 373.00
quantile(ChickWeight$weight, probs = seq(0, 1, 0.2))
0% 20% 40% 60% 80% 100% 35.0 57.0 85.0 126.0 181.6 373.0
IQR()IQR(ChickWeight$weight)
[1] 100.75
## same as quantile(ChickWeight$weight)[4] - quantile(ChickWeight$weight)[2]
75% 100.75
boxplot() is a picture of summary()summary() also gives the meansummary(ChickWeight$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max. 35.0 63.0 103.0 121.8 163.8 373.0
boxplot(ChickWeight$weight)
sort() sorts a vectororder() returns a vector representing the ordered status of a vectorx <- c(2, 4, 6, 1, 3) sort(x)
[1] 1 2 3 4 6
ordr <- order(x); ordr
[1] 4 1 5 2 3
x[ordr]
[1] 1 2 3 4 6
When you use sort(), a vector will be shuffled in-place. This is ususally NOT desirable when coupled vectors are being analysed (as in the most used data type Dataframes!)
geneNames <- c("P53","BRCA1","VAMP1", "FHIT")
sig <- c(TRUE, TRUE, FALSE, FALSE)
meanExp <- c(4.5, 7.3, 5.4, 2.4)
genes <- data.frame(
"name" = geneNames,
"significant" = sig,
"meanExp" = meanExp)
genes
name significant meanExp 1 P53 TRUE 4.5 2 BRCA1 TRUE 7.3 3 VAMP1 FALSE 5.4 4 FHIT FALSE 2.4
## sort on gene name genes[order(genes$name), ]
name significant meanExp 2 BRCA1 TRUE 7.3 4 FHIT FALSE 2.4 1 P53 TRUE 4.5 3 VAMP1 FALSE 5.4
You can also sort on multiple properties:
students <- data.frame(
"st.names" = c("Henk", "Piet", "Sara", "Henk", "Henk"),
"st.ages" = c(22, 23, 18, 19, 24))
students[order(students$st.names, students$st.ages), ]
st.names st.ages 4 Henk 19 1 Henk 22 5 Henk 24 2 Piet 23 3 Sara 18
rm() to do this: rm(genes), rm(x, y, z)search()getwd() returns the current working directorysetwd() sets the current working directorydir(), dir(path) lists the contents of the current directory or of path"E:\\emile\\datasets""~/datasets" or "/home/emile/datasets"Textual data comes in many forms. Here are a few examples:
DesertBirdCensus.csv
Species,"Count" Black Vulture,64 Turkey Vulture,23 Harris's Hawk,3 Red-tailed Hawk,16 American Kestrel,7
BED_file.txt
browser position chr7:127471196-127495720 browser hide all track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255
mySNPdata.vcf
##fileformat=VCFv4.0 ##fileDate=20100501 ##reference=1000GenomesPilot-NCBI36 ##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta ##INFO=<ID=BKPTID,Number=-1,Type=String,Description="ID of the assembled alternate allele in the assembly file"> ##ALT=<ID=DEL,Description="Deletion"> ##ALT=<ID=CNV,Description="Copy number variable region"> ##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 1 2827693 . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9 2 321682 . T <DEL> 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12 2 14477084 . C <DEL:ME:ALU> 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12 3 9425916 . C <INS:ME:L1> 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15 3 12665100 . A <DUP> 14 PASS IMPRECISE;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500 GT:GQ:CN:CNQ ./.:0:3:16.2 4 18665128 . T <DUP:TANDEM> 11 PASS IMPRECISE;SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10 GT:GQ:CN:CNQ ./.:0:5:8.3
Whatever the contents of a file, you always need to address (some of) these questions:
write.table(myData, file="file.csv")col.names = Frow.names = Fsep = ";"sep = "\t" # tab-separatedpaste()paste() to combine elements into a stringpaste(1, 2, 3)
[1] "1 2 3"
paste(1, 2, 3, sep="-")
[1] "1-2-3"
paste(1:12, month.abb)
[1] "1 Jan" "2 Feb" "3 Mar" "4 Apr" "5 May" "6 Jun" "7 Jul" "8 Aug" "9 Sep" "10 Oct" [11] "11 Nov" "12 Dec"
str()str() to investigate the structure of a complex objectstr(chickwts)
'data.frame': 71 obs. of 2 variables: $ weight: num 179 160 136 227 217 168 108 124 143 140 ... $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
if & else for conditional codex <- 43
if (x > 40) {
print("TRUE")
} else {
print("FALSE")
}
[1] "TRUE"
my_data_file <- "/some/file/on/disk"
## fetch file
if (!file.exists(my_data_file)) {
print(paste("downloading", my_data_file))
download.file(url = remote_url, destfile = my_data_file)
} else {
print(paste("reading cached copy of", my_data_file))
}
There is also a shorthand for if(){}else{}
a <- 3 x <- if (a == 1) 1 else 2 x
[1] 2
apply() and its relatives (see above)for (i in 1:3) {
print(i)
}
[1] 1 [1] 2 [1] 3
my_mean <- function(x) {
sum(x) / length(x)
}
my_mean(1:5)
[1] 3
return() for forcing return values at other points:my_message <- function(age) {
if (age < 18) return("have a lemonade!")
else return("have a beer!")
}
my_message(20)
[1] "have a beer!"
my_power <- function(x, power = 2) {
x ^ power
}
my_power(10, 3) ## custom power
[1] 1000
my_power(10) ## defaults to 2
[1] 100