dplyr
Hypothesis testing for categorical variables (Fisher’s Test, Chi-square test)
Book chapters 0 and 1
Pseudo code | Example code |
---|---|
library(packagename) | library(dplyr) |
?functionname | ?select |
?package::functionname | ?dplyr::select |
? ‘Reserved keyword or symbol’ | ? ‘%>%’ |
??searchforpossiblyexistingfunctionandortopic | ??simulate |
help(package = “loadedpackage”) | help(“dplyr”) |
browseVignettes(“packagename”) | browseVignettes(“dplyr”) |
Slide credit: Marcel Ramos
Pseudo code:
install.packages("BiocManager") #from CRAN
packages <- c("packagename", "githubuser/repository", "biopackage")
BiocManager::install(packages)
BiocManager::valid() #check validity of installed packages
devtools
BiocManager::install()
numeric
(set seed to sync random number generator):
set.seed(1)
rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
integer
:
sample( 1:5 )
## [1] 2 1 3 4 5
logical
:
1:3 %in% 3
## [1] FALSE FALSE TRUE
character
:
c("yes", "no")
## [1] "yes" "no"
factor
:
factor(c("yes", "no"))
## [1] yes no
## Levels: no yes
Demo: integer-like properties, relevel()
c(NA, NaN, -Inf, Inf)
## [1] NA NaN -Inf Inf
class()
to find the class of a variable.
matrix
:
matrix(1:9, nrow = 3)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
The list
is a non-atomic vector:
measurements <- c( 1.3, 1.6, 3.2, 9.8, 10.2 )
parents <- c( "Parent1.name", "Parent2.name" )
my.list <- list( measurements, parents)
my.list
## [[1]]
## [1] 1.3 1.6 3.2 9.8 10.2
##
## [[2]]
## [1] "Parent1.name" "Parent2.name"
The data.frame
has list-like and matrix-like properties:
x <- 11:16
y <- seq(0,1,.2)
z <- c( "one", "two", "three", "four", "five", "six" )
a <- factor( z )
my.df <- data.frame(x,y,z,a, stringsAsFactors = FALSE)
DataFrame
: like a data.frame
but more flexible. columns can be any atomic vector type:
GenomicRanges
objectsRle
(run-length encoding)suppressPackageStartupMessages(library(S4Vectors))
df <- DataFrame(var1 = Rle(c("a", "a", "b")),
var2 = 1:3)
metadata(df) <- list(father="Levi is my father")
df
## DataFrame with 3 rows and 2 columns
## var1 var2
## <Rle> <integer>
## 1 a 1
## 2 a 2
## 3 b 3
List(my.list)
## List of length 2
str(List(my.list))
## Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
## ..@ listData :List of 2
## .. ..$ : num [1:5] 1.3 1.6 3.2 9.8 10.2
## .. ..$ : chr [1:2] "Parent1.name" "Parent2.name"
## ..@ elementType : chr "ANY"
## ..@ elementMetadata: NULL
## ..@ metadata : list()
suppressPackageStartupMessages(library(IRanges))
IntegerList(var1=1:26, var2=1:100)
## IntegerList of length 2
## [["var1"]] 1 2 3 4 5 6 7 8 9 10 11 12 ... 16 17 18 19 20 21 22 23 24 25 26
## [["var2"]] 1 2 3 4 5 6 7 8 9 10 11 ... 90 91 92 93 94 95 96 97 98 99 100
CharacterList(var1=letters[1:100], var2=LETTERS[1:26])
## CharacterList of length 2
## [["var1"]] a b c d e f g h i ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## [["var2"]] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
LogicalList(var1=1:100 %in% 5, var2=1:100 %% 2)
## LogicalList of length 2
## [["var1"]] FALSE FALSE FALSE FALSE TRUE ... FALSE FALSE FALSE FALSE FALSE
## [["var2"]] TRUE FALSE TRUE FALSE TRUE ... FALSE TRUE FALSE TRUE FALSE
suppressPackageStartupMessages(library(Biostrings))
bstring = BString("I am a BString object")
bstring
## 21-letter "BString" instance
## seq: I am a BString object
dnastring = DNAString("TTGAAA-CTC-N")
dnastring
## 12-letter "DNAString" instance
## seq: TTGAAA-CTC-N
str(dnastring)
## Formal class 'DNAString' [package "Biostrings"] with 5 slots
## ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. ..@ xp :<externalptr>
## .. .. ..@ .link_to_cached_object:<environment: 0x7ff37196c090>
## ..@ offset : int 0
## ..@ length : int 12
## ..@ elementMetadata: NULL
## ..@ metadata : list()
alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)
## A C G T other
## 0.25000000 0.16666667 0.08333333 0.25000000 0.25000000
dplyr
dplyr
convention aims to ease cognitive burdendplyr
examplelibrary(nycflights13)
library(dplyr)
delays <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(year, month, day, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
dplyr
example (cont’d)hist(delays$delay, main="Mean hourly delay", xlab="Delay (hours)")
A random variable: any characteristic that can be measured or categorized, and where any particular outcome is determined at least partially by chance.
Normally distributed random variable with mean \(\mu = 0\) / standard deviation \(\sigma = 1\), and a sample of \(n=100\)
Poisson distributed random variable (\(\lambda = 2\)), and a sample of \(n=100\).
Negative Binomially distributed random variable (\(size=30, \mu=2\)), and a sample of \(n=100\).
One Sample - observations come from one of two population distributions:
Two Sample - two samples are drawn, either:
p-value is the probability of the observed number of successes, or more, under \(H_0\)
Success count | Permutations of selection | Number of permutations |
---|---|---|
0 | oooo | 1 × 1 = 1 |
1 | ooox, ooxo, oxoo, xooo | 4 × 4 = 16 |
2 | ooxx, oxox, oxxo, xoxo, xxoo, xoox | 6 × 6 = 36 |
3 | oxxx, xoxx, xxox, xxxo | 4 × 4 = 16 |
4 | xxxx | 1 × 1 = 1 |
Total | 70 |
What do you notice about all these combinations?
fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",204),rep("aa",46)),
levels=c("AA/Aa","aa"))
set.seed(1)
genotype <- sample(genotype)
dat <- data.frame(disease, genotype)
summary(dat)
## disease genotype
## control:220 AA/Aa:204
## cases : 30 aa : 46
table(disease, genotype)
## genotype
## disease AA/Aa aa
## control 181 39
## cases 23 7
chisq.test(disease, genotype)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: disease and genotype
## X-squared = 0.24229, df = 1, p-value = 0.6226
chisq.test(disease, genotype, simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: disease and genotype
## X-squared = 0.5526, df = NA, p-value = 0.5777
Note that the result says nothing about how the departure from independence occurs
library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## AA/Aa 1.000000 NA NA
## aa 1.412486 0.5662511 3.523379
(the default is whichever level is first alphabetically!)
simulate.p.value=TRUE
to get a more accurate p-value from the chi-square testTrue state of nature | Result of test | |
---|---|---|
Reject \(H_0\) | Fail to reject \(H_0\) | |
\(H_0\) TRUE | Type I error, probability = \(\alpha\) | No error, probability = \(1-\alpha\) |
\(H_0\) FALSE | No error, probability is called power = \(1-\beta\) | Type II error, probability = \(\beta\) (false negative) |