My research interests:
Biomedical Data Science book cover
dplyr
A random variable: any characteristic that can be measured or categorized, and where any particular outcome is determined at least partially by chance.
Normally distributed random variable with mean \(\mu = 0\) / standard deviation \(\sigma = 1\), and a sample of \(n=100\)
Poisson distributed random variable (\(\lambda = 2\)), and a sample of \(n=100\).
Negative Binomially distributed random variable (\(size=30, \mu=2\)), and a sample of \(n=100\).
- Example questions with yes/no answers:
- Is a gene differentially expressed between two populations?
- Do hypertensive, smoking men have the same mean cholesterol level as the general population?
One-sample hypothesis test of mean - sample comes from one of two population distributions:
Two-sample hypothesis test of means - two samples are drawn, either:
When sample size is large, the average \(\bar{Y}\) of a random sample: 1. Follows a normal distribution 2. The normal distribution has mean entered at the population average \(\mu_Y\) 3. with standard deviation equal to the population standard deviation \(\sigma_Y\), divided by the square root of the sample size \(N\). We refer to the standard deviation of the distribution of a random variable as the random variable’s standard error.
Thanks to CLT, linear modeling, t-tests, ANOVA are all guaranteed to work reasonably well for large samples (more than ~30 observations).
Go to online demo: http://onlinestatbook.com/stat_sim/sampling_dist/
These are research hypotheses. What are the corresponding null hypotheses?
library(TeachingDemos)
set.seed(1)
(weights = rnorm(10, mean = 75, sd = 10))
## [1] 68.73546 76.83643 66.64371 90.95281 78.29508 66.79532 79.87429
## [8] 82.38325 80.75781 71.94612
TeachingDemos::z.test(
x = weights,
mu = 70, #specified for population under H0 because this is a one-sample test
stdev = 10, #specified for population because this is a z-test
alternative = "two.sided"
)
##
## One Sample z-test
##
## data: weights
## z = 1.9992, n = 10.0000, Std. Dev. = 10.0000, Std. Dev. of the
## sample mean = 3.1623, p-value = 0.04559
## alternative hypothesis: true mean is not equal to 70
## 95 percent confidence interval:
## 70.12408 82.51998
## sample estimates:
## mean of weights
## 76.32203
Note that here we do not specify the standard deviation, because it is estimated from the sample.
stats::t.test(
x = weights,
mu = 70, #specified for population under H0 because this is a one-sample test
alternative = "two.sided"
)
##
## One Sample t-test
##
## data: weights
## t = 2.5612, df = 9, p-value = 0.03063
## alternative hypothesis: true mean is not equal to 70
## 95 percent confidence interval:
## 70.73805 81.90600
## sample estimates:
## mean of x
## 76.32203
In this particular example the p-value is smaller than from the z-test, but in general it would be less powerful than a z-test if its assumptions are met.
Confidence intervals provide a probable range for the true population mean.
set.seed(1)
reps <- 100
N <- 30
alpha <- 0.05
my.conf.ints <- replicate(reps, t.test(rnorm(N), conf.level = 1-alpha)$conf.int)
plot(x = rep(range(my.conf.ints), reps / 2),
y = 1:reps,
main = paste(reps, "samples of size N=", N),
type = "n")
abline(v = 0, lw = 2, lty = 3)
for (i in seq(1, reps)) {
acceptH0 <- my.conf.ints[1, i] < 0 & my.conf.ints[2, i] > 0
segments(
x0 = my.conf.ints[1, i],
y0 = i,
x1 = my.conf.ints[2, i],
y1 = i,
lw = ifelse(acceptH0, 1, 2),
col = ifelse(acceptH0, "black", "red")
)
}
Which of the following are true?
True state of nature | Result of test | |
---|---|---|
Reject \(H_0\) | Fail to reject \(H_0\) | |
\(H_0\) TRUE | Type I error, probability = \(\alpha\) | No error, probability = \(1-\alpha\) |
\(H_0\) FALSE | No error, probability is called power = \(1-\beta\) | Type II error, probability = \(\beta\) (false negative) |
Pseudo code | Example code |
---|---|
library(packagename) | library(dplyr) |
?functionname | ?select |
?package::functionname | ?dplyr::select |
? ‘Reserved keyword or symbol’ | ? ‘%>%’ |
??searchforpossiblyexistingfunctionandortopic | ??simulate |
help(package = “loadedpackage”) | help(“dplyr”) |
browseVignettes(“packagename”) | browseVignettes(“dplyr”) |
Slide credit: Marcel Ramos
Pseudo code:
install.packages("BiocManager") #from CRAN
packages <- c("packagename", "githubuser/repository", "biopackage")
BiocManager::install(packages)
BiocManager::valid() #check validity of installed packages
devtools
BiocManager::install()
numeric
(set seed to sync random number generator):
set.seed(1)
rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
integer
:
sample(1L:5L)
## [1] 3 5 1 4 2
logical
:
1:3 %in% 3
## [1] FALSE FALSE TRUE
character
:
c("yes", "no")
## [1] "yes" "no"
factor
:
factor(c("yes", "no"))
## [1] yes no
## Levels: no yes
Demo: integer-like properties, relevel()
c(NA, NaN, -Inf, Inf)
## [1] NA NaN -Inf Inf
class()
to find the class of a variable.
matrix
:
matrix(1:9, nrow = 3)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
The list
is a non-atomic vector:
measurements <- c(1.3, 1.6, 3.2, 9.8, 10.2)
parents <- c("Parent1.name", "Parent2.name")
my.list <- list(measurements, parents)
my.list
## [[1]]
## [1] 1.3 1.6 3.2 9.8 10.2
##
## [[2]]
## [1] "Parent1.name" "Parent2.name"
The data.frame
has list-like and matrix-like properties:
x <- 11:16
y <- seq(0, 1, .2)
z <- c("one", "two", "three", "four", "five", "six")
a <- factor(z)
my.df <- data.frame(x, y, z, a, stringsAsFactors = FALSE)
DataFrame
: like a data.frame
but more flexible. columns can be any atomic vector type:
GenomicRanges
objectsRle
(run-length encoding)suppressPackageStartupMessages(library(S4Vectors))
df <- DataFrame(var1 = Rle(c("a", "a", "b")),
var2 = 1:3)
metadata(df) <- list(father = "Levi is my father")
df
## DataFrame with 3 rows and 2 columns
## var1 var2
## <Rle> <integer>
## 1 a 1
## 2 a 2
## 3 b 3
List(my.list)
## List of length 2
str(List(my.list))
## Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
## ..@ listData :List of 2
## .. ..$ : num [1:5] 1.3 1.6 3.2 9.8 10.2
## .. ..$ : chr [1:2] "Parent1.name" "Parent2.name"
## ..@ elementType : chr "ANY"
## ..@ elementMetadata: NULL
## ..@ metadata : list()
suppressPackageStartupMessages(library(IRanges))
IntegerList(var1 = 1:26, var2 = 1:100)
## IntegerList of length 2
## [["var1"]] 1 2 3 4 5 6 7 8 9 10 11 12 ... 16 17 18 19 20 21 22 23 24 25 26
## [["var2"]] 1 2 3 4 5 6 7 8 9 10 11 ... 90 91 92 93 94 95 96 97 98 99 100
CharacterList(var1 = letters[1:100], var2 = LETTERS[1:26])
## CharacterList of length 2
## [["var1"]] a b c d e f g h i ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## [["var2"]] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
LogicalList(var1 = 1:100 %in% 5, var2 = 1:100 %% 2)
## LogicalList of length 2
## [["var1"]] FALSE FALSE FALSE FALSE TRUE ... FALSE FALSE FALSE FALSE FALSE
## [["var2"]] TRUE FALSE TRUE FALSE TRUE ... FALSE TRUE FALSE TRUE FALSE
suppressPackageStartupMessages(library(Biostrings))
bstring = BString("I am a BString object")
bstring
## 21-letter "BString" instance
## seq: I am a BString object
dnastring = DNAString("TTGAAA-CTC-N")
dnastring
## 12-letter "DNAString" instance
## seq: TTGAAA-CTC-N
str(dnastring)
## Formal class 'DNAString' [package "Biostrings"] with 5 slots
## ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. ..@ xp :<externalptr>
## .. .. ..@ .link_to_cached_object:<environment: 0x7fb03e656098>
## ..@ offset : int 0
## ..@ length : int 12
## ..@ elementMetadata: NULL
## ..@ metadata : list()
alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)
## A C G T other
## 0.25000000 0.16666667 0.08333333 0.25000000 0.25000000
dplyr
dplyr
convention aims to ease cognitive burdendplyr
examplelibrary(nycflights13)
library(dplyr)
delays <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(year, month, day, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
## dplyr cheatsheet
## ggplot2 cheatsheet
dplyr
example (cont’d)hist(delays$delay, main = "Mean hourly delay", xlab = "Delay (minutes)")