hw_01_yourname.Rmd and use it for your solutions.Only include necessary code to answer the questions.
Most of the functions you use should be from the tidyverse. Too much base R will result in point deductions.
Use Pull requests and or email to ask me any questions. If you email, please ensure your most recent code is pushed to GitHub.
Learning Outcomes:
Libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stats)
Because of their generality, lists (or list-like objects) are often the output of many statistical procedures in R. The file fpout.RDS in the data folder contains sample output from using fitPoly, a statistical method to quantify properties of locations on the genome.
readRDS() and a relative path to read this data into R.fpout_01 <- readRDS(file = "../data/fpout.RDS")
length(fpout_01)
## [1] 7
names(fpout_01)
## [1] "log" "modeldata" "allmodeldata" "scores" "diploscores"
## [6] "time" "prop_miss"
diploscores element does not provide any information. Remove it from the list.fpout_01 <- fpout_01[-c(5)]
# check again
length(fpout_01)
## [1] 6
names(fpout_01)
## [1] "log" "modeldata" "allmodeldata" "scores" "time"
## [6] "prop_miss"
# second method
#fpout$diploscores <- NULL
# third method
#[[5]] <- NULL
scores element contains the output most users would want. The variables in scores called P0, P1, P2, P3, P4, P5, and P6 contain “posterior probabilities” for each individual for values 0, 1, 2, 3, 4, 5, and 6 (respectively).(P0 * 0) + (P1 * 1) + (P2 * 2) + (P3 * 3) + (P4 * 4) + (P5 * 5) + (P6 * 6).scores data frame.fpout_01$scores %>%
mutate(Posterior_Mean = (P0 * 0) + (P1 * 1) + (P2 * 2) + (P3 * 3) + (P4 * 4) + (P5 * 5) + (P6 * 6)) -> fpout_01$scores
head(fpout_01$scores)
## marker MarkerName SampleName ratio P0 P1 P2
## 2 1 SNP 2 0.9450980 0 0.000000e+00 0.000000e+00
## 3 1 SNP 3 0.9186047 0 1.513089e-70 1.630727e-32
## 4 1 SNP 4 0.9976387 0 4.965532e-143 6.960499e-85
## 5 1 SNP 5 1.0000000 0 3.237313e-161 5.236498e-99
## 6 1 SNP 6 0.9202756 0 2.644688e-71 5.145367e-33
## 7 1 SNP 7 0.9037620 0 4.181303e-64 2.965385e-28
## P3 P4 P5 P6 maxgeno maxP geno
## 2 5.487698e-278 1.792528e-95 1.000000e+00 1.765903e-145 5 1.0000000 5
## 3 1.688654e-15 8.489692e-05 9.999151e-01 2.727826e-18 5 0.9999151 5
## 4 2.023616e-54 1.173498e-31 8.163299e-14 1.000000e+00 6 1.0000000 6
## 5 7.915965e-66 1.252725e-40 5.265544e-20 1.000000e+00 6 1.0000000 6
## 6 7.908401e-16 5.665944e-05 9.999433e-01 6.710451e-18 5 0.9999433 5
## 7 1.067884e-12 2.639557e-03 9.973604e-01 1.287973e-21 5 0.9973604 5
## Posterior_Mean
## 2 5.000000
## 3 4.999915
## 4 6.000000
## 5 6.000000
## 6 4.999943
## 7 4.997360
map*() function to identify the names of the variables in the scores data frame that are not of type double.# keep: keep the TRUE list
# discard: keep the FALSE list
discard(fpout_01$scores, is_double) -> notDouble
head(notDouble)
## marker MarkerName SampleName
## 2 1 SNP 2
## 3 1 SNP 3
## 4 1 SNP 4
## 5 1 SNP 5
## 6 1 SNP 6
## 7 1 SNP 7
# $marker, $MarkerName, and $SampleName, are not dbl type
Revised: Is this correct? No – it does not return just the names. Here are some alternates.
names(fpout_01$scores)[map_chr(fpout_01$scores, typeof)!= "double"]
## [1] "marker" "MarkerName" "SampleName"
names(keep(fpout_01$scores, ~ typeof(.) != "double"))
## [1] "marker" "MarkerName" "SampleName"
names(discard(fpout_01$scores, ~ typeof(.) == "double"))
## [1] "marker" "MarkerName" "SampleName"
and a simpler approach to Column Means
fpout_01$scores %>%
select(-marker, -MarkerName, -SampleName) %>%
colMeans(na.rm = TRUE) ->
fpout_01$col_means
col_means in the list that contains just the column means of all the double variables in the scores data frame.fpout_01$scores %>%
select(ratio:Posterior_Mean) -> scores_02
col_means <- list(col_means = map_dbl(scores_02, mean, na.rm = TRUE))
# Create a new list called `col_means`
print(col_means)
## $col_means
## ratio P0 P1 P2 P3
## 9.110837e-01 1.141826e-236 5.571926e-24 3.159587e-06 1.528993e-02
## P4 P5 P6 maxgeno maxP
## 2.535163e-01 5.074307e-01 2.237600e-01 4.951049e+00 9.780702e-01
## geno Posterior_Mean
## 5.095652e+00 4.939654e+00
# adding to the existing list to "fpout_02"
append(fpout_01, col_means) -> fpout_02
# test whether is successful or not
length(fpout_02)
## [1] 8
col_means element from the list. The extracted element should not be a list.# first way
fpout_02[[7]]
## ratio P0 P1 P2 P3
## 9.110837e-01 1.141826e-236 5.571926e-24 3.159587e-06 1.528993e-02
## P4 P5 P6 maxgeno maxP
## 2.535163e-01 5.074307e-01 2.237600e-01 4.951049e+00 9.780702e-01
## geno Posterior_Mean
## 5.095652e+00 4.939654e+00
# second way
fpout_02$col_means
## ratio P0 P1 P2 P3
## 9.110837e-01 1.141826e-236 5.571926e-24 3.159587e-06 1.528993e-02
## P4 P5 P6 maxgeno maxP
## 2.535163e-01 5.074307e-01 2.237600e-01 4.951049e+00 9.780702e-01
## geno Posterior_Mean
## 5.095652e+00 4.939654e+00
# third method
fpout_02[["col_means"]]
## ratio P0 P1 P2 P3
## 9.110837e-01 1.141826e-236 5.571926e-24 3.159587e-06 1.528993e-02
## P4 P5 P6 maxgeno maxP
## 2.535163e-01 5.074307e-01 2.237600e-01 4.951049e+00 9.780702e-01
## geno Posterior_Mean
## 5.095652e+00 4.939654e+00
# we can always check with str()
col_means# first way
fpout_02$col_means[c(3)]
## P1
## 5.571926e-24
#second way
fpout_02$col_means[c("P1")]
## P1
## 5.571926e-24
#third way (optional)
fpout_02$col_means[c(-1, -2, -4, -5, -6, -7, -8, -9, -10, -11, -12)]
## P1
## 5.571926e-24
Consider the recursive sequence defined by \[ x_n = x_{n-1} + \frac{|x_{n-3} - x_{n-2}|}{4}. \] That is, element \(n\) is the sum of element \(n-1\) and the absolute value of the difference between between elements \(n-3\) and \(n-2\) divided by two. For example, if we let \(x_1 = 3\), \(x_2 = 1\), and \(x_3 = 10\), then \(x_4\) is \[ x_4 = 10 + \frac{|3 - 1|}{4} = 11. \]
calcn() that takes as input a vector x containing the first three elements of this sequence and an integer n denoting the final element of the sequence to calculate.calcn() should return element n.For example, in my implementation of calcn(), I obtained the following: (see HTML)
calcn <- function(x, n){
nums <- vector(mode = "integer", length = n) # set nums type
for(i in seq_along(nums)){
if (i <= 3){
nums[i] <- x[i] # 1, 2, 3 can not be calculated so keep it
}
else {
nums[i] <- nums[i-1] + (abs(nums[i-3] - nums[i-2]))/4 # start working formula with 4
}
}
return(nums[n]) # return Xn in formula
}
calcn(x = c(2, 4, 3), n = 3)
## [1] 3
calcn(x = c(2, 4, 3), n = 4)
## [1] 3.5
calcn(x = c(2, 4, 3), n = 5)
## [1] 3.75
calcn(x = c(2, 4, 3), n = 6)
## [1] 3.875
calcn(x = c(2, 4, 3), n = 7)
## [1] 3.9375
calcn <- function(x, n){
if(length(x) == 3 & n > 0 & is.integer(n)){
nums <- vector(mode = "integer", length = n)
for(i in seq_along(nums)){
if(i <= 3){
nums[i] <- x[i]
}else{
nums[i] <- nums[i-1] + (abs(nums[i-3] - nums[i-2]))/4
}
}
return(nums[n])
}else{
stop("x is not equal to 3 or n is not a positive integer")
} # error check x and n
} # created calcn function
calcn(x = c(2, 4, 3), n = 7L) # correct x and n type
## [1] 3.9375
calcn(x = c(2, 4), n = 7) # length of x < 3
## Error in calcn(x = c(2, 4), n = 7): x is not equal to 3 or n is not a positive integer
calcn(x = c(2, 4, 3, 5), n = 6) # length of x > 3
## Error in calcn(x = c(2, 4, 3, 5), n = 6): x is not equal to 3 or n is not a positive integer
calcn(x = c(2, 4, 3), n = 0) # n <= 0
## Error in calcn(x = c(2, 4, 3), n = 0): x is not equal to 3 or n is not a positive integer
calcn(x = c(2, 4, 3), n = 5.53) # n is a float(dbl)
## Error in calcn(x = c(2, 4, 3), n = 5.53): x is not equal to 3 or n is not a positive integer
calcn_test <- function(x, n){
stopifnot(length(x) == 3 & is.numeric(x)) # error check x
stopifnot(n > 0 & is.integer(n)) # error check n
nums <- vector(mode = "integer", length = n) # create nums's type
for(i in seq_along(nums)){
if(i <= 3){
nums[i] <- x[i]
}else{
nums[i] <- nums[i-1] + (abs(nums[i-3] - nums[i-2]))/4
}
}
return(nums[n])
} # created calcn function
calcn_test(x = c(2, 4, 3), n = 7L) # correct x and n type
## [1] 3.9375
calcn_test(x = c(2, 4), n = 7) # length of x < 3
## Error in calcn_test(x = c(2, 4), n = 7): length(x) == 3 & is.numeric(x) is not TRUE
calcn_test(x = c(2, 4, 3, 5), n = 6) # length of x > 3
## Error in calcn_test(x = c(2, 4, 3, 5), n = 6): length(x) == 3 & is.numeric(x) is not TRUE
calcn_test(x = c(2, 4, 3), n = 0) # n <= 0
## Error in calcn_test(x = c(2, 4, 3), n = 0): n > 0 & is.integer(n) is not TRUE
calcn_test(x = c(2, 4, 3), n = 5.53) # n is a float(dbl)
## Error in calcn_test(x = c(2, 4, 3), n = 5.53): n > 0 & is.integer(n) is not TRUE
calcn(c(11,1,130), 1000L)
## [1] 176.3333
calcn(c(11,1,130), 1L)
## [1] 11
calcn(c(7, 3, 20), 8L)
## [1] 26.625
map_*()Lists are often used to save simulation output. You can then extract individual elements from the lists using for-loops.
Consider the \(t\)-test, used to test whether or not the mean of some observations is 0. We would use the following code to simulate data from a Normal (0,1) distribution, and then use a \(t\)-test to test if the true mean is 0:
x <- rnorm(n = 10, mean = 0, sd = 1)
tout <- t.test(x)
tout
##
## One Sample t-test
##
## data: x
## t = 1.9842, df = 9, p-value = 0.07853
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.06640975 1.01459370
## sample estimates:
## mean of x
## 0.474092
t.test() is a list-like object. Use one function to show how many elements are in the list along with their names and types.# only one function to show all
str(tout)
## List of 10
## $ statistic : Named num 1.98
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 9
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.0785
## $ conf.int : num [1:2] -0.0664 1.0146
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num 0.474
## ..- attr(*, "names")= chr "mean of x"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "mean"
## $ stderr : num 0.239
## $ alternative: chr "two.sided"
## $ method : chr "One Sample t-test"
## $ data.name : chr "x"
## - attr(*, "class")= chr "htest"
# how many elements
length(tout)
## [1] 10
# names
names(tout)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
# types
typeof(tout)
## [1] "list"
is_double(tout)
## [1] FALSE
i:
ith element in a list called tlist.set.seed(1)
tlist <- list()
for(i in 1:1000) {
x <- rnorm(10, mean = 0, sd = 2)
tout <- t.test(x)# draw 10 random observations
tlist[[i]] <- tout# t-test the x for each time then save to the tlist
}
# randomly test some tlists
tlist[[1]]
##
## One Sample t-test
##
## data: x
## t = 0.53557, df = 9, p-value = 0.6052
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.8523895 1.3812007
## sample estimates:
## mean of x
## 0.2644056
tlist[[20]]
##
## One Sample t-test
##
## data: x
## t = -0.52263, df = 9, p-value = 0.6139
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -1.5732947 0.9827645
## sample estimates:
## mean of x
## -0.2952651
tlist[[300]]
##
## One Sample t-test
##
## data: x
## t = -0.582, df = 9, p-value = 0.5749
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -2.725734 1.610202
## sample estimates:
## mean of x
## -0.557766
Hint: Make sure the data going into ggplot is a data frame (tibble)
QQ plots are for testing against a known distribution or comparing two distributions to see if they are similar.
Histogram is the more appropriate plot here.
Revised: 3.3.3: - 0.25 Created intermediate variables instead of using pipe.
If you don’t need intermediate variables they can clutter up your code and memory. Sometimes they can add to clarity or help in debugging but don’t create them just because you can. Once you created them and tested everything works consider removing them to have cleaner code.
Original Data from Yunting
tlist %>%
map_dbl(~.$estimate) -> tlist_mean # extract the mean of each tlist
as_tibble(tlist_mean) -> tlist_mean01 # make sure the data save as data frame type
# plot to histogram
tlist_mean01 %>%
ggplot(aes(x = value)) +
geom_histogram() +
theme_bw() +
xlab("Mean of the tlist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#tlist_mean01 %>%
#ggplot(aes(sample = value))+
#geom_qq()+
#theme_bw()
tlist %>%
map_dbl(~.$estimate[[1]]) %>%
tibble(x_bar=.) %>%
ggplot(aes(x=x_bar))+
geom_histogram(bins = 50)+
theme_bw()
pvec_f. Show the first 6 values.pvec_f <- vector(mode = "double", length = 1000) # set pvec_f is a vector type
# start for loop
for(i in 1:1000) {
pvec_f[[i]] <- tlist[[i]]$p.value # recall tlist$p.value to extract it
}
head(pvec_f)
## [1] 0.6052327 0.4806056 0.6686761 0.6480420 0.4945143 0.6653161
pvec_m. Show the first 6 values.map_dbl(tlist, ~.$p.value) -> pvec_m
head(pvec_m)
## [1] 0.6052327 0.4806056 0.6686761 0.6480420 0.4945143 0.6653161
pvec_m to create a QQ-plot and then interpret the plot with regard to whether the \(p\)-values exhibit a uniform distribution.Interpretation: If \(p\)-values is uniformly distributed, the null hypothesis should be true. In this plot, we can see these two lines are almost sticking together, which means this statistical evidence indicates tlist’s \(p\)-values is uniformly distributed.
Revised: 3.3.6: +1.75 Labels are confusing as your x axis should not be p values, it is the theoretical uniform distribution and your sample is the p values
pvec_m %>%
as_tibble(pvec_m) -> dfpvec_m # make sure the data save as data frame type
ggplot(data = dfpvec_m)+
geom_qq(distribution = stats::qunif, mapping = aes(sample = pvec_m))+
geom_abline(color = "red", linetype = "dashed", size = 1)+ # "qunif" gives the quantile function
theme_bw()+
labs(x = "Theoretical Quantiles", y = "Sample Quantiles")+
ggtitle("Uniform Distribution")