Instructions

  1. Clone this homework repo to your homework directory as a new repo.
  2. Rename the starter file under the analysis directory as hw_01_yourname.Rmd and use it for your solutions.
  3. Modify the “author” field in the YAML header.
  4. Stage and Commit R Markdown and HTML files (no PDF files).
  5. Push both .Rmd and HTML files to GitHub.
  1. Commit each time you answer a part of question, e.g. 1.1
  2. Push to GitHub after each major question
  3. When complete, submit a response in Canvas

Libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stats)

1 Working with Lists

Because of their generality, lists (or list-like objects) are often the output of many statistical procedures in R. The file fpout.RDS in the data folder contains sample output from using fitPoly, a statistical method to quantify properties of locations on the genome.

  1. Use readRDS() and a relative path to read this data into R.
fpout_01 <- readRDS(file = "../data/fpout.RDS")
  1. Use appropriate functions to obtain the length of the list and then the vector of the names from the list.
length(fpout_01)
## [1] 7
names(fpout_01)
## [1] "log"          "modeldata"    "allmodeldata" "scores"       "diploscores" 
## [6] "time"         "prop_miss"
  1. The diploscores element does not provide any information. Remove it from the list.
    Revised: - 0.1 Did not remove from the original list but created a new list. Could assign Null to the list element.
fpout_01 <- fpout_01[-c(5)]

# check again
length(fpout_01)
## [1] 6
names(fpout_01)
## [1] "log"          "modeldata"    "allmodeldata" "scores"       "time"        
## [6] "prop_miss"
# second method
#fpout$diploscores <- NULL

# third method
#[[5]] <- NULL
  1. The scores element contains the output most users would want. The variables in scores called P0, P1, P2, P3, P4, P5, and P6 contain “posterior probabilities” for each individual for values 0, 1, 2, 3, 4, 5, and 6 (respectively).
fpout_01$scores %>% 
  mutate(Posterior_Mean = (P0 * 0) + (P1 * 1) + (P2 * 2) + (P3 * 3) + (P4 * 4) + (P5 * 5) + (P6 * 6)) -> fpout_01$scores

head(fpout_01$scores)
##   marker MarkerName SampleName     ratio P0            P1           P2
## 2      1        SNP          2 0.9450980  0  0.000000e+00 0.000000e+00
## 3      1        SNP          3 0.9186047  0  1.513089e-70 1.630727e-32
## 4      1        SNP          4 0.9976387  0 4.965532e-143 6.960499e-85
## 5      1        SNP          5 1.0000000  0 3.237313e-161 5.236498e-99
## 6      1        SNP          6 0.9202756  0  2.644688e-71 5.145367e-33
## 7      1        SNP          7 0.9037620  0  4.181303e-64 2.965385e-28
##              P3           P4           P5            P6 maxgeno      maxP geno
## 2 5.487698e-278 1.792528e-95 1.000000e+00 1.765903e-145       5 1.0000000    5
## 3  1.688654e-15 8.489692e-05 9.999151e-01  2.727826e-18       5 0.9999151    5
## 4  2.023616e-54 1.173498e-31 8.163299e-14  1.000000e+00       6 1.0000000    6
## 5  7.915965e-66 1.252725e-40 5.265544e-20  1.000000e+00       6 1.0000000    6
## 6  7.908401e-16 5.665944e-05 9.999433e-01  6.710451e-18       5 0.9999433    5
## 7  1.067884e-12 2.639557e-03 9.973604e-01  1.287973e-21       5 0.9973604    5
##   Posterior_Mean
## 2       5.000000
## 3       4.999915
## 4       6.000000
## 5       6.000000
## 6       4.999943
## 7       4.997360
  1. Use a map*() function to identify the names of the variables in the scores data frame that are not of type double.
    Not correct: 3.1.5: - 0.25 Map function identified all names, not just the names of the variables that were Not double. A very complex approach which could have been done with select, colMeans and assignment back to the lists as a new vector.
# keep: keep the TRUE list
# discard: keep the FALSE list
discard(fpout_01$scores, is_double) -> notDouble

head(notDouble)
##   marker MarkerName SampleName
## 2      1        SNP          2
## 3      1        SNP          3
## 4      1        SNP          4
## 5      1        SNP          5
## 6      1        SNP          6
## 7      1        SNP          7
# $marker, $MarkerName, and $SampleName, are not dbl type

Revised: Is this correct? No – it does not return just the names. Here are some alternates.

names(fpout_01$scores)[map_chr(fpout_01$scores, typeof)!= "double"]
## [1] "marker"     "MarkerName" "SampleName"
names(keep(fpout_01$scores, ~ typeof(.) != "double"))
## [1] "marker"     "MarkerName" "SampleName"
names(discard(fpout_01$scores, ~ typeof(.) == "double"))
## [1] "marker"     "MarkerName" "SampleName"

and a simpler approach to Column Means

fpout_01$scores %>%
  select(-marker, -MarkerName, -SampleName) %>%
  colMeans(na.rm = TRUE) ->
  fpout_01$col_means
fpout_01$scores %>%
 select(ratio:Posterior_Mean) -> scores_02
col_means <- list(col_means = map_dbl(scores_02, mean, na.rm = TRUE))

#  Create a new list called `col_means` 
print(col_means)
## $col_means
##          ratio             P0             P1             P2             P3 
##   9.110837e-01  1.141826e-236   5.571926e-24   3.159587e-06   1.528993e-02 
##             P4             P5             P6        maxgeno           maxP 
##   2.535163e-01   5.074307e-01   2.237600e-01   4.951049e+00   9.780702e-01 
##           geno Posterior_Mean 
##   5.095652e+00   4.939654e+00
# adding to the existing list to "fpout_02"
append(fpout_01, col_means) -> fpout_02
# test whether is successful or not
length(fpout_02)
## [1] 8
  1. Demonstrate three different ways to extract the col_means element from the list. The extracted element should not be a list.
# first way
fpout_02[[7]]
##          ratio             P0             P1             P2             P3 
##   9.110837e-01  1.141826e-236   5.571926e-24   3.159587e-06   1.528993e-02 
##             P4             P5             P6        maxgeno           maxP 
##   2.535163e-01   5.074307e-01   2.237600e-01   4.951049e+00   9.780702e-01 
##           geno Posterior_Mean 
##   5.095652e+00   4.939654e+00
# second way
fpout_02$col_means
##          ratio             P0             P1             P2             P3 
##   9.110837e-01  1.141826e-236   5.571926e-24   3.159587e-06   1.528993e-02 
##             P4             P5             P6        maxgeno           maxP 
##   2.535163e-01   5.074307e-01   2.237600e-01   4.951049e+00   9.780702e-01 
##           geno Posterior_Mean 
##   5.095652e+00   4.939654e+00
# third method
fpout_02[["col_means"]]
##          ratio             P0             P1             P2             P3 
##   9.110837e-01  1.141826e-236   5.571926e-24   3.159587e-06   1.528993e-02 
##             P4             P5             P6        maxgeno           maxP 
##   2.535163e-01   5.074307e-01   2.237600e-01   4.951049e+00   9.780702e-01 
##           geno Posterior_Mean 
##   5.095652e+00   4.939654e+00
# we can always check with str()
# first way
fpout_02$col_means[c(3)]
##           P1 
## 5.571926e-24
#second way
fpout_02$col_means[c("P1")]
##           P1 
## 5.571926e-24
#third way (optional)
fpout_02$col_means[c(-1, -2, -4, -5, -6, -7, -8, -9, -10, -11, -12)]
##           P1 
## 5.571926e-24

2 For Loops

Consider the recursive sequence defined by \[ x_n = x_{n-1} + \frac{|x_{n-3} - x_{n-2}|}{4}. \] That is, element \(n\) is the sum of element \(n-1\) and the absolute value of the difference between between elements \(n-3\) and \(n-2\) divided by two. For example, if we let \(x_1 = 3\), \(x_2 = 1\), and \(x_3 = 10\), then \(x_4\) is \[ x_4 = 10 + \frac{|3 - 1|}{4} = 11. \]

  1. Write a function called calcn() that takes as input a vector x containing the first three elements of this sequence and an integer n denoting the final element of the sequence to calculate.

For example, in my implementation of calcn(), I obtained the following: (see HTML)

calcn <- function(x, n){
  nums <- vector(mode = "integer", length = n) # set nums type
  for(i in seq_along(nums)){
    if (i <= 3){
      nums[i] <- x[i] # 1, 2, 3 can not be calculated so keep it
    }
    else {
      nums[i] <- nums[i-1] + (abs(nums[i-3] - nums[i-2]))/4 # start working formula with 4 
    }
    
    }
    return(nums[n]) # return Xn in formula
}

calcn(x = c(2, 4, 3), n = 3)
## [1] 3
calcn(x = c(2, 4, 3), n = 4)
## [1] 3.5
calcn(x = c(2, 4, 3), n = 5)
## [1] 3.75
calcn(x = c(2, 4, 3), n = 6)
## [1] 3.875
calcn(x = c(2, 4, 3), n = 7)
## [1] 3.9375
calcn <- function(x, n){
  if(length(x) == 3 & n > 0 & is.integer(n)){
  nums <- vector(mode = "integer", length = n)
  for(i in seq_along(nums)){
    if(i <= 3){
      nums[i] <- x[i]
    }else{
      nums[i] <- nums[i-1] + (abs(nums[i-3] - nums[i-2]))/4
    }
    
    }
  return(nums[n])
    }else{
      stop("x is not equal to 3 or n is not a positive integer")
      
    } # error check x and n
} # created calcn function 


calcn(x = c(2, 4, 3), n = 7L) # correct x and n type
## [1] 3.9375
calcn(x = c(2, 4), n = 7) # length of x < 3
## Error in calcn(x = c(2, 4), n = 7): x is not equal to 3 or n is not a positive integer
calcn(x = c(2, 4, 3, 5), n = 6) # length of x > 3
## Error in calcn(x = c(2, 4, 3, 5), n = 6): x is not equal to 3 or n is not a positive integer
calcn(x = c(2, 4, 3), n = 0) # n <= 0
## Error in calcn(x = c(2, 4, 3), n = 0): x is not equal to 3 or n is not a positive integer
calcn(x = c(2, 4, 3), n = 5.53) # n is a float(dbl)
## Error in calcn(x = c(2, 4, 3), n = 5.53): x is not equal to 3 or n is not a positive integer
calcn_test <- function(x, n){
  stopifnot(length(x) == 3 & is.numeric(x)) # error check x
  stopifnot(n > 0 & is.integer(n)) # error check n
  
  nums <- vector(mode = "integer", length = n) # create nums's type 
  for(i in seq_along(nums)){
    if(i <= 3){
      nums[i] <- x[i]
    }else{
      nums[i] <- nums[i-1] + (abs(nums[i-3] - nums[i-2]))/4
    }
  }
  return(nums[n])
} # created calcn function 


calcn_test(x = c(2, 4, 3), n = 7L) # correct x and n type
## [1] 3.9375
calcn_test(x = c(2, 4), n = 7) # length of x < 3
## Error in calcn_test(x = c(2, 4), n = 7): length(x) == 3 & is.numeric(x) is not TRUE
calcn_test(x = c(2, 4, 3, 5), n = 6) # length of x > 3
## Error in calcn_test(x = c(2, 4, 3, 5), n = 6): length(x) == 3 & is.numeric(x) is not TRUE
calcn_test(x = c(2, 4, 3), n = 0) # n <= 0
## Error in calcn_test(x = c(2, 4, 3), n = 0): n > 0 & is.integer(n) is not TRUE
calcn_test(x = c(2, 4, 3), n = 5.53) # n is a float(dbl)
## Error in calcn_test(x = c(2, 4, 3), n = 5.53): n > 0 & is.integer(n) is not TRUE
calcn(c(11,1,130), 1000L)  
## [1] 176.3333
calcn(c(11,1,130), 1L)  
## [1] 11
calcn(c(7, 3, 20), 8L)
## [1] 26.625

3 Question: Lists, For-loops, and map_*()

Lists are often used to save simulation output. You can then extract individual elements from the lists using for-loops.

Consider the \(t\)-test, used to test whether or not the mean of some observations is 0. We would use the following code to simulate data from a Normal (0,1) distribution, and then use a \(t\)-test to test if the true mean is 0:

x <- rnorm(n = 10, mean = 0, sd = 1)
tout <- t.test(x)
tout
## 
##  One Sample t-test
## 
## data:  x
## t = 1.9842, df = 9, p-value = 0.07853
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.06640975  1.01459370
## sample estimates:
## mean of x 
##  0.474092
  1. The output of t.test() is a list-like object. Use one function to show how many elements are in the list along with their names and types.
    Not revise: 3.3.1: - 0.05 No need for additional functions to repeat what is in str()
# only one function to show all
str(tout)
## List of 10
##  $ statistic  : Named num 1.98
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 9
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.0785
##  $ conf.int   : num [1:2] -0.0664 1.0146
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num 0.474
##   ..- attr(*, "names")= chr "mean of x"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "mean"
##  $ stderr     : num 0.239
##  $ alternative: chr "two.sided"
##  $ method     : chr "One Sample t-test"
##  $ data.name  : chr "x"
##  - attr(*, "class")= chr "htest"
# how many elements
length(tout)
## [1] 10
# names
names(tout)
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"
# types
typeof(tout)
## [1] "list"
is_double(tout)
## [1] FALSE
  1. Write a for-loop to perform the following three operations on iteration i:
    1. Draw 10 random observations from a normal distribution with mean 0 and standard deviation.
    2. Run a \(t\)-test on these 10 observations.
    3. Save the output of the \(t\)-test as the ith element in a list called tlist.
set.seed(1)
tlist <- list()

for(i in 1:1000) {
  x <- rnorm(10, mean = 0, sd = 2)
  tout <- t.test(x)# draw 10 random observations
  tlist[[i]] <- tout# t-test the x for each time then save to the tlist
}

# randomly test some tlists
tlist[[1]]
## 
##  One Sample t-test
## 
## data:  x
## t = 0.53557, df = 9, p-value = 0.6052
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.8523895  1.3812007
## sample estimates:
## mean of x 
## 0.2644056
tlist[[20]]
## 
##  One Sample t-test
## 
## data:  x
## t = -0.52263, df = 9, p-value = 0.6139
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -1.5732947  0.9827645
## sample estimates:
##  mean of x 
## -0.2952651
tlist[[300]]
## 
##  One Sample t-test
## 
## data:  x
## t = -0.582, df = 9, p-value = 0.5749
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -2.725734  1.610202
## sample estimates:
## mean of x 
## -0.557766
  1. Use the appropriate map function to extract the sample mean from each test (from 2) and pipe to an appropriate plot to show the sampling distribution of the sample mean.
tlist %>%
map_dbl(~.$estimate) -> tlist_mean # extract the mean of each tlist
as_tibble(tlist_mean) -> tlist_mean01 # make sure the data save as data frame type

# plot to histogram
tlist_mean01 %>%
 ggplot(aes(x = value)) +
  geom_histogram() +
  theme_bw() +
  xlab("Mean of the tlist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#tlist_mean01 %>%
 #ggplot(aes(sample = value))+
  #geom_qq()+
  #theme_bw()
tlist %>% 
  map_dbl(~.$estimate[[1]]) %>% 
  tibble(x_bar=.) %>% 
  ggplot(aes(x=x_bar))+
    geom_histogram(bins = 50)+
  theme_bw()

  1. The \(p\)-value is an important quantity in statistics. Use a for-loop to extract the \(p\)-values from each test in part 2 and combine these \(p\)-values into a single vector called pvec_f. Show the first 6 values.
    Revised:3.3.4: - 0.1 Use mode and length = 1000) to initialize vector of proper type and length
pvec_f <- vector(mode = "double", length = 1000) # set pvec_f is a vector type

# start for loop
for(i in 1:1000) {
  pvec_f[[i]] <- tlist[[i]]$p.value # recall tlist$p.value to extract it
}

head(pvec_f)
## [1] 0.6052327 0.4806056 0.6686761 0.6480420 0.4945143 0.6653161
  1. Use the appropriate map function to extract the \(p\)-values from each test in part 2 and combine these \(p\)-values into a single vector called pvec_m. Show the first 6 values.
map_dbl(tlist, ~.$p.value) -> pvec_m

head(pvec_m)
## [1] 0.6052327 0.4806056 0.6686761 0.6480420 0.4945143 0.6653161
  1. Extra Credit \(p\)-values have a nice property where, if the the null hypothesis is true (i.e. the mean of the observations is actually 0), then the \(p\)-values follow the uniform distribution.

Interpretation: If \(p\)-values is uniformly distributed, the null hypothesis should be true. In this plot, we can see these two lines are almost sticking together, which means this statistical evidence indicates tlist’s \(p\)-values is uniformly distributed.
Revised: 3.3.6: +1.75 Labels are confusing as your x axis should not be p values, it is the theoretical uniform distribution and your sample is the p values

pvec_m %>%
  as_tibble(pvec_m) -> dfpvec_m # make sure the data save as data frame type
  ggplot(data = dfpvec_m)+
  geom_qq(distribution = stats::qunif, mapping = aes(sample = pvec_m))+
  geom_abline(color = "red", linetype = "dashed", size = 1)+ # "qunif" gives the quantile function 
  theme_bw()+
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles")+
  ggtitle("Uniform Distribution")