Contents

1 Getting started: Tibbles

A tidy data frame can be constructed by the tibble() function. For example, we illustrate how to construct a data frame from genomic intervals or coordinates.

library(tidyverse)

chr <- c("chr1", "chr1", "chr2", "chr2")
strand <- c("-","-","+","+")
start<- c(200,4000,100,400)
end<-c(250,410,200,450)
mydata <- tibble(chr,start,end,strand)
#change column names
names(mydata) <- c("chr","start","end","strand")
mydata # OR this will work too
## # A tibble: 4 x 4
##   chr   start   end strand
##   <chr> <dbl> <dbl> <chr> 
## 1 chr1    200   250 -     
## 2 chr1   4000   410 -     
## 3 chr2    100   200 +     
## 4 chr2    400   450 +

2 Reading and writing data

To read whitespace-separated columns into a tibble with the readr package, two dedicated funcitons are useful:

Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is the BED format, which is used primarily by the UCSC genome browser but most other genome browsers and tools will support the BED file format.

library(compGenomRData)

enhancerFilePath=system.file("extdata",
                "subset.enhancers.hg18.bed",
                package="compGenomRData")
enh_df <- read_table2(enhancerFilePath,col_names = FALSE) 
glimpse(enh_df)
## Rows: 50,416
## Columns: 9
## $ X1 <chr> "chr20", "chr20", "chr20", "chr20", "chr20", "chr20", "chr20", "chr…
## $ X2 <dbl> 266275, 287400, 300500, 330400, 341425, 437975, 516650, 519100, 543…
## $ X3 <dbl> 267925, 294500, 302500, 331800, 343400, 439900, 518525, 521475, 545…
## $ X4 <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".…
## $ X5 <dbl> 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1…
## $ X6 <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".…
## $ X7 <dbl> 9.11, 10.53, 9.10, 6.39, 6.20, 6.31, 12.48, 7.10, 9.52, 7.71, 8.00,…
## $ X8 <dbl> 13.1693, 13.0231, 13.3935, 13.5105, 12.9852, 13.5184, 13.6827, 13.0…
## $ X9 <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
cpgiFilePath=system.file("extdata",
                "subset.cpgi.hg18.bed",
                package="compGenomRData")
cpgi_df <- read_table2(cpgiFilePath, col_names = FALSE) 
glimpse(cpgi_df)
## Rows: 1,895
## Columns: 4
## $ X1 <chr> "chr20", "chr20", "chr20", "chr20", "chr20", "chr20", "chr20", "chr…
## $ X2 <dbl> 195575, 207789, 219055, 225831, 252826, 275376, 309489, 324916, 336…
## $ X3 <dbl> 195851, 208148, 219437, 227155, 256323, 276977, 310805, 325165, 337…
## $ X4 <chr> "CpG:_28", "CpG:_32", "CpG:_33", "CpG:_135", "CpG:_286", "CpG:_116"…

A data frame or matrix can be written out by using the write_csv function, of the write_*() family of functions.

write_csv(enh_df, "enh_df.csv")
write_csv(cpgi_df, "cpgi_df.csv")

3 Visualisation

Captions to figures can be assigned in the code chunk option fig.cap to automatically number them, and to be able to reference them, see Figure 1. The figure label is generated from the code chunk label by prefixing it with fig:.

plot histogram of lenths of CpG islands.

Figure 1: plot histogram of lenths of CpG islands

Small and wide figures can be specified by fig.small and fig.wide code chunk options.

Small figure. Barplot of one categorical data: chromosome for cpgi_df.

Figure 2: Small figure
Barplot of one categorical data: chromosome for cpgi_df.

Wide figure. Color according to chromosome type for cpgi_df.

Figure 3: Wide figure
Color according to chromosome type for cpgi_df.

To decide if the CpG island is large (>1500), normal (<=1500 and >700) or short (<=700), without loop structures, subsetting is very useful.

cpglen <- tibble(cpgi_df$X3-cpgi_df$X2+1)

cpglen <- cpglen %>% mutate( size = if_else( cpglen > 1500, "large", if_else( cpglen > 700, "normal", "short" )))

cpglen[100,]
## # A tibble: 1 x 2
##   `cpgi_df$X3 - cpgi_df$X2 + 1` size  
##                           <dbl> <chr> 
## 1                          1048 normal

4 Inference

Due to technical bias in experiments or natural variation in the samples, the statistical treatment of data is needed when measuring gene expression values for a certain gene.

To do this in the tidyverse, the indicated package is infer because implements an expressive grammar to perform statistical inference that coheres with a tidy design framework.

The workflow of this package is designed around the following funcitons:

The Poisson distribution is commonly used to model count data such as sequencing read counts. Let’s simulate \(30\) random variables with a mean parameter of \(50\).

library(Distributacalcul)
library(tidymodels)
set.seed(21)
sample1= rpois(30,50)
sample2= rpois(30,50)
sample=tibble(sample1,sample2)

The specify() function can be used to specify that the vector sample1 in this tibble will be our response variable.

sample %>% specify(response = sample1) %>% class()
## [1] "infer"      "tbl_df"     "tbl"        "data.frame"

The next step in the infer pipeline is often to declare a null hypothesis using hypothesize(). In this case, the null hypothesis is that the mean of the sample in our population is \(46\).

sample %>% specify(response = sample1) %>% hypothesize(null = "point", mu = 47)
## Response: sample1 (integer)
## Null Hypothesis: point
## # A tibble: 30 x 1
##    sample1
##      <int>
##  1      55
##  2      53
##  3      62
##  4      41
##  5      57
##  6      60
##  7      52
##  8      37
##  9      49
## 10      33
## # … with 20 more rows

Now we do this \(1000\) times and calculate the mean of each sample using bootstraping, by repeatedly taking samples from the original sample with replacement and forming a null distribution.

sample %>% specify(response = sample1) %>% hypothesize(null = "point", mu = 47) %>% generate(reps = 1000, type = "bootstrap")
## Response: sample1 (integer)
## Null Hypothesis: point
## # A tibble: 30,000 x 2
## # Groups:   replicate [1,000]
##    replicate sample1
##        <int>   <dbl>
##  1         1    58.2
##  2         1    58.2
##  3         1    53.2
##  4         1    43.2
##  5         1    39.2
##  6         1    48.2
##  7         1    35.2
##  8         1    56.2
##  9         1    45.2
## 10         1    54.2
## # … with 29,990 more rows

With the output of generate(), we now supply calculate() with the stat argument set to “mean”.

tibble(sample1) %>% specify(response = sample1) %>% hypothesize(null = "point", mu = 47) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean")
## # A tibble: 1,000 x 2
##    replicate  stat
##        <int> <dbl>
##  1         1  43.5
##  2         2  45.2
##  3         3  47.8
##  4         4  48.0
##  5         5  48.4
##  6         6  45.5
##  7         7  46.2
##  8         8  47.6
##  9         9  48  
## 10        10  48.2
## # … with 990 more rows

The output shows the sample statistic mean for each of our \(1000\) replicates. Now we can visualize whether this statistic is relative to the distribution by using visualize().

null_dist <- sample %>% specify(response = sample1) %>% hypothesize(null = "point", mu = 47) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean") 

point_estimate <- tibble(sample1) %>% specify(response = sample1) %>% calculate(stat = "mean") 

null_dist %>% visualize()
Key functionality of calculate.

Figure 4: Key functionality of calculate

And to observe where this statistic lies on this distribution, we can use the obs_stat argument to specify this inside the shade_p_value function.

null_dist %>% visualize() + shade_p_value(obs_stat = point_estimate, direction = "two-sided")
Key functionality of shade_p_value.

Figure 5: Key functionality of shade_p_value

And to see how unlikely the estimate is, we have the get_p_value function

p_value <- null_dist %>%
  get_p_value(obs_stat = point_estimate, direction = "two-sided")
p_value
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.004

To get a confidence interval around our estimate, we can write:

null_dist %>% get_confidence_interval(point_estimate = point_estimate, level = .95, type = "se")
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     47.8     53.8

5 Samples

To compare sets of samples, and evaluate whether differences in paired values differ from \(0\), a test in this difference in means of two populations can be carried using a sample of data drawn from them. Such comparisons include if wild-type samples have different expression compared to mutants or if healthy samples are different from disease samples in some measurable feature (blood count, gene expression, methylation of certain loci).

\(2\)-Sample t-Tests evaluate the difference in mean values of two populations using data randomly-sampled from the population that approximately follows a normal distribution. As an example, let’s test the difference of means of the following simulated genes:

set.seed(100)
gene1=rnorm(30,mean=4,sd=2)
gene2=rnorm(30,mean=2,sd=2)
org.diff=mean(gene1)-mean(gene2)
gene_df=tibble(expr=c(gene1,gene2),group=c( rep("test",30),rep("control",30) ))

Their distribution can be visualized as follows:

ggplot(gene_df, aes(group, expr)) +
  geom_boxplot()
Boxplot of simulated genes.

Figure 6: Boxplot of simulated genes

To calculate the observed t statistic, we can use specify() and calculate().

observed_statistic2 <- gene_df %>%
  specify(expr ~ group) %>%
  calculate(stat = "t", order = c("test", "control"))
observed_statistic2
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  3.77

By using permutation, a null distribution can be generated.

null_distribution_2_sample_permute <- gene_df %>%
  specify(expr ~ group) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t", order = c("test", "control"))

And also a null distribution using the theoretical t distribution can be generated

null_distribution_2_sample_theoretical <- gene_df %>%
  specify(expr ~ group) %>%
  hypothesize(null = "independence") %>%
  # generate() isn't used for the theoretical version!
  calculate(stat = "t", order = c("test", "control"))

A visualization of these distributions is donde with visualize().

null_distribution_2_sample_permute %>%
  visualize() + 
  shade_p_value(observed_statistic2,
                direction = "two-sided")
Visualization of permuted null dsitribution.

Figure 7: Visualization of permuted null dsitribution

null_distribution_2_sample_theoretical %>%
  visualize(method = "theoretical") + 
  shade_p_value(observed_statistic2,
                direction = "two-sided")
Visualization of theoretical null dsitribution.

Figure 8: Visualization of theoretical null dsitribution

And the calculation of the p-value is:

p_value_2_sample <- null_distribution_2_sample_permute %>%
  get_p_value(obs_stat = observed_statistic2,
              direction = "two-sided")
p_value_2_sample
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.002

Another syntax to carry out \(2\)-sample t-Tests on tidy data looks like this:

t_test(x = gene_df, 
       formula = expr ~ group, 
       order = c("test", "control"),
       alternative = "two-sided")
## # A tibble: 1 x 6
##   statistic  t_df  p_value alternative lower_ci upper_ci
##       <dbl> <dbl>    <dbl> <chr>          <dbl>    <dbl>
## 1      3.77  47.6 0.000458 two.sided      0.872     2.87

6 Modeling

To ilustrate the basics of tidy modeling, let’s read and inspect the following histone modification data set:

hmodFile=system.file("extdata",
                     "HistoneModeVSgeneExp.rds",
                     package="compGenomRData") 
hmod_df <- read_rds(hmodFile)
hmod_df <- as_tibble(hmod_df) 
glimpse(hmod_df)
## Rows: 13,731
## Columns: 3
## $ H3k4me3       <dbl> 4.670, 2.046, 2.375, 1.954, 5.190, 5.456, 5.837, 5.432, …
## $ H3k27me3      <dbl> -0.24554094, -0.24554094, 1.22368169, 0.43572509, 0.2237…
## $ measured_log2 <dbl> 6.21751, -5.05889, -5.05889, -5.05889, 6.05534, 5.03454,…

There are \(3\) columns in the dataset. These are measured levels of H3K4me3, H3K27me3 and gene expression per gene. The scatter plot for H3K4me3 vs. expression is:

hmodFile=system.file("extdata",
                     "HistoneModeVSgeneExp.rds",
                     package="compGenomRData") %>% 
  read_rds() %>% as_tibble() %>% glimpse()
## Rows: 13,731
## Columns: 3
## $ H3k4me3       <dbl> 4.670, 2.046, 2.375, 1.954, 5.190, 5.456, 5.837, 5.432, …
## $ H3k27me3      <dbl> -0.24554094, -0.24554094, 1.22368169, 0.43572509, 0.2237…
## $ measured_log2 <dbl> 6.21751, -5.05889, -5.05889, -5.05889, 6.05534, 5.03454,…

And the the scatter plot for H3K27me3 vs. expression is:

ggplot(hmod_df,aes(H3k4me3,measured_log2))+
  geom_point(alpha = 0.5, col = '#006EA1') 

First, a model workflow for prediction of expression data using only H3K4me3 as explanatory variable can be created.

set.seed(1)
histone_split <- initial_split(hmod_df, prop = 0.75, strata = measured_log2)
histone_train <- training(histone_split)
histone_test <- testing(histone_split)

histone_recipe <- recipe(measured_log2 ~ H3k4me3, data = histone_train) %>%
  step_scale(all_predictors()) %>%
  step_YeoJohnson(all_predictors()) %>%
  step_center(all_predictors())

set.seed(1)
lm_model <- linear_reg() %>% 
  set_engine('lm') %>% # adds lm implementation of linear regression
  set_mode('regression')

lm_fit <- lm_model %>% 
  fit(measured_log2 ~ H3k4me3, data = histone_train)

par(mar=c(1,1,1,1))
par(mfrow=c(2,2)) # plot all 4 plots in one
plot(lm_fit$fit, 
     pch = 16,    # optional parameters to make points blue
     col = '#006EA1')

And now the detailed results from the trained linear regression model in a data frame is obtained

# Data frame of estimated coefficients
tidy(lm_fit)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    -9.70    0.0675     -144.       0
## 2 H3k4me3         2.34    0.0168      139.       0
# Performance metrics on training data
glance(lm_fit)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.653         0.653  2.56    19384.       0     1 -24293. 48593. 48614.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

To assess the accuracy of the trained linear regression model, lm_fit, predictions are made on the test data, histone_test.

histone_test_results <- predict(lm_fit, new_data = histone_test) %>% 
  bind_cols(histone_test)
histone_test_results
## # A tibble: 3,434 x 4
##     .pred H3k4me3 H3k27me3 measured_log2
##     <dbl>   <dbl>    <dbl>         <dbl>
##  1 -4.15     2.38  1.22          -5.06  
##  2  2.43     5.19  0.224          6.06  
##  3  3.95     5.84  0.00881        2.29  
##  4  2.54     5.24 -0.246          6.19  
##  5 -1.22     3.63  4.19          -1.94  
##  6  0.797    4.49 -0.149          2.01  
##  7 -4.90     2.05 -0.246          0.0970
##  8  4.89     6.24  0.411          6.55  
##  9  3.37     5.59  0.412          4.22  
## 10  1.35     4.73  4.98          -5.06  
## # … with 3,424 more rows

To obtain the RMSE and \(R^2\) values on the test set results, the rmse() and rsq() functions are used.

rmse(histone_test_results, 
     truth = measured_log2,
     estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.58
rsq(histone_test_results,
    truth = measured_log2,
    estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.645

The best way to assess the test set accuracy is by making an \(R^2\) plot. This is a plot that can be used for any regression model.

ggplot(data = histone_test_results,
       mapping = aes(x = .pred, y = measured_log2)) +
  geom_point(color = '#006EA1') +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Histone Test Set',
       x = 'Predicted H3k4me3 modification',
       y = 'Actual H3k4me3 measurement')  

The machine learning workflow can be accomplished with a few steps using tidymodels.

histone_workflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(histone_recipe)

histone_fit <- histone_workflow %>% 
  last_fit(split = histone_split)

histone_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       2.59  Preprocessor1_Model1
## 2 rsq     standard       0.641 Preprocessor1_Model1
test_results <- histone_fit %>% 
  collect_predictions()
test_results
## # A tibble: 3,434 x 5
##    id                .pred  .row measured_log2 .config             
##    <chr>             <dbl> <int>         <dbl> <chr>               
##  1 train/test split -4.06      3       -5.06   Preprocessor1_Model1
##  2 train/test split  2.47      5        6.06   Preprocessor1_Model1
##  3 train/test split  3.56      7        2.29   Preprocessor1_Model1
##  4 train/test split  2.55     18        6.19   Preprocessor1_Model1
##  5 train/test split -0.681    24       -1.94   Preprocessor1_Model1
##  6 train/test split  1.17     28        2.01   Preprocessor1_Model1
##  7 train/test split -5.11     30        0.0970 Preprocessor1_Model1
##  8 train/test split  4.18     32        6.55   Preprocessor1_Model1
##  9 train/test split  3.16     36        4.22   Preprocessor1_Model1
## 10 train/test split  1.63     37       -5.06   Preprocessor1_Model1
## # … with 3,424 more rows

Comparing now, using only H3K27me3 as explanatory variable, the analogous workflow is constructed.

histone_recipe2 <- recipe(measured_log2 ~ H3k27me3, data = histone_train) %>%
  step_scale(all_predictors()) %>%
  step_YeoJohnson(all_predictors()) %>%
  step_center(all_predictors())

lm_fit2 <- lm_model %>% 
  fit(measured_log2 ~ H3k27me3, data = histone_train)

par(mar=c(1,1,1,1))
par(mfrow=c(2,2)) # plot all 4 plots in one
plot(lm_fit2$fit, 
     pch = 16,    # optional parameters to make points blue
     col = '#006EA1')

# Data frame of estimated coefficients
tidy(lm_fit2)
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   -0.354    0.0509     -6.97 3.42e- 12
## 2 H3k27me3      -0.789    0.0362    -21.8  3.12e-103
# Performance metrics on training data
glance(lm_fit2)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0442        0.0441  4.25      476. 3.12e-103     1 -29512. 59029. 59051.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
predict(lm_fit2, new_data = histone_test)
## # A tibble: 3,434 x 1
##     .pred
##     <dbl>
##  1 -1.32 
##  2 -0.531
##  3 -0.361
##  4 -0.161
##  5 -3.66 
##  6 -0.237
##  7 -0.161
##  8 -0.678
##  9 -0.679
## 10 -4.28 
## # … with 3,424 more rows
histone_test_results2 <- predict(lm_fit2, new_data = histone_test) %>% 
  bind_cols(histone_test)
histone_test_results2
## # A tibble: 3,434 x 4
##     .pred H3k4me3 H3k27me3 measured_log2
##     <dbl>   <dbl>    <dbl>         <dbl>
##  1 -1.32     2.38  1.22          -5.06  
##  2 -0.531    5.19  0.224          6.06  
##  3 -0.361    5.84  0.00881        2.29  
##  4 -0.161    5.24 -0.246          6.19  
##  5 -3.66     3.63  4.19          -1.94  
##  6 -0.237    4.49 -0.149          2.01  
##  7 -0.161    2.05 -0.246          0.0970
##  8 -0.678    6.24  0.411          6.55  
##  9 -0.679    5.59  0.412          4.22  
## 10 -4.28     4.73  4.98          -5.06  
## # … with 3,424 more rows
rmse(histone_test_results2, 
     truth = measured_log2,
     estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.22
rsq(histone_test_results2,
    truth = measured_log2,
    estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard      0.0453
ggplot(data = histone_test_results2,
       mapping = aes(x = .pred, y = measured_log2)) +
  geom_point(color = '#006EA1') +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Histone Test Set',
       x = 'Predicted H3k27me3 modification',
       y = 'Actual H3k27me3 measurement')

histone_workflow2 <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(histone_recipe2)

histone_fit2 <- histone_workflow2 %>% 
  last_fit(split = histone_split)

histone_fit2 %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      4.24   Preprocessor1_Model1
## 2 rsq     standard      0.0377 Preprocessor1_Model1
test_results2 <- histone_fit2 %>% 
  collect_predictions()
test_results2
## # A tibble: 3,434 x 5
##    id                .pred  .row measured_log2 .config             
##    <chr>             <dbl> <int>         <dbl> <chr>               
##  1 train/test split -1.59      3       -5.06   Preprocessor1_Model1
##  2 train/test split -0.705     5        6.06   Preprocessor1_Model1
##  3 train/test split -0.388     7        2.29   Preprocessor1_Model1
##  4 train/test split  0.102    18        6.19   Preprocessor1_Model1
##  5 train/test split -2.60     24       -1.94   Preprocessor1_Model1
##  6 train/test split -0.100    28        2.01   Preprocessor1_Model1
##  7 train/test split  0.102    30        0.0970 Preprocessor1_Model1
##  8 train/test split -0.929    32        6.55   Preprocessor1_Model1
##  9 train/test split -0.930    36        4.22   Preprocessor1_Model1
## 10 train/test split -2.74     37       -5.06   Preprocessor1_Model1
## # … with 3,424 more rows

Finally, using both H3K4me3 and H3K27me3 as explanatory variables, the regression omproves significantly.

histone_recipe3 <- recipe(measured_log2 ~ ., data = histone_train) %>%
  step_scale(all_predictors()) %>%
  step_YeoJohnson(all_predictors()) %>%
  step_center(all_predictors())

set.seed(1)
lm_model <- linear_reg() %>% 
  set_engine('lm') %>% # adds lm implementation of linear regression
  set_mode('regression')
lm_model
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
lm_fit3 <- lm_model %>% 
  fit(measured_log2 ~ ., data = histone_train)
lm_fit3
## parsnip model object
## 
## Fit time:  7ms 
## 
## Call:
## stats::lm(formula = measured_log2 ~ ., data = data)
## 
## Coefficients:
## (Intercept)      H3k4me3     H3k27me3  
##     -9.1415       2.3029      -0.5365
names(lm_fit3)
## [1] "lvl"     "spec"    "fit"     "preproc" "elapsed"
summary(lm_fit3$fit)
## 
## Call:
## stats::lm(formula = measured_log2 ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0620  -1.0821  -0.3085   1.2425  10.9017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.14154    0.06912  -132.3   <2e-16 ***
## H3k4me3      2.30295    0.01635   140.8   <2e-16 ***
## H3k27me3    -0.53646    0.02121   -25.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.485 on 10294 degrees of freedom
## Multiple R-squared:  0.6734, Adjusted R-squared:  0.6734 
## F-statistic: 1.061e+04 on 2 and 10294 DF,  p-value: < 2.2e-16
par(mar=c(1,1,1,1))
par(mfrow=c(2,2)) # plot all 4 plots in one
plot(lm_fit3$fit, 
     pch = 16,    # optional parameters to make points blue
     col = '#006EA1')

# Data frame of estimated coefficients
tidy(lm_fit3)
## # A tibble: 3 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   -9.14     0.0691    -132.  0        
## 2 H3k4me3        2.30     0.0164     141.  0        
## 3 H3k27me3      -0.536    0.0212     -25.3 5.29e-137
# Performance metrics on training data
glance(lm_fit3)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.673         0.673  2.49    10613.       0     2 -23983. 47974. 48003.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
predict(lm_fit3, new_data = histone_test)
## # A tibble: 3,434 x 1
##     .pred
##     <dbl>
##  1 -4.33 
##  2  2.69 
##  3  4.30 
##  4  3.05 
##  5 -3.04 
##  6  1.28 
##  7 -4.28 
##  8  5.01 
##  9  3.51 
## 10 -0.924
## # … with 3,424 more rows
histone_test_results3 <- predict(lm_fit3, new_data = histone_test) %>% 
  bind_cols(histone_test)
histone_test_results3
## # A tibble: 3,434 x 4
##     .pred H3k4me3 H3k27me3 measured_log2
##     <dbl>   <dbl>    <dbl>         <dbl>
##  1 -4.33     2.38  1.22          -5.06  
##  2  2.69     5.19  0.224          6.06  
##  3  4.30     5.84  0.00881        2.29  
##  4  3.05     5.24 -0.246          6.19  
##  5 -3.04     3.63  4.19          -1.94  
##  6  1.28     4.49 -0.149          2.01  
##  7 -4.28     2.05 -0.246          0.0970
##  8  5.01     6.24  0.411          6.55  
##  9  3.51     5.59  0.412          4.22  
## 10 -0.924    4.73  4.98          -5.06  
## # … with 3,424 more rows
rmse(histone_test_results3, 
     truth = measured_log2,
     estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.49
rsq(histone_test_results3,
    truth = measured_log2,
    estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.669
ggplot(data = histone_test_results3,
       mapping = aes(x = .pred, y = measured_log2)) +
  geom_point(color = '#006EA1') +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Histone Test Set',
       x = 'Predicted histone modification',
       y = 'Actual histone measurement')

histone_workflow3 <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(histone_recipe3)

histone_fit3 <- histone_workflow3 %>% 
  last_fit(split = histone_split)

histone_fit3 %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       2.51  Preprocessor1_Model1
## 2 rsq     standard       0.663 Preprocessor1_Model1
test_results3 <- histone_fit3 %>% 
  collect_predictions()
test_results3
## # A tibble: 3,434 x 5
##    id                .pred  .row measured_log2 .config             
##    <chr>             <dbl> <int>         <dbl> <chr>               
##  1 train/test split -4.45      3       -5.06   Preprocessor1_Model1
##  2 train/test split  2.63      5        6.06   Preprocessor1_Model1
##  3 train/test split  3.92      7        2.29   Preprocessor1_Model1
##  4 train/test split  3.28     18        6.19   Preprocessor1_Model1
##  5 train/test split -1.82     24       -1.94   Preprocessor1_Model1
##  6 train/test split  1.76     28        2.01   Preprocessor1_Model1
##  7 train/test split -4.30     30        0.0970 Preprocessor1_Model1
##  8 train/test split  4.16     32        6.55   Preprocessor1_Model1
##  9 train/test split  3.14     36        4.22   Preprocessor1_Model1
## 10 train/test split  0.359    37       -5.06   Preprocessor1_Model1
## # … with 3,424 more rows

Session info

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=es_MX.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_MX.UTF-8        LC_COLLATE=es_MX.UTF-8    
##  [5] LC_MONETARY=es_MX.UTF-8    LC_MESSAGES=es_MX.UTF-8   
##  [7] LC_PAPER=es_MX.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_MX.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] vctrs_0.3.8            rlang_0.4.11           yardstick_0.0.8       
##  [4] workflowsets_0.0.2     workflows_0.2.2        tune_0.1.5            
##  [7] rsample_0.1.0          recipes_0.1.16         parsnip_0.1.6         
## [10] modeldata_0.1.0        infer_0.5.4            dials_0.0.9           
## [13] scales_1.1.1           broom_0.7.6            tidymodels_0.1.3      
## [16] Distributacalcul_0.3.0 compGenomRData_0.1.0   forcats_0.5.1         
## [19] stringr_1.4.0.9000     dplyr_1.0.6            purrr_0.3.4           
## [22] readr_1.4.0            tidyr_1.1.3            tibble_3.1.2          
## [25] ggplot2_3.3.3          tidyverse_1.3.1        BiocStyle_2.18.1      
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-1    ellipsis_0.3.2      class_7.3-18       
##  [4] fs_1.5.0            rstudioapi_0.13     listenv_0.8.0      
##  [7] furrr_0.2.2         farver_2.1.0        prodlim_2019.11.13 
## [10] fansi_0.5.0         lubridate_1.7.10    xml2_1.3.2         
## [13] codetools_0.2-18    splines_4.0.5       knitr_1.33         
## [16] jsonlite_1.7.2      pROC_1.17.0.1       dbplyr_2.1.1       
## [19] BiocManager_1.30.15 compiler_4.0.5      httr_1.4.2         
## [22] backports_1.2.1     assertthat_0.2.1    Matrix_1.3-2       
## [25] cli_2.5.0           prettyunits_1.1.1   htmltools_0.5.1.1  
## [28] tools_4.0.5         gtable_0.3.0        glue_1.4.2         
## [31] Rcpp_1.0.6          cellranger_1.1.0    jquerylib_0.1.4    
## [34] DiceDesign_1.9      iterators_1.0.13    timeDate_3043.102  
## [37] gower_0.2.2         xfun_0.23           globals_0.14.0     
## [40] rvest_1.0.0         lifecycle_1.0.0     future_1.21.0      
## [43] MASS_7.3-53.1       ipred_0.9-11        hms_1.1.0          
## [46] parallel_4.0.5      yaml_2.2.1          sass_0.4.0         
## [49] rpart_4.1-15        stringi_1.6.2       highr_0.9          
## [52] foreach_1.5.1       lhs_1.1.1           hardhat_0.1.5      
## [55] lava_1.6.9          pkgconfig_2.0.3     evaluate_0.14      
## [58] lattice_0.20-41     labeling_0.4.2      tidyselect_1.1.1   
## [61] parallelly_1.26.0   plyr_1.8.6          magrittr_2.0.1     
## [64] bookdown_0.22       R6_2.5.0            generics_0.1.0     
## [67] DBI_1.1.1           pillar_1.6.1        haven_2.4.1        
## [70] withr_2.4.2         survival_3.2-11     nnet_7.3-16        
## [73] modelr_0.1.8        crayon_1.4.1        utf8_1.2.1         
## [76] rmarkdown_2.8       grid_4.0.5          readxl_1.3.1       
## [79] reprex_2.0.0        digest_0.6.27       munsell_0.5.0      
## [82] GPfit_1.0-8         bslib_0.2.5.1