Computational genomics using tidyverse and tidymodels
A tidy data frame can be constructed by the tibble() function. For example, we illustrate how to construct a data frame from genomic intervals or coordinates.
library(tidyverse)
chr <- c("chr1", "chr1", "chr2", "chr2")
strand <- c("-","-","+","+")
start<- c(200,4000,100,400)
end<-c(250,410,200,450)
mydata <- tibble(chr,start,end,strand)
#change column names
names(mydata) <- c("chr","start","end","strand")
mydata # OR this will work too
## # A tibble: 4 x 4
## chr start end strand
## <chr> <dbl> <dbl> <chr>
## 1 chr1 200 250 -
## 2 chr1 4000 410 -
## 3 chr2 100 200 +
## 4 chr2 400 450 +
To read whitespace-separated columns into a tibble with the readr package, two dedicated funcitons are useful:
read_table2() is like read.table(), it allows any number of whitespace characters between columns, and the lines can be of different lengths.read_table() is more strict, each line must be the same length, and each field is in the same position in every line. It first finds empty columns and then parses like a fixed width file.Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is the BED format, which is used primarily by the UCSC genome browser but most other genome browsers and tools will support the BED file format.
library(compGenomRData)
enhancerFilePath=system.file("extdata",
"subset.enhancers.hg18.bed",
package="compGenomRData")
enh_df <- read_table2(enhancerFilePath,col_names = FALSE)
glimpse(enh_df)
## Rows: 50,416
## Columns: 9
## $ X1 <chr> "chr20", "chr20", "chr20", "chr20", "chr20", "chr20", "chr20", "chr…
## $ X2 <dbl> 266275, 287400, 300500, 330400, 341425, 437975, 516650, 519100, 543…
## $ X3 <dbl> 267925, 294500, 302500, 331800, 343400, 439900, 518525, 521475, 545…
## $ X4 <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".…
## $ X5 <dbl> 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1…
## $ X6 <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".…
## $ X7 <dbl> 9.11, 10.53, 9.10, 6.39, 6.20, 6.31, 12.48, 7.10, 9.52, 7.71, 8.00,…
## $ X8 <dbl> 13.1693, 13.0231, 13.3935, 13.5105, 12.9852, 13.5184, 13.6827, 13.0…
## $ X9 <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
cpgiFilePath=system.file("extdata",
"subset.cpgi.hg18.bed",
package="compGenomRData")
cpgi_df <- read_table2(cpgiFilePath, col_names = FALSE)
glimpse(cpgi_df)
## Rows: 1,895
## Columns: 4
## $ X1 <chr> "chr20", "chr20", "chr20", "chr20", "chr20", "chr20", "chr20", "chr…
## $ X2 <dbl> 195575, 207789, 219055, 225831, 252826, 275376, 309489, 324916, 336…
## $ X3 <dbl> 195851, 208148, 219437, 227155, 256323, 276977, 310805, 325165, 337…
## $ X4 <chr> "CpG:_28", "CpG:_32", "CpG:_33", "CpG:_135", "CpG:_286", "CpG:_116"…
A data frame or matrix can be written out by using the write_csv function, of the write_*() family of functions.
write_csv(enh_df, "enh_df.csv")
write_csv(cpgi_df, "cpgi_df.csv")
Captions to figures can be assigned in the code chunk option fig.cap to automatically number them, and to be able to reference them, see Figure 1. The figure label is generated from the code chunk label by prefixing it with fig:.
Figure 1: plot histogram of lenths of CpG islands
Small and wide figures can be specified by fig.small and fig.wide code chunk options.
Figure 2: Small figure
Barplot of one categorical data: chromosome for cpgi_df.
Figure 3: Wide figure
Color according to chromosome type for cpgi_df.
To decide if the CpG island is large (>1500), normal (<=1500 and >700) or short (<=700), without loop structures, subsetting is very useful.
cpglen <- tibble(cpgi_df$X3-cpgi_df$X2+1)
cpglen <- cpglen %>% mutate( size = if_else( cpglen > 1500, "large", if_else( cpglen > 700, "normal", "short" )))
cpglen[100,]
## # A tibble: 1 x 2
## `cpgi_df$X3 - cpgi_df$X2 + 1` size
## <dbl> <chr>
## 1 1048 normal
Due to technical bias in experiments or natural variation in the samples, the statistical treatment of data is needed when measuring gene expression values for a certain gene.
To do this in the tidyverse, the indicated package is infer because implements an expressive grammar to perform statistical inference that coheres with a tidy design framework.
The workflow of this package is designed around the following funcitons:
specify() allows you to specify the variable, or relationship between variables, that you’re interested in.hypothesize() allows you to declare the null hypothesis.generate() allows you to generate data reflecting the null hypothesis.calculate() allows you to calculate a distribution of statistics from the generated data to form the null distribution.The Poisson distribution is commonly used to model count data such as sequencing read counts. Let’s simulate \(30\) random variables with a mean parameter of \(50\).
library(Distributacalcul)
library(tidymodels)
set.seed(21)
sample1= rpois(30,50)
sample2= rpois(30,50)
sample=tibble(sample1,sample2)
The specify() function can be used to specify that the vector sample1 in this tibble will be our response variable.
sample %>% specify(response = sample1) %>% class()
## [1] "infer" "tbl_df" "tbl" "data.frame"
The next step in the infer pipeline is often to declare a null hypothesis using hypothesize(). In this case, the null hypothesis is that the mean of the sample in our population is \(46\).
sample %>% specify(response = sample1) %>% hypothesize(null = "point", mu = 47)
## Response: sample1 (integer)
## Null Hypothesis: point
## # A tibble: 30 x 1
## sample1
## <int>
## 1 55
## 2 53
## 3 62
## 4 41
## 5 57
## 6 60
## 7 52
## 8 37
## 9 49
## 10 33
## # … with 20 more rows
Now we do this \(1000\) times and calculate the mean of each sample using bootstraping, by repeatedly taking samples from the original sample with replacement and forming a null distribution.
sample %>% specify(response = sample1) %>% hypothesize(null = "point", mu = 47) %>% generate(reps = 1000, type = "bootstrap")
## Response: sample1 (integer)
## Null Hypothesis: point
## # A tibble: 30,000 x 2
## # Groups: replicate [1,000]
## replicate sample1
## <int> <dbl>
## 1 1 58.2
## 2 1 58.2
## 3 1 53.2
## 4 1 43.2
## 5 1 39.2
## 6 1 48.2
## 7 1 35.2
## 8 1 56.2
## 9 1 45.2
## 10 1 54.2
## # … with 29,990 more rows
With the output of generate(), we now supply calculate() with the stat argument set to “mean”.
tibble(sample1) %>% specify(response = sample1) %>% hypothesize(null = "point", mu = 47) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean")
## # A tibble: 1,000 x 2
## replicate stat
## <int> <dbl>
## 1 1 43.5
## 2 2 45.2
## 3 3 47.8
## 4 4 48.0
## 5 5 48.4
## 6 6 45.5
## 7 7 46.2
## 8 8 47.6
## 9 9 48
## 10 10 48.2
## # … with 990 more rows
The output shows the sample statistic mean for each of our \(1000\) replicates. Now we can visualize whether this statistic is relative to the distribution by using visualize().
null_dist <- sample %>% specify(response = sample1) %>% hypothesize(null = "point", mu = 47) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "mean")
point_estimate <- tibble(sample1) %>% specify(response = sample1) %>% calculate(stat = "mean")
null_dist %>% visualize()
Figure 4: Key functionality of calculate
And to observe where this statistic lies on this distribution, we can use the obs_stat argument to specify this inside the shade_p_value function.
null_dist %>% visualize() + shade_p_value(obs_stat = point_estimate, direction = "two-sided")
Figure 5: Key functionality of shade_p_value
And to see how unlikely the estimate is, we have the get_p_value function
p_value <- null_dist %>%
get_p_value(obs_stat = point_estimate, direction = "two-sided")
p_value
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.004
To get a confidence interval around our estimate, we can write:
null_dist %>% get_confidence_interval(point_estimate = point_estimate, level = .95, type = "se")
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 47.8 53.8
To compare sets of samples, and evaluate whether differences in paired values differ from \(0\), a test in this difference in means of two populations can be carried using a sample of data drawn from them. Such comparisons include if wild-type samples have different expression compared to mutants or if healthy samples are different from disease samples in some measurable feature (blood count, gene expression, methylation of certain loci).
\(2\)-Sample t-Tests evaluate the difference in mean values of two populations using data randomly-sampled from the population that approximately follows a normal distribution. As an example, let’s test the difference of means of the following simulated genes:
set.seed(100)
gene1=rnorm(30,mean=4,sd=2)
gene2=rnorm(30,mean=2,sd=2)
org.diff=mean(gene1)-mean(gene2)
gene_df=tibble(expr=c(gene1,gene2),group=c( rep("test",30),rep("control",30) ))
Their distribution can be visualized as follows:
ggplot(gene_df, aes(group, expr)) +
geom_boxplot()
Figure 6: Boxplot of simulated genes
To calculate the observed t statistic, we can use specify() and calculate().
observed_statistic2 <- gene_df %>%
specify(expr ~ group) %>%
calculate(stat = "t", order = c("test", "control"))
observed_statistic2
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 3.77
By using permutation, a null distribution can be generated.
null_distribution_2_sample_permute <- gene_df %>%
specify(expr ~ group) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t", order = c("test", "control"))
And also a null distribution using the theoretical t distribution can be generated
null_distribution_2_sample_theoretical <- gene_df %>%
specify(expr ~ group) %>%
hypothesize(null = "independence") %>%
# generate() isn't used for the theoretical version!
calculate(stat = "t", order = c("test", "control"))
A visualization of these distributions is donde with visualize().
null_distribution_2_sample_permute %>%
visualize() +
shade_p_value(observed_statistic2,
direction = "two-sided")
Figure 7: Visualization of permuted null dsitribution
null_distribution_2_sample_theoretical %>%
visualize(method = "theoretical") +
shade_p_value(observed_statistic2,
direction = "two-sided")
Figure 8: Visualization of theoretical null dsitribution
And the calculation of the p-value is:
p_value_2_sample <- null_distribution_2_sample_permute %>%
get_p_value(obs_stat = observed_statistic2,
direction = "two-sided")
p_value_2_sample
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.002
Another syntax to carry out \(2\)-sample t-Tests on tidy data looks like this:
t_test(x = gene_df,
formula = expr ~ group,
order = c("test", "control"),
alternative = "two-sided")
## # A tibble: 1 x 6
## statistic t_df p_value alternative lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 3.77 47.6 0.000458 two.sided 0.872 2.87
To ilustrate the basics of tidy modeling, let’s read and inspect the following histone modification data set:
hmodFile=system.file("extdata",
"HistoneModeVSgeneExp.rds",
package="compGenomRData")
hmod_df <- read_rds(hmodFile)
hmod_df <- as_tibble(hmod_df)
glimpse(hmod_df)
## Rows: 13,731
## Columns: 3
## $ H3k4me3 <dbl> 4.670, 2.046, 2.375, 1.954, 5.190, 5.456, 5.837, 5.432, …
## $ H3k27me3 <dbl> -0.24554094, -0.24554094, 1.22368169, 0.43572509, 0.2237…
## $ measured_log2 <dbl> 6.21751, -5.05889, -5.05889, -5.05889, 6.05534, 5.03454,…
There are \(3\) columns in the dataset. These are measured levels of H3K4me3, H3K27me3 and gene expression per gene. The scatter plot for H3K4me3 vs. expression is:
hmodFile=system.file("extdata",
"HistoneModeVSgeneExp.rds",
package="compGenomRData") %>%
read_rds() %>% as_tibble() %>% glimpse()
## Rows: 13,731
## Columns: 3
## $ H3k4me3 <dbl> 4.670, 2.046, 2.375, 1.954, 5.190, 5.456, 5.837, 5.432, …
## $ H3k27me3 <dbl> -0.24554094, -0.24554094, 1.22368169, 0.43572509, 0.2237…
## $ measured_log2 <dbl> 6.21751, -5.05889, -5.05889, -5.05889, 6.05534, 5.03454,…
And the the scatter plot for H3K27me3 vs. expression is:
ggplot(hmod_df,aes(H3k4me3,measured_log2))+
geom_point(alpha = 0.5, col = '#006EA1')
First, a model workflow for prediction of expression data using only H3K4me3 as explanatory variable can be created.
set.seed(1)
histone_split <- initial_split(hmod_df, prop = 0.75, strata = measured_log2)
histone_train <- training(histone_split)
histone_test <- testing(histone_split)
histone_recipe <- recipe(measured_log2 ~ H3k4me3, data = histone_train) %>%
step_scale(all_predictors()) %>%
step_YeoJohnson(all_predictors()) %>%
step_center(all_predictors())
set.seed(1)
lm_model <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
lm_fit <- lm_model %>%
fit(measured_log2 ~ H3k4me3, data = histone_train)
par(mar=c(1,1,1,1))
par(mfrow=c(2,2)) # plot all 4 plots in one
plot(lm_fit$fit,
pch = 16, # optional parameters to make points blue
col = '#006EA1')
And now the detailed results from the trained linear regression model in a data frame is obtained
# Data frame of estimated coefficients
tidy(lm_fit)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -9.70 0.0675 -144. 0
## 2 H3k4me3 2.34 0.0168 139. 0
# Performance metrics on training data
glance(lm_fit)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.653 0.653 2.56 19384. 0 1 -24293. 48593. 48614.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
To assess the accuracy of the trained linear regression model, lm_fit, predictions are made on the test data, histone_test.
histone_test_results <- predict(lm_fit, new_data = histone_test) %>%
bind_cols(histone_test)
histone_test_results
## # A tibble: 3,434 x 4
## .pred H3k4me3 H3k27me3 measured_log2
## <dbl> <dbl> <dbl> <dbl>
## 1 -4.15 2.38 1.22 -5.06
## 2 2.43 5.19 0.224 6.06
## 3 3.95 5.84 0.00881 2.29
## 4 2.54 5.24 -0.246 6.19
## 5 -1.22 3.63 4.19 -1.94
## 6 0.797 4.49 -0.149 2.01
## 7 -4.90 2.05 -0.246 0.0970
## 8 4.89 6.24 0.411 6.55
## 9 3.37 5.59 0.412 4.22
## 10 1.35 4.73 4.98 -5.06
## # … with 3,424 more rows
To obtain the RMSE and \(R^2\) values on the test set results, the rmse() and rsq() functions are used.
rmse(histone_test_results,
truth = measured_log2,
estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.58
rsq(histone_test_results,
truth = measured_log2,
estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.645
The best way to assess the test set accuracy is by making an \(R^2\) plot. This is a plot that can be used for any regression model.
ggplot(data = histone_test_results,
mapping = aes(x = .pred, y = measured_log2)) +
geom_point(color = '#006EA1') +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - Histone Test Set',
x = 'Predicted H3k4me3 modification',
y = 'Actual H3k4me3 measurement')
The machine learning workflow can be accomplished with a few steps using tidymodels.
histone_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(histone_recipe)
histone_fit <- histone_workflow %>%
last_fit(split = histone_split)
histone_fit %>% collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 2.59 Preprocessor1_Model1
## 2 rsq standard 0.641 Preprocessor1_Model1
test_results <- histone_fit %>%
collect_predictions()
test_results
## # A tibble: 3,434 x 5
## id .pred .row measured_log2 .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split -4.06 3 -5.06 Preprocessor1_Model1
## 2 train/test split 2.47 5 6.06 Preprocessor1_Model1
## 3 train/test split 3.56 7 2.29 Preprocessor1_Model1
## 4 train/test split 2.55 18 6.19 Preprocessor1_Model1
## 5 train/test split -0.681 24 -1.94 Preprocessor1_Model1
## 6 train/test split 1.17 28 2.01 Preprocessor1_Model1
## 7 train/test split -5.11 30 0.0970 Preprocessor1_Model1
## 8 train/test split 4.18 32 6.55 Preprocessor1_Model1
## 9 train/test split 3.16 36 4.22 Preprocessor1_Model1
## 10 train/test split 1.63 37 -5.06 Preprocessor1_Model1
## # … with 3,424 more rows
Comparing now, using only H3K27me3 as explanatory variable, the analogous workflow is constructed.
histone_recipe2 <- recipe(measured_log2 ~ H3k27me3, data = histone_train) %>%
step_scale(all_predictors()) %>%
step_YeoJohnson(all_predictors()) %>%
step_center(all_predictors())
lm_fit2 <- lm_model %>%
fit(measured_log2 ~ H3k27me3, data = histone_train)
par(mar=c(1,1,1,1))
par(mfrow=c(2,2)) # plot all 4 plots in one
plot(lm_fit2$fit,
pch = 16, # optional parameters to make points blue
col = '#006EA1')
# Data frame of estimated coefficients
tidy(lm_fit2)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.354 0.0509 -6.97 3.42e- 12
## 2 H3k27me3 -0.789 0.0362 -21.8 3.12e-103
# Performance metrics on training data
glance(lm_fit2)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0442 0.0441 4.25 476. 3.12e-103 1 -29512. 59029. 59051.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
predict(lm_fit2, new_data = histone_test)
## # A tibble: 3,434 x 1
## .pred
## <dbl>
## 1 -1.32
## 2 -0.531
## 3 -0.361
## 4 -0.161
## 5 -3.66
## 6 -0.237
## 7 -0.161
## 8 -0.678
## 9 -0.679
## 10 -4.28
## # … with 3,424 more rows
histone_test_results2 <- predict(lm_fit2, new_data = histone_test) %>%
bind_cols(histone_test)
histone_test_results2
## # A tibble: 3,434 x 4
## .pred H3k4me3 H3k27me3 measured_log2
## <dbl> <dbl> <dbl> <dbl>
## 1 -1.32 2.38 1.22 -5.06
## 2 -0.531 5.19 0.224 6.06
## 3 -0.361 5.84 0.00881 2.29
## 4 -0.161 5.24 -0.246 6.19
## 5 -3.66 3.63 4.19 -1.94
## 6 -0.237 4.49 -0.149 2.01
## 7 -0.161 2.05 -0.246 0.0970
## 8 -0.678 6.24 0.411 6.55
## 9 -0.679 5.59 0.412 4.22
## 10 -4.28 4.73 4.98 -5.06
## # … with 3,424 more rows
rmse(histone_test_results2,
truth = measured_log2,
estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.22
rsq(histone_test_results2,
truth = measured_log2,
estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.0453
ggplot(data = histone_test_results2,
mapping = aes(x = .pred, y = measured_log2)) +
geom_point(color = '#006EA1') +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - Histone Test Set',
x = 'Predicted H3k27me3 modification',
y = 'Actual H3k27me3 measurement')
histone_workflow2 <- workflow() %>%
add_model(lm_model) %>%
add_recipe(histone_recipe2)
histone_fit2 <- histone_workflow2 %>%
last_fit(split = histone_split)
histone_fit2 %>% collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 4.24 Preprocessor1_Model1
## 2 rsq standard 0.0377 Preprocessor1_Model1
test_results2 <- histone_fit2 %>%
collect_predictions()
test_results2
## # A tibble: 3,434 x 5
## id .pred .row measured_log2 .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split -1.59 3 -5.06 Preprocessor1_Model1
## 2 train/test split -0.705 5 6.06 Preprocessor1_Model1
## 3 train/test split -0.388 7 2.29 Preprocessor1_Model1
## 4 train/test split 0.102 18 6.19 Preprocessor1_Model1
## 5 train/test split -2.60 24 -1.94 Preprocessor1_Model1
## 6 train/test split -0.100 28 2.01 Preprocessor1_Model1
## 7 train/test split 0.102 30 0.0970 Preprocessor1_Model1
## 8 train/test split -0.929 32 6.55 Preprocessor1_Model1
## 9 train/test split -0.930 36 4.22 Preprocessor1_Model1
## 10 train/test split -2.74 37 -5.06 Preprocessor1_Model1
## # … with 3,424 more rows
Finally, using both H3K4me3 and H3K27me3 as explanatory variables, the regression omproves significantly.
histone_recipe3 <- recipe(measured_log2 ~ ., data = histone_train) %>%
step_scale(all_predictors()) %>%
step_YeoJohnson(all_predictors()) %>%
step_center(all_predictors())
set.seed(1)
lm_model <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
lm_model
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
lm_fit3 <- lm_model %>%
fit(measured_log2 ~ ., data = histone_train)
lm_fit3
## parsnip model object
##
## Fit time: 7ms
##
## Call:
## stats::lm(formula = measured_log2 ~ ., data = data)
##
## Coefficients:
## (Intercept) H3k4me3 H3k27me3
## -9.1415 2.3029 -0.5365
names(lm_fit3)
## [1] "lvl" "spec" "fit" "preproc" "elapsed"
summary(lm_fit3$fit)
##
## Call:
## stats::lm(formula = measured_log2 ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0620 -1.0821 -0.3085 1.2425 10.9017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.14154 0.06912 -132.3 <2e-16 ***
## H3k4me3 2.30295 0.01635 140.8 <2e-16 ***
## H3k27me3 -0.53646 0.02121 -25.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.485 on 10294 degrees of freedom
## Multiple R-squared: 0.6734, Adjusted R-squared: 0.6734
## F-statistic: 1.061e+04 on 2 and 10294 DF, p-value: < 2.2e-16
par(mar=c(1,1,1,1))
par(mfrow=c(2,2)) # plot all 4 plots in one
plot(lm_fit3$fit,
pch = 16, # optional parameters to make points blue
col = '#006EA1')
# Data frame of estimated coefficients
tidy(lm_fit3)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -9.14 0.0691 -132. 0
## 2 H3k4me3 2.30 0.0164 141. 0
## 3 H3k27me3 -0.536 0.0212 -25.3 5.29e-137
# Performance metrics on training data
glance(lm_fit3)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.673 0.673 2.49 10613. 0 2 -23983. 47974. 48003.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
predict(lm_fit3, new_data = histone_test)
## # A tibble: 3,434 x 1
## .pred
## <dbl>
## 1 -4.33
## 2 2.69
## 3 4.30
## 4 3.05
## 5 -3.04
## 6 1.28
## 7 -4.28
## 8 5.01
## 9 3.51
## 10 -0.924
## # … with 3,424 more rows
histone_test_results3 <- predict(lm_fit3, new_data = histone_test) %>%
bind_cols(histone_test)
histone_test_results3
## # A tibble: 3,434 x 4
## .pred H3k4me3 H3k27me3 measured_log2
## <dbl> <dbl> <dbl> <dbl>
## 1 -4.33 2.38 1.22 -5.06
## 2 2.69 5.19 0.224 6.06
## 3 4.30 5.84 0.00881 2.29
## 4 3.05 5.24 -0.246 6.19
## 5 -3.04 3.63 4.19 -1.94
## 6 1.28 4.49 -0.149 2.01
## 7 -4.28 2.05 -0.246 0.0970
## 8 5.01 6.24 0.411 6.55
## 9 3.51 5.59 0.412 4.22
## 10 -0.924 4.73 4.98 -5.06
## # … with 3,424 more rows
rmse(histone_test_results3,
truth = measured_log2,
estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.49
rsq(histone_test_results3,
truth = measured_log2,
estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.669
ggplot(data = histone_test_results3,
mapping = aes(x = .pred, y = measured_log2)) +
geom_point(color = '#006EA1') +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - Histone Test Set',
x = 'Predicted histone modification',
y = 'Actual histone measurement')
histone_workflow3 <- workflow() %>%
add_model(lm_model) %>%
add_recipe(histone_recipe3)
histone_fit3 <- histone_workflow3 %>%
last_fit(split = histone_split)
histone_fit3 %>% collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 2.51 Preprocessor1_Model1
## 2 rsq standard 0.663 Preprocessor1_Model1
test_results3 <- histone_fit3 %>%
collect_predictions()
test_results3
## # A tibble: 3,434 x 5
## id .pred .row measured_log2 .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split -4.45 3 -5.06 Preprocessor1_Model1
## 2 train/test split 2.63 5 6.06 Preprocessor1_Model1
## 3 train/test split 3.92 7 2.29 Preprocessor1_Model1
## 4 train/test split 3.28 18 6.19 Preprocessor1_Model1
## 5 train/test split -1.82 24 -1.94 Preprocessor1_Model1
## 6 train/test split 1.76 28 2.01 Preprocessor1_Model1
## 7 train/test split -4.30 30 0.0970 Preprocessor1_Model1
## 8 train/test split 4.16 32 6.55 Preprocessor1_Model1
## 9 train/test split 3.14 36 4.22 Preprocessor1_Model1
## 10 train/test split 0.359 37 -5.06 Preprocessor1_Model1
## # … with 3,424 more rows
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=es_MX.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_MX.UTF-8 LC_COLLATE=es_MX.UTF-8
## [5] LC_MONETARY=es_MX.UTF-8 LC_MESSAGES=es_MX.UTF-8
## [7] LC_PAPER=es_MX.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_MX.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] vctrs_0.3.8 rlang_0.4.11 yardstick_0.0.8
## [4] workflowsets_0.0.2 workflows_0.2.2 tune_0.1.5
## [7] rsample_0.1.0 recipes_0.1.16 parsnip_0.1.6
## [10] modeldata_0.1.0 infer_0.5.4 dials_0.0.9
## [13] scales_1.1.1 broom_0.7.6 tidymodels_0.1.3
## [16] Distributacalcul_0.3.0 compGenomRData_0.1.0 forcats_0.5.1
## [19] stringr_1.4.0.9000 dplyr_1.0.6 purrr_0.3.4
## [22] readr_1.4.0 tidyr_1.1.3 tibble_3.1.2
## [25] ggplot2_3.3.3 tidyverse_1.3.1 BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-1 ellipsis_0.3.2 class_7.3-18
## [4] fs_1.5.0 rstudioapi_0.13 listenv_0.8.0
## [7] furrr_0.2.2 farver_2.1.0 prodlim_2019.11.13
## [10] fansi_0.5.0 lubridate_1.7.10 xml2_1.3.2
## [13] codetools_0.2-18 splines_4.0.5 knitr_1.33
## [16] jsonlite_1.7.2 pROC_1.17.0.1 dbplyr_2.1.1
## [19] BiocManager_1.30.15 compiler_4.0.5 httr_1.4.2
## [22] backports_1.2.1 assertthat_0.2.1 Matrix_1.3-2
## [25] cli_2.5.0 prettyunits_1.1.1 htmltools_0.5.1.1
## [28] tools_4.0.5 gtable_0.3.0 glue_1.4.2
## [31] Rcpp_1.0.6 cellranger_1.1.0 jquerylib_0.1.4
## [34] DiceDesign_1.9 iterators_1.0.13 timeDate_3043.102
## [37] gower_0.2.2 xfun_0.23 globals_0.14.0
## [40] rvest_1.0.0 lifecycle_1.0.0 future_1.21.0
## [43] MASS_7.3-53.1 ipred_0.9-11 hms_1.1.0
## [46] parallel_4.0.5 yaml_2.2.1 sass_0.4.0
## [49] rpart_4.1-15 stringi_1.6.2 highr_0.9
## [52] foreach_1.5.1 lhs_1.1.1 hardhat_0.1.5
## [55] lava_1.6.9 pkgconfig_2.0.3 evaluate_0.14
## [58] lattice_0.20-41 labeling_0.4.2 tidyselect_1.1.1
## [61] parallelly_1.26.0 plyr_1.8.6 magrittr_2.0.1
## [64] bookdown_0.22 R6_2.5.0 generics_0.1.0
## [67] DBI_1.1.1 pillar_1.6.1 haven_2.4.1
## [70] withr_2.4.2 survival_3.2-11 nnet_7.3-16
## [73] modelr_0.1.8 crayon_1.4.1 utf8_1.2.1
## [76] rmarkdown_2.8 grid_4.0.5 readxl_1.3.1
## [79] reprex_2.0.0 digest_0.6.27 munsell_0.5.0
## [82] GPfit_1.0-8 bslib_0.2.5.1