Computational Biology & Data Science

Simple Linear Regression

library(tidyverse)

indata <- read.csv("income.data.csv", header = T)
head(indata)

##   X   income happiness
## 1 1 3.862647  2.314489
## 2 2 4.979381  3.433490
## 3 3 4.923957  4.599373
## 4 4 3.214372  2.791114
## 5 5 7.196409  5.596398
## 6 6 3.729643  2.458556

plot(indata$income, indata$happiness)

Alternative visualization

library(tidyverse)
ggplot(indata, aes(income, happiness)) +
  geom_point()

Linear regression

Here, we will explore if two variables can be fitted to regression line.

\[ y = αx + β\] The slope of the line (the regression coefficient) is β, the increase per unit change in x. The line intersects the y-axis at the intercept α.

Here is the linear model

income.happiness.lm <- lm(happiness ~ income, data = indata)

summary(income.happiness.lm)

## 
## Call:
## lm(formula = happiness ~ income, data = indata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02479 -0.48526  0.04078  0.45898  2.37805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20427    0.08884   2.299   0.0219 *  
## income       0.71383    0.01854  38.505   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared:  0.7493, Adjusted R-squared:  0.7488 
## F-statistic:  1483 on 1 and 496 DF,  p-value: < 2.2e-16

Let’s check how this regression line looks like on our plot

ggplot(indata, aes(income, happiness)) +
  geom_point()+
  stat_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

For further investigation, please check out http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/.

Correlation

#Returns Pearson correlation coefficient as default
cor(indata$income, indata$happiness)

## [1] 0.8656337

Calculate Correlation Matrix for Entire Data Frame

Gene expression table of 6 tissues represented by 1000 genes each. Please, download the data from the Google drive link. “Gene_expression_table.txt”

exp <- read.delim("Gene_expression_table.txt", header = T, row.names = 1)
head(exp)

##        Sample_A_rep1 Sample_A_rep2 Sample_A_rep3 Sample_B_rep1 Sample_B_rep2
## Gene_1    22.9429000     23.674800     25.902100    22.5873000    23.4013000
## Gene_2     0.0966727      0.000000      0.282938     0.1273130     0.4440570
## Gene_3    52.2926000     48.391900     53.052200    60.5590000    72.4147000
## Gene_4     0.3252890      0.313256      0.512266     0.0556837     0.0719282
## Gene_5     0.5296930      0.629770      2.150640     1.1608300     1.1623000
## Gene_6     0.0000000      0.000000      0.000000     0.0000000     0.0458154
##        Sample_B_rep3
## Gene_1    22.3985000
## Gene_2     0.0835085
## Gene_3    53.1314000
## Gene_4     0.2759260
## Gene_5     1.4716900
## Gene_6     0.0000000

We wonder if our experiment went well or not.

ggplot(exp, aes(Sample_A_rep1, Sample_A_rep2)) +
  geom_point()+
  stat_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

Let’s check out Pearson correlation coefficients between the pairs of samples. Notice the difference.

#Between the replicates of samples
cor(exp$Sample_A_rep1, exp$Sample_A_rep2)

## [1] 0.9839493

cor(exp$Sample_A_rep1, exp$Sample_A_rep3)

## [1] 0.9847132

#Across the samples
cor(exp$Sample_A_rep1, exp$Sample_B_rep3)

## [1] 0.8055932

Now, let’s calculate all at once.

cormat <- cor(exp)
cormat

##               Sample_A_rep1 Sample_A_rep2 Sample_A_rep3 Sample_B_rep1
## Sample_A_rep1     1.0000000     0.9839493     0.9847132     0.9151998
## Sample_A_rep2     0.9839493     1.0000000     0.9794035     0.8773407
## Sample_A_rep3     0.9847132     0.9794035     1.0000000     0.8767977
## Sample_B_rep1     0.9151998     0.8773407     0.8767977     1.0000000
## Sample_B_rep2     0.8695399     0.8206401     0.8270926     0.9460479
## Sample_B_rep3     0.8055932     0.7528408     0.7354855     0.9347932
##               Sample_B_rep2 Sample_B_rep3
## Sample_A_rep1     0.8695399     0.8055932
## Sample_A_rep2     0.8206401     0.7528408
## Sample_A_rep3     0.8270926     0.7354855
## Sample_B_rep1     0.9460479     0.9347932
## Sample_B_rep2     1.0000000     0.9014187
## Sample_B_rep3     0.9014187     1.0000000

Visualization of the sample to sample correlation coefficients

#For this, we will use `reshape2` R library
library(reshape2)#Please, install if not already.

#First, round the digits following points
cormat <- round(cormat, 2)

#then, melt the NxN matrix to 2 by N*(N-1)
melted_cormat <- melt(cormat)
head(melted_cormat)

##            Var1          Var2 value
## 1 Sample_A_rep1 Sample_A_rep1  1.00
## 2 Sample_A_rep2 Sample_A_rep1  0.98
## 3 Sample_A_rep3 Sample_A_rep1  0.98
## 4 Sample_B_rep1 Sample_A_rep1  0.92
## 5 Sample_B_rep2 Sample_A_rep1  0.87
## 6 Sample_B_rep3 Sample_A_rep1  0.81

Now, let’s drow the heatmap with ggplot functions, i.e. geom_time()

heatmap <- ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()
heatmap

heatmap <- heatmap + theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1))

heatmap

For further aestetics, please check out http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization

Also, an alternative R package for beautiful heatmaps is pheatmap. Try it out your self!

Exploring gene expression dinamics

Calculate gene expression variance across samples (row wise) using apply().

exp.vars <- apply(exp, 1, var, na.rm=TRUE)#row wise variance

#sample.vars <- apply(exp, 2, mean, na.rm=TRUE)#col wise mean

#Calculate gene mean expression
exp.means <- apply(exp, 1, mean, na.rm=T)

Examine the relationship between mean gene expression and its variance across samples.

plot(exp.means, exp.vars, pch=20)

Coefficient of variation (CV) is a better estimation of variance against mean value of gene expression when selecting gene features. Let’s see how to examine CV.

exp.df <- data.frame(row.names = rownames(exp),
                     GeneVar = exp.vars,
                     GeneMean = exp.means,
                     CoefVar = exp.vars/exp.means)

plot(exp.df$GeneMean, exp.df$CoefVar)

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.4  lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
##  [5] dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0    
##  [9] tibble_3.2.1    ggplot2_3.4.3   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0  xfun_0.39         bslib_0.5.1       splines_4.1.0    
##  [5] lattice_0.21-8    colorspace_2.1-0  vctrs_0.6.2       generics_0.1.3   
##  [9] htmltools_0.5.5   yaml_2.3.7        mgcv_1.8-42       utf8_1.2.3       
## [13] rlang_1.1.1       jquerylib_0.1.4   pillar_1.9.0      glue_1.6.2       
## [17] withr_2.5.0       lifecycle_1.0.3   plyr_1.8.8        munsell_0.5.0    
## [21] gtable_0.3.3      evaluate_0.21     labeling_0.4.2    knitr_1.43       
## [25] tzdb_0.3.0        fastmap_1.1.1     fansi_1.0.4       highr_0.10       
## [29] Rcpp_1.0.10       scales_1.2.1      cachem_1.0.8      jsonlite_1.8.4   
## [33] farver_2.1.1      hms_1.1.3         digest_0.6.31     stringi_1.7.12   
## [37] grid_4.1.0        cli_3.6.1         tools_4.1.0       magrittr_2.0.3   
## [41] sass_0.4.6        pkgconfig_2.0.3   Matrix_1.5-1      timechange_0.2.0 
## [45] rmarkdown_2.24    rstudioapi_0.15.0 R6_2.5.1          nlme_3.1-162     
## [49] compiler_4.1.0

Computational Biology & Data Science - Lecture 5

Yasin Kaymaz

2022-12-04