library(tidyverse)
indata <- read.csv("income.data.csv", header = T)
head(indata)
## X income happiness
## 1 1 3.862647 2.314489
## 2 2 4.979381 3.433490
## 3 3 4.923957 4.599373
## 4 4 3.214372 2.791114
## 5 5 7.196409 5.596398
## 6 6 3.729643 2.458556
plot(indata$income, indata$happiness)
library(tidyverse)
ggplot(indata, aes(income, happiness)) +
geom_point()
Here, we will explore if two variables can be fitted to regression line.
\[ y = αx + β\] The slope of the line (the regression coefficient) is β, the increase per unit change in x. The line intersects the y-axis at the intercept α.
Here is the linear model
income.happiness.lm <- lm(happiness ~ income, data = indata)
summary(income.happiness.lm)
##
## Call:
## lm(formula = happiness ~ income, data = indata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02479 -0.48526 0.04078 0.45898 2.37805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20427 0.08884 2.299 0.0219 *
## income 0.71383 0.01854 38.505 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared: 0.7493, Adjusted R-squared: 0.7488
## F-statistic: 1483 on 1 and 496 DF, p-value: < 2.2e-16
Let’s check how this regression line looks like on our plot
ggplot(indata, aes(income, happiness)) +
geom_point()+
stat_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
For further investigation, please check out http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/.
#Returns Pearson correlation coefficient as default
cor(indata$income, indata$happiness)
## [1] 0.8656337
Gene expression table of 6 tissues represented by 1000 genes each. Please, download the data from the Google drive link. “Gene_expression_table.txt”
exp <- read.delim("Gene_expression_table.txt", header = T, row.names = 1)
head(exp)
## Sample_A_rep1 Sample_A_rep2 Sample_A_rep3 Sample_B_rep1 Sample_B_rep2
## Gene_1 22.9429000 23.674800 25.902100 22.5873000 23.4013000
## Gene_2 0.0966727 0.000000 0.282938 0.1273130 0.4440570
## Gene_3 52.2926000 48.391900 53.052200 60.5590000 72.4147000
## Gene_4 0.3252890 0.313256 0.512266 0.0556837 0.0719282
## Gene_5 0.5296930 0.629770 2.150640 1.1608300 1.1623000
## Gene_6 0.0000000 0.000000 0.000000 0.0000000 0.0458154
## Sample_B_rep3
## Gene_1 22.3985000
## Gene_2 0.0835085
## Gene_3 53.1314000
## Gene_4 0.2759260
## Gene_5 1.4716900
## Gene_6 0.0000000
We wonder if our experiment went well or not.
ggplot(exp, aes(Sample_A_rep1, Sample_A_rep2)) +
geom_point()+
stat_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
Let’s check out Pearson correlation coefficients between the pairs of samples. Notice the difference.
#Between the replicates of samples
cor(exp$Sample_A_rep1, exp$Sample_A_rep2)
## [1] 0.9839493
cor(exp$Sample_A_rep1, exp$Sample_A_rep3)
## [1] 0.9847132
#Across the samples
cor(exp$Sample_A_rep1, exp$Sample_B_rep3)
## [1] 0.8055932
Now, let’s calculate all at once.
cormat <- cor(exp)
cormat
## Sample_A_rep1 Sample_A_rep2 Sample_A_rep3 Sample_B_rep1
## Sample_A_rep1 1.0000000 0.9839493 0.9847132 0.9151998
## Sample_A_rep2 0.9839493 1.0000000 0.9794035 0.8773407
## Sample_A_rep3 0.9847132 0.9794035 1.0000000 0.8767977
## Sample_B_rep1 0.9151998 0.8773407 0.8767977 1.0000000
## Sample_B_rep2 0.8695399 0.8206401 0.8270926 0.9460479
## Sample_B_rep3 0.8055932 0.7528408 0.7354855 0.9347932
## Sample_B_rep2 Sample_B_rep3
## Sample_A_rep1 0.8695399 0.8055932
## Sample_A_rep2 0.8206401 0.7528408
## Sample_A_rep3 0.8270926 0.7354855
## Sample_B_rep1 0.9460479 0.9347932
## Sample_B_rep2 1.0000000 0.9014187
## Sample_B_rep3 0.9014187 1.0000000
Visualization of the sample to sample correlation coefficients
#For this, we will use `reshape2` R library
library(reshape2)#Please, install if not already.
#First, round the digits following points
cormat <- round(cormat, 2)
#then, melt the NxN matrix to 2 by N*(N-1)
melted_cormat <- melt(cormat)
head(melted_cormat)
## Var1 Var2 value
## 1 Sample_A_rep1 Sample_A_rep1 1.00
## 2 Sample_A_rep2 Sample_A_rep1 0.98
## 3 Sample_A_rep3 Sample_A_rep1 0.98
## 4 Sample_B_rep1 Sample_A_rep1 0.92
## 5 Sample_B_rep2 Sample_A_rep1 0.87
## 6 Sample_B_rep3 Sample_A_rep1 0.81
Now, let’s drow the heatmap with ggplot functions, i.e. geom_time()
heatmap <- ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
heatmap
heatmap <- heatmap + theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1))
heatmap
For further aestetics, please check out http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization
Also, an alternative R package for beautiful heatmaps is
pheatmap
. Try it out your self!
Calculate gene expression variance
across samples (row
wise) using apply()
.
exp.vars <- apply(exp, 1, var, na.rm=TRUE)#row wise variance
#sample.vars <- apply(exp, 2, mean, na.rm=TRUE)#col wise mean
#Calculate gene mean expression
exp.means <- apply(exp, 1, mean, na.rm=T)
Examine the relationship between mean gene expression and its variance across samples.
plot(exp.means, exp.vars, pch=20)
Coefficient of variation (CV) is a better estimation of variance against mean value of gene expression when selecting gene features. Let’s see how to examine CV.
exp.df <- data.frame(row.names = rownames(exp),
GeneVar = exp.vars,
GeneMean = exp.means,
CoefVar = exp.vars/exp.means)
plot(exp.df$GeneMean, exp.df$CoefVar)
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.4 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
## [5] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
## [9] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.39 bslib_0.5.1 splines_4.1.0
## [5] lattice_0.21-8 colorspace_2.1-0 vctrs_0.6.2 generics_0.1.3
## [9] htmltools_0.5.5 yaml_2.3.7 mgcv_1.8-42 utf8_1.2.3
## [13] rlang_1.1.1 jquerylib_0.1.4 pillar_1.9.0 glue_1.6.2
## [17] withr_2.5.0 lifecycle_1.0.3 plyr_1.8.8 munsell_0.5.0
## [21] gtable_0.3.3 evaluate_0.21 labeling_0.4.2 knitr_1.43
## [25] tzdb_0.3.0 fastmap_1.1.1 fansi_1.0.4 highr_0.10
## [29] Rcpp_1.0.10 scales_1.2.1 cachem_1.0.8 jsonlite_1.8.4
## [33] farver_2.1.1 hms_1.1.3 digest_0.6.31 stringi_1.7.12
## [37] grid_4.1.0 cli_3.6.1 tools_4.1.0 magrittr_2.0.3
## [41] sass_0.4.6 pkgconfig_2.0.3 Matrix_1.5-1 timechange_0.2.0
## [45] rmarkdown_2.24 rstudioapi_0.15.0 R6_2.5.1 nlme_3.1-162
## [49] compiler_4.1.0