HW3: Simple Linear Regression in Bioinformatics

2024-06-09

Introduction to Linear Regression

Simple linear regression is a statistical data analysis technique which is used to predict the relationship between a dependent variable and an independent variables. The method to achieve this prediction is to find a best-fitting line through the observation points to model the dependent variable based on the values of the independent variable. In bioinformatics, linear regression is utilized to predict gene expression levels by analyzing the relationship between biological variables, and to model the effect of treatments.

Mathematical Formula

math text written in Latex (1) \[ y = \beta_0 + \beta_1 x + \epsilon \]

\(y\) is the dependent variable.
\(\beta_0\) is the intercept.
\(\beta_1\) is the slope coefficient.
\(x\) is the independent variable.
\(\epsilon\) is the error term.

Example Analysis

Bioinformatics Dataset

I’ll make up a simple dataset as an example to simulate gene expression levels and cell cycle stage. This example tries to depict the essence of gene differential expression in single cell RNA sequencing. This simulated dataset helps to illustrate how linear regression can be used to understand the relationship between gene expression and cell cycle stages, which is crucial for analyzing disease progression and tailoring personalized treatments in bioinformatics.

In this example, gene expression levels are expressed across different stages of the cell cycle. The dataset simulates 100 observations with two genes: A_gene and B_gene. The cell cycle stage is an integer value ranging from 1 to 10, representing different phases of the cell cycle. The gene expression levels for A_gene and B_gene are modeled to vary linearly with the cell cycle stage, incorporating some random noise to reflect biological variability.

Fitting the Linear Model

Now I have created the simulated data set with the name called “data”. I can use lm() function to identify the best-fitting line to correlate the cell_cycle_stage and gene_expression.

## 
## Call:
## lm(formula = A_gene_expression ~ cell_cycle_stage, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07079 -0.68894 -0.01604  0.52582  3.11818 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.7718     0.2337  20.418  < 2e-16 ***
## cell_cycle_stage  -0.2649     0.0352  -7.525 2.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9774 on 98 degrees of freedom
## Multiple R-squared:  0.3662, Adjusted R-squared:  0.3597 
## F-statistic: 56.62 on 1 and 98 DF,  p-value: 2.584e-11

R Code to Plot Linear Regression for Gene A

Use of R code

# Plot the data and the regression line with 99% confidence interval
ggplot_linear <- 
ggplot(data, aes(x = cell_cycle_stage, y = A_gene_expression)) +
  geom_point(size = 1) +
  geom_smooth(method = "lm", 
              se = TRUE, 
              level = 0.99, 
              fill = "lightgrey") +
  labs(title = "Simple Linear Regression in Bioinformatics",
       x = "Cell Cycle Stage",
       y = "A Gene Expression") +
  ylim(0, 8)

ggplot_linear

ggplot2 Plot: A Gene Regression Line

Use of ggplot (1)

## `geom_smooth()` using formula = 'y ~ x'

Convert ggplot linear regression to Plotly object

Use of plotly (1)

## `geom_smooth()` using formula = 'y ~ x'

Jitter Plot of A Gene Expression by Cell Cycle Stage

Use of ggplot (2)

Convert ggplot linear regression to Plotly object

Use of plotly (2)

Plotly Plot: 3D Visualization

Use of plotly (3)

Implication of the 3D plot

The inclusion of Gene B in the 3D plot is significant as it introduces the concept of incorporating an additional variable alongside Gene A. This demonstrates that the same genes can be expressed differently across various regions or tissues of the body. By introducing another variable into the same independent variable space, using techniques such as linear algebra, researchers can gain deeper insights into gene expression patterns by building new models.

Residuals of the Linear Model

Mathematical Representation

Use of math text written in Latex (2)

Residuals in a linear model reflect the difference between the observed gene expression levels and the predicted gene expression levels from the linear model. Specifically, for each data point in our bioinformatics dataset, the residual is calculated as: \[ \epsilon_i = y_i - \hat{y}_i \] where \(\epsilon_i\) is the residual; \(y_i\) is the observed value; \(\hat{y}_i\) is the predicted value from the model.

Residuals help in validation of the model by checking if the assumptions of linear regression fits the reality.

ggplot2 Plot: Residuals

Use of ggplot (3)

Conclusion

In conclusion, simple linear regression is a powerful in bioinformatics to understand the relationships between biological variables, such as gene expression levels and cell cycle stages in this example. By simulating a dataset and fitting a linear model, we have demonstrated how this technique can reveal patterns and insights that are crucial for analyzing disease progression and developing personalized treatments.