Second session: Biostatistics Project

Study title: Inflammation and the Host Response to Injury

Summary of the previous report

Inflammation and the Host Response to Injury is a research consortium with the objective of uncover the biological reasons why patients have different outcomes after suffering traumatic injuries.
The first part consisted of a descriptive analysis that followed the genetic expression Affymetrix U133 plus 2.0 GeneChip signal of 244 severe burned patients (>20% of the total body surface) admited within the first 96 hours of injury and 37 healthy controls, over the time for 44692 human genes. Thus, some of the cases have multiple observations across their evolution in time. The data frames obtained for this study are publicly available via the NCBI portal Bioproject.
The first data frame, pheno_data contains the clinical measurement variables and the second data frame, expr_data, contains the affimetrix chip measurement. Bot datasets are joined together by the identifier “geo_accession”.
As a main conclusion from the first part, after performing PCA and differential gene expression analysis, is that the gene expression between controls and the longest hour since injury samples match, possibly explaining the return of the genetic expression to a baseline after a period of time after the injury.

Proposed objectives

Asses the differentially expressed genes over time.
Identify the amount of time necessary for the return of the genetic expression to the baseline.
Create a genetic map to identify the metabolic route of the genetic expression after a burn.

Identification of a single gene differential expression:

The samples might be biased by the person they come from, as there are multiple measures per patient and hours since injury ¹.
When there are many time points, it is reasonable to assume that expression changes smoothly over time rather than making discrete jumps from one time point to another ².
Following this reasoning, a single linear regression may not be suitable, and a polynomial regression may overfit the values³. this is why it was decided to use the cubic spline “generalized aditive model” approach ^4,5.
In terms of a linear predictor, the mathematical formula takes the following output:
\[y = \sum x_{j} \beta{j} \] Where \(y\) is a response to the outcome variable, \(x\) is a prognostic factor and \(\beta{j}\) are parameters.
The generalized aditive model replaces \(\sum x_{j} \beta{j}\) with \(\sum f{j}(x{j})\) where \(f{j}\) is a smooth curve that minimizes the function:
\[\sum (y{i} - f(x{i}))^2 + \lambda \int f'' (x)^2 dx \]

Notice here that \(\lambda\) is a non negative smoothing parameter that has to be choosen between the tradeoff of the goodness of fit of the data ( \(\sum (y{i} - f(x{i}))^2\) ) and the wiggleness of the function. Larger values of \(\lambda\) force \(f\) to be smoother.
The solution of this function is a cubic spline, a cubic polynomial with the pieces joined at unique observed \(x\) in the dataset, named knots¹⁴.

The “ns” function of the package limma ² creates a basis for the cubic spline regression that can be replaced for the variable hours_since_injury with the added separation between knots.
In order to choose the correct degrees of freedom (the value of \(\lambda\)), first I have to see if inside the knots, the same patient will be repeated at least twice, which will introduce a bias. Therefore I created a function that will output how many times a patient is repeated within the boundaries of an incresing amount of knots, using gene as an example. Bearing in mind that I decided to put all the controls as time 0, with no unique source name, they will be counted as repeats; therefore the minimum outcome expected in this function is 1, the column name is \(\lambda\) and the result is the total repeats per knot.

Figure 1: Barplot of the total number of repeated samples inside knots for increasing amounts of degrees of freedom.

As it can be seen in Fig. 1, there is a great decrease in the number of repeats. When it reaches 38 there are 0 repeats. However, at late stage of this assay I found that there are empty knots that the lmFit function from limma package does not accept. Thus, the minimum degrees of freedom chosen was 16.
Next is to compare if the use of the knots in the ns function represents a better model than just using hours since injury.

## Analysis of Variance Table
## 
## Model 1: X211275_s_at ~ hours_since_injury
## Model 2: X211275_s_at ~ X
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    588 466.06                                  
## 2    553 184.79 35    281.26 24.048 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Analysis of Variance Table
## 
## Model 1: X211275_s_at ~ X
## Model 2: X211275_s_at ~ X
##   Res.Df    RSS  Df Sum of Sq     F Pr(>F)
## 1    553 184.79                           
## 2    573 192.03 -20   -7.2381 1.083 0.3631

As it is shown by the anova result, the addition of the knots is significant between a single linear function and the addition of a cubic spline with 16 knots (first result). However, using 16 or 38 knots does not show a significative difference (second result). It increases the \(R^2\) from 0.24 to 0.68 and excludes in each bin the individual differences between patients but for 16 of the 285 patients. When excluding the controls and performing the test on 16 degress of freedom the \(R^2\) is 0.55 and the p-value is the same (Fig. 2(B)).

<b>Figure 2: Cubic spline fit of the gene GYG1.</b> The black line represents the fitted function. (A)16 degrees of freedom including the controls (light blue) and the cases orange. (B) 16 degrees of freedom withouth the controls (grey).

Figure 2: Cubic spline fit of the gene GYG1. The black line represents the fitted function. (A)16 degrees of freedom including the controls (light blue) and the cases orange. (B) 16 degrees of freedom withouth the controls (grey).

When comparing the fitting from \(\lambda = 16\) with and without controls (Fig. 2), the expression levels approximate those of the controls when the time passes by. Using the approximation function, using the fitted data from the black curve in Fig. 1 B, the mean time in where the expression of the gene GYG1 for the cases approximates the controls is 6959 hours with a standrd deviation of 2900 hours. If one was to fit the same function only in a linear model, the expected hours to return to the baseline would be 9035.

Finding the most differentially expressed genes in the dataset

Figure 3: Gene expression as a function of time for the best 10 differentialy expressed genes.

From the total 44692 genes, only 7887 were not differentially expressed (p-adj <0.05) or using a cutoff of p-adj of 0.01, 12284 . It tells the big impact of a severe burn in the body.
When selecting the \(10^{-61}\) p-adj; the top 270 genes are selected. Using the package topGO ⁷, that relates the gene names to annotated categories in the GO object, the pathway of biological signalling corresponding to inflammation, signalling and inmmune response related to 28 adjusted top genes ends in neutrophil degranulation; which is a feature of inflammatory disorders ⁸.

Figure 4: Significantly enriched GO nodes in the GO hierarchy.

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 28 
## Number of Edges = 39 
## 
## $complete.dag
## [1] "A graph with 28 nodes."

Pathway Enrichment Analysis

Using the package called ReactomePA ⁹, it is possible to investigate the peer reviewed Reactome pathway database to visualize specific metabolic pathways associated to the burn injury results. Arbitrarily, less than \(10^{-32}\) p-adj selected 2588 genes, that were adjusted

<b>Figure 5: Enriched Reactome pathways and their p–values.</b> Barplot of the different pathways.

Figure 5: Enriched Reactome pathways and their p–values. Barplot of the different pathways.

Figure 6: Visualization of the gene sets as a network.

Discussion

From the genetic expression on the Affymetrix U133 plus 2.0 GeneChip signal from 244 severe burned patients it was possible to study the passage of time and the genetic expression as a function of it. Because some patients were measured multiple times, individual expression of each patient would be a bias. In order to account for this, a model of multiple testing using cubic splines was used in the study of a single gene, the GYG1 (glycogenin, related to the glucose metabolism and innate immune system ¹⁰). It was selected because it was the most differentially expressed when using 5 degrees of freedom for the initial testing of the cubic splines. However, testing the amount of repeats in each knot of single patient measurements rapidly showed the need to increase the degrees of freedom for the basis of the cubic spline. The function used did not take in account the empty knots, what in the end gave errors trying to use the limma package with 38 degrees of freedom. The maximum number allowed was 16 degrees of freedom. Regardess of that, an analysis of variance of the two multivariate models (16 and 38 degress of freedom) did not show any statistical difference, so this value was picked for later studies. To pick most accurately the degrees of freedom it may be necessary to iterate over all the genes, and also to review cross validation error estimates ⁶.
About the time necesssary for the gene expression to return to the baseline; the approximation of the cubic spline function for the GYG1 gene to the value of the expression of the controls was around 7000 hours, which is around 9 months. Nevertheless, in Fig. 3 a quick visual inspection shows that this value is different for the top genes of the whole dataset regression with 16 degrees of freedom, for those the return to the baseline is observed around 4 months. I conclude that the time necessary for the genetic expression to return to the baseline depends on the type of gene. Possibly the genes related with inflamation as an acute response will be the most differentially expressed but also the ones that will return to the baseline faster, whereas the GYG1 gene releated to a more general metabolic pathway will take longer to return to its baseline. This results are not enough to give real conclusions, though.
With the results from the limma package analysis, using a p-adjusted cutoff of 0.01, 72% of the total genes in this study were affected by a major burn. This highlights the impact that a major burn does to the body.
The topGO package annotation was tracked to the U133 Affimetrix chip to generate a hierarchy of nodes, that starts with a general biological process and leads into a neutrophil degranulation pathway, using a \(10^{-61}\) arbitraty cutoff because of a means of visualization. A p-value correction test using fisher statistics shows the results of the 28 variables that showed to be dependent to one another from the initial 270.
Finally, from an initial cutoff of \(10^{-32}\), that is 2588 genes with the top p-adjusted values; an enrichment analysis was performed to identify pathway groups, using a hypergeometric model to asses wether the number of genes associated with a reactome pathway is larger than expected¹¹. Four main groups result from this analysis, DAP12 , interleukin 4 , interferon gamma signaling and neutrophil degranulation. DAP12 is related in celular activation ¹² expressed in receptors of immune cells ¹³. Interleukin 4 is a citokine that induces the differentiation of naive helper T cells into Th2 cells (adaptative immune system) and interferon gamma is a citokine related to the activation of the innate immune system. Finally, as discussed previously, the degranulation of neutrophils is related to kinase activators. The contents of such granules is higly enriched in tissue destructive proteases, so in some extent is related to burn injury.⁸
The results from topGO and Reactome both complement each other and validate the results.
With this study, it was possible to conclude that there is indeed regression of the expresion of the genes to a baseline depending on the type of genes involved and that burn injuries are related to the activation of innate and adaptive immune respones. Further statistical analysis need to be done and iterated over all the genes to validate the results.Also, it would be interesting to asses more clinical data specially not only in the surface but