Biostatistics Project

Study title: Inflammation and the Host Response to Injury

Retriving the Data from Bioconductor

data <- getGEO('GSE37069', GSEMatrix=TRUE)
eset <- data[[1]]

Background of the Study

Inflammation and the Host Response to Injury is a research consortium supported by the National Institute of General Medical Sciences (NIGMS), a component of the National Institutes of Health. This collaborative program aims to uncover the biological reasons why patients can have dramatically different outcomes after suffering a traumatic injury.¹

Accession numbers: PRJNA158087²; GEO: GSE37069³

m<- pData(phenoData(eset))
test<- exprs(eset)

It consists of 590 blood sample tests, from 244 severe burned patients over time as well as 37 healthy subjects (controls). Study subjects were treated under standard operating procedures to minimize treatment variations. Patients had burns covering >20% of the total body surface area and were admitted within 96 hours of injury. Genome-wide expression analyses were conducted using the Affymetrix U133 plus 2.0 GeneChip^TM ^(3,4).
This chip contains the measurements of 44692 human genes as described elsewere ⁵. The value of each result is the Robust Multi-array Average (RMA) signal per Affymetrix Power Tools version 1.8.6 RMA model ⁵ which consists in background-adjusted, quantile-normalized, and log₂-transform of the Perfect Match probes signal, meaning the signal given by the 16–20 pairs of oligonucleotides referred to as probe sets that match perfectly the RNA of the sample⁶.

Data Arrangement

There are 2 data frames that contain the data from this study, pheno_data and expr_data, both joined together by the affimetrix chip measurement identifier “geo_accession”.

Pheno_data contains the clinical data of each blow sample measurement and expr_data contains the intensity results of each measurement (Fig. 1)

<b>Figure 1. General structure of the data frames for this study.</b> (A) Pheno_data. The first 5 columns were selected as the most relevant from the original 44 columns, plus a dummy variable - ID, that was introduced to identify cases and controls more easily later on in the analysis. (B) Expr_data. Each row contains the RMA value of each 44692 genes of the microarray chip, and in each column the geo_accession number linked to the clinical data in the pheno_data dataframe.

Figure 1. General structure of the data frames for this study. (A) Pheno_data. The first 5 columns were selected as the most relevant from the original 44 columns, plus a dummy variable - ID, that was introduced to identify cases and controls more easily later on in the analysis. (B) Expr_data. Each row contains the RMA value of each 44692 genes of the microarray chip, and in each column the geo_accession number linked to the clinical data in the pheno_data dataframe.

The present report makes an exploratory data analysis of the two data frames of Fig. 1. First, an analysis variable per variable, then by pairs and finally combining the two data frames to obtain preliminary results that may need further validation. The RMD R data code is available in github (https://github.com/florez-alberto/master2-bio)and the final report was also published in RPubs (http://rpubs.com/florez_alberto/m2-biostat).

Descriptive analytics for Pheno_data

Single Variable Statistics

The variable age has been corrected for each individual patient because there may be several observations per patient. It is quantitative continuous variable with a mean 24.79, a standard deviation of 19.75. The bins were chosen following Sturge’s rule ⁷. The histogram plot is skewed to the left, according to the literature that says that younger people tend to have more burn accidents⁸.

<b>Figure 2: Summary statistics of the distribution of the age of each patient.</b> (A) Quartile distribution. (B) Histogram distribution. The red vertical line represents the mean and the blue line represents the median.

Figure 2: Summary statistics of the distribution of the age of each patient. (A) Quartile distribution. (B) Histogram distribution. The red vertical line represents the mean and the blue line represents the median.

The variable hours_since_injury is quantitative continuous, in Fig. 3 it is appreciated that the data is skewed to the left, with a predominancy of the observations close to the time of injury. The outliers in this case make the stantard deviation to be very high and shift the mean to the right, whereas the majority of observations are close to the injury time. The controls were not taken into account because they did not have a time after injury.

<b>Figure 3: Density Plot of the distribution of the hours since injury.</b> The red vertical line represents the mean and the blue line represents the median.

Figure 3: Density Plot of the distribution of the hours since injury. The red vertical line represents the mean and the blue line represents the median.

The variable source_name is qualitative and it identifies each patient. Some patients were taken repeated blood samples at different hours since injury. When tabled and plotted it is possible to asses how many invividual times each patient had their blood sampled, which in turn is a quantitative discrete variable.

<b>Figure 4: Number of blood samples taken per patient.</b> The numbers over the bars correspond to the absolute count.

Figure 4: Number of blood samples taken per patient. The numbers over the bars correspond to the absolute count.

As it can be appreciated in Fig. 4, most of the patients only had one blood sample taken and analized on the microarray chip U133.
The variables sex and ID both are dichotomous qualitative nominal variables. To review these variables, only the unique source_names must be taken in account. For this reason a subset of the pheno_data data frame was taken.

<b>Figure 4: Summary statistics of the sex and ID.</b> (A) Sex. Absolute and relative values. (B) ID. Absolute and relative values.

Figure 4: Summary statistics of the sex and ID. (A) Sex. Absolute and relative values. (B) ID. Absolute and relative values.

There is a predominancy of male subjects in the sample. The ratio male/female for the cases is 2.76. This ratio is slightly higher than the reported in the literature, which is expected to be 1.56 ⁹.
Interestingly, the raw ratio male/female withouth removing duplicated observations is 2.64, which is very close to the calculated ratio male/female for unique patients.
The dummy variable ID just identifies if an observation corresponds to a control or to a case. As stated before, each of the 590 rows corresponds to each individual blood sample. Also taking the subset of the unique source names it is possible to describe the number of patients in each group.

Descriptive Analytics of certain pairs of variables

Using the combn function⁸, all the possible pairs of the previously analysed variables are 10.

The variables are mostly unique per patient (source name) which means that they need to be subsetted. For the case of combining hours_since_injury with these variables, no valuable information can me extracted, reason why only the some of the pairs were analyzed.
To achieve a boxplot per sample per patient, a new data frame is created by merging the table count of each source name and its respective age. The results in Fig. 5 may be biased because of the uneven distribution of the absolute number of samples per patient (Fig. 4). However, the trend shows that apparently older patients had more samples taken.

<b>Figure 5: Boxplot of samples per patient and age.</b> Only the significant pairs have been plotted with a line.

Figure 5: Boxplot of samples per patient and age. Only the significant pairs have been plotted with a line.

Regarding the pair source_name, sex a facetted barplot was generated. The ratio male/female is not preserved for each of the absolute number per patients and this could introduce a bias in the generalization of the results for women.

<b>Figure 6: Barplot of samples per sex.</b> The values over the columns represent the absolut count for each group.

Figure 6: Barplot of samples per sex. The values over the columns represent the absolut count for each group.

The sixth pair of variables are age and sex and ID are, quantitative, and the last two qualitative. Again, a predominancy of male cases can be seen in Fig 7-C, similarly to Fig. 6. In Fig. 7-A it can be appreciated that there is no difference between both groups, whereas there is a difference between the ages of the cases and controls (Fig. 7-B), being the controls relatively older.

<b>Figure 7: Summary statistics of different pairs of variables.</b> (A) Table of the sex and ID. A predominancy of male cases can be appreciated. (B) Age and sex. (C) Age and ID. A wilcoxon test for the difference of the means was performed.

Figure 7: Summary statistics of different pairs of variables. (A) Table of the sex and ID. A predominancy of male cases can be appreciated. (B) Age and sex. (C) Age and ID. A wilcoxon test for the difference of the means was performed.

Descriptive Analytics for expr_data

Central tendency and dispersion analytics per gene

First, the row-wise calculation of the medians, means and standard deviations per gene was generated. The median statistic was also chosen because it is ourlier independent. In Fig. 9 all distributions skewed to the left, this means that there are genes that are poorly expressed and that the gene expression between genes have a relatively low variance.

Figure 9: Central tendency and dispersion analytics per gene. The histogram bars correspond to the bins of the density of each of the 44692 genes analyzed in the chip.

To explore genes that have a low variance and are poorly expressed, that can potentially be trimmed for the dataset, SDs were plotted against mean, using the density in a heat-plot style to show the concentration of the samples. In Fig. 10 there is a cluster of genes that may be close to the background noise expression, that can potentially be interpretted as significantly different in later analysis but in reality it may be just a confusion with the background.

<b>Figure 10:SD vs Mean RMA intensities per gene.</b> The red line is a regression line of the samples and the colors represent the density of the observations in each x-y coordinate.

Figure 10:SD vs Mean RMA intensities per gene. The red line is a regression line of the samples and the colors represent the density of the observations in each x-y coordinate.

Central tendency and dispersion analytics per patient

Analyzing each patient expression across all the genes, it can be appreciated in Fig. 11 that there is a higher median expression with a mean and variance skewed to the right. This means that there is a large variation of the expression of genes across patients.

<b>Figure 11: Central tendency and dispersion analytics per Chip Measurement.</b> The histogram bars correspond to the bins of the density of each of the 590 chip measurements.

Figure 11: Central tendency and dispersion analytics per Chip Measurement. The histogram bars correspond to the bins of the density of each of the 590 chip measurements.

Exploratory analysis between pheno_data and expr_data

Principal Component Analysis

The principal component analysis of each gene observation was carried out as previosly published ¹⁰, to assess how much each component contributes to the variability of the samples. The first 2 components roughly contribute to around 37% of the variability as it can be seen in Fig. 12, with two differentiated groups across the components.

Figure 12: Principal Component Analysis of the gene expression. Every point in the plot represents one blood sample. The colour indicates the hours since injury and the shape indicates whether its case or control.

clustering Heatmap of the Manhattan distances

The manhattan distance is defined as the sum of the absolute distance of their cartesian coordinates¹², and can be used to asses the differences of the discrete (genetical) distributions per patient as an exploratory analysis. This analysis if followed by a heatmap cluster analysis that will validate the results of the PCA. Large distances that do not cluster together may correspond to outliers that can possibly be removed too. Row annotations were used to follow the clustering for each observation.

<b>Figure 13: Clustering heatmap of the diferential gene expression.</b> Every point in the plot represents one blood sample. The colour indicates the hours since injury and the shape indicates whether its case or control.

Figure 13: Clustering heatmap of the diferential gene expression. Every point in the plot represents one blood sample. The colour indicates the hours since injury and the shape indicates whether its case or control.

Conclusions

Inflammation and the Host Response to Injury is a research consortium with the objective of uncover the biological reasons why patients have different outcomes after suffering traumatic injuries.
The following descriptive analysis follows the genetic expression Affymetrix U133 plus 2.0 GeneChip signal of 244 severe burned patients (>20% of the total body surface) admited within the first 96 hours of injury and 37 healthy controls, over the time for 44692 human genes. Thus, some of the cases have multiple observations across their evolution in time. The data frames obtained for this study are publicly available via the NCBI portal Bioproject.
The first data frame, pheno_data contains the clinical measurement variables and the second data frame, expr_data, contains the affimetrix chip measurement. Bot datasets are joined together by the identifier “geo_accession”.
An analysis of the clinical variables, singles and by pairs showed a mean distribution of the age according to the literature that says that younger patients tend to have more burning injuries. There is a tendency that the hours since injury are closer to the time of the injury than further in time. and most of the patients only have one or two measurements, apparently younger patients. The sex distribution ratio is slightly higher than the reported literature. However, it does not follow an even distribution by the absolute number of samples taken per patient. Apparently, there is a significative difference between the age of the cases and controls and a clustering of male cases in this study.
In the analysis of the signal expression per gene across all patients, it appears that there is a subset of poorly expressed genes with high variation that may be accounted as significantly different but in reality it may be confusion with the background.
The differential expression between the hours since injury is the dominant source of variation. Also it can be seen that the expression between controls and the longest hour since injury sample start to match, possibly explaining the return of the genetic expression to a baseline after a period of time after the injury.
In the heatmap cluster analysis, this data is validated as it can be seen the clustering of the longest hours since injury with the controls. However this clustering is not completely defined confirming that the diferential expression is not completely perfect.
Finally, the results of the PCA and differential expression analysis may serve as a milestone to look for statisticals tests and regression models using the hours since injury and case/control variables across genetical expression data to find differentially expressed genes for a posterior analysis of differential expression of those target genes.

References

https://med.stanford.edu/sgtc/research/inflammation.html
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA158087
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse37069
https://www.thermofisher.com/order/catalog/product/900466
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570
Irizarry RA, Hobbs B, Collin F, Beazer‐Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003 Apr 1;4(2):249-
Sturges, H.(1926) The choice of a class-interval.J. Amer. Statist. Assoc.,21, 65–66
https://www.who.int/news-room/fact-sheets/detail/burns
Smolle C, Cambiaso-Daniel J, Forbes AA, Wurzer P, Hundeshagen G, Branski LK, Huss F, Kamolz LP. Recent trends in burn epidemiology worldwide: A systematic review. Burns. 2017 Mar 1;43(2):249-57.
https://www.rdocumentation.org/packages/utils/versions/3.4.1/topics/combn
https://bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html