Project: Statistics of Gene Expression

Introduction
Analysis
Conclusion
References

Introduction

A large number of genes with diverse normal functions are involved in human cancer. More than 500 genes have been identified as strongly implicated in the process of transforming normal cells to cancer cells. The expression of these genes in normal cells contributes to normal growth, survival and function, whereas dysregulated expression, including overexpression, loss of expression or expression of a defect protein, in cancer cells contributes to ungoverned tumor growth. Altered gene expression can be caused by coarse structural and numerical chromosomal rearrangements, specific gene amplifications, etc.

In the 1980s, the National Cancer Institute developed a set of 60 cancer cell lines, called NCI60, contained in the DataComputing package in two data tables NCI60 and NCI60cells.

The original purpose was for screening potential anti-cancer drugs. Here, we will examine gene expression in these cell lines. 41,078 probes were used for each of the 60 cell lines.

A gene probe (DNA probe) is a single-stranded DNA or RNA fragment used in genetic engineering to search for a particular gene or other DNA sequence. The probe has a base sequence complementary to the target sequence and will thus attach to it by base pairing. By labelling the probe with a radioactive isotope it can be identified on subsequent separation and purification. In short, it is a gene reporter.

Each of these 2,454,680 entries is a measure of how much a particular gene was expressed in one cell line. In fact, one clue that a gene is linked to cancer is differences in expression of that gene from one cancer type to another.

Analysis

Let’s start by joining NCI60 and NCI60cell and select for probes, cell line, expression and tissue.

Narrow <-
  NCI60 %>%
  gather(cellLine, expression, -Probe)

CellTypes <-
  NCI60cells %>% 
  select(cellLine, tissue) %>% 
  mutate(cellLine = gsub("\\:",".", as.character(cellLine)))

Narrow <-
  Narrow %>% 
  inner_join(CellTypes)

Then, let’s analyze a probe we randomly picked.

Probe_PRM1 <-
  Narrow %>% 
  filter(Probe == "PRM1")
  

SummaryStats <-
  Probe_PRM1 %>%
  group_by(tissue) %>% 
  summarize(mn = mean(expression, na.rm=TRUE),
            se = sd(expression, na.rm=TRUE)/sqrt(n()))

SummaryStats <-
  SummaryStats %>% 
  mutate(tissue = reorder(tissue, mn))

SummaryStats %>% 
  ggplot(aes(x = tissue, y = exp(mn))) +
  geom_bar(stat = "identity", fill = "gray", color=NA) +
  geom_point(data = Probe_PRM1, aes(x=tissue, y=exp(expression)))+
  theme(axis.text.x = element_text(angle = 30, hjust=1))

Figure 1: A graphic with two layers. The bar chart shows the SummaryStats data table while the scatter plot shows the Probe PRM1 data table.

From the bar graph, it looks like PRM1 is expressed more in Prostate cancer than in other cancer. However, when we look at the expression of PRM1 in the individual cell lines for each type of tissues, it is not so clear that PRM1 is expressed more or differently in prostate cancer when compared to the other.

Confidence intervals

The mean reflects the PRM1 expression collectively within each tissue type. As such, the mean is a statistic. In order to compare different tissues to one another, some indication like the 95-percent confidence interval must be given.

SummaryStats <-
  SummaryStats %>%
  mutate(top = mn + 2 * se,
         bottom = mn - 2 * se)

SummaryStats %>%
  ggplot(aes(x = tissue, y = exp(mn))) +
  geom_errorbar(aes(x= tissue,
                    ymax= exp(top),
                    ymin= exp(bottom)), width=0.5)+
  geom_point(data = Probe_PRM1, aes(x=tissue, y=exp(expression)))+
  theme(axis.text.x = element_text(angle = 30, hjust=1))

Figure 2: PRM1 mean expression by tissue types with 95-percent cofidence intervals.

The confidence intervals greatly overlap each other, meaning there is no difference in the expression of PRM1 in the different types of tissues.

Probing for a probe

There are 41,078 distinct probes in NCI60. PRM1 in the above example was selected literally at random. In what follows, we’ll identify candidate probes that might be more closely related to tissue type than PRM1. It’s impractical to look through 41,078 different probes individually. We need a way for the computer to evaluate each probe on its own. We use the \(R^2\) statistic here which provides a measure of how much of the variation is one variable, called a response variable, is accounted for other, explanatory variables.

r2 <- function(data) {
  rsquared(lm(data$expression ~ data$tissue))
}

One strategy for finding probes that are strongly connected with tissue type is to find the \(R^2\) for each probe.

ProbeR2 <-
  Narrow %>% 
  group_by(Probe) %>% 
  do(r2 = r2(.)) %>% 
  mutate(r2 = unlist(r2))

Next, here are statements to pull out the 30 probes with the largest \(R^2\) and order them from largest \(R^2\) to smallest.

Actual <-
  ProbeR2 %>% 
  arrange(desc(r2)) %>% 
  head(30) %>% 
  mutate(Probe = reorder(Probe, desc(r2)))

Actual %>% 
  ggplot(aes(x= Probe, y=r2)) +
  geom_point()+
  theme(axis.text.x = element_text(angle = 45, hjust=1))

Figure 3: Probes with the largest \(R^2\) for gene expression level.

Finally, let’s graph one of the probes with the highest \(R^2\). Here, I choose to graph the probe CD53 because it is right at that 95% mark.

Probe_CD53 <-
  Narrow %>% 
  filter(Probe == "CD53")

SummaryStats1 <-
  Probe_CD53 %>%
  group_by(tissue) %>% 
  summarize(mn = mean(expression, na.rm=TRUE),
            se = sd(expression, na.rm=TRUE)/sqrt(n()))

SummaryStats1 <-
  SummaryStats1 %>% 
  mutate(tissue = reorder(tissue, mn))

SummaryStats1 <-
  SummaryStats1 %>%
  mutate(top = mn + 2 * se,
         bottom = mn - 2 * se)

SummaryStats1 %>%
  ggplot(aes(x = tissue, y = exp(mn))) +
  geom_errorbar(aes(x= tissue,
                    ymax= exp(top),
                    ymin= exp(bottom)), width=0.5)+
  geom_point(data = Probe_CD53, aes(x=tissue, y=exp(expression)))+
  theme(axis.text.x = element_text(angle = 30, hjust=1))

Figure 4: CD53 mean expression by tissue types with 95-percent confidence intervals.

As you could see, CD53 confidence intervals all greatly overlap but for Leukemia, showing that this probe is strongly related to Leukemia.

False discoveries

A major concern when selecting results from ten of thousands of possibilities is the possiblity of false discovery. Even if the probe expression is unrelated to tissue type, examining the probes with the highest \(R^2\) will select out those that, perhaps by chance, have a relationship. We need to determine if the high \(R^2\) is due to chance selection.

We can get an idea of what sort of role chance plays by examining a situation where only chance is at play. This is the Null hypothesis. By comparing the actual statitistic (\(R^2\) in this example) to that of the Null hypothesis, we can determine wheteher the observed statistic has arisen in a world where the Null Hypothesis is true. This probability is the p-value. Only when this probability is small can you reject the Null Hypothesis. Let’s create a world where the Null hypothesis is true. For this gene-expression example, an appropriate Null Hypothesis is that gene expression is unrelated to tissue time. To do so, we shuffle the data so that the expression levels are unrelated with tissue type.

NullR2 <-
  Narrow %>% 
  group_by(Probe) %>% 
  mutate(expression = shuffle(expression)) %>% 
  group_by(Probe) %>% 
  do(r2 = r2(.)) %>% 
  mutate(r2= unlist(r2))

By comparing the distributions of \(R^2\) from the actual data to the suffled data, we can get an idea of the extent to which the actual data is different.

ProbeR2 %>% 
  ggplot(aes(x=r2))+
  geom_density(fill="gray30", color=NA)+
  geom_density(data=NullR2, aes(x=r2),
               fill="gray80", alpha=.75, color=NA)

Figure 5: Comparing the distribution of \(R^2\) for the actual data (dark gray) to that for the null hypothesis data (light gray).

From this figure, we can see that there are hardly any null-hypothesis probes with \(R^2\) greater than 0.30. What’s at issue is not whether the Null Hypothesis can produce \(R^2\) > 30, but what the highest \(R^2\) values will be when selected from 32,344 candidates. That sort of question can be answered by comparing the very highest \(R^2\) probes from the Null to the highest \(R^2\) from the actual data.

Null <-
  NullR2 %>% 
  arrange(desc(r2)) %>% 
  head(30)

Actual$null <- Null$r2
Actual %>% 
  ggplot(aes(x=Probe, y=r2)) +
  geom_point() +
  geom_point(aes(y=null), color="gray50") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

Figure 6: Comparing the highest \(R^2\) from the actual data and the Null

We can see that none of the top 30 \(R^2\) values for the actual data lie anywhere near those from the Null Hypothesis. Therefore, their high correlations to their different tissue types have a very low probability of being due to chance.

Conclusion

We were able to to find the probes with the expressions that have the highest correlation to specific tissue types. We were also able to show that the high correlation between the expression of these genes and the tissue types was not due to chance. Therefore, we can confidently move forward and analyze what roles those gene expressions play in their particular type of cancers. It would be really interesting to look for genes that have a high correlation with many types of cancer and analyze them to understand why they are so prominent across multiple cancer types.

References

Kaplan A. Project: Statistics of Gene Expression. Data Computing, 175-181.
Gene probe. A Dictionary of Biology. Encyclopedia.com. Nov 15. 2018 https://www.encyclopedia.com.
Tamborero, David et al. “Comprehensive identification of mutational cancer driver genes across 12 tumor types” Scientific reports vol. 3 2650. 2 Oct. 2013, doi:10.1038/srep02650