This module does is very similar to Module 4 from the course. With this module, and the ones to follow, you can see how we can use Python to do the analysis we have already done in R.

First we need to import our modules. We then read in the data as a csv file. In order to make the data the same as the data we used while working in R, we are going to drop the columns below (brca_expr).

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import plotly
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn import preprocessing
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as shc
brca_expr = pd.read_csv('/home/alex/brca_expr_norm_names_df.csv')
brca_expr = brca_expr.drop(columns = ["Unnamed: 0"])
brca_expr = brca_expr.set_index("symbol")
print("The gene expression matrix has " + str(np.shape(brca_expr)[0]) + " rows and " + str(np.shape(brca_expr)[1]) + " columns.")
## The gene expression matrix has 18351 rows and 1082 columns.

We can see that the sentence created when you run the code chunk above is the same sentence we created in R.

Now, let’s look at our data.

brca_expr.head()
##           TCGA-3C-AAAU  TCGA-3C-AALI  ...  TCGA-Z7-A8R5  TCGA-Z7-A8R6
## symbol                                ...                            
## TSPAN6        188.1840      207.7220  ...     1319.4800      677.1640
## TNMD            0.3447        1.0875  ...        4.7785        1.7399
## DPM1          524.2260      809.1350  ...      596.7210      645.8460
## SCYL3         325.1410     1558.9400  ...      498.1330      520.9010
## C1orf112      124.2940      274.6660  ...      138.6080      554.0010
## 
## [5 rows x 1082 columns]

We can see that this data is the same as the data in Module 4.

The code chunk below is the same code chunk from Module 3 where we found the log of base 2 of the values in the data frame.

df3 = pd.DataFrame(brca_expr)
df = round(df3, 1)
dfadd = df + 1
brca_expr_log = np.log2(dfadd)
brca_expr_log.head()
##           TCGA-3C-AAAU  TCGA-3C-AALI  ...  TCGA-Z7-A8R5  TCGA-Z7-A8R6
## symbol                                ...                            
## TSPAN6        7.563768      7.705287  ...     10.366869      9.405567
## TNMD          0.378512      1.070389  ...      2.536053      1.432959
## DPM1          9.036723      9.661956  ...      9.223278      9.337176
## SCYL3         8.349171     10.607238  ...      8.963185      9.027630
## C1orf112      6.969243      8.106955  ...      7.125155      9.116344
## 
## [5 rows x 1082 columns]

Now, we do something similar to what we did in R. We take the variance of the data frame and sort the values of the data frame. We then rank the values using .argsort(). Then, we look create brca_log_sub which consists of the 500 most variable genes.

df2 = df.transpose()
NUM_GENES = 500
V = df2.var()
O = (-V).argsort()
var_genes = V.iloc[O]
brca_log_sub = brca_expr_log.iloc[O[0:NUM_GENES]]

Let’s look at brca_log_sub.

brca_log_sub.head()
##         TCGA-3C-AAAU  TCGA-3C-AALI  ...  TCGA-Z7-A8R5  TCGA-Z7-A8R6
## symbol                              ...                            
## CPB1       16.161348      4.240314  ...     11.611578      8.947491
## COL1A1     15.762154     17.451710  ...     16.138004     16.391566
## COL1A2     15.081309     16.591093  ...     15.580200     15.505306
## FN1        15.493033     16.703795  ...     13.791102     16.051126
## COL3A1     14.947623     16.132576  ...     15.407580     15.183748
## 
## [5 rows x 1082 columns]

We will use np.linspace to sequence every fourth number. Then we subset brca_log_sub to take every fourth number, and we create a data frame named brca_sub. Now, we’ll look at the heatmap of brca_sub.

I_sample = np.linspace(start = 0, stop = 1080, num = 271)
I_sample = np.trunc(I_sample)
brca_sub = brca_log_sub.iloc[0:500,I_sample]
sns.heatmap(brca_sub, cmap = "RdBu", cbar = False, yticklabels = False, xticklabels = False)
plt.ylabel(" ")
plt.show()

Similar to what we did in Module 4 using R, we are going to scale the matrix. First, we need to transpose it and then we can scale it. After we scale it, we then have to transpose the matrix again.

b = brca_log_sub.transpose()
b1 = preprocessing.scale(b)
b2 = pd.DataFrame(b1, index = b.index, columns = b.columns)
brca_log_sub_scale = b2.transpose()

Now, let’s look at the heatmap of the scaled data frame.

sns.heatmap(brca_log_sub_scale.iloc[0:500, I_sample], cmap = "RdBu", cbar = False, yticklabels = False, xticklabels = False, vmin = -2, vmax = 2)
plt.ylabel(" ")
plt.show()

Now we create a data frame called our_matrix, which is every fourth sample from the scaled matrix. Then we create row_dist and col_dist which are the distance matrices for the rows and columns.

our_matrix = brca_log_sub_scale.iloc[0:500, I_sample]
row_dist = pd.DataFrame(distance_matrix(our_matrix.values, our_matrix.values), index = our_matrix.index, columns = our_matrix.index)
our_matrix_t = our_matrix.transpose()
col_dist = pd.DataFrame(distance_matrix(our_matrix_t.values, our_matrix_t.values), index = our_matrix_t.index, columns = our_matrix_t.index)

When you run the code chunk below, you see the dendrogram created from col_dist.

dend = shc.linkage(col_dist, method = 'complete')
## /home/alex/miniconda/envs/r-reticulate/bin/python:1: ClusterWarning:
## 
## scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
plt.show(shc.dendrogram(dend))

When you run the code chunk below, you see the dendrogram created from row_dist.

dend1 = shc.linkage(row_dist, method = 'complete')
plt.show(shc.dendrogram(dend1))

Below we cerate a heatmap of our_matrix and we use the dendrograms for the rows and columns to create the dendrograms on the heatmap.

plt.show(sns.clustermap(our_matrix, cmap = "RdBu", yticklabels = False, xticklabels = False, vmin = -2, vmax = 2, row_linkage = dend1, col_linkage = dend))