This module does is very similar to Module 5 from the course. With this module, and the ones to follow, you can see how we can use Python to do the analysis we have already done in R. The beginning of this module is very similar to the Python module 4.

First we need to import our modules. We then read in the data as csv files. In order to make the data the same as the data we used while working in R, we are going to drop the columns below.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import plotly
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn import preprocessing
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as shc
brca_expr_mat = pd.read_csv('/home/alex/brca_expr_mat.csv')
brca_expr_mat_log = pd.read_csv('/home/alex/brca_expr_mat_log.csv')
brca_expr_mat = brca_expr_mat.set_index("Unnamed: 0")
brca_expr_mat_log = brca_expr_mat_log.set_index("Unnamed: 0")
print("The gene expression matrix has " + str(np.shape(brca_expr_mat_log)[0]) + " rows and " + str(np.shape(brca_expr_mat_log)[1]) + " columns.")
## The gene expression matrix has 18351 rows and 1082 columns.

First we will look at the head of brca_expr_mat_log.

brca_expr_mat_log.head()
##             TCGA-3C-AAAU  TCGA-3C-AALI  ...  TCGA-Z7-A8R5  TCGA-Z7-A8R6
## Unnamed: 0                              ...                            
## TSPAN6          7.563646      7.705439  ...     10.366847      9.405490
## TNMD            0.427284      1.061776  ...      2.530695      1.454123
## DPM1            9.036795      9.662019  ...      9.223328      9.337278
## SCYL3           8.349352     10.607275  ...      8.963280      9.027632
## C1orf112        6.969174      8.106778  ...      7.125238      9.116347
## 
## [5 rows x 1082 columns]

Now we will find the 500 most variable genes.

df2 = brca_expr_mat.transpose()
NUM_GENES = 500
V = df2.var()
O = (-V).argsort()
var_genes = V.iloc[O]
brca_log_sub = brca_expr_mat_log.iloc[O[0:NUM_GENES]]

Let’s take a look at the head of brca_log_sub.

brca_log_sub.head()
##             TCGA-3C-AAAU  TCGA-3C-AALI  ...  TCGA-Z7-A8R5  TCGA-Z7-A8R6
## Unnamed: 0                              ...                            
## CPB1           16.161348      4.243707  ...     11.611555      8.947596
## COL1A1         15.762154     17.451710  ...     16.138004     16.391566
## COL1A2         15.081309     16.591093  ...     15.580200     15.505306
## FN1            15.493033     16.703795  ...     13.791102     16.051126
## COL3A1         14.947623     16.132576  ...     15.407580     15.183748
## 
## [5 rows x 1082 columns]

Now we will scale the brca_log_sub data frame.

b = brca_log_sub.transpose()
b1 = preprocessing.scale(b)
b2 = pd.DataFrame(b1, index = b.index, columns = b.columns)
brca_log_sub_scale = b2.transpose()
brca_log_sub_scale
##             TCGA-3C-AAAU  TCGA-3C-AALI  ...  TCGA-Z7-A8R5  TCGA-Z7-A8R6
## Unnamed: 0                              ...                            
## CPB1            1.863869     -0.445328  ...      0.982288      0.466111
## COL1A1         -0.465844      0.659024  ...     -0.215611     -0.046796
## COL1A2         -0.637834      0.379305  ...     -0.301732     -0.352187
## FN1            -0.426909      0.423212  ...     -1.621897     -0.035051
## COL3A1         -0.626509      0.148882  ...     -0.325530     -0.471998
## ...                  ...           ...  ...           ...           ...
## PLEC            0.949011      0.942321  ...     -0.826189      0.477968
## RPS10          -0.671866     -1.026790  ...      1.792479     -0.288156
## TMEM66          1.099204      1.180778  ...      0.485094     -1.339690
## RNF213          0.248104      0.624964  ...     -1.426613     -0.530906
## STMN1          -0.020757      0.188477  ...     -0.741528      1.641815
## 
## [500 rows x 1082 columns]

Below is the heatmap of every fourth sample from brca_log_sub_scale.

I_sample = np.linspace(start = 0, stop = 1080, num = 271)
I_sample = np.trunc(I_sample)
brca_scale_sub = brca_log_sub_scale.iloc[0:500, I_sample]
sns.heatmap(brca_scale_sub, cmap = "RdBu", cbar = False, yticklabels = False, xticklabels = False, vmin = -2, vmax = 2)
plt.ylabel(" ")
plt.show()

Now, we will find the distance matrix for the rows and columns.

our_matrix = brca_log_sub_scale.iloc[0:500, I_sample]
row_dist = distance_matrix(our_matrix.values, our_matrix.values)
our_matrix_t = our_matrix.transpose()
col_dist = distance_matrix(our_matrix_t.values, our_matrix_t.values)
dend = shc.linkage(col_dist, method = 'complete')
## /home/alex/miniconda/envs/r-reticulate/bin/python:1: ClusterWarning:
## 
## scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
dend1 = shc.linkage(row_dist, method = 'complete')

Below is the heatmap of our_matrix with the dendrograms from above.

plt.show(sns.clustermap(our_matrix, cmap = "RdBu", yticklabels = False, xticklabels = False, vmin = -2, vmax = 2, row_linkage = dend1, col_linkage = dend))

In the code chunk below, we are looking at the dendrogram for the row clusters.

plt.show(shc.dendrogram(dend1))

We will now use the cut_tree function to cut the tree into 4 clusters. Then, we will look at the first 25 elements.

cluster = shc.cut_tree(dend1, 4)
cluster[0:24]
## array([[0],
##        [1],
##        [1],
##        [1],
##        [1],
##        [0],
##        [0],
##        [2],
##        [2],
##        [1],
##        [2],
##        [2],
##        [2],
##        [0],
##        [2],
##        [2],
##        [0],
##        [2],
##        [2],
##        [0],
##        [0],
##        [0],
##        [0],
##        [2]])

We want to make sure that Python split dend1 into 4 clusters, so we will use the unique function to check that it has.

np.unique(cluster)
## array([0, 1, 2, 3])

After running this code chunk we can see that there are 4 unique clusters, and Python did what we asked it to.