This module does is very similar to Module 5 from the course. With this module, and the ones to follow, you can see how we can use Python to do the analysis we have already done in R. The beginning of this module is very similar to the Python module 4.
First we need to import our modules. We then read in the data as csv files. In order to make the data the same as the data we used while working in R, we are going to drop the columns below.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import plotly
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn import preprocessing
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as shc
brca_expr_mat = pd.read_csv('/home/alex/brca_expr_mat.csv')
brca_expr_mat_log = pd.read_csv('/home/alex/brca_expr_mat_log.csv')
brca_expr_mat = brca_expr_mat.set_index("Unnamed: 0")
brca_expr_mat_log = brca_expr_mat_log.set_index("Unnamed: 0")
print("The gene expression matrix has " + str(np.shape(brca_expr_mat_log)[0]) + " rows and " + str(np.shape(brca_expr_mat_log)[1]) + " columns.")
## The gene expression matrix has 18351 rows and 1082 columns.
First we will look at the head of brca_expr_mat_log.
brca_expr_mat_log.head()
## TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
## Unnamed: 0 ...
## TSPAN6 7.563646 7.705439 ... 10.366847 9.405490
## TNMD 0.427284 1.061776 ... 2.530695 1.454123
## DPM1 9.036795 9.662019 ... 9.223328 9.337278
## SCYL3 8.349352 10.607275 ... 8.963280 9.027632
## C1orf112 6.969174 8.106778 ... 7.125238 9.116347
##
## [5 rows x 1082 columns]
Now we will find the 500 most variable genes.
df2 = brca_expr_mat.transpose()
NUM_GENES = 500
V = df2.var()
O = (-V).argsort()
var_genes = V.iloc[O]
brca_log_sub = brca_expr_mat_log.iloc[O[0:NUM_GENES]]
Let’s take a look at the head of brca_log_sub.
brca_log_sub.head()
## TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
## Unnamed: 0 ...
## CPB1 16.161348 4.243707 ... 11.611555 8.947596
## COL1A1 15.762154 17.451710 ... 16.138004 16.391566
## COL1A2 15.081309 16.591093 ... 15.580200 15.505306
## FN1 15.493033 16.703795 ... 13.791102 16.051126
## COL3A1 14.947623 16.132576 ... 15.407580 15.183748
##
## [5 rows x 1082 columns]
Now we will scale the brca_log_sub data frame.
b = brca_log_sub.transpose()
b1 = preprocessing.scale(b)
b2 = pd.DataFrame(b1, index = b.index, columns = b.columns)
brca_log_sub_scale = b2.transpose()
brca_log_sub_scale
## TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
## Unnamed: 0 ...
## CPB1 1.863869 -0.445328 ... 0.982288 0.466111
## COL1A1 -0.465844 0.659024 ... -0.215611 -0.046796
## COL1A2 -0.637834 0.379305 ... -0.301732 -0.352187
## FN1 -0.426909 0.423212 ... -1.621897 -0.035051
## COL3A1 -0.626509 0.148882 ... -0.325530 -0.471998
## ... ... ... ... ... ...
## PLEC 0.949011 0.942321 ... -0.826189 0.477968
## RPS10 -0.671866 -1.026790 ... 1.792479 -0.288156
## TMEM66 1.099204 1.180778 ... 0.485094 -1.339690
## RNF213 0.248104 0.624964 ... -1.426613 -0.530906
## STMN1 -0.020757 0.188477 ... -0.741528 1.641815
##
## [500 rows x 1082 columns]
Below is the heatmap of every fourth sample from brca_log_sub_scale.
I_sample = np.linspace(start = 0, stop = 1080, num = 271)
I_sample = np.trunc(I_sample)
brca_scale_sub = brca_log_sub_scale.iloc[0:500, I_sample]
sns.heatmap(brca_scale_sub, cmap = "RdBu", cbar = False, yticklabels = False, xticklabels = False, vmin = -2, vmax = 2)
plt.ylabel(" ")
plt.show()
Now, we will find the distance matrix for the rows and columns.
our_matrix = brca_log_sub_scale.iloc[0:500, I_sample]
row_dist = distance_matrix(our_matrix.values, our_matrix.values)
our_matrix_t = our_matrix.transpose()
col_dist = distance_matrix(our_matrix_t.values, our_matrix_t.values)
dend = shc.linkage(col_dist, method = 'complete')
## /home/alex/miniconda/envs/r-reticulate/bin/python:1: ClusterWarning:
##
## scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
dend1 = shc.linkage(row_dist, method = 'complete')
Below is the heatmap of our_matrix with the dendrograms from above.
plt.show(sns.clustermap(our_matrix, cmap = "RdBu", yticklabels = False, xticklabels = False, vmin = -2, vmax = 2, row_linkage = dend1, col_linkage = dend))
In the code chunk below, we are looking at the dendrogram for the row clusters.
plt.show(shc.dendrogram(dend1))
We will now use the cut_tree function to cut the tree into 4 clusters. Then, we will look at the first 25 elements.
cluster = shc.cut_tree(dend1, 4)
cluster[0:24]
## array([[0],
## [1],
## [1],
## [1],
## [1],
## [0],
## [0],
## [2],
## [2],
## [1],
## [2],
## [2],
## [2],
## [0],
## [2],
## [2],
## [0],
## [2],
## [2],
## [0],
## [0],
## [0],
## [0],
## [2]])
We want to make sure that Python split dend1 into 4 clusters, so we will use the unique function to check that it has.
np.unique(cluster)
## array([0, 1, 2, 3])
After running this code chunk we can see that there are 4 unique clusters, and Python did what we asked it to.