This module does is very similar to Module 4 from the course. With this module, and the ones to follow, you can see how we can use Python to do the analysis we have already done in R.
First we need to import our modules. We then read in the data as a csv file. In order to make the data the same as the data we used while working in R, we are going to drop the columns below (brca_expr).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import plotly
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn import preprocessing
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as shc
brca_expr = pd.read_csv('/home/alex/brca_expr_norm_names_df.csv')
brca_expr = brca_expr.drop(columns = ["Unnamed: 0"])
brca_expr = brca_expr.set_index("symbol")
print("The gene expression matrix has " + str(np.shape(brca_expr)[0]) + " rows and " + str(np.shape(brca_expr)[1]) + " columns.")
## The gene expression matrix has 18351 rows and 1082 columns.
We can see that the sentence created when you run the code chunk above is the same sentence we created in R.
Now, let’s look at our data.
brca_expr.head()
## TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
## symbol ...
## TSPAN6 188.1840 207.7220 ... 1319.4800 677.1640
## TNMD 0.3447 1.0875 ... 4.7785 1.7399
## DPM1 524.2260 809.1350 ... 596.7210 645.8460
## SCYL3 325.1410 1558.9400 ... 498.1330 520.9010
## C1orf112 124.2940 274.6660 ... 138.6080 554.0010
##
## [5 rows x 1082 columns]
We can see that this data is the same as the data in Module 4.
The code chunk below is the same code chunk from Module 3 where we found the log of base 2 of the values in the data frame.
df3 = pd.DataFrame(brca_expr)
df = round(df3, 1)
dfadd = df + 1
brca_expr_log = np.log2(dfadd)
brca_expr_log.head()
## TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
## symbol ...
## TSPAN6 7.563768 7.705287 ... 10.366869 9.405567
## TNMD 0.378512 1.070389 ... 2.536053 1.432959
## DPM1 9.036723 9.661956 ... 9.223278 9.337176
## SCYL3 8.349171 10.607238 ... 8.963185 9.027630
## C1orf112 6.969243 8.106955 ... 7.125155 9.116344
##
## [5 rows x 1082 columns]
Now, we do something similar to what we did in R. We take the variance of the data frame and sort the values of the data frame. We then rank the values using .argsort(). Then, we look create brca_log_sub which consists of the 500 most variable genes.
df2 = df.transpose()
NUM_GENES = 500
V = df2.var()
O = (-V).argsort()
var_genes = V.iloc[O]
brca_log_sub = brca_expr_log.iloc[O[0:NUM_GENES]]
Let’s look at brca_log_sub.
brca_log_sub.head()
## TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
## symbol ...
## CPB1 16.161348 4.240314 ... 11.611578 8.947491
## COL1A1 15.762154 17.451710 ... 16.138004 16.391566
## COL1A2 15.081309 16.591093 ... 15.580200 15.505306
## FN1 15.493033 16.703795 ... 13.791102 16.051126
## COL3A1 14.947623 16.132576 ... 15.407580 15.183748
##
## [5 rows x 1082 columns]
We will use np.linspace to sequence every fourth number. Then we subset brca_log_sub to take every fourth number, and we create a data frame named brca_sub. Now, we’ll look at the heatmap of brca_sub.
I_sample = np.linspace(start = 0, stop = 1080, num = 271)
I_sample = np.trunc(I_sample)
brca_sub = brca_log_sub.iloc[0:500,I_sample]
sns.heatmap(brca_sub, cmap = "RdBu", cbar = False, yticklabels = False, xticklabels = False)
plt.ylabel(" ")
plt.show()
Similar to what we did in Module 4 using R, we are going to scale the matrix. First, we need to transpose it and then we can scale it. After we scale it, we then have to transpose the matrix again.
b = brca_log_sub.transpose()
b1 = preprocessing.scale(b)
b2 = pd.DataFrame(b1, index = b.index, columns = b.columns)
brca_log_sub_scale = b2.transpose()
Now, let’s look at the heatmap of the scaled data frame.
sns.heatmap(brca_log_sub_scale.iloc[0:500, I_sample], cmap = "RdBu", cbar = False, yticklabels = False, xticklabels = False, vmin = -2, vmax = 2)
plt.ylabel(" ")
plt.show()
Now we create a data frame called our_matrix, which is every fourth sample from the scaled matrix. Then we create row_dist and col_dist which are the distance matrices for the rows and columns.
our_matrix = brca_log_sub_scale.iloc[0:500, I_sample]
row_dist = pd.DataFrame(distance_matrix(our_matrix.values, our_matrix.values), index = our_matrix.index, columns = our_matrix.index)
our_matrix_t = our_matrix.transpose()
col_dist = pd.DataFrame(distance_matrix(our_matrix_t.values, our_matrix_t.values), index = our_matrix_t.index, columns = our_matrix_t.index)
When you run the code chunk below, you see the dendrogram created from col_dist.
dend = shc.linkage(col_dist, method = 'complete')
## /home/alex/miniconda/envs/r-reticulate/bin/python:1: ClusterWarning:
##
## scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
plt.show(shc.dendrogram(dend))
When you run the code chunk below, you see the dendrogram created from row_dist.
dend1 = shc.linkage(row_dist, method = 'complete')
plt.show(shc.dendrogram(dend1))
Below we cerate a heatmap of our_matrix and we use the dendrograms for the rows and columns to create the dendrograms on the heatmap.
plt.show(sns.clustermap(our_matrix, cmap = "RdBu", yticklabels = False, xticklabels = False, vmin = -2, vmax = 2, row_linkage = dend1, col_linkage = dend))