This module does is very similar to Module 2 from the course. With this module, and the ones to follow, you can see how we can use Python to do the analysis we have already done in R.

First we need to import the pandas and numpy modules. We then read in the data as a csv file. In order to make the data the same as the data we used while working in R, we are going to drop the columns below (brca_clin_df.drop).

import pandas as pd
import numpy as np
brca_clin_df = pd.read_csv('/home/alex/brca_clin_df.csv')
brca_clin_df = brca_clin_df.drop(columns = ["Unnamed: 0", "OS", "OS.time", "DSS", "DSS.time", "DFI", "DFI.time", "PFI", "PFI.time"])
brca_clin_df
##      bcr_patient_barcode  ... lab_proc_her2_neu_immunohistochemistry_receptor_status
## 0           TCGA-3C-AAAU  ...                                           Negative    
## 1           TCGA-3C-AALI  ...                                           Positive    
## 2           TCGA-3C-AALJ  ...                                      Indeterminate    
## 3           TCGA-3C-AALK  ...                                           Positive    
## 4           TCGA-4H-AAAK  ...                                          Equivocal    
## ...                  ...  ...                                                ...    
## 1077        TCGA-WT-AB44  ...                                           Negative    
## 1078        TCGA-XX-A899  ...                                           Negative    
## 1079        TCGA-XX-A89A  ...                                           Negative    
## 1080        TCGA-Z7-A8R5  ...                                           Negative    
## 1081        TCGA-Z7-A8R6  ...                                           Negative    
## 
## [1082 rows x 29 columns]

When you run the code chunk below, you should see that the data we are using is the same as the data we are using in R.

print("The clinical data frame has " + str(np.shape(brca_clin_df)[0]) + " rows and " + str(np.shape(brca_clin_df)[1]) + " columns.")
## The clinical data frame has 1082 rows and 29 columns.

We can see that the sentence created when you run the code chunk above is the same sentence we created in R. Notice the differences in the code.

We can see the columns in the data frame by putting .columns after the name of the data frame.

brca_clin_df.columns
## Index(['bcr_patient_barcode', 'gender', 'race', 'ethnicity',
##        'age_at_diagnosis', 'year_of_initial_pathologic_diagnosis',
##        'vital_status', 'menopause_status', 'tumor_status', 'margin_status',
##        'days_to_last_followup', 'prior_dx',
##        'new_tumor_event_after_initial_treatment', 'radiation_therapy',
##        'tissue_source_site', 'histological_type', 'pathologic_T',
##        'pathologic_M', 'pathologic_N', 'pathologic_stage',
##        'lymph_node_examined_count', 'number_of_lymphnodes_positive_by_he',
##        'initial_pathologic_diagnosis_method',
##        'axillary_lymph_node_stage_method_type',
##        'breast_carcinoma_surgical_procedure_name',
##        'anatomic_neoplasm_subdivision',
##        'breast_carcinoma_estrogen_receptor_status',
##        'breast_carcinoma_progesterone_receptor_status',
##        'lab_proc_her2_neu_immunohistochemistry_receptor_status'],
##       dtype='object')

Similar to a table in R, we can use .value_counts to see the amount of values in each unique value of a column.

brca_clin_df.value_counts("gender")
## gender
## FEMALE    1070
## MALE        12
## dtype: int64
brca_clin_df.value_counts("breast_carcinoma_estrogen_receptor_status")
## breast_carcinoma_estrogen_receptor_status
## Positive           796
## Negative           236
## [Not Evaluated]     48
## Indeterminate        2
## dtype: int64
brca_clin_df.value_counts("breast_carcinoma_progesterone_receptor_status")
## breast_carcinoma_progesterone_receptor_status
## Positive           689
## Negative           340
## [Not Evaluated]     49
## Indeterminate        4
## dtype: int64

For the code chunk below, we filter our data frame to only look at the data frame where breast_carcinoma_estrogen_receptor_status is positive and where breast_carcinoma_progesterone_receptor_status is positive. .shape[0] is equivalent to nrow in R. The zero in the square brackets corresponds to the rows.

est = brca_clin_df[(brca_clin_df["breast_carcinoma_estrogen_receptor_status"] == "Positive")]
both = est[(est["breast_carcinoma_progesterone_receptor_status"] =="Positive")]
both.shape[0] 
## 672
brca_clin_df.value_counts("lab_proc_her2_neu_immunohistochemistry_receptor_status")
## lab_proc_her2_neu_immunohistochemistry_receptor_status
## Negative           554
## Equivocal          177
## [Not Evaluated]    170
## Positive           161
## Indeterminate       12
## [Not Available]      8
## dtype: int64

In the code chunk below, we are taking specific rows from brca_clin_df and creating a new data frame. We then look at the dimensions of receptor_status (np.shape) and then look at the first 5 rows of the data.

sample_id = brca_clin_df["bcr_patient_barcode"]
er_status = brca_clin_df["breast_carcinoma_estrogen_receptor_status"]
pro_status = brca_clin_df["breast_carcinoma_progesterone_receptor_status"]
her2_status = brca_clin_df["lab_proc_her2_neu_immunohistochemistry_receptor_status"]
receptor_status = pd.DataFrame([sample_id, er_status, pro_status, her2_status])
receptor_status = receptor_status.transpose()

np.shape(receptor_status)
## (1082, 4)
receptor_status.head()
##   bcr_patient_barcode  ... lab_proc_her2_neu_immunohistochemistry_receptor_status
## 0        TCGA-3C-AAAU  ...                                           Negative    
## 1        TCGA-3C-AALI  ...                                           Positive    
## 2        TCGA-3C-AALJ  ...                                      Indeterminate    
## 3        TCGA-3C-AALK  ...                                           Positive    
## 4        TCGA-4H-AAAK  ...                                          Equivocal    
## 
## [5 rows x 4 columns]

Below, we create a data frame called tnbc which includes data in which breast_carcinoma_estrogen_receptor_status, breast_carcinoma_progesterone_receptor_status and lab_proc_her2_neu_immunohistochemistry_receptor_status are all negative. We then look at the head and the number of rows in tnbc.

es = receptor_status[(receptor_status["breast_carcinoma_estrogen_receptor_status"] == "Negative")]
pr = es[(es["breast_carcinoma_progesterone_receptor_status"] == "Negative")]
tnbc = pr[(pr["lab_proc_her2_neu_immunohistochemistry_receptor_status"] == "Negative")]
tnbc.head()
##    bcr_patient_barcode  ... lab_proc_her2_neu_immunohistochemistry_receptor_status
## 15        TCGA-A1-A0SK  ...                                           Negative    
## 19        TCGA-A1-A0SP  ...                                           Negative    
## 26        TCGA-A2-A04U  ...                                           Negative    
## 33        TCGA-A2-A0CM  ...                                           Negative    
## 46        TCGA-A2-A0D0  ...                                           Negative    
## 
## [5 rows x 4 columns]
tnbc.shape[0]
## 114
#tnbc.match(tnbc["bcr_patient_barcode"], data["bcr_patient_barcode"])

Practice 1

When you run the code chunk below, you will see the table for the histological_type column of brca_clin_df.

brca_clin_df.value_counts("histological_type")
## histological_type
## Infiltrating Ductal Carcinoma       774
## Infiltrating Lobular Carcinoma      201
## Other, specify                       45
## Mixed Histology (please specify)     29
## Mucinous Carcinoma                   17
## Metaplastic Carcinoma                 8
## Medullary Carcinoma                   6
## [Not Available]                       1
## Infiltrating Carcinoma NOS            1
## dtype: int64

Look at the columns of brca_clin_df (line 35) and replace the word choice with one of the column names to see the table for that column. Remember to uncomment the line of code.

#brca_clin_df.value_counts("choice")

Practice 2

When you run the code chunk below, you see the dimensions and head of receptor_status.

np.shape(receptor_status)
## (1082, 4)
receptor_status
##      bcr_patient_barcode  ... lab_proc_her2_neu_immunohistochemistry_receptor_status
## 0           TCGA-3C-AAAU  ...                                           Negative    
## 1           TCGA-3C-AALI  ...                                           Positive    
## 2           TCGA-3C-AALJ  ...                                      Indeterminate    
## 3           TCGA-3C-AALK  ...                                           Positive    
## 4           TCGA-4H-AAAK  ...                                          Equivocal    
## ...                  ...  ...                                                ...    
## 1077        TCGA-WT-AB44  ...                                           Negative    
## 1078        TCGA-XX-A899  ...                                           Negative    
## 1079        TCGA-XX-A89A  ...                                           Negative    
## 1080        TCGA-Z7-A8R5  ...                                           Negative    
## 1081        TCGA-Z7-A8R6  ...                                           Negative    
## 
## [1082 rows x 4 columns]

Replace the words Feature_value with a feature value of your choice. (To find your options, look at lines 45 or 49). Remember to uncomment the lines code.

#hormone_receptor_positive1 = receptor_status[(receptor_status["breast_carcinoma_estrogen_receptor_status"] == "Feature_value")]
#hormone_receptor_positive = hormone_receptor_positive1[(hormone_receptor_positive1["breast_carcinoma_progesterone_receptor_status"] == #"Feature_value")]
#hormone_receptor_positive.head()