This module does is very similar to Module 2 from the course. With this module, and the ones to follow, you can see how we can use Python to do the analysis we have already done in R.
First we need to import the pandas and numpy modules. We then read in the data as a csv file. In order to make the data the same as the data we used while working in R, we are going to drop the columns below (brca_clin_df.drop).
import pandas as pd
import numpy as np
brca_clin_df = pd.read_csv('/home/alex/brca_clin_df.csv')
brca_clin_df = brca_clin_df.drop(columns = ["Unnamed: 0", "OS", "OS.time", "DSS", "DSS.time", "DFI", "DFI.time", "PFI", "PFI.time"])
brca_clin_df
## bcr_patient_barcode ... lab_proc_her2_neu_immunohistochemistry_receptor_status
## 0 TCGA-3C-AAAU ... Negative
## 1 TCGA-3C-AALI ... Positive
## 2 TCGA-3C-AALJ ... Indeterminate
## 3 TCGA-3C-AALK ... Positive
## 4 TCGA-4H-AAAK ... Equivocal
## ... ... ... ...
## 1077 TCGA-WT-AB44 ... Negative
## 1078 TCGA-XX-A899 ... Negative
## 1079 TCGA-XX-A89A ... Negative
## 1080 TCGA-Z7-A8R5 ... Negative
## 1081 TCGA-Z7-A8R6 ... Negative
##
## [1082 rows x 29 columns]
When you run the code chunk below, you should see that the data we are using is the same as the data we are using in R.
print("The clinical data frame has " + str(np.shape(brca_clin_df)[0]) + " rows and " + str(np.shape(brca_clin_df)[1]) + " columns.")
## The clinical data frame has 1082 rows and 29 columns.
We can see that the sentence created when you run the code chunk above is the same sentence we created in R. Notice the differences in the code.
We can see the columns in the data frame by putting .columns after the name of the data frame.
brca_clin_df.columns
## Index(['bcr_patient_barcode', 'gender', 'race', 'ethnicity',
## 'age_at_diagnosis', 'year_of_initial_pathologic_diagnosis',
## 'vital_status', 'menopause_status', 'tumor_status', 'margin_status',
## 'days_to_last_followup', 'prior_dx',
## 'new_tumor_event_after_initial_treatment', 'radiation_therapy',
## 'tissue_source_site', 'histological_type', 'pathologic_T',
## 'pathologic_M', 'pathologic_N', 'pathologic_stage',
## 'lymph_node_examined_count', 'number_of_lymphnodes_positive_by_he',
## 'initial_pathologic_diagnosis_method',
## 'axillary_lymph_node_stage_method_type',
## 'breast_carcinoma_surgical_procedure_name',
## 'anatomic_neoplasm_subdivision',
## 'breast_carcinoma_estrogen_receptor_status',
## 'breast_carcinoma_progesterone_receptor_status',
## 'lab_proc_her2_neu_immunohistochemistry_receptor_status'],
## dtype='object')
Similar to a table in R, we can use .value_counts to see the amount of values in each unique value of a column.
brca_clin_df.value_counts("gender")
## gender
## FEMALE 1070
## MALE 12
## dtype: int64
brca_clin_df.value_counts("breast_carcinoma_estrogen_receptor_status")
## breast_carcinoma_estrogen_receptor_status
## Positive 796
## Negative 236
## [Not Evaluated] 48
## Indeterminate 2
## dtype: int64
brca_clin_df.value_counts("breast_carcinoma_progesterone_receptor_status")
## breast_carcinoma_progesterone_receptor_status
## Positive 689
## Negative 340
## [Not Evaluated] 49
## Indeterminate 4
## dtype: int64
For the code chunk below, we filter our data frame to only look at the data frame where breast_carcinoma_estrogen_receptor_status is positive and where breast_carcinoma_progesterone_receptor_status is positive. .shape[0] is equivalent to nrow in R. The zero in the square brackets corresponds to the rows.
est = brca_clin_df[(brca_clin_df["breast_carcinoma_estrogen_receptor_status"] == "Positive")]
both = est[(est["breast_carcinoma_progesterone_receptor_status"] =="Positive")]
both.shape[0]
## 672
brca_clin_df.value_counts("lab_proc_her2_neu_immunohistochemistry_receptor_status")
## lab_proc_her2_neu_immunohistochemistry_receptor_status
## Negative 554
## Equivocal 177
## [Not Evaluated] 170
## Positive 161
## Indeterminate 12
## [Not Available] 8
## dtype: int64
In the code chunk below, we are taking specific rows from brca_clin_df and creating a new data frame. We then look at the dimensions of receptor_status (np.shape) and then look at the first 5 rows of the data.
sample_id = brca_clin_df["bcr_patient_barcode"]
er_status = brca_clin_df["breast_carcinoma_estrogen_receptor_status"]
pro_status = brca_clin_df["breast_carcinoma_progesterone_receptor_status"]
her2_status = brca_clin_df["lab_proc_her2_neu_immunohistochemistry_receptor_status"]
receptor_status = pd.DataFrame([sample_id, er_status, pro_status, her2_status])
receptor_status = receptor_status.transpose()
np.shape(receptor_status)
## (1082, 4)
receptor_status.head()
## bcr_patient_barcode ... lab_proc_her2_neu_immunohistochemistry_receptor_status
## 0 TCGA-3C-AAAU ... Negative
## 1 TCGA-3C-AALI ... Positive
## 2 TCGA-3C-AALJ ... Indeterminate
## 3 TCGA-3C-AALK ... Positive
## 4 TCGA-4H-AAAK ... Equivocal
##
## [5 rows x 4 columns]
Below, we create a data frame called tnbc which includes data in which breast_carcinoma_estrogen_receptor_status, breast_carcinoma_progesterone_receptor_status and lab_proc_her2_neu_immunohistochemistry_receptor_status are all negative. We then look at the head and the number of rows in tnbc.
es = receptor_status[(receptor_status["breast_carcinoma_estrogen_receptor_status"] == "Negative")]
pr = es[(es["breast_carcinoma_progesterone_receptor_status"] == "Negative")]
tnbc = pr[(pr["lab_proc_her2_neu_immunohistochemistry_receptor_status"] == "Negative")]
tnbc.head()
## bcr_patient_barcode ... lab_proc_her2_neu_immunohistochemistry_receptor_status
## 15 TCGA-A1-A0SK ... Negative
## 19 TCGA-A1-A0SP ... Negative
## 26 TCGA-A2-A04U ... Negative
## 33 TCGA-A2-A0CM ... Negative
## 46 TCGA-A2-A0D0 ... Negative
##
## [5 rows x 4 columns]
tnbc.shape[0]
## 114
#tnbc.match(tnbc["bcr_patient_barcode"], data["bcr_patient_barcode"])
When you run the code chunk below, you will see the table for the histological_type column of brca_clin_df.
brca_clin_df.value_counts("histological_type")
## histological_type
## Infiltrating Ductal Carcinoma 774
## Infiltrating Lobular Carcinoma 201
## Other, specify 45
## Mixed Histology (please specify) 29
## Mucinous Carcinoma 17
## Metaplastic Carcinoma 8
## Medullary Carcinoma 6
## [Not Available] 1
## Infiltrating Carcinoma NOS 1
## dtype: int64
Look at the columns of brca_clin_df (line 35) and replace the word choice with one of the column names to see the table for that column. Remember to uncomment the line of code.
#brca_clin_df.value_counts("choice")
When you run the code chunk below, you see the dimensions and head of receptor_status.
np.shape(receptor_status)
## (1082, 4)
receptor_status
## bcr_patient_barcode ... lab_proc_her2_neu_immunohistochemistry_receptor_status
## 0 TCGA-3C-AAAU ... Negative
## 1 TCGA-3C-AALI ... Positive
## 2 TCGA-3C-AALJ ... Indeterminate
## 3 TCGA-3C-AALK ... Positive
## 4 TCGA-4H-AAAK ... Equivocal
## ... ... ... ...
## 1077 TCGA-WT-AB44 ... Negative
## 1078 TCGA-XX-A899 ... Negative
## 1079 TCGA-XX-A89A ... Negative
## 1080 TCGA-Z7-A8R5 ... Negative
## 1081 TCGA-Z7-A8R6 ... Negative
##
## [1082 rows x 4 columns]
Replace the words Feature_value with a feature value of your choice. (To find your options, look at lines 45 or 49). Remember to uncomment the lines code.
#hormone_receptor_positive1 = receptor_status[(receptor_status["breast_carcinoma_estrogen_receptor_status"] == "Feature_value")]
#hormone_receptor_positive = hormone_receptor_positive1[(hormone_receptor_positive1["breast_carcinoma_progesterone_receptor_status"] == #"Feature_value")]
#hormone_receptor_positive.head()