# Objective This project is intended to help undergraduate and early graduate students to develop research projects. --- # Overview **Using datasets to support your hypothesis** Analyzing data from NCBI GEO database -Cleaning data -Building graphs -Statistical analysis **Using web-based software to interpret data** -Looking at correlations in Ingenuity Pathway Analysis **Assessing possible implications of your hypothesis** -Looking at correlations with disease dataset -Cleaning data -Building graphs --- # Using datasets to support your hypothesis **NCBI Gene Expression** **Ombnibus** + "GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles" + *GSM3027039 (lung10.5E expression) + *GSM3027047(lung11.5E expression)  --- # Using datasets to support your hypothesis + Save http as txt  + Open txt in Excel  --- # Creating arrays from the data ```python import numpy as np from scipy import stats E2E= np.array ([19.83, 67.81, 0.00]) E1E= np.array ([4.22, 190.49, 0.00, 0.00, 64.77, 0.00, 5.15, 0.00,79.94, 19.73, 0.00, 0.00, 0.00, 174.08,4.64, 0.00]) E2M= np.array ([5.15, 0.00, 0.00, 0.00,8.77,68.86,73.01,174.58, 0.00, 23.24, 30.73, 35.16,0.00, 126.49, 73.18, 73.03]) E1M= np.array ([255.61, 15.31, 0.00, 52.57, 98.65, 24.65, 46.29, 217.5]) #Combine the arrays from the different measurements into one array Epi= np.concatenate ([E1E] + [E2E]) Mes = np.concatenate ([E1M] + [E2M]) ``` --- # Creating the error bars ```python #Calculate the mean Epi_mean = np.mean (Epi) Mes_mean = np.mean ( Mes ) #Calculate Standard deviation Epi_std = np.std (Epi) Mes_std = np.std ( Mes ) # Error bars Cell_types =[ **‘Epi'** , **'** **Mes** **'** ] x_pos = np.arange ( len ( Cell_types )) meanbars =[ Epi_mean , Mes_mean ] error=[ Epi_std , Mes_std ] ``` --- # Customizing the graphs ```python import matplotlib.pyplot as plt # Build the plot fig , ax = plt.subplots () ax.bar ( x_pos , meanbars , color =[ **'pink'** , **'blue'** ] , yerr =error , align = **'center'** , alpha = 0.5 , ecolor = **'black'** , capsize = 10 ) ax.set_ylabel ( **'mRNA Expression'** ) ax.set_xticks ( x_pos ) ax.set_xticklabels ( Cell_types ) ax.set_title ( **'Timp2 expression in lung E11.5 epithelial and mesenchymal cells’** ) ax.yaxis.grid ( True ) # Save the figure and show plt.tight_layout () plt.savefig ( **”Timp2_mRNAexpression_E11.5** **lung.png** **''** ) plt.show () #t-test and p-value t, p = stats.ttest_ind ( Mes,Epi , equal_var =False) print( **"t = "** + str (t)) print( **"p = "** + str (p)) ``` --- # GSM3027039 (lung10.5E expression) + t = 0.18786636459852307 + p = 0.8519663195255461  --- # GSM3027047(lung11.5E expression) + t =1.2774824478919518 + p = 0.20863320356230441  --- # Using web-based software to interpret data **What is IPA?** **I** ngenuity Pathway Analysis ( IPA ) is a web-based software that it is used for the interpretation of omics data. You can either interpret your own data or used the data available within the software.  [IPA](https://www.qiagenbioinformatics.com/products/ingenuity-pathway-analysis/) --- # Building pathways in IPA .pull-left[] .pull-right[] --- # Using matplotlib venn to look for relationships ```python import matplotlib.pyplot as plt from matplotlib_venn import venn2 inhibition_lung_dev =[ **'ADAM17'** , **'MMP14'** , **'MMP2'** , **'MMP9'** , **'ADAM17'** ] inhibition_Fibrosis_and_lung_cancer =[] venn2([set( inhibition_lung_dev ), set( inhibition_Fibrosis_and_lung_cancer )], set_labels = ( **'Lung development '** , **'Fibrosis and lung adenocarcinoma'** )) plt.title ( **'Timp2** **inhibtion** **interactions during development vs disease'** ) plt.savefig ( **'Timp2_inhibition_interaction_dev_vs_disease'** ) plt.print () ``` --- # Using matplotlib venn to look for relationships  --- # Assessing possible implications **Looking at correlations with disease dataset** + “ OncoLnc is a tool for interactively exploring survival correlations, and for downloading clinical data coupled to expression data for mRNAs, miRNAs, or lncRNAs . *(PDF)* *OncoLnc* *: Linking TCGA survival data to mRNAs, miRNAs, and* *lncRNAs* .” Available from: https://www.researchgate.net/publication/308718954_OncoLnc_Linking_TCGA_survival_data_to_mRNAs_miRNAs_and_lncRNAs [accessed Dec 13 2018].  [https://www.researchgate.net/publication/308718954_OncoLnc_Linking_TCGA_survival_data_to_mRNAs_miRNAs_and_lncRNAs](https://www.researchgate.net/publication/308718954_OncoLnc_Linking_TCGA_survival_data_to_mRNAs_miRNAs_and_lncRNAs) --- # Cleaning data from OncoLnc ```python import pandas as pd import os filename= os.path.abspath ( os.path.join ( **'Desktop'** , **'LUAD_7077_25_25_1.csv'** )) fin= open(filename) readCSV = pd.read_csv (fin) readCSV.head () # Getting read of columns readCSV.drop ([ **"Patient"** , **"Expression"** ], axis=1, inplace =True) readCSV.head () print( **"Number of Observations:"** , readCSV.shape [0]) ```  --- # Building survival graphs ```python from lifelines import KaplanMeierFitter kmf = KaplanMeierFitter () C= readCSV [ **'Status'** ] T= readCSV [ **'Days'** ] kmf.fit (T,C) groups= readCSV [ **'Group'** ] ix = (groups == **'Low'** , **"High"** ) for r in readCSV [ **'Group'** ].unique(): ix= readCSV [ **'Group'** ] ==r kmf.fit (T[~ix], C[~ix], label= **'Low Timp2 Expression'** ) ax = kmf.plot () kmf.fit (T[ix], C[ix], label= **'High Timp2 Expression'** ) kmf.plot (ax=ax) ``` --- # Survival curve of high vs Low Timp2  --- # Summary Hopefully this project helped the reader: + have a better idea on how to design a research question + How to look for evidence that could support your hypothesis --- # Currently working on **Assessing possible implications of your hypothesis** -Statistical analysis -Graph