Methods

In our project, we will first perform some data cleaning and exploration before performing the predictive analysis.

Notes on cleaning:

We have two data sets that include a clinical dataset and the gene expression dataset. We will first transpose the gene expression dataset so that each gene is a column.
The clinical dataset includes two attributes that we will need in our analysis. It includes the Sample ID column and the cancer subtype attribute. The Sample ID column in the clinical dataset corresponds to the Ensemble ID column in the gene expression dataset. Our goal is to merge these two datasets together into a single dataset that include the cancer subtype and the gene expression. We also note that the Ensemble ID column has an extra “A” in front of every ID as opposed to the Sample ID column. We will add An extra A to sample ID and rename Sample ID to Ensemble ID.
Now that we have two columns named Sample ID in both datasets, we will perform an inner-join between these two datasets on the Sample ID column. From the new dataset, we wiill only keep the gene expression columns and the subtype column.We will also remove the rows with missing data.

Notes on Predictive Analysis

We will perform various classification models and compare the accuracy obtained. The classifiers to be used include: Support Vector Machines, Random Forest and logistic regression. Since the dataset has many columns, we will perform a dimensionality reduction technique (PCA) before training the models. We will also train a neuronetwork and compare the accuracy of the models.