In our project, we will first perform some data cleaning and exploration before performing the predictive analysis.
Notes on cleaning:
Sample ID column and the cancer subtype attribute. The Sample ID column in the clinical dataset corresponds to the Ensemble ID column in the gene expression dataset. Our goal is to merge these two datasets together into a single dataset that include the cancer subtype and the gene expression. We also note that the Ensemble ID column has an extra “A” in front of every ID as opposed to the Sample ID column. We will add An extra A to sample ID and rename Sample ID to Ensemble ID.Sample ID in both datasets, we will perform an inner-join between these two datasets on the Sample ID column. From the new dataset, we wiill only keep the gene expression columns and the subtype column.We will also remove the rows with missing data.Notes on Predictive Analysis
Support Vector Machines, Random Forest and logistic regression. Since the dataset has many columns, we will perform a dimensionality reduction technique (PCA) before training the models. We will also train a neuronetwork and compare the accuracy of the models.