Overview
This is my analysis and model for the Personalized Medicine Kaggle competition: https://www.kaggle.com/c/msk-redefining-cancer-treatment
As there is a bit of drama about external data being used that actually included the test transactions, I will focus only on the data provided and no external data. The goal is not to win the competition but do some exploratory work, feature selection, and have a halfway decent model result.
Data Import
Options & Libraries
options(repos = getOption("repos")["CRAN"])
options(scipen = 999)
rm(list=ls())
#Libraries
library(tidyverse)
library(corrplot)
library(caret)
library(gbm)
library(magrittr)
library(gbm)
library(e1071)
library(randomForest)
library(lubridate)
library(xgboost)
library(Matrix)
library(data.table)
library(stringi)
library(NMOF)
library(tidytext)
library(stringr)
library(forcats)
library(tidytext)
library(SnowballC)
library(wordcloud)
library(tm)
library(syuzhet)
Reading in the data files, there are four files. One set of test/train is a base file with genes and their variations. The other test/train text files include researcher write-up of the case.
d_train <- read.csv('training_variants')
d_test <- read.csv('test_variants')
#these text files have a bad first row and don't import well with read.table functions
d_train_text <- do.call(rbind,strsplit(readLines('training_text'),'||',fixed=T))
d_train_text <- as.data.frame(d_train_text)
d_train_text <- d_train_text[-1,]
colnames(d_train_text) <-c("ID","Text")
d_train_text$ID <- as.numeric(as.character(d_train_text$ID))
d_test_text <- do.call(rbind,strsplit(readLines('test_text'),'||',fixed=T))
d_test_text <- as.data.frame(d_test_text)
d_test_text <- d_test_text[-1,]
colnames(d_test_text) <- c("ID", "Text")
d_test_text$ID <- as.numeric(as.character(d_test_text$ID))
I’ll combine the data for the base files and the text files.
d_train <- d_train %>%
mutate(Class = factor(Class))
d_test$Class <-as.factor(-9)
df <- rbind(d_train,d_test)
rm(list=c('d_train','d_test'))
df$ds <- ifelse(df$Class == '-9','Test','Train')
d_train_text$ds <- "Train"
d_test_text$ds <- "Test"
dt <- rbind(d_train_text,d_test_text)
dt$ds <- as.factor(dt$ds)
rm(list = c('d_train_text','d_test_text'))
# dt %>%
# group_by(ds) %>%
# summarize(count = n())
df <- merge(df,dt,by=c('ID','ds'))
Exploratory Plots
The first view shows us the prediction class distribution for levels 1-9, which is what needs to be predicted. We can see there are many gene, some with high distribution while many small. The variation seems to have very high cardinality with few terms showing up more than once, which will be difficult to manage.
df %>%
filter(ds == 'Train') %>%
select(Gene,Variation,Class) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = 'free') +
geom_bar(fill='blue') +
theme_bw()
attributes are not identical across measure variables; they will be dropped

I will manually look at a few of the higher occurrences to get a sense of the data, there seems to be some discrete values and then lots of hard to understand variations like “Q61R”.
df %>%
group_by(ds, Variation) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(n=20)
The genes seem to have complicated naming conventions as well, some research and industry knowledge would be very helpful here. It is interesting to see that the most common gene in the test set (F8) is not common in the training set. It is not clear how the training/ test sets were split and sampled at this point.
df %>%
group_by(ds,Gene) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(n=20)
The distribution of genes to classes is quite interesting, as some classes have very few genes in them.
df %>%
filter(ds == 'Train') %>%
select(Class,Gene) %>%
ggplot(aes(x=Gene)) +
geom_bar() +
facet_wrap(~ Class, scales = 'free')

There are not too many major Gene-Class combinations.
df %>%
group_by(ds,Class,Gene) %>%
summarize(obs = n()) %>%
arrange(Gene,Class,obs) %>%
head(10)
I’ll make a new variable for high occuring genes, and also plot it. There are some high observation ones which may be interesting to research.
df <- df %>%
group_by(Gene) %>%
summarize(obs = n()) %>%
mutate(gene_o_25 = ifelse(obs > 25, 1, 0)) %>%
right_join(df) %>%
ungroup() %>%
select(-obs)
Joining, by = "Gene"
df %>%
group_by(Gene) %>%
summarize(obs = n()) %>%
mutate(gene_o_25 = ifelse(obs > 25, 1, 0)) %>%
filter(gene_o_25 == 1) %>%
ggplot(aes(x=Gene, y = obs)) +
geom_bar(fill = 'green',stat = 'identity') +
coord_flip() +
theme_bw()

When I include the class level it can be quite interesting. Certain genes clearly have very different distributions of the classes they may belong to.
df %>%
filter(gene_o_25 == 1 & ds == 'Train') %>%
ggplot(aes(x=Gene ,fill = Class)) +
geom_bar() +
coord_flip() +
theme_bw() +
scale_fill_brewer(palette = 'Set3')

I’ll do similar plotting for the Variations. It looks like a few buckets dominate and then more technical classifications occur.
df <- df %>%
group_by(Variation) %>%
summarize(obs1 = n()) %>%
mutate(variation_o_2 = ifelse(obs1 > 2, 1, 0)) %>%
right_join(df, by = 'Variation') %>%
ungroup() %>%
select(-obs1)
df %>%
group_by(Variation) %>%
summarize(obs = n()) %>%
mutate(big_vari = ifelse(obs > 2, 1, 0)) %>%
filter(big_vari == 1) %>%
ggplot(aes(x=Variation, y = obs)) +
geom_bar(fill = 'green',stat = 'identity') +
coord_flip() +
theme_bw()

Class 1 seems to dominate the Truncating Mutations as well as Deletion, it will be harder to group the Q61’s up and find a pattern.
df %>%
filter(variation_o_2 == 1 & ds == 'Train') %>%
ggplot(aes(x=Variation, fill = Class)) +
geom_bar() +
coord_flip() +
theme_bw() +
scale_fill_brewer(palette = 'Set1')

This view will show us where the gene-variation combos might occur, and which class they fall into.
df %>%
filter(variation_o_2 == 1 & gene_o_25 == 1 & ds == 'Train') %>%
ggplot(aes(x=Variation, y=Gene, color = Class)) +
geom_point(size = 4, alpha = .8) +
theme_bw() +
scale_size() +
scale_fill_brewer(palette = 'Set1')

Feature Engineering - Base File
I’ll build some basic features for what I have just explored. Let’s start scraping variation for some of these key words as well as the character length of gene and variation.
df <- df %>%
mutate(fusion_in_variation = ifelse(grepl('Fusion',df$Variation),1,0),
mutation_in_variation = ifelse(grepl('Mutation',df$Variation),1,0),
truncating_in_variation = ifelse(grepl('Truncating',df$Variation),1,0),
amplification_in_variation = ifelse(grepl('Amplification',df$Variation),1,0),
insertion_in_variation = ifelse(grepl('insertion',df$Variation),1,0),
del_in_variation = ifelse(grepl('Del',df$Variation),1,0),
gene_length = nchar(as.character(Gene)),
variation_length = nchar(as.character(Variation))
)
Dummy variables for the 10 most common genes, exact matching.
df <- df %>%
mutate(g_brca1 = ifelse(Gene == 'BRCA1',1,0),
g_brca2 = ifelse(Gene == 'BRCA2',1,0),
g_tp53 = ifelse(Gene == 'TP53',1,0),
g_pten = ifelse(Gene == 'PTEN',1,0),
g_kit = ifelse(Gene == 'KIT',1,0),
g_egfr = ifelse(Gene == 'EGFR',1,0),
g_braf = ifelse(Gene == 'BRAF',1,0),
g_cdkn2a = ifelse(Gene == 'CDKN2A',1,0),
g_abl1 = ifelse(Gene == 'ABL1',1,0),
g_tsc2 = ifelse(Gene == 'TSC2',1,0),
g_vhl = ifelse(Gene == 'VHL',1,0))
Dummy variables for truncations.
df <- df %>%
mutate(v_trunc_mutations = ifelse(Variation == 'Truncating Mutations',1,0),
v_t581 = ifelse(Variation == 'T58I',1,0),
v_q61r = ifelse(Variation == 'Q61R',1,0),
v_q61l = ifelse(Variation == 'Q61L',1,0),
v_q61h = ifelse(Variation == 'Q61H',1,0),
v_overexpression = ifelse(Variation == 'Overexpression',1,0),
v_g13d = ifelse(Variation == 'G13D',1,0),
v_g12v = ifelse(Variation == 'G12V',1,0),
v_g12c = ifelse(Variation == 'G12C',1,0),
v_fusions = ifelse(Variation == 'Fusions',1,0),
v_e17k = ifelse(Variation == 'E17K',1,0),
v_Deletion = ifelse(Variation == 'Deletion',1,0),
v_Amplification = ifelse(Variation == 'Amplification',1,0)
)
The first and last values for the genes seem to be significant, lets create dummy variables for them as well. First, the last letter in the gene.
df$gene <- as.character(sapply(strsplit(as.character(df$Gene),""),tail,1))
df$val <- 1
df <- df %>%
spread(key = gene,value = val, fill = 0,sep = '_end_')
We can create the first letter for the gene in a similar way.
df$gene <- as.character(sapply(strsplit(as.character(df$Gene),""),head,1))
df$val <- 1
df <- df %>%
spread(key = gene,value = val, fill = 0,sep = '_begin_')
We can also check what variation end in.
df$vrtn <- as.character(sapply(strsplit(as.character(df$Variation),""),head,1))
df$val <- 1
df <- df %>%
spread(key = vrtn,value = val, fill = 0,sep = '_end_')
The second, third and fourth digits also seem to have some type of similar partterns, lets use the top occurences to create a few more dummy variables. Later, research will be needed as to what this represents.
df$vrtn <- as.character(substr(as.character(df$Variation),2,4))
df$val <- 1
df <- df %>%
group_by(vrtn,val) %>%
summarize(count = n()) %>%
mutate(com_mid_vrtn = ifelse(count>25,vrtn,"")) %>%
right_join(df, by = 'vrtn') %>%
ungroup()
df <- df %>%
spread(key=com_mid_vrtn,value = 'val.x', fill = 0, sep = '_mid3_') %>%
select(-vrtn,-count,-val.y)
Feature Engineering - Text File
The text files provided are quite extensive with very long write ups by the researchers. I will take a very light text mining approach here by converting this to a Corpus and then a sparse matrix for the more common words.
First, the corpus can be created and cleaned up using the standard text-mining approach.
t_corpus <- Corpus(VectorSource(dt$Text))
t_corpus <- t_corpus %>%
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stemDocument, language="english")
I’ll now convert it to a matrix and use TF-IDF for word importance. We can join back to our base dataset.
dtm <- DocumentTermMatrix(t_corpus, control = list(weighting = weightTfIdf))
dtm <- removeSparseTerms(dtm, 0.95)
df <- cbind(df,as.matrix(dtm))
Lastly, let’s do sentimenet analysis on all these terms to determine positive/ negative relationships.
df <- cbind(df,get_nrc_sentiment(as.character(df$Text)))
Let’s plot out the sentiments that were created.

Model Building
I will now set up the framework to run a GBT model. I will use k-fold cross-validation to get an estimate.
set.seed(7)
cv_folds <- createFolds(df$Class[df$Class != '-9'], k=5, list=TRUE, returnTrain=FALSE)
With the folds set, I’ll divide the data back into our train and test sets.
train_matrix <- df %>%
filter(ds == 'Train') %>%
select(-ID,-Gene,-Variation,-Class, -ds, - Text) %>%
apply(2,as.character) %>%
apply(2,as.integer) %>%
as.matrix() %>%
Matrix(sparse = T)
test_matrix <- df %>%
filter(ds == 'Test') %>%
select(-ID,-Gene,-Variation,-Class, -ds,-Text) %>%
apply(2,as.character) %>%
apply(2,as.integer) %>%
as.matrix() %>%
Matrix(sparse = T)
Let’s give our training data the correct labels, as well as create the vector of our test ids for later.
train_y <- df %>%
filter(ds == 'Train') %>%
select(Class) %>%
apply(2,as.character) %>%
apply(2,as.integer) -1
test_ids <- df[df$ds == 'Test','ID']
Now lets create our XGB Matrices.
dtrain <- xgb.DMatrix(data = train_matrix, label = train_y)
dtest <- xgb.DMatrix(data = test_matrix)
The paramaters can be found through a grid search or trial and error.
param <- list(booster = 'gbtree',
objective = 'multi:softprob',
eval_metric = 'mlogloss',
num_class = 9,
eta = .2,
gamma = 1,
max_depth = 6,
min_child_weight = 1,
subsample = .7,
colsample_bytree = .7)
I’ll run to find the proper rounds needed, as well get a sense of our average log-loss error on the test sets.
xgb_cv <- xgb.cv(data = dtrain,
params = param,
nrounds = 1000,
maximize = FALSE,
prediction = TRUE,
folds = cv_folds,
print_every_n = 10,
early_stopping_rounds = 100)
[1] train-mlogloss:1.937655+0.009034 test-mlogloss:1.959912+0.018980
Multiple eval metrics are present. Will use test_mlogloss for early stopping.
Will train until test_mlogloss hasn't improved in 100 rounds.
[11] train-mlogloss:1.152867+0.008930 test-mlogloss:1.310597+0.034807
[21] train-mlogloss:0.898848+0.007829 test-mlogloss:1.158172+0.040239
[31] train-mlogloss:0.754805+0.006796 test-mlogloss:1.093189+0.044991
[41] train-mlogloss:0.658517+0.005712 test-mlogloss:1.062522+0.046323
[51] train-mlogloss:0.591242+0.006247 test-mlogloss:1.048288+0.045692
[61] train-mlogloss:0.541578+0.005534 test-mlogloss:1.039373+0.047339
[71] train-mlogloss:0.505758+0.006165 test-mlogloss:1.034065+0.047184
[81] train-mlogloss:0.476895+0.005391 test-mlogloss:1.033773+0.048242
[91] train-mlogloss:0.453790+0.005452 test-mlogloss:1.032013+0.046393
[101] train-mlogloss:0.434546+0.004890 test-mlogloss:1.031118+0.045972
[111] train-mlogloss:0.419225+0.005004 test-mlogloss:1.030887+0.044943
[121] train-mlogloss:0.406539+0.004720 test-mlogloss:1.031755+0.045891
[131] train-mlogloss:0.395407+0.004484 test-mlogloss:1.032509+0.046531
[141] train-mlogloss:0.385014+0.004666 test-mlogloss:1.032902+0.046992
[151] train-mlogloss:0.376440+0.004775 test-mlogloss:1.033313+0.046635
[161] train-mlogloss:0.368263+0.003906 test-mlogloss:1.034437+0.045821
[171] train-mlogloss:0.362079+0.003916 test-mlogloss:1.034673+0.046986
[181] train-mlogloss:0.356015+0.003251 test-mlogloss:1.035144+0.046364
[191] train-mlogloss:0.350541+0.003166 test-mlogloss:1.036582+0.045985
[201] train-mlogloss:0.345986+0.002985 test-mlogloss:1.036591+0.046569
[211] train-mlogloss:0.341979+0.002905 test-mlogloss:1.038154+0.047450
Stopping. Best iteration:
[115] train-mlogloss:0.413861+0.004695 test-mlogloss:1.029883+0.044247
We can now build our final model.
rounds <- which.min(xgb_cv$evaluation_log$test_mlogloss_mean)
xgb_model <- xgb.train(data = dtrain,
params = param,
watchlist = list(train = dtrain),
nrounds = rounds,
verbose = 1,
print_every_n = 5)
[1] train-mlogloss:1.942394
[6] train-mlogloss:1.403119
[11] train-mlogloss:1.162500
[16] train-mlogloss:1.023238
[21] train-mlogloss:0.923765
[26] train-mlogloss:0.844091
[31] train-mlogloss:0.782102
[36] train-mlogloss:0.732347
[41] train-mlogloss:0.694913
[46] train-mlogloss:0.653488
[51] train-mlogloss:0.619108
[56] train-mlogloss:0.588880
[61] train-mlogloss:0.565901
[66] train-mlogloss:0.547181
[71] train-mlogloss:0.530021
[76] train-mlogloss:0.515254
[81] train-mlogloss:0.502279
[86] train-mlogloss:0.489295
[91] train-mlogloss:0.476362
[96] train-mlogloss:0.465331
[101] train-mlogloss:0.455101
[106] train-mlogloss:0.445170
[111] train-mlogloss:0.436663
[115] train-mlogloss:0.430055
The importance matrix is as follows, sentiment analysis helped quite a bit along with a few of our gene/variation variables.
names <- dimnames(train_matrix)[[2]]
importance_matrix <- xgb.importance(names,model=xgb_model)
xgb.plot.importance(importance_matrix,top_n=20)

We can now get our predictions on the test data.
predictions <- xgb_model %>%
predict(dtest) %>%
matrix(nrow = nrow(dtest), ncol=9,byrow = T) %>%
as.data.frame()
Let’s format them correctly and see if each row sums properly to 1 as a sanity check.
colnames(predictions) <- c("class1","class2","class3","class4","class5","class6","class7","class8","class9")
df_preds <- data.frame(ID=test_ids,predictions)
df_preds %>%
select(-ID) %>%
apply(1,sum) %>%
head(n=25)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
And now our submission file.
write.csv(df_preds,'submission_nb.csv',row.names = F)
With the very light feature engineering and model tuning, I am around the 50th percentile. More research needs to be done around how to understand the genes and mutations, as well as deeper text mining and feature creation.
---
title: "Personalized Medicine - Kaggle"
output: html_notebook
---

#Overview
This is my analysis and model for the Personalized Medicine Kaggle competition:
https://www.kaggle.com/c/msk-redefining-cancer-treatment

As there is a bit of drama about external data being used that actually included the test transactions, I will focus only on the data provided and no external data. The goal is not to win the competition but do some exploratory work, feature selection, and have a halfway decent model result.

#Data Import

Options & Libraries
```{r}
options(repos = getOption("repos")["CRAN"])
options(scipen = 999)

rm(list=ls())


#Libraries
library(tidyverse)
library(corrplot)
library(caret)
library(gbm)
library(magrittr)
library(gbm)
library(e1071)
library(randomForest)
library(lubridate)
library(xgboost)
library(Matrix)
library(data.table)
library(stringi)
library(NMOF)
library(tidytext)
library(stringr)
library(forcats) 
library(tidytext) 
library(SnowballC) 
library(wordcloud) 
library(tm)
library(syuzhet)
```

Reading in the data files, there are four files. One set of test/train is a base file with genes and their variations. The other test/train text files include researcher write-up of the case.
```{r}
d_train <- read.csv('training_variants')
d_test <- read.csv('test_variants')

#these text files have a bad first row and don't import well with read.table functions
d_train_text <- do.call(rbind,strsplit(readLines('training_text'),'||',fixed=T))
d_train_text <- as.data.frame(d_train_text)
d_train_text <- d_train_text[-1,]
colnames(d_train_text) <-c("ID","Text")
d_train_text$ID <- as.numeric(as.character(d_train_text$ID))

d_test_text <- do.call(rbind,strsplit(readLines('test_text'),'||',fixed=T))
d_test_text <- as.data.frame(d_test_text)
d_test_text <- d_test_text[-1,]
colnames(d_test_text) <- c("ID", "Text")
d_test_text$ID <- as.numeric(as.character(d_test_text$ID))
```
I'll combine the data for the base files and the text files.
```{r}
d_train <- d_train %>%
  mutate(Class = factor(Class))

d_test$Class <-as.factor(-9)

df <- rbind(d_train,d_test)
rm(list=c('d_train','d_test'))
df$ds <- ifelse(df$Class == '-9','Test','Train')


d_train_text$ds <- "Train"
d_test_text$ds <- "Test"
dt <- rbind(d_train_text,d_test_text)
dt$ds <- as.factor(dt$ds)

rm(list = c('d_train_text','d_test_text'))

# dt %>%
#   group_by(ds) %>%
#   summarize(count = n())

df <- merge(df,dt,by=c('ID','ds'))
```
#Exploratory Plots

The first view shows us the prediction class distribution for levels 1-9, which is what needs to be predicted. We can see there are many gene, some with high distribution while many small. The variation seems to have very high cardinality with few terms showing up more than once, which will be difficult to manage.
```{r}
df %>%
  filter(ds == 'Train') %>%
  select(Gene,Variation,Class) %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = 'free') +
  geom_bar(fill='blue') +
  theme_bw() 
```
I will manually look at a few of the higher occurrences to get a sense of the data, there seems to be some discrete values and then lots of hard to understand variations like "Q61R".

```{r}
df %>%
  group_by(ds, Variation) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  head(n=20)
```
The genes seem to have complicated naming conventions as well, some research and industry knowledge would be very helpful here. It is interesting to see that the most common gene in the test set (F8) is not common in the training set. It is not clear how the training/ test sets were split and sampled at this point.
```{r}
df %>%
  group_by(ds,Gene) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  head(n=20)
```
The distribution of genes to classes is quite interesting, as some classes have very few genes in them.
```{r}
df %>%
  filter(ds == 'Train') %>%
  select(Class,Gene) %>%
  ggplot(aes(x=Gene)) +
  geom_bar() +
  facet_wrap(~ Class, scales = 'free')
```
There are not too many major Gene-Class combinations.
```{r}
df %>%
  group_by(ds,Class,Gene) %>%
  summarize(obs = n()) %>%
  arrange(Gene,Class,obs) %>%
  head(10)
```
I'll make a new variable for high occuring genes, and also plot it. There are some high observation ones which may be interesting to research.
```{r}

df <- df %>%
  group_by(Gene) %>%
  summarize(obs = n()) %>%
  mutate(gene_o_25 = ifelse(obs > 25, 1, 0)) %>%
  right_join(df) %>%
  ungroup() %>%
  select(-obs)

df %>%
  group_by(Gene) %>%
  summarize(obs = n()) %>%
  mutate(gene_o_25 = ifelse(obs > 25, 1, 0)) %>%
  filter(gene_o_25 == 1) %>%
  ggplot(aes(x=Gene, y = obs)) +
  geom_bar(fill = 'green',stat = 'identity') +
  coord_flip() +
  theme_bw()
```
When I include the class level it can be quite interesting. Certain genes clearly have very different distributions of the classes they may belong to.
```{r}
df %>%
  filter(gene_o_25 == 1 & ds == 'Train') %>%
  ggplot(aes(x=Gene ,fill = Class)) +
  geom_bar() +
  coord_flip() +
  theme_bw() +
  scale_fill_brewer(palette = 'Set3')
```
I'll do similar plotting for the Variations. It looks like a few buckets dominate and then more technical classifications occur.
```{r}
df <-   df %>%
  group_by(Variation) %>%
  summarize(obs1 = n()) %>%
  mutate(variation_o_2 = ifelse(obs1 > 2, 1, 0)) %>%
  right_join(df, by = 'Variation') %>%
  ungroup() %>%
  select(-obs1)

df %>%
  group_by(Variation) %>%
  summarize(obs = n()) %>%
  mutate(big_vari = ifelse(obs > 2, 1, 0)) %>%
  filter(big_vari == 1) %>%
  ggplot(aes(x=Variation, y = obs)) +
  geom_bar(fill = 'green',stat = 'identity') +
  coord_flip() +
  theme_bw()
```
Class 1 seems to dominate the Truncating Mutations as well as Deletion, it will be harder to group the Q61's up and find a pattern.
```{r}
df %>%
  filter(variation_o_2 == 1 & ds == 'Train') %>%
  ggplot(aes(x=Variation, fill = Class)) +
  geom_bar() +
  coord_flip() +
  theme_bw() +
  scale_fill_brewer(palette = 'Set1')
```
This view will show us where the gene-variation combos might occur, and which class they fall into.
```{r}
df %>%
  filter(variation_o_2 == 1 & gene_o_25 == 1 & ds == 'Train') %>%
  ggplot(aes(x=Variation, y=Gene, color = Class)) +
  geom_point(size = 4, alpha = .8) +
  theme_bw() +
  scale_size() + 
  scale_fill_brewer(palette = 'Set1')
```
#Feature Engineering - Base File

I'll build some basic features for what I have just explored. Let's start scraping variation for some of these key words as well as the character length of gene and variation.
```{r}
df <- df %>%
  mutate(fusion_in_variation = ifelse(grepl('Fusion',df$Variation),1,0),
         mutation_in_variation = ifelse(grepl('Mutation',df$Variation),1,0),
         truncating_in_variation = ifelse(grepl('Truncating',df$Variation),1,0),
         amplification_in_variation = ifelse(grepl('Amplification',df$Variation),1,0),
         insertion_in_variation = ifelse(grepl('insertion',df$Variation),1,0),
         del_in_variation = ifelse(grepl('Del',df$Variation),1,0),
         gene_length = nchar(as.character(Gene)), 
         variation_length = nchar(as.character(Variation))
         )
```
Dummy variables for the 10 most common genes, exact matching.
```{r}
df <- df %>%
  mutate(g_brca1 = ifelse(Gene == 'BRCA1',1,0),
         g_brca2 = ifelse(Gene == 'BRCA2',1,0),
         g_tp53 = ifelse(Gene == 'TP53',1,0),
         g_pten = ifelse(Gene == 'PTEN',1,0),
         g_kit = ifelse(Gene == 'KIT',1,0),
         g_egfr = ifelse(Gene == 'EGFR',1,0),
         g_braf = ifelse(Gene == 'BRAF',1,0),
         g_cdkn2a = ifelse(Gene == 'CDKN2A',1,0),
         g_abl1 = ifelse(Gene == 'ABL1',1,0),
         g_tsc2 = ifelse(Gene == 'TSC2',1,0),
         g_vhl = ifelse(Gene == 'VHL',1,0))
```
Dummy variables for truncations.
```{r}
df <- df %>%
  mutate(v_trunc_mutations = ifelse(Variation == 'Truncating Mutations',1,0),
         v_t581 = ifelse(Variation == 'T58I',1,0),
         v_q61r = ifelse(Variation == 'Q61R',1,0),
         v_q61l = ifelse(Variation == 'Q61L',1,0),
         v_q61h = ifelse(Variation == 'Q61H',1,0),
         v_overexpression = ifelse(Variation == 'Overexpression',1,0),
         v_g13d = ifelse(Variation == 'G13D',1,0),
         v_g12v = ifelse(Variation == 'G12V',1,0),
         v_g12c = ifelse(Variation == 'G12C',1,0),
         v_fusions = ifelse(Variation == 'Fusions',1,0),
         v_e17k = ifelse(Variation == 'E17K',1,0),
         v_Deletion = ifelse(Variation == 'Deletion',1,0),
         v_Amplification = ifelse(Variation == 'Amplification',1,0)
  )
```
The first and last values for the genes seem to be significant, lets create dummy variables for them as well. First, the last letter in the gene.
```{r}
df$gene <- as.character(sapply(strsplit(as.character(df$Gene),""),tail,1))
df$val <- 1

df <- df %>%
  spread(key = gene,value = val, fill = 0,sep = '_end_') 
```
We can create the first letter for the gene in a similar way.
```{r}
df$gene <- as.character(sapply(strsplit(as.character(df$Gene),""),head,1))
df$val <- 1

df <- df %>%
  spread(key = gene,value = val, fill = 0,sep = '_begin_') 
```
We can also check what variation end in.
```{r}
df$vrtn <- as.character(sapply(strsplit(as.character(df$Variation),""),head,1))
df$val <- 1

df <- df %>%
  spread(key = vrtn,value = val, fill = 0,sep = '_end_') 
```
The second, third and fourth digits also seem to have some type of similar partterns, lets use the top occurences to create a few more dummy variables. Later, research will be needed as to what this represents.
```{r}
df$vrtn <- as.character(substr(as.character(df$Variation),2,4))
df$val <- 1

df <- df %>%
  group_by(vrtn,val) %>%
  summarize(count = n()) %>%
  mutate(com_mid_vrtn = ifelse(count>25,vrtn,"")) %>%
  right_join(df, by = 'vrtn') %>%
  ungroup()
  
df <- df %>%
  spread(key=com_mid_vrtn,value = 'val.x', fill = 0, sep = '_mid3_') %>%
  select(-vrtn,-count,-val.y)
```

#Feature Engineering - Text File

The text files provided are quite extensive with very long write ups by the researchers. I will take a very light text mining approach here by converting this to a Corpus and then a sparse matrix for the more common words.

First, the corpus can be created and cleaned up using the standard text-mining approach.
```{r}
t_corpus <- Corpus(VectorSource(dt$Text))
t_corpus <- t_corpus %>%
                tm_map(stripWhitespace) %>%
                tm_map(removeNumbers) %>%
                tm_map(removePunctuation) %>%
                tm_map(content_transformer(tolower)) %>%
                tm_map(removeWords, stopwords("english")) %>%
                tm_map(stemDocument, language="english")

```
I'll now convert it to a matrix and use TF-IDF for word importance. We can join back to our base dataset.
```{r}
dtm <- DocumentTermMatrix(t_corpus, control = list(weighting = weightTfIdf))
dtm <- removeSparseTerms(dtm, 0.95)
df <-  cbind(df,as.matrix(dtm))
```
Lastly, let's do sentimenet analysis on all these terms to determine positive/ negative relationships.
```{r}
df <- cbind(df,get_nrc_sentiment(as.character(df$Text))) 
```
Let's plot out the sentiments that were created.
```{r}
df[,3866:3875] %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_bar(fill = 'green') +
  facet_wrap(~key,scales = 'free') +
  theme_bw()
```
# Model Building
I will now set up the framework to run a GBT model. I will use k-fold cross-validation to get an estimate.
```{r}
set.seed(7)
cv_folds <- createFolds(df$Class[df$Class != '-9'], k=5, list=TRUE, returnTrain=FALSE) 
```
With the folds set, I'll divide the data back into our train and test sets.
```{r}
train_matrix <- df %>%
  filter(ds == 'Train') %>%
  select(-ID,-Gene,-Variation,-Class, -ds, - Text) %>%
  apply(2,as.character) %>%
  apply(2,as.integer) %>%
  as.matrix() %>%
  Matrix(sparse = T)

test_matrix <- df %>%
  filter(ds == 'Test') %>%
  select(-ID,-Gene,-Variation,-Class, -ds,-Text) %>%
  apply(2,as.character) %>%
  apply(2,as.integer) %>%
  as.matrix() %>%
  Matrix(sparse = T)
```
Let's give our training data the correct labels, as well as create the vector of our test ids for later.
```{r}
train_y <- df %>%
  filter(ds == 'Train') %>%
  select(Class) %>%
  apply(2,as.character) %>%
  apply(2,as.integer) -1

test_ids <- df[df$ds == 'Test','ID']
```
Now lets create our XGB Matrices.
```{r}
dtrain <- xgb.DMatrix(data = train_matrix, label = train_y)
dtest <- xgb.DMatrix(data = test_matrix)   
```
The paramaters can be found through a grid search or trial and error.
```{r}
param <- list(booster = 'gbtree',
              objective = 'multi:softprob',
              eval_metric = 'mlogloss',
              num_class = 9,
              eta = .2,
              gamma = 1,
              max_depth = 6,
              min_child_weight = 1,
              subsample = .7,
              colsample_bytree = .7) 
```
I'll run to find the proper rounds needed, as well get a sense of our average log-loss error on the test sets.
```{r}
xgb_cv <- xgb.cv(data = dtrain,
                 params = param,
                 nrounds = 1000,
                 maximize = FALSE,
                 prediction = TRUE,
                 folds = cv_folds,
                 print_every_n = 10,
                 early_stopping_rounds = 100)
```
We can now build our final model.
```{r}
rounds <- which.min(xgb_cv$evaluation_log$test_mlogloss_mean)

xgb_model <- xgb.train(data = dtrain,
                       params = param,
                       watchlist = list(train = dtrain),
                       nrounds = rounds,
                       verbose = 1,
                       print_every_n = 5)
```
The importance matrix is as follows, sentiment analysis helped quite a bit along with a few of our gene/variation variables.
```{r}
names <- dimnames(train_matrix)[[2]] 

importance_matrix <- xgb.importance(names,model=xgb_model)
xgb.plot.importance(importance_matrix,top_n=20)
```
We can now get our predictions on the test data.
```{r}
predictions <- xgb_model %>%
  predict(dtest) %>%
  matrix(nrow = nrow(dtest), ncol=9,byrow = T) %>%
  as.data.frame() 
```
Let's format them correctly and see if each row sums properly to 1 as a sanity check.
```{r}
colnames(predictions) <- c("class1","class2","class3","class4","class5","class6","class7","class8","class9")
  
df_preds <- data.frame(ID=test_ids,predictions)

df_preds %>%
  select(-ID) %>%
  apply(1,sum) %>%
  head(n=25)
```
And now our submission file.
```{r}
write.csv(df_preds,'submission_nb.csv',row.names = F)
```
With the very light feature engineering and model tuning, I am around the 50th percentile. More research needs to be done around how to understand the genes and mutations, as well as deeper text mining and feature creation.
