Objective

ML Question: Build a model that can predict the skin cancer form of the patient based on demographic and income data. Predicted diagnosis and the degree of cancer stage will be returned.

Background

Like all cancer types, it is best for the safety of the patient to get early and accurate diagnosis if they have signs of skin cancer. Melanoma occurs when melanocytes (skin cells) exhibit metastatic behavior - like unnaturally rapid cell division. While Melanoma is less common relative to other forms of skin cancer, it is more critical due to its likelihood of spreading to other parts of the body. Hence, melanoma is best treated at its early stages before it can spread and become difficult to track.

Undetected melanoma can lead to spread of the disease to vital areas that result in bodily dysfunction - especially in the eyes, nose, musculoskeletal system or internal organ systems. In critical cases, untreated melanoma can decrease survival rates from as high 63% to less than 20%. Thus highlighting the importance of creating a model that can in effect, increase awareness to vulnerable populations. Heightened awareness of melanoma incidence rates to those who may develop melanoma due to environmental factors can raise their chances of early treatment while the disease is curable.

Data Description

The Melanoma epidemiology registration data is of United Kingdom National Health Service patient information collated by the National Cancer Registration and Analysis Service, Public Health England (PHE). From the years 1995-2017, the data contained general demographics (age, sex, ethnicity, an area code) as well as a column describing the proportion of people in an area experiencing deprivation due to low income for where that patient is from. We selected the aforementioned variables in addition to the histology type (melanoma diagnosis), the tumor degree, as well as the method for diagnosis (how melanoma was discovered in the case).

Because there were many instances of rare cases of melanoma (<20 instances of 210477 cases), the data was divided into four general groups. Melanoma can be divided into four common subtypes: superficial spreading, nodular, lentigo maligna, and acral lentiginous. Interestingly, these groups existed in the histology factor but certain diagnoses were specified and therefore not included into these general classifications. Therefore in the process of data cleansing, spelling, capitalization, and groupable classes were organized where appropriate. Not Otherwise Specified (NOS) is an additional group representing the unclassified melanoma diagnoses, accounting for the minority melanoma subtypes. NOS and carcinoma cases are identified as ‘Other’, as they do not fall under any of the four main divisions.

Exploratory Data Analysis

str(cmeld)
## 'data.frame':    210477 obs. of  9 variables:
##  $ DIAGNOSISYEAR       : Factor w/ 23 levels "1995","1996",..: 5 5 5 13 5 13 12 10 5 13 ...
##  $ ethnicity_band      : Factor w/ 3 levels "Non-White","Unknown",..: 3 1 3 3 3 3 3 1 3 3 ...
##  $ age_group           : Factor w/ 3 levels "<45","45-69",..: 1 1 3 3 2 3 3 2 2 3 ...
##  $ HISTOLOGY_CODED_DESC: Factor w/ 5 levels "acral_lentiginous",..: 4 4 3 5 3 5 4 5 3 3 ...
##  $ SEX                 : Factor w/ 2 levels "1","2": 1 1 2 2 2 1 2 1 2 2 ...
##  $ CREG_CODE           : Factor w/ 8 levels "Y0201","Y0301",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ BASISOFDIAGNOSIS    : Factor w/ 9 levels "0","1","2","3",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ GRADE               : Factor w/ 6 levels "","G1","G2","G3",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ QUINTILE_2015       : Factor w/ 5 levels "1 - least deprived",..: 2 3 1 3 4 5 4 2 3 3 ...

We can see from the structure of cmeld that all the variables are factors and therefore non-continuous. For that reason, we decided to use a bar plot as opposed to histograms or scatterplots to display the counts of the types of melanoma and their related variable.

summary(cmeld)
##  DIAGNOSISYEAR      ethnicity_band   age_group    
##  2016   : 13797   Non-White: 41711   <45  :41534  
##  2017   : 13740   Unknown  : 15166   45-69:96181  
##  2015   : 13431   White    :153600   70+  :72762  
##  2014   : 13110                                   
##  2013   : 12501                                   
##  2012   : 11644                                   
##  (Other):132254                                   
##             HISTOLOGY_CODED_DESC SEX          CREG_CODE     BASISOFDIAGNOSIS
##  acral_lentiginous    : 2556     1: 99610   Y1001  :42705   7      :203042  
##  lentigo_maligna      : 6932     2:110867   Y0801  :39206   1      :  2861  
##  nodular              :26333                Y1701  :26845   6      :  1997  
##  Superficial_spreading:98751                Y0201  :25647   9      :  1104  
##  Other                :75905                Y0401  :25322   2      :   810  
##                                             Y1201  :19090   0      :   447  
##                                             (Other):31662   (Other):   216  
##  GRADE                  QUINTILE_2015  
##    :  5469   1 - least deprived:58188  
##  G1:   189   2                 :52913  
##  G2:   231   3                 :43567  
##  G3:   716   4                 :33251  
##  G4:   248   5 - most deprived :22558  
##  GX:203624                             
## 

From the summary statistics we can see the counts of all levels. Some columns have the majority of their data in a given level. For example, GRADE, ethnicity_band, age_group, HISTOLOGY_CODED_DESC, and BASISOFDIAGNOSIS. So, lets calculate the following proportions from the data set.

# Lets see the following proportions from the data set. 
table(cmeld$age_group)/sum(table(cmeld$age_group))
## 
##       <45     45-69       70+ 
## 0.1973327 0.4569668 0.3457005
table(cmeld$GRADE)/sum(table(cmeld$GRADE))
## 
##                        G1           G2           G3           G4           GX 
## 0.0259838367 0.0008979603 0.0010975071 0.0034017969 0.0011782760 0.9674406230
table(cmeld$ethnicity_band)/sum(table(cmeld$ethnicity_band))
## 
##  Non-White    Unknown      White 
## 0.19817367 0.07205538 0.72977095
table(cmeld$HISTOLOGY_CODED_DESC)/sum(table(cmeld$HISTOLOGY_CODED_DESC))
## 
##     acral_lentiginous       lentigo_maligna               nodular 
##            0.01214384            0.03293471            0.12511106 
## Superficial_spreading                 Other 
##            0.46917715            0.36063323

From the above proportions, GRADE has 96.74% of its data is “GX”, ethnicity_band has 72.97% of its data in “White”, and HISTOLOGY_CODED_DESC has 46.91% of its data in the “Superficial_spreading” level and 36.06% in the “Other” level. The dominant representation of a certain level can have an impact on how the model is formed.

Bar Plots

# bar plot -- number of patients w/ each type of melanoma by age group 
ggplot(cmeld, aes(HISTOLOGY_CODED_DESC)) + 
  geom_bar(aes(fill = age_group)) +
  labs(
    title = 'Number of Patients with Each Type of Melanoma by Age Group') +
  xlab('Histology') + 
  ylab('Count') +
  scale_fill_discrete(name = "Age Group")+
  theme_minimal()
Age

The age bar plot displays that the majority of our data resides in the superficial spreading category and the other category. The age distribution displays that the majority of cases occur in those 45+.

# bar plot -- number of patients w/ each type of melanoma by SEX
ggplot(cmeld, aes(HISTOLOGY_CODED_DESC)) + 
  geom_bar(aes(fill = SEX)) +
  labs(
    title = 'Number of Patients with Each Type of Melanoma by Sex') +
  xlab('Histology') + 
  ylab('Count') +
  scale_fill_discrete(name = "SEX", labels = c("Male", "Female"))+
  theme_minimal()
Sex

The sex bar plot displays that the distribution of male vs female data is fairly even.

# bar plot -- counts of each type of melanoma by basis of diagnosis 
ggplot(cmeld, aes(HISTOLOGY_CODED_DESC)) + 
  geom_bar(aes(fill = BASISOFDIAGNOSIS)) +
  labs(
    title = 'Number of Patients with Each Type of Melanoma by their Basis of Diagnosis') +
  xlab('Histology') + 
  ylab('Count') +
   scale_fill_discrete(name = "Basis of Diagnosis", labels = c("Death certificate", "Clinical", "Clinical investigation", "Clinical investigation 2", "Specific tumour markers", "Cytology", "Histology of a metastases", "Histology of a primary tumour", "Unknown"))+
  theme_minimal()
Basis of Diagnosis

The Basis of Diagnosis bar plot displays that the overwhelming majority of cases were diagnosed on the basis of the histology of the primary tumor.

Clustering/ Descision Tree

# time to get columns for clustering
set.seed(2002)
cluster_columns<- cmeld[,-1]
cluster_1h<-one_hot(as.data.table(cluster_columns),cols = "auto",sparsifyNAs = TRUE,naCols = FALSE,dropCols = TRUE,dropUnusedLevels = TRUE)

# did elbow chart. we should use 8 clusters
kmeans_cluster = kmeans(cluster_1h, centers = 8,
                        algorithm = "Lloyd")
cluster_columns$cluster<-as.factor(kmeans_cluster$cluster)


# making the tree
train_index <- createDataPartition(cluster_columns$HISTOLOGY_CODED_DESC,
                                           p = .7,
                                           list = FALSE,
                                           times = 1)
train <- cluster_columns[train_index,]
tune_and_test <- cluster_columns[-train_index, ]

#The we need to use the function again to create the tuning set 

tune_and_test_index <- createDataPartition(cluster_columns$HISTOLOGY_CODED_DESC,
                                           p = .5,
                                           list = FALSE,
                                           times = 1)

tune <- tune_and_test[tune_and_test_index, ]
test <- tune_and_test[-tune_and_test_index, ]


# Create our features and target for training of the model. 

features <- as.data.frame(train[,-3])
target <- train$HISTOLOGY_CODED_DESC

cancer_dt <- train(x=features,
                    y=target,
                    method="rpart")

# This is more or less a easy target but the clusters are very predictive. 

#varImp(cancer_dt)

# Let's predict and see how we did. 



#confusionMatrix(as.factor(dt_predict_1), 
                #as.factor(test$HISTOLOGY_CODED_DESC), 
                #dnn=c("Prediction", "Actual"), 
                #mode = "sens_spec")

Methods

As we are trying to identify which of the 5 categories of melanoma a diagnosed patient may have based on various demographic factors, we decided to predict which type a patient may have using a decision tree. As all of our features in the model are factors, and our target variable can be subsetted into 5 different levels, we felt a decision tree would be best for classifying data under those conditions. We initially tried a kNN model to see if our data was useful at predicting, but found we could not do kNN for a target variable with five levels because kNN needs a set positive class within the target variable. This limited our model to a decision tree or random forest. Due to the low tree depth and limited features to include (8), we decided a decision tree was of sufficient depth and that the extra run time and space of a random forest would not be worth the marginal increase in accuracy. Random forest also has a tendency to bias and favor its improvements towards the majority class, which is a problem this data set already has shown with the two large levels, Superficial spreading and other.
The way we built our decision tree was unique compared to a default decision tree though, as when we ran our features without any changes, we were only able to achieve an accuracy of around 55%. The modified tree we decided to build was designed around clustering our data into different groups, 8 to be exact, which we then used what cluster a row was in as a column added to the features. We then ran a decision tree algorithm on the data with the original features and the clusters, yielding us a much more effective decision tree compared to the original one. Specifically the original would only ever predict the two most prominent levels, in Superficial spreading and other, whereas the new version included one more of the levels, and was more accurate across the other two levels.

set.seed(2002)
cancer_dt$finalModel # cmeld_mdl_1 is whatever u named ur model
## n= 147337 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 147337 78211 Superficial_spreading (0.012 0.033 0.13 0.47 0.36)  
##    2) cluster=3,5,8 61568  1857 Superficial_spreading (0.006 0.0056 0.019 0.97 0) *
##    3) cluster=1,2,4,6,7 85769 32635 Other (0.017 0.053 0.2 0.11 0.62)  
##      6) cluster=1,4 36372 23584 nodular (0.029 0.11 0.35 0.24 0.27)  
##       12) SEX=1 11790  3038 nodular (0.044 0.21 0.74 0 0) *
##       13) SEX=2 24582 14843 Other (0.021 0.061 0.16 0.36 0.4) *
##      7) cluster=2,6,7 49397  6002 Other (0.0077 0.01 0.091 0.013 0.88) *
dt_predict_1 = predict(cancer_dt,test,type= "raw")
rpart.plot(cancer_dt$finalModel, type=4,extra=101) # this is a visual of the final tree

Model Evaluation

Looking at the overall statistics of the model, we can see that our accuracy is just above 82 percent, with a relatively narrow confidence interval. While this is not perfect, especially in a medical setting if this were an exact diagnosis, it is still potentially usable looking from a large scale point of view. Looking at our sensitivity and specificity statistics for each level, we can see we have 0 sensitivity for acral lentiginous and lentigo maligna, as they each have either under or just at 1000 entries, compared to the much greater other levels. This means we fail to detect them, and even struggle to predict when it is nodular. Both specificity and sensitivity are extremely favorable to both Superficial spreading as well as Other, largely because of how dominant their volume is within the data set. From the confusion matrix it is observed that every time we predicted Other we were correct. Our model struggles to pick up the more rare forms of skin cancer, however the other category has a 100% Sensitivity and ~77% Specificity for classifying as other, and ~86% and ~97.5% respectively for Superficial spreading. This can be useful for helping the patient get a broad idea of the cancer they have. Furthermore when testing with other seeds, we noticed the models can be trained to focus more on Superficial spreading sensitivity, etc. The clustering column as expected is used 100% of the time. Without this column which was created through kmeans clustering, our models accuracy and predictive qualities would plummet.

varImp(cancer_dt)
## rpart variable importance
## 
##                  Overall
## cluster          100.000
## age_group         10.714
## SEX               10.506
## BASISOFDIAGNOSIS   8.195
## CREG_CODE          5.659
## ethnicity_band     2.249
## QUINTILE_2015      0.000
## GRADE              0.000
confusionMatrix(as.factor(dt_predict_1),
                as.factor(test$HISTOLOGY_CODED_DESC),
                dnn=c("Prediction", "Actual"),
                mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##                        Actual
## Prediction              acral_lentiginous lentigo_maligna nodular
##   acral_lentiginous                     0               0       0
##   lentigo_maligna                       0               0       0
##   nodular                             105             512    1783
##   Superficial_spreading                83              85     245
##   Other                               208             427    1849
##                        Actual
## Prediction              Superficial_spreading Other
##   acral_lentiginous                         0     0
##   lentigo_maligna                           0     0
##   nodular                                   0     0
##   Superficial_spreading                 12810     0
##   Other                                  2082 11251
## 
## Overall Statistics
##                                           
##                Accuracy : 0.822           
##                  95% CI : (0.8177, 0.8262)
##     No Information Rate : 0.4737          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7089          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: acral_lentiginous Class: lentigo_maligna
## Sensitivity                            0.0000                0.00000
## Specificity                            1.0000                1.00000
## Pos Pred Value                            NaN                    NaN
## Neg Pred Value                         0.9874                0.96743
## Prevalence                             0.0126                0.03257
## Detection Rate                         0.0000                0.00000
## Detection Prevalence                   0.0000                0.00000
## Balanced Accuracy                      0.5000                0.50000
##                      Class: nodular Class: Superficial_spreading Class: Other
## Sensitivity                 0.45989                       0.8602       1.0000
## Specificity                 0.97761                       0.9750       0.7738
## Pos Pred Value              0.74292                       0.9688       0.7113
## Neg Pred Value              0.92789                       0.8857       1.0000
## Prevalence                  0.12331                       0.4737       0.3579
## Detection Rate              0.05671                       0.4074       0.3579
## Detection Prevalence        0.07634                       0.4206       0.5031
## Balanced Accuracy           0.71875                       0.9176       0.8869

Fairness Assessment

The protected classes included in this dataset are ethnicity and gender. Based on the variable importance run on our model, sex had a variable importance of 10.506 and ethnicity at 2.249. While not zero, the significance in the contribution from these variables to our model is low. This is in part due to the fact that ethnicity is only divided into two categories: “white” and “nonwhite”.

Conclusions

As our approach to predicting the type of melanoma a diagnosed patient has been evaluated, we can see there is some merit in our model for predicting some of the categories, but overall not accurate enough for a medical setting or accurate diagnosis. Our approach of using clustering data to input into a decision tree was an overall success, as we were able to raise our accuracy by 25 percent and also include an extra level from our target variable in predictions. This is most likely the approach we would continue to experiment and work with if we were to continue with this data set, but more likely we would need more specific data in age, and other demographic factors to further make predictions more useful. Medical settings tend to be the strictest in terms of acceptable accuracy, and how consequential false positives or negatives can be, so while 82% accuracy seems like a substantial enough increase for our model to be useful, it still needs further improvement.

Future Work

Additional analysis is necessary regarding demographic factors. The raw data has extremely vague and over-generalized demographic data. Ethnicity is divided into either ‘Whites’ or ‘nonWhites’ and the economic factor is not representative of a more direct financial feature such as income or socioeconomic status. With more identifiable metadata relating to the individual/patient, a more comprehensive model may have resulted. Another limitation was the ill-defined histology (diagnosis) provided from the raw data. Inaccurate data organization may have led to an imbalance in the melanoma subtype case observations, resulting in the unequal representation of a particular class due to misidentification. Another solution would be to find an additional dataset to be merged in support of the ML question, but we struggled to find compatible data sources to achieve this possibility.