O câncer de mama, é atualmente um dos tipos de câncer mais mortais. Iremos investigar a utilidade da aprendizagem de máquina para detecção de tumores malignos ou benignos, aplicando técnicas de machine learning e data science utilizando do repositório de Machine Learning da UCI
http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names
Os dados presentes no datase tratam-se de dados reais.
setwd('E:/projetos/previsao_cancer_de_mama')
pacman::p_load(tidyverse,class,caTools,gmodels,DMwR,caret,e1071,knitr)
df <- read.csv("dataset.csv")
glimpse(df)
## Rows: 569
## Columns: 32
## $ id <int> 87139402, 8910251, 905520, 868871, 9012568, 90653...
## $ diagnosis <chr> "B", "B", "B", "B", "B", "B", "B", "M", "B", "B",...
## $ radius_mean <dbl> 12.32, 10.60, 11.04, 11.28, 15.19, 11.57, 11.51, ...
## $ texture_mean <dbl> 12.39, 18.95, 16.83, 13.39, 13.21, 19.04, 23.93, ...
## $ perimeter_mean <dbl> 78.85, 69.28, 70.92, 73.00, 97.65, 74.20, 74.52, ...
## $ area_mean <dbl> 464.1, 346.4, 373.2, 384.8, 711.8, 409.7, 403.5, ...
## $ smoothness_mean <dbl> 0.10280, 0.09688, 0.10770, 0.11640, 0.07963, 0.08...
## $ compactness_mean <dbl> 0.06981, 0.11470, 0.07804, 0.11360, 0.06934, 0.07...
## $ concavity_mean <dbl> 0.039870, 0.063870, 0.030460, 0.046350, 0.033930,...
## $ points_mean <dbl> 0.037000, 0.026420, 0.024800, 0.047960, 0.026570,...
## $ symmetry_mean <dbl> 0.1959, 0.1922, 0.1714, 0.1771, 0.1721, 0.2031, 0...
## $ dimension_mean <dbl> 0.05955, 0.06491, 0.06340, 0.06072, 0.05544, 0.06...
## $ radius_se <dbl> 0.2360, 0.4505, 0.1967, 0.3384, 0.1783, 0.2864, 0...
## $ texture_se <dbl> 0.6656, 1.1970, 1.3870, 1.3430, 0.4125, 1.4400, 2...
## $ perimeter_se <dbl> 1.670, 3.430, 1.342, 1.851, 1.338, 2.206, 1.936, ...
## $ area_se <dbl> 17.43, 27.10, 13.54, 26.33, 17.72, 20.30, 16.97, ...
## $ smoothness_se <dbl> 0.008045, 0.007470, 0.005158, 0.011270, 0.005012,...
## $ compactness_se <dbl> 0.011800, 0.035810, 0.009355, 0.034980, 0.014850,...
## $ concavity_se <dbl> 0.016830, 0.033540, 0.010560, 0.021870, 0.015510,...
## $ points_se <dbl> 0.012410, 0.013650, 0.007483, 0.019650, 0.009155,...
## $ symmetry_se <dbl> 0.01924, 0.03504, 0.01718, 0.01580, 0.01647, 0.01...
## $ dimension_se <dbl> 0.002248, 0.003318, 0.002198, 0.003442, 0.001767,...
## $ radius_worst <dbl> 13.50, 11.88, 12.41, 11.92, 16.20, 13.07, 12.48, ...
## $ texture_worst <dbl> 15.64, 22.94, 26.44, 15.77, 15.73, 26.98, 37.16, ...
## $ perimeter_worst <dbl> 86.97, 78.28, 79.93, 76.53, 104.50, 86.43, 82.28,...
## $ area_worst <dbl> 549.1, 424.8, 471.4, 434.0, 819.1, 520.5, 474.2, ...
## $ smoothness_worst <dbl> 0.1385, 0.1213, 0.1369, 0.1367, 0.1126, 0.1249, 0...
## $ compactness_worst <dbl> 0.12660, 0.25150, 0.14820, 0.18220, 0.17370, 0.19...
## $ concavity_worst <dbl> 0.124200, 0.191600, 0.106700, 0.086690, 0.136200,...
## $ points_worst <dbl> 0.09391, 0.07926, 0.07431, 0.08611, 0.08178, 0.06...
## $ symmetry_worst <dbl> 0.2827, 0.2940, 0.2998, 0.2102, 0.2487, 0.3035, 0...
## $ dimension_worst <dbl> 0.06771, 0.07587, 0.07881, 0.06784, 0.06766, 0.08...
Todos os dados do dataset são numéricos, com exceção da variável “diagnosis” cuja é um classificador que indicará se o conjunto de dados apontam para um tipo de câncer “benigno” ou “maligno”.
df <- df %>% select(-id) %>%
mutate(diagnosis = factor(diagnosis, levels = c("B","M"), labels = c(0,1))) %>%
mutate_if(is.numeric,scale)
Neste ponto, foi realizado o ajuste no dataset.
Realizamos a padronização das variáveis numéricas com o “scale” e além disso foi transformado a variável “diagnosis” (variável target) em “fator” com atribuição das labels 0 e 1, onde 0 = benigno e 1 = maligno.
round(prop.table(table(df$diagnosis))*100,2)
##
## 0 1
## 62.74 37.26
ggplot(df, aes(x = diagnosis))+
geom_bar(fill = "Steelblue")+
labs(title = "Proporção de tipo de câncer")+
theme_minimal()
Também se faz necessário verificar o balanceamento dos dados para que o modelo seja suficientemente genérico. Neste caso, há uma proporção de 62% dos dados para “Benigno” e 37% para “Maligno”. Realizaremos o balanceamento utilizando o algoritmo SMOTE.
set.seed(123)
df2 <- DMwR::SMOTE(diagnosis~.,df, perc.over = 900,perc.under = 100)
round(prop.table(table(df2$diagnosis))*100,2)
##
## 0 1
## 47.37 52.63
set.seed(123)
treino <- df2 %>% sample_frac(0.7)
teste <- df2 %>% sample_frac(0.3)
set.seed(123)
round(prop.table(table(treino$diagnosis))*100,2)
##
## 0 1
## 47.77 52.23
round(prop.table(table(teste$diagnosis))*100,2)
##
## 0 1
## 47.27 52.73
Utilizando ainda do mesmo princípio de balanceamento dos dados, é possível ver que o dataset de “treino e de”teste" estão balanceados.
set.seed(123)
ctrl <- trainControl(method = "repeatedcv", repeats = 3)
set.seed(123)
knn_v1 <- caret::train(diagnosis~.,
data = treino,
method = "knn",
trControl = ctrl,
tuneLength = 20)
plot(knn_v1)
O plot acima mostra qual o valor de K ideal que foi utilizado.
set.seed(123)
previsao_knn <- predict(knn_v1,newdata=teste)
confusionMatrix(previsao_knn,teste$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 569 0
## 1 2 637
##
## Accuracy : 0.9983
## 95% CI : (0.994, 0.9998)
## No Information Rate : 0.5273
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9967
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.9965
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9969
## Prevalence : 0.4727
## Detection Rate : 0.4710
## Detection Prevalence : 0.4710
## Balanced Accuracy : 0.9982
##
## 'Positive' Class : 0
##
set.seed(123)
controle <- trainControl(method = "repeatedcv",
repeats =3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
df3 <- df2 %>% mutate(diagnosis = factor(diagnosis,levels = c("0","1"), labels = c("Benigno","Maligno")))
treino2 <- df3 %>% sample_frac(0.7)
teste2 <- df3 %>% sample_frac(0.3)
knn_v2 <- caret::train(diagnosis~.,
data = treino2,
method = "knn",
trControl = controle,
metric = "ROC",
tuneLength = 20)
plot(knn_v2)
O plot acima mostra qual o valor de K ideal que foi utilizado.
set.seed(123)
previsao_knn2 <- predict(knn_v2,teste2[-1])
confusionMatrix(previsao_knn2,teste2$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benigno Maligno
## Benigno 561 6
## Maligno 10 631
##
## Accuracy : 0.9868
## 95% CI : (0.9786, 0.9924)
## No Information Rate : 0.5273
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9734
##
## Mcnemar's Test P-Value : 0.4533
##
## Sensitivity : 0.9825
## Specificity : 0.9906
## Pos Pred Value : 0.9894
## Neg Pred Value : 0.9844
## Prevalence : 0.4727
## Detection Rate : 0.4644
## Detection Prevalence : 0.4694
## Balanced Accuracy : 0.9865
##
## 'Positive' Class : Benigno
##
set.seed(123)
modelo_naive <- naiveBayes(x = treino[-1],y=treino$diagnosis)
previsao <- predict(modelo_naive,teste[-1])
conf.matrix <- table(teste[,1],previsao)
confusionMatrix(conf.matrix)
## Confusion Matrix and Statistics
##
## previsao
## 0 1
## 0 555 16
## 1 63 574
##
## Accuracy : 0.9346
## 95% CI : (0.9192, 0.9479)
## No Information Rate : 0.5116
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8694
##
## Mcnemar's Test P-Value : 2.274e-07
##
## Sensitivity : 0.8981
## Specificity : 0.9729
## Pos Pred Value : 0.9720
## Neg Pred Value : 0.9011
## Prevalence : 0.5116
## Detection Rate : 0.4594
## Detection Prevalence : 0.4727
## Balanced Accuracy : 0.9355
##
## 'Positive' Class : 0
##
set.seed(123)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 2)
modelo_reg_log <- train(diagnosis~., data = treino, method = "glm", trControl = control)
set.seed(123)
importance <- varImp(modelo_reg_log, scale = FALSE);plot(importance)
set.seed(123)
previsoes <- predict(modelo_reg_log, teste[-1])
set.seed(123)
confusionMatrix(table(data = previsoes, reference = teste[,1]), positive = "1")
## Confusion Matrix and Statistics
##
## reference
## data 0 1
## 0 571 0
## 1 0 637
##
## Accuracy : 1
## 95% CI : (0.997, 1)
## No Information Rate : 0.5273
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5273
## Detection Rate : 0.5273
## Detection Prevalence : 0.5273
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 1
##
O Modelo com regressão logística acertou 100% dos dados, ou seja, não tivemos nenhum falso positivo e nenhum falso negativo, o que é ótimo devido ao contexto do dataset tratar de diagnóstico de câncer de mama.