Os serviços baseados em assinatura geralmente ganham dinheiro das três maneiras a seguir:
Neste neste, iremos nos concentrar na retenção de clientes. Para fazer isso, construiremos um modelo preditivo de rotatividade de clientes.
A motivação para este modelo é o retorno sobre o investimento (ROI). Se uma empresa interagisse com todos os clientes, o custo seria astronômico. Concentrar os esforços de retenção em um pequeno subconjunto de clientes de alto risco é uma estratégia muito mais eficaz.
O dataset que trabalharemos pode ser encontrado em IBM Watson Analytics website: https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/. https://jtsulliv.github.io/churn-prediction/
Este é um exemplo de conjunto de dados para uma empresa de telecomunicações. Podemos começar observando as dimensões dos dados, bem como os diferentes recursos.
library(tidyverse)
library(miscset) # Gráficos Estilo Facets
df <- read_csv("Telco.data.csv")
dim(df)
## [1] 7043 21
names(df)
## [1] "customerID" "gender" "SeniorCitizen" "Partner"
## [5] "Dependents" "tenure" "PhoneService" "MultipleLines"
## [9] "InternetService" "OnlineSecurity" "OnlineBackup" "DeviceProtection"
## [13] "TechSupport" "StreamingTV" "StreamingMovies" "Contract"
## [17] "PaperlessBilling" "PaymentMethod" "MonthlyCharges" "TotalCharges"
## [21] "Churn"
Dando uma olhada, vemos que existem 21 variáveis e 7043 linhas de observações. Os recursos são nomeados muito bem, como “PhoneService” e “TechSupport.”
O recurso de destino que tentaremos prever é “Churn”. Podemos cavar um pouco mais fundo e dar uma olhada nos tipos de dados.
str(df)
## spc_tbl_ [7,043 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ customerID : chr [1:7043] "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr [1:7043] "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr [1:7043] "Yes" "No" "No" "No" ...
## $ Dependents : chr [1:7043] "No" "No" "No" "No" ...
## $ tenure : num [1:7043] 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr [1:7043] "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr [1:7043] "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr [1:7043] "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr [1:7043] "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr [1:7043] "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr [1:7043] "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr [1:7043] "No" "No" "No" "Yes" ...
## $ StreamingTV : chr [1:7043] "No" "No" "No" "No" ...
## $ StreamingMovies : chr [1:7043] "No" "No" "No" "No" ...
## $ Contract : chr [1:7043] "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr [1:7043] "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr [1:7043] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num [1:7043] 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr [1:7043] "No" "No" "Yes" "No" ...
## - attr(*, "spec")=
## .. cols(
## .. customerID = col_character(),
## .. gender = col_character(),
## .. SeniorCitizen = col_double(),
## .. Partner = col_character(),
## .. Dependents = col_character(),
## .. tenure = col_double(),
## .. PhoneService = col_character(),
## .. MultipleLines = col_character(),
## .. InternetService = col_character(),
## .. OnlineSecurity = col_character(),
## .. OnlineBackup = col_character(),
## .. DeviceProtection = col_character(),
## .. TechSupport = col_character(),
## .. StreamingTV = col_character(),
## .. StreamingMovies = col_character(),
## .. Contract = col_character(),
## .. PaperlessBilling = col_character(),
## .. PaymentMethod = col_character(),
## .. MonthlyCharges = col_double(),
## .. TotalCharges = col_double(),
## .. Churn = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Efetuando as alterações nas observações para ficarmos apenas com variávei binárias (“No”, “Yes”)
colunas <- c('MultipleLines',
'OnlineSecurity',
'OnlineBackup',
'DeviceProtection',
'TechSupport',
'StreamingTV',
'StreamingMovies')
df <- df %>%
mutate_at(colunas, ~ifelse(startsWith(., "No"), "No", .))
df
Os dados contêm vários recursos categóricos que são tipos de caracteres, como o recurso “Dependents”, que tem valores de “Sim” ou “Não”. Há, também, tipos numéricos, que incluem “MonthlyCharges” e “TotalCharges”.
A variável “SeniorCitizen” é um tipo inteiro, mas realmente representa “Sim” e “Não”, então vamos converter isso em um fator.
Investigaremos a variável “tenure”, que também é um número inteiro, mais tarde. Por enquanto, vamos começar transformando as variáveis de caracteres, bem como a variável “SeniorCitizen”, para os tipos de fator.
df$SeniorCitizen <- factor(df$SeniorCitizen,
levels = c("0","1"),
labels = c("No", "Yes"))
df <- df%>%mutate_if(is.character, as.factor)
glimpse(df)
## Rows: 7,043
## Columns: 21
## $ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-…
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
df %>% map(~ sum(is.na(.)))
## $customerID
## [1] 0
##
## $gender
## [1] 0
##
## $SeniorCitizen
## [1] 0
##
## $Partner
## [1] 0
##
## $Dependents
## [1] 0
##
## $tenure
## [1] 0
##
## $PhoneService
## [1] 0
##
## $MultipleLines
## [1] 0
##
## $InternetService
## [1] 0
##
## $OnlineSecurity
## [1] 0
##
## $OnlineBackup
## [1] 0
##
## $DeviceProtection
## [1] 0
##
## $TechSupport
## [1] 0
##
## $StreamingTV
## [1] 0
##
## $StreamingMovies
## [1] 0
##
## $Contract
## [1] 0
##
## $PaperlessBilling
## [1] 0
##
## $PaymentMethod
## [1] 0
##
## $MonthlyCharges
## [1] 0
##
## $TotalCharges
## [1] 11
##
## $Churn
## [1] 0
Parece que “TotalCharges” é o único recurso com valores ausentes.
Vamos em frente e introduzir os 11 valores ausentes usando o valor mediano.
df <- df %>%
mutate(TotalCharges = replace(TotalCharges,
is.na(TotalCharges),
median(TotalCharges, na.rm = TRUE)))
Funcionou? Se positivo, a soma de NAs deve ser igual a zero.
sum(is.na(df$TotalCharges))
## [1] 0
Analisando os valores únicos das variáveis classe fator.
df_tbl <- df %>%
select_if(is.factor) %>%
summarise_all(n_distinct)
df_tbl[1:8]
df_tbl[9:17]
Há um valor único para cada “ID do cliente”, então provavelmente não seremos capazes de obter muitas informações lá. Todos os outros fatores têm quatro ou menos valores exclusivos, portanto, todos serão bastante gerenciáveis.
Como guia da análise procuraremos responder algumas questões, a saber:
Os homens são mais propensos a efetuar o cancelamento do que as mulheres?
Os idosos ?
Os indivíduos com um parceiro mudam mais do que aqueles sem parceiro?
As pessoas com dependentes se mudam mais do que as pessoas que não têm dependentes?
Variável Sexo:
ggplot(df) +
geom_bar(aes(x = gender, fill = Churn), position = "dodge") +
ggtitle("Churn por Sexo")
Os resultados são similares, visualmente falando.
df %>%
group_by(gender,Churn) %>%
summarise(n=n()) %>%
mutate(freq = n / sum(n))
Aproximadamente um quarto dos clientes do sexo masculino abandona, e cerca de um quarto das clientes do sexo feminino se desligam. Também podemos dar uma olhada em exatamente quantas pessoas de cada gênero mudaram.
# Seniors Citizens
ggplot(df) +
geom_bar(aes(x = SeniorCitizen, fill = Churn), position = "dodge")+
ggtitle("Churn por Idades")
# Grupos de Idade
df %>%
group_by(SeniorCitizen) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# Grupos de Idade e Churn
df %>%
group_by(SeniorCitizen, Churn) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
Ops!!!!
Esta variável mostra uma relação muito mais significativa. Aproximadamente 16% dos clientes são idosos e de 41,7 % destes se desligam.
Por outro lado, dos 84% dos clientes que não são idosos, apenas 23,6% se desligam. Embora em menor número, os resultados mostram que os idosos são muito mais propensos a mudar de companhia.
#Pessoas com parceiros
ggplot(df) +
geom_bar(aes(x=Partner, fill = Churn), position = "dodge")+
ggtitle("Churn por Dependentes")
df %>%
group_by(Partner) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
df %>%
group_by(Partner, Churn) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
Um pouco mais da metade das pessoas tem parceiros - das pessoas parceitos, 20% se desligam. Para pessoas sem parceiros, cerca de 33% se desligam.
# Dependentes
ggplot(df) +
geom_bar(aes_string(x="Dependents", fill="Churn"), position = "dodge")
df %>% group_by(Dependents, Churn) %>%
summarise(n=n()) %>%
mutate(freq = n / sum(n))
# Pessoas com dependentes
df %>% group_by(Dependents) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
Aproximadamente 30% das pessoas tem dependentes, dos quais 15% churn. Para 70% das pessoas que não têm dependentes, 31% churn.
Outra visualização útil é o gráfico de caixa e bigode (boxplot). Isso nos dá um visual um pouco mais compacto de nossos dados, e nos ajuda a identificar outliers.
Vamos dar uma olhada em alguns gráficos de caixa e bigode para cobranças totais dos diferentes segmentos de clientes.
#Senior Citizens
ggplot(df, aes(x = SeniorCitizen, y = TotalCharges)) +
geom_boxplot(col ="steelblue", fill = "brown")
# Partner
ggplot(df, aes(x = Partner, y = TotalCharges)) +
geom_boxplot(col ="steelblue", fill = "brown")
# Dependents
ggplot(df, aes(x = Dependents, y = TotalCharges)) +
geom_boxplot(col ="steelblue", fill = "brown")
Depois de examinar esses resultados iniciais, podemos fazer mais algumas perguntas.
Podemos comparar o total de queixas de idosos, pessoas sem companheiros e pessoas sem dependentes.
Estes parecem ser os subconjuntos de pessoas com maior probabilidade de efetuar o churn em seus respectivos segmentos de clientes.
Vamos compará-los para que possamos identificar onde potencialmente concentraríamos nossos esforços.
#Total de queixas de idosos
glimpse(df)
## Rows: 7,043
## Columns: 21
## $ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-…
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye…
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y…
## $ MultipleLines <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o…
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No,…
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No, N…
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No, Y…
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No, No,…
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No, Ye…
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, Yes…
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M…
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No…
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
df %>%
dplyr::select(SeniorCitizen, Churn, TotalCharges, tenure) %>%
filter(SeniorCitizen == "Yes", Churn == "Yes") %>%
summarize(n = n(),
total = sum(TotalCharges),
avg_tenure = sum(tenure)/n)
# Idosos: Número de 476 com 882.405 queixas
# Total de queixas de pessoas sem parceiros
df %>%
dplyr::select(Partner, Churn, TotalCharges, tenure) %>%
filter(Partner == "No", Churn == "Yes") %>%
summarize(n = n(),
total = sum(TotalCharges),
avg_tenure = sum(tenure)/n)
# 1200 clientes com 1.306.776 queixas
# Total de queixas com pessoas sem dependentes
df %>%
dplyr::select(Dependents, Churn, TotalCharges, tenure) %>%
filter(Dependents == "No", Churn == "Yes") %>%
summarise(n = n(),
total = sum(TotalCharges),
avg_tenure = sum(tenure)/n)
# 1543 e 2.261.840 queixas
# Total de queixas
sum(df$TotalCharges)
## [1] 16071541
Em resumo:
| Customer Segment | Total Charges |
|---|---|
| Senior Citizens | 882,405 |
| No Partners | 1.306.776 |
| No Dependents | 2.261.840 |
Com base nos resultados, devemos direcionar nossos esforços para pessoas sem dependentes. Este segmento de clientes teve churn de quase 2,3 milhões em custos totais comparados a 1,3 MM para pessoas sem companheiros eem terceiro lugar 900 mil para idosos.
Vamos aprofundar um pouco mais e ver quais serviços esse segmento de clientes sem dependendes usa.
dependents <- df %>% filter(Dependents == "No")
ggplotGrid(ncol = 2,
lapply(c("PhoneService","MultipleLines",
"OnlineSecurity","OnlineBackup",
"DeviceProtection"),
function(col){
ggplot(dependents,aes_string(col)) +
geom_bar(aes(fill=Churn),position="dodge")
}))
ggplotGrid(ncol = 2,
lapply(c("TechSupport","StreamingTV","StreamingMovies","Contract",
"PaperlessBilling", "InternetService"),
function(col){
ggplot(dependents,aes_string(col)) + geom_bar(aes(fill= Churn),position="dodge")
}))
ggplot(dependents) +
geom_bar(aes(x=PaymentMethod,fill=Churn), position = "dodge")
Visualizando os resultados, ganhamos alguns insights em potencial:
Muitas pessoas com serviço de telefone mudaram.
Pessoas com Internet de fibra ótica efetuando o Churn muito mais do que pessoas com DSL ou sem Internet.
Pessoas sem backup online, proteção de dispositivo e segurança online mudam com bastante frequência. Talvez seus dispositivos tenham travado,
Da mesma forma que o backup e a segurança online, aqueles sem proteção de dispositivo tendiam a agitar mais do que aqueles que assinavam o serviço.
Aqueles sem suporte técnico tendem a se mudar com mais freqüência do que aqueles com suporte técnico. Mover clientes para contas de suporte técnico pode ser outra forma potencial de evitar a rotatividade.
Há vários outros insights diferentes que poderíamos obter com os dados, mas essa seria uma boa lista inicial para investigar melhor se a empresa tivesse conjuntos de dados ainda mais detalhados.
Agora que fizemos uma análise exploratória básica, começaremos a modelar algumas técnicas preditivas de Machine Learning.
Usaremos a regressão logística e a floresta aleatória (random forest).
A regressão logística é um classificador linear, o que a torna mais fácil de interpretar do que os modelos não lineares.
Ao mesmo tempo, por ser um modelo linear, ele tem um viés alto em relação a esse tipo de ajuste, então pode não ter um bom desempenho em dados não lineares.
A floresta aleatória é outro método de classificação popular.
Ao contrário da regressão logística, a floresta aleatória é melhor no ajuste de dados não lineares.
Também pode funcionar bem mesmo se houver recursos correlacionados, o que pode ser um problema para a regressão logística (embora métodos de como Lasso e Ridge Regression possam ajudar com recursos correlacionados em um modelo de regressão logística).
Não tenho certeza se meus dados têm um limite de decisão linear ou não linear, então é por isso que vou começar com regressão logística e, em seguida, testar um modelo de floresta aleatória.
Abordagem de conjunto de validação de treinamento / teste, ara meu método de reamostragem.
Se eu fosse me aprofundar no modelo para realmente examinar os resultados, usaria validação cruzada k-fold repetida.
Para manter este projeto razoavelmente curto, vou manter a validação de treinamento / teste e deixar o k-fold para outro projeto.
Vamos mdodelar!
Primeiro, vamos desenvolver um modelo de regressão logística, começando por dividir os dados em um conjunto de treinamento (75%) e um conjunto de teste (25%).
Vou remover o recurso customerID porque é único para cada observação e não adicionará informações valiosas ao meu modelo.
#install.packages("caret")
library(caret)
# removendo customerID
df <- df %>% dplyr::select(-customerID)
# train/test dividindo em 75%/25%
set.seed(5) # Para reproducibilidade
inTrain <- createDataPartition(y=df$Churn, p=0.75, list=FALSE)
train <- df[inTrain,]
test <- df[-inTrain,]
# Proporção de Churn na df (dataframe inteiro)
round(prop.table(table(df$Churn)) * 100, digits = 2)
##
## No Yes
## 73.46 26.54
# Verificar se as proporções são próximas
round(prop.table(table(train$Churn)) * 100, digits = 2)
##
## No Yes
## 73.46 26.54
round(prop.table(table(test$Churn)) * 100, digits = 2)
##
## No Yes
## 73.47 26.53
str(train)
## tibble [5,283 × 20] (S3: tbl_df/tbl/data.frame)
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 2 2 2 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 2 1 2 1 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 1 1 ...
## $ tenure : num [1:5283] 1 2 22 10 62 13 16 58 49 25 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 2 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 2 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 2 1 1 1 3 2 2 2 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 1 2 1 2 2 2 1 1 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 1 1 2 1 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 2 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 2 2 2 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 2 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 1 1 1 2 1 3 2 1 1 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 2 2 1 1 2 1 1 2 2 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 2 4 1 4 2 2 1 3 ...
## $ MonthlyCharges : num [1:5283] 29.9 53.9 89.1 29.8 56.1 ...
## $ TotalCharges : num [1:5283] 29.9 108.2 1949.4 301.9 3487.9 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
str(test)
## tibble [1,760 × 20] (S3: tbl_df/tbl/data.frame)
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 1 1 2 2 1 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 1 2 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 1 1 ...
## $ tenure : num [1:1760] 34 45 2 8 28 69 10 12 1 71 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 2 1 1 1 2 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 2 2 2 2 1 3 3 2 ...
## $ OnlineSecurity : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 1 1 1 2 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 1 2 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 2 1 1 2 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 2 1 1 1 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 2 1 1 1 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 2 2 1 1 1 3 1 2 1 3 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 1 1 1 2 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 4 1 3 3 3 2 2 1 4 2 ...
## $ MonthlyCharges : num [1:1760] 57 42.3 70.7 99.7 104.8 ...
## $ TotalCharges : num [1:1760] 1890 1841 152 820 3046 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 2 1 2 1 ...
Proporções mantendo-se bem próximas.
fit <- glm(Churn ~ ., data = train, family = binomial)
summary(fit)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8855 -0.6849 -0.2885 0.7314 3.3603
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.228e+00 9.438e-01 1.301 0.1932
## genderMale -6.811e-03 7.457e-02 -0.091 0.9272
## SeniorCitizenYes 8.980e-02 9.835e-02 0.913 0.3612
## PartnerYes -4.509e-02 9.012e-02 -0.500 0.6168
## DependentsYes -3.883e-02 1.019e-01 -0.381 0.7031
## tenure -5.451e-02 6.985e-03 -7.803 6.02e-15 ***
## PhoneServiceYes 3.576e-01 7.514e-01 0.476 0.6342
## MultipleLinesYes 4.795e-01 2.044e-01 2.346 0.0190 *
## InternetServiceFiber optic 1.973e+00 9.235e-01 2.136 0.0326 *
## InternetServiceNo -1.907e+00 9.353e-01 -2.039 0.0414 *
## OnlineSecurityYes -1.688e-01 2.082e-01 -0.811 0.4173
## OnlineBackupYes 9.493e-02 2.033e-01 0.467 0.6405
## DeviceProtectionYes 2.154e-01 2.034e-01 1.059 0.2895
## TechSupportYes -1.315e-01 2.084e-01 -0.631 0.5280
## StreamingTVYes 6.838e-01 3.772e-01 1.812 0.0699 .
## StreamingMoviesYes 6.972e-01 3.786e-01 1.841 0.0656 .
## ContractOne year -6.878e-01 1.251e-01 -5.498 3.84e-08 ***
## ContractTwo year -1.434e+00 2.021e-01 -7.094 1.30e-12 ***
## PaperlessBillingYes 3.713e-01 8.579e-02 4.328 1.51e-05 ***
## PaymentMethodCredit card (automatic) -2.352e-01 1.318e-01 -1.785 0.0743 .
## PaymentMethodElectronic check 2.690e-01 1.088e-01 2.473 0.0134 *
## PaymentMethodMailed check -4.020e-02 1.314e-01 -0.306 0.7597
## MonthlyCharges -4.718e-02 3.676e-02 -1.283 0.1994
## TotalCharges 2.628e-04 7.987e-05 3.290 0.0010 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6113.6 on 5282 degrees of freedom
## Residual deviance: 4393.7 on 5259 degrees of freedom
## AIC: 4441.7
##
## Number of Fisher Scoring iterations: 6
# Efteuando uma nova regressão com as variáveis significativas
fit <- glm(Churn ~ tenure + MultipleLines + InternetService + PhoneService +
MultipleLines + TechSupport +StreamingTV +
StreamingMovies + Contract,
data = train, family = binomial)
summary(fit)
##
## Call:
## glm(formula = Churn ~ tenure + MultipleLines + InternetService +
## PhoneService + MultipleLines + TechSupport + StreamingTV +
## StreamingMovies + Contract, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8200 -0.6894 -0.3023 0.7413 3.1863
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.003670 0.125237 0.029 0.976623
## tenure -0.036619 0.002447 -14.964 < 2e-16 ***
## MultipleLinesYes 0.311199 0.088676 3.509 0.000449 ***
## InternetServiceFiber optic 1.114323 0.101695 10.957 < 2e-16 ***
## InternetServiceNo -0.721762 0.147170 -4.904 9.38e-07 ***
## PhoneServiceYes -0.557854 0.144647 -3.857 0.000115 ***
## TechSupportYes -0.390476 0.096794 -4.034 5.48e-05 ***
## StreamingTVYes 0.328224 0.089692 3.659 0.000253 ***
## StreamingMoviesYes 0.354701 0.089974 3.942 8.07e-05 ***
## ContractOne year -0.834863 0.121610 -6.865 6.65e-12 ***
## ContractTwo year -1.633449 0.196956 -8.293 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6113.6 on 5282 degrees of freedom
## Residual deviance: 4473.4 on 5272 degrees of freedom
## AIC: 4495.4
##
## Number of Fisher Scoring iterations: 6
Com o stepwise.
#install.packages("MASS")
library(MASS)
step.model <- fit %>%
stepAIC(trace = TRUE)
## Start: AIC=4495.4
## Churn ~ tenure + MultipleLines + InternetService + PhoneService +
## MultipleLines + TechSupport + StreamingTV + StreamingMovies +
## Contract
##
## Df Deviance AIC
## <none> 4473.4 4495.4
## - MultipleLines 1 4485.7 4505.7
## - StreamingTV 1 4486.9 4506.9
## - PhoneService 1 4488.0 4508.0
## - StreamingMovies 1 4489.0 4509.0
## - TechSupport 1 4489.9 4509.9
## - Contract 2 4576.9 4594.9
## - InternetService 2 4714.9 4732.9
## - tenure 1 4721.5 4741.5
summary(step.model)
##
## Call:
## glm(formula = Churn ~ tenure + MultipleLines + InternetService +
## PhoneService + MultipleLines + TechSupport + StreamingTV +
## StreamingMovies + Contract, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8200 -0.6894 -0.3023 0.7413 3.1863
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.003670 0.125237 0.029 0.976623
## tenure -0.036619 0.002447 -14.964 < 2e-16 ***
## MultipleLinesYes 0.311199 0.088676 3.509 0.000449 ***
## InternetServiceFiber optic 1.114323 0.101695 10.957 < 2e-16 ***
## InternetServiceNo -0.721762 0.147170 -4.904 9.38e-07 ***
## PhoneServiceYes -0.557854 0.144647 -3.857 0.000115 ***
## TechSupportYes -0.390476 0.096794 -4.034 5.48e-05 ***
## StreamingTVYes 0.328224 0.089692 3.659 0.000253 ***
## StreamingMoviesYes 0.354701 0.089974 3.942 8.07e-05 ***
## ContractOne year -0.834863 0.121610 -6.865 6.65e-12 ***
## ContractTwo year -1.633449 0.196956 -8.293 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6113.6 on 5282 degrees of freedom
## Residual deviance: 4473.4 on 5272 degrees of freedom
## AIC: 4495.4
##
## Number of Fisher Scoring iterations: 6
# Efetuando as predições
result <- predict(step.model, newdata=test, type='response')
result <- ifelse(result > 0.5, 1 , 0)
table(result)
## result
## 0 1
## 1394 366
test$Churn <- ifelse(test$Churn == "No", 0, 1)
table(test$Churn)
##
## 0 1
## 1293 467
glimpse(test)
## Rows: 1,760
## Columns: 20
## $ gender <fct> Male, Male, Female, Female, Female, Female, Female, M…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, Yes, Yes, No, Yes…
## $ Partner <fct> No, No, No, No, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes…
## $ Dependents <fct> No, No, No, No, No, Yes, Yes, No, No, No, No, Yes, No…
## $ tenure <dbl> 34, 45, 2, 8, 28, 69, 10, 12, 1, 71, 2, 27, 1, 46, 11…
## $ PhoneService <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ MultipleLines <fct> No, No, No, Yes, Yes, Yes, No, No, No, Yes, No, No, N…
## $ InternetService <fct> DSL, DSL, Fiber optic, Fiber optic, Fiber optic, Fibe…
## $ OnlineSecurity <fct> Yes, Yes, No, No, No, Yes, No, No, No, Yes, No, Yes, …
## $ OnlineBackup <fct> No, No, No, No, No, Yes, No, No, No, Yes, No, Yes, No…
## $ DeviceProtection <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Y…
## $ TechSupport <fct> No, Yes, No, No, Yes, Yes, Yes, No, No, Yes, No, Yes,…
## $ StreamingTV <fct> No, No, No, Yes, Yes, Yes, No, No, No, No, Yes, No, N…
## $ StreamingMovies <fct> No, No, No, Yes, Yes, Yes, No, No, No, No, Yes, No, N…
## $ Contract <fct> One year, One year, Month-to-month, Month-to-month, M…
## $ PaperlessBilling <fct> No, No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, No, …
## $ PaymentMethod <fct> Mailed check, Bank transfer (automatic), Electronic c…
## $ MonthlyCharges <dbl> 56.95, 42.30, 70.70, 99.65, 104.80, 113.25, 55.20, 19…
## $ TotalCharges <dbl> 1889.50, 1840.75, 151.65, 820.50, 3046.05, 7895.15, 5…
## $ Churn <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,…
test$Churn <- factor(test$Churn,
levels = c(0, 1),
labels=c("0", "1"))
test$factor.result <- factor(result,
levels = c(0, 1),
labels=c("0", "1"))
glimpse(test)
## Rows: 1,760
## Columns: 21
## $ gender <fct> Male, Male, Female, Female, Female, Female, Female, M…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, Yes, Yes, No, Yes…
## $ Partner <fct> No, No, No, No, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes…
## $ Dependents <fct> No, No, No, No, No, Yes, Yes, No, No, No, No, Yes, No…
## $ tenure <dbl> 34, 45, 2, 8, 28, 69, 10, 12, 1, 71, 2, 27, 1, 46, 11…
## $ PhoneService <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ MultipleLines <fct> No, No, No, Yes, Yes, Yes, No, No, No, Yes, No, No, N…
## $ InternetService <fct> DSL, DSL, Fiber optic, Fiber optic, Fiber optic, Fibe…
## $ OnlineSecurity <fct> Yes, Yes, No, No, No, Yes, No, No, No, Yes, No, Yes, …
## $ OnlineBackup <fct> No, No, No, No, No, Yes, No, No, No, Yes, No, Yes, No…
## $ DeviceProtection <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Y…
## $ TechSupport <fct> No, Yes, No, No, Yes, Yes, Yes, No, No, Yes, No, Yes,…
## $ StreamingTV <fct> No, No, No, Yes, Yes, Yes, No, No, No, No, Yes, No, N…
## $ StreamingMovies <fct> No, No, No, Yes, Yes, Yes, No, No, No, No, Yes, No, N…
## $ Contract <fct> One year, One year, Month-to-month, Month-to-month, M…
## $ PaperlessBilling <fct> No, No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, No, …
## $ PaymentMethod <fct> Mailed check, Bank transfer (automatic), Electronic c…
## $ MonthlyCharges <dbl> 56.95, 42.30, 70.70, 99.65, 104.80, 113.25, 55.20, 19…
## $ TotalCharges <dbl> 1889.50, 1840.75, 151.65, 820.50, 3046.05, 7895.15, 5…
## $ Churn <fct> 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,…
## $ factor.result <fct> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
# install.packages("e1071")
library(e1071)
(mc <- confusionMatrix(data = test$factor.result, reference = test$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1169 225
## 1 124 242
##
## Accuracy : 0.8017
## 95% CI : (0.7823, 0.8201)
## No Information Rate : 0.7347
## P-Value [Acc > NIR] : 3.160e-11
##
## Kappa : 0.4536
##
## Mcnemar's Test P-Value : 8.657e-08
##
## Sensitivity : 0.9041
## Specificity : 0.5182
## Pos Pred Value : 0.8386
## Neg Pred Value : 0.6612
## Prevalence : 0.7347
## Detection Rate : 0.6642
## Detection Prevalence : 0.7920
## Balanced Accuracy : 0.7112
##
## 'Positive' Class : 0
##
source('Matriz de Confusao.R')
plot_matconf(mc)
# install.packages("rpart")
# install.packages("rattle")
# install.packages('rpart.plot')
# install.packages('RColorBrewer')
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
head(train)
fit_2 <- rpart(Churn ~ tenure + MultipleLines + InternetService + PhoneService + MultipleLines + TechSupport + StreamingTV + StreamingMovies + Contract,
data = train,
method = "class",
minsplit = 2,
minbucket = 1)
fancyRpartPlot(fit_2, cex = 0.7, palettes = c("Blues", "Reds"))
result_1 <- predict(fit_2, test, type = 'class')
result_1 <- ifelse(result_1 == "No", "0", "1")
test$result_1 <- as.factor(result_1)
levels(test$result_1)
## [1] "0" "1"
levels(test$Churn)
## [1] "0" "1"
(mc <- confusionMatrix(test$result_1, test$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1182 248
## 1 111 219
##
## Accuracy : 0.796
## 95% CI : (0.7764, 0.8146)
## No Information Rate : 0.7347
## P-Value [Acc > NIR] : 1.247e-09
##
## Kappa : 0.4227
##
## Mcnemar's Test P-Value : 7.084e-13
##
## Sensitivity : 0.9142
## Specificity : 0.4690
## Pos Pred Value : 0.8266
## Neg Pred Value : 0.6636
## Prevalence : 0.7347
## Detection Rate : 0.6716
## Detection Prevalence : 0.8125
## Balanced Accuracy : 0.6916
##
## 'Positive' Class : 0
##
plot_matconf(mc)
#install.packages("neuralnet")
library(neuralnet)
glimpse(train)
## Rows: 5,283
## Columns: 20
## $ gender <fct> Female, Male, Male, Female, Male, Male, Male, Male, M…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
## $ Partner <fct> Yes, No, No, No, No, Yes, No, Yes, No, No, No, No, No…
## $ Dependents <fct> No, No, Yes, No, Yes, Yes, No, No, No, No, No, Yes, N…
## $ tenure <dbl> 1, 2, 22, 10, 62, 13, 16, 58, 49, 25, 52, 71, 21, 1, …
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, …
## $ MultipleLines <fct> No, No, Yes, No, No, No, No, Yes, Yes, No, No, Yes, N…
## $ InternetService <fct> DSL, DSL, Fiber optic, DSL, DSL, DSL, No, Fiber optic…
## $ OnlineSecurity <fct> No, Yes, No, Yes, Yes, Yes, No, No, No, Yes, No, Yes,…
## $ OnlineBackup <fct> Yes, Yes, Yes, No, Yes, No, No, No, Yes, No, No, No, …
## $ DeviceProtection <fct> No, No, No, No, No, No, No, Yes, Yes, Yes, No, Yes, Y…
## $ TechSupport <fct> No, No, No, No, No, No, No, No, No, Yes, No, No, No, …
## $ StreamingTV <fct> No, No, Yes, No, No, No, No, Yes, Yes, Yes, No, Yes, …
## $ StreamingMovies <fct> No, No, No, No, No, No, No, Yes, Yes, Yes, No, Yes, Y…
## $ Contract <fct> Month-to-month, Month-to-month, Month-to-month, Month…
## $ PaperlessBilling <fct> Yes, Yes, Yes, No, No, Yes, No, No, Yes, Yes, No, No,…
## $ PaymentMethod <fct> Electronic check, Mailed check, Credit card (automati…
## $ MonthlyCharges <dbl> 29.85, 53.85, 89.10, 29.75, 56.15, 49.95, 18.95, 100.…
## $ TotalCharges <dbl> 29.85, 108.15, 1949.40, 301.90, 3487.95, 587.45, 326.…
## $ Churn <fct> No, Yes, No, No, No, No, No, No, Yes, No, No, No, No,…
# Transformando as variaveis de Train em numéricas
nn.train <- train # Iremos usar o original em outros método
nn.train$gender <- as.numeric(ifelse(nn.train$gender == "Female", "0", "1"))
nn.train$MultipleLines <- as.numeric(ifelse(nn.train$MultipleLines == "No", "0", "1"))
glimpse(nn.train)
## Rows: 5,283
## Columns: 20
## $ gender <dbl> 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1,…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
## $ Partner <fct> Yes, No, No, No, No, Yes, No, Yes, No, No, No, No, No…
## $ Dependents <fct> No, No, Yes, No, Yes, Yes, No, No, No, No, No, Yes, N…
## $ tenure <dbl> 1, 2, 22, 10, 62, 13, 16, 58, 49, 25, 52, 71, 21, 1, …
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, …
## $ MultipleLines <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,…
## $ InternetService <fct> DSL, DSL, Fiber optic, DSL, DSL, DSL, No, Fiber optic…
## $ OnlineSecurity <fct> No, Yes, No, Yes, Yes, Yes, No, No, No, Yes, No, Yes,…
## $ OnlineBackup <fct> Yes, Yes, Yes, No, Yes, No, No, No, Yes, No, No, No, …
## $ DeviceProtection <fct> No, No, No, No, No, No, No, Yes, Yes, Yes, No, Yes, Y…
## $ TechSupport <fct> No, No, No, No, No, No, No, No, No, Yes, No, No, No, …
## $ StreamingTV <fct> No, No, Yes, No, No, No, No, Yes, Yes, Yes, No, Yes, …
## $ StreamingMovies <fct> No, No, No, No, No, No, No, Yes, Yes, Yes, No, Yes, Y…
## $ Contract <fct> Month-to-month, Month-to-month, Month-to-month, Month…
## $ PaperlessBilling <fct> Yes, Yes, Yes, No, No, Yes, No, No, Yes, Yes, No, No,…
## $ PaymentMethod <fct> Electronic check, Mailed check, Credit card (automati…
## $ MonthlyCharges <dbl> 29.85, 53.85, 89.10, 29.75, 56.15, 49.95, 18.95, 100.…
## $ TotalCharges <dbl> 29.85, 108.15, 1949.40, 301.90, 3487.95, 587.45, 326.…
## $ Churn <fct> No, Yes, No, No, No, No, No, No, Yes, No, No, No, No,…
table(nn.train$InternetService)
##
## DSL Fiber optic No
## 1808 2326 1149
levels(nn.train$InternetService)[levels(nn.train$InternetService)=="No"] <- "0"
levels(nn.train$InternetService)[levels(nn.train$InternetService)=="DSL"] <- "1"
levels(nn.train$InternetService)[levels(nn.train$InternetService)== "Fiber optic"] <- "2"
table(nn.train$InternetService)
##
## 1 2 0
## 1808 2326 1149
nn.train$InternetService <- as.numeric(as.character(nn.train$InternetService))
nn.train$PhoneService <- as.numeric(ifelse(nn.train$PhoneService == "No", "0", "1"))
nn.train$TechSupport <- as.numeric(ifelse(nn.train$TechSupport == "No", "0", "1"))
nn.train$StreamingTV <- as.numeric(ifelse(nn.train$StreamingTV == "No", "0", "1"))
nn.train$StreamingMovies <- as.numeric(ifelse(nn.train$StreamingMovies == "No", "0", "1"))
levels(nn.train$Contract)[levels(nn.train$Contract)=="Month-to-month"] <- "0"
levels(nn.train$Contract)[levels(nn.train$Contract)=="One year"] <- "1"
levels(nn.train$Contract)[levels(nn.train$Contract)== "Two year"] <- "2"
nn.train$Contract <- as.numeric(as.character(nn.train$Contract))
table(nn.train$Contract)
##
## 0 1 2
## 2914 1085 1284
nn.train$Churn <- as.numeric(ifelse(nn.train$Churn == "No", "0", "1"))
glimpse(nn.train)
## Rows: 5,283
## Columns: 20
## $ gender <dbl> 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1,…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
## $ Partner <fct> Yes, No, No, No, No, Yes, No, Yes, No, No, No, No, No…
## $ Dependents <fct> No, No, Yes, No, Yes, Yes, No, No, No, No, No, Yes, N…
## $ tenure <dbl> 1, 2, 22, 10, 62, 13, 16, 58, 49, 25, 52, 71, 21, 1, …
## $ PhoneService <dbl> 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,…
## $ MultipleLines <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,…
## $ InternetService <dbl> 1, 1, 2, 1, 1, 1, 0, 2, 2, 2, 0, 2, 2, 1, 1, 1, 1, 2,…
## $ OnlineSecurity <fct> No, Yes, No, Yes, Yes, Yes, No, No, No, Yes, No, Yes,…
## $ OnlineBackup <fct> Yes, Yes, Yes, No, Yes, No, No, No, Yes, No, No, No, …
## $ DeviceProtection <fct> No, No, No, No, No, No, No, Yes, Yes, Yes, No, Yes, Y…
## $ TechSupport <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ StreamingTV <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ StreamingMovies <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ Contract <dbl> 0, 0, 0, 0, 1, 0, 2, 1, 0, 0, 1, 2, 0, 0, 2, 0, 0, 0,…
## $ PaperlessBilling <fct> Yes, Yes, Yes, No, No, Yes, No, No, Yes, Yes, No, No,…
## $ PaymentMethod <fct> Electronic check, Mailed check, Credit card (automati…
## $ MonthlyCharges <dbl> 29.85, 53.85, 89.10, 29.75, 56.15, 49.95, 18.95, 100.…
## $ TotalCharges <dbl> 29.85, 108.15, 1949.40, 301.90, 3487.95, 587.45, 326.…
## $ Churn <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
Repetindo o mesmo procedimento para test.
glimpse(test)
## Rows: 1,760
## Columns: 22
## $ gender <fct> Male, Male, Female, Female, Female, Female, Female, M…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, Yes, Yes, No, Yes…
## $ Partner <fct> No, No, No, No, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes…
## $ Dependents <fct> No, No, No, No, No, Yes, Yes, No, No, No, No, Yes, No…
## $ tenure <dbl> 34, 45, 2, 8, 28, 69, 10, 12, 1, 71, 2, 27, 1, 46, 11…
## $ PhoneService <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ MultipleLines <fct> No, No, No, Yes, Yes, Yes, No, No, No, Yes, No, No, N…
## $ InternetService <fct> DSL, DSL, Fiber optic, Fiber optic, Fiber optic, Fibe…
## $ OnlineSecurity <fct> Yes, Yes, No, No, No, Yes, No, No, No, Yes, No, Yes, …
## $ OnlineBackup <fct> No, No, No, No, No, Yes, No, No, No, Yes, No, Yes, No…
## $ DeviceProtection <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Y…
## $ TechSupport <fct> No, Yes, No, No, Yes, Yes, Yes, No, No, Yes, No, Yes,…
## $ StreamingTV <fct> No, No, No, Yes, Yes, Yes, No, No, No, No, Yes, No, N…
## $ StreamingMovies <fct> No, No, No, Yes, Yes, Yes, No, No, No, No, Yes, No, N…
## $ Contract <fct> One year, One year, Month-to-month, Month-to-month, M…
## $ PaperlessBilling <fct> No, No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, No, …
## $ PaymentMethod <fct> Mailed check, Bank transfer (automatic), Electronic c…
## $ MonthlyCharges <dbl> 56.95, 42.30, 70.70, 99.65, 104.80, 113.25, 55.20, 19…
## $ TotalCharges <dbl> 1889.50, 1840.75, 151.65, 820.50, 3046.05, 7895.15, 5…
## $ Churn <fct> 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,…
## $ factor.result <fct> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
## $ result_1 <fct> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,…
nn.test <- test # Iremos usar o original em outros método
nn.test$gender <- as.numeric(ifelse(nn.test$gender == "Female", "0", "1"))
nn.test$MultipleLines <- as.numeric(ifelse(nn.test$MultipleLines == "No", "0", "1"))
table(nn.test$InternetService)
##
## DSL Fiber optic No
## 613 770 377
levels(nn.test$InternetService)[levels(nn.test$InternetService)=="No"] <- "0"
levels(nn.test$InternetService)[levels(nn.test$InternetService)=="DSL"] <- "1"
levels(nn.test$InternetService)[levels(nn.test$InternetService)== "Fiber optic"] <- "2"
table(nn.test$InternetService)
##
## 1 2 0
## 613 770 377
nn.test$InternetService <- as.numeric(as.character(nn.test$InternetService))
nn.test$PhoneService <- as.numeric(ifelse(nn.test$PhoneService == "No", "0", "1"))
nn.test$TechSupport <- as.numeric(ifelse(nn.test$TechSupport == "No", "0", "1"))
nn.test$StreamingTV <- as.numeric(ifelse(nn.test$StreamingTV == "No", "0", "1"))
nn.test$StreamingMovies <- as.numeric(ifelse(nn.test$StreamingMovies == "No", "0", "1"))
levels(nn.test$Contract)[levels(nn.test$Contract)=="Month-to-month"] <- "0"
levels(nn.test$Contract)[levels(nn.test$Contract)=="One year"] <- "1"
levels(nn.test$Contract)[levels(nn.test$Contract)== "Two year"] <- "2"
nn.test$Contract <- as.numeric(as.character(nn.test$Contract))
table(nn.test$Contract)
##
## 0 1 2
## 961 388 411
nn.test$Churn <- as.numeric(ifelse(nn.test$Churn == "No", "0", "1"))
glimpse(nn.test)
## Rows: 1,760
## Columns: 22
## $ gender <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ SeniorCitizen <fct> No, No, No, No, No, No, No, No, No, Yes, Yes, No, Yes…
## $ Partner <fct> No, No, No, No, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes…
## $ Dependents <fct> No, No, No, No, No, Yes, Yes, No, No, No, No, Yes, No…
## $ tenure <dbl> 34, 45, 2, 8, 28, 69, 10, 12, 1, 71, 2, 27, 1, 46, 11…
## $ PhoneService <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MultipleLines <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ InternetService <dbl> 1, 1, 2, 2, 2, 2, 1, 0, 0, 2, 2, 1, 1, 2, 2, 2, 1, 2,…
## $ OnlineSecurity <fct> Yes, Yes, No, No, No, Yes, No, No, No, Yes, No, Yes, …
## $ OnlineBackup <fct> No, No, No, No, No, Yes, No, No, No, Yes, No, Yes, No…
## $ DeviceProtection <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Y…
## $ TechSupport <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ StreamingTV <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
## $ StreamingMovies <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,…
## $ Contract <dbl> 1, 1, 0, 0, 0, 2, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ PaperlessBilling <fct> No, No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, No, …
## $ PaymentMethod <fct> Mailed check, Bank transfer (automatic), Electronic c…
## $ MonthlyCharges <dbl> 56.95, 42.30, 70.70, 99.65, 104.80, 113.25, 55.20, 19…
## $ TotalCharges <dbl> 1889.50, 1840.75, 151.65, 820.50, 3046.05, 7895.15, 5…
## $ Churn <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ factor.result <fct> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
## $ result_1 <fct> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,…
Tudo pronto.
nn <- neuralnet(Churn ~ tenure + MultipleLines + InternetService
+ PhoneService+ MultipleLines + TechSupport
+ StreamingTV + StreamingMovies + Contract,
hidden = 16,
lifesign = "minimal",
linear.output = FALSE,
threshold = 0.1,
data = nn.train)
## Plotando a Rede Neural Artificial
nn$result.matrix
## [,1]
## error 3.270223e+02
## reached.threshold 7.562742e-02
## steps 2.239400e+04
## Intercept.to.1layhid1 1.138743e+00
## tenure.to.1layhid1 -1.831204e-01
## MultipleLines.to.1layhid1 6.483027e+00
## InternetService.to.1layhid1 -1.625252e+00
## PhoneService.to.1layhid1 -4.647132e+00
## TechSupport.to.1layhid1 -7.172801e+02
## StreamingTV.to.1layhid1 6.686042e+00
## StreamingMovies.to.1layhid1 6.560198e-01
## Contract.to.1layhid1 -4.118677e+00
## Intercept.to.1layhid2 1.653959e+01
## tenure.to.1layhid2 -3.296146e+00
## MultipleLines.to.1layhid2 -6.663613e+01
## InternetService.to.1layhid2 7.989322e+00
## PhoneService.to.1layhid2 7.222195e+01
## TechSupport.to.1layhid2 4.329927e+01
## StreamingTV.to.1layhid2 1.347040e+01
## StreamingMovies.to.1layhid2 -3.961656e+01
## Contract.to.1layhid2 1.603783e+02
## Intercept.to.1layhid3 -4.812659e-01
## tenure.to.1layhid3 3.185972e-02
## MultipleLines.to.1layhid3 7.405267e-01
## InternetService.to.1layhid3 -8.344509e-01
## PhoneService.to.1layhid3 -1.326962e+00
## TechSupport.to.1layhid3 8.242099e-01
## StreamingTV.to.1layhid3 1.684514e+00
## StreamingMovies.to.1layhid3 1.648559e+00
## Contract.to.1layhid3 -4.491656e+00
## Intercept.to.1layhid4 1.471965e+01
## tenure.to.1layhid4 1.716796e+01
## MultipleLines.to.1layhid4 -9.221385e+01
## InternetService.to.1layhid4 1.237418e+01
## PhoneService.to.1layhid4 -3.626759e-01
## TechSupport.to.1layhid4 -4.521366e+01
## StreamingTV.to.1layhid4 -1.315815e+02
## StreamingMovies.to.1layhid4 8.688916e+01
## Contract.to.1layhid4 3.857145e-01
## Intercept.to.1layhid5 -4.393576e+00
## tenure.to.1layhid5 2.767732e-02
## MultipleLines.to.1layhid5 5.367237e+00
## InternetService.to.1layhid5 3.338529e+00
## PhoneService.to.1layhid5 -1.440227e+01
## TechSupport.to.1layhid5 -4.599125e+00
## StreamingTV.to.1layhid5 5.301298e+00
## StreamingMovies.to.1layhid5 4.680127e+00
## Contract.to.1layhid5 -8.538929e+00
## Intercept.to.1layhid6 2.511608e+00
## tenure.to.1layhid6 9.524278e+00
## MultipleLines.to.1layhid6 1.079297e+01
## InternetService.to.1layhid6 -1.178394e+01
## PhoneService.to.1layhid6 -1.108906e+01
## TechSupport.to.1layhid6 1.857308e+01
## StreamingTV.to.1layhid6 2.708785e+00
## StreamingMovies.to.1layhid6 -5.412265e-01
## Contract.to.1layhid6 -1.965352e+01
## Intercept.to.1layhid7 2.335836e+01
## tenure.to.1layhid7 2.741479e+00
## MultipleLines.to.1layhid7 -1.489832e-01
## InternetService.to.1layhid7 -4.719904e+01
## PhoneService.to.1layhid7 2.810265e+01
## TechSupport.to.1layhid7 5.776384e-01
## StreamingTV.to.1layhid7 -6.529699e+01
## StreamingMovies.to.1layhid7 3.057242e+01
## Contract.to.1layhid7 2.536915e+01
## Intercept.to.1layhid8 1.316034e+01
## tenure.to.1layhid8 -7.200407e+00
## MultipleLines.to.1layhid8 -1.528312e+01
## InternetService.to.1layhid8 -1.487478e+01
## PhoneService.to.1layhid8 8.655246e-01
## TechSupport.to.1layhid8 -1.080400e+01
## StreamingTV.to.1layhid8 6.116803e+01
## StreamingMovies.to.1layhid8 -2.640689e+01
## Contract.to.1layhid8 6.481398e+01
## Intercept.to.1layhid9 -1.398992e+01
## tenure.to.1layhid9 1.232100e+01
## MultipleLines.to.1layhid9 2.246009e+01
## InternetService.to.1layhid9 -9.838739e+00
## PhoneService.to.1layhid9 7.045568e+00
## TechSupport.to.1layhid9 7.892595e+00
## StreamingTV.to.1layhid9 4.629155e+01
## StreamingMovies.to.1layhid9 1.305894e+01
## Contract.to.1layhid9 -8.208182e+01
## Intercept.to.1layhid10 -3.229738e+00
## tenure.to.1layhid10 -1.372548e+00
## MultipleLines.to.1layhid10 2.348966e+01
## InternetService.to.1layhid10 8.417296e+00
## PhoneService.to.1layhid10 -2.495236e+01
## TechSupport.to.1layhid10 -3.031727e+00
## StreamingTV.to.1layhid10 2.443474e+01
## StreamingMovies.to.1layhid10 2.551706e+01
## Contract.to.1layhid10 -2.264108e+01
## Intercept.to.1layhid11 1.183864e+01
## tenure.to.1layhid11 -1.914676e+01
## MultipleLines.to.1layhid11 -7.327936e+00
## InternetService.to.1layhid11 7.017125e+00
## PhoneService.to.1layhid11 -1.691153e+01
## TechSupport.to.1layhid11 -3.009731e+01
## StreamingTV.to.1layhid11 6.177991e+00
## StreamingMovies.to.1layhid11 3.276077e+01
## Contract.to.1layhid11 -7.108603e+02
## Intercept.to.1layhid12 -1.727974e+01
## tenure.to.1layhid12 1.723715e+00
## MultipleLines.to.1layhid12 -3.588885e+01
## InternetService.to.1layhid12 -2.202742e+01
## PhoneService.to.1layhid12 4.050082e+01
## TechSupport.to.1layhid12 8.346899e+01
## StreamingTV.to.1layhid12 7.861098e+01
## StreamingMovies.to.1layhid12 4.650064e+01
## Contract.to.1layhid12 -1.588456e+02
## Intercept.to.1layhid13 -1.887209e-01
## tenure.to.1layhid13 -1.222886e+00
## MultipleLines.to.1layhid13 1.279114e-01
## InternetService.to.1layhid13 9.131759e-01
## PhoneService.to.1layhid13 3.596237e-01
## TechSupport.to.1layhid13 -1.792084e+00
## StreamingTV.to.1layhid13 3.444678e+00
## StreamingMovies.to.1layhid13 -5.570563e+00
## Contract.to.1layhid13 7.577348e+00
## Intercept.to.1layhid14 -2.682817e+00
## tenure.to.1layhid14 1.293580e+01
## MultipleLines.to.1layhid14 1.402391e+00
## InternetService.to.1layhid14 4.177280e+00
## PhoneService.to.1layhid14 -1.080543e+01
## TechSupport.to.1layhid14 -5.082451e+00
## StreamingTV.to.1layhid14 -1.289244e+01
## StreamingMovies.to.1layhid14 5.330142e+01
## Contract.to.1layhid14 3.784925e+01
## Intercept.to.1layhid15 1.947992e+00
## tenure.to.1layhid15 2.676778e+00
## MultipleLines.to.1layhid15 -9.220034e+00
## InternetService.to.1layhid15 -2.071704e+00
## PhoneService.to.1layhid15 -5.836584e+00
## TechSupport.to.1layhid15 4.831941e+00
## StreamingTV.to.1layhid15 -1.890615e+01
## StreamingMovies.to.1layhid15 6.840077e+00
## Contract.to.1layhid15 4.882700e+01
## Intercept.to.1layhid16 1.317142e+00
## tenure.to.1layhid16 1.777481e-01
## MultipleLines.to.1layhid16 -1.065679e+00
## InternetService.to.1layhid16 -3.437214e+00
## PhoneService.to.1layhid16 -2.040236e+00
## TechSupport.to.1layhid16 -1.064026e+00
## StreamingTV.to.1layhid16 -5.920867e+00
## StreamingMovies.to.1layhid16 -1.986092e+00
## Contract.to.1layhid16 5.951302e+00
## Intercept.to.Churn -3.384574e-01
## 1layhid1.to.Churn -2.682519e+00
## 1layhid2.to.Churn -1.255060e+00
## 1layhid3.to.Churn -3.524509e+00
## 1layhid4.to.Churn 2.479107e+00
## 1layhid5.to.Churn 2.417643e+00
## 1layhid6.to.Churn 3.376629e+00
## 1layhid7.to.Churn -1.204331e+00
## 1layhid8.to.Churn -4.030421e+00
## 1layhid9.to.Churn 2.154360e+00
## 1layhid10.to.Churn 1.464653e+00
## 1layhid11.to.Churn 4.452669e+00
## 1layhid12.to.Churn 1.076528e+00
## 1layhid13.to.Churn 5.419765e+00
## 1layhid14.to.Churn -2.822835e+00
## 1layhid15.to.Churn -3.898946e+00
## 1layhid16.to.Churn -1.877085e+00
plot(nn)
### Acurácia
test_result <- neuralnet::compute(nn, nn.test[, c("tenure", "MultipleLines",
"InternetService",
"PhoneService",
"MultipleLines",
"TechSupport",
"StreamingTV",
"StreamingMovies",
"Contract")])
result_2 <- test_result$net.result
dim(result_2)
## [1] 1760 1
result_2 <- ifelse(result_2 > 0.5, 1, 0)
test$result_2 <- as.factor(as.character(result_2))
table(test$result_2)
##
## 0 1
## 1400 360
table(test$Churn)
##
## 0 1
## 1293 467
(mc <- confusionMatrix(test$result_2, test$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1165 235
## 1 128 232
##
## Accuracy : 0.7938
## 95% CI : (0.7741, 0.8124)
## No Information Rate : 0.7347
## P-Value [Acc > NIR] : 4.928e-09
##
## Kappa : 0.4292
##
## Mcnemar's Test P-Value : 2.643e-08
##
## Sensitivity : 0.9010
## Specificity : 0.4968
## Pos Pred Value : 0.8321
## Neg Pred Value : 0.6444
## Prevalence : 0.7347
## Detection Rate : 0.6619
## Detection Prevalence : 0.7955
## Balanced Accuracy : 0.6989
##
## 'Positive' Class : 0
##
plot_matconf(mc)
#install.packages("randomForest")
library(randomForest)
set.seed(1234)
rf <- train(Churn ~ tenure + MultipleLines + InternetService +
PhoneService+ MultipleLines + TechSupport +
StreamingTV + StreamingMovies + Contract,
data = nn.train,
method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE)
# O melhor parâmetro
rf$bestTune
rf$finalModel # o OUT OF THE BOUNDARY estima erro de 14,45%
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1382093
## % Var explained: 29.11
(importance(rf$finalModel))
## %IncMSE IncNodePurity
## tenure 44.58488 106.183964
## MultipleLines 20.14413 7.528720
## InternetService 58.66933 67.555965
## PhoneService 13.48294 4.451239
## TechSupport 27.91221 13.384354
## StreamingTV 24.91013 8.869404
## StreamingMovies 25.21210 9.085068
## Contract 55.79435 95.420806
#Acurácia
result_3 <- predict(rf, nn.test)
length(result_3)
## [1] 1760
result_3 <- ifelse(result_3 > 0.5, 1, 0)
result_3 <- as.factor(as.character(result_3))
# Matriz de Confusão
(mc <- confusionMatrix(result_3, test$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1210 271
## 1 83 196
##
## Accuracy : 0.7989
## 95% CI : (0.7794, 0.8174)
## No Information Rate : 0.7347
## P-Value [Acc > NIR] : 2.073e-10
##
## Kappa : 0.408
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9358
## Specificity : 0.4197
## Pos Pred Value : 0.8170
## Neg Pred Value : 0.7025
## Prevalence : 0.7347
## Detection Rate : 0.6875
## Detection Prevalence : 0.8415
## Balanced Accuracy : 0.6778
##
## 'Positive' Class : 0
##
plot_matconf(mc)
# Acurácia: 79,89% (IC 95% 77,94% e 81,74%)
# Plotando
# Plot MeanDecreaseAccuracy
varImpPlot(rf$finalModel, type = 1, col = "blue")
# Plot MeanDecreaseGini
varImpPlot(rf$finalModel, type = 2, scale = TRUE, col ="red")
varImp(rf) # Isso aqui é mega importante
## rf variable importance
##
## Overall
## InternetService 100.00
## Contract 93.64
## tenure 68.83
## TechSupport 31.93
## StreamingMovies 25.96
## StreamingTV 25.29
## MultipleLines 14.74
## PhoneService 0.00
var.imp <-varImp(rf)
var.imp$importance
data<- data.frame(name = row.names(var.imp$importance),
value = var.imp$importance$Overall)
data$name <- as.factor(data$name)
data$value <- as.numeric(data$value)
dev.off()
## pdf
## 3
data %>%
mutate(name = fct_reorder(name, value)) %>%
ggplot( aes(x=name, y=value)) +
geom_bar(stat="identity", fill="brown3", width= .6) +
coord_flip() +
xlab("") +
theme_bw()
library(e1071)
nB <- naiveBayes(Churn ~ tenure + MultipleLines + InternetService +
PhoneService+ MultipleLines + TechSupport +
StreamingTV + StreamingMovies + Contract,
data = train)
result_5 <- predict(nB, test, type="class")
class(test$Churn)
## [1] "factor"
levels(result_5)[levels(result_5)=="No"] <- "0"
levels(result_5)[levels(result_5)=="Yes"] <- "1"
table(result_5)
## result_5
## 0 1
## 1209 551
table(test$Churn)
##
## 0 1
## 1293 467
(mc <- confusionMatrix(result_5, test$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1059 150
## 1 234 317
##
## Accuracy : 0.7818
## 95% CI : (0.7618, 0.8009)
## No Information Rate : 0.7347
## P-Value [Acc > NIR] : 2.783e-06
##
## Kappa : 0.4708
##
## Mcnemar's Test P-Value : 2.280e-05
##
## Sensitivity : 0.8190
## Specificity : 0.6788
## Pos Pred Value : 0.8759
## Neg Pred Value : 0.5753
## Prevalence : 0.7347
## Detection Rate : 0.6017
## Detection Prevalence : 0.6869
## Balanced Accuracy : 0.7489
##
## 'Positive' Class : 0
##
plot_matconf(mc)
# Acurácia de 78,18% IC = 95% (76,18% 80,09%)
| Modelo | Acurácia |
|---|---|
| Regressão Logística | 80,2% |
| Árvore de Decisão | 79,6% |
| Rede Neural | 79,4% |
| Random Forest | 79,9% |
| Naïve Bayes | 78,2% |
Seu trabalho é continuar o desenvolvimento de modelos de classifição para prever o “churn”, utilizando os algoritmos que vimos em aula.
Note que os dados faltantes (NA´s) forma substituídos de modo “inocente” pela mediana dos dados. Empregue as técnicas vistas em aula para selecionar a melhor maneira de substituir esses dados no dataset.
Em seu relatório, complemente a tabela acima com os resulados obtidos, além de desenhar (como fizemos em cada caso, com o emprego da função plot_matconf()) a Matriz de Confusão de cada modelo.
Escreva uma conclusão, considerando que sua audiência não é versada em aprendizagem de máquina ou inteligência artificial.
Bom trabalho!