El objetivo de este trabajo final es utilizar DOS técnicas de las que se vieron durante la materia Estadística Aplicada a la Investigación de Mercado.
La base de datos fue extraida de kaggle1, se trata de un dataset público construido a partir de webscrapping con el objetivo de analizar el mercado de las aplicaciones de android. La misma fue publicada hace 4 años por Lavanya Gupta.
Las variables que posee el dataframe son las siguientes. Es importante resaltar que toda esta información es de cuando fue escrapeada la base. Es decir en 2019.
| Variable | Descripción |
|---|---|
Apps |
Nombre de la aplicación |
Category |
Categoría a la que pertenece la app |
Rating |
Puntaje que los usuarios le dieron a la app |
Reviews |
Cantidad de comentarios de la app |
Size |
Tamaño de la app |
Installs |
Cantidad de usuarios que descargaron la app |
Type |
Si es paga (paid) o gratuita (free) |
Price |
Precio de la app |
Content Raiting |
Grupo de edad para la cual es la app (Children/Mature +21/ Adult /Everyone) |
Genres |
Genero de la App. Una App puede pertenecer a multiples generos |
Last Updated |
Fecha de la última actualización |
Current Ver |
Versión actual de la aplicación que se encuentra disponible en playstore |
Android Ver |
Versión mínima de android necesaria para descargar la app |
La Asociación por el Derecho al Acceso (ADA) quiere desarrollar una aplicación de Google Playstore que sirva para promocionar y difundir el derecho al acceso a la información, por eso necesitan conocer el mercado de aplicaciones de Google Playstore
El objetivo de este trabajo es analizar las características del mercado de las aplicaciones alojadas en Google Play Store para saber cómo influyen las mismas en la popularidad de las aplicaciones (medido por el número de instalaciones). De esta manera, se buscará investigar el mercado de las App para decidir qué variables habría que tener en cuenta para hacer de esta app popular. ¿Debería ser un juego o un aplicativo informativo? ¿Bajo qué rotulo sería conveniente clasificarla?.
En este marco, el trabajo se estructura de la siguiente manera: En primer lugar, se realiza el análisis exploratorio y la limpieza de los datos. En segundo lugar, se exploran cuáles son las características que contribuyen a la popularidad de las aplicaciones. Por último, se tratará de predecir la popularidad de las apps (medido por las instalaciones)
Comenzamos por levantar la base de datos y explorar sus variables2:
#Cargamos la base y limpiamos los nombres del dataset
apps <- read_csv("Data/googleplaystore.csv")
bd_apps <- apps %>%
clean_names()
#visualizamos los primeros valores
head(bd_apps) %>%
gt()
| app | category | rating | reviews | size | installs | type | price | content_rating | genres | last_updated | current_ver | android_ver |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| U Launcher Lite – FREE Live Cool Themes, Hide Apps | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
| Paper flowers instructions | ART_AND_DESIGN | 4.4 | 167 | 5.6M | 50,000+ | Free | 0 | Everyone | Art & Design | March 26, 2017 | 1.0 | 2.3 and up |
Composición de las variables:
#visualizamos las variables
bd_apps %>%
skimr::skim()
| Name | Piped data |
| Number of rows | 10841 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| app | 0 | 1 | 1 | 194 | 0 | 9660 | 0 |
| category | 0 | 1 | 3 | 19 | 0 | 34 | 0 |
| size | 0 | 1 | 3 | 18 | 0 | 462 | 0 |
| installs | 0 | 1 | 1 | 14 | 0 | 22 | 0 |
| type | 0 | 1 | 1 | 4 | 0 | 4 | 0 |
| price | 0 | 1 | 1 | 8 | 0 | 93 | 0 |
| content_rating | 1 | 1 | 4 | 15 | 0 | 6 | 0 |
| genres | 0 | 1 | 4 | 37 | 0 | 120 | 0 |
| last_updated | 0 | 1 | 6 | 18 | 0 | 1378 | 0 |
| current_ver | 1 | 1 | 1 | 50 | 0 | 2833 | 0 |
| android_ver | 1 | 1 | 3 | 18 | 0 | 34 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rating | 1474 | 0.86 | 4.19 | 0.54 | 1 | 4 | 4.3 | 4.5 | 19 | ▇▁▁▁▁ |
| reviews | 1 | 1.00 | 444152.90 | 2927760.60 | 0 | 38 | 2094.0 | 54775.5 | 78158306 | ▇▁▁▁▁ |
# variables por tipo
bd_apps %>% vis_dat(warn_large_data = F)
#valores perdidos
vis_miss(bd_apps)
#solo hay en rating
Se observa que de 13 columnas, 12 contienen variables de tipo “character” y 2 númericas. A continuación vamos a reconvertir algunas variables.
Rating es la única variable que posee missings en su
composición, esto se debe a que en otras variables como
size o android version se encuentra el valor
“varies with device” que será reemplazado por NA. Además
existen valores repetidos. Por lo que nos quedamos sólo con los valores
únicos
nrow(bd_apps %>%
distinct())
## [1] 10358
bd_apps <- bd_apps %>%
distinct()
bd_apps %>%
filter(installs == "Free") %>%
gt()
| app | category | rating | reviews | size | installs | type | price | content_rating | genres | last_updated | current_ver | android_ver |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Life Made WI-Fi Touchscreen Photo Frame | 1.9 | 19 | NA | 1,000+ | Free | 0 | Everyone | NA | February 11, 2018 | 1.0.19 | 4.0 and up | NA |
Eliminamos esta observación dado que tiene un rating que
supera los 5 puntos y una categoría que no coincide con la del resto de
las apps.
Las variable instalaciones, precio y tamaño poseen caracteres. Vamos
a transformarlas para su posterior utilización. Respecto de
size la variable contiene “M” (MB) o “K” (KB). Asimismo se
observa que existe el valor “varies with device” . Esto es
debido a que Google Play permite publicar diferentes APK para cada
aplicación. Cada uno dirigido a una configuración de dispositivo
diferente. Por lo que, al seleccionar “instalar” el sistema Android
selecciona los recursos apropiados para el dispositivo. Para poder
convertir esta variable a numérica se pasará todo a KB. Es decir
multiplicando los MB * 1000, dado que 1 MB = 1000 KB. Sucede lo mismo
con Android versión que posee valores correspondientes a
“varies with device”
#pasamos a formato fecha la variable last_updated
bd_apps <- bd_apps %>%
filter(app != "Life Made WI-Fi Touchscreen Photo Frame") %>%
mutate(last_updated = mdy(last_updated),
installs = gsub("\\+",'',installs), #eliminamos los simbolos
installs = gsub(",",'',installs),
installs = as.numeric(installs),
reviews = as.numeric(reviews), #pasamos a numérico
price = as.numeric(gsub("\\$", "", as.character(price))), #eliminamos los simbolos
android_ver = gsub("Varies with device", NA, android_ver), #varies with device lo pasamos a NA
android_ver = as.numeric(substr(android_ver, start = 1, stop = 3)),
size_num = ifelse(grepl("M", size), as.numeric(sub("([0-9\\.]+)M", "\\1", size))*1000, as.numeric(sub("([0-9\\.]+)k", "\\1", size)))) %>% #dejamos un solo decimal
filter(type %in% c("Free", "Paid"))
# Hay dos apps que tienen 0 o NA, vamos a eliminarlas y quedarnos solo con las que tienen Free o PAID en Type
bd_apps %>%
skimr::skim()
| Name | Piped data |
| Number of rows | 10356 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| Date | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| app | 0 | 1 | 1 | 194 | 0 | 9658 | 0 |
| category | 0 | 1 | 4 | 19 | 0 | 33 | 0 |
| size | 0 | 1 | 3 | 18 | 0 | 461 | 0 |
| type | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| content_rating | 0 | 1 | 4 | 15 | 0 | 6 | 0 |
| genres | 0 | 1 | 4 | 37 | 0 | 119 | 0 |
| current_ver | 1 | 1 | 1 | 50 | 0 | 2832 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_updated | 0 | 1 | 2010-05-21 | 2018-08-08 | 2018-05-20 | 1377 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rating | 1464 | 0.86 | 4.19 | 0.52 | 1.0 | 4 | 4.3 | 4.50 | 5 | ▁▁▁▆▇ |
| reviews | 0 | 1.00 | 405943.81 | 2696905.10 | 0.0 | 32 | 1683.0 | 46438.25 | 78158306 | ▇▁▁▁▁ |
| installs | 0 | 1.00 | 14159126.55 | 80243307.58 | 0.0 | 1000 | 100000.0 | 1000000.00 | 1000000000 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 1.03 | 16.28 | 0.0 | 0 | 0.0 | 0.00 | 400 | ▇▁▁▁▁ |
| android_ver | 1222 | 0.88 | 3.85 | 0.84 | 1.0 | 4 | 4.1 | 4.10 | 8 | ▂▁▇▁▁ |
| size_num | 1525 | 0.85 | 21287.79 | 22540.25 | 8.5 | 4700 | 13000.0 | 29000.00 | 100000 | ▇▂▁▁▁ |
#transformamos type a binaria
bd_apps$type_bin <- ifelse(bd_apps$type == "Paid", 1, 0)
#mostramos el resultado
g2 <- bd_apps %>%
group_by(type) %>%
summarise(N=n()) %>%
ggplot(aes(N, reorder(type,N))) +
geom_col(fill = "#009999") +
theme_classic() +
labs(x = " ",
y = " ",
title = "Distribución de aplicaciones por tipo (pago/gratuito")
ggplotly(g2)
La variable type muestra si la aplicación es paga o
gratuita, para poder utilizarla en los análisis la transformaremos en
una variable binaria asignandole 1 si es paga y 0 si es gratuita
Vamos a modificar esta variable para generar rangos de edad
bd_apps <- bd_apps %>%
mutate(grupo_edades = case_when(content_rating == "Everyone" ~ "4+",
content_rating == "Everyone 10+" ~ "9+",
content_rating == "Teen" ~ "12+",
content_rating == "Mature 17+" ~ "17+",
content_rating == "Unrated" ~ "9+",
content_rating == "Adults only 18+" ~ "17+"))
bd_apps %>%
filter(content_rating == "Unrated") %>%
gt()
| app | category | rating | reviews | size | installs | type | price | content_rating | genres | last_updated | current_ver | android_ver | size_num | type_bin | grupo_edades |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Best CG Photography | FAMILY | NaN | 1 | 2.5M | 500 | Free | 0 | Unrated | Entertainment | 2015-06-24 | 5.2 | 3.0 | 2500 | 0 | 9+ |
| DC Universe Online Map | TOOLS | 4.1 | 1186 | 6.4M | 50000 | Free | 0 | Unrated | Tools | 2012-02-27 | 1.3 | 2.3 | 6400 | 0 | 9+ |
g1 <- bd_apps %>%
group_by(grupo_edades) %>%
summarise(N=n()) %>%
ggplot(aes(N, reorder(grupo_edades,N))) +
geom_col(fill = "#009999") +
theme_classic() +
labs(x = "Número de aplicaciones",
y = "Grupo de edades",
title = "Distribución de aplicaciones por grupo de edades")
ggplotly(g1)
Hay dos apps sin calificar que son Best CG Photography y DC Universe Online Map. como son herramientas, vamos a clasificarlas como “Everyone” osea +9. Hay solo 3 aplicaciones que corresponden a +18, se las incorporará a +17.
#vemos la distribución de los grupos de edad en las apps
edades <- bd_apps %>%
group_by(grupo_edades) %>%
summarize(Total = n()) %>%
mutate(perc = round(Total/sum(Total) * 100)) %>%
arrange(-perc)
perc_counts <- edades$perc
names(perc_counts) <- edades$grupo_edades
# Graficamos
waffle(perc_counts) +
theme_minimal() +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
labs(title = "Porcentaje de apps por grupos de edad")
rm(edades)
rm(perc_counts)
La mayoría de las aplicaciones está catalogada como “Everyone” o apta para mayores de 4 años.
bd_apps %>%
group_by(genres) %>%
summarise(n = n()) %>%
head(15) %>%
gt()
| genres | n |
|---|---|
| Action | 356 |
| Action;Action & Adventure | 15 |
| Adventure | 75 |
| Adventure;Action & Adventure | 13 |
| Adventure;Brain Games | 1 |
| Adventure;Education | 2 |
| Arcade | 218 |
| Arcade;Action & Adventure | 15 |
| Arcade;Pretend Play | 1 |
| Art & Design | 58 |
| Art & Design;Action & Adventure | 2 |
| Art & Design;Creativity | 7 |
| Art & Design;Pretend Play | 2 |
| Auto & Vehicles | 85 |
| Beauty | 53 |
#exploramos las etiquetas, dividimos las etiquetas
generos <- bd_apps %>%
separate_rows(genres, sep = ";") %>%
# separate_rows(genres, sep = "&") %>%
count(genres) %>%
arrange(desc(n))
g3 <- ggplot(generos, aes(x = genres, y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
xlab("Género") +
ylab("Frecuencia") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(x = "Géneros",
y = "N° de apps",
title = "Distribución de aplicaciones por género")
ggplotly(g3)
La variable genres contiene etiquetas que definen el
género de la aplicación. Es decir una app puede tener más de una
etiqueta. Debido a la cantidad de categorías que posee la variable se
decide eliminarla para el desarrollo de los posteriores modelos
# Valores nulos
md.pattern(bd_apps, rotate.names = TRUE, plot = TRUE)
## app category reviews size installs type price content_rating genres
## 7368 1 1 1 1 1 1 1 1 1
## 343 1 1 1 1 1 1 1 1 1
## 1407 1 1 1 1 1 1 1 1 1
## 15 1 1 1 1 1 1 1 1 1
## 55 1 1 1 1 1 1 1 1 1
## 1125 1 1 1 1 1 1 1 1 1
## 42 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0
## last_updated type_bin grupo_edades current_ver android_ver rating size_num
## 7368 1 1 1 1 1 1 1
## 343 1 1 1 1 1 1 0
## 1407 1 1 1 1 1 0 1
## 15 1 1 1 1 1 0 0
## 55 1 1 1 1 0 1 1
## 1125 1 1 1 1 0 1 0
## 42 1 1 1 1 0 0 0
## 1 1 1 1 0 1 1 1
## 0 0 0 1 1222 1464 1525
##
## 7368 0
## 343 1
## 1407 1
## 15 2
## 55 1
## 1125 2
## 42 3
## 1 1
## 4212
# Visualización de perdidos
gg_miss_var(bd_apps, show_pct = T)
#Analizamos el patrón de los valores nulos
gg_miss_upset(bd_apps,nsets = 10)
Se observa que hay un patrón en los datos faltantes de
android_ver , rating y size_num
por lo que no es posible eliminar los valores nulos.
Más arriba se observó que la variable rating tiene
valores nulos. Observamos la distribución de estos valores en relación a
la cantidad de descargas
t2 <- bd_apps %>%
filter(is.na(rating)) %>%
count(reviews) %>%
mutate(porcentaje = round(n/1474*100, 2),
porcentaje_acumulado = cumsum(porcentaje))
#analizamos los ratings por cantidad de reviews
t1 <- bd_apps %>%
filter(is.na(rating)) %>%
count(installs) %>%
mutate(porcentaje = round(n/1474*100, 2),
porcentaje_acumulado = cumsum(porcentaje))
kable(list(t1, t2))
|
|
El 80% de las aplicaciones que poseen missings en la variable
rating tienen menos de 500 descargas y menos de 10 reviews.
Esto se puede deber a que como fueron poco descargadas todavía la gente
no las ha puntuado o algún error en el webscrapping.
bd_apps %>%
group_by(size_num) %>%
summarise(N= n()) %>%
arrange(desc(N)) %>%
head() %>%
gt()
| size_num | N |
|---|---|
| NA | 1525 |
| 11000 | 188 |
| 12000 | 186 |
| 13000 | 186 |
| 14000 | 182 |
| 15000 | 174 |
De 10356 registros 1525 (15%) no poseen un tamaño definido por lo expresado más arriba. Vamos a ver la distribución de esta variable:
# histograma de la variable "size"
p1<- ggplot(bd_apps, aes(x = size_num)) +
geom_histogram(color="black", fill="lightblue", binwidth = 5000) +
labs(title = "Distribución del tamaño de las aplicaciones",
x = "Tamaño (KB)", y = "Frecuencia") +
theme_minimal() +
theme(plot.title = element_text(size=14, face="bold"),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12),
axis.text = element_text(size=10))
# densidad de la variable "size"
p2 <-ggplot(bd_apps, aes(x = size_num)) +
geom_density(color="black", fill="lightblue") +
labs(x = "Tamaño (KB)", y = "Densidad") +
theme_minimal() +
theme(plot.title = element_text(size=14, face="bold"),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12),
axis.text = element_text(size=10))
grid.arrange(p1, p2, ncol=2)
Hay 1352 filas que contienen NA porque corresponde a la categoría “varies with devices”.
bd_apps %>%
group_by(android_ver) %>%
summarise(n = n()) %>%
ggplot(aes(x = as.factor(android_ver), y = n, fill = android_ver)) +
geom_bar(stat = "identity") +
labs(x = "Versión de Android", y = "Frecuencia", fill = "Versión de Android") +
theme_classic()
bd_apps %>%
group_by(current_ver) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head() %>%
gt()
| current_ver | n |
|---|---|
| Varies with device | 1301 |
| 1.0 | 802 |
| 1.1 | 260 |
| 1.2 | 177 |
| 2.0 | 149 |
| 1.3 | 142 |
La variable current_ver tiene 2833 valores únicos, el
25% de los registros. Esto sugiere que no es muy informativa. Además, no
puede ser convertida a numérica ya que es una variable categorica que
indica la versión actual de la aplicación. No siempre el número
de versión es un número continuo. Por este motivo se decide
prescindir de esta variable para el posterior análisis
A continuación utilizaremos la técnica de imputación múltiple del paquete ´mice´ para los valores nulos
#Imputamos los NA
imp <- mice(bd_apps[, c("size_num", "rating", "android_ver")])
# Visualizamos la distribución de variables antes y después de la imputación
kableExtra::kable(summary(bd_apps[, c("size_num", "rating", "android_ver")]),caption = "Extructura variables previo a imputar")
| size_num | rating | android_ver | |
|---|---|---|---|
| Min. : 8.5 | Min. :1.000 | Min. :1.000 | |
| 1st Qu.: 4700.0 | 1st Qu.:4.000 | 1st Qu.:4.000 | |
| Median : 13000.0 | Median :4.300 | Median :4.100 | |
| Mean : 21287.8 | Mean :4.188 | Mean :3.853 | |
| 3rd Qu.: 29000.0 | 3rd Qu.:4.500 | 3rd Qu.:4.100 | |
| Max. :100000.0 | Max. :5.000 | Max. :8.000 | |
| NA’s :1525 | NA’s :1464 | NA’s :1222 |
kableExtra::kable(summary(complete(imp)[, c("size_num", "rating", "android_ver")]),caption = "Extructura variables imputadas")
| size_num | rating | android_ver | |
|---|---|---|---|
| Min. : 8.5 | Min. :1.000 | Min. :1.000 | |
| 1st Qu.: 4800.0 | 1st Qu.:4.000 | 1st Qu.:4.000 | |
| Median : 13000.0 | Median :4.300 | Median :4.100 | |
| Mean : 21648.0 | Mean :4.185 | Mean :3.857 | |
| 3rd Qu.: 30000.0 | 3rd Qu.:4.500 | 3rd Qu.:4.100 | |
| Max. :100000.0 | Max. :5.000 | Max. :8.000 |
# Agregamos las variables originales a la base imputada
bd_apps_imputed <- cbind(bd_apps[, setdiff(colnames(bd_apps), colnames(imp))], complete(imp))
# Renombrar las columnas imputadas
colnames(bd_apps_imputed)[17:19] <- paste0(colnames(bd_apps_imputed)[17:19], "_imp")
#sacamos las variables que no nos sirven o las ya imputadas y creamos la base que se va a utilizar para el desarrollo de la consigna
df_apps <- bd_apps_imputed %>%
select(-rating,
-size,
-size_num,
-android_ver,
-content_rating,
-current_ver)
# Verificamos la estructura de la nueva base de datos
skimr::skim(df_apps)
| Name | df_apps |
| Number of rows | 10356 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| Date | 1 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| app | 0 | 1 | 1 | 194 | 0 | 9658 | 0 |
| category | 0 | 1 | 4 | 19 | 0 | 33 | 0 |
| type | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| genres | 0 | 1 | 4 | 37 | 0 | 119 | 0 |
| grupo_edades | 0 | 1 | 2 | 3 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_updated | 0 | 1 | 2010-05-21 | 2018-08-08 | 2018-05-20 | 1377 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| reviews | 0 | 1 | 405943.81 | 2696905.10 | 0.0 | 32 | 1683.0 | 46438.25 | 78158306 | ▇▁▁▁▁ |
| installs | 0 | 1 | 14159126.55 | 80243307.58 | 0.0 | 1000 | 100000.0 | 1000000.00 | 1000000000 | ▇▁▁▁▁ |
| price | 0 | 1 | 1.03 | 16.28 | 0.0 | 0 | 0.0 | 0.00 | 400 | ▇▁▁▁▁ |
| type_bin | 0 | 1 | 0.07 | 0.26 | 0.0 | 0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| size_num_imp | 0 | 1 | 21647.96 | 22755.34 | 8.5 | 4800 | 13000.0 | 30000.00 | 100000 | ▇▂▁▁▁ |
| rating_imp | 0 | 1 | 4.19 | 0.53 | 1.0 | 4 | 4.3 | 4.50 | 5 | ▁▁▁▆▇ |
| android_ver_imp | 0 | 1 | 3.86 | 0.84 | 1.0 | 4 | 4.1 | 4.10 | 8 | ▂▁▇▁▁ |
Volvemos a ver cómo quedó la estructura de los valores nulos
# variables por tipo
df_apps %>% vis_dat(warn_large_data = F)
#valores perdidos
vis_miss(df_apps)
A continuación boxplot.stats calcula el límite inferior
y superior de cada variable, y luego suma los valores que se encuentran
por debajo y por encima del limite superior
count_outliers <- function(x) {
bp <- boxplot.stats(x)
sum(x < bp$stats[1] | x > bp$stats[5])
}
bd_numerico <- df_apps %>%
select_if(is.numeric)
# Contamos los outliers en la base de datos de ejemplo
sapply(bd_numerico, count_outliers)
## reviews installs price type_bin size_num_imp
## 1869 2566 765 765 636
## rating_imp android_ver_imp
## 586 4034
# Boxplots
plot1 <- ggplot(df_apps, aes(y = installs)) +
geom_boxplot(aes(fill = "installs")) +
scale_fill_manual(values = '#FF689f', guide= FALSE) +
ggtitle("Boxplot para la variable installs") +
ylab("Cantidad") +
theme_classic()
plot2 <- ggplot(df_apps, aes(y = price)) +
geom_boxplot(aes(fill = "price")) +
scale_fill_manual(values = '#DC71FA', guide= FALSE) +
ggtitle("Boxplot para la variable price") +
ylab("Cantidad") +
theme_classic()
plot3 <- ggplot(df_apps, aes(y = rating_imp )) +
geom_boxplot(aes(fill = "rating_imp")) +
scale_fill_manual(values = '#00ABFD', guide= FALSE) +
ylab("Cantidad") +
theme_classic() +
ggtitle("Boxplot para la variable raiting")
# Boxplot de type_bin
plot4 <- ggplot(df_apps, aes(x = "", y = reviews)) +
geom_boxplot(aes(fill = "reviews")) +
scale_fill_manual(values = '#00C1AA', guide= FALSE)+
ggtitle("Boxplot de type_bin")+
ylab("Cantidad") +
theme_classic()
plot5 <- ggplot(df_apps, aes(y = size_num_imp)) +
geom_boxplot(aes(fill = "size_num_imp")) +
scale_fill_manual(values = '#39B600', guide= FALSE)+
ggtitle("Boxplot para la variable size_num") +
ylab("Tamaño en MB") +
theme_classic()
plot6 <-ggplot(df_apps, aes(y = android_ver_imp )) +
geom_boxplot(aes(fill = "android_ver_imp")) +
scale_fill_manual(values = '#F37B59', guide= FALSE)+
ggtitle("Boxplot para la variable android version") +
ylab("Cantidad") +
theme_classic()
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol = 2)
Se observa presencia de outliers en la mayoría de las variables
numéricas. Esto se debe a la dispersión de los datos y no a un error en
la medición. Es decir, hay aplicaciones con mayor precio o cantidad de
instalaciones que otras. El caso de typebin es porque es
una variable binaria por lo que no conviene imputar los outliers para no
generar cambios en su distribución.
En la variable price existen valores atípicos porque
gran parte de la muestra de apps es gratuita. En este caso se decide no
imputar esos valores, ya que son legítimos y representan el precio real
de las aplicaciones.
quantile(df_apps$price, na.rm=TRUE)
## 0% 25% 50% 75% 100%
## 0 0 0 0 400
El 75% de las apps son gratuitas. Para los análisis posteriores se
decidió omitir esta variable. Se utilizará solo type, es
decir si la aplicación es paga o gratuita
df_apps <- df_apps %>%
select(-price)
Reviews
ggplot(df_apps, aes(x = reviews)) +
geom_histogram(bins = 30, color = "black", fill = "lightblue") +
labs(title = "Histograma de Reviews", x = "Número de Reviews", y = "Frecuencia")
summary(df_apps$reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 32 1683 405944 46438 78158306
Se puede ver que la mayoría de las apps tienen pocas reviews. El 50% tiene 1683 o menos. Por lo que se observa que hay algunos outliers. También se puede observar que hay una gran variabilidad en la cantidad de reseñas, con una media de 405944 y un valor máximo de 78158306, lo que sugiere la presencia de outliers
Reemplazar los outliers de variables como rating y
android_ver no la consideramos apropiada debido a que, la
primera solo tiene valores del 1 al 5; y, la segunda del 1 al 8.
plot1 <- ggplot(df_apps, aes(x = rating_imp)) +
geom_histogram(binwidth = 0.5, fill = "#00ABFD", color = "white") +
ggtitle("Distribución de Rating") +
xlab("Rating") +
ylab("Frecuencia") +
theme_classic()
plot_3 <- ggplot(df_apps, aes(x = rating_imp)) +
geom_density(fill = "#00ABFD", color = "white") +
ggtitle("Densidad de rating") +
xlab("Rating") +
ylab("Densidad") +
theme_classic()
plot2 <- ggplot(df_apps, aes(x = android_ver_imp)) +
geom_histogram(binwidth = 0.5, fill = "#00ABFD", color = "white") +
ggtitle("Distribución de android_ver") +
xlab("version android") +
ylab("Frecuencia") +
theme_classic()
plot4 <- ggplot(df_apps, aes(x = android_ver_imp)) +
geom_density(fill = "#00ABFD", color = "white") +
ggtitle("Densidad de android_ver") +
xlab("versión android") +
ylab("Densidad") +
theme_classic()
grid.arrange(plot1, plot_3, plot2,plot4, ncol =2 )
A continuación se imputarán los valores atípicos para las variables
installs, reviews y size a través
de winzonrize del paquete robustHD para
reducir el impacto de los valores extremos o atípicos. Esta técnica se
utiliza para reemplazar los outliers por los valores mas cercanos.
#imputamos los outliers
df_final <- df_apps %>%
mutate(installs = winsorize(installs, probs = c(0.05, 0.95)),
reviews = winsorize(reviews, probs = c(0.05, 0.95)),
size_num_imp = winsorize(size_num_imp, probs = c(0.05, 0.95)))
Analizamos la nueva distribución:
bd_numerico2 <- df_final %>%
select_if(is.numeric)
# Contar los outliers en la base de datos de ejemplo
sapply(bd_numerico2, count_outliers)
## reviews installs type_bin size_num_imp rating_imp
## 0 0 765 0 586
## android_ver_imp
## 4034
# Crear el boxplot
plot1 <- ggplot(df_final, aes(y = installs)) +
geom_boxplot(aes(fill = "installs")) +
scale_fill_manual(values = '#FF689f', guide= FALSE) +
ggtitle("Boxplot para la variable installs") +
ylab("Cantidad") +
theme_classic()
plot5 <- ggplot(df_final, aes(y = size_num_imp)) +
geom_boxplot(aes(fill = "size_num_imp")) +
scale_fill_manual(values = '#39B600', guide= FALSE)+
ggtitle("Boxplot para la variable size_num") +
ylab("Tamaño en MB") +
theme_classic()
plot6 <-ggplot(df_final, aes(y = reviews )) +
geom_boxplot(aes(fill = "reviews")) +
scale_fill_manual(values = '#F37B59', guide= FALSE)+
ggtitle("Boxplot para la variable android version") +
ylab("Cantidad") +
theme_classic()
grid.arrange(plot1,plot5, plot6, ncol = 1)
g <- df_final %>%
group_by(category) %>%
summarise(N = n()) %>%
ggplot(aes(x = category, y = N, size = N, color = category)) +
geom_point() +
theme_classic()+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(x = "", y = " ", title = "Cantidad de apps por categoría") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
theme(legend.position = "none")
ggplotly(g)
rm(g)
Se puede observar que de las categorías existentes la mayoría de las apps se encuentran clasificadas como Family, Game y Tools. A continuación se analizan cuales son las categorías más instaladas:
g <- df_final %>%
group_by(category) %>%
summarise(descargas = sum(installs)) %>%
ggplot(aes(x = reorder(category,-descargas), y = descargas, size = descargas, color = category)) +
geom_point() +
theme_classic()+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(x = "", y = " ", title = "Categorías más populares (Cantidad de instalaciones)", caption = "En millones de descargas") +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
theme(legend.position = "none") +
scale_y_continuous(labels = scales::comma)
# bd_apps %>%
# group_by(category) %>%
# summarise(descargas = sum(installs)) %>%
# arrange(desc(descargas))
ggplotly(g)
rm(g)
Se puede observar que las categorías más populares son GAME y FAMILY
g4 <- ggplot(df_final, aes(x=type, y=installs, fill=type)) +
geom_boxplot() +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = scales::comma) +
theme_classic() +
ggtitle(label = "Boxplot instalaciones por tipo") +
guides(fill=FALSE)
ggplotly(g4)
rm(g4)
Las aplicaciones gratuitas son mas instaladas en general
# Convertir a gráfico interactivo
p <- ggplot(df_final, aes(x = reorder(grupo_edades,installs), y = installs)) +
geom_bar(stat = "identity", fill = "#FB8072") +
labs(title = "Cantidad de instalaciones por grupos de edades",
x = "Grupos de Edades", y = "Cantidad de instalaciones") +
theme_classic()
ggplotly(p)
rm(p)
Se puede observar que las apps más descargadas son las habilitadas para +4 años (toda la familia)
p <- df_final %>%
mutate(rating_imp = round(rating_imp,0)) %>%
ggplot(aes(x = reorder(rating_imp,installs), y = installs)) +
geom_bar(stat = "identity", fill = "#FB8072") +
labs(title = "Cantidad de instalaciones por rating",
x = "Rating", y = "Cantidad de instalaciones") +
theme_classic()
ggplotly(p)
rm(p)
Las apps con mayor cantidad de instalaciones son las que tienen un puntaje de 4 estrellas.
ggplot(df_final, aes(x = installs, y = size_num_imp, fill = size_num_imp)) +
geom_boxplot(fill = "#FB8072") + theme_classic() +
scale_y_continuous(labels = scales::comma) +
ylab("Instalaciones") +
xlab("Tamaño") +
ggtitle("Relación entre tamaño y cantidad de instalaciones")
Las apps con mayor cantidad de instalaciones se encuentran entre 10000 y 30000 KB
plot1 <- df_final %>%
group_by(last_updated) %>%
summarise(total_installs = sum(installs)) %>%
ggplot(aes(x = last_updated, y = total_installs)) +
geom_line(color = "#FF689f", size = 1) +
labs(x = "Fecha de actualización", y = "Instalaciones",
title = "Total de instalaciones por fecha de actualización") +
theme_classic() +
theme(plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 10),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold"),
legend.title = element_blank(),
legend.position = "none")
ggplotly(plot1)
Se puede observar que las aplicaciones más populares son las que poseen actualización más reciente (2018). Por lo que, para incluirla en los analisis convertimos la variable a año. Armamos una variable que sea “Año de actualización” para poder incluirla en los análisis posteriores
df_final <- df_final %>%
mutate(ano_act = year(last_updated))
#observamos la distribución por año
df_final %>%
group_by(ano_act) %>%
summarise(N=n()) %>%
gt()
| ano_act | N |
|---|---|
| 2010 | 1 |
| 2011 | 15 |
| 2012 | 26 |
| 2013 | 108 |
| 2014 | 204 |
| 2015 | 454 |
| 2016 | 789 |
| 2017 | 1826 |
| 2018 | 6933 |
El árbol de decisión es un modelo de aprendizaje automático que divide los datos en subconjuntos más pequeños basados en diferentes características y reglas. El objetivo es encontrar las variables que tienen la mayor influencia en la popularidad (medido por la cantidad de instalaciones) de una aplicación, para así encarar el desarrollo de la aplicación que contribuya a difundir el derecho de acceso de la mejor manera. De esta manera, se buscará identificar las variables más importantes para explicar la variabilidad de la variable objetivo.
Analizamos la correlación entre las variables numéricas:
#Guardamos el CSV
write.csv(df_final, "df_final.csv")
#armamos una matriz de correlación
matriz_df <- df_final %>%
select_if(is.numeric)
matriz_df <- cor(matriz_df)
matriz_df
## reviews installs type_bin size_num_imp rating_imp
## reviews 1.00000000 0.93540955 -0.16366458 0.30863061 0.18104029
## installs 0.93540955 1.00000000 -0.23489469 0.28906019 0.12983148
## type_bin -0.16366458 -0.23489469 1.00000000 -0.02798766 0.02957440
## size_num_imp 0.30863061 0.28906019 -0.02798766 1.00000000 0.06648153
## rating_imp 0.18104029 0.12983148 0.02957440 0.06648153 1.00000000
## android_ver_imp 0.05965945 0.06238828 -0.10790581 0.17620397 0.05616275
## ano_act 0.22045246 0.22614106 -0.17713161 0.22549474 0.12548035
## android_ver_imp ano_act
## reviews 0.05965945 0.2204525
## installs 0.06238828 0.2261411
## type_bin -0.10790581 -0.1771316
## size_num_imp 0.17620397 0.2254947
## rating_imp 0.05616275 0.1254804
## android_ver_imp 1.00000000 0.4859452
## ano_act 0.48594524 1.0000000
c1 <- matriz_df %>%
ggcorrplot:::ggcorrplot(type = "lower", lab=TRUE, hc.order = TRUE, title = "Matriz de correlación R", colors = c("#6D9EC1", "white", "purple")) +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle=90, hjust=1, size = 7),
axis.text.y = element_text(size = 7))
ggplotly(c1)
#Borramos los datasets para optimizar memoria
rm(c1)
rm(matriz_df)
La variable con más correlación positiva es reviews
Creamos una variable llamada popular con la variable
installs para luego comparar con qué variable el modelo
predice mejor.
#Borramos los datasets para optimizar memoria
rm(plot1)
rm(count_outliers)
rm(fa_apps)
#convertimos a factor las variables categóricas
### Distribución de la variable installs
df_final %>%
group_by(installs) %>%
summarise(N=n()) %>%
gt() %>%
tab_header(title = "Apps por cantidad de instalaciones")
| Apps por cantidad de instalaciones | |
| installs | N |
|---|---|
| 0.0 | 14 |
| 1.0 | 67 |
| 5.0 | 82 |
| 10.0 | 385 |
| 50.0 | 204 |
| 100.0 | 710 |
| 500.0 | 328 |
| 1000.0 | 890 |
| 5000.0 | 469 |
| 10000.0 | 1033 |
| 50000.0 | 474 |
| 100000.0 | 1129 |
| 396371.7 | 517 |
| 396371.7 | 4054 |
#Creo variables dummys y estandarizo las variables
df_accesin <- df_final %>%
mutate(category = as.factor(category),
grupo_edades = as.factor(grupo_edades),
ano_act = as.factor(ano_act),
popular = ifelse(installs >= 100000, 1,0)) %>%
select(-last_updated, -type, -genres)%>%
recipes::step_scale(all_numeric(), except = c("type_bin","popular", "installs")) %>%
select(-steps)
#df_accesin %>% group_by(android_ver_imp) %>% summarise(N=n())
#ponemos el nombre de la app como nombre de la fila
df_accesin$app <- paste0(seq_len(nrow(df_accesin)), "_", df_accesin$app)
rownames(df_accesin) <- df_accesin$app
df_accesin_cat <- df_accesin %>%
select_if(is.factor)
#Creating Dummy Variables
dummy<- data.frame(sapply(df_accesin_cat,function(x) data.frame(model.matrix(~x-1,data =df_accesin_cat))[,-1]))
dummy %>%
head(10) %>%
gt()
| category.xAUTO_AND_VEHICLES | category.xBEAUTY | category.xBOOKS_AND_REFERENCE | category.xBUSINESS | category.xCOMICS | category.xCOMMUNICATION | category.xDATING | category.xEDUCATION | category.xENTERTAINMENT | category.xEVENTS | category.xFAMILY | category.xFINANCE | category.xFOOD_AND_DRINK | category.xGAME | category.xHEALTH_AND_FITNESS | category.xHOUSE_AND_HOME | category.xLIBRARIES_AND_DEMO | category.xLIFESTYLE | category.xMAPS_AND_NAVIGATION | category.xMEDICAL | category.xNEWS_AND_MAGAZINES | category.xPARENTING | category.xPERSONALIZATION | category.xPHOTOGRAPHY | category.xPRODUCTIVITY | category.xSHOPPING | category.xSOCIAL | category.xSPORTS | category.xTOOLS | category.xTRAVEL_AND_LOCAL | category.xVIDEO_PLAYERS | category.xWEATHER | grupo_edades.x17. | grupo_edades.x4. | grupo_edades.x9. | ano_act.x2011 | ano_act.x2012 | ano_act.x2013 | ano_act.x2014 | ano_act.x2015 | ano_act.x2016 | ano_act.x2017 | ano_act.x2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
df_accesin_int <- df_accesin %>%
select_if(is.numeric)
df_analisis <- cbind(df_accesin_int,dummy)
##Creamos un dataset con la variable popular y otro con la variable installs
df_popular <- df_analisis %>%
select(-installs)
accesin_final <- df_analisis %>%
select(-popular)
Observamos la matriz de correlación con las variables binarias
matriz_df <- accesin_final %>%
select_if(is.numeric)
matriz_df <- cor(matriz_df)
c1 <- matriz_df %>%
ggcorrplot:::ggcorrplot(type = "full", lab=FALSE, hc.order = FALSE, colors = c("#6D9EC1", "white", "purple"), ggtheme = ggplot2::theme_classic) +
ggtitle("Matriz de correlación R") +
theme(text = element_text(size = 5),
axis.text.x = element_text(angle=90, hjust=1, size = 5),
axis.text.y = element_text(size = 5))
ggplotly(c1)
Las variables con mayor correlación son
# Seleccionar las columnas que tienen una correlación mayor a 0.5
cols_sel <- matriz_df %>%
abs() %>%
as.data.frame() %>%
rownames_to_column(var = "var1") %>%
pivot_longer(cols = -var1, names_to = "var2", values_to = "cor") %>%
filter(var1 != var2 & cor > 0.3) %>%
arrange(desc(cor))
cols_sel %>%
gt() %>%
tab_header(title = "Variables con mayor correlación")
| Variables con mayor correlación | ||
| var1 | var2 | cor |
|---|---|---|
| reviews | installs | 0.9354096 |
| installs | reviews | 0.9354096 |
| ano_act.x2017 | ano_act.x2018 | 0.6584655 |
| ano_act.x2018 | ano_act.x2017 | 0.6584655 |
| category.xDATING | grupo_edades.x17. | 0.5543406 |
| grupo_edades.x17. | category.xDATING | 0.5543406 |
| grupo_edades.x17. | grupo_edades.x4. | 0.4391950 |
| grupo_edades.x4. | grupo_edades.x17. | 0.4391950 |
| ano_act.x2016 | ano_act.x2018 | 0.4087029 |
| ano_act.x2018 | ano_act.x2016 | 0.4087029 |
| grupo_edades.x4. | grupo_edades.x9. | 0.4010739 |
| grupo_edades.x9. | grupo_edades.x4. | 0.4010739 |
| android_ver_imp | ano_act.x2018 | 0.3965828 |
| ano_act.x2018 | android_ver_imp | 0.3965828 |
| reviews | size_num_imp | 0.3086306 |
| size_num_imp | reviews | 0.3086306 |
| ano_act.x2015 | ano_act.x2018 | 0.3047359 |
| ano_act.x2018 | ano_act.x2015 | 0.3047359 |
#Creamos el dataset de train y test
#fijamos una semilla
set.seed(583)
n <- nrow(accesin_final)
train_idx <- sample(1:n, n*0.7, replace = FALSE) # 70% para entrenamiento
train <- accesin_final[train_idx, ]
test <- accesin_final[-train_idx, ]
n_train = nrow(train)
#visualizamos la distribución
train %>%
group_by(installs) %>%
summarise(Prop = round(n()/n_train,3)) %>%
gt() %>%
tab_options(page.width = "100") %>%
tab_header(title = "Proporción de Instalaciones")
| Proporción de Instalaciones | |
| installs | Prop |
|---|---|
| 0.0 | 0.002 |
| 1.0 | 0.007 |
| 5.0 | 0.009 |
| 10.0 | 0.037 |
| 50.0 | 0.020 |
| 100.0 | 0.067 |
| 500.0 | 0.032 |
| 1000.0 | 0.088 |
| 5000.0 | 0.045 |
| 10000.0 | 0.101 |
| 50000.0 | 0.046 |
| 100000.0 | 0.108 |
| 396371.7 | 0.050 |
| 396371.7 | 0.388 |
#creamos el modelo
model <- rpart(installs ~., data = train, method = 'class')
#Graficamos
rpart.plot(model, main = "Árbol de clasificación", extra = 101, under = TRUE, branch.lty = 1, shadow.col = "gray")
#resumen del modelo creado
summary(model)
## Call:
## rpart(formula = installs ~ ., data = train, method = "class")
## n= 7249
##
## CP nsplit rel error xerror xstd
## 1 0.13868434 0 1.0000000 1.0000000 0.010368652
## 2 0.10702013 1 0.8613157 0.8613157 0.010444013
## 3 0.09008346 2 0.7542955 0.7557683 0.010330297
## 4 0.06504664 3 0.6642121 0.6710849 0.010129035
## 5 0.01251841 4 0.5991654 0.6072656 0.009908923
## 6 0.01202749 5 0.5866470 0.6033382 0.009893363
## 7 0.01000000 6 0.5746195 0.5915562 0.009845239
##
## Variable importance
## reviews size_num_imp ano_act.x2018 category.xGAME
## 68 10 6 5
## grupo_edades.x4. rating_imp type_bin
## 5 4 1
##
## Node number 1: 7249 observations, complexity param=0.1386843
## predicted class=396371.74 expected loss=0.5620086 P(node) =1
## class counts: 12 54 65 265 145 489 230 637 323 735 333 786 3175
## probabilities: 0.002 0.007 0.009 0.037 0.020 0.067 0.032 0.088 0.045 0.101 0.046 0.108 0.438
## left son=2 (4013 obs) right son=3 (3236 obs)
## Primary splits:
## reviews < 3430.5 to the left, improve=1594.34500, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve= 153.10500, (0 missing)
## size_num_imp < 33500 to the left, improve= 117.21300, (0 missing)
## type_bin < 0.5 to the right, improve= 99.12031, (0 missing)
## rating_imp < 4.75 to the right, improve= 89.13282, (0 missing)
## Surrogate splits:
## size_num_imp < 33500 to the left, agree=0.636, adj=0.184, (0 split)
## ano_act.x2018 < 0.5 to the left, agree=0.605, adj=0.115, (0 split)
## category.xGAME < 0.5 to the left, agree=0.603, adj=0.110, (0 split)
## grupo_edades.x4. < 0.5 to the right, agree=0.598, adj=0.099, (0 split)
## rating_imp < 4.15 to the left, agree=0.577, adj=0.053, (0 split)
##
## Node number 2: 4013 observations, complexity param=0.1070201
## predicted class=10000 expected loss=0.817842 P(node) =0.5535936
## class counts: 12 54 65 265 145 489 230 637 323 731 320 576 166
## probabilities: 0.003 0.013 0.016 0.066 0.036 0.122 0.057 0.159 0.080 0.182 0.080 0.144 0.041
## left son=4 (2096 obs) right son=5 (1917 obs)
## Primary splits:
## reviews < 57.5 to the left, improve=331.531700, (0 missing)
## rating_imp < 4.95 to the right, improve= 35.459040, (0 missing)
## type_bin < 0.5 to the right, improve= 14.158570, (0 missing)
## category.xBUSINESS < 0.5 to the right, improve= 7.719463, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve= 6.210602, (0 missing)
## Surrogate splits:
## rating_imp < 4.55 to the right, agree=0.555, adj=0.069, (0 split)
## size_num_imp < 34500 to the left, agree=0.551, adj=0.061, (0 split)
## android_ver_imp < 2.65 to the right, agree=0.534, adj=0.024, (0 split)
## category.xGAME < 0.5 to the left, agree=0.534, adj=0.023, (0 split)
## grupo_edades.x4. < 0.5 to the right, agree=0.532, adj=0.020, (0 split)
##
## Node number 3: 3236 observations
## predicted class=396371.74 expected loss=0.07014833 P(node) =0.4464064
## class counts: 0 0 0 0 0 0 0 0 0 4 13 210 3009
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.004 0.065 0.930
##
## Node number 4: 2096 observations, complexity param=0.06504664
## predicted class=1000 expected loss=0.740458 P(node) =0.2891433
## class counts: 12 54 65 265 145 489 223 544 187 108 3 1 0
## probabilities: 0.006 0.026 0.031 0.126 0.069 0.233 0.106 0.260 0.089 0.052 0.001 0.000 0.000
## left son=8 (987 obs) right son=9 (1109 obs)
## Primary splits:
## reviews < 4.5 to the left, improve=145.843000, (0 missing)
## rating_imp < 4.95 to the right, improve= 12.720960, (0 missing)
## type_bin < 0.5 to the right, improve= 5.845681, (0 missing)
## category.xFAMILY < 0.5 to the left, improve= 5.206641, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve= 4.586008, (0 missing)
## Surrogate splits:
## category.xBUSINESS < 0.5 to the right, agree=0.556, adj=0.057, (0 split)
## rating_imp < 4.95 to the right, agree=0.552, adj=0.048, (0 split)
## ano_act.x2018 < 0.5 to the right, agree=0.549, adj=0.043, (0 split)
## category.xMEDICAL < 0.5 to the right, agree=0.542, adj=0.027, (0 split)
## size_num_imp < 7450 to the right, agree=0.539, adj=0.021, (0 split)
##
## Node number 5: 1917 observations, complexity param=0.09008346
## predicted class=10000 expected loss=0.675013 P(node) =0.2644503
## class counts: 0 0 0 0 0 0 7 93 136 623 317 575 166
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.004 0.049 0.071 0.325 0.165 0.300 0.087
## left son=10 (919 obs) right son=11 (998 obs)
## Primary splits:
## reviews < 417.5 to the left, improve=183.621600, (0 missing)
## type_bin < 0.5 to the right, improve= 32.657830, (0 missing)
## rating_imp < 4.45 to the right, improve= 10.123790, (0 missing)
## category.xFINANCE < 0.5 to the right, improve= 6.843752, (0 missing)
## size_num_imp < 2150 to the left, improve= 4.607771, (0 missing)
## Surrogate splits:
## size_num_imp < 4350 to the left, agree=0.560, adj=0.083, (0 split)
## rating_imp < 3.25 to the left, agree=0.543, adj=0.047, (0 split)
## ano_act.x2017 < 0.5 to the right, agree=0.543, adj=0.046, (0 split)
## ano_act.x2018 < 0.5 to the left, agree=0.541, adj=0.044, (0 split)
## ano_act.x2016 < 0.5 to the right, agree=0.533, adj=0.025, (0 split)
##
## Node number 8: 987 observations, complexity param=0.01251841
## predicted class=100 expected loss=0.6646403 P(node) =0.1361567
## class counts: 12 54 63 249 118 331 87 66 3 1 2 1 0
## probabilities: 0.012 0.055 0.064 0.252 0.120 0.335 0.088 0.067 0.003 0.001 0.002 0.001 0.000
## left son=16 (631 obs) right son=17 (356 obs)
## Primary splits:
## reviews < 1.5 to the left, improve=32.162300, (0 missing)
## type_bin < 0.5 to the right, improve= 7.373249, (0 missing)
## android_ver_imp < 2.05 to the left, improve= 3.508581, (0 missing)
## category.xBOOKS_AND_REFERENCE < 0.5 to the right, improve= 3.208907, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve= 2.428568, (0 missing)
## Surrogate splits:
## rating_imp < 4.95 to the left, agree=0.668, adj=0.079, (0 split)
## ano_act.x2016 < 0.5 to the left, agree=0.644, adj=0.014, (0 split)
## size_num_imp < 113 to the right, agree=0.642, adj=0.008, (0 split)
## category.xVIDEO_PLAYERS < 0.5 to the left, agree=0.642, adj=0.008, (0 split)
##
## Node number 9: 1109 observations
## predicted class=1000 expected loss=0.5689811 P(node) =0.1529866
## class counts: 0 0 2 16 27 158 136 478 184 107 1 0 0
## probabilities: 0.000 0.000 0.002 0.014 0.024 0.142 0.123 0.431 0.166 0.096 0.001 0.000 0.000
##
## Node number 10: 919 observations
## predicted class=10000 expected loss=0.4798694 P(node) =0.1267761
## class counts: 0 0 0 0 0 0 7 93 121 478 148 63 9
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.008 0.101 0.132 0.520 0.161 0.069 0.010
##
## Node number 11: 998 observations, complexity param=0.01202749
## predicted class=100000 expected loss=0.4869739 P(node) =0.1376742
## class counts: 0 0 0 0 0 0 0 0 15 145 169 512 157
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.015 0.145 0.169 0.513 0.157
## left son=22 (88 obs) right son=23 (910 obs)
## Primary splits:
## type_bin < 0.5 to the right, improve=44.460350, (0 missing)
## reviews < 1548 to the left, improve=25.545110, (0 missing)
## rating_imp < 4.55 to the right, improve=11.518250, (0 missing)
## category.xMEDICAL < 0.5 to the right, improve= 2.541844, (0 missing)
## size_num_imp < 217.5 to the right, improve= 2.325753, (0 missing)
##
## Node number 16: 631 observations
## predicted class=10 expected loss=0.6656101 P(node) =0.08704649
## class counts: 12 53 56 211 86 160 33 15 1 1 2 1 0
## probabilities: 0.019 0.084 0.089 0.334 0.136 0.254 0.052 0.024 0.002 0.002 0.003 0.002 0.000
##
## Node number 17: 356 observations
## predicted class=100 expected loss=0.5196629 P(node) =0.04911022
## class counts: 0 1 7 38 32 171 54 51 2 0 0 0 0
## probabilities: 0.000 0.003 0.020 0.107 0.090 0.480 0.152 0.143 0.006 0.000 0.000 0.000 0.000
##
## Node number 22: 88 observations
## predicted class=10000 expected loss=0.375 P(node) =0.01213961
## class counts: 0 0 0 0 0 0 0 0 8 55 19 6 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.091 0.625 0.216 0.068 0.000
##
## Node number 23: 910 observations
## predicted class=100000 expected loss=0.443956 P(node) =0.1255346
## class counts: 0 0 0 0 0 0 0 0 7 90 150 506 157
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.008 0.099 0.165 0.556 0.173
El resultado del primer modelo muestra que las variables con mayor
importancia son (en el siguiente orden): reviews,
size_num_imp, ano_act.x2018,
category.xGAME, grupo_edades.x4.,
rating_imp, type
# Aplicamos el modelo a los datos de prueba
predict_test <- predict(model, test, type = "class")
# Creamos una tabla de contingencia para evaluar la precisión
table_mat <- table(test$installs, predict_test)
table_mat
## predict_test
## 0 1 5 10 50 100 500 1000 5000 10000 50000 100000
## 0 0 0 0 2 0 0 0 0 0 0 0 0
## 1 0 0 0 9 0 4 0 0 0 0 0 0
## 5 0 0 0 14 0 3 0 0 0 0 0 0
## 10 0 0 0 93 0 19 0 8 0 0 0 0
## 50 0 0 0 33 0 16 0 10 0 0 0 0
## 100 0 0 0 63 0 68 0 90 0 0 0 0
## 500 0 0 0 9 0 28 0 55 0 6 0 0
## 1000 0 0 0 4 0 23 0 184 0 41 0 1
## 5000 0 0 0 1 0 2 0 83 0 59 0 1
## 10000 0 0 0 2 0 0 0 50 0 212 0 29
## 50000 0 0 0 1 0 1 0 0 0 61 0 66
## 100000 0 0 0 1 0 0 0 1 0 39 0 206
## 396371.74 0 0 0 1 0 0 0 0 0 4 0 61
## predict_test
## 396371.74
## 0 0
## 1 0
## 5 0
## 10 0
## 50 0
## 100 0
## 500 0
## 1000 0
## 5000 0
## 10000 5
## 50000 12
## 100000 96
## 396371.74 1330
# Calculamos la precisión del modelo
accuracy_train <- function(fit) {
predict_unseen_train <- predict(fit, train, type = 'class')
table_mat <- table(train$installs, predict_unseen_train)
accuracy_train <- sum(diag(table_mat)) / sum(table_mat)
accuracy_train
}
accuracy_test <- function(fit) {
predict_unseen <- predict(fit, test, type = 'class')
table_mat <- table(test$installs, predict_unseen)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
accuracy_Test
}
print(paste('Accuracy para train', accuracy_train(model)))
## [1] "Accuracy para train 0.677058904676507"
print(paste('Accuracy para test', accuracy_test(model)))
## [1] "Accuracy para test 0.673640167364017"
El valor de 0.67 para la exactitud (accuracy) del modelo puede considerarse relativamente bueno.
# Obtener las predicciones del modelo en el conjunto de prueba
pred_test <- predict(model, newdata = test, type = "class")
# Matriz de confusión
confusionMatrix(table(pred_test, test$installs))
## Confusion Matrix and Statistics
##
##
## pred_test 0 1 5 10 50 100 500 1000 5000 10000 50000 100000
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## 10 2 9 14 93 33 63 9 4 1 2 1 1
## 50 0 0 0 0 0 0 0 0 0 0 0 0
## 100 0 4 3 19 16 68 28 23 2 0 1 0
## 500 0 0 0 0 0 0 0 0 0 0 0 0
## 1000 0 0 0 8 10 90 55 184 83 50 0 1
## 5000 0 0 0 0 0 0 0 0 0 0 0 0
## 10000 0 0 0 0 0 0 6 41 59 212 61 39
## 50000 0 0 0 0 0 0 0 0 0 0 0 0
## 100000 0 0 0 0 0 0 0 1 1 29 66 206
## 396371.74 0 0 0 0 0 0 0 0 0 5 12 96
##
## pred_test 396371.74
## 0 0
## 1 0
## 5 0
## 10 1
## 50 0
## 100 0
## 500 0
## 1000 0
## 5000 0
## 10000 4
## 50000 0
## 100000 61
## 396371.74 1330
##
## Overall Statistics
##
## Accuracy : 0.6736
## 95% CI : (0.6568, 0.6901)
## No Information Rate : 0.4493
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5626
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 5 Class: 10 Class: 50 Class: 100
## Sensitivity 0.0000000 0.000000 0.000000 0.77500 0.00000 0.30769
## Specificity 1.0000000 1.000000 1.000000 0.95313 1.00000 0.96674
## Pos Pred Value NaN NaN NaN 0.39914 NaN 0.41463
## Neg Pred Value 0.9993563 0.995816 0.994528 0.99061 0.98101 0.94801
## Prevalence 0.0006437 0.004184 0.005472 0.03862 0.01899 0.07113
## Detection Rate 0.0000000 0.000000 0.000000 0.02993 0.00000 0.02189
## Detection Prevalence 0.0000000 0.000000 0.000000 0.07499 0.00000 0.05278
## Balanced Accuracy 0.5000000 0.500000 0.500000 0.86407 0.50000 0.63721
## Class: 500 Class: 1000 Class: 5000 Class: 10000
## Sensitivity 0.00000 0.72727 0.00000 0.71141
## Specificity 1.00000 0.89594 1.00000 0.92524
## Pos Pred Value NaN 0.38254 NaN 0.50237
## Neg Pred Value 0.96846 0.97372 0.95301 0.96797
## Prevalence 0.03154 0.08143 0.04699 0.09591
## Detection Rate 0.00000 0.05922 0.00000 0.06823
## Detection Prevalence 0.00000 0.15481 0.00000 0.13582
## Balanced Accuracy 0.50000 0.81160 0.50000 0.81832
## Class: 50000 Class: 100000 Class: 396371.74
## Sensitivity 0.00000 0.6006 0.9527
## Specificity 1.00000 0.9428 0.9340
## Pos Pred Value NaN 0.5659 0.9217
## Neg Pred Value 0.95462 0.9501 0.9603
## Prevalence 0.04538 0.1104 0.4493
## Detection Rate 0.00000 0.0663 0.4281
## Detection Prevalence 0.00000 0.1172 0.4644
## Balanced Accuracy 0.50000 0.7717 0.9433
# Tabla de contingencia
table_test <- table(test$installs, pred_test)
# Precisión por instalaciones
precision <- diag(table_test) / colSums(table_test)
# Recall por instalaciones
recall <- diag(table_test) / rowSums(table_test)
# Graficar precision y recall en un gráfico de barras
barplot(precision, ylim = c(0, 1), main = "Precisión por instalaciones", xlab = "cantidad", ylab = "Precisión")
barplot(recall, ylim = c(0, 1), main = "Recall por instalaciones", xlab = "cantidad", ylab = "Recall")
El modelo clasifica correctamente el 67,36% de las muestras. En cuanto a la matriz de confusión, muestra que el modelo tiene dificultades para clasificar correctamente las clases de muestra más bajas (0, 1 y 5) y las clases de muestra más altas (5000, 50000 y 100000).
Vamos a ajustar este modelo en base a tune que prueba
diferentes valores de hiperparámetros y devuelve el conjunto de valores
que produce el mejor modelo. En este caso se utilizará cp
para controlar la complejidad del modelo:
# Define the range of values for the cp parameter
cp_values <- seq(0.001, 0.1, by = 0.001)
# Create the tuning grid
tune_grid <- data.frame(cp = cp_values)
# Fit the model with cross-validation and the tuning grid
fit <- train(
installs ~ .,
data = train,
method = "rpart",
tuneGrid = tune_grid,
trControl = trainControl(method = "cv", number = 10, verboseIter = TRUE)
)
## + Fold01: cp=0.001
## - Fold01: cp=0.001
## + Fold02: cp=0.001
## - Fold02: cp=0.001
## + Fold03: cp=0.001
## - Fold03: cp=0.001
## + Fold04: cp=0.001
## - Fold04: cp=0.001
## + Fold05: cp=0.001
## - Fold05: cp=0.001
## + Fold06: cp=0.001
## - Fold06: cp=0.001
## + Fold07: cp=0.001
## - Fold07: cp=0.001
## + Fold08: cp=0.001
## - Fold08: cp=0.001
## + Fold09: cp=0.001
## - Fold09: cp=0.001
## + Fold10: cp=0.001
## - Fold10: cp=0.001
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.001 on full training set
model_2 <- rpart(installs ~ ., data = train, method = "class", control = rpart.control(cp = fit$bestTune$cp))
# Plot the decision tree
rpart.plot(model_2, extra = 1,main = "Arbol" )
summary(model_2)
## Call:
## rpart(formula = installs ~ ., data = train, method = "class",
## control = rpart.control(cp = fit$bestTune$cp))
## n= 7249
##
## CP nsplit rel error xerror xstd
## 1 0.138684340 0 1.0000000 1.0000000 0.010368652
## 2 0.107020128 1 0.8613157 0.8807069 0.010448790
## 3 0.090083456 2 0.7542955 0.7447226 0.010309711
## 4 0.065046637 3 0.6642121 0.6747668 0.010139897
## 5 0.012518409 4 0.5991654 0.6107020 0.009922342
## 6 0.012027491 5 0.5866470 0.6104566 0.009921390
## 7 0.007363770 6 0.5746195 0.5886107 0.009832867
## 8 0.002209131 7 0.5672558 0.5787923 0.009790630
## 9 0.002127311 12 0.5562101 0.5792833 0.009792779
## 10 0.001840943 15 0.5498282 0.5787923 0.009790630
## 11 0.001718213 18 0.5441826 0.5780560 0.009787401
## 12 0.001636393 20 0.5407462 0.5770741 0.009783081
## 13 0.001472754 24 0.5341188 0.5760923 0.009778745
## 14 0.001227295 27 0.5297005 0.5714286 0.009757939
## 15 0.001145475 31 0.5247914 0.5770741 0.009783081
## 16 0.001104566 34 0.5213549 0.5785469 0.009789555
## 17 0.001022746 39 0.5149730 0.5748650 0.009773304
## 18 0.001000000 45 0.5088365 0.5748650 0.009773304
##
## Variable importance
## reviews size_num_imp ano_act.x2018 rating_imp
## 65 9 5 5
## category.xGAME grupo_edades.x4. type_bin
## 5 4 4
##
## Node number 1: 7249 observations, complexity param=0.1386843
## predicted class=396371.74 expected loss=0.5620086 P(node) =1
## class counts: 12 54 65 265 145 489 230 637 323 735 333 786 3175
## probabilities: 0.002 0.007 0.009 0.037 0.020 0.067 0.032 0.088 0.045 0.101 0.046 0.108 0.438
## left son=2 (4013 obs) right son=3 (3236 obs)
## Primary splits:
## reviews < 3430.5 to the left, improve=1594.34500, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve= 153.10500, (0 missing)
## size_num_imp < 33500 to the left, improve= 117.21300, (0 missing)
## type_bin < 0.5 to the right, improve= 99.12031, (0 missing)
## rating_imp < 4.75 to the right, improve= 89.13282, (0 missing)
## Surrogate splits:
## size_num_imp < 33500 to the left, agree=0.636, adj=0.184, (0 split)
## ano_act.x2018 < 0.5 to the left, agree=0.605, adj=0.115, (0 split)
## category.xGAME < 0.5 to the left, agree=0.603, adj=0.110, (0 split)
## grupo_edades.x4. < 0.5 to the right, agree=0.598, adj=0.099, (0 split)
## rating_imp < 4.15 to the left, agree=0.577, adj=0.053, (0 split)
##
## Node number 2: 4013 observations, complexity param=0.1070201
## predicted class=10000 expected loss=0.817842 P(node) =0.5535936
## class counts: 12 54 65 265 145 489 230 637 323 731 320 576 166
## probabilities: 0.003 0.013 0.016 0.066 0.036 0.122 0.057 0.159 0.080 0.182 0.080 0.144 0.041
## left son=4 (2096 obs) right son=5 (1917 obs)
## Primary splits:
## reviews < 57.5 to the left, improve=331.531700, (0 missing)
## rating_imp < 4.95 to the right, improve= 35.459040, (0 missing)
## type_bin < 0.5 to the right, improve= 14.158570, (0 missing)
## category.xBUSINESS < 0.5 to the right, improve= 7.719463, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve= 6.210602, (0 missing)
## Surrogate splits:
## rating_imp < 4.55 to the right, agree=0.555, adj=0.069, (0 split)
## size_num_imp < 34500 to the left, agree=0.551, adj=0.061, (0 split)
## android_ver_imp < 2.65 to the right, agree=0.534, adj=0.024, (0 split)
## category.xGAME < 0.5 to the left, agree=0.534, adj=0.023, (0 split)
## grupo_edades.x4. < 0.5 to the right, agree=0.532, adj=0.020, (0 split)
##
## Node number 3: 3236 observations, complexity param=0.002127311
## predicted class=396371.74 expected loss=0.07014833 P(node) =0.4464064
## class counts: 0 0 0 0 0 0 0 0 0 4 13 210 3009
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.004 0.065 0.930
## left son=6 (296 obs) right son=7 (2940 obs)
## Primary splits:
## reviews < 6550.5 to the left, improve=75.879300, (0 missing)
## type_bin < 0.5 to the right, improve=62.851410, (0 missing)
## rating_imp < 4.55 to the right, improve= 8.094133, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve= 4.363969, (0 missing)
## category.xMEDICAL < 0.5 to the right, improve= 3.610085, (0 missing)
## Surrogate splits:
## size_num_imp < 32 to the left, agree=0.909, adj=0.003, (0 split)
## rating_imp < 2.75 to the left, agree=0.909, adj=0.003, (0 split)
##
## Node number 4: 2096 observations, complexity param=0.06504664
## predicted class=1000 expected loss=0.740458 P(node) =0.2891433
## class counts: 12 54 65 265 145 489 223 544 187 108 3 1 0
## probabilities: 0.006 0.026 0.031 0.126 0.069 0.233 0.106 0.260 0.089 0.052 0.001 0.000 0.000
## left son=8 (987 obs) right son=9 (1109 obs)
## Primary splits:
## reviews < 4.5 to the left, improve=145.843000, (0 missing)
## rating_imp < 4.95 to the right, improve= 12.720960, (0 missing)
## type_bin < 0.5 to the right, improve= 5.845681, (0 missing)
## category.xFAMILY < 0.5 to the left, improve= 5.206641, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve= 4.586008, (0 missing)
## Surrogate splits:
## category.xBUSINESS < 0.5 to the right, agree=0.556, adj=0.057, (0 split)
## rating_imp < 4.95 to the right, agree=0.552, adj=0.048, (0 split)
## ano_act.x2018 < 0.5 to the right, agree=0.549, adj=0.043, (0 split)
## category.xMEDICAL < 0.5 to the right, agree=0.542, adj=0.027, (0 split)
## size_num_imp < 7450 to the right, agree=0.539, adj=0.021, (0 split)
##
## Node number 5: 1917 observations, complexity param=0.09008346
## predicted class=10000 expected loss=0.675013 P(node) =0.2644503
## class counts: 0 0 0 0 0 0 7 93 136 623 317 575 166
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.004 0.049 0.071 0.325 0.165 0.300 0.087
## left son=10 (919 obs) right son=11 (998 obs)
## Primary splits:
## reviews < 417.5 to the left, improve=183.621600, (0 missing)
## type_bin < 0.5 to the right, improve= 32.657830, (0 missing)
## rating_imp < 4.45 to the right, improve= 10.123790, (0 missing)
## category.xFINANCE < 0.5 to the right, improve= 6.843752, (0 missing)
## size_num_imp < 2150 to the left, improve= 4.607771, (0 missing)
## Surrogate splits:
## size_num_imp < 4350 to the left, agree=0.560, adj=0.083, (0 split)
## rating_imp < 3.25 to the left, agree=0.543, adj=0.047, (0 split)
## ano_act.x2017 < 0.5 to the right, agree=0.543, adj=0.046, (0 split)
## ano_act.x2018 < 0.5 to the left, agree=0.541, adj=0.044, (0 split)
## ano_act.x2016 < 0.5 to the right, agree=0.533, adj=0.025, (0 split)
##
## Node number 6: 296 observations, complexity param=0.002127311
## predicted class=396371.74 expected loss=0.4324324 P(node) =0.04083322
## class counts: 0 0 0 0 0 0 0 0 0 4 11 113 168
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.014 0.037 0.382 0.568
## left son=12 (21 obs) right son=13 (275 obs)
## Primary splits:
## type_bin < 0.5 to the right, improve=10.101500, (0 missing)
## rating_imp < 4.15 to the right, improve= 9.397326, (0 missing)
## grupo_edades.x4. < 0.5 to the left, improve= 3.631335, (0 missing)
## android_ver_imp < 2.25 to the left, improve= 2.577301, (0 missing)
## reviews < 5036.5 to the left, improve= 2.311034, (0 missing)
## Surrogate splits:
## android_ver_imp < 2.15 to the left, agree=0.936, adj=0.095, (0 split)
## category.xENTERTAINMENT < 0.5 to the right, agree=0.936, adj=0.095, (0 split)
## ano_act.x2014 < 0.5 to the right, agree=0.936, adj=0.095, (0 split)
## category.xCOMMUNICATION < 0.5 to the right, agree=0.932, adj=0.048, (0 split)
##
## Node number 7: 2940 observations, complexity param=0.002127311
## predicted class=396371.74 expected loss=0.03367347 P(node) =0.4055732
## class counts: 0 0 0 0 0 0 0 0 0 0 2 97 2841
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.033 0.966
## left son=14 (69 obs) right son=15 (2871 obs)
## Primary splits:
## type_bin < 0.5 to the right, improve=44.5395400, (0 missing)
## rating_imp < 4.55 to the right, improve= 5.2936860, (0 missing)
## ano_act.x2015 < 0.5 to the right, improve= 1.0158670, (0 missing)
## category.xFAMILY < 0.5 to the right, improve= 0.8871212, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve= 0.6688937, (0 missing)
##
## Node number 8: 987 observations, complexity param=0.01251841
## predicted class=100 expected loss=0.6646403 P(node) =0.1361567
## class counts: 12 54 63 249 118 331 87 66 3 1 2 1 0
## probabilities: 0.012 0.055 0.064 0.252 0.120 0.335 0.088 0.067 0.003 0.001 0.002 0.001 0.000
## left son=16 (631 obs) right son=17 (356 obs)
## Primary splits:
## reviews < 1.5 to the left, improve=32.162300, (0 missing)
## type_bin < 0.5 to the right, improve= 7.373249, (0 missing)
## android_ver_imp < 2.05 to the left, improve= 3.508581, (0 missing)
## category.xBOOKS_AND_REFERENCE < 0.5 to the right, improve= 3.208907, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve= 2.428568, (0 missing)
## Surrogate splits:
## rating_imp < 4.95 to the left, agree=0.668, adj=0.079, (0 split)
## ano_act.x2016 < 0.5 to the left, agree=0.644, adj=0.014, (0 split)
## size_num_imp < 113 to the right, agree=0.642, adj=0.008, (0 split)
## category.xVIDEO_PLAYERS < 0.5 to the left, agree=0.642, adj=0.008, (0 split)
##
## Node number 9: 1109 observations, complexity param=0.002209131
## predicted class=1000 expected loss=0.5689811 P(node) =0.1529866
## class counts: 0 0 2 16 27 158 136 478 184 107 1 0 0
## probabilities: 0.000 0.000 0.002 0.014 0.024 0.142 0.123 0.431 0.166 0.096 0.001 0.000 0.000
## left son=18 (526 obs) right son=19 (583 obs)
## Primary splits:
## reviews < 16.5 to the left, improve=34.778520, (0 missing)
## rating_imp < 4.75 to the right, improve=10.371930, (0 missing)
## type_bin < 0.5 to the right, improve= 9.615269, (0 missing)
## category.xVIDEO_PLAYERS < 0.5 to the left, improve= 3.373009, (0 missing)
## category.xPARENTING < 0.5 to the left, improve= 3.149121, (0 missing)
## Surrogate splits:
## rating_imp < 4.35 to the right, agree=0.564, adj=0.080, (0 split)
## category.xBUSINESS < 0.5 to the right, agree=0.542, adj=0.034, (0 split)
## category.xCOMMUNICATION < 0.5 to the right, agree=0.537, adj=0.023, (0 split)
## category.xBOOKS_AND_REFERENCE < 0.5 to the right, agree=0.532, adj=0.013, (0 split)
## category.xEVENTS < 0.5 to the right, agree=0.530, adj=0.010, (0 split)
##
## Node number 10: 919 observations, complexity param=0.00736377
## predicted class=10000 expected loss=0.4798694 P(node) =0.1267761
## class counts: 0 0 0 0 0 0 7 93 121 478 148 63 9
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.008 0.101 0.132 0.520 0.161 0.069 0.010
## left son=20 (109 obs) right son=21 (810 obs)
## Primary splits:
## type_bin < 0.5 to the right, improve=37.068620, (0 missing)
## reviews < 204.5 to the left, improve=21.938330, (0 missing)
## rating_imp < 4.75 to the right, improve=11.443230, (0 missing)
## category.xAUTO_AND_VEHICLES < 0.5 to the left, improve= 4.832124, (0 missing)
## category.xFINANCE < 0.5 to the right, improve= 4.621178, (0 missing)
## Surrogate splits:
## size_num_imp < 37 to the left, agree=0.882, adj=0.009, (0 split)
##
## Node number 11: 998 observations, complexity param=0.01202749
## predicted class=100000 expected loss=0.4869739 P(node) =0.1376742
## class counts: 0 0 0 0 0 0 0 0 15 145 169 512 157
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.015 0.145 0.169 0.513 0.157
## left son=22 (88 obs) right son=23 (910 obs)
## Primary splits:
## type_bin < 0.5 to the right, improve=44.460350, (0 missing)
## reviews < 1548 to the left, improve=25.545110, (0 missing)
## rating_imp < 4.55 to the right, improve=11.518250, (0 missing)
## category.xMEDICAL < 0.5 to the right, improve= 2.541844, (0 missing)
## size_num_imp < 217.5 to the right, improve= 2.325753, (0 missing)
##
## Node number 12: 21 observations
## predicted class=100000 expected loss=0.3809524 P(node) =0.002896951
## class counts: 0 0 0 0 0 0 0 0 0 2 6 13 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.095 0.286 0.619 0.000
##
## Node number 13: 275 observations, complexity param=0.001104566
## predicted class=396371.74 expected loss=0.3890909 P(node) =0.03793627
## class counts: 0 0 0 0 0 0 0 0 0 2 5 100 168
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.007 0.018 0.364 0.611
## left son=26 (159 obs) right son=27 (116 obs)
## Primary splits:
## rating_imp < 4.15 to the right, improve=6.909134, (0 missing)
## grupo_edades.x4. < 0.5 to the left, improve=3.367080, (0 missing)
## reviews < 3675.5 to the right, improve=3.287658, (0 missing)
## category.xGAME < 0.5 to the right, improve=1.438048, (0 missing)
## category.xMEDICAL < 0.5 to the right, improve=1.163596, (0 missing)
## Surrogate splits:
## android_ver_imp < 3.15 to the right, agree=0.604, adj=0.060, (0 split)
## category.xLIFESTYLE < 0.5 to the left, agree=0.593, adj=0.034, (0 split)
## category.xTOOLS < 0.5 to the left, agree=0.593, adj=0.034, (0 split)
## size_num_imp < 1450 to the right, agree=0.589, adj=0.026, (0 split)
## category.xNEWS_AND_MAGAZINES < 0.5 to the left, agree=0.589, adj=0.026, (0 split)
##
## Node number 14: 69 observations, complexity param=0.001472754
## predicted class=100000 expected loss=0.4202899 P(node) =0.009518554
## class counts: 0 0 0 0 0 0 0 0 0 0 2 40 27
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.029 0.580 0.391
## left son=28 (51 obs) right son=29 (18 obs)
## Primary splits:
## size_num_imp < 42058.96 to the left, improve=4.091121, (0 missing)
## reviews < 6673.432 to the left, improve=3.866372, (0 missing)
## android_ver_imp < 3.5 to the right, improve=2.254170, (0 missing)
## rating_imp < 4.45 to the right, improve=1.521143, (0 missing)
## ano_act.x2016 < 0.5 to the left, improve=1.492553, (0 missing)
##
## Node number 15: 2871 observations
## predicted class=396371.74 expected loss=0.01985371 P(node) =0.3960546
## class counts: 0 0 0 0 0 0 0 0 0 0 0 57 2814
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.020 0.980
##
## Node number 16: 631 observations, complexity param=0.001840943
## predicted class=10 expected loss=0.6656101 P(node) =0.08704649
## class counts: 12 53 56 211 86 160 33 15 1 1 2 1 0
## probabilities: 0.019 0.084 0.089 0.334 0.136 0.254 0.052 0.024 0.002 0.002 0.003 0.002 0.000
## left son=32 (86 obs) right son=33 (545 obs)
## Primary splits:
## type_bin < 0.5 to the right, improve=6.551269, (0 missing)
## reviews < 0.5 to the left, improve=4.406052, (0 missing)
## size_num_imp < 13500 to the left, improve=2.459145, (0 missing)
## android_ver_imp < 2.05 to the left, improve=2.426151, (0 missing)
## category.xBOOKS_AND_REFERENCE < 0.5 to the right, improve=2.140979, (0 missing)
## Surrogate splits:
## android_ver_imp < 2.05 to the left, agree=0.873, adj=0.070, (0 split)
## ano_act.x2014 < 0.5 to the right, agree=0.868, adj=0.035, (0 split)
##
## Node number 17: 356 observations, complexity param=0.001227295
## predicted class=100 expected loss=0.5196629 P(node) =0.04911022
## class counts: 0 1 7 38 32 171 54 51 2 0 0 0 0
## probabilities: 0.000 0.003 0.020 0.107 0.090 0.480 0.152 0.143 0.006 0.000 0.000 0.000 0.000
## left son=34 (347 obs) right son=35 (9 obs)
## Primary splits:
## rating_imp < 2.5 to the right, improve=3.948846, (0 missing)
## category.xFAMILY < 0.5 to the left, improve=2.638266, (0 missing)
## reviews < 2.5 to the left, improve=2.384410, (0 missing)
## android_ver_imp < 2.05 to the left, improve=2.259204, (0 missing)
## type_bin < 0.5 to the right, improve=2.196937, (0 missing)
##
## Node number 18: 526 observations, complexity param=0.002209131
## predicted class=1000 expected loss=0.5589354 P(node) =0.07256173
## class counts: 0 0 2 16 25 120 101 232 24 6 0 0 0
## probabilities: 0.000 0.000 0.004 0.030 0.048 0.228 0.192 0.441 0.046 0.011 0.000 0.000 0.000
## left son=36 (57 obs) right son=37 (469 obs)
## Primary splits:
## type_bin < 0.5 to the right, improve=8.378135, (0 missing)
## rating_imp < 4.75 to the right, improve=7.540602, (0 missing)
## reviews < 10.5 to the left, improve=6.331664, (0 missing)
## category.xBOOKS_AND_REFERENCE < 0.5 to the left, improve=3.076515, (0 missing)
## category.xPHOTOGRAPHY < 0.5 to the left, improve=2.885551, (0 missing)
## Surrogate splits:
## ano_act.x2014 < 0.5 to the right, agree=0.894, adj=0.018, (0 split)
##
## Node number 19: 583 observations, complexity param=0.002209131
## predicted class=1000 expected loss=0.5780446 P(node) =0.08042489
## class counts: 0 0 0 0 2 38 35 246 160 101 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.003 0.065 0.060 0.422 0.274 0.173 0.002 0.000 0.000
## left son=38 (68 obs) right son=39 (515 obs)
## Primary splits:
## type_bin < 0.5 to the right, improve=10.077620, (0 missing)
## reviews < 42.5 to the left, improve= 7.865569, (0 missing)
## rating_imp < 4.55 to the right, improve= 7.076145, (0 missing)
## category.xPARENTING < 0.5 to the left, improve= 2.610204, (0 missing)
## category.xVIDEO_PLAYERS < 0.5 to the left, improve= 2.247603, (0 missing)
## Surrogate splits:
## size_num_imp < 34 to the left, agree=0.887, adj=0.029, (0 split)
##
## Node number 20: 109 observations, complexity param=0.001472754
## predicted class=1000 expected loss=0.5412844 P(node) =0.01503656
## class counts: 0 0 0 0 0 0 5 50 33 20 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.046 0.459 0.303 0.183 0.009 0.000 0.000
## left son=40 (55 obs) right son=41 (54 obs)
## Primary splits:
## reviews < 160 to the left, improve=5.181299, (0 missing)
## rating_imp < 3.65 to the right, improve=3.917017, (0 missing)
## category.xMEDICAL < 0.5 to the left, improve=2.706491, (0 missing)
## android_ver_imp < 2.05 to the right, improve=2.575239, (0 missing)
## size_num_imp < 344 to the right, improve=1.920693, (0 missing)
## Surrogate splits:
## size_num_imp < 3650 to the left, agree=0.578, adj=0.148, (0 split)
## rating_imp < 4.55 to the right, agree=0.550, adj=0.093, (0 split)
## grupo_edades.x9. < 0.5 to the left, agree=0.550, adj=0.093, (0 split)
## category.xCOMMUNICATION < 0.5 to the left, agree=0.541, adj=0.074, (0 split)
## category.xGAME < 0.5 to the left, agree=0.541, adj=0.074, (0 split)
##
## Node number 21: 810 observations, complexity param=0.001636393
## predicted class=10000 expected loss=0.4345679 P(node) =0.1117396
## class counts: 0 0 0 0 0 0 2 43 88 458 147 63 9
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.053 0.109 0.565 0.181 0.078 0.011
## left son=42 (523 obs) right son=43 (287 obs)
## Primary splits:
## reviews < 204.5 to the left, improve=25.244190, (0 missing)
## rating_imp < 4.75 to the right, improve= 9.250378, (0 missing)
## category.xAUTO_AND_VEHICLES < 0.5 to the left, improve= 5.101173, (0 missing)
## category.xFINANCE < 0.5 to the right, improve= 4.472597, (0 missing)
## ano_act.x2013 < 0.5 to the right, improve= 1.987642, (0 missing)
## Surrogate splits:
## category.xGAME < 0.5 to the left, agree=0.651, adj=0.014, (0 split)
## ano_act.x2011 < 0.5 to the left, agree=0.651, adj=0.014, (0 split)
## android_ver_imp < 7.05 to the left, agree=0.649, adj=0.010, (0 split)
## rating_imp < 1.65 to the right, agree=0.648, adj=0.007, (0 split)
## category.xEDUCATION < 0.5 to the left, agree=0.648, adj=0.007, (0 split)
##
## Node number 22: 88 observations, complexity param=0.001227295
## predicted class=10000 expected loss=0.375 P(node) =0.01213961
## class counts: 0 0 0 0 0 0 0 0 8 55 19 6 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.091 0.625 0.216 0.068 0.000
## left son=44 (68 obs) right son=45 (20 obs)
## Primary splits:
## reviews < 1815.5 to the left, improve=6.798128, (0 missing)
## android_ver_imp < 3.15 to the left, improve=2.212196, (0 missing)
## size_num_imp < 4600 to the left, improve=1.905814, (0 missing)
## category.xTOOLS < 0.5 to the right, improve=1.761364, (0 missing)
## category.xPERSONALIZATION < 0.5 to the right, improve=1.266814, (0 missing)
##
## Node number 23: 910 observations, complexity param=0.001145475
## predicted class=100000 expected loss=0.443956 P(node) =0.1255346
## class counts: 0 0 0 0 0 0 0 0 7 90 150 506 157
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.008 0.099 0.165 0.556 0.173
## left son=46 (540 obs) right son=47 (370 obs)
## Primary splits:
## reviews < 1550.5 to the left, improve=25.192840, (0 missing)
## rating_imp < 4.55 to the right, improve= 9.129044, (0 missing)
## category.xCOMICS < 0.5 to the right, improve= 2.601593, (0 missing)
## category.xBOOKS_AND_REFERENCE < 0.5 to the right, improve= 2.319170, (0 missing)
## category.xMEDICAL < 0.5 to the right, improve= 2.133059, (0 missing)
## Surrogate splits:
## category.xSOCIAL < 0.5 to the left, agree=0.598, adj=0.011, (0 split)
## category.xENTERTAINMENT < 0.5 to the left, agree=0.597, adj=0.008, (0 split)
## category.xHEALTH_AND_FITNESS < 0.5 to the left, agree=0.596, adj=0.005, (0 split)
## category.xTRAVEL_AND_LOCAL < 0.5 to the left, agree=0.596, adj=0.005, (0 split)
## category.xWEATHER < 0.5 to the left, agree=0.596, adj=0.005, (0 split)
##
## Node number 26: 159 observations, complexity param=0.001104566
## predicted class=396371.74 expected loss=0.490566 P(node) =0.02193406
## class counts: 0 0 0 0 0 0 0 0 0 2 4 72 81
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.013 0.025 0.453 0.509
## left son=52 (89 obs) right son=53 (70 obs)
## Primary splits:
## reviews < 5036.5 to the left, improve=2.541121, (0 missing)
## rating_imp < 4.55 to the right, improve=2.268519, (0 missing)
## grupo_edades.x4. < 0.5 to the left, improve=1.890206, (0 missing)
## android_ver_imp < 4.05 to the left, improve=1.504865, (0 missing)
## category.xMEDICAL < 0.5 to the right, improve=1.295178, (0 missing)
## Surrogate splits:
## size_num_imp < 36000 to the left, agree=0.635, adj=0.171, (0 split)
## rating_imp < 4.65 to the left, agree=0.610, adj=0.114, (0 split)
## category.xFAMILY < 0.5 to the left, agree=0.610, adj=0.114, (0 split)
## category.xFINANCE < 0.5 to the left, agree=0.579, adj=0.043, (0 split)
## category.xHEALTH_AND_FITNESS < 0.5 to the left, agree=0.579, adj=0.043, (0 split)
##
## Node number 27: 116 observations
## predicted class=396371.74 expected loss=0.25 P(node) =0.01600221
## class counts: 0 0 0 0 0 0 0 0 0 0 1 28 87
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.009 0.241 0.750
##
## Node number 28: 51 observations
## predicted class=100000 expected loss=0.3333333 P(node) =0.007035453
## class counts: 0 0 0 0 0 0 0 0 0 0 2 34 15
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.039 0.667 0.294
##
## Node number 29: 18 observations
## predicted class=396371.74 expected loss=0.3333333 P(node) =0.002483101
## class counts: 0 0 0 0 0 0 0 0 0 0 0 6 12
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.333 0.667
##
## Node number 32: 86 observations
## predicted class=10 expected loss=0.5581395 P(node) =0.01186371
## class counts: 8 16 9 38 9 6 0 0 0 0 0 0 0
## probabilities: 0.093 0.186 0.105 0.442 0.105 0.070 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##
## Node number 33: 545 observations, complexity param=0.001840943
## predicted class=10 expected loss=0.6825688 P(node) =0.07518278
## class counts: 4 37 47 173 77 154 33 15 1 1 2 1 0
## probabilities: 0.007 0.068 0.086 0.317 0.141 0.283 0.061 0.028 0.002 0.002 0.004 0.002 0.000
## left son=66 (376 obs) right son=67 (169 obs)
## Primary splits:
## reviews < 0.5 to the left, improve=3.664163, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve=2.893701, (0 missing)
## size_num_imp < 27500 to the left, improve=2.672262, (0 missing)
## ano_act.x2016 < 0.5 to the left, improve=2.405935, (0 missing)
## rating_imp < 3.55 to the right, improve=2.008815, (0 missing)
## Surrogate splits:
## rating_imp < 4.95 to the left, agree=0.734, adj=0.142, (0 split)
## category.xSHOPPING < 0.5 to the left, agree=0.695, adj=0.018, (0 split)
## ano_act.x2015 < 0.5 to the left, agree=0.694, adj=0.012, (0 split)
## category.xAUTO_AND_VEHICLES < 0.5 to the left, agree=0.692, adj=0.006, (0 split)
## category.xPHOTOGRAPHY < 0.5 to the left, agree=0.692, adj=0.006, (0 split)
##
## Node number 34: 347 observations
## predicted class=100 expected loss=0.5100865 P(node) =0.04786867
## class counts: 0 1 7 38 32 170 48 49 2 0 0 0 0
## probabilities: 0.000 0.003 0.020 0.110 0.092 0.490 0.138 0.141 0.006 0.000 0.000 0.000 0.000
##
## Node number 35: 9 observations
## predicted class=500 expected loss=0.3333333 P(node) =0.001241551
## class counts: 0 0 0 0 0 1 6 2 0 0 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.111 0.667 0.222 0.000 0.000 0.000 0.000 0.000
##
## Node number 36: 57 observations
## predicted class=100 expected loss=0.5087719 P(node) =0.007863154
## class counts: 0 0 0 5 4 28 8 12 0 0 0 0 0
## probabilities: 0.000 0.000 0.000 0.088 0.070 0.491 0.140 0.211 0.000 0.000 0.000 0.000 0.000
##
## Node number 37: 469 observations, complexity param=0.001022746
## predicted class=1000 expected loss=0.5309168 P(node) =0.06469858
## class counts: 0 0 2 11 21 92 93 220 24 6 0 0 0
## probabilities: 0.000 0.000 0.004 0.023 0.045 0.196 0.198 0.469 0.051 0.013 0.000 0.000 0.000
## left son=74 (307 obs) right son=75 (162 obs)
## Primary splits:
## reviews < 10.5 to the left, improve=7.264448, (0 missing)
## rating_imp < 4.75 to the right, improve=6.951071, (0 missing)
## category.xBOOKS_AND_REFERENCE < 0.5 to the left, improve=3.431773, (0 missing)
## category.xPHOTOGRAPHY < 0.5 to the left, improve=2.593914, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve=2.037250, (0 missing)
## Surrogate splits:
## category.xPHOTOGRAPHY < 0.5 to the left, agree=0.661, adj=0.019, (0 split)
## size_num_imp < 635.5 to the right, agree=0.659, adj=0.012, (0 split)
## android_ver_imp < 6.5 to the left, agree=0.659, adj=0.012, (0 split)
## ano_act.x2013 < 0.5 to the left, agree=0.657, adj=0.006, (0 split)
##
## Node number 38: 68 observations
## predicted class=1000 expected loss=0.4264706 P(node) =0.009380604
## class counts: 0 0 0 0 0 14 10 39 1 4 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.206 0.147 0.574 0.015 0.059 0.000 0.000 0.000
##
## Node number 39: 515 observations, complexity param=0.002209131
## predicted class=1000 expected loss=0.5980583 P(node) =0.07104428
## class counts: 0 0 0 0 2 24 25 207 159 97 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.004 0.047 0.049 0.402 0.309 0.188 0.002 0.000 0.000
## left son=78 (133 obs) right son=79 (382 obs)
## Primary splits:
## rating_imp < 4.55 to the right, improve=8.756813, (0 missing)
## reviews < 44.5 to the left, improve=8.184461, (0 missing)
## category.xPARENTING < 0.5 to the left, improve=2.666829, (0 missing)
## size_num_imp < 2350 to the right, improve=2.080131, (0 missing)
## category.xVIDEO_PLAYERS < 0.5 to the right, improve=1.861519, (0 missing)
## Surrogate splits:
## category.xEVENTS < 0.5 to the right, agree=0.746, adj=0.015, (0 split)
## reviews < 56.5 to the right, agree=0.744, adj=0.008, (0 split)
## category.xFOOD_AND_DRINK < 0.5 to the right, agree=0.744, adj=0.008, (0 split)
##
## Node number 40: 55 observations
## predicted class=1000 expected loss=0.3818182 P(node) =0.007587253
## class counts: 0 0 0 0 0 0 5 34 11 5 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.091 0.618 0.200 0.091 0.000 0.000 0.000
##
## Node number 41: 54 observations, complexity param=0.001472754
## predicted class=5000 expected loss=0.5925926 P(node) =0.007449303
## class counts: 0 0 0 0 0 0 0 16 22 15 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.296 0.407 0.278 0.019 0.000 0.000
## left son=82 (35 obs) right son=83 (19 obs)
## Primary splits:
## rating_imp < 4.25 to the right, improve=4.123141, (0 missing)
## android_ver_imp < 2.15 to the right, improve=1.466734, (0 missing)
## size_num_imp < 978.5 to the right, improve=1.371981, (0 missing)
## reviews < 251.5 to the left, improve=1.160199, (0 missing)
## category.xMEDICAL < 0.5 to the left, improve=1.004728, (0 missing)
## Surrogate splits:
## reviews < 168.5 to the right, agree=0.704, adj=0.158, (0 split)
## size_num_imp < 344 to the right, agree=0.704, adj=0.158, (0 split)
## category.xTOOLS < 0.5 to the left, agree=0.704, adj=0.158, (0 split)
## android_ver_imp < 4.7 to the left, agree=0.667, adj=0.053, (0 split)
## category.xCOMMUNICATION < 0.5 to the left, agree=0.667, adj=0.053, (0 split)
##
## Node number 42: 523 observations, complexity param=0.001636393
## predicted class=10000 expected loss=0.374761 P(node) =0.07214788
## class counts: 0 0 0 0 0 0 2 41 85 327 48 18 2
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.004 0.078 0.163 0.625 0.092 0.034 0.004
## left son=84 (31 obs) right son=85 (492 obs)
## Primary splits:
## rating_imp < 4.75 to the right, improve=10.705650, (0 missing)
## reviews < 85.5 to the left, improve= 6.767054, (0 missing)
## category.xAUTO_AND_VEHICLES < 0.5 to the left, improve= 3.237148, (0 missing)
## category.xCOMMUNICATION < 0.5 to the right, improve= 2.332380, (0 missing)
## size_num_imp < 2650 to the left, improve= 1.724450, (0 missing)
##
## Node number 43: 287 observations, complexity param=0.001636393
## predicted class=10000 expected loss=0.543554 P(node) =0.03959167
## class counts: 0 0 0 0 0 0 0 2 3 131 99 45 7
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.007 0.010 0.456 0.345 0.157 0.024
## left son=86 (111 obs) right son=87 (176 obs)
## Primary splits:
## rating_imp < 4.25 to the right, improve=8.547383, (0 missing)
## category.xFINANCE < 0.5 to the right, improve=5.255432, (0 missing)
## reviews < 207.5 to the left, improve=3.691115, (0 missing)
## size_num_imp < 7500 to the right, improve=2.970639, (0 missing)
## category.xMEDICAL < 0.5 to the left, improve=2.327487, (0 missing)
## Surrogate splits:
## reviews < 410.5 to the right, agree=0.627, adj=0.036, (0 split)
## category.xLIBRARIES_AND_DEMO < 0.5 to the right, agree=0.627, adj=0.036, (0 split)
## category.xBOOKS_AND_REFERENCE < 0.5 to the right, agree=0.624, adj=0.027, (0 split)
## category.xNEWS_AND_MAGAZINES < 0.5 to the right, agree=0.624, adj=0.027, (0 split)
## category.xTRAVEL_AND_LOCAL < 0.5 to the right, agree=0.624, adj=0.027, (0 split)
##
## Node number 44: 68 observations
## predicted class=10000 expected loss=0.2647059 P(node) =0.009380604
## class counts: 0 0 0 0 0 0 0 0 8 50 9 1 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.118 0.735 0.132 0.015 0.000
##
## Node number 45: 20 observations
## predicted class=50000 expected loss=0.5 P(node) =0.002759001
## class counts: 0 0 0 0 0 0 0 0 0 5 10 5 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.250 0.500 0.250 0.000
##
## Node number 46: 540 observations, complexity param=0.001145475
## predicted class=100000 expected loss=0.45 P(node) =0.07449303
## class counts: 0 0 0 0 0 0 0 0 6 85 119 297 33
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.011 0.157 0.220 0.550 0.061
## left son=92 (73 obs) right son=93 (467 obs)
## Primary splits:
## rating_imp < 4.55 to the right, improve=10.760680, (0 missing)
## reviews < 806.5 to the left, improve= 9.823751, (0 missing)
## category.xCOMICS < 0.5 to the right, improve= 3.566737, (0 missing)
## android_ver_imp < 4.05 to the left, improve= 2.356799, (0 missing)
## category.xPERSONALIZATION < 0.5 to the right, improve= 2.134066, (0 missing)
## Surrogate splits:
## category.xBOOKS_AND_REFERENCE < 0.5 to the right, agree=0.87, adj=0.041, (0 split)
##
## Node number 47: 370 observations
## predicted class=100000 expected loss=0.4351351 P(node) =0.05104152
## class counts: 0 0 0 0 0 0 0 0 1 5 31 209 124
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.003 0.014 0.084 0.565 0.335
##
## Node number 52: 89 observations, complexity param=0.001104566
## predicted class=100000 expected loss=0.4719101 P(node) =0.01227756
## class counts: 0 0 0 0 0 0 0 0 0 2 2 47 38
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.022 0.022 0.528 0.427
## left son=104 (76 obs) right son=105 (13 obs)
## Primary splits:
## reviews < 3654 to the right, improve=3.9684070, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve=1.2613220, (0 missing)
## size_num_imp < 2900 to the right, improve=1.0811960, (0 missing)
## grupo_edades.x4. < 0.5 to the left, improve=0.8318352, (0 missing)
## rating_imp < 4.55 to the right, improve=0.6127940, (0 missing)
##
## Node number 53: 70 observations
## predicted class=396371.74 expected loss=0.3857143 P(node) =0.009656504
## class counts: 0 0 0 0 0 0 0 0 0 0 2 25 43
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.029 0.357 0.614
##
## Node number 66: 376 observations, complexity param=0.001718213
## predicted class=10 expected loss=0.6648936 P(node) =0.05186922
## class counts: 4 33 41 126 50 92 17 8 1 1 2 1 0
## probabilities: 0.011 0.088 0.109 0.335 0.133 0.245 0.045 0.021 0.003 0.003 0.005 0.003 0.000
## left son=132 (239 obs) right son=133 (137 obs)
## Primary splits:
## size_num_imp < 13500 to the left, improve=2.365628, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve=2.185873, (0 missing)
## rating_imp < 3.55 to the left, improve=1.865445, (0 missing)
## category.xTRAVEL_AND_LOCAL < 0.5 to the left, improve=1.664472, (0 missing)
## grupo_edades.x4. < 0.5 to the left, improve=1.631643, (0 missing)
## Surrogate splits:
## category.xPRODUCTIVITY < 0.5 to the left, agree=0.649, adj=0.036, (0 split)
## category.xSPORTS < 0.5 to the left, agree=0.649, adj=0.036, (0 split)
## category.xGAME < 0.5 to the left, agree=0.646, adj=0.029, (0 split)
## category.xTRAVEL_AND_LOCAL < 0.5 to the left, agree=0.644, adj=0.022, (0 split)
## android_ver_imp < 4.7 to the left, agree=0.638, adj=0.007, (0 split)
##
## Node number 67: 169 observations, complexity param=0.001840943
## predicted class=100 expected loss=0.6331361 P(node) =0.02331356
## class counts: 0 4 6 47 27 62 16 7 0 0 0 0 0
## probabilities: 0.000 0.024 0.036 0.278 0.160 0.367 0.095 0.041 0.000 0.000 0.000 0.000 0.000
## left son=134 (45 obs) right son=135 (124 obs)
## Primary splits:
## android_ver_imp < 4.15 to the right, improve=2.680694, (0 missing)
## rating_imp < 3.95 to the right, improve=2.201740, (0 missing)
## ano_act.x2016 < 0.5 to the left, improve=2.182554, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve=1.830705, (0 missing)
## size_num_imp < 42029.48 to the left, improve=1.449943, (0 missing)
## Surrogate splits:
## category.xBEAUTY < 0.5 to the right, agree=0.746, adj=0.044, (0 split)
## size_num_imp < 42058.96 to the right, agree=0.740, adj=0.022, (0 split)
## category.xPRODUCTIVITY < 0.5 to the right, agree=0.740, adj=0.022, (0 split)
##
## Node number 74: 307 observations, complexity param=0.001022746
## predicted class=1000 expected loss=0.6058632 P(node) =0.04235067
## class counts: 0 0 2 10 16 70 72 121 13 3 0 0 0
## probabilities: 0.000 0.000 0.007 0.033 0.052 0.228 0.235 0.394 0.042 0.010 0.000 0.000 0.000
## left son=148 (184 obs) right son=149 (123 obs)
## Primary splits:
## rating_imp < 4.15 to the right, improve=3.307040, (0 missing)
## category.xBOOKS_AND_REFERENCE < 0.5 to the left, improve=2.449646, (0 missing)
## android_ver_imp < 4.05 to the right, improve=2.186021, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve=2.086453, (0 missing)
## category.xPRODUCTIVITY < 0.5 to the right, improve=1.964938, (0 missing)
## Surrogate splits:
## size_num_imp < 3050 to the right, agree=0.622, adj=0.057, (0 split)
## category.xAUTO_AND_VEHICLES < 0.5 to the left, agree=0.612, adj=0.033, (0 split)
## category.xTOOLS < 0.5 to the left, agree=0.612, adj=0.033, (0 split)
## ano_act.x2015 < 0.5 to the left, agree=0.612, adj=0.033, (0 split)
## category.xPERSONALIZATION < 0.5 to the left, agree=0.609, adj=0.024, (0 split)
##
## Node number 75: 162 observations, complexity param=0.001022746
## predicted class=1000 expected loss=0.3888889 P(node) =0.02234791
## class counts: 0 0 0 1 5 22 21 99 11 3 0 0 0
## probabilities: 0.000 0.000 0.000 0.006 0.031 0.136 0.130 0.611 0.068 0.019 0.000 0.000 0.000
## left son=150 (16 obs) right son=151 (146 obs)
## Primary splits:
## rating_imp < 4.85 to the right, improve=5.840859, (0 missing)
## size_num_imp < 10500 to the left, improve=2.664200, (0 missing)
## category.xGAME < 0.5 to the right, improve=1.284953, (0 missing)
## android_ver_imp < 4.15 to the left, improve=1.153390, (0 missing)
## reviews < 14.5 to the left, improve=1.137499, (0 missing)
##
## Node number 78: 133 observations
## predicted class=1000 expected loss=0.5338346 P(node) =0.01834736
## class counts: 0 0 0 0 2 19 16 62 22 12 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.015 0.143 0.120 0.466 0.165 0.090 0.000 0.000 0.000
##
## Node number 79: 382 observations, complexity param=0.002209131
## predicted class=1000 expected loss=0.6204188 P(node) =0.05269692
## class counts: 0 0 0 0 0 5 9 145 137 85 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.013 0.024 0.380 0.359 0.223 0.003 0.000 0.000
## left son=158 (304 obs) right son=159 (78 obs)
## Primary splits:
## reviews < 44.5 to the left, improve=10.408140, (0 missing)
## category.xBUSINESS < 0.5 to the left, improve= 2.246599, (0 missing)
## size_num_imp < 4050 to the right, improve= 1.862363, (0 missing)
## ano_act.x2016 < 0.5 to the right, improve= 1.643459, (0 missing)
## category.xVIDEO_PLAYERS < 0.5 to the left, improve= 1.529237, (0 missing)
## Surrogate splits:
## size_num_imp < 89.5 to the right, agree=0.798, adj=0.013, (0 split)
##
## Node number 82: 35 observations
## predicted class=1000 expected loss=0.5428571 P(node) =0.004828252
## class counts: 0 0 0 0 0 0 0 16 10 9 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.457 0.286 0.257 0.000 0.000 0.000
##
## Node number 83: 19 observations
## predicted class=5000 expected loss=0.3684211 P(node) =0.002621051
## class counts: 0 0 0 0 0 0 0 0 12 6 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.632 0.316 0.053 0.000 0.000
##
## Node number 84: 31 observations
## predicted class=1000 expected loss=0.5806452 P(node) =0.004276452
## class counts: 0 0 0 0 0 0 1 13 8 6 0 3 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.032 0.419 0.258 0.194 0.000 0.097 0.000
##
## Node number 85: 492 observations
## predicted class=10000 expected loss=0.347561 P(node) =0.06787143
## class counts: 0 0 0 0 0 0 1 28 77 321 48 15 2
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.057 0.157 0.652 0.098 0.030 0.004
##
## Node number 86: 111 observations
## predicted class=10000 expected loss=0.3873874 P(node) =0.01531246
## class counts: 0 0 0 0 0 0 0 2 3 68 23 11 4
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.018 0.027 0.613 0.207 0.099 0.036
##
## Node number 87: 176 observations, complexity param=0.001636393
## predicted class=50000 expected loss=0.5681818 P(node) =0.02427921
## class counts: 0 0 0 0 0 0 0 0 0 63 76 34 3
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.358 0.432 0.193 0.017
## left son=174 (107 obs) right son=175 (69 obs)
## Primary splits:
## reviews < 305.5 to the left, improve=4.926168, (0 missing)
## category.xFINANCE < 0.5 to the right, improve=4.681243, (0 missing)
## size_num_imp < 7500 to the right, improve=1.830204, (0 missing)
## android_ver_imp < 4.15 to the right, improve=1.622348, (0 missing)
## category.xTOOLS < 0.5 to the left, improve=1.167208, (0 missing)
## Surrogate splits:
## size_num_imp < 2350 to the right, agree=0.631, adj=0.058, (0 split)
## android_ver_imp < 2.25 to the right, agree=0.619, adj=0.029, (0 split)
## rating_imp < 1.8 to the right, agree=0.614, adj=0.014, (0 split)
## category.xAUTO_AND_VEHICLES < 0.5 to the left, agree=0.614, adj=0.014, (0 split)
## category.xCOMMUNICATION < 0.5 to the left, agree=0.614, adj=0.014, (0 split)
##
## Node number 92: 73 observations, complexity param=0.001145475
## predicted class=10000 expected loss=0.6438356 P(node) =0.01007035
## class counts: 0 0 0 0 0 0 0 0 5 26 18 19 5
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.068 0.356 0.247 0.260 0.068
## left son=184 (32 obs) right son=185 (41 obs)
## Primary splits:
## reviews < 712 to the left, improve=3.770318, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve=2.280535, (0 missing)
## rating_imp < 4.85 to the right, improve=2.175079, (0 missing)
## ano_act.x2017 < 0.5 to the left, improve=1.949394, (0 missing)
## android_ver_imp < 4.7 to the right, improve=1.762142, (0 missing)
## Surrogate splits:
## size_num_imp < 2050 to the left, agree=0.644, adj=0.188, (0 split)
## ano_act.x2018 < 0.5 to the left, agree=0.644, adj=0.188, (0 split)
## android_ver_imp < 3.5 to the left, agree=0.603, adj=0.094, (0 split)
## category.xPERSONALIZATION < 0.5 to the right, agree=0.603, adj=0.094, (0 split)
## ano_act.x2015 < 0.5 to the right, agree=0.603, adj=0.094, (0 split)
##
## Node number 93: 467 observations
## predicted class=100000 expected loss=0.4047109 P(node) =0.06442268
## class counts: 0 0 0 0 0 0 0 0 1 59 101 278 28
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.126 0.216 0.595 0.060
##
## Node number 104: 76 observations
## predicted class=100000 expected loss=0.4078947 P(node) =0.0104842
## class counts: 0 0 0 0 0 0 0 0 0 1 2 45 28
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.013 0.026 0.592 0.368
##
## Node number 105: 13 observations
## predicted class=396371.74 expected loss=0.2307692 P(node) =0.001793351
## class counts: 0 0 0 0 0 0 0 0 0 1 0 2 10
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.077 0.000 0.154 0.769
##
## Node number 132: 239 observations
## predicted class=10 expected loss=0.6192469 P(node) =0.03297006
## class counts: 4 19 23 91 30 51 12 8 0 0 0 1 0
## probabilities: 0.017 0.079 0.096 0.381 0.126 0.213 0.050 0.033 0.000 0.000 0.000 0.004 0.000
##
## Node number 133: 137 observations, complexity param=0.001718213
## predicted class=100 expected loss=0.7007299 P(node) =0.01889916
## class counts: 0 14 18 35 20 41 5 0 1 1 2 0 0
## probabilities: 0.000 0.102 0.131 0.255 0.146 0.299 0.036 0.000 0.007 0.007 0.015 0.000 0.000
## left son=266 (100 obs) right son=267 (37 obs)
## Primary splits:
## ano_act.x2018 < 0.5 to the right, improve=3.842932, (0 missing)
## category.xSPORTS < 0.5 to the right, improve=2.540799, (0 missing)
## android_ver_imp < 4.7 to the left, improve=2.037782, (0 missing)
## rating_imp < 4.05 to the right, improve=1.716818, (0 missing)
## category.xPRODUCTIVITY < 0.5 to the right, improve=1.439691, (0 missing)
## Surrogate splits:
## ano_act.x2017 < 0.5 to the left, agree=0.942, adj=0.784, (0 split)
## ano_act.x2016 < 0.5 to the left, agree=0.774, adj=0.162, (0 split)
## android_ver_imp < 3.15 to the right, agree=0.759, adj=0.108, (0 split)
## category.xFINANCE < 0.5 to the left, agree=0.745, adj=0.054, (0 split)
## category.xPERSONALIZATION < 0.5 to the left, agree=0.745, adj=0.054, (0 split)
##
## Node number 134: 45 observations
## predicted class=10 expected loss=0.6222222 P(node) =0.006207753
## class counts: 0 2 1 17 10 9 3 3 0 0 0 0 0
## probabilities: 0.000 0.044 0.022 0.378 0.222 0.200 0.067 0.067 0.000 0.000 0.000 0.000 0.000
##
## Node number 135: 124 observations, complexity param=0.001227295
## predicted class=100 expected loss=0.5725806 P(node) =0.01710581
## class counts: 0 2 5 30 17 53 13 4 0 0 0 0 0
## probabilities: 0.000 0.016 0.040 0.242 0.137 0.427 0.105 0.032 0.000 0.000 0.000 0.000 0.000
## left son=270 (33 obs) right son=271 (91 obs)
## Primary splits:
## rating_imp < 4.65 to the right, improve=2.9047190, (0 missing)
## ano_act.x2016 < 0.5 to the left, improve=1.3617270, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve=1.2231950, (0 missing)
## size_num_imp < 26500 to the left, improve=0.9421113, (0 missing)
## category.xBUSINESS < 0.5 to the left, improve=0.8444905, (0 missing)
##
## Node number 148: 184 observations, complexity param=0.001022746
## predicted class=1000 expected loss=0.6793478 P(node) =0.02538281
## class counts: 0 0 2 8 9 49 45 59 9 3 0 0 0
## probabilities: 0.000 0.000 0.011 0.043 0.049 0.266 0.245 0.321 0.049 0.016 0.000 0.000 0.000
## left son=296 (83 obs) right son=297 (101 obs)
## Primary splits:
## rating_imp < 4.65 to the right, improve=2.381488, (0 missing)
## reviews < 5.5 to the left, improve=2.056307, (0 missing)
## size_num_imp < 10500 to the right, improve=1.861594, (0 missing)
## category.xBUSINESS < 0.5 to the right, improve=1.826880, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve=1.719765, (0 missing)
## Surrogate splits:
## category.xBUSINESS < 0.5 to the right, agree=0.587, adj=0.084, (0 split)
## reviews < 6.5 to the left, agree=0.582, adj=0.072, (0 split)
## category.xDATING < 0.5 to the right, agree=0.560, adj=0.024, (0 split)
## category.xGAME < 0.5 to the right, agree=0.560, adj=0.024, (0 split)
## category.xBOOKS_AND_REFERENCE < 0.5 to the right, agree=0.554, adj=0.012, (0 split)
##
## Node number 149: 123 observations
## predicted class=1000 expected loss=0.495935 P(node) =0.01696786
## class counts: 0 0 0 2 7 21 27 62 4 0 0 0 0
## probabilities: 0.000 0.000 0.000 0.016 0.057 0.171 0.220 0.504 0.033 0.000 0.000 0.000 0.000
##
## Node number 150: 16 observations
## predicted class=100 expected loss=0.5 P(node) =0.002207201
## class counts: 0 0 0 1 2 8 1 3 1 0 0 0 0
## probabilities: 0.000 0.000 0.000 0.062 0.125 0.500 0.062 0.188 0.062 0.000 0.000 0.000 0.000
##
## Node number 151: 146 observations
## predicted class=1000 expected loss=0.3424658 P(node) =0.02014071
## class counts: 0 0 0 0 3 14 20 96 10 3 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.021 0.096 0.137 0.658 0.068 0.021 0.000 0.000 0.000
##
## Node number 158: 304 observations, complexity param=0.001104566
## predicted class=1000 expected loss=0.5559211 P(node) =0.04193682
## class counts: 0 0 0 0 0 5 9 135 98 56 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.016 0.030 0.444 0.322 0.184 0.003 0.000 0.000
## left son=316 (296 obs) right son=317 (8 obs)
## Primary splits:
## category.xBUSINESS < 0.5 to the left, improve=2.836771, (0 missing)
## reviews < 43.5 to the left, improve=2.689877, (0 missing)
## category.xVIDEO_PLAYERS < 0.5 to the left, improve=2.498833, (0 missing)
## category.xFAMILY < 0.5 to the right, improve=1.767272, (0 missing)
## size_num_imp < 30500 to the right, improve=1.764518, (0 missing)
##
## Node number 159: 78 observations
## predicted class=5000 expected loss=0.5 P(node) =0.0107601
## class counts: 0 0 0 0 0 0 0 10 39 29 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.128 0.500 0.372 0.000 0.000 0.000
##
## Node number 174: 107 observations
## predicted class=10000 expected loss=0.5327103 P(node) =0.01476066
## class counts: 0 0 0 0 0 0 0 0 0 50 43 13 1
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.467 0.402 0.121 0.009
##
## Node number 175: 69 observations
## predicted class=50000 expected loss=0.5217391 P(node) =0.009518554
## class counts: 0 0 0 0 0 0 0 0 0 13 33 21 2
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.188 0.478 0.304 0.029
##
## Node number 184: 32 observations
## predicted class=10000 expected loss=0.5 P(node) =0.004414402
## class counts: 0 0 0 0 0 0 0 0 3 16 10 2 1
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.094 0.500 0.312 0.062 0.031
##
## Node number 185: 41 observations
## predicted class=100000 expected loss=0.5853659 P(node) =0.005655953
## class counts: 0 0 0 0 0 0 0 0 2 10 8 17 4
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.049 0.244 0.195 0.415 0.098
##
## Node number 266: 100 observations, complexity param=0.001227295
## predicted class=10 expected loss=0.7 P(node) =0.01379501
## class counts: 0 12 16 30 17 22 2 0 0 0 1 0 0
## probabilities: 0.000 0.120 0.160 0.300 0.170 0.220 0.020 0.000 0.000 0.000 0.010 0.000 0.000
## left son=532 (77 obs) right son=533 (23 obs)
## Primary splits:
## android_ver_imp < 4.35 to the left, improve=2.478046, (0 missing)
## category.xSPORTS < 0.5 to the right, improve=1.816527, (0 missing)
## category.xPRODUCTIVITY < 0.5 to the right, improve=1.333636, (0 missing)
## rating_imp < 3.75 to the right, improve=1.297381, (0 missing)
## size_num_imp < 21500 to the right, improve=1.295984, (0 missing)
## Surrogate splits:
## category.xTRAVEL_AND_LOCAL < 0.5 to the left, agree=0.78, adj=0.043, (0 split)
##
## Node number 267: 37 observations
## predicted class=100 expected loss=0.4864865 P(node) =0.005104152
## class counts: 0 2 2 5 3 19 3 0 1 1 1 0 0
## probabilities: 0.000 0.054 0.054 0.135 0.081 0.514 0.081 0.000 0.027 0.027 0.027 0.000 0.000
##
## Node number 270: 33 observations
## predicted class=10 expected loss=0.5454545 P(node) =0.004552352
## class counts: 0 0 2 15 4 10 2 0 0 0 0 0 0
## probabilities: 0.000 0.000 0.061 0.455 0.121 0.303 0.061 0.000 0.000 0.000 0.000 0.000 0.000
##
## Node number 271: 91 observations
## predicted class=100 expected loss=0.5274725 P(node) =0.01255346
## class counts: 0 2 3 15 13 43 11 4 0 0 0 0 0
## probabilities: 0.000 0.022 0.033 0.165 0.143 0.473 0.121 0.044 0.000 0.000 0.000 0.000 0.000
##
## Node number 296: 83 observations, complexity param=0.001022746
## predicted class=1000 expected loss=0.6385542 P(node) =0.01144986
## class counts: 0 0 2 5 6 26 12 30 2 0 0 0 0
## probabilities: 0.000 0.000 0.024 0.060 0.072 0.313 0.145 0.361 0.024 0.000 0.000 0.000 0.000
## left son=592 (29 obs) right son=593 (54 obs)
## Primary splits:
## size_num_imp < 11500 to the right, improve=2.498146, (0 missing)
## category.xBUSINESS < 0.5 to the left, improve=1.945783, (0 missing)
## android_ver_imp < 4.05 to the right, improve=1.930632, (0 missing)
## rating_imp < 4.75 to the right, improve=1.542557, (0 missing)
## category.xFAMILY < 0.5 to the right, improve=1.438946, (0 missing)
## Surrogate splits:
## android_ver_imp < 4.15 to the right, agree=0.699, adj=0.138, (0 split)
## category.xMEDICAL < 0.5 to the right, agree=0.699, adj=0.138, (0 split)
## category.xBUSINESS < 0.5 to the right, agree=0.687, adj=0.103, (0 split)
## category.xCOMMUNICATION < 0.5 to the right, agree=0.675, adj=0.069, (0 split)
## category.xGAME < 0.5 to the right, agree=0.675, adj=0.069, (0 split)
##
## Node number 297: 101 observations, complexity param=0.001022746
## predicted class=500 expected loss=0.6732673 P(node) =0.01393296
## class counts: 0 0 0 3 3 23 33 29 7 3 0 0 0
## probabilities: 0.000 0.000 0.000 0.030 0.030 0.228 0.327 0.287 0.069 0.030 0.000 0.000 0.000
## left son=594 (53 obs) right son=595 (48 obs)
## Primary splits:
## ano_act.x2018 < 0.5 to the right, improve=2.338883, (0 missing)
## size_num_imp < 28500 to the left, improve=1.620878, (0 missing)
## category.xCOMMUNICATION < 0.5 to the left, improve=1.553164, (0 missing)
## reviews < 5.5 to the left, improve=1.446561, (0 missing)
## ano_act.x2016 < 0.5 to the left, improve=1.364044, (0 missing)
## Surrogate splits:
## ano_act.x2017 < 0.5 to the left, agree=0.792, adj=0.563, (0 split)
## size_num_imp < 6850 to the right, agree=0.683, adj=0.333, (0 split)
## ano_act.x2016 < 0.5 to the left, agree=0.663, adj=0.292, (0 split)
## android_ver_imp < 4.05 to the right, agree=0.653, adj=0.271, (0 split)
## rating_imp < 4.45 to the right, agree=0.574, adj=0.104, (0 split)
##
## Node number 316: 296 observations, complexity param=0.001104566
## predicted class=1000 expected loss=0.5506757 P(node) =0.04083322
## class counts: 0 0 0 0 0 5 8 133 98 51 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.017 0.027 0.449 0.331 0.172 0.003 0.000 0.000
## left son=632 (282 obs) right son=633 (14 obs)
## Primary splits:
## category.xVIDEO_PLAYERS < 0.5 to the left, improve=2.423067, (0 missing)
## reviews < 43.5 to the left, improve=2.283122, (0 missing)
## size_num_imp < 36500 to the right, improve=1.930457, (0 missing)
## rating_imp < 2.85 to the left, improve=1.886526, (0 missing)
## ano_act.x2018 < 0.5 to the right, improve=1.435229, (0 missing)
##
## Node number 317: 8 observations
## predicted class=10000 expected loss=0.375 P(node) =0.0011036
## class counts: 0 0 0 0 0 0 1 2 0 5 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.125 0.250 0.000 0.625 0.000 0.000 0.000
##
## Node number 532: 77 observations
## predicted class=10 expected loss=0.6363636 P(node) =0.01062215
## class counts: 0 11 9 28 12 15 2 0 0 0 0 0 0
## probabilities: 0.000 0.143 0.117 0.364 0.156 0.195 0.026 0.000 0.000 0.000 0.000 0.000 0.000
##
## Node number 533: 23 observations
## predicted class=5 expected loss=0.6956522 P(node) =0.003172851
## class counts: 0 1 7 2 5 7 0 0 0 0 1 0 0
## probabilities: 0.000 0.043 0.304 0.087 0.217 0.304 0.000 0.000 0.000 0.000 0.043 0.000 0.000
##
## Node number 592: 29 observations
## predicted class=100 expected loss=0.5172414 P(node) =0.004000552
## class counts: 0 0 1 2 3 14 3 6 0 0 0 0 0
## probabilities: 0.000 0.000 0.034 0.069 0.103 0.483 0.103 0.207 0.000 0.000 0.000 0.000 0.000
##
## Node number 593: 54 observations
## predicted class=1000 expected loss=0.5555556 P(node) =0.007449303
## class counts: 0 0 1 3 3 12 9 24 2 0 0 0 0
## probabilities: 0.000 0.000 0.019 0.056 0.056 0.222 0.167 0.444 0.037 0.000 0.000 0.000 0.000
##
## Node number 594: 53 observations
## predicted class=1000 expected loss=0.6415094 P(node) =0.007311353
## class counts: 0 0 0 3 3 12 11 19 3 2 0 0 0
## probabilities: 0.000 0.000 0.000 0.057 0.057 0.226 0.208 0.358 0.057 0.038 0.000 0.000 0.000
##
## Node number 595: 48 observations
## predicted class=500 expected loss=0.5416667 P(node) =0.006621603
## class counts: 0 0 0 0 0 11 22 10 4 1 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.229 0.458 0.208 0.083 0.021 0.000 0.000 0.000
##
## Node number 632: 282 observations
## predicted class=1000 expected loss=0.5390071 P(node) =0.03890192
## class counts: 0 0 0 0 0 5 7 130 89 50 1 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.018 0.025 0.461 0.316 0.177 0.004 0.000 0.000
##
## Node number 633: 14 observations
## predicted class=5000 expected loss=0.3571429 P(node) =0.001931301
## class counts: 0 0 0 0 0 0 1 3 9 1 0 0 0
## probabilities: 0.000 0.000 0.000 0.000 0.000 0.000 0.071 0.214 0.643 0.071 0.000 0.000 0.000
# Obtener las predicciones del modelo en el conjunto de prueba
pred_test <- predict(model_2, newdata = test, type = "class")
# Matriz de confusión
confusionMatrix(table(pred_test, test$installs))
## Confusion Matrix and Statistics
##
##
## pred_test 0 1 5 10 50 100 500 1000 5000 10000 50000 100000
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 1 5 0 1 0 1 0 1 0 0
## 10 2 9 11 75 25 35 5 2 1 1 1 1
## 50 0 0 0 0 0 0 0 0 0 0 0 0
## 100 0 4 4 35 31 118 36 29 3 0 1 0
## 500 0 0 1 0 1 5 6 12 1 0 0 0
## 1000 0 0 0 5 2 60 50 172 77 51 0 1
## 5000 0 0 0 0 0 0 0 15 16 13 0 0
## 10000 0 0 0 0 0 2 1 22 47 195 50 30
## 50000 0 0 0 0 0 0 0 0 0 8 14 14
## 100000 0 0 0 0 0 0 0 0 1 29 72 235
## 396371.74 0 0 0 0 0 0 0 0 0 0 3 62
##
## pred_test 396371.74
## 0 0
## 1 0
## 5 0
## 10 1
## 50 0
## 100 0
## 500 0
## 1000 0
## 5000 0
## 10000 4
## 50000 0
## 100000 87
## 396371.74 1304
##
## Overall Statistics
##
## Accuracy : 0.6875
## 95% CI : (0.6708, 0.7038)
## No Information Rate : 0.4493
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5864
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 5 Class: 10 Class: 50
## Sensitivity 0.0000000 0.000000 0.0588235 0.62500 0.00000
## Specificity 1.0000000 1.000000 0.9974110 0.96853 1.00000
## Pos Pred Value NaN NaN 0.1111111 0.44379 NaN
## Neg Pred Value 0.9993563 0.995816 0.9948354 0.98468 0.98101
## Prevalence 0.0006437 0.004184 0.0054715 0.03862 0.01899
## Detection Rate 0.0000000 0.000000 0.0003219 0.02414 0.00000
## Detection Prevalence 0.0000000 0.000000 0.0028967 0.05439 0.00000
## Balanced Accuracy 0.5000000 0.500000 0.5281173 0.79677 0.50000
## Class: 100 Class: 500 Class: 1000 Class: 5000 Class: 10000
## Sensitivity 0.53394 0.061224 0.67984 0.10959 0.65436
## Specificity 0.95045 0.993353 0.91381 0.99054 0.94446
## Pos Pred Value 0.45211 0.230769 0.41148 0.36364 0.55556
## Neg Pred Value 0.96381 0.970140 0.96988 0.95756 0.96263
## Prevalence 0.07113 0.031542 0.08143 0.04699 0.09591
## Detection Rate 0.03798 0.001931 0.05536 0.00515 0.06276
## Detection Prevalence 0.08400 0.008368 0.13453 0.01416 0.11297
## Balanced Accuracy 0.74219 0.527289 0.79682 0.55007 0.79941
## Class: 50000 Class: 100000 Class: 396371.74
## Sensitivity 0.099291 0.68513 0.9341
## Specificity 0.992583 0.93162 0.9620
## Pos Pred Value 0.388889 0.55425 0.9525
## Neg Pred Value 0.958645 0.95975 0.9471
## Prevalence 0.045381 0.11040 0.4493
## Detection Rate 0.004506 0.07564 0.4197
## Detection Prevalence 0.011587 0.13647 0.4406
## Balanced Accuracy 0.545937 0.80838 0.9481
predict_test <- predict(model_2, test, type = "class")
# Creamos una tabla de contingencia para evaluar la precisión
table_mat <- table(test$installs, predict_test)
table_mat
## predict_test
## 0 1 5 10 50 100 500 1000 5000 10000 50000 100000
## 0 0 0 0 2 0 0 0 0 0 0 0 0
## 1 0 0 0 9 0 4 0 0 0 0 0 0
## 5 0 0 1 11 0 4 1 0 0 0 0 0
## 10 0 0 5 75 0 35 0 5 0 0 0 0
## 50 0 0 0 25 0 31 1 2 0 0 0 0
## 100 0 0 1 35 0 118 5 60 0 2 0 0
## 500 0 0 0 5 0 36 6 50 0 1 0 0
## 1000 0 0 1 2 0 29 12 172 15 22 0 0
## 5000 0 0 0 1 0 3 1 77 16 47 0 1
## 10000 0 0 1 1 0 0 0 51 13 195 8 29
## 50000 0 0 0 1 0 1 0 0 0 50 14 72
## 100000 0 0 0 1 0 0 0 1 0 30 14 235
## 396371.74 0 0 0 1 0 0 0 0 0 4 0 87
## predict_test
## 396371.74
## 0 0
## 1 0
## 5 0
## 10 0
## 50 0
## 100 0
## 500 0
## 1000 0
## 5000 0
## 10000 0
## 50000 3
## 100000 62
## 396371.74 1304
print(paste('Accuracy para train', accuracy_train(model_2)))
## [1] "Accuracy para train 0.714029521313285"
print(paste('Accuracy para test', accuracy_test(model_2)))
## [1] "Accuracy para test 0.687479884132604"
En comparación con el modelo 2, podemos ver que el modelo 1 tiene una accuracy menor (67,36% vs 69%), un kappa menor (0,5626 vs 0.5939) y una sensibilidad mucho menor para las clases 0, 1, 5 y 50 y una sensibilidad mayor para las clases 1000, 50000 y 100000.
El accuracy tanto para train como para test
son similares, lo cual es una buena señal de que no hay overfitting o
sobreajuste
# Obtener las predicciones del modelo en el conjunto de prueba
pred_test <- predict(model_2, newdata = test, type = "class")
# Tabla de contingencia
table_test <- table(test$installs, pred_test)
# Precisión por instalaciones
precision <- diag(table_test) / colSums(table_test)
# Recall por instalaciones
recall <- diag(table_test) / rowSums(table_test)
# Graficar precision y recall en un gráfico de barras
barplot(precision, ylim = c(0, 1), main = "Precisión por instalaciones", xlab = "cantidad", ylab = "Precisión")
barplot(recall, ylim = c(0, 1), main = "Recall por instalaciones", xlab = "cantidad", ylab = "Recall")
Mas arriba se creó una variable popular (binaria) en
base a installs a partir de la distribución de la cantidad
de instalaciones, aquellas apps con > 100000
instalaciones se las consideró populares. Se tratará de
predecir la popularidad de las apps en base a esta variable y comparar
si el modelo predice mejor que con installs
#fijamos una semilla
set.seed(583)
n <- nrow(df_popular)
train_idx <- sample(1:n, n*0.7, replace = FALSE) # 70% para entrenamiento
train_pop <- df_popular[train_idx, ]
test_pop <- df_popular[-train_idx, ]
n_train = nrow(train_pop)
train_pop %>%
group_by(popular) %>%
summarise(Prop = round(n()/n_train,3)) %>%
gt() %>%
tab_options(page.width = "100") %>%
tab_header(title = "Proporción de Instalaciones")
| Proporción de Instalaciones | |
| popular | Prop |
|---|---|
| 0 | 0.454 |
| 1 | 0.546 |
Realizamos el modelo:
#creamos el modelo
model_pop <- rpart(popular ~., data = train_pop, method = 'class')
#Graficamos el arbol
# fancyRpartPlot(model, type = 1, palettes=c("Greys", "Blues"), main = "Arbol de Clasificación - Modelo 1", caption = "Por la cantidad de instalaciones")
rpart.plot(model_pop, main = "Árbol de clasificación", extra = 101, under = TRUE, branch.lty = 1, shadow.col = "gray")
#resumen del modelo creado
summary(model_pop)
## Call:
## rpart(formula = popular ~ ., data = train_pop, method = "class")
## n= 7249
##
## CP nsplit rel error xerror xstd
## 1 0.88382 0 1.00000 1.0000000 0.012891322
## 2 0.01000 1 0.11618 0.1225669 0.005933352
##
## Variable importance
## reviews ano_act.x2018 size_num_imp rating_imp ano_act.x2017
## 64 10 9 7 6
## type_bin
## 4
##
## Node number 1: 7249 observations, complexity param=0.88382
## predicted class=1 expected loss=0.4535798 P(node) =1
## class counts: 3288 3961
## probabilities: 0.454 0.546
## left son=2 (3160 obs) right son=3 (4089 obs)
## Primary splits:
## reviews < 580.5 to the left, improve=2871.2720, (0 missing)
## ano_act.x2018 < 0.5 to the left, improve= 227.5610, (0 missing)
## size_num_imp < 33500 to the left, improve= 202.2597, (0 missing)
## type_bin < 0.5 to the right, improve= 171.8375, (0 missing)
## rating_imp < 4.75 to the right, improve= 162.3323, (0 missing)
## Surrogate splits:
## ano_act.x2018 < 0.5 to the left, agree=0.633, adj=0.158, (0 split)
## size_num_imp < 5350 to the left, agree=0.625, adj=0.140, (0 split)
## rating_imp < 3.75 to the left, agree=0.612, adj=0.111, (0 split)
## ano_act.x2017 < 0.5 to the right, agree=0.604, adj=0.092, (0 split)
## type_bin < 0.5 to the right, agree=0.592, adj=0.064, (0 split)
##
## Node number 2: 3160 observations
## predicted class=0 expected loss=0.04018987 P(node) =0.4359222
## class counts: 3033 127
## probabilities: 0.960 0.040
##
## Node number 3: 4089 observations
## predicted class=1 expected loss=0.06236244 P(node) =0.5640778
## class counts: 255 3834
## probabilities: 0.062 0.938
predict_test <- predict(model_pop, test_pop, type = "class")
# Creamos una tabla de contingencia para evaluar la precisión
table_mat <- table(test_pop$popular, predict_test)
table_mat
## predict_test
## 0 1
## 0 1261 107
## 1 71 1668
accuracy_train <- function(fit) {
predict_unseen_train <- predict(fit, train_pop, type = 'class')
table_mat <- table(train_pop$popular, predict_unseen_train)
accuracy_train <- sum(diag(table_mat)) / sum(table_mat)
accuracy_train
}
accuracy_test <- function(fit) {
predict_unseen <- predict(fit, test_pop, type = 'class')
table_mat <- table(test_pop$popular, predict_unseen)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
accuracy_Test
}
print(paste('Accuracy para train', accuracy_train(model_pop)))
## [1] "Accuracy para train 0.947303076286384"
print(paste('Accuracy para test', accuracy_test(model_pop)))
## [1] "Accuracy para test 0.942710009655616"
# Obtener las predicciones del modelo en el conjunto de prueba
pred_test <- predict(model_pop, newdata = test_pop, type = "class")
# Matriz de confusión
confusionMatrix(table(pred_test, test_pop$popular))
## Confusion Matrix and Statistics
##
##
## pred_test 0 1
## 0 1261 71
## 1 107 1668
##
## Accuracy : 0.9427
## 95% CI : (0.934, 0.9506)
## No Information Rate : 0.5597
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8834
##
## Mcnemar's Test P-Value : 0.008707
##
## Sensitivity : 0.9218
## Specificity : 0.9592
## Pos Pred Value : 0.9467
## Neg Pred Value : 0.9397
## Prevalence : 0.4403
## Detection Rate : 0.4059
## Detection Prevalence : 0.4287
## Balanced Accuracy : 0.9405
##
## 'Positive' Class : 0
##
predictions <- predict(model_pop, newdata = test, type = "prob")[, 2]
roc_curve <- roc(test_pop$popular ~ predictions)
plot(roc_curve)
auc(roc_curve)
## Area under the curve: 0.9405
Al utilizar la variable dependiente popular en lugar de
installs, el modelo no toma en cuenta todas las
variabilidades de installs. Sin embargo,
model_pop es más preciso para la predicción de la
popularidad
Respecto a la curva ROC, el AUC es de 0.9405 lo que indica que el modelo tiene una buena capacidad para clasificar
La matriz de confusión muestra que el model_pop tiene
una precisión del 94.27%, con una sensibilidad del 92.18% y una
especificidad del 95.92%. Además, Kappa indica una fuerte concordancia
entre las predicciones del modelo y los valores reales observados.
Las posibles variables predictoras son reviews y
ano_act.x2018 junto con size_num_imp,
rating_imp y type_bin. La precisión del modelo
es delel 94%, y la probabilidad de clasificación correcta para la 1
(popular) es de 92.18% y para 0 es de 95.92%. La gratuidad de las apps,
un rating mayor y las apps con mayor tamaños son las que tienen más
relación con la popularidad de las apps:
p1 <- ggplot(data = train_pop, aes(x = type_bin, y = popular)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Type", y = "Popularidad") +
ggtitle("Popularidad x type")+
theme_classic() +
theme(plot.title = element_text(size = 10))
p2 <- ggplot(data = train_pop, aes(x = rating_imp, y = popular)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "rating", y = "") +
ggtitle("Popularidad x rating") +
theme_classic() +
theme(plot.title = element_text(size = 10))
p3 <- ggplot(data = train_pop, aes(x = size_num_imp, y = popular)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "tamaño", y = "") +
ggtitle("Popularidad x tamaño") +
theme_classic() +
theme(plot.title = element_text(size = 10))
cowplot::plot_grid(p1, p2, p3, ncol = 3, align = "h")
# Limpiamos memoria
rm(fit, model,model_2,model_pop,p1,p2,p3, predict_test, predictions, roc1, roc2, roc_test,roc_curve, test_pop, test, train, train_pop, tune_grid, accuracy, cp_values, n, n_train, precision, pred_test, recall,table_mat, table_test, train_idx, df_analisis, accuracy_test,accuracy_Test,accuracy_train,df_final)
En función del objetivo del trabajo “desarrollar una app de promoción”, se buscará elaborar un modelo que permita predecir la probabilidad de que una aplicación sea popular en función de sus características, para contribuir en las decisiones sobre el desarrollo y estrategia de marketing de la App.
Vamos a utilizar df_popular
# Regresión logística
library(caTools) #muestra
library(MASS) #stepAIC
df_popular %>%
head(5) %>%
gt()
| reviews | type_bin | size_num_imp | rating_imp | android_ver_imp | popular | category.xAUTO_AND_VEHICLES | category.xBEAUTY | category.xBOOKS_AND_REFERENCE | category.xBUSINESS | category.xCOMICS | category.xCOMMUNICATION | category.xDATING | category.xEDUCATION | category.xENTERTAINMENT | category.xEVENTS | category.xFAMILY | category.xFINANCE | category.xFOOD_AND_DRINK | category.xGAME | category.xHEALTH_AND_FITNESS | category.xHOUSE_AND_HOME | category.xLIBRARIES_AND_DEMO | category.xLIFESTYLE | category.xMAPS_AND_NAVIGATION | category.xMEDICAL | category.xNEWS_AND_MAGAZINES | category.xPARENTING | category.xPERSONALIZATION | category.xPHOTOGRAPHY | category.xPRODUCTIVITY | category.xSHOPPING | category.xSOCIAL | category.xSPORTS | category.xTOOLS | category.xTRAVEL_AND_LOCAL | category.xVIDEO_PLAYERS | category.xWEATHER | grupo_edades.x17. | grupo_edades.x4. | grupo_edades.x9. | ano_act.x2011 | ano_act.x2012 | ano_act.x2013 | ano_act.x2014 | ano_act.x2015 | ano_act.x2016 | ano_act.x2017 | ano_act.x2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 159.000 | 0 | 19000 | 4.1 | 4.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 967.000 | 0 | 14000 | 3.9 | 4.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 6673.432 | 0 | 8700 | 4.7 | 4.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 6673.432 | 0 | 25000 | 4.5 | 4.2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 967.000 | 0 | 2800 | 4.3 | 4.4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
#Dividimos la base en train y validation
set.seed(1120)
indices = sample.split(df_popular$popular, SplitRatio = 0.7)
train = df_popular[indices,]
validation = df_popular[!(indices),]
# Primer modelo con todas las variables
model_1 = glm(popular ~ ., data = train, family = "binomial")
summary(model_1)
##
## Call:
## glm(formula = popular ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9577 -0.2944 0.0008 0.0014 3.6714
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 12.388887858 324.744348865 0.038
## reviews 0.002559911 0.000101855 25.133
## type_bin -8.221826174 0.673893559 -12.200
## size_num_imp 0.000004066 0.000004966 0.819
## rating_imp -0.564338309 0.086966948 -6.489
## android_ver_imp -0.076240167 0.080436830 -0.948
## category.xAUTO_AND_VEHICLES -0.467622446 0.566044964 -0.826
## category.xBEAUTY -0.894271265 0.668085372 -1.339
## category.xBOOKS_AND_REFERENCE -2.456852098 0.634579027 -3.872
## category.xBUSINESS -2.464637006 0.517624412 -4.761
## category.xCOMICS -2.802789589 0.929428167 -3.016
## category.xCOMMUNICATION -1.795538738 0.537497684 -3.341
## category.xDATING -1.923513933 0.673317979 -2.857
## category.xEDUCATION 0.155848040 0.632748728 0.246
## category.xENTERTAINMENT -0.607821759 1.145987823 -0.530
## category.xEVENTS -1.422233810 0.661540747 -2.150
## category.xFAMILY -1.583373049 0.420157067 -3.769
## category.xFINANCE -2.827766809 0.559323761 -5.056
## category.xFOOD_AND_DRINK -0.786501355 0.568652894 -1.383
## category.xGAME -1.745951570 0.485280079 -3.598
## category.xHEALTH_AND_FITNESS -1.671373480 0.533250944 -3.134
## category.xHOUSE_AND_HOME -0.581546032 0.664851951 -0.875
## category.xLIBRARIES_AND_DEMO -1.193724587 0.592359188 -2.015
## category.xLIFESTYLE -1.999204222 0.493663991 -4.050
## category.xMAPS_AND_NAVIGATION -2.099727251 0.661136272 -3.176
## category.xMEDICAL -2.389822075 0.519288377 -4.602
## category.xNEWS_AND_MAGAZINES -2.105287128 0.555544363 -3.790
## category.xPARENTING -0.283102277 0.588921919 -0.481
## category.xPERSONALIZATION -1.696018311 0.531797239 -3.189
## category.xPHOTOGRAPHY -0.847654859 0.541144957 -1.566
## category.xPRODUCTIVITY -1.673246295 0.515656212 -3.245
## category.xSHOPPING -1.347994511 0.612867024 -2.199
## category.xSOCIAL -2.959814870 0.704809941 -4.199
## category.xSPORTS -2.379326836 0.537474975 -4.427
## category.xTOOLS -1.412186155 0.436679753 -3.234
## category.xTRAVEL_AND_LOCAL -1.689209355 0.557713171 -3.029
## category.xVIDEO_PLAYERS -1.702548856 0.604605379 -2.816
## category.xWEATHER -1.400207522 0.980858657 -1.428
## grupo_edades.x17. 0.946976626 0.509417292 1.859
## grupo_edades.x4. 0.622005764 0.265810639 2.340
## grupo_edades.x9. 0.643597338 0.486276704 1.324
## ano_act.x2011 -10.857782891 324.745478384 -0.033
## ano_act.x2012 -10.879129249 324.744570532 -0.034
## ano_act.x2013 -11.941472038 324.744180856 -0.037
## ano_act.x2014 -12.144618633 324.743963334 -0.037
## ano_act.x2015 -11.412222280 324.743858282 -0.035
## ano_act.x2016 -11.934297781 324.743845680 -0.037
## ano_act.x2017 -11.959239558 324.743811356 -0.037
## ano_act.x2018 -11.380131206 324.743811550 -0.035
## Pr(>|z|)
## (Intercept) 0.969568
## reviews < 0.0000000000000002 ***
## type_bin < 0.0000000000000002 ***
## size_num_imp 0.412867
## rating_imp 0.0000000000863 ***
## android_ver_imp 0.343218
## category.xAUTO_AND_VEHICLES 0.408735
## category.xBEAUTY 0.180714
## category.xBOOKS_AND_REFERENCE 0.000108 ***
## category.xBUSINESS 0.0000019221750 ***
## category.xCOMICS 0.002565 **
## category.xCOMMUNICATION 0.000836 ***
## category.xDATING 0.004280 **
## category.xEDUCATION 0.805448
## category.xENTERTAINMENT 0.595841
## category.xEVENTS 0.031565 *
## category.xFAMILY 0.000164 ***
## category.xFINANCE 0.0000004288404 ***
## category.xFOOD_AND_DRINK 0.166636
## category.xGAME 0.000321 ***
## category.xHEALTH_AND_FITNESS 0.001723 **
## category.xHOUSE_AND_HOME 0.381737
## category.xLIBRARIES_AND_DEMO 0.043883 *
## category.xLIFESTYLE 0.0000512774927 ***
## category.xMAPS_AND_NAVIGATION 0.001494 **
## category.xMEDICAL 0.0000041823282 ***
## category.xNEWS_AND_MAGAZINES 0.000151 ***
## category.xPARENTING 0.630721
## category.xPERSONALIZATION 0.001427 **
## category.xPHOTOGRAPHY 0.117253
## category.xPRODUCTIVITY 0.001175 **
## category.xSHOPPING 0.027843 *
## category.xSOCIAL 0.0000267562824 ***
## category.xSPORTS 0.0000095614441 ***
## category.xTOOLS 0.001221 **
## category.xTRAVEL_AND_LOCAL 0.002455 **
## category.xVIDEO_PLAYERS 0.004863 **
## category.xWEATHER 0.153426
## grupo_edades.x17. 0.063036 .
## grupo_edades.x4. 0.019282 *
## grupo_edades.x9. 0.185662
## ano_act.x2011 0.973328
## ano_act.x2012 0.973275
## ano_act.x2013 0.970667
## ano_act.x2014 0.970168
## ano_act.x2015 0.971966
## ano_act.x2016 0.970684
## ano_act.x2017 0.970623
## ano_act.x2018 0.972045
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9975.4 on 7248 degrees of freedom
## Residual deviance: 2011.2 on 7200 degrees of freedom
## AIC: 2109.2
##
## Number of Fisher Scoring iterations: 11
Las variables con mayor significancia son “reviews”, “type_bin”, “rating_imp” y “category.xFINANCE”. A continuación se muestra en una tabla las variables con significancia estadística del modelo 1:
| Variable | P-valor |
|---|---|
| reviews | < 0.0000000000000002 |
| type_bin | < 0.0000000000000002 |
| rating_imp | 0.000000000256 |
| category.xBOOKS_AND_REFERENCE | 0.000120 |
| category.xBUSINESS | 0.000003146100 |
| category.xCOMICS | 0.003096 |
| category.xCOMMUNICATION | 0.001129 |
| category.xDATING | 0.006090 |
| category.xEVENTS | 0.029426 |
| category.xFAMILY | 0.000304 |
| category.xFINANCE | 0.000000543376 |
| category.xGAME | 0.000801 |
| category.xHEALTH_AND_FITNESS | 0.002626 |
| category.xLIBRARIES_AND_DEMO | 0.045199 |
| category.xLIFESTYLE | 0.000064984971 |
| category.xMAPS_AND_NAVIGATION | 0.002266 |
| category.xMEDICAL | 0.000008479370 |
| category.xNEWS_AND_MAGAZINES | 0.000208 |
| category.xPERSONALIZATION | 0.002230 |
| category.xPRODUCTIVITY | 0.001483 |
| category.xSHOPPING | 0.028897 |
| category.xSOCIAL | 0.000031306821 |
| category.xSPORTS | 0.000006654342 |
| category.xTOOLS | 0.001749 |
| category.xTRAVEL_AND_LOCAL | 0.004953 |
| category.xVIDEO_PLAYERS | 0.007778 |
| grupo_edades.x17. | 0.065640 |
| grupo_edades.x4. | 0.019739 |
Se utilizará stepAIC para la selección de variables, se
trata de un proceso iterativo que busca agregar o eliminar variables con
el fin de obtener un subconjunto de variables que proporcione el modelo
más óptimo. El modelo seleccionado es el que tiene el valor mínimo en el
AIC
#seleccionamos el método más optimo
model_2<- stepAIC(model_1, direction = "backward", steps = 100)
## Start: AIC=2109.23
## popular ~ reviews + type_bin + size_num_imp + rating_imp + android_ver_imp +
## category.xAUTO_AND_VEHICLES + category.xBEAUTY + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEDUCATION + category.xENTERTAINMENT +
## category.xEVENTS + category.xFAMILY + category.xFINANCE +
## category.xFOOD_AND_DRINK + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xHOUSE_AND_HOME + category.xLIBRARIES_AND_DEMO +
## category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
## category.xNEWS_AND_MAGAZINES + category.xPARENTING + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2011 + ano_act.x2012 +
## ano_act.x2013 + ano_act.x2014 + ano_act.x2015 + ano_act.x2016 +
## ano_act.x2017 + ano_act.x2018
##
## Df Deviance AIC
## - category.xEDUCATION 1 2011.3 2107.3
## - category.xPARENTING 1 2011.5 2107.5
## - category.xENTERTAINMENT 1 2011.5 2107.5
## - ano_act.x2011 1 2011.5 2107.5
## - ano_act.x2012 1 2011.6 2107.6
## - ano_act.x2018 1 2011.8 2107.8
## - ano_act.x2015 1 2011.8 2107.8
## - size_num_imp 1 2011.9 2107.9
## - category.xAUTO_AND_VEHICLES 1 2011.9 2107.9
## - category.xHOUSE_AND_HOME 1 2012.0 2108.0
## - ano_act.x2013 1 2012.1 2108.1
## - ano_act.x2016 1 2012.1 2108.1
## - ano_act.x2017 1 2012.1 2108.1
## - android_ver_imp 1 2012.1 2108.1
## - ano_act.x2014 1 2012.2 2108.2
## - grupo_edades.x9. 1 2012.9 2108.9
## - category.xBEAUTY 1 2013.1 2109.1
## - category.xFOOD_AND_DRINK 1 2013.2 2109.2
## <none> 2011.2 2109.2
## - category.xWEATHER 1 2013.4 2109.4
## - category.xPHOTOGRAPHY 1 2013.7 2109.7
## - grupo_edades.x17. 1 2014.5 2110.5
## - category.xLIBRARIES_AND_DEMO 1 2015.4 2111.4
## - category.xEVENTS 1 2016.2 2112.2
## - category.xSHOPPING 1 2016.3 2112.3
## - grupo_edades.x4. 1 2017.1 2113.1
## - category.xDATING 1 2019.4 2115.4
## - category.xVIDEO_PLAYERS 1 2019.4 2115.4
## - category.xTRAVEL_AND_LOCAL 1 2020.5 2116.5
## - category.xTOOLS 1 2020.8 2116.8
## - category.xHEALTH_AND_FITNESS 1 2021.0 2117.0
## - category.xPERSONALIZATION 1 2021.3 2117.3
## - category.xPRODUCTIVITY 1 2021.5 2117.5
## - category.xCOMICS 1 2022.2 2118.2
## - category.xCOMMUNICATION 1 2022.3 2118.3
## - category.xMAPS_AND_NAVIGATION 1 2022.4 2118.4
## - category.xGAME 1 2023.4 2119.4
## - category.xFAMILY 1 2023.8 2119.8
## - category.xNEWS_AND_MAGAZINES 1 2025.6 2121.6
## - category.xLIFESTYLE 1 2026.7 2122.7
## - category.xBOOKS_AND_REFERENCE 1 2027.2 2123.2
## - category.xSPORTS 1 2030.1 2126.1
## - category.xSOCIAL 1 2031.1 2127.1
## - category.xMEDICAL 1 2031.6 2127.6
## - category.xBUSINESS 1 2033.0 2129.0
## - category.xFINANCE 1 2036.8 2132.8
## - rating_imp 1 2050.9 2146.9
## - type_bin 1 2324.2 2420.2
## - reviews 1 7989.8 8085.8
##
## Step: AIC=2107.29
## popular ~ reviews + type_bin + size_num_imp + rating_imp + android_ver_imp +
## category.xAUTO_AND_VEHICLES + category.xBEAUTY + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xENTERTAINMENT + category.xEVENTS +
## category.xFAMILY + category.xFINANCE + category.xFOOD_AND_DRINK +
## category.xGAME + category.xHEALTH_AND_FITNESS + category.xHOUSE_AND_HOME +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPARENTING +
## category.xPERSONALIZATION + category.xPHOTOGRAPHY + category.xPRODUCTIVITY +
## category.xSHOPPING + category.xSOCIAL + category.xSPORTS +
## category.xTOOLS + category.xTRAVEL_AND_LOCAL + category.xVIDEO_PLAYERS +
## category.xWEATHER + grupo_edades.x17. + grupo_edades.x4. +
## grupo_edades.x9. + ano_act.x2011 + ano_act.x2012 + ano_act.x2013 +
## ano_act.x2014 + ano_act.x2015 + ano_act.x2016 + ano_act.x2017 +
## ano_act.x2018
##
## Df Deviance AIC
## - ano_act.x2011 1 2011.6 2105.6
## - ano_act.x2012 1 2011.6 2105.6
## - category.xENTERTAINMENT 1 2011.6 2105.6
## - category.xPARENTING 1 2011.7 2105.7
## - ano_act.x2018 1 2011.8 2105.8
## - ano_act.x2015 1 2011.8 2105.8
## - size_num_imp 1 2012.0 2106.0
## - ano_act.x2013 1 2012.1 2106.1
## - ano_act.x2016 1 2012.1 2106.1
## - ano_act.x2017 1 2012.2 2106.2
## - android_ver_imp 1 2012.2 2106.2
## - ano_act.x2014 1 2012.3 2106.3
## - category.xAUTO_AND_VEHICLES 1 2012.4 2106.4
## - category.xHOUSE_AND_HOME 1 2012.4 2106.4
## - grupo_edades.x9. 1 2013.0 2107.0
## <none> 2011.3 2107.3
## - category.xBEAUTY 1 2013.9 2107.9
## - category.xWEATHER 1 2013.9 2107.9
## - category.xFOOD_AND_DRINK 1 2014.2 2108.2
## - grupo_edades.x17. 1 2014.7 2108.7
## - category.xPHOTOGRAPHY 1 2015.0 2109.0
## - category.xLIBRARIES_AND_DEMO 1 2017.1 2111.1
## - grupo_edades.x4. 1 2017.2 2111.2
## - category.xEVENTS 1 2018.1 2112.1
## - category.xSHOPPING 1 2018.4 2112.4
## - category.xDATING 1 2022.0 2116.0
## - category.xVIDEO_PLAYERS 1 2022.8 2116.8
## - category.xCOMICS 1 2024.5 2118.5
## - category.xTRAVEL_AND_LOCAL 1 2025.1 2119.1
## - category.xHEALTH_AND_FITNESS 1 2026.1 2120.1
## - category.xMAPS_AND_NAVIGATION 1 2026.5 2120.5
## - category.xPERSONALIZATION 1 2026.7 2120.7
## - category.xPRODUCTIVITY 1 2027.2 2121.2
## - category.xCOMMUNICATION 1 2028.1 2122.1
## - category.xTOOLS 1 2028.2 2122.2
## - category.xGAME 1 2031.3 2125.3
## - category.xNEWS_AND_MAGAZINES 1 2032.6 2126.6
## - category.xBOOKS_AND_REFERENCE 1 2033.7 2127.7
## - category.xFAMILY 1 2034.7 2128.7
## - category.xLIFESTYLE 1 2036.1 2130.1
## - category.xSOCIAL 1 2038.0 2132.0
## - category.xSPORTS 1 2039.7 2133.7
## - category.xMEDICAL 1 2043.5 2137.5
## - category.xBUSINESS 1 2046.0 2140.0
## - category.xFINANCE 1 2050.4 2144.4
## - rating_imp 1 2051.0 2145.0
## - type_bin 1 2324.8 2418.8
## - reviews 1 8013.7 8107.7
##
## Step: AIC=2105.6
## popular ~ reviews + type_bin + size_num_imp + rating_imp + android_ver_imp +
## category.xAUTO_AND_VEHICLES + category.xBEAUTY + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xENTERTAINMENT + category.xEVENTS +
## category.xFAMILY + category.xFINANCE + category.xFOOD_AND_DRINK +
## category.xGAME + category.xHEALTH_AND_FITNESS + category.xHOUSE_AND_HOME +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPARENTING +
## category.xPERSONALIZATION + category.xPHOTOGRAPHY + category.xPRODUCTIVITY +
## category.xSHOPPING + category.xSOCIAL + category.xSPORTS +
## category.xTOOLS + category.xTRAVEL_AND_LOCAL + category.xVIDEO_PLAYERS +
## category.xWEATHER + grupo_edades.x17. + grupo_edades.x4. +
## grupo_edades.x9. + ano_act.x2012 + ano_act.x2013 + ano_act.x2014 +
## ano_act.x2015 + ano_act.x2016 + ano_act.x2017 + ano_act.x2018
##
## Df Deviance AIC
## - ano_act.x2012 1 2011.6 2103.6
## - category.xENTERTAINMENT 1 2011.9 2103.9
## - category.xPARENTING 1 2012.0 2104.0
## - ano_act.x2018 1 2012.0 2104.0
## - ano_act.x2015 1 2012.1 2104.1
## - size_num_imp 1 2012.3 2104.3
## - android_ver_imp 1 2012.5 2104.5
## - category.xAUTO_AND_VEHICLES 1 2012.7 2104.7
## - category.xHOUSE_AND_HOME 1 2012.7 2104.7
## - ano_act.x2013 1 2012.8 2104.8
## - ano_act.x2016 1 2013.0 2105.0
## - ano_act.x2017 1 2013.1 2105.1
## - grupo_edades.x9. 1 2013.3 2105.3
## - ano_act.x2014 1 2013.4 2105.4
## <none> 2011.6 2105.6
## - category.xBEAUTY 1 2014.2 2106.2
## - category.xWEATHER 1 2014.2 2106.2
## - category.xFOOD_AND_DRINK 1 2014.5 2106.5
## - grupo_edades.x17. 1 2015.0 2107.0
## - category.xPHOTOGRAPHY 1 2015.3 2107.3
## - category.xLIBRARIES_AND_DEMO 1 2017.4 2109.4
## - grupo_edades.x4. 1 2017.5 2109.5
## - category.xEVENTS 1 2018.4 2110.4
## - category.xSHOPPING 1 2018.7 2110.7
## - category.xDATING 1 2022.3 2114.3
## - category.xVIDEO_PLAYERS 1 2023.1 2115.1
## - category.xCOMICS 1 2024.8 2116.8
## - category.xTRAVEL_AND_LOCAL 1 2025.4 2117.4
## - category.xHEALTH_AND_FITNESS 1 2026.4 2118.4
## - category.xMAPS_AND_NAVIGATION 1 2026.8 2118.8
## - category.xPERSONALIZATION 1 2027.0 2119.0
## - category.xPRODUCTIVITY 1 2027.5 2119.5
## - category.xCOMMUNICATION 1 2028.4 2120.4
## - category.xTOOLS 1 2028.6 2120.6
## - category.xGAME 1 2031.7 2123.7
## - category.xNEWS_AND_MAGAZINES 1 2032.9 2124.9
## - category.xBOOKS_AND_REFERENCE 1 2034.0 2126.0
## - category.xFAMILY 1 2034.9 2126.9
## - category.xLIFESTYLE 1 2036.4 2128.4
## - category.xSOCIAL 1 2038.3 2130.3
## - category.xSPORTS 1 2040.0 2132.0
## - category.xMEDICAL 1 2043.8 2135.8
## - category.xBUSINESS 1 2046.3 2138.3
## - category.xFINANCE 1 2050.7 2142.7
## - rating_imp 1 2051.3 2143.3
## - type_bin 1 2325.5 2417.5
## - reviews 1 8014.8 8106.8
##
## Step: AIC=2103.62
## popular ~ reviews + type_bin + size_num_imp + rating_imp + android_ver_imp +
## category.xAUTO_AND_VEHICLES + category.xBEAUTY + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xENTERTAINMENT + category.xEVENTS +
## category.xFAMILY + category.xFINANCE + category.xFOOD_AND_DRINK +
## category.xGAME + category.xHEALTH_AND_FITNESS + category.xHOUSE_AND_HOME +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPARENTING +
## category.xPERSONALIZATION + category.xPHOTOGRAPHY + category.xPRODUCTIVITY +
## category.xSHOPPING + category.xSOCIAL + category.xSPORTS +
## category.xTOOLS + category.xTRAVEL_AND_LOCAL + category.xVIDEO_PLAYERS +
## category.xWEATHER + grupo_edades.x17. + grupo_edades.x4. +
## grupo_edades.x9. + ano_act.x2013 + ano_act.x2014 + ano_act.x2015 +
## ano_act.x2016 + ano_act.x2017 + ano_act.x2018
##
## Df Deviance AIC
## - category.xENTERTAINMENT 1 2012.0 2102.0
## - category.xPARENTING 1 2012.0 2102.0
## - size_num_imp 1 2012.3 2102.3
## - ano_act.x2018 1 2012.4 2102.4
## - ano_act.x2015 1 2012.4 2102.4
## - android_ver_imp 1 2012.5 2102.5
## - category.xAUTO_AND_VEHICLES 1 2012.7 2102.7
## - category.xHOUSE_AND_HOME 1 2012.7 2102.7
## - grupo_edades.x9. 1 2013.3 2103.3
## - ano_act.x2013 1 2013.6 2103.6
## <none> 2011.6 2103.6
## - category.xBEAUTY 1 2014.2 2104.2
## - category.xWEATHER 1 2014.2 2104.2
## - category.xFOOD_AND_DRINK 1 2014.5 2104.5
## - ano_act.x2016 1 2014.5 2104.5
## - ano_act.x2017 1 2014.8 2104.8
## - grupo_edades.x17. 1 2015.0 2105.0
## - ano_act.x2014 1 2015.1 2105.1
## - category.xPHOTOGRAPHY 1 2015.4 2105.4
## - category.xLIBRARIES_AND_DEMO 1 2017.5 2107.5
## - grupo_edades.x4. 1 2017.5 2107.5
## - category.xEVENTS 1 2018.4 2108.4
## - category.xSHOPPING 1 2018.7 2108.7
## - category.xDATING 1 2022.3 2112.3
## - category.xVIDEO_PLAYERS 1 2023.1 2113.1
## - category.xCOMICS 1 2024.8 2114.8
## - category.xTRAVEL_AND_LOCAL 1 2025.4 2115.4
## - category.xHEALTH_AND_FITNESS 1 2026.4 2116.4
## - category.xMAPS_AND_NAVIGATION 1 2026.8 2116.8
## - category.xPERSONALIZATION 1 2027.0 2117.0
## - category.xPRODUCTIVITY 1 2027.5 2117.5
## - category.xCOMMUNICATION 1 2028.5 2118.5
## - category.xTOOLS 1 2028.6 2118.6
## - category.xGAME 1 2031.7 2121.7
## - category.xNEWS_AND_MAGAZINES 1 2032.9 2122.9
## - category.xBOOKS_AND_REFERENCE 1 2034.0 2124.0
## - category.xFAMILY 1 2035.0 2125.0
## - category.xLIFESTYLE 1 2036.4 2126.4
## - category.xSOCIAL 1 2038.3 2128.3
## - category.xSPORTS 1 2040.0 2130.0
## - category.xMEDICAL 1 2043.8 2133.8
## - category.xBUSINESS 1 2046.3 2136.3
## - category.xFINANCE 1 2050.7 2140.7
## - rating_imp 1 2051.3 2141.3
## - type_bin 1 2325.5 2415.5
## - reviews 1 8014.9 8104.9
##
## Step: AIC=2101.96
## popular ~ reviews + type_bin + size_num_imp + rating_imp + android_ver_imp +
## category.xAUTO_AND_VEHICLES + category.xBEAUTY + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xFOOD_AND_DRINK + category.xGAME +
## category.xHEALTH_AND_FITNESS + category.xHOUSE_AND_HOME +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPARENTING +
## category.xPERSONALIZATION + category.xPHOTOGRAPHY + category.xPRODUCTIVITY +
## category.xSHOPPING + category.xSOCIAL + category.xSPORTS +
## category.xTOOLS + category.xTRAVEL_AND_LOCAL + category.xVIDEO_PLAYERS +
## category.xWEATHER + grupo_edades.x17. + grupo_edades.x4. +
## grupo_edades.x9. + ano_act.x2013 + ano_act.x2014 + ano_act.x2015 +
## ano_act.x2016 + ano_act.x2017 + ano_act.x2018
##
## Df Deviance AIC
## - category.xPARENTING 1 2012.3 2100.3
## - size_num_imp 1 2012.6 2100.6
## - ano_act.x2018 1 2012.7 2100.7
## - ano_act.x2015 1 2012.8 2100.8
## - category.xAUTO_AND_VEHICLES 1 2012.9 2100.9
## - android_ver_imp 1 2012.9 2100.9
## - category.xHOUSE_AND_HOME 1 2012.9 2100.9
## - grupo_edades.x9. 1 2013.7 2101.7
## - ano_act.x2013 1 2013.9 2101.9
## <none> 2012.0 2102.0
## - category.xBEAUTY 1 2014.3 2102.3
## - category.xWEATHER 1 2014.4 2102.4
## - category.xFOOD_AND_DRINK 1 2014.6 2102.6
## - ano_act.x2016 1 2014.8 2102.8
## - ano_act.x2017 1 2015.2 2103.2
## - ano_act.x2014 1 2015.4 2103.4
## - category.xPHOTOGRAPHY 1 2015.4 2103.4
## - grupo_edades.x17. 1 2015.4 2103.4
## - category.xLIBRARIES_AND_DEMO 1 2017.5 2105.5
## - grupo_edades.x4. 1 2018.2 2106.2
## - category.xEVENTS 1 2018.4 2106.4
## - category.xSHOPPING 1 2018.7 2106.7
## - category.xDATING 1 2022.3 2110.3
## - category.xVIDEO_PLAYERS 1 2023.1 2111.1
## - category.xCOMICS 1 2024.8 2112.8
## - category.xTRAVEL_AND_LOCAL 1 2025.4 2113.4
## - category.xHEALTH_AND_FITNESS 1 2026.4 2114.4
## - category.xMAPS_AND_NAVIGATION 1 2026.8 2114.8
## - category.xPERSONALIZATION 1 2027.1 2115.1
## - category.xPRODUCTIVITY 1 2027.6 2115.6
## - category.xCOMMUNICATION 1 2028.5 2116.5
## - category.xTOOLS 1 2028.8 2116.8
## - category.xGAME 1 2032.0 2120.0
## - category.xNEWS_AND_MAGAZINES 1 2033.0 2121.0
## - category.xBOOKS_AND_REFERENCE 1 2034.1 2122.1
## - category.xFAMILY 1 2035.5 2123.5
## - category.xLIFESTYLE 1 2036.8 2124.8
## - category.xSOCIAL 1 2038.5 2126.5
## - category.xSPORTS 1 2040.3 2128.3
## - category.xMEDICAL 1 2044.4 2132.4
## - category.xBUSINESS 1 2046.9 2134.9
## - category.xFINANCE 1 2051.4 2139.4
## - rating_imp 1 2051.4 2139.4
## - type_bin 1 2325.6 2413.6
## - reviews 1 8025.4 8113.4
##
## Step: AIC=2100.28
## popular ~ reviews + type_bin + size_num_imp + rating_imp + android_ver_imp +
## category.xAUTO_AND_VEHICLES + category.xBEAUTY + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xFOOD_AND_DRINK + category.xGAME +
## category.xHEALTH_AND_FITNESS + category.xHOUSE_AND_HOME +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2013 + ano_act.x2014 +
## ano_act.x2015 + ano_act.x2016 + ano_act.x2017 + ano_act.x2018
##
## Df Deviance AIC
## - size_num_imp 1 2012.9 2098.9
## - category.xAUTO_AND_VEHICLES 1 2013.0 2099.0
## - category.xHOUSE_AND_HOME 1 2013.0 2099.0
## - ano_act.x2018 1 2013.1 2099.1
## - ano_act.x2015 1 2013.1 2099.1
## - android_ver_imp 1 2013.3 2099.3
## - grupo_edades.x9. 1 2014.0 2100.0
## - ano_act.x2013 1 2014.2 2100.2
## <none> 2012.3 2100.3
## - category.xBEAUTY 1 2014.3 2100.3
## - category.xWEATHER 1 2014.5 2100.5
## - category.xFOOD_AND_DRINK 1 2014.6 2100.6
## - ano_act.x2016 1 2015.2 2101.2
## - ano_act.x2017 1 2015.4 2101.4
## - category.xPHOTOGRAPHY 1 2015.4 2101.4
## - ano_act.x2014 1 2015.7 2101.7
## - grupo_edades.x17. 1 2015.8 2101.8
## - category.xLIBRARIES_AND_DEMO 1 2017.6 2103.6
## - category.xEVENTS 1 2018.5 2104.5
## - grupo_edades.x4. 1 2018.5 2104.5
## - category.xSHOPPING 1 2018.9 2104.9
## - category.xDATING 1 2022.4 2108.4
## - category.xVIDEO_PLAYERS 1 2023.5 2109.5
## - category.xCOMICS 1 2024.9 2110.9
## - category.xTRAVEL_AND_LOCAL 1 2026.3 2112.3
## - category.xMAPS_AND_NAVIGATION 1 2027.4 2113.4
## - category.xHEALTH_AND_FITNESS 1 2027.5 2113.5
## - category.xPERSONALIZATION 1 2028.2 2114.2
## - category.xPRODUCTIVITY 1 2029.0 2115.0
## - category.xCOMMUNICATION 1 2029.8 2115.8
## - category.xTOOLS 1 2031.5 2117.5
## - category.xGAME 1 2034.3 2120.3
## - category.xNEWS_AND_MAGAZINES 1 2034.8 2120.8
## - category.xBOOKS_AND_REFERENCE 1 2035.5 2121.5
## - category.xSOCIAL 1 2039.9 2125.9
## - category.xLIFESTYLE 1 2039.9 2125.9
## - category.xFAMILY 1 2040.7 2126.7
## - category.xSPORTS 1 2043.1 2129.1
## - category.xMEDICAL 1 2048.8 2134.8
## - rating_imp 1 2051.9 2137.9
## - category.xBUSINESS 1 2051.9 2137.9
## - category.xFINANCE 1 2056.3 2142.3
## - type_bin 1 2325.8 2411.8
## - reviews 1 8040.5 8126.5
##
## Step: AIC=2098.95
## popular ~ reviews + type_bin + rating_imp + android_ver_imp +
## category.xAUTO_AND_VEHICLES + category.xBEAUTY + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xFOOD_AND_DRINK + category.xGAME +
## category.xHEALTH_AND_FITNESS + category.xHOUSE_AND_HOME +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2013 + ano_act.x2014 +
## ano_act.x2015 + ano_act.x2016 + ano_act.x2017 + ano_act.x2018
##
## Df Deviance AIC
## - category.xAUTO_AND_VEHICLES 1 2013.6 2097.6
## - ano_act.x2018 1 2013.6 2097.6
## - ano_act.x2015 1 2013.7 2097.7
## - category.xHOUSE_AND_HOME 1 2013.7 2097.7
## - android_ver_imp 1 2013.8 2097.8
## - grupo_edades.x9. 1 2014.7 2098.7
## - ano_act.x2013 1 2014.9 2098.9
## <none> 2012.9 2098.9
## - category.xWEATHER 1 2015.1 2099.1
## - category.xBEAUTY 1 2015.1 2099.1
## - category.xFOOD_AND_DRINK 1 2015.2 2099.2
## - ano_act.x2016 1 2015.7 2099.7
## - ano_act.x2017 1 2015.9 2099.9
## - ano_act.x2014 1 2016.3 2100.3
## - category.xPHOTOGRAPHY 1 2016.3 2100.3
## - grupo_edades.x17. 1 2016.4 2100.4
## - category.xLIBRARIES_AND_DEMO 1 2018.4 2102.4
## - grupo_edades.x4. 1 2018.9 2102.9
## - category.xEVENTS 1 2019.2 2103.2
## - category.xSHOPPING 1 2019.8 2103.8
## - category.xDATING 1 2023.3 2107.3
## - category.xVIDEO_PLAYERS 1 2024.3 2108.3
## - category.xCOMICS 1 2025.8 2109.8
## - category.xTRAVEL_AND_LOCAL 1 2026.7 2110.7
## - category.xMAPS_AND_NAVIGATION 1 2028.2 2112.2
## - category.xHEALTH_AND_FITNESS 1 2028.3 2112.3
## - category.xPERSONALIZATION 1 2029.3 2113.3
## - category.xPRODUCTIVITY 1 2029.9 2113.9
## - category.xCOMMUNICATION 1 2030.8 2114.8
## - category.xTOOLS 1 2032.8 2116.8
## - category.xGAME 1 2034.3 2118.3
## - category.xNEWS_AND_MAGAZINES 1 2036.1 2120.1
## - category.xBOOKS_AND_REFERENCE 1 2036.5 2120.5
## - category.xLIFESTYLE 1 2040.9 2124.9
## - category.xFAMILY 1 2041.2 2125.2
## - category.xSOCIAL 1 2041.2 2125.2
## - category.xSPORTS 1 2043.8 2127.8
## - category.xMEDICAL 1 2049.2 2133.2
## - rating_imp 1 2053.0 2137.0
## - category.xBUSINESS 1 2053.0 2137.0
## - category.xFINANCE 1 2057.7 2141.7
## - type_bin 1 2325.8 2409.8
## - reviews 1 8328.4 8412.4
##
## Step: AIC=2097.59
## popular ~ reviews + type_bin + rating_imp + android_ver_imp +
## category.xBEAUTY + category.xBOOKS_AND_REFERENCE + category.xBUSINESS +
## category.xCOMICS + category.xCOMMUNICATION + category.xDATING +
## category.xEVENTS + category.xFAMILY + category.xFINANCE +
## category.xFOOD_AND_DRINK + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xHOUSE_AND_HOME + category.xLIBRARIES_AND_DEMO +
## category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
## category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2013 + ano_act.x2014 +
## ano_act.x2015 + ano_act.x2016 + ano_act.x2017 + ano_act.x2018
##
## Df Deviance AIC
## - category.xHOUSE_AND_HOME 1 2014.1 2096.1
## - ano_act.x2018 1 2014.3 2096.3
## - ano_act.x2015 1 2014.3 2096.3
## - android_ver_imp 1 2014.4 2096.4
## - grupo_edades.x9. 1 2015.3 2097.3
## - category.xBEAUTY 1 2015.3 2097.3
## - category.xFOOD_AND_DRINK 1 2015.4 2097.4
## - category.xWEATHER 1 2015.5 2097.5
## - ano_act.x2013 1 2015.5 2097.5
## <none> 2013.6 2097.6
## - ano_act.x2016 1 2016.3 2098.3
## - category.xPHOTOGRAPHY 1 2016.4 2098.4
## - ano_act.x2017 1 2016.6 2098.6
## - ano_act.x2014 1 2016.9 2098.9
## - grupo_edades.x17. 1 2017.0 2099.0
## - category.xLIBRARIES_AND_DEMO 1 2018.4 2100.4
## - category.xEVENTS 1 2019.3 2101.3
## - grupo_edades.x4. 1 2019.5 2101.5
## - category.xSHOPPING 1 2019.8 2101.8
## - category.xDATING 1 2023.3 2105.3
## - category.xVIDEO_PLAYERS 1 2024.3 2106.3
## - category.xCOMICS 1 2025.8 2107.8
## - category.xTRAVEL_AND_LOCAL 1 2026.8 2108.8
## - category.xMAPS_AND_NAVIGATION 1 2028.3 2110.3
## - category.xHEALTH_AND_FITNESS 1 2028.6 2110.6
## - category.xPERSONALIZATION 1 2029.7 2111.7
## - category.xPRODUCTIVITY 1 2030.4 2112.4
## - category.xCOMMUNICATION 1 2031.2 2113.2
## - category.xTOOLS 1 2034.4 2116.4
## - category.xGAME 1 2035.3 2117.3
## - category.xNEWS_AND_MAGAZINES 1 2036.8 2118.8
## - category.xBOOKS_AND_REFERENCE 1 2036.9 2118.9
## - category.xSOCIAL 1 2041.6 2123.6
## - category.xLIFESTYLE 1 2042.5 2124.5
## - category.xFAMILY 1 2044.7 2126.7
## - category.xSPORTS 1 2045.1 2127.1
## - category.xMEDICAL 1 2051.7 2133.7
## - rating_imp 1 2053.3 2135.3
## - category.xBUSINESS 1 2056.1 2138.1
## - category.xFINANCE 1 2060.5 2142.5
## - type_bin 1 2326.6 2408.6
## - reviews 1 8344.1 8426.1
##
## Step: AIC=2096.1
## popular ~ reviews + type_bin + rating_imp + android_ver_imp +
## category.xBEAUTY + category.xBOOKS_AND_REFERENCE + category.xBUSINESS +
## category.xCOMICS + category.xCOMMUNICATION + category.xDATING +
## category.xEVENTS + category.xFAMILY + category.xFINANCE +
## category.xFOOD_AND_DRINK + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2013 + ano_act.x2014 +
## ano_act.x2015 + ano_act.x2016 + ano_act.x2017 + ano_act.x2018
##
## Df Deviance AIC
## - ano_act.x2018 1 2014.8 2094.8
## - ano_act.x2015 1 2014.9 2094.9
## - android_ver_imp 1 2014.9 2094.9
## - category.xFOOD_AND_DRINK 1 2015.6 2095.6
## - category.xBEAUTY 1 2015.6 2095.6
## - category.xWEATHER 1 2015.8 2095.8
## - grupo_edades.x9. 1 2015.8 2095.8
## - ano_act.x2013 1 2016.0 2096.0
## <none> 2014.1 2096.1
## - category.xPHOTOGRAPHY 1 2016.5 2096.5
## - ano_act.x2016 1 2016.8 2096.8
## - ano_act.x2017 1 2017.1 2097.1
## - ano_act.x2014 1 2017.3 2097.3
## - grupo_edades.x17. 1 2017.6 2097.6
## - category.xLIBRARIES_AND_DEMO 1 2018.6 2098.6
## - category.xEVENTS 1 2019.4 2099.4
## - category.xSHOPPING 1 2019.9 2099.9
## - grupo_edades.x4. 1 2020.0 2100.0
## - category.xDATING 1 2023.4 2103.4
## - category.xVIDEO_PLAYERS 1 2024.4 2104.4
## - category.xCOMICS 1 2025.9 2105.9
## - category.xTRAVEL_AND_LOCAL 1 2026.8 2106.8
## - category.xMAPS_AND_NAVIGATION 1 2028.3 2108.3
## - category.xHEALTH_AND_FITNESS 1 2028.6 2108.6
## - category.xPERSONALIZATION 1 2029.7 2109.7
## - category.xPRODUCTIVITY 1 2030.4 2110.4
## - category.xCOMMUNICATION 1 2031.2 2111.2
## - category.xTOOLS 1 2034.8 2114.8
## - category.xGAME 1 2035.5 2115.5
## - category.xNEWS_AND_MAGAZINES 1 2036.9 2116.9
## - category.xBOOKS_AND_REFERENCE 1 2036.9 2116.9
## - category.xSOCIAL 1 2041.7 2121.7
## - category.xLIFESTYLE 1 2043.0 2123.0
## - category.xSPORTS 1 2045.3 2125.3
## - category.xFAMILY 1 2046.0 2126.0
## - category.xMEDICAL 1 2052.4 2132.4
## - rating_imp 1 2053.6 2133.6
## - category.xBUSINESS 1 2057.0 2137.0
## - category.xFINANCE 1 2061.3 2141.3
## - type_bin 1 2326.9 2406.9
## - reviews 1 8344.1 8424.1
##
## Step: AIC=2094.78
## popular ~ reviews + type_bin + rating_imp + android_ver_imp +
## category.xBEAUTY + category.xBOOKS_AND_REFERENCE + category.xBUSINESS +
## category.xCOMICS + category.xCOMMUNICATION + category.xDATING +
## category.xEVENTS + category.xFAMILY + category.xFINANCE +
## category.xFOOD_AND_DRINK + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2013 + ano_act.x2014 +
## ano_act.x2015 + ano_act.x2016 + ano_act.x2017
##
## Df Deviance AIC
## - ano_act.x2015 1 2014.9 2092.9
## - ano_act.x2013 1 2016.1 2094.1
## - android_ver_imp 1 2016.2 2094.2
## - category.xFOOD_AND_DRINK 1 2016.3 2094.3
## - category.xBEAUTY 1 2016.3 2094.3
## - category.xWEATHER 1 2016.5 2094.5
## - grupo_edades.x9. 1 2016.6 2094.6
## <none> 2014.8 2094.8
## - category.xPHOTOGRAPHY 1 2017.2 2095.2
## - grupo_edades.x17. 1 2018.3 2096.3
## - ano_act.x2014 1 2019.0 2097.0
## - category.xLIBRARIES_AND_DEMO 1 2019.4 2097.4
## - category.xEVENTS 1 2020.0 2098.0
## - category.xSHOPPING 1 2020.3 2098.3
## - grupo_edades.x4. 1 2020.8 2098.8
## - ano_act.x2016 1 2021.3 2099.3
## - category.xDATING 1 2024.1 2102.1
## - category.xVIDEO_PLAYERS 1 2024.9 2102.9
## - category.xCOMICS 1 2026.5 2104.5
## - category.xTRAVEL_AND_LOCAL 1 2027.4 2105.4
## - category.xMAPS_AND_NAVIGATION 1 2028.9 2106.9
## - category.xHEALTH_AND_FITNESS 1 2029.1 2107.1
## - ano_act.x2017 1 2029.3 2107.3
## - category.xPERSONALIZATION 1 2030.3 2108.3
## - category.xPRODUCTIVITY 1 2031.0 2109.0
## - category.xCOMMUNICATION 1 2031.7 2109.7
## - category.xTOOLS 1 2035.0 2113.0
## - category.xGAME 1 2036.0 2114.0
## - category.xNEWS_AND_MAGAZINES 1 2037.5 2115.5
## - category.xBOOKS_AND_REFERENCE 1 2037.6 2115.6
## - category.xSOCIAL 1 2042.3 2120.3
## - category.xLIFESTYLE 1 2043.4 2121.4
## - category.xSPORTS 1 2045.9 2123.9
## - category.xFAMILY 1 2046.4 2124.4
## - category.xMEDICAL 1 2052.7 2130.7
## - rating_imp 1 2054.4 2132.4
## - category.xBUSINESS 1 2057.3 2135.3
## - category.xFINANCE 1 2061.5 2139.5
## - type_bin 1 2327.8 2405.8
## - reviews 1 8345.0 8423.0
##
## Step: AIC=2092.86
## popular ~ reviews + type_bin + rating_imp + android_ver_imp +
## category.xBEAUTY + category.xBOOKS_AND_REFERENCE + category.xBUSINESS +
## category.xCOMICS + category.xCOMMUNICATION + category.xDATING +
## category.xEVENTS + category.xFAMILY + category.xFINANCE +
## category.xFOOD_AND_DRINK + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2013 + ano_act.x2014 +
## ano_act.x2016 + ano_act.x2017
##
## Df Deviance AIC
## - ano_act.x2013 1 2016.1 2092.1
## - android_ver_imp 1 2016.2 2092.2
## - category.xFOOD_AND_DRINK 1 2016.4 2092.4
## - category.xBEAUTY 1 2016.4 2092.4
## - category.xWEATHER 1 2016.6 2092.6
## - grupo_edades.x9. 1 2016.7 2092.7
## <none> 2014.9 2092.9
## - category.xPHOTOGRAPHY 1 2017.3 2093.3
## - grupo_edades.x17. 1 2018.4 2094.4
## - ano_act.x2014 1 2019.0 2095.0
## - category.xLIBRARIES_AND_DEMO 1 2019.5 2095.5
## - category.xEVENTS 1 2020.1 2096.1
## - category.xSHOPPING 1 2020.4 2096.4
## - grupo_edades.x4. 1 2020.9 2096.9
## - ano_act.x2016 1 2021.3 2097.3
## - category.xDATING 1 2024.2 2100.2
## - category.xVIDEO_PLAYERS 1 2025.4 2101.4
## - category.xCOMICS 1 2026.6 2102.6
## - category.xTRAVEL_AND_LOCAL 1 2027.6 2103.6
## - category.xMAPS_AND_NAVIGATION 1 2029.1 2105.1
## - category.xHEALTH_AND_FITNESS 1 2029.4 2105.4
## - ano_act.x2017 1 2029.5 2105.5
## - category.xPERSONALIZATION 1 2030.5 2106.5
## - category.xPRODUCTIVITY 1 2031.3 2107.3
## - category.xCOMMUNICATION 1 2032.0 2108.0
## - category.xTOOLS 1 2035.4 2111.4
## - category.xGAME 1 2036.7 2112.7
## - category.xNEWS_AND_MAGAZINES 1 2037.7 2113.7
## - category.xBOOKS_AND_REFERENCE 1 2037.8 2113.8
## - category.xSOCIAL 1 2042.5 2118.5
## - category.xLIFESTYLE 1 2043.7 2119.7
## - category.xSPORTS 1 2046.1 2122.1
## - category.xFAMILY 1 2046.9 2122.9
## - category.xMEDICAL 1 2053.0 2129.0
## - rating_imp 1 2054.4 2130.4
## - category.xBUSINESS 1 2057.7 2133.7
## - category.xFINANCE 1 2061.7 2137.7
## - type_bin 1 2330.0 2406.0
## - reviews 1 8399.3 8475.3
##
## Step: AIC=2092.1
## popular ~ reviews + type_bin + rating_imp + android_ver_imp +
## category.xBEAUTY + category.xBOOKS_AND_REFERENCE + category.xBUSINESS +
## category.xCOMICS + category.xCOMMUNICATION + category.xDATING +
## category.xEVENTS + category.xFAMILY + category.xFINANCE +
## category.xFOOD_AND_DRINK + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2014 + ano_act.x2016 +
## ano_act.x2017
##
## Df Deviance AIC
## - android_ver_imp 1 2016.9 2090.9
## - category.xFOOD_AND_DRINK 1 2017.6 2091.6
## - category.xBEAUTY 1 2017.6 2091.6
## - category.xWEATHER 1 2017.8 2091.8
## - grupo_edades.x9. 1 2018.0 2092.0
## <none> 2016.1 2092.1
## - category.xPHOTOGRAPHY 1 2018.5 2092.5
## - grupo_edades.x17. 1 2019.7 2093.7
## - ano_act.x2014 1 2019.8 2093.8
## - category.xLIBRARIES_AND_DEMO 1 2020.5 2094.5
## - category.xEVENTS 1 2021.3 2095.3
## - category.xSHOPPING 1 2021.6 2095.6
## - grupo_edades.x4. 1 2021.9 2095.9
## - ano_act.x2016 1 2022.1 2096.1
## - category.xDATING 1 2025.6 2099.6
## - category.xVIDEO_PLAYERS 1 2027.1 2101.1
## - category.xCOMICS 1 2027.9 2101.9
## - category.xTRAVEL_AND_LOCAL 1 2028.8 2102.8
## - ano_act.x2017 1 2030.1 2104.1
## - category.xMAPS_AND_NAVIGATION 1 2030.4 2104.4
## - category.xHEALTH_AND_FITNESS 1 2030.8 2104.8
## - category.xPERSONALIZATION 1 2032.6 2106.6
## - category.xPRODUCTIVITY 1 2032.7 2106.7
## - category.xCOMMUNICATION 1 2033.7 2107.7
## - category.xTOOLS 1 2036.9 2110.9
## - category.xGAME 1 2038.2 2112.2
## - category.xBOOKS_AND_REFERENCE 1 2038.9 2112.9
## - category.xNEWS_AND_MAGAZINES 1 2038.9 2112.9
## - category.xSOCIAL 1 2043.9 2117.9
## - category.xLIFESTYLE 1 2045.2 2119.2
## - category.xSPORTS 1 2047.7 2121.7
## - category.xFAMILY 1 2048.0 2122.0
## - category.xMEDICAL 1 2054.7 2128.7
## - rating_imp 1 2055.6 2129.6
## - category.xBUSINESS 1 2059.0 2133.0
## - category.xFINANCE 1 2063.2 2137.2
## - type_bin 1 2331.9 2405.9
## - reviews 1 8417.1 8491.1
##
## Step: AIC=2090.9
## popular ~ reviews + type_bin + rating_imp + category.xBEAUTY +
## category.xBOOKS_AND_REFERENCE + category.xBUSINESS + category.xCOMICS +
## category.xCOMMUNICATION + category.xDATING + category.xEVENTS +
## category.xFAMILY + category.xFINANCE + category.xFOOD_AND_DRINK +
## category.xGAME + category.xHEALTH_AND_FITNESS + category.xLIBRARIES_AND_DEMO +
## category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
## category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2014 + ano_act.x2016 +
## ano_act.x2017
##
## Df Deviance AIC
## - category.xBEAUTY 1 2018.4 2090.4
## - category.xFOOD_AND_DRINK 1 2018.4 2090.4
## - category.xWEATHER 1 2018.6 2090.6
## - grupo_edades.x9. 1 2018.8 2090.8
## <none> 2016.9 2090.9
## - category.xPHOTOGRAPHY 1 2019.3 2091.3
## - ano_act.x2014 1 2020.0 2092.0
## - grupo_edades.x17. 1 2020.5 2092.5
## - category.xLIBRARIES_AND_DEMO 1 2020.8 2092.8
## - category.xEVENTS 1 2022.2 2094.2
## - category.xSHOPPING 1 2022.4 2094.4
## - ano_act.x2016 1 2022.4 2094.4
## - grupo_edades.x4. 1 2022.7 2094.7
## - category.xDATING 1 2026.6 2098.6
## - category.xVIDEO_PLAYERS 1 2028.0 2100.0
## - category.xCOMICS 1 2028.8 2100.8
## - category.xTRAVEL_AND_LOCAL 1 2030.0 2102.0
## - ano_act.x2017 1 2030.4 2102.4
## - category.xMAPS_AND_NAVIGATION 1 2031.4 2103.4
## - category.xHEALTH_AND_FITNESS 1 2032.0 2104.0
## - category.xPERSONALIZATION 1 2033.6 2105.6
## - category.xPRODUCTIVITY 1 2033.7 2105.7
## - category.xCOMMUNICATION 1 2034.3 2106.3
## - category.xTOOLS 1 2037.6 2109.6
## - category.xGAME 1 2038.5 2110.5
## - category.xBOOKS_AND_REFERENCE 1 2039.4 2111.4
## - category.xNEWS_AND_MAGAZINES 1 2040.0 2112.0
## - category.xSOCIAL 1 2044.5 2116.5
## - category.xLIFESTYLE 1 2046.0 2118.0
## - category.xFAMILY 1 2048.6 2120.6
## - category.xSPORTS 1 2048.9 2120.9
## - category.xMEDICAL 1 2056.3 2128.3
## - rating_imp 1 2056.5 2128.5
## - category.xBUSINESS 1 2060.2 2132.2
## - category.xFINANCE 1 2064.9 2136.9
## - type_bin 1 2331.9 2403.9
## - reviews 1 8417.2 8489.2
##
## Step: AIC=2090.38
## popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xFOOD_AND_DRINK + category.xGAME +
## category.xHEALTH_AND_FITNESS + category.xLIBRARIES_AND_DEMO +
## category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
## category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2014 + ano_act.x2016 +
## ano_act.x2017
##
## Df Deviance AIC
## - category.xFOOD_AND_DRINK 1 2019.5 2089.5
## - category.xWEATHER 1 2019.9 2089.9
## - category.xPHOTOGRAPHY 1 2020.2 2090.2
## - grupo_edades.x9. 1 2020.2 2090.2
## <none> 2018.4 2090.4
## - ano_act.x2014 1 2021.4 2091.4
## - category.xLIBRARIES_AND_DEMO 1 2021.6 2091.6
## - grupo_edades.x17. 1 2021.9 2091.9
## - category.xEVENTS 1 2023.0 2093.0
## - category.xSHOPPING 1 2023.2 2093.2
## - ano_act.x2016 1 2023.8 2093.8
## - grupo_edades.x4. 1 2024.4 2094.4
## - category.xDATING 1 2027.1 2097.1
## - category.xVIDEO_PLAYERS 1 2028.4 2098.4
## - category.xCOMICS 1 2029.4 2099.4
## - category.xTRAVEL_AND_LOCAL 1 2030.3 2100.3
## - ano_act.x2017 1 2031.8 2101.8
## - category.xMAPS_AND_NAVIGATION 1 2031.8 2101.8
## - category.xHEALTH_AND_FITNESS 1 2032.2 2102.2
## - category.xPERSONALIZATION 1 2033.7 2103.7
## - category.xPRODUCTIVITY 1 2033.9 2103.9
## - category.xCOMMUNICATION 1 2034.4 2104.4
## - category.xTOOLS 1 2037.6 2107.6
## - category.xGAME 1 2038.5 2108.5
## - category.xBOOKS_AND_REFERENCE 1 2039.6 2109.6
## - category.xNEWS_AND_MAGAZINES 1 2040.0 2110.0
## - category.xSOCIAL 1 2044.6 2114.6
## - category.xLIFESTYLE 1 2046.0 2116.0
## - category.xFAMILY 1 2048.9 2118.9
## - category.xSPORTS 1 2048.9 2118.9
## - category.xMEDICAL 1 2056.3 2126.3
## - rating_imp 1 2058.4 2128.4
## - category.xBUSINESS 1 2060.3 2130.3
## - category.xFINANCE 1 2065.0 2135.0
## - type_bin 1 2333.1 2403.1
## - reviews 1 8429.9 8499.9
##
## Step: AIC=2089.47
## popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + category.xWEATHER + grupo_edades.x17. +
## grupo_edades.x4. + grupo_edades.x9. + ano_act.x2014 + ano_act.x2016 +
## ano_act.x2017
##
## Df Deviance AIC
## - category.xWEATHER 1 2020.8 2088.8
## - category.xPHOTOGRAPHY 1 2020.9 2088.9
## - grupo_edades.x9. 1 2021.2 2089.2
## <none> 2019.5 2089.5
## - category.xLIBRARIES_AND_DEMO 1 2022.1 2090.1
## - ano_act.x2014 1 2022.5 2090.5
## - grupo_edades.x17. 1 2023.0 2091.0
## - category.xEVENTS 1 2023.5 2091.5
## - category.xSHOPPING 1 2023.6 2091.6
## - ano_act.x2016 1 2024.9 2092.9
## - grupo_edades.x4. 1 2025.5 2093.5
## - category.xDATING 1 2027.5 2095.5
## - category.xVIDEO_PLAYERS 1 2028.6 2096.6
## - category.xCOMICS 1 2029.9 2097.9
## - category.xTRAVEL_AND_LOCAL 1 2030.4 2098.4
## - category.xMAPS_AND_NAVIGATION 1 2031.9 2099.9
## - category.xHEALTH_AND_FITNESS 1 2032.3 2100.3
## - ano_act.x2017 1 2032.9 2100.9
## - category.xPERSONALIZATION 1 2033.7 2101.7
## - category.xPRODUCTIVITY 1 2033.9 2101.9
## - category.xCOMMUNICATION 1 2034.4 2102.4
## - category.xTOOLS 1 2037.7 2105.7
## - category.xGAME 1 2038.6 2106.6
## - category.xBOOKS_AND_REFERENCE 1 2039.6 2107.6
## - category.xNEWS_AND_MAGAZINES 1 2040.0 2108.0
## - category.xSOCIAL 1 2044.6 2112.6
## - category.xLIFESTYLE 1 2046.1 2114.1
## - category.xSPORTS 1 2049.0 2117.0
## - category.xFAMILY 1 2049.9 2117.9
## - category.xMEDICAL 1 2056.6 2124.6
## - rating_imp 1 2059.2 2127.2
## - category.xBUSINESS 1 2060.8 2128.8
## - category.xFINANCE 1 2065.4 2133.4
## - type_bin 1 2334.2 2402.2
## - reviews 1 8431.7 8499.7
##
## Step: AIC=2088.75
## popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPHOTOGRAPHY + category.xPRODUCTIVITY + category.xSHOPPING +
## category.xSOCIAL + category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + grupo_edades.x17. + grupo_edades.x4. +
## grupo_edades.x9. + ano_act.x2014 + ano_act.x2016 + ano_act.x2017
##
## Df Deviance AIC
## - category.xPHOTOGRAPHY 1 2021.9 2087.9
## - grupo_edades.x9. 1 2022.5 2088.5
## <none> 2020.8 2088.8
## - category.xLIBRARIES_AND_DEMO 1 2023.2 2089.2
## - ano_act.x2014 1 2024.0 2090.0
## - grupo_edades.x17. 1 2024.4 2090.4
## - category.xEVENTS 1 2024.5 2090.5
## - category.xSHOPPING 1 2024.6 2090.6
## - ano_act.x2016 1 2026.2 2092.2
## - grupo_edades.x4. 1 2026.8 2092.8
## - category.xDATING 1 2028.5 2094.5
## - category.xVIDEO_PLAYERS 1 2029.5 2095.5
## - category.xCOMICS 1 2030.8 2096.8
## - category.xTRAVEL_AND_LOCAL 1 2031.1 2097.1
## - category.xMAPS_AND_NAVIGATION 1 2032.7 2098.7
## - category.xHEALTH_AND_FITNESS 1 2032.9 2098.9
## - ano_act.x2017 1 2034.3 2100.3
## - category.xPERSONALIZATION 1 2034.3 2100.3
## - category.xPRODUCTIVITY 1 2034.4 2100.4
## - category.xCOMMUNICATION 1 2035.0 2101.0
## - category.xTOOLS 1 2038.0 2104.0
## - category.xGAME 1 2038.9 2104.9
## - category.xBOOKS_AND_REFERENCE 1 2040.2 2106.2
## - category.xNEWS_AND_MAGAZINES 1 2040.5 2106.5
## - category.xSOCIAL 1 2045.2 2111.2
## - category.xLIFESTYLE 1 2046.4 2112.4
## - category.xSPORTS 1 2049.3 2115.3
## - category.xFAMILY 1 2050.0 2116.0
## - category.xMEDICAL 1 2056.8 2122.8
## - rating_imp 1 2060.4 2126.4
## - category.xBUSINESS 1 2061.0 2127.0
## - category.xFINANCE 1 2065.6 2131.6
## - type_bin 1 2335.2 2401.2
## - reviews 1 8439.4 8505.4
##
## Step: AIC=2087.93
## popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPRODUCTIVITY + category.xSHOPPING + category.xSOCIAL +
## category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + grupo_edades.x17. + grupo_edades.x4. +
## grupo_edades.x9. + ano_act.x2014 + ano_act.x2016 + ano_act.x2017
##
## Df Deviance AIC
## - grupo_edades.x9. 1 2023.7 2087.7
## - category.xLIBRARIES_AND_DEMO 1 2023.9 2087.9
## <none> 2021.9 2087.9
## - category.xEVENTS 1 2025.2 2089.2
## - category.xSHOPPING 1 2025.2 2089.2
## - ano_act.x2014 1 2025.3 2089.3
## - grupo_edades.x17. 1 2025.6 2089.6
## - ano_act.x2016 1 2027.6 2091.6
## - grupo_edades.x4. 1 2027.9 2091.9
## - category.xDATING 1 2029.1 2093.1
## - category.xVIDEO_PLAYERS 1 2029.8 2093.8
## - category.xTRAVEL_AND_LOCAL 1 2031.3 2095.3
## - category.xCOMICS 1 2031.4 2095.4
## - category.xMAPS_AND_NAVIGATION 1 2033.0 2097.0
## - category.xHEALTH_AND_FITNESS 1 2033.0 2097.0
## - category.xPERSONALIZATION 1 2034.4 2098.4
## - category.xPRODUCTIVITY 1 2034.5 2098.5
## - category.xCOMMUNICATION 1 2035.1 2099.1
## - ano_act.x2017 1 2036.2 2100.2
## - category.xTOOLS 1 2038.0 2102.0
## - category.xGAME 1 2039.0 2103.0
## - category.xBOOKS_AND_REFERENCE 1 2040.3 2104.3
## - category.xNEWS_AND_MAGAZINES 1 2040.6 2104.6
## - category.xSOCIAL 1 2045.3 2109.3
## - category.xLIFESTYLE 1 2046.5 2110.5
## - category.xSPORTS 1 2049.3 2113.3
## - category.xFAMILY 1 2050.7 2114.7
## - category.xMEDICAL 1 2057.0 2121.0
## - rating_imp 1 2061.0 2125.0
## - category.xBUSINESS 1 2061.3 2125.3
## - category.xFINANCE 1 2065.8 2129.8
## - type_bin 1 2338.3 2402.3
## - reviews 1 8449.6 8513.6
##
## Step: AIC=2087.7
## popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIBRARIES_AND_DEMO + category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION +
## category.xMEDICAL + category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPRODUCTIVITY + category.xSHOPPING + category.xSOCIAL +
## category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + grupo_edades.x17. + grupo_edades.x4. +
## ano_act.x2014 + ano_act.x2016 + ano_act.x2017
##
## Df Deviance AIC
## - category.xLIBRARIES_AND_DEMO 1 2025.6 2087.6
## <none> 2023.7 2087.7
## - grupo_edades.x17. 1 2026.5 2088.5
## - category.xEVENTS 1 2026.9 2088.9
## - ano_act.x2014 1 2027.0 2089.0
## - category.xSHOPPING 1 2027.0 2089.0
## - grupo_edades.x4. 1 2027.9 2089.9
## - ano_act.x2016 1 2029.6 2091.6
## - category.xDATING 1 2031.0 2093.0
## - category.xVIDEO_PLAYERS 1 2031.4 2093.4
## - category.xTRAVEL_AND_LOCAL 1 2033.0 2095.0
## - category.xCOMICS 1 2033.5 2095.5
## - category.xHEALTH_AND_FITNESS 1 2034.7 2096.7
## - category.xMAPS_AND_NAVIGATION 1 2034.7 2096.7
## - category.xPRODUCTIVITY 1 2036.2 2098.2
## - category.xPERSONALIZATION 1 2036.4 2098.4
## - category.xCOMMUNICATION 1 2037.0 2099.0
## - ano_act.x2017 1 2038.2 2100.2
## - category.xTOOLS 1 2039.6 2101.6
## - category.xGAME 1 2041.3 2103.3
## - category.xNEWS_AND_MAGAZINES 1 2041.4 2103.4
## - category.xBOOKS_AND_REFERENCE 1 2041.8 2103.8
## - category.xLIFESTYLE 1 2047.9 2109.9
## - category.xSOCIAL 1 2048.5 2110.5
## - category.xSPORTS 1 2050.9 2112.9
## - category.xFAMILY 1 2052.4 2114.4
## - category.xMEDICAL 1 2058.4 2120.4
## - rating_imp 1 2062.7 2124.7
## - category.xBUSINESS 1 2063.1 2125.1
## - category.xFINANCE 1 2067.4 2129.4
## - type_bin 1 2338.3 2400.3
## - reviews 1 8467.7 8529.7
##
## Step: AIC=2087.63
## popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
## category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPRODUCTIVITY + category.xSHOPPING + category.xSOCIAL +
## category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + grupo_edades.x17. + grupo_edades.x4. +
## ano_act.x2014 + ano_act.x2016 + ano_act.x2017
##
## Df Deviance AIC
## <none> 2025.6 2087.6
## - category.xEVENTS 1 2028.3 2088.3
## - category.xSHOPPING 1 2028.4 2088.4
## - grupo_edades.x17. 1 2028.5 2088.5
## - ano_act.x2014 1 2028.8 2088.8
## - grupo_edades.x4. 1 2029.7 2089.7
## - ano_act.x2016 1 2031.4 2091.4
## - category.xDATING 1 2032.4 2092.4
## - category.xVIDEO_PLAYERS 1 2032.5 2092.5
## - category.xTRAVEL_AND_LOCAL 1 2033.9 2093.9
## - category.xCOMICS 1 2034.9 2094.9
## - category.xHEALTH_AND_FITNESS 1 2035.5 2095.5
## - category.xMAPS_AND_NAVIGATION 1 2035.7 2095.7
## - category.xPRODUCTIVITY 1 2036.9 2096.9
## - category.xPERSONALIZATION 1 2037.2 2097.2
## - category.xCOMMUNICATION 1 2037.7 2097.7
## - category.xTOOLS 1 2039.9 2099.9
## - ano_act.x2017 1 2041.2 2101.2
## - category.xGAME 1 2041.7 2101.7
## - category.xNEWS_AND_MAGAZINES 1 2042.0 2102.0
## - category.xBOOKS_AND_REFERENCE 1 2042.6 2102.6
## - category.xLIFESTYLE 1 2048.1 2108.1
## - category.xSOCIAL 1 2049.2 2109.2
## - category.xSPORTS 1 2051.3 2111.3
## - category.xFAMILY 1 2052.4 2112.4
## - category.xMEDICAL 1 2058.5 2118.5
## - category.xBUSINESS 1 2063.1 2123.1
## - rating_imp 1 2065.0 2125.0
## - category.xFINANCE 1 2067.5 2127.5
## - type_bin 1 2339.4 2399.4
## - reviews 1 8485.9 8545.9
summary(model_2)
##
## Call:
## glm(formula = popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xDATING + category.xEVENTS + category.xFAMILY +
## category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
## category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPRODUCTIVITY + category.xSHOPPING + category.xSOCIAL +
## category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
## category.xVIDEO_PLAYERS + grupo_edades.x17. + grupo_edades.x4. +
## ano_act.x2014 + ano_act.x2016 + ano_act.x2017, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9596 -0.2980 0.0008 0.0014 3.5644
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.3256196 0.4434563 0.734
## reviews 0.0025651 0.0001013 25.327
## type_bin -8.1454415 0.6692997 -12.170
## rating_imp -0.5584931 0.0863395 -6.469
## category.xBOOKS_AND_REFERENCE -1.8691146 0.5150485 -3.629
## category.xBUSINESS -1.8937094 0.3577243 -5.294
## category.xCOMICS -2.2917798 0.8529493 -2.687
## category.xCOMMUNICATION -1.2415271 0.3864343 -3.213
## category.xDATING -1.4075128 0.5492249 -2.563
## category.xEVENTS -0.8455897 0.5501968 -1.537
## category.xFAMILY -0.9795207 0.1891670 -5.178
## category.xFINANCE -2.2607102 0.4142731 -5.457
## category.xGAME -1.1327329 0.2948277 -3.842
## category.xHEALTH_AND_FITNESS -1.1060827 0.3787786 -2.920
## category.xLIFESTYLE -1.4094100 0.3202885 -4.400
## category.xMAPS_AND_NAVIGATION -1.5309902 0.5450130 -2.809
## category.xMEDICAL -1.8111919 0.3584871 -5.052
## category.xNEWS_AND_MAGAZINES -1.4965058 0.4085322 -3.663
## category.xPERSONALIZATION -1.1838232 0.3758839 -3.149
## category.xPRODUCTIVITY -1.1110392 0.3546183 -3.133
## category.xSHOPPING -0.7558444 0.4820156 -1.568
## category.xSOCIAL -2.4638362 0.5980034 -4.120
## category.xSPORTS -1.8203606 0.3865600 -4.709
## category.xTOOLS -0.8327901 0.2240891 -3.716
## category.xTRAVEL_AND_LOCAL -1.0944288 0.4098848 -2.670
## category.xVIDEO_PLAYERS -1.1553295 0.4685611 -2.466
## grupo_edades.x17. 0.8384517 0.4880515 1.718
## grupo_edades.x4. 0.4528372 0.2300680 1.968
## ano_act.x2014 -0.6664322 0.3882616 -1.716
## ano_act.x2016 -0.5307236 0.2291425 -2.316
## ano_act.x2017 -0.5889548 0.1538303 -3.829
## Pr(>|z|)
## (Intercept) 0.462780
## reviews < 0.0000000000000002 ***
## type_bin < 0.0000000000000002 ***
## rating_imp 0.0000000000989 ***
## category.xBOOKS_AND_REFERENCE 0.000285 ***
## category.xBUSINESS 0.0000001198211 ***
## category.xCOMICS 0.007212 **
## category.xCOMMUNICATION 0.001315 **
## category.xDATING 0.010385 *
## category.xEVENTS 0.124321
## category.xFAMILY 0.0000002241897 ***
## category.xFINANCE 0.0000000484104 ***
## category.xGAME 0.000122 ***
## category.xHEALTH_AND_FITNESS 0.003499 **
## category.xLIFESTYLE 0.0000108031699 ***
## category.xMAPS_AND_NAVIGATION 0.004968 **
## category.xMEDICAL 0.0000004364725 ***
## category.xNEWS_AND_MAGAZINES 0.000249 ***
## category.xPERSONALIZATION 0.001636 **
## category.xPRODUCTIVITY 0.001730 **
## category.xSHOPPING 0.116860
## category.xSOCIAL 0.0000378701585 ***
## category.xSPORTS 0.0000024877917 ***
## category.xTOOLS 0.000202 ***
## category.xTRAVEL_AND_LOCAL 0.007583 **
## category.xVIDEO_PLAYERS 0.013675 *
## grupo_edades.x17. 0.085804 .
## grupo_edades.x4. 0.049036 *
## ano_act.x2014 0.086079 .
## ano_act.x2016 0.020551 *
## ano_act.x2017 0.000129 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9975.4 on 7248 degrees of freedom
## Residual deviance: 2025.6 on 7218 degrees of freedom
## AIC: 2087.6
##
## Number of Fisher Scoring iterations: 9
Las variables con el coeficiente el p-valor más alto son
type_bin,
reviews,
category.xBOOKS_AND_REFERENCE,
category.xBUSINESS,
category.xCOMICS,
category.xFINANCE,
category.xSPORTS,
category.xVIDEO_PLAYERS y
category.xSOCIAL. Al mismo tiempo,
category.xSHOPPING,
category.xEVENTS,
grupo_edades.x17 y
ano_act.x2014 no son estadísticamente
significativas por lo que podrían eliminarse del análisis.
Se utilizará el factor de inflación de la varianza (vif) para eliminar los predictores redundantes o las variables que tienen una alta multicolinealidad entre ellos. El VIF mide la cantidad de varianza de una variable que se puede explicar por otras variables en el modelo. Un VIF de 1 indica que la variable no está correlacionada con otras variables en el modelo, mientras que un VIF superior a 1 indica que la variable está correlacionada con otras variables en el modelo.
car::vif(model_2)
## reviews type_bin
## 1.590630 1.473004
## rating_imp category.xBOOKS_AND_REFERENCE
## 1.076671 1.077197
## category.xBUSINESS category.xCOMICS
## 1.144185 1.049969
## category.xCOMMUNICATION category.xDATING
## 1.128051 2.705575
## category.xEVENTS category.xFAMILY
## 1.059053 1.771376
## category.xFINANCE category.xGAME
## 1.115258 1.359045
## category.xHEALTH_AND_FITNESS category.xLIFESTYLE
## 1.132575 1.195243
## category.xMAPS_AND_NAVIGATION category.xMEDICAL
## 1.060665 1.146465
## category.xNEWS_AND_MAGAZINES category.xPERSONALIZATION
## 1.111806 1.150098
## category.xPRODUCTIVITY category.xSHOPPING
## 1.147852 1.073984
## category.xSOCIAL category.xSPORTS
## 1.096141 1.152683
## category.xTOOLS category.xTRAVEL_AND_LOCAL
## 1.487034 1.115659
## category.xVIDEO_PLAYERS grupo_edades.x17.
## 1.087282 3.034021
## grupo_edades.x4. ano_act.x2014
## 1.719947 1.046145
## ano_act.x2016 ano_act.x2017
## 1.038117 1.040019
Se observa que la mayoría de las variables tienen un VIF entre 1 y 2, lo que sugiere que no hay una multicolinealidad importante presente en el modelo. De esta manera, “un predictor que tiene un VIF de 2 o menos generalmente se considera seguro y se puede suponer que no está correlacionado con otras variables predictoras”.
Se aplicará el modelo_2 sin las variables sin significancia
estadística del modelo_2
(category.xSHOPPING,
category.xEVENTS,
grupo_edades.x17 y
ano_act.x2014)
model_3 <- glm(formula = popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION + category.xFAMILY +
category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
category.xPRODUCTIVITY + category.xSOCIAL +
category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
category.xVIDEO_PLAYERS + ano_act.x2016 + ano_act.x2017 + category.xDATING + grupo_edades.x4., family = "binomial", data = train)
summary(model_3)
##
## Call:
## glm(formula = popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xFAMILY + category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
## category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPRODUCTIVITY + category.xSOCIAL + category.xSPORTS +
## category.xTOOLS + category.xTRAVEL_AND_LOCAL + category.xVIDEO_PLAYERS +
## ano_act.x2016 + ano_act.x2017 + category.xDATING + grupo_edades.x4.,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9751 -0.2980 0.0007 0.0014 3.6067
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.3485971 0.4262971 0.818
## reviews 0.0025602 0.0001007 25.430
## type_bin -8.1446070 0.6686982 -12.180
## rating_imp -0.5617996 0.0858054 -6.547
## category.xBOOKS_AND_REFERENCE -1.7550526 0.5092750 -3.446
## category.xBUSINESS -1.7855512 0.3553920 -5.024
## category.xCOMICS -2.1214243 0.8690952 -2.441
## category.xCOMMUNICATION -1.1165792 0.3827254 -2.917
## category.xFAMILY -0.8625122 0.1809668 -4.766
## category.xFINANCE -2.1159912 0.4099957 -5.161
## category.xGAME -1.0672780 0.2885420 -3.699
## category.xHEALTH_AND_FITNESS -0.9741456 0.3750889 -2.597
## category.xLIFESTYLE -1.2788397 0.3153649 -4.055
## category.xMAPS_AND_NAVIGATION -1.3891143 0.5418198 -2.564
## category.xMEDICAL -1.6758247 0.3538254 -4.736
## category.xNEWS_AND_MAGAZINES -1.3960021 0.4058702 -3.440
## category.xPERSONALIZATION -1.1119307 0.3718230 -2.990
## category.xPRODUCTIVITY -0.9934965 0.3492538 -2.845
## category.xSOCIAL -2.3615727 0.5976708 -3.951
## category.xSPORTS -1.6807647 0.3820530 -4.399
## category.xTOOLS -0.7128830 0.2171869 -3.282
## category.xTRAVEL_AND_LOCAL -0.9554333 0.4060406 -2.353
## category.xVIDEO_PLAYERS -1.0451383 0.4699643 -2.224
## ano_act.x2016 -0.5075043 0.2282188 -2.224
## ano_act.x2017 -0.5687543 0.1527835 -3.723
## category.xDATING -0.6378535 0.3919167 -1.628
## grupo_edades.x4. 0.2986621 0.2082569 1.434
## Pr(>|z|)
## (Intercept) 0.413510
## reviews < 0.0000000000000002 ***
## type_bin < 0.0000000000000002 ***
## rating_imp 0.0000000000586 ***
## category.xBOOKS_AND_REFERENCE 0.000569 ***
## category.xBUSINESS 0.0000005056043 ***
## category.xCOMICS 0.014648 *
## category.xCOMMUNICATION 0.003529 **
## category.xFAMILY 0.0000018779244 ***
## category.xFINANCE 0.0000002456230 ***
## category.xGAME 0.000217 ***
## category.xHEALTH_AND_FITNESS 0.009401 **
## category.xLIFESTYLE 0.0000501104418 ***
## category.xMAPS_AND_NAVIGATION 0.010354 *
## category.xMEDICAL 0.0000021765121 ***
## category.xNEWS_AND_MAGAZINES 0.000583 ***
## category.xPERSONALIZATION 0.002785 **
## category.xPRODUCTIVITY 0.004446 **
## category.xSOCIAL 0.0000777298835 ***
## category.xSPORTS 0.0000108601821 ***
## category.xTOOLS 0.001029 **
## category.xTRAVEL_AND_LOCAL 0.018620 *
## category.xVIDEO_PLAYERS 0.026157 *
## ano_act.x2016 0.026164 *
## ano_act.x2017 0.000197 ***
## category.xDATING 0.103626
## grupo_edades.x4. 0.151542
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9975.4 on 7248 degrees of freedom
## Residual deviance: 2036.7 on 7222 degrees of freedom
## AIC: 2090.7
##
## Number of Fisher Scoring iterations: 9
car::vif(model_3)
## reviews type_bin
## 1.569844 1.462363
## rating_imp category.xBOOKS_AND_REFERENCE
## 1.072786 1.066314
## category.xBUSINESS category.xCOMICS
## 1.116923 1.040181
## category.xCOMMUNICATION category.xFAMILY
## 1.101597 1.626308
## category.xFINANCE category.xGAME
## 1.095646 1.303909
## category.xHEALTH_AND_FITNESS category.xLIFESTYLE
## 1.109061 1.160791
## category.xMAPS_AND_NAVIGATION category.xMEDICAL
## 1.050052 1.121012
## category.xNEWS_AND_MAGAZINES category.xPERSONALIZATION
## 1.091261 1.110079
## category.xPRODUCTIVITY category.xSOCIAL
## 1.122120 1.082080
## category.xSPORTS category.xTOOLS
## 1.128971 1.404621
## category.xTRAVEL_AND_LOCAL category.xVIDEO_PLAYERS
## 1.096217 1.070644
## ano_act.x2016 ano_act.x2017
## 1.033930 1.031542
## category.xDATING grupo_edades.x4.
## 1.419840 1.421911
Respecto a los vectores de inflación todas las variables se
encuentran entre 1 y 2. Se observa que category.xDATING y
grupo_edades.x4. no tienen significancia estadística para
un nivel de significancia del 5% por lo que se realizará un
modelo_4 sin estas variables:
model_4 <- glm(formula = popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION + category.xFAMILY +
category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
category.xPRODUCTIVITY + category.xSOCIAL +
category.xSPORTS + category.xTOOLS + category.xTRAVEL_AND_LOCAL +
category.xVIDEO_PLAYERS + ano_act.x2016 + ano_act.x2017, family = "binomial", data = train)
summary(model_4)
##
## Call:
## glm(formula = popular ~ reviews + type_bin + rating_imp + category.xBOOKS_AND_REFERENCE +
## category.xBUSINESS + category.xCOMICS + category.xCOMMUNICATION +
## category.xFAMILY + category.xFINANCE + category.xGAME + category.xHEALTH_AND_FITNESS +
## category.xLIFESTYLE + category.xMAPS_AND_NAVIGATION + category.xMEDICAL +
## category.xNEWS_AND_MAGAZINES + category.xPERSONALIZATION +
## category.xPRODUCTIVITY + category.xSOCIAL + category.xSPORTS +
## category.xTOOLS + category.xTRAVEL_AND_LOCAL + category.xVIDEO_PLAYERS +
## ano_act.x2016 + ano_act.x2017, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9794 -0.3035 0.0008 0.0014 3.6311
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.45072109 0.36798100 1.225
## reviews 0.00253901 0.00009965 25.480
## type_bin -8.06055102 0.66402834 -12.139
## rating_imp -0.55001164 0.08574821 -6.414
## category.xBOOKS_AND_REFERENCE -1.59813146 0.50349542 -3.174
## category.xBUSINESS -1.63688534 0.35066341 -4.668
## category.xCOMICS -2.12291787 0.86278911 -2.461
## category.xCOMMUNICATION -0.99406376 0.37766591 -2.632
## category.xFAMILY -0.75282151 0.17367632 -4.335
## category.xFINANCE -1.95395604 0.40534114 -4.821
## category.xGAME -1.03269695 0.27715374 -3.726
## category.xHEALTH_AND_FITNESS -0.84029891 0.37051717 -2.268
## category.xLIFESTYLE -1.14046646 0.31063321 -3.671
## category.xMAPS_AND_NAVIGATION -1.22974715 0.53798234 -2.286
## category.xMEDICAL -1.52807807 0.34863293 -4.383
## category.xNEWS_AND_MAGAZINES -1.27051095 0.40092525 -3.169
## category.xPERSONALIZATION -1.00878194 0.36541431 -2.761
## category.xPRODUCTIVITY -0.84578015 0.34440360 -2.456
## category.xSOCIAL -2.35734537 0.58468320 -4.032
## category.xSPORTS -1.52817639 0.37714788 -4.052
## category.xTOOLS -0.56057817 0.21130254 -2.653
## category.xTRAVEL_AND_LOCAL -0.79787140 0.40208337 -1.984
## category.xVIDEO_PLAYERS -0.91554861 0.46683842 -1.961
## ano_act.x2016 -0.50852983 0.22868184 -2.224
## ano_act.x2017 -0.55067071 0.15299143 -3.599
## Pr(>|z|)
## (Intercept) 0.220632
## reviews < 0.0000000000000002 ***
## type_bin < 0.0000000000000002 ***
## rating_imp 0.000000000142 ***
## category.xBOOKS_AND_REFERENCE 0.001503 **
## category.xBUSINESS 0.000003041942 ***
## category.xCOMICS 0.013873 *
## category.xCOMMUNICATION 0.008485 **
## category.xFAMILY 0.000014600969 ***
## category.xFINANCE 0.000001431828 ***
## category.xGAME 0.000194 ***
## category.xHEALTH_AND_FITNESS 0.023335 *
## category.xLIFESTYLE 0.000241 ***
## category.xMAPS_AND_NAVIGATION 0.022263 *
## category.xMEDICAL 0.000011702537 ***
## category.xNEWS_AND_MAGAZINES 0.001530 **
## category.xPERSONALIZATION 0.005769 **
## category.xPRODUCTIVITY 0.014058 *
## category.xSOCIAL 0.000055343373 ***
## category.xSPORTS 0.000050797096 ***
## category.xTOOLS 0.007979 **
## category.xTRAVEL_AND_LOCAL 0.047218 *
## category.xVIDEO_PLAYERS 0.049859 *
## ano_act.x2016 0.026166 *
## ano_act.x2017 0.000319 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9975.4 on 7248 degrees of freedom
## Residual deviance: 2045.8 on 7224 degrees of freedom
## AIC: 2095.8
##
## Number of Fisher Scoring iterations: 9
car::vif(model_4)
## reviews type_bin
## 1.551758 1.458784
## rating_imp category.xBOOKS_AND_REFERENCE
## 1.066679 1.056401
## category.xBUSINESS category.xCOMICS
## 1.096669 1.023141
## category.xCOMMUNICATION category.xFAMILY
## 1.084677 1.503305
## category.xFINANCE category.xGAME
## 1.078281 1.186952
## category.xHEALTH_AND_FITNESS category.xLIFESTYLE
## 1.090422 1.133291
## category.xMAPS_AND_NAVIGATION category.xMEDICAL
## 1.041032 1.100225
## category.xNEWS_AND_MAGAZINES category.xPERSONALIZATION
## 1.075506 1.090835
## category.xPRODUCTIVITY category.xSOCIAL
## 1.101478 1.047189
## category.xSPORTS category.xTOOLS
## 1.108778 1.334461
## category.xTRAVEL_AND_LOCAL category.xVIDEO_PLAYERS
## 1.079966 1.059048
## ano_act.x2016 ano_act.x2017
## 1.034841 1.030511
Todas las variables tienen significancia estadística en este último modelo
# Obtener los residuos
residuos <- residuals(model_4, type = "deviance")
summary(residuos)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.979407 -0.303497 0.000814 -0.067982 0.001425 3.631136
par(mfrow=c(1, 2))
plot(model_3, main="Modelo 3", pch=19, cex=1, which=1)
plot(model_4, main="Modelo 4", pch=19, cex=1, which=1)
Se puede observar que la media de los residuos es cercana a cero, lo
que indica que el modelo tiene un buen ajuste en términos generales. Al
mismo tiempo, hay valores extremos en los residuos, lo que puede indicar
la presencia de valores atípicos o errores en el modelo. Se elegirá el
modelo_4 como modelo final.
Analizamos los Odds Ratio
final_model <- model_4
#calculamos los ods ratio
exp(coef(final_model))
## (Intercept) reviews
## 1.5694434851 1.0025422378
## type_bin rating_imp
## 0.0003157528 0.5769430973
## category.xBOOKS_AND_REFERENCE category.xBUSINESS
## 0.2022741231 0.1945851651
## category.xCOMICS category.xCOMMUNICATION
## 0.1196819025 0.3700697550
## category.xFAMILY category.xFINANCE
## 0.4710356464 0.1417123418
## category.xGAME category.xHEALTH_AND_FITNESS
## 0.3560454261 0.4315815011
## category.xLIFESTYLE category.xMAPS_AND_NAVIGATION
## 0.3196698728 0.2923664933
## category.xMEDICAL category.xNEWS_AND_MAGAZINES
## 0.2169522344 0.2806881688
## category.xPERSONALIZATION category.xPRODUCTIVITY
## 0.3646628897 0.4292223715
## category.xSOCIAL category.xSPORTS
## 0.0946712066 0.2169309039
## category.xTOOLS category.xTRAVEL_AND_LOCAL
## 0.5708789048 0.4502864237
## category.xVIDEO_PLAYERS ano_act.x2016
## 0.4002969596 0.6013790566
## ano_act.x2017
## 0.5765629750
El Odd Ratio indica cuántas veces más probable es que ocurra un
evento dado (que la app sea popular), cuando se compara un cambio en una
unidad de la variable popular.
pred <- predict(final_model, type = "response", newdata = validation)
summary(pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000004 0.057112 0.761552 0.549189 0.999999 1.000000
validation$prob <- pred
# Using probability cutoff of 50%.
pred_popular <- factor(ifelse(pred >= 0.50, "Yes", "No"))
actual_popular <- factor(ifelse(validation$popular==1,"Yes","No"))
table(actual_popular,pred_popular)
## pred_popular
## actual_popular No Yes
## No 1361 36
## Yes 129 1581
Vamos a calcular la sensibilidad, la especificidad y accuracy con un corte de 0.50
cutoff_popular <- factor(ifelse(pred >=0.50, "Yes", "No"))
conf_final <- confusionMatrix(cutoff_popular, actual_popular, positive = "Yes")
accuracy <- conf_final$overall[1]
sensitivity <- conf_final$byClass[1]
specificity <- conf_final$byClass[2]
accuracy
## Accuracy
## 0.9468941
sensitivity
## Sensitivity
## 0.9245614
specificity
## Specificity
## 0.9742305
La sensibilidad nos indica la capacidad de nuestro estimador para dar
como casos positivos. En este caso, nuestro modelo tiene una
sensibilidad de 0.92. La especificidad nos indica la capacidad de
nuestro estimador para dar como casos negativos, el model_4
posee un nivel de especificidad de 0.97, y, por último, Accuracy es lo
que mide en base a la diagonal cuantos de las apps populares reales
(0.94).
perform_app <- function(cutoff)
{
predicted_popular <- factor(ifelse(pred >= cutoff, "Yes", "No"))
conf <- confusionMatrix(predicted_popular, actual_popular, positive = "Yes")
accuray <- conf$overall[1]
sensitivity <- conf$byClass[1]
specificity <- conf$byClass[2]
out <- t(as.matrix(c(sensitivity, specificity, accuray)))
colnames(out) <- c("sensitivity", "specificity", "accuracy")
return(out)
}
options(repr.plot.width =8, repr.plot.height =6)
summary(pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000004 0.057112 0.761552 0.549189 0.999999 1.000000
s = seq(0.01,0.80,length=100)
OUT = matrix(0,100,3)
for(i in 1:100)
{
OUT[i,] = perform_app(s[i])
}
plot(s, OUT[,1],xlab="Cutoff",ylab="Value",cex.lab=1.5,cex.axis=1.5,ylim=c(0,1),
type="l",lwd=2,axes=FALSE,col=2)
axis(1,seq(0,1,length=5),seq(0,1,length=5),cex.lab=1.5)
axis(2,seq(0,1,length=5),seq(0,1,length=5),cex.lab=1.5)
lines(s,OUT[,2],col="darkgreen",lwd=2)
lines(s,OUT[,3],col=4,lwd=2)
box()
legend("bottom",col=c(2,"darkgreen",4,"darkred"),text.font =3,inset = 0.02,
box.lty=0,cex = 0.8,
lwd=c(2,2,2,2),c("Sensitivity","Specificity","Accuracy"))
abline(v = 0.32, col="red", lwd=1, lty=2)
axis(1, at = seq(0.1, 1, by = 0.1))
#cutoff <- s[which(abs(OUT[,1]-OUT[,2])<0.01)]
Según los resultados, el modelo es es capaz de predecir correctamente el 94.9% de los casos.
https://www.kaggle.com/datasets/lava18/google-play-store-apps↩︎
https://www.kaggle.com/code/akwamfoneventus/google-play-store-app-prep-cleaning-r
https://www.kaggle.com/code/arstby/app-store-games-eda/report↩︎
https://www.kaggle.com/code/gabydel1982/telco-customer-churn-rboles-de-decisi-n↩︎
https://www.kaggle.com/code/gabydel1982/telco-customer-churn-logisticregression-untref↩︎