Modelar

O objetivo básico de qualquer modelo é fornecer um resumo simples de baixa dimensão de um conjunto de dados.

Captar padrões gerados pelo fenômeno de interesse e ignorar ruídos (variação aleatória na qual se está interessado);
Modelos preditivos (supervisionado);
Modelos não preditivos (descobrir relacionamentos dentro dos dados - não supervisonado);
Divisão dos dados em três partes - treinamento, consulta e teste.

Partes de um modelo

Família de modelos - expressa um padrão preciso
Modelo ajustado
Todos os modelos são errados, mas alguns são úteis - George Box.
Nota: O objetivo de um modelo não é descobrir a verdade, mas sim buscar fazer uma aproximação o mais simples possível que ainda seja útel.

Construção de modelos

Pacote ‘modelr’

O pacote modelr fornece funções que ajudam a criar pipelines elegantes durante a modelagem. Foi projetado principalmente para apoiar o ensino dos conceitos básicos de modelagem dentro do universo tidyverse.

Qualidade de diamentes e seu preço

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.1

## -- Attaching packages --------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.1

## Warning: package 'tibble' was built under R version 3.6.1

## Warning: package 'tidyr' was built under R version 3.6.1

## Warning: package 'dplyr' was built under R version 3.6.1

## Warning: package 'stringr' was built under R version 3.6.1

## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(modelr)

## Warning: package 'modelr' was built under R version 3.6.1

library(ggplot2)

# https://ggplot2.tidyverse.org/reference/diamonds.html

head(diamonds)

glimpse(diamonds)

## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

summary(diamonds)

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
##

ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price))

ggplot(diamonds,aes(cut,price))+geom_boxplot()

ggplot(diamonds,aes(color,price))+geom_boxplot()

ggplot(diamonds,aes(clarity,price))+geom_boxplot()

# opcional
# install.packages("devtools")
# devtools::install_github("easystats/see")
library(see)

p1<-ggplot(diamonds,aes(cut,price))+geom_violin()
p2<-ggplot(diamonds,aes(color,price))+geom_violin()
p3<-ggplot(diamonds,aes(clarity,price))+geom_violin()
plots(p1, p2, p3, ncol = 2)

# O que já podemos verificar com os gráficos?

Diamantes de menor qualidade têm preços mais altos porque há uma variável complicadora neste caso que é o peso ou quilate (carat). * Diamantes de baixa qualidade tendem a ser maiores.

ggplot(diamonds,aes(carat, price)) + geom_hex(bins = 150)

Vamos focar nos diamantes menores que 2,5 quilates (99,7%).

d1<-diamonds%>%
  filter(carat<=2.5)
(length(d1$carat)/length(diamonds$carat))*100

## [1] 99.76641

d2<-d1%>%
  mutate(logprice = log2(price),logcarat = log2(carat))

ggplot(d2,aes(logcarat,logprice)) + geom_hex(bins=150)

mod1<- lm(logprice ~ logcarat,data = d2)
summary(mod1)

## 
## Call:
## lm(formula = logprice ~ logcarat, data = d2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.96407 -0.24549 -0.00844  0.23930  1.93486 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.193863   0.001969  6194.5   <2e-16 ***
## logcarat     1.681371   0.001936   868.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3767 on 53812 degrees of freedom
## Multiple R-squared:  0.9334, Adjusted R-squared:  0.9334 
## F-statistic: 7.542e+05 on 1 and 53812 DF,  p-value: < 2.2e-16

# Valores preditos

grid<-d2%>%
  data_grid(carat = seq_range(carat,20))%>%
  mutate(logcarat=log2(carat))%>%
  add_predictions(mod1,"logprice")%>%
  mutate(price = 2^logprice)
grid

ggplot(d2,aes(carat,price)) + geom_hex(bins = 150) + geom_line(data = grid, color = "red", size = 1)

Os diamantes maiores são muito mais baratos do que o esperado. Por que?
E os resíduos?

d3<-d2%>%
  add_residuals(mod1,"logresid")

ggplot(d3, aes(logcarat, logresid)) + geom_hex(bins = 150)

r1<-ggplot(d3,aes(cut,logresid))+ geom_boxplot()
r2<-ggplot(d3,aes(color,logresid)) + geom_boxplot()
r3<-ggplot(d3,aes(clarity,logresid)) + geom_boxplot()
plots(r1,r2,r3)

Perceba que a medida que a qualidade dos diamentes aumenta, o mesmo ocorre com seu preço relativo.

Modelos mais ‘complicados’

mod2<-lm(logprice ~ logcarat + color + cut + clarity, data = d2)
summary(mod2)

## 
## Call:
## lm(formula = logprice ~ logcarat + color + cut + clarity, data = d2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17388 -0.12437 -0.00094  0.11920  2.78322 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.206978   0.001693 7211.806  < 2e-16 ***
## logcarat     1.886239   0.001124 1677.809  < 2e-16 ***
## color.L     -0.633998   0.002910 -217.872  < 2e-16 ***
## color.Q     -0.137580   0.002676  -51.409  < 2e-16 ***
## color.C     -0.022072   0.002503   -8.819  < 2e-16 ***
## color^4      0.016570   0.002297    7.213 5.54e-13 ***
## color^5     -0.002828   0.002169   -1.304    0.192    
## color^6      0.003533   0.001971    1.793    0.073 .  
## cut.L        0.173866   0.003386   51.349  < 2e-16 ***
## cut.Q       -0.050346   0.002980  -16.897  < 2e-16 ***
## cut.C        0.019129   0.002583    7.407 1.31e-13 ***
## cut^4       -0.002410   0.002066   -1.166    0.243    
## clarity.L    1.308155   0.005179  252.598  < 2e-16 ***
## clarity.Q   -0.334090   0.004839  -69.047  < 2e-16 ***
## clarity.C    0.178423   0.004140   43.093  < 2e-16 ***
## clarity^4   -0.088059   0.003298  -26.697  < 2e-16 ***
## clarity^5    0.035885   0.002680   13.389  < 2e-16 ***
## clarity^6   -0.001371   0.002327   -0.589    0.556    
## clarity^7    0.048221   0.002051   23.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1916 on 53795 degrees of freedom
## Multiple R-squared:  0.9828, Adjusted R-squared:  0.9828 
## F-statistic: 1.706e+05 on 18 and 53795 DF,  p-value: < 2.2e-16

grid<- d2 %>%
  data_grid(cut,.model = mod2) %>%
  add_predictions(mod2)
grid

ggplot(grid, aes(cut, pred)) + geom_point()

Para variáveis contínuas o comando anterior utiliza a mediana e para variáveis categórias usa o valor mais comum (ou valores, se for um empate).

d4<- d2 %>%
  add_residuals(mod2,"lresid2")

ggplot(d4, aes(logcarat,lresid2)) + geom_hex(bins = 150)

# Filtrar resíduos maiores que 1

d4<- d4 %>%
  filter(abs(lresid2)>1) %>%
  add_predictions(mod2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)
d4

Exemplo gapminder
Conjunto de dados que resume a evolução de países ao longo do tempo, observando as estatísticas de expectativa de vida e PIB. Maiores detalhes ver o vídeo de Hans Rosling (https://www.youtube.com/watch?time_continue=4&v=WjVHvC9EeB4&feature=emb_logo).

library(tidyverse)
library(modelr)
library(gapminder)

## Warning: package 'gapminder' was built under R version 3.6.1

head(gapminder)

glimpse(gapminder)

## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

# Vamos verificar como a expectativa de vida (lifeExp) muda com o tempo (year) em cada país (country).

gapminder %>%
  ggplot(aes(year,lifeExp,group = country)) + geom_line(apha = 0.4)

## Warning: Ignoring unknown parameters: apha

# Vamos considerar um único país 

### Nova Zelândia 

nz<-filter(gapminder, country == 'New Zealand')
nz %>%
  ggplot(aes(year, lifeExp)) + geom_line()

nz_mod<- lm(lifeExp ~ year, data = nz)
nz %>%
  add_predictions(nz_mod)%>%
  ggplot(aes(year,pred)) + geom_line() + ggtitle('Tendência linear')

nz %>%
  add_residuals(nz_mod) %>%
  ggplot(aes(year,resid)) + geom_hline(yintercept = 0, color = 'lightblue', size = 3) + geom_line() + ggtitle('Padrão restante')

### Brasil

br<-filter(gapminder, country == 'Brazil')
br %>%
  ggplot(aes(year, lifeExp)) + geom_line()

Podemos repetir isso para cada país da forma:

by_country <- gapminder %>%
    group_by(country, continent) %>%
    nest()
by_country %>% head

by_country$data[[1]] %>% head()

# Note que temos um dataframe agrupado e que em cada linha temos uma observação; em cada data frame aninhado, cada linha é um grupo, isto é, uma linha representa o curso de tempo completo para um país ao invés de um único ponto no tempo.

Agora vamos ajustar modelos para cada continente.

country_model<-function(df){
  lm(lifeExp ~ year, data =df)
}
# Podemos utilizar o purrr agora para aplicar country_model a cada elemento. 

models<-map(by_country$data, country_model)
head(models)

## [[1]]
## 
## Call:
## lm(formula = lifeExp ~ year, data = df)
## 
## Coefficients:
## (Intercept)         year  
##   -507.5343       0.2753  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = lifeExp ~ year, data = df)
## 
## Coefficients:
## (Intercept)         year  
##   -594.0725       0.3347  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = lifeExp ~ year, data = df)
## 
## Coefficients:
## (Intercept)         year  
##  -1067.8590       0.5693  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = lifeExp ~ year, data = df)
## 
## Coefficients:
## (Intercept)         year  
##   -376.5048       0.2093  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = lifeExp ~ year, data = df)
## 
## Coefficients:
## (Intercept)         year  
##   -389.6063       0.2317  
## 
## 
## [[6]]
## 
## Call:
## lm(formula = lifeExp ~ year, data = df)
## 
## Coefficients:
## (Intercept)         year  
##   -376.1163       0.2277

by_country<-by_country %>%
  mutate(model=map(data, country_model))
by_country

by_country %>%
  filter(continent=="Europe")

# resíduos

by_country<-by_country %>%
  mutate(resids = map2(data,model,add_residuals))
by_country

# Lembrando que usamos o 'nest()' para transformar um data frame regular em um data frame aninhado e agora fazemos o oposto com o 'unnest'.

resids<-unnest(by_country,resids)
resids

# Gráfico dos resíduos

resids %>%
  ggplot(aes(year,resid)) + geom_line(aes(group = country), alpha = 0.35) + geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# Resíduos

resids %>%
  ggplot(aes(year,resid,group = country)) + geom_line(alpha = 0.35) + facet_wrap(~continent)

# Predições

by_country<-by_country %>%
  mutate(preds = map2(data,model,add_predictions))
by_country

preds <- unnest(data = by_country, preds)
preds %>% head()

preds %>%
    ggplot(aes(year, pred, group = country)) +
    geom_line(alpha = 0.35) +
    facet_wrap(~ continent)

Qualidade do modelo

Pacote broom

O pacote broom fornece um conjunto de funções gerais para transformar modelos em dados tidy. O comando ‘glance’ serve para extrairmos algumas métricas de qualidade de ajuste de modelos.

library(broom)

## Warning: package 'broom' was built under R version 3.6.1

## 
## Attaching package: 'broom'

## The following object is masked from 'package:modelr':
## 
##     bootstrap

glance(nz_mod)

# Para cada país

glance <- by_country %>% 
    mutate(glance = map(model, broom::glance)) %>% 
    unnest(glance, .drop = TRUE) %>%
    arrange(r.squared)

## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.
## This warning is displayed once per session.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

glance %>% head()

# R2

glance %>%
  arrange(adj.r.squared)

# Ajustes ruins

bad_fit <- filter(glance, adj.r.squared <= 0.30)
gapminder %>%
    filter(country %in% unique(bad_fit$country)) %>%
    ggplot(aes(year, lifeExp, color = country, group = country)) +
    geom_line()

glance %>%
  ggplot(aes(continent,adj.r.squared)) + geom_jitter(witdh = 0.5)

## Warning: Ignoring unknown parameters: witdh

Modelagem de dados

Paulo H. Sales Guimarães

Modelar

Partes de um modelo

Construção de modelos

Modelos mais ‘complicados’

Qualidade do modelo