O objetivo bĂ¡sico de qualquer modelo Ă© fornecer um resumo simples de baixa dimensĂ£o de um conjunto de dados.
Captar padrões gerados pelo fenĂ´meno de interesse e ignorar ruĂdos (variaĂ§Ă£o aleatĂ³ria na qual se estĂ¡ interessado);
Modelos preditivos (supervisionado);
Modelos nĂ£o preditivos (descobrir relacionamentos dentro dos dados - nĂ£o supervisonado);
DivisĂ£o dos dados em trĂªs partes - treinamento, consulta e teste.
FamĂlia de modelos - expressa um padrĂ£o preciso
Modelo ajustado
Todos os modelos sĂ£o errados, mas alguns sĂ£o Ăºteis - George Box.
Nota: O objetivo de um modelo nĂ£o Ă© descobrir a verdade, mas sim buscar fazer uma aproximaĂ§Ă£o o mais simples possĂvel que ainda seja Ăºtel.
O pacote modelr fornece funções que ajudam a criar pipelines elegantes durante a modelagem. Foi projetado principalmente para apoiar o ensino dos conceitos bĂ¡sicos de modelagem dentro do universo tidyverse.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.1
## -- Attaching packages --------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.1
## Warning: package 'tibble' was built under R version 3.6.1
## Warning: package 'tidyr' was built under R version 3.6.1
## Warning: package 'dplyr' was built under R version 3.6.1
## Warning: package 'stringr' was built under R version 3.6.1
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(modelr)
## Warning: package 'modelr' was built under R version 3.6.1
library(ggplot2)
# https://ggplot2.tidyverse.org/reference/diamonds.html
head(diamonds)
glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price))
ggplot(diamonds,aes(cut,price))+geom_boxplot()
ggplot(diamonds,aes(color,price))+geom_boxplot()
ggplot(diamonds,aes(clarity,price))+geom_boxplot()
# opcional
# install.packages("devtools")
# devtools::install_github("easystats/see")
library(see)
p1<-ggplot(diamonds,aes(cut,price))+geom_violin()
p2<-ggplot(diamonds,aes(color,price))+geom_violin()
p3<-ggplot(diamonds,aes(clarity,price))+geom_violin()
plots(p1, p2, p3, ncol = 2)
# O que jĂ¡ podemos verificar com os grĂ¡ficos?
Diamantes de menor qualidade tĂªm preços mais altos porque hĂ¡ uma variĂ¡vel complicadora neste caso que Ă© o peso ou quilate (carat). * Diamantes de baixa qualidade tendem a ser maiores.
ggplot(diamonds,aes(carat, price)) + geom_hex(bins = 150)
Vamos focar nos diamantes menores que 2,5 quilates (99,7%).
d1<-diamonds%>%
filter(carat<=2.5)
(length(d1$carat)/length(diamonds$carat))*100
## [1] 99.76641
d2<-d1%>%
mutate(logprice = log2(price),logcarat = log2(carat))
ggplot(d2,aes(logcarat,logprice)) + geom_hex(bins=150)
mod1<- lm(logprice ~ logcarat,data = d2)
summary(mod1)
##
## Call:
## lm(formula = logprice ~ logcarat, data = d2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.96407 -0.24549 -0.00844 0.23930 1.93486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.193863 0.001969 6194.5 <2e-16 ***
## logcarat 1.681371 0.001936 868.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3767 on 53812 degrees of freedom
## Multiple R-squared: 0.9334, Adjusted R-squared: 0.9334
## F-statistic: 7.542e+05 on 1 and 53812 DF, p-value: < 2.2e-16
# Valores preditos
grid<-d2%>%
data_grid(carat = seq_range(carat,20))%>%
mutate(logcarat=log2(carat))%>%
add_predictions(mod1,"logprice")%>%
mutate(price = 2^logprice)
grid
ggplot(d2,aes(carat,price)) + geom_hex(bins = 150) + geom_line(data = grid, color = "red", size = 1)
Os diamantes maiores sĂ£o muito mais baratos do que o esperado. Por que?
E os resĂduos?
d3<-d2%>%
add_residuals(mod1,"logresid")
ggplot(d3, aes(logcarat, logresid)) + geom_hex(bins = 150)
r1<-ggplot(d3,aes(cut,logresid))+ geom_boxplot()
r2<-ggplot(d3,aes(color,logresid)) + geom_boxplot()
r3<-ggplot(d3,aes(clarity,logresid)) + geom_boxplot()
plots(r1,r2,r3)
Perceba que a medida que a qualidade dos diamentes aumenta, o mesmo ocorre com seu preço relativo.
mod2<-lm(logprice ~ logcarat + color + cut + clarity, data = d2)
summary(mod2)
##
## Call:
## lm(formula = logprice ~ logcarat + color + cut + clarity, data = d2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17388 -0.12437 -0.00094 0.11920 2.78322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.206978 0.001693 7211.806 < 2e-16 ***
## logcarat 1.886239 0.001124 1677.809 < 2e-16 ***
## color.L -0.633998 0.002910 -217.872 < 2e-16 ***
## color.Q -0.137580 0.002676 -51.409 < 2e-16 ***
## color.C -0.022072 0.002503 -8.819 < 2e-16 ***
## color^4 0.016570 0.002297 7.213 5.54e-13 ***
## color^5 -0.002828 0.002169 -1.304 0.192
## color^6 0.003533 0.001971 1.793 0.073 .
## cut.L 0.173866 0.003386 51.349 < 2e-16 ***
## cut.Q -0.050346 0.002980 -16.897 < 2e-16 ***
## cut.C 0.019129 0.002583 7.407 1.31e-13 ***
## cut^4 -0.002410 0.002066 -1.166 0.243
## clarity.L 1.308155 0.005179 252.598 < 2e-16 ***
## clarity.Q -0.334090 0.004839 -69.047 < 2e-16 ***
## clarity.C 0.178423 0.004140 43.093 < 2e-16 ***
## clarity^4 -0.088059 0.003298 -26.697 < 2e-16 ***
## clarity^5 0.035885 0.002680 13.389 < 2e-16 ***
## clarity^6 -0.001371 0.002327 -0.589 0.556
## clarity^7 0.048221 0.002051 23.512 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1916 on 53795 degrees of freedom
## Multiple R-squared: 0.9828, Adjusted R-squared: 0.9828
## F-statistic: 1.706e+05 on 18 and 53795 DF, p-value: < 2.2e-16
grid<- d2 %>%
data_grid(cut,.model = mod2) %>%
add_predictions(mod2)
grid
ggplot(grid, aes(cut, pred)) + geom_point()
d4<- d2 %>%
add_residuals(mod2,"lresid2")
ggplot(d4, aes(logcarat,lresid2)) + geom_hex(bins = 150)
# Filtrar resĂduos maiores que 1
d4<- d4 %>%
filter(abs(lresid2)>1) %>%
add_predictions(mod2) %>%
mutate(pred = round(2^pred)) %>%
select(price, pred, carat:table, x:z) %>%
arrange(price)
d4
Exemplo gapminder
Conjunto de dados que resume a evoluĂ§Ă£o de paĂses ao longo do tempo, observando as estatĂsticas de expectativa de vida e PIB. Maiores detalhes ver o vĂdeo de Hans Rosling (https://www.youtube.com/watch?time_continue=4&v=WjVHvC9EeB4&feature=emb_logo).
library(tidyverse)
library(modelr)
library(gapminder)
## Warning: package 'gapminder' was built under R version 3.6.1
head(gapminder)
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
# Vamos verificar como a expectativa de vida (lifeExp) muda com o tempo (year) em cada paĂs (country).
gapminder %>%
ggplot(aes(year,lifeExp,group = country)) + geom_line(apha = 0.4)
## Warning: Ignoring unknown parameters: apha
# Vamos considerar um Ăºnico paĂs
### Nova ZelĂ¢ndia
nz<-filter(gapminder, country == 'New Zealand')
nz %>%
ggplot(aes(year, lifeExp)) + geom_line()
nz_mod<- lm(lifeExp ~ year, data = nz)
nz %>%
add_predictions(nz_mod)%>%
ggplot(aes(year,pred)) + geom_line() + ggtitle('TendĂªncia linear')
nz %>%
add_residuals(nz_mod) %>%
ggplot(aes(year,resid)) + geom_hline(yintercept = 0, color = 'lightblue', size = 3) + geom_line() + ggtitle('PadrĂ£o restante')
### Brasil
br<-filter(gapminder, country == 'Brazil')
br %>%
ggplot(aes(year, lifeExp)) + geom_line()
Podemos repetir isso para cada paĂs da forma:
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
by_country %>% head
by_country$data[[1]] %>% head()
# Note que temos um dataframe agrupado e que em cada linha temos uma observaĂ§Ă£o; em cada data frame aninhado, cada linha Ă© um grupo, isto Ă©, uma linha representa o curso de tempo completo para um paĂs ao invĂ©s de um Ăºnico ponto no tempo.
Agora vamos ajustar modelos para cada continente.
country_model<-function(df){
lm(lifeExp ~ year, data =df)
}
# Podemos utilizar o purrr agora para aplicar country_model a cada elemento.
models<-map(by_country$data, country_model)
head(models)
## [[1]]
##
## Call:
## lm(formula = lifeExp ~ year, data = df)
##
## Coefficients:
## (Intercept) year
## -507.5343 0.2753
##
##
## [[2]]
##
## Call:
## lm(formula = lifeExp ~ year, data = df)
##
## Coefficients:
## (Intercept) year
## -594.0725 0.3347
##
##
## [[3]]
##
## Call:
## lm(formula = lifeExp ~ year, data = df)
##
## Coefficients:
## (Intercept) year
## -1067.8590 0.5693
##
##
## [[4]]
##
## Call:
## lm(formula = lifeExp ~ year, data = df)
##
## Coefficients:
## (Intercept) year
## -376.5048 0.2093
##
##
## [[5]]
##
## Call:
## lm(formula = lifeExp ~ year, data = df)
##
## Coefficients:
## (Intercept) year
## -389.6063 0.2317
##
##
## [[6]]
##
## Call:
## lm(formula = lifeExp ~ year, data = df)
##
## Coefficients:
## (Intercept) year
## -376.1163 0.2277
by_country<-by_country %>%
mutate(model=map(data, country_model))
by_country
by_country %>%
filter(continent=="Europe")
# resĂduos
by_country<-by_country %>%
mutate(resids = map2(data,model,add_residuals))
by_country
# Lembrando que usamos o 'nest()' para transformar um data frame regular em um data frame aninhado e agora fazemos o oposto com o 'unnest'.
resids<-unnest(by_country,resids)
resids
# GrĂ¡fico dos resĂduos
resids %>%
ggplot(aes(year,resid)) + geom_line(aes(group = country), alpha = 0.35) + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# ResĂduos
resids %>%
ggplot(aes(year,resid,group = country)) + geom_line(alpha = 0.35) + facet_wrap(~continent)
# Predições
by_country<-by_country %>%
mutate(preds = map2(data,model,add_predictions))
by_country
preds <- unnest(data = by_country, preds)
preds %>% head()
preds %>%
ggplot(aes(year, pred, group = country)) +
geom_line(alpha = 0.35) +
facet_wrap(~ continent)
O pacote broom fornece um conjunto de funções gerais para transformar modelos em dados tidy. O comando ‘glance’ serve para extrairmos algumas mĂ©tricas de qualidade de ajuste de modelos.
library(broom)
## Warning: package 'broom' was built under R version 3.6.1
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
glance(nz_mod)
# Para cada paĂs
glance <- by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE) %>%
arrange(r.squared)
## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.
## This warning is displayed once per session.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
glance %>% head()
# R2
glance %>%
arrange(adj.r.squared)
# Ajustes ruins
bad_fit <- filter(glance, adj.r.squared <= 0.30)
gapminder %>%
filter(country %in% unique(bad_fit$country)) %>%
ggplot(aes(year, lifeExp, color = country, group = country)) +
geom_line()
glance %>%
ggplot(aes(continent,adj.r.squared)) + geom_jitter(witdh = 0.5)
## Warning: Ignoring unknown parameters: witdh