Aula 7 - Regressão 2

Autor

Paulo Manoel da Silva Junior

1 Introdução

  • Vamos fazer o uso de um modelo GAM, e ver qual ajuste vai se adequar melhor aos dados.
rm(list=ls(all=T))
gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 559509 29.9    1276470 68.2   644254 34.5
Vcells 985724  7.6    8388608 64.0  1635077 12.5

2 Uso do GAM

2.1 Impotando o banco de dados e visualizando

O banco de Dados se encontra no pacote UsingR, e o nome do banco de dados é fat, e a descrição do banco é: “A data set containing many physical measurements of 252 males. Most of the variables can be measured with a scale or tape measure. Can they be used to predict the percentage of body fat? If so, this offers an easy alternative to an underwater weighing technique.”

As variáveis são as seguintes:

  • case: Case Number
  • body.fat: Percent body fat using Brozek’s equation, 457/Density - 414.2
  • body.fat.siri : Percent body fat using Siri’s equation, 495/Density - 450
  • density : Density (gm/cm^2)
  • age : Age (yrs)
  • weight : Weight (lbs)
  • height : Height (inches)
  • BMI : Adiposity index = Weight/Height^2 (kg/m^2)
  • ffweight : Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek’s formula (lbs)
  • neck : Neck circumference (cm)
  • chest : Chest circumference (cm)
  • abdomen : Abdomen circumference (cm) “at the umbilicus and level with the iliac crest”
  • hip : Hip circumference (cm)
  • thigh : Thigh circumference (cm)
  • knee : Knee circumference (cm)
  • ankle : Ankle circumference (cm)
  • bicep : Extended biceps circumference (cm)
  • forearm : Forearm circumference (cm)
  • wrist : Wrist circumference (cm) “distal to the styloid processes”
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

2.1.1 Visualizando algumas observações do banco

UsingR::fat %>% 
  slice_head(n = 10) %>% 
  gt::gt()
case body.fat body.fat.siri density age weight height BMI ffweight neck chest abdomen hip thigh knee ankle bicep forearm wrist
1 12.6 12.3 1.0708 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5 59.0 37.3 21.9 32.0 27.4 17.1
2 6.9 6.1 1.0853 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7 58.7 37.3 23.4 30.5 28.9 18.2
3 24.6 25.3 1.0414 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9 99.2 59.6 38.9 24.0 28.8 25.2 16.6
4 10.9 10.4 1.0751 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4 101.2 60.1 37.3 22.8 32.4 29.4 18.2
5 27.8 28.7 1.0340 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0 101.9 63.2 42.2 24.0 32.2 27.7 17.7
6 20.6 20.9 1.0502 24 210.25 74.75 26.5 167.0 39.0 104.5 94.4 107.8 66.0 42.0 25.6 35.7 30.6 18.8
7 19.0 19.2 1.0549 26 181.00 69.75 26.2 146.6 36.4 105.1 90.7 100.3 58.4 38.3 22.9 31.9 27.8 17.7
8 12.8 12.4 1.0704 25 176.00 72.50 23.6 153.6 37.8 99.6 88.5 97.1 60.0 39.4 23.2 30.5 29.0 18.8
9 5.1 4.1 1.0900 25 191.00 74.00 24.6 181.3 38.1 100.9 82.5 99.9 62.9 38.3 23.8 35.9 31.1 18.2
10 12.0 11.7 1.0722 23 198.25 73.50 25.8 174.4 42.1 99.6 88.6 104.1 63.1 41.7 25.0 35.6 30.0 19.2
UsingR::fat %>% 
  slice_tail(n = 12) %>% 
  gt::gt()
case body.fat body.fat.siri density age weight height BMI ffweight neck chest abdomen hip thigh knee ankle bicep forearm wrist
241 17.0 17.0 1.0599 65 127.50 65.75 20.8 105.9 34.7 93.0 79.7 87.6 50.7 33.4 20.1 28.5 24.8 16.5
242 33.6 35.0 1.0207 65 224.50 68.25 33.9 149.2 38.8 119.6 118.0 114.3 61.3 42.1 23.4 34.9 30.1 19.4
243 29.3 30.4 1.0304 66 234.25 72.00 31.8 165.6 41.4 119.7 109.0 109.1 63.7 42.4 24.6 35.6 30.7 19.5
244 31.4 32.6 1.0256 67 227.75 72.75 30.3 156.3 41.3 115.8 113.4 109.8 65.6 46.0 25.4 35.3 29.8 19.5
245 28.1 29.0 1.0334 67 199.50 68.50 29.9 143.6 40.7 118.3 106.1 101.6 58.2 38.8 24.1 32.1 29.3 18.5
246 15.3 15.2 1.0641 68 155.50 69.25 22.8 131.8 36.3 97.4 84.3 94.4 54.3 37.5 22.6 29.2 27.3 18.5
247 29.1 30.2 1.0308 69 215.50 70.50 30.5 152.7 40.8 113.7 107.6 110.0 63.3 44.0 22.6 37.5 32.6 18.8
248 11.5 11.0 1.0736 70 134.25 67.00 21.1 118.9 34.9 89.2 83.6 88.8 49.6 34.8 21.5 25.6 25.7 18.5
249 32.3 33.6 1.0236 72 201.00 69.75 29.1 136.1 40.9 108.5 105.0 104.5 59.6 40.8 23.2 35.2 28.6 20.1
250 28.3 29.3 1.0328 72 186.75 66.00 30.2 133.9 38.9 111.1 111.5 101.7 60.3 37.3 21.5 31.3 27.2 18.0
251 25.3 26.0 1.0399 72 190.75 70.50 27.0 142.6 38.9 108.3 101.3 97.8 56.0 41.6 22.7 30.5 29.4 19.8
252 30.7 31.9 1.0271 74 207.50 70.00 29.8 143.7 40.8 112.4 108.5 107.1 59.3 42.2 24.6 33.7 30.0 20.9

2.1.2 Estatísticas Descritivas

visdat::vis_dat(UsingR::fat)

Verificando se existe algum valor ausente:

visdat::vis_miss(UsingR::fat, sort_miss = TRUE)

Conforme observado, temos todo o conjunto de dados completo sem nenhum NA.

  • Verificando o tipo das variáveis
glimpse(UsingR::fat)
Rows: 252
Columns: 19
$ case          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ body.fat      <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.6, 19.0, 12.8, 5.1, 12.0…
$ body.fat.siri <dbl> 12.3, 6.1, 25.3, 10.4, 28.7, 20.9, 19.2, 12.4, 4.1, 11.7…
$ density       <dbl> 1.0708, 1.0853, 1.0414, 1.0751, 1.0340, 1.0502, 1.0549, …
$ age           <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, …
$ weight        <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, …
$ height        <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, …
$ BMI           <dbl> 23.7, 23.4, 24.7, 24.9, 25.6, 26.5, 26.2, 23.6, 24.6, 25…
$ ffweight      <dbl> 134.9, 161.3, 116.0, 164.7, 133.1, 167.0, 146.6, 153.6, …
$ neck          <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42…
$ chest         <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9…
$ abdomen       <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8…
$ hip           <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9…
$ thigh         <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63…
$ knee          <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41…
$ ankle         <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25…
$ bicep         <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35…
$ forearm       <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30…
$ wrist         <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19…
  • Com a função skim do pacote skimr, temos um resumo da estatística descritiva do banco utilizado.
skimr::skim(UsingR::fat)
Data summary
Name UsingR::fat
Number of rows 252
Number of columns 19
_______________________
Column type frequency:
numeric 19
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
case 0 1 126.50 72.89 1.0 63.75 126.50 189.25 252.00 ▇▇▇▇▇
body.fat 0 1 18.94 7.75 0.0 12.80 19.00 24.60 45.10 ▂▇▇▃▁
body.fat.siri 0 1 19.15 8.37 0.0 12.47 19.20 25.30 47.50 ▃▇▇▃▁
density 0 1 1.06 0.02 1.0 1.04 1.05 1.07 1.11 ▁▅▇▆▁
age 0 1 44.88 12.60 22.0 35.75 43.00 54.00 81.00 ▅▇▇▃▁
weight 0 1 178.92 29.39 118.5 159.00 176.50 197.00 363.15 ▆▇▂▁▁
height 0 1 70.15 3.66 29.5 68.25 70.00 72.25 77.75 ▁▁▁▂▇
BMI 0 1 25.44 3.65 18.1 23.10 25.05 27.33 48.90 ▆▇▁▁▁
ffweight 0 1 143.71 18.23 105.9 131.35 141.55 153.88 240.50 ▃▇▂▁▁
neck 0 1 37.99 2.43 31.1 36.40 38.00 39.42 51.20 ▂▇▃▁▁
chest 0 1 100.82 8.43 79.3 94.35 99.65 105.38 136.20 ▂▇▅▁▁
abdomen 0 1 92.56 10.78 69.4 84.57 90.95 99.33 148.10 ▃▇▂▁▁
hip 0 1 99.90 7.16 85.0 95.50 99.30 103.53 147.70 ▆▇▁▁▁
thigh 0 1 59.41 5.25 47.2 56.00 59.00 62.35 87.30 ▃▇▃▁▁
knee 0 1 38.59 2.41 33.0 36.98 38.50 39.92 49.10 ▃▇▅▁▁
ankle 0 1 23.10 1.69 19.1 22.00 22.80 24.00 33.90 ▃▇▁▁▁
bicep 0 1 32.27 3.02 24.8 30.20 32.05 34.32 45.00 ▂▇▆▁▁
forearm 0 1 28.66 2.02 21.0 27.30 28.70 30.00 34.90 ▁▂▇▆▁
wrist 0 1 18.23 0.93 15.8 17.60 18.30 18.80 21.40 ▂▆▇▂▁

2.2 Primeiro Ajuste

Vamos fazer o primeiro ajuste, considerando a variável dependente body.fat, e o parâmetro te, nas variáveis independentes, logo em seguida vamos analisar os resultados.

fit1<-mgcv::gam(body.fat ~ te(age) + te(weight) + te(height) + te(BMI) + te(neck) + te(chest) + te(abdomen) + te(hip) + te(thigh) + te(knee) + te(ankle) + te(bicep) + te(forearm) + te(wrist), data=UsingR::fat)

2.2.1 Resultados

summary(fit1)

Family: gaussian 
Link function: identity 

Formula:
body.fat ~ te(age) + te(weight) + te(height) + te(BMI) + te(neck) + 
    te(chest) + te(abdomen) + te(hip) + te(thigh) + te(knee) + 
    te(ankle) + te(bicep) + te(forearm) + te(wrist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.9385     0.2322   81.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F  p-value    
te(age)     1.000  1.000  6.216 0.013381 *  
te(weight)  1.000  1.000  0.121 0.728618    
te(height)  1.000  1.000  0.693 0.406014    
te(BMI)     2.076  2.647  0.810 0.402917    
te(neck)    1.336  1.591  3.816 0.087699 .  
te(chest)   1.000  1.000  0.984 0.322169    
te(abdomen) 4.000  4.000 25.794  < 2e-16 ***
te(hip)     3.324  3.680  2.182 0.163367    
te(thigh)   2.880  3.421  1.553 0.305784    
te(knee)    1.750  2.187  0.733 0.480100    
te(ankle)   1.000  1.000  0.579 0.447634    
te(bicep)   3.606  3.896  2.186 0.075074 .  
te(forearm) 1.000  1.000  1.429 0.233174    
te(wrist)   1.734  2.155  8.458 0.000234 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.774   Deviance explained = 79.8%
GCV = 15.269  Scale est. = 13.591    n = 252

Podemos ver, com o primeiro ajuste, o resultado de que, tivemos um R^2 ajustado de 77.4%, e uma explicação de Desvio de 79.8%, que são bons valores para o modelo ajustado. Todavia, prosseguiremos em busca do melhor modelo ajustado.

2.3 Segundo ajuste

Vamos fazer o segundo ajuste, considerando a variável dependente body.fat, e o parâmetro t2, nas variáveis independentes, logo em seguida vamos analisar os resultados.

fit2<-mgcv::gam(body.fat ~ t2(age) + t2(weight) + t2(height) + t2(BMI) + t2(neck) + t2(chest) + t2(abdomen) +t2(hip) + t2(thigh) + t2(knee) + t2(ankle) + t2(bicep) + t2(forearm) + t2(wrist), data=UsingR::fat)

2.3.1 Resultados

summary(fit2)

Family: gaussian 
Link function: identity 

Formula:
body.fat ~ t2(age) + t2(weight) + t2(height) + t2(BMI) + t2(neck) + 
    t2(chest) + t2(abdomen) + t2(hip) + t2(thigh) + t2(knee) + 
    t2(ankle) + t2(bicep) + t2(forearm) + t2(wrist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.9385     0.2323   81.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F p-value    
t2(age)     1.000  1.000  6.134  0.0140 *  
t2(weight)  1.000  1.000  0.142  0.7067    
t2(height)  1.000  1.000  0.652  0.4204    
t2(BMI)     2.085  2.653  0.535  0.6423    
t2(neck)    1.339  1.594  0.837  0.5339    
t2(chest)   1.000  1.000  0.989  0.3211    
t2(abdomen) 4.000  4.000 25.890  <2e-16 ***
t2(hip)     3.167  3.627  1.925  0.2097    
t2(thigh)   2.890  3.420  1.540  0.2067    
t2(knee)    1.753  2.188  0.489  0.5731    
t2(ankle)   1.000  1.000  0.593  0.4419    
t2(bicep)   3.515  3.859  1.855  0.1050    
t2(forearm) 1.000  1.000  1.501  0.2217    
t2(wrist)   1.737  2.157  5.658  0.0257 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.774   Deviance explained = 79.7%
GCV = 15.269  Scale est. = 13.604    n = 252

Podemos ver, com o segundo ajuste, o resultado de que, tivemos um R^2 ajustado de 77.4%, e uma explicação de Desvio de 79.7%, que são bons valores para o modelo ajustado. Todavia, prosseguiremos em busca do melhor modelo ajustado.

2.4 Terceiro ajuste

Vamos fazer o terceiro ajuste, considerando a variável dependente body.fat, e o parâmetro s, nas variáveis independentes, logo em seguida vamos analisar os resultados.

fit3<-mgcv::gam(body.fat ~ s(age) + s(weight) + s(height) + s(BMI) + s(neck) + s(chest) + s(abdomen) + s(hip) + s(thigh) + s(knee) + s(ankle) + s(bicep) + s(forearm) + s(wrist), data=UsingR::fat)

2.4.1 Resultados

summary(fit3)

Family: gaussian 
Link function: identity 

Formula:
body.fat ~ s(age) + s(weight) + s(height) + s(BMI) + s(neck) + 
    s(chest) + s(abdomen) + s(hip) + s(thigh) + s(knee) + s(ankle) + 
    s(bicep) + s(forearm) + s(wrist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.9385     0.2256   83.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df      F  p-value    
s(age)     5.423  6.567  2.075 0.051891 .  
s(weight)  1.000  1.000  0.175 0.676352    
s(height)  1.000  1.000  0.199 0.655765    
s(BMI)     2.452  3.173  0.925 0.482472    
s(neck)    1.494  1.842  1.867 0.104184    
s(chest)   1.000  1.000  0.563 0.453853    
s(abdomen) 6.547  7.425 14.779  < 2e-16 ***
s(hip)     7.439  8.261  2.130 0.047066 *  
s(thigh)   2.361  3.065  0.490 0.665215    
s(knee)    1.437  1.760  0.368 0.566256    
s(ankle)   2.844  3.503  0.820 0.618727    
s(bicep)   4.558  5.557  1.791 0.081062 .  
s(forearm) 1.000  1.000  2.156 0.143457    
s(wrist)   1.806  2.268  7.208 0.000558 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.787   Deviance explained = 82.1%
GCV = 15.338  Scale est. = 12.821    n = 252

Podemos ver, com o terceiro ajuste, o resultado de que, tivemos um R^2 ajustado de 78.7%, e uma explicação de Desvio de 82.1%. Ou seja, temos que o terceiro modelo foi o que melhor se adequou aos dados, sendo assim, tomaremos o parâmetro s de suavização como o parâmetro a ser utilizado no ajuste da GAM.

2.5 Ajuste do modelo final.

Devido ao número excessivo de variáveis, é observado que no ajuste acima algumas variáveis não são significativas para o ajuste, considerando o nível de significância de 5%, foi observado que apenas as variáveis abaixo são significativas.

  • abdomen : Abdomen circumference (cm) “at the umbilicus and level with the iliac crest”
  • hip : Hip circumference (cm)
  • wrist : Wrist circumference (cm) “distal to the styloid processes”

Sendo assim, ajustaremos um modelo apenas com essas variáveis.

ajuste_final <- mgcv::gam(body.fat~s(abdomen)+s(hip)+s(wrist), data = UsingR::fat)

2.5.1 Resultados

summary(ajuste_final)

Family: gaussian 
Link function: identity 

Formula:
body.fat ~ s(abdomen) + s(hip) + s(wrist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.9385     0.2433   77.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df      F  p-value    
s(abdomen) 5.387  6.490 51.025  < 2e-16 ***
s(hip)     6.986  8.005  3.266  0.00145 ** 
s(wrist)   1.267  1.486 22.070 2.38e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.752   Deviance explained = 76.5%
GCV = 15.834  Scale est. = 14.914    n = 252

2.6 Análise gráfica residual

Uma importante parte é a análise gráfica para saber se o modelo ficou bem ajustado.

par(mfrow = c(1,3))
plot(ajuste_final)

par(mfrow = c(2,2))
mgcv::gam.check(ajuste_final)


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 8 iterations.
The RMS GCV score gradient at convergence was 4.538414e-06 .
The Hessian was positive definite.
Model rank =  28 / 28 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

             k'  edf k-index p-value
s(abdomen) 9.00 5.39    1.02    0.56
s(hip)     9.00 6.99    1.03    0.66
s(wrist)   9.00 1.27    1.06    0.83

3 Conclusão

O uso de GAM (Modelos aditivos generalizados) se mostrou muito eficiente no ajuste dos dados, quando utilizado o parâmetro de suavização spline, sendo assim, obtivemos um bom modelo ajustado, onde os valores preditos estão coerentes com os valores observados, tendo uma diminuição no erro, um bom coeficiente de determinação, evidenciando a importância do modelo ajustado, e que os residuos aparentemente seguem uma distribuição normal.