Aula 7 - Regressão 2

Autor

Paulo Manoel da Silva Junior

1 Introdução

Vamos fazer o uso de um modelo GAM, e ver qual ajuste vai se adequar melhor aos dados.

rm(list=ls(all=T))
gc()

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 559509 29.9    1276470 68.2   644254 34.5
Vcells 985724  7.6    8388608 64.0  1635077 12.5

2 Uso do GAM

2.1 Impotando o banco de dados e visualizando

O banco de Dados se encontra no pacote UsingR, e o nome do banco de dados é fat, e a descrição do banco é: “A data set containing many physical measurements of 252 males. Most of the variables can be measured with a scale or tape measure. Can they be used to predict the percentage of body fat? If so, this offers an easy alternative to an underwater weighing technique.”

As variáveis são as seguintes:

case: Case Number
body.fat: Percent body fat using Brozek’s equation, 457/Density - 414.2
body.fat.siri : Percent body fat using Siri’s equation, 495/Density - 450
density : Density (gm/cm^2)
age : Age (yrs)
weight : Weight (lbs)
height : Height (inches)
BMI : Adiposity index = Weight/Height^2 (kg/m^2)
ffweight : Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek’s formula (lbs)
neck : Neck circumference (cm)
chest : Chest circumference (cm)
abdomen : Abdomen circumference (cm) “at the umbilicus and level with the iliac crest”
hip : Hip circumference (cm)
thigh : Thigh circumference (cm)
knee : Knee circumference (cm)
ankle : Ankle circumference (cm)
bicep : Extended biceps circumference (cm)
forearm : Forearm circumference (cm)
wrist : Wrist circumference (cm) “distal to the styloid processes”

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

2.1.1 Visualizando algumas observações do banco

UsingR::fat %>% 
  slice_head(n = 10) %>% 
  gt::gt()

case	body.fat	body.fat.siri	density	age	weight	height	BMI	ffweight	neck	chest	abdomen	hip	thigh	knee	ankle	bicep	forearm	wrist
1	12.6	12.3	1.0708	23	154.25	67.75	23.7	134.9	36.2	93.1	85.2	94.5	59.0	37.3	21.9	32.0	27.4	17.1
2	6.9	6.1	1.0853	22	173.25	72.25	23.4	161.3	38.5	93.6	83.0	98.7	58.7	37.3	23.4	30.5	28.9	18.2
3	24.6	25.3	1.0414	22	154.00	66.25	24.7	116.0	34.0	95.8	87.9	99.2	59.6	38.9	24.0	28.8	25.2	16.6
4	10.9	10.4	1.0751	26	184.75	72.25	24.9	164.7	37.4	101.8	86.4	101.2	60.1	37.3	22.8	32.4	29.4	18.2
5	27.8	28.7	1.0340	24	184.25	71.25	25.6	133.1	34.4	97.3	100.0	101.9	63.2	42.2	24.0	32.2	27.7	17.7
6	20.6	20.9	1.0502	24	210.25	74.75	26.5	167.0	39.0	104.5	94.4	107.8	66.0	42.0	25.6	35.7	30.6	18.8
7	19.0	19.2	1.0549	26	181.00	69.75	26.2	146.6	36.4	105.1	90.7	100.3	58.4	38.3	22.9	31.9	27.8	17.7
8	12.8	12.4	1.0704	25	176.00	72.50	23.6	153.6	37.8	99.6	88.5	97.1	60.0	39.4	23.2	30.5	29.0	18.8
9	5.1	4.1	1.0900	25	191.00	74.00	24.6	181.3	38.1	100.9	82.5	99.9	62.9	38.3	23.8	35.9	31.1	18.2
10	12.0	11.7	1.0722	23	198.25	73.50	25.8	174.4	42.1	99.6	88.6	104.1	63.1	41.7	25.0	35.6	30.0	19.2

UsingR::fat %>% 
  slice_tail(n = 12) %>% 
  gt::gt()

case	body.fat	body.fat.siri	density	age	weight	height	BMI	ffweight	neck	chest	abdomen	hip	thigh	knee	ankle	bicep	forearm	wrist
241	17.0	17.0	1.0599	65	127.50	65.75	20.8	105.9	34.7	93.0	79.7	87.6	50.7	33.4	20.1	28.5	24.8	16.5
242	33.6	35.0	1.0207	65	224.50	68.25	33.9	149.2	38.8	119.6	118.0	114.3	61.3	42.1	23.4	34.9	30.1	19.4
243	29.3	30.4	1.0304	66	234.25	72.00	31.8	165.6	41.4	119.7	109.0	109.1	63.7	42.4	24.6	35.6	30.7	19.5
244	31.4	32.6	1.0256	67	227.75	72.75	30.3	156.3	41.3	115.8	113.4	109.8	65.6	46.0	25.4	35.3	29.8	19.5
245	28.1	29.0	1.0334	67	199.50	68.50	29.9	143.6	40.7	118.3	106.1	101.6	58.2	38.8	24.1	32.1	29.3	18.5
246	15.3	15.2	1.0641	68	155.50	69.25	22.8	131.8	36.3	97.4	84.3	94.4	54.3	37.5	22.6	29.2	27.3	18.5
247	29.1	30.2	1.0308	69	215.50	70.50	30.5	152.7	40.8	113.7	107.6	110.0	63.3	44.0	22.6	37.5	32.6	18.8
248	11.5	11.0	1.0736	70	134.25	67.00	21.1	118.9	34.9	89.2	83.6	88.8	49.6	34.8	21.5	25.6	25.7	18.5
249	32.3	33.6	1.0236	72	201.00	69.75	29.1	136.1	40.9	108.5	105.0	104.5	59.6	40.8	23.2	35.2	28.6	20.1
250	28.3	29.3	1.0328	72	186.75	66.00	30.2	133.9	38.9	111.1	111.5	101.7	60.3	37.3	21.5	31.3	27.2	18.0
251	25.3	26.0	1.0399	72	190.75	70.50	27.0	142.6	38.9	108.3	101.3	97.8	56.0	41.6	22.7	30.5	29.4	19.8
252	30.7	31.9	1.0271	74	207.50	70.00	29.8	143.7	40.8	112.4	108.5	107.1	59.3	42.2	24.6	33.7	30.0	20.9

2.1.2 Estatísticas Descritivas

visdat::vis_dat(UsingR::fat)

Verificando se existe algum valor ausente:

visdat::vis_miss(UsingR::fat, sort_miss = TRUE)

Conforme observado, temos todo o conjunto de dados completo sem nenhum NA.

Verificando o tipo das variáveis

glimpse(UsingR::fat)

Rows: 252
Columns: 19
$ case          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ body.fat      <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.6, 19.0, 12.8, 5.1, 12.0…
$ body.fat.siri <dbl> 12.3, 6.1, 25.3, 10.4, 28.7, 20.9, 19.2, 12.4, 4.1, 11.7…
$ density       <dbl> 1.0708, 1.0853, 1.0414, 1.0751, 1.0340, 1.0502, 1.0549, …
$ age           <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, …
$ weight        <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, …
$ height        <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, …
$ BMI           <dbl> 23.7, 23.4, 24.7, 24.9, 25.6, 26.5, 26.2, 23.6, 24.6, 25…
$ ffweight      <dbl> 134.9, 161.3, 116.0, 164.7, 133.1, 167.0, 146.6, 153.6, …
$ neck          <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42…
$ chest         <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9…
$ abdomen       <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8…
$ hip           <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9…
$ thigh         <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63…
$ knee          <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41…
$ ankle         <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25…
$ bicep         <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35…
$ forearm       <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30…
$ wrist         <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19…

Com a função skim do pacote skimr, temos um resumo da estatística descritiva do banco utilizado.

skimr::skim(UsingR::fat)

Data summary
Name	UsingR::fat
Number of rows	252
Number of columns	19
_______________________
Column type frequency:
numeric	19
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
case	1	126.50	72.89	1.0	63.75	126.50	189.25	252.00	▇▇▇▇▇
body.fat	1	18.94	7.75	0.0	12.80	19.00	24.60	45.10	▂▇▇▃▁
body.fat.siri	1	19.15	8.37	0.0	12.47	19.20	25.30	47.50	▃▇▇▃▁
density	1	1.06	0.02	1.0	1.04	1.05	1.07	1.11	▁▅▇▆▁
age	1	44.88	12.60	22.0	35.75	43.00	54.00	81.00	▅▇▇▃▁
weight	1	178.92	29.39	118.5	159.00	176.50	197.00	363.15	▆▇▂▁▁
height	1	70.15	3.66	29.5	68.25	70.00	72.25	77.75	▁▁▁▂▇
BMI	1	25.44	3.65	18.1	23.10	25.05	27.33	48.90	▆▇▁▁▁
ffweight	1	143.71	18.23	105.9	131.35	141.55	153.88	240.50	▃▇▂▁▁
neck	1	37.99	2.43	31.1	36.40	38.00	39.42	51.20	▂▇▃▁▁
chest	1	100.82	8.43	79.3	94.35	99.65	105.38	136.20	▂▇▅▁▁
abdomen	1	92.56	10.78	69.4	84.57	90.95	99.33	148.10	▃▇▂▁▁
hip	1	99.90	7.16	85.0	95.50	99.30	103.53	147.70	▆▇▁▁▁
thigh	1	59.41	5.25	47.2	56.00	59.00	62.35	87.30	▃▇▃▁▁
knee	1	38.59	2.41	33.0	36.98	38.50	39.92	49.10	▃▇▅▁▁
ankle	1	23.10	1.69	19.1	22.00	22.80	24.00	33.90	▃▇▁▁▁
bicep	1	32.27	3.02	24.8	30.20	32.05	34.32	45.00	▂▇▆▁▁
forearm	1	28.66	2.02	21.0	27.30	28.70	30.00	34.90	▁▂▇▆▁
wrist	1	18.23	0.93	15.8	17.60	18.30	18.80	21.40	▂▆▇▂▁

2.2 Primeiro Ajuste

Vamos fazer o primeiro ajuste, considerando a variável dependente body.fat, e o parâmetro te, nas variáveis independentes, logo em seguida vamos analisar os resultados.

fit1<-mgcv::gam(body.fat ~ te(age) + te(weight) + te(height) + te(BMI) + te(neck) + te(chest) + te(abdomen) + te(hip) + te(thigh) + te(knee) + te(ankle) + te(bicep) + te(forearm) + te(wrist), data=UsingR::fat)

2.2.1 Resultados

summary(fit1)


Family: gaussian 
Link function: identity 

Formula:
body.fat ~ te(age) + te(weight) + te(height) + te(BMI) + te(neck) + 
    te(chest) + te(abdomen) + te(hip) + te(thigh) + te(knee) + 
    te(ankle) + te(bicep) + te(forearm) + te(wrist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.9385     0.2322   81.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F  p-value    
te(age)     1.000  1.000  6.216 0.013381 *  
te(weight)  1.000  1.000  0.121 0.728618    
te(height)  1.000  1.000  0.693 0.406014    
te(BMI)     2.076  2.647  0.810 0.402917    
te(neck)    1.336  1.591  3.816 0.087699 .  
te(chest)   1.000  1.000  0.984 0.322169    
te(abdomen) 4.000  4.000 25.794  < 2e-16 ***
te(hip)     3.324  3.680  2.182 0.163367    
te(thigh)   2.880  3.421  1.553 0.305784    
te(knee)    1.750  2.187  0.733 0.480100    
te(ankle)   1.000  1.000  0.579 0.447634    
te(bicep)   3.606  3.896  2.186 0.075074 .  
te(forearm) 1.000  1.000  1.429 0.233174    
te(wrist)   1.734  2.155  8.458 0.000234 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.774   Deviance explained = 79.8%
GCV = 15.269  Scale est. = 13.591    n = 252

Podemos ver, com o primeiro ajuste, o resultado de que, tivemos um R^2 ajustado de 77.4%, e uma explicação de Desvio de 79.8%, que são bons valores para o modelo ajustado. Todavia, prosseguiremos em busca do melhor modelo ajustado.

2.3 Segundo ajuste

Vamos fazer o segundo ajuste, considerando a variável dependente body.fat, e o parâmetro t2, nas variáveis independentes, logo em seguida vamos analisar os resultados.

fit2<-mgcv::gam(body.fat ~ t2(age) + t2(weight) + t2(height) + t2(BMI) + t2(neck) + t2(chest) + t2(abdomen) +t2(hip) + t2(thigh) + t2(knee) + t2(ankle) + t2(bicep) + t2(forearm) + t2(wrist), data=UsingR::fat)

2.3.1 Resultados

summary(fit2)


Family: gaussian 
Link function: identity 

Formula:
body.fat ~ t2(age) + t2(weight) + t2(height) + t2(BMI) + t2(neck) + 
    t2(chest) + t2(abdomen) + t2(hip) + t2(thigh) + t2(knee) + 
    t2(ankle) + t2(bicep) + t2(forearm) + t2(wrist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.9385     0.2323   81.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F p-value    
t2(age)     1.000  1.000  6.134  0.0140 *  
t2(weight)  1.000  1.000  0.142  0.7067    
t2(height)  1.000  1.000  0.652  0.4204    
t2(BMI)     2.085  2.653  0.535  0.6423    
t2(neck)    1.339  1.594  0.837  0.5339    
t2(chest)   1.000  1.000  0.989  0.3211    
t2(abdomen) 4.000  4.000 25.890  <2e-16 ***
t2(hip)     3.167  3.627  1.925  0.2097    
t2(thigh)   2.890  3.420  1.540  0.2067    
t2(knee)    1.753  2.188  0.489  0.5731    
t2(ankle)   1.000  1.000  0.593  0.4419    
t2(bicep)   3.515  3.859  1.855  0.1050    
t2(forearm) 1.000  1.000  1.501  0.2217    
t2(wrist)   1.737  2.157  5.658  0.0257 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.774   Deviance explained = 79.7%
GCV = 15.269  Scale est. = 13.604    n = 252

Podemos ver, com o segundo ajuste, o resultado de que, tivemos um R^2 ajustado de 77.4%, e uma explicação de Desvio de 79.7%, que são bons valores para o modelo ajustado. Todavia, prosseguiremos em busca do melhor modelo ajustado.

2.4 Terceiro ajuste

Vamos fazer o terceiro ajuste, considerando a variável dependente body.fat, e o parâmetro s, nas variáveis independentes, logo em seguida vamos analisar os resultados.

fit3<-mgcv::gam(body.fat ~ s(age) + s(weight) + s(height) + s(BMI) + s(neck) + s(chest) + s(abdomen) + s(hip) + s(thigh) + s(knee) + s(ankle) + s(bicep) + s(forearm) + s(wrist), data=UsingR::fat)

2.4.1 Resultados

summary(fit3)


Family: gaussian 
Link function: identity 

Formula:
body.fat ~ s(age) + s(weight) + s(height) + s(BMI) + s(neck) + 
    s(chest) + s(abdomen) + s(hip) + s(thigh) + s(knee) + s(ankle) + 
    s(bicep) + s(forearm) + s(wrist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.9385     0.2256   83.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df      F  p-value    
s(age)     5.423  6.567  2.075 0.051891 .  
s(weight)  1.000  1.000  0.175 0.676352    
s(height)  1.000  1.000  0.199 0.655765    
s(BMI)     2.452  3.173  0.925 0.482472    
s(neck)    1.494  1.842  1.867 0.104184    
s(chest)   1.000  1.000  0.563 0.453853    
s(abdomen) 6.547  7.425 14.779  < 2e-16 ***
s(hip)     7.439  8.261  2.130 0.047066 *  
s(thigh)   2.361  3.065  0.490 0.665215    
s(knee)    1.437  1.760  0.368 0.566256    
s(ankle)   2.844  3.503  0.820 0.618727    
s(bicep)   4.558  5.557  1.791 0.081062 .  
s(forearm) 1.000  1.000  2.156 0.143457    
s(wrist)   1.806  2.268  7.208 0.000558 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.787   Deviance explained = 82.1%
GCV = 15.338  Scale est. = 12.821    n = 252

Podemos ver, com o terceiro ajuste, o resultado de que, tivemos um R^2 ajustado de 78.7%, e uma explicação de Desvio de 82.1%. Ou seja, temos que o terceiro modelo foi o que melhor se adequou aos dados, sendo assim, tomaremos o parâmetro s de suavização como o parâmetro a ser utilizado no ajuste da GAM.

2.5 Ajuste do modelo final.

Devido ao número excessivo de variáveis, é observado que no ajuste acima algumas variáveis não são significativas para o ajuste, considerando o nível de significância de 5%, foi observado que apenas as variáveis abaixo são significativas.

abdomen : Abdomen circumference (cm) “at the umbilicus and level with the iliac crest”
hip : Hip circumference (cm)
wrist : Wrist circumference (cm) “distal to the styloid processes”

Sendo assim, ajustaremos um modelo apenas com essas variáveis.

ajuste_final <- mgcv::gam(body.fat~s(abdomen)+s(hip)+s(wrist), data = UsingR::fat)

2.5.1 Resultados

summary(ajuste_final)


Family: gaussian 
Link function: identity 

Formula:
body.fat ~ s(abdomen) + s(hip) + s(wrist)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18.9385     0.2433   77.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df      F  p-value    
s(abdomen) 5.387  6.490 51.025  < 2e-16 ***
s(hip)     6.986  8.005  3.266  0.00145 ** 
s(wrist)   1.267  1.486 22.070 2.38e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.752   Deviance explained = 76.5%
GCV = 15.834  Scale est. = 14.914    n = 252

2.6 Análise gráfica residual

Uma importante parte é a análise gráfica para saber se o modelo ficou bem ajustado.

par(mfrow = c(1,3))
plot(ajuste_final)

par(mfrow = c(2,2))
mgcv::gam.check(ajuste_final)


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 8 iterations.
The RMS GCV score gradient at convergence was 4.538414e-06 .
The Hessian was positive definite.
Model rank =  28 / 28 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

             k'  edf k-index p-value
s(abdomen) 9.00 5.39    1.02    0.56
s(hip)     9.00 6.99    1.03    0.66
s(wrist)   9.00 1.27    1.06    0.83

3 Conclusão

O uso de GAM (Modelos aditivos generalizados) se mostrou muito eficiente no ajuste dos dados, quando utilizado o parâmetro de suavização spline, sendo assim, obtivemos um bom modelo ajustado, onde os valores preditos estão coerentes com os valores observados, tendo uma diminuição no erro, um bom coeficiente de determinação, evidenciando a importância do modelo ajustado, e que os residuos aparentemente seguem uma distribuição normal.