Trabalho II - Multivariados

Author

Vinicius Aquino

Pacotes usados:

library(gcmr)
library(ggplot2)
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-7

Questão 1

ETL dos dados:

dados = read.csv("http://leg.ufpr.br/~lucambio/CE090/20231S/DRS.dat", header = TRUE, sep = ",")
dados$treat <- as.factor(dados$treat)
dados$type <- as.factor(dados$type)
dados$ind <- as.factor(dados$ind)

Treinando modelos:

Poisson:

copulapois <- gcmr(status ~ . -Obs -id, marginal = poisson.marg(),
 cormat = cluster.cormat(id, type = "exc"), data = dados)
summary(copulapois)

Call:
gcmr(formula = status ~ . - Obs - id, data = dados, marginal = poisson.marg(), 
    cormat = cluster.cormat(id, type = "exc"))


Coefficients marginal model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.087980   0.184866  -0.476  0.63414    
ind2         0.267516   0.082319   3.250  0.00116 ** 
obs_time    -0.031644   0.003716  -8.517  < 2e-16 ***
treat1      -0.159813   0.111571  -1.432  0.15203    
treat2      -0.111531   0.100803  -1.106  0.26854    
age         -0.002087   0.011257  -0.185  0.85290    
type2        0.228088   0.345206   0.661  0.50879    

Coefficients Gaussian copula:
    Estimate Std. Error z value Pr(>|z|)    
tau  0.91049    0.04124   22.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

log likelihood = 228.06,  AIC = 472.11
plot(copulapois)

Binomial:

copulaBin <- gcmr(status ~ . -Obs -id, marginal = binomial.marg(), cormat = cluster.cormat(id, type = "exc"), data = dados)
summary(copulaBin)

Call:
gcmr(formula = status ~ . - Obs - id, data = dados, marginal = binomial.marg(), 
    cormat = cluster.cormat(id, type = "exc"))


Coefficients marginal model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.425575   0.461513   5.256 1.47e-07 ***
ind2         0.891029   0.231518   3.849 0.000119 ***
obs_time    -0.093177   0.009285 -10.035  < 2e-16 ***
treat1      -0.444397   0.288025  -1.543 0.122853    
treat2      -0.278507   0.273096  -1.020 0.307817    
age         -0.015060   0.020521  -0.734 0.462998    
type2        0.740572   0.618865   1.197 0.231438    

Coefficients Gaussian copula:
    Estimate Std. Error z value Pr(>|z|)    
tau  0.82426    0.09618    8.57   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

log likelihood = 148.01,  AIC = 312.03
plot(copulaBin)

Gaussiano:

copula <- gcmr(status ~ . -Obscid, marginal = gaussian.marg(), cormat = cluster.cormat(id, type = "exc"), data = dados)
Warning in terms.formula(fi, ...): 'varlist' has changed (from nvar=8) to new 9
after EncodeVars() -- should no longer happen!

Warning in terms.formula(fi, ...): 'varlist' has changed (from nvar=8) to new 9
after EncodeVars() -- should no longer happen!

Warning in terms.formula(fi, ...): 'varlist' has changed (from nvar=8) to new 9
after EncodeVars() -- should no longer happen!
summary(copula)

Call:
gcmr(formula = status ~ . - Obscid, data = dados, marginal = gaussian.marg(), 
    cormat = cluster.cormat(id, type = "exc"))


Coefficients marginal model:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.9384445  0.0710656  13.205  < 2e-16 ***
Obs         -0.0012850  0.0044170  -0.291    0.771    
id           0.0002520  0.0010151   0.248    0.804    
ind2         0.1422006  0.0291257   4.882 1.05e-06 ***
obs_time    -0.0157891  0.0009685 -16.302  < 2e-16 ***
treat1      -0.0606403  0.0370243  -1.638    0.101    
treat2      -0.0553727  0.0371285  -1.491    0.136    
age         -0.0014056  0.0028169  -0.499    0.618    
type2        0.0902787  0.0844710   1.069    0.285    
sigma        0.3716866  0.0149176  24.916  < 2e-16 ***

Coefficients Gaussian copula:
    Estimate Std. Error z value Pr(>|z|)    
tau  0.44532    0.06226   7.153 8.48e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

log likelihood = 143.62,  AIC = 309.25
plot(copula)

Como pode ser notado, o tempo observado foi a variável mais significativa nos dados. O modelo que teve melhor ajuste foi o de Poisson.

Questão 2

ETL dos dados:

data <- read.csv(file = "http://leg.ufpr.br/~lucambio/MSM/cigarettes.txt",
 header = TRUE, sep = "")
data$Gender <- as.factor(data$Gender)
data$Brand <- as.factor(data$Brand)

Modelo sem interação:

summary(glm(y ~ ., family = poisson, data = data))

Call:
glm(formula = y ~ ., family = poisson, data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.86527    1.07137   1.741   0.0817 .
Price        0.30959    0.32024   0.967   0.3337  
GenderM     -0.01568    0.22505  -0.070   0.9444  
BrandB      -0.87602    0.37957  -2.308   0.0210 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 63.647  on 11  degrees of freedom
Residual deviance: 46.311  on  8  degrees of freedom
AIC: 105.56

Number of Fisher Scoring iterations: 5

Modelo com interação:

summary(glm(y ~ . + Price:Brand, family = poisson, data = data))

Call:
glm(formula = y ~ . + Price:Brand, family = poisson, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)    6.7204     2.6941   2.494   0.0126 *
Price         -1.1348     0.8028  -1.414   0.1575  
GenderM        0.2615     0.2491   1.050   0.2939  
BrandB        -6.6896     2.9766  -2.247   0.0246 *
Price:BrandB   1.6522     0.8394   1.968   0.0490 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 63.647  on 11  degrees of freedom
Residual deviance: 42.361  on  7  degrees of freedom
AIC: 103.61

Number of Fisher Scoring iterations: 5

Sem a análise da interação, somente há efeito da marca. No entanto, quando se adciona a interação, é possível notar que a interação preço e marca é significativa.

Adcionando curvatura no preço:

m = glm(y~.+Price:Brand+I(Price^2), family = poisson, data = data)

summary(m)

Call:
glm(formula = y ~ . + Price:Brand + I(Price^2), family = poisson, 
    data = data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -183.8193    46.7926  -3.928 8.55e-05 ***
Price         103.9964    25.7996   4.031 5.56e-05 ***
GenderM        -2.3613     0.6767  -3.489 0.000484 ***
BrandB       -105.5351    24.7177  -4.270 1.96e-05 ***
I(Price^2)    -14.2874     3.5069  -4.074 4.62e-05 ***
Price:BrandB   25.7770     6.0386   4.269 1.97e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 63.647  on 11  degrees of freedom
Residual deviance: 22.688  on  6  degrees of freedom
AIC: 85.941

Number of Fisher Scoring iterations: 5
plot(m)

Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Ao adcionar a curvatura no preço, todas as variáveis se tornam significativas. Nesse ponto, deve-se tomar cuidado com um eventual overfitting.

Preparando os dados:

data$Gender <- as.numeric(data$Gender)
data$Brand <- as.numeric(data$Brand)
m <- as.matrix(data[,2:4])
r <- data$y

Valor ótimo encontrado:

mod <- cv.glmnet(m, r, alpha = 0)
Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
fold
(mod$lambda.min)
[1] 18.24972