Instalação de pacotes e extração de dados

Packages

Library

Criando tabela casos

Filtrando somente casos de covid

casos_fil_covid <- filter(casos_fil, CLASSI_FIN == 5)

Criando a coluna menor data

Summary da tabela

summary(casos_fil_covid)
##    DT_NOTIFIC            SEM_NOT       SG_UF_NOT          ID_MUNICIP       
##  Min.   :2020-02-21   Min.   : 1.00   Length:2136005     Length:2136005    
##  1st Qu.:2020-11-11   1st Qu.:12.00   Class :character   Class :character  
##  Median :2021-03-24   Median :21.00   Mode  :character   Mode  :character  
##  Mean   :2021-03-17   Mean   :22.53                                        
##  3rd Qu.:2021-06-15   3rd Qu.:30.00                                        
##  Max.   :2022-12-09   Max.   :53.00                                        
##                                                                            
##    CO_MUN_NOT       CS_SEXO            DT_NASC            CS_GESTANT   
##  Min.   :110001   Length:2136005     Length:2136005     Min.   :0.000  
##  1st Qu.:310620   Class :character   Class :character   1st Qu.:5.000  
##  Median :351880   Mode  :character   Mode  :character   Median :6.000  
##  Mean   :344069                                         Mean   :5.779  
##  3rd Qu.:410690                                         3rd Qu.:6.000  
##  Max.   :530010                                         Max.   :9.000  
##                                                                        
##     CS_RACA        CS_ESCOL_N      PAC_COCBO            CS_ZONA      
##  Min.   :1.00    Min.   :0.0      Length:2136005     Min.   :1.00    
##  1st Qu.:1.00    1st Qu.:2.0      Class :character   1st Qu.:1.00    
##  Median :4.00    Median :4.0      Mode  :character   Median :1.00    
##  Mean   :3.49    Mean   :5.4                         Mean   :1.15    
##  3rd Qu.:4.00    3rd Qu.:9.0                         3rd Qu.:1.00    
##  Max.   :9.00    Max.   :9.0                         Max.   :9.00    
##  NA's   :28576   NA's   :716885                      NA's   :233075  
##    VACINA_COV         VACINA        DT_UT_DOSE           HOSPITAL    
##  Min.   :1.0      Min.   :1.0      Length:2136005     Min.   :1.00   
##  1st Qu.:2.0      1st Qu.:2.0      Class :character   1st Qu.:1.00   
##  Median :2.0      Median :2.0      Mode  :character   Median :1.00   
##  Mean   :2.6      Mean   :5.1                         Mean   :1.04   
##  3rd Qu.:2.0      3rd Qu.:9.0                         3rd Qu.:1.00   
##  Max.   :9.0      Max.   :9.0                         Max.   :9.00   
##  NA's   :326476   NA's   :528302                      NA's   :45286  
##    DT_INTERNA              UTI           DT_ENTUTI         
##  Min.   :2020-01-05   Min.   :1.00     Min.   :2020-01-05  
##  1st Qu.:2020-11-06   1st Qu.:1.00     1st Qu.:2020-11-09  
##  Median :2021-03-21   Median :2.00     Median :2021-03-24  
##  Mean   :2021-03-23   Mean   :1.79     Mean   :2021-03-17  
##  3rd Qu.:2021-06-11   3rd Qu.:2.00     3rd Qu.:2021-06-14  
##  Max.   :9202-09-11   Max.   :9.00     Max.   :4202-05-26  
##                       NA's   :254657                       
##    DT_SAIDUTI           CLASSI_FIN    EVOLUCAO       DT_EVOLUCA        
##  Min.   :2020-02-21   Min.   :5    Min.   :1.0     Min.   :2020-02-21  
##  1st Qu.:2020-11-12   1st Qu.:5    1st Qu.:1.0     1st Qu.:2020-11-17  
##  Median :2021-03-26   Median :5    Median :1.0     Median :2021-03-31  
##  Mean   :2021-03-18   Mean   :5    Mean   :1.5     Mean   :2021-03-23  
##  3rd Qu.:2021-06-16   3rd Qu.:5    3rd Qu.:2.0     3rd Qu.:2021-06-21  
##  Max.   :2121-03-13   Max.   :5    Max.   :9.0     Max.   :2022-12-04  
##                                    NA's   :99481                       
##     DT_MIN.DT_MIN    
##  Min.   :2020-01-05  
##  1st Qu.:2020-11-05  
##  Median :2021-03-21  
##  Mean   :2021-03-12  
##  3rd Qu.:2021-06-10  
##  Max.   :2022-12-04  
## 

Criando tabelas por tipo (Possuimos muitas colunas, algumas delas com muitos nulo, portanto para verificar quais possuem impacto na mortalidade vou separar por tipo, por exemplo, demográficas, comorbidades, relacionadas a vacinação, etc.)

Demográficas

Criando dummies

casos_fil_covid <- dummy_cols(casos_fil_covid, select_columns = c('CS_SEXO','CS_RACA','CS_ZONA', 'VACINA_COV','UTI', 'CLASSI_FIN'),
           remove_selected_columns = TRUE)

Criando coluna idade, cortando número de idades iguais ou menores do que zero “erro de input” e excluindo estes valores.

casos_fil_covid$DT_NASC<- as_date(casos_fil_covid$DT_NASC, format= "%d/%m/%Y")

casos_fil_covid$idade <-  (as.numeric(casos_fil_covid$DT_MIN) - as.numeric(casos_fil_covid$DT_NASC))/365

count(casos_fil_covid, idade <= 0)
## # A tibble: 3 x 2
##   `idade <= 0`       n
##   <lgl>          <int>
## 1 FALSE        2133407
## 2 TRUE             743
## 3 NA              1855
count(casos_fil_covid, idade >= 100)
## # A tibble: 3 x 2
##   `idade >= 100`       n
##   <lgl>            <int>
## 1 FALSE          2130827
## 2 TRUE              3323
## 3 NA                1855
casos_fil_covid <- filter(casos_fil_covid, idade > 0)
casos_fil_covid <- filter(casos_fil_covid, idade < 100)

Transformando a variável EVOLUCAO em dummy, 1 = sobreviveu, 0 = óbito.

casos_fil_covid$EVOLUCAO <- if_else(casos_fil_covid$EVOLUCAO == 1 , 1, 0)

Porcentagem de nulos na coluna EVOLUCAO.

sum(is.na(casos_fil_covid$EVOLUCAO)) / length(casos_fil_covid$EVOLUCAO)
## [1] 0.0465484

Vou limpar certas colunas para que meus dados fiquem como one hot encoding.

casos_fil_covid <-casos_fil_covid %>% select( -SG_UF_NOT, -ID_MUNICIP, -CO_MUN_NOT, -DT_NASC, -CS_GESTANT, -CS_ESCOL_N, -VACINA, -HOSPITAL, -DT_MIN, -DT_EVOLUCA, -DT_SAIDUTI, -DT_ENTUTI, -DT_INTERNA, -DT_NOTIFIC, -PAC_COCBO,-DT_UT_DOSE, -SEM_NOT )

Resumo do dataframe a ser analisado

summary(casos_fil_covid)
##     EVOLUCAO       CS_SEXO_F        CS_SEXO_I           CS_SEXO_M     
##  Min.   :0.00    Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.00    1st Qu.:0.0000   1st Qu.:0.0000000   1st Qu.:0.0000  
##  Median :1.00    Median :0.0000   Median :0.0000000   Median :1.0000  
##  Mean   :0.65    Mean   :0.4484   Mean   :0.0001272   Mean   :0.5515  
##  3rd Qu.:1.00    3rd Qu.:1.0000   3rd Qu.:0.0000000   3rd Qu.:1.0000  
##  Max.   :1.00    Max.   :1.0000   Max.   :1.0000000   Max.   :1.0000  
##  NA's   :99152                                                        
##    CS_RACA_1       CS_RACA_2       CS_RACA_3       CS_RACA_4    
##  Min.   :0.000   Min.   :0.000   Min.   :0.00    Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00    1st Qu.:0.000  
##  Median :0.000   Median :0.000   Median :0.00    Median :0.000  
##  Mean   :0.429   Mean   :0.043   Mean   :0.01    Mean   :0.342  
##  3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:0.00    3rd Qu.:1.000  
##  Max.   :1.000   Max.   :1.000   Max.   :1.00    Max.   :1.000  
##  NA's   :28434   NA's   :28434   NA's   :28434   NA's   :28434  
##    CS_RACA_5       CS_RACA_9       CS_RACA_NA        CS_ZONA_1     
##  Min.   :0.000   Min.   :0.000   Min.   :0.00000   Min.   :0.00    
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:1.00    
##  Median :0.000   Median :0.000   Median :0.00000   Median :1.00    
##  Mean   :0.002   Mean   :0.174   Mean   :0.01335   Mean   :0.93    
##  3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:0.00000   3rd Qu.:1.00    
##  Max.   :1.000   Max.   :1.000   Max.   :1.00000   Max.   :1.00    
##  NA's   :28434   NA's   :28434                     NA's   :231863  
##    CS_ZONA_2        CS_ZONA_3        CS_ZONA_9        CS_ZONA_NA    
##  Min.   :0.00     Min.   :0        Min.   :0.00     Min.   :0.0000  
##  1st Qu.:0.00     1st Qu.:0        1st Qu.:0.00     1st Qu.:0.0000  
##  Median :0.00     Median :0        Median :0.00     Median :0.0000  
##  Mean   :0.05     Mean   :0        Mean   :0.01     Mean   :0.1089  
##  3rd Qu.:0.00     3rd Qu.:0        3rd Qu.:0.00     3rd Qu.:0.0000  
##  Max.   :1.00     Max.   :1        Max.   :1.00     Max.   :1.0000  
##  NA's   :231863   NA's   :231863   NA's   :231863                   
##   VACINA_COV_1     VACINA_COV_2     VACINA_COV_9    VACINA_COV_NA   
##  Min.   :0.0      Min.   :0.0      Min.   :0.0      Min.   :0.0000  
##  1st Qu.:0.0      1st Qu.:0.0      1st Qu.:0.0      1st Qu.:0.0000  
##  Median :0.0      Median :1.0      Median :0.0      Median :0.0000  
##  Mean   :0.2      Mean   :0.7      Mean   :0.1      Mean   :0.1529  
##  3rd Qu.:0.0      3rd Qu.:1.0      3rd Qu.:0.0      3rd Qu.:0.0000  
##  Max.   :1.0      Max.   :1.0      Max.   :1.0      Max.   :1.0000  
##  NA's   :325774   NA's   :325774   NA's   :325774                   
##      UTI_1            UTI_2            UTI_9            UTI_NA      
##  Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.0000  
##  1st Qu.:0.00     1st Qu.:0.00     1st Qu.:0.00     1st Qu.:0.0000  
##  Median :0.00     Median :1.00     Median :0.00     Median :0.0000  
##  Mean   :0.37     Mean   :0.61     Mean   :0.02     Mean   :0.1189  
##  3rd Qu.:1.00     3rd Qu.:1.00     3rd Qu.:0.00     3rd Qu.:0.0000  
##  Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.0000  
##  NA's   :253195   NA's   :253195   NA's   :253195                   
##   CLASSI_FIN_5     idade          
##  Min.   :1     Min.   :  0.00274  
##  1st Qu.:1     1st Qu.: 45.50685  
##  Median :1     Median : 59.41918  
##  Mean   :1     Mean   : 58.41555  
##  3rd Qu.:1     3rd Qu.: 72.51233  
##  Max.   :1     Max.   : 99.99726  
## 

Variáveis selecionadas: CS_SEX_F = se é mulher ou não, CS_RACA_2 = se é preto ou não, CS_ZONA_2 = se é rural ou não, VACINA_COV_1 = se tomou vacina ou não, UTI_1 se foi para UTI ou não, idade = idade da pessoa. EVOLUCAO = 1 para sobreviveu, 0 para óbito. Recordo também que o comando glm omite as linhas com valores nulos.

casos_fil_covid.probit <- glm(formula = EVOLUCAO ~   CS_SEXO_F + CS_RACA_2  + CS_ZONA_2 + VACINA_COV_1  + UTI_1 + idade, family = binomial(link = "probit"), 
    data = casos_fil_covid)
summary(casos_fil_covid.probit)
## 
## Call:
## glm(formula = EVOLUCAO ~ CS_SEXO_F + CS_RACA_2 + CS_ZONA_2 + 
##     VACINA_COV_1 + UTI_1 + idade, family = binomial(link = "probit"), 
##     data = casos_fil_covid)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1407  -0.9339   0.5147   0.8090   2.1926  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.169e+00  4.651e-03  466.41   <2e-16 ***
## CS_SEXO_F     6.344e-02  2.393e-03   26.51   <2e-16 ***
## CS_RACA_2    -1.883e-01  5.637e-03  -33.41   <2e-16 ***
## CS_ZONA_2    -1.370e-01  5.485e-03  -24.98   <2e-16 ***
## VACINA_COV_1  2.144e-01  2.890e-03   74.17   <2e-16 ***
## UTI_1        -9.741e-01  2.401e-03 -405.65   <2e-16 ***
## idade        -2.351e-02  7.058e-05 -333.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1796266  on 1400584  degrees of freedom
## Residual deviance: 1486420  on 1400578  degrees of freedom
##   (729499 observations deleted due to missingness)
## AIC: 1486434
## 
## Number of Fisher Scoring iterations: 4

Mesma análise mas com Logit para comparação

casos_fil_covid.logit <- glm(formula = EVOLUCAO ~   CS_SEXO_F + CS_RACA_2  + CS_ZONA_2 + VACINA_COV_1  + UTI_1 + idade, family = binomial(link = "logit"), 
    data = casos_fil_covid)
summary(casos_fil_covid.logit)
## 
## Call:
## glm(formula = EVOLUCAO ~ CS_SEXO_F + CS_RACA_2 + CS_ZONA_2 + 
##     VACINA_COV_1 + UTI_1 + idade, family = binomial(link = "logit"), 
##     data = casos_fil_covid)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8967  -0.9233   0.5154   0.8005   2.1795  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.7000563  0.0083843  441.31   <2e-16 ***
## CS_SEXO_F     0.1121574  0.0040614   27.62   <2e-16 ***
## CS_RACA_2    -0.3205169  0.0094865  -33.79   <2e-16 ***
## CS_ZONA_2    -0.2231001  0.0092556  -24.10   <2e-16 ***
## VACINA_COV_1  0.3689358  0.0048997   75.30   <2e-16 ***
## UTI_1        -1.6203856  0.0040714 -397.99   <2e-16 ***
## idade        -0.0406087  0.0001241 -327.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1796266  on 1400584  degrees of freedom
## Residual deviance: 1485599  on 1400578  degrees of freedom
##   (729499 observations deleted due to missingness)
## AIC: 1485613
## 
## Number of Fisher Scoring iterations: 4