Let’s now create the dataset that we’ll use for modeling by filtering
on some of the variables and transforming some variables to a be
factors. There are still lots of NA values for age but we
are going to impute those.
Let’s select the categorical variables
## provincia
## 5
## caso.suspeito
## 2
## qual.e.a.principal.fonte.de.agua.para.beber.e.preparar.a.comida
## 5
## como.e.que.a.familia.trata.a.agua.que.bebe.
## 5
## tem.sistema.de.lavagem.das.maos.
## 2
## como.lavam.as.maos
## 2
## a.familia.come.alimentos.preparados.fora.da.casa
## 2
## se.teve.um.evento.particular.
## 5
## sabe.como.reduzir.o.risco.de.morte.por.colera.
## 2
## sanitation
## 4
| Name | cati |
| Number of rows | 1551 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| provincia | 0 | 1.00 | 4 | 12 | 0 | 5 | 0 |
| distrito | 0 | 1.00 | 3 | 16 | 0 | 28 | 0 |
| posto.administrativo | 0 | 1.00 | 1 | 17 | 0 | 141 | 0 |
| localidade | 0 | 1.00 | 3 | 21 | 0 | 185 | 0 |
| comunidade.bairro | 0 | 1.00 | 2 | 26 | 0 | 377 | 0 |
| caso.suspeito | 0 | 1.00 | 3 | 3 | 0 | 2 | 0 |
| qual.e.a.principal.fonte.de.agua.para.beber.e.preparar.a.comida | 0 | 1.00 | 3 | 15 | 0 | 5 | 0 |
| como.e.que.a.familia.trata.a.agua.que.bebe. | 0 | 1.00 | 5 | 9 | 0 | 5 | 0 |
| tem.sistema.de.lavagem.das.maos. | 0 | 1.00 | 3 | 3 | 0 | 2 | 0 |
| como.lavam.as.maos | 0 | 1.00 | 4 | 44 | 0 | 2 | 0 |
| tem.latrina. | 0 | 1.00 | 3 | 13 | 0 | 3 | 0 |
| se.nao. | 1122 | 0.28 | 21 | 32 | 0 | 2 | 0 |
| a.familia.come.alimentos.preparados.fora.da.casa | 0 | 1.00 | 3 | 3 | 0 | 2 | 0 |
| qual.o.mercado.principal.onde.se.procura.alimentos | 0 | 1.00 | 3 | 32 | 0 | 346 | 0 |
| como.a.agua.e.armazenada.pedir.para.ver.a.balde | 0 | 1.00 | 5 | 40 | 0 | 105 | 0 |
| se.teve.um.evento.particular. | 0 | 1.00 | 6 | 25 | 0 | 5 | 0 |
| sabe.como.reduzir.o.risco.de.morte.por.colera. | 0 | 1.00 | 3 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
|
|
0 | 1 | 5.66 | 4.09 | -1 | 4 | 5 | 7 | 102 | ▇▁▁▁▁ |
Explore dataset features
## Rows: 1,551
## Columns: 18
## $ provincia <chr> "Sofal…
## $ distrito <chr> "Beira…
## $ posto.administrativo <chr> "Chive…
## $ localidade <chr> "Beira…
## $ comunidade.bairro <chr> "Espan…
## $ n. <dbl> 3, 7, …
## $ caso.suspeito <chr> "Não",…
## $ qual.e.a.principal.fonte.de.agua.para.beber.e.preparar.a.comida <chr> "Água …
## $ como.e.que.a.familia.trata.a.agua.que.bebe. <chr> "Nao t…
## $ tem.sistema.de.lavagem.das.maos. <chr> "Não",…
## $ como.lavam.as.maos <chr> "Agua"…
## $ tem.latrina. <chr> "SIM c…
## $ se.nao. <chr> NA, NA…
## $ a.familia.come.alimentos.preparados.fora.da.casa <chr> "Sim",…
## $ qual.o.mercado.principal.onde.se.procura.alimentos <chr> "Merca…
## $ como.a.agua.e.armazenada.pedir.para.ver.a.balde <chr> "Balde…
## $ se.teve.um.evento.particular. <chr> "Event…
## $ sabe.como.reduzir.o.risco.de.morte.por.colera. <chr> "Sim",…
Let’s see the distribution of surveyed households by province:
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.449648259 18.7353441 18.73534
## Dim.2 0.249999578 10.4166491 29.15199
## Dim.3 0.226854673 9.4522780 38.60427
## Dim.4 0.170360756 7.0983648 45.70264
## Dim.5 0.132724011 5.5301671 51.23280
## Dim.6 0.121051102 5.0437959 56.27660
## Dim.7 0.108054890 4.5022871 60.77889
## Dim.8 0.101494085 4.2289202 65.00781
## Dim.9 0.094919921 3.9549967 68.96280
## Dim.10 0.090303823 3.7626593 72.72546
## Dim.11 0.083670652 3.4862772 76.21174
## Dim.12 0.080124365 3.3385152 79.55025
## Dim.13 0.076032967 3.1680403 82.71830
## Dim.14 0.071865453 2.9943939 85.71269
## Dim.15 0.066485833 2.7702430 88.48293
## Dim.16 0.056604458 2.3585191 90.84145
## Dim.17 0.049051063 2.0437943 92.88525
## Dim.18 0.043627412 1.8178088 94.70305
## Dim.19 0.032376142 1.3490059 96.05206
## Dim.20 0.029997648 1.2499020 97.30196
## Dim.21 0.023306764 0.9711152 98.27308
## Dim.22 0.018724171 0.7801738 99.05325
## Dim.23 0.014568382 0.6070159 99.66027
## Dim.24 0.008153591 0.3397330 100.00000
Using FactomineR directly
Using ggplot to colour by variable they belong to
The quality of the representation is called the squared cosine (cos2), which measures the degree of association between variable categories and a particular axis. The cos2 of variable categories can be extracted as follow:
It’s also possible to create a bar plot of variable cos2 using the function fviz_cos2()
Also the correlation with the dimensios
The most important (or, contributing) variable categories can be highlighted on the scatter plot as follow:
### Suspected case groups
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.