Load libraries

Configuration

Load data

Prepare data

Let’s now create the dataset that we’ll use for modeling by filtering on some of the variables and transforming some variables to a be factors. There are still lots of NA values for age but we are going to impute those.

Check number of categories

Let’s select the categorical variables

##                                                       provincia 
##                                                               5 
##                                                   caso.suspeito 
##                                                               2 
## qual.e.a.principal.fonte.de.agua.para.beber.e.preparar.a.comida 
##                                                               5 
##                     como.e.que.a.familia.trata.a.agua.que.bebe. 
##                                                               5 
##                                tem.sistema.de.lavagem.das.maos. 
##                                                               2 
##                                              como.lavam.as.maos 
##                                                               2 
##                a.familia.come.alimentos.preparados.fora.da.casa 
##                                                               2 
##                                   se.teve.um.evento.particular. 
##                                                               5 
##                  sabe.como.reduzir.o.risco.de.morte.por.colera. 
##                                                               2 
##                                                      sanitation 
##                                                               4

One hot encode variables

Explore data

Data summary
Name cati
Number of rows 1551
Number of columns 18
_______________________
Column type frequency:
character 17
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
provincia 0 1.00 4 12 0 5 0
distrito 0 1.00 3 16 0 28 0
posto.administrativo 0 1.00 1 17 0 141 0
localidade 0 1.00 3 21 0 185 0
comunidade.bairro 0 1.00 2 26 0 377 0
caso.suspeito 0 1.00 3 3 0 2 0
qual.e.a.principal.fonte.de.agua.para.beber.e.preparar.a.comida 0 1.00 3 15 0 5 0
como.e.que.a.familia.trata.a.agua.que.bebe. 0 1.00 5 9 0 5 0
tem.sistema.de.lavagem.das.maos. 0 1.00 3 3 0 2 0
como.lavam.as.maos 0 1.00 4 44 0 2 0
tem.latrina. 0 1.00 3 13 0 3 0
se.nao. 1122 0.28 21 32 0 2 0
a.familia.come.alimentos.preparados.fora.da.casa 0 1.00 3 3 0 2 0
qual.o.mercado.principal.onde.se.procura.alimentos 0 1.00 3 32 0 346 0
como.a.agua.e.armazenada.pedir.para.ver.a.balde 0 1.00 5 40 0 105 0
se.teve.um.evento.particular. 0 1.00 6 25 0 5 0
sabe.como.reduzir.o.risco.de.morte.por.colera. 0 1.00 3 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
0 1 5.66 4.09 -1 4 5 7 102 ▇▁▁▁▁

Explore dataset features

## Rows: 1,551
## Columns: 18
## $ provincia                                                       <chr> "Sofal…
## $ distrito                                                        <chr> "Beira…
## $ posto.administrativo                                            <chr> "Chive…
## $ localidade                                                      <chr> "Beira…
## $ comunidade.bairro                                               <chr> "Espan…
## $ n.                                                              <dbl> 3, 7, …
## $ caso.suspeito                                                   <chr> "Não",…
## $ qual.e.a.principal.fonte.de.agua.para.beber.e.preparar.a.comida <chr> "Água …
## $ como.e.que.a.familia.trata.a.agua.que.bebe.                     <chr> "Nao t…
## $ tem.sistema.de.lavagem.das.maos.                                <chr> "Não",…
## $ como.lavam.as.maos                                              <chr> "Agua"…
## $ tem.latrina.                                                    <chr> "SIM c…
## $ se.nao.                                                         <chr> NA, NA…
## $ a.familia.come.alimentos.preparados.fora.da.casa                <chr> "Sim",…
## $ qual.o.mercado.principal.onde.se.procura.alimentos              <chr> "Merca…
## $ como.a.agua.e.armazenada.pedir.para.ver.a.balde                 <chr> "Balde…
## $ se.teve.um.evento.particular.                                   <chr> "Event…
## $ sabe.como.reduzir.o.risco.de.morte.por.colera.                  <chr> "Sim",…

How surveyed households are distributed?

Let’s see the distribution of surveyed households by province:

What water sources are used on each province?

What handwash methods are used on each province?

What sanitation systems are used on each province?

What water treaments are used on each province?

How reported suspected cases are distributed by province?

How reported suspected cases are distributed by province?

Performing MCA

Extract eigenvalues and Scree Plot

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  0.449648259       18.7353441                    18.73534
## Dim.2  0.249999578       10.4166491                    29.15199
## Dim.3  0.226854673        9.4522780                    38.60427
## Dim.4  0.170360756        7.0983648                    45.70264
## Dim.5  0.132724011        5.5301671                    51.23280
## Dim.6  0.121051102        5.0437959                    56.27660
## Dim.7  0.108054890        4.5022871                    60.77889
## Dim.8  0.101494085        4.2289202                    65.00781
## Dim.9  0.094919921        3.9549967                    68.96280
## Dim.10 0.090303823        3.7626593                    72.72546
## Dim.11 0.083670652        3.4862772                    76.21174
## Dim.12 0.080124365        3.3385152                    79.55025
## Dim.13 0.076032967        3.1680403                    82.71830
## Dim.14 0.071865453        2.9943939                    85.71269
## Dim.15 0.066485833        2.7702430                    88.48293
## Dim.16 0.056604458        2.3585191                    90.84145
## Dim.17 0.049051063        2.0437943                    92.88525
## Dim.18 0.043627412        1.8178088                    94.70305
## Dim.19 0.032376142        1.3490059                    96.05206
## Dim.20 0.029997648        1.2499020                    97.30196
## Dim.21 0.023306764        0.9711152                    98.27308
## Dim.22 0.018724171        0.7801738                    99.05325
## Dim.23 0.014568382        0.6070159                    99.66027
## Dim.24 0.008153591        0.3397330                   100.00000

MCA plot of variables

MCA plot of categories

Using FactomineR directly

Using ggplot to colour by variable they belong to

MCA density plot of categories and individuals

Quality of representation of variable categories

The quality of the representation is called the squared cosine (cos2), which measures the degree of association between variable categories and a particular axis. The cos2 of variable categories can be extracted as follow:

It’s also possible to create a bar plot of variable cos2 using the function fviz_cos2()

Also the correlation with the dimensios

Contribution of variable categories to the dimensions

The most important (or, contributing) variable categories can be highlighted on the scatter plot as follow:

Color individuals by groups

Province groups

Sanitation groups

Handwash groups

Water source groups

Water treatment groups

Events group

### Suspected case groups

Take-away food groups

Factor map 1

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Factor map 2

Factor map 3

Dimension description