R for Reproducible Scientific Analysis (https://swcarpentry.github.io/r-novice-gapminder/): A very recommendable and easy to follow tutorial that covers the basics with nice examples of use
Interactive Tutorial Tool included in R Studio (on the right top window). It will ask you permision to install some needed packages.
The R Graph Gallery (https://r-graph-gallery.com): A comprehensive resource with hundreds of tutorials and examples on creating amazing visualizations using R and focusing on ggplot
This only needs to be done once in your system:
install.packages(c("tidyverse", "openxlsx", "knitr", "janitor", "kable", "kableExtra"))
Error in install.packages : Updating loaded packages
We load the libraries we will use in our project.
library(tidyverse): A collection of R packages designed
for data science that helps in data manipulation, visualization, and
analysis.library(readxl): Provides straightforward tools to read
Excel files (.xls and .xlsx) into R without dependencies.library(openxlsx): Enables the reading, writing, and
editing of .xlsx files in R, with more flexibility and options than
readxl.: Loads the `janitor` package, useful for data cleaning tasks like simplifying column names and adding totals to tables.library(knitr): Loads the knitr package,
which is used for dynamic report generation in R, integrating code and
output into rich documents.library(kableExtra): Loads the kableExtra
package, which enhances knitr’s kable()
function to create more complex HTML or LaTeX tables with additional
formatting options.library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%() masks base::%||%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(readxl)
library(openxlsx)
library(janitor)
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
library(knitr)
library(kableExtra)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Attaching package: ‘kableExtra’
The following object is masked from ‘package:dplyr’:
group_rows
Here we define a simple vector that contains the UNICEF palette of colors, for later use in plots:
UNICEF_PALETTE <- c("#00AEEF","#00833D", "#80BD41", "#FFC20E", "#F26A21", "#E2231A", "#961A49", "#6A1E74", "#D8D1C9", "#777779", "#2D2926", "#374EA2")
openxlsx librarydataset1 <- read.xlsx("data/Education - MOZ_Edu Cluster_5Ws_Q12024.xlsx", sheet = "Data_Entry_5W", check.names = TRUE, detectDates = TRUE, startRow = 5)
Let’s check the column names:
names(dataset1)
[1] "Lead.organization...Organização.líder"
[2] "Organization.type...Tipo.de.organização"
[3] "Implementing.partner...Parceiro.de.implementação"
[4] "Organization.type...Tipo.de.organização.1"
[5] "Consortium.Lead..in.case.the.activity.is.implemented.under.a.consortium....Líder.de.Consórcio..caso.a.actividade.seja.implementada.no.âmbito.de.um.consórcio."
[6] "Donor...Doador"
[7] "Period...Período"
[8] "HRP.ou.non.HRP.project...Projeto.HRP.ou.não.HRP"
[9] "Type.of.Disaster..Tipo.de.Desastre"
[10] "Sector...Setor"
[11] "Cluster.objective...Objectivo.do.cluster"
[12] "Cluster.indicator...Indicador.do.cluster"
[13] "HRP.activity...Actividade.do.HRP"
[14] "Non.HRP.activity...Actividade.não.HRP"
[15] "Unit.of.measure...Unidade.de.medida"
[16] "Delivery.modality...Modalidade.de.intervenção"
[17] "Activity.status...Progresso.da.actividade"
[18] "X..of.items.distributed.or.infrastructures.improved.....de.ítens.distribuídos.ou.infraestruturas.melhoradas"
[19] "Province...Província...ADM1.PT"
[20] "ADM1...PCODE"
[21] "District...Distrito...ADM2.PT"
[22] "ADM2...PCODE.2"
[23] "Posto...ADM3.PT"
[24] "School.name...Nome.da.escola"
[25] "School.code...Código.de.escola"
[26] "Other.specific.location...Outro.local.específico"
[27] "Type.of.beneficiary.reached...Tipo.de.beneficiário.alcançado"
[28] "Boys...Meninos...18y."
[29] "Girls...Meninas...18y."
[30] "Men...Homens..18...59y."
[31] "Women...Mulheres..18...59y."
[32] "Elderly.men...Idosos..60.y."
[33] "Elderly.women...Idosas..60.y."
[34] "Of.which.are.male.PWD..Número.de.pessoas.masculinas.com.deficiência"
[35] "Of.which.are.female.PWD...Número.de.pessoas.femininas.com.deficiência"
[36] "Total.male...Total.masculino"
[37] "Total.female...Total.feminino"
[38] "Total.people.reached...Total.de.pessoas.alcançadas"
[39] "Child.Participation...Participação.da.criança"
[40] "Comments...Comentários"
readxl librarydataset2 <- read_excel("data/Education - MOZ_Edu Cluster_5Ws_Q12024.xlsx", sheet = "Data_Entry_5W", skip = 4)
names(dataset2)
[1] "Lead organization / Organização líder"
[2] "Organization type / Tipo de organização"
[3] "Implementing partner / Parceiro de implementação"
[4] "Organization type / Tipo de organização"
[5] "Consortium Lead (in case the activity is implemented under a consortium) / Líder de Consórcio (caso a actividade seja implementada no âmbito de um consórcio)"
[6] "Donor / Doador"
[7] "Period / Período"
[8] "HRP ou non-HRP project / Projeto HRP ou não-HRP"
[9] "Type of Disaster/ Tipo de Desastre"
[10] "Sector / Setor"
[11] "Cluster objective / Objectivo do cluster"
[12] "Cluster indicator / Indicador do cluster"
[13] "HRP activity / Actividade do HRP"
[14] "Non-HRP activity / Actividade não-HRP"
[15] "Unit of measure / Unidade de medida"
[16] "Delivery modality / Modalidade de intervenção"
[17] "Activity status / Progresso da actividade"
[18] "# of items distributed or infrastructures improved / # de ítens distribuídos ou infraestruturas melhoradas"
[19] "Province / Província / ADM1 PT"
[20] "ADM1 / PCODE"
[21] "District / Distrito / ADM2 PT"
[22] "ADM2 / PCODE 2"
[23] "Posto / ADM3 PT"
[24] "School name / Nome da escola"
[25] "School code / Código de escola"
[26] "Other specific location / Outro local específico"
[27] "Type of beneficiary reached / Tipo de beneficiário alcançado"
[28] "Boys / Meninos (<18y)"
[29] "Girls / Meninas (<18y)"
[30] "Men / Homens (18 - 59y)"
[31] "Women / Mulheres (18 - 59y)"
[32] "Elderly men / Idosos (60+y)"
[33] "Elderly women / Idosas (60+y)"
[34] "Of which are male PWD /Número de pessoas masculinas com deficiência"
[35] "Of which are female PWD / Número de pessoas femininas com deficiência"
[36] "Total male / Total masculino"
[37] "Total female / Total feminino"
[38] "Total people reached / Total de pessoas alcançadas"
[39] "Child Participation / Participação da criança"
[40] "Comments / Comentários"
We will use dataset2 in the example.
nrow(dataset2): Returns the number of rows in the
dataset2 dataset.ncol(dataset2): Returns the number of columns in the
dataset2 dataset.dim(dataset2): Returns a vector containing the number
of rows and columns in the dataset2 dataset.nrow(dataset2)
[1] 711
ncol(dataset2)
[1] 40
dim(dataset2)
[1] 711 40
headhead(dataset2) displays the first six rows of the
dataset2 dataset.
head(dataset2)
summarysummary(dataset2) provides a
statistical summary of each variable in the
dataset2 dataset, including measures like
mean, median, quartiles, minimum, and maximum values.
summary(dataset2)
Lead organization / Organização líder Organization type / Tipo de organização
Length:711 Length:711
Class :character Class :character
Mode :character Mode :character
Implementing partner / Parceiro de implementação Organization type / Tipo de organização
Length:711 Length:711
Class :character Class :character
Mode :character Mode :character
Consortium Lead (in case the activity is implemented under a consortium) / Líder de Consórcio (caso a actividade seja implementada no âmbito de um consórcio)
Length:711
Class :character
Mode :character
Donor / Doador Period / Período HRP ou non-HRP project / Projeto HRP ou não-HRP
Length:711 Length:711 Length:711
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Type of Disaster/ Tipo de Desastre Sector / Setor Cluster objective / Objectivo do cluster
Length:711 Length:711 Length:711
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Cluster indicator / Indicador do cluster HRP activity / Actividade do HRP
Length:711 Length:711
Class :character Class :character
Mode :character Mode :character
Non-HRP activity / Actividade não-HRP Unit of measure / Unidade de medida
Length:711 Length:711
Class :character Class :character
Mode :character Mode :character
Delivery modality / Modalidade de intervenção Activity status / Progresso da actividade
Length:711 Length:711
Class :character Class :character
Mode :character Mode :character
# of items distributed or infrastructures improved / # de ítens distribuídos ou infraestruturas melhoradas
Min. : 0.0
1st Qu.: 1.0
Median : 12.0
Mean : 402.4
3rd Qu.: 225.8
Max. :16000.0
NA's :169
Province / Província / ADM1 PT ADM1 / PCODE District / Distrito / ADM2 PT ADM2 / PCODE 2
Length:711 Length:711 Length:711 Length:711
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Posto / ADM3 PT School name / Nome da escola School code / Código de escola
Length:711 Length:711 Length:711
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Other specific location / Outro local específico
Length:711
Class :character
Mode :character
Type of beneficiary reached / Tipo de beneficiário alcançado Boys / Meninos (<18y)
Length:711 Min. : 0.0
Class :character 1st Qu.: 3.0
Mode :character Median : 45.0
Mean : 255.8
3rd Qu.: 157.0
Max. :7453.0
NA's :152
Girls / Meninas (<18y) Men / Homens (18 - 59y) Women / Mulheres (18 - 59y)
Min. : 0.0 Min. : 0.000 Min. : 0.000
1st Qu.: 3.0 1st Qu.: 0.000 1st Qu.: 0.000
Median : 60.0 Median : 0.000 Median : 0.000
Mean : 240.2 Mean : 2.106 Mean : 2.401
3rd Qu.: 189.0 3rd Qu.: 2.000 3rd Qu.: 1.000
Max. :8547.0 Max. :103.000 Max. :119.000
NA's :152 NA's :152 NA's :152
Elderly men / Idosos (60+y) Elderly women / Idosas (60+y)
Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :0.00000
Mean :0.0161 Mean :0.00179
3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :3.0000 Max. :1.00000
NA's :152 NA's :152
Of which are male PWD /Número de pessoas masculinas com deficiência
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.1056
3rd Qu.:0.0000
Max. :7.0000
NA's :152
Of which are female PWD / Número de pessoas femininas com deficiência Total male / Total masculino
Min. :0.00000 Min. : 0.0
1st Qu.:0.00000 1st Qu.: 0.0
Median :0.00000 Median : 21.0
Mean :0.07168 Mean : 202.8
3rd Qu.:0.00000 3rd Qu.: 94.5
Max. :4.00000 Max. :7453.0
NA's :153
Total female / Total feminino Total people reached / Total de pessoas alcançadas
Min. : 0.0 Min. : 0.0
1st Qu.: 0.0 1st Qu.: 2.0
Median : 24.0 Median : 50.0
Mean : 190.7 Mean : 393.5
3rd Qu.: 128.5 3rd Qu.: 230.0
Max. :8547.0 Max. :16000.0
Child Participation / Participação da criança Comments / Comentários
Length:711 Length:711
Class :character Class :character
Mode :character Mode :character
glimpseglimpse(dataset2) provides a transposed overview of the
dataset2 dataset, showing the first few entries of each
column and the data types of the columns.
glimpse(dataset)
Rows: 711
Columns: 40
$ Lead.organization...Organização.líder <chr> …
$ Organization.type...Tipo.de.organização <chr> …
$ Implementing.partner...Parceiro.de.implementação <chr> …
$ Organization.type...Tipo.de.organização.1 <chr> …
$ Consortium.Lead..in.case.the.activity.is.implemented.under.a.consortium....Líder.de.Consórcio..caso.a.actividade.seja.implementada.no.âmbito.de.um.consórcio. <chr> …
$ Donor...Doador <chr> …
$ Period...Período <chr> …
$ HRP.ou.non.HRP.project...Projeto.HRP.ou.não.HRP <chr> …
$ Type.of.Disaster..Tipo.de.Desastre <chr> …
$ Sector...Setor <chr> …
$ Cluster.objective...Objectivo.do.cluster <chr> …
$ Cluster.indicator...Indicador.do.cluster <chr> …
$ HRP.activity...Actividade.do.HRP <chr> …
$ Non.HRP.activity...Actividade.não.HRP <chr> …
$ Unit.of.measure...Unidade.de.medida <chr> …
$ Delivery.modality...Modalidade.de.intervenção <chr> …
$ Activity.status...Progresso.da.actividade <chr> …
$ X..of.items.distributed.or.infrastructures.improved.....de.ítens.distribuídos.ou.infraestruturas.melhoradas <dbl> …
$ Province...Província...ADM1.PT <chr> …
$ ADM1...PCODE <chr> …
$ District...Distrito...ADM2.PT <chr> …
$ ADM2...PCODE.2 <chr> …
$ Posto...ADM3.PT <chr> …
$ School.name...Nome.da.escola <chr> …
$ School.code...Código.de.escola <chr> …
$ Other.specific.location...Outro.local.específico <chr> …
$ Type.of.beneficiary.reached...Tipo.de.beneficiário.alcançado <chr> …
$ Boys...Meninos...18y. <dbl> …
$ Girls...Meninas...18y. <dbl> …
$ Men...Homens..18...59y. <dbl> …
$ Women...Mulheres..18...59y. <dbl> …
$ Elderly.men...Idosos..60.y. <dbl> …
$ Elderly.women...Idosas..60.y. <dbl> …
$ Of.which.are.male.PWD..Número.de.pessoas.masculinas.com.deficiência <dbl> …
$ Of.which.are.female.PWD...Número.de.pessoas.femininas.com.deficiência <dbl> …
$ Total.male...Total.masculino <dbl> …
$ Total.female...Total.feminino <dbl> …
$ Total.people.reached...Total.de.pessoas.alcançadas <dbl> …
$ Child.Participation...Participação.da.criança <chr> …
$ Comments...Comentários <chr> …
rename_with(~make.unique(str_squish(str_remove(., "/.*"))), .cols = everything()):
Removes everything after a slash in column names, squishes spaces, and
ensures names are unique.rename_with(str_to_lower): Converts all column names to
lowercase.rename_with(~str_replace_all(., " ", "_")): Replaces
all spaces in column names with underscores to create valid variable
names.
dataset_renamed <- dataset2 %>%
rename_with(~make.unique(str_squish(str_remove(., "/.*"))), .cols = everything()) %>%
rename_with(str_to_lower) %>%
rename_with(~str_replace_all(., " ", "_"))
#rename_with(~str_replace_all(str_to_lower(.), " ", "_"), .cols = everything())
Let’s check the column names now:
names(dataset_renamed)
[1] "lead_organization"
[2] "organization_type"
[3] "implementing_partner"
[4] "organization_type.1"
[5] "consortium_lead_(in_case_the_activity_is_implemented_under_a_consortium)"
[6] "donor"
[7] "period"
[8] "hrp_ou_non-hrp_project"
[9] "type_of_disaster"
[10] "sector"
[11] "cluster_objective"
[12] "cluster_indicator"
[13] "hrp_activity"
[14] "non-hrp_activity"
[15] "unit_of_measure"
[16] "delivery_modality"
[17] "activity_status"
[18] "#_of_items_distributed_or_infrastructures_improved"
[19] "province"
[20] "adm1"
[21] "district"
[22] "adm2"
[23] "posto"
[24] "school_name"
[25] "school_code"
[26] "other_specific_location"
[27] "type_of_beneficiary_reached"
[28] "boys"
[29] "girls"
[30] "men"
[31] "women"
[32] "elderly_men"
[33] "elderly_women"
[34] "of_which_are_male_pwd"
[35] "of_which_are_female_pwd"
[36] "total_male"
[37] "total_female"
[38] "total_people_reached"
[39] "child_participation"
[40] "comments"
Typing select(c(column_name1, etc.)), selects the
columns. But this can be impractical if we need a lot of columns.
dataset_keep_cols <- dataset_renamed %>%
select(c(lead_organization,organization_type,implementing_partner,organization_type.1))
head(dataset_keep_cols)
Removes specified columns by their names and positions from
dataset_renamed, creating a new dataset
dataset_filter_cols.
dataset_filter_cols <- dataset_renamed %>%
select(-c(5, donor, sector, hrp_activity, `non-hrp_activity`, 18, unit_of_measure, delivery_modality, comments))
names(dataset_filter_cols)
[1] "lead_organization" "organization_type" "implementing_partner"
[4] "organization_type.1" "period" "hrp_ou_non-hrp_project"
[7] "type_of_disaster" "cluster_objective" "cluster_indicator"
[10] "activity_status" "province" "adm1"
[13] "district" "adm2" "posto"
[16] "school_name" "school_code" "other_specific_location"
[19] "type_of_beneficiary_reached" "boys" "girls"
[22] "men" "women" "elderly_men"
[25] "elderly_women" "of_which_are_male_pwd" "of_which_are_female_pwd"
[28] "total_male" "total_female" "total_people_reached"
[31] "child_participation"
We can have another glimpse into the dataset:
glimpse(dataset_filter_cols)
Rows: 711
Columns: 31
$ lead_organization <chr> "United Nations Children's Fund", "United Nations Children's Fund…
$ organization_type <chr> "UN", "UN", "UN", "UN", "UN", "UN", "UN", "UN", "UN", "UN", "UN",…
$ implementing_partner <chr> "United Nations Children's Fund", "United Nations Children's Fund…
$ organization_type.1 <chr> "UN", "UN", "UN", "UN", "UN", "UN", "UN", "UN", "UN", "UN", "GOV"…
$ period <chr> "March", "March", "March", "March", "March", "March", "March", "M…
$ `hrp_ou_non-hrp_project` <chr> "HRP", "HRP", "HRP", "HRP", "HRP", "HRP", "HRP", "HRP", "HRP", "H…
$ type_of_disaster <chr> "Conflict", "Conflict", "Conflict", "Conflict", "Conflict", "Conf…
$ cluster_objective <chr> "To improve learning environments by ensuring access to improved …
$ cluster_indicator <chr> "# of children benefitting from new or improved WASH infrastructu…
$ activity_status <chr> "Ongoing", "Ongoing", "Ongoing", "Ongoing", "Completed", "Complet…
$ province <chr> "Nampula", "Nampula", "Nampula", "Nampula", "Nampula", "Nampula",…
$ adm1 <chr> "MZ07", "MZ07", "MZ07", "MZ07", "MZ07", "MZ07", "MZ07", "MZ07", "…
$ district <chr> "Namapa - Eráti", "Namapa - Eráti", "Namapa - Eráti", "Namapa - E…
$ adm2 <chr> "MZ0703", "MZ0703", "MZ0703", "MZ0703", "MZ0703", "MZ0703", "MZ07…
$ posto <chr> "Namapa - Eráti", "Namapa - Eráti", "Namapa - Eráti", "Namapa - E…
$ school_name <chr> "Escola Primária do 1º e 2º Grau de Nacucha", "Escola Primária do…
$ school_code <chr> "60076", "62132", "61622", "61733", "60076", "62132", "61622", "6…
$ other_specific_location <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
$ type_of_beneficiary_reached <chr> "IDPs", "IDPs", "IDPs", "IDPs", "IDPs", "IDPs", "IDPs", "IDPs", "…
$ boys <dbl> 248, 16, 192, 2, 1205, 1118, 3309, 1205, 1118, 3309, 55, 25, 25, …
$ girls <dbl> 186, 23, 138, 0, 1207, 1140, 1790, 1207, 1140, 1790, 52, 14, 14, …
$ men <dbl> 6, 8, 16, 0, 6, 8, 16, 6, 8, 16, 4, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0,…
$ women <dbl> 25, 29, 26, 0, 25, 29, 26, 25, 29, 26, 1, 0, 0, 0, 3, 3, 0, 0, 0,…
$ elderly_men <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ elderly_women <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ of_which_are_male_pwd <dbl> 6, 4, 7, 0, 6, 4, 7, 6, 4, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ of_which_are_female_pwd <dbl> 4, 2, 3, 0, 4, 2, 3, 4, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ total_male <dbl> 254, 24, 208, 2, 1211, 1126, 3325, 1211, 1126, 3325, 59, 27, 27, …
$ total_female <dbl> 211, 52, 164, 0, 1232, 1169, 1816, 1232, 1169, 1816, 53, 14, 14, …
$ total_people_reached <dbl> 465, 76, 372, 2, 2443, 2295, 5141, 2443, 2295, 5141, 112, 41, 41,…
$ child_participation <chr> "Sim/Yes", "Sim/Yes", "Sim/Yes", "Sim/Yes", "Sim/Yes", "Sim/Yes",…
We can remove rows with empty values that we will not use, for
example rows with no lead_organization value defined.
filter(!is.na(lead_organization)): Removes rows from
dataset_filter_cols where the
lead_organization column contains NA values.dataset_no_empty_rows <- dataset_filter_cols %>%
filter(!is.na(lead_organization))
dim(dataset_no_empty_rows)
[1] 542 31
period to date formatLet’s explore the type of data of the period column.
class(dataset_no_empty_rows$period)
returns the data type or class of the
period column in the
dataset_no_empty_rows datasetdataset_no_empty_rows$period)
class(dataset_no_empty_rows$period)
[1] "character"
It is of type “character”, let’s see its unique values using
unique:
unique(dataset_no_empty_rows$period)
[1] "March" "February" "January"
We need to convert those month strings into a date
format, so it will facilitate later on the ordering of results by
chronological order. We will use functions from the
lubridate package:
period column values to date format by
appending “2024” to each month name and using my() from
lubridate.class(dataset_period_date$period): Returns the class of
the period column in dataset_period_date.unique(dataset_period_date$period): Extracts unique
date values from the period column in
dataset_period_date.
dataset_period_date <- dataset_no_empty_rows %>%
mutate(period = my(paste(period, "2024")))
class(dataset_period_date$period)
[1] "Date"
unique(dataset_period_date$period)
[1] "2024-03-01" "2024-02-01" "2024-01-01"
Let’s continue by removing the ‘suspended’ and ‘planned’ activities
using filter, so we can focus only on ongoing and completed
activities. We take the previous cleaned dataframe
dataset_period_date.
What different values does activity_status get?
unique(dataset_no_empty_rows$activity_status): Returns
all unique values from the activity_status column of the
dataset_no_empty_rows dataset, removing duplicates.unique(dataset_period_date$activity_status)
[1] "Ongoing" "Completed" "Planned" "Suspended"
The code filters dataset_no_empty_rows to include only
rows where the activity_status column has values “Ongoing”
or “Completed”, and stores the result in
dataset_no_suspended.
dataset_filter_activity <- dataset_period_date %>%
filter(activity_status %in% c("Ongoing", "Completed"))
dim(dataset_filter_activity)
[1] 537 31
Let’s check
unique(dataset_filter_activity$activity_status)
[1] "Ongoing" "Completed"
We could have done all the previous steps at once by combining all the ‘pipes’ as follows:
lead_organization is NAperiod to 2024 date formatactivity_status “Ongoing” or
“Completed”
dataset_filter_activity_all <- dataset2 %>%
rename_with(~make.unique(str_squish(str_remove(., "/.*"))), .cols = everything()) %>%
rename_with(str_to_lower) %>%
rename_with(~str_replace_all(., " ", "_")) %>%
select(-c(5, donor, sector, hrp_activity, `non-hrp_activity`, 18, unit_of_measure, delivery_modality, comments)) %>%
filter(!is.na(lead_organization)) %>%
mutate(period = my(paste(period, "2024"))) %>%
filter(activity_status %in% c("Ongoing", "Completed"))
dim(dataset_filter_activity_all)
[1] 537 31
head(dataset_filter_activity_all)
NA
wb <- createWorkbook()
addWorksheet(wb, "clean_data", gridLines = FALSE)
writeDataTable(wb, sheet = 1, dataset_filter_activity_all, tableStyle = "TableStyleMedium9", withFilter = TRUE)
saveWorkbook(wb, file = "output/clean_data.xlsx", overwrite = TRUE)
Let’s see several examples of different aggregations. We will be
using our latest result dataset_filter_activity_all, that
contains all “ongoing” and “completed” activites
reached_period_province_district_tbl <- dataset_filter_activity_all %>%
group_by(period, province, district) %>%
summarise(
total_male = sum(total_male, na.rm = TRUE),
total_female = sum(total_female, na.rm = TRUE),
total_reached = sum(total_people_reached, na.rm = TRUE))
`summarise()` has grouped output by 'period', 'province'. You can override using the `.groups`
argument.
print(reached_period_province_district_tbl)
We can also use the command across(), to apply the same
operation over a collection of columns in a more compact way:
dataset_filter_activity_all by
period, province, and
district.total_male, total_female, and
total_people_reached for each group, removing NAs.reached_period_province_district_tbl.
reached_period_province_district_tbl <- dataset_filter_activity_all %>%
group_by(period, province, district) %>%
summarise(across(c(total_male, total_female, total_people_reached), ~ sum(., na.rm = TRUE)))
`summarise()` has grouped output by 'period', 'province'. You can override using the `.groups`
argument.
print(reached_period_province_district_tbl)
For simplicity, we will assume that we can sum everything up to national level, and also across months
reached_indicator_national_tbl <- dataset_filter_activity_all %>%
group_by(cluster_indicator) %>%
summarise(total_reached = sum(total_people_reached, na.rm = TRUE))
reached_indicator_national_tbl
Let’s see which indicator we have in the dataset:
unique(dataset_filter_activity_all$cluster_indicator)
[1] "# of children benefitting from new or improved WASH infrastructures (drinking water pumps, latrines, etc.) in schools"
[2] "# of children, affect by emergencies accessing formal or non-formal education disaggregated by gender, age and disability"
[3] "# of teachers and education personnel, officials trained in emergency and related topics (MHPSS, Peace Education, DRR, PEBE, climate change, basic pedagogy and learner-centred methodologies, gender socialization, VAC/GBV prevention and referral pathways)"
Let’s assume now that for the [2] indicator, we have take the maximum across months instead the sum because it was defined like that. We need then to apply different aggregation depending on the indicator. A way to do can be:
dataset_filter_activity_all by
period and cluster_indicator, calculates the
sum of total_people_reached for each group.cluster_indicator again, applies different
aggregation based on the content of cluster_indicator (max
for those containing “formal”, sum for others).reached_indicator_national_sum_max_tbl.In this case, the summarise function generated several
repeated rows, as different aggregation paths (like sum and max) could
potentially create rows with identical values across the columns being
analyzed. We use distinct() to remove the duplicates.
reached_indicator_national_sum_max_tbl <- dataset_filter_activity_all %>%
group_by(period, cluster_indicator) %>%
summarise(total_reached = sum(total_people_reached, na.rm = TRUE)) %>%
ungroup() %>%
group_by(cluster_indicator) %>%
summarise(total_reached = if_else(grepl("formal", cluster_indicator),
max(total_reached, na.rm = TRUE),
sum(total_reached, na.rm = TRUE))) %>%
ungroup() %>%
distinct()
`summarise()` has grouped output by 'period'. You can override using the `.groups` argument.
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an
ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
`summarise()` has grouped output by 'cluster_indicator'. You can override using the `.groups`
argument.
reached_indicator_national_sum_max_tbl
dataset_filter_activity_all by
period.total_people_reached into
total_reached, ignoring NAs.month column with month names extracted from
period.period,
month, total_reached.reached_period_national_tbl.
reached_month_national_tbl <- dataset_filter_activity_all %>%
group_by(period) %>%
summarise(total_reached = sum(total_people_reached, na.rm = TRUE)) %>%
mutate(month = month(period, label = TRUE)) %>%
select(period, month, total_reached)
print(reached_month_national_tbl)
reached_month_indicator_national_tbl <- dataset_filter_activity_all %>%
group_by(period, cluster_indicator) %>%
summarise(total_reached = sum(total_people_reached, na.rm = TRUE)) %>%
mutate(month = month(period, label = TRUE)) %>%
select(period, month, indicator = cluster_indicator, total_reached)
`summarise()` has grouped output by 'period'. You can override using the `.groups` argument.
print(reached_month_indicator_national_tbl)
reached_province_tbl <- dataset_filter_activity_all %>%
group_by(province, adm1) %>%
summarise(total_reached = sum(total_people_reached, na.rm = TRUE)) %>%
select(province, adm1, total_reached)
`summarise()` has grouped output by 'province'. You can override using the `.groups` argument.
print(reached_province_tbl)
reached_month_province_tbl <- dataset_filter_activity_all %>%
group_by(period, province, adm1) %>%
summarise(total_reached = sum(total_people_reached, na.rm = TRUE)) %>%
ungroup() %>%
mutate(month = month(period, label = TRUE)) %>%
select(month, province, adm1, total_reached)
`summarise()` has grouped output by 'period', 'province'. You can override using the `.groups`
argument.
print(reached_month_province_tbl)
janitor() packageWhile we can calculate table totals (both across row and by column)
using dplyr, there is also some packages that facilitate this task, for
example the janitor() package. Let’s see an example with
the indicators reached_indicator_national_tbl dataframe we
obtained before:
reached_indicator_national_tbl,
summing values across the value column, using
adorn_totals(where = "row").reached_indicator_national_total_tbl.
reached_indicator_national_total_tbl <- reached_indicator_national_tbl %>%
adorn_totals(where = "row")
print(reached_indicator_national_total_tbl)
cluster_indicator
# of children benefitting from new or improved WASH infrastructures (drinking water pumps, latrines, etc.) in schools
# of children, affect by emergencies accessing formal or non-formal education disaggregated by gender, age and disability
# of teachers and education personnel, officials trained in emergency and related topics (MHPSS, Peace Education, DRR, PEBE, climate change, basic pedagogy and learner-centred methodologies, gender socialization, VAC/GBV prevention and referral pathways)
Total
total_reached
28647
238213
11383
278243
kable() and
kableExtra() packagesWe can also format the resulting table for publishing, using the
kable() and kableExtra() packages:
reached_period_province_district_tbl %>% kable() %>% kable_classic()
| period | province | district | total_male | total_female | total_people_reached |
|---|---|---|---|---|---|
| 2024-01-01 | Cabo Delgado | Ancuabe | 1172.0 | 869.0 | 2041 |
| 2024-01-01 | Cabo Delgado | Chiúre | 375.0 | 276.0 | 651 |
| 2024-01-01 | Cabo Delgado | Cidade de Pemba | 110.0 | 190.0 | 300 |
| 2024-01-01 | Cabo Delgado | Ibo | 130.0 | 132.0 | 262 |
| 2024-01-01 | Cabo Delgado | Metuge | 1565.0 | 1713.0 | 3278 |
| 2024-01-01 | Cabo Delgado | Mocímboa da Praia | 8.0 | 2.0 | 10 |
| 2024-01-01 | Cabo Delgado | Mueda | 8.0 | 7.0 | 15 |
| 2024-01-01 | Cabo Delgado | Pemba - Metuge | 107.0 | 141.0 | 248 |
| 2024-01-01 | Cabo Delgado | Quissanga | 72.0 | 58.0 | 130 |
| 2024-01-01 | Nampula | Memba | 223.0 | 255.0 | 478 |
| 2024-02-01 | Cabo Delgado | Ancuabe | 6544.0 | 7385.0 | 13929 |
| 2024-02-01 | Cabo Delgado | Cidade de Pemba | 2938.0 | 4198.0 | 7136 |
| 2024-02-01 | Cabo Delgado | Ibo | 113.0 | 116.0 | 229 |
| 2024-02-01 | Cabo Delgado | Meluco | 1101.6 | 1652.4 | 2754 |
| 2024-02-01 | Cabo Delgado | Metuge | 4705.0 | 4547.0 | 9252 |
| 2024-02-01 | Cabo Delgado | Montepuez | 1481.0 | 2052.0 | 3533 |
| 2024-02-01 | Cabo Delgado | Mueda | 388.0 | 647.0 | 1035 |
| 2024-02-01 | Cabo Delgado | Muidumbe | 1496.4 | 2244.6 | 3741 |
| 2024-02-01 | Cabo Delgado | Palma | 886.0 | 1329.0 | 2215 |
| 2024-02-01 | Nampula | Memba | 349.0 | 377.0 | 726 |
| 2024-03-01 | Cabo Delgado | Ancuabe | 5067.0 | 3609.0 | 8676 |
| 2024-03-01 | Cabo Delgado | Chiúre | 1588.0 | 1574.0 | 3162 |
| 2024-03-01 | Cabo Delgado | Cidade de Pemba | 354.0 | 459.0 | 813 |
| 2024-03-01 | Cabo Delgado | Macomia | 6646.0 | 5810.0 | 12456 |
| 2024-03-01 | Cabo Delgado | Metuge | 0.0 | 1143.0 | 1143 |
| 2024-03-01 | Cabo Delgado | Mocimboa Da Praia | 3043.0 | 2772.0 | 5815 |
| 2024-03-01 | Cabo Delgado | Mocímboa da Praia | 587.0 | 593.0 | 1180 |
| 2024-03-01 | Cabo Delgado | Montepuez | 20345.0 | 20613.0 | 40958 |
| 2024-03-01 | Cabo Delgado | Mueda | 5509.0 | 5037.0 | 10546 |
| 2024-03-01 | Cabo Delgado | Nangade | 258.0 | 258.0 | 516 |
| 2024-03-01 | Cabo Delgado | Palma | 767.0 | 697.0 | 1464 |
| 2024-03-01 | Cabo Delgado | Pemba - Metuge | 10431.0 | 6966.0 | 17397 |
| 2024-03-01 | Nampula | Memba | 864.0 | 920.0 | 1784 |
| 2024-03-01 | Nampula | Namapa - Eráti | 11812.0 | 8861.0 | 20673 |
| 2024-03-01 | Zambézia | Inhassungue | 5904.0 | 4596.0 | 10500 |
| 2024-03-01 | Zambézia | Maganja da Costa | 13955.0 | 11556.0 | 25511 |
| 2024-03-01 | Zambézia | Mopeia | 5985.0 | 4515.0 | 10500 |
| 2024-03-01 | Zambézia | Namacurra | 13555.0 | 13631.0 | 27186 |
| 2024-03-01 | Zambézia | Nicoadala | 7453.0 | 8547.0 | 16000 |
| 2024-03-01 | Zambézia | Quelimane | 5482.0 | 4518.0 | 10000 |
ggplotggplot2 is a data visualization package
for the R programming language. It’s based on the grammar of graphics,
providing a system for declaratively creating graphics based on the idea
of building up a plot using layers. The plot layers are connected using
the sign +
This code snippet sets up a ggplot object with the aesthetics defined
for x-axis (province), y-axis (total reached), and bar fill color (also
by province), and then adds a column geometry to render a bar for each
province. The fill aesthetic ensures that each province’s
bar is colored differently, aiding in distinguishing between them
visually in the plot.
ggplot(reached_province_tbl, aes(x = province, y = total_reached, fill = province)) +
geom_col()
We can also ‘pipe’ the dataframe and send it to
ggplot:
reached_province_tbl %>% ggplot(aes(x = province, y = total_reached, fill = province)) +
geom_col()
If we don’t want to use the plot at the moment of creation, it can be also store in a variable for later print or saving:
This code creates a bar chart using ggplot2:
aes(x = province, y = total_reached, fill = province):
Maps province to the x-axis and the fill color of the bars,
and total_reached to the y-axis.geom_col(): Plots bars for each
province with heights determined by total_reached.scale_fill_manual(values = UNICEF_PALETTE):
Applies custom colors from UNICEF_PALETTE to the bars, one
for each province.theme_minimal(): Uses a minimal theme
for a clean presentation.labs(): Sets labels for the x-axis,
y-axis, and a title for the chart.
p <- ggplot(reached_province_tbl, aes(x = province, y = total_reached, fill = province)) +
geom_col() +
scale_fill_manual(values = UNICEF_PALETTE) +
theme_minimal() +
labs(x = "Province", y = "Total Reached", title = "Total Reached by Province")
p
Usually, we can flip any chart by swapping the variables in the x and y axis as follows:
Alternatively, we can also use the command
coord_flip():
p <- ggplot(reached_province_tbl, aes(x = reorder(province, total_reached), y = total_reached, fill = province)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = UNICEF_PALETTE) +
theme_minimal() +
labs(x = "Province", y = "Total Reached", title = "Total Reached by Province")
p
We are grouping month results in the same bar:
ggplot(reached_month_province_tbl, aes(x = province, y = total_reached, fill = month)):
Initializes a ggplot graph using reached_month_province_tbl
data, setting province as the x-axis,
total_reached as the y-axis, and month to
color the bars.geom_col(): Adds columns to the plot, with heights
representing total_reached values.scale_fill_manual(values = UNICEF_PALETTE): Customizes
the color palette of the fill (month) using predefined colors in
UNICEF_PALETTE.theme_minimal(): Applies a minimalistic theme to the
plot, reducing visual noise.labs(x = "Province", y = "Total Reached", title = "Total Reached by Province and Month"):
Sets labels for the x-axis, y-axis, and the plot title.
p <- ggplot(reached_month_province_tbl, aes(x = province, y = total_reached, fill = month)) +
geom_col() +
scale_fill_manual(values = UNICEF_PALETTE) +
theme_minimal() +
labs(x = "Province", y = "Total Reached", title = "Total Reached by Province and Month")
p
We are grouping by month as before, but in this case we want separate
contiguous bars. We control this with
geom_col(position = position_dodge()):
ggplot(reached_month_province_tbl, aes(x = province, y = total_reached, fill = month)):
Sets up a ggplot with province as the x-axis,
total_reached as the y-axis, and different colors for each
month.geom_col(position = position_dodge()): Adds columns to
the plot, separating them by month to avoid overlap and
make comparisons easier.scale_fill_manual(values = UNICEF_PALETTE): Customizes
the fill colors using the UNICEF_PALETTE.theme_minimal(): Applies a minimalistic theme, reducing
visual clutter.labs(x = "Province", y = "Total Reached", title = "Total Reached by Province and Month"):
Sets labels and title for clarity.p <- ggplot(reached_month_province_tbl, aes(x = province, y = total_reached, fill = month)) +
geom_col(position = position_dodge()) +
scale_fill_manual(values = UNICEF_PALETTE) +
theme_minimal() +
labs(x = "Province", y = "Total Reached", title = "Total Reached by Province and Month")
p
This is a useful plot that can be constructed by combining different
ggplot elements:
ggplot(reached_province_tbl, aes(x=province, y=total_reached)):
Initializes a ggplot graph using reached_province_tbl,
mapping province to the x-axis and
total_reached to the y-axis.geom_segment(aes(x=province, xend=province, y=0, yend=total_reached, color=province)):
Adds vertical line segments from y=0 to total_reached for
each province, colored by province.geom_point(aes(color=province), size=4): Adds points at
the end of each segment, colored by province and with a
specified size.coord_flip(): Flips the x and y axes, making the bars
horizontal.scale_color_manual(values = UNICEF_PALETTE): Sets the
colors for the different provinces using
UNICEF_PALETTE.theme_minimal(): Applies a minimal theme to reduce
visual clutter.
p <- ggplot(reached_province_tbl, aes(x=province, y=total_reached)) +
geom_segment( aes(x=province, xend=province, y=0, yend=total_reached, color=province)) +
geom_point(aes(color=province), size=4) +
coord_flip() +
scale_color_manual(values = UNICEF_PALETTE) +
theme_minimal()
p
We will now be plotting time series data in different ways. In here
it was useful to transform the period column into a
date format, so we could derive the month variable in the
right chronological order (if not, the month would appear ordered
alphabetically)
ggplot(reached_month_national_tbl, aes(x = month, y = total_reached)):
Initializes a ggplot graph with month on the x-axis and
total_reached on the y-axis.geom_point(alpha=1, shape=20, color=UNICEF_PALETTE[1], size=15):
Adds points representing each data entry. Each point is fully opaque
(alpha=1), uses a filled circle shape
(shape=20), colored with the first color of the
UNICEF_PALETTE, and is large (size=15).geom_text(aes(label = total_reached), size = 2, color = "white", fontface="bold"):
Adds text labels to each point displaying total_reached,
with small white bold text (size=2).theme_minimal(): Applies a minimal theme to reduce
visual clutter.p <- ggplot(reached_month_national_tbl, aes(x = month, y = total_reached)) +
geom_point(alpha=1, shape=20, color=UNICEF_PALETTE[1], size=15) +
geom_text(aes(label = total_reached), size = 2, color = "white", fontface="bold") +
theme_minimal()
print(p)
We can joint the dots with a line plot to enhance the visualization:
ggplot(reached_month_national_tbl, aes(x = month, y = total_reached, group=1)):
Sets up a ggplot for reached_month_national_tbl with
month on the x-axis, total_reached on the
y-axis, and all data considered part of the same group (enabling line
connectivity).geom_line(size = 0.5, color = UNICEF_PALETTE[1]): Adds
a line connecting the data points, with a specified color from the
UNICEF_PALETTE and thickness.geom_point(shape=20, color=UNICEF_PALETTE[1], size=15):
Adds large, solid circle points at each data point, matching the line
color.geom_text(aes(label = total_reached), size = 2, color = "white", fontface="bold"):
Labels each point with its total_reached value in bold,
white text.theme_minimal(): Applies a minimalistic theme for a
clean appearance.
p <- ggplot(reached_month_national_tbl, aes(x = month, y = total_reached, group=1)) +
geom_line(size = 0.5, color = UNICEF_PALETTE[1]) +
geom_point(shape=20, color=UNICEF_PALETTE[1], size=15) +
geom_text(aes(label = total_reached), size = 2, color = "white", fontface="bold") +
theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
print(p)
We can use a grouping variable, in this case province,
to plot several lineplots across time:
ggplot(reached_month_province_tbl, aes(x = month, y = total_reached, group = province, color = province)):
Initializes a ggplot with month on the x-axis,
total_reached on the y-axis, and lines grouped and colored
by province.geom_line(size = 0.5): Adds lines to connect data
points within each province group, with a specified line
thickness.geom_point(alpha=0.7, shape=20, size=12): Adds
semi-transparent circle points at each data point, with specified size
and opacity.geom_text(aes(label = total_reached), size = 1.5, color = "white", fontface="bold"):
Labels each point with its total_reached value in bold,
white text.scale_color_manual(values = UNICEF_PALETTE): Customizes
the line colors using the UNICEF_PALETTE.theme_minimal(): Applies a minimal theme to reduce
visual clutter.theme(legend.position = "bottom"): Moves the legend to
the bottom of the plot for better visibility and layout.
multiline_plot <- ggplot(reached_month_province_tbl, aes(x = month, y = total_reached, group = province, color = province)) +
geom_line(size = 0.5) +
geom_point(alpha=0.7, shape=20, size=12) +
geom_text(aes(label = total_reached), size = 1.5, color = "white", fontface="bold") +
scale_color_manual(values = UNICEF_PALETTE) + # Customize the colors
theme_minimal() +
theme(legend.position = "bottom")
print(multiline_plot)
Let’s save the latest plot and export it as a PDF. We only need to write the path and name to the file, and choose the right extension ‘.pdf’. It will be automatically saved as PDF:
ggsave(plot = multiline_plot, filename = "plots/multiline.pdf", width = 12, height = 8)
Alternativelly, we can also save it as PNG following the same approach