Introduction

According to Britannica; Case fatality rate is the proportion of people who die from a specified disease among all individuals diagnosed with the disease over a certain period of time. Case fatality rate is used in this study to measure the severity of diseases in teenagers and children resident in Nigeria.

Aim

To determine the most dangerous diseases to teenagers and children.

Objectives

Data and Data Source

The Disease Outbreak in Nigeria Datasets is a public dataset simulated and collated by Emmanuel Odelami, this dataset is auto-generated based on the most common and deadly disease outbreaks in Nigeria. This dataset contains disease reports from 2009 to 2018 from all 36 states from both urban and rural settlements detailing their ages and diseases respectively.

Cleaning and Transforming the data

Setting up the enviroment

The installed packages to be used are loaded

library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(readr)
## Warning: package 'readr' was built under R version 4.0.5
library(ggplot2)

Loading the dataset

disease_outbreak <- read.csv("meningitis_dataset.csv")

Exploring the dataset

The data is explored to get a grasp of the structure of the dataset

tibble(disease_outbreak)
## # A tibble: 284,484 x 40
##       id surname   firstn~1 middl~2 gender gende~3 gende~4 state settl~5 rural~6
##    <int> <chr>     <chr>    <chr>   <chr>    <int>   <int> <chr> <chr>     <int>
##  1     1 Solade    Grace    Solape  Female       0       1 Rive~ Rural         1
##  2     2 Eneche    Kure     Balogun Male         1       0 Ebon~ Rural         1
##  3     3 Sanusi    Adaugo   Kateri~ Female       0       1 Ogun  Urban         0
##  4     4 Sowore    Mooslem~ Ifedayo Female       0       1 Ondo  Rural         1
##  5     5 Abdusalam Yusuf    Okafor  Male         1       0 Oyo   Urban         0
##  6     6 Yakubu    Janet    Chioma  Female       0       1 Kadu~ Rural         1
##  7     7 Razak     Adaugo   Adaobi  Female       0       1 Tara~ Rural         1
##  8     8 Annakyi   Danmbaz~ Osagie  Male         1       0 Kats~ Rural         1
##  9     9 Adejoro   Iyin     Osatim~ Male         1       0 Kats~ Rural         1
## 10    10 Okorie    Adaugo   Chika   Female       0       1 Osun  Urban         0
## # ... with 284,474 more rows, 30 more variables: urban_settlement <int>,
## #   report_date <chr>, report_year <int>, age <int>, age_str <chr>,
## #   date_of_birth <chr>, child_group <int>, adult_group <int>, disease <chr>,
## #   cholera <int>, diarrhoea <int>, measles <int>,
## #   viral_haemmorrhaphic_fever <int>, meningitis <int>, ebola <int>,
## #   marburg_virus <int>, yellow_fever <int>, rubella_mars <int>, malaria <int>,
## #   serotype <chr>, NmA <int>, NmC <int>, NmW <int>, health_status <chr>, ...
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
colnames(disease_outbreak)
##  [1] "id"                         "surname"                   
##  [3] "firstname"                  "middlename"                
##  [5] "gender"                     "gender_male"               
##  [7] "gender_female"              "state"                     
##  [9] "settlement"                 "rural_settlement"          
## [11] "urban_settlement"           "report_date"               
## [13] "report_year"                "age"                       
## [15] "age_str"                    "date_of_birth"             
## [17] "child_group"                "adult_group"               
## [19] "disease"                    "cholera"                   
## [21] "diarrhoea"                  "measles"                   
## [23] "viral_haemmorrhaphic_fever" "meningitis"                
## [25] "ebola"                      "marburg_virus"             
## [27] "yellow_fever"               "rubella_mars"              
## [29] "malaria"                    "serotype"                  
## [31] "NmA"                        "NmC"                       
## [33] "NmW"                        "health_status"             
## [35] "alive"                      "dead"                      
## [37] "report_outcome"             "unconfirmed"               
## [39] "confirmed"                  "null_serotype"

Checking for and removing duplicates from the dataset

disease_outbreak <- distinct(disease_outbreak)
tibble(disease_outbreak)
## # A tibble: 284,484 x 40
##       id surname   firstn~1 middl~2 gender gende~3 gende~4 state settl~5 rural~6
##    <int> <chr>     <chr>    <chr>   <chr>    <int>   <int> <chr> <chr>     <int>
##  1     1 Solade    Grace    Solape  Female       0       1 Rive~ Rural         1
##  2     2 Eneche    Kure     Balogun Male         1       0 Ebon~ Rural         1
##  3     3 Sanusi    Adaugo   Kateri~ Female       0       1 Ogun  Urban         0
##  4     4 Sowore    Mooslem~ Ifedayo Female       0       1 Ondo  Rural         1
##  5     5 Abdusalam Yusuf    Okafor  Male         1       0 Oyo   Urban         0
##  6     6 Yakubu    Janet    Chioma  Female       0       1 Kadu~ Rural         1
##  7     7 Razak     Adaugo   Adaobi  Female       0       1 Tara~ Rural         1
##  8     8 Annakyi   Danmbaz~ Osagie  Male         1       0 Kats~ Rural         1
##  9     9 Adejoro   Iyin     Osatim~ Male         1       0 Kats~ Rural         1
## 10    10 Okorie    Adaugo   Chika   Female       0       1 Osun  Urban         0
## # ... with 284,474 more rows, 30 more variables: urban_settlement <int>,
## #   report_date <chr>, report_year <int>, age <int>, age_str <chr>,
## #   date_of_birth <chr>, child_group <int>, adult_group <int>, disease <chr>,
## #   cholera <int>, diarrhoea <int>, measles <int>,
## #   viral_haemmorrhaphic_fever <int>, meningitis <int>, ebola <int>,
## #   marburg_virus <int>, yellow_fever <int>, rubella_mars <int>, malaria <int>,
## #   serotype <chr>, NmA <int>, NmC <int>, NmW <int>, health_status <chr>, ...
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Sorting the dataset

The dataset contains a lot of redundant information so the select() and filter() functions are used to reduce the redundancy; the necessary criteria are selected and only confirmed reports from teenagers and children are included

disease_outbreak <- disease_outbreak %>%
  select(c(id,firstname,surname,gender,age,disease,settlement,state,report_year,health_status,report_outcome))%>%
  filter(age < 20)%>%
  filter(report_outcome == "Confirmed" | report_outcome == "confirmed")
tibble(disease_outbreak)
## # A tibble: 45,100 x 11
##       id firstname   surname  gender   age disease settl~1 state repor~2 healt~3
##    <int> <chr>       <chr>    <chr>  <int> <chr>   <chr>   <chr>   <int> <chr>  
##  1     5 Yusuf       Abdusal~ Male       9 Rubell~ Urban   Oyo      2017 Alive  
##  2    10 Adaugo      Okorie   Female    15 Marbur~ Urban   Osun     2014 Alive  
##  3    13 Christopher Folawiyo Male       4 Measles Urban   Adam~    2012 Dead   
##  4    22 Adaugo      Eleojo   Female    14 Malaria Urban   Rive~    2011 Alive  
##  5    28 Jane        Egar     Female     7 Viral ~ Urban   Kwara    2016 Dead   
##  6    30 Caroline    Isa      Female     7 Marbur~ Urban   Yobe     2014 Alive  
##  7    33 Paulina     Igbonubi Female     2 Viral ~ Urban   Kogi     2017 Alive  
##  8    42 Alexandria  Quayum   Female    17 Malaria Urban   Oyo      2011 Alive  
##  9    45 Christianah Chima    Female     3 Yellow~ Rural   Osun     2012 Dead   
## 10    56 Alexandria  Ileri    Female     5 Rubell~ Urban   Jiga~    2017 Alive  
## # ... with 45,090 more rows, 1 more variable: report_outcome <chr>, and
## #   abbreviated variable names 1: settlement, 2: report_year, 3: health_status
## # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Separating the dataset into Settlements

Here the dataset is split into the various settlements

#Urban Vs Rural
urban <- disease_outbreak%>%
  filter(settlement == "Urban")
rural <- disease_outbreak%>%
  filter(settlement == "Rural")
Separating the dataset into Gender

Finally, the dataset is then split by the gender

#Gender
male <- disease_outbreak%>%
  filter(gender == "Male")
female <- disease_outbreak%>%
  filter(gender == "Female")