Introduction
This is a classroom exercise of PYU CS424 Course on Big Data Anaylsis. The aim of this exercise is to demonstrate data wrangling techniques useful for converting raw datasets into a tidy dataframe that can be easily analyzed using Tidyverse running on R.
Background
Tuberculosis claims approximately 4000 lives each day and remains the top infectious killer in the world. Each year millions of people fall ill to this preventable and curable disease. WHO publishes annually The Global Tuberculosis Report to provide a comprehensive and up-to-date assessment of the TB epidemic, and progress arising fromabatement efforts at global, regional and country levels. However, data gathering and verification in this reporting is a slow process. To provide early warnings of changes, the WHO has invited health officials around the globe to submit provisional TB notifications monthly or quarterly to WHO on an voluntary basis.[^WHOTB] Although the numbers are not official, nor complete, they do provide early indications of changes in the pattern of tuberculosis infections. The dynamic state of this database is useful exercise to demonstrate techniques in data wrangling.
TB is caused by microbaterium that is able to creates holes and absesses in lungs and brain tissue. It is rapidly spread in overcrowded conditions and among those would are malnourished. Although is curable, treatment can take months and requires drugs that are not available or affordable by thoses infected.

Research Question
Does the provisional database of tuberculosis infections have any hint of the impact of COVID-19 pandemic on tuberculosis infections?
Methodology
The WHO Global Tuberculosis Programme receives provisional data voluntarily from countries worldwide on either a monthly or quarterly basis. These numbers are not final and are subject to change but they do provide early indications of any changes in the infection pattern of tuberculosis.
The data was downloaded from the WHO website as a CSV file that was loaded into R. Analysis was done with the Tidyverse library.
library(tidyverse)
library(lubridate)
tbdat = read_csv("datasets/TB_provisional_notifications_2021-08-09.csv")
Rows: 215 Columns: 24
── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): country, iso2, iso3, iso_numeric, g_whoregion
dbl (19): year, report_frequency, report_coverage, m_01, m_02, m_03, m_04, m_05, m_06, m_07, m_0...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data dictionary
Meaning of the data fields:
- country = col_character() : Country name
- iso2 = col_character() : ISO 3066 2-digit country code
- iso3 = col_character() : ISO 3066 3-digit country code
- iso_numeric = col_character() : Country ID num
- g_whoregion = col_character() : Region
- year = col_double() : Calendar year
- report_frequency = col_double() : Monthly or Quarterly reporting
- report_coverage = col_double() : Completeness of the report
- m_01 = col_double() : Jan infection data
- m_02 = col_double() : Feb infection data
- m_03 = col_double() : Mar infection data
- m_04 = col_double() : Apr infection data
- m_05 = col_double() : May infection data
- m_06 = col_double() : Jun infection data
- m_07 = col_double() : Jul infection data
- m_08 = col_double() : Aug infection data
- m_09 = col_double() : Sep infection data
- m_10 = col_double() : Oct infection data
- m_11 = col_double() : Nov infection data
- m_12 = col_double() : Dec infection data
- q_1 = col_double() : 1st quarter infection data
- q_2 = col_double() : 2nd quarter infection data
- q_3 = col_double() : 3rd quarter infection data
- q_4 = col_double() : 4th quarter infection data
Selecting the countries of interest
Two methods were attempted and timed.
timefunc <- function(func) {
dat = rep(0,5)
for (rpt in c(1:5)) {
dat[rpt] = Sys.time()
for (i in c(1:1000)) {
func()
}
dat[rpt] = Sys.time() - dat[rpt]
print(dat[rpt])
}
return(dat)
}
- Selection by multiple logic expression
multiselect <- function() {
tbdat %>%
filter(iso2 == "TH" | iso2 == "IN" | iso2 == "CN" |
iso2 == "US" | iso2 == "PH" | iso2 == "ID")
}
dat = timefunc(multiselect)
[1] 9.601976
[1] 8.056268
[1] 8.619806
[1] 8.674143
[1] 8.488425
Mean 8.6881237 +/- 0.5654888
- Selection by grepl() patterns
greplselect <- function() {
tbdat %>%
filter(grepl("CN|IN|ID|LA|TH|US",iso2))
}
dat2 = timefunc(greplselect)
[1] 8.251768
[1] 8.402063
[1] 8.514668
[1] 9.502662
[1] 9.095228
Mean 8.7532777 +/- 0.5268549
Welch Two Sample t-test
data: dat and dat2
t = -0.1885, df = 7.9603, p-value = 0.8552
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.8629067 0.7325987
sample estimates:
mean of x mean of y
8.688124 8.753278
There is no statistically significant difference in the performance of the two methods.
Reshaping the dataset
tbdat2 = tbdat %>%
filter(iso2 == "TH" | iso2 == "IN" | iso2 == "CN" |
iso2 == "US" | iso2 == "LA" | iso2 == "ID") %>%
gather(m_01, m_02, m_03, m_04, m_05, m_06,
m_07, m_08, m_09, m_10, m_11, m_12,
key="mon",value="infmon") %>%
gather(q_1, q_2, q_3, q_4,
key="qrt", value="infqrt") %>%
separate(qrt, into =c("q","quarter"), sep="_") %>%
separate(mon, into =c("s","month"), sep="_") %>%
mutate(yrmon = ymd(year * 10000 + as.integer(month) * 100 + 1)) %>%
select(iso2,yrmon,infmon,quarter,infqrt)
- Sample of the Polished Data
Plot the data
tbdat2 %>%
filter(!is.na(infmon)) %>%
ggplot(aes(x=yrmon,y=infmon,color=iso2)) +
geom_smooth(method="loess",formula="y~x") +
geom_point()

Alternative plot
tbdat2 %>%
filter(!is.na(infmon)) %>%
ggplot(aes(x=yrmon,y=infmon,color=iso2)) +
geom_smooth(method="loess",formula="y~x") +
geom_point() +
facet_grid(rows=vars(iso2),scales="free_y")

Results
The greatest decrease in the number of the TB case was seen at the peak of the waves of COVID19.
Summary and Conclusion
Our prelimary findings are consistent with those reported by WHO. Althought the data is provisional, they provide an early indication of how the disruption caused by the COVID-19 pandemic may be affecting essential TB services. At the same time, the antiCOVID-19 measures also By mid-March 2021, 84 countries with more than 80% of global TB incidence and almost 90% of global TB notifications in 2019 had reported complete monthly or quarterly data for 2020. These showed a 21% drop in TB notifications between 2019 and 2020 overall, with much larger reductions in some high TB burden countries, notably India, Indonesia, the Philippines and South Africa.
References
