Data Introduction

Data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report are available for download as comma-separated value (CSV) files. CSV files can be opened by or imported into many spreadsheet, statistical analysis and database packages.

library used

library(stringr)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(readr)
library(ggplot2)
library(dplyr)

Raw Data

who
## # A tibble: 7,240 x 60
##        country  iso2  iso3  year new_sp_m014 new_sp_m1524 new_sp_m2534
##          <chr> <chr> <chr> <int>       <int>        <int>        <int>
##  1 Afghanistan    AF   AFG  1980          NA           NA           NA
##  2 Afghanistan    AF   AFG  1981          NA           NA           NA
##  3 Afghanistan    AF   AFG  1982          NA           NA           NA
##  4 Afghanistan    AF   AFG  1983          NA           NA           NA
##  5 Afghanistan    AF   AFG  1984          NA           NA           NA
##  6 Afghanistan    AF   AFG  1985          NA           NA           NA
##  7 Afghanistan    AF   AFG  1986          NA           NA           NA
##  8 Afghanistan    AF   AFG  1987          NA           NA           NA
##  9 Afghanistan    AF   AFG  1988          NA           NA           NA
## 10 Afghanistan    AF   AFG  1989          NA           NA           NA
## # ... with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## #   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## #   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## #   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## #   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## #   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## #   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## #   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## #   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## #   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## #   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## #   new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## #   newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## #   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## #   newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## #   newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## #   newrel_f65 <int>

Tidy Data

who_new <- who %>%
  gather(code, cases, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  #In this step, we are putting the same variable into one colume.
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  #In this step, we are making the fomat of the variable consistent.
  separate(code, c("new", "var", "sexage")) %>% 
  #In this step, we are extracting the information from one colume and put it into different columes.
  select(-new, -iso2, -iso3) %>% 
  #In this step, we are clearing out the variables that does not provide valuable information.
  separate(sexage, c("sex", "age"), sep = 1)
  #In this step, we are extracting sex and age information from the sexage variable and put them into different columes.
who_new
## # A tibble: 76,046 x 6
##        country  year   var   sex   age cases
##  *       <chr> <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997    sp     m   014     0
##  2 Afghanistan  1998    sp     m   014    30
##  3 Afghanistan  1999    sp     m   014     8
##  4 Afghanistan  2000    sp     m   014    52
##  5 Afghanistan  2001    sp     m   014   129
##  6 Afghanistan  2002    sp     m   014    90
##  7 Afghanistan  2003    sp     m   014   127
##  8 Afghanistan  2004    sp     m   014   139
##  9 Afghanistan  2005    sp     m   014   151
## 10 Afghanistan  2006    sp     m   014   193
## # ... with 76,036 more rows

The original data is messy and we were unable to aply . After the data is cleaned, we are able to start analysis. FYI rel stands for cases of relapse ep stands for cases of extrapulmonary TB sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative) sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)

Analysis

Analysis of Time vs. Cases

The cases of TB all over the world has been increasing since the year of 1995, yet there is a quick drop after 2010. It is probable that an effective new treatment has been developed. How ever if we look into the data by seperating different type of TB, we could find the drop was provided only by the cases of rel (which stands for relapsed cases). And it is almost 75% of the total amount, thus we could conclud that the treatment is not effective at all, and in most of the cases, the TB would come back after 1-2 years.

Analysis of Sex vs. Cases

who2 <- who_new %>% 
  group_by(sex) %>% 
  mutate(sum_cases = sum(cases))
ggplot(who2, mapping = aes(x = sex, y = sum_cases, fill = var)) +
  geom_bar(stat = "identity") 

We could see that there are more males no matter in terms of getting the disease and relapsing cases.

Analysis of Age vs. Cases

who3 <- who_new %>%
  group_by(age) %>%
  mutate(sum_cases = sum(cases))
ggplot(who3, mapping = aes(x = age, y = sum_cases, fill = var)) +
  geom_bar(stat = "identity")

We could see that there are little amount of people below 14 who get the disease. And there are lots of patient bewteen 15 to 44, who also has a higher chance of relapsing.

Conclusion

Three analysis was adopted in this project to explore the effective treatment and to give useful advices to the researcher who are aiming to develope the treatment of TB. The conclusions are ~ The current treatment is not very effective and there is high probability of replasing after a period of 1 to 2 years. ~ Males are more likely to get the disease and have it replased. ~ People below 14 has low probability of getting this disease and the most dangerous people are at age between 15 to 44