Data provided by countries to WHO and estimates of TB burden generated by WHO for the Global Tuberculosis Report are available for download as comma-separated value (CSV) files. CSV files can be opened by or imported into many spreadsheet, statistical analysis and database packages.
library(stringr)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(readr)
library(ggplot2)
library(dplyr)
who
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Afghanistan AF AFG 1980 NA NA NA
## 2 Afghanistan AF AFG 1981 NA NA NA
## 3 Afghanistan AF AFG 1982 NA NA NA
## 4 Afghanistan AF AFG 1983 NA NA NA
## 5 Afghanistan AF AFG 1984 NA NA NA
## 6 Afghanistan AF AFG 1985 NA NA NA
## 7 Afghanistan AF AFG 1986 NA NA NA
## 8 Afghanistan AF AFG 1987 NA NA NA
## 9 Afghanistan AF AFG 1988 NA NA NA
## 10 Afghanistan AF AFG 1989 NA NA NA
## # ... with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## # new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## # new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## # new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## # new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## # new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## # new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## # newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## # newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## # newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## # newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## # newrel_f65 <int>
who_new <- who %>%
gather(code, cases, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
#In this step, we are putting the same variable into one colume.
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
#In this step, we are making the fomat of the variable consistent.
separate(code, c("new", "var", "sexage")) %>%
#In this step, we are extracting the information from one colume and put it into different columes.
select(-new, -iso2, -iso3) %>%
#In this step, we are clearing out the variables that does not provide valuable information.
separate(sexage, c("sex", "age"), sep = 1)
#In this step, we are extracting sex and age information from the sexage variable and put them into different columes.
who_new
## # A tibble: 76,046 x 6
## country year var sex age cases
## * <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # ... with 76,036 more rows
The original data is messy and we were unable to aply . After the data is cleaned, we are able to start analysis. FYI rel stands for cases of relapse ep stands for cases of extrapulmonary TB sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative) sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
The cases of TB all over the world has been increasing since the year of 1995, yet there is a quick drop after 2010. It is probable that an effective new treatment has been developed. How ever if we look into the data by seperating different type of TB, we could find the drop was provided only by the cases of rel (which stands for relapsed cases). And it is almost 75% of the total amount, thus we could conclud that the treatment is not effective at all, and in most of the cases, the TB would come back after 1-2 years.
who2 <- who_new %>%
group_by(sex) %>%
mutate(sum_cases = sum(cases))
ggplot(who2, mapping = aes(x = sex, y = sum_cases, fill = var)) +
geom_bar(stat = "identity")
We could see that there are more males no matter in terms of getting the disease and relapsing cases.
who3 <- who_new %>%
group_by(age) %>%
mutate(sum_cases = sum(cases))
ggplot(who3, mapping = aes(x = age, y = sum_cases, fill = var)) +
geom_bar(stat = "identity")
We could see that there are little amount of people below 14 who get the disease. And there are lots of patient bewteen 15 to 44, who also has a higher chance of relapsing.
Three analysis was adopted in this project to explore the effective treatment and to give useful advices to the researcher who are aiming to develope the treatment of TB. The conclusions are ~ The current treatment is not very effective and there is high probability of replasing after a period of 1 to 2 years. ~ Males are more likely to get the disease and have it replased. ~ People below 14 has low probability of getting this disease and the most dangerous people are at age between 15 to 44