R used with its TIDYR package is impressive. Below is the messy World Health Data (who) as it is. Then just by passing it through GATHER it it cleans it up in one line:

World Health Examples

Here is what the who data looks like before

who

## # A tibble: 7,240 x 60
##        country  iso2  iso3  year new_sp_m014 new_sp_m1524 new_sp_m2534
##          <chr> <chr> <chr> <int>       <int>        <int>        <int>
##  1 Afghanistan    AF   AFG  1980          NA           NA           NA
##  2 Afghanistan    AF   AFG  1981          NA           NA           NA
##  3 Afghanistan    AF   AFG  1982          NA           NA           NA
##  4 Afghanistan    AF   AFG  1983          NA           NA           NA
##  5 Afghanistan    AF   AFG  1984          NA           NA           NA
##  6 Afghanistan    AF   AFG  1985          NA           NA           NA
##  7 Afghanistan    AF   AFG  1986          NA           NA           NA
##  8 Afghanistan    AF   AFG  1987          NA           NA           NA
##  9 Afghanistan    AF   AFG  1988          NA           NA           NA
## 10 Afghanistan    AF   AFG  1989          NA           NA           NA
## # ... with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## #   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## #   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## #   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## #   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## #   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## #   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## #   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## #   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## #   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## #   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## #   new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## #   newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## #   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## #   newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## #   newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## #   newrel_f65 <int>

who is 7200 rows by 60 columns. A mess.

dim(who)

## [1] 7240   60

And here it is after it is processed by TIDYR. That is a lot of cleaning for very little code.

who%>%
  select(-iso2,-iso3)%>%
  gather(group,cases,-country,-year)%>%
  count(country,year,wt=cases)

## # A tibble: 7,240 x 3
##        country  year     n
##          <chr> <int> <int>
##  1 Afghanistan  1980     0
##  2 Afghanistan  1981     0
##  3 Afghanistan  1982     0
##  4 Afghanistan  1983     0
##  5 Afghanistan  1984     0
##  6 Afghanistan  1985     0
##  7 Afghanistan  1986     0
##  8 Afghanistan  1987     0
##  9 Afghanistan  1988     0
## 10 Afghanistan  1989     0
## # ... with 7,230 more rows

Taking it step further:

who%>%
  select(-iso2,-iso3)%>%
  gather(group,cases,-country,-year)%>%
  count(country,year,wt=cases)%>%  #this makes a frequency table."wt=Cases"sums each country year combo.
  ggplot(aes(x=year,y=n,group=country))+ #ggplot is the graphing package
  geom_line(size=0.2)  # this could be one of bozens, geom_bar,histogram, heatmap...

TIDYR_Examples

Joe Foley

October 7, 2017

World Health Examples