R used with its TIDYR package is impressive. Below is the messy World Health Data (who) as it is. Then just by passing it through GATHER it it cleans it up in one line:
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
Here is what the who data looks like before
who
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Afghanistan AF AFG 1980 NA NA NA
## 2 Afghanistan AF AFG 1981 NA NA NA
## 3 Afghanistan AF AFG 1982 NA NA NA
## 4 Afghanistan AF AFG 1983 NA NA NA
## 5 Afghanistan AF AFG 1984 NA NA NA
## 6 Afghanistan AF AFG 1985 NA NA NA
## 7 Afghanistan AF AFG 1986 NA NA NA
## 8 Afghanistan AF AFG 1987 NA NA NA
## 9 Afghanistan AF AFG 1988 NA NA NA
## 10 Afghanistan AF AFG 1989 NA NA NA
## # ... with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## # new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## # new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## # new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## # new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## # new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## # new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## # newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## # newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## # newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## # newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## # newrel_f65 <int>
who is 7200 rows by 60 columns. A mess.
dim(who)
## [1] 7240 60
And here it is after it is processed by TIDYR. That is a lot of cleaning for very little code.
who%>%
select(-iso2,-iso3)%>%
gather(group,cases,-country,-year)%>%
count(country,year,wt=cases)
## # A tibble: 7,240 x 3
## country year n
## <chr> <int> <int>
## 1 Afghanistan 1980 0
## 2 Afghanistan 1981 0
## 3 Afghanistan 1982 0
## 4 Afghanistan 1983 0
## 5 Afghanistan 1984 0
## 6 Afghanistan 1985 0
## 7 Afghanistan 1986 0
## 8 Afghanistan 1987 0
## 9 Afghanistan 1988 0
## 10 Afghanistan 1989 0
## # ... with 7,230 more rows
Taking it step further:
who%>%
select(-iso2,-iso3)%>%
gather(group,cases,-country,-year)%>%
count(country,year,wt=cases)%>% #this makes a frequency table."wt=Cases"sums each country year combo.
ggplot(aes(x=year,y=n,group=country))+ #ggplot is the graphing package
geom_line(size=0.2) # this could be one of bozens, geom_bar,histogram, heatmap...