1. The World Bank publishes data sets which track the population of countries and pre-defined regions over a period of 55 years. The regional data is classified into socio-economic catagories so comparing this data may tell us something about trends underlying those catagories.

http://data.worldbank.org/indicator/SP.POP.TOTL

The World Bank defines total population as follows . . .

Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship–except for refugees not permanently settled in the country of asylum, who are generally considered part of the population of their country of origin. The values shown are midyear estimates.

This rudimentary analysis looks at the 26 regional data sets contained in the World Bank’s super-set, presents a historgram of decreasing population counts by region and scatterplots of each region’s popuation trend faceted by region.

  1. Load Data Frame from file
l <- read.csv(
              "/Users/scottkarr/IS607Spring2016/project2/more/population.csv",
              sep=",",
              na.strings = "",
              blank.lines.skip = TRUE,
              #col.names = c("Quintile", "West",  "South", "Midwest","Northeast", "US Overall"),
              stringsAsFactors=FALSE
    )
df = data.frame(l)
names(df) <- gsub("X", "", names(df))
#kable(head(df), align = 'l')
  1. Tidy data
# filter columns
df <- df %>% 
  select(Country.Name, Country.Code, 5:ncol(df))
# gather morphs data from wide to long format
df_tidy <- df %>% 
  gather(Year, Population, -Country.Name, -Country.Code) %>%
  arrange(Country.Name, Year, Population)
# make year and population numerics
df_tidy$Year <- as.numeric(df_tidy$Year)
df_tidy$Population <- as.numeric(df_tidy$Population)
# remove na rows
df_tidy <- df_tidy %>% na.omit()
# present data nicely
kable(head(df_tidy), align = 'l')
Country.Name Country.Code Year Population
Afghanistan AFG 1960 8994793
Afghanistan AFG 1961 9164945
Afghanistan AFG 1962 9343772
Afghanistan AFG 1963 9531555
Afghanistan AFG 1964 9728645
Afghanistan AFG 1965 9935358
  1. Analyze data - let’s look only at regional population trends
# filter only on regional population data sets
df_regions <- 
  filter(
    df_tidy %>% select(Country.Name, Country.Code) %>% distinct()
    , 
      grepl(' all', Country.Name) | 
      grepl('states', Country.Name) | 
      grepl('members', Country.Name) |     
      grepl('poor', Country.Name) |         
      grepl('conflict', Country.Name) |             
      grepl(':', Country.Name) |             
      grepl('income', Country.Name) |
      grepl('only', Country.Name) |
      grepl('small', Country.Name)    
  )
# display key codes for regions
kable(df_regions, align = 'l')
Country.Name Country.Code
Caribbean small states CSS
East Asia & Pacific (all income levels) EAS
East Asia & Pacific (developing only) EAP
Europe & Central Asia (all income levels) ECS
Europe & Central Asia (developing only) ECA
Fragile and conflict affected situations FCS
Heavily indebted poor countries (HIPC) HPC
High income HIC
High income: nonOECD NOC
High income: OECD OEC
Latin America & Caribbean (all income levels) LCN
Latin America & Caribbean (developing only) LAC
Least developed countries: UN classification LDC
Low & middle income LMY
Low income LIC
Lower middle income LMC
Middle East & North Africa (all income levels) MEA
Middle East & North Africa (developing only) MNA
Middle income MIC
OECD members OED
Other small states OSS
Pacific island small states PSS
Small states SST
Sub-Saharan Africa (all income levels) SSF
Sub-Saharan Africa (developing only) SSA
Upper middle income UMC
# filter the presentation data set by these "target" regions
target <- df_regions$Country.Code
df_regions <- df_tidy %>% filter (Country.Code %in% target)
  1. Presentation
# histogram of population by regions
ggplot(df_regions) + geom_histogram(aes(x = Population))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# regional scatterplot of population by regions
ggplot(data = df_regions, aes(x = Year, y = Population)) +
  geom_point() + facet_wrap( ~ Country.Code )