http://data.worldbank.org/indicator/SP.POP.TOTL
The World Bank defines total population as follows . . .
Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship–except for refugees not permanently settled in the country of asylum, who are generally considered part of the population of their country of origin. The values shown are midyear estimates.
This rudimentary analysis looks at the 26 regional data sets contained in the World Bank’s super-set, presents a historgram of decreasing population counts by region and scatterplots of each region’s popuation trend faceted by region.
l <- read.csv(
"/Users/scottkarr/IS607Spring2016/project2/more/population.csv",
sep=",",
na.strings = "",
blank.lines.skip = TRUE,
#col.names = c("Quintile", "West", "South", "Midwest","Northeast", "US Overall"),
stringsAsFactors=FALSE
)
df = data.frame(l)
names(df) <- gsub("X", "", names(df))
#kable(head(df), align = 'l')
# filter columns
df <- df %>%
select(Country.Name, Country.Code, 5:ncol(df))
# gather morphs data from wide to long format
df_tidy <- df %>%
gather(Year, Population, -Country.Name, -Country.Code) %>%
arrange(Country.Name, Year, Population)
# make year and population numerics
df_tidy$Year <- as.numeric(df_tidy$Year)
df_tidy$Population <- as.numeric(df_tidy$Population)
# remove na rows
df_tidy <- df_tidy %>% na.omit()
# present data nicely
kable(head(df_tidy), align = 'l')
Country.Name | Country.Code | Year | Population |
---|---|---|---|
Afghanistan | AFG | 1960 | 8994793 |
Afghanistan | AFG | 1961 | 9164945 |
Afghanistan | AFG | 1962 | 9343772 |
Afghanistan | AFG | 1963 | 9531555 |
Afghanistan | AFG | 1964 | 9728645 |
Afghanistan | AFG | 1965 | 9935358 |
# filter only on regional population data sets
df_regions <-
filter(
df_tidy %>% select(Country.Name, Country.Code) %>% distinct()
,
grepl(' all', Country.Name) |
grepl('states', Country.Name) |
grepl('members', Country.Name) |
grepl('poor', Country.Name) |
grepl('conflict', Country.Name) |
grepl(':', Country.Name) |
grepl('income', Country.Name) |
grepl('only', Country.Name) |
grepl('small', Country.Name)
)
# display key codes for regions
kable(df_regions, align = 'l')
Country.Name | Country.Code |
---|---|
Caribbean small states | CSS |
East Asia & Pacific (all income levels) | EAS |
East Asia & Pacific (developing only) | EAP |
Europe & Central Asia (all income levels) | ECS |
Europe & Central Asia (developing only) | ECA |
Fragile and conflict affected situations | FCS |
Heavily indebted poor countries (HIPC) | HPC |
High income | HIC |
High income: nonOECD | NOC |
High income: OECD | OEC |
Latin America & Caribbean (all income levels) | LCN |
Latin America & Caribbean (developing only) | LAC |
Least developed countries: UN classification | LDC |
Low & middle income | LMY |
Low income | LIC |
Lower middle income | LMC |
Middle East & North Africa (all income levels) | MEA |
Middle East & North Africa (developing only) | MNA |
Middle income | MIC |
OECD members | OED |
Other small states | OSS |
Pacific island small states | PSS |
Small states | SST |
Sub-Saharan Africa (all income levels) | SSF |
Sub-Saharan Africa (developing only) | SSA |
Upper middle income | UMC |
# filter the presentation data set by these "target" regions
target <- df_regions$Country.Code
df_regions <- df_tidy %>% filter (Country.Code %in% target)
# histogram of population by regions
ggplot(df_regions) + geom_histogram(aes(x = Population))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# regional scatterplot of population by regions
ggplot(data = df_regions, aes(x = Year, y = Population)) +
geom_point() + facet_wrap( ~ Country.Code )