#load readr package for reading csvs
library(readr)
#load dplyr package for data transformations
library(dplyr)
#load tidyr package to make data long form or wide form
library(tidyr)
#load rmarkdown package for paged tables
library(rmarkdown)
#load gggplot2 package for plots
library(ggplot2)

#Import data
population <- read_csv("Population-EstimatesData.csv")

#Gather years into one year variable to make data tidy.
population <- 
  population %>%
  gather(`1960`:`2050`, key="year",
         value="value")

#Interested in data that describes population changes over time.
#Create subset of data for birth rate, death rate, total population, and life expectancy indicators
population_change <- subset(population, population$"Indicator Code" == "SP.DYN.CBRT.IN" | population$"Indicator Code" == "SP.DYN.CDRT.IN" | population$"Indicator Code" == "SP.POP.TOTL" | population$"Indicator Code"== "SP.DYN.LE00.IN" | population$"Indicator Code" == "SP.RUR.TOTL" | population$"Indicator Code" == "SP.URB.TOTL")

#Delete irrelevant columns: Indicator Code and ...96
population_change <- population_change[-c(4,5)]

#make population change wide format with Indicators as variable columns.
population_change <- spread(population_change, "Indicator Name", value)

#Rename variables for ease of referencing using dplyr::rename()
population_change <- population_change %>%
  rename(country="Country Name",
         country_code="Country Code",
         birth_rate="Birth rate, crude (per 1,000 people)",
         death_rate="Death rate, crude (per 1,000 people)",
         life_expectancy="Life expectancy at birth, total (years)",
         population="Population, total",
         rural_pop="Rural population",
         urban_pop="Urban population")

#Use dplyr::lag() to create population lag variable for calculation of population growth variable
population_change <- 
  population_change %>%
  group_by(country) %>%
  mutate(lag_population=lag(population, n=1, default=NA))

#Compute and add population growth variable to data frame
percent_pop_growth <- ((population_change$population-population_change$lag_population)/population_change$lag_population)*100
population_change$percent_pop_growth <- percent_pop_growth

#Compute and add population index variable (birth rate - death rate) to data frame.
population_index <- population_change$birth_rate - population_change$death_rate
population_change$population_index <- population_index

#Compute and add urban percent variable ((urban_pop/(urban_pop+rural_pop))*100) to data frame.
percent_urban <- (population_change$urban_pop/(population_change$urban_pop+population_change$rural_pop))*100
population_change$percent_urban <- percent_urban

#attach data frame for easy variable referencing
attach(population_change)

For the Week 3 Challenge, I am exploring the World Bank’s Population Estimates and Projections data set. This data set presents population and other demographic estimates and projections from 1960 to 2050 and covers more than 200 economies. I am specifically interested in how populations have grown and are projected to grow over time, including overall growth rates, population indexes, and percentages of people living in urban areas. I hope to gain insight into any relationships between these three variables. I suspect that nations with high growth rates will be more likely to have higher population indexes and more people living in urban areas. Additionally, I will explore the relationships between these variables and life expectancy to gain insight into which national characteristics are associated with better life expectancy and, therefore, better health.

To start this exploration, I computed the averages of the aforementioned variables by country:

#Use dplyr::group_by() and dplyr::summarise() to show averages by country.
growth_index_urban <- data.frame(population_change %>%
  group_by(country) %>%
  summarise(avg_pop_growth_percent=sprintf("%0.2f",mean(percent_pop_growth, na.rm=TRUE)),
            avg_pop_index=sprintf("%0.2f",mean(population_index, na.rm=TRUE)),
            avg_urban_pop_percent=sprintf("%0.2f",mean(percent_urban, na.rm=TRUE)),
            avg_life_expectancy=sprintf("%0.2f", mean(life_expectancy, na.rm=TRUE))))
paged_table(growth_index_urban, options=list(rows.print=25, cols.min.print=5))

This information is difficult to interpret and not suitable for drawing conclusions as there is simply too much information. To gain insight into the relationships between these variables, it is best to compute correlations and to visualize the relationships. The correlations of interest are:

Population Growth-Population Index
Population index is calculated by subtracting death rate from birthrate. It makes sense that population growth has a decently strong positive correlation with a birth rate exceeding death rate.

Population Growth-Urban Population
It is unexpected that there is a slight negative correlation between these two variables. One would think that, as populations grow more quickly, more people would shift towards city environments. Perhaps, as populations grow more quickly, people are forced to spread outside of cities.

Population Growth-Life Expectancy
This is also an unexpected correlation. However, it starts to make more sense with further thought. Life expectancy is associated with quality of health care and quality of life. More rapid population growth, on the other hand, may be associated with rapidly developing countries, which tend to be less affluent. It is quite possible quality of life and quality of health care are confounding variables here that explain the negative correlation.

Population Index-Urban Population
This negative correlation makes sense after considering the relationship between growth and urban population. Population index and population growth are pretty similar to each other and have a decent positive correlation. It therefore follows that population index should be negatively correlated with urban population, potentially for the same reason that people need to leave the cities to accommodate more people.

Population Index-Life Expectancy
This was was of the most surprising correlations initially, but after considering the relationship between population growth and life expectancy, it makes sense. Perhaps similar reasons of health care and quality of life explain this strong engative correlation.

Urban Population-Life Expectancy
This is one of the more interesting correlations as there are many potential explanations. Perhaps nations with mostly urban populations are more affluent and, therefore, have better health care and quality of life. It would be interesting interesting to examine the relationship between GDP, the percent of populations living in urban environments, and life expectancy.

Data Visualization

As an unexpected and interesting finding, I’ve plotted the relationship between population index and life expectancy. Perhaps a visualization will provide further insight into this unexpected relationship.

ggplot(data=population_change) +
  geom_smooth(mapping=aes(x=life_expectancy, y=population_index), color="blue") +
  ggtitle("Relationship Between Life Expectancy and Population Index") +
  xlab("Life Expectancy") +
  ylab("Population Index")

This visualization does reveal some interesting things. First of all, the relationship between the two variables appears to be nonlinear. Second, it seems that population index and life expectancy are positively related until a certain point and then become negatively related. This seems to provide further support for the previous hypotheses that nations with high population indexes are more likely to be in development and less affluent and that nations with a higher life expectancy are more likely to be affluent. What this visualization potentially suggests is that both affluent and less affluent nations can support life expectancy up to a certain age (appears to be around the mid 50s). After this age range, it is possible that more affluent nations that are characterized by lower population indexes are able to support a higher life expectancy while less affluent nations with higher population indexes lack the resources and health care to support a higher life expectancy. Comparing economic metrics, such as GDP, to life expectancy and population index could be useful in fully understanding this relationship.