What is the statistical relationship between a country’s birth rate and it’s population size?

In this final example, we’ll use the tools you’ve learned to model the relationship between a country’s birth rate and it’s population size. Perhaps we have a hypothesis that birth rates are positively correlated to population size (after we’ve read the pertinent literature on this topic, of course).

You already have a data set of birth rates.

total_fertility_long
## # A tibble: 40,296 x 3
##    country             year  children_per_woman
##    <chr>               <chr>              <dbl>
##  1 Afghanistan         1800                7   
##  2 Albania             1800                4.6 
##  3 Algeria             1800                6.99
##  4 Angola              1800                6.93
##  5 Antigua and Barbuda 1800                5   
##  6 Argentina           1800                6.8 
##  7 Armenia             1800                7.8 
##  8 Australia           1800                6.5 
##  9 Austria             1800                5.1 
## 10 Azerbaijan          1800                8.1 
## # ... with 40,286 more rows

Now we need a data set of population sizes. For this exercise, we’ll use the gapminder package in R to get population data.

install.packages("gapminder")
library(gapminder)
data(gapminder_unfiltered)

You should see the gapminder_unfiltered data set in your Environment (upper right window). Take a look at it.

head(gapminder_unfiltered) #head() shows the first six rows of a data set.
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

We’re interested in testing the relationship between population size and birth rate, so we need to combine those two variables in a single data set. The code below will do that using merge().

pop_br <- gapminder_unfiltered %>% merge(total_fertility_long)

Let’s check the result.

head(pop_br)
##       country year continent lifeExp      pop gdpPercap children_per_woman
## 1 Afghanistan 1952      Asia  28.801  8425333  779.4453               7.55
## 2 Afghanistan 1957      Asia  30.332  9240934  820.8530               7.49
## 3 Afghanistan 1962      Asia  31.997 10267083  853.1007               7.45
## 4 Afghanistan 1967      Asia  34.020 11537966  836.1971               7.45
## 5 Afghanistan 1972      Asia  36.088 13079460  739.9811               7.45
## 6 Afghanistan 1977      Asia  38.438 14880372  786.1134               7.45

Do you see the new column with children_per_woman? Also note that this data set is quite a bit smaller than the total_fertility_long data set. Why do you think that is?

There are also some other variables in our data set. They’re not a problem and we can simply ignore them for this analysis.

Let’s plot the relationship with children_per_woman on the x-axis and population on the y-axis. First we need to pick a year to analyze. The data set goes from 1950 to 2007. Let’s start with 1950.

pop_br1950 <- pop_br %>% filter(year == 1950) #filter the data set to only include 1950

ggplot(pop_br1950, aes(x = children_per_woman, y = pop)) +
  geom_point() +
  geom_smooth(method = "lm")

What would you conclude from this? Our hypothesis was that birth rates are positively correlated to population size. What do you think?

We can stare at the plot all day long, but we really need some numbers to assess our hypothesis. Let’s use a linear regression, like you learned before.

pop_br_model <- lm(pop ~ children_per_woman, data = pop_br1950)

summary(pop_br_model)
## 
## Call:
## lm(formula = pop ~ children_per_woman, data = pop_br1950)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -28074448 -19140479 -13278673   8911898 128555038 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        35937962   13384070   2.685    0.011 *
## children_per_woman -4047020    3467635  -1.167    0.251  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32700000 on 35 degrees of freedom
## Multiple R-squared:  0.03746,    Adjusted R-squared:  0.009958 
## F-statistic: 1.362 on 1 and 35 DF,  p-value: 0.2511

How would you interpret this? It shows a slope of -4047020. Here’s what that means: “On average, every unit increase in the number of birth’s was associated with a decline in population size of ~4 million people”. The standard error is huge though (3.4 million), nearly as large as the average. We would interpret that like this: “There is large uncertainty in this estimate, however. The standard error is 3.4 million, indicating that a decline of anywhere between ~0 to 7.4 million people is a reasonable guess based on this model. These numbers indicate little support for the hypothesis that birth rate is positively correlated with population size.

Can you repeat this analysis, but with 2007 data? Does anything change?