In this final example, we’ll use the tools you’ve learned to model the relationship between a country’s birth rate and it’s population size. Perhaps we have a hypothesis that birth rates are positively correlated to population size (after we’ve read the pertinent literature on this topic, of course).
You already have a data set of birth rates.
total_fertility_long
## # A tibble: 40,296 x 3
## country year children_per_woman
## <chr> <chr> <dbl>
## 1 Afghanistan 1800 7
## 2 Albania 1800 4.6
## 3 Algeria 1800 6.99
## 4 Angola 1800 6.93
## 5 Antigua and Barbuda 1800 5
## 6 Argentina 1800 6.8
## 7 Armenia 1800 7.8
## 8 Australia 1800 6.5
## 9 Austria 1800 5.1
## 10 Azerbaijan 1800 8.1
## # ... with 40,286 more rows
Now we need a data set of population sizes. For this exercise, we’ll use the gapminder package in R to get population data.
install.packages("gapminder")
library(gapminder)
data(gapminder_unfiltered)
You should see the gapminder_unfiltered data set in your Environment (upper right window). Take a look at it.
head(gapminder_unfiltered) #head() shows the first six rows of a data set.
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
We’re interested in testing the relationship between population size and birth rate, so we need to combine those two variables in a single data set. The code below will do that using merge().
pop_br <- gapminder_unfiltered %>% merge(total_fertility_long)
Let’s check the result.
head(pop_br)
## country year continent lifeExp pop gdpPercap children_per_woman
## 1 Afghanistan 1952 Asia 28.801 8425333 779.4453 7.55
## 2 Afghanistan 1957 Asia 30.332 9240934 820.8530 7.49
## 3 Afghanistan 1962 Asia 31.997 10267083 853.1007 7.45
## 4 Afghanistan 1967 Asia 34.020 11537966 836.1971 7.45
## 5 Afghanistan 1972 Asia 36.088 13079460 739.9811 7.45
## 6 Afghanistan 1977 Asia 38.438 14880372 786.1134 7.45
Do you see the new column with children_per_woman? Also note that this data set is quite a bit smaller than the total_fertility_long data set. Why do you think that is?
There are also some other variables in our data set. They’re not a problem and we can simply ignore them for this analysis.
Let’s plot the relationship with children_per_woman on the x-axis and population on the y-axis. First we need to pick a year to analyze. The data set goes from 1950 to 2007. Let’s start with 1950.
pop_br1950 <- pop_br %>% filter(year == 1950) #filter the data set to only include 1950
ggplot(pop_br1950, aes(x = children_per_woman, y = pop)) +
geom_point() +
geom_smooth(method = "lm")
What would you conclude from this? Our hypothesis was that birth rates are positively correlated to population size. What do you think?
We can stare at the plot all day long, but we really need some numbers to assess our hypothesis. Let’s use a linear regression, like you learned before.
pop_br_model <- lm(pop ~ children_per_woman, data = pop_br1950)
summary(pop_br_model)
##
## Call:
## lm(formula = pop ~ children_per_woman, data = pop_br1950)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28074448 -19140479 -13278673 8911898 128555038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35937962 13384070 2.685 0.011 *
## children_per_woman -4047020 3467635 -1.167 0.251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32700000 on 35 degrees of freedom
## Multiple R-squared: 0.03746, Adjusted R-squared: 0.009958
## F-statistic: 1.362 on 1 and 35 DF, p-value: 0.2511
How would you interpret this? It shows a slope of -4047020. Here’s what that means: “On average, every unit increase in the number of birth’s was associated with a decline in population size of ~4 million people”. The standard error is huge though (3.4 million), nearly as large as the average. We would interpret that like this: “There is large uncertainty in this estimate, however. The standard error is 3.4 million, indicating that a decline of anywhere between ~0 to 7.4 million people is a reasonable guess based on this model. These numbers indicate little support for the hypothesis that birth rate is positively correlated with population size.”
Can you repeat this analysis, but with 2007 data? Does anything change?