6) Model and describe the data

library(tidyverse)

ggplot(tot_fert_last50, aes(x=as.numeric(year), y=children_per_woman, group = year))+
  geom_point(alpha = 0.1) +
  xlab("Year")+
  ylab("Number of children per woman")+
  geom_boxplot(outlier.shape = NA)

The graph above shows a clear decline in the median global birth rate since 1950. But just how big of a decline was it? To answer that, we need to use descriptive statistics and some modeling.

What is the median birth rate in each year? The code below will calculate that using the pipes we saw earlier. It says in the dataset “tot_fert_last50” group the data by year and calculate the median number of children per woman within each year. Then show all of the results using print(n = Inf). Inf is short for infinity

tot_fert_last50 %>% 
  group_by(year) %>% 
  summarize(median_br = median(children_per_woman)) %>% 
  print(n = Inf)
## # A tibble: 69 x 2
##    year  median_br
##    <chr>     <dbl>
##  1 1950       5.98
##  2 1951       5.98
##  3 1952       5.99
##  4 1953       6.1 
##  5 1954       6.14
##  6 1955       6.16
##  7 1956       6.18
##  8 1957       6.22
##  9 1958       6.23
## 10 1959       6.23
## 11 1960       6.27
## 12 1961       6.29
## 13 1962       6.28
## 14 1963       6.22
## 15 1964       6.15
## 16 1965       6.14
## 17 1966       6.14
## 18 1967       6.06
## 19 1968       6.02
## 20 1969       5.96
## 21 1970       5.93
## 22 1971       5.82
## 23 1972       5.78
## 24 1973       5.69
## 25 1974       5.57
## 26 1975       5.46
## 27 1976       5.38
## 28 1977       5.26
## 29 1978       5.18
## 30 1979       5.06
## 31 1980       4.97
## 32 1981       4.88
## 33 1982       4.72
## 34 1983       4.62
## 35 1984       4.52
## 36 1985       4.36
## 37 1986       4.25
## 38 1987       4.14
## 39 1988       4.03
## 40 1989       3.94
## 41 1990       3.84
## 42 1991       3.71
## 43 1992       3.57
## 44 1993       3.46
## 45 1994       3.39
## 46 1995       3.29
## 47 1996       3.16
## 48 1997       3.03
## 49 1998       2.94
## 50 1999       2.9 
## 51 2000       2.84
## 52 2001       2.77
## 53 2002       2.72
## 54 2003       2.68
## 55 2004       2.64
## 56 2005       2.58
## 57 2006       2.58
## 58 2007       2.56
## 59 2008       2.53
## 60 2009       2.50
## 61 2010       2.48
## 62 2011       2.46
## 63 2012       2.41
## 64 2013       2.38
## 65 2014       2.36
## 66 2015       2.34
## 67 2016       2.32
## 68 2017       2.28
## 69 2018       2.26

We can use these numbers to describe the graph. In our paper, we might write “In 1950, women gave birth to a global median of 5.9 children. By 2000, that number dropped to a median of 2.8 children per woman, and by 2018 it was 2.3 children per woman. Overall, the global fertility rate declined by more than 50% since 1950.”

The text above shows how useful descriptive statistics can be. They put concrete numbers to visual trends. If we just said “The birth rate declined since 1950 (Figure 1)”, we wouldn’t be wrong, but we also wouldn’t be very compelling. A reader would say “Duh. I can see that on the graph. Just how much did it decline?”. Descriptive statistics answered the second question.

But we can go a bit further. Instead of describing each year by itself, we might want to know how birth rates declined in a typical year. In other words, what is the trend of the decline? For that, we need a model

Models

This is not a biostatistics course, so we’re only going to scratch the surface here. Let’s just get into it. Here’s a statistical model that tests the hypothesis “What is the average decline in birth rate per year?”

my_model <- lm(children_per_woman ~ as.numeric(year), data = tot_fert_last50)

lm() stands for linear model, otherwise known as a linear regression. This is one of the simplest models out there. The only new symbol here is the tilde (~). You can interpret it as saying ~ = is related to or as a function of…. In this case, the code above says What is the linear relationship between children per woman and year in the dataset tot_fert_last50?

The code below shows the summary of the statistical model

summary(my_model)
## 
## Call:
## lm(formula = children_per_woman ~ as.numeric(year), data = tot_fert_last50)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1116 -1.4989  0.0821  1.4776  4.6674 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.014e+02  1.591e+00   63.72   <2e-16 ***
## as.numeric(year) -4.892e-02  8.018e-04  -61.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.799 on 12694 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2267 
## F-statistic:  3722 on 1 and 12694 DF,  p-value: < 2.2e-16

This is a lot to unpack, but you’re almost done! The most important part of the model above is the term called “as.numeric(year)”. It is the slope of this model. The estimate of the slope is -0.048 with a standard error of -0.008. That means this: Every unit increase in year generates an average decrease in children per woman of -0.048 with a standard error of 0.008. That doesn’t sound like a lot, but it adds up. Multiply -0.048 by 68 years and it adds up to a drop of 3.2 children per woman between 1950 and 2018!

Let’s plot the linear relationship here. Can you see the code that adds the linear regression? What’s different about this code compared to the boxplot you created earlier?

ggplot(tot_fert_last50, aes(x=as.numeric(year), y=children_per_woman))+
  geom_point(alpha = 0.1, size = 0.1) +
  xlab("Year")+
  ylab("Number of children per woman")+
  geom_smooth(method = "lm")