IMPUTE MISSING DATA FOR TIDYTUESDAY VOYAGES OF CAPTIVE AFRICANS WITH TIDYMODELS
This week’s Tidy Tuesday datasets reflect on Juneteenth, the date when the law abolishing salvery in the United States was finally enforced throughout the American South, and specifically Texas. My own education as a white woman i nthe United States has been gravely lacking in the history of enslaved people, their experiences, and their impact and legacy: I’m glad to spend some time in this screencast understanding more about the forced transport of African people using the Salve Voyages African Names database
Let’s read in the data on African names and use skimr to see what’s there
suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
suppressMessages(library(kableExtra))
african_names <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-16/african_names.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## voyage_id = col_double(),
## name = col_character(),
## gender = col_character(),
## age = col_double(),
## height = col_double(),
## ship_name = col_character(),
## year_arrival = col_double(),
## port_disembark = col_character(),
## port_embark = col_character(),
## country_origin = col_character()
## )
skimr::skim(african_names)
| Name | african_names |
| Number of rows | 91490 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1.00 | 2 | 24 | 0 | 62330 | 0 |
| gender | 12878 | 0.86 | 3 | 5 | 0 | 4 | 0 |
| ship_name | 1 | 1.00 | 2 | 59 | 0 | 443 | 0 |
| port_disembark | 0 | 1.00 | 6 | 19 | 0 | 5 | 0 |
| port_embark | 1126 | 0.99 | 4 | 31 | 0 | 59 | 0 |
| country_origin | 79404 | 0.13 | 3 | 31 | 0 | 563 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 62122.02 | 51305.07 | 1.0 | 22935.25 | 45822.5 | 101263.8 | 199932 | ▇▆▃▁▂ |
| voyage_id | 0 | 1.00 | 17698.25 | 82016.88 | 557.0 | 2443.00 | 2871.0 | 3601.0 | 500082 | ▇▁▁▁▁ |
| age | 1126 | 0.99 | 18.89 | 8.60 | 0.5 | 11.00 | 20.0 | 26.0 | 77 | ▆▇▁▁▁ |
| height | 4820 | 0.95 | 58.61 | 6.84 | 0.0 | 54.00 | 60.0 | 64.0 | 85 | ▁▁▂▇▁ |
| year_arrival | 0 | 1.00 | 1831.40 | 9.52 | 1808.0 | 1826.00 | 1832.0 | 1837.0 | 1862 | ▂▆▇▃▁ |
There is data missing in both the gender and age variables, two I am interested in
This is a dataset of individual people who were liberated from slave ships. Where did the people in this dataset/ leave their ships?
african_names %>%
count(port_disembark, sort = TRUE) %>%
kable()
| port_disembark | n |
|---|---|
| Freetown | 81009 |
| Havana | 10058 |
| Bahamas unspecified | 183 |
| Kingston, Jamaica | 144 |
| St. Helena | 96 |
Most of the freed captives in this database were liberated in either Freetown. Sierra Leone (so on the eastern side of the Atlantic) or Havana, Cuba (on the western side). Both cities had tribunals/courts to judge ships seized by anti-slaving patrols after European countries outlawed or restricted slavery
Where did these people start their forced journeys?
african_names %>%
add_count(port_embark) %>%
mutate(port_embark = case_when(
n < 4000 ~ 'Other',
TRUE~ port_embark
)) %>%
ggplot(aes(x = port_embark, y = year_arrival, fill = port_embark)) +
geom_boxplot(alpha = 0.4, show.legend = FALSE) +
labs(x = NULL, y = 'Year') + scale_y_continuous(breaks = 100*seq(18.2, 18.7, by=0.1))
When is this data from?
african_names %>% ggplot(aes(year_arrival)) +
geom_histogram(bins = 20, fill = 'midnightblue', alpha = 0.7)+
scale_y_continuous(labels = scales:: comma_format()) +
labs(
y = 'Number of liberated individuals',
x = NULL
) + scale_x_continuous(breaks = 100*seq(18.2, 18.7, by = 0.1))
What is the pattern of missing data?
library(naniar)
## Warning: package 'naniar' was built under R version 4.0.2
african_names %>%
select(gender, age, height, year_arrival) %>%
gg_miss_upset()
Gender has the highest proportion of missing data, and there is not much data missing from the age column. Fortunately for our attempt to impute missing valeus for many rows have all three of these missing
What is the relationship between gender and year of arrival?
african_names %>%
ggplot(aes( x = gender, y = year_arrival, fill = gender)) +
geom_boxplot(alpha = 0.4, show.legend = FALSE) +
labs(x = NULL, y = 'Year')
Gender was coded as both man/woman and boy/girl, but there is a fair amount of overlap in ages (children coded as “man”, for example)
african_names %>%
ggplot(aes(x= gender, y = age, fill = gender)) +
geom_boxplot(alpha = 0.4, show.legend = FALSE) +
labs(x = NULL, y = 'Year')
## Warning: Removed 1126 rows containing non-finite values (stat_boxplot).
What is the relationship between age and year of arrival?
african_names %>%
filter(year_arrival < 1850) %>%
group_by(year_arrival) %>%
summarise(age = mean(age, na.rm = TRUE)) %>%
ggplot(aes(x = year_arrival, y = age)) +
geom_line(alpha = 0.6, size = 1.5) +
geom_smooth(method = 'lm') +
scale_y_continuous(limits = c(0,NA)) +
labs(x = NULL, y = 'Mean age')
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using formula 'y ~ x'
Overall, the age is drifting up slightly, although the previous plot on boys/girls/men/women calls this into question. We can use modeling to explore this better
One of the most unique and valuable characteristics of this dataset is the names. We can make a scatterplot to understand more about the distribution of ages and year of arrival.
library(ggrepel)
african_names %>% group_by(name) %>%
summarise(
n = n(),
age = mean(age, na.rm = TRUE),
year_arrival = mean(year_arrival, na.rm = TRUE)
) %>%
ungroup() %>%
arrange(-n) %>%
filter(n > 30) %>%
ggplot(aes(x = year_arrival, y = age)) +
geom_text_repel(aes(label = name), size = 3, family = 'IBMPlexSans') +
geom_point(aes(size = n), color = 'midnightblue', alphaa = 0.7) +
labs(
x = 'Mean year of arrival', y = 'Mean age',
size = 'Number of people',
title = 'Age and year of arrival for most common names of transported captives',
caption = 'African Names Database from slavevoyages.org'
)
## `summarise()` ungrouping output (override with `.groups` argument)
## Warning: Ignoring unknown parameters: alphaa
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
First, let’s filter to only the dat a from before 1850 and recode the gender variable
liberated_df <- african_names %>%
filter(year_arrival < 1850) %>%
mutate(gender = case_when(
gender == 'Boy' ~ 'Man',
gender == 'Girl' ~ 'Woman',
TRUE ~ gender
)) %>%
mutate_if(is.character, factor)
liberated_df
## # A tibble: 91,394 x 11
## id voyage_id name gender age height ship_name year_arrival
## <dbl> <dbl> <fct> <fct> <dbl> <dbl> <fct> <dbl>
## 1 1 2314 Bora Man 30 62.5 NS de Re~ 1819
## 2 2 2315 Flam Man 30 64 Fabiana 1819
## 3 3 2315 Dee Man 28 65 Fabiana 1819
## 4 4 2315 Pao Man 22 62.5 Fabiana 1819
## 5 5 2315 Mufa Man 16 59 Fabiana 1819
## 6 6 2315 Latty Man 22 67.5 Fabiana 1819
## 7 7 2315 So Man 20 62 Fabiana 1819
## 8 8 2315 Trua Man 30 65.5 Fabiana 1819
## 9 9 2315 Tou Man 18 61.5 Fabiana 1819
## 10 10 2315 Quaco Man 23 62 Fabiana 1819
## # ... with 91,384 more rows, and 3 more variables: port_disembark <fct>,
## # port_embark <fct>, country_origin <fct>
Next, let’s impute the missing data using a recipe
library(recipes)
impute_rec <- recipe(year_arrival ~ gender + age + height, data = liberated_df) %>%
step_meanimpute(height) %>%
step_knnimpute(all_predictors())
Let’s walk through the steps in this recipe
First, we must tell the recipe() what’s going on with our model what data we are using (notice we did not split into training and testing, because of our specific modeling goals)
Next, we impute the missing values for height with the mean value of height. height has a low values of missingness. and we are only going to use it to impute age and gender, not for modeling
Next, we impute the missing values for age and gender using a nearest neighbors model with all three predictors
Once we have the recipe defined, we can estimate the parameters needed to apply it using prep() function. In this case, that means finding the mean for height (fast) and training the nearest neighbor model to find gender and age (not) so fast). Then we can use juice() to get that imputed data back out. (If we wanted to apply the recipe to other data, like new data we hadn’t seen before, we would use bake() instead).
imputed <- prep(impute_rec) %>% juice()
How did the imputation turn out?
summary(liberated_df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.50 11.00 20.00 18.89 26.00 77.00 1030
summary(imputed$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 11.00 19.00 18.77 26.00 77.00
summary(liberated_df$gender)
## Man Woman NA's
## 52723 25889 12782
summary(imputed$gender)
## Man Woman
## 60992 30402
No more NA values, and the distributions look about the same. I like to keep in mind that the point of imputation like this is to be able to use the information we have in the dataset without throwing it away, which feels especially important when dealing with historical data on individuals who experienced enslavement.
The distribution of year of arrival was a bit wonky, so that is good to keep in mind when training a linear model.
fit_lm <- lm(year_arrival ~ gender + age, data = imputed)
We can check out the model results.
summary(fit_lm)
##
## Call:
## lm(formula = year_arrival ~ gender + age, data = imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.7206 -5.3343 0.9842 5.6828 17.0903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.832e+03 8.163e-02 22440.485 < 2e-16 ***
## genderWoman -3.014e-01 6.724e-02 -4.482 7.40e-06 ***
## age -2.123e-02 3.665e-03 -5.793 6.95e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.476 on 91391 degrees of freedom
## Multiple R-squared: 0.0005149, Adjusted R-squared: 0.000493
## F-statistic: 23.54 on 2 and 91391 DF, p-value: 6.012e-11
tidy(fit_lm) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1831.869 | 0.082 | 22440.485 | 0 |
| genderWoman | -0.301 | 0.067 | -4.482 | 0 |
| age | -0.021 | 0.004 | -5.793 | 0 |
During the years (about 1810 to 1850) included here, as time passed, there were some gradual shifts in the population of who was found on (i.e. liberated from) these slave ships.
There is evidence for a modest shift to younger ages as time passed. (The plot showing increasing age with time was, it turns out, an example of Simpson’s paradox.)
In the earlier years, there were more proportionally more women while in the later years, there were proportionally more men.