Understanding the migrant death issue

The data on migrant deaths comes from the Arizona OpenGIS Project, cosponsored by Humane Borders and the Pima County Office of Medical Examiner. I’ve edited the data set to include all recorded migrant deaths between 2000 up to July of 2024. The objective of this project is to give you continued experience in working with and making sense of quantitative data, while at the same time, giving you the opportunity to better understand at a deep level, the nature of the migrant death issue. Above all else, my goal is to always honor and humanize those who have died in the desert. These data permit a better understanding of the death crisis.

For purposes of this project, I have compiled a number of plots of the data highlighting different features of the migrant death crisis. Apart from an overview of the total number of migrant deaths recorded, I have compiled information on gender, age, and cause of death, as well as other factors. Your job is to tell a story. Imagine you are asked to summarize the information for an audience who knows nothing about the migrant death crisis? How would you proceed? What would you highlight?

Your job is to take what I have provided, analyze it, and tell your story I want to know something important, interesting, and useful about the migrant death crisis. I will not tell you what to look for; your job is to think critically and analytically. In class, I’ll discuss examples of what I’d look for. Ultimately, when presented with information, you need to get practice in learning how to engage it. But some prompts might be: what are the characteristics of most migrant deaths? What trends do you observe? Are there gender differences? Do attributes of migrant deaths change with respect to time?

This assignment will be worth 500 points. 100 points will be based on writing. I’m expecting analysis, not summaries. 400 points will be based on creativity. What or how do you choose what to engage? What should we learn about the migrant death crisis based on your analysis. Understand, you are probably THE ONLY college class in the world looking at these data. What will you tell the rest of the world about the migrant death crisis?

What do I mean by creativity? To be blunt, creativity is not repeating numbers or statistics you see in a table. I don’t need anyone to do that as I can see it with my own eyes. Creativity paints a picture as to what the human picture of the death crisis looks like, looking at the observed data. I’m looking for analysis that has a natural flow. Students are used to answering by rote, questions that get asked of them. In turn, answers are flat and usually nonanalytical. That’s not the student’s fault; it’s the fault of the person denying the student creativity. Creativity means bringing in information and context outside of the confines of the specific charts, plots, or questions. If you’re wondering about what page lenght I’m imagining assuming you will cut and paste some of the charts or plots, I would expect that with the charts and plots included, the analysis should be 5 pages or thereabouts. I’d strongly encourage bringing in external information (I’ve sent you links to sources and there are sources on the syllabus) to augment your analysis. In the end, if you simply report the results I’ve already created for you, then I would not expect anything above a “C”-level grade (i.e. you just reproduce in words what I see in the tables or charts). In class, we went over in great detail tips on how to avoid these problems, so follow those tips and you’re going to be well on your way.

For the submission, do not submit an R Markdown file. I want a holistic submission that I can read from start to finish.

Extra credit bonus If you are interested in producing your own analysis using R, I will offer up to 10% extra credit if you take the data and do something interesting with it. The operative words are “up to.” There’s no guarantee that if you do something, you’ll get the full 10 percent. In order to do this, you will need to use the .RMD file and edit it as needed. We teach R Markdown in POL 51 (well I do) and so if you know how to use it, use it. Skill like this are real skills to develop. BUT, there is no requirement to do anything above and beyond what the core assignment is asking. I’m trying to incentivize the use of statistical computing.

This assignment is worth 500 points and is due by 11:59 PM on December 13.

In this document, I make references to “chunks” of code. For purposes of my HTML file, I’ve suppressed this code from showing up in the resulting file. If you examine the .RMD file, you will see the code (should you want to see it).

The data file is a csv file saved to my GitHub site. Should you choose to use R directly to reproduce what I have, this could will permit direct access to the data. You are not required to use R.

A note about the data

The data on migrant deaths comes from the Arizona OpenGIS Project, cosponsored by Humane Borders and the Pima County Office of Medical Examiner. Recorded remains are between the years 1981 to 2024. However systematic data on migrant deaths only are found from 2000 and later. This is primarily because the death crisis really doesn’t emerge until around the year 2000.

The reason the crisis was nonexistent before 2000 is that very, very few people crossed through Arizona. This changed due to changes made in the Clinton Administration that led to the funnel effect pushing migrants into the Arizona corridor. So if one downloads the death map data, one will find that 4,263 migrant remains have been recovered (as of 7/31/2024); however, 4,127 have been recovered since the year 2000. This means that in the death map, only 136 remains are recorded as being found between 1981 and 1999 (an 18 year period). I don’t mean “only” in a demeaning way; rather, relative to the mass death crisis, the total number of deaths reported in the 18 year period between 1981 to 1999 is far lower than the average yearly number of deaths after the year 2000.

The data file is a csv file saved to my GitHub site. I dynamically update the data as new information is added. This file is current through 7/31/24.

md="https://raw.githubusercontent.com/mightyjoemoon/POL51/main/ogis_migrant_deaths-10.csv"

md<-read_csv(url(md))
## Rows: 4262 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): ML Number, Name, Sex, Reporting Date, Surface Management, Location...
## dbl  (8): Age, Decade, Corridor Code, Condition Code, Latitude, Longitude, U...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Migrant deaths by year and decade

Because we may want to look at yearly data, it’s useful to generate a variable that records the calendar year in which migrant remains were found. The code in the chunk below does just this. In addition to creating the new variable, I use the R command tabyl to produce a table of migrant deaths by year. In all, there are 4,090 recorded deaths in the time frame of the data set I’ve created. The table will show you the number of remains recovered by year along with the proportion of total deaths each year accounts for. So in 2010, we see 224 remains were recovered. The proportion of the total number of deaths accounted for by this year is 0.05476773 (i.e. \(\frac{224}{4090}\)). Multiply this proportion by 100 and you get the percent contribution. For 2010, about 5.5% of all the recovered remains occurred in 2010. One way to quickly assess the persistence of the death crisis is to inspect the proportions. If the crisis was abating, we’d expect to see a substantial decline in the proportion. If the crisis is persistent, we’d expect to see these proportions to be very similar across time. (Note that 2024 will be a very small number because we only have partial data for this year.) What do you see when you look at these proportions?

md$yeardecade n percent
1981 1 0.02%
1982 1 0.02%
1985 3 0.07%
1987 1 0.02%
1990 9 0.21%
1991 6 0.14%
1992 7 0.16%
1993 17 0.40%
1994 4 0.09%
1995 12 0.28%
1996 13 0.31%
1997 22 0.52%
1998 15 0.35%
1999 23 0.54%
2000 75 1.76%
2001 79 1.85%
2002 151 3.54%
2003 164 3.85%
2004 186 4.36%
2005 202 4.74%
2006 174 4.08%
2007 221 5.19%
2008 166 3.89%
2009 197 4.62%
2010 224 5.26%
2011 182 4.27%
2012 163 3.82%
2013 184 4.32%
2014 140 3.28%
2015 147 3.45%
2016 164 3.85%
2017 124 2.91%
2018 128 3.00%
2019 144 3.38%
2020 223 5.23%
2021 225 5.28%
2022 173 4.06%
2023 197 4.62%
2024 95 2.23%
Total 4262 -

Visualizing migrant deaths by year

Often (most always), it’s easier to visualize quantitative data than looking at a table of data. The code in the chunk below will create what’s known as a barplot using the data from the table we just considered. Each bar corresponds to the number of migrant remains recovered in each year. When you look at this plot, what do you see? What interpretation would you give to this? Does the plot show the crisis abating? Does it seem persistent? Is it getting worse?

Migrants deaths by decade

This plot show the number of recovered remains by decade.

Understanding gender and migrant deaths

How do migrant deaths and gender relate to one another? The code in the chunk below creates a “factor-level” variable recording the gender of the migrant. Since gender is not always determined, there is a category called “undetermined.” Using the tabyl function, I create a table showing the total number of remains recovered that are male, female, and undetermined. As you can see, about 82% of the total remains recovered are male.

##     md$gender    n      percent valid_percent
##          male 3443 0.8078366964    0.80840573
##        female  598 0.1403097137    0.14040855
##  undetermined  218 0.0511496950    0.05118572
##          <NA>    3 0.0007038949            NA

Are there discernable trends in gender differences? There has been some speculation that if asylum seeker deaths begin to rise, then more females will die given a large share of asylum seekers are women. Here is a link to a recent article on this: https://19thnews.org/2024/07/women-migrants-deaths-us-mexico-border/

Sometimes, raw numbers are harder to interpret than are perecentages. The chunk below reports the gender data in terms of proportions. To understand this, consider the year 2023. In this year, of the remains recovered, 71% were male, 19% were female, and 10% were undetermined. Do you see any patterns in the data?

yeardecade/gender male female undetermined NA_
1981 0% (0) 0% (0) 100% (1) 0% (0)
1982 0% (0) 100% (1) 0% (0) 0% (0)
1985 100% (3) 0% (0) 0% (0) 0% (0)
1987 100% (1) 0% (0) 0% (0) 0% (0)
1990 56% (5) 0% (0) 44% (4) 0% (0)
1991 67% (4) 0% (0) 33% (2) 0% (0)
1992 71% (5) 0% (0) 29% (2) 0% (0)
1993 59% (10) 0% (0) 41% (7) 0% (0)
1994 50% (2) 0% (0) 50% (2) 0% (0)
1995 42% (5) 0% (0) 58% (7) 0% (0)
1996 23% (3) 0% (0) 77% (10) 0% (0)
1997 18% (4) 5% (1) 77% (17) 0% (0)
1998 20% (3) 0% (0) 80% (12) 0% (0)
1999 22% (5) 4% (1) 74% (17) 0% (0)
2000 76% (57) 24% (18) 0% (0) 0% (0)
2001 75% (59) 24% (19) 1% (1) 0% (0)
2002 76% (115) 24% (36) 0% (0) 0% (0)
2003 80% (131) 20% (33) 0% (0) 0% (0)
2004 81% (150) 19% (35) 1% (1) 0% (0)
2005 81% (163) 19% (39) 0% (0) 0% (0)
2006 82% (143) 18% (31) 0% (0) 0% (0)
2007 77% (171) 23% (50) 0% (0) 0% (0)
2008 80% (133) 20% (33) 0% (0) 0% (0)
2009 85% (167) 14% (27) 2% (3) 0% (0)
2010 87% (195) 13% (29) 0% (0) 0% (0)
2011 87% (159) 12% (21) 1% (2) 0% (0)
2012 87% (141) 12% (20) 1% (2) 0% (0)
2013 90% (166) 10% (18) 0% (0) 0% (0)
2014 91% (127) 8% (11) 1% (2) 0% (0)
2015 88% (130) 8% (12) 3% (5) 0% (0)
2016 93% (152) 6% (10) 1% (2) 0% (0)
2017 98% (121) 2% (3) 0% (0) 0% (0)
2018 90% (115) 9% (12) 1% (1) 0% (0)
2019 88% (127) 8% (12) 3% (5) 0% (0)
2020 82% (182) 8% (18) 10% (23) 0% (0)
2021 76% (170) 13% (30) 10% (22) 1% (3)
2022 77% (134) 13% (23) 9% (16) 0% (0)
2023 73% (144) 19% (38) 8% (15) 0% (0)
2024 43% (41) 18% (17) 39% (37) 0% (0)

Visualizing deaths by gender

I created a summary data set of remains recoverd for the years 2000 to 2024 (noting that 2024 is incomplete). This summary data set makes it easy to visualize migrant deaths by year and gender.

## New names:
## Rows: 26 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (4): year20, male, female, undetermined lgl (2): ...5, ...6
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`
## • `` -> `...6`

The code in the chunk below uses my new data set to create a line plot showing the percentage of migrant deaths accounted for by gender? What do you see in this plot? How would you characterize the relationship between time, deaths, and gender? (Note that the code seems to produce some warning codes; these can be disregarded as they have no bearing on the plot.)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

To see the total amount of migrant deaths by gender, the code in the chunk below gives a barplot of total deaths. What is the key takeaway point here?

Here is another way to visualize migrant deaths by gender. The code in the chunk below creates what is known as a “stacked bar plot.” Each color code corresponds to gender classification. The total height of the bar corresponds to the total number of remains recovered by year. The color codes represent the number associated with males, females, and those whose gender is undetermined.

Age of deceased migrants

The OpenGIS data records the age of the migrant if this determination is possible. To understand the relationship between age and migrant deaths, I’ve created a new variable recording agegroups. These groups are in 5-year increments with the exception of the first group (0-9 years) and the last group (60 and over). We see that about 38% of the migrant deaths have an indeterminant age. We can see that most migrant deaths (about 32% of the total) are in the age range of 20 to 34 years of age. This percentage is based on including all of the individuals whose age cannot be determined. This group constitutes 40% of the data. What else do we see? What age groups contribute to most of the deaths? What do you see?

md$agegroup n percent
0-9 7 0%
10-14 21 0%
15-19 246 6%
20-24 471 11%
25-29 459 11%
30-34 446 10%
35-39 372 9%
40-44 251 6%
45-49 155 4%
49-54 76 2%
55-59 36 1%
60 and over 21 0%
NA 1701 40%

Suppose we want to tabulate age group deaths by year? This is what the code in the chunk below is doing. To understand the resulting table, consider the year 2006 and the age group 20-24 years of age. This group accounted for about 13% of that year’s migrant deaths. For most years, those whose age is indeterminant accounts for a large share of migrant deaths. Note that the number of “NAs” seems to be increasing with time. What do you make of this?

yeardecade/agegroup 0-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 49-54 55-59 60 and over NA Total
1981 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (1) 100% (1)
1982 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (1) 100% (1)
1985 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (3) 100% (3)
1987 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (1) 100% (1)
1990 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (9) 100% (9)
1991 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (6) 100% (6)
1992 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (7) 100% (7)
1993 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (17) 100% (17)
1994 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (4) 100% (4)
1995 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (12) 100% (12)
1996 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (13) 100% (13)
1997 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (22) 100% (22)
1998 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (15) 100% (15)
1999 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 0% (0) 100% (23) 100% (23)
2000 0% (0) 4% (3) 11% (8) 17% (13) 19% (14) 12% (9) 4% (3) 8% (6) 5% (4) 1% (1) 0% (0) 0% (0) 19% (14) 100% (75)
2001 0% (0) 0% (0) 6% (5) 19% (15) 13% (10) 13% (10) 6% (5) 9% (7) 6% (5) 3% (2) 0% (0) 0% (0) 25% (20) 100% (79)
2002 0% (0) 3% (4) 11% (17) 15% (22) 13% (20) 11% (17) 11% (17) 7% (10) 5% (8) 3% (4) 1% (2) 0% (0) 20% (30) 100% (151)
2003 1% (2) 0% (0) 10% (16) 15% (25) 14% (23) 10% (17) 9% (15) 8% (13) 2% (3) 1% (2) 2% (3) 1% (2) 26% (43) 100% (164)
2004 0% (0) 1% (1) 7% (13) 13% (24) 17% (31) 9% (16) 10% (18) 10% (18) 3% (5) 2% (3) 2% (3) 1% (1) 28% (53) 100% (186)
2005 0% (1) 1% (3) 10% (20) 15% (30) 11% (23) 12% (25) 10% (21) 4% (9) 5% (11) 3% (7) 0% (1) 0% (0) 25% (51) 100% (202)
2006 1% (1) 3% (5) 8% (14) 13% (23) 10% (17) 11% (19) 7% (13) 7% (12) 2% (4) 2% (3) 1% (2) 1% (2) 34% (59) 100% (174)
2007 0% (0) 0% (1) 8% (18) 9% (20) 13% (28) 14% (30) 13% (28) 8% (17) 2% (5) 3% (6) 1% (2) 1% (3) 29% (63) 100% (221)
2008 1% (1) 1% (1) 10% (17) 13% (22) 11% (18) 13% (21) 6% (10) 8% (13) 4% (7) 4% (7) 1% (1) 1% (1) 28% (47) 100% (166)
2009 0% (0) 1% (1) 8% (16) 10% (20) 14% (27) 15% (29) 11% (21) 3% (6) 2% (4) 2% (4) 2% (3) 1% (1) 33% (65) 100% (197)
2010 0% (0) 0% (0) 5% (12) 12% (27) 16% (35) 11% (24) 13% (28) 8% (18) 5% (12) 1% (2) 0% (0) 0% (0) 29% (66) 100% (224)
2011 0% (0) 0% (0) 5% (9) 9% (16) 12% (21) 12% (21) 11% (20) 7% (12) 3% (5) 1% (2) 1% (1) 1% (1) 41% (74) 100% (182)
2012 0% (0) 1% (1) 4% (7) 6% (9) 13% (22) 12% (19) 9% (15) 8% (13) 4% (7) 2% (3) 1% (1) 0% (0) 40% (66) 100% (163)
2013 0% (0) 0% (0) 5% (9) 9% (16) 10% (19) 10% (18) 12% (22) 5% (10) 3% (5) 2% (3) 1% (2) 1% (1) 43% (79) 100% (184)
2014 0% (0) 0% (0) 4% (6) 6% (9) 8% (11) 6% (9) 5% (7) 6% (9) 4% (5) 3% (4) 1% (2) 0% (0) 56% (78) 100% (140)
2015 0% (0) 0% (0) 6% (9) 12% (18) 8% (12) 12% (17) 12% (17) 5% (7) 2% (3) 3% (4) 1% (2) 1% (1) 39% (57) 100% (147)
2016 0% (0) 0% (0) 2% (3) 13% (21) 9% (14) 12% (19) 10% (16) 2% (4) 4% (6) 1% (1) 2% (3) 1% (2) 46% (75) 100% (164)
2017 0% (0) 0% (0) 2% (3) 9% (11) 10% (12) 11% (14) 5% (6) 2% (3) 5% (6) 1% (1) 1% (1) 0% (0) 54% (67) 100% (124)
2018 0% (0) 0% (0) 3% (4) 5% (7) 9% (11) 9% (11) 9% (12) 9% (11) 7% (9) 2% (2) 2% (3) 1% (1) 45% (57) 100% (128)
2019 1% (1) 0% (0) 3% (4) 13% (18) 8% (12) 10% (15) 10% (14) 3% (4) 1% (2) 3% (5) 0% (0) 1% (1) 47% (68) 100% (144)
2020 0% (0) 0% (0) 5% (11) 11% (24) 8% (17) 7% (16) 7% (16) 5% (11) 3% (7) 0% (0) 0% (0) 0% (1) 54% (120) 100% (223)
2021 0% (0) 0% (0) 4% (8) 11% (25) 10% (22) 9% (21) 8% (18) 5% (12) 4% (10) 1% (2) 1% (2) 0% (0) 47% (105) 100% (225)
2022 0% (0) 0% (0) 4% (7) 12% (20) 10% (18) 13% (22) 5% (8) 9% (15) 3% (6) 1% (1) 1% (1) 1% (1) 43% (74) 100% (173)
2023 1% (1) 1% (1) 3% (6) 14% (27) 9% (17) 10% (19) 10% (19) 4% (8) 6% (12) 4% (7) 1% (1) 1% (2) 39% (77) 100% (197)
2024 0% (0) 0% (0) 4% (4) 9% (9) 5% (5) 8% (8) 3% (3) 3% (3) 4% (4) 0% (0) 0% (0) 0% (0) 62% (59) 100% (95)
Total 0% (7) 0% (21) 4% (246) 7% (471) 7% (459) 7% (446) 6% (372) 4% (251) 2% (155) 1% (76) 1% (36) 0% (21) 60% (1,701) 100% (4,262)

It might be useful to visualize the data with a barplot. This is what I do in the following chunk. This plot includes the cases where age is indeterminate. What do you see?

Suppose we wanted to see the percentage for each age group based only on the number of migrants whose age could be determined? In other words, suppose we omitted from our analysis the “NAs”. This is what is done in the chunk below, where I subset the date by not including the NAs. Consider again the year 2006 and the age group 20-24. Here I see that of those whose age can be identified, about 20% of the remains recovered were of people aged 20 to 24.

Suppose we are interested in see if there are any trends in average ages of deceased migrants? One way to do this is through a simple regression model. This model will give us an estimate of the average age conditional on year. The code in the chunk will do this along with the resulting plot. The plot may look daunting but it’s actually simple. In order to make the x-axis cleaner, I’m just using the last digit(s) of the year to denote the year. So a “0” denotes the year 2000; a “24” denotes the year 2024. The key thing to look at is the dots plotted for each year. This dot is our estimate of the average age of death for migrants whose remains are recovered. So consider the year 2003 (denoted as “3). This dot is right on the line corresponding to 30 years of age. This implies that among those whose remains were recovered in 2003, the average age at death was about 30 years of age. The vertical lines above and below the dot correspond to a”confidence interval.” This is sort of like a margin of error in layperson terms. Years with smaller numbers of deaths will have larger confidence intervals because there are fewer cases with which to make an estimate. What do you see in this graph? Are there any trends you spot?

## 
## Call:
## lm(formula = Age ~ YEAR, data = md)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.650  -7.832  -1.120   6.494  66.404 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   28.230      1.301  21.704 < 0.0000000000000002 ***
## YEAR2001       2.686      1.855   1.448              0.14777    
## YEAR2002       1.903      1.595   1.193              0.23306    
## YEAR2003       1.770      1.595   1.110              0.26714    
## YEAR2004       2.891      1.571   1.840              0.06585 .  
## YEAR2005       1.843      1.541   1.196              0.23178    
## YEAR2006       1.797      1.609   1.117              0.26430    
## YEAR2007       3.929      1.531   2.566              0.01036 *  
## YEAR2008       2.602      1.600   1.627              0.10389    
## YEAR2009       2.324      1.573   1.477              0.13970    
## YEAR2010       3.277      1.531   2.140              0.03246 *  
## YEAR2011       3.891      1.627   2.391              0.01686 *  
## YEAR2012       4.451      1.660   2.681              0.00738 ** 
## YEAR2013       3.990      1.635   2.439              0.01478 *  
## YEAR2014       5.416      1.832   2.956              0.00314 ** 
## YEAR2015       3.459      1.685   2.053              0.04014 *  
## YEAR2016       4.366      1.689   2.586              0.00978 ** 
## YEAR2017       2.946      1.871   1.574              0.11557    
## YEAR2018       7.404      1.773   4.175            0.0000308 ***
## YEAR2019       3.192      1.746   1.828              0.06773 .  
## YEAR2020       2.120      1.641   1.292              0.19657    
## YEAR2021       3.487      1.597   2.183              0.02913 *  
## YEAR2022       3.356      1.654   2.030              0.04248 *  
## YEAR2023       4.420      1.597   2.767              0.00569 ** 
## YEAR2024       1.632      2.135   0.764              0.44481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 2536 degrees of freedom
##   (1701 observations deleted due to missingness)
## Multiple R-squared:  0.01574,    Adjusted R-squared:  0.006424 
## F-statistic:  1.69 on 24 and 2536 DF,  p-value: 0.01934

Age and gender

Suppose we wanted to consider age and gender. This is what I’m doing in this section. The code in the chunk below is going to allow me to estimate the average age at death for males and females. The resulting plot has the same kind of interpretation as the one we just looked at. The difference is we have two sets of estimates–one for males and one for females. Here, those things called confidence intervals are important. If those intervals for females overlap the intervals for males, then we might conclude the differences in average age is insignificant. What do you see? Do you see clear gender differences?

## 
##   male female 
##   3443    598
## 
##  Welch Two Sample t-test
## 
## data:  md$Age by md$gender2
## t = 3.9991, df = 632.82, p-value = 0.00007108
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
##  1.057570 3.098214
## sample estimates:
##   mean in group male mean in group female 
##             31.71107             29.63318
## 
## Call:
## lm(formula = Age ~ YEAR * gender2, data = md)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.284  -7.518  -1.222   6.583  66.321 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)            28.51163    1.54846  18.413 < 0.0000000000000002 ***
## YEAR2001                2.44186    2.18985   1.115             0.264922    
## YEAR2002                1.88837    1.88236   1.003             0.315865    
## YEAR2003                2.77258    1.86628   1.486             0.137505    
## YEAR2004                2.44075    1.83838   1.328             0.184409    
## YEAR2005                2.20968    1.80078   1.227             0.219912    
## YEAR2006                1.84321    1.87252   0.984             0.325040    
## YEAR2007                3.91079    1.81288   2.157             0.031083 *  
## YEAR2008                2.97291    1.86027   1.598             0.110146    
## YEAR2009                2.17019    1.82620   1.188             0.234802    
## YEAR2010                2.68262    1.77185   1.514             0.130147    
## YEAR2011                3.74090    1.85449   2.017             0.043780 *  
## YEAR2012                4.00602    1.90018   2.108             0.035110 *  
## YEAR2013                3.97258    1.86628   2.129             0.033384 *  
## YEAR2014                5.48837    2.05098   2.676             0.007500 ** 
## YEAR2015                3.26886    1.91182   1.710             0.087424 .  
## YEAR2016                4.16738    1.91588   2.175             0.029709 *  
## YEAR2017                2.66381    2.05098   1.299             0.194132    
## YEAR2018                6.88837    1.99597   3.451             0.000567 ***
## YEAR2019                3.41484    1.97836   1.726             0.084453 .  
## YEAR2020                2.20576    1.87574   1.176             0.239728    
## YEAR2021                3.52919    1.85736   1.900             0.057533 .  
## YEAR2022                3.71059    1.91588   1.937             0.052887 .  
## YEAR2023                5.05826    1.87252   2.701             0.006953 ** 
## YEAR2024                2.83453    2.52253   1.124             0.261255    
## gender2female          -0.95607    2.85054  -0.335             0.737351    
## YEAR2001:gender2female  0.81508    4.11912   0.198             0.843157    
## YEAR2002:gender2female -0.08909    3.54923  -0.025             0.979976    
## YEAR2003:gender2female -5.02045    3.62992  -1.383             0.166766    
## YEAR2004:gender2female  1.75369    3.57627   0.490             0.623916    
## YEAR2005:gender2female -2.42041    3.53920  -0.684             0.494110    
## YEAR2006:gender2female -0.76240    3.73104  -0.204             0.838104    
## YEAR2007:gender2female -0.03777    3.38662  -0.011             0.991102    
## YEAR2008:gender2female -2.57392    3.72491  -0.691             0.489628    
## YEAR2009:gender2female  0.18335    3.70801   0.049             0.960568    
## YEAR2010:gender2female  3.55130    3.78071   0.939             0.347656    
## YEAR2011:gender2female -0.62979    4.54123  -0.139             0.889713    
## YEAR2012:gender2female  2.27176    4.23443   0.536             0.591662    
## YEAR2013:gender2female -1.82814    4.41826  -0.414             0.679079    
## YEAR2014:gender2female -3.44393    5.52764  -0.623             0.533316    
## YEAR2015:gender2female -0.07442    4.71918  -0.016             0.987420    
## YEAR2016:gender2female  0.02706    4.72082   0.006             0.995427    
## YEAR2017:gender2female       NA         NA      NA                   NA    
## YEAR2018:gender2female  3.72274    5.18608   0.718             0.472927    
## YEAR2019:gender2female -3.84540    4.74653  -0.810             0.417931    
## YEAR2020:gender2female -2.48859    4.31499  -0.577             0.564173    
## YEAR2021:gender2female -0.94189    3.75330  -0.251             0.801874    
## YEAR2022:gender2female -2.54393    3.88926  -0.654             0.513113    
## YEAR2023:gender2female -3.13234    3.61287  -0.867             0.386027    
## YEAR2024:gender2female -4.39008    4.73299  -0.928             0.353731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.15 on 2511 degrees of freedom
##   (1702 observations deleted due to missingness)
## Multiple R-squared:  0.02631,    Adjusted R-squared:  0.007698 
## F-statistic: 1.414 on 48 and 2511 DF,  p-value: 0.03274
## Warning in predict.lm(model, newdata = data_grid, type = "response", se.fit =
## se, : prediction from rank-deficient fit; attr(*, "non-estim") has doubtful
## cases

Understanding cause of death

The OpenGIS data records the likely cause of death. Several codes are given for migrant deaths. Two of these codes suggest cause of death is not determined. In one case, it is simply impossible to determine how the migrant dies and in the second case, only skeletal remains are found (making it extraordinarily difficult to discern cause of death). Below, I create a table of the causes of death reported in the OpenGIS data. What stands out most to you in terms of what we see in the data?

md$Cause of Death n percent valid_percent
Asphyxia 8 0% 0%
Blunt Force Injury 219 5% 5%
Diabetes 5 0% 0%
Drowning 45 1% 1%
Drug Overdose 5 0% 0%
Exposure 1493 35% 36%
Exsanguination 1 0% 0%
Gunshot Wound 97 2% 2%
Heart Disease 23 1% 1%
Lightning Strike 1 0% 0%
Motor Vehicle Accident 25 1% 1%
Nonviable Fetus 2 0% 0%
Other Disease 20 0% 0%
Other Injury 13 0% 0%
Other Injury / Homicide 26 1% 1%
Other injury 2 0% 0%
Pending 26 1% 1%
Pregnancy Complication 1 0% 0%
Skeletal Remains 1482 35% 36%
Undetermined 641 15% 15%
c 1 0% 0%
NA 126 3% -

Because “exposure,” “skeletal remains,” and “undetermined” are the three dominant codes given, let’s consider how these kinds of cases vary across the years. This is what I’m up to in the code below. In the table, do you see anything that stands out to you?

yeardecade/cod Exposure Other Skeletal remains Undetermined Total
1981 0% (0) 0% (0) 0% (0) 100% (1) 100% (1)
1982 0% (0) 0% (0) 0% (0) 100% (1) 100% (1)
1985 0% (0) 33% (1) 0% (0) 67% (2) 100% (3)
1987 0% (0) 0% (0) 0% (0) 100% (1) 100% (1)
1990 0% (0) 100% (9) 0% (0) 0% (0) 100% (9)
1991 0% (0) 100% (6) 0% (0) 0% (0) 100% (6)
1992 0% (0) 100% (7) 0% (0) 0% (0) 100% (7)
1993 0% (0) 100% (17) 0% (0) 0% (0) 100% (17)
1994 0% (0) 75% (3) 0% (0) 25% (1) 100% (4)
1995 0% (0) 100% (12) 0% (0) 0% (0) 100% (12)
1996 0% (0) 92% (12) 0% (0) 8% (1) 100% (13)
1997 0% (0) 100% (22) 0% (0) 0% (0) 100% (22)
1998 0% (0) 100% (15) 0% (0) 0% (0) 100% (15)
1999 0% (0) 96% (22) 0% (0) 4% (1) 100% (23)
2000 55% (41) 32% (24) 13% (10) 0% (0) 100% (75)
2001 67% (53) 6% (5) 3% (2) 24% (19) 100% (79)
2002 61% (92) 20% (30) 5% (8) 14% (21) 100% (151)
2003 59% (97) 16% (27) 4% (7) 20% (33) 100% (164)
2004 43% (80) 28% (53) 2% (3) 27% (50) 100% (186)
2005 69% (139) 16% (32) 2% (4) 13% (27) 100% (202)
2006 41% (71) 20% (34) 6% (11) 33% (58) 100% (174)
2007 51% (112) 14% (32) 11% (25) 24% (52) 100% (221)
2008 37% (61) 22% (37) 19% (31) 22% (37) 100% (166)
2009 38% (75) 23% (45) 20% (40) 19% (37) 100% (197)
2010 42% (95) 11% (25) 27% (61) 19% (43) 100% (224)
2011 29% (52) 10% (18) 51% (92) 11% (20) 100% (182)
2012 20% (33) 11% (18) 56% (92) 12% (20) 100% (163)
2013 28% (52) 9% (16) 54% (99) 9% (17) 100% (184)
2014 11% (15) 6% (9) 75% (105) 8% (11) 100% (140)
2015 21% (31) 10% (14) 64% (94) 5% (8) 100% (147)
2016 27% (44) 5% (9) 60% (99) 7% (12) 100% (164)
2017 12% (15) 3% (4) 76% (94) 9% (11) 100% (124)
2018 20% (26) 5% (6) 65% (83) 10% (13) 100% (128)
2019 20% (29) 2% (3) 65% (93) 13% (19) 100% (144)
2020 21% (46) 9% (19) 58% (130) 13% (28) 100% (223)
2021 34% (76) 10% (23) 50% (112) 6% (14) 100% (225)
2022 29% (51) 6% (10) 52% (90) 13% (22) 100% (173)
2023 38% (74) 9% (18) 34% (67) 19% (38) 100% (197)
2024 35% (33) 9% (9) 32% (30) 24% (23) 100% (95)
Total 23% (1,493) 34% (646) 23% (1,482) 20% (641) 100% (4,262)

I created a summary data set using the information in the table we just considered. Sometimes it’s useful to plot the data. In the chunk below is the code to read in my new data set and following this is a line plot of the different codes given for “cause of death.” What do you see? Compare deaths due to exposure versus deaths that are not determined because of skeletal remains. What do you make of this? Do you see trends?

## Rows: 25 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): year20, Exposure, Other, Skeletal remains, Undetermined
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##      year20        Exposure          Other         Skeletal remains 
##  Min.   :2000   Min.   :0.1071   Min.   :0.02083   Min.   :0.01613  
##  1st Qu.:2006   1st Qu.:0.2097   1st Qu.:0.06404   1st Qu.:0.10065  
##  Median :2012   Median :0.3526   Median :0.10056   Median :0.41894  
##  Mean   :2012   Mean   :0.3636   Mean   :0.12646   Mean   :0.36347  
##  3rd Qu.:2017   3rd Qu.:0.4533   3rd Qu.:0.17233   3rd Qu.:0.58813  
##  Max.   :2023   Max.   :0.6881   Max.   :0.32000   Max.   :0.75806  
##  NA's   :1      NA's   :1        NA's   :1         NA's   :1        
##   Undetermined    
##  Min.   :0.00000  
##  1st Qu.:0.09147  
##  Median :0.12956  
##  Mean   :0.14649  
##  3rd Qu.:0.19497  
##  Max.   :0.33333  
##  NA's   :1
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).