Migrant death project

Understanding the migrant death issue

The data on migrant deaths comes from the Arizona OpenGIS Project, cosponsored by Humane Borders and the Pima County Office of Medical Examiner. I’ve edited the data set to include all recorded migrant deaths between 2000 up to July of 2024. The objective of this project is to give you continued experience in working with and making sense of quantitative data, while at the same time, giving you the opportunity to better understand at a deep level, the nature of the migrant death issue. Above all else, my goal is to always honor and humanize those who have died in the desert. These data permit a better understanding of the death crisis.

For purposes of this project, I have compiled a number of plots of the data highlighting different features of the migrant death crisis. Apart from an overview of the total number of migrant deaths recorded, I have compiled information on gender, age, and cause of death, as well as other factors. Your job is to tell a story. Imagine you are asked to summarize the information for an audience who knows nothing about the migrant death crisis? How would you proceed? What would you highlight?

Your job is to take what I have provided, analyze it, and tell your story I want to know something important, interesting, and useful about the migrant death crisis. I will not tell you what to look for; your job is to think critically and analytically. In class, I’ll discuss examples of what I’d look for. Ultimately, when presented with information, you need to get practice in learning how to engage it. But some prompts might be: what are the characteristics of most migrant deaths? What trends do you observe? Are there gender differences? Do attributes of migrant deaths change with respect to time?

This assignment will be worth 500 points. 100 points will be based on writing. I’m expecting analysis, not summaries. 400 points will be based on creativity. What or how do you choose what to engage? What should we learn about the migrant death crisis based on your analysis. Understand, you are probably THE ONLY college class in the world looking at these data. What will you tell the rest of the world about the migrant death crisis?

What do I mean by creativity? To be blunt, creativity is not repeating numbers or statistics you see in a table. I don’t need anyone to do that as I can see it with my own eyes. Creativity paints a picture as to what the human picture of the death crisis looks like, looking at the observed data. I’m looking for analysis that has a natural flow. Students are used to answering by rote, questions that get asked of them. In turn, answers are flat and usually nonanalytical. That’s not the student’s fault; it’s the fault of the person denying the student creativity. Creativity means bringing in information and context outside of the confines of the specific charts, plots, or questions. If you’re wondering about what page lenght I’m imagining assuming you will cut and paste some of the charts or plots, I would expect that with the charts and plots included, the analysis should be 5 pages or thereabouts. I’d strongly encourage bringing in external information (I’ve sent you links to sources and there are sources on the syllabus) to augment your analysis. In the end, if you simply report the results I’ve already created for you, then I would not expect anything above a “C”-level grade (i.e. you just reproduce in words what I see in the tables or charts). In class, we went over in great detail tips on how to avoid these problems, so follow those tips and you’re going to be well on your way.

For the submission, do not submit an R Markdown file. I want a holistic submission that I can read from start to finish.

Extra credit bonus If you are interested in producing your own analysis using R, I will offer up to 10% extra credit if you take the data and do something interesting with it. The operative words are “up to.” There’s no guarantee that if you do something, you’ll get the full 10 percent. In order to do this, you will need to use the .RMD file and edit it as needed. We teach R Markdown in POL 51 (well I do) and so if you know how to use it, use it. Skill like this are real skills to develop. BUT, there is no requirement to do anything above and beyond what the core assignment is asking. I’m trying to incentivize the use of statistical computing.

This assignment is worth 500 points and is due by 11:59 PM on December 13.

In this document, I make references to “chunks” of code. For purposes of my HTML file, I’ve suppressed this code from showing up in the resulting file. If you examine the .RMD file, you will see the code (should you want to see it).

The data file is a csv file saved to my GitHub site. Should you choose to use R directly to reproduce what I have, this could will permit direct access to the data. You are not required to use R.

A note about the data

The data on migrant deaths comes from the Arizona OpenGIS Project, cosponsored by Humane Borders and the Pima County Office of Medical Examiner. Recorded remains are between the years 1981 to 2024. However systematic data on migrant deaths only are found from 2000 and later. This is primarily because the death crisis really doesn’t emerge until around the year 2000.

The reason the crisis was nonexistent before 2000 is that very, very few people crossed through Arizona. This changed due to changes made in the Clinton Administration that led to the funnel effect pushing migrants into the Arizona corridor. So if one downloads the death map data, one will find that 4,263 migrant remains have been recovered (as of 7/31/2024); however, 4,127 have been recovered since the year 2000. This means that in the death map, only 136 remains are recorded as being found between 1981 and 1999 (an 18 year period). I don’t mean “only” in a demeaning way; rather, relative to the mass death crisis, the total number of deaths reported in the 18 year period between 1981 to 1999 is far lower than the average yearly number of deaths after the year 2000.

The data file is a csv file saved to my GitHub site. I dynamically update the data as new information is added. This file is current through 7/31/24.

md="https://raw.githubusercontent.com/mightyjoemoon/POL51/main/ogis_migrant_deaths-10.csv"

md<-read_csv(url(md))

## Rows: 4262 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): ML Number, Name, Sex, Reporting Date, Surface Management, Location...
## dbl  (8): Age, Decade, Corridor Code, Condition Code, Latitude, Longitude, U...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Migrant deaths by year and decade

Because we may want to look at yearly data, it’s useful to generate a variable that records the calendar year in which migrant remains were found. The code in the chunk below does just this. In addition to creating the new variable, I use the R command tabyl to produce a table of migrant deaths by year. In all, there are 4,090 recorded deaths in the time frame of the data set I’ve created. The table will show you the number of remains recovered by year along with the proportion of total deaths each year accounts for. So in 2010, we see 224 remains were recovered. The proportion of the total number of deaths accounted for by this year is 0.05476773 (i.e. $\frac{224}{4090}$). Multiply this proportion by 100 and you get the percent contribution. For 2010, about 5.5% of all the recovered remains occurred in 2010. One way to quickly assess the persistence of the death crisis is to inspect the proportions. If the crisis was abating, we’d expect to see a substantial decline in the proportion. If the crisis is persistent, we’d expect to see these proportions to be very similar across time. (Note that 2024 will be a very small number because we only have partial data for this year.) What do you see when you look at these proportions?

md$yeardecade	n	percent
1981	1	0.02%
1982	1	0.02%
1985	3	0.07%
1987	1	0.02%
1990	9	0.21%
1991	6	0.14%
1992	7	0.16%
1993	17	0.40%
1994	4	0.09%
1995	12	0.28%
1996	13	0.31%
1997	22	0.52%
1998	15	0.35%
1999	23	0.54%
2000	75	1.76%
2001	79	1.85%
2002	151	3.54%
2003	164	3.85%
2004	186	4.36%
2005	202	4.74%
2006	174	4.08%
2007	221	5.19%
2008	166	3.89%
2009	197	4.62%
2010	224	5.26%
2011	182	4.27%
2012	163	3.82%
2013	184	4.32%
2014	140	3.28%
2015	147	3.45%
2016	164	3.85%
2017	124	2.91%
2018	128	3.00%
2019	144	3.38%
2020	223	5.23%
2021	225	5.28%
2022	173	4.06%
2023	197	4.62%
2024	95	2.23%
Total	4262	-

Visualizing migrant deaths by year

Often (most always), it’s easier to visualize quantitative data than looking at a table of data. The code in the chunk below will create what’s known as a barplot using the data from the table we just considered. Each bar corresponds to the number of migrant remains recovered in each year. When you look at this plot, what do you see? What interpretation would you give to this? Does the plot show the crisis abating? Does it seem persistent? Is it getting worse?

Migrants deaths by decade

This plot show the number of recovered remains by decade.

Understanding gender and migrant deaths

How do migrant deaths and gender relate to one another? The code in the chunk below creates a “factor-level” variable recording the gender of the migrant. Since gender is not always determined, there is a category called “undetermined.” Using the tabyl function, I create a table showing the total number of remains recovered that are male, female, and undetermined. As you can see, about 82% of the total remains recovered are male.

##     md$gender    n      percent valid_percent
##          male 3443 0.8078366964    0.80840573
##        female  598 0.1403097137    0.14040855
##  undetermined  218 0.0511496950    0.05118572
##          <NA>    3 0.0007038949            NA

Are there discernable trends in gender differences? There has been some speculation that if asylum seeker deaths begin to rise, then more females will die given a large share of asylum seekers are women. Here is a link to a recent article on this: https://19thnews.org/2024/07/women-migrants-deaths-us-mexico-border/

Sometimes, raw numbers are harder to interpret than are perecentages. The chunk below reports the gender data in terms of proportions. To understand this, consider the year 2023. In this year, of the remains recovered, 71% were male, 19% were female, and 10% were undetermined. Do you see any patterns in the data?

yeardecade/gender	male	female	undetermined	NA_
1981	0% (0)	0% (0)	100% (1)	0% (0)
1982	0% (0)	100% (1)	0% (0)	0% (0)
1985	100% (3)	0% (0)	0% (0)	0% (0)
1987	100% (1)	0% (0)	0% (0)	0% (0)
1990	56% (5)	0% (0)	44% (4)	0% (0)
1991	67% (4)	0% (0)	33% (2)	0% (0)
1992	71% (5)	0% (0)	29% (2)	0% (0)
1993	59% (10)	0% (0)	41% (7)	0% (0)
1994	50% (2)	0% (0)	50% (2)	0% (0)
1995	42% (5)	0% (0)	58% (7)	0% (0)
1996	23% (3)	0% (0)	77% (10)	0% (0)
1997	18% (4)	5% (1)	77% (17)	0% (0)
1998	20% (3)	0% (0)	80% (12)	0% (0)
1999	22% (5)	4% (1)	74% (17)	0% (0)
2000	76% (57)	24% (18)	0% (0)	0% (0)
2001	75% (59)	24% (19)	1% (1)	0% (0)
2002	76% (115)	24% (36)	0% (0)	0% (0)
2003	80% (131)	20% (33)	0% (0)	0% (0)
2004	81% (150)	19% (35)	1% (1)	0% (0)
2005	81% (163)	19% (39)	0% (0)	0% (0)
2006	82% (143)	18% (31)	0% (0)	0% (0)
2007	77% (171)	23% (50)	0% (0)	0% (0)
2008	80% (133)	20% (33)	0% (0)	0% (0)
2009	85% (167)	14% (27)	2% (3)	0% (0)
2010	87% (195)	13% (29)	0% (0)	0% (0)
2011	87% (159)	12% (21)	1% (2)	0% (0)
2012	87% (141)	12% (20)	1% (2)	0% (0)
2013	90% (166)	10% (18)	0% (0)	0% (0)
2014	91% (127)	8% (11)	1% (2)	0% (0)
2015	88% (130)	8% (12)	3% (5)	0% (0)
2016	93% (152)	6% (10)	1% (2)	0% (0)
2017	98% (121)	2% (3)	0% (0)	0% (0)
2018	90% (115)	9% (12)	1% (1)	0% (0)
2019	88% (127)	8% (12)	3% (5)	0% (0)
2020	82% (182)	8% (18)	10% (23)	0% (0)
2021	76% (170)	13% (30)	10% (22)	1% (3)
2022	77% (134)	13% (23)	9% (16)	0% (0)
2023	73% (144)	19% (38)	8% (15)	0% (0)
2024	43% (41)	18% (17)	39% (37)	0% (0)

Visualizing deaths by gender

I created a summary data set of remains recoverd for the years 2000 to 2024 (noting that 2024 is incomplete). This summary data set makes it easy to visualize migrant deaths by year and gender.

## New names:
## Rows: 26 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (4): year20, male, female, undetermined lgl (2): ...5, ...6
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`
## • `` -> `...6`

The code in the chunk below uses my new data set to create a line plot showing the percentage of migrant deaths accounted for by gender? What do you see in this plot? How would you characterize the relationship between time, deaths, and gender? (Note that the code seems to produce some warning codes; these can be disregarded as they have no bearing on the plot.)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

To see the total amount of migrant deaths by gender, the code in the chunk below gives a barplot of total deaths. What is the key takeaway point here?

Here is another way to visualize migrant deaths by gender. The code in the chunk below creates what is known as a “stacked bar plot.” Each color code corresponds to gender classification. The total height of the bar corresponds to the total number of remains recovered by year. The color codes represent the number associated with males, females, and those whose gender is undetermined.

Age of deceased migrants

The OpenGIS data records the age of the migrant if this determination is possible. To understand the relationship between age and migrant deaths, I’ve created a new variable recording agegroups. These groups are in 5-year increments with the exception of the first group (0-9 years) and the last group (60 and over). We see that about 38% of the migrant deaths have an indeterminant age. We can see that most migrant deaths (about 32% of the total) are in the age range of 20 to 34 years of age. This percentage is based on including all of the individuals whose age cannot be determined. This group constitutes 40% of the data. What else do we see? What age groups contribute to most of the deaths? What do you see?

md$agegroup	n	percent
0-9	7	0%
10-14	21	0%
15-19	246	6%
20-24	471	11%
25-29	459	11%
30-34	446	10%
35-39	372	9%
40-44	251	6%
45-49	155	4%
49-54	76	2%
55-59	36	1%
60 and over	21	0%
NA	1701	40%

Suppose we want to tabulate age group deaths by year? This is what the code in the chunk below is doing. To understand the resulting table, consider the year 2006 and the age group 20-24 years of age. This group accounted for about 13% of that year’s migrant deaths. For most years, those whose age is indeterminant accounts for a large share of migrant deaths. Note that the number of “NAs” seems to be increasing with time. What do you make of this?

yeardecade/agegroup	0-9	10-14	15-19	20-24	25-29	30-34	35-39	40-44	45-49	49-54	55-59	60 and over	NA	Total
1981	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (1)	100% (1)
1982	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (1)	100% (1)
1985	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (3)	100% (3)
1987	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (1)	100% (1)
1990	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (9)	100% (9)
1991	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (6)	100% (6)
1992	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (7)	100% (7)
1993	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (17)	100% (17)
1994	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (4)	100% (4)
1995	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (12)	100% (12)
1996	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (13)	100% (13)
1997	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (22)	100% (22)
1998	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (15)	100% (15)
1999	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	0% (0)	100% (23)	100% (23)
2000	0% (0)	4% (3)	11% (8)	17% (13)	19% (14)	12% (9)	4% (3)	8% (6)	5% (4)	1% (1)	0% (0)	0% (0)	19% (14)	100% (75)
2001	0% (0)	0% (0)	6% (5)	19% (15)	13% (10)	13% (10)	6% (5)	9% (7)	6% (5)	3% (2)	0% (0)	0% (0)	25% (20)	100% (79)
2002	0% (0)	3% (4)	11% (17)	15% (22)	13% (20)	11% (17)	11% (17)	7% (10)	5% (8)	3% (4)	1% (2)	0% (0)	20% (30)	100% (151)
2003	1% (2)	0% (0)	10% (16)	15% (25)	14% (23)	10% (17)	9% (15)	8% (13)	2% (3)	1% (2)	2% (3)	1% (2)	26% (43)	100% (164)
2004	0% (0)	1% (1)	7% (13)	13% (24)	17% (31)	9% (16)	10% (18)	10% (18)	3% (5)	2% (3)	2% (3)	1% (1)	28% (53)	100% (186)
2005	0% (1)	1% (3)	10% (20)	15% (30)	11% (23)	12% (25)	10% (21)	4% (9)	5% (11)	3% (7)	0% (1)	0% (0)	25% (51)	100% (202)
2006	1% (1)	3% (5)	8% (14)	13% (23)	10% (17)	11% (19)	7% (13)	7% (12)	2% (4)	2% (3)	1% (2)	1% (2)	34% (59)	100% (174)
2007	0% (0)	0% (1)	8% (18)	9% (20)	13% (28)	14% (30)	13% (28)	8% (17)	2% (5)	3% (6)	1% (2)	1% (3)	29% (63)	100% (221)
2008	1% (1)	1% (1)	10% (17)	13% (22)	11% (18)	13% (21)	6% (10)	8% (13)	4% (7)	4% (7)	1% (1)	1% (1)	28% (47)	100% (166)
2009	0% (0)	1% (1)	8% (16)	10% (20)	14% (27)	15% (29)	11% (21)	3% (6)	2% (4)	2% (4)	2% (3)	1% (1)	33% (65)	100% (197)
2010	0% (0)	0% (0)	5% (12)	12% (27)	16% (35)	11% (24)	13% (28)	8% (18)	5% (12)	1% (2)	0% (0)	0% (0)	29% (66)	100% (224)
2011	0% (0)	0% (0)	5% (9)	9% (16)	12% (21)	12% (21)	11% (20)	7% (12)	3% (5)	1% (2)	1% (1)	1% (1)	41% (74)	100% (182)
2012	0% (0)	1% (1)	4% (7)	6% (9)	13% (22)	12% (19)	9% (15)	8% (13)	4% (7)	2% (3)	1% (1)	0% (0)	40% (66)	100% (163)
2013	0% (0)	0% (0)	5% (9)	9% (16)	10% (19)	10% (18)	12% (22)	5% (10)	3% (5)	2% (3)	1% (2)	1% (1)	43% (79)	100% (184)
2014	0% (0)	0% (0)	4% (6)	6% (9)	8% (11)	6% (9)	5% (7)	6% (9)	4% (5)	3% (4)	1% (2)	0% (0)	56% (78)	100% (140)
2015	0% (0)	0% (0)	6% (9)	12% (18)	8% (12)	12% (17)	12% (17)	5% (7)	2% (3)	3% (4)	1% (2)	1% (1)	39% (57)	100% (147)
2016	0% (0)	0% (0)	2% (3)	13% (21)	9% (14)	12% (19)	10% (16)	2% (4)	4% (6)	1% (1)	2% (3)	1% (2)	46% (75)	100% (164)
2017	0% (0)	0% (0)	2% (3)	9% (11)	10% (12)	11% (14)	5% (6)	2% (3)	5% (6)	1% (1)	1% (1)	0% (0)	54% (67)	100% (124)
2018	0% (0)	0% (0)	3% (4)	5% (7)	9% (11)	9% (11)	9% (12)	9% (11)	7% (9)	2% (2)	2% (3)	1% (1)	45% (57)	100% (128)
2019	1% (1)	0% (0)	3% (4)	13% (18)	8% (12)	10% (15)	10% (14)	3% (4)	1% (2)	3% (5)	0% (0)	1% (1)	47% (68)	100% (144)
2020	0% (0)	0% (0)	5% (11)	11% (24)	8% (17)	7% (16)	7% (16)	5% (11)	3% (7)	0% (0)	0% (0)	0% (1)	54% (120)	100% (223)
2021	0% (0)	0% (0)	4% (8)	11% (25)	10% (22)	9% (21)	8% (18)	5% (12)	4% (10)	1% (2)	1% (2)	0% (0)	47% (105)	100% (225)
2022	0% (0)	0% (0)	4% (7)	12% (20)	10% (18)	13% (22)	5% (8)	9% (15)	3% (6)	1% (1)	1% (1)	1% (1)	43% (74)	100% (173)
2023	1% (1)	1% (1)	3% (6)	14% (27)	9% (17)	10% (19)	10% (19)	4% (8)	6% (12)	4% (7)	1% (1)	1% (2)	39% (77)	100% (197)
2024	0% (0)	0% (0)	4% (4)	9% (9)	5% (5)	8% (8)	3% (3)	3% (3)	4% (4)	0% (0)	0% (0)	0% (0)	62% (59)	100% (95)
Total	0% (7)	0% (21)	4% (246)	7% (471)	7% (459)	7% (446)	6% (372)	4% (251)	2% (155)	1% (76)	1% (36)	0% (21)	60% (1,701)	100% (4,262)

It might be useful to visualize the data with a barplot. This is what I do in the following chunk. This plot includes the cases where age is indeterminate. What do you see?

Suppose we wanted to see the percentage for each age group based only on the number of migrants whose age could be determined? In other words, suppose we omitted from our analysis the “NAs”. This is what is done in the chunk below, where I subset the date by not including the NAs. Consider again the year 2006 and the age group 20-24. Here I see that of those whose age can be identified, about 20% of the remains recovered were of people aged 20 to 24.

Suppose we are interested in see if there are any trends in average ages of deceased migrants? One way to do this is through a simple regression model. This model will give us an estimate of the average age conditional on year. The code in the chunk will do this along with the resulting plot. The plot may look daunting but it’s actually simple. In order to make the x-axis cleaner, I’m just using the last digit(s) of the year to denote the year. So a “0” denotes the year 2000; a “24” denotes the year 2024. The key thing to look at is the dots plotted for each year. This dot is our estimate of the average age of death for migrants whose remains are recovered. So consider the year 2003 (denoted as “3). This dot is right on the line corresponding to 30 years of age. This implies that among those whose remains were recovered in 2003, the average age at death was about 30 years of age. The vertical lines above and below the dot correspond to a”confidence interval.” This is sort of like a margin of error in layperson terms. Years with smaller numbers of deaths will have larger confidence intervals because there are fewer cases with which to make an estimate. What do you see in this graph? Are there any trends you spot?

## 
## Call:
## lm(formula = Age ~ YEAR, data = md)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.650  -7.832  -1.120   6.494  66.404 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   28.230      1.301  21.704 < 0.0000000000000002 ***
## YEAR2001       2.686      1.855   1.448              0.14777    
## YEAR2002       1.903      1.595   1.193              0.23306    
## YEAR2003       1.770      1.595   1.110              0.26714    
## YEAR2004       2.891      1.571   1.840              0.06585 .  
## YEAR2005       1.843      1.541   1.196              0.23178    
## YEAR2006       1.797      1.609   1.117              0.26430    
## YEAR2007       3.929      1.531   2.566              0.01036 *  
## YEAR2008       2.602      1.600   1.627              0.10389    
## YEAR2009       2.324      1.573   1.477              0.13970    
## YEAR2010       3.277      1.531   2.140              0.03246 *  
## YEAR2011       3.891      1.627   2.391              0.01686 *  
## YEAR2012       4.451      1.660   2.681              0.00738 ** 
## YEAR2013       3.990      1.635   2.439              0.01478 *  
## YEAR2014       5.416      1.832   2.956              0.00314 ** 
## YEAR2015       3.459      1.685   2.053              0.04014 *  
## YEAR2016       4.366      1.689   2.586              0.00978 ** 
## YEAR2017       2.946      1.871   1.574              0.11557    
## YEAR2018       7.404      1.773   4.175            0.0000308 ***
## YEAR2019       3.192      1.746   1.828              0.06773 .  
## YEAR2020       2.120      1.641   1.292              0.19657    
## YEAR2021       3.487      1.597   2.183              0.02913 *  
## YEAR2022       3.356      1.654   2.030              0.04248 *  
## YEAR2023       4.420      1.597   2.767              0.00569 ** 
## YEAR2024       1.632      2.135   0.764              0.44481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 2536 degrees of freedom
##   (1701 observations deleted due to missingness)
## Multiple R-squared:  0.01574,    Adjusted R-squared:  0.006424 
## F-statistic:  1.69 on 24 and 2536 DF,  p-value: 0.01934

Age and gender

Suppose we wanted to consider age and gender. This is what I’m doing in this section. The code in the chunk below is going to allow me to estimate the average age at death for males and females. The resulting plot has the same kind of interpretation as the one we just looked at. The difference is we have two sets of estimates–one for males and one for females. Here, those things called confidence intervals are important. If those intervals for females overlap the intervals for males, then we might conclude the differences in average age is insignificant. What do you see? Do you see clear gender differences?

## 
##   male female 
##   3443    598

## 
##  Welch Two Sample t-test
## 
## data:  md$Age by md$gender2
## t = 3.9991, df = 632.82, p-value = 0.00007108
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
##  1.057570 3.098214
## sample estimates:
##   mean in group male mean in group female 
##             31.71107             29.63318

## 
## Call:
## lm(formula = Age ~ YEAR * gender2, data = md)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.284  -7.518  -1.222   6.583  66.321 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)            28.51163    1.54846  18.413 < 0.0000000000000002 ***
## YEAR2001                2.44186    2.18985   1.115             0.264922    
## YEAR2002                1.88837    1.88236   1.003             0.315865    
## YEAR2003                2.77258    1.86628   1.486             0.137505    
## YEAR2004                2.44075    1.83838   1.328             0.184409    
## YEAR2005                2.20968    1.80078   1.227             0.219912    
## YEAR2006                1.84321    1.87252   0.984             0.325040    
## YEAR2007                3.91079    1.81288   2.157             0.031083 *  
## YEAR2008                2.97291    1.86027   1.598             0.110146    
## YEAR2009                2.17019    1.82620   1.188             0.234802    
## YEAR2010                2.68262    1.77185   1.514             0.130147    
## YEAR2011                3.74090    1.85449   2.017             0.043780 *  
## YEAR2012                4.00602    1.90018   2.108             0.035110 *  
## YEAR2013                3.97258    1.86628   2.129             0.033384 *  
## YEAR2014                5.48837    2.05098   2.676             0.007500 ** 
## YEAR2015                3.26886    1.91182   1.710             0.087424 .  
## YEAR2016                4.16738    1.91588   2.175             0.029709 *  
## YEAR2017                2.66381    2.05098   1.299             0.194132    
## YEAR2018                6.88837    1.99597   3.451             0.000567 ***
## YEAR2019                3.41484    1.97836   1.726             0.084453 .  
## YEAR2020                2.20576    1.87574   1.176             0.239728    
## YEAR2021                3.52919    1.85736   1.900             0.057533 .  
## YEAR2022                3.71059    1.91588   1.937             0.052887 .  
## YEAR2023                5.05826    1.87252   2.701             0.006953 ** 
## YEAR2024                2.83453    2.52253   1.124             0.261255    
## gender2female          -0.95607    2.85054  -0.335             0.737351    
## YEAR2001:gender2female  0.81508    4.11912   0.198             0.843157    
## YEAR2002:gender2female -0.08909    3.54923  -0.025             0.979976    
## YEAR2003:gender2female -5.02045    3.62992  -1.383             0.166766    
## YEAR2004:gender2female  1.75369    3.57627   0.490             0.623916    
## YEAR2005:gender2female -2.42041    3.53920  -0.684             0.494110    
## YEAR2006:gender2female -0.76240    3.73104  -0.204             0.838104    
## YEAR2007:gender2female -0.03777    3.38662  -0.011             0.991102    
## YEAR2008:gender2female -2.57392    3.72491  -0.691             0.489628    
## YEAR2009:gender2female  0.18335    3.70801   0.049             0.960568    
## YEAR2010:gender2female  3.55130    3.78071   0.939             0.347656    
## YEAR2011:gender2female -0.62979    4.54123  -0.139             0.889713    
## YEAR2012:gender2female  2.27176    4.23443   0.536             0.591662    
## YEAR2013:gender2female -1.82814    4.41826  -0.414             0.679079    
## YEAR2014:gender2female -3.44393    5.52764  -0.623             0.533316    
## YEAR2015:gender2female -0.07442    4.71918  -0.016             0.987420    
## YEAR2016:gender2female  0.02706    4.72082   0.006             0.995427    
## YEAR2017:gender2female       NA         NA      NA                   NA    
## YEAR2018:gender2female  3.72274    5.18608   0.718             0.472927    
## YEAR2019:gender2female -3.84540    4.74653  -0.810             0.417931    
## YEAR2020:gender2female -2.48859    4.31499  -0.577             0.564173    
## YEAR2021:gender2female -0.94189    3.75330  -0.251             0.801874    
## YEAR2022:gender2female -2.54393    3.88926  -0.654             0.513113    
## YEAR2023:gender2female -3.13234    3.61287  -0.867             0.386027    
## YEAR2024:gender2female -4.39008    4.73299  -0.928             0.353731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.15 on 2511 degrees of freedom
##   (1702 observations deleted due to missingness)
## Multiple R-squared:  0.02631,    Adjusted R-squared:  0.007698 
## F-statistic: 1.414 on 48 and 2511 DF,  p-value: 0.03274

## Warning in predict.lm(model, newdata = data_grid, type = "response", se.fit =
## se, : prediction from rank-deficient fit; attr(*, "non-estim") has doubtful
## cases

Understanding cause of death

The OpenGIS data records the likely cause of death. Several codes are given for migrant deaths. Two of these codes suggest cause of death is not determined. In one case, it is simply impossible to determine how the migrant dies and in the second case, only skeletal remains are found (making it extraordinarily difficult to discern cause of death). Below, I create a table of the causes of death reported in the OpenGIS data. What stands out most to you in terms of what we see in the data?

md$`Cause of Death`	n	percent	valid_percent
Asphyxia	8	0%	0%
Blunt Force Injury	219	5%	5%
Diabetes	5	0%	0%
Drowning	45	1%	1%
Drug Overdose	5	0%	0%
Exposure	1493	35%	36%
Exsanguination	1	0%	0%
Gunshot Wound	97	2%	2%
Heart Disease	23	1%	1%
Lightning Strike	1	0%	0%
Motor Vehicle Accident	25	1%	1%
Nonviable Fetus	2	0%	0%
Other Disease	20	0%	0%
Other Injury	13	0%	0%
Other Injury / Homicide	26	1%	1%
Other injury	2	0%	0%
Pending	26	1%	1%
Pregnancy Complication	1	0%	0%
Skeletal Remains	1482	35%	36%
Undetermined	641	15%	15%
c	1	0%	0%
NA	126	3%	-

Because “exposure,” “skeletal remains,” and “undetermined” are the three dominant codes given, let’s consider how these kinds of cases vary across the years. This is what I’m up to in the code below. In the table, do you see anything that stands out to you?

yeardecade/cod	Exposure	Other	Skeletal remains	Undetermined	Total
1981	0% (0)	0% (0)	0% (0)	100% (1)	100% (1)
1982	0% (0)	0% (0)	0% (0)	100% (1)	100% (1)
1985	0% (0)	33% (1)	0% (0)	67% (2)	100% (3)
1987	0% (0)	0% (0)	0% (0)	100% (1)	100% (1)
1990	0% (0)	100% (9)	0% (0)	0% (0)	100% (9)
1991	0% (0)	100% (6)	0% (0)	0% (0)	100% (6)
1992	0% (0)	100% (7)	0% (0)	0% (0)	100% (7)
1993	0% (0)	100% (17)	0% (0)	0% (0)	100% (17)
1994	0% (0)	75% (3)	0% (0)	25% (1)	100% (4)
1995	0% (0)	100% (12)	0% (0)	0% (0)	100% (12)
1996	0% (0)	92% (12)	0% (0)	8% (1)	100% (13)
1997	0% (0)	100% (22)	0% (0)	0% (0)	100% (22)
1998	0% (0)	100% (15)	0% (0)	0% (0)	100% (15)
1999	0% (0)	96% (22)	0% (0)	4% (1)	100% (23)
2000	55% (41)	32% (24)	13% (10)	0% (0)	100% (75)
2001	67% (53)	6% (5)	3% (2)	24% (19)	100% (79)
2002	61% (92)	20% (30)	5% (8)	14% (21)	100% (151)
2003	59% (97)	16% (27)	4% (7)	20% (33)	100% (164)
2004	43% (80)	28% (53)	2% (3)	27% (50)	100% (186)
2005	69% (139)	16% (32)	2% (4)	13% (27)	100% (202)
2006	41% (71)	20% (34)	6% (11)	33% (58)	100% (174)
2007	51% (112)	14% (32)	11% (25)	24% (52)	100% (221)
2008	37% (61)	22% (37)	19% (31)	22% (37)	100% (166)
2009	38% (75)	23% (45)	20% (40)	19% (37)	100% (197)
2010	42% (95)	11% (25)	27% (61)	19% (43)	100% (224)
2011	29% (52)	10% (18)	51% (92)	11% (20)	100% (182)
2012	20% (33)	11% (18)	56% (92)	12% (20)	100% (163)
2013	28% (52)	9% (16)	54% (99)	9% (17)	100% (184)
2014	11% (15)	6% (9)	75% (105)	8% (11)	100% (140)
2015	21% (31)	10% (14)	64% (94)	5% (8)	100% (147)
2016	27% (44)	5% (9)	60% (99)	7% (12)	100% (164)
2017	12% (15)	3% (4)	76% (94)	9% (11)	100% (124)
2018	20% (26)	5% (6)	65% (83)	10% (13)	100% (128)
2019	20% (29)	2% (3)	65% (93)	13% (19)	100% (144)
2020	21% (46)	9% (19)	58% (130)	13% (28)	100% (223)
2021	34% (76)	10% (23)	50% (112)	6% (14)	100% (225)
2022	29% (51)	6% (10)	52% (90)	13% (22)	100% (173)
2023	38% (74)	9% (18)	34% (67)	19% (38)	100% (197)
2024	35% (33)	9% (9)	32% (30)	24% (23)	100% (95)
Total	23% (1,493)	34% (646)	23% (1,482)	20% (641)	100% (4,262)

I created a summary data set using the information in the table we just considered. Sometimes it’s useful to plot the data. In the chunk below is the code to read in my new data set and following this is a line plot of the different codes given for “cause of death.” What do you see? Compare deaths due to exposure versus deaths that are not determined because of skeletal remains. What do you make of this? Do you see trends?

## Rows: 25 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): year20, Exposure, Other, Skeletal remains, Undetermined
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##      year20        Exposure          Other         Skeletal remains 
##  Min.   :2000   Min.   :0.1071   Min.   :0.02083   Min.   :0.01613  
##  1st Qu.:2006   1st Qu.:0.2097   1st Qu.:0.06404   1st Qu.:0.10065  
##  Median :2012   Median :0.3526   Median :0.10056   Median :0.41894  
##  Mean   :2012   Mean   :0.3636   Mean   :0.12646   Mean   :0.36347  
##  3rd Qu.:2017   3rd Qu.:0.4533   3rd Qu.:0.17233   3rd Qu.:0.58813  
##  Max.   :2023   Max.   :0.6881   Max.   :0.32000   Max.   :0.75806  
##  NA's   :1      NA's   :1        NA's   :1         NA's   :1        
##   Undetermined    
##  Min.   :0.00000  
##  1st Qu.:0.09147  
##  Median :0.12956  
##  Mean   :0.14649  
##  3rd Qu.:0.19497  
##  Max.   :0.33333  
##  NA's   :1

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).