The data on migrant deaths comes from the Arizona OpenGIS Project, cosponsored by Humane Borders and the Pima County Office of Medical Examiner. I’ve edited the data set to include all recorded migrant deaths between 2000 up to July of 2024. The objective of this project is to give you continued experience in working with and making sense of quantitative data, while at the same time, giving you the opportunity to better understand at a deep level, the nature of the migrant death issue. Above all else, my goal is to always honor and humanize those who have died in the desert. These data permit a better understanding of the death crisis.
For purposes of this project, I have compiled a number of plots of the data highlighting different features of the migrant death crisis. Apart from an overview of the total number of migrant deaths recorded, I have compiled information on gender, age, and cause of death, as well as other factors. Your job is to tell a story. Imagine you are asked to summarize the information for an audience who knows nothing about the migrant death crisis? How would you proceed? What would you highlight?
Your job is to take what I have provided, analyze it, and tell your story I want to know something important, interesting, and useful about the migrant death crisis. I will not tell you what to look for; your job is to think critically and analytically. In class, I’ll discuss examples of what I’d look for. Ultimately, when presented with information, you need to get practice in learning how to engage it. But some prompts might be: what are the characteristics of most migrant deaths? What trends do you observe? Are there gender differences? Do attributes of migrant deaths change with respect to time?
This assignment will be worth 500 points. 100 points will be based on writing. I’m expecting analysis, not summaries. 400 points will be based on creativity. What or how do you choose what to engage? What should we learn about the migrant death crisis based on your analysis. Understand, you are probably THE ONLY college class in the world looking at these data. What will you tell the rest of the world about the migrant death crisis?
What do I mean by creativity? To be blunt, creativity is not repeating numbers or statistics you see in a table. I don’t need anyone to do that as I can see it with my own eyes. Creativity paints a picture as to what the human picture of the death crisis looks like, looking at the observed data. I’m looking for analysis that has a natural flow. Students are used to answering by rote, questions that get asked of them. In turn, answers are flat and usually nonanalytical. That’s not the student’s fault; it’s the fault of the person denying the student creativity. Creativity means bringing in information and context outside of the confines of the specific charts, plots, or questions. If you’re wondering about what page lenght I’m imagining assuming you will cut and paste some of the charts or plots, I would expect that with the charts and plots included, the analysis should be 5 pages or thereabouts. I’d strongly encourage bringing in external information (I’ve sent you links to sources and there are sources on the syllabus) to augment your analysis. In the end, if you simply report the results I’ve already created for you, then I would not expect anything above a “C”-level grade (i.e. you just reproduce in words what I see in the tables or charts). In class, we went over in great detail tips on how to avoid these problems, so follow those tips and you’re going to be well on your way.
For the submission, do not submit an R Markdown file. I want a holistic submission that I can read from start to finish.
Extra credit bonus If you are interested in producing your own analysis using R, I will offer up to 10% extra credit if you take the data and do something interesting with it. The operative words are “up to.” There’s no guarantee that if you do something, you’ll get the full 10 percent. In order to do this, you will need to use the .RMD file and edit it as needed. We teach R Markdown in POL 51 (well I do) and so if you know how to use it, use it. Skill like this are real skills to develop. BUT, there is no requirement to do anything above and beyond what the core assignment is asking. I’m trying to incentivize the use of statistical computing.
This assignment is worth 500 points and is due by 11:59 PM on December 13.
In this document, I make references to “chunks” of code. For purposes of my HTML file, I’ve suppressed this code from showing up in the resulting file. If you examine the .RMD file, you will see the code (should you want to see it).
The data file is a csv file saved to my GitHub site. Should you choose to use R directly to reproduce what I have, this could will permit direct access to the data. You are not required to use R.
The data on migrant deaths comes from the Arizona OpenGIS Project, cosponsored by Humane Borders and the Pima County Office of Medical Examiner. Recorded remains are between the years 1981 to 2024. However systematic data on migrant deaths only are found from 2000 and later. This is primarily because the death crisis really doesn’t emerge until around the year 2000.
The reason the crisis was nonexistent before 2000 is that very, very few people crossed through Arizona. This changed due to changes made in the Clinton Administration that led to the funnel effect pushing migrants into the Arizona corridor. So if one downloads the death map data, one will find that 4,263 migrant remains have been recovered (as of 7/31/2024); however, 4,127 have been recovered since the year 2000. This means that in the death map, only 136 remains are recorded as being found between 1981 and 1999 (an 18 year period). I don’t mean “only” in a demeaning way; rather, relative to the mass death crisis, the total number of deaths reported in the 18 year period between 1981 to 1999 is far lower than the average yearly number of deaths after the year 2000.
The data file is a csv file saved to my GitHub site. I dynamically update the data as new information is added. This file is current through 7/31/24.
md="https://raw.githubusercontent.com/mightyjoemoon/POL51/main/ogis_migrant_deaths-10.csv"
md<-read_csv(url(md))
## Rows: 4262 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): ML Number, Name, Sex, Reporting Date, Surface Management, Location...
## dbl (8): Age, Decade, Corridor Code, Condition Code, Latitude, Longitude, U...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Because we may want to look at yearly data, it’s useful to generate a variable that records the calendar year in which migrant remains were found. The code in the chunk below does just this. In addition to creating the new variable, I use the R command tabyl to produce a table of migrant deaths by year. In all, there are 4,090 recorded deaths in the time frame of the data set I’ve created. The table will show you the number of remains recovered by year along with the proportion of total deaths each year accounts for. So in 2010, we see 224 remains were recovered. The proportion of the total number of deaths accounted for by this year is 0.05476773 (i.e. \(\frac{224}{4090}\)). Multiply this proportion by 100 and you get the percent contribution. For 2010, about 5.5% of all the recovered remains occurred in 2010. One way to quickly assess the persistence of the death crisis is to inspect the proportions. If the crisis was abating, we’d expect to see a substantial decline in the proportion. If the crisis is persistent, we’d expect to see these proportions to be very similar across time. (Note that 2024 will be a very small number because we only have partial data for this year.) What do you see when you look at these proportions?
| md$yeardecade | n | percent |
|---|---|---|
| 1981 | 1 | 0.02% |
| 1982 | 1 | 0.02% |
| 1985 | 3 | 0.07% |
| 1987 | 1 | 0.02% |
| 1990 | 9 | 0.21% |
| 1991 | 6 | 0.14% |
| 1992 | 7 | 0.16% |
| 1993 | 17 | 0.40% |
| 1994 | 4 | 0.09% |
| 1995 | 12 | 0.28% |
| 1996 | 13 | 0.31% |
| 1997 | 22 | 0.52% |
| 1998 | 15 | 0.35% |
| 1999 | 23 | 0.54% |
| 2000 | 75 | 1.76% |
| 2001 | 79 | 1.85% |
| 2002 | 151 | 3.54% |
| 2003 | 164 | 3.85% |
| 2004 | 186 | 4.36% |
| 2005 | 202 | 4.74% |
| 2006 | 174 | 4.08% |
| 2007 | 221 | 5.19% |
| 2008 | 166 | 3.89% |
| 2009 | 197 | 4.62% |
| 2010 | 224 | 5.26% |
| 2011 | 182 | 4.27% |
| 2012 | 163 | 3.82% |
| 2013 | 184 | 4.32% |
| 2014 | 140 | 3.28% |
| 2015 | 147 | 3.45% |
| 2016 | 164 | 3.85% |
| 2017 | 124 | 2.91% |
| 2018 | 128 | 3.00% |
| 2019 | 144 | 3.38% |
| 2020 | 223 | 5.23% |
| 2021 | 225 | 5.28% |
| 2022 | 173 | 4.06% |
| 2023 | 197 | 4.62% |
| 2024 | 95 | 2.23% |
| Total | 4262 | - |
Often (most always), it’s easier to visualize quantitative data than looking at a table of data. The code in the chunk below will create what’s known as a barplot using the data from the table we just considered. Each bar corresponds to the number of migrant remains recovered in each year. When you look at this plot, what do you see? What interpretation would you give to this? Does the plot show the crisis abating? Does it seem persistent? Is it getting worse?
This plot show the number of recovered remains by decade.
How do migrant deaths and gender relate to one another? The code in the chunk below creates a “factor-level” variable recording the gender of the migrant. Since gender is not always determined, there is a category called “undetermined.” Using the tabyl function, I create a table showing the total number of remains recovered that are male, female, and undetermined. As you can see, about 82% of the total remains recovered are male.
## md$gender n percent valid_percent
## male 3443 0.8078366964 0.80840573
## female 598 0.1403097137 0.14040855
## undetermined 218 0.0511496950 0.05118572
## <NA> 3 0.0007038949 NA
Are there discernable trends in gender differences? There has been some speculation that if asylum seeker deaths begin to rise, then more females will die given a large share of asylum seekers are women. Here is a link to a recent article on this: https://19thnews.org/2024/07/women-migrants-deaths-us-mexico-border/
Sometimes, raw numbers are harder to interpret than are perecentages. The chunk below reports the gender data in terms of proportions. To understand this, consider the year 2023. In this year, of the remains recovered, 71% were male, 19% were female, and 10% were undetermined. Do you see any patterns in the data?
| yeardecade/gender | male | female | undetermined | NA_ |
|---|---|---|---|---|
| 1981 | 0% (0) | 0% (0) | 100% (1) | 0% (0) |
| 1982 | 0% (0) | 100% (1) | 0% (0) | 0% (0) |
| 1985 | 100% (3) | 0% (0) | 0% (0) | 0% (0) |
| 1987 | 100% (1) | 0% (0) | 0% (0) | 0% (0) |
| 1990 | 56% (5) | 0% (0) | 44% (4) | 0% (0) |
| 1991 | 67% (4) | 0% (0) | 33% (2) | 0% (0) |
| 1992 | 71% (5) | 0% (0) | 29% (2) | 0% (0) |
| 1993 | 59% (10) | 0% (0) | 41% (7) | 0% (0) |
| 1994 | 50% (2) | 0% (0) | 50% (2) | 0% (0) |
| 1995 | 42% (5) | 0% (0) | 58% (7) | 0% (0) |
| 1996 | 23% (3) | 0% (0) | 77% (10) | 0% (0) |
| 1997 | 18% (4) | 5% (1) | 77% (17) | 0% (0) |
| 1998 | 20% (3) | 0% (0) | 80% (12) | 0% (0) |
| 1999 | 22% (5) | 4% (1) | 74% (17) | 0% (0) |
| 2000 | 76% (57) | 24% (18) | 0% (0) | 0% (0) |
| 2001 | 75% (59) | 24% (19) | 1% (1) | 0% (0) |
| 2002 | 76% (115) | 24% (36) | 0% (0) | 0% (0) |
| 2003 | 80% (131) | 20% (33) | 0% (0) | 0% (0) |
| 2004 | 81% (150) | 19% (35) | 1% (1) | 0% (0) |
| 2005 | 81% (163) | 19% (39) | 0% (0) | 0% (0) |
| 2006 | 82% (143) | 18% (31) | 0% (0) | 0% (0) |
| 2007 | 77% (171) | 23% (50) | 0% (0) | 0% (0) |
| 2008 | 80% (133) | 20% (33) | 0% (0) | 0% (0) |
| 2009 | 85% (167) | 14% (27) | 2% (3) | 0% (0) |
| 2010 | 87% (195) | 13% (29) | 0% (0) | 0% (0) |
| 2011 | 87% (159) | 12% (21) | 1% (2) | 0% (0) |
| 2012 | 87% (141) | 12% (20) | 1% (2) | 0% (0) |
| 2013 | 90% (166) | 10% (18) | 0% (0) | 0% (0) |
| 2014 | 91% (127) | 8% (11) | 1% (2) | 0% (0) |
| 2015 | 88% (130) | 8% (12) | 3% (5) | 0% (0) |
| 2016 | 93% (152) | 6% (10) | 1% (2) | 0% (0) |
| 2017 | 98% (121) | 2% (3) | 0% (0) | 0% (0) |
| 2018 | 90% (115) | 9% (12) | 1% (1) | 0% (0) |
| 2019 | 88% (127) | 8% (12) | 3% (5) | 0% (0) |
| 2020 | 82% (182) | 8% (18) | 10% (23) | 0% (0) |
| 2021 | 76% (170) | 13% (30) | 10% (22) | 1% (3) |
| 2022 | 77% (134) | 13% (23) | 9% (16) | 0% (0) |
| 2023 | 73% (144) | 19% (38) | 8% (15) | 0% (0) |
| 2024 | 43% (41) | 18% (17) | 39% (37) | 0% (0) |
I created a summary data set of remains recoverd for the years 2000 to 2024 (noting that 2024 is incomplete). This summary data set makes it easy to visualize migrant deaths by year and gender.
## New names:
## Rows: 26 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (4): year20, male, female, undetermined lgl (2): ...5, ...6
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`
## • `` -> `...6`
The code in the chunk below uses my new data set to create a line plot showing the percentage of migrant deaths accounted for by gender? What do you see in this plot? How would you characterize the relationship between time, deaths, and gender? (Note that the code seems to produce some warning codes; these can be disregarded as they have no bearing on the plot.)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
To see the total amount of migrant deaths by gender, the code in the chunk below gives a barplot of total deaths. What is the key takeaway point here?
Here is another way to visualize migrant deaths by gender. The code in the chunk below creates what is known as a “stacked bar plot.” Each color code corresponds to gender classification. The total height of the bar corresponds to the total number of remains recovered by year. The color codes represent the number associated with males, females, and those whose gender is undetermined.
The OpenGIS data records the age of the migrant if this determination is possible. To understand the relationship between age and migrant deaths, I’ve created a new variable recording agegroups. These groups are in 5-year increments with the exception of the first group (0-9 years) and the last group (60 and over). We see that about 38% of the migrant deaths have an indeterminant age. We can see that most migrant deaths (about 32% of the total) are in the age range of 20 to 34 years of age. This percentage is based on including all of the individuals whose age cannot be determined. This group constitutes 40% of the data. What else do we see? What age groups contribute to most of the deaths? What do you see?
| md$agegroup | n | percent |
|---|---|---|
| 0-9 | 7 | 0% |
| 10-14 | 21 | 0% |
| 15-19 | 246 | 6% |
| 20-24 | 471 | 11% |
| 25-29 | 459 | 11% |
| 30-34 | 446 | 10% |
| 35-39 | 372 | 9% |
| 40-44 | 251 | 6% |
| 45-49 | 155 | 4% |
| 49-54 | 76 | 2% |
| 55-59 | 36 | 1% |
| 60 and over | 21 | 0% |
| NA | 1701 | 40% |
Suppose we want to tabulate age group deaths by year? This is what the code in the chunk below is doing. To understand the resulting table, consider the year 2006 and the age group 20-24 years of age. This group accounted for about 13% of that year’s migrant deaths. For most years, those whose age is indeterminant accounts for a large share of migrant deaths. Note that the number of “NAs” seems to be increasing with time. What do you make of this?
| yeardecade/agegroup | 0-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | 40-44 | 45-49 | 49-54 | 55-59 | 60 and over | NA | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1981 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (1) | 100% (1) |
| 1982 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (1) | 100% (1) |
| 1985 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (3) | 100% (3) |
| 1987 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (1) | 100% (1) |
| 1990 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (9) | 100% (9) |
| 1991 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (6) | 100% (6) |
| 1992 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (7) | 100% (7) |
| 1993 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (17) | 100% (17) |
| 1994 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (4) | 100% (4) |
| 1995 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (12) | 100% (12) |
| 1996 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (13) | 100% (13) |
| 1997 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (22) | 100% (22) |
| 1998 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (15) | 100% (15) |
| 1999 | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 0% (0) | 100% (23) | 100% (23) |
| 2000 | 0% (0) | 4% (3) | 11% (8) | 17% (13) | 19% (14) | 12% (9) | 4% (3) | 8% (6) | 5% (4) | 1% (1) | 0% (0) | 0% (0) | 19% (14) | 100% (75) |
| 2001 | 0% (0) | 0% (0) | 6% (5) | 19% (15) | 13% (10) | 13% (10) | 6% (5) | 9% (7) | 6% (5) | 3% (2) | 0% (0) | 0% (0) | 25% (20) | 100% (79) |
| 2002 | 0% (0) | 3% (4) | 11% (17) | 15% (22) | 13% (20) | 11% (17) | 11% (17) | 7% (10) | 5% (8) | 3% (4) | 1% (2) | 0% (0) | 20% (30) | 100% (151) |
| 2003 | 1% (2) | 0% (0) | 10% (16) | 15% (25) | 14% (23) | 10% (17) | 9% (15) | 8% (13) | 2% (3) | 1% (2) | 2% (3) | 1% (2) | 26% (43) | 100% (164) |
| 2004 | 0% (0) | 1% (1) | 7% (13) | 13% (24) | 17% (31) | 9% (16) | 10% (18) | 10% (18) | 3% (5) | 2% (3) | 2% (3) | 1% (1) | 28% (53) | 100% (186) |
| 2005 | 0% (1) | 1% (3) | 10% (20) | 15% (30) | 11% (23) | 12% (25) | 10% (21) | 4% (9) | 5% (11) | 3% (7) | 0% (1) | 0% (0) | 25% (51) | 100% (202) |
| 2006 | 1% (1) | 3% (5) | 8% (14) | 13% (23) | 10% (17) | 11% (19) | 7% (13) | 7% (12) | 2% (4) | 2% (3) | 1% (2) | 1% (2) | 34% (59) | 100% (174) |
| 2007 | 0% (0) | 0% (1) | 8% (18) | 9% (20) | 13% (28) | 14% (30) | 13% (28) | 8% (17) | 2% (5) | 3% (6) | 1% (2) | 1% (3) | 29% (63) | 100% (221) |
| 2008 | 1% (1) | 1% (1) | 10% (17) | 13% (22) | 11% (18) | 13% (21) | 6% (10) | 8% (13) | 4% (7) | 4% (7) | 1% (1) | 1% (1) | 28% (47) | 100% (166) |
| 2009 | 0% (0) | 1% (1) | 8% (16) | 10% (20) | 14% (27) | 15% (29) | 11% (21) | 3% (6) | 2% (4) | 2% (4) | 2% (3) | 1% (1) | 33% (65) | 100% (197) |
| 2010 | 0% (0) | 0% (0) | 5% (12) | 12% (27) | 16% (35) | 11% (24) | 13% (28) | 8% (18) | 5% (12) | 1% (2) | 0% (0) | 0% (0) | 29% (66) | 100% (224) |
| 2011 | 0% (0) | 0% (0) | 5% (9) | 9% (16) | 12% (21) | 12% (21) | 11% (20) | 7% (12) | 3% (5) | 1% (2) | 1% (1) | 1% (1) | 41% (74) | 100% (182) |
| 2012 | 0% (0) | 1% (1) | 4% (7) | 6% (9) | 13% (22) | 12% (19) | 9% (15) | 8% (13) | 4% (7) | 2% (3) | 1% (1) | 0% (0) | 40% (66) | 100% (163) |
| 2013 | 0% (0) | 0% (0) | 5% (9) | 9% (16) | 10% (19) | 10% (18) | 12% (22) | 5% (10) | 3% (5) | 2% (3) | 1% (2) | 1% (1) | 43% (79) | 100% (184) |
| 2014 | 0% (0) | 0% (0) | 4% (6) | 6% (9) | 8% (11) | 6% (9) | 5% (7) | 6% (9) | 4% (5) | 3% (4) | 1% (2) | 0% (0) | 56% (78) | 100% (140) |
| 2015 | 0% (0) | 0% (0) | 6% (9) | 12% (18) | 8% (12) | 12% (17) | 12% (17) | 5% (7) | 2% (3) | 3% (4) | 1% (2) | 1% (1) | 39% (57) | 100% (147) |
| 2016 | 0% (0) | 0% (0) | 2% (3) | 13% (21) | 9% (14) | 12% (19) | 10% (16) | 2% (4) | 4% (6) | 1% (1) | 2% (3) | 1% (2) | 46% (75) | 100% (164) |
| 2017 | 0% (0) | 0% (0) | 2% (3) | 9% (11) | 10% (12) | 11% (14) | 5% (6) | 2% (3) | 5% (6) | 1% (1) | 1% (1) | 0% (0) | 54% (67) | 100% (124) |
| 2018 | 0% (0) | 0% (0) | 3% (4) | 5% (7) | 9% (11) | 9% (11) | 9% (12) | 9% (11) | 7% (9) | 2% (2) | 2% (3) | 1% (1) | 45% (57) | 100% (128) |
| 2019 | 1% (1) | 0% (0) | 3% (4) | 13% (18) | 8% (12) | 10% (15) | 10% (14) | 3% (4) | 1% (2) | 3% (5) | 0% (0) | 1% (1) | 47% (68) | 100% (144) |
| 2020 | 0% (0) | 0% (0) | 5% (11) | 11% (24) | 8% (17) | 7% (16) | 7% (16) | 5% (11) | 3% (7) | 0% (0) | 0% (0) | 0% (1) | 54% (120) | 100% (223) |
| 2021 | 0% (0) | 0% (0) | 4% (8) | 11% (25) | 10% (22) | 9% (21) | 8% (18) | 5% (12) | 4% (10) | 1% (2) | 1% (2) | 0% (0) | 47% (105) | 100% (225) |
| 2022 | 0% (0) | 0% (0) | 4% (7) | 12% (20) | 10% (18) | 13% (22) | 5% (8) | 9% (15) | 3% (6) | 1% (1) | 1% (1) | 1% (1) | 43% (74) | 100% (173) |
| 2023 | 1% (1) | 1% (1) | 3% (6) | 14% (27) | 9% (17) | 10% (19) | 10% (19) | 4% (8) | 6% (12) | 4% (7) | 1% (1) | 1% (2) | 39% (77) | 100% (197) |
| 2024 | 0% (0) | 0% (0) | 4% (4) | 9% (9) | 5% (5) | 8% (8) | 3% (3) | 3% (3) | 4% (4) | 0% (0) | 0% (0) | 0% (0) | 62% (59) | 100% (95) |
| Total | 0% (7) | 0% (21) | 4% (246) | 7% (471) | 7% (459) | 7% (446) | 6% (372) | 4% (251) | 2% (155) | 1% (76) | 1% (36) | 0% (21) | 60% (1,701) | 100% (4,262) |
It might be useful to visualize the data with a barplot. This is what I do in the following chunk. This plot includes the cases where age is indeterminate. What do you see?
Suppose we wanted to see the percentage for each age group based only on the number of migrants whose age could be determined? In other words, suppose we omitted from our analysis the “NAs”. This is what is done in the chunk below, where I subset the date by not including the NAs. Consider again the year 2006 and the age group 20-24. Here I see that of those whose age can be identified, about 20% of the remains recovered were of people aged 20 to 24.
Suppose we are interested in see if there are any trends in average ages of deceased migrants? One way to do this is through a simple regression model. This model will give us an estimate of the average age conditional on year. The code in the chunk will do this along with the resulting plot. The plot may look daunting but it’s actually simple. In order to make the x-axis cleaner, I’m just using the last digit(s) of the year to denote the year. So a “0” denotes the year 2000; a “24” denotes the year 2024. The key thing to look at is the dots plotted for each year. This dot is our estimate of the average age of death for migrants whose remains are recovered. So consider the year 2003 (denoted as “3). This dot is right on the line corresponding to 30 years of age. This implies that among those whose remains were recovered in 2003, the average age at death was about 30 years of age. The vertical lines above and below the dot correspond to a”confidence interval.” This is sort of like a margin of error in layperson terms. Years with smaller numbers of deaths will have larger confidence intervals because there are fewer cases with which to make an estimate. What do you see in this graph? Are there any trends you spot?
##
## Call:
## lm(formula = Age ~ YEAR, data = md)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.650 -7.832 -1.120 6.494 66.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.230 1.301 21.704 < 0.0000000000000002 ***
## YEAR2001 2.686 1.855 1.448 0.14777
## YEAR2002 1.903 1.595 1.193 0.23306
## YEAR2003 1.770 1.595 1.110 0.26714
## YEAR2004 2.891 1.571 1.840 0.06585 .
## YEAR2005 1.843 1.541 1.196 0.23178
## YEAR2006 1.797 1.609 1.117 0.26430
## YEAR2007 3.929 1.531 2.566 0.01036 *
## YEAR2008 2.602 1.600 1.627 0.10389
## YEAR2009 2.324 1.573 1.477 0.13970
## YEAR2010 3.277 1.531 2.140 0.03246 *
## YEAR2011 3.891 1.627 2.391 0.01686 *
## YEAR2012 4.451 1.660 2.681 0.00738 **
## YEAR2013 3.990 1.635 2.439 0.01478 *
## YEAR2014 5.416 1.832 2.956 0.00314 **
## YEAR2015 3.459 1.685 2.053 0.04014 *
## YEAR2016 4.366 1.689 2.586 0.00978 **
## YEAR2017 2.946 1.871 1.574 0.11557
## YEAR2018 7.404 1.773 4.175 0.0000308 ***
## YEAR2019 3.192 1.746 1.828 0.06773 .
## YEAR2020 2.120 1.641 1.292 0.19657
## YEAR2021 3.487 1.597 2.183 0.02913 *
## YEAR2022 3.356 1.654 2.030 0.04248 *
## YEAR2023 4.420 1.597 2.767 0.00569 **
## YEAR2024 1.632 2.135 0.764 0.44481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.16 on 2536 degrees of freedom
## (1701 observations deleted due to missingness)
## Multiple R-squared: 0.01574, Adjusted R-squared: 0.006424
## F-statistic: 1.69 on 24 and 2536 DF, p-value: 0.01934
Suppose we wanted to consider age and gender. This is what I’m doing in this section. The code in the chunk below is going to allow me to estimate the average age at death for males and females. The resulting plot has the same kind of interpretation as the one we just looked at. The difference is we have two sets of estimates–one for males and one for females. Here, those things called confidence intervals are important. If those intervals for females overlap the intervals for males, then we might conclude the differences in average age is insignificant. What do you see? Do you see clear gender differences?
##
## male female
## 3443 598
##
## Welch Two Sample t-test
##
## data: md$Age by md$gender2
## t = 3.9991, df = 632.82, p-value = 0.00007108
## alternative hypothesis: true difference in means between group male and group female is not equal to 0
## 95 percent confidence interval:
## 1.057570 3.098214
## sample estimates:
## mean in group male mean in group female
## 31.71107 29.63318
##
## Call:
## lm(formula = Age ~ YEAR * gender2, data = md)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.284 -7.518 -1.222 6.583 66.321
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.51163 1.54846 18.413 < 0.0000000000000002 ***
## YEAR2001 2.44186 2.18985 1.115 0.264922
## YEAR2002 1.88837 1.88236 1.003 0.315865
## YEAR2003 2.77258 1.86628 1.486 0.137505
## YEAR2004 2.44075 1.83838 1.328 0.184409
## YEAR2005 2.20968 1.80078 1.227 0.219912
## YEAR2006 1.84321 1.87252 0.984 0.325040
## YEAR2007 3.91079 1.81288 2.157 0.031083 *
## YEAR2008 2.97291 1.86027 1.598 0.110146
## YEAR2009 2.17019 1.82620 1.188 0.234802
## YEAR2010 2.68262 1.77185 1.514 0.130147
## YEAR2011 3.74090 1.85449 2.017 0.043780 *
## YEAR2012 4.00602 1.90018 2.108 0.035110 *
## YEAR2013 3.97258 1.86628 2.129 0.033384 *
## YEAR2014 5.48837 2.05098 2.676 0.007500 **
## YEAR2015 3.26886 1.91182 1.710 0.087424 .
## YEAR2016 4.16738 1.91588 2.175 0.029709 *
## YEAR2017 2.66381 2.05098 1.299 0.194132
## YEAR2018 6.88837 1.99597 3.451 0.000567 ***
## YEAR2019 3.41484 1.97836 1.726 0.084453 .
## YEAR2020 2.20576 1.87574 1.176 0.239728
## YEAR2021 3.52919 1.85736 1.900 0.057533 .
## YEAR2022 3.71059 1.91588 1.937 0.052887 .
## YEAR2023 5.05826 1.87252 2.701 0.006953 **
## YEAR2024 2.83453 2.52253 1.124 0.261255
## gender2female -0.95607 2.85054 -0.335 0.737351
## YEAR2001:gender2female 0.81508 4.11912 0.198 0.843157
## YEAR2002:gender2female -0.08909 3.54923 -0.025 0.979976
## YEAR2003:gender2female -5.02045 3.62992 -1.383 0.166766
## YEAR2004:gender2female 1.75369 3.57627 0.490 0.623916
## YEAR2005:gender2female -2.42041 3.53920 -0.684 0.494110
## YEAR2006:gender2female -0.76240 3.73104 -0.204 0.838104
## YEAR2007:gender2female -0.03777 3.38662 -0.011 0.991102
## YEAR2008:gender2female -2.57392 3.72491 -0.691 0.489628
## YEAR2009:gender2female 0.18335 3.70801 0.049 0.960568
## YEAR2010:gender2female 3.55130 3.78071 0.939 0.347656
## YEAR2011:gender2female -0.62979 4.54123 -0.139 0.889713
## YEAR2012:gender2female 2.27176 4.23443 0.536 0.591662
## YEAR2013:gender2female -1.82814 4.41826 -0.414 0.679079
## YEAR2014:gender2female -3.44393 5.52764 -0.623 0.533316
## YEAR2015:gender2female -0.07442 4.71918 -0.016 0.987420
## YEAR2016:gender2female 0.02706 4.72082 0.006 0.995427
## YEAR2017:gender2female NA NA NA NA
## YEAR2018:gender2female 3.72274 5.18608 0.718 0.472927
## YEAR2019:gender2female -3.84540 4.74653 -0.810 0.417931
## YEAR2020:gender2female -2.48859 4.31499 -0.577 0.564173
## YEAR2021:gender2female -0.94189 3.75330 -0.251 0.801874
## YEAR2022:gender2female -2.54393 3.88926 -0.654 0.513113
## YEAR2023:gender2female -3.13234 3.61287 -0.867 0.386027
## YEAR2024:gender2female -4.39008 4.73299 -0.928 0.353731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.15 on 2511 degrees of freedom
## (1702 observations deleted due to missingness)
## Multiple R-squared: 0.02631, Adjusted R-squared: 0.007698
## F-statistic: 1.414 on 48 and 2511 DF, p-value: 0.03274
## Warning in predict.lm(model, newdata = data_grid, type = "response", se.fit =
## se, : prediction from rank-deficient fit; attr(*, "non-estim") has doubtful
## cases
The OpenGIS data records the likely cause of death. Several codes are given for migrant deaths. Two of these codes suggest cause of death is not determined. In one case, it is simply impossible to determine how the migrant dies and in the second case, only skeletal remains are found (making it extraordinarily difficult to discern cause of death). Below, I create a table of the causes of death reported in the OpenGIS data. What stands out most to you in terms of what we see in the data?
md$Cause of Death |
n | percent | valid_percent |
|---|---|---|---|
| Asphyxia | 8 | 0% | 0% |
| Blunt Force Injury | 219 | 5% | 5% |
| Diabetes | 5 | 0% | 0% |
| Drowning | 45 | 1% | 1% |
| Drug Overdose | 5 | 0% | 0% |
| Exposure | 1493 | 35% | 36% |
| Exsanguination | 1 | 0% | 0% |
| Gunshot Wound | 97 | 2% | 2% |
| Heart Disease | 23 | 1% | 1% |
| Lightning Strike | 1 | 0% | 0% |
| Motor Vehicle Accident | 25 | 1% | 1% |
| Nonviable Fetus | 2 | 0% | 0% |
| Other Disease | 20 | 0% | 0% |
| Other Injury | 13 | 0% | 0% |
| Other Injury / Homicide | 26 | 1% | 1% |
| Other injury | 2 | 0% | 0% |
| Pending | 26 | 1% | 1% |
| Pregnancy Complication | 1 | 0% | 0% |
| Skeletal Remains | 1482 | 35% | 36% |
| Undetermined | 641 | 15% | 15% |
| c | 1 | 0% | 0% |
| NA | 126 | 3% | - |
Because “exposure,” “skeletal remains,” and “undetermined” are the three dominant codes given, let’s consider how these kinds of cases vary across the years. This is what I’m up to in the code below. In the table, do you see anything that stands out to you?
| yeardecade/cod | Exposure | Other | Skeletal remains | Undetermined | Total |
|---|---|---|---|---|---|
| 1981 | 0% (0) | 0% (0) | 0% (0) | 100% (1) | 100% (1) |
| 1982 | 0% (0) | 0% (0) | 0% (0) | 100% (1) | 100% (1) |
| 1985 | 0% (0) | 33% (1) | 0% (0) | 67% (2) | 100% (3) |
| 1987 | 0% (0) | 0% (0) | 0% (0) | 100% (1) | 100% (1) |
| 1990 | 0% (0) | 100% (9) | 0% (0) | 0% (0) | 100% (9) |
| 1991 | 0% (0) | 100% (6) | 0% (0) | 0% (0) | 100% (6) |
| 1992 | 0% (0) | 100% (7) | 0% (0) | 0% (0) | 100% (7) |
| 1993 | 0% (0) | 100% (17) | 0% (0) | 0% (0) | 100% (17) |
| 1994 | 0% (0) | 75% (3) | 0% (0) | 25% (1) | 100% (4) |
| 1995 | 0% (0) | 100% (12) | 0% (0) | 0% (0) | 100% (12) |
| 1996 | 0% (0) | 92% (12) | 0% (0) | 8% (1) | 100% (13) |
| 1997 | 0% (0) | 100% (22) | 0% (0) | 0% (0) | 100% (22) |
| 1998 | 0% (0) | 100% (15) | 0% (0) | 0% (0) | 100% (15) |
| 1999 | 0% (0) | 96% (22) | 0% (0) | 4% (1) | 100% (23) |
| 2000 | 55% (41) | 32% (24) | 13% (10) | 0% (0) | 100% (75) |
| 2001 | 67% (53) | 6% (5) | 3% (2) | 24% (19) | 100% (79) |
| 2002 | 61% (92) | 20% (30) | 5% (8) | 14% (21) | 100% (151) |
| 2003 | 59% (97) | 16% (27) | 4% (7) | 20% (33) | 100% (164) |
| 2004 | 43% (80) | 28% (53) | 2% (3) | 27% (50) | 100% (186) |
| 2005 | 69% (139) | 16% (32) | 2% (4) | 13% (27) | 100% (202) |
| 2006 | 41% (71) | 20% (34) | 6% (11) | 33% (58) | 100% (174) |
| 2007 | 51% (112) | 14% (32) | 11% (25) | 24% (52) | 100% (221) |
| 2008 | 37% (61) | 22% (37) | 19% (31) | 22% (37) | 100% (166) |
| 2009 | 38% (75) | 23% (45) | 20% (40) | 19% (37) | 100% (197) |
| 2010 | 42% (95) | 11% (25) | 27% (61) | 19% (43) | 100% (224) |
| 2011 | 29% (52) | 10% (18) | 51% (92) | 11% (20) | 100% (182) |
| 2012 | 20% (33) | 11% (18) | 56% (92) | 12% (20) | 100% (163) |
| 2013 | 28% (52) | 9% (16) | 54% (99) | 9% (17) | 100% (184) |
| 2014 | 11% (15) | 6% (9) | 75% (105) | 8% (11) | 100% (140) |
| 2015 | 21% (31) | 10% (14) | 64% (94) | 5% (8) | 100% (147) |
| 2016 | 27% (44) | 5% (9) | 60% (99) | 7% (12) | 100% (164) |
| 2017 | 12% (15) | 3% (4) | 76% (94) | 9% (11) | 100% (124) |
| 2018 | 20% (26) | 5% (6) | 65% (83) | 10% (13) | 100% (128) |
| 2019 | 20% (29) | 2% (3) | 65% (93) | 13% (19) | 100% (144) |
| 2020 | 21% (46) | 9% (19) | 58% (130) | 13% (28) | 100% (223) |
| 2021 | 34% (76) | 10% (23) | 50% (112) | 6% (14) | 100% (225) |
| 2022 | 29% (51) | 6% (10) | 52% (90) | 13% (22) | 100% (173) |
| 2023 | 38% (74) | 9% (18) | 34% (67) | 19% (38) | 100% (197) |
| 2024 | 35% (33) | 9% (9) | 32% (30) | 24% (23) | 100% (95) |
| Total | 23% (1,493) | 34% (646) | 23% (1,482) | 20% (641) | 100% (4,262) |
I created a summary data set using the information in the table we just considered. Sometimes it’s useful to plot the data. In the chunk below is the code to read in my new data set and following this is a line plot of the different codes given for “cause of death.” What do you see? Compare deaths due to exposure versus deaths that are not determined because of skeletal remains. What do you make of this? Do you see trends?
## Rows: 25 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): year20, Exposure, Other, Skeletal remains, Undetermined
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## year20 Exposure Other Skeletal remains
## Min. :2000 Min. :0.1071 Min. :0.02083 Min. :0.01613
## 1st Qu.:2006 1st Qu.:0.2097 1st Qu.:0.06404 1st Qu.:0.10065
## Median :2012 Median :0.3526 Median :0.10056 Median :0.41894
## Mean :2012 Mean :0.3636 Mean :0.12646 Mean :0.36347
## 3rd Qu.:2017 3rd Qu.:0.4533 3rd Qu.:0.17233 3rd Qu.:0.58813
## Max. :2023 Max. :0.6881 Max. :0.32000 Max. :0.75806
## NA's :1 NA's :1 NA's :1 NA's :1
## Undetermined
## Min. :0.00000
## 1st Qu.:0.09147
## Median :0.12956
## Mean :0.14649
## 3rd Qu.:0.19497
## Max. :0.33333
## NA's :1
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).