Today is 20 years since the terrorist attacks on New York’s World Trade Center. I thought I would take a look at the data on the US soldiers who died in Afghanistan.
This exercise illustrates the use of:
readxl package to read data from an Excel file into Rn() function to count the number of observations that belong to a particular grouplubridate package to:
R recognizes as a calendar daterbind() function to stack two data frames that have the same variablesggplot2 package to create a bar chartkableExtra package to format tabular data for displaylibrary(tidyverse)
library(readxl)
library(lubridate)
I downloaded an Excel file on the US military personnel who died in Operation Enduring Freedom (Afghanistan 2001 – 2014) from https://dcas.dmdc.osd.mil/dcas/pages/report_oef_namesalp.xhtml#. Similarly, I downloaded an Excel file on the US military personnel who died in Operation Freedom’s Sentinel (Afghanistan 2014 and after) from https://dcas.dmdc.osd.mil/dcas/pages/report_ofs_namesalp.xhtml#.
OEF.file <- "data/OEFNames of Fallen.xls"
OFS.file <- "data/OFSNames of Fallen.xls"
R using the readxl packageOEF.fallen <- read_excel(OEF.file, skip = 5, .name_repair = "universal")
The .name_repair = "universal"attribute converts the variable names in the Excel file into formats that R can deal with easily.
read_excel() into data that R recognizes as a calendar dateThis uses the ymd() function in the lubridate package.
OEF.fallen <- OEF.fallen %>%
mutate(Date.of.Death = ymd(Date.of.Death..yyyy.mm.dd.))
I used the ymd() function because my Excel dates were in the yyyymmdd format. Had they been in the ddmmyyyy format I would have used the dmy() function instead.
Also, I used the mutate() function of the dplyr package to give the date variable a nicer name.
OFS.fallen <- read_excel(OFS.file, skip = 5, .name_repair = "universal") %>%
mutate(Date.of.Death = ymd(Date.of.Death..yyyy.mm.dd.))
fallen <- rbind(OFS.fallen, OEF.fallen)
This tally of the Afghan war dead covers the period from October 10, 2001 through August 26, 2021. The last US soldiers left Afghanistan on August 31, 2021.
Here, I use the year() function of the lubridate package. (I could have used the month() function to extract the month, etc.)
fallen <- fallen %>%
mutate(Year = year(Date.of.Death)) %>%
select(Year, Age, Gender)
Also, I used the select() function of the dplyr package to select the three variables I need for this exercise.
Each observation in the fallen data frame I have created above is one soldier who died. Now, I need to count the number of the dead for each year-and-gender group. The grouping is done using the group_by() function of the dplyr package. Then the counting of observations within each group is done using the n() function.
fallen <- fallen %>%
group_by(Year, Gender) %>%
summarise(Deaths = n()) %>%
na.omit()
Finally, I use the ggplot2 package to make a bar chart:
ggplot(fallen, aes(x = Year, y = Deaths)) + # Deaths gives you the total Deaths in each year for both Genders.
geom_col(aes(fill = Gender))
I will now use the pivot_wider() command of the dplyr package to show deaths for males and females in separate columns. Then, I will use the kableExtra package to present my data in tabular form.
fallen <- fallen %>% pivot_wider(names_from = Gender, values_from = Deaths)
fallen[is.na(fallen)] = 0
fallen <- fallen %>%
mutate(TOTAL = MALE + FEMALE) %>%
knitr::kable(caption = "U.S. Deaths in Afghanistan") %>%
kableExtra::kable_styling(full_width = FALSE)
fallen
| Year | MALE | FEMALE | TOTAL |
|---|---|---|---|
| 2001 | 11 | 0 | 11 |
| 2002 | 46 | 3 | 49 |
| 2003 | 42 | 1 | 43 |
| 2004 | 51 | 1 | 52 |
| 2005 | 95 | 4 | 99 |
| 2006 | 95 | 3 | 98 |
| 2007 | 116 | 2 | 118 |
| 2008 | 155 | 1 | 156 |
| 2009 | 305 | 6 | 311 |
| 2010 | 494 | 4 | 498 |
| 2011 | 402 | 9 | 411 |
| 2012 | 305 | 9 | 314 |
| 2013 | 125 | 7 | 132 |
| 2014 | 54 | 1 | 55 |
| 2015 | 17 | 4 | 21 |
| 2016 | 10 | 0 | 10 |
| 2017 | 15 | 0 | 15 |
| 2018 | 16 | 0 | 16 |
| 2019 | 23 | 0 | 23 |
| 2020 | 10 | 1 | 11 |
| 2021 | 11 | 2 | 13 |
The End