Midterm Project

Author

Marie Adele Grosso

Data Introduction

The data that I’m using is from the Stockholm International Peace Research Institute (SIPRI) and among other things it tracks weapons transfers internationally from 1975 to present day. SIPRI is used the source used by the UN and linked in their website. Because there’s so much information in their databases I had to narrow down the data I would downloads. On their website I filtered to look at only weapons sold by the United States. The the variables in the data set are “ recipient” (the nation receiving weapons), “supplier” (the nation selling weapons, which in my case was always the United States), the year the weapons were ordered, number ordered, the weapon designation (which is the name of the weapon(s) they were buying), the year the weapon was delivered, the “status” (which in this case refers to if the weapon was new or secondhand), comments (additional information about weapons), and then they had three more complicated variables. These variables SIPRI TIV per unit, SIPRI TIV for total order, and SIPRI TIV of delivered weapons. SIPRI TIV stands for Stockholm International Peace Research Institute trend indicator value. This measures the military capability of the weapon/total purchase and its value to the country. Part of why it exists is to account for changes in the market and weapons capability. The variables I ended up using were the total value of the weapon sold (measured by the SPIRI TIV), the years they were sold, and what nation was receiving them.

Load in data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
setwd("/Users/marieadelegrosso/Desktop/Desktop - Marie’s MacBook Air (2)/Data")
weapons <- read_csv("weapons register 2000s.csv")
Rows: 3006 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Recipient, Supplier, Weapon designation, Weapon description, Year(s...
dbl (6): Year of order, Number ordered, Number delivered, SIPRI TIV per unit...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(weapons)
# A tibble: 6 × 13
  Recipient   Supplier     `Year of order` `Number ordered` `Weapon designation`
  <chr>       <chr>                  <dbl>            <dbl> <chr>               
1 Afghanistan United Stat…            2014               12 MD-500E             
2 Afghanistan United Stat…            2015               12 MD-500E             
3 Afghanistan United Stat…            2012                4 C-130H Hercules     
4 Afghanistan United Stat…            2016             1673 HMMWV-UA            
5 Afghanistan United Stat…            2014              222 MaxxPro             
6 Afghanistan United Stat…            2016              433 HMMWV-UA            
# ℹ 8 more variables: `Weapon description` <chr>, `Number delivered` <dbl>,
#   `Year(s) of delivery` <chr>, status <chr>, Comments <chr>,
#   `SIPRI TIV per unit` <dbl>, `SIPRI TIV for total order` <dbl>,
#   `SIPRI TIV of delivered weapons` <dbl>

Look at totals of weapons delivered by country

byrecipient <- weapons |>
  group_by(Recipient, `Year of order`) |> # group all recipient
  summarise(SIPRI_total_value = sum(`SIPRI TIV of delivered weapons`),
            delivery_total_count = sum(`Number delivered`),  # combine value of all delivered weapons each year for each country
            avg_count = mean(`Number delivered`), # mean value per unit
            avg_SIPRI = mean(`SIPRI TIV of delivered weapons`),  # average number of weapons ordered
            .groups = "drop") |>  # remove the grouping structure after summarizing
  arrange(Recipient) #sort alphabetically by "recipient" or country
byrecipient 
# A tibble: 1,280 × 6
   Recipient   `Year of order` SIPRI_total_value delivery_total_count avg_count
   <chr>                 <dbl>             <dbl>                <dbl>     <dbl>
 1 Afghanistan            2004              18.8                  188     188  
 2 Afghanistan            2006              41.6                  800     800  
 3 Afghanistan            2008             616.                  4735    4735  
 4 Afghanistan            2010             684.                  6805    2268. 
 5 Afghanistan            2011             240.                  1063     177. 
 6 Afghanistan            2012             128.                   211      70.3
 7 Afghanistan            2013              54                    135     135  
 8 Afghanistan            2014              36.8                  234     117  
 9 Afghanistan            2015              30.4                   67      33.5
10 Afghanistan            2016             501.                  2159     720. 
# ℹ 1,270 more rows
# ℹ 1 more variable: avg_SIPRI <dbl>

Create new column with decades to broaden years

years <- byrecipient |> # create new factor/column to combine years into decades and increase legibility
  mutate(
    `Decade of order` = dplyr::case_when (
      `Year of order` <= 1980 ~ "1975-1980",
      `Year of order` > 1980 & `Year of order` <= 1990 ~ "1980-1990",
      `Year of order` > 1990 & `Year of order` <= 2000 ~ "1990-2000",
      `Year of order` > 2000 & `Year of order` <= 2010 ~ "2000-2010",
      `Year of order` > 2010 & `Year of order` <= 2020 ~ "2010-2020",
      `Year of order` > 2020 & `Year of order` <= 2025 ~ "2020-2025",)) |>
  mutate(
    `Decade of order` = factor
    (`Decade of order`,
      level = c("1975-1980",
                "1980-1990",
                "1990-2000",
                "2000-2010",
                "2010-2020",
                "2020-2025"))) 
# Used this post to remember how to do this https://forum.posit.co/t/dplyr-way-s-and-base-r-way-s-of-creating-age-group-from-age/89226/4 
years
# A tibble: 1,280 × 7
   Recipient   `Year of order` SIPRI_total_value delivery_total_count avg_count
   <chr>                 <dbl>             <dbl>                <dbl>     <dbl>
 1 Afghanistan            2004              18.8                  188     188  
 2 Afghanistan            2006              41.6                  800     800  
 3 Afghanistan            2008             616.                  4735    4735  
 4 Afghanistan            2010             684.                  6805    2268. 
 5 Afghanistan            2011             240.                  1063     177. 
 6 Afghanistan            2012             128.                   211      70.3
 7 Afghanistan            2013              54                    135     135  
 8 Afghanistan            2014              36.8                  234     117  
 9 Afghanistan            2015              30.4                   67      33.5
10 Afghanistan            2016             501.                  2159     720. 
# ℹ 1,270 more rows
# ℹ 2 more variables: avg_SIPRI <dbl>, `Decade of order` <fct>

Linear Regression Model

linear <- lm(SIPRI_total_value ~ avg_SIPRI, 
             delivery_total_count,
             data = years) # looking to see if estimated number of weapons purchased is related to the value of the purchase. I realized that the correlation is not particularly consistant, I would have had to catagorize weapon type to make this interesting 
summary(linear)

Call:
lm(formula = SIPRI_total_value ~ avg_SIPRI, data = years, subset = delivery_total_count)

Residuals:
    Min      1Q  Median      3Q     Max 
-590.16  -46.48  -31.14   -3.11 1417.02 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.35470    6.16434   4.762 2.15e-06 ***
avg_SIPRI    1.91107    0.04059  47.079  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 191.1 on 1202 degrees of freedom
  (76 observations deleted due to missingness)
Multiple R-squared:  0.6484,    Adjusted R-squared:  0.6481 
F-statistic:  2216 on 1 and 1202 DF,  p-value: < 2.2e-16

What the model means

The P value was 2.2e-16 which is 0.00000000000000022 meaning there is essentially no correlation between these two variables.This is notable because it shows that high value purchases are not inherently going towards a high value weapons they are often also going to a high volume of weapons.

avg_SIPRI means the average per unit cost

SIPRI_total_value is the sum of every purchase

The model equation is: SIPRI_total_value = 1.91107 (avg_SIPRI) + 29.35470

Veiw Linear Regression Graphs

plot(linear) #visualizing this to see if I potentially want to use this data

Single out 10 countries receiving the most weapons from the US in total over all time

topten <- years |> # Narrow the table down to top 10 recipients of US weapons
  filter((Recipient== "Saudi Arabia" 
          | Recipient== "United Arab Emirates" 
          | Recipient== "Qatar" 
          | Recipient== "United Kingdom" 
          | Recipient== "Japan" 
          | Recipient== "Australia" 
          | Recipient== "Israel" 
          | Recipient== "United Arab Emirates"
          | Recipient== "Taiwan"
          | Recipient== "Egypt")) |>
  arrange(SIPRI_total_value)
topten
# A tibble: 233 × 7
   Recipient    `Year of order` SIPRI_total_value delivery_total_count avg_count
   <chr>                  <dbl>             <dbl>                <dbl>     <dbl>
 1 United Arab…            2004              1                      50      50  
 2 Japan                   1977              1.42                    2       2  
 3 Saudi Arabia            1990              2.19                   73      73  
 4 Japan                   1982              4.28                    1       1  
 5 Taiwan                  1982              5                       2       2  
 6 Japan                   2008              7.2                    24      24  
 7 United King…            2023              7.63                  763     763  
 8 United King…            2019              8                       4       4  
 9 Australia               1999              8.97                  299     299  
10 Australia               2023              9                     105      52.5
# ℹ 223 more rows
# ℹ 2 more variables: avg_SIPRI <dbl>, `Decade of order` <fct>

Create stacked bar graph

plot1 <- topten |>
  ggplot()  +
  geom_bar(aes(x=Recipient, y=SIPRI_total_value, fill=`Decade of order`),  #specify stacked bar graph and fill
      position = "stack", stat = "identity" ) +
  labs(fill = "Year of Weapons Order", # specify labels for graph, title and source
       x = "Country Recieving Weapons",
       y = "Total Value of Weapons Sold 
       (in SIPRI TIV)",
       title = "Top 10 Countries Receiving Weapons from the United States from 1975-2025",
       caption = "Source: Stockholm International Peace Research Institute") +
   scale_fill_manual(values = c( #change to pretty colors 
     "1975-1980" = "#615055",
     "1980-1990" = "#946e83",
     "1990-2000" = "#b4a6ab",
     "2000-2010" = "#cdd5d1",
     "2010-2020" = "#B6CBBA",
     "2020-2025" = "#9EC1A3"))+
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)) # Used The R Graph Gallery to remember how to do this because I couldn't find it in my notes 
plot1

Methods/Mini Essay

Question A

I started out pretty simple with data cleaning. Because I was able to specify what data I wanted before I downloaded the CSV, my data didn’t include very much information that I wanted to remove. Initially, I filtered out NA values, but the variables I ended up using did not have any and filtering it out on other columns was unnecessary. I did change some formatting, I grouped the data by recipient and year of order an also added columns for several data points that I found useful in deciding what I would do with the data. My last step was adding a column with year ranges because I was working with a large range of dates and needed fewer categories.

Question B

The visualization shows the value of the weapons sold by the United States to the top 10 receiving countries divided up by decades. That means that we not only have information on the value of weapons in the United States, has sold to each country in the last 50 years, but we also can see when that happened. It’s important to know that because we are SIPRI TIV The value is measured adjusted for inflation and what weapons were available. It’s interesting to see the degree to which weapons sales increased in the 2000s-2010s.

Question C

Showing the most commonly sold weapons in each time period would have been really interesting with interactivity. Unfortunately, it was too complicated to measure that in an empirical way without using other databases or using more complex grouping.