The following packages are required to produce the report.
# Required packages
library(readr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
This report aims to create a clean dataset and subsequently find the per capita Military Expenditure by Countries over the ten year span between 2008 and 2017. Three datasets (Military Expenditure, Population, Continents) are merged for this purpose. We first see the summary of the datasets and look at their structure. We then subset the Military Expenditure dataset to keep the required columns. We look for missing values in the datasets and impute the median where necessary. After merging the datasets, we apply the correct type conversions to the variables and then check for outliers in the Expenditure (Military Expenditure) and Count (population) variables. Since these datasets represent real world information, and so we do not impute or eliminate the outliers. Lastly we create a new column (Per Capita Expenditure) and then apply log transformation to reduce its skewness.
As discussed earlier, we import 3 datasets for this analysis. Their descriptions are as follows:
1. Military Expenditure
The file can be found in the ‘Military Spending of Countries (1960-2019)’ dataset. The file to be used is ‘Military Expenditure.csv’. This dataset has been taken from kaggle and can be viewed using the following link: https://www.kaggle.com/nitinsss/military-expenditure-of-countries-19602019
The dataset has 264 observation and 63 variables. The variables include the:
2. Population of the each Country
The second dataset contains population information of all countries from 1960 to 2018. The dataset consists of 6 files and for this analysis we will use the file named ‘population_total_long’. This dataset has also been downloaded from Kaggle and can be accessed using the following link: https://www.kaggle.com/imdevskp/world-population-19602018?select=population_total_long.csv
The data file has 12,595 observations and 3 variables. The variables are as follows:
3. Continent-wise information
The third dataset, named ‘Country to Continent’ has also been taken from Kaggle. It can be accessed using the following link: https://www.kaggle.com/statchaitya/country-to-continent. The data file used, ‘countryContinent’, has 249 observations and 9 variables. The important variables are as follows:
The other variables are ‘code_2’, ‘code_3’, ‘country_code’, ‘iso_3166_2’, ‘region_code’,‘Sub_region’ and ‘sub_region_code’. These variables are not essential to the analysis and hence we do not look into them any further. We will subsequently merge the 3 datasets after tidying them.
We import the 3 csv files using the read_csv() function from the readr package.
#Importing the data
Mil_Expenditure <- read_csv("C:/Users/Joshua/Desktop/Data Wrangling Military Expenditure.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Name = col_character(),
## Code = col_character(),
## Type = col_character(),
## `Indicator Name` = col_character()
## )
## See spec(...) for full column specifications.
head(Mil_Expenditure)
## # A tibble: 6 x 63
## Name Code Type `Indicator Name` `1960` `1961` `1962` `1963`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 World WLD Regi~ Military expend~ NA NA NA NA
## 2 High~ HIC Regi~ Military expend~ 6.74e10 7.13e10 7.85e10 8.06e10
## 3 OECD~ OED Regi~ Military expend~ 6.78e10 7.16e10 7.89e10 8.10e10
## 4 Post~ PST Regi~ Military expend~ 6.33e10 6.68e10 7.37e10 7.51e10
## 5 Nort~ NAC Regi~ Military expend~ 4.71e10 4.95e10 5.41e10 5.39e10
## 6 Unit~ USA Coun~ Military expend~ 4.54e10 4.78e10 5.24e10 5.23e10
## # ... with 55 more variables: `1964` <dbl>, `1965` <dbl>, `1966` <dbl>,
## # `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
## # `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
## # `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
## # `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
## # `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
## # `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
## # `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
## # `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
## # `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## # `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
## # `2017` <dbl>, `2018` <dbl>
Population <- read_csv("C:/Users/Joshua/Desktop/Population.csv")
## Parsed with column specification:
## cols(
## `Country Name` = col_character(),
## Year = col_double(),
## Count = col_double()
## )
head(Population)
## # A tibble: 6 x 3
## `Country Name` Year Count
## <chr> <dbl> <dbl>
## 1 Aruba 1960 54211
## 2 Afghanistan 1960 8996973
## 3 Angola 1960 5454933
## 4 Albania 1960 1608800
## 5 Andorra 1960 13411
## 6 United Arab Emirates 1960 92418
Cont <- read_csv("C:/Users/Joshua/Desktop/Continent Info.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## code_2 = col_character(),
## code_3 = col_character(),
## country_code = col_double(),
## iso_3166_2 = col_character(),
## continent = col_character(),
## sub_region = col_character(),
## region_code = col_double(),
## sub_region_code = col_double()
## )
head(Cont)
## # A tibble: 6 x 9
## country code_2 code_3 country_code iso_3166_2 continent sub_region region_code
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 "Afgha~ AF AFG 4 ISO 3166-~ Asia Southern ~ 142
## 2 "\xc5l~ AX ALA 248 ISO 3166-~ Europe Northern ~ 150
## 3 "Alban~ AL ALB 8 ISO 3166-~ Europe Southern ~ 150
## 4 "Alger~ DZ DZA 12 ISO 3166-~ Africa Northern ~ 2
## 5 "Ameri~ AS ASM 16 ISO 3166-~ Oceania Polynesia 9
## 6 "Andor~ AD AND 20 ISO 3166-~ Europe Southern ~ 150
## # ... with 1 more variable: sub_region_code <dbl>
# We summarize the data by using the str() function
str(Mil_Expenditure)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 264 obs. of 63 variables:
## $ Name : chr "World" "High income" "OECD members" "Post-demographic dividend" ...
## $ Code : chr "WLD" "HIC" "OED" "PST" ...
## $ Type : chr "Regions Clubbed Geographically" "Regions Clubbed Economically" "Regions Clubbed Economically" "Regions Clubbed Economically" ...
## $ Indicator Name: chr "Military expenditure (current USD)" "Military expenditure (current USD)" "Military expenditure (current USD)" "Military expenditure (current USD)" ...
## $ 1960 : num NA 6.74e+10 6.78e+10 6.33e+10 4.71e+10 ...
## $ 1961 : num NA 7.13e+10 7.16e+10 6.68e+10 4.95e+10 ...
## $ 1962 : num NA 7.85e+10 7.89e+10 7.37e+10 5.41e+10 ...
## $ 1963 : num NA 8.06e+10 8.10e+10 7.51e+10 5.39e+10 ...
## $ 1964 : num NA 8.11e+10 8.15e+10 7.53e+10 5.29e+10 ...
## $ 1965 : num NA 8.33e+10 8.36e+10 7.71e+10 5.34e+10 ...
## $ 1966 : num NA 9.73e+10 9.75e+10 9.04e+10 6.52e+10 ...
## $ 1967 : num NA 1.12e+11 1.12e+11 1.04e+11 7.72e+10 ...
## $ 1968 : num NA 1.18e+11 1.18e+11 1.09e+11 8.25e+10 ...
## $ 1969 : num NA 1.21e+11 1.21e+11 1.11e+11 8.32e+10 ...
## $ 1970 : num NA 1.23e+11 1.23e+11 1.12e+11 8.17e+10 ...
## $ 1971 : num NA 1.23e+11 1.23e+11 1.11e+11 7.69e+10 ...
## $ 1972 : num NA 1.35e+11 1.34e+11 1.21e+11 7.99e+10 ...
## $ 1973 : num NA 1.49e+11 1.48e+11 1.31e+11 8.07e+10 ...
## $ 1974 : num NA 1.65e+11 1.65e+11 1.46e+11 8.87e+10 ...
## $ 1975 : num NA 1.79e+11 1.81e+11 1.60e+11 9.16e+10 ...
## $ 1976 : num NA 1.85e+11 1.87e+11 1.64e+11 9.46e+10 ...
## $ 1977 : num NA 2.18e+11 2.08e+11 1.84e+11 1.05e+11 ...
## $ 1978 : num NA 2.41e+11 2.29e+11 2.11e+11 1.13e+11 ...
## $ 1979 : num NA 2.70e+11 2.53e+11 2.43e+11 1.26e+11 ...
## $ 1980 : num 3.64e+11 3.09e+11 2.88e+11 2.78e+11 1.43e+11 ...
## $ 1981 : num 4.00e+11 3.39e+11 3.14e+11 3.02e+11 1.75e+11 ...
## $ 1982 : num NA 3.86e+11 3.57e+11 3.45e+11 2.20e+11 ...
## $ 1983 : num NA 3.78e+11 3.54e+11 3.43e+11 2.21e+11 ...
## $ 1984 : num 4.48e+11 3.92e+11 3.68e+11 3.57e+11 2.39e+11 ...
## $ 1985 : num 4.74e+11 4.17e+11 3.97e+11 3.87e+11 2.66e+11 ...
## $ 1986 : num 5.20e+11 4.58e+11 4.57e+11 4.47e+11 2.89e+11 ...
## $ 1987 : num 5.88e+11 5.16e+11 5.00e+11 4.88e+11 2.97e+11 ...
## $ 1988 : num 6.27e+11 5.48e+11 5.25e+11 5.12e+11 3.03e+11 ...
## $ 1989 : num 6.45e+11 5.57e+11 5.34e+11 5.24e+11 3.15e+11 ...
## $ 1990 : num 7.00e+11 6.09e+11 5.76e+11 5.62e+11 3.18e+11 ...
## $ 1991 : num 6.85e+11 5.97e+11 5.61e+11 5.42e+11 2.92e+11 ...
## $ 1992 : num 7.17e+11 6.24e+11 5.95e+11 5.78e+11 3.16e+11 ...
## $ 1993 : num 6.86e+11 5.98e+11 5.70e+11 5.52e+11 3.08e+11 ...
## $ 1994 : num 6.95e+11 5.97e+11 5.68e+11 5.52e+11 2.98e+11 ...
## $ 1995 : num 7.17e+11 6.10e+11 5.81e+11 5.65e+11 2.88e+11 ...
## $ 1996 : num 7.16e+11 5.99e+11 5.71e+11 5.53e+11 2.80e+11 ...
## $ 1997 : num 7.14e+11 5.89e+11 5.54e+11 5.36e+11 2.84e+11 ...
## $ 1998 : num 6.96e+11 5.80e+11 5.46e+11 5.26e+11 2.82e+11 ...
## $ 1999 : num 7.09e+11 5.90e+11 5.60e+11 5.39e+11 2.89e+11 ...
## $ 2000 : num 7.30e+11 6.01e+11 5.69e+11 5.46e+11 3.10e+11 ...
## $ 2001 : num 7.42e+11 6.05e+11 5.70e+11 5.49e+11 3.21e+11 ...
## $ 2002 : num 8.02e+11 6.63e+11 6.32e+11 6.10e+11 3.65e+11 ...
## $ 2003 : num 9.37e+11 7.72e+11 7.40e+11 7.16e+11 4.25e+11 ...
## $ 2004 : num 1.06e+12 8.64e+11 8.29e+11 8.03e+11 4.76e+11 ...
## $ 2005 : num 1.15e+12 9.20e+11 8.80e+11 8.52e+11 5.16e+11 ...
## $ 2006 : num 1.19e+12 9.62e+11 9.19e+11 8.88e+11 5.42e+11 ...
## $ 2007 : num 1.32e+12 1.04e+12 9.93e+11 9.59e+11 5.74e+11 ...
## $ 2008 : num 1.49e+12 1.15e+12 1.10e+12 1.06e+12 6.40e+11 ...
## $ 2009 : num 1.55e+12 1.19e+12 1.13e+12 1.09e+12 6.88e+11 ...
## $ 2010 : num 1.63e+12 1.23e+12 1.16e+12 1.12e+12 7.17e+11 ...
## $ 2011 : num 1.73e+12 1.28e+12 1.20e+12 1.16e+12 7.33e+11 ...
## $ 2012 : num 1.74e+12 1.25e+12 1.16e+12 1.12e+12 7.05e+11 ...
## $ 2013 : num 1.74e+12 1.21e+12 1.10e+12 1.06e+12 6.58e+11 ...
## $ 2014 : num 1.74e+12 1.19e+12 1.08e+12 1.03e+12 6.28e+11 ...
## $ 2015 : num 1.64e+12 1.11e+12 1.01e+12 9.71e+11 6.14e+11 ...
## $ 2016 : num 1.63e+12 1.10e+12 1.02e+12 9.84e+11 6.18e+11 ...
## $ 2017 : num 1.70e+12 1.13e+12 1.05e+12 1.01e+12 6.27e+11 ...
## $ 2018 : num 1.78e+12 1.20e+12 1.12e+12 1.08e+12 6.70e+11 ...
## - attr(*, "spec")=
## .. cols(
## .. Name = col_character(),
## .. Code = col_character(),
## .. Type = col_character(),
## .. `Indicator Name` = col_character(),
## .. `1960` = col_double(),
## .. `1961` = col_double(),
## .. `1962` = col_double(),
## .. `1963` = col_double(),
## .. `1964` = col_double(),
## .. `1965` = col_double(),
## .. `1966` = col_double(),
## .. `1967` = col_double(),
## .. `1968` = col_double(),
## .. `1969` = col_double(),
## .. `1970` = col_double(),
## .. `1971` = col_double(),
## .. `1972` = col_double(),
## .. `1973` = col_double(),
## .. `1974` = col_double(),
## .. `1975` = col_double(),
## .. `1976` = col_double(),
## .. `1977` = col_double(),
## .. `1978` = col_double(),
## .. `1979` = col_double(),
## .. `1980` = col_double(),
## .. `1981` = col_double(),
## .. `1982` = col_double(),
## .. `1983` = col_double(),
## .. `1984` = col_double(),
## .. `1985` = col_double(),
## .. `1986` = col_double(),
## .. `1987` = col_double(),
## .. `1988` = col_double(),
## .. `1989` = col_double(),
## .. `1990` = col_double(),
## .. `1991` = col_double(),
## .. `1992` = col_double(),
## .. `1993` = col_double(),
## .. `1994` = col_double(),
## .. `1995` = col_double(),
## .. `1996` = col_double(),
## .. `1997` = col_double(),
## .. `1998` = col_double(),
## .. `1999` = col_double(),
## .. `2000` = col_double(),
## .. `2001` = col_double(),
## .. `2002` = col_double(),
## .. `2003` = col_double(),
## .. `2004` = col_double(),
## .. `2005` = col_double(),
## .. `2006` = col_double(),
## .. `2007` = col_double(),
## .. `2008` = col_double(),
## .. `2009` = col_double(),
## .. `2010` = col_double(),
## .. `2011` = col_double(),
## .. `2012` = col_double(),
## .. `2013` = col_double(),
## .. `2014` = col_double(),
## .. `2015` = col_double(),
## .. `2016` = col_double(),
## .. `2017` = col_double(),
## .. `2018` = col_double()
## .. )
str(Population)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 12595 obs. of 3 variables:
## $ Country Name: chr "Aruba" "Afghanistan" "Angola" "Albania" ...
## $ Year : num 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
## $ Count : num 54211 8996973 5454933 1608800 13411 ...
## - attr(*, "spec")=
## .. cols(
## .. `Country Name` = col_character(),
## .. Year = col_double(),
## .. Count = col_double()
## .. )
str(Cont)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 249 obs. of 9 variables:
## $ country : chr "Afghanistan" "<c5>land Islands" "Albania" "Algeria" ...
## $ code_2 : chr "AF" "AX" "AL" "DZ" ...
## $ code_3 : chr "AFG" "ALA" "ALB" "DZA" ...
## $ country_code : num 4 248 8 12 16 20 24 660 10 28 ...
## $ iso_3166_2 : chr "ISO 3166-2:AF" "ISO 3166-2:AX" "ISO 3166-2:AL" "ISO 3166-2:DZ" ...
## $ continent : chr "Asia" "Europe" "Europe" "Africa" ...
## $ sub_region : chr "Southern Asia" "Northern Europe" "Southern Europe" "Northern Africa" ...
## $ region_code : num 142 150 150 2 9 150 2 19 NA 19 ...
## $ sub_region_code: num 34 154 39 15 61 39 17 29 NA 29 ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. code_2 = col_character(),
## .. code_3 = col_character(),
## .. country_code = col_double(),
## .. iso_3166_2 = col_character(),
## .. continent = col_character(),
## .. sub_region = col_character(),
## .. region_code = col_double(),
## .. sub_region_code = col_double()
## .. )
There are several unnecesary columns in the datasets. There are also several missing values in each column of the military expenditure dataset. We have to subset the dataset to keep only the required columns in all datasets. Additionally we must change the Military expenditure dataset to long format in line with Hadley Wickham’s Tidy data principles (each variable must have its own column, each observation must have its own row and each value must have its own cell).
The datasets also contain variables of the different data types. We have variables in factor, character and numeric format, but some variables are stored incorrectly. We must convert the Continent (from continent dataset) to factors.
Further inspection must be conducted after merging the data sets.
We first only select the columns we need. For this we select only the 10 year period between 2008-2017. We select this time period as we want to select the most recent ten year period that we have complete data for. Since we do not have population data for 2018, we only study the ten year period from 2008-2017.
# Select the required columns.
Mil_Exp <- Mil_Expenditure %>% select(Name, Code,Type,`Indicator Name`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`)
head(Mil_Exp)
## # A tibble: 6 x 14
## Name Code Type `Indicator Name` `2008` `2009` `2010` `2011` `2012`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 World WLD Regi~ Military expend~ 1.49e12 1.55e12 1.63e12 1.73e12 1.74e12
## 2 High~ HIC Regi~ Military expend~ 1.15e12 1.19e12 1.23e12 1.28e12 1.25e12
## 3 OECD~ OED Regi~ Military expend~ 1.10e12 1.13e12 1.16e12 1.20e12 1.16e12
## 4 Post~ PST Regi~ Military expend~ 1.06e12 1.09e12 1.12e12 1.16e12 1.12e12
## 5 Nort~ NAC Regi~ Military expend~ 6.40e11 6.88e11 7.17e11 7.33e11 7.05e11
## 6 Unit~ USA Coun~ Military expend~ 6.21e11 6.69e11 6.98e11 7.11e11 6.85e11
## # ... with 5 more variables: `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## # `2016` <dbl>, `2017` <dbl>
We take a look at a summary of the data and the number of missing values for each variable.
colSums(is.na(Mil_Exp))
## Name Code Type Indicator Name 2008
## 0 0 0 0 63
## 2009 2010 2011 2012 2013
## 67 68 70 65 62
## 2014 2015 2016 2017
## 61 66 67 69
summary(Mil_Exp)
## Name Code Type Indicator Name
## Length:264 Length:264 Length:264 Length:264
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## 2008 2009 2010
## Min. :0.000e+00 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:1.940e+08 1st Qu.:2.148e+08 1st Qu.:2.195e+08
## Median :1.732e+09 Median :2.116e+09 Median :2.136e+09
## Mean :5.607e+10 Mean :5.919e+10 Mean :6.289e+10
## 3rd Qu.:1.713e+10 3rd Qu.:1.783e+10 3rd Qu.:1.682e+10
## Max. :1.490e+12 Max. :1.550e+12 Max. :1.630e+12
## NA's :63 NA's :67 NA's :68
## 2011 2012 2013
## Min. :0.000e+00 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:2.592e+08 1st Qu.:2.309e+08 1st Qu.:2.511e+08
## Median :2.485e+09 Median :2.590e+09 Median :2.436e+09
## Mean :6.823e+10 Mean :6.767e+10 Mean :6.782e+10
## 3rd Qu.:1.823e+10 3rd Qu.:1.876e+10 3rd Qu.:1.937e+10
## Max. :1.730e+12 Max. :1.740e+12 Max. :1.740e+12
## NA's :70 NA's :65 NA's :62
## 2014 2015 2016
## Min. :0.000e+00 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.:2.492e+08 1st Qu.:2.489e+08 1st Qu.:2.607e+08
## Median :2.356e+09 Median :2.416e+09 Median :2.456e+09
## Mean :6.788e+10 Mean :6.553e+10 Mean :6.562e+10
## 3rd Qu.:2.032e+10 3rd Qu.:1.928e+10 3rd Qu.:1.785e+10
## Max. :1.740e+12 Max. :1.640e+12 Max. :1.630e+12
## NA's :61 NA's :66 NA's :67
## 2017
## Min. :0.000e+00
## 1st Qu.:2.851e+08
## Median :2.328e+09
## Mean :6.928e+10
## 3rd Qu.:2.148e+10
## Max. :1.700e+12
## NA's :69
This current data set has 264 countries and 14 variables. We observe that there are still a considerable number of missing values in each year column. We eliminate the observations that have missing values for more than 5 years. This is because we will impute the median of the expenditure of all the countries that year. Imputing more than half the values for any country would not be appropriate for the analysis, as it would affect the integrity of the data and the significance of the results.
# We subset the data by keeping only the observations with less than 5 missing values
Mil_Exp2 <- Mil_Exp[which(rowSums(is.na(Mil_Exp)) < 5),]
colSums(is.na(Mil_Exp2 ))
## Name Code Type Indicator Name 2008
## 0 0 0 0 5
## 2009 2010 2011 2012 2013
## 7 6 7 2 1
## 2014 2015 2016 2017
## 1 5 7 8
We impute the median instead of the mean as there are extreme values in the dataset that would impact the mean value. In such a situation imputing with the median can give a more feasible estimate.
# IMPUTE the median into the missing values.
Mil_Exp3 <- sapply(Mil_Exp2,function(x) {
if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
head(Mil_Exp3)
## Name Code Type
## [1,] "World" "WLD" "Regions Clubbed Geographically"
## [2,] "High income" "HIC" "Regions Clubbed Economically"
## [3,] "OECD members" "OED" "Regions Clubbed Economically"
## [4,] "Post-demographic dividend" "PST" "Regions Clubbed Economically"
## [5,] "North America" "NAC" "Regions Clubbed Geographically"
## [6,] "United States" "USA" "Country"
## Indicator Name 2008 2009 2010
## [1,] "Military expenditure (current USD)" "1.49e+12" "1.55e+12" "1.63e+12"
## [2,] "Military expenditure (current USD)" "1.15e+12" "1.19e+12" "1.23e+12"
## [3,] "Military expenditure (current USD)" "1.1e+12" "1.13e+12" "1.16e+12"
## [4,] "Military expenditure (current USD)" "1.06e+12" "1.09e+12" "1.12e+12"
## [5,] "Military expenditure (current USD)" "6.4e+11" "6.88e+11" "7.17e+11"
## [6,] "Military expenditure (current USD)" "6.21e+11" "6.69e+11" "6.98e+11"
## 2011 2012 2013 2014 2015 2016
## [1,] "1.73e+12" "1.74e+12" "1.74e+12" "1.74e+12" "1.64e+12" "1.63e+12"
## [2,] "1.28e+12" "1.25e+12" "1.21e+12" "1.19e+12" "1.11e+12" "1.1e+12"
## [3,] "1.2e+12" "1.16e+12" "1.1e+12" "1.08e+12" "1.01e+12" "1.02e+12"
## [4,] "1.16e+12" "1.12e+12" "1.06e+12" "1.03e+12" "9.71e+11" "9.84e+11"
## [5,] "7.33e+11" "7.05e+11" "6.58e+11" "6.28e+11" "6.14e+11" "6.18e+11"
## [6,] "7.11e+11" "6.85e+11" "6.4e+11" "6.1e+11" "5.96e+11" "6e+11"
## 2017
## [1,] "1.7e+12"
## [2,] "1.13e+12"
## [3,] "1.05e+12"
## [4,] "1.01e+12"
## [5,] "6.27e+11"
## [6,] "6.06e+11"
colSums(is.na(Mil_Exp3))
## Name Code Type Indicator Name 2008
## 0 0 0 0 0
## 2009 2010 2011 2012 2013
## 0 0 0 0 0
## 2014 2015 2016 2017
## 0 0 0 0
Mil_Exp4 <- as.data.frame(Mil_Exp3)
head(Mil_Exp4)
## Name Code Type
## 1 World WLD Regions Clubbed Geographically
## 2 High income HIC Regions Clubbed Economically
## 3 OECD members OED Regions Clubbed Economically
## 4 Post-demographic dividend PST Regions Clubbed Economically
## 5 North America NAC Regions Clubbed Geographically
## 6 United States USA Country
## Indicator Name 2008 2009 2010 2011
## 1 Military expenditure (current USD) 1.49e+12 1.55e+12 1.63e+12 1.73e+12
## 2 Military expenditure (current USD) 1.15e+12 1.19e+12 1.23e+12 1.28e+12
## 3 Military expenditure (current USD) 1.1e+12 1.13e+12 1.16e+12 1.2e+12
## 4 Military expenditure (current USD) 1.06e+12 1.09e+12 1.12e+12 1.16e+12
## 5 Military expenditure (current USD) 6.4e+11 6.88e+11 7.17e+11 7.33e+11
## 6 Military expenditure (current USD) 6.21e+11 6.69e+11 6.98e+11 7.11e+11
## 2012 2013 2014 2015 2016 2017
## 1 1.74e+12 1.74e+12 1.74e+12 1.64e+12 1.63e+12 1.7e+12
## 2 1.25e+12 1.21e+12 1.19e+12 1.11e+12 1.1e+12 1.13e+12
## 3 1.16e+12 1.1e+12 1.08e+12 1.01e+12 1.02e+12 1.05e+12
## 4 1.12e+12 1.06e+12 1.03e+12 9.71e+11 9.84e+11 1.01e+12
## 5 7.05e+11 6.58e+11 6.28e+11 6.14e+11 6.18e+11 6.27e+11
## 6 6.85e+11 6.4e+11 6.1e+11 5.96e+11 6e+11 6.06e+11
We now look for missing values in the population and continents dataset
# Check for missing values
colSums(is.na(Population))
## Country Name Year Count
## 0 0 0
colSums(is.na(Cont))
## country code_2 code_3 country_code iso_3166_2
## 0 1 0 0 0
## continent sub_region region_code sub_region_code
## 9 9 9 9
We see that there are currently no NA values in the population dataset. The Continent dataset has a few missing values, but no missing values in the Country Name column which will be used to merge the datasets. The missing values will be dealt with after the datasets are merged.
The data currently has information for each year in a different column. According to the Tidy data principle each observation should have its own row, each variable should be in a column and each value should be in a cell. The years are not variables and can be efficiently stored as values for each country under a variable ‘Year’.
Additionally the data in the population dataset is also in long format. In order to merge the two datasets it is necessary for us to convert the dataset to long format. For this purpose we can use the gather() function or its newer alternative, pivot_longer().
Mil <- Mil_Exp4 %>% pivot_longer(cols = c(-Name,-Code,-Type,-`Indicator Name`), names_to = "Year", values_to = "Expenditure")
head(Mil)
## # A tibble: 6 x 6
## Name Code Type `Indicator Name` Year Expenditure
## <fct> <fct> <fct> <fct> <chr> <fct>
## 1 World WLD Regions Clubbed Geog~ Military expenditure (cur~ 2008 1.49e+12
## 2 World WLD Regions Clubbed Geog~ Military expenditure (cur~ 2009 1.55e+12
## 3 World WLD Regions Clubbed Geog~ Military expenditure (cur~ 2010 1.63e+12
## 4 World WLD Regions Clubbed Geog~ Military expenditure (cur~ 2011 1.73e+12
## 5 World WLD Regions Clubbed Geog~ Military expenditure (cur~ 2012 1.74e+12
## 6 World WLD Regions Clubbed Geog~ Military expenditure (cur~ 2013 1.74e+12
We now have a tidy dataset for military expenditure data.
We first create a key column, by combining the Country Name and Year in order to combine the Military expenditure and Population information. We do this using the unite() function from the tidyr package. We then left join the population data to the expenditure data. This is because we only want to analyse observations for which we have military expenditure information. By doing a left join we automatically drop the countries in the Population dataset that are not in the Expenditure dataset. We then separate the key column using the separate() function and combine it with the continents dataset using the Country name column as a key.
# Uniting the columns to make a key
Mil1 <- Mil %>% unite("Key", Name, Year)
head(Mil1)
## # A tibble: 6 x 5
## Key Code Type `Indicator Name` Expenditure
## <chr> <fct> <fct> <fct> <fct>
## 1 World_20~ WLD Regions Clubbed Geog~ Military expenditure (curre~ 1.49e+12
## 2 World_20~ WLD Regions Clubbed Geog~ Military expenditure (curre~ 1.55e+12
## 3 World_20~ WLD Regions Clubbed Geog~ Military expenditure (curre~ 1.63e+12
## 4 World_20~ WLD Regions Clubbed Geog~ Military expenditure (curre~ 1.73e+12
## 5 World_20~ WLD Regions Clubbed Geog~ Military expenditure (curre~ 1.74e+12
## 6 World_20~ WLD Regions Clubbed Geog~ Military expenditure (curre~ 1.74e+12
Pop1 <- Population %>% unite("Key", `Country Name`, Year)
head(Pop1)
## # A tibble: 6 x 2
## Key Count
## <chr> <dbl>
## 1 Aruba_1960 54211
## 2 Afghanistan_1960 8996973
## 3 Angola_1960 5454933
## 4 Albania_1960 1608800
## 5 Andorra_1960 13411
## 6 United Arab Emirates_1960 92418
Data <- left_join(Mil1, Pop1, by="Key")
head(Data)
## # A tibble: 6 x 6
## Key Code Type `Indicator Name` Expenditure Count
## <chr> <fct> <fct> <fct> <fct> <dbl>
## 1 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.49e+12 NA
## 2 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.55e+12 NA
## 3 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.63e+12 NA
## 4 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.73e+12 NA
## 5 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.74e+12 NA
## 6 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.74e+12 NA
#Separating the key column
Data2 <- Data %>% separate(Key, c("Name", "Year"), sep = "_")
# Combining the continent column
Exp <- left_join(Data2, Cont, by = c("Name" = "country"))
head(Exp)
## # A tibble: 6 x 15
## Name Year Code Type `Indicator Name` Expenditure Count code_2 code_3
## <chr> <chr> <fct> <fct> <fct> <fct> <dbl> <chr> <chr>
## 1 World 2008 WLD Regi~ Military expend~ 1.49e+12 NA <NA> <NA>
## 2 World 2009 WLD Regi~ Military expend~ 1.55e+12 NA <NA> <NA>
## 3 World 2010 WLD Regi~ Military expend~ 1.63e+12 NA <NA> <NA>
## 4 World 2011 WLD Regi~ Military expend~ 1.73e+12 NA <NA> <NA>
## 5 World 2012 WLD Regi~ Military expend~ 1.74e+12 NA <NA> <NA>
## 6 World 2013 WLD Regi~ Military expend~ 1.74e+12 NA <NA> <NA>
## # ... with 6 more variables: country_code <dbl>, iso_3166_2 <chr>,
## # continent <chr>, sub_region <chr>, region_code <dbl>, sub_region_code <dbl>
When we look at a summary of the data we see that there are NA values created due to the different naming scheme of countries in different datasets. Therefore we change these names to the correct format using the recode() function.
Population$`Country Name` <- recode(Population$`Country Name`, `Egypt, Arab Rep.` ="Egypt", `Gambia, The` = "Gambia",
`Iran, Islamic Rep.` = "Iran",`Korea, Rep.` = "South Korea", `Venezuela, RB` = "Venezuela")
The datasets are then merged again, with the adjusted names.
# Uniting the columns to make a key
Mil1 <- Mil %>% unite("Key", Name, Year)
Pop2 <- Population %>% unite("Key", `Country Name`, Year)
Data3 <- left_join(Mil1, Pop2, by="Key")
head(Data)
## # A tibble: 6 x 6
## Key Code Type `Indicator Name` Expenditure Count
## <chr> <fct> <fct> <fct> <fct> <dbl>
## 1 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.49e+12 NA
## 2 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.55e+12 NA
## 3 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.63e+12 NA
## 4 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.73e+12 NA
## 5 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.74e+12 NA
## 6 World_2~ WLD Regions Clubbed Ge~ Military expenditure (cu~ 1.74e+12 NA
#Separating the key column
Data4 <- Data3 %>% separate(Key, c("Name", "Year"), sep = "_")
# Combining the continent column
Exp <- left_join(Data4, Cont, by = c("Name" = "country"))
head(Exp)
## # A tibble: 6 x 15
## Name Year Code Type `Indicator Name` Expenditure Count code_2 code_3
## <chr> <chr> <fct> <fct> <fct> <fct> <dbl> <chr> <chr>
## 1 World 2008 WLD Regi~ Military expend~ 1.49e+12 NA <NA> <NA>
## 2 World 2009 WLD Regi~ Military expend~ 1.55e+12 NA <NA> <NA>
## 3 World 2010 WLD Regi~ Military expend~ 1.63e+12 NA <NA> <NA>
## 4 World 2011 WLD Regi~ Military expend~ 1.73e+12 NA <NA> <NA>
## 5 World 2012 WLD Regi~ Military expend~ 1.74e+12 NA <NA> <NA>
## 6 World 2013 WLD Regi~ Military expend~ 1.74e+12 NA <NA> <NA>
## # ... with 6 more variables: country_code <dbl>, iso_3166_2 <chr>,
## # continent <chr>, sub_region <chr>, region_code <dbl>, sub_region_code <dbl>
The Type column tells us whether the observation is a country or a group of countries. We filter out the data to only keep observations that are ‘Country’ in the Type column.
Exp1 <- Exp %>% filter(Type == "Country")
Once again we select the columns we need.
# Selecting the required variables
Exp2 <- Exp1 %>% select(Name, Year,continent,Expenditure,Count )
head(Exp2)
## # A tibble: 6 x 5
## Name Year continent Expenditure Count
## <chr> <chr> <chr> <fct> <dbl>
## 1 United States 2008 <NA> 6.21e+11 304093966
## 2 United States 2009 <NA> 6.69e+11 306771529
## 3 United States 2010 <NA> 6.98e+11 309326085
## 4 United States 2011 <NA> 7.11e+11 311580009
## 5 United States 2012 <NA> 6.85e+11 313874218
## 6 United States 2013 <NA> 6.4e+11 316057727
We check if we have any missing values in our current dataset
colSums(is.na(Exp2))
## Name Year continent Expenditure Count
## 0 0 190 0 0
There are 190 missing values in the continent column. Having missing values in the continent column will not impact the analysis, however, we must not have any incomplete values in the Expenditure and Count columns. Since this requirement is met, we move on to converting the variables to the correct format.
# First we check the summary of our variables using summary() and str()
str(Exp2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1550 obs. of 5 variables:
## $ Name : chr "United States" "United States" "United States" "United States" ...
## $ Year : chr "2008" "2009" "2010" "2011" ...
## $ continent : chr NA NA NA NA ...
## $ Expenditure: Factor w/ 1841 levels "0","1.03e+11",..: 158 342 525 718 895 159 1267 1428 1640 1806 ...
## $ Count : num 3.04e+08 3.07e+08 3.09e+08 3.12e+08 3.14e+08 ...
summary(Exp2)
## Name Year continent Expenditure
## Length:1550 Length:1550 Length:1550 0 : 30
## Class :character Class :character Class :character 2115785124 : 8
## Mode :character Mode :character Mode :character 2474313259 : 8
## 2513200000 : 7
## 1867877499 : 6
## 2136472861.5: 6
## (Other) :1485
## Count
## Min. :8.696e+04
## 1st Qu.:3.859e+06
## Median :1.020e+07
## Mean :4.473e+07
## 3rd Qu.:3.087e+07
## Max. :1.386e+09
##
We observe that we have 1500 observations of 7 variables. We have one numeric variables (Count), two factor variable (Expenditure and Code) and 3 character variables. The Expenditure variable is incorrectly stored as a factor instead of a numeric variable. The Year column and Continent columns can be better stored as factors instead of characters.
We change the continent and year variables to a factor variable and Expenditure to Numeric Variable
# Converting Continent and Year Variable to factor
Exp2$continent <- factor(Exp2$continent, labels = c("Africa","Americas", "Asia", "Europe","Oceania" ))
Exp2$Year <- factor(Exp2$Year, labels= c("2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017"))
# Converting Expenditure to Numeric
Exp2$Expenditure <- as.numeric(as.character(Exp2$Expenditure))
levels(Exp2$Year)
## [1] "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"
Exp2 <- as.data.frame(Exp2)
head(Exp2)
## Name Year continent Expenditure Count
## 1 United States 2008 <NA> 6.21e+11 304093966
## 2 United States 2009 <NA> 6.69e+11 306771529
## 3 United States 2010 <NA> 6.98e+11 309326085
## 4 United States 2011 <NA> 7.11e+11 311580009
## 5 United States 2012 <NA> 6.85e+11 313874218
## 6 United States 2013 <NA> 6.40e+11 316057727
str(Exp2)
## 'data.frame': 1550 obs. of 5 variables:
## $ Name : chr "United States" "United States" "United States" "United States" ...
## $ Year : Factor w/ 10 levels "2008","2009",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Expenditure: num 6.21e+11 6.69e+11 6.98e+11 7.11e+11 6.85e+11 ...
## $ Count : num 3.04e+08 3.07e+08 3.09e+08 3.12e+08 3.14e+08 ...
summary(Exp2)
## Name Year continent Expenditure
## Length:1550 2008 :155 Africa :410 Min. :0.000e+00
## Class :character 2009 :155 Americas:220 1st Qu.:1.489e+08
## Mode :character 2010 :155 Asia :340 Median :8.003e+08
## 2011 :155 Europe :350 Mean :1.068e+10
## 2012 :155 Oceania : 40 3rd Qu.:4.380e+09
## 2013 :155 NA's :190 Max. :7.110e+11
## (Other):620
## Count
## Min. :8.696e+04
## 1st Qu.:3.859e+06
## Median :1.020e+07
## Mean :4.473e+07
## 3rd Qu.:3.087e+07
## Max. :1.386e+09
##
colSums(is.na(Exp2))
## Name Year continent Expenditure Count
## 0 0 190 0 0
sum(is.infinite(Exp2$Expenditure))
## [1] 0
sum(is.nan(Exp2$Expenditure))
## [1] 0
sum(is.infinite(Exp2$Count))
## [1] 0
sum(is.nan(Exp2$Count))
## [1] 0
There are no missing values in the required columns (expenditure and count) as well as no special values in the Expenditure and count variables. Additionally, through the summary function we can also conclude that there are no obvious errors in the data set as the max and min values are reasonable.
We check the data for possible outliers. We use “Tukey’s method of outlier detection” to identify the outliers. In this method we look for data points that lie outside the range
[Q1 − 1.5 × IQR Q3 + 1.5 × IQR]
where, * Q1 is the first quartile, * Q3 is the third quartile and * IQR is the inter quartile range.
We first plot a boxplot to visualise the outliers in the data for the Expenditure and Count variables.
par(mfrow=c(1,2))
boxplot(Exp2$Expenditure, Main= "Expenditure", col = "Maroon" )
boxplot(Exp2$Count, Main= "Count", col = "Blue" )
Due to the presence of some extreme values we are unable to see the distribution. We create a subset of the dataframe named outlier_data to see the complete information of the univariate outliers in both variables.
# Outliers for Expenditure
iqr <- IQR(Exp2$Expenditure)
Q1 <- quantile(Exp2$Expenditure, 0.25)
Q3 <- quantile(Exp2$Expenditure, 0.75)
outlier_data1 <- subset(Exp2, Exp2$Expenditure < (Q1 - 1.5*iqr) | Exp2$Expenditure > (Q3 + 1.5*iqr))
head(outlier_data1)
## Name Year continent Expenditure Count
## 1 United States 2008 <NA> 6.21e+11 304093966
## 2 United States 2009 <NA> 6.69e+11 306771529
## 3 United States 2010 <NA> 6.98e+11 309326085
## 4 United States 2011 <NA> 7.11e+11 311580009
## 5 United States 2012 <NA> 6.85e+11 313874218
## 6 United States 2013 <NA> 6.40e+11 316057727
summary(outlier_data1)
## Name Year continent Expenditure
## Length:193 2008 :20 Africa : 0 Min. :1.108e+10
## Class :character 2009 :20 Americas:23 1st Qu.:1.866e+10
## Mode :character 2010 :20 Asia :68 Median :3.405e+10
## 2011 :20 Europe :54 Mean :7.356e+10
## 2012 :20 Oceania :10 3rd Qu.:5.808e+10
## 2013 :20 NA's :38 Max. :7.110e+11
## (Other):73
## Count
## Min. :7.089e+06
## 1st Qu.:3.570e+07
## Median :6.560e+07
## Mean :2.103e+08
## 3rd Qu.:1.428e+08
## Max. :1.386e+09
##
# Outliers for Count
Ciqr <- IQR(Exp2$Count)
CQ1 <- quantile(Exp2$Count, 0.25)
CQ3 <- quantile(Exp2$Count, 0.75)
outlier_data2 <- subset(Exp2, Exp2$Count < (CQ1 - 1.5*Ciqr) | Exp2$Count > (CQ3 + 1.5*Ciqr))
head(outlier_data2)
## Name Year continent Expenditure Count
## 1 United States 2008 <NA> 6.21e+11 304093966
## 2 United States 2009 <NA> 6.69e+11 306771529
## 3 United States 2010 <NA> 6.98e+11 309326085
## 4 United States 2011 <NA> 7.11e+11 311580009
## 5 United States 2012 <NA> 6.85e+11 313874218
## 6 United States 2013 <NA> 6.40e+11 316057727
summary(outlier_data2)
## Name Year continent Expenditure
## Length:182 2014 :19 Africa :30 Min. :2.952e+08
## Class :character 2015 :19 Americas:20 1st Qu.:3.312e+09
## Mode :character 2016 :19 Asia :78 Median :9.073e+09
## 2017 :19 Europe :20 Mean :6.123e+10
## 2010 :18 Oceania : 0 3rd Qu.:4.678e+10
## 2011 :18 NA's :34 Max. :7.110e+11
## (Other):70
## Count
## Min. :7.212e+07
## 1st Qu.:8.989e+07
## Median :1.281e+08
## Mean :2.699e+08
## 3rd Qu.:2.023e+08
## Max. :1.386e+09
##
We check to see if there are any multivariate outliers in the data based on the Military Expenditure (Expenditure) and the Population of the country (Count). This can be done by plotting a scatter plot to visualize any outliers.
plot(Exp2$Expenditure,Exp2$Count, xlab = "Military Expenditure", ylab = "Population")
The scatter plot gives us a better view of the outliers. On further inspection we find that the outlier cluster towards the top of the chart consists of data points of China and India and the outliers to the extreme right are the data points of USA.
Although the data contains some extreme outliers, we know that these are not incorrect values. Therefore we must not remove or change them as these are real world data points and we don’t want the removal of valid data points to affect our results. We do not eliminate the outliers as these are many of the major countries that do spend large amounts of money on their military or have an extremely large population. Therefore deleting these values would lead to a loss of important information.
we create a per capita military expense column dividing military expense by count (population of the country)
# Creating the new column
Mil_Data <- Exp2 %>% mutate(Per_Capita_Exp = round((Expenditure/Count),2))
# We order the dataset by Per Capita Expenditure to see the highest per capita expenditure
Mil_Data2 <- Mil_Data[order(Mil_Data$Per_Capita_Exp, decreasing = TRUE),]
head(Mil_Data2)
## Name Year continent Expenditure Count Per_Capita_Exp
## 28 Saudi Arabia 2015 Asia 87185866667 31717667 2748.81
## 265 Oman 2012 Asia 9250650195 3498029 2644.53
## 27 Saudi Arabia 2014 Asia 80762400000 30916994 2612.23
## 1496 United Arab Emirates 2013 Asia 23561061947 9197910 2561.57
## 1497 United Arab Emirates 2014 Asia 22755071477 9214175 2469.57
## 266 Oman 2013 Asia 8766319896 3764805 2328.49
We plot a histogram of the Per_Capita_Exp column to see the distribution of the data.
# Plotting a histogram
hist(Mil_Data2$Per_Capita_Exp, breaks = 50,
main = 'Histogram of Per Capita Expenditure',
xlab = 'Per Capita Expenditure')
We can see that the data is highly right skewed. In order to conduct a linear regression or any other analysis we would require the data to follow normal distribution. Since the data is right skewed we first try the log transformation to reduce skewness.
Mil_Data3 <- Mil_Data2 %>% mutate(`Log of Per_Capita_Exp` = log(Per_Capita_Exp))
We plot another histogram to see the change in the distribution after applying log transformation.
hist(Mil_Data3$`Log of Per_Capita_Exp` ,
main = 'Transformed Per Capita Expenditure',
breaks = 50,
xlab = 'Log of Per Capita Expenditure')
We now see that the data is more normally distributed as compared to the initial distribution.
Military Spending of Countries (1960-2019), Military Expenditure.csv , Dataset, Kaggle.com, https://www.kaggle.com/nitinsss/military-expenditure-of-countries-19602019
World Population 1960-2018, population_total_long , Dataset, Kaggle.com, https://www.kaggle.com/imdevskp/world-population-19602018?select=population_total_long.csv
country to continent, countryContinent , Dataset, Kaggle.com, https://www.kaggle.com/statchaitya/country-to-continent