Required Packages

The following packages are required to produce the report.

# Required packages
library(readr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Executive Summary

This report aims to create a clean dataset and subsequently find the per capita Military Expenditure by Countries over the ten year span between 2008 and 2017. Three datasets (Military Expenditure, Population, Continents) are merged for this purpose. We first see the summary of the datasets and look at their structure. We then subset the Military Expenditure dataset to keep the required columns. We look for missing values in the datasets and impute the median where necessary. After merging the datasets, we apply the correct type conversions to the variables and then check for outliers in the Expenditure (Military Expenditure) and Count (population) variables. Since these datasets represent real world information, and so we do not impute or eliminate the outliers. Lastly we create a new column (Per Capita Expenditure) and then apply log transformation to reduce its skewness.

Data

As discussed earlier, we import 3 datasets for this analysis. Their descriptions are as follows:

1. Military Expenditure

The file can be found in the ‘Military Spending of Countries (1960-2019)’ dataset. The file to be used is ‘Military Expenditure.csv’. This dataset has been taken from kaggle and can be viewed using the following link: https://www.kaggle.com/nitinsss/military-expenditure-of-countries-19602019

The dataset has 264 observation and 63 variables. The variables include the:

2. Population of the each Country

The second dataset contains population information of all countries from 1960 to 2018. The dataset consists of 6 files and for this analysis we will use the file named ‘population_total_long’. This dataset has also been downloaded from Kaggle and can be accessed using the following link: https://www.kaggle.com/imdevskp/world-population-19602018?select=population_total_long.csv

The data file has 12,595 observations and 3 variables. The variables are as follows:

3. Continent-wise information

The third dataset, named ‘Country to Continent’ has also been taken from Kaggle. It can be accessed using the following link: https://www.kaggle.com/statchaitya/country-to-continent. The data file used, ‘countryContinent’, has 249 observations and 9 variables. The important variables are as follows:

The other variables are ‘code_2’, ‘code_3’, ‘country_code’, ‘iso_3166_2’, ‘region_code’,‘Sub_region’ and ‘sub_region_code’. These variables are not essential to the analysis and hence we do not look into them any further. We will subsequently merge the 3 datasets after tidying them.

Importing the data

We import the 3 csv files using the read_csv() function from the readr package.

#Importing the data
Mil_Expenditure <- read_csv("C:/Users/Joshua/Desktop/Data Wrangling Military Expenditure.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Name = col_character(),
##   Code = col_character(),
##   Type = col_character(),
##   `Indicator Name` = col_character()
## )
## See spec(...) for full column specifications.
head(Mil_Expenditure)
## # A tibble: 6 x 63
##   Name  Code  Type  `Indicator Name`   `1960`   `1961`   `1962`   `1963`
##   <chr> <chr> <chr> <chr>               <dbl>    <dbl>    <dbl>    <dbl>
## 1 World WLD   Regi~ Military expend~ NA       NA       NA       NA      
## 2 High~ HIC   Regi~ Military expend~  6.74e10  7.13e10  7.85e10  8.06e10
## 3 OECD~ OED   Regi~ Military expend~  6.78e10  7.16e10  7.89e10  8.10e10
## 4 Post~ PST   Regi~ Military expend~  6.33e10  6.68e10  7.37e10  7.51e10
## 5 Nort~ NAC   Regi~ Military expend~  4.71e10  4.95e10  5.41e10  5.39e10
## 6 Unit~ USA   Coun~ Military expend~  4.54e10  4.78e10  5.24e10  5.23e10
## # ... with 55 more variables: `1964` <dbl>, `1965` <dbl>, `1966` <dbl>,
## #   `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
## #   `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
## #   `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
## #   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
## #   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
## #   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
## #   `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
## #   `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
## #   `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
## #   `2017` <dbl>, `2018` <dbl>
Population <- read_csv("C:/Users/Joshua/Desktop/Population.csv")
## Parsed with column specification:
## cols(
##   `Country Name` = col_character(),
##   Year = col_double(),
##   Count = col_double()
## )
head(Population)
## # A tibble: 6 x 3
##   `Country Name`        Year   Count
##   <chr>                <dbl>   <dbl>
## 1 Aruba                 1960   54211
## 2 Afghanistan           1960 8996973
## 3 Angola                1960 5454933
## 4 Albania               1960 1608800
## 5 Andorra               1960   13411
## 6 United Arab Emirates  1960   92418
Cont <- read_csv("C:/Users/Joshua/Desktop/Continent Info.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   code_2 = col_character(),
##   code_3 = col_character(),
##   country_code = col_double(),
##   iso_3166_2 = col_character(),
##   continent = col_character(),
##   sub_region = col_character(),
##   region_code = col_double(),
##   sub_region_code = col_double()
## )
head(Cont)
## # A tibble: 6 x 9
##   country code_2 code_3 country_code iso_3166_2 continent sub_region region_code
##   <chr>   <chr>  <chr>         <dbl> <chr>      <chr>     <chr>            <dbl>
## 1 "Afgha~ AF     AFG               4 ISO 3166-~ Asia      Southern ~         142
## 2 "\xc5l~ AX     ALA             248 ISO 3166-~ Europe    Northern ~         150
## 3 "Alban~ AL     ALB               8 ISO 3166-~ Europe    Southern ~         150
## 4 "Alger~ DZ     DZA              12 ISO 3166-~ Africa    Northern ~           2
## 5 "Ameri~ AS     ASM              16 ISO 3166-~ Oceania   Polynesia            9
## 6 "Andor~ AD     AND              20 ISO 3166-~ Europe    Southern ~         150
## # ... with 1 more variable: sub_region_code <dbl>

Understand

Viewing a Summary of the Data

# We summarize the data by using the str() function 
str(Mil_Expenditure)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 264 obs. of  63 variables:
##  $ Name          : chr  "World" "High income" "OECD members" "Post-demographic dividend" ...
##  $ Code          : chr  "WLD" "HIC" "OED" "PST" ...
##  $ Type          : chr  "Regions Clubbed Geographically" "Regions Clubbed Economically" "Regions Clubbed Economically" "Regions Clubbed Economically" ...
##  $ Indicator Name: chr  "Military expenditure (current USD)" "Military expenditure (current USD)" "Military expenditure (current USD)" "Military expenditure (current USD)" ...
##  $ 1960          : num  NA 6.74e+10 6.78e+10 6.33e+10 4.71e+10 ...
##  $ 1961          : num  NA 7.13e+10 7.16e+10 6.68e+10 4.95e+10 ...
##  $ 1962          : num  NA 7.85e+10 7.89e+10 7.37e+10 5.41e+10 ...
##  $ 1963          : num  NA 8.06e+10 8.10e+10 7.51e+10 5.39e+10 ...
##  $ 1964          : num  NA 8.11e+10 8.15e+10 7.53e+10 5.29e+10 ...
##  $ 1965          : num  NA 8.33e+10 8.36e+10 7.71e+10 5.34e+10 ...
##  $ 1966          : num  NA 9.73e+10 9.75e+10 9.04e+10 6.52e+10 ...
##  $ 1967          : num  NA 1.12e+11 1.12e+11 1.04e+11 7.72e+10 ...
##  $ 1968          : num  NA 1.18e+11 1.18e+11 1.09e+11 8.25e+10 ...
##  $ 1969          : num  NA 1.21e+11 1.21e+11 1.11e+11 8.32e+10 ...
##  $ 1970          : num  NA 1.23e+11 1.23e+11 1.12e+11 8.17e+10 ...
##  $ 1971          : num  NA 1.23e+11 1.23e+11 1.11e+11 7.69e+10 ...
##  $ 1972          : num  NA 1.35e+11 1.34e+11 1.21e+11 7.99e+10 ...
##  $ 1973          : num  NA 1.49e+11 1.48e+11 1.31e+11 8.07e+10 ...
##  $ 1974          : num  NA 1.65e+11 1.65e+11 1.46e+11 8.87e+10 ...
##  $ 1975          : num  NA 1.79e+11 1.81e+11 1.60e+11 9.16e+10 ...
##  $ 1976          : num  NA 1.85e+11 1.87e+11 1.64e+11 9.46e+10 ...
##  $ 1977          : num  NA 2.18e+11 2.08e+11 1.84e+11 1.05e+11 ...
##  $ 1978          : num  NA 2.41e+11 2.29e+11 2.11e+11 1.13e+11 ...
##  $ 1979          : num  NA 2.70e+11 2.53e+11 2.43e+11 1.26e+11 ...
##  $ 1980          : num  3.64e+11 3.09e+11 2.88e+11 2.78e+11 1.43e+11 ...
##  $ 1981          : num  4.00e+11 3.39e+11 3.14e+11 3.02e+11 1.75e+11 ...
##  $ 1982          : num  NA 3.86e+11 3.57e+11 3.45e+11 2.20e+11 ...
##  $ 1983          : num  NA 3.78e+11 3.54e+11 3.43e+11 2.21e+11 ...
##  $ 1984          : num  4.48e+11 3.92e+11 3.68e+11 3.57e+11 2.39e+11 ...
##  $ 1985          : num  4.74e+11 4.17e+11 3.97e+11 3.87e+11 2.66e+11 ...
##  $ 1986          : num  5.20e+11 4.58e+11 4.57e+11 4.47e+11 2.89e+11 ...
##  $ 1987          : num  5.88e+11 5.16e+11 5.00e+11 4.88e+11 2.97e+11 ...
##  $ 1988          : num  6.27e+11 5.48e+11 5.25e+11 5.12e+11 3.03e+11 ...
##  $ 1989          : num  6.45e+11 5.57e+11 5.34e+11 5.24e+11 3.15e+11 ...
##  $ 1990          : num  7.00e+11 6.09e+11 5.76e+11 5.62e+11 3.18e+11 ...
##  $ 1991          : num  6.85e+11 5.97e+11 5.61e+11 5.42e+11 2.92e+11 ...
##  $ 1992          : num  7.17e+11 6.24e+11 5.95e+11 5.78e+11 3.16e+11 ...
##  $ 1993          : num  6.86e+11 5.98e+11 5.70e+11 5.52e+11 3.08e+11 ...
##  $ 1994          : num  6.95e+11 5.97e+11 5.68e+11 5.52e+11 2.98e+11 ...
##  $ 1995          : num  7.17e+11 6.10e+11 5.81e+11 5.65e+11 2.88e+11 ...
##  $ 1996          : num  7.16e+11 5.99e+11 5.71e+11 5.53e+11 2.80e+11 ...
##  $ 1997          : num  7.14e+11 5.89e+11 5.54e+11 5.36e+11 2.84e+11 ...
##  $ 1998          : num  6.96e+11 5.80e+11 5.46e+11 5.26e+11 2.82e+11 ...
##  $ 1999          : num  7.09e+11 5.90e+11 5.60e+11 5.39e+11 2.89e+11 ...
##  $ 2000          : num  7.30e+11 6.01e+11 5.69e+11 5.46e+11 3.10e+11 ...
##  $ 2001          : num  7.42e+11 6.05e+11 5.70e+11 5.49e+11 3.21e+11 ...
##  $ 2002          : num  8.02e+11 6.63e+11 6.32e+11 6.10e+11 3.65e+11 ...
##  $ 2003          : num  9.37e+11 7.72e+11 7.40e+11 7.16e+11 4.25e+11 ...
##  $ 2004          : num  1.06e+12 8.64e+11 8.29e+11 8.03e+11 4.76e+11 ...
##  $ 2005          : num  1.15e+12 9.20e+11 8.80e+11 8.52e+11 5.16e+11 ...
##  $ 2006          : num  1.19e+12 9.62e+11 9.19e+11 8.88e+11 5.42e+11 ...
##  $ 2007          : num  1.32e+12 1.04e+12 9.93e+11 9.59e+11 5.74e+11 ...
##  $ 2008          : num  1.49e+12 1.15e+12 1.10e+12 1.06e+12 6.40e+11 ...
##  $ 2009          : num  1.55e+12 1.19e+12 1.13e+12 1.09e+12 6.88e+11 ...
##  $ 2010          : num  1.63e+12 1.23e+12 1.16e+12 1.12e+12 7.17e+11 ...
##  $ 2011          : num  1.73e+12 1.28e+12 1.20e+12 1.16e+12 7.33e+11 ...
##  $ 2012          : num  1.74e+12 1.25e+12 1.16e+12 1.12e+12 7.05e+11 ...
##  $ 2013          : num  1.74e+12 1.21e+12 1.10e+12 1.06e+12 6.58e+11 ...
##  $ 2014          : num  1.74e+12 1.19e+12 1.08e+12 1.03e+12 6.28e+11 ...
##  $ 2015          : num  1.64e+12 1.11e+12 1.01e+12 9.71e+11 6.14e+11 ...
##  $ 2016          : num  1.63e+12 1.10e+12 1.02e+12 9.84e+11 6.18e+11 ...
##  $ 2017          : num  1.70e+12 1.13e+12 1.05e+12 1.01e+12 6.27e+11 ...
##  $ 2018          : num  1.78e+12 1.20e+12 1.12e+12 1.08e+12 6.70e+11 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Name = col_character(),
##   ..   Code = col_character(),
##   ..   Type = col_character(),
##   ..   `Indicator Name` = col_character(),
##   ..   `1960` = col_double(),
##   ..   `1961` = col_double(),
##   ..   `1962` = col_double(),
##   ..   `1963` = col_double(),
##   ..   `1964` = col_double(),
##   ..   `1965` = col_double(),
##   ..   `1966` = col_double(),
##   ..   `1967` = col_double(),
##   ..   `1968` = col_double(),
##   ..   `1969` = col_double(),
##   ..   `1970` = col_double(),
##   ..   `1971` = col_double(),
##   ..   `1972` = col_double(),
##   ..   `1973` = col_double(),
##   ..   `1974` = col_double(),
##   ..   `1975` = col_double(),
##   ..   `1976` = col_double(),
##   ..   `1977` = col_double(),
##   ..   `1978` = col_double(),
##   ..   `1979` = col_double(),
##   ..   `1980` = col_double(),
##   ..   `1981` = col_double(),
##   ..   `1982` = col_double(),
##   ..   `1983` = col_double(),
##   ..   `1984` = col_double(),
##   ..   `1985` = col_double(),
##   ..   `1986` = col_double(),
##   ..   `1987` = col_double(),
##   ..   `1988` = col_double(),
##   ..   `1989` = col_double(),
##   ..   `1990` = col_double(),
##   ..   `1991` = col_double(),
##   ..   `1992` = col_double(),
##   ..   `1993` = col_double(),
##   ..   `1994` = col_double(),
##   ..   `1995` = col_double(),
##   ..   `1996` = col_double(),
##   ..   `1997` = col_double(),
##   ..   `1998` = col_double(),
##   ..   `1999` = col_double(),
##   ..   `2000` = col_double(),
##   ..   `2001` = col_double(),
##   ..   `2002` = col_double(),
##   ..   `2003` = col_double(),
##   ..   `2004` = col_double(),
##   ..   `2005` = col_double(),
##   ..   `2006` = col_double(),
##   ..   `2007` = col_double(),
##   ..   `2008` = col_double(),
##   ..   `2009` = col_double(),
##   ..   `2010` = col_double(),
##   ..   `2011` = col_double(),
##   ..   `2012` = col_double(),
##   ..   `2013` = col_double(),
##   ..   `2014` = col_double(),
##   ..   `2015` = col_double(),
##   ..   `2016` = col_double(),
##   ..   `2017` = col_double(),
##   ..   `2018` = col_double()
##   .. )
str(Population)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 12595 obs. of  3 variables:
##  $ Country Name: chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Year        : num  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $ Count       : num  54211 8996973 5454933 1608800 13411 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Country Name` = col_character(),
##   ..   Year = col_double(),
##   ..   Count = col_double()
##   .. )
str(Cont)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 249 obs. of  9 variables:
##  $ country        : chr  "Afghanistan" "<c5>land Islands" "Albania" "Algeria" ...
##  $ code_2         : chr  "AF" "AX" "AL" "DZ" ...
##  $ code_3         : chr  "AFG" "ALA" "ALB" "DZA" ...
##  $ country_code   : num  4 248 8 12 16 20 24 660 10 28 ...
##  $ iso_3166_2     : chr  "ISO 3166-2:AF" "ISO 3166-2:AX" "ISO 3166-2:AL" "ISO 3166-2:DZ" ...
##  $ continent      : chr  "Asia" "Europe" "Europe" "Africa" ...
##  $ sub_region     : chr  "Southern Asia" "Northern Europe" "Southern Europe" "Northern Africa" ...
##  $ region_code    : num  142 150 150 2 9 150 2 19 NA 19 ...
##  $ sub_region_code: num  34 154 39 15 61 39 17 29 NA 29 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   code_2 = col_character(),
##   ..   code_3 = col_character(),
##   ..   country_code = col_double(),
##   ..   iso_3166_2 = col_character(),
##   ..   continent = col_character(),
##   ..   sub_region = col_character(),
##   ..   region_code = col_double(),
##   ..   sub_region_code = col_double()
##   .. )

There are several unnecesary columns in the datasets. There are also several missing values in each column of the military expenditure dataset. We have to subset the dataset to keep only the required columns in all datasets. Additionally we must change the Military expenditure dataset to long format in line with Hadley Wickham’s Tidy data principles (each variable must have its own column, each observation must have its own row and each value must have its own cell).

The datasets also contain variables of the different data types. We have variables in factor, character and numeric format, but some variables are stored incorrectly. We must convert the Continent (from continent dataset) to factors.

Further inspection must be conducted after merging the data sets.

Selecting the required columns

We first only select the columns we need. For this we select only the 10 year period between 2008-2017. We select this time period as we want to select the most recent ten year period that we have complete data for. Since we do not have population data for 2018, we only study the ten year period from 2008-2017.

# Select the required columns. 
Mil_Exp <- Mil_Expenditure %>% select(Name, Code,Type,`Indicator Name`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`)
head(Mil_Exp)
## # A tibble: 6 x 14
##   Name  Code  Type  `Indicator Name`  `2008`  `2009`  `2010`  `2011`  `2012`
##   <chr> <chr> <chr> <chr>              <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 World WLD   Regi~ Military expend~ 1.49e12 1.55e12 1.63e12 1.73e12 1.74e12
## 2 High~ HIC   Regi~ Military expend~ 1.15e12 1.19e12 1.23e12 1.28e12 1.25e12
## 3 OECD~ OED   Regi~ Military expend~ 1.10e12 1.13e12 1.16e12 1.20e12 1.16e12
## 4 Post~ PST   Regi~ Military expend~ 1.06e12 1.09e12 1.12e12 1.16e12 1.12e12
## 5 Nort~ NAC   Regi~ Military expend~ 6.40e11 6.88e11 7.17e11 7.33e11 7.05e11
## 6 Unit~ USA   Coun~ Military expend~ 6.21e11 6.69e11 6.98e11 7.11e11 6.85e11
## # ... with 5 more variables: `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2017` <dbl>

Scan I: Dealing with Missing Values

We take a look at a summary of the data and the number of missing values for each variable.

colSums(is.na(Mil_Exp))
##           Name           Code           Type Indicator Name           2008 
##              0              0              0              0             63 
##           2009           2010           2011           2012           2013 
##             67             68             70             65             62 
##           2014           2015           2016           2017 
##             61             66             67             69
summary(Mil_Exp)
##      Name               Code               Type           Indicator Name    
##  Length:264         Length:264         Length:264         Length:264        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       2008                2009                2010          
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:1.940e+08   1st Qu.:2.148e+08   1st Qu.:2.195e+08  
##  Median :1.732e+09   Median :2.116e+09   Median :2.136e+09  
##  Mean   :5.607e+10   Mean   :5.919e+10   Mean   :6.289e+10  
##  3rd Qu.:1.713e+10   3rd Qu.:1.783e+10   3rd Qu.:1.682e+10  
##  Max.   :1.490e+12   Max.   :1.550e+12   Max.   :1.630e+12  
##  NA's   :63          NA's   :67          NA's   :68         
##       2011                2012                2013          
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:2.592e+08   1st Qu.:2.309e+08   1st Qu.:2.511e+08  
##  Median :2.485e+09   Median :2.590e+09   Median :2.436e+09  
##  Mean   :6.823e+10   Mean   :6.767e+10   Mean   :6.782e+10  
##  3rd Qu.:1.823e+10   3rd Qu.:1.876e+10   3rd Qu.:1.937e+10  
##  Max.   :1.730e+12   Max.   :1.740e+12   Max.   :1.740e+12  
##  NA's   :70          NA's   :65          NA's   :62         
##       2014                2015                2016          
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:2.492e+08   1st Qu.:2.489e+08   1st Qu.:2.607e+08  
##  Median :2.356e+09   Median :2.416e+09   Median :2.456e+09  
##  Mean   :6.788e+10   Mean   :6.553e+10   Mean   :6.562e+10  
##  3rd Qu.:2.032e+10   3rd Qu.:1.928e+10   3rd Qu.:1.785e+10  
##  Max.   :1.740e+12   Max.   :1.640e+12   Max.   :1.630e+12  
##  NA's   :61          NA's   :66          NA's   :67         
##       2017          
##  Min.   :0.000e+00  
##  1st Qu.:2.851e+08  
##  Median :2.328e+09  
##  Mean   :6.928e+10  
##  3rd Qu.:2.148e+10  
##  Max.   :1.700e+12  
##  NA's   :69

This current data set has 264 countries and 14 variables. We observe that there are still a considerable number of missing values in each year column. We eliminate the observations that have missing values for more than 5 years. This is because we will impute the median of the expenditure of all the countries that year. Imputing more than half the values for any country would not be appropriate for the analysis, as it would affect the integrity of the data and the significance of the results.

# We subset the data by keeping only the observations with less than 5 missing values
Mil_Exp2 <- Mil_Exp[which(rowSums(is.na(Mil_Exp)) < 5),]
colSums(is.na(Mil_Exp2 ))
##           Name           Code           Type Indicator Name           2008 
##              0              0              0              0              5 
##           2009           2010           2011           2012           2013 
##              7              6              7              2              1 
##           2014           2015           2016           2017 
##              1              5              7              8

Looking for NA values and Imputing the median

We impute the median instead of the mean as there are extreme values in the dataset that would impact the mean value. In such a situation imputing with the median can give a more feasible estimate.

# IMPUTE the median into the missing values. 
  
Mil_Exp3 <- sapply(Mil_Exp2,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})

head(Mil_Exp3)
##      Name                        Code  Type                            
## [1,] "World"                     "WLD" "Regions Clubbed Geographically"
## [2,] "High income"               "HIC" "Regions Clubbed Economically"  
## [3,] "OECD members"              "OED" "Regions Clubbed Economically"  
## [4,] "Post-demographic dividend" "PST" "Regions Clubbed Economically"  
## [5,] "North America"             "NAC" "Regions Clubbed Geographically"
## [6,] "United States"             "USA" "Country"                       
##      Indicator Name                       2008       2009       2010      
## [1,] "Military expenditure (current USD)" "1.49e+12" "1.55e+12" "1.63e+12"
## [2,] "Military expenditure (current USD)" "1.15e+12" "1.19e+12" "1.23e+12"
## [3,] "Military expenditure (current USD)" "1.1e+12"  "1.13e+12" "1.16e+12"
## [4,] "Military expenditure (current USD)" "1.06e+12" "1.09e+12" "1.12e+12"
## [5,] "Military expenditure (current USD)" "6.4e+11"  "6.88e+11" "7.17e+11"
## [6,] "Military expenditure (current USD)" "6.21e+11" "6.69e+11" "6.98e+11"
##      2011       2012       2013       2014       2015       2016      
## [1,] "1.73e+12" "1.74e+12" "1.74e+12" "1.74e+12" "1.64e+12" "1.63e+12"
## [2,] "1.28e+12" "1.25e+12" "1.21e+12" "1.19e+12" "1.11e+12" "1.1e+12" 
## [3,] "1.2e+12"  "1.16e+12" "1.1e+12"  "1.08e+12" "1.01e+12" "1.02e+12"
## [4,] "1.16e+12" "1.12e+12" "1.06e+12" "1.03e+12" "9.71e+11" "9.84e+11"
## [5,] "7.33e+11" "7.05e+11" "6.58e+11" "6.28e+11" "6.14e+11" "6.18e+11"
## [6,] "7.11e+11" "6.85e+11" "6.4e+11"  "6.1e+11"  "5.96e+11" "6e+11"   
##      2017      
## [1,] "1.7e+12" 
## [2,] "1.13e+12"
## [3,] "1.05e+12"
## [4,] "1.01e+12"
## [5,] "6.27e+11"
## [6,] "6.06e+11"
colSums(is.na(Mil_Exp3))
##           Name           Code           Type Indicator Name           2008 
##              0              0              0              0              0 
##           2009           2010           2011           2012           2013 
##              0              0              0              0              0 
##           2014           2015           2016           2017 
##              0              0              0              0
Mil_Exp4 <- as.data.frame(Mil_Exp3)
head(Mil_Exp4)
##                        Name Code                           Type
## 1                     World  WLD Regions Clubbed Geographically
## 2               High income  HIC   Regions Clubbed Economically
## 3              OECD members  OED   Regions Clubbed Economically
## 4 Post-demographic dividend  PST   Regions Clubbed Economically
## 5             North America  NAC Regions Clubbed Geographically
## 6             United States  USA                        Country
##                       Indicator Name     2008     2009     2010     2011
## 1 Military expenditure (current USD) 1.49e+12 1.55e+12 1.63e+12 1.73e+12
## 2 Military expenditure (current USD) 1.15e+12 1.19e+12 1.23e+12 1.28e+12
## 3 Military expenditure (current USD)  1.1e+12 1.13e+12 1.16e+12  1.2e+12
## 4 Military expenditure (current USD) 1.06e+12 1.09e+12 1.12e+12 1.16e+12
## 5 Military expenditure (current USD)  6.4e+11 6.88e+11 7.17e+11 7.33e+11
## 6 Military expenditure (current USD) 6.21e+11 6.69e+11 6.98e+11 7.11e+11
##       2012     2013     2014     2015     2016     2017
## 1 1.74e+12 1.74e+12 1.74e+12 1.64e+12 1.63e+12  1.7e+12
## 2 1.25e+12 1.21e+12 1.19e+12 1.11e+12  1.1e+12 1.13e+12
## 3 1.16e+12  1.1e+12 1.08e+12 1.01e+12 1.02e+12 1.05e+12
## 4 1.12e+12 1.06e+12 1.03e+12 9.71e+11 9.84e+11 1.01e+12
## 5 7.05e+11 6.58e+11 6.28e+11 6.14e+11 6.18e+11 6.27e+11
## 6 6.85e+11  6.4e+11  6.1e+11 5.96e+11    6e+11 6.06e+11

We now look for missing values in the population and continents dataset

# Check for missing values
colSums(is.na(Population))
## Country Name         Year        Count 
##            0            0            0
colSums(is.na(Cont))
##         country          code_2          code_3    country_code      iso_3166_2 
##               0               1               0               0               0 
##       continent      sub_region     region_code sub_region_code 
##               9               9               9               9

We see that there are currently no NA values in the population dataset. The Continent dataset has a few missing values, but no missing values in the Country Name column which will be used to merge the datasets. The missing values will be dealt with after the datasets are merged.

Tidy & Manipulate Data I

We gather the date columns of Mil_Exp4, Changing the format of Military Expenses to long

The data currently has information for each year in a different column. According to the Tidy data principle each observation should have its own row, each variable should be in a column and each value should be in a cell. The years are not variables and can be efficiently stored as values for each country under a variable ‘Year’.

Additionally the data in the population dataset is also in long format. In order to merge the two datasets it is necessary for us to convert the dataset to long format. For this purpose we can use the gather() function or its newer alternative, pivot_longer().

Mil <- Mil_Exp4 %>% pivot_longer(cols = c(-Name,-Code,-Type,-`Indicator Name`), names_to = "Year", values_to = "Expenditure")
head(Mil)
## # A tibble: 6 x 6
##   Name  Code  Type                  `Indicator Name`           Year  Expenditure
##   <fct> <fct> <fct>                 <fct>                      <chr> <fct>      
## 1 World WLD   Regions Clubbed Geog~ Military expenditure (cur~ 2008  1.49e+12   
## 2 World WLD   Regions Clubbed Geog~ Military expenditure (cur~ 2009  1.55e+12   
## 3 World WLD   Regions Clubbed Geog~ Military expenditure (cur~ 2010  1.63e+12   
## 4 World WLD   Regions Clubbed Geog~ Military expenditure (cur~ 2011  1.73e+12   
## 5 World WLD   Regions Clubbed Geog~ Military expenditure (cur~ 2012  1.74e+12   
## 6 World WLD   Regions Clubbed Geog~ Military expenditure (cur~ 2013  1.74e+12

We now have a tidy dataset for military expenditure data.

We merge the Mil_Exp and Continent datasets together on basis of country and then combine the population data based on country and year

We first create a key column, by combining the Country Name and Year in order to combine the Military expenditure and Population information. We do this using the unite() function from the tidyr package. We then left join the population data to the expenditure data. This is because we only want to analyse observations for which we have military expenditure information. By doing a left join we automatically drop the countries in the Population dataset that are not in the Expenditure dataset. We then separate the key column using the separate() function and combine it with the continents dataset using the Country name column as a key.

# Uniting the columns to make a key
Mil1 <- Mil %>% unite("Key", Name, Year)
head(Mil1)
## # A tibble: 6 x 5
##   Key       Code  Type                  `Indicator Name`             Expenditure
##   <chr>     <fct> <fct>                 <fct>                        <fct>      
## 1 World_20~ WLD   Regions Clubbed Geog~ Military expenditure (curre~ 1.49e+12   
## 2 World_20~ WLD   Regions Clubbed Geog~ Military expenditure (curre~ 1.55e+12   
## 3 World_20~ WLD   Regions Clubbed Geog~ Military expenditure (curre~ 1.63e+12   
## 4 World_20~ WLD   Regions Clubbed Geog~ Military expenditure (curre~ 1.73e+12   
## 5 World_20~ WLD   Regions Clubbed Geog~ Military expenditure (curre~ 1.74e+12   
## 6 World_20~ WLD   Regions Clubbed Geog~ Military expenditure (curre~ 1.74e+12
Pop1 <- Population %>%  unite("Key", `Country Name`, Year)
head(Pop1)
## # A tibble: 6 x 2
##   Key                         Count
##   <chr>                       <dbl>
## 1 Aruba_1960                  54211
## 2 Afghanistan_1960          8996973
## 3 Angola_1960               5454933
## 4 Albania_1960              1608800
## 5 Andorra_1960                13411
## 6 United Arab Emirates_1960   92418
Data <- left_join(Mil1, Pop1, by="Key")
head(Data)
## # A tibble: 6 x 6
##   Key      Code  Type                `Indicator Name`          Expenditure Count
##   <chr>    <fct> <fct>               <fct>                     <fct>       <dbl>
## 1 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.49e+12       NA
## 2 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.55e+12       NA
## 3 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.63e+12       NA
## 4 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.73e+12       NA
## 5 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.74e+12       NA
## 6 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.74e+12       NA
#Separating the key column
Data2 <- Data %>% separate(Key, c("Name", "Year"), sep = "_")

# Combining the continent column
Exp <- left_join(Data2, Cont, by = c("Name" = "country"))
head(Exp)
## # A tibble: 6 x 15
##   Name  Year  Code  Type  `Indicator Name` Expenditure Count code_2 code_3
##   <chr> <chr> <fct> <fct> <fct>            <fct>       <dbl> <chr>  <chr> 
## 1 World 2008  WLD   Regi~ Military expend~ 1.49e+12       NA <NA>   <NA>  
## 2 World 2009  WLD   Regi~ Military expend~ 1.55e+12       NA <NA>   <NA>  
## 3 World 2010  WLD   Regi~ Military expend~ 1.63e+12       NA <NA>   <NA>  
## 4 World 2011  WLD   Regi~ Military expend~ 1.73e+12       NA <NA>   <NA>  
## 5 World 2012  WLD   Regi~ Military expend~ 1.74e+12       NA <NA>   <NA>  
## 6 World 2013  WLD   Regi~ Military expend~ 1.74e+12       NA <NA>   <NA>  
## # ... with 6 more variables: country_code <dbl>, iso_3166_2 <chr>,
## #   continent <chr>, sub_region <chr>, region_code <dbl>, sub_region_code <dbl>

When we look at a summary of the data we see that there are NA values created due to the different naming scheme of countries in different datasets. Therefore we change these names to the correct format using the recode() function.

Population$`Country Name` <- recode(Population$`Country Name`, `Egypt, Arab Rep.` ="Egypt", `Gambia, The` = "Gambia",
                                                                    `Iran, Islamic Rep.` = "Iran",`Korea, Rep.` = "South Korea", `Venezuela, RB` = "Venezuela") 

The datasets are then merged again, with the adjusted names.

# Uniting the columns to make a key

Mil1 <- Mil %>% unite("Key", Name, Year)

Pop2 <- Population %>%  unite("Key", `Country Name`, Year)

Data3 <- left_join(Mil1, Pop2, by="Key")
head(Data)
## # A tibble: 6 x 6
##   Key      Code  Type                `Indicator Name`          Expenditure Count
##   <chr>    <fct> <fct>               <fct>                     <fct>       <dbl>
## 1 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.49e+12       NA
## 2 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.55e+12       NA
## 3 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.63e+12       NA
## 4 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.73e+12       NA
## 5 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.74e+12       NA
## 6 World_2~ WLD   Regions Clubbed Ge~ Military expenditure (cu~ 1.74e+12       NA
#Separating the key column
Data4 <- Data3 %>% separate(Key, c("Name", "Year"), sep = "_")

# Combining the continent column
Exp <- left_join(Data4, Cont, by = c("Name" = "country"))
head(Exp)
## # A tibble: 6 x 15
##   Name  Year  Code  Type  `Indicator Name` Expenditure Count code_2 code_3
##   <chr> <chr> <fct> <fct> <fct>            <fct>       <dbl> <chr>  <chr> 
## 1 World 2008  WLD   Regi~ Military expend~ 1.49e+12       NA <NA>   <NA>  
## 2 World 2009  WLD   Regi~ Military expend~ 1.55e+12       NA <NA>   <NA>  
## 3 World 2010  WLD   Regi~ Military expend~ 1.63e+12       NA <NA>   <NA>  
## 4 World 2011  WLD   Regi~ Military expend~ 1.73e+12       NA <NA>   <NA>  
## 5 World 2012  WLD   Regi~ Military expend~ 1.74e+12       NA <NA>   <NA>  
## 6 World 2013  WLD   Regi~ Military expend~ 1.74e+12       NA <NA>   <NA>  
## # ... with 6 more variables: country_code <dbl>, iso_3166_2 <chr>,
## #   continent <chr>, sub_region <chr>, region_code <dbl>, sub_region_code <dbl>

The Type column tells us whether the observation is a country or a group of countries. We filter out the data to only keep observations that are ‘Country’ in the Type column.

Exp1 <- Exp %>% filter(Type == "Country")

Once again we select the columns we need.

# Selecting the required variables 
Exp2 <- Exp1 %>% select(Name, Year,continent,Expenditure,Count )
head(Exp2)
## # A tibble: 6 x 5
##   Name          Year  continent Expenditure     Count
##   <chr>         <chr> <chr>     <fct>           <dbl>
## 1 United States 2008  <NA>      6.21e+11    304093966
## 2 United States 2009  <NA>      6.69e+11    306771529
## 3 United States 2010  <NA>      6.98e+11    309326085
## 4 United States 2011  <NA>      7.11e+11    311580009
## 5 United States 2012  <NA>      6.85e+11    313874218
## 6 United States 2013  <NA>      6.4e+11     316057727

We check if we have any missing values in our current dataset

colSums(is.na(Exp2))
##        Name        Year   continent Expenditure       Count 
##           0           0         190           0           0

There are 190 missing values in the continent column. Having missing values in the continent column will not impact the analysis, however, we must not have any incomplete values in the Expenditure and Count columns. Since this requirement is met, we move on to converting the variables to the correct format.

We make sure all the variables are in the correct format. If required we will make the type conversions

# First we check the summary of our variables using summary() and str()
str(Exp2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1550 obs. of  5 variables:
##  $ Name       : chr  "United States" "United States" "United States" "United States" ...
##  $ Year       : chr  "2008" "2009" "2010" "2011" ...
##  $ continent  : chr  NA NA NA NA ...
##  $ Expenditure: Factor w/ 1841 levels "0","1.03e+11",..: 158 342 525 718 895 159 1267 1428 1640 1806 ...
##  $ Count      : num  3.04e+08 3.07e+08 3.09e+08 3.12e+08 3.14e+08 ...
summary(Exp2)
##      Name               Year            continent               Expenditure  
##  Length:1550        Length:1550        Length:1550        0           :  30  
##  Class :character   Class :character   Class :character   2115785124  :   8  
##  Mode  :character   Mode  :character   Mode  :character   2474313259  :   8  
##                                                           2513200000  :   7  
##                                                           1867877499  :   6  
##                                                           2136472861.5:   6  
##                                                           (Other)     :1485  
##      Count          
##  Min.   :8.696e+04  
##  1st Qu.:3.859e+06  
##  Median :1.020e+07  
##  Mean   :4.473e+07  
##  3rd Qu.:3.087e+07  
##  Max.   :1.386e+09  
## 

We observe that we have 1500 observations of 7 variables. We have one numeric variables (Count), two factor variable (Expenditure and Code) and 3 character variables. The Expenditure variable is incorrectly stored as a factor instead of a numeric variable. The Year column and Continent columns can be better stored as factors instead of characters.

Applying the required type conversions

We change the continent and year variables to a factor variable and Expenditure to Numeric Variable

# Converting Continent and Year Variable to factor
Exp2$continent <- factor(Exp2$continent, labels = c("Africa","Americas", "Asia", "Europe","Oceania" ))  

Exp2$Year <- factor(Exp2$Year, labels= c("2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017"))

# Converting Expenditure to Numeric
Exp2$Expenditure <- as.numeric(as.character(Exp2$Expenditure))

levels(Exp2$Year)
##  [1] "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"
Exp2 <- as.data.frame(Exp2)

head(Exp2)
##            Name Year continent Expenditure     Count
## 1 United States 2008      <NA>    6.21e+11 304093966
## 2 United States 2009      <NA>    6.69e+11 306771529
## 3 United States 2010      <NA>    6.98e+11 309326085
## 4 United States 2011      <NA>    7.11e+11 311580009
## 5 United States 2012      <NA>    6.85e+11 313874218
## 6 United States 2013      <NA>    6.40e+11 316057727
str(Exp2)
## 'data.frame':    1550 obs. of  5 variables:
##  $ Name       : chr  "United States" "United States" "United States" "United States" ...
##  $ Year       : Factor w/ 10 levels "2008","2009",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent  : Factor w/ 5 levels "Africa","Americas",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Expenditure: num  6.21e+11 6.69e+11 6.98e+11 7.11e+11 6.85e+11 ...
##  $ Count      : num  3.04e+08 3.07e+08 3.09e+08 3.12e+08 3.14e+08 ...
summary(Exp2)
##      Name                Year        continent    Expenditure       
##  Length:1550        2008   :155   Africa  :410   Min.   :0.000e+00  
##  Class :character   2009   :155   Americas:220   1st Qu.:1.489e+08  
##  Mode  :character   2010   :155   Asia    :340   Median :8.003e+08  
##                     2011   :155   Europe  :350   Mean   :1.068e+10  
##                     2012   :155   Oceania : 40   3rd Qu.:4.380e+09  
##                     2013   :155   NA's    :190   Max.   :7.110e+11  
##                     (Other):620                                     
##      Count          
##  Min.   :8.696e+04  
##  1st Qu.:3.859e+06  
##  Median :1.020e+07  
##  Mean   :4.473e+07  
##  3rd Qu.:3.087e+07  
##  Max.   :1.386e+09  
## 

We then scan for missing values in the entire dataset as well as special values and obvious errors in our expenditure and count variables .

colSums(is.na(Exp2))
##        Name        Year   continent Expenditure       Count 
##           0           0         190           0           0
sum(is.infinite(Exp2$Expenditure))
## [1] 0
sum(is.nan(Exp2$Expenditure))
## [1] 0
sum(is.infinite(Exp2$Count))
## [1] 0
sum(is.nan(Exp2$Count))
## [1] 0

There are no missing values in the required columns (expenditure and count) as well as no special values in the Expenditure and count variables. Additionally, through the summary function we can also conclude that there are no obvious errors in the data set as the max and min values are reasonable.

Scan II : Check for outliers.

We check the data for possible outliers. We use “Tukey’s method of outlier detection” to identify the outliers. In this method we look for data points that lie outside the range

[Q1 − 1.5 × IQR Q3 + 1.5 × IQR]

where, * Q1 is the first quartile, * Q3 is the third quartile and * IQR is the inter quartile range.

We first plot a boxplot to visualise the outliers in the data for the Expenditure and Count variables.

par(mfrow=c(1,2))
boxplot(Exp2$Expenditure, Main= "Expenditure", col = "Maroon"  )
boxplot(Exp2$Count, Main= "Count", col = "Blue"  )

Due to the presence of some extreme values we are unable to see the distribution. We create a subset of the dataframe named outlier_data to see the complete information of the univariate outliers in both variables.

# Outliers for Expenditure

iqr <- IQR(Exp2$Expenditure)
Q1 <- quantile(Exp2$Expenditure, 0.25)
Q3 <- quantile(Exp2$Expenditure, 0.75)

outlier_data1 <- subset(Exp2, Exp2$Expenditure < (Q1 - 1.5*iqr) | Exp2$Expenditure > (Q3 + 1.5*iqr))

head(outlier_data1) 
##            Name Year continent Expenditure     Count
## 1 United States 2008      <NA>    6.21e+11 304093966
## 2 United States 2009      <NA>    6.69e+11 306771529
## 3 United States 2010      <NA>    6.98e+11 309326085
## 4 United States 2011      <NA>    7.11e+11 311580009
## 5 United States 2012      <NA>    6.85e+11 313874218
## 6 United States 2013      <NA>    6.40e+11 316057727
summary(outlier_data1)
##      Name                Year       continent   Expenditure       
##  Length:193         2008   :20   Africa  : 0   Min.   :1.108e+10  
##  Class :character   2009   :20   Americas:23   1st Qu.:1.866e+10  
##  Mode  :character   2010   :20   Asia    :68   Median :3.405e+10  
##                     2011   :20   Europe  :54   Mean   :7.356e+10  
##                     2012   :20   Oceania :10   3rd Qu.:5.808e+10  
##                     2013   :20   NA's    :38   Max.   :7.110e+11  
##                     (Other):73                                    
##      Count          
##  Min.   :7.089e+06  
##  1st Qu.:3.570e+07  
##  Median :6.560e+07  
##  Mean   :2.103e+08  
##  3rd Qu.:1.428e+08  
##  Max.   :1.386e+09  
## 
# Outliers for Count
Ciqr <- IQR(Exp2$Count)
CQ1 <- quantile(Exp2$Count, 0.25)
CQ3 <- quantile(Exp2$Count, 0.75)

outlier_data2 <- subset(Exp2, Exp2$Count < (CQ1 - 1.5*Ciqr) | Exp2$Count > (CQ3 + 1.5*Ciqr))

head(outlier_data2) 
##            Name Year continent Expenditure     Count
## 1 United States 2008      <NA>    6.21e+11 304093966
## 2 United States 2009      <NA>    6.69e+11 306771529
## 3 United States 2010      <NA>    6.98e+11 309326085
## 4 United States 2011      <NA>    7.11e+11 311580009
## 5 United States 2012      <NA>    6.85e+11 313874218
## 6 United States 2013      <NA>    6.40e+11 316057727
summary(outlier_data2)
##      Name                Year       continent   Expenditure       
##  Length:182         2014   :19   Africa  :30   Min.   :2.952e+08  
##  Class :character   2015   :19   Americas:20   1st Qu.:3.312e+09  
##  Mode  :character   2016   :19   Asia    :78   Median :9.073e+09  
##                     2017   :19   Europe  :20   Mean   :6.123e+10  
##                     2010   :18   Oceania : 0   3rd Qu.:4.678e+10  
##                     2011   :18   NA's    :34   Max.   :7.110e+11  
##                     (Other):70                                    
##      Count          
##  Min.   :7.212e+07  
##  1st Qu.:8.989e+07  
##  Median :1.281e+08  
##  Mean   :2.699e+08  
##  3rd Qu.:2.023e+08  
##  Max.   :1.386e+09  
## 

We check to see if there are any multivariate outliers in the data based on the Military Expenditure (Expenditure) and the Population of the country (Count). This can be done by plotting a scatter plot to visualize any outliers.

plot(Exp2$Expenditure,Exp2$Count, xlab = "Military Expenditure", ylab = "Population")

The scatter plot gives us a better view of the outliers. On further inspection we find that the outlier cluster towards the top of the chart consists of data points of China and India and the outliers to the extreme right are the data points of USA.

Although the data contains some extreme outliers, we know that these are not incorrect values. Therefore we must not remove or change them as these are real world data points and we don’t want the removal of valid data points to affect our results. We do not eliminate the outliers as these are many of the major countries that do spend large amounts of money on their military or have an extremely large population. Therefore deleting these values would lead to a loss of important information.

Tidy & Manipulate Data II

we create a per capita military expense column dividing military expense by count (population of the country)

# Creating the new column
Mil_Data <- Exp2 %>% mutate(Per_Capita_Exp = round((Expenditure/Count),2))

# We order the dataset by Per Capita Expenditure to see the highest per capita expenditure 
Mil_Data2 <- Mil_Data[order(Mil_Data$Per_Capita_Exp, decreasing = TRUE),]

head(Mil_Data2)
##                      Name Year continent Expenditure    Count Per_Capita_Exp
## 28           Saudi Arabia 2015      Asia 87185866667 31717667        2748.81
## 265                  Oman 2012      Asia  9250650195  3498029        2644.53
## 27           Saudi Arabia 2014      Asia 80762400000 30916994        2612.23
## 1496 United Arab Emirates 2013      Asia 23561061947  9197910        2561.57
## 1497 United Arab Emirates 2014      Asia 22755071477  9214175        2469.57
## 266                  Oman 2013      Asia  8766319896  3764805        2328.49

Transform: Carry out data transformation

We plot a histogram of the Per_Capita_Exp column to see the distribution of the data.

# Plotting a histogram
hist(Mil_Data2$Per_Capita_Exp, breaks = 50,
     main = 'Histogram of Per Capita Expenditure',
     xlab = 'Per Capita Expenditure')

We can see that the data is highly right skewed. In order to conduct a linear regression or any other analysis we would require the data to follow normal distribution. Since the data is right skewed we first try the log transformation to reduce skewness.

Mil_Data3 <- Mil_Data2 %>% mutate(`Log of Per_Capita_Exp` = log(Per_Capita_Exp))

We plot another histogram to see the change in the distribution after applying log transformation.

hist(Mil_Data3$`Log of Per_Capita_Exp` , 
     main = 'Transformed Per Capita Expenditure',
     breaks = 50,
     xlab = 'Log of Per Capita Expenditure')

We now see that the data is more normally distributed as compared to the initial distribution.

References