MATH2349 Data Wrangling

1. Required packages

1.1 The following packages were used:

# Requirement 10: Approved packages  --------------------------------------------------------------

library(readxl)
library(knitr)
library(tidyr)
library(gdata)
library(dplyr)
library(Hmisc)
library(MVN)

2. Executive Summary

2.1 We are interested in the socioeconomics of countries (income, population) that have experienced adverse natural disaster impacts (deaths and damages) between 2015 and 2019.

2.2 Three datasets (World Bank “Development Indicators” and “Country and Lending Groups”; “EM-Dat”) were imported into R. The datasets were inspected for structure and data types. The Indicators dataset was reshaped to a tidy format. Data type conversions were conducted on incorrectly stored variables and income groupings correctly labelled and ordered. The Em-Dat dataset was transformed to annual-level data to enable merging. Datasets were merged using country codes.

2.3 The merged dataset (1085 observations, 8 variables) was scanned for obvious errors (e.g. negative values for population variable, overlapping income groupings), missing and special values using logical checks. Considering our aim, missing values were dealt with by excluding cases where no damage and death data was recorded; imputing values (income) where possible, otherwise excluding the observations altogether (e.g. where no income available for all years).

2.4 This dataset (now 186 observations) was scanned for outliers. All numerical variables (income, damage, death, population) were skewed and so boxplots were used for univariate outlier scan and scatter plots for bivariate scans. Using the least correlated variables (death, damages), a multivariate scan was performed to obtain a dataset that removed outliers.

2.5 This clean dataset (117 observations) was semi-joined to reobtain observation values of remaining variables with death and damage data as linking variables.Three variables were rescaled (to millions). Two new variables (deaths per 100,000 people; damages as proportion of income) were created. The income variable was transformed to a normal distribution using log method.

2.6 Packages used: readxl (read in excel files), knitr (tables) gdata (rename variables), dplyr (mutate and create new variables), tidyr (reshape/tidy data), hmisc, basic R and MVN (plots, normality and logic checks, detect outliers).

3. Data

3.1 The following datasets and data sources were used:

World Bank Development Indicators, available via the World Bank Databank: https://databank.worldbank.org/source/world-development-indicators. The panel dataset, collected annually since 1960, contains 1437 indicators for 217 countries. For our analysis, the “Country Name” and indicators “GNI, Atlas method (current US$)” and “Population, total” were selected for all 217 countries for the years 2015 to 2019.
World Bank Country and Lending Groups, available from the World Bank website: https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups. The dataset provides the World Bank regional, income group and lending group classifications for each country. For our analysis, the “Code” and “Income group” classification for each country were selected.
EM-Dat, a comprehensive global disaster dataset managed by the Centre for Research on the Epidemiology of Disasters: https://public.emdat.be/data. It collates detailed event-level information on two disaster types (natural, technological) for all disasters recorded since 1900, with 43 variables including information on human impacts (e.g. number killed, injured, affected) and economic damage estimates. For this analysis, the “ISO” code, “Year”, Total Deaths“, and”Total Damages (’000 US$)" for all available countries were selected for the years 2015 to 2019.

3.2 Table 1 provides variable descriptions of all variables used in the analysis. It was created using the matrix function, and presented using the kable function (knittr package). Country Code, ISO and Code will be used as the keys to link the datasets.

#1. Create Table 1:

  #a. Read in variable descriptions file 
  
    table1 <- read_excel("Variable_descriptions.xlsx",range = "A1:D12")
  
  #b. Present table using knitr package:
  
    knitr::kable(table1, caption = "Table 1: Description of select variables")

Table 1: Description of select variables
Variable	Data type	Description	Dataset
Country Name	Categorical - character	Country name	World Bank Indicators
Country Code	Categorical - character	World Bank country code	World Bank Indicators
2015 [YR2015] 2016 [YR2016] 2017 [YR2017] 2018 [YR2018] 2019 [YR2019]	Integer - Untidy data	Year	World Bank Indicators
GNI Atlas method - current US$	Numerical	Gross national income in current US$ Atlas method	World Bank Indicators
Population - total	Integer	Population of a country in a given year	World Bank Indicators
Income group	Categorical - factor	World Bank Grouping of countries into: Low income, Lower middle income, Upper middle income, and High income	World Bank Income Groups
ISO	Categorical - character	EM-Dat country code	EM-Dat
Country	Categorical - character	Country name	EM-Dat
Total Deaths	Integer	Total deaths from natural disasters, per country and year	EM-Dat
Total Damages - one thousand US$	Numerical	Total Damages from natural disasters,per country and year	EM-Dat
Year	Integer	Year	EM-Dat

3.3 The excel datasets were read into R using the read_excel function:

World Bank Development Indicators was stored as indicators
World Bank Country and Lending Groups was stored as incomegroup
EM-Dat was stored as disaster.

3.4 Explanatory notes and headings were removed from the indicators and incomegroup datasets using skip and by dropping irrelevant rows and columns. As explained later, we cannot merge until we tidy the datasets.

# Requirement 1:  --------------------------------------------------------------

#1. Read in files

  #a. World Bank Income groupings dataset, skip rows, removing rows and columns
  
    incomegroup <- read_excel("CLASS.xls",
                              sheet =1,skip=3)[-c(1,266:275),-c(1,2,5)]

  #b. World Bank Indicators dataset, removing rows and columns
  
    indicators <- read_excel("WBIndicators1.xlsx")[-c(435:439),-c(4)] 

  #c. Read in natural disaster dataset, skip rows 

    disaster <- read_excel("emdat_public_2020_10_10_query_uid-lPH5eg.xlsx",skip=6)

4. Tidy & Manipulate Data I

4.1 Datasets were previewed using the head() function. This step indicated that further refinement was required before datasets could be merged (tidying of dataset and column names).

Tidiness of datasets

4.2. The incomegroup and disaster datasets are tidy as:

all the variables are in their own unique column. For instance, Economy and Code are split into two separate columns in the incomegroup dataset; likewise Region and Continent in the disaster dataset.
all observations related to each country (incomegroup dataset) and disaster event (disaster dataset) are in unique rows.
all values for each variable are in their unique cells.

4.3. The indicators dataset is untidy as:

unit of analysis is a country-year pairing; but
the year variable is spread across multiple columns, with the year values (e.g 2015) saved as the column (variable) names. This breaks the first tidy principle that each variable must have its own column.
income and population variables do not have their own columns, thus each country-year observation is split across multiple rows. For example, income and population observations for the year 2015 are split into two rows. This breaks both the first tidy principle and the second principle: Each observation must have its own row.

4.4 To tidy the indicators dataset, we:

Correct year values (e.g rename 2015 [YR2015] to 2015) using rename.vars() function
Use the gather() function (tidy) to gather all years into one column (“Year”)
Use the spread() function (tidy) to spread the population and income variables into separate columns.

4.5 The population and income column names were also simplified using rename.vars() function.

Merging

4.6 Before merging disasters and indicators, disasters must be transformed from event-level data into annual-level data for each country. We use group_by() and summarise() functions (dplyr) to aggregate the death and damages variables by country and year only, so that all other variables are dropped. Using dim() function, this reduces the:

observations from 1886 to 604, indicating that multiple disaster events occured within a given year and given country.
columns from 43 to 4, as required for the merge.

4.7 The datasets can now be merged. indicators will be our primary dataset. The datasets will be merged using left_join() (dplyr) as we want to add the variables from the two datasets as columns to indicators. Checking the dimensions of indicators (dim() function), we should end up with 1085 observations, and 8 columns (once income group, total deaths, and total damages are added).

4.8 Merge procedure:

Left join indicators with incomegroup dataset, selecting only “Code” and “Income Group” as our columns of interest, and use the country code columns in both datasets as the matching variables (“Country Code” and “Code”, respectively).
Left join these merged datasets to disaster, again using the country code columns to match (“ISO” and “Country Code”, respectively).
Store the final merged dataset as indicators_merged.

# Requirement 5: Tidy dataset  --------------------------------------------------------------

#1. Check tidiness of each dataset

  head(indicators)

## # A tibble: 6 x 8
##   `Country Name` `Country Code` `Series Name` `2015 [YR2015]` `2016 [YR2016]`
##   <chr>          <chr>          <chr>                   <dbl>           <dbl>
## 1 Afghanistan    AFG            GNI, Atlas m~    20583397306.    20092934119.
## 2 Afghanistan    AFG            Population, ~       34413603        35383128 
## 3 Albania        ALB            GNI, Atlas m~    12634903056.    12421239110.
## 4 Albania        ALB            Population, ~        2880703         2876101 
## 5 Algeria        DZA            GNI, Atlas m~   192676844902.   177365176105.
## 6 Algeria        DZA            Population, ~       39728025        40551404 
## # ... with 3 more variables: `2017 [YR2017]` <dbl>, `2018 [YR2018]` <dbl>,
## #   `2019 [YR2019]` <dbl>

  head(incomegroup)

## # A tibble: 6 x 6
##   Economy      Code  Region             `Income group`   `Lending categor~ Other
##   <chr>        <chr> <chr>              <chr>            <chr>             <chr>
## 1 Afghanistan  AFG   South Asia         Low income       IDA               HIPC 
## 2 Albania      ALB   Europe & Central ~ Upper middle in~ IBRD              <NA> 
## 3 Algeria      DZA   Middle East & Nor~ Lower middle in~ IBRD              <NA> 
## 4 American Sa~ ASM   East Asia & Pacif~ Upper middle in~ ..                <NA> 
## 5 Andorra      AND   Europe & Central ~ High income      ..                <NA> 
## 6 Angola       AGO   Sub-Saharan Africa Lower middle in~ IBRD              <NA>

  head(disaster)

## # A tibble: 6 x 43
##   `Dis No` Year  Seq   `Disaster Group` `Disaster Subgr~ `Disaster Type`
##   <chr>    <chr> <chr> <chr>            <chr>            <chr>          
## 1 2015-01~ 2015  0118  Natural          Meteorological   Storm          
## 2 2015-00~ 2015  0054  Natural          Hydrological     Landslide      
## 3 2015-01~ 2015  0148  Natural          Meteorological   Storm          
## 4 2015-04~ 2015  0479  Natural          Meteorological   Storm          
## 5 2015-00~ 2015  0090  Natural          Hydrological     Flood          
## 6 2015-00~ 2015  0023  Natural          Hydrological     Flood          
## # ... with 37 more variables: `Disaster Subtype` <chr>, `Disaster
## #   Subsubtype` <chr>, `Event Name` <chr>, `Entry Criteria` <chr>, ISO <chr>,
## #   Country <chr>, Region <chr>, Continent <chr>, Location <chr>, Origin <chr>,
## #   `Associated Dis` <chr>, `Associated Dis2` <chr>, `OFDA Response` <chr>,
## #   Appeal <chr>, Declaration <chr>, `Aid Contribution` <lgl>, `Dis Mag
## #   Value` <dbl>, `Dis Mag Scale` <chr>, Latitude <chr>, Longitude <chr>,
## #   `Local Time` <chr>, `River Basin` <chr>, `Start Year` <dbl>, `Start
## #   Month` <dbl>, `Start Day` <dbl>, `End Year` <dbl>, `End Month` <dbl>, `End
## #   Day` <dbl>, `Total Deaths` <dbl>, `No Injured` <dbl>, `No Affected` <dbl>,
## #   `No Homeless` <dbl>, `Total Affected` <dbl>, `Reconstruction Costs ('000
## #   US$)` <dbl>, `Insured Damages ('000 US$)` <dbl>, `Total Damages ('000
## #   US$)` <dbl>, CPI <dbl>

#2. Tidy the indicators dataset

  indicators <- indicators %>%
  
   #a. Rename year columns into proper years
    
      rename.vars(c(
              "2015 [YR2015]", "2016 [YR2016]",
              "2017 [YR2017]", "2018 [YR2018]",
              "2019 [YR2019]"), 
              c("2015", "2016",
                "2017","2018",
                "2019")
              ) %>%
    
  
    #b. Gather all years into one column
  
      gather(key = "Year", value = "Value",
           - c(1:3)) %>%
      
    #c. Spread variables into different columns
  
      spread(key = "Series Name", value = "Value")  %>%
  
  
    #d. Rename income and population for legibility
    
      rename.vars(c("GNI, Atlas method (current US$)",
                    "Population, total"),
                  c("income","population"))

## 
## Changing in .                                                                           
## From: 2015 [YR2015] 2016 [YR2016] 2017 [YR2017] 2018 [YR2018] 2019 [YR2019]
## To:   2015          2016          2017          2018          2019         
## 
## 
## Changing in .                                                       
## From: GNI, Atlas method (current US$) Population, total
## To:   income                          population

# Requirement 1: Merge dataset  --------------------------------------------------------------


#1. Prepare disaster dataset for merging

  #a. Group disaster data by year and country code 
  
      disaster <- disaster %>% 
      group_by(ISO, Year) %>%
      summarise(deaths = sum(`Total Deaths`, na.rm =TRUE),
                damages = sum(`Total Damages ('000 US$)`, na.rm =TRUE))
    
  #b. Check dimensions of disaster
      
      dim(disaster)

## [1] 604   4

#2. Merge datasets  

  #a. Join indicators and incomegroup by country codes, to obtain income grouping data
  
      indicators_merged <-  indicators %>%
      
      left_join(incomegroup[,c(2,4)], by = c("Country Code"="Code")) %>%
    
  #b. Join merged dataset with the disaster dataset by country codes, to obtain death and damage data 
  
      left_join(disaster,by=c("Country Code"="ISO","Year" = "Year"))

5. Understand

5.1 We confirm the merge has succeeded by checking the structure of indicators_merged (str()).

5.2 Turning to the variables, we cross-reference the stored data types with Table 1. All variables are correctly stored except:

Year is incorrectly stored as character. We convert to integer using as.integer() function.
Population should be integer, not numerical. We convert to integer using as.integer() function.
Income group is incorrectly stored as character. We convert to a factor using as.factor() function, and order from Low income to High income. We confirm it has been correctly converted using is.factor().

# Requirement 2: Multiple data types  --------------------------------------------------------------

#1. Check structure to confirm correct merger, and data types for relevant variables

    str(indicators_merged)

## tibble [1,085 x 8] (S3: tbl_df/tbl/data.frame)
##  $ Country Name: chr [1:1085] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Country Code: chr [1:1085] "AFG" "AFG" "AFG" "AFG" ...
##  $ Year        : chr [1:1085] "2015" "2016" "2017" "2018" ...
##  $ income      : num [1:1085] 2.06e+10 2.01e+10 2.02e+10 2.05e+10 2.07e+10 ...
##  $ population  : num [1:1085] 34413603 35383128 36296400 37172386 38041754 ...
##  $ Income group: chr [1:1085] "Low income" "Low income" "Low income" "Low income" ...
##  $ deaths      : num [1:1085] 446 84 226 123 211 1 2 7 0 51 ...
##  $ damages     : num [1:1085] 3000 0 0 0 0 0 0 8900 0 745000 ...

# Requirement 3: Data type conversions  --------------------------------------------------------------

#1. Data type conversions

  #a.Convert Year column into integer 

      indicators_merged$Year <-  as.integer(indicators_merged$Year )

  #b.Convert Population column into integer
    
      indicators_merged$population <- as.integer(indicators_merged$population)
    

# Requirement 4: Factor, ordered  --------------------------------------------------------------

#2. Convert income group to ordered factor 
  
  #a. Convert
      
    indicators_merged$`Income group` <- 
      factor(indicators_merged$`Income group`,
      levels= c("Low income", "Lower middle income",
    "Upper middle income", "High income"),
    labels=c("Low income", "Lower middle income",
    "Upper middle income", "High income"),
    ordered = TRUE)

  #b. Check Conversion
      
    is.factor(indicators_merged$`Income group`) #check factor conversion

## [1] TRUE

6. Scan I

Obvious errors

6.1 We check that:

Income, population, death and damage values are non-negative by running a logical check on whether the relevant columns have values greater than or equal to 0, and count (using sum) any FALSE values. No FALSE values were returned.
Income group incomes do not overlap by plotting a boxplot (boxplot()) of income (filtered for year 2019) by income group. We simplify the group classifications by combining Upper and Lower middle income groupings into “Middle income” (using dplyr mutate() function). The boxplot whiskers confirm there is no overlap between income group categories.

Special values

6.2 We scan for special values using is.nan() function. For efficiency, we:

nest this in a sapply() function so it can be applied to all variables in dataset.
use sum() to inspect the total number of special values within each variable. As no special values were detected, we move on to next step.

Missing values

6.3 For missing values, we again use sapply() and sum(), this time combining it with the is.na() function. We find 143 observations with no income values, and 491 observations with no deaths or damages figures. By summing (sum()) the total number of incomplete cases (complete.cases() and !) and dividing by total number of observations (nrow()) we find that almost half (48%) of cases are incomplete. Given this high percentage, we will need to investigate ways to improve the usefulness of our dataset.

6.4 For the income variable, we:

Use anti_join() to exclude any countries with no income data across all 5 years. The list of excluded countries is determined by using aggregate() to obtain NA tallies for countries with missing income values (income ~ Country Name), and filtering (filter()) for any who have a total of 5 NAs. To avoid human error, the number of years (5) is determined through passing the total number of unique years (unique(),length()) in the dataset. Comparing the number of countries remaining in indicators_merged to the original dataset, we find that 23 countries have been excluded.
Use aggregate() to identify the years with the highest number of missing values.
Use mutate()(dplyr) and impute() function to replace the remaining missing values with the mean for the country.
Confirm all missing values have been replaced by repeating the sapply() step.

6.5 For death and damages, we:

Assign any NA values as zeros (as not all countries will experience disasters).
filter (filter,dplyr) to exclude any observations where both damage and death data are zero (As we are only interested in considering countries that have been negatively impacted by natural disasters).

6.6 We verify that our changes have been successful by confirming that the proportion of complete cases is 1.

# Requirement 7: Scan missing values, special values, errors  ----------------------------------

#1. Check for obvious errors

#1.1. Numerical variables have no negative values 

  #a. Logical check 

      checkerror <- indicators_merged[,c(4,5,7,8)] >= 0 
  
  #b. Confirm no FALSE values

      sum(checkerror[checkerror==FALSE])

## [1] NA

#1.2. Income groups do not overlap
      
  #a.Create dataset with income per capita for year 2019  
  
        checkincomegroup <- indicators_merged %>%
        as.data.frame()%>%
        mutate(incomepop = income/population) %>%
        filter(Year=="2019") 
  
  #b.Merge middle income groupings
      
        checkincomegroup$`Income group` <- plyr::revalue(checkincomegroup$`Income group`, c("Lower middle income" ="Middle income",
          "Upper middle income"="Middle income"))
  
  #c.Create boxplot comparing income per capita by income group
        
      boxplot(data=checkincomegroup,incomepop ~ `Income group`,xlab="Income group",ylab="Income $m",main="Check: Income groups do not overlap")

#2. Check for special values

      sapply(indicators_merged, function(x) sum(is.nan(x) )) #Special values

## Country Name Country Code         Year       income   population Income group 
##            0            0            0            0            0            0 
##       deaths      damages 
##            0            0

#3. Check for missing values
    #a.  Check for NAs
      
      sapply(indicators_merged, function(x) sum(is.na(x))) #missing values

## Country Name Country Code         Year       income   population Income group 
##            0            0            0          143            5            0 
##       deaths      damages 
##          491          491

    #b. Obtain proportion of incomplete cases in dataset
      
      sum(!complete.cases(indicators_merged))/nrow(indicators_merged)

## [1] 0.4847926

#3. Deal with missing values 

#3.1 Income variable: 

  #a.  Remove countries with no income data
  
    ## Remove countries with no income data
    
      indicators_sub <- indicators_merged %>%
        anti_join(aggregate(income ~`Country Name`, 
                          data=indicators_merged, 
                          function(x) {sum(is.na(x))}, na.action = NULL) %>%
                  filter( income == length(unique(indicators_merged$Year))),
                by=c("Country Name"))
  
    ##Compare remaining countries compared to original dataset 
      count(unique(read_excel("WBIndicators.xlsx")[-c(435:439),c(2)])) - 
      length(unique(indicators_sub$`Country Name`))

##    n
## 1 23

  #b. Identify which years have highest number of missing income data 
  
      aggregate(income ~`Year`, 
            data=indicators_sub, function(x) {sum(is.na(x))}, na.action = 
              NULL)

##   Year income
## 1 2015      0
## 2 2016      1
## 3 2017      2
## 4 2018      6
## 5 2019     19

  #c. Impute missing income values for remaining years 
    
      require(dplyr)
        indicators_sub <- indicators_sub %>%
        group_by(`Country Name`) %>%
        mutate(income= impute(income, fun = mean))
      
#3.2 Deaths and damages: Remove observations with no damage or deaths data
  
  #a. Convert NAs to zeros
    indicators_sub$deaths[is.na(indicators_sub$deaths)] <- 0
    indicators_sub$damages[is.na(indicators_sub$damages)] <- 0

  #b. Remove observations with no damage and deaths data
  
    indicators_sub <-  indicators_sub %>%
    filter(deaths > 0 , damages > 0) 
    
#4. Final check of proportion of complete cases
    
    sum(complete.cases(indicators_sub))/nrow(indicators_sub)

## [1] 1

7. Scan II

7.1 We will check for outliers by completing univariate, bivariate and multivariate scans on the numerical variables: income, population, deaths and damages.

7.2 Some outlier checks (e.g. z score) assume normality. We plot the histograms (hist()) of these variables to test the assumption. From the plots, all variables are highly skewed. For our univariate scan, we rely on boxplots as an alternative to z-score. The boxplots show a high number of observations that are potential outliers across all variables, particularly income and population. These need to be investigated further.

7.3 For our multivariate scan, we use outliers from the MVN package to create a clean dataset that removes outliers. Considering high correlation, only death and damage variables are included in this step (as otherwise error occurs due to violation of invertibility).

7.4 By excluding outliers, the MVN package returns a new dataset (results$newData) with 117 observations. Comparing the descriptive statistics (summary()), we can see the more extreme death and damages observations have been removed, with some improvement to skewness of data.

7.5 As we need to re-merge this clean dataset with the other 6 variables, we:

First check that each combination of death and damages is unique.
Semi-join indicators_sub with this dataset, using both death and damages as the linking variables.

# Requirement 7: Scan numeric variables for outliers  ----------------------------------

#1. Univariate scan

  #a. Check normality
    par(mfrow=c(2,2))
    
    hist(indicators_sub$income, xlab="Income $m",main="")
    hist(indicators_sub$deaths,xlab="Deaths per 100k",main="")
    hist(indicators_sub$damages,xlab="Damages $m",main="")
    hist(indicators_sub$population,xlab = "Population m",main="")
    mtext("Normality check: Histograms", side = 3, line = -1, outer = TRUE)

  #b. Create boxplots
    par(mfrow=c(2,2))
    
    boxplot(indicators_sub$income, xlab="Income $m")
    boxplot(indicators_sub$deaths,xlab="Deaths per 100K")
    boxplot(indicators_sub$damages, xlab="Damages $m")
    boxplot(indicators_sub$population, xlab = "Population $m")
    mtext("Univariate Scans: Boxplots", side = 3, line = -1, outer = TRUE)

#2. Bivariate scan
  
  # Create scatterplots
    par(mfrow=c(2,2))
    plot(indicators_sub$income ~ indicators_sub$damages, ylab="Damages $m",xlab="Income $m")
    plot(indicators_sub$damages ~ indicators_sub$deaths,ylab="Damages $m ",xlab="Deaths per 100k")
    plot(indicators_sub$income ~ indicators_sub$deaths,ylab="Income $m",xlab="Deaths per 100k" )
    
    mtext("Bivariate Scans: Scatter plots", side = 3, line = -1, outer = TRUE)
  

#3. Multivariate scan

    #a. Correlation check
      
      cor(indicators_sub[,-c(1:3,6)])

##                income population     deaths    damages
## income     1.00000000  0.5073637 0.05002798 0.64884410
## population 0.50736373  1.0000000 0.30565468 0.23820648
## deaths     0.05002798  0.3056547 1.00000000 0.04887991
## damages    0.64884410  0.2382065 0.04887991 1.00000000

    #b. continue with variables with least correlation 
  
      outliers <- indicators_sub %>%
        as.data.frame() %>%
        select(deaths,damages)
        
      #c. Run outlier test
      
        par(mfrow=c(1,1))

        results <-  mvn(data = outliers, 
                          multivariateOutlierMethod ="quan", 
                          showOutliers = TRUE, 
                          showNewData = TRUE)

#3. Compare original dataset with clean data
  
     knitr::kable(summary(outliers), caption ="Table 2: Summary Statistics: Original")

Table 2: Summary Statistics: Original
	deaths	damages
	Min. : 1.00	Min. : 2
	1st Qu.: 11.25	1st Qu.: 38917
	Median : 51.50	Median : 272500
	Mean : 308.30	Mean : 4101826
	3rd Qu.: 257.75	3rd Qu.: 1586650
	Max. :9034.00	Max. :190510000

     knitr::kable(summary(results$newData), caption = "Table 3: Summary Statistics: Clean data")

Table 3: Summary Statistics: Clean data
	deaths	damages
	Min. : 1.00	Min. : 2
	1st Qu.: 8.00	1st Qu.: 16600
	Median : 24.00	Median : 94000
	Mean : 61.69	Mean : 216422
	3rd Qu.: 90.00	3rd Qu.: 280000
	Max. :294.00	Max. :1215000

#4. Final dataset (clean)
  
  #a. Check that each observation (death and data) are unique 
  
    nrow(unique(indicators_sub[,c(7,8)]))

## [1] 186

  #b. Obtain full clean dataset  
  
    indicators_clean <-  semi_join(indicators_sub,results$newData, by= c("deaths"="deaths","damages"="damages"))

8. Tidy & Manipulate Data II

8.1 As we are interested in analysing deaths and damages figures in relation to a country’s income, income group and population. We use the dplyr mutate() function to:

Create a new variable death_pop to calculate disaster deaths per 100,000 citizens (standard).
For comparability, mutate existing income, population and damages variables so all are in millions.
Create a new variable “damage_income” to calculate damages as a proportion of income.

# Requirement 6: Create/Mutate variables  -----------------------------------------


indicators_final <- indicators_clean  %>%
  mutate( 

#1.  Convert population and income variables to millions

    income = income/1000000,
    population = population/1000000,
    damages=   damages*1000/1000000,

#2. Create new variables: 
   
  #a. damages as proportion of income

    damage_income = round((damages/income),2),
  
  #b. Calculate disaster deaths per 100,000 citizens

    death_pop = deaths/population*100000
)

9. Transform

9.1 One of the issues detected was that all numerical variables were rightly skewed. As is common in economic literature, we apply a log transform (log()) to the income variable using the mutate() function (dplyr).As can be seen by comparing histograms, the transformation has significantly improved the distribution.

10. References

Centre for Research on the Epidemiology of Disasters 2020, EM-Dat, viewed 15 October 2020, https://www.emdat.be.
World Bank 2020, Wolrd Bank Country and Lending Groups, viewed 15 October 2020, 2020,https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups.
World Bank 2020, World Bank Development Indicators, viewed 15 October 2020, https://databank.worldbank.org/source/world-development-indicators.