1.1 The following packages were used:
# Requirement 10: Approved packages --------------------------------------------------------------
library(readxl)
library(knitr)
library(tidyr)
library(gdata)
library(dplyr)
library(Hmisc)
library(MVN)
2.1 We are interested in the socioeconomics of countries (income, population) that have experienced adverse natural disaster impacts (deaths and damages) between 2015 and 2019.
2.2 Three datasets (World Bank “Development Indicators” and “Country and Lending Groups”; “EM-Dat”) were imported into R. The datasets were inspected for structure and data types. The Indicators dataset was reshaped to a tidy format. Data type conversions were conducted on incorrectly stored variables and income groupings correctly labelled and ordered. The Em-Dat dataset was transformed to annual-level data to enable merging. Datasets were merged using country codes.
2.3 The merged dataset (1085 observations, 8 variables) was scanned for obvious errors (e.g. negative values for population variable, overlapping income groupings), missing and special values using logical checks. Considering our aim, missing values were dealt with by excluding cases where no damage and death data was recorded; imputing values (income) where possible, otherwise excluding the observations altogether (e.g. where no income available for all years).
2.4 This dataset (now 186 observations) was scanned for outliers. All numerical variables (income, damage, death, population) were skewed and so boxplots were used for univariate outlier scan and scatter plots for bivariate scans. Using the least correlated variables (death, damages), a multivariate scan was performed to obtain a dataset that removed outliers.
2.5 This clean dataset (117 observations) was semi-joined to reobtain observation values of remaining variables with death and damage data as linking variables.Three variables were rescaled (to millions). Two new variables (deaths per 100,000 people; damages as proportion of income) were created. The income variable was transformed to a normal distribution using log method.
2.6 Packages used: readxl (read in excel files), knitr (tables) gdata (rename variables), dplyr (mutate and create new variables), tidyr (reshape/tidy data), hmisc, basic R and MVN (plots, normality and logic checks, detect outliers).
3.1 The following datasets and data sources were used:
World Bank Development Indicators, available via the World Bank Databank: https://databank.worldbank.org/source/world-development-indicators. The panel dataset, collected annually since 1960, contains 1437 indicators for 217 countries. For our analysis, the “Country Name” and indicators “GNI, Atlas method (current US$)” and “Population, total” were selected for all 217 countries for the years 2015 to 2019.
World Bank Country and Lending Groups, available from the World Bank website: https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups. The dataset provides the World Bank regional, income group and lending group classifications for each country. For our analysis, the “Code” and “Income group” classification for each country were selected.
EM-Dat, a comprehensive global disaster dataset managed by the Centre for Research on the Epidemiology of Disasters: https://public.emdat.be/data. It collates detailed event-level information on two disaster types (natural, technological) for all disasters recorded since 1900, with 43 variables including information on human impacts (e.g. number killed, injured, affected) and economic damage estimates. For this analysis, the “ISO” code, “Year”, Total Deaths“, and”Total Damages (’000 US$)" for all available countries were selected for the years 2015 to 2019.
3.2 Table 1 provides variable descriptions of all variables used in the analysis. It was created using the matrix function, and presented using the kable function (knittr package). Country Code, ISO and Code will be used as the keys to link the datasets.
#1. Create Table 1:
#a. Read in variable descriptions file
table1 <- read_excel("Variable_descriptions.xlsx",range = "A1:D12")
#b. Present table using knitr package:
knitr::kable(table1, caption = "Table 1: Description of select variables")
| Variable | Data type | Description | Dataset |
|---|---|---|---|
| Country Name | Categorical - character | Country name | World Bank Indicators |
| Country Code | Categorical - character | World Bank country code | World Bank Indicators |
| 2015 [YR2015] 2016 [YR2016] 2017 [YR2017] 2018 [YR2018] 2019 [YR2019] | Integer - Untidy data | Year | World Bank Indicators |
| GNI Atlas method - current US$ | Numerical | Gross national income in current US$ Atlas method | World Bank Indicators |
| Population - total | Integer | Population of a country in a given year | World Bank Indicators |
| Income group | Categorical - factor | World Bank Grouping of countries into: Low income, Lower middle income, Upper middle income, and High income | World Bank Income Groups |
| ISO | Categorical - character | EM-Dat country code | EM-Dat |
| Country | Categorical - character | Country name | EM-Dat |
| Total Deaths | Integer | Total deaths from natural disasters, per country and year | EM-Dat |
| Total Damages - one thousand US$ | Numerical | Total Damages from natural disasters,per country and year | EM-Dat |
| Year | Integer | Year | EM-Dat |
3.3 The excel datasets were read into R using the read_excel function:
World Bank Development Indicators was stored as indicators
World Bank Country and Lending Groups was stored as incomegroup
EM-Dat was stored as disaster.
3.4 Explanatory notes and headings were removed from the indicators and incomegroup datasets using skip and by dropping irrelevant rows and columns. As explained later, we cannot merge until we tidy the datasets.
# Requirement 1: --------------------------------------------------------------
#1. Read in files
#a. World Bank Income groupings dataset, skip rows, removing rows and columns
incomegroup <- read_excel("CLASS.xls",
sheet =1,skip=3)[-c(1,266:275),-c(1,2,5)]
#b. World Bank Indicators dataset, removing rows and columns
indicators <- read_excel("WBIndicators1.xlsx")[-c(435:439),-c(4)]
#c. Read in natural disaster dataset, skip rows
disaster <- read_excel("emdat_public_2020_10_10_query_uid-lPH5eg.xlsx",skip=6)
4.1 Datasets were previewed using the head() function. This step indicated that further refinement was required before datasets could be merged (tidying of dataset and column names).
Tidiness of datasets
4.2. The incomegroup and disaster datasets are tidy as:
all the variables are in their own unique column. For instance, Economy and Code are split into two separate columns in the incomegroup dataset; likewise Region and Continent in the disaster dataset.
all observations related to each country (incomegroup dataset) and disaster event (disaster dataset) are in unique rows.
all values for each variable are in their unique cells.
4.3. The indicators dataset is untidy as:
unit of analysis is a country-year pairing; but
the year variable is spread across multiple columns, with the year values (e.g 2015) saved as the column (variable) names. This breaks the first tidy principle that each variable must have its own column.
income and population variables do not have their own columns, thus each country-year observation is split across multiple rows. For example, income and population observations for the year 2015 are split into two rows. This breaks both the first tidy principle and the second principle: Each observation must have its own row.
4.4 To tidy the indicators dataset, we:
Correct year values (e.g rename 2015 [YR2015] to 2015) using rename.vars() function
Use the gather() function (tidy) to gather all years into one column (“Year”)
Use the spread() function (tidy) to spread the population and income variables into separate columns.
4.5 The population and income column names were also simplified using rename.vars() function.
Merging
4.6 Before merging disasters and indicators, disasters must be transformed from event-level data into annual-level data for each country. We use group_by() and summarise() functions (dplyr) to aggregate the death and damages variables by country and year only, so that all other variables are dropped. Using dim() function, this reduces the:
observations from 1886 to 604, indicating that multiple disaster events occured within a given year and given country.
columns from 43 to 4, as required for the merge.
4.7 The datasets can now be merged. indicators will be our primary dataset. The datasets will be merged using left_join() (dplyr) as we want to add the variables from the two datasets as columns to indicators. Checking the dimensions of indicators (dim() function), we should end up with 1085 observations, and 8 columns (once income group, total deaths, and total damages are added).
4.8 Merge procedure:
Left join indicators with incomegroup dataset, selecting only “Code” and “Income Group” as our columns of interest, and use the country code columns in both datasets as the matching variables (“Country Code” and “Code”, respectively).
Left join these merged datasets to disaster, again using the country code columns to match (“ISO” and “Country Code”, respectively).
Store the final merged dataset as indicators_merged.
# Requirement 5: Tidy dataset --------------------------------------------------------------
#1. Check tidiness of each dataset
head(indicators)
## # A tibble: 6 x 8
## `Country Name` `Country Code` `Series Name` `2015 [YR2015]` `2016 [YR2016]`
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghanistan AFG GNI, Atlas m~ 20583397306. 20092934119.
## 2 Afghanistan AFG Population, ~ 34413603 35383128
## 3 Albania ALB GNI, Atlas m~ 12634903056. 12421239110.
## 4 Albania ALB Population, ~ 2880703 2876101
## 5 Algeria DZA GNI, Atlas m~ 192676844902. 177365176105.
## 6 Algeria DZA Population, ~ 39728025 40551404
## # ... with 3 more variables: `2017 [YR2017]` <dbl>, `2018 [YR2018]` <dbl>,
## # `2019 [YR2019]` <dbl>
head(incomegroup)
## # A tibble: 6 x 6
## Economy Code Region `Income group` `Lending categor~ Other
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Afghanistan AFG South Asia Low income IDA HIPC
## 2 Albania ALB Europe & Central ~ Upper middle in~ IBRD <NA>
## 3 Algeria DZA Middle East & Nor~ Lower middle in~ IBRD <NA>
## 4 American Sa~ ASM East Asia & Pacif~ Upper middle in~ .. <NA>
## 5 Andorra AND Europe & Central ~ High income .. <NA>
## 6 Angola AGO Sub-Saharan Africa Lower middle in~ IBRD <NA>
head(disaster)
## # A tibble: 6 x 43
## `Dis No` Year Seq `Disaster Group` `Disaster Subgr~ `Disaster Type`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2015-01~ 2015 0118 Natural Meteorological Storm
## 2 2015-00~ 2015 0054 Natural Hydrological Landslide
## 3 2015-01~ 2015 0148 Natural Meteorological Storm
## 4 2015-04~ 2015 0479 Natural Meteorological Storm
## 5 2015-00~ 2015 0090 Natural Hydrological Flood
## 6 2015-00~ 2015 0023 Natural Hydrological Flood
## # ... with 37 more variables: `Disaster Subtype` <chr>, `Disaster
## # Subsubtype` <chr>, `Event Name` <chr>, `Entry Criteria` <chr>, ISO <chr>,
## # Country <chr>, Region <chr>, Continent <chr>, Location <chr>, Origin <chr>,
## # `Associated Dis` <chr>, `Associated Dis2` <chr>, `OFDA Response` <chr>,
## # Appeal <chr>, Declaration <chr>, `Aid Contribution` <lgl>, `Dis Mag
## # Value` <dbl>, `Dis Mag Scale` <chr>, Latitude <chr>, Longitude <chr>,
## # `Local Time` <chr>, `River Basin` <chr>, `Start Year` <dbl>, `Start
## # Month` <dbl>, `Start Day` <dbl>, `End Year` <dbl>, `End Month` <dbl>, `End
## # Day` <dbl>, `Total Deaths` <dbl>, `No Injured` <dbl>, `No Affected` <dbl>,
## # `No Homeless` <dbl>, `Total Affected` <dbl>, `Reconstruction Costs ('000
## # US$)` <dbl>, `Insured Damages ('000 US$)` <dbl>, `Total Damages ('000
## # US$)` <dbl>, CPI <dbl>
#2. Tidy the indicators dataset
indicators <- indicators %>%
#a. Rename year columns into proper years
rename.vars(c(
"2015 [YR2015]", "2016 [YR2016]",
"2017 [YR2017]", "2018 [YR2018]",
"2019 [YR2019]"),
c("2015", "2016",
"2017","2018",
"2019")
) %>%
#b. Gather all years into one column
gather(key = "Year", value = "Value",
- c(1:3)) %>%
#c. Spread variables into different columns
spread(key = "Series Name", value = "Value") %>%
#d. Rename income and population for legibility
rename.vars(c("GNI, Atlas method (current US$)",
"Population, total"),
c("income","population"))
##
## Changing in .
## From: 2015 [YR2015] 2016 [YR2016] 2017 [YR2017] 2018 [YR2018] 2019 [YR2019]
## To: 2015 2016 2017 2018 2019
##
##
## Changing in .
## From: GNI, Atlas method (current US$) Population, total
## To: income population
# Requirement 1: Merge dataset --------------------------------------------------------------
#1. Prepare disaster dataset for merging
#a. Group disaster data by year and country code
disaster <- disaster %>%
group_by(ISO, Year) %>%
summarise(deaths = sum(`Total Deaths`, na.rm =TRUE),
damages = sum(`Total Damages ('000 US$)`, na.rm =TRUE))
#b. Check dimensions of disaster
dim(disaster)
## [1] 604 4
#2. Merge datasets
#a. Join indicators and incomegroup by country codes, to obtain income grouping data
indicators_merged <- indicators %>%
left_join(incomegroup[,c(2,4)], by = c("Country Code"="Code")) %>%
#b. Join merged dataset with the disaster dataset by country codes, to obtain death and damage data
left_join(disaster,by=c("Country Code"="ISO","Year" = "Year"))
5.1 We confirm the merge has succeeded by checking the structure of indicators_merged (str()).
5.2 Turning to the variables, we cross-reference the stored data types with Table 1. All variables are correctly stored except:
Year is incorrectly stored as character. We convert to integer using as.integer() function.
Population should be integer, not numerical. We convert to integer using as.integer() function.
Income group is incorrectly stored as character. We convert to a factor using as.factor() function, and order from Low income to High income. We confirm it has been correctly converted using is.factor().
# Requirement 2: Multiple data types --------------------------------------------------------------
#1. Check structure to confirm correct merger, and data types for relevant variables
str(indicators_merged)
## tibble [1,085 x 8] (S3: tbl_df/tbl/data.frame)
## $ Country Name: chr [1:1085] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Country Code: chr [1:1085] "AFG" "AFG" "AFG" "AFG" ...
## $ Year : chr [1:1085] "2015" "2016" "2017" "2018" ...
## $ income : num [1:1085] 2.06e+10 2.01e+10 2.02e+10 2.05e+10 2.07e+10 ...
## $ population : num [1:1085] 34413603 35383128 36296400 37172386 38041754 ...
## $ Income group: chr [1:1085] "Low income" "Low income" "Low income" "Low income" ...
## $ deaths : num [1:1085] 446 84 226 123 211 1 2 7 0 51 ...
## $ damages : num [1:1085] 3000 0 0 0 0 0 0 8900 0 745000 ...
# Requirement 3: Data type conversions --------------------------------------------------------------
#1. Data type conversions
#a.Convert Year column into integer
indicators_merged$Year <- as.integer(indicators_merged$Year )
#b.Convert Population column into integer
indicators_merged$population <- as.integer(indicators_merged$population)
# Requirement 4: Factor, ordered --------------------------------------------------------------
#2. Convert income group to ordered factor
#a. Convert
indicators_merged$`Income group` <-
factor(indicators_merged$`Income group`,
levels= c("Low income", "Lower middle income",
"Upper middle income", "High income"),
labels=c("Low income", "Lower middle income",
"Upper middle income", "High income"),
ordered = TRUE)
#b. Check Conversion
is.factor(indicators_merged$`Income group`) #check factor conversion
## [1] TRUE
Obvious errors
6.1 We check that:
Income, population, death and damage values are non-negative by running a logical check on whether the relevant columns have values greater than or equal to 0, and count (using sum) any FALSE values. No FALSE values were returned.
Income group incomes do not overlap by plotting a boxplot (boxplot()) of income (filtered for year 2019) by income group. We simplify the group classifications by combining Upper and Lower middle income groupings into “Middle income” (using dplyr mutate() function). The boxplot whiskers confirm there is no overlap between income group categories.
Special values
6.2 We scan for special values using is.nan() function. For efficiency, we:
nest this in a sapply() function so it can be applied to all variables in dataset.
use sum() to inspect the total number of special values within each variable. As no special values were detected, we move on to next step.
Missing values
6.3 For missing values, we again use sapply() and sum(), this time combining it with the is.na() function. We find 143 observations with no income values, and 491 observations with no deaths or damages figures. By summing (sum()) the total number of incomplete cases (complete.cases() and !) and dividing by total number of observations (nrow()) we find that almost half (48%) of cases are incomplete. Given this high percentage, we will need to investigate ways to improve the usefulness of our dataset.
6.4 For the income variable, we:
Use anti_join() to exclude any countries with no income data across all 5 years. The list of excluded countries is determined by using aggregate() to obtain NA tallies for countries with missing income values (income ~ Country Name), and filtering (filter()) for any who have a total of 5 NAs. To avoid human error, the number of years (5) is determined through passing the total number of unique years (unique(),length()) in the dataset. Comparing the number of countries remaining in indicators_merged to the original dataset, we find that 23 countries have been excluded.
Use aggregate() to identify the years with the highest number of missing values.
Use mutate()(dplyr) and impute() function to replace the remaining missing values with the mean for the country.
Confirm all missing values have been replaced by repeating the sapply() step.
6.5 For death and damages, we:
Assign any NA values as zeros (as not all countries will experience disasters).
filter (filter,dplyr) to exclude any observations where both damage and death data are zero (As we are only interested in considering countries that have been negatively impacted by natural disasters).
6.6 We verify that our changes have been successful by confirming that the proportion of complete cases is 1.
# Requirement 7: Scan missing values, special values, errors ----------------------------------
#1. Check for obvious errors
#1.1. Numerical variables have no negative values
#a. Logical check
checkerror <- indicators_merged[,c(4,5,7,8)] >= 0
#b. Confirm no FALSE values
sum(checkerror[checkerror==FALSE])
## [1] NA
#1.2. Income groups do not overlap
#a.Create dataset with income per capita for year 2019
checkincomegroup <- indicators_merged %>%
as.data.frame()%>%
mutate(incomepop = income/population) %>%
filter(Year=="2019")
#b.Merge middle income groupings
checkincomegroup$`Income group` <- plyr::revalue(checkincomegroup$`Income group`, c("Lower middle income" ="Middle income",
"Upper middle income"="Middle income"))
#c.Create boxplot comparing income per capita by income group
boxplot(data=checkincomegroup,incomepop ~ `Income group`,xlab="Income group",ylab="Income $m",main="Check: Income groups do not overlap")
#2. Check for special values
sapply(indicators_merged, function(x) sum(is.nan(x) )) #Special values
## Country Name Country Code Year income population Income group
## 0 0 0 0 0 0
## deaths damages
## 0 0
#3. Check for missing values
#a. Check for NAs
sapply(indicators_merged, function(x) sum(is.na(x))) #missing values
## Country Name Country Code Year income population Income group
## 0 0 0 143 5 0
## deaths damages
## 491 491
#b. Obtain proportion of incomplete cases in dataset
sum(!complete.cases(indicators_merged))/nrow(indicators_merged)
## [1] 0.4847926
#3. Deal with missing values
#3.1 Income variable:
#a. Remove countries with no income data
## Remove countries with no income data
indicators_sub <- indicators_merged %>%
anti_join(aggregate(income ~`Country Name`,
data=indicators_merged,
function(x) {sum(is.na(x))}, na.action = NULL) %>%
filter( income == length(unique(indicators_merged$Year))),
by=c("Country Name"))
##Compare remaining countries compared to original dataset
count(unique(read_excel("WBIndicators.xlsx")[-c(435:439),c(2)])) -
length(unique(indicators_sub$`Country Name`))
## n
## 1 23
#b. Identify which years have highest number of missing income data
aggregate(income ~`Year`,
data=indicators_sub, function(x) {sum(is.na(x))}, na.action =
NULL)
## Year income
## 1 2015 0
## 2 2016 1
## 3 2017 2
## 4 2018 6
## 5 2019 19
#c. Impute missing income values for remaining years
require(dplyr)
indicators_sub <- indicators_sub %>%
group_by(`Country Name`) %>%
mutate(income= impute(income, fun = mean))
#3.2 Deaths and damages: Remove observations with no damage or deaths data
#a. Convert NAs to zeros
indicators_sub$deaths[is.na(indicators_sub$deaths)] <- 0
indicators_sub$damages[is.na(indicators_sub$damages)] <- 0
#b. Remove observations with no damage and deaths data
indicators_sub <- indicators_sub %>%
filter(deaths > 0 , damages > 0)
#4. Final check of proportion of complete cases
sum(complete.cases(indicators_sub))/nrow(indicators_sub)
## [1] 1
7.1 We will check for outliers by completing univariate, bivariate and multivariate scans on the numerical variables: income, population, deaths and damages.
7.2 Some outlier checks (e.g. z score) assume normality. We plot the histograms (hist()) of these variables to test the assumption. From the plots, all variables are highly skewed. For our univariate scan, we rely on boxplots as an alternative to z-score. The boxplots show a high number of observations that are potential outliers across all variables, particularly income and population. These need to be investigated further.
7.3 For our multivariate scan, we use outliers from the MVN package to create a clean dataset that removes outliers. Considering high correlation, only death and damage variables are included in this step (as otherwise error occurs due to violation of invertibility).
7.4 By excluding outliers, the MVN package returns a new dataset (results$newData) with 117 observations. Comparing the descriptive statistics (summary()), we can see the more extreme death and damages observations have been removed, with some improvement to skewness of data.
7.5 As we need to re-merge this clean dataset with the other 6 variables, we:
First check that each combination of death and damages is unique.
Semi-join indicators_sub with this dataset, using both death and damages as the linking variables.
# Requirement 7: Scan numeric variables for outliers ----------------------------------
#1. Univariate scan
#a. Check normality
par(mfrow=c(2,2))
hist(indicators_sub$income, xlab="Income $m",main="")
hist(indicators_sub$deaths,xlab="Deaths per 100k",main="")
hist(indicators_sub$damages,xlab="Damages $m",main="")
hist(indicators_sub$population,xlab = "Population m",main="")
mtext("Normality check: Histograms", side = 3, line = -1, outer = TRUE)
#b. Create boxplots
par(mfrow=c(2,2))
boxplot(indicators_sub$income, xlab="Income $m")
boxplot(indicators_sub$deaths,xlab="Deaths per 100K")
boxplot(indicators_sub$damages, xlab="Damages $m")
boxplot(indicators_sub$population, xlab = "Population $m")
mtext("Univariate Scans: Boxplots", side = 3, line = -1, outer = TRUE)
#2. Bivariate scan
# Create scatterplots
par(mfrow=c(2,2))
plot(indicators_sub$income ~ indicators_sub$damages, ylab="Damages $m",xlab="Income $m")
plot(indicators_sub$damages ~ indicators_sub$deaths,ylab="Damages $m ",xlab="Deaths per 100k")
plot(indicators_sub$income ~ indicators_sub$deaths,ylab="Income $m",xlab="Deaths per 100k" )
mtext("Bivariate Scans: Scatter plots", side = 3, line = -1, outer = TRUE)
#3. Multivariate scan
#a. Correlation check
cor(indicators_sub[,-c(1:3,6)])
## income population deaths damages
## income 1.00000000 0.5073637 0.05002798 0.64884410
## population 0.50736373 1.0000000 0.30565468 0.23820648
## deaths 0.05002798 0.3056547 1.00000000 0.04887991
## damages 0.64884410 0.2382065 0.04887991 1.00000000
#b. continue with variables with least correlation
outliers <- indicators_sub %>%
as.data.frame() %>%
select(deaths,damages)
#c. Run outlier test
par(mfrow=c(1,1))
results <- mvn(data = outliers,
multivariateOutlierMethod ="quan",
showOutliers = TRUE,
showNewData = TRUE)
#3. Compare original dataset with clean data
knitr::kable(summary(outliers), caption ="Table 2: Summary Statistics: Original")
| deaths | damages | |
|---|---|---|
| Min. : 1.00 | Min. : 2 | |
| 1st Qu.: 11.25 | 1st Qu.: 38917 | |
| Median : 51.50 | Median : 272500 | |
| Mean : 308.30 | Mean : 4101826 | |
| 3rd Qu.: 257.75 | 3rd Qu.: 1586650 | |
| Max. :9034.00 | Max. :190510000 |
knitr::kable(summary(results$newData), caption = "Table 3: Summary Statistics: Clean data")
| deaths | damages | |
|---|---|---|
| Min. : 1.00 | Min. : 2 | |
| 1st Qu.: 8.00 | 1st Qu.: 16600 | |
| Median : 24.00 | Median : 94000 | |
| Mean : 61.69 | Mean : 216422 | |
| 3rd Qu.: 90.00 | 3rd Qu.: 280000 | |
| Max. :294.00 | Max. :1215000 |
#4. Final dataset (clean)
#a. Check that each observation (death and data) are unique
nrow(unique(indicators_sub[,c(7,8)]))
## [1] 186
#b. Obtain full clean dataset
indicators_clean <- semi_join(indicators_sub,results$newData, by= c("deaths"="deaths","damages"="damages"))
8.1 As we are interested in analysing deaths and damages figures in relation to a country’s income, income group and population. We use the dplyr mutate() function to:
Create a new variable death_pop to calculate disaster deaths per 100,000 citizens (standard).
For comparability, mutate existing income, population and damages variables so all are in millions.
Create a new variable “damage_income” to calculate damages as a proportion of income.
# Requirement 6: Create/Mutate variables -----------------------------------------
indicators_final <- indicators_clean %>%
mutate(
#1. Convert population and income variables to millions
income = income/1000000,
population = population/1000000,
damages= damages*1000/1000000,
#2. Create new variables:
#a. damages as proportion of income
damage_income = round((damages/income),2),
#b. Calculate disaster deaths per 100,000 citizens
death_pop = deaths/population*100000
)
9.1 One of the issues detected was that all numerical variables were rightly skewed. As is common in economic literature, we apply a log transform (log()) to the income variable using the mutate() function (dplyr).As can be seen by comparing histograms, the transformation has significantly improved the distribution.
Centre for Research on the Epidemiology of Disasters 2020, EM-Dat, viewed 15 October 2020, https://www.emdat.be.
World Bank 2020, Wolrd Bank Country and Lending Groups, viewed 15 October 2020, 2020,https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups.
World Bank 2020, World Bank Development Indicators, viewed 15 October 2020, https://databank.worldbank.org/source/world-development-indicators.