MATH2349 Data Wrangling

Executive Summary

The following data processing aims to create a dataset that can be used to analyse international tourism, what factors influence the number of yearly international arrivals to a country as well as analyse trends that took place since 1995. For this purpose 3 datasets are used: UNESCO world heritage list to count a number of cultural, natural and mixed sites per country; World Bank dataset that contains a number of international arrivals per country since 1960 till 2018; World Bank metadata on countries that includes income level group. All 3 datasets will be merged using the fourth dataset that contains a full updated list of world countries. This report will demonstrate steps taken to tidy, manipulate and merge four datasets. At the start I will explain datasets in further detail, later I will demonstrate how the datasets were retrieved and subsetted, then I will show type conversions, afterwards, I will demonstrate steps taken to count the number of sites and find the mean number of arrivals, then I will perform a logarithmic transformation on a number of arrivals, finally, I will scan for missing values and outliers and impute them before merging the datasets.

Required packages

library(xlsx)
library(readr)
library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(naniar)

Data

The first dataset I am using is the UNESCO world heritage list that can be found on the UNESCO webpage: http://whc.unesco.org/en/list/ [1]. This dataset contains records of all the world heritage sites recognised by UNSECO: name of sites in both English and French, unique ids, longitude and latitude and more. For the purpose of this task I am only interested in the following 3 columns: - udnp_code: a 3 letter country code that the site is located in (qualitative nominal variable); - region_en: a name of the region the site is located in (qualitative nominal variable); - category: a site can be of 3 types: Cultural, Natural and Mixed (qualitative nominal variable).

The second dataset I am using is the data on the number of arrivals yearly to every country in the world, it can be found on the World Bank page: https://data.worldbank.org/indicator/ST.INT.ARVL. The dataset defines international inbound tourists (overnight visitors) as the number of tourists who travel to a country other than that in which they have their usual residence, but outside their usual environment, for a period not exceeding 12 months and whose main purpose in visiting is other than an activity remunerated from within the country visited. The data on inbound tourists refer to the number of arrivals, not to the number of people travelling.[2]

The dataset contains the following columns: - Country Name (qualitative nominal variable) that contains country names as well as region names and also “world”; - Country Code: a 3 letter code of the country(qualitative nominal variable); - Indicator Name: contains the same indicator for all the observations - International tourism, number of arrivals (qualitative nominal variable); - Indicator Code: code of the above value (qualitative nominal variable); - years 1960 - 2018: the number of arrivals per year(quantitative variable);

From this dataset, I am selecting only Country Code, years 1995 - 2018 as columns 1960 - 1994 are mostly empty.

The dataset also contains metadata on countries in a separate excel sheet: - Country Code: a 3 letter code of the country(qualitative nominal variable); - Region: a 3 letter code of the country(qualitative nominal variable); - IncomeGroup: the World Bank assigns the world’s economies to four income groups: low, lower-middle, upper-middle, and high-income countries (qualitative ordinal variable). - TableName: name of the country as written in the main table(qualitative nominal variable);

Finally, I am also using a list of all the countries in the world with the country codes that can be found: http://www.fao.org/countryprofiles/iso3list/en/ [3]. From this table I will only keep 3 columns: - Short name: short name of the country (qualitative nominal variable); - Official name: the official name of the country (qualitative nominal variable); - ISO3: 3 letter code of the country (qualitative nominal variable).

unesco <- read.xlsx("UNESCO_WORLD_Sites.xls",sheetName = "sites inscribed properties XLS ", colNames  = TRUE, stringsAsFactors = FALSE)
unesco <- unesco[,c("category","udnp_code")]
head(unesco)

tourism <- read_csv("tourism_by_country.csv", skip =4)

## Warning: Missing column names filled in: 'X65' [65]

## Parsed with column specification:
## cols(
##   .default = col_logical(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `1995` = col_double(),
##   `1996` = col_double(),
##   `1997` = col_double(),
##   `1998` = col_double(),
##   `1999` = col_double(),
##   `2000` = col_double(),
##   `2001` = col_double(),
##   `2002` = col_double(),
##   `2003` = col_double(),
##   `2004` = col_double(),
##   `2005` = col_double(),
##   `2006` = col_double(),
##   `2007` = col_double(),
##   `2008` = col_double(),
##   `2009` = col_double(),
##   `2010` = col_double()
##   # ... with 8 more columns
## )

## See spec(...) for full column specifications.

drop <- c("Country Name", "Indicator Name", "Indicator Code",1960:1994, 2019, "X65")
tourism = tourism[,!(names(tourism) %in% drop)]
head(tourism)

meta_tourism <- read.xlsx("API_ST.INT.ARVL_DS2_en_excel_v2_1496875.xls", sheetName = "Metadata - Countries", colNames = TRUE, stringsAsFactors = FALSE)
meta_tourism <- meta_tourism[c(1:3)]
head(meta_tourism)

names <- read_html("http://www.fao.org/countryprofiles/iso3list/en/")
names <- html_table(html_nodes(names, "table")[[1]])
names <- names[c("Short name", "Official name", "ISO3")]
head(names)

Understand

Data types are checked using str().

In UNESCO dataset site category has only 3 values and therefore needs to be changed to factor(satisfies requirement 3). tourism dataset doesn’t require any changes to its data types. meta_tourism has 2 columns that need to be changed to factor: Region and IncomeGroup, the later also need to be ordered (satisfies requirement 4). names don’t require type conversions.

str(unesco,give.attr = FALSE)

## 'data.frame':    1121 obs. of  2 variables:
##  $ category : chr  "Cultural" "Cultural" "Cultural" "Cultural" ...
##  $ udnp_code: chr  "afg" "afg" "alb" "alb" ...

str(tourism,give.attr = FALSE)

## tibble [264 x 25] (S3: tbl_df/tbl/data.frame)
##  $ Country Code: chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
##  $ 1995        : num [1:264] 619000 NA 9000 NA NA ...
##  $ 1996        : num [1:264] 641000 NA 21000 NA NA ...
##  $ 1997        : num [1:264] 650000 NA 45000 NA NA ...
##  $ 1998        : num [1:264] 647000 NA 52000 NA NA ...
##  $ 1999        : num [1:264] 683000 NA 45000 NA 2347000 ...
##  $ 2000        : num [1:264] 721000 NA 51000 NA 2949000 ...
##  $ 2001        : num [1:264] 691000 NA 67000 NA 3516000 ...
##  $ 2002        : num [1:264] 643000 NA 91000 NA 3387000 ...
##  $ 2003        : num [1:264] 642000 NA 107000 NA 3138000 ...
##  $ 2004        : num [1:264] 728000 NA 194000 NA 2791000 ...
##  $ 2005        : num [1:264] 733000 NA 210000 NA 2418000 ...
##  $ 2006        : num [1:264] 694000 NA 121000 NA 2227000 ...
##  $ 2007        : num [1:264] 772000 NA 195000 1062000 2189000 ...
##  $ 2008        : num [1:264] 827000 NA 294000 1247000 2059000 ...
##  $ 2009        : num [1:264] 813000 NA 366000 1711000 1830000 ...
##  $ 2010        : num [1:264] 824000 NA 425000 2191000 1808000 ...
##  $ 2011        : num [1:264] 869000 NA 481000 2469000 2242000 ...
##  $ 2012        : num [1:264] 904000 NA 528000 3156000 2238000 ...
##  $ 2013        : num [1:264] 979000 NA 650000 2857000 2328000 ...
##  $ 2014        : num [1:264] 1072000 NA 595000 3341000 2363000 ...
##  $ 2015        : num [1:264] 1225000 NA 592000 3784000 2663000 ...
##  $ 2016        : num [1:264] 1102000 NA 397000 4070000 2819000 ...
##  $ 2017        : num [1:264] 1070500 NA 261000 4643000 3003000 ...
##  $ 2018        : num [1:264] 1082000 NA 218000 5340000 3042000 ...

str(meta_tourism,give.attr = FALSE)

## 'data.frame':    263 obs. of  3 variables:
##  $ Country.Code: chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ Region      : chr  "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
##  $ IncomeGroup : chr  "High income" "Low income" "Lower middle income" "Upper middle income" ...

str(names,give.attr = FALSE)

## 'data.frame':    196 obs. of  3 variables:
##  $ Short name   : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ Official name: chr  "the Islamic Republic of Afghanistan" "the Republic of Albania" "the People's Democratic Republic of Algeria" "the Principality of Andorra" ...
##  $ ISO3         : chr  "AFG" "ALB" "DZA" "AND" ...

unesco$category <- factor(unesco$category)

meta_tourism$Region <- factor(meta_tourism$Country.Code )

meta_tourism$IncomeGroup<- factor(meta_tourism$IncomeGroup, levels = c("Low income", "Lower middle income", "Upper middle income", "High income") , ordered = TRUE)

Tidy & Manipulate Data II

Original UNESCO dataset contained all the sites that comprise world heritage, however for this task, I am only interested in the number of UNESCO recognized sites by country and how it influences the number of international arrivals.

The UNESCO dataset contains some sites that are situated in multiple countries, therefore I need to separate country codes into separate rows using functioning separate_rows(). Once it’s done I group the dataset by country code using group_by() function. I want to add the total number of sites per country. To calculate it I change data to wide format and add an extra column called Total using mutate().

In tourism dataset, I calculate the mean number of arrivals per country for since 1995 and add it to data frame using mutate().

unesco<- separate_rows(unesco,2,sep = ",")

#Counting number of different sights per country
unesco <- unesco %>%group_by(udnp_code,category) %>%tally() 

#Changing data to wide format so we can calculate total and return to long format
unesco <- unesco%>%spread(key = category, value = n, fill = 0)

unesco <- unesco%>%mutate(Total = Cultural + Mixed + Natural)

#Finding mean number of arrivals
tourism <- tourism%>%mutate(mean_num_tourist = rowMeans(tourism[,2:25], na.rm = TRUE))

Transform

The number of international arrivals is measured in thousand and for some countries in millions. The data is right-skewed since most of the countries have a small number of arrivals, however, few large countries have huge numbers and comprise a long tale to the right. Additionally, the number of arrivals usually displayed in the format using e and makes it hard to read. To normalise the data and make the numbers more easily readable I apply log transformation to all the numeric columns in tourism.

tourism[c(2:26)] <- log10(tourism[c(2:26)])

Scan I

To check for the number of missing values in the dataset I use sum() and is.na() functions. I do not use which(is.na) because both datasets are big and it wouldn’t be helpful. To find proportion missing values make in tourism dataset I use prop_miss(). Additionally, the pattern of missing values sometimes can give additional info, so before I deal with them I plot them using mice_plot(). UNESCO dataset doesn’t contain missing values, because those rows that didn’t have some category of the site got filled with 0 in the previous step. Since the number of arrivals per country varies, I use calculated mean number of arrivals per country in the previous step instead of the column mean. The number of arrivals steadily increased in nearly all countries and the best solution would be to apply a linear model or use KNN to impute missing values, however, to adhere with minimum requirement #10 and keep the number of packages to those permitted, I will use rowMeans as described.

sum(is.na(unesco))

## [1] 1

sum(is.na(tourism))

## [1] 802

#dropping rows that contain only missing values
tourism <- tourism[complete.cases(tourism$mean_num_tourist), ]

prop_miss(tourism[,2:26])

## [1] 0.06859438

columns_to_loop_thourgh <- c(3:26)
for (i in columns_to_loop_thourgh){
  tourism[[i]] <- ifelse(is.na( tourism[[i]] ), tourism$mean_num_tourist, tourism[[i]])
}

Scan II

UNESCO dataset doesn’t have outliers since all the numbers have been calculated in R from the list of heritage sites as explained previously. However, to demonstrate how it could be done, I use a loop to plot 4 boxplots - 1 for each category and total.

There are nearly 200 countries, so visually checking for outliers is tiresome, but I still use for loop to plot small part of countries and see the general trend and if outliers present. These plots show that some countries indeed can have outliers like Bangladesh(BGD) for example. To deal with this I use cap function and apply it to each row (not a column for the same reason as with missing values). Since data was normalised following tranformation to log10 I san safely use standard outlier bounds: 1.5*IQR.

#Checking for outliers in each column
columns_to_loop_thourgh <- c(2:5)
for (i in columns_to_loop_thourgh){
  name<-colnames(unesco)[i]
  unesco[i]%>%boxplot(main=name, ylab="Number of sights", col = "grey")
}

#To check for outliers visually and see data trends plotting each country
row_num <-c(10:20)

for (i in row_num){
  row1<-tourism[i,c(3:26)]
  name <-tourism[i,1]
  row1%>%boxplot(main=name, ylab="Number of tourists", col = "grey")
}

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

tourism[c(3:26)] <- apply(tourism[c(3:26)],2, cap)

Tidy & Merge

In this step I merge 4 datasets: - UNESCO that contains the number of sites per country; - tourism that contains the number of arrivals per year per country; - meta_tourism that contains income data per country;

All these datasets contained initially country name as well as 3 letter code, however, some datasets in addition to country names contained also region names and whole “world” as mentioned previously, additionally the writing of country names in datasets can vary which will result in duplicate rows with missing values. To avoid this I used 4th dataset names with which I combine the rest of data by country code. To make sure that country codes are all in the same format I use toupper(), sstr_trim() and str_squish() functions.

tourism$`Country Code` <- toupper(tourism$`Country Code`)%>%str_trim(side = "both")%>%str_squish()
unesco$`udnp_code` <- toupper(unesco$`udnp_code`)%>%str_trim(side = "both")%>%str_squish()
meta_tourism$`Country.Code` <- toupper(meta_tourism$`Country.Code`)%>%str_trim(side = "both")%>%str_squish()
names$ISO3 <- toupper(names$ISO3)%>%str_trim(side = "both")%>%str_squish()

combined_data<-tourism%>%full_join(unesco, by = c("Country Code"="udnp_code") )
combined_data<-combined_data%>%full_join(meta_tourism, by = c("Country Code"="Country.Code") )
combined_data <- names%>%left_join(combined_data, by = c("ISO3"="Country Code") )

Tidy & Manipulate Data I

tourism data is not tidy, because years are column names, therefore I use gather(). Although UNESCO data was initially tidy I spread it to calculate total and need to gather again. Additionally, I rearrange column order and change names.

cols <- as.character(c(1995:2018))
combined_data <- combined_data%>%gather(key = year, value = num_sites,cols)

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(cols)` instead of `cols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

combined_data <- combined_data%>%gather(key = category, value = num_arrivals, c("Cultural", "Mixed", "Natural", "Total")) 

combined_data <- combined_data[, c(1, 2, 3, 6, 7, 8, 4,9 ,10 )]
combined_data <- setNames(combined_data, c("short_name", "official_name", "country_code", "income_group", "year", "num_arrivals", "mean_num_arrivals", "site_category", "num_sites" ))

Tidy & Manipulate Data I

As I have merged multiple datasets there are new missing values. For num_sites I impute missing values with 0 since this country wasn’t previously mentioned in UNESCO data, therefore it has 0 sites. Last but not num_arrivals also got missing values after the merge, this values appeared for countries where data was fully missing, imputing with mean would be very misleading, therefore I drop rows with missing values using complete.cases().

combined_data$num_sites <- ifelse(is.na(combined_data$num_sites), 0, combined_data$num_sites)
combined_data <- drop_na(combined_data)


head(combined_data)

References

[1] Centre, U. (n.d.). World Heritage List. Retrieved October 20, 2020, from https://whc.unesco.org/en/list/ [2] International tourism, number of arrivals. (n.d.). Retrieved October 20, 2020, from https://data.worldbank.org/indicator/ST.INT.ARVL [3] FAO Country Profiles. (n.d.). Retrieved October 20, 2020, from http://www.fao.org/countryprofiles/iso3list/en/

MATH2349 Data Wrangling

Assignment 2

Yelyzaveta Tokareva s3679177

Executive Summary

Required packages

Data

Understand

Tidy & Manipulate Data II

Transform

Scan I

Scan II

Tidy & Merge

Tidy & Manipulate Data I

Tidy & Manipulate Data I

References