TourismFast

Packages

library(tidyverse)

Data

Since our data files our .rds files we will be using the read_rds function to read them in.

Tour <-read_rds("tourism.rds")
Continent <- read_rds("country_info.rds")
head(Tour) #Gives the first couple rows of data

## # A tibble: 6 x 23
##       A COUNTRY   Series `_1995` `_1996` `_1997` `_1998` `_1999` `_2000` `_2001`
##   <dbl> <chr>     <chr>  <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
## 1     4 AFGHANIS~ ""     ""      ""      ""      ""      ""      ""      ""     
## 2    NA Inbound ~ ""     ""      ""      ""      ""      ""      ""      ""     
## 3    NA Arrivals~ ".."   ".."    ".."    ".."    ".."    ".."    ".."    ".."   
## 4    NA Tourism ~ "IMF"  ".."    ".."    ".."    ".."    ".."    ".."    ".."   
## 5    NA Travel -~ "IMF"  ".."    ".."    ".."    ".."    ".."    ".."    ".."   
## 6    NA Passenge~ "IMF"  ".."    ".."    ".."    ".."    ".."    ".."    ".."   
## # ... with 13 more variables: _2002 <chr>, _2003 <chr>, _2004 <chr>,
## #   _2005 <chr>, _2006 <chr>, _2007 <chr>, _2008 <chr>, _2009 <chr>,
## #   _2010 <chr>, _2011 <chr>, _2012 <chr>, _2013 <chr>, _2014 <chr>

Variables of Interest

Using the select function to only select the columns of data I want.

df <- select(Tour, c(A, COUNTRY, Series, "_2014"))

head(df)

## # A tibble: 6 x 4
##       A COUNTRY                                     Series `_2014`
##   <dbl> <chr>                                       <chr>  <chr>  
## 1     4 AFGHANISTAN                                 ""     ""     
## 2    NA Inbound tourism                             ""     ""     
## 3    NA Arrivals - Thousands                        ".."   ".."   
## 4    NA Tourism expenditure in the country - US$ Mn "IMF"  "91"   
## 5    NA Travel - US$ Mn                             "IMF"  "82"   
## 6    NA Passenger transport - US$ Mn                "IMF"  "9"

Creating Country Name

The mutate function lets us add columns to a dataframe. Since the A column has a number when there is Country name we can use the !(which means not) and is.na function with the ifelse function to say that when A is not missing extract the COUNTRY value and put it in the column COUNTRY_NAME. Then we use the fill function to replace all the missing values with our COUNTRY_NAME. Then we can drop the rows that have a number for A using the filter function.

df <- df %>% mutate(COUNTRY_NAME = ifelse(!is.na(A), COUNTRY, NA)) %>% #When A is not missing grab the COUNTRY value if not give a NA value
  fill(COUNTRY_NAME) %>% #Replaces missing values with the name of country
  filter(is.na(A)) #Drops Rows where A is a not missing, we don't need them anymore.
head(df)

## # A tibble: 6 x 5
##       A COUNTRY                                     Series `_2014` COUNTRY_NAME
##   <dbl> <chr>                                       <chr>  <chr>   <chr>       
## 1    NA Inbound tourism                             ""     ""      AFGHANISTAN 
## 2    NA Arrivals - Thousands                        ".."   ".."    AFGHANISTAN 
## 3    NA Tourism expenditure in the country - US$ Mn "IMF"  "91"    AFGHANISTAN 
## 4    NA Travel - US$ Mn                             "IMF"  "82"    AFGHANISTAN 
## 5    NA Passenger transport - US$ Mn                "IMF"  "9"     AFGHANISTAN 
## 6    NA Outbound tourism                            ""     ""      AFGHANISTAN

Getting Inbound or Outbound Tourism

A very similar process to the getting country name. We will be using the case_when function that says when COUNTRY is equal which we show with 2 equal signs. To learn more about comparison operators. https://www.tutorialgateway.org/comparison-operators-in-r/

df <- df %>% 
  mutate(TOURISM_TYPE = case_when(COUNTRY == "Inbound tourism" ~ "Inbound tourism",
                                             COUNTRY == "Outbound tourism" ~ "Outbound tourism")) %>% 
  fill(TOURISM_TYPE) %>% #fill the missing values with the type of tourism
  filter(COUNTRY != "Inbound tourism" & COUNTRY != "Outbound tourism") #if COUNTRY is equal to Inbound or Outbond filter those out
head(df)

## # A tibble: 6 x 6
##       A COUNTRY                       Series `_2014` COUNTRY_NAME TOURISM_TYPE  
##   <dbl> <chr>                         <chr>  <chr>   <chr>        <chr>         
## 1    NA Arrivals - Thousands          ..     ..      AFGHANISTAN  Inbound touri~
## 2    NA Tourism expenditure in the c~ IMF    91      AFGHANISTAN  Inbound touri~
## 3    NA Travel - US$ Mn               IMF    82      AFGHANISTAN  Inbound touri~
## 4    NA Passenger transport - US$ Mn  IMF    9       AFGHANISTAN  Inbound touri~
## 5    NA Departures - Thousands        ..     ..      AFGHANISTAN  Outbound tour~
## 6    NA Tourism expenditure in other~ IMF    122     AFGHANISTAN  Outbound tour~

Processing the Series Column

We want to make sure all the Series values are upper case so we use the toupper function. Then we want to replace “..” or "" with NA values. This can be done with the na_if functions.

df <- df %>% 
  mutate(Series = toupper(Series)) %>% #Turnings Everything Upper case
  mutate(Series = na_if(Series, ".."), #Replaces ".." with NA
         Series = na_if(Series, "")) #Replaces "" with NA

Processing the Y2014 column

Is it millions or Thousands

To figure out if the value is in Millions or Thousands I will be using the grepl function. This returns a logical if the vector has the letters “Mn” or “Thousand”. M_or_T contains 1,000,000 or 1,000 because that is the transformation needed to convert _2014 values to their actual values.

df <- df %>% 
  mutate(M_or_T = case_when(grepl("Mn", COUNTRY) ~ 1000000, #if it contains Mn then it is millions
                            grepl("Thousand", COUNTRY) ~ 1000)) #if it contains Thousand then it is in Thousands

Converting _2014 to numbers and actual numbers

Now that we know which rows are millions or thousands then we can transform _2014.

df <- df %>% 
  mutate(Y2014 = as.numeric(`_2014`) * M_or_T) #Multiply M_or_T with _2014 as numeric

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

Creating the CATEGORY column

This is simply using the case_when function to replace the various character values into a new column.

df <- df %>% 
  mutate(CATEGORY = case_when(COUNTRY == "Arrivals - Thousands" ~ "Arrivals",
                              COUNTRY == "Departures - Thousands" ~ "Departures",
                              COUNTRY == "Passenger transport - US$ Mn" ~ "Passenger transport - US$",
                              COUNTRY == "Tourism expenditure in other countries - US$ Mn" ~ "Tourism expenditure in other countries - US$",
                              COUNTRY == "Tourism expenditure in the country - US$ Mn" ~ "Tourism expenditure in the country - US$",
                              COUNTRY == "Travel - US$ Mn" ~ "Travel - US$"))

Dropping Unneeded Columns

We can drop unneeded columns by using the select function with a minus sign with the name of the column

df <- df %>% 
  select(-A, -COUNTRY, -`_2014`, -M_or_T)

Creating The Final Toursism Table

head(Continent)

## # A tibble: 6 x 2
##   Continent Country             
##       <dbl> <chr>               
## 1         3 ANDORRA             
## 2         5 UNITED ARAB EMIRATES
## 3         5 AFGHANISTAN         
## 4         1 ANTIGUA AND BARBUDA 
## 5         1 ANGUILLA            
## 6         3 ALBANIA

Relabeling Continent

For relabeling the Continent column we can use the case_when function.

Continent <- Continent %>% 
  mutate(Continent = case_when(Continent == 1 ~ "North America",
                               Continent == 2 ~ "South America",
                               Continent == 3 ~ "Europe",
                               Continent == 4 ~ "Africa",
                               Continent == 5 ~ "Asia",
                               Continent == 6 ~ "Oceania",
                               Continent == 7 ~ "Antarctica"))#if continent is equal to number give it name

Joining Data Together

We need to merge the tourism data with he continents data. We can use the merge function for this. It can be a little complex function so be sure to look at the documentation. We use the arrange function to sort rows to make it look more like the assignment.

final_tourism <- merge(df, Continent, by.x= "COUNTRY_NAME", by.y = "Country") %>%#Merging df and Continent by the country name 
  arrange(COUNTRY_NAME,TOURISM_TYPE) #This Sorts the rows
  

head(final_tourism)

##   COUNTRY_NAME Series     TOURISM_TYPE    Y2014
## 1  AFGHANISTAN    IMF  Inbound tourism 8.20e+07
## 2  AFGHANISTAN    IMF  Inbound tourism 9.10e+07
## 3  AFGHANISTAN   <NA>  Inbound tourism       NA
## 4  AFGHANISTAN    IMF  Inbound tourism 9.00e+06
## 5  AFGHANISTAN    IMF Outbound tourism 1.11e+08
## 6  AFGHANISTAN   <NA> Outbound tourism       NA
##                                   CATEGORY Continent
## 1                             Travel - US$      Asia
## 2 Tourism expenditure in the country - US$      Asia
## 3                                 Arrivals      Asia
## 4                Passenger transport - US$      Asia
## 5                             Travel - US$      Asia
## 6                               Departures      Asia

Finding Countries not found

To find the countries not found in the final data set we use the %in% operator and the unique function to figure out which countries are in the data set then use the filter and ! not operator to figure out which are missing.

logical<- Continent$Country %in% unique(final_tourism$COUNTRY_NAME) #Which Countries are in the dataset
NoCountryFound <- Continent %>% 
  filter(!logical)# Then which are not in the dataset
NoCountryFound

## # A tibble: 41 x 2
##    Continent     Country                              
##    <chr>         <chr>                                
##  1 Antarctica    ANTARCTICA                           
##  2 North America SAINT BARTHELEMY                     
##  3 North America BONAIRE, SAINT EUSTATIUS AND SABA    
##  4 Antarctica    BOUVET ISLAND                        
##  5 Asia          COCOS (KEELING) ISLANDS              
##  6 Africa        CONGO, THE DEMOCRATIC REPUBLIC OF THE
##  7 Africa        CAMEROON/NIGERIA                     
##  8 Asia          CHINA/INDIA                          
##  9 Asia          CHINA/TAIWAN, PROVINCE OF CHINA      
## 10 Africa        CAPE VERDE                           
## # ... with 31 more rows