library(tidyverse)
Since our data files our .rds files we will be using the read_rds function to read them in.
Tour <-read_rds("tourism.rds")
Continent <- read_rds("country_info.rds")
head(Tour) #Gives the first couple rows of data
## # A tibble: 6 x 23
## A COUNTRY Series `_1995` `_1996` `_1997` `_1998` `_1999` `_2000` `_2001`
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 4 AFGHANIS~ "" "" "" "" "" "" "" ""
## 2 NA Inbound ~ "" "" "" "" "" "" "" ""
## 3 NA Arrivals~ ".." ".." ".." ".." ".." ".." ".." ".."
## 4 NA Tourism ~ "IMF" ".." ".." ".." ".." ".." ".." ".."
## 5 NA Travel -~ "IMF" ".." ".." ".." ".." ".." ".." ".."
## 6 NA Passenge~ "IMF" ".." ".." ".." ".." ".." ".." ".."
## # ... with 13 more variables: _2002 <chr>, _2003 <chr>, _2004 <chr>,
## # _2005 <chr>, _2006 <chr>, _2007 <chr>, _2008 <chr>, _2009 <chr>,
## # _2010 <chr>, _2011 <chr>, _2012 <chr>, _2013 <chr>, _2014 <chr>
Using the select function to only select the columns of data I want.
df <- select(Tour, c(A, COUNTRY, Series, "_2014"))
head(df)
## # A tibble: 6 x 4
## A COUNTRY Series `_2014`
## <dbl> <chr> <chr> <chr>
## 1 4 AFGHANISTAN "" ""
## 2 NA Inbound tourism "" ""
## 3 NA Arrivals - Thousands ".." ".."
## 4 NA Tourism expenditure in the country - US$ Mn "IMF" "91"
## 5 NA Travel - US$ Mn "IMF" "82"
## 6 NA Passenger transport - US$ Mn "IMF" "9"
The mutate function lets us add columns to a dataframe. Since the A column has a number when there is Country name we can use the !(which means not) and is.na function with the ifelse function to say that when A is not missing extract the COUNTRY value and put it in the column COUNTRY_NAME. Then we use the fill function to replace all the missing values with our COUNTRY_NAME. Then we can drop the rows that have a number for A using the filter function.
df <- df %>% mutate(COUNTRY_NAME = ifelse(!is.na(A), COUNTRY, NA)) %>% #When A is not missing grab the COUNTRY value if not give a NA value
fill(COUNTRY_NAME) %>% #Replaces missing values with the name of country
filter(is.na(A)) #Drops Rows where A is a not missing, we don't need them anymore.
head(df)
## # A tibble: 6 x 5
## A COUNTRY Series `_2014` COUNTRY_NAME
## <dbl> <chr> <chr> <chr> <chr>
## 1 NA Inbound tourism "" "" AFGHANISTAN
## 2 NA Arrivals - Thousands ".." ".." AFGHANISTAN
## 3 NA Tourism expenditure in the country - US$ Mn "IMF" "91" AFGHANISTAN
## 4 NA Travel - US$ Mn "IMF" "82" AFGHANISTAN
## 5 NA Passenger transport - US$ Mn "IMF" "9" AFGHANISTAN
## 6 NA Outbound tourism "" "" AFGHANISTAN
A very similar process to the getting country name. We will be using the case_when function that says when COUNTRY is equal which we show with 2 equal signs. To learn more about comparison operators. https://www.tutorialgateway.org/comparison-operators-in-r/
df <- df %>%
mutate(TOURISM_TYPE = case_when(COUNTRY == "Inbound tourism" ~ "Inbound tourism",
COUNTRY == "Outbound tourism" ~ "Outbound tourism")) %>%
fill(TOURISM_TYPE) %>% #fill the missing values with the type of tourism
filter(COUNTRY != "Inbound tourism" & COUNTRY != "Outbound tourism") #if COUNTRY is equal to Inbound or Outbond filter those out
head(df)
## # A tibble: 6 x 6
## A COUNTRY Series `_2014` COUNTRY_NAME TOURISM_TYPE
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 NA Arrivals - Thousands .. .. AFGHANISTAN Inbound touri~
## 2 NA Tourism expenditure in the c~ IMF 91 AFGHANISTAN Inbound touri~
## 3 NA Travel - US$ Mn IMF 82 AFGHANISTAN Inbound touri~
## 4 NA Passenger transport - US$ Mn IMF 9 AFGHANISTAN Inbound touri~
## 5 NA Departures - Thousands .. .. AFGHANISTAN Outbound tour~
## 6 NA Tourism expenditure in other~ IMF 122 AFGHANISTAN Outbound tour~
We want to make sure all the Series values are upper case so we use the toupper function. Then we want to replace “..” or "" with NA values. This can be done with the na_if functions.
df <- df %>%
mutate(Series = toupper(Series)) %>% #Turnings Everything Upper case
mutate(Series = na_if(Series, ".."), #Replaces ".." with NA
Series = na_if(Series, "")) #Replaces "" with NA
To figure out if the value is in Millions or Thousands I will be using the grepl function. This returns a logical if the vector has the letters “Mn” or “Thousand”. M_or_T contains 1,000,000 or 1,000 because that is the transformation needed to convert _2014 values to their actual values.
df <- df %>%
mutate(M_or_T = case_when(grepl("Mn", COUNTRY) ~ 1000000, #if it contains Mn then it is millions
grepl("Thousand", COUNTRY) ~ 1000)) #if it contains Thousand then it is in Thousands
Now that we know which rows are millions or thousands then we can transform _2014.
df <- df %>%
mutate(Y2014 = as.numeric(`_2014`) * M_or_T) #Multiply M_or_T with _2014 as numeric
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
This is simply using the case_when function to replace the various character values into a new column.
df <- df %>%
mutate(CATEGORY = case_when(COUNTRY == "Arrivals - Thousands" ~ "Arrivals",
COUNTRY == "Departures - Thousands" ~ "Departures",
COUNTRY == "Passenger transport - US$ Mn" ~ "Passenger transport - US$",
COUNTRY == "Tourism expenditure in other countries - US$ Mn" ~ "Tourism expenditure in other countries - US$",
COUNTRY == "Tourism expenditure in the country - US$ Mn" ~ "Tourism expenditure in the country - US$",
COUNTRY == "Travel - US$ Mn" ~ "Travel - US$"))
We can drop unneeded columns by using the select function with a minus sign with the name of the column
df <- df %>%
select(-A, -COUNTRY, -`_2014`, -M_or_T)
head(Continent)
## # A tibble: 6 x 2
## Continent Country
## <dbl> <chr>
## 1 3 ANDORRA
## 2 5 UNITED ARAB EMIRATES
## 3 5 AFGHANISTAN
## 4 1 ANTIGUA AND BARBUDA
## 5 1 ANGUILLA
## 6 3 ALBANIA
For relabeling the Continent column we can use the case_when function.
Continent <- Continent %>%
mutate(Continent = case_when(Continent == 1 ~ "North America",
Continent == 2 ~ "South America",
Continent == 3 ~ "Europe",
Continent == 4 ~ "Africa",
Continent == 5 ~ "Asia",
Continent == 6 ~ "Oceania",
Continent == 7 ~ "Antarctica"))#if continent is equal to number give it name
We need to merge the tourism data with he continents data. We can use the merge function for this. It can be a little complex function so be sure to look at the documentation. We use the arrange function to sort rows to make it look more like the assignment.
final_tourism <- merge(df, Continent, by.x= "COUNTRY_NAME", by.y = "Country") %>%#Merging df and Continent by the country name
arrange(COUNTRY_NAME,TOURISM_TYPE) #This Sorts the rows
head(final_tourism)
## COUNTRY_NAME Series TOURISM_TYPE Y2014
## 1 AFGHANISTAN IMF Inbound tourism 8.20e+07
## 2 AFGHANISTAN IMF Inbound tourism 9.10e+07
## 3 AFGHANISTAN <NA> Inbound tourism NA
## 4 AFGHANISTAN IMF Inbound tourism 9.00e+06
## 5 AFGHANISTAN IMF Outbound tourism 1.11e+08
## 6 AFGHANISTAN <NA> Outbound tourism NA
## CATEGORY Continent
## 1 Travel - US$ Asia
## 2 Tourism expenditure in the country - US$ Asia
## 3 Arrivals Asia
## 4 Passenger transport - US$ Asia
## 5 Travel - US$ Asia
## 6 Departures Asia
To find the countries not found in the final data set we use the %in% operator and the unique function to figure out which countries are in the data set then use the filter and ! not operator to figure out which are missing.
logical<- Continent$Country %in% unique(final_tourism$COUNTRY_NAME) #Which Countries are in the dataset
NoCountryFound <- Continent %>%
filter(!logical)# Then which are not in the dataset
NoCountryFound
## # A tibble: 41 x 2
## Continent Country
## <chr> <chr>
## 1 Antarctica ANTARCTICA
## 2 North America SAINT BARTHELEMY
## 3 North America BONAIRE, SAINT EUSTATIUS AND SABA
## 4 Antarctica BOUVET ISLAND
## 5 Asia COCOS (KEELING) ISLANDS
## 6 Africa CONGO, THE DEMOCRATIC REPUBLIC OF THE
## 7 Africa CAMEROON/NIGERIA
## 8 Asia CHINA/INDIA
## 9 Asia CHINA/TAIWAN, PROVINCE OF CHINA
## 10 Africa CAPE VERDE
## # ... with 31 more rows