Image source:Organisation for Economic Co-operation and Development (OECD). (n.d.). Tourism GDP. https://www.oecd.org/en/data/indicators/tourism-gdp.html

This image is from the OECD Tourism GDP indicator page, which shows how tourism contributes directly to national GDP. This project analyzes the relationship between tourism activity and economic performance across countries.

Introduction

This project examines the relationship between tourism and economic performance across countries. Tourism contributes to the global economy by generating income, creating jobs, and supporting industries such as transportation and hospitality, making it useful to study its link to GDP. The dataset is based on World Bank World Development Indicators and UNWTO tourism statistics, which compile economic and tourism reporting data collected from government statistical agencies. The categorical variables are Country and Year. The quantitative variables are GDP, tourism receipts, tourism arrivals, inflation, and unemployment. These variables are used to compare countries and analyze relationships between tourism and economic performance.

The main research question is: What is the relationship between tourism activity and GDP across countries?

GDP (Gross Domestic Product) is the total value of all goods and services produced in a country over a specific time period. It is used to measure the size and strength of an economy, where higher GDP generally indicates a stronger economy. The project uses visualizations and multiple linear regression to examine this relationship.

Load libraries and read the data

# Load libraries needed for data analysis and visualization

library(tidyverse)   # Collection of packages used for data science and visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)       # Imports CSV files using read_csv()
library(dplyr)       # Used for data cleaning 
library(ggplot2)     # Creates graphs and visualizations
library(plotly)      # Adds interactivity to graphs
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
# Import the tourism and economic dataset using read_csv()
data <- read_csv("/Users/precious/Downloads/world_tourism_economy_data.csv")
## Rows: 6650 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, country_code
## dbl (9): year, tourism_receipts, tourism_arrivals, tourism_exports, tourism_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 11
##   country   country_code  year tourism_receipts tourism_arrivals tourism_exports
##   <chr>     <chr>        <dbl>            <dbl>            <dbl>           <dbl>
## 1 Aruba     ABW           1999       782000000           972000           62.5  
## 2 Africa E… AFE           1999      8034209108.        15309378.          12.2  
## 3 Afghanis… AFG           1999              NA               NA           NA    
## 4 Africa W… AFW           1999      1443612847.         3897975.           3.97 
## 5 Angola    AGO           1999        31000000            45000            0.584
## 6 Albania   ALB           1999       218000000           371000           56.0  
## # ℹ 5 more variables: tourism_departures <dbl>, tourism_expenditures <dbl>,
## #   gdp <dbl>, inflation <dbl>, unemployment <dbl>

Explore Dataset Structure

Understanding the Variables

# Display the structure of the dataset
str(data)
## spc_tbl_ [6,650 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country             : chr [1:6650] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa Western and Central" ...
##  $ country_code        : chr [1:6650] "ABW" "AFE" "AFG" "AFW" ...
##  $ year                : num [1:6650] 1999 1999 1999 1999 1999 ...
##  $ tourism_receipts    : num [1:6650] 7.82e+08 8.03e+09 NA 1.44e+09 3.10e+07 ...
##  $ tourism_arrivals    : num [1:6650] 972000 15309378 NA 3897975 45000 ...
##  $ tourism_exports     : num [1:6650] 62.543 12.204 NA 3.974 0.584 ...
##  $ tourism_departures  : num [1:6650] NA NA NA NA NA NA NA NA NA NA ...
##  $ tourism_expenditures: num [1:6650] 9.5 7.76 NA 6.15 2.49 ...
##  $ gdp                 : num [1:6650] 1.72e+09 2.65e+11 NA 1.39e+11 6.15e+09 ...
##  $ inflation           : num [1:6650] 2.28 7.82 NA 0.372 248.196 ...
##  $ unemployment        : num [1:6650] NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   country_code = col_character(),
##   ..   year = col_double(),
##   ..   tourism_receipts = col_double(),
##   ..   tourism_arrivals = col_double(),
##   ..   tourism_exports = col_double(),
##   ..   tourism_departures = col_double(),
##   ..   tourism_expenditures = col_double(),
##   ..   gdp = col_double(),
##   ..   inflation = col_double(),
##   ..   unemployment = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Display summary statistics
summary(data)
##    country          country_code            year      tourism_receipts   
##  Length:6650        Length:6650        Min.   :1999   Min.   :1.000e+05  
##  Class :character   Class :character   1st Qu.:2005   1st Qu.:2.690e+08  
##  Mode  :character   Mode  :character   Median :2011   Median :1.553e+09  
##                                        Mean   :2011   Mean   :3.063e+10  
##                                        3rd Qu.:2017   3rd Qu.:9.144e+09  
##                                        Max.   :2023   Max.   :1.863e+12  
##                                                       NA's   :2361       
##  tourism_arrivals    tourism_exports     tourism_departures 
##  Min.   :9.000e+02   Min.   :  0.00096   Min.   :2.000e+03  
##  1st Qu.:5.290e+05   1st Qu.:  4.65773   1st Qu.:1.051e+06  
##  Median :2.508e+06   Median :  8.30680   Median :4.634e+06  
##  Mean   :6.264e+07   Mean   : 15.50685   Mean   :8.246e+07  
##  3rd Qu.:1.818e+07   3rd Qu.: 18.50671   3rd Qu.:4.509e+07  
##  Max.   :2.403e+09   Max.   :101.96700   Max.   :2.034e+09  
##  NA's   :1701        NA's   :2536        NA's   :4061       
##  tourism_expenditures      gdp              inflation        unemployment   
##  Min.   : 0.1578      Min.   :1.396e+07   Min.   :-16.860   Min.   : 0.039  
##  1st Qu.: 4.0747      1st Qu.:6.087e+09   1st Qu.:  1.865   1st Qu.: 4.250  
##  Median : 5.7548      Median :3.682e+10   Median :  3.629   Median : 6.548  
##  Mean   : 6.6527      Mean   :2.090e+12   Mean   :  6.319   Mean   : 7.961  
##  3rd Qu.: 7.9851      3rd Qu.:4.267e+11   3rd Qu.:  6.563   3rd Qu.: 9.895  
##  Max.   :28.1923      Max.   :1.062e+14   Max.   :557.202   Max.   :57.000  
##  NA's   :2477         NA's   :226         NA's   :982       NA's   :2992

Data Cleaning

Missing Values and Convert Variables

# Clean and select relevant variables
clean_data <- data %>%
  select(country, year, tourism_receipts, tourism_arrivals,
         gdp, inflation, unemployment) %>%   # keep only needed columns
  mutate(
    GDP_Billions = gdp / 1000000000,         # convert GDP to billions
    year = as.factor(year)                   # treat year as categorical
  ) %>%
  filter(
    !is.na(gdp),
    !is.na(tourism_receipts),
    !is.na(tourism_arrivals),
    !is.na(inflation),
    !is.na(unemployment)
  )  # remove missing values

Data Exploration

Average GDP by Country

# Summarize average GDP and tourism by country
country_summary <- clean_data %>%
  group_by(country) %>%   # group by country
  summarise(
    avg_gdp = mean(gdp, na.rm = TRUE),              # average GDP
    avg_tourism = mean(tourism_receipts, na.rm = TRUE)  # average tourism income
  ) %>%
  arrange(desc(avg_gdp))  # sort from highest GDP to lowest

country_summary  # display result
## # A tibble: 190 × 3
##    country                    avg_gdp avg_tourism
##    <chr>                        <dbl>       <dbl>
##  1 World                      6.50e13     1.23e12
##  2 High income                4.40e13     7.33e11
##  3 OECD members               4.22e13     6.57e11
##  4 Post-demographic dividend  3.99e13     6.69e11
##  5 Europe & Central Asia      1.88e13     4.16e11
##  6 North America              1.64e13     2.10e11
##  7 United States              1.53e13     1.60e11
##  8 European Union             1.30e13     3.13e11
##  9 Euro area                  1.27e13     2.92e11
## 10 Early-demographic dividend 9.46e12     1.88e11
## # ℹ 180 more rows

The code calculates the average GDP and average tourism receipts for each country and sorts countries from highest to lowest GDP. This simplifies the dataset into country-level summaries, making it easier to compare overall economic performance.

The results show differences in economic size and tourism activity across countries. Countries with higher average GDP generally have stronger economies, and some also have higher tourism receipts. However, the relationship is not always consistent, suggesting that tourism is influenced by multiple factors beyond economic size. Overall, this summary helps identify general patterns between tourism and GDP and provides a foundation for further visualization and regression analysis.

Regression Model

# Multiple Linear Regression Model
tourism_model <- lm(
  gdp ~ tourism_receipts + tourism_arrivals + inflation + unemployment,
  data = clean_data
)

# Full model summary (IMPORTANT for rubric)
summary(tourism_model)
## 
## Call:
## lm(formula = gdp ~ tourism_receipts + tourism_arrivals + inflation + 
##     unemployment, data = clean_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.527e+13 -2.731e+11 -1.039e+11  2.968e+10  1.649e+13 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.704e+11  6.696e+10   2.545   0.0110 *  
## tourism_receipts  5.835e+01  8.760e-01  66.603  < 2e-16 ***
## tourism_arrivals -2.232e+03  5.671e+02  -3.936 8.51e-05 ***
## inflation        -1.746e+09  4.036e+09  -0.433   0.6653    
## unemployment     -1.417e+10  6.377e+09  -2.222   0.0264 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.703e+12 on 2368 degrees of freedom
## Multiple R-squared:  0.9638, Adjusted R-squared:  0.9637 
## F-statistic: 1.575e+04 on 4 and 2368 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(tourism_model)

The regression model uses tourism receipts, tourism arrivals, inflation, and unemployment to explain GDP. It shows how GDP changes when each factor increases, while the others stay the same. Positive values mean a positive relationship with GDP, and negative values mean the opposite. Overall, tourism is usually linked to higher GDP, while unemployment is linked to lower GDP. Inflation may have a weaker effect. The model suggests that tourism and labor conditions are related to GDP, but the results show associations rather than direct causation to GDP. The diagnostic plots show whether the assumptions of linear regression are reasonably satisfied.

Visualization 1: Tourism Receipts vs GDP

library(plotly)
# Loads plotly package to add interactivity (hover, zoom, pan)
top_countries <- clean_data %>%
  group_by(country) %>%
  summarise(avg_gdp = mean(gdp, na.rm = TRUE)) %>%
  arrange(desc(avg_gdp)) %>%
  slice(1:8)

scatter_plot <- clean_data %>%
  filter(country %in% top_countries$country) %>%  
  # Filters dataset to only include top 8 GDP countries for clearer comparison

  ggplot(aes(tourism_receipts, gdp, color = country)) +  
  # X-axis = tourism receipts (independent variable)
  # Y-axis = GDP (dependent economic outcome)
  # Color separates each country for comparison

  geom_point(size = 3, alpha = 0.8) +  
  labs(
    title = "Tourism vs GDP (Top 8 Countries)",  
    # Main title describing relationship being analyzed

    x = "Tourism Receipts",  
    # Label for x-axis (tourism economic input)

    y = "GDP",  
    # Label for y-axis (economic output)

    color = "Country",  
    # Legend title for country grouping

    caption = "Source: World Bank / Tourism Dataset"  
    # Required data source citation
  ) +

  scale_color_brewer(palette = "Set2") +  
  #  color palette 
  theme_minimal() +  
  # Clean base theme for readability

  theme(
    legend.position = "right",  
    # Moves legend to right side for clarity

    panel.background = element_rect(fill = "#ffe6f0"),  
    # Custom background color (pink theme)

    plot.background = element_rect(fill = "#ffe6f0"),  
    # Ensures full plot background matches theme

    panel.grid.major = element_line(color = "white"),  
    # Makes major grid lines subtle and clean

    panel.grid.minor = element_blank()  
    # Removes minor grid lines to reduce clutter
  )

# Convert static ggplot into interactive plotly chart
ggplotly(scatter_plot)
# Adds interactivity (hover tooltips, zoom, pan) without changing visual design

1. Scatter Plot: Tourism Receipts vs GDP

This graph shows the relationship between tourism revenue and GDP for the top 8 countries by average GDP. Overall, there appears to be a positive association between the two variables. Countries with higher GDP tend to also have higher tourism receipts, suggesting that stronger economies often attract or generate more tourism activity. However, the relationship is not perfectly linear, meaning that GDP alone does not fully explain tourism performance.

Factors such as geography, tourism policy, infrastructure, and global appeal likely play an important role. Some countries with very high GDP do not necessarily have proportionally high tourism receipts, showing that economic size and tourism success are related but not dependent variables.

Conclusion

Overall, the plots show that countries with higher GDP tend to generate more tourism revenue, indicating a positive relationship between economic size and tourism activity. However, the relationship is not perfectly consistent or linear. This suggests that while GDP matters, tourism is also influenced by other factors such as geography, infrastructure, and cultural appeal. Countries with similar GDP levels can still have very different tourism performance.

The second visualization, a bar chart of the top 8 countries by average GDP, highlights a strong concentration of global economic output. The United States and China dominate significantly compared to other countries, reflecting global economic inequality. After these top two economies, there is a gradual decline among the remaining countries, showing a divided economic structure rather than an even distribution.

Limitations and what I would change.

One limitation of this study is that it only focuses on the top 8 countries by GDP, which limits how well the results represent all countries and may bias the findings toward larger economies. The data is also observational, so it can show relationships but not prove that tourism causes changes in GDP. In addition, differences in how countries collect economic and tourism data may affect accuracy, and important factors like population, geography, and political stability were not included.

Background Research

According to the World Tourism Organization (UNWTO, 2023), international tourism is an important part of the global economy, contributing to GDP, creating jobs, and generating income for many countries. The World Bank (2024) also notes that stronger economies often invest more in infrastructure and services, which can support higher tourism activity.

However, GDP is not the only factor that affects tourism. Other influences such as geography, political stability, culture, and climate also play a major role. As a result, some smaller countries can earn high tourism revenue due to natural attractions or historical sites (UNWTO, 2023; World Bank, 2024).

References

World Bank. (2024). World development indicatorshttps://data.worldbank.org

UN Tourism. (2023). International tourism highlightshttps://www.unwto.org/tourism-data