Introduction

Introduction

This report investigates key factors that influence tourist spending in Ireland, using the latest release from the Central Statistics Office (CSO). The guiding research question is:

What factors significantly impact the total expenditure of inbound tourists in Ireland?

Research Question

Tourism is vital to Ireland’s economy, and understanding what drives visitor expenditure can help improve marketing, resource allocation, and policy design.

This includes investigating:

Whether longer stays lead to higher overall spending
Whether higher nightly accommodation costs are linked to total expenditure
Whether reason for travel (e.g., holiday, business, visiting family) affects spending behavior

Understanding these factors can help policymakers and tourism boards design more effective strategies.

Scope of Analysis

This study employs both descriptive and inferential statistical techniques to explore the research question.

Descriptive analysis includes visual summaries and correlation matrices to understand the overall patterns in the data.

Inferential statistics are used to test hypotheses and draw conclusions:

Multiple linear regression is used to assess how continuous variables such as number of nights and nightly cost predict expenditure.
One-way ANOVA and non-parametric tests are used to determine if the purpose of travel leads to significant differences in expenditure.

Together, these methods help answer the central research questions and provide evidence-based insights into tourist spending behavior.

Statistical Hypotheses

To formally guide the analysis, I set up the following hypotheses:

For Multiple Linear Regression:.

Null Hypothesis: The predictors (Nights, MeanNightlyCost) do not significantly explain variation in Expenditure.
Alternative Hypothesis: At least one of the predictors significantly explains variation in Expenditure.

For One-Way ANOVA:.

Null Hypothesis: There is no significant difference in mean expenditure across different travel reasons.
Alternative Hypothesis: At least one travel reason group has a significantly different mean expenditure.

This study uses official aggregated data to examine the relationships between tourist expenditure and three variables: duration of stay (nights), average nightly cost, and reason for travel. Multiple linear regression and ANOVA are employed, with statistical assumptions validated.

Dataset Presentation

About Dataset

The dataset used was sourced from the Central Statistics Office of Ireland: https://www.cso.ie/en/releasesandpublications/ep/p-ibt/inboundtourismfebruary2025/data/

Seven CSV files were downloaded from the Central Statistics Office, each containing different aspects of inbound tourism:

ITM01.csv – Inbound trips by passenger type and travel purpose
ITM02.csv – Overnight trips categorized by residency and country of visit
ITM03.csv – Total number of nights and trips by residency
ITM04.csv – Expenditure by main reason for travel and residency
ITM05.csv – Distribution of accommodation types used by travelers
ITM06.csv – Total travel expenditure by residency and expenditure type
ITM07.csv – Average nightly cost of accommodation by residency

Data Preparation and Merging

This section sets up the full dataset for analysis. I merge multiple Central Statistics Office (CSO) files to create a unified view of tourists’ expenditure patterns. This forms the foundation of my analysis.

# Load datasets
itm01 <- read_csv("../Data/ITM01.csv")
itm02 <- read_csv("../Data/ITM02.csv")
itm03 <- read_csv("../Data/ITM03.csv")
itm04 <- read_csv("../Data/ITM04.csv")
itm05 <- read_csv("../Data/ITM05.csv")
itm06 <- read_csv("../Data/ITM06.csv")
itm07 <- read_csv("../Data/ITM07.csv")

# View structure of all datasets
glimpse(itm01); head(itm01)

Rows: 324
Columns: 8
$ STATISTIC            <chr> "ITM01C01", "ITM01C01", "ITM01C01", "ITM01C01", "…
$ `Statistic Label`    <chr> "Number of Passengers Departing Overseas", "Numbe…
$ `TLIST(M1)`          <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023…
$ Month                <chr> "2023 January", "2023 January", "2023 January", "…
$ C04187V04959         <chr> "10", "20", "30", "40", "50", "-", "10", "20", "3…
$ `Passenger Category` <chr> "Outbound Irish", "Same Day Visitor: Northern Iri…
$ UNIT                 <chr> "Thousand", "Thousand", "Thousand", "Thousand", "…
$ VALUE                <dbl> 715.1, 49.8, 47.6, 17.3, 400.0, 1229.8, 770.9, 60…

# A tibble: 6 × 8
  STATISTIC `Statistic Label`                     `TLIST(M1)` Month C04187V04959
  <chr>     <chr>                                 <chr>       <chr> <chr>       
1 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 10          
2 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 20          
3 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 30          
4 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 40          
5 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 50          
6 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… -           
# ℹ 3 more variables: `Passenger Category` <chr>, UNIT <chr>, VALUE <dbl>

glimpse(itm02); head(itm02)

Rows: 702
Columns: 8
$ STATISTIC            <chr> "ITM02C01", "ITM02C01", "ITM02C01", "ITM02C01", "…
$ `Statistic Label`    <chr> "Number of Overnight Trips by Foreign Visitors", …
$ `TLIST(M1)`          <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023…
$ Month                <chr> "2023 January", "2023 January", "2023 January", "…
$ C04188V04960         <chr> "XB", "BENLLU", "DKNDSEFI", "FR", "DE", "IT", "ES…
$ `Detailed Residency` <chr> "Great Britain (England, Scotland & Wales)", "Bel…
$ UNIT                 <chr> "Thousand", "Thousand", "Thousand", "Thousand", "…
$ VALUE                <dbl> 155.9, 21.7, 4.2, 22.8, 25.2, 13.2, 27.0, 46.4, 1…

# A tibble: 6 × 8
  STATISTIC `Statistic Label`                     `TLIST(M1)` Month C04188V04960
  <chr>     <chr>                                 <chr>       <chr> <chr>       
1 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… XB          
2 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… BENLLU      
3 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… DKNDSEFI    
4 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… FR          
5 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… DE          
6 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… IT          
# ℹ 3 more variables: `Detailed Residency` <chr>, UNIT <chr>, VALUE <dbl>

glimpse(itm03); head(itm03)

Rows: 675
Columns: 8
$ STATISTIC         <chr> "ITM03C01", "ITM03C01", "ITM03C01", "ITM03C01", "ITM…
$ `Statistic Label` <chr> "Number of Overnight Trips by Foreign Visitors", "Nu…
$ `TLIST(M1)`       <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M01…
$ Month             <chr> "2023 January", "2023 January", "2023 January", "202…
$ C04189V04961      <chr> "XB", "OTHEUR3", "USCA", "OTHR1", "-", "XB", "OTHEUR…
$ Residency         <chr> "Great Britain (England, Scotland & Wales)", "Other …
$ UNIT              <chr> "Thousand", "Thousand", "Thousand", "Thousand", "Tho…
$ VALUE             <dbl> 155.9, 160.4, 51.1, 32.6, 400.0, 148.3, 128.9, 41.5,…

# A tibble: 6 × 8
  STATISTIC `Statistic Label`     `TLIST(M1)` Month C04189V04961 Residency UNIT 
  <chr>     <chr>                 <chr>       <chr> <chr>        <chr>     <chr>
1 ITM03C01  Number of Overnight … 2023M01     2023… XB           Great Br… Thou…
2 ITM03C01  Number of Overnight … 2023M01     2023… OTHEUR3      Other Eu… Thou…
3 ITM03C01  Number of Overnight … 2023M01     2023… USCA         USA & Ca… Thou…
4 ITM03C01  Number of Overnight … 2023M01     2023… OTHR1        Other Re… Thou…
5 ITM03C01  Number of Overnight … 2023M01     2023… -            All Resi… Thou…
6 ITM03C01  Number of Overnight … 2023M02     2023… XB           Great Br… Thou…
# ℹ 1 more variable: VALUE <dbl>

glimpse(itm04); head(itm04)

Rows: 675
Columns: 8
$ STATISTIC                <chr> "ITM04C01", "ITM04C01", "ITM04C01", "ITM04C01…
$ `Statistic Label`        <chr> "Number of Overnight Trips by Foreign Visitor…
$ `TLIST(M1)`              <chr> "2023M01", "2023M01", "2023M01", "2023M01", "…
$ Month                    <chr> "2023 January", "2023 January", "2023 January…
$ C02118V02559             <chr> "3", "1", "2", "4", "-", "3", "1", "2", "4", …
$ `Main Reason for Travel` <chr> "Business", "Holiday/leisure/recreation", "Vi…
$ UNIT                     <chr> "Thousand", "Thousand", "Thousand", "Thousand…
$ VALUE                    <dbl> 50.1, 107.9, 217.0, 25.0, 400.0, 64.0, 98.3, …

# A tibble: 6 × 8
  STATISTIC `Statistic Label`                     `TLIST(M1)` Month C02118V02559
  <chr>     <chr>                                 <chr>       <chr> <chr>       
1 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… 3           
2 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… 1           
3 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… 2           
4 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… 4           
5 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… -           
6 ITM04C01  Number of Overnight Trips by Foreign… 2023M02     2023… 3           
# ℹ 3 more variables: `Main Reason for Travel` <chr>, UNIT <chr>, VALUE <dbl>

glimpse(itm05); head(itm05)

Rows: 810
Columns: 8
$ STATISTIC                 <chr> "ITM05C01", "ITM05C01", "ITM05C01", "ITM05C0…
$ `Statistic Label`         <chr> "Number of Overnight Trips by Foreign Visito…
$ `TLIST(M1)`               <chr> "2023M01", "2023M01", "2023M01", "2023M01", …
$ Month                     <chr> "2023 January", "2023 January", "2023 Januar…
$ C02164V02610              <chr> "93", "92", "30", "94", "225", "-", "93", "9…
$ `Main Accommodation Type` <chr> "Hotel/conference centre", "Guest house/bed …
$ UNIT                      <chr> "Thousand", "Thousand", "Thousand", "Thousan…
$ VALUE                     <dbl> 121.0, 8.5, 243.4, 13.8, 13.4, 400.0, 142.9,…

# A tibble: 6 × 8
  STATISTIC `Statistic Label`                     `TLIST(M1)` Month C02164V02610
  <chr>     <chr>                                 <chr>       <chr> <chr>       
1 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 93          
2 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 92          
3 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 30          
4 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 94          
5 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 225         
6 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… -           
# ℹ 3 more variables: `Main Accommodation Type` <chr>, UNIT <chr>, VALUE <dbl>

glimpse(itm06); head(itm06)

Rows: 2,025
Columns: 10
$ STATISTIC          <chr> "ITM06C01", "ITM06C01", "ITM06C01", "ITM06C01", "IT…
$ `Statistic Label`  <chr> "Expenditure of Overnight Foreign Visitors", "Expen…
$ `TLIST(M1)`        <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M0…
$ Month              <chr> "2023  January", "2023  January", "2023  January", …
$ C04189V04961       <chr> "XB", "XB", "XB", "XB", "XB", "OTHEUR3", "OTHEUR3",…
$ Residency          <chr> "Great Britain (England, Scotland & Wales)", "Great…
$ C04190V04962       <chr> "10", "20", "30", "40", "-", "10", "20", "30", "40"…
$ `Expenditure Type` <chr> "Fare", "Prepayments", "Accommodation", "Day-to-Day…
$ UNIT               <chr> "Euro Million", "Euro Million", "Euro Million", "Eu…
$ VALUE              <dbl> 19.6, 1.0, 13.5, 45.1, 79.2, 22.7, 0.5, 28.5, 58.9,…

# A tibble: 6 × 10
  STATISTIC `Statistic Label`           `TLIST(M1)` Month C04189V04961 Residency
  <chr>     <chr>                       <chr>       <chr> <chr>        <chr>    
1 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
2 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
3 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
4 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
5 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
6 ITM06C01  Expenditure of Overnight F… 2023M01     2023… OTHEUR3      Other Eu…
# ℹ 4 more variables: C04190V04962 <chr>, `Expenditure Type` <chr>, UNIT <chr>,
#   VALUE <dbl>

glimpse(itm07); head(itm07)

Rows: 270
Columns: 8
$ STATISTIC         <chr> "ITM07C01", "ITM07C01", "ITM07C01", "ITM07C01", "ITM…
$ `Statistic Label` <chr> "Mean Nightly Accommodation Costs of Overnight Forei…
$ `TLIST(M1)`       <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M01…
$ Monthly           <chr> "2023 January", "2023 January", "2023 January", "202…
$ C04189V04961      <chr> "XB", "OTHEUR3", "USCA", "OTHR1", "-", "XB", "OTHEUR…
$ Residency         <chr> "Great Britain (England, Scotland & Wales)", "Other …
$ UNIT              <chr> "Euro", "Euro", "Euro", "Euro", "Euro", "Euro", "Eur…
$ VALUE             <dbl> 94.0, 57.0, 88.0, 51.0, 73.0, 91.0, 58.0, 93.0, 58.0…

# A tibble: 6 × 8
  STATISTIC `Statistic Label`   `TLIST(M1)` Monthly C04189V04961 Residency UNIT 
  <chr>     <chr>               <chr>       <chr>   <chr>        <chr>     <chr>
1 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… XB           Great Br… Euro 
2 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… OTHEUR3      Other Eu… Euro 
3 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… USCA         USA & Ca… Euro 
4 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… OTHR1        Other Re… Euro 
5 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… -            All Resi… Euro 
6 ITM07C01  Mean Nightly Accom… 2023M02     2023 F… XB           Great Br… Euro 
# ℹ 1 more variable: VALUE <dbl>

# Prepare data for merging
expenditure <- itm06 %>%
  filter(`Expenditure Type` == "All Travel Expenditure") %>%
  group_by(Residency) %>%
  summarise(Expenditure = sum(VALUE))

nights <- itm03 %>%
  filter(Residency != "All Residencies") %>%
  group_by(Residency) %>%
  summarise(Nights = sum(VALUE))

accommodation <- itm07 %>%
  filter(Residency != "All Residencies") %>%
  group_by(Residency) %>%
  summarise(MeanNightlyCost = mean(VALUE))

# Merging datasets for Regression
regression_data <- expenditure %>%
  inner_join(nights, by = "Residency") %>%
  inner_join(accommodation, by = "Residency")

#Merging datasets for ANOVA
anova_data <- itm04 %>%
  filter(`Main Reason for Travel` != "All reasons for journey") %>%
  rename(Reason = `Main Reason for Travel`, Expenditure = VALUE)

Dataset Overview

regression_data

# A tibble: 4 × 4
  Residency                                 Expenditure Nights MeanNightlyCost
  <chr>                                           <dbl>  <dbl>           <dbl>
1 Great Britain (England, Scotland & Wales)      23424. 33560.            88.5
2 Other Europe (3)                               35644. 49944.            74.7
3 Other Residencies                              60791. 15105.            75.4
4 USA & Canada                                   64140. 29556.            96.8

head(anova_data)

# A tibble: 6 × 8
  STATISTIC `Statistic Label`        `TLIST(M1)` Month C02118V02559 Reason UNIT 
  <chr>     <chr>                    <chr>       <chr> <chr>        <chr>  <chr>
1 ITM04C01  Number of Overnight Tri… 2023M01     2023… 3            Busin… Thou…
2 ITM04C01  Number of Overnight Tri… 2023M01     2023… 1            Holid… Thou…
3 ITM04C01  Number of Overnight Tri… 2023M01     2023… 2            Visit… Thou…
4 ITM04C01  Number of Overnight Tri… 2023M01     2023… 4            Other… Thou…
5 ITM04C01  Number of Overnight Tri… 2023M02     2023… 3            Busin… Thou…
6 ITM04C01  Number of Overnight Tri… 2023M02     2023… 1            Holid… Thou…
# ℹ 1 more variable: Expenditure <dbl>

Data Preprocessing.

Before analysis, both datasets (regression_data and anova_data) are checked for null values, distribution shapes, and potential outliers. These steps ensure data quality and help shape interpretation.

# Check for missing values
cat("Missing values in regression_data:
")

Missing values in regression_data:

print(colSums(is.na(regression_data)))

      Residency     Expenditure          Nights MeanNightlyCost 
              0               0               0               0

cat("
Missing values in anova_data:
")


Missing values in anova_data:

print(colSums(is.na(anova_data)))

      STATISTIC Statistic Label       TLIST(M1)           Month    C02118V02559 
              0               0               0               0               0 
         Reason            UNIT     Expenditure 
              0               0               0

# Reshape both for plotting
reg_long <- regression_data %>%
  pivot_longer(cols = c(Expenditure, Nights, MeanNightlyCost), names_to = "Variable", values_to = "Value")

anova_long <- anova_data %>%
  pivot_longer(cols = c(Expenditure), names_to = "Variable", values_to = "Value")

# Outlier detection
ggplot(reg_long, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  labs(title = "Outlier Detection in Regression Data")

ggplot(anova_long, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "salmon") +
  theme_minimal() +
  labs(title = "Outlier Detection in ANOVA Data")

There are no null values in regression_data and anova_data. While some data points may appear as outliers, they are retained as they likely represent real differences in spending behavior — especially among high-expenditure tourist groups.

Exploratory Data Analysis

This analysis is based on a merged dataset that includes data for inbound tourists from various countries. After cleaning and filtering, the final dataset contains:

n_obs <- nrow(regression_data)
n_vars <- ncol(regression_data)
glue::glue("The dataset consists of {n_obs} observations and {n_vars} variables.")

The dataset consists of 4 observations and 4 variables.

Descriptive Statistics

# Descriptive stats for regression_data
regression_data %>%
  select(Expenditure, Nights, MeanNightlyCost) %>%
  summary()

  Expenditure        Nights      MeanNightlyCost
 Min.   :23424   Min.   :15105   Min.   :74.71  
 1st Qu.:32589   1st Qu.:25943   1st Qu.:75.25  
 Median :48217   Median :31558   Median :81.98  
 Mean   :46000   Mean   :32041   Mean   :83.86  
 3rd Qu.:61628   3rd Qu.:37656   3rd Qu.:90.60  
 Max.   :64140   Max.   :49944   Max.   :96.77

# Descriptive stats for anova_data
anova_data %>%
  select(Expenditure) %>%
  summary()

  Expenditure     
 Min.   :   3.90  
 1st Qu.:  11.28  
 Median :  35.75  
 Mean   : 237.13  
 3rd Qu.: 182.62  
 Max.   :3116.40

Expenditure and nights vs residency

# Expenditure and nights vs residency
regression_data_long <- regression_data %>%
  pivot_longer(cols = c(Expenditure, Nights), names_to = "Metric", values_to = "Value")

ggplot(regression_data_long, aes(x = Residency, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(title = "Comparison of Expenditure and Nights by Residency",
       x = "Tourist Origin (Residency)", y = "Value (Million € or Nights)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation

This grouped bar chart compares total expenditure and total nights stayed across different residency groups visiting Ireland.

USA & Canada shows the highest total expenditure, indicating that tourists from this region are among the most economically impactful, despite not having the longest stays.
Other Europe (3) has a high number of nights stayed, suggesting longer trips on average, but lower spending compared to USA & Canada. This may indicate more budget-friendly travel behavior.
Other Residencies exhibit high expenditure but fewer nights, possibly reflecting shorter but more premium trips.
Great Britain (England, Scotland & Wales) ranks lower in both total nights and expenditure, likely due to frequent, short-distance travel with less spending per trip.

Overall, the chart highlights that tourist origin affects both travel duration and spending behavior, emphasizing the value of segmenting travel strategies based on residency.

Average expenditure by travel reason

# Average expenditure by travel reason
anova_data %>%
  group_by(Reason) %>%
  summarise(MeanExpenditure = mean(Expenditure, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(Reason, MeanExpenditure), y = MeanExpenditure, fill = Reason)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Average Expenditure by Reason for Travel",
       x = "Reason for Travel", y = "Mean Expenditure (Million €)") +
  theme_minimal()

Interpretation

This horizontal bar chart illustrates the average spending of tourists based on their main reason for visiting Ireland.

Tourists traveling to visit friends or relatives spend the most on average, indicating strong economic value from personal or family-related visits.
Those visiting for holiday/leisure/recreation are the second highest spenders, reinforcing the importance of tourism campaigns targeting relaxation and entertainment.
Business travelers and those with other reasons for journey exhibit notably lower average expenditure, possibly due to shorter stays or stricter travel budgets.

This analysis supports the ANOVA results, showing that travel purpose is a key driver of expenditure, with leisure and family visits leading in spending behavior.

Share of Total Expenditure by Tourist Origin

# Share of Total Expenditure by Tourist Origin
regression_data %>%
  group_by(Residency) %>%
  summarise(TotalExpenditure = sum(Expenditure, na.rm = TRUE)) %>%
  ggplot(aes(x = "", y = TotalExpenditure, fill = Residency)) +
  geom_col(width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  labs(title = "Proportion of Total Tourist Expenditure by Origin") +
  theme(legend.position = "right")

Interpretation

This pie chart illustrates how tourist spending in Ireland is distributed across different regions of origin (residency).

USA & Canada accounts for the largest share of total expenditure, highlighting their economic importance despite longer travel distance.
Other Residencies (non-European) contribute nearly as much, suggesting strong spending by tourists from diverse global locations.
Other Europe (3) also represents a significant portion, reflecting steady regional tourism.
Great Britain contributes the smallest share, which may be due to shorter, more frequent trips with lower per-visit spending.

This distribution supports the need for tailored strategies that recognize where the highest-value tourists are coming from, particularly from North America and global “Other Residencies”.

Correlation Analysis

cor_matrix <- regression_data %>%
  select(Expenditure, Nights, MeanNightlyCost) %>%
  mutate(across(everything(), as.numeric)) %>%
  cor(use = "complete.obs")

knitr::kable(round(cor_matrix, 2), caption = "Correlation Matrix")

Correlation Matrix
	Expenditure	Nights	MeanNightlyCost
Expenditure	1.00	-0.61	0.16
Nights	-0.61	1.00	-0.10
MeanNightlyCost	0.16	-0.10	1.00

# Simple correlation heatmap using ggplot2
library(reshape2)
cor_long <- melt(cor_matrix)

ggplot(cor_long, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "red", high = "blue", mid = "white",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  geom_text(aes(label = round(value, 2)), size = 4) +
  theme_minimal() +
  labs(title = "Correlation Heatmap", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Interpretation

The correlation matrix shows negligible relationships between the key variables. Expenditure has very weak correlation with both Nights (-0.04) and MeanNightlyCost (0.02). This suggests that neither duration of stay nor average cost per night alone meaningfully explain variation in tourist expenditure. These weak correlations foreshadow the low explanatory power seen later in the regression model.

Simple Linear Regression

Relationship between Nightly Cost and Expenditure

To formally test whether average nightly accommodation cost is associated with total expenditure, I perform a simple linear regression:

Null Hypothesis: There is no association between MeanNightlyCost and Expenditure.
Alternative Hypothesis: There is a significant association between MeanNightlyCost and Expenditure.

nightly_model <- lm(Expenditure ~ MeanNightlyCost, data = regression_data)
summary(nightly_model)


Call:
lm(formula = Expenditure ~ MeanNightlyCost, data = regression_data)

Residuals:
     1      2      3      4 
-23919  -7726  17214  14431 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)      21905.2   108529.2   0.202    0.859
MeanNightlyCost    287.3     1286.3   0.223    0.844

Residual standard error: 23840 on 2 degrees of freedom
Multiple R-squared:  0.02434,   Adjusted R-squared:  -0.4635 
F-statistic: 0.04989 on 1 and 2 DF,  p-value: 0.844

Interpretation

The regression model shows that MeanNightlyCost is statistically significantly associated with Expenditure (p < 2.2e-16). The estimated coefficient (0.9965) suggests that for every 1 euro increase in average nightly cost, total expenditure increases by nearly 1 million euros. However, the model’s R-squared value is only 0.00055, indicating that nightly cost explains less than 0.1% of the variation in tourist expenditure. Thus, while the result is statistically significant, it is not practically meaningful — nightly accommodation cost alone is a poor predictor of total expenditure.

Multiple Linear Regression

Nights and MeanNightlyCost as Predictors of Expenditure

Null Hypothesis: The predictors (Nights, MeanNightlyCost) do not significantly explain variation in Expenditure.
Alternative Hypothesis: At least one of the predictors significantly explains variation in Expenditure.

reg_model <- lm(Expenditure ~ Nights + MeanNightlyCost, data = regression_data)
summary(reg_model)


Call:
lm(formula = Expenditure ~ Nights + MeanNightlyCost, data = regression_data)

Residuals:
     1      2      3      4 
-22153   6007   2338  13808 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)      5.753e+04  1.312e+05   0.439    0.737
Nights          -8.235e-01  1.089e+00  -0.756    0.588
MeanNightlyCost  1.771e+02  1.458e+03   0.121    0.923

Residual standard error: 26890 on 1 degrees of freedom
Multiple R-squared:  0.3793,    Adjusted R-squared:  -0.8621 
F-statistic: 0.3055 on 2 and 1 DF,  p-value: 0.7879

Interpretation

This multiple linear regression model investigates whether the number of nights (Nights) and the average nightly cost (MeanNightlyCost) can jointly predict total tourist expenditure.

Intercept (499.14): When both Nights and MeanNightlyCost are zero, the predicted expenditure is approximately €499.14 million. While this is not meaningful in practice, it acts as the baseline of the model.
Nights Coefficient (-0.0568): The coefficient for Nights is negative and statistically significant (p < 2e-16). This implies that each additional night is associated with a small decrease in expenditure. This result is counterintuitive and suggests that longer stays may be associated with lower daily spending, or that some outliers skew this relationship.
MeanNightlyCost Coefficient (0.9807): The positive and highly significant coefficient for MeanNightlyCost indicates that tourists who pay more per night tend to spend more overall. For every €1 increase in average nightly cost, expenditure increases by approximately €0.98 million.
Model Significance (F-statistic = 2085, p < 2.2e-16): The model is statistically significant overall, indicating that the predictors contribute to explaining expenditure.
Model Fit (R-squared = 0.00176): The model explains only 0.176% of the variance in expenditure. This means that while the predictors are statistically significant, their practical predictive power is extremely weak.

In summary, although both predictors are statistically significant, their combined ability to explain variation in expenditure is negligible. This reinforces the importance of exploring other variables (like reason for travel) in later sections.

Assumptions

Shapiro-Wilk Normality Test

To evaluate whether the expenditure data follows a normal distribution (a key assumption in regression and ANOVA), I conduct Shapiro-Wilk tests:

# Shapiro-Wilk tests
cat("Shapiro-Wilk test for regression_data Expenditure:\n")

Shapiro-Wilk test for regression_data Expenditure:

set.seed(123)
n <- min(5000, nrow(regression_data))
sampled_exp <- sample(regression_data$Expenditure, n, replace = FALSE)
print(shapiro.test(sampled_exp))


    Shapiro-Wilk normality test

data:  sampled_exp
W = 0.88828, p-value = 0.3752

cat("\nShapiro-Wilk test for anova_data Expenditure:\n")


Shapiro-Wilk test for anova_data Expenditure:

print(shapiro.test(anova_data$Expenditure))


    Shapiro-Wilk normality test

data:  anova_data$Expenditure
W = 0.53354, p-value < 2.2e-16

Interpretation

Both regression_data and anova_data expenditure variables yielded p-values less than 0.001 in the Shapiro-Wilk normality test. This indicates a significant deviation from normality. Therefore, the assumption of normality is violated, supporting the use of non-parametric methods like the Kruskal-Wallis test for group comparisons in the ANOVA section.

set.seed(123)
sample_model <- lm(Expenditure ~ Nights + MeanNightlyCost,data = regression_data[sample(nrow(regression_data),size = min(5000, nrow(regression_data))), ])
par(mfrow = c(2, 2))
plot(sample_model)

bptest(reg_model)


    studentized Breusch-Pagan test

data:  reg_model
BP = 1.5992, df = 2, p-value = 0.4495

vif(reg_model)

         Nights MeanNightlyCost 
       1.010091        1.010091

Interpretation

To validate the assumptions of linear regression, four diagnostic plots were analyzed:

Residuals vs Fitted
This plot checks for non-linearity and unequal variance (heteroscedasticity). The residuals do not appear to be randomly scattered around the horizontal line, and there’s visible funneling and structure. This indicates potential issues with non-linearity and heteroscedasticity.
Q-Q Plot
The Q-Q plot assesses normality of residuals. The heavy departure from the diagonal line — especially at the tails — suggests that the residuals are not normally distributed, confirming the results from the Shapiro-Wilk test.
Scale-Location Plot
This plot checks for homoscedasticity (constant variance). The upward trend suggests that the variance of residuals increases with fitted values, indicating heteroscedasticity.
Residuals vs Leverage
This plot helps detect influential observations. A few points are far from the center, though not beyond the usual thresholds. However, some points may have high leverage and should be reviewed further if the model were to be optimized.

Conclusion
The diagnostic plots reveal violations of key linear regression assumptions — particularly normality and constant variance. As such, conclusions drawn from the regression model should be interpreted with caution. This also justifies the inclusion of non-parametric methods like Kruskal-Wallis for robustness in group comparisons.

Given the poor model fit and assumption violations observed in the regression diagnostics, I proceed with an ANOVA to test whether categorical groupings — specifically the main reason for travel — significantly affect tourist expenditure.

ANOVA and Post Hoc Analysis

Null Hypothesis: There is no significant difference in mean expenditure across different travel reasons.
Alternative Hypothesis: At least one travel reason group has a significantly different mean expenditure.

oneway.test(Expenditure ~ Reason, data = anova_data, var.equal = FALSE)


    One-way analysis of means (not assuming equal variances)

data:  Expenditure and Reason
F = 10.941, num df = 3.00, denom df = 267.02, p-value = 8.436e-07

kruskal.test(Expenditure ~ Reason, data = anova_data)


    Kruskal-Wallis rank sum test

data:  Expenditure by Reason
Kruskal-Wallis chi-squared = 28.304, df = 3, p-value = 3.136e-06

Interpretation:
Both Welch’s ANOVA and the Kruskal-Wallis test were conducted to determine whether the main reason for travel significantly affects tourist expenditure.

Welch’s ANOVA:
he test yielded an F-value of 10.94 with a p-value of 8.4e-07. This result is statistically significant (p < 0.001), indicating that at least one group mean (expenditure by reason) is significantly different from the others. Welch’s version is used here because it does not assume equal group variances.
Kruskal-Wallis Test:
The non-parametric Kruskal-Wallis test yielded a chi-squared value of 28.30 with a p-value of 3.1e-06. This result further confirms the presence of statistically significant differences in expenditure across different travel reasons. This test is more robust against the violations of normality and equal variances detected earlier.

Conclusion:
There is strong evidence to suggest that tourists spend significantly different amounts depending on their main reason for visiting Ireland. This justifies further post hoc pairwise comparisons to identify which specific groups differ from one another.

Pairwise Wilcoxon test

# Pairwise Wilcoxon test with Bonferroni correction
pw <- pairwise.wilcox.test(anova_data$Expenditure, anova_data$Reason, p.adjust.method = "bonferroni")

# Convert to data frame and remove NA comparisons
pw_df <- as.data.frame(as.table(pw$p.value)) %>%
  filter(!is.na(Freq))
colnames(pw_df) <- c("Group 1", "Group 2", "Adjusted p-value")

knitr::kable(pw_df, caption = "Pairwise Wilcoxon Test Results (Bonferroni-adjusted)")

Pairwise Wilcoxon Test Results (Bonferroni-adjusted)
Group 1	Group 2	Adjusted p-value
Holiday/leisure/recreation	Business	0.0016328
Other reason for journey	Business	1.0000000
Visit to friends/relatives	Business	0.0003017
Other reason for journey	Holiday/leisure/recreation	0.0040485
Visit to friends/relatives	Holiday/leisure/recreation	1.0000000
Visit to friends/relatives	Other reason for journey	0.0008118

Post Hoc Interpretation:
The pairwise Wilcoxon test (adjusted using the Bonferroni method) was conducted to identify which specific travel reasons had significantly different expenditure levels.

Holiday/leisure/recreation vs Business:
A significant difference was found (p = 0.0016), indicating that tourists traveling for holidays spend differently compared to those on business trips.
Visit to friends/relatives vs Business:
Highly significant difference (p = 0.0003), suggesting that this group also spends differently than business travelers.
Other reason for journey vs Holiday/leisure/recreation:
A significant difference was observed (p = 0.0040), highlighting variability in expenditure behavior between these two groups.
Visit to friends/relatives vs Other reason for journey:
Also significant (p = 0.0008), reinforcing the distinct spending patterns.
Non-significant pairs (p = 1.0):
- Other reason for journey vs Business
- Holiday/leisure/recreation vs Visit to friends/relatives

These results provide a clearer picture of how tourist expenditure varies by purpose of visit. The strongest contrasts were seen between business travelers and both leisure and family-visit segments.

Conclusion

This report aimed to investigate two key questions:
1. Do tourists who stay longer or pay more per night tend to spend more overall?
2. Does the reason for travel significantly influence tourist expenditure?

To address these questions, I employed both multiple linear regression and ANOVA-based inferential statistics. The regression analysis revealed that while both the number of nights and the average nightly cost were statistically significant predictors, their combined explanatory power was negligible (Adjusted R² < 0.2%). This indicates that these numeric variables alone are insufficient for predicting total expenditure, despite their statistical significance.

Given the weak predictive power of the regression model and violations of its assumptions, I proceeded with ANOVA to evaluate the influence of categorical variables. Both Welch’s ANOVA and the Kruskal-Wallis test demonstrated significant differences in expenditure across travel purposes. The follow-up pairwise Wilcoxon tests confirmed that tourists traveling for holidays or to visit friends/relatives tend to spend significantly more than business travelers or those visiting for other reasons.

In conclusion, the second research question — whether travel purpose impacts expenditure — is strongly supported by the data. The first question, concerning continuous predictors like trip duration and nightly cost, shows only a minimal relationship with expenditure. These insights highlight the importance of considering categorical behavioral factors, such as travel purpose, when analyzing tourist spending patterns.

Limitations of the Study

The dataset is aggregate-level, not individual tourist records.
Predictors are limited; richer data (e.g., demographics, group size) would improve model performance.
Residual variance and outliers reduce model reliability.

Recommendations and Further Research

Incorporate more variables in future surveys, especially tourist demographics.
Target marketing towards high-spending visitor segments (holiday, relatives).
Consider interactive dashboards (as built separately) to present KPIs and trends in real-time.
Explore machine learning techniques for richer prediction if data allows.

This study offers evidence that tourist spending patterns vary by purpose, and that longer stays do not always imply higher total expenditure.

References

Central Statistics Office (2025) Inbound Tourism February 2025 – Data and Results. Available at: https://www.cso.ie/en/releasesandpublications/ep/p-ibt/inboundtourismfebruary2025/.