Introduction

Introduction

This report investigates key factors that influence tourist spending in Ireland, using the latest release from the Central Statistics Office (CSO). The guiding research question is:

What factors significantly impact the total expenditure of inbound tourists in Ireland?

Research Question

Tourism is vital to Ireland’s economy, and understanding what drives visitor expenditure can help improve marketing, resource allocation, and policy design.

This includes investigating:

Understanding these factors can help policymakers and tourism boards design more effective strategies.

Scope of Analysis

This study employs both descriptive and inferential statistical techniques to explore the research question.

Descriptive analysis includes visual summaries and correlation matrices to understand the overall patterns in the data.

Inferential statistics are used to test hypotheses and draw conclusions:

Together, these methods help answer the central research questions and provide evidence-based insights into tourist spending behavior.

Statistical Hypotheses

To formally guide the analysis, I set up the following hypotheses:

For Multiple Linear Regression:.

For One-Way ANOVA:.

This study uses official aggregated data to examine the relationships between tourist expenditure and three variables: duration of stay (nights), average nightly cost, and reason for travel. Multiple linear regression and ANOVA are employed, with statistical assumptions validated.

Dataset Presentation

About Dataset

The dataset used was sourced from the Central Statistics Office of Ireland: https://www.cso.ie/en/releasesandpublications/ep/p-ibt/inboundtourismfebruary2025/data/

Seven CSV files were downloaded from the Central Statistics Office, each containing different aspects of inbound tourism:

Data Preparation and Merging

This section sets up the full dataset for analysis. I merge multiple Central Statistics Office (CSO) files to create a unified view of tourists’ expenditure patterns. This forms the foundation of my analysis.

# Load datasets
itm01 <- read_csv("../Data/ITM01.csv")
itm02 <- read_csv("../Data/ITM02.csv")
itm03 <- read_csv("../Data/ITM03.csv")
itm04 <- read_csv("../Data/ITM04.csv")
itm05 <- read_csv("../Data/ITM05.csv")
itm06 <- read_csv("../Data/ITM06.csv")
itm07 <- read_csv("../Data/ITM07.csv")

# View structure of all datasets
glimpse(itm01); head(itm01)
Rows: 324
Columns: 8
$ STATISTIC            <chr> "ITM01C01", "ITM01C01", "ITM01C01", "ITM01C01", "…
$ `Statistic Label`    <chr> "Number of Passengers Departing Overseas", "Numbe…
$ `TLIST(M1)`          <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023…
$ Month                <chr> "2023 January", "2023 January", "2023 January", "…
$ C04187V04959         <chr> "10", "20", "30", "40", "50", "-", "10", "20", "3…
$ `Passenger Category` <chr> "Outbound Irish", "Same Day Visitor: Northern Iri…
$ UNIT                 <chr> "Thousand", "Thousand", "Thousand", "Thousand", "…
$ VALUE                <dbl> 715.1, 49.8, 47.6, 17.3, 400.0, 1229.8, 770.9, 60…
# A tibble: 6 × 8
  STATISTIC `Statistic Label`                     `TLIST(M1)` Month C04187V04959
  <chr>     <chr>                                 <chr>       <chr> <chr>       
1 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 10          
2 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 20          
3 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 30          
4 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 40          
5 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… 50          
6 ITM01C01  Number of Passengers Departing Overs… 2023M01     2023… -           
# ℹ 3 more variables: `Passenger Category` <chr>, UNIT <chr>, VALUE <dbl>
glimpse(itm02); head(itm02)
Rows: 702
Columns: 8
$ STATISTIC            <chr> "ITM02C01", "ITM02C01", "ITM02C01", "ITM02C01", "…
$ `Statistic Label`    <chr> "Number of Overnight Trips by Foreign Visitors", …
$ `TLIST(M1)`          <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023…
$ Month                <chr> "2023 January", "2023 January", "2023 January", "…
$ C04188V04960         <chr> "XB", "BENLLU", "DKNDSEFI", "FR", "DE", "IT", "ES…
$ `Detailed Residency` <chr> "Great Britain (England, Scotland & Wales)", "Bel…
$ UNIT                 <chr> "Thousand", "Thousand", "Thousand", "Thousand", "…
$ VALUE                <dbl> 155.9, 21.7, 4.2, 22.8, 25.2, 13.2, 27.0, 46.4, 1…
# A tibble: 6 × 8
  STATISTIC `Statistic Label`                     `TLIST(M1)` Month C04188V04960
  <chr>     <chr>                                 <chr>       <chr> <chr>       
1 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… XB          
2 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… BENLLU      
3 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… DKNDSEFI    
4 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… FR          
5 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… DE          
6 ITM02C01  Number of Overnight Trips by Foreign… 2023M01     2023… IT          
# ℹ 3 more variables: `Detailed Residency` <chr>, UNIT <chr>, VALUE <dbl>
glimpse(itm03); head(itm03)
Rows: 675
Columns: 8
$ STATISTIC         <chr> "ITM03C01", "ITM03C01", "ITM03C01", "ITM03C01", "ITM…
$ `Statistic Label` <chr> "Number of Overnight Trips by Foreign Visitors", "Nu…
$ `TLIST(M1)`       <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M01…
$ Month             <chr> "2023 January", "2023 January", "2023 January", "202…
$ C04189V04961      <chr> "XB", "OTHEUR3", "USCA", "OTHR1", "-", "XB", "OTHEUR…
$ Residency         <chr> "Great Britain (England, Scotland & Wales)", "Other …
$ UNIT              <chr> "Thousand", "Thousand", "Thousand", "Thousand", "Tho…
$ VALUE             <dbl> 155.9, 160.4, 51.1, 32.6, 400.0, 148.3, 128.9, 41.5,…
# A tibble: 6 × 8
  STATISTIC `Statistic Label`     `TLIST(M1)` Month C04189V04961 Residency UNIT 
  <chr>     <chr>                 <chr>       <chr> <chr>        <chr>     <chr>
1 ITM03C01  Number of Overnight … 2023M01     2023… XB           Great Br… Thou…
2 ITM03C01  Number of Overnight … 2023M01     2023… OTHEUR3      Other Eu… Thou…
3 ITM03C01  Number of Overnight … 2023M01     2023… USCA         USA & Ca… Thou…
4 ITM03C01  Number of Overnight … 2023M01     2023… OTHR1        Other Re… Thou…
5 ITM03C01  Number of Overnight … 2023M01     2023… -            All Resi… Thou…
6 ITM03C01  Number of Overnight … 2023M02     2023… XB           Great Br… Thou…
# ℹ 1 more variable: VALUE <dbl>
glimpse(itm04); head(itm04)
Rows: 675
Columns: 8
$ STATISTIC                <chr> "ITM04C01", "ITM04C01", "ITM04C01", "ITM04C01…
$ `Statistic Label`        <chr> "Number of Overnight Trips by Foreign Visitor…
$ `TLIST(M1)`              <chr> "2023M01", "2023M01", "2023M01", "2023M01", "…
$ Month                    <chr> "2023 January", "2023 January", "2023 January…
$ C02118V02559             <chr> "3", "1", "2", "4", "-", "3", "1", "2", "4", …
$ `Main Reason for Travel` <chr> "Business", "Holiday/leisure/recreation", "Vi…
$ UNIT                     <chr> "Thousand", "Thousand", "Thousand", "Thousand…
$ VALUE                    <dbl> 50.1, 107.9, 217.0, 25.0, 400.0, 64.0, 98.3, …
# A tibble: 6 × 8
  STATISTIC `Statistic Label`                     `TLIST(M1)` Month C02118V02559
  <chr>     <chr>                                 <chr>       <chr> <chr>       
1 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… 3           
2 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… 1           
3 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… 2           
4 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… 4           
5 ITM04C01  Number of Overnight Trips by Foreign… 2023M01     2023… -           
6 ITM04C01  Number of Overnight Trips by Foreign… 2023M02     2023… 3           
# ℹ 3 more variables: `Main Reason for Travel` <chr>, UNIT <chr>, VALUE <dbl>
glimpse(itm05); head(itm05)
Rows: 810
Columns: 8
$ STATISTIC                 <chr> "ITM05C01", "ITM05C01", "ITM05C01", "ITM05C0…
$ `Statistic Label`         <chr> "Number of Overnight Trips by Foreign Visito…
$ `TLIST(M1)`               <chr> "2023M01", "2023M01", "2023M01", "2023M01", …
$ Month                     <chr> "2023 January", "2023 January", "2023 Januar…
$ C02164V02610              <chr> "93", "92", "30", "94", "225", "-", "93", "9…
$ `Main Accommodation Type` <chr> "Hotel/conference centre", "Guest house/bed …
$ UNIT                      <chr> "Thousand", "Thousand", "Thousand", "Thousan…
$ VALUE                     <dbl> 121.0, 8.5, 243.4, 13.8, 13.4, 400.0, 142.9,…
# A tibble: 6 × 8
  STATISTIC `Statistic Label`                     `TLIST(M1)` Month C02164V02610
  <chr>     <chr>                                 <chr>       <chr> <chr>       
1 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 93          
2 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 92          
3 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 30          
4 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 94          
5 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… 225         
6 ITM05C01  Number of Overnight Trips by Foreign… 2023M01     2023… -           
# ℹ 3 more variables: `Main Accommodation Type` <chr>, UNIT <chr>, VALUE <dbl>
glimpse(itm06); head(itm06)
Rows: 2,025
Columns: 10
$ STATISTIC          <chr> "ITM06C01", "ITM06C01", "ITM06C01", "ITM06C01", "IT…
$ `Statistic Label`  <chr> "Expenditure of Overnight Foreign Visitors", "Expen…
$ `TLIST(M1)`        <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M0…
$ Month              <chr> "2023  January", "2023  January", "2023  January", …
$ C04189V04961       <chr> "XB", "XB", "XB", "XB", "XB", "OTHEUR3", "OTHEUR3",…
$ Residency          <chr> "Great Britain (England, Scotland & Wales)", "Great…
$ C04190V04962       <chr> "10", "20", "30", "40", "-", "10", "20", "30", "40"…
$ `Expenditure Type` <chr> "Fare", "Prepayments", "Accommodation", "Day-to-Day…
$ UNIT               <chr> "Euro Million", "Euro Million", "Euro Million", "Eu…
$ VALUE              <dbl> 19.6, 1.0, 13.5, 45.1, 79.2, 22.7, 0.5, 28.5, 58.9,…
# A tibble: 6 × 10
  STATISTIC `Statistic Label`           `TLIST(M1)` Month C04189V04961 Residency
  <chr>     <chr>                       <chr>       <chr> <chr>        <chr>    
1 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
2 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
3 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
4 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
5 ITM06C01  Expenditure of Overnight F… 2023M01     2023… XB           Great Br…
6 ITM06C01  Expenditure of Overnight F… 2023M01     2023… OTHEUR3      Other Eu…
# ℹ 4 more variables: C04190V04962 <chr>, `Expenditure Type` <chr>, UNIT <chr>,
#   VALUE <dbl>
glimpse(itm07); head(itm07)
Rows: 270
Columns: 8
$ STATISTIC         <chr> "ITM07C01", "ITM07C01", "ITM07C01", "ITM07C01", "ITM…
$ `Statistic Label` <chr> "Mean Nightly Accommodation Costs of Overnight Forei…
$ `TLIST(M1)`       <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M01…
$ Monthly           <chr> "2023 January", "2023 January", "2023 January", "202…
$ C04189V04961      <chr> "XB", "OTHEUR3", "USCA", "OTHR1", "-", "XB", "OTHEUR…
$ Residency         <chr> "Great Britain (England, Scotland & Wales)", "Other …
$ UNIT              <chr> "Euro", "Euro", "Euro", "Euro", "Euro", "Euro", "Eur…
$ VALUE             <dbl> 94.0, 57.0, 88.0, 51.0, 73.0, 91.0, 58.0, 93.0, 58.0…
# A tibble: 6 × 8
  STATISTIC `Statistic Label`   `TLIST(M1)` Monthly C04189V04961 Residency UNIT 
  <chr>     <chr>               <chr>       <chr>   <chr>        <chr>     <chr>
1 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… XB           Great Br… Euro 
2 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… OTHEUR3      Other Eu… Euro 
3 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… USCA         USA & Ca… Euro 
4 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… OTHR1        Other Re… Euro 
5 ITM07C01  Mean Nightly Accom… 2023M01     2023 J… -            All Resi… Euro 
6 ITM07C01  Mean Nightly Accom… 2023M02     2023 F… XB           Great Br… Euro 
# ℹ 1 more variable: VALUE <dbl>
# Prepare data for merging
expenditure <- itm06 %>%
  filter(`Expenditure Type` == "All Travel Expenditure") %>%
  group_by(Residency) %>%
  summarise(Expenditure = sum(VALUE))

nights <- itm03 %>%
  filter(Residency != "All Residencies") %>%
  group_by(Residency) %>%
  summarise(Nights = sum(VALUE))

accommodation <- itm07 %>%
  filter(Residency != "All Residencies") %>%
  group_by(Residency) %>%
  summarise(MeanNightlyCost = mean(VALUE))

# Merging datasets for Regression
regression_data <- expenditure %>%
  inner_join(nights, by = "Residency") %>%
  inner_join(accommodation, by = "Residency")

#Merging datasets for ANOVA
anova_data <- itm04 %>%
  filter(`Main Reason for Travel` != "All reasons for journey") %>%
  rename(Reason = `Main Reason for Travel`, Expenditure = VALUE)

Dataset Overview

regression_data
# A tibble: 4 × 4
  Residency                                 Expenditure Nights MeanNightlyCost
  <chr>                                           <dbl>  <dbl>           <dbl>
1 Great Britain (England, Scotland & Wales)      23424. 33560.            88.5
2 Other Europe (3)                               35644. 49944.            74.7
3 Other Residencies                              60791. 15105.            75.4
4 USA & Canada                                   64140. 29556.            96.8
head(anova_data)
# A tibble: 6 × 8
  STATISTIC `Statistic Label`        `TLIST(M1)` Month C02118V02559 Reason UNIT 
  <chr>     <chr>                    <chr>       <chr> <chr>        <chr>  <chr>
1 ITM04C01  Number of Overnight Tri… 2023M01     2023… 3            Busin… Thou…
2 ITM04C01  Number of Overnight Tri… 2023M01     2023… 1            Holid… Thou…
3 ITM04C01  Number of Overnight Tri… 2023M01     2023… 2            Visit… Thou…
4 ITM04C01  Number of Overnight Tri… 2023M01     2023… 4            Other… Thou…
5 ITM04C01  Number of Overnight Tri… 2023M02     2023… 3            Busin… Thou…
6 ITM04C01  Number of Overnight Tri… 2023M02     2023… 1            Holid… Thou…
# ℹ 1 more variable: Expenditure <dbl>

Data Preprocessing.

Before analysis, both datasets (regression_data and anova_data) are checked for null values, distribution shapes, and potential outliers. These steps ensure data quality and help shape interpretation.

# Check for missing values
cat("Missing values in regression_data:
")
Missing values in regression_data:
print(colSums(is.na(regression_data)))
      Residency     Expenditure          Nights MeanNightlyCost 
              0               0               0               0 
cat("
Missing values in anova_data:
")

Missing values in anova_data:
print(colSums(is.na(anova_data)))
      STATISTIC Statistic Label       TLIST(M1)           Month    C02118V02559 
              0               0               0               0               0 
         Reason            UNIT     Expenditure 
              0               0               0 
# Reshape both for plotting
reg_long <- regression_data %>%
  pivot_longer(cols = c(Expenditure, Nights, MeanNightlyCost), names_to = "Variable", values_to = "Value")

anova_long <- anova_data %>%
  pivot_longer(cols = c(Expenditure), names_to = "Variable", values_to = "Value")
# Outlier detection
ggplot(reg_long, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  labs(title = "Outlier Detection in Regression Data")
ggplot(anova_long, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "salmon") +
  theme_minimal() +
  labs(title = "Outlier Detection in ANOVA Data")

.

There are no null values in regression_data and anova_data. While some data points may appear as outliers, they are retained as they likely represent real differences in spending behavior — especially among high-expenditure tourist groups.

Exploratory Data Analysis

This analysis is based on a merged dataset that includes data for inbound tourists from various countries. After cleaning and filtering, the final dataset contains:

n_obs <- nrow(regression_data)
n_vars <- ncol(regression_data)
glue::glue("The dataset consists of {n_obs} observations and {n_vars} variables.")
The dataset consists of 4 observations and 4 variables.

Descriptive Statistics

# Descriptive stats for regression_data
regression_data %>%
  select(Expenditure, Nights, MeanNightlyCost) %>%
  summary()
  Expenditure        Nights      MeanNightlyCost
 Min.   :23424   Min.   :15105   Min.   :74.71  
 1st Qu.:32589   1st Qu.:25943   1st Qu.:75.25  
 Median :48217   Median :31558   Median :81.98  
 Mean   :46000   Mean   :32041   Mean   :83.86  
 3rd Qu.:61628   3rd Qu.:37656   3rd Qu.:90.60  
 Max.   :64140   Max.   :49944   Max.   :96.77  
# Descriptive stats for anova_data
anova_data %>%
  select(Expenditure) %>%
  summary()
  Expenditure     
 Min.   :   3.90  
 1st Qu.:  11.28  
 Median :  35.75  
 Mean   : 237.13  
 3rd Qu.: 182.62  
 Max.   :3116.40  

Expenditure and nights vs residency

# Expenditure and nights vs residency
regression_data_long <- regression_data %>%
  pivot_longer(cols = c(Expenditure, Nights), names_to = "Metric", values_to = "Value")

ggplot(regression_data_long, aes(x = Residency, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(title = "Comparison of Expenditure and Nights by Residency",
       x = "Tourist Origin (Residency)", y = "Value (Million € or Nights)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

.

Interpretation

This grouped bar chart compares total expenditure and total nights stayed across different residency groups visiting Ireland.

Overall, the chart highlights that tourist origin affects both travel duration and spending behavior, emphasizing the value of segmenting travel strategies based on residency.

Average expenditure by travel reason

# Average expenditure by travel reason
anova_data %>%
  group_by(Reason) %>%
  summarise(MeanExpenditure = mean(Expenditure, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(Reason, MeanExpenditure), y = MeanExpenditure, fill = Reason)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Average Expenditure by Reason for Travel",
       x = "Reason for Travel", y = "Mean Expenditure (Million €)") +
  theme_minimal()

.

Interpretation

This horizontal bar chart illustrates the average spending of tourists based on their main reason for visiting Ireland.

This analysis supports the ANOVA results, showing that travel purpose is a key driver of expenditure, with leisure and family visits leading in spending behavior.

Share of Total Expenditure by Tourist Origin

# Share of Total Expenditure by Tourist Origin
regression_data %>%
  group_by(Residency) %>%
  summarise(TotalExpenditure = sum(Expenditure, na.rm = TRUE)) %>%
  ggplot(aes(x = "", y = TotalExpenditure, fill = Residency)) +
  geom_col(width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  labs(title = "Proportion of Total Tourist Expenditure by Origin") +
  theme(legend.position = "right")

.

Interpretation

This pie chart illustrates how tourist spending in Ireland is distributed across different regions of origin (residency).

This distribution supports the need for tailored strategies that recognize where the highest-value tourists are coming from, particularly from North America and global “Other Residencies”.

Correlation Analysis

cor_matrix <- regression_data %>%
  select(Expenditure, Nights, MeanNightlyCost) %>%
  mutate(across(everything(), as.numeric)) %>%
  cor(use = "complete.obs")

knitr::kable(round(cor_matrix, 2), caption = "Correlation Matrix")
Correlation Matrix
Expenditure Nights MeanNightlyCost
Expenditure 1.00 -0.61 0.16
Nights -0.61 1.00 -0.10
MeanNightlyCost 0.16 -0.10 1.00
# Simple correlation heatmap using ggplot2
library(reshape2)
cor_long <- melt(cor_matrix)

ggplot(cor_long, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "red", high = "blue", mid = "white",
                       midpoint = 0, limit = c(-1, 1), space = "Lab",
                       name = "Correlation") +
  geom_text(aes(label = round(value, 2)), size = 4) +
  theme_minimal() +
  labs(title = "Correlation Heatmap", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

.

Interpretation

The correlation matrix shows negligible relationships between the key variables. Expenditure has very weak correlation with both Nights (-0.04) and MeanNightlyCost (0.02). This suggests that neither duration of stay nor average cost per night alone meaningfully explain variation in tourist expenditure. These weak correlations foreshadow the low explanatory power seen later in the regression model.

Simple Linear Regression

Relationship between Nightly Cost and Expenditure

To formally test whether average nightly accommodation cost is associated with total expenditure, I perform a simple linear regression:

nightly_model <- lm(Expenditure ~ MeanNightlyCost, data = regression_data)
summary(nightly_model)

Call:
lm(formula = Expenditure ~ MeanNightlyCost, data = regression_data)

Residuals:
     1      2      3      4 
-23919  -7726  17214  14431 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)      21905.2   108529.2   0.202    0.859
MeanNightlyCost    287.3     1286.3   0.223    0.844

Residual standard error: 23840 on 2 degrees of freedom
Multiple R-squared:  0.02434,   Adjusted R-squared:  -0.4635 
F-statistic: 0.04989 on 1 and 2 DF,  p-value: 0.844

Interpretation

The regression model shows that MeanNightlyCost is statistically significantly associated with Expenditure (p < 2.2e-16). The estimated coefficient (0.9965) suggests that for every 1 euro increase in average nightly cost, total expenditure increases by nearly 1 million euros. However, the model’s R-squared value is only 0.00055, indicating that nightly cost explains less than 0.1% of the variation in tourist expenditure. Thus, while the result is statistically significant, it is not practically meaningful — nightly accommodation cost alone is a poor predictor of total expenditure.

Multiple Linear Regression

Nights and MeanNightlyCost as Predictors of Expenditure

reg_model <- lm(Expenditure ~ Nights + MeanNightlyCost, data = regression_data)
summary(reg_model)

Call:
lm(formula = Expenditure ~ Nights + MeanNightlyCost, data = regression_data)

Residuals:
     1      2      3      4 
-22153   6007   2338  13808 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)      5.753e+04  1.312e+05   0.439    0.737
Nights          -8.235e-01  1.089e+00  -0.756    0.588
MeanNightlyCost  1.771e+02  1.458e+03   0.121    0.923

Residual standard error: 26890 on 1 degrees of freedom
Multiple R-squared:  0.3793,    Adjusted R-squared:  -0.8621 
F-statistic: 0.3055 on 2 and 1 DF,  p-value: 0.7879

Interpretation

This multiple linear regression model investigates whether the number of nights (Nights) and the average nightly cost (MeanNightlyCost) can jointly predict total tourist expenditure.

In summary, although both predictors are statistically significant, their combined ability to explain variation in expenditure is negligible. This reinforces the importance of exploring other variables (like reason for travel) in later sections.

Assumptions

Shapiro-Wilk Normality Test

To evaluate whether the expenditure data follows a normal distribution (a key assumption in regression and ANOVA), I conduct Shapiro-Wilk tests:
# Shapiro-Wilk tests
cat("Shapiro-Wilk test for regression_data Expenditure:\n")
Shapiro-Wilk test for regression_data Expenditure:
set.seed(123)
n <- min(5000, nrow(regression_data))
sampled_exp <- sample(regression_data$Expenditure, n, replace = FALSE)
print(shapiro.test(sampled_exp))

    Shapiro-Wilk normality test

data:  sampled_exp
W = 0.88828, p-value = 0.3752
cat("\nShapiro-Wilk test for anova_data Expenditure:\n")

Shapiro-Wilk test for anova_data Expenditure:
print(shapiro.test(anova_data$Expenditure))

    Shapiro-Wilk normality test

data:  anova_data$Expenditure
W = 0.53354, p-value < 2.2e-16

Interpretation

Both regression_data and anova_data expenditure variables yielded p-values less than 0.001 in the Shapiro-Wilk normality test. This indicates a significant deviation from normality. Therefore, the assumption of normality is violated, supporting the use of non-parametric methods like the Kruskal-Wallis test for group comparisons in the ANOVA section.
set.seed(123)
sample_model <- lm(Expenditure ~ Nights + MeanNightlyCost,data = regression_data[sample(nrow(regression_data),size = min(5000, nrow(regression_data))), ])
par(mfrow = c(2, 2))
plot(sample_model)
bptest(reg_model)

    studentized Breusch-Pagan test

data:  reg_model
BP = 1.5992, df = 2, p-value = 0.4495
vif(reg_model)
         Nights MeanNightlyCost 
       1.010091        1.010091 

Interpretation

To validate the assumptions of linear regression, four diagnostic plots were analyzed:

  1. Residuals vs Fitted
    This plot checks for non-linearity and unequal variance (heteroscedasticity). The residuals do not appear to be randomly scattered around the horizontal line, and there’s visible funneling and structure. This indicates potential issues with non-linearity and heteroscedasticity.

  2. Q-Q Plot
    The Q-Q plot assesses normality of residuals. The heavy departure from the diagonal line — especially at the tails — suggests that the residuals are not normally distributed, confirming the results from the Shapiro-Wilk test.

  3. Scale-Location Plot
    This plot checks for homoscedasticity (constant variance). The upward trend suggests that the variance of residuals increases with fitted values, indicating heteroscedasticity.

  4. Residuals vs Leverage
    This plot helps detect influential observations. A few points are far from the center, though not beyond the usual thresholds. However, some points may have high leverage and should be reviewed further if the model were to be optimized.

Conclusion
The diagnostic plots reveal violations of key linear regression assumptions — particularly normality and constant variance. As such, conclusions drawn from the regression model should be interpreted with caution. This also justifies the inclusion of non-parametric methods like Kruskal-Wallis for robustness in group comparisons.

Given the poor model fit and assumption violations observed in the regression diagnostics, I proceed with an ANOVA to test whether categorical groupings — specifically the main reason for travel — significantly affect tourist expenditure.

ANOVA and Post Hoc Analysis

oneway.test(Expenditure ~ Reason, data = anova_data, var.equal = FALSE)

    One-way analysis of means (not assuming equal variances)

data:  Expenditure and Reason
F = 10.941, num df = 3.00, denom df = 267.02, p-value = 8.436e-07
kruskal.test(Expenditure ~ Reason, data = anova_data)

    Kruskal-Wallis rank sum test

data:  Expenditure by Reason
Kruskal-Wallis chi-squared = 28.304, df = 3, p-value = 3.136e-06

Interpretation:
Both Welch’s ANOVA and the Kruskal-Wallis test were conducted to determine whether the main reason for travel significantly affects tourist expenditure.

Conclusion:
There is strong evidence to suggest that tourists spend significantly different amounts depending on their main reason for visiting Ireland. This justifies further post hoc pairwise comparisons to identify which specific groups differ from one another.

Pairwise Wilcoxon test
# Pairwise Wilcoxon test with Bonferroni correction
pw <- pairwise.wilcox.test(anova_data$Expenditure, anova_data$Reason, p.adjust.method = "bonferroni")

# Convert to data frame and remove NA comparisons
pw_df <- as.data.frame(as.table(pw$p.value)) %>%
  filter(!is.na(Freq))
colnames(pw_df) <- c("Group 1", "Group 2", "Adjusted p-value")

knitr::kable(pw_df, caption = "Pairwise Wilcoxon Test Results (Bonferroni-adjusted)")
Pairwise Wilcoxon Test Results (Bonferroni-adjusted)
Group 1 Group 2 Adjusted p-value
Holiday/leisure/recreation Business 0.0016328
Other reason for journey Business 1.0000000
Visit to friends/relatives Business 0.0003017
Other reason for journey Holiday/leisure/recreation 0.0040485
Visit to friends/relatives Holiday/leisure/recreation 1.0000000
Visit to friends/relatives Other reason for journey 0.0008118

Post Hoc Interpretation:
The pairwise Wilcoxon test (adjusted using the Bonferroni method) was conducted to identify which specific travel reasons had significantly different expenditure levels.

These results provide a clearer picture of how tourist expenditure varies by purpose of visit. The strongest contrasts were seen between business travelers and both leisure and family-visit segments.

Conclusion

This report aimed to investigate two key questions:
1. Do tourists who stay longer or pay more per night tend to spend more overall?
2. Does the reason for travel significantly influence tourist expenditure?

To address these questions, I employed both multiple linear regression and ANOVA-based inferential statistics. The regression analysis revealed that while both the number of nights and the average nightly cost were statistically significant predictors, their combined explanatory power was negligible (Adjusted R² < 0.2%). This indicates that these numeric variables alone are insufficient for predicting total expenditure, despite their statistical significance.

Given the weak predictive power of the regression model and violations of its assumptions, I proceeded with ANOVA to evaluate the influence of categorical variables. Both Welch’s ANOVA and the Kruskal-Wallis test demonstrated significant differences in expenditure across travel purposes. The follow-up pairwise Wilcoxon tests confirmed that tourists traveling for holidays or to visit friends/relatives tend to spend significantly more than business travelers or those visiting for other reasons.

In conclusion, the second research question — whether travel purpose impacts expenditure — is strongly supported by the data. The first question, concerning continuous predictors like trip duration and nightly cost, shows only a minimal relationship with expenditure. These insights highlight the importance of considering categorical behavioral factors, such as travel purpose, when analyzing tourist spending patterns.

Limitations of the Study

Recommendations and Further Research

This study offers evidence that tourist spending patterns vary by purpose, and that longer stays do not always imply higher total expenditure.

References

Central Statistics Office (2025) Inbound Tourism February 2025 – Data and Results. Available at: https://www.cso.ie/en/releasesandpublications/ep/p-ibt/inboundtourismfebruary2025/.