Introduction
This report investigates key factors that influence tourist spending in Ireland, using the latest release from the Central Statistics Office (CSO). The guiding research question is:
What factors significantly impact the total expenditure of inbound tourists in Ireland?
Research Question
Tourism is vital to Ireland’s economy, and understanding what drives visitor expenditure can help improve marketing, resource allocation, and policy design.
This includes investigating:
Whether longer stays lead to higher overall spending
Whether higher nightly accommodation costs are linked to total expenditure
Whether reason for travel (e.g., holiday, business, visiting family) affects spending behavior
Understanding these factors can help policymakers and tourism boards design more effective strategies.
Scope of Analysis
This study employs both descriptive and inferential statistical techniques to explore the research question.
Descriptive analysis includes visual summaries and correlation matrices to understand the overall patterns in the data.
Inferential statistics are used to test hypotheses and draw conclusions:
Multiple linear regression is used to assess how continuous variables such as number of nights and nightly cost predict expenditure.
One-way ANOVA and non-parametric tests are used to determine if the purpose of travel leads to significant differences in expenditure.
Together, these methods help answer the central research questions and provide evidence-based insights into tourist spending behavior.
To formally guide the analysis, I set up the following hypotheses:
For Multiple Linear Regression:.
Null Hypothesis: The predictors (Nights, MeanNightlyCost) do not significantly explain variation in Expenditure.
Alternative Hypothesis: At least one of the predictors significantly explains variation in Expenditure.
For One-Way ANOVA:.
Null Hypothesis: There is no significant difference in mean expenditure across different travel reasons.
Alternative Hypothesis: At least one travel reason group has a significantly different mean expenditure.
This study uses official aggregated data to examine the relationships between tourist expenditure and three variables: duration of stay (nights), average nightly cost, and reason for travel. Multiple linear regression and ANOVA are employed, with statistical assumptions validated.
About Dataset
The dataset used was sourced from the Central Statistics Office of Ireland: https://www.cso.ie/en/releasesandpublications/ep/p-ibt/inboundtourismfebruary2025/data/
Seven CSV files were downloaded from the Central Statistics Office, each containing different aspects of inbound tourism:
Data Preparation and Merging
This section sets up the full dataset for analysis. I merge multiple Central Statistics Office (CSO) files to create a unified view of tourists’ expenditure patterns. This forms the foundation of my analysis.
# Load datasets
itm01 <- read_csv("../Data/ITM01.csv")
itm02 <- read_csv("../Data/ITM02.csv")
itm03 <- read_csv("../Data/ITM03.csv")
itm04 <- read_csv("../Data/ITM04.csv")
itm05 <- read_csv("../Data/ITM05.csv")
itm06 <- read_csv("../Data/ITM06.csv")
itm07 <- read_csv("../Data/ITM07.csv")
# View structure of all datasets
glimpse(itm01); head(itm01)Rows: 324
Columns: 8
$ STATISTIC <chr> "ITM01C01", "ITM01C01", "ITM01C01", "ITM01C01", "…
$ `Statistic Label` <chr> "Number of Passengers Departing Overseas", "Numbe…
$ `TLIST(M1)` <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023…
$ Month <chr> "2023 January", "2023 January", "2023 January", "…
$ C04187V04959 <chr> "10", "20", "30", "40", "50", "-", "10", "20", "3…
$ `Passenger Category` <chr> "Outbound Irish", "Same Day Visitor: Northern Iri…
$ UNIT <chr> "Thousand", "Thousand", "Thousand", "Thousand", "…
$ VALUE <dbl> 715.1, 49.8, 47.6, 17.3, 400.0, 1229.8, 770.9, 60…
# A tibble: 6 × 8
STATISTIC `Statistic Label` `TLIST(M1)` Month C04187V04959
<chr> <chr> <chr> <chr> <chr>
1 ITM01C01 Number of Passengers Departing Overs… 2023M01 2023… 10
2 ITM01C01 Number of Passengers Departing Overs… 2023M01 2023… 20
3 ITM01C01 Number of Passengers Departing Overs… 2023M01 2023… 30
4 ITM01C01 Number of Passengers Departing Overs… 2023M01 2023… 40
5 ITM01C01 Number of Passengers Departing Overs… 2023M01 2023… 50
6 ITM01C01 Number of Passengers Departing Overs… 2023M01 2023… -
# ℹ 3 more variables: `Passenger Category` <chr>, UNIT <chr>, VALUE <dbl>
Rows: 702
Columns: 8
$ STATISTIC <chr> "ITM02C01", "ITM02C01", "ITM02C01", "ITM02C01", "…
$ `Statistic Label` <chr> "Number of Overnight Trips by Foreign Visitors", …
$ `TLIST(M1)` <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023…
$ Month <chr> "2023 January", "2023 January", "2023 January", "…
$ C04188V04960 <chr> "XB", "BENLLU", "DKNDSEFI", "FR", "DE", "IT", "ES…
$ `Detailed Residency` <chr> "Great Britain (England, Scotland & Wales)", "Bel…
$ UNIT <chr> "Thousand", "Thousand", "Thousand", "Thousand", "…
$ VALUE <dbl> 155.9, 21.7, 4.2, 22.8, 25.2, 13.2, 27.0, 46.4, 1…
# A tibble: 6 × 8
STATISTIC `Statistic Label` `TLIST(M1)` Month C04188V04960
<chr> <chr> <chr> <chr> <chr>
1 ITM02C01 Number of Overnight Trips by Foreign… 2023M01 2023… XB
2 ITM02C01 Number of Overnight Trips by Foreign… 2023M01 2023… BENLLU
3 ITM02C01 Number of Overnight Trips by Foreign… 2023M01 2023… DKNDSEFI
4 ITM02C01 Number of Overnight Trips by Foreign… 2023M01 2023… FR
5 ITM02C01 Number of Overnight Trips by Foreign… 2023M01 2023… DE
6 ITM02C01 Number of Overnight Trips by Foreign… 2023M01 2023… IT
# ℹ 3 more variables: `Detailed Residency` <chr>, UNIT <chr>, VALUE <dbl>
Rows: 675
Columns: 8
$ STATISTIC <chr> "ITM03C01", "ITM03C01", "ITM03C01", "ITM03C01", "ITM…
$ `Statistic Label` <chr> "Number of Overnight Trips by Foreign Visitors", "Nu…
$ `TLIST(M1)` <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M01…
$ Month <chr> "2023 January", "2023 January", "2023 January", "202…
$ C04189V04961 <chr> "XB", "OTHEUR3", "USCA", "OTHR1", "-", "XB", "OTHEUR…
$ Residency <chr> "Great Britain (England, Scotland & Wales)", "Other …
$ UNIT <chr> "Thousand", "Thousand", "Thousand", "Thousand", "Tho…
$ VALUE <dbl> 155.9, 160.4, 51.1, 32.6, 400.0, 148.3, 128.9, 41.5,…
# A tibble: 6 × 8
STATISTIC `Statistic Label` `TLIST(M1)` Month C04189V04961 Residency UNIT
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ITM03C01 Number of Overnight … 2023M01 2023… XB Great Br… Thou…
2 ITM03C01 Number of Overnight … 2023M01 2023… OTHEUR3 Other Eu… Thou…
3 ITM03C01 Number of Overnight … 2023M01 2023… USCA USA & Ca… Thou…
4 ITM03C01 Number of Overnight … 2023M01 2023… OTHR1 Other Re… Thou…
5 ITM03C01 Number of Overnight … 2023M01 2023… - All Resi… Thou…
6 ITM03C01 Number of Overnight … 2023M02 2023… XB Great Br… Thou…
# ℹ 1 more variable: VALUE <dbl>
Rows: 675
Columns: 8
$ STATISTIC <chr> "ITM04C01", "ITM04C01", "ITM04C01", "ITM04C01…
$ `Statistic Label` <chr> "Number of Overnight Trips by Foreign Visitor…
$ `TLIST(M1)` <chr> "2023M01", "2023M01", "2023M01", "2023M01", "…
$ Month <chr> "2023 January", "2023 January", "2023 January…
$ C02118V02559 <chr> "3", "1", "2", "4", "-", "3", "1", "2", "4", …
$ `Main Reason for Travel` <chr> "Business", "Holiday/leisure/recreation", "Vi…
$ UNIT <chr> "Thousand", "Thousand", "Thousand", "Thousand…
$ VALUE <dbl> 50.1, 107.9, 217.0, 25.0, 400.0, 64.0, 98.3, …
# A tibble: 6 × 8
STATISTIC `Statistic Label` `TLIST(M1)` Month C02118V02559
<chr> <chr> <chr> <chr> <chr>
1 ITM04C01 Number of Overnight Trips by Foreign… 2023M01 2023… 3
2 ITM04C01 Number of Overnight Trips by Foreign… 2023M01 2023… 1
3 ITM04C01 Number of Overnight Trips by Foreign… 2023M01 2023… 2
4 ITM04C01 Number of Overnight Trips by Foreign… 2023M01 2023… 4
5 ITM04C01 Number of Overnight Trips by Foreign… 2023M01 2023… -
6 ITM04C01 Number of Overnight Trips by Foreign… 2023M02 2023… 3
# ℹ 3 more variables: `Main Reason for Travel` <chr>, UNIT <chr>, VALUE <dbl>
Rows: 810
Columns: 8
$ STATISTIC <chr> "ITM05C01", "ITM05C01", "ITM05C01", "ITM05C0…
$ `Statistic Label` <chr> "Number of Overnight Trips by Foreign Visito…
$ `TLIST(M1)` <chr> "2023M01", "2023M01", "2023M01", "2023M01", …
$ Month <chr> "2023 January", "2023 January", "2023 Januar…
$ C02164V02610 <chr> "93", "92", "30", "94", "225", "-", "93", "9…
$ `Main Accommodation Type` <chr> "Hotel/conference centre", "Guest house/bed …
$ UNIT <chr> "Thousand", "Thousand", "Thousand", "Thousan…
$ VALUE <dbl> 121.0, 8.5, 243.4, 13.8, 13.4, 400.0, 142.9,…
# A tibble: 6 × 8
STATISTIC `Statistic Label` `TLIST(M1)` Month C02164V02610
<chr> <chr> <chr> <chr> <chr>
1 ITM05C01 Number of Overnight Trips by Foreign… 2023M01 2023… 93
2 ITM05C01 Number of Overnight Trips by Foreign… 2023M01 2023… 92
3 ITM05C01 Number of Overnight Trips by Foreign… 2023M01 2023… 30
4 ITM05C01 Number of Overnight Trips by Foreign… 2023M01 2023… 94
5 ITM05C01 Number of Overnight Trips by Foreign… 2023M01 2023… 225
6 ITM05C01 Number of Overnight Trips by Foreign… 2023M01 2023… -
# ℹ 3 more variables: `Main Accommodation Type` <chr>, UNIT <chr>, VALUE <dbl>
Rows: 2,025
Columns: 10
$ STATISTIC <chr> "ITM06C01", "ITM06C01", "ITM06C01", "ITM06C01", "IT…
$ `Statistic Label` <chr> "Expenditure of Overnight Foreign Visitors", "Expen…
$ `TLIST(M1)` <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M0…
$ Month <chr> "2023 January", "2023 January", "2023 January", …
$ C04189V04961 <chr> "XB", "XB", "XB", "XB", "XB", "OTHEUR3", "OTHEUR3",…
$ Residency <chr> "Great Britain (England, Scotland & Wales)", "Great…
$ C04190V04962 <chr> "10", "20", "30", "40", "-", "10", "20", "30", "40"…
$ `Expenditure Type` <chr> "Fare", "Prepayments", "Accommodation", "Day-to-Day…
$ UNIT <chr> "Euro Million", "Euro Million", "Euro Million", "Eu…
$ VALUE <dbl> 19.6, 1.0, 13.5, 45.1, 79.2, 22.7, 0.5, 28.5, 58.9,…
# A tibble: 6 × 10
STATISTIC `Statistic Label` `TLIST(M1)` Month C04189V04961 Residency
<chr> <chr> <chr> <chr> <chr> <chr>
1 ITM06C01 Expenditure of Overnight F… 2023M01 2023… XB Great Br…
2 ITM06C01 Expenditure of Overnight F… 2023M01 2023… XB Great Br…
3 ITM06C01 Expenditure of Overnight F… 2023M01 2023… XB Great Br…
4 ITM06C01 Expenditure of Overnight F… 2023M01 2023… XB Great Br…
5 ITM06C01 Expenditure of Overnight F… 2023M01 2023… XB Great Br…
6 ITM06C01 Expenditure of Overnight F… 2023M01 2023… OTHEUR3 Other Eu…
# ℹ 4 more variables: C04190V04962 <chr>, `Expenditure Type` <chr>, UNIT <chr>,
# VALUE <dbl>
Rows: 270
Columns: 8
$ STATISTIC <chr> "ITM07C01", "ITM07C01", "ITM07C01", "ITM07C01", "ITM…
$ `Statistic Label` <chr> "Mean Nightly Accommodation Costs of Overnight Forei…
$ `TLIST(M1)` <chr> "2023M01", "2023M01", "2023M01", "2023M01", "2023M01…
$ Monthly <chr> "2023 January", "2023 January", "2023 January", "202…
$ C04189V04961 <chr> "XB", "OTHEUR3", "USCA", "OTHR1", "-", "XB", "OTHEUR…
$ Residency <chr> "Great Britain (England, Scotland & Wales)", "Other …
$ UNIT <chr> "Euro", "Euro", "Euro", "Euro", "Euro", "Euro", "Eur…
$ VALUE <dbl> 94.0, 57.0, 88.0, 51.0, 73.0, 91.0, 58.0, 93.0, 58.0…
# A tibble: 6 × 8
STATISTIC `Statistic Label` `TLIST(M1)` Monthly C04189V04961 Residency UNIT
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ITM07C01 Mean Nightly Accom… 2023M01 2023 J… XB Great Br… Euro
2 ITM07C01 Mean Nightly Accom… 2023M01 2023 J… OTHEUR3 Other Eu… Euro
3 ITM07C01 Mean Nightly Accom… 2023M01 2023 J… USCA USA & Ca… Euro
4 ITM07C01 Mean Nightly Accom… 2023M01 2023 J… OTHR1 Other Re… Euro
5 ITM07C01 Mean Nightly Accom… 2023M01 2023 J… - All Resi… Euro
6 ITM07C01 Mean Nightly Accom… 2023M02 2023 F… XB Great Br… Euro
# ℹ 1 more variable: VALUE <dbl>
# Prepare data for merging
expenditure <- itm06 %>%
filter(`Expenditure Type` == "All Travel Expenditure") %>%
group_by(Residency) %>%
summarise(Expenditure = sum(VALUE))
nights <- itm03 %>%
filter(Residency != "All Residencies") %>%
group_by(Residency) %>%
summarise(Nights = sum(VALUE))
accommodation <- itm07 %>%
filter(Residency != "All Residencies") %>%
group_by(Residency) %>%
summarise(MeanNightlyCost = mean(VALUE))
# Merging datasets for Regression
regression_data <- expenditure %>%
inner_join(nights, by = "Residency") %>%
inner_join(accommodation, by = "Residency")
#Merging datasets for ANOVA
anova_data <- itm04 %>%
filter(`Main Reason for Travel` != "All reasons for journey") %>%
rename(Reason = `Main Reason for Travel`, Expenditure = VALUE)Dataset Overview
# A tibble: 4 × 4
Residency Expenditure Nights MeanNightlyCost
<chr> <dbl> <dbl> <dbl>
1 Great Britain (England, Scotland & Wales) 23424. 33560. 88.5
2 Other Europe (3) 35644. 49944. 74.7
3 Other Residencies 60791. 15105. 75.4
4 USA & Canada 64140. 29556. 96.8
# A tibble: 6 × 8
STATISTIC `Statistic Label` `TLIST(M1)` Month C02118V02559 Reason UNIT
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ITM04C01 Number of Overnight Tri… 2023M01 2023… 3 Busin… Thou…
2 ITM04C01 Number of Overnight Tri… 2023M01 2023… 1 Holid… Thou…
3 ITM04C01 Number of Overnight Tri… 2023M01 2023… 2 Visit… Thou…
4 ITM04C01 Number of Overnight Tri… 2023M01 2023… 4 Other… Thou…
5 ITM04C01 Number of Overnight Tri… 2023M02 2023… 3 Busin… Thou…
6 ITM04C01 Number of Overnight Tri… 2023M02 2023… 1 Holid… Thou…
# ℹ 1 more variable: Expenditure <dbl>
Before analysis, both datasets (regression_data and
anova_data) are checked for null values, distribution
shapes, and potential outliers. These steps ensure data quality and help
shape interpretation.
Missing values in regression_data:
Residency Expenditure Nights MeanNightlyCost
0 0 0 0
Missing values in anova_data:
STATISTIC Statistic Label TLIST(M1) Month C02118V02559
0 0 0 0 0
Reason UNIT Expenditure
0 0 0
# Reshape both for plotting
reg_long <- regression_data %>%
pivot_longer(cols = c(Expenditure, Nights, MeanNightlyCost), names_to = "Variable", values_to = "Value")
anova_long <- anova_data %>%
pivot_longer(cols = c(Expenditure), names_to = "Variable", values_to = "Value")# Outlier detection
ggplot(reg_long, aes(x = Variable, y = Value)) +
geom_boxplot(fill = "lightblue") +
theme_minimal() +
labs(title = "Outlier Detection in Regression Data")ggplot(anova_long, aes(x = Variable, y = Value)) +
geom_boxplot(fill = "salmon") +
theme_minimal() +
labs(title = "Outlier Detection in ANOVA Data")
.
There are no null values in regression_data and
anova_data. While some data points may appear as outliers,
they are retained as they likely represent real differences in spending
behavior — especially among high-expenditure tourist groups.
This analysis is based on a merged dataset that includes data for inbound tourists from various countries. After cleaning and filtering, the final dataset contains:
n_obs <- nrow(regression_data)
n_vars <- ncol(regression_data)
glue::glue("The dataset consists of {n_obs} observations and {n_vars} variables.")The dataset consists of 4 observations and 4 variables.
Descriptive Statistics
# Descriptive stats for regression_data
regression_data %>%
select(Expenditure, Nights, MeanNightlyCost) %>%
summary() Expenditure Nights MeanNightlyCost
Min. :23424 Min. :15105 Min. :74.71
1st Qu.:32589 1st Qu.:25943 1st Qu.:75.25
Median :48217 Median :31558 Median :81.98
Mean :46000 Mean :32041 Mean :83.86
3rd Qu.:61628 3rd Qu.:37656 3rd Qu.:90.60
Max. :64140 Max. :49944 Max. :96.77
Expenditure
Min. : 3.90
1st Qu.: 11.28
Median : 35.75
Mean : 237.13
3rd Qu.: 182.62
Max. :3116.40
Expenditure and nights vs residency
# Expenditure and nights vs residency
regression_data_long <- regression_data %>%
pivot_longer(cols = c(Expenditure, Nights), names_to = "Metric", values_to = "Value")
ggplot(regression_data_long, aes(x = Residency, y = Value, fill = Metric)) +
geom_col(position = "dodge") +
labs(title = "Comparison of Expenditure and Nights by Residency",
x = "Tourist Origin (Residency)", y = "Value (Million € or Nights)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
.
Interpretation
This grouped bar chart compares total expenditure and total nights stayed across different residency groups visiting Ireland.
USA & Canada shows the highest total expenditure, indicating that tourists from this region are among the most economically impactful, despite not having the longest stays.
Other Europe (3) has a high number of nights stayed, suggesting longer trips on average, but lower spending compared to USA & Canada. This may indicate more budget-friendly travel behavior.
Other Residencies exhibit high expenditure but fewer nights, possibly reflecting shorter but more premium trips.
Great Britain (England, Scotland & Wales) ranks lower in both total nights and expenditure, likely due to frequent, short-distance travel with less spending per trip.
Overall, the chart highlights that tourist origin affects both travel duration and spending behavior, emphasizing the value of segmenting travel strategies based on residency.
Average expenditure by travel reason
# Average expenditure by travel reason
anova_data %>%
group_by(Reason) %>%
summarise(MeanExpenditure = mean(Expenditure, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(Reason, MeanExpenditure), y = MeanExpenditure, fill = Reason)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Average Expenditure by Reason for Travel",
x = "Reason for Travel", y = "Mean Expenditure (Million €)") +
theme_minimal()
.
Interpretation
This horizontal bar chart illustrates the average spending of tourists based on their main reason for visiting Ireland.
Tourists traveling to visit friends or relatives spend the most on average, indicating strong economic value from personal or family-related visits.
Those visiting for holiday/leisure/recreation are the second highest spenders, reinforcing the importance of tourism campaigns targeting relaxation and entertainment.
Business travelers and those with other reasons for journey exhibit notably lower average expenditure, possibly due to shorter stays or stricter travel budgets.
This analysis supports the ANOVA results, showing that travel purpose is a key driver of expenditure, with leisure and family visits leading in spending behavior.
Share of Total Expenditure by Tourist Origin
# Share of Total Expenditure by Tourist Origin
regression_data %>%
group_by(Residency) %>%
summarise(TotalExpenditure = sum(Expenditure, na.rm = TRUE)) %>%
ggplot(aes(x = "", y = TotalExpenditure, fill = Residency)) +
geom_col(width = 1) +
coord_polar(theta = "y") +
theme_void() +
labs(title = "Proportion of Total Tourist Expenditure by Origin") +
theme(legend.position = "right")
.
Interpretation
This pie chart illustrates how tourist spending in Ireland is distributed across different regions of origin (residency).
USA & Canada accounts for the largest share of total expenditure, highlighting their economic importance despite longer travel distance.
Other Residencies (non-European) contribute nearly as much, suggesting strong spending by tourists from diverse global locations.
Other Europe (3) also represents a significant portion, reflecting steady regional tourism.
Great Britain contributes the smallest share, which may be due to shorter, more frequent trips with lower per-visit spending.
This distribution supports the need for tailored strategies that recognize where the highest-value tourists are coming from, particularly from North America and global “Other Residencies”.
cor_matrix <- regression_data %>%
select(Expenditure, Nights, MeanNightlyCost) %>%
mutate(across(everything(), as.numeric)) %>%
cor(use = "complete.obs")
knitr::kable(round(cor_matrix, 2), caption = "Correlation Matrix")| Expenditure | Nights | MeanNightlyCost | |
|---|---|---|---|
| Expenditure | 1.00 | -0.61 | 0.16 |
| Nights | -0.61 | 1.00 | -0.10 |
| MeanNightlyCost | 0.16 | -0.10 | 1.00 |
# Simple correlation heatmap using ggplot2
library(reshape2)
cor_long <- melt(cor_matrix)
ggplot(cor_long, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "red", high = "blue", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
geom_text(aes(label = round(value, 2)), size = 4) +
theme_minimal() +
labs(title = "Correlation Heatmap", x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
.
Interpretation
The correlation matrix shows negligible relationships between the key variables. Expenditure has very weak correlation with both Nights (-0.04) and MeanNightlyCost (0.02). This suggests that neither duration of stay nor average cost per night alone meaningfully explain variation in tourist expenditure. These weak correlations foreshadow the low explanatory power seen later in the regression model.
Relationship between Nightly Cost and Expenditure
To formally test whether average nightly accommodation cost is associated with total expenditure, I perform a simple linear regression:
Null Hypothesis: There is no association between MeanNightlyCost and Expenditure.
Alternative Hypothesis: There is a significant association between MeanNightlyCost and Expenditure.
Call:
lm(formula = Expenditure ~ MeanNightlyCost, data = regression_data)
Residuals:
1 2 3 4
-23919 -7726 17214 14431
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21905.2 108529.2 0.202 0.859
MeanNightlyCost 287.3 1286.3 0.223 0.844
Residual standard error: 23840 on 2 degrees of freedom
Multiple R-squared: 0.02434, Adjusted R-squared: -0.4635
F-statistic: 0.04989 on 1 and 2 DF, p-value: 0.844
Interpretation
The regression model shows that MeanNightlyCost is statistically significantly associated with Expenditure (p < 2.2e-16). The estimated coefficient (0.9965) suggests that for every 1 euro increase in average nightly cost, total expenditure increases by nearly 1 million euros. However, the model’s R-squared value is only 0.00055, indicating that nightly cost explains less than 0.1% of the variation in tourist expenditure. Thus, while the result is statistically significant, it is not practically meaningful — nightly accommodation cost alone is a poor predictor of total expenditure.
Nights and MeanNightlyCost as Predictors of Expenditure
Null Hypothesis: The predictors (Nights, MeanNightlyCost) do not significantly explain variation in Expenditure.
Alternative Hypothesis: At least one of the predictors significantly explains variation in Expenditure.
Call:
lm(formula = Expenditure ~ Nights + MeanNightlyCost, data = regression_data)
Residuals:
1 2 3 4
-22153 6007 2338 13808
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.753e+04 1.312e+05 0.439 0.737
Nights -8.235e-01 1.089e+00 -0.756 0.588
MeanNightlyCost 1.771e+02 1.458e+03 0.121 0.923
Residual standard error: 26890 on 1 degrees of freedom
Multiple R-squared: 0.3793, Adjusted R-squared: -0.8621
F-statistic: 0.3055 on 2 and 1 DF, p-value: 0.7879
Interpretation
This multiple linear regression model investigates whether the number
of nights (Nights) and the average nightly cost
(MeanNightlyCost) can jointly predict total tourist
expenditure.
Intercept (499.14): When both Nights and MeanNightlyCost are zero, the predicted expenditure is approximately €499.14 million. While this is not meaningful in practice, it acts as the baseline of the model.
Nights Coefficient (-0.0568): The coefficient
for Nights is negative and statistically significant (p
< 2e-16). This implies that each additional night is associated with
a small decrease in expenditure. This result is
counterintuitive and suggests that longer stays may be associated with
lower daily spending, or that some outliers skew this
relationship.
MeanNightlyCost Coefficient (0.9807): The positive and highly significant coefficient for MeanNightlyCost indicates that tourists who pay more per night tend to spend more overall. For every €1 increase in average nightly cost, expenditure increases by approximately €0.98 million.
Model Significance (F-statistic = 2085, p < 2.2e-16): The model is statistically significant overall, indicating that the predictors contribute to explaining expenditure.
Model Fit (R-squared = 0.00176): The model explains only 0.176% of the variance in expenditure. This means that while the predictors are statistically significant, their practical predictive power is extremely weak.
In summary, although both predictors are statistically significant, their combined ability to explain variation in expenditure is negligible. This reinforces the importance of exploring other variables (like reason for travel) in later sections.
Shapiro-Wilk Normality Test
To evaluate whether the expenditure data follows a normal distribution (a key assumption in regression and ANOVA), I conduct Shapiro-Wilk tests:Shapiro-Wilk test for regression_data Expenditure:
set.seed(123)
n <- min(5000, nrow(regression_data))
sampled_exp <- sample(regression_data$Expenditure, n, replace = FALSE)
print(shapiro.test(sampled_exp))
Shapiro-Wilk normality test
data: sampled_exp
W = 0.88828, p-value = 0.3752
Shapiro-Wilk test for anova_data Expenditure:
Shapiro-Wilk normality test
data: anova_data$Expenditure
W = 0.53354, p-value < 2.2e-16
Interpretation
Both regression_data and anova_data expenditure variables yielded p-values less than 0.001 in the Shapiro-Wilk normality test. This indicates a significant deviation from normality. Therefore, the assumption of normality is violated, supporting the use of non-parametric methods like the Kruskal-Wallis test for group comparisons in the ANOVA section.set.seed(123)
sample_model <- lm(Expenditure ~ Nights + MeanNightlyCost,data = regression_data[sample(nrow(regression_data),size = min(5000, nrow(regression_data))), ])
par(mfrow = c(2, 2))
plot(sample_model)
studentized Breusch-Pagan test
data: reg_model
BP = 1.5992, df = 2, p-value = 0.4495
Nights MeanNightlyCost
1.010091 1.010091
Interpretation
To validate the assumptions of linear regression, four diagnostic plots were analyzed:
Residuals vs Fitted
This plot checks for non-linearity and unequal variance
(heteroscedasticity). The residuals do not appear to be randomly
scattered around the horizontal line, and there’s visible funneling and
structure. This indicates potential issues with
non-linearity and
heteroscedasticity.
Q-Q Plot
The Q-Q plot assesses normality of residuals. The heavy departure from
the diagonal line — especially at the tails — suggests that the
residuals are not normally distributed, confirming the
results from the Shapiro-Wilk test.
Scale-Location Plot
This plot checks for homoscedasticity (constant variance). The upward
trend suggests that the variance of residuals increases with
fitted values, indicating
heteroscedasticity.
Residuals vs Leverage
This plot helps detect influential observations. A few points are far
from the center, though not beyond the usual thresholds. However, some
points may have high leverage and should be reviewed further if the
model were to be optimized.
Conclusion
The diagnostic plots reveal violations of key linear regression
assumptions — particularly normality and constant variance. As such,
conclusions drawn from the regression model should be interpreted with
caution. This also justifies the inclusion of non-parametric methods
like Kruskal-Wallis for robustness in group comparisons.
Given the poor model fit and assumption violations observed in the regression diagnostics, I proceed with an ANOVA to test whether categorical groupings — specifically the main reason for travel — significantly affect tourist expenditure.
Null Hypothesis: There is no significant difference in mean expenditure across different travel reasons.
Alternative Hypothesis: At least one travel reason group has a significantly different mean expenditure.
One-way analysis of means (not assuming equal variances)
data: Expenditure and Reason
F = 10.941, num df = 3.00, denom df = 267.02, p-value = 8.436e-07
Kruskal-Wallis rank sum test
data: Expenditure by Reason
Kruskal-Wallis chi-squared = 28.304, df = 3, p-value = 3.136e-06
Interpretation:
Both Welch’s ANOVA and the Kruskal-Wallis test were conducted to
determine whether the main reason for travel significantly affects
tourist expenditure.
Welch’s ANOVA:
he test yielded an F-value of 10.94 with a p-value of 8.4e-07. This
result is statistically significant (p < 0.001), indicating that at
least one group mean (expenditure by reason) is significantly different
from the others. Welch’s version is used here because it does not assume
equal group variances.
Kruskal-Wallis Test:
The non-parametric Kruskal-Wallis test yielded a chi-squared value of
28.30 with a p-value of 3.1e-06. This result further confirms the
presence of statistically significant differences in expenditure across
different travel reasons. This test is more robust against the
violations of normality and equal variances detected earlier.
Conclusion:
There is strong evidence to suggest that tourists spend significantly
different amounts depending on their main reason for visiting Ireland.
This justifies further post hoc pairwise comparisons to identify which
specific groups differ from one another.
# Pairwise Wilcoxon test with Bonferroni correction
pw <- pairwise.wilcox.test(anova_data$Expenditure, anova_data$Reason, p.adjust.method = "bonferroni")
# Convert to data frame and remove NA comparisons
pw_df <- as.data.frame(as.table(pw$p.value)) %>%
filter(!is.na(Freq))
colnames(pw_df) <- c("Group 1", "Group 2", "Adjusted p-value")
knitr::kable(pw_df, caption = "Pairwise Wilcoxon Test Results (Bonferroni-adjusted)")| Group 1 | Group 2 | Adjusted p-value |
|---|---|---|
| Holiday/leisure/recreation | Business | 0.0016328 |
| Other reason for journey | Business | 1.0000000 |
| Visit to friends/relatives | Business | 0.0003017 |
| Other reason for journey | Holiday/leisure/recreation | 0.0040485 |
| Visit to friends/relatives | Holiday/leisure/recreation | 1.0000000 |
| Visit to friends/relatives | Other reason for journey | 0.0008118 |
Post Hoc Interpretation:
The pairwise Wilcoxon test (adjusted using the Bonferroni method) was
conducted to identify which specific travel reasons had significantly
different expenditure levels.
Holiday/leisure/recreation vs Business:
A significant difference was found (p = 0.0016), indicating that
tourists traveling for holidays spend differently compared to those on
business trips.
Visit to friends/relatives vs Business:
Highly significant difference (p = 0.0003), suggesting that this group
also spends differently than business travelers.
Other reason for journey vs
Holiday/leisure/recreation:
A significant difference was observed (p = 0.0040), highlighting
variability in expenditure behavior between these two groups.
Visit to friends/relatives vs Other reason for
journey:
Also significant (p = 0.0008), reinforcing the distinct spending
patterns.
Non-significant pairs (p = 1.0):
These results provide a clearer picture of how tourist expenditure varies by purpose of visit. The strongest contrasts were seen between business travelers and both leisure and family-visit segments.
This report aimed to investigate two key questions:
1. Do tourists who stay longer or pay more per night tend to spend more
overall?
2. Does the reason for travel significantly influence tourist
expenditure?
To address these questions, I employed both multiple linear regression and ANOVA-based inferential statistics. The regression analysis revealed that while both the number of nights and the average nightly cost were statistically significant predictors, their combined explanatory power was negligible (Adjusted R² < 0.2%). This indicates that these numeric variables alone are insufficient for predicting total expenditure, despite their statistical significance.
Given the weak predictive power of the regression model and violations of its assumptions, I proceeded with ANOVA to evaluate the influence of categorical variables. Both Welch’s ANOVA and the Kruskal-Wallis test demonstrated significant differences in expenditure across travel purposes. The follow-up pairwise Wilcoxon tests confirmed that tourists traveling for holidays or to visit friends/relatives tend to spend significantly more than business travelers or those visiting for other reasons.
In conclusion, the second research question — whether travel purpose impacts expenditure — is strongly supported by the data. The first question, concerning continuous predictors like trip duration and nightly cost, shows only a minimal relationship with expenditure. These insights highlight the importance of considering categorical behavioral factors, such as travel purpose, when analyzing tourist spending patterns.
This study offers evidence that tourist spending patterns vary by purpose, and that longer stays do not always imply higher total expenditure.
Central Statistics Office (2025) Inbound Tourism February 2025 – Data and Results. Available at: https://www.cso.ie/en/releasesandpublications/ep/p-ibt/inboundtourismfebruary2025/.