library(dplyr)
library(ggplot2)
library(GGally)
dataset <- read.csv("Train.csv", stringsAsFactors = TRUE)
dataset$total_people <- dataset$total_male+dataset$total_female
We want to highlight which aspects of tourism in Tanzania are more profitable and in which it is worthwhile to invest
summary(dataset)
## ID country age_group
## tour_0 : 1 UNITED STATES OF AMERICA: 695 1-24 : 624
## tour_10 : 1 UNITED KINGDOM : 533 25-44:2487
## tour_1000: 1 ITALY : 393 45-64:1391
## tour_1002: 1 FRANCE : 280 65+ : 307
## tour_1004: 1 ZIMBABWE : 274
## tour_1005: 1 KENYA : 235
## (Other) :4803 (Other) :2399
## travel_with total_female total_male
## :1114 Min. : 0.0000 Min. : 0.00
## Alone :1265 1st Qu.: 0.0000 1st Qu.: 1.00
## Children : 162 Median : 1.0000 Median : 1.00
## Friends/Relatives : 895 Mean : 0.9268 Mean : 1.01
## Spouse :1005 3rd Qu.: 1.0000 3rd Qu.: 1.00
## Spouse and Children: 368 Max. :49.0000 Max. :44.00
## NA's :3 NA's :5
## purpose main_activity
## Business : 671 Wildlife tourism :2259
## Leisure and Holidays :2840 Beach tourism :1025
## Meetings and Conference : 312 Hunting tourism : 457
## Other : 128 Conference tourism: 367
## Scientific and Academic : 87 Cultural tourism : 359
## Visiting Friends and Relatives: 633 Mountain climbing : 234
## Volunteering : 138 (Other) : 108
## info_source tour_arrangement
## Travel, agent, tour operator :1913 Independent :2570
## Friends, relatives :1635 Package Tour:2239
## others : 490
## Newspaper, magazines,brochures: 359
## Radio, TV, Web : 249
## Trade fair : 77
## (Other) : 86
## package_transport_int package_accomodation package_food package_transport_tz
## No :3357 No :2602 No :2748 No :2919
## Yes:1452 Yes:2207 Yes:2061 Yes:1890
##
##
##
##
##
## package_sightseeing package_guided_tour package_insurance night_mainland
## No :3319 No :3259 No :4079 Min. : 0.000
## Yes:1490 Yes:1550 Yes: 730 1st Qu.: 3.000
## Median : 6.000
## Mean : 8.488
## 3rd Qu.: 11.000
## Max. :145.000
##
## night_zanzibar payment_mode first_trip_tz
## Min. : 0.000 Cash :4172 No :1566
## 1st Qu.: 0.000 Credit Card : 622 Yes:3243
## Median : 0.000 Other : 8
## Mean : 2.304 Travellers Cheque: 7
## 3rd Qu.: 4.000
## Max. :61.000
##
## most_impressing total_cost
## Friendly People :1541 Min. : 49000
## Wildlife :1038 1st Qu.: 812175
## No comments : 743 Median : 3397875
## Wonderful Country, Landscape, Nature: 507 Mean : 8114389
## Good service : 365 3rd Qu.: 9945000
## : 313 Max. :99532875
## (Other) : 302
## total_people
## Min. : 0.000
## 1st Qu.: 1.000
## Median : 2.000
## Mean : 1.932
## 3rd Qu.: 2.000
## Max. :93.000
## NA's :8
From the correlogram below, the histogram of the numerical variables of the dataset can be seen on the main diagonal. Since these distributions including total_cost are skew to the left instead of the mean, it is worth considering the median, so that 50% of paying tourists in the dataset equal to 2405 people out of 4809 spend a median of 1461 US dollars for a median of 6 days in mainland and a median of 2 people.
print(ggpairs(dataset[,c(5,6,18,19,23,24)], title ="Tanzania tourism correlogram"))
#Total cost in united states dollar - USD\n
summary(dataset$total_cost)*0.00043
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.07 349.24 1461.09 3489.19 4276.35 42799.14
#Number of nights a tourist spent in Tanzania mainland
summary(dataset$night_mainland)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 6.000 8.488 11.000 145.000
#Number of nights a tourist spent in Zanzibar
summary(dataset$night_zanzibar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.304 4.000 61.000
#Total people
summary(dataset$total_people)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 2.000 1.932 2.000 93.000 8
Spending statistics by countries:
dataset %>%
group_by(country) %>%
summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
arrange(desc(sum))
## # A tibble: 105 × 6
## country sum total total_people mean median
## <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 UNITED STATES OF AMERICA 8890832054. 695 1356 12792564. 7707375
## 2 UNITED KINGDOM 3808382503. 533 991 7145183. 3978000
## 3 ITALY 3762160276. 393 945 9572927. 7458750
## 4 FRANCE 3344495948. 280 730 11944628. 9077320
## 5 AUSTRALIA 2743131959. 186 311 14748021. 9903562.
## 6 SOUTH AFRICA 2594805035. 235 441 11041724. 3315000
## 7 GERMANY 2218351851. 223 523 9947766. 5884125
## 8 SPAIN 1699098956. 165 414 10297569. 6807990
## 9 CANADA 1462028554. 114 209 12824812. 5958712.
## 10 NETHERLANDS 1250217145. 112 282 11162653. 5552625
## # … with 95 more rows
Spending statistics by age_group:
From the boxplot of the age groups it can be seen that when the age of paying tourists increases, the average and the median of the total tourist expenditure also grows but despite this, by calculating the total expenditure by age group, it can be seen that the 45-64 and 25 -44 overall spend more than the 65+ group, in fact there are more people in those groups although the median is smaller. Furthermore, the bar graph shows that the main favorite activity is “wildlife tourism”, followed by “beach tourism”.
dataset %>%
ggplot(aes(age_group,total_cost, fill=age_group))+
geom_boxplot()+
theme(axis.text.x=element_text(angle=-45, vjust=0.5,hjust=0))+
ggtitle("Age_group total_cost boxplot")
df<-dataset %>%
group_by(age_group, main_activity) %>%
summarise(total=n(), .groups = "drop_last")
df %>%
ggplot(aes(age_group,total, fill=main_activity))+
geom_bar(stat = "identity")+
ggtitle("Total number of tourist by age_group and main_activity")
dataset %>%
group_by(age_group) %>%
summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
arrange(desc(sum))
## # A tibble: 4 × 6
## age_group sum total total_people mean median
## <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 45-64 15371839260. 1391 3054 11050927. 5834400
## 2 25-44 14987099938. 2487 4548 6026176. 2486250
## 3 65+ 5284068284. 307 608 17211949. 12845625
## 4 1-24 3379088150. 624 1067 5415205. 2602275
Spending statistics by travel_with:
Those who spend the most overall are couples without children, followed by groups of friends and then by couples with children even if the latter spend more on average and median but they are fewer:
dataset %>%
filter(travel_with!="") %>%
ggplot(aes(travel_with,total_cost, fill=travel_with))+
geom_boxplot()+
theme(axis.text.x=element_text(angle=-45, vjust=0.5,hjust=0))+
ggtitle("Travel_with total_cost boxplot")
dataset %>%
filter(travel_with!="") %>%
group_by(travel_with) %>%
summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
arrange(desc(sum))
## # A tibble: 5 × 6
## travel_with sum total total_people mean median
## <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Spouse 12746305028. 1005 2023 12682891. 8623454
## 2 Friends/Relatives 9158699780. 895 2860 10233184. 4972500
## 3 Spouse and Children 6745753040. 368 1423 18330851. 14720606
## 4 Alone 4334079145. 1265 1278 3426150. 1491750
## 5 Children 1653502309. 162 514 10206804. 5687662.
Spending statistics by pourpose:
dataset %>%
group_by(purpose) %>%
summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
arrange(desc(sum))
## # A tibble: 7 × 6
## purpose sum total total_people mean median
## <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Leisure and Holidays 33941224747. 2840 6370 1.20e7 7.29e6
## 2 Visiting Friends and Relatives 2019760962. 633 1093 3.19e6 8.29e5
## 3 Business 1196015966. 671 829 1.78e6 6.81e5
## 4 Meetings and Conference 765337095. 312 459 2.45e6 1.16e6
## 5 Volunteering 545177930. 138 205 3.95e6 2.65e6
## 6 Scientific and Academic 350783148. 87 149 4.03e6 1.16e6
## 7 Other 203795784. 128 172 1.59e6 4.65e5
Spending statistics by main_activity:
dataset %>%
group_by(main_activity) %>%
summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
arrange(desc(sum))
## # A tibble: 9 × 6
## main_activity sum total total_people mean median
## <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Wildlife tourism 23934843156. 2259 4588 10595327. 5138250
## 2 Beach tourism 7712957936. 1025 2199 7524837. 3884500
## 3 Conference tourism 3782596917. 367 672 10306804. 6145308
## 4 Cultural tourism 1432818972. 359 683 3991139. 1657500
## 5 Hunting tourism 873476390. 457 628 1911327. 500000
## 6 business 471254513 58 117 8125078. 4889625
## 7 Mountain climbing 435908481. 234 309 1862857. 828750
## 8 Diving and Sport Fishing 222226440. 13 24 17094342. 3978000
## 9 Bird watching 156012828. 37 57 4216563. 663000
Spending statistics by info_source:
dataset %>%
group_by(info_source) %>%
summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
arrange(desc(sum))
## # A tibble: 8 × 6
## info_source sum total total_people mean median
## <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Travel, agent, tour operator 25104817473. 1913 4216 1.31e7 7.96e6
## 2 Friends, relatives 7047420630. 1635 2824 4.31e6 1.56e6
## 3 Newspaper, magazines,brochures 2292693735. 359 654 6.39e6 2.49e6
## 4 others 2217138885 490 875 4.52e6 1.66e6
## 5 Radio, TV, Web 1581531308. 249 461 6.35e6 3.32e6
## 6 Trade fair 519880715 77 116 6.75e6 1.19e6
## 7 Tanzania Mission Abroad 213708810. 68 110 3.14e6 1.16e6
## 8 inflight magazines 44904078. 18 21 2.49e6 7.71e5
Spending statistics by most_impressing:
dataset %>%
group_by(most_impressing) %>%
summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
arrange(desc(sum))
## # A tibble: 8 × 6
## most_impressing sum total total_people mean median
## <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 "Friendly People" 1.27e10 1541 2933 8.23e6 3.32e6
## 2 " Wildlife" 1.13e10 1038 2157 1.09e7 6.09e6
## 3 "No comments" 4.92e 9 743 1402 6.63e6 2.49e6
## 4 "Wonderful Country, Landscape, Natur… 3.98e 9 507 993 7.85e6 3.98e6
## 5 "Good service" 2.91e 9 365 650 7.97e6 3.32e6
## 6 "Excellent Experience" 2.01e 9 271 590 7.43e6 3.32e6
## 7 "" 1.00e 9 313 484 3.20e6 8.93e5
## 8 "Satisfies and Hope Come Back" 1.75e 8 31 68 5.66e6 2 e6
The most profitable tourism sectors in Tanzania are mainly “Wildlife tourism”, followed by “Beach tourism”. The most profitable source of information on tourism is “Travel, agent, tour operator”. The most profitable purpose is “Leisure and Holidays”. Those who spend the most overall are couples without children, followed by groups of friends and then by couples with children even if the latter spend more on average and median but they are fewer. Older people spend more, so it is worthwhile to encourage those over 65 to come to Tanzania. Most profitable countries are: USA, United Kingdom, Italy, France, Canada etc.