Load packages

library(dplyr)
library(ggplot2)
library(GGally)

Load Data

dataset <- read.csv("Train.csv", stringsAsFactors = TRUE)
dataset$total_people <- dataset$total_male+dataset$total_female

Part 1: Research question

We want to highlight which aspects of tourism in Tanzania are more profitable and in which it is worthwhile to invest

Part 2: Exploratory data analysis

summary(dataset)
##          ID                           country     age_group   
##  tour_0   :   1   UNITED STATES OF AMERICA: 695   1-24 : 624  
##  tour_10  :   1   UNITED KINGDOM          : 533   25-44:2487  
##  tour_1000:   1   ITALY                   : 393   45-64:1391  
##  tour_1002:   1   FRANCE                  : 280   65+  : 307  
##  tour_1004:   1   ZIMBABWE                : 274               
##  tour_1005:   1   KENYA                   : 235               
##  (Other)  :4803   (Other)                 :2399               
##               travel_with    total_female       total_male   
##                     :1114   Min.   : 0.0000   Min.   : 0.00  
##  Alone              :1265   1st Qu.: 0.0000   1st Qu.: 1.00  
##  Children           : 162   Median : 1.0000   Median : 1.00  
##  Friends/Relatives  : 895   Mean   : 0.9268   Mean   : 1.01  
##  Spouse             :1005   3rd Qu.: 1.0000   3rd Qu.: 1.00  
##  Spouse and Children: 368   Max.   :49.0000   Max.   :44.00  
##                             NA's   :3         NA's   :5      
##                            purpose                main_activity 
##  Business                      : 671   Wildlife tourism  :2259  
##  Leisure and Holidays          :2840   Beach tourism     :1025  
##  Meetings and Conference       : 312   Hunting tourism   : 457  
##  Other                         : 128   Conference tourism: 367  
##  Scientific and Academic       :  87   Cultural tourism  : 359  
##  Visiting Friends and Relatives: 633   Mountain climbing : 234  
##  Volunteering                  : 138   (Other)           : 108  
##                          info_source       tour_arrangement
##  Travel, agent, tour operator  :1913   Independent :2570   
##  Friends, relatives            :1635   Package Tour:2239   
##  others                        : 490                       
##  Newspaper, magazines,brochures: 359                       
##  Radio, TV, Web                : 249                       
##  Trade fair                    :  77                       
##  (Other)                       :  86                       
##  package_transport_int package_accomodation package_food package_transport_tz
##  No :3357              No :2602             No :2748     No :2919            
##  Yes:1452              Yes:2207             Yes:2061     Yes:1890            
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  package_sightseeing package_guided_tour package_insurance night_mainland   
##  No :3319            No :3259            No :4079          Min.   :  0.000  
##  Yes:1490            Yes:1550            Yes: 730          1st Qu.:  3.000  
##                                                            Median :  6.000  
##                                                            Mean   :  8.488  
##                                                            3rd Qu.: 11.000  
##                                                            Max.   :145.000  
##                                                                             
##  night_zanzibar              payment_mode  first_trip_tz
##  Min.   : 0.000   Cash             :4172   No :1566     
##  1st Qu.: 0.000   Credit Card      : 622   Yes:3243     
##  Median : 0.000   Other            :   8                
##  Mean   : 2.304   Travellers Cheque:   7                
##  3rd Qu.: 4.000                                         
##  Max.   :61.000                                         
##                                                         
##                              most_impressing   total_cost      
##  Friendly People                     :1541   Min.   :   49000  
##   Wildlife                           :1038   1st Qu.:  812175  
##  No comments                         : 743   Median : 3397875  
##  Wonderful Country, Landscape, Nature: 507   Mean   : 8114389  
##  Good service                        : 365   3rd Qu.: 9945000  
##                                      : 313   Max.   :99532875  
##  (Other)                             : 302                     
##   total_people   
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 2.000  
##  Mean   : 1.932  
##  3rd Qu.: 2.000  
##  Max.   :93.000  
##  NA's   :8

From the correlogram below, the histogram of the numerical variables of the dataset can be seen on the main diagonal. Since these distributions including total_cost are skew to the left instead of the mean, it is worth considering the median, so that 50% of paying tourists in the dataset equal to 2405 people out of 4809 spend a median of 1461 US dollars for a median of 6 days in mainland and a median of 2 people.

print(ggpairs(dataset[,c(5,6,18,19,23,24)],  title ="Tanzania tourism correlogram"))

#Total cost in united states dollar - USD\n
summary(dataset$total_cost)*0.00043
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    21.07   349.24  1461.09  3489.19  4276.35 42799.14
#Number of nights a tourist spent in Tanzania mainland
summary(dataset$night_mainland)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   6.000   8.488  11.000 145.000
#Number of nights a tourist spent in Zanzibar 
summary(dataset$night_zanzibar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.304   4.000  61.000
#Total people
summary(dataset$total_people)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   2.000   1.932   2.000  93.000       8

Spending statistics by countries:

dataset %>%
  group_by(country) %>%
  summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
  arrange(desc(sum))
## # A tibble: 105 × 6
##    country                          sum total total_people      mean   median
##    <fct>                          <dbl> <int>        <dbl>     <dbl>    <dbl>
##  1 UNITED STATES OF AMERICA 8890832054.   695         1356 12792564. 7707375 
##  2 UNITED KINGDOM           3808382503.   533          991  7145183. 3978000 
##  3 ITALY                    3762160276.   393          945  9572927. 7458750 
##  4 FRANCE                   3344495948.   280          730 11944628. 9077320 
##  5 AUSTRALIA                2743131959.   186          311 14748021. 9903562.
##  6 SOUTH AFRICA             2594805035.   235          441 11041724. 3315000 
##  7 GERMANY                  2218351851.   223          523  9947766. 5884125 
##  8 SPAIN                    1699098956.   165          414 10297569. 6807990 
##  9 CANADA                   1462028554.   114          209 12824812. 5958712.
## 10 NETHERLANDS              1250217145.   112          282 11162653. 5552625 
## # … with 95 more rows

Spending statistics by age_group:

From the boxplot of the age groups it can be seen that when the age of paying tourists increases, the average and the median of the total tourist expenditure also grows but despite this, by calculating the total expenditure by age group, it can be seen that the 45-64 and 25 -44 overall spend more than the 65+ group, in fact there are more people in those groups although the median is smaller. Furthermore, the bar graph shows that the main favorite activity is “wildlife tourism”, followed by “beach tourism”.

dataset %>%
  ggplot(aes(age_group,total_cost, fill=age_group))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=-45, vjust=0.5,hjust=0))+
  ggtitle("Age_group total_cost boxplot")

df<-dataset %>%
  group_by(age_group, main_activity) %>%
  summarise(total=n(), .groups = "drop_last") 

df %>%
  ggplot(aes(age_group,total, fill=main_activity))+
  geom_bar(stat = "identity")+
  ggtitle("Total number of tourist by age_group and main_activity")

dataset %>%
  group_by(age_group) %>%
  summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
  arrange(desc(sum))
## # A tibble: 4 × 6
##   age_group          sum total total_people      mean   median
##   <fct>            <dbl> <int>        <dbl>     <dbl>    <dbl>
## 1 45-64     15371839260.  1391         3054 11050927.  5834400
## 2 25-44     14987099938.  2487         4548  6026176.  2486250
## 3 65+        5284068284.   307          608 17211949. 12845625
## 4 1-24       3379088150.   624         1067  5415205.  2602275

Spending statistics by travel_with:

Those who spend the most overall are couples without children, followed by groups of friends and then by couples with children even if the latter spend more on average and median but they are fewer:

dataset %>%
  filter(travel_with!="") %>%
  ggplot(aes(travel_with,total_cost, fill=travel_with))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=-45, vjust=0.5,hjust=0))+
  ggtitle("Travel_with  total_cost boxplot")

dataset %>%
  filter(travel_with!="") %>%
  group_by(travel_with) %>%
  summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
  arrange(desc(sum))
## # A tibble: 5 × 6
##   travel_with                  sum total total_people      mean    median
##   <fct>                      <dbl> <int>        <dbl>     <dbl>     <dbl>
## 1 Spouse              12746305028.  1005         2023 12682891.  8623454 
## 2 Friends/Relatives    9158699780.   895         2860 10233184.  4972500 
## 3 Spouse and Children  6745753040.   368         1423 18330851. 14720606 
## 4 Alone                4334079145.  1265         1278  3426150.  1491750 
## 5 Children             1653502309.   162          514 10206804.  5687662.

Spending statistics by pourpose:

dataset %>%
  group_by(purpose) %>%
  summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
  arrange(desc(sum))
## # A tibble: 7 × 6
##   purpose                                 sum total total_people     mean median
##   <fct>                                 <dbl> <int>        <dbl>    <dbl>  <dbl>
## 1 Leisure and Holidays           33941224747.  2840         6370   1.20e7 7.29e6
## 2 Visiting Friends and Relatives  2019760962.   633         1093   3.19e6 8.29e5
## 3 Business                        1196015966.   671          829   1.78e6 6.81e5
## 4 Meetings and Conference          765337095.   312          459   2.45e6 1.16e6
## 5 Volunteering                     545177930.   138          205   3.95e6 2.65e6
## 6 Scientific and Academic          350783148.    87          149   4.03e6 1.16e6
## 7 Other                            203795784.   128          172   1.59e6 4.65e5

Spending statistics by main_activity:

dataset %>%
  group_by(main_activity) %>%
  summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
  arrange(desc(sum))
## # A tibble: 9 × 6
##   main_activity                     sum total total_people      mean  median
##   <fct>                           <dbl> <int>        <dbl>     <dbl>   <dbl>
## 1 Wildlife tourism         23934843156.  2259         4588 10595327. 5138250
## 2 Beach tourism             7712957936.  1025         2199  7524837. 3884500
## 3 Conference tourism        3782596917.   367          672 10306804. 6145308
## 4 Cultural tourism          1432818972.   359          683  3991139. 1657500
## 5 Hunting tourism            873476390.   457          628  1911327.  500000
## 6 business                   471254513     58          117  8125078. 4889625
## 7 Mountain climbing          435908481.   234          309  1862857.  828750
## 8 Diving and Sport Fishing   222226440.    13           24 17094342. 3978000
## 9 Bird watching              156012828.    37           57  4216563.  663000

Spending statistics by info_source:

dataset %>%
  group_by(info_source) %>%
  summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
  arrange(desc(sum))
## # A tibble: 8 × 6
##   info_source                             sum total total_people     mean median
##   <fct>                                 <dbl> <int>        <dbl>    <dbl>  <dbl>
## 1 Travel, agent, tour operator   25104817473.  1913         4216   1.31e7 7.96e6
## 2 Friends, relatives              7047420630.  1635         2824   4.31e6 1.56e6
## 3 Newspaper, magazines,brochures  2292693735.   359          654   6.39e6 2.49e6
## 4 others                          2217138885    490          875   4.52e6 1.66e6
## 5 Radio, TV, Web                  1581531308.   249          461   6.35e6 3.32e6
## 6 Trade fair                       519880715     77          116   6.75e6 1.19e6
## 7 Tanzania Mission Abroad          213708810.    68          110   3.14e6 1.16e6
## 8 inflight magazines                44904078.    18           21   2.49e6 7.71e5

Spending statistics by most_impressing:

dataset %>%
  group_by(most_impressing) %>%
  summarise(sum=sum(total_cost), total=n(), total_people=sum(total_people, na.rm = T), mean=mean(total_cost), median= median(total_cost)) %>%
  arrange(desc(sum))
## # A tibble: 8 × 6
##   most_impressing                           sum total total_people   mean median
##   <fct>                                   <dbl> <int>        <dbl>  <dbl>  <dbl>
## 1 "Friendly People"                     1.27e10  1541         2933 8.23e6 3.32e6
## 2 " Wildlife"                           1.13e10  1038         2157 1.09e7 6.09e6
## 3 "No comments"                         4.92e 9   743         1402 6.63e6 2.49e6
## 4 "Wonderful Country, Landscape, Natur… 3.98e 9   507          993 7.85e6 3.98e6
## 5 "Good service"                        2.91e 9   365          650 7.97e6 3.32e6
## 6 "Excellent Experience"                2.01e 9   271          590 7.43e6 3.32e6
## 7 ""                                    1.00e 9   313          484 3.20e6 8.93e5
## 8 "Satisfies and Hope Come Back"        1.75e 8    31           68 5.66e6 2   e6

Conclusion

The most profitable tourism sectors in Tanzania are mainly “Wildlife tourism”, followed by “Beach tourism”. The most profitable source of information on tourism is “Travel, agent, tour operator”. The most profitable purpose is “Leisure and Holidays”. Those who spend the most overall are couples without children, followed by groups of friends and then by couples with children even if the latter spend more on average and median but they are fewer. Older people spend more, so it is worthwhile to encourage those over 65 to come to Tanzania. Most profitable countries are: USA, United Kingdom, Italy, France, Canada etc.