Introduction

Today, i’m going to be reviewing and analysing the reddit travel subreddit survey which covers all sorts of travelling habits, The idea is to do an initial review of the data then progress to find some trends and groups of people with similar travel habits. Lets take a look at the data set:

travsur <- read.csv("travelsurvey2.csv")

dim(travsur)
## [1] 768 100

So, the data set has 768 rows and 100 variables. The variables in the original data set where too big and difficult to code with as they were the questions from the survey. Below you can see the variables in a more manageable format:

colnames(travsur)
##   [1] "Timestamp"         "Age"               "Gender"           
##   [4] "Nationality"       "Rescidence"        "Education"        
##   [7] "Employed"          "travIndustry"      "Relationship"     
##  [10] "Passport"          "TripsPerY"         "DaysperTrav"      
##  [13] "TravDayBusi"       "TraveAbroa"        "Currency"         
##  [16] "VisitTravel"       "UsePlane"          "UseBus"           
##  [19] "UseTrain"          "UseBoat"           "UseBike"          
##  [22] "UseCar"            "UseMotorB"         "UseWalk"          
##  [25] "UseOther"          "Spend"             "MealsOut"         
##  [28] "NightsPlace"       "CountriesT"        "FavCountry"       
##  [31] "Motivation"        "Bots"              "Alone"            
##  [34] "MotivateTrav"      "LostBag"           "WhereLost"        
##  [37] "Insurance"         "SocialMedia"       "MediaFor"         
##  [40] "ExtraQ"            "SecPassport"       "SecPassCOunt"     
##  [43] "Born"              "NoCOunt"           "BestDescript"     
##  [46] "Holidays"          "Current"           "Religion"         
##  [49] "Children"          "AvTrip"            "LongTrip"         
##  [52] "ShortRIP"          "AvCity"            "TravPartners"     
##  [55] "PartnersCatergory" "PrefTravMont"      "MostLikeTravMonth"
##  [58] "ExtraBusDays"      "DisaapointingDES"  "PositiveDes"      
##  [61] "Continents"        "BigCityTiime"      "AccomodationTpe"  
##  [64] "NomalAxx"          "DelayedCheclBag"   "MusicGenre"       
##  [67] "AirportFood"       "Pick.one."         "Pick.one..1"      
##  [70] "Pick.one..2"       "FlightChoice"      "BestTickets"      
##  [73] "BestPrchase"       "PreferTrav"        "AbleTravf"        
##  [76] "Booking"           "Arranging"         "AccomBook"        
##  [79] "Activities"        "Photos"            "TraInsura"        
##  [82] "Trips"             "ResearchSite"      "Reconmendation"   
##  [85] "Budget"            "EmergencyFund"     "DreamDest"        
##  [88] "VisitedDream"      "Language"          "SocialMedia.1"    
##  [91] "StarngeSouv"       "FavSouv"           "TravelItem"       
##  [94] "FictionalTrav"     "NextTrip"          "NextTripDest"     
##  [97] "BestTip"           "DecidingTrip"      "Redditchsnge"     
## [100] "Bot"

Initial Exploreatory analysis

Here we take an exploratory look at the data set:

## Selecting by tot

The three graphs above you can see a basic breakdown of the dataset. First thing to point out is there are clearly more males filling out the survey then females. Also, the most common age is the 22-29 age bracket. American is by far the most common nationality which I think is more a reflection of Reddit users rather than people who generally go travelling. At the start I said the idea was to find some groups and review the groups different tendencies. I am going to do that using trips per year variable and create a new one based on trips and spend: spend per trip.

k-means Clustering

Now the relationship is not entirely unexpected, the more trips a person takes the less they spend per trip. Not many people are millionaires that can just permanently be globetrotting in 5* resorts. Not the people that read the reddit travel area at least. We are now going to use k means clustering to allocate the groups. First up lets see how many groups there needs to be

Clearly by reviewing the sil plot either 3 or 5 clusters is appropriate. We are going to go with 3 clusters as that is the highest value on the plot

Now above is previous plot with it now coloured by the cluster each point belongs. There are three clear groups:

We can now use these clusters or groupings to look for trends in the three groups. Below we are first going to look at the type of accommodation the three groups prefer

k-means Analysis

Cluster 1 is probably a cluster that the average person would belong to and seems to have a fairly even spread of different accommodation types. Cluster 2 has majority of people in either AirBnb or hotels. Finally, cluster 3 seems have the highest proportion of people in hotels which is surprising. I would have thought people who travel a lot would use hostels more.

The first thing that is surprising is that in 2 of the 3 groups less than a quarter of people use social media while travelling. That increases to about a third for people in cluster 3. Then when we look at what it’s used for cluster 3 has by far the most people who want to use social media to earn money. These must be the travel vloggers.

That’s it for todays review i’m sure there is loads more you can do to review this dataset and i’m sure there loads of other insights to find.