FourSquare Dataset

This dataset contains check-ins from FourSquare users in NYC and Tokyo collected for about 10 month (from 12 April 2012 to 16 February 2013). It contains 227,428 check-ins in New York city and 573,703 check-ins in Tokyo. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning (represented by fine-grained venue-categories).

By using the str() command, we can get a better understanding of the structure of the data.

str(TKY)
## Classes 'tbl_df', 'tbl' and 'data.frame':    573703 obs. of  11 variables:
##  $ userId         : int  1541 868 114 868 1458 1541 1541 114 1635 2033 ...
##  $ venueId        : chr  "4f0fd5a8e4b03856eeb6c8cb" "4b7b884ff964a5207d662fe3" "4c16fdda96040f477cc473a5" "4c178638c2dfc928651ea869" ...
##  $ venueCategoryId: chr  "4bf58dd8d48988d10c951735" "4bf58dd8d48988d1d1941735" "4d954b0ea243a5684a65b473" "4bf58dd8d48988d118951735" ...
##  $ venueCategory  : chr  "Cosmetics Shop" "Ramen /  Noodle House" "Convenience Store" "Food & Drink Shop" ...
##  $ latitude       : num  35.7 35.7 35.7 35.7 35.7 ...
##  $ longitude      : num  140 140 139 140 140 ...
##  $ timezoneOffset : int  540 540 540 540 540 540 540 540 540 540 ...
##  $ utcTimestamp   : chr  "Tue Apr 03 18:17:18 +0000 2012" "Tue Apr 03 18:22:04 +0000 2012" "Tue Apr 03 19:12:07 +0000 2012" "Tue Apr 03 19:12:13 +0000 2012" ...
##  $ datetime       : POSIXct, format: "2012-04-04 03:17:18" "2012-04-04 03:22:04" ...
##  $ weekday        : Factor w/ 7 levels "Sun","Mon","Tue",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ hour           : int  3 3 4 4 4 4 4 4 4 4 ...

Each user and venue is given a unique ID, and the venues are placed into similiar categories. The location, given by latitdue and longitude, and the time of each check-in is also recorded in the data.

Where did they check in the most?

The purpose of the below barcharts is to show the top 10 check-in venue categories in TKY and NYC. The top 10 check-in venue categories in TKY account for 61.28% of all check-ins in TKY, and the top 10 check-in venue categories in NYC account for 40.69% of all check-ins in NYC.

#Top Checkin Places (Total)
TKY %>% group_by(venueCategory) %>% summarize(Total = n()) %>%
  mutate(Proportion = Total / sum(Total) * 100) %>%
  arrange(desc(Total)) %>% head(10) %>% ggplot() +
  geom_bar(aes(x = reorder(venueCategory, Proportion), y = Proportion), stat = "identity") +
  coord_flip() + labs(title = "Top 10 Check-in Venue Categories by Tokyo FourSquare Users",
                      subtitle = "Data from 12 April 2012 to 16 February 2013",
                      y = "Proportion of all check-ins (percent)",
                      x = "Venue Categories")

When did they check in?

The purpose of the below line chart is to show the pattern of check-in behaviours of FourSquare Tokyo users.

TKY %>% group_by(weekday, hour) %>% summarise(count = n()) %>%
  ggplot(aes(x = hour, y = count, col = weekday)) +
  geom_line() +
  labs(y = "Total Count", x = "Hour", title = "Total Check-in Counts",
       subtitle = "Data from 12 April 2012 to 16 February 2013")