FourSquare Dataset

This dataset contains check-ins from FourSquare users in NYC and Tokyo collected for about 10 month (from 12 April 2012 to 16 February 2013). It contains 227,428 check-ins in New York city and 573,703 check-ins in Tokyo. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning (represented by fine-grained venue-categories).

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
setwd("C:/Users/LokHeng/Desktop/Kaggle/FourSquare")

TKY <- read.csv("dataset_TSMC2014_TKY.csv",stringsAsFactors = FALSE)
TKY <- tbl_df(TKY)
NYC <- read.csv("dataset_TSMC2014_NYC.csv",stringsAsFactors = FALSE)
NYC <- tbl_df(NYC)

By using the str() command, we can get a better understanding of the structure of the data.

str(NYC)
## Classes 'tbl_df', 'tbl' and 'data.frame':    227428 obs. of  8 variables:
##  $ userId         : int  470 979 69 395 87 484 642 292 428 877 ...
##  $ venueId        : chr  "49bbd6c0f964a520f4531fe3" "4a43c0aef964a520c6a61fe3" "4c5cc7b485a1e21e00d35711" "4bc7086715a7ef3bef9878da" ...
##  $ venueCategoryId: chr  "4bf58dd8d48988d127951735" "4bf58dd8d48988d1df941735" "4bf58dd8d48988d103941735" "4bf58dd8d48988d104941735" ...
##  $ venueCategory  : chr  "Arts & Crafts Store" "Bridge" "Home (private)" "Medical Center" ...
##  $ latitude       : num  40.7 40.6 40.7 40.7 40.7 ...
##  $ longitude      : num  -74 -74 -73.9 -74 -74 ...
##  $ timezoneOffset : int  -240 -240 -240 -240 -240 -240 -240 -240 -240 -240 ...
##  $ utcTimestamp   : chr  "Tue Apr 03 18:00:09 +0000 2012" "Tue Apr 03 18:00:25 +0000 2012" "Tue Apr 03 18:02:24 +0000 2012" "Tue Apr 03 18:02:41 +0000 2012" ...

Each user and venue is given a unique ID, and the venues are placed into similiar categories. The location, given by latitdue and longitude, and the time of each check-in is also recorded in the data.

Visualization

The purpose of the below barcharts is to show the top 10 check-in venue categories in TKY and NYC. The top 10 check-in venue categories in TKY account for 61.28% of all check-ins in TKY, and the top 10 check-in venue categories in NYC account for 40.69% of all check-ins in NYC.

#Top Checkin Places (Total)
TKY %>% group_by(venueCategory) %>% summarize(Total = n()) %>%
  mutate(Proportion = Total / sum(Total) * 100) %>%
  arrange(desc(Total)) %>% head(10) %>% ggplot() +
  geom_bar(aes(x = reorder(venueCategory, Proportion), y = Proportion), stat = "identity") +
  coord_flip() + labs(title = "Top 10 Check-in Venue Categories by Tokyo FourSquare Users",
                      subtitle = "Data from 12 April 2012 to 16 February 2013",
                      y = "Proportion of all check-ins (percent)",
                      x = "Venue Categories")

NYC %>% group_by(venueCategory) %>% summarize(Total = n()) %>%
  mutate(Proportion = Total / sum(Total) * 100) %>%
  arrange(desc(Total)) %>% head(10) %>% ggplot() +
  geom_bar(aes(x = reorder(venueCategory, Proportion), y = Proportion), stat = "identity") +
  coord_flip() + labs(title = "Top 10 Check-in Venue Categories by NYC FourSquare Users",
                      subtitle = "Data from 12 April 2012 to 16 February 2013",
                      y = "Proportion of all check-ins (percent)",
                      x = "Venue Categories")