This dataset contains check-ins from FourSquare users in NYC and Tokyo collected for about 10 month (from 12 April 2012 to 16 February 2013). It contains 227,428 check-ins in New York city and 573,703 check-ins in Tokyo. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning (represented by fine-grained venue-categories).
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
setwd("C:/Users/LokHeng/Desktop/Kaggle/FourSquare")
TKY <- read.csv("dataset_TSMC2014_TKY.csv",stringsAsFactors = FALSE)
TKY <- tbl_df(TKY)
NYC <- read.csv("dataset_TSMC2014_NYC.csv",stringsAsFactors = FALSE)
NYC <- tbl_df(NYC)
By using the str() command, we can get a better understanding of the structure of the data.
str(NYC)
## Classes 'tbl_df', 'tbl' and 'data.frame': 227428 obs. of 8 variables:
## $ userId : int 470 979 69 395 87 484 642 292 428 877 ...
## $ venueId : chr "49bbd6c0f964a520f4531fe3" "4a43c0aef964a520c6a61fe3" "4c5cc7b485a1e21e00d35711" "4bc7086715a7ef3bef9878da" ...
## $ venueCategoryId: chr "4bf58dd8d48988d127951735" "4bf58dd8d48988d1df941735" "4bf58dd8d48988d103941735" "4bf58dd8d48988d104941735" ...
## $ venueCategory : chr "Arts & Crafts Store" "Bridge" "Home (private)" "Medical Center" ...
## $ latitude : num 40.7 40.6 40.7 40.7 40.7 ...
## $ longitude : num -74 -74 -73.9 -74 -74 ...
## $ timezoneOffset : int -240 -240 -240 -240 -240 -240 -240 -240 -240 -240 ...
## $ utcTimestamp : chr "Tue Apr 03 18:00:09 +0000 2012" "Tue Apr 03 18:00:25 +0000 2012" "Tue Apr 03 18:02:24 +0000 2012" "Tue Apr 03 18:02:41 +0000 2012" ...
Each user and venue is given a unique ID, and the venues are placed into similiar categories. The location, given by latitdue and longitude, and the time of each check-in is also recorded in the data.
The purpose of the below barcharts is to show the top 10 check-in venue categories in TKY and NYC. The top 10 check-in venue categories in TKY account for 61.28% of all check-ins in TKY, and the top 10 check-in venue categories in NYC account for 40.69% of all check-ins in NYC.
#Top Checkin Places (Total)
TKY %>% group_by(venueCategory) %>% summarize(Total = n()) %>%
mutate(Proportion = Total / sum(Total) * 100) %>%
arrange(desc(Total)) %>% head(10) %>% ggplot() +
geom_bar(aes(x = reorder(venueCategory, Proportion), y = Proportion), stat = "identity") +
coord_flip() + labs(title = "Top 10 Check-in Venue Categories by Tokyo FourSquare Users",
subtitle = "Data from 12 April 2012 to 16 February 2013",
y = "Proportion of all check-ins (percent)",
x = "Venue Categories")
NYC %>% group_by(venueCategory) %>% summarize(Total = n()) %>%
mutate(Proportion = Total / sum(Total) * 100) %>%
arrange(desc(Total)) %>% head(10) %>% ggplot() +
geom_bar(aes(x = reorder(venueCategory, Proportion), y = Proportion), stat = "identity") +
coord_flip() + labs(title = "Top 10 Check-in Venue Categories by NYC FourSquare Users",
subtitle = "Data from 12 April 2012 to 16 February 2013",
y = "Proportion of all check-ins (percent)",
x = "Venue Categories")