This dataset contains check-ins from FourSquare users in NYC and Tokyo collected for about 10 month (from 12 April 2012 to 16 February 2013). It contains 227,428 check-ins in New York city and 573,703 check-ins in Tokyo. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning (represented by fine-grained venue-categories).
By using the str() command, we can get a better understanding of the structure of the data.
str(TKY)
## Classes 'tbl_df', 'tbl' and 'data.frame': 573703 obs. of 11 variables:
## $ userId : int 1541 868 114 868 1458 1541 1541 114 1635 2033 ...
## $ venueId : chr "4f0fd5a8e4b03856eeb6c8cb" "4b7b884ff964a5207d662fe3" "4c16fdda96040f477cc473a5" "4c178638c2dfc928651ea869" ...
## $ venueCategoryId: chr "4bf58dd8d48988d10c951735" "4bf58dd8d48988d1d1941735" "4d954b0ea243a5684a65b473" "4bf58dd8d48988d118951735" ...
## $ venueCategory : chr "Cosmetics Shop" "Ramen / Noodle House" "Convenience Store" "Food & Drink Shop" ...
## $ latitude : num 35.7 35.7 35.7 35.7 35.7 ...
## $ longitude : num 140 140 139 140 140 ...
## $ timezoneOffset : int 540 540 540 540 540 540 540 540 540 540 ...
## $ utcTimestamp : chr "Tue Apr 03 18:17:18 +0000 2012" "Tue Apr 03 18:22:04 +0000 2012" "Tue Apr 03 19:12:07 +0000 2012" "Tue Apr 03 19:12:13 +0000 2012" ...
## $ datetime : POSIXct, format: "2012-04-04 03:17:18" "2012-04-04 03:22:04" ...
## $ weekday : Factor w/ 7 levels "Sun","Mon","Tue",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ hour : int 3 3 4 4 4 4 4 4 4 4 ...
Each user and venue is given a unique ID, and the venues are placed into similiar categories. The location, given by latitdue and longitude, and the time of each check-in is also recorded in the data.
The purpose of the below barcharts is to show the top 10 check-in venue categories in TKY and NYC. The top 10 check-in venue categories in TKY account for 61.28% of all check-ins in TKY, and the top 10 check-in venue categories in NYC account for 40.69% of all check-ins in NYC.
#Top Checkin Places (Total)
TKY %>% group_by(venueCategory) %>% summarize(Total = n()) %>%
mutate(Proportion = Total / sum(Total) * 100) %>%
arrange(desc(Total)) %>% head(10) %>% ggplot() +
geom_bar(aes(x = reorder(venueCategory, Proportion), y = Proportion), stat = "identity") +
coord_flip() + labs(title = "Top 10 Check-in Venue Categories by Tokyo FourSquare Users",
subtitle = "Data from 12 April 2012 to 16 February 2013",
y = "Proportion of all check-ins (percent)",
x = "Venue Categories")
The purpose of the below line chart is to show the pattern of check-in behaviours of FourSquare Tokyo users.
TKY %>% group_by(weekday, hour) %>% summarise(count = n()) %>%
ggplot(aes(x = hour, y = count, col = weekday)) +
geom_line() +
labs(y = "Total Count", x = "Hour", title = "Total Check-in Counts",
subtitle = "Data from 12 April 2012 to 16 February 2013")