I have chosen to scrape Alexa.com (ranking of websites) since I thought it would be interesting to see the number of users and time spent on each website.
library(rvest)
library(dplyr)
library(tidyverse)
library(RSocrata)
url <- "https://www.alexa.com/topsites"
AlexaScrape <- function(x) {
page <- x
rank <- (page %>% read_html() %>% html_nodes(xpath = "/html/body/div/div/section/div/section/div[1]/section[2]/span/span/div/div/div[2]/div[*]/div[1]") %>% html_text() %>% as.data.frame())[1:51,1]
site <- page %>% read_html() %>% html_nodes(xpath = "/html/body/div/div/section/div/section/div[1]/section[2]/span/span/div/div/div[2]/div[*]/div[2]/p/a") %>% html_text() %>% as.data.frame()
time <- page %>% read_html() %>% html_nodes(xpath = "/html/body/div/div/section/div/section/div[1]/section[2]/span/span/div/div/div[2]/div[*]/div[3]/p") %>% html_text() %>% as.data.frame()
visitors <- page %>% read_html() %>% html_nodes(xpath = "/html/body/div/div/section/div/section/div[1]/section[2]/span/span/div/div/div[2]/div[*]/div[4]/p") %>% html_text() %>% as.data.frame()
top <- cbind(rank[2:51], site, time, visitors)
names(top) <- c("Ranking", "Site", "Average_Time_Spent_on_Site", "Average_Daily_Visitors")
return(top)
}
Top_50 <- map_df(url, AlexaScrape)
head(Top_50)
Top_50 <- Top_50 %>% mutate(Average_Time_Spent_on_Site = gsub(":", "", Average_Time_Spent_on_Site)) %>%
mutate( Average_Time_Spent_on_Site = as.numeric(Average_Time_Spent_on_Site) ) %>%
mutate(Average_Time_Spent_on_Site = (Average_Time_Spent_on_Site %/% 100 * 60) + Average_Time_Spent_on_Site %% 100 ) %>%
mutate(Average_Daily_Visitors = as.numeric(as.character(Top_50$`Average_Daily_Visitors`)))
head(Top_50)
Visited_Top_10 <- Top_50 %>%
arrange(desc(Average_Daily_Visitors)) %>%
top_n(10)
ggplot(Visited_Top_10) +
geom_bar(mapping = aes(x = reorder(Site, Average_Daily_Visitors),
weight = Average_Daily_Visitors,
fill = Ranking)) +
coord_flip() +
ylab("Average Daily Visitors") +
xlab("Site")
Time_Top_10 <- Top_50 %>%
arrange(desc(Average_Time_Spent_on_Site)) %>%
top_n(10)
ggplot(Time_Top_10) +
geom_bar(mapping = aes(x = reorder(Site, Average_Time_Spent_on_Site),
weight = Average_Time_Spent_on_Site, fill = Ranking)) +
coord_flip() +
ylab("Average Time Spent on Site (seconds)") +
xlab("Site")
ggplot(Top_50) +
geom_point(aes(Average_Daily_Visitors,
Average_Time_Spent_on_Site),
colour = "deeppink2") +
xlab("Average Daily Visitors") +
ylab("Average Time Spent on Site")
It is interesting to see that the top ten most visited websites (globally) are not necessarily the ones ranked 1-10; the same is true for the top ten websites with greatest average time spent on site. What is not suprising to see is that the more visitors there are daily, the greater the time spent on the site. We can assume that the more popular the website, the more users it will have and that these users will, generally, spend more time on them.
df <- read.socrata("https://data.cityofnewyork.us/api/views/varh-9tsp/rows.json?accessType=DOWNLOAD")
sapply(df,class)
## boro the_geom objectid type provider name
## "character" "character" "integer" "character" "character" "character"
## location lat lon x y location_t
## "character" "numeric" "numeric" "numeric" "numeric" "character"
## remarks city ssid sourceid activated borocode
## "character" "character" "character" "character" "character" "integer"
## boroname ntacode ntaname coundist postcode borocd
## "character" "character" "character" "integer" "integer" "integer"
## ct2010 boroct2010 bin bbl doitt_id
## "integer" "integer" "integer" "numeric" "integer"
library(ggmap)
qmplot(lon, lat,
data = df,
color = boro)
In this plot, what stands out is that the density of hotspot locations is greatest in the burrow of Manhattan and very sparse in Staten Island.
qmplot(lon, lat, data = df, color = type) +
facet_wrap(~ type)
Here, it is clear that the most common type of wi-fi is Free, with Limited Free coming in as second, and no visible Partner Site types.
If I was a person looking to visit New York and considering access to wi-fi, I would choose Manhattan since there is a high density of Free wi-fi.
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf