Analyzing Alexa Rankings

I have chosen to scrape Alexa.com (ranking of websites) since I thought it would be interesting to see the number of users and time spent on each website.

library(rvest)
library(dplyr)
library(tidyverse)
library(RSocrata)
url <- "https://www.alexa.com/topsites"

AlexaScrape <- function(x) {
  page <- x
  rank <- (page %>% read_html() %>% html_nodes(xpath = "/html/body/div/div/section/div/section/div[1]/section[2]/span/span/div/div/div[2]/div[*]/div[1]") %>% html_text() %>% as.data.frame())[1:51,1]
  site <- page %>% read_html() %>% html_nodes(xpath = "/html/body/div/div/section/div/section/div[1]/section[2]/span/span/div/div/div[2]/div[*]/div[2]/p/a") %>% html_text() %>% as.data.frame()
  time <- page %>% read_html() %>% html_nodes(xpath = "/html/body/div/div/section/div/section/div[1]/section[2]/span/span/div/div/div[2]/div[*]/div[3]/p") %>% html_text() %>% as.data.frame()
  visitors <- page %>% read_html() %>% html_nodes(xpath = "/html/body/div/div/section/div/section/div[1]/section[2]/span/span/div/div/div[2]/div[*]/div[4]/p") %>% html_text() %>% as.data.frame()
  
  
  top <- cbind(rank[2:51], site, time, visitors)
  names(top) <- c("Ranking", "Site", "Average_Time_Spent_on_Site", "Average_Daily_Visitors")
  return(top)
}
Top_50 <- map_df(url, AlexaScrape)
head(Top_50)
Top_50 <- Top_50 %>% mutate(Average_Time_Spent_on_Site = gsub(":", "", Average_Time_Spent_on_Site)) %>% 
  mutate( Average_Time_Spent_on_Site = as.numeric(Average_Time_Spent_on_Site) ) %>% 
  mutate(Average_Time_Spent_on_Site = (Average_Time_Spent_on_Site %/% 100 * 60) + Average_Time_Spent_on_Site %% 100 ) %>% 
  mutate(Average_Daily_Visitors = as.numeric(as.character(Top_50$`Average_Daily_Visitors`)))
head(Top_50)
Visited_Top_10 <- Top_50 %>% 
  arrange(desc(Average_Daily_Visitors)) %>%
  top_n(10)

ggplot(Visited_Top_10) + 
  geom_bar(mapping = aes(x = reorder(Site, Average_Daily_Visitors),
                         weight = Average_Daily_Visitors,
                         fill = Ranking)) +
  coord_flip() +
  ylab("Average Daily Visitors") +
  xlab("Site")

Time_Top_10 <- Top_50 %>% 
  arrange(desc(Average_Time_Spent_on_Site)) %>%
  top_n(10)

ggplot(Time_Top_10) +
  geom_bar(mapping = aes(x = reorder(Site, Average_Time_Spent_on_Site),
                         weight = Average_Time_Spent_on_Site, fill = Ranking)) +
  coord_flip() +
  ylab("Average Time Spent on Site (seconds)") +
  xlab("Site")

ggplot(Top_50) + 
  geom_point(aes(Average_Daily_Visitors,
                 Average_Time_Spent_on_Site),
             colour = "deeppink2") +
  xlab("Average Daily Visitors") +
  ylab("Average Time Spent on Site")

It is interesting to see that the top ten most visited websites (globally) are not necessarily the ones ranked 1-10; the same is true for the top ten websites with greatest average time spent on site. What is not suprising to see is that the more visitors there are daily, the greater the time spent on the site. We can assume that the more popular the website, the more users it will have and that these users will, generally, spend more time on them.

Wi-Fi in the City of New York

df <- read.socrata("https://data.cityofnewyork.us/api/views/varh-9tsp/rows.json?accessType=DOWNLOAD")
sapply(df,class)
##        boro    the_geom    objectid        type    provider        name 
## "character" "character"   "integer" "character" "character" "character" 
##    location         lat         lon           x           y  location_t 
## "character"   "numeric"   "numeric"   "numeric"   "numeric" "character" 
##     remarks        city        ssid    sourceid   activated    borocode 
## "character" "character" "character" "character" "character"   "integer" 
##    boroname     ntacode     ntaname    coundist    postcode      borocd 
## "character" "character" "character"   "integer"   "integer"   "integer" 
##      ct2010  boroct2010         bin         bbl    doitt_id 
##   "integer"   "integer"   "integer"   "numeric"   "integer"
library(ggmap)
qmplot(lon, lat,
       data = df,
       color = boro)

In this plot, what stands out is that the density of hotspot locations is greatest in the burrow of Manhattan and very sparse in Staten Island.

qmplot(lon, lat, data = df, color = type) +
  facet_wrap(~ type)

Here, it is clear that the most common type of wi-fi is Free, with Limited Free coming in as second, and no visible Partner Site types.

If I was a person looking to visit New York and considering access to wi-fi, I would choose Manhattan since there is a high density of Free wi-fi.

Citations

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf