DATA607-Week07-Homework

Libraries used

library(knitr)
library(dplyr)
library(rvest)
library(stringr)
library(jsonlite)
library(tibble)

In this assignment, we will loading data from 3 different data files. Data files are downloaded from NYC Open Data Plan - Scheduled Releases 2016. This inventory includes all datasets scheduled for release between July 2016 and December 31, 2018. Additional information can be found at https://data.cityofnewyork.us/City-Government/NYC-Open-Data-Plan-Scheduled-Releases-2016/tyjc-nqc2.

File Type Link
HTML https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.html?accessType=DOWNLOAD
XML https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.xml?accessType=DOWNLOAD
JSON https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.json?accessType=DOWNLOAD
Meta Data https://data.cityofnewyork.us/City-Government/NYC-Open-Data-Plan-Scheduled-Releases-2016/tyjc-nqc2

HTML Files

Processing HTML files. We will be using rvest package and NYC Open Data Plan - Scheduled Releases 2016. dataset. As HTML pages are designed to display data on web pages, actual data is wrapped around formatting tags.

#Read data from website
nyc.html.raw.data <- read_html("https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.html")

#In case file is downloaded and saved to local drive. It can also be read from local drive same was it is read from website.
#nyc.html.raw.data <- read_html(paste(getwd(),"//rows.html",sep=""))

#Extract nodes information
nyc.html.nodes <- nyc.html.raw.data %>% 
      html_nodes("tr")

#Extract data from each node
nyc.html.text <- nyc.html.nodes %>% 
  html_text()

#Data needs some tidying as it contains (\n) characters. Replace (\n) with pipe delimited.
nyc.html.text <- gsub(pattern = "\\n", replacement = "|", nyc.html.text)

#First row is column heading. Remove the row
nyc.html.text <- nyc.html.text[2:length(nyc.html.text)]

#Convert dataset into data frame
nyc.html.data.frame <- data.frame(unlist(str_split_fixed(nyc.html.text, "\\|", 7)), stringsAsFactors = F)

#Remove unused columns
nyc.html.data.frame$X7 <- NULL

#Rename the columns names accordingly
nyc.html.data.frame <- rename(nyc.html.data.frame, agency = X1, dataset = X2, dataset_description = X3, update_frequency = X4, planned_release_date = X5, urlLink = X6)

nyc.html.data.frame <- nyc.html.data.frame %>% mutate(cid = row_number())

#Display data
nyc.html.data.frame %>% filter(cid < 16) %>% select(cid, agency, dataset, dataset_description, update_frequency, planned_release_date, urlLink) %>% 
kable(format="pandoc")
cid agency dataset dataset_description update_frequency planned_release_date urlLink
1 Department of Transportation (DOT) Adopted Highway Service Ratings NYC highways that receive a cleanliness rating of good (%). To Be Determined 07/15/2016 https://data.cityofnewyork.us/Transportation/Adopted-Highway-Service-Ratings-Adopt-a-Highway-Hi/dte3-kvx7
2 Department of Transportation (DOT) Bicycle network connectivity index Bicycle network connectivity index. To Be Determined 07/15/2016 https://data.cityofnewyork.us/Transportation/Bicycle-Network-Connectivity-Index/d9fg-z42k
3 Department of Transportation (DOT) Bridge ratings Bridges rated good or very good (%) (calendar year). To Be Determined 07/15/2016 https://data.cityofnewyork.us/Transportation/Bridge-Ratings/9dux-uz3w
4 Department of Transportation (DOT) Pothole work orders Average time to close a pothole work order where repair was done (days). To Be Determined 07/15/2016 https://data.cityofnewyork.us/Transportation/Pothole-Work-Orders-Closed/psde-rqze
5 Department of Transportation (DOT) Future Protected Streets - Intersections List of Future Protected Streets for intersections. Daily 07/15/2016 https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Intersections/yupw-u2ax
6 Department of Transportation (DOT) Future Protected Streets - Segments List of Future Protected Streets for segments. Daily 07/15/2016 https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Segments/pnij-y7y6
7 Department of Transportation (DOT) Protected Streets - Intersections Current list of Protected Streets for intersections. Daily 07/15/2016 https://data.cityofnewyork.us/Transportation/Protected-Streets-Intersections/hfa3-euj3
8 Department of Transportation (DOT) Protected Streets - Segments Current list of Protected Streets for segments. Daily 07/15/2016 https://data.cityofnewyork.us/Transportation/Protected-Streets-Segments/9p9k-tusd
9 New York City Housing Authority (NYCHA) Electric Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 07/15/2016 https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2016/sd8e-3ugp
10 New York City Housing Authority (NYCHA) Water Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 07/15/2016 https://data.cityofnewyork.us/Housing-Development/Water-Consumption-And-Cost-2012-2016/66be-66yr
11 New York City Housing Authority (NYCHA) Heating Gas Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 07/15/2016 https://data.cityofnewyork.us/Housing-Development/Heating-Gas-Consumption-And-Cost-2010-2016/it56-eyq4
12 New York City Housing Authority (NYCHA) Cooking Gas Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 07/15/2016 https://data.cityofnewyork.us/Housing-Development/Cooking-Gas-Consumption-And-Cost-2016/b8vr-3ckz
13 New York City Housing Authority (NYCHA) Oil Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 07/15/2016 https://data.cityofnewyork.us/Housing-Development/Heating-Oil-Consumption-And-Cost-2010-2016/bhwu-wuzu
14 New York City Housing Authority (NYCHA) Steam Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 07/15/2016 https://data.cityofnewyork.us/Housing-Development/Steam-Consumption-And-Cost-2010-2016/smdw-73pj
15 Landmarks Preservation Commission (LPC) Requests for Evaluation Tabular dataset containing property information, request dates, determinations and determination dates, beginning in 2001. To Be Determined 07/29/2016

Next, exercise shows how to extract attributes and values from particular node. Then combine them together as one row.

#Extracting node that contain hyperlink. Node h3 contains hyperlink.
nyc.hyperlink.raw.data <- nyc.html.raw.data %>% 
      html_nodes("h3")

#Get child node. Node <a> contain hyperlink as attribute and text data as value. In our case we only have one node.
nyc.hyperlink.node.data <- html_nodes(nyc.hyperlink.raw.data[1], "a")

#Get attribute value and and text values using a function
attributes_text <- function(x){
  a <- html_attrs(x)
  t <- html_text(x)
  
  #Tidy attribute values  
  a <- gsub("\\n","",a)
  a <- gsub("\\t","",a)
  a <- gsub("href","",a)
  a <- unlist(a)
  #When node contain more than one attriutes combine them into single value
  a <- paste(a,collapse="|")
  
  #Tidy text values 
  t <- gsub("\\n","",t)
  t <- gsub("\\t","",t)
  t <- gsub("href","",t)
  t <- unlist(t)
  #When node contain more than one text values combine them into single value
  t<-paste(t,collapse="|")
  
  final<-c(t,a)
  final<-unlist(final)
  #Combine attributes and text values
  final<-paste(final,collapse="|")
  return(final)
}

#Display information
nyc.hyperlink.node.data
## {xml_nodeset (1)}
## [1] <a href="https://data.cityofnewyork.us/dataset/tyjc-nqc2">NYC Open D ...
nyc.hyperlink.node.data %>% 
  html_text()
## [1] "NYC Open Data Plan - Scheduled Releases 2016"
nyc.hyperlink.node.data %>% 
  html_attrs()
## [[1]]
##                                              href 
## "https://data.cityofnewyork.us/dataset/tyjc-nqc2"
#Attribute and text combined(pipe delimited)
single_row<-attributes_text(nyc.hyperlink.node.data[1])
single_row
## [1] "NYC Open Data Plan - Scheduled Releases 2016|https://data.cityofnewyork.us/dataset/tyjc-nqc2"

XML files

Processing XML files. We will be using rvest package and NYC Open Data Plan - Scheduled Releases 2016. dataset. XML files are data files, data can be found as attributes or text values. If we look at 3 and 4 lines from image below, 3rd line contains attributes of data wrapped between tags . Whereas, line 4 is variable information that is part of .

#Read data from website
nyc.xml.raw.data <- read_html("https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.xml")

#In case file is downloaded and saved to local drive. It can also be read from local drive same was it is read from website.
#nyc.xml.raw.data <- read_html(paste(getwd(),"//rows.xml",sep=""))

#Extract nodes information
nyc.xml.nodes <- nyc.xml.raw.data %>% 
      html_nodes(xpath = "//row//row")

nyc.xml.data.frame <- data.frame(
  agency = nyc.xml.nodes %>%  html_nodes(xpath = "//agency") %>% html_text(),
  dataset = nyc.xml.nodes %>%  html_nodes(xpath = "//dataset") %>% html_text(),
  dataset_description = nyc.xml.nodes %>%  html_nodes(xpath = "//dataset_description") %>% html_text(),
  update_frequency = nyc.xml.nodes %>%  html_nodes(xpath = "//update_frequency") %>% html_text(),
  planned_release_date = nyc.xml.nodes %>%  html_nodes(xpath = "//planned_release_date") %>% html_text(),
  cid = nyc.xml.nodes %>% html_attr("_id") %>% as.integer(),
  stringsAsFactors = F
  )

#Link for the dataset
#Not all datasets have links
#Initiate an empty data frame
nyc.xml.url.frame <- data.frame(cid = NA, urlLink = NA)
for (i in 1:length(nyc.xml.nodes)){
    
    # Get node Id info
    nodeId <-html_attr(nyc.xml.nodes[i], "_id") %>% as.integer()
    
    #Get Url Info
    allText <- nyc.xml.nodes[i] %>% html_text()
    url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
    dataSetUrl <- str_extract(allText, url_pattern)
    
    #Load data into data frame
    nyc.xml.url.frame <- rbind(nyc.xml.url.frame, c(nodeId,dataSetUrl))
}

nyc.xml.url.frame <- nyc.xml.url.frame %>% filter(!is.na(cid)) %>% select(cid,urlLink)
nyc.xml.url.frame$cid <- as.numeric(nyc.xml.url.frame$cid)

nyc.xml.complete.data <- nyc.xml.data.frame %>% 
  inner_join(nyc.xml.url.frame, by = "cid") %>% 
  arrange(cid)

#Display data
nyc.xml.complete.data %>% filter(cid < 16) %>% select(cid, agency, dataset, dataset_description, update_frequency, planned_release_date, urlLink) %>% 
kable(format="pandoc")
cid agency dataset dataset_description update_frequency planned_release_date urlLink
1 Department of Transportation (DOT) Adopted Highway Service Ratings NYC highways that receive a cleanliness rating of good (%). To Be Determined 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Adopted-Highway-Service-Ratings-Adopt-a-Highway-Hi/dte3-kvx7
2 Department of Transportation (DOT) Bicycle network connectivity index Bicycle network connectivity index. To Be Determined 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Bicycle-Network-Connectivity-Index/d9fg-z42k
3 Department of Transportation (DOT) Bridge ratings Bridges rated good or very good (%) (calendar year). To Be Determined 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Bridge-Ratings/9dux-uz3w
4 Department of Transportation (DOT) Pothole work orders Average time to close a pothole work order where repair was done (days). To Be Determined 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Pothole-Work-Orders-Closed/psde-rqze
5 Department of Transportation (DOT) Future Protected Streets - Intersections List of Future Protected Streets for intersections. Daily 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Intersections/yupw-u2ax
6 Department of Transportation (DOT) Future Protected Streets - Segments List of Future Protected Streets for segments. Daily 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Segments/pnij-y7y6
7 Department of Transportation (DOT) Protected Streets - Intersections Current list of Protected Streets for intersections. Daily 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Protected-Streets-Intersections/hfa3-euj3
8 Department of Transportation (DOT) Protected Streets - Segments Current list of Protected Streets for segments. Daily 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Protected-Streets-Segments/9p9k-tusd
9 New York City Housing Authority (NYCHA) Electric Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2016/sd8e-3ugp
10 New York City Housing Authority (NYCHA) Water Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Water-Consumption-And-Cost-2012-2016/66be-66yr
11 New York City Housing Authority (NYCHA) Heating Gas Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Heating-Gas-Consumption-And-Cost-2010-2016/it56-eyq4
12 New York City Housing Authority (NYCHA) Cooking Gas Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Cooking-Gas-Consumption-And-Cost-2016/b8vr-3ckz
13 New York City Housing Authority (NYCHA) Oil Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Heating-Oil-Consumption-And-Cost-2010-2016/bhwu-wuzu
14 New York City Housing Authority (NYCHA) Steam Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Steam-Consumption-And-Cost-2010-2016/smdw-73pj
15 Landmarks Preservation Commission (LPC) Requests for Evaluation Tabular dataset containing property information, request dates, determinations and determination dates, beginning in 2001. To Be Determined 2016-07-29T00:00:00 NA

JSON files

Processing JSON files. We will be using jsonlite package and NYC Open Data Plan - Scheduled Releases 2016. dataset. JSON files are data files, data can be found as attributes or text values.

#JSON data from website
nyc.json.raw.data <- fromJSON("https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.json", flatten = T)

#In case file is downloaded and saved to local drive. It can also be read from local drive same was it is read from website.
#nyc.json.raw.data <- read_html(paste(getwd(),"//rows.json",sep=""))

#Load data into data frame
nyc.json.data.frame <- data.frame(nyc.json.raw.data[['data']])

#Metadata for the JSON data 
nyc.json.meta.data <- nyc.json.raw.data[['meta']]

#Rename column names
nyc.json.data.frame <- nyc.json.data.frame %>% select(cid = X1, agency = X9, dataset = X10, dataset_description = X11, update_frequency = X12, planned_release_date = X13, urlLink = X14)

nyc.json.data.frame$cid <- as.numeric(as.character(nyc.json.data.frame$cid))

#Display data
nyc.json.data.frame %>% filter(cid < 16) %>% select(cid, agency, dataset, dataset_description, update_frequency, planned_release_date, urlLink) %>% 
kable(format="pandoc")
cid agency dataset dataset_description update_frequency planned_release_date urlLink
1 Department of Transportation (DOT) Adopted Highway Service Ratings NYC highways that receive a cleanliness rating of good (%). To Be Determined 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Adopted-Highway-Service-Ratings-Adopt-a-Highway-Hi/dte3-kvx7
2 Department of Transportation (DOT) Bicycle network connectivity index Bicycle network connectivity index. To Be Determined 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Bicycle-Network-Connectivity-Index/d9fg-z42k
3 Department of Transportation (DOT) Bridge ratings Bridges rated good or very good (%) (calendar year). To Be Determined 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Bridge-Ratings/9dux-uz3w
4 Department of Transportation (DOT) Pothole work orders Average time to close a pothole work order where repair was done (days). To Be Determined 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Pothole-Work-Orders-Closed/psde-rqze
5 Department of Transportation (DOT) Future Protected Streets - Intersections List of Future Protected Streets for intersections. Daily 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Intersections/yupw-u2ax
6 Department of Transportation (DOT) Future Protected Streets - Segments List of Future Protected Streets for segments. Daily 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Segments/pnij-y7y6
7 Department of Transportation (DOT) Protected Streets - Intersections Current list of Protected Streets for intersections. Daily 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Protected-Streets-Intersections/hfa3-euj3
8 Department of Transportation (DOT) Protected Streets - Segments Current list of Protected Streets for segments. Daily 2016-07-15T00:00:00 https://data.cityofnewyork.us/Transportation/Protected-Streets-Segments/9p9k-tusd
9 New York City Housing Authority (NYCHA) Electric Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2016/sd8e-3ugp
10 New York City Housing Authority (NYCHA) Water Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Water-Consumption-And-Cost-2012-2016/66be-66yr
11 New York City Housing Authority (NYCHA) Heating Gas Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Heating-Gas-Consumption-And-Cost-2010-2016/it56-eyq4
12 New York City Housing Authority (NYCHA) Cooking Gas Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Cooking-Gas-Consumption-And-Cost-2016/b8vr-3ckz
13 New York City Housing Authority (NYCHA) Oil Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Heating-Oil-Consumption-And-Cost-2010-2016/bhwu-wuzu
14 New York City Housing Authority (NYCHA) Steam Consumption and Cost Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. Quarterly 2016-07-15T00:00:00 https://data.cityofnewyork.us/Housing-Development/Steam-Consumption-And-Cost-2010-2016/smdw-73pj
15 Landmarks Preservation Commission (LPC) Requests for Evaluation Tabular dataset containing property information, request dates, determinations and determination dates, beginning in 2001. To Be Determined 2016-07-29T00:00:00 NA