Libraries used
library(knitr)
library(dplyr)
library(rvest)
library(stringr)
library(jsonlite)
library(tibble)
In this assignment, we will loading data from 3 different data files. Data files are downloaded from NYC Open Data Plan - Scheduled Releases 2016. This inventory includes all datasets scheduled for release between July 2016 and December 31, 2018. Additional information can be found at https://data.cityofnewyork.us/City-Government/NYC-Open-Data-Plan-Scheduled-Releases-2016/tyjc-nqc2.
Processing HTML files. We will be using rvest package and NYC Open Data Plan - Scheduled Releases 2016. dataset. As HTML pages are designed to display data on web pages, actual data is wrapped around formatting tags.
#Read data from website
nyc.html.raw.data <- read_html("https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.html")
#In case file is downloaded and saved to local drive. It can also be read from local drive same was it is read from website.
#nyc.html.raw.data <- read_html(paste(getwd(),"//rows.html",sep=""))
#Extract nodes information
nyc.html.nodes <- nyc.html.raw.data %>%
html_nodes("tr")
#Extract data from each node
nyc.html.text <- nyc.html.nodes %>%
html_text()
#Data needs some tidying as it contains (\n) characters. Replace (\n) with pipe delimited.
nyc.html.text <- gsub(pattern = "\\n", replacement = "|", nyc.html.text)
#First row is column heading. Remove the row
nyc.html.text <- nyc.html.text[2:length(nyc.html.text)]
#Convert dataset into data frame
nyc.html.data.frame <- data.frame(unlist(str_split_fixed(nyc.html.text, "\\|", 7)), stringsAsFactors = F)
#Remove unused columns
nyc.html.data.frame$X7 <- NULL
#Rename the columns names accordingly
nyc.html.data.frame <- rename(nyc.html.data.frame, agency = X1, dataset = X2, dataset_description = X3, update_frequency = X4, planned_release_date = X5, urlLink = X6)
nyc.html.data.frame <- nyc.html.data.frame %>% mutate(cid = row_number())
#Display data
nyc.html.data.frame %>% filter(cid < 16) %>% select(cid, agency, dataset, dataset_description, update_frequency, planned_release_date, urlLink) %>%
kable(format="pandoc")
| cid | agency | dataset | dataset_description | update_frequency | planned_release_date | urlLink |
|---|---|---|---|---|---|---|
| 1 | Department of Transportation (DOT) | Adopted Highway Service Ratings | NYC highways that receive a cleanliness rating of good (%). | To Be Determined | 07/15/2016 | https://data.cityofnewyork.us/Transportation/Adopted-Highway-Service-Ratings-Adopt-a-Highway-Hi/dte3-kvx7 |
| 2 | Department of Transportation (DOT) | Bicycle network connectivity index | Bicycle network connectivity index. | To Be Determined | 07/15/2016 | https://data.cityofnewyork.us/Transportation/Bicycle-Network-Connectivity-Index/d9fg-z42k |
| 3 | Department of Transportation (DOT) | Bridge ratings | Bridges rated good or very good (%) (calendar year). | To Be Determined | 07/15/2016 | https://data.cityofnewyork.us/Transportation/Bridge-Ratings/9dux-uz3w |
| 4 | Department of Transportation (DOT) | Pothole work orders | Average time to close a pothole work order where repair was done (days). | To Be Determined | 07/15/2016 | https://data.cityofnewyork.us/Transportation/Pothole-Work-Orders-Closed/psde-rqze |
| 5 | Department of Transportation (DOT) | Future Protected Streets - Intersections | List of Future Protected Streets for intersections. | Daily | 07/15/2016 | https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Intersections/yupw-u2ax |
| 6 | Department of Transportation (DOT) | Future Protected Streets - Segments | List of Future Protected Streets for segments. | Daily | 07/15/2016 | https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Segments/pnij-y7y6 |
| 7 | Department of Transportation (DOT) | Protected Streets - Intersections | Current list of Protected Streets for intersections. | Daily | 07/15/2016 | https://data.cityofnewyork.us/Transportation/Protected-Streets-Intersections/hfa3-euj3 |
| 8 | Department of Transportation (DOT) | Protected Streets - Segments | Current list of Protected Streets for segments. | Daily | 07/15/2016 | https://data.cityofnewyork.us/Transportation/Protected-Streets-Segments/9p9k-tusd |
| 9 | New York City Housing Authority (NYCHA) | Electric Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 07/15/2016 | https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2016/sd8e-3ugp |
| 10 | New York City Housing Authority (NYCHA) | Water Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 07/15/2016 | https://data.cityofnewyork.us/Housing-Development/Water-Consumption-And-Cost-2012-2016/66be-66yr |
| 11 | New York City Housing Authority (NYCHA) | Heating Gas Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 07/15/2016 | https://data.cityofnewyork.us/Housing-Development/Heating-Gas-Consumption-And-Cost-2010-2016/it56-eyq4 |
| 12 | New York City Housing Authority (NYCHA) | Cooking Gas Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 07/15/2016 | https://data.cityofnewyork.us/Housing-Development/Cooking-Gas-Consumption-And-Cost-2016/b8vr-3ckz |
| 13 | New York City Housing Authority (NYCHA) | Oil Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 07/15/2016 | https://data.cityofnewyork.us/Housing-Development/Heating-Oil-Consumption-And-Cost-2010-2016/bhwu-wuzu |
| 14 | New York City Housing Authority (NYCHA) | Steam Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 07/15/2016 | https://data.cityofnewyork.us/Housing-Development/Steam-Consumption-And-Cost-2010-2016/smdw-73pj |
| 15 | Landmarks Preservation Commission (LPC) | Requests for Evaluation | Tabular dataset containing property information, request dates, determinations and determination dates, beginning in 2001. | To Be Determined | 07/29/2016 |
Next, exercise shows how to extract attributes and values from particular node. Then combine them together as one row.
#Extracting node that contain hyperlink. Node h3 contains hyperlink.
nyc.hyperlink.raw.data <- nyc.html.raw.data %>%
html_nodes("h3")
#Get child node. Node <a> contain hyperlink as attribute and text data as value. In our case we only have one node.
nyc.hyperlink.node.data <- html_nodes(nyc.hyperlink.raw.data[1], "a")
#Get attribute value and and text values using a function
attributes_text <- function(x){
a <- html_attrs(x)
t <- html_text(x)
#Tidy attribute values
a <- gsub("\\n","",a)
a <- gsub("\\t","",a)
a <- gsub("href","",a)
a <- unlist(a)
#When node contain more than one attriutes combine them into single value
a <- paste(a,collapse="|")
#Tidy text values
t <- gsub("\\n","",t)
t <- gsub("\\t","",t)
t <- gsub("href","",t)
t <- unlist(t)
#When node contain more than one text values combine them into single value
t<-paste(t,collapse="|")
final<-c(t,a)
final<-unlist(final)
#Combine attributes and text values
final<-paste(final,collapse="|")
return(final)
}
#Display information
nyc.hyperlink.node.data
## {xml_nodeset (1)}
## [1] <a href="https://data.cityofnewyork.us/dataset/tyjc-nqc2">NYC Open D ...
nyc.hyperlink.node.data %>%
html_text()
## [1] "NYC Open Data Plan - Scheduled Releases 2016"
nyc.hyperlink.node.data %>%
html_attrs()
## [[1]]
## href
## "https://data.cityofnewyork.us/dataset/tyjc-nqc2"
#Attribute and text combined(pipe delimited)
single_row<-attributes_text(nyc.hyperlink.node.data[1])
single_row
## [1] "NYC Open Data Plan - Scheduled Releases 2016|https://data.cityofnewyork.us/dataset/tyjc-nqc2"
Processing XML files. We will be using rvest package and NYC Open Data Plan - Scheduled Releases 2016. dataset. XML files are data files, data can be found as attributes or text values. If we look at 3 and 4 lines from image below, 3rd line contains attributes of data wrapped between tags
#Read data from website
nyc.xml.raw.data <- read_html("https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.xml")
#In case file is downloaded and saved to local drive. It can also be read from local drive same was it is read from website.
#nyc.xml.raw.data <- read_html(paste(getwd(),"//rows.xml",sep=""))
#Extract nodes information
nyc.xml.nodes <- nyc.xml.raw.data %>%
html_nodes(xpath = "//row//row")
nyc.xml.data.frame <- data.frame(
agency = nyc.xml.nodes %>% html_nodes(xpath = "//agency") %>% html_text(),
dataset = nyc.xml.nodes %>% html_nodes(xpath = "//dataset") %>% html_text(),
dataset_description = nyc.xml.nodes %>% html_nodes(xpath = "//dataset_description") %>% html_text(),
update_frequency = nyc.xml.nodes %>% html_nodes(xpath = "//update_frequency") %>% html_text(),
planned_release_date = nyc.xml.nodes %>% html_nodes(xpath = "//planned_release_date") %>% html_text(),
cid = nyc.xml.nodes %>% html_attr("_id") %>% as.integer(),
stringsAsFactors = F
)
#Link for the dataset
#Not all datasets have links
#Initiate an empty data frame
nyc.xml.url.frame <- data.frame(cid = NA, urlLink = NA)
for (i in 1:length(nyc.xml.nodes)){
# Get node Id info
nodeId <-html_attr(nyc.xml.nodes[i], "_id") %>% as.integer()
#Get Url Info
allText <- nyc.xml.nodes[i] %>% html_text()
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
dataSetUrl <- str_extract(allText, url_pattern)
#Load data into data frame
nyc.xml.url.frame <- rbind(nyc.xml.url.frame, c(nodeId,dataSetUrl))
}
nyc.xml.url.frame <- nyc.xml.url.frame %>% filter(!is.na(cid)) %>% select(cid,urlLink)
nyc.xml.url.frame$cid <- as.numeric(nyc.xml.url.frame$cid)
nyc.xml.complete.data <- nyc.xml.data.frame %>%
inner_join(nyc.xml.url.frame, by = "cid") %>%
arrange(cid)
#Display data
nyc.xml.complete.data %>% filter(cid < 16) %>% select(cid, agency, dataset, dataset_description, update_frequency, planned_release_date, urlLink) %>%
kable(format="pandoc")
| cid | agency | dataset | dataset_description | update_frequency | planned_release_date | urlLink |
|---|---|---|---|---|---|---|
| 1 | Department of Transportation (DOT) | Adopted Highway Service Ratings | NYC highways that receive a cleanliness rating of good (%). | To Be Determined | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Adopted-Highway-Service-Ratings-Adopt-a-Highway-Hi/dte3-kvx7 |
| 2 | Department of Transportation (DOT) | Bicycle network connectivity index | Bicycle network connectivity index. | To Be Determined | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Bicycle-Network-Connectivity-Index/d9fg-z42k |
| 3 | Department of Transportation (DOT) | Bridge ratings | Bridges rated good or very good (%) (calendar year). | To Be Determined | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Bridge-Ratings/9dux-uz3w |
| 4 | Department of Transportation (DOT) | Pothole work orders | Average time to close a pothole work order where repair was done (days). | To Be Determined | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Pothole-Work-Orders-Closed/psde-rqze |
| 5 | Department of Transportation (DOT) | Future Protected Streets - Intersections | List of Future Protected Streets for intersections. | Daily | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Intersections/yupw-u2ax |
| 6 | Department of Transportation (DOT) | Future Protected Streets - Segments | List of Future Protected Streets for segments. | Daily | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Segments/pnij-y7y6 |
| 7 | Department of Transportation (DOT) | Protected Streets - Intersections | Current list of Protected Streets for intersections. | Daily | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Protected-Streets-Intersections/hfa3-euj3 |
| 8 | Department of Transportation (DOT) | Protected Streets - Segments | Current list of Protected Streets for segments. | Daily | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Protected-Streets-Segments/9p9k-tusd |
| 9 | New York City Housing Authority (NYCHA) | Electric Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2016/sd8e-3ugp |
| 10 | New York City Housing Authority (NYCHA) | Water Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Water-Consumption-And-Cost-2012-2016/66be-66yr |
| 11 | New York City Housing Authority (NYCHA) | Heating Gas Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Heating-Gas-Consumption-And-Cost-2010-2016/it56-eyq4 |
| 12 | New York City Housing Authority (NYCHA) | Cooking Gas Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Cooking-Gas-Consumption-And-Cost-2016/b8vr-3ckz |
| 13 | New York City Housing Authority (NYCHA) | Oil Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Heating-Oil-Consumption-And-Cost-2010-2016/bhwu-wuzu |
| 14 | New York City Housing Authority (NYCHA) | Steam Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Steam-Consumption-And-Cost-2010-2016/smdw-73pj |
| 15 | Landmarks Preservation Commission (LPC) | Requests for Evaluation | Tabular dataset containing property information, request dates, determinations and determination dates, beginning in 2001. | To Be Determined | 2016-07-29T00:00:00 | NA |
Processing JSON files. We will be using jsonlite package and NYC Open Data Plan - Scheduled Releases 2016. dataset. JSON files are data files, data can be found as attributes or text values.
#JSON data from website
nyc.json.raw.data <- fromJSON("https://data.cityofnewyork.us/api/views/tyjc-nqc2/rows.json", flatten = T)
#In case file is downloaded and saved to local drive. It can also be read from local drive same was it is read from website.
#nyc.json.raw.data <- read_html(paste(getwd(),"//rows.json",sep=""))
#Load data into data frame
nyc.json.data.frame <- data.frame(nyc.json.raw.data[['data']])
#Metadata for the JSON data
nyc.json.meta.data <- nyc.json.raw.data[['meta']]
#Rename column names
nyc.json.data.frame <- nyc.json.data.frame %>% select(cid = X1, agency = X9, dataset = X10, dataset_description = X11, update_frequency = X12, planned_release_date = X13, urlLink = X14)
nyc.json.data.frame$cid <- as.numeric(as.character(nyc.json.data.frame$cid))
#Display data
nyc.json.data.frame %>% filter(cid < 16) %>% select(cid, agency, dataset, dataset_description, update_frequency, planned_release_date, urlLink) %>%
kable(format="pandoc")
| cid | agency | dataset | dataset_description | update_frequency | planned_release_date | urlLink |
|---|---|---|---|---|---|---|
| 1 | Department of Transportation (DOT) | Adopted Highway Service Ratings | NYC highways that receive a cleanliness rating of good (%). | To Be Determined | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Adopted-Highway-Service-Ratings-Adopt-a-Highway-Hi/dte3-kvx7 |
| 2 | Department of Transportation (DOT) | Bicycle network connectivity index | Bicycle network connectivity index. | To Be Determined | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Bicycle-Network-Connectivity-Index/d9fg-z42k |
| 3 | Department of Transportation (DOT) | Bridge ratings | Bridges rated good or very good (%) (calendar year). | To Be Determined | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Bridge-Ratings/9dux-uz3w |
| 4 | Department of Transportation (DOT) | Pothole work orders | Average time to close a pothole work order where repair was done (days). | To Be Determined | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Pothole-Work-Orders-Closed/psde-rqze |
| 5 | Department of Transportation (DOT) | Future Protected Streets - Intersections | List of Future Protected Streets for intersections. | Daily | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Intersections/yupw-u2ax |
| 6 | Department of Transportation (DOT) | Future Protected Streets - Segments | List of Future Protected Streets for segments. | Daily | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Future-Protected-Streets-Segments/pnij-y7y6 |
| 7 | Department of Transportation (DOT) | Protected Streets - Intersections | Current list of Protected Streets for intersections. | Daily | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Protected-Streets-Intersections/hfa3-euj3 |
| 8 | Department of Transportation (DOT) | Protected Streets - Segments | Current list of Protected Streets for segments. | Daily | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Transportation/Protected-Streets-Segments/9p9k-tusd |
| 9 | New York City Housing Authority (NYCHA) | Electric Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2016/sd8e-3ugp |
| 10 | New York City Housing Authority (NYCHA) | Water Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Water-Consumption-And-Cost-2012-2016/66be-66yr |
| 11 | New York City Housing Authority (NYCHA) | Heating Gas Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Heating-Gas-Consumption-And-Cost-2010-2016/it56-eyq4 |
| 12 | New York City Housing Authority (NYCHA) | Cooking Gas Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Cooking-Gas-Consumption-And-Cost-2016/b8vr-3ckz |
| 13 | New York City Housing Authority (NYCHA) | Oil Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Heating-Oil-Consumption-And-Cost-2010-2016/bhwu-wuzu |
| 14 | New York City Housing Authority (NYCHA) | Steam Consumption and Cost | Monthly consumption and cost data by borough and development. Data set includes utility vendor and meter information. | Quarterly | 2016-07-15T00:00:00 | https://data.cityofnewyork.us/Housing-Development/Steam-Consumption-And-Cost-2010-2016/smdw-73pj |
| 15 | Landmarks Preservation Commission (LPC) | Requests for Evaluation | Tabular dataset containing property information, request dates, determinations and determination dates, beginning in 2001. | To Be Determined | 2016-07-29T00:00:00 | NA |