CIS 4730 Unstructured Data Management

But NA creates problems for most numerical functions.

For example, we cannot add NA to other numbers.

sum(vec)

## [1] NA

max(vec)

## [1] NA

To apply these numerical functions on data with NAs, we simply just remove NAs from the calculation. That is,

sum(vec, na.rm = T) # remove NAs before calculating the sum

## [1] 7

max(vec, na.rm = T) # remove NAs before getting the max value

## [1] 4

library(xml2)
library(XML)

install.packages("tidyverse")
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Read the xml file
menu_data <- read_xml('https://www.w3schools.com/xml/simple.xml')
# display menu_data
class(menu_data)

## [1] "xml_document" "xml_node"

menu_data

## {xml_document}
## <breakfast_menu>
## [1] <food>\n  <name>Belgian Waffles</name>\n  <price>$5.95</price>\n  <descri ...
## [2] <food>\n  <name>Strawberry Belgian Waffles</name>\n  <price>$7.95</price> ...
## [3] <food>\n  <name>Berry-Berry Belgian Waffles</name>\n  <price>$8.95</price ...
## [4] <food>\n  <name>French Toast</name>\n  <price>$4.50</price>\n  <descripti ...
## [5] <food>\n  <name>Homestyle Breakfast</name>\n  <price>$6.95</price>\n  <de ...

# Parse the food_data into R structure representing XML tree
menu_xml <- xmlParse(menu_data)
# Display the XML tree
menu_xml

## <?xml version="1.0" encoding="UTF-8"?>
## <breakfast_menu>
##   <food>
##     <name>Belgian Waffles</name>
##     <price>$5.95</price>
##     <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
##     <calories>650</calories>
##   </food>
##   <food>
##     <name>Strawberry Belgian Waffles</name>
##     <price>$7.95</price>
##     <description>Light Belgian waffles covered with strawberries and whipped cream</description>
##     <calories>900</calories>
##   </food>
##   <food>
##     <name>Berry-Berry Belgian Waffles</name>
##     <price>$8.95</price>
##     <description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
##     <calories>900</calories>
##   </food>
##   <food>
##     <name>French Toast</name>
##     <price>$4.50</price>
##     <description>Thick slices made from our homemade sourdough bread</description>
##     <calories>600</calories>
##   </food>
##   <food>
##     <name>Homestyle Breakfast</name>
##     <price>$6.95</price>
##     <description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
##     <calories>950</calories>
##   </food>
## </breakfast_menu>
##

# Convert the parsed XML to a dataframe
df_menu <- xmlToDataFrame(nodes=getNodeSet(menu_xml, "//food"))
class(df_menu)

## [1] "data.frame"

#View(df_menu)

# Extract XML data using xpath
menu <- xml_find_all(menu_data, xpath="/breakfast_menu/food")
print(xml_text(menu))

## [1] "Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650"                              
## [2] "Strawberry Belgian Waffles$7.95Light Belgian waffles covered with strawberries and whipped cream900"                   
## [3] "Berry-Berry Belgian Waffles$8.95Light Belgian waffles covered with an assortment of fresh berries and whipped cream900"
## [4] "French Toast$4.50Thick slices made from our homemade sourdough bread600"                                               
## [5] "Homestyle Breakfast$6.95Two eggs, bacon or sausage, toast, and our ever-popular hash browns950"

breakfast_name <- xml_find_all(menu_data, xpath="//name") %>% xml_text
print(breakfast_name)

## [1] "Belgian Waffles"             "Strawberry Belgian Waffles" 
## [3] "Berry-Berry Belgian Waffles" "French Toast"               
## [5] "Homestyle Breakfast"

breakfast_price <- xml_find_all(menu_data, xpath="//price") %>% xml_text
print(breakfast_price)

## [1] "$5.95" "$7.95" "$8.95" "$4.50" "$6.95"

names(breakfast_price) <-breakfast_name # Recall names(vec) <- c("name1", "name2", "name3", "name4")

breakfast_price

##             Belgian Waffles  Strawberry Belgian Waffles 
##                     "$5.95"                     "$7.95" 
## Berry-Berry Belgian Waffles                French Toast 
##                     "$8.95"                     "$4.50" 
##         Homestyle Breakfast 
##                     "$6.95"

install.packages("tidyverse")
install.packages("jsonlite")

library(tidyverse)
library(jsonlite)

url_json <- "https://mdn.github.io/learning-area/javascript/oojs/json/superheroes.json"
superheros <- jsonlite::fromJSON(url_json)
class(superheros)

## [1] "list"

#print(superheros)
#str(superheros)
View(superheros)

#df <- as.data.frame(superheros)
df <- jsonlite::fromJSON(url_json) %>% as.data.frame
print(df)

##          squadName   homeTown formed  secretBase active    members.name
## 1 Super Hero Squad Metro City   2016 Super tower   TRUE    Molecule Man
## 2 Super Hero Squad Metro City   2016 Super tower   TRUE Madame Uppercut
## 3 Super Hero Squad Metro City   2016 Super tower   TRUE   Eternal Flame
##   members.age members.secretIdentity
## 1          29              Dan Jukes
## 2          39            Jane Wilson
## 3     1000000                Unknown
##                                                                members.powers
## 1                         Radiation resistance, Turning tiny, Radiation blast
## 2                 Million tonne punch, Damage resistance, Superhuman reflexes
## 3 Immortality, Heat Immunity, Inferno, Teleportation, Interdimensional travel

# install.packages("devtools")
# devtools::install_github("blmoore/rjsonpath")
# library(rjsonpath)

# df2 <- read_json(url_json)
# json_path(df2, "$.members[*].name")
# json_path(df2, "$..name")

This lab assignment involves 2 tasks (see the next 2 slides, scroll to the bottom for instructions).

Once you finish the following tasks, please put everything in one single R file with the file name assignment1.R (.R is the file extension) and upload it to iCollege (Lab Assignment 1).

Caution:

You will lose 50% of the points if you use a different file name or put your code in multiple files.
You will lose 10% of the points if your code can not be run as a whole script (see lab 1 slide p.14).

In addition, lab assignments will be graded based on:

Accuracy: whether the R script achieves the objectives
Readability: whether the R script is clean, well-formatted, and easily readable
- You risk losing 10% points if your code has no proper indentation or has more than 80 characters in a line.

[1. (10 points)] Create the below list

## $name
## [1] "Alex"   "Bob"    "Claire" "Denise"
## 
## $female
## [1] FALSE FALSE  TRUE  TRUE
## 
## $age
## [1] 20 25 30 35

[2. (10 points)] Get the name “Bob” from the list by accessing it from the location of name vector
- Hint: Get the name vector from the list and then get the second element in the vector
- Note: You will not get any point if you get “Bob” directly from list you created.

## [1] "Bob"

[3. (10 points)] Create the above data frame (don’t forget the column/row names!)

##         name female age
## row_1   Alex  FALSE  20
## row_2    Bob  FALSE  25
## row_3 Claire   TRUE  30
## row_4 Denise   TRUE  35

[4. (10 points)] Obtain the mean of the age column from the data frame
- Note: You will not get any point if you do not get the answer through the data.frame.

## [1] 27.5

[5. (10 points)] Retrieve Claire’s age from the data frame
- Hint: Refer to slides on getting values from ‘rows matching a condition’.
- Note: You will not get any point if you do not get the answer through the data.frame.

## [1] 30

Recap

Agenda

R Data Types

Vector

Name a vector

Combining vectors

Your turn

Vector arithmetics

Test if a vector has a specific value

Test if a vector has a specific value

Missing values: NA

Recycling

Quiz

Vector indexing

Subsetting all but some

List

List indexing

Your turn

Matrix

Data frame

Useful functions for data frames

Getting values from a column

Getting values from rows

Summary of data types

Reading XML in R

Reading JSON in R

Lab assignment (50 points)

Lab assignment 1/2

Lab assignment 2/2