DATA607 Assignment 9

3. Data checks and transformations

The structure of the JSON data can be seen in a data viewer.

View(top_stories)

The top stories are in level 1, element 6 (“results”), which is already a dataframe, so I started the data transformations from there.

3.1. Convert “facet” columns

The “_facet” columns are character lists, so I unlisted and converted them to strings.

top_stories <- top_stories[[6]] %>%
  rowwise() %>%
  mutate(
    des_facet = toString(unlist(des_facet)),
    org_facet = toString(unlist(org_facet)),
    per_facet = toString(unlist(per_facet)),
    geo_facet = toString(unlist(geo_facet))
  ) %>%
  ungroup()

Now the dataframe looks like this:

glimpse(top_stories)

## Rows: 27
## Columns: 19
## $ section             <chr> "world", "world", "world", "us", "us", "world", "w…
## $ subsection          <chr> "europe", "europe", "europe", "politics", "", "eur…
## $ title               <chr> "In First Remarks on Attack, Putin Tries to Link A…
## $ abstract            <chr> "American officials, who have assessed that a bran…
## $ url                 <chr> "https://www.nytimes.com/2024/03/23/world/europe/m…
## $ uri                 <chr> "nyt://article/bd065841-6d43-573d-9890-07e054d0d35…
## $ byline              <chr> "By Anton Troianovski", "By Julian E. Barnes and E…
## $ item_type           <chr> "Article", "Article", "Article", "Article", "Artic…
## $ updated_date        <chr> "2024-03-23T12:25:20-04:00", "2024-03-23T12:28:04-…
## $ created_date        <chr> "2024-03-23T09:58:43-04:00", "2024-03-22T18:36:11-…
## $ published_date      <chr> "2024-03-23T09:58:43-04:00", "2024-03-22T18:36:11-…
## $ material_type_facet <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ kicker              <chr> "", "", "", "", "news analysis", "", "", "", "", "…
## $ des_facet           <chr> "Terrorism", "Moscow Concert Hall Shooting (March …
## $ org_facet           <chr> "", "Islamic State Khorasan", "Islamic State Khora…
## $ per_facet           <chr> "Putin, Vladimir V", "", "", "Johnson, Mike (1972-…
## $ geo_facet           <chr> "Moscow (Russia), Russia", "Moscow (Russia)", "Rus…
## $ multimedia          <list> [<data.frame[3 x 8]>], [<data.frame[3 x 8]>], [<d…
## $ short_url           <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""…

3.2. Convert “date” columns

The date-times in the “date” columns appear to be in ISO8601 format as described in R for Data Science, 2nd edition, chapter 17, such that the date and time components are separated by a “T”. I also noticed that all times end with “-04:00”, which is the difference between Eastern Daylight Time and UTC (Coordinated Universal Time).¹

So converting these columns to datetime format is straightforward:

top_stories <- top_stories %>%
  mutate (
    updated_date = as_datetime(updated_date),
    created_date = as_datetime(created_date),
    published_date = as_datetime(published_date),    
  )

3.3. Rename columns

I renamed the following columns for clarity:

Date-time columns include time in the column name
Differentiate “url” (Uniform Resource Locator = website address) and “uri” (Uniform Resource identifier), which are hard to tell apart by the lowercase acronym
Use more descriptive names for the “facet” columns²

top_stories <- top_stories %>%
  rename(web_url = url) %>%
  rename(resource_identifier = uri) %>%
  rename(updated_datetime = updated_date) %>%
  rename(created_datetime = created_date) %>%  
  rename(published_datetime = published_date) %>%
  rename(terms_descriptive_subject = des_facet) %>%
  rename(terms_organization = org_facet) %>%
  rename(terms_person = per_facet) %>%
  rename(terms_geographic_area = geo_facet)

3.4. Unnest and unpack “multimedia” column

I debated how to handle the “multimedia” column, which is a list of dataframes. According to tidy data principles, it should be unnested longer, but the resulting dataframe is less readable (ie, because each row (top story) is repeated several times for each multimedia type. I wonder if the multimedia column could be left as a more compact dataframe if it isn’t important for a data analysis.

However, for the purpose of this assignment, I went ahead and unnested it. This led me to discover that the unpack function is also needed because tibbles are unnested as 2D columns.³ The keep_empty = TRUE is to prevent dropping rows for null elements.

top_stories <- top_stories %>%
  unnest_longer(multimedia, keep_empty = TRUE) %>%
  unpack(cols = multimedia, names_sep = "_")

3.5. Missing values

Finally, I filled missing values with NAs. This was a little tricky until I realized that na_if only works with character vectors.⁴

top_stories <- top_stories %>%
  mutate(
    across(
      where(is.character), ~ na_if(., "")
    )
  )

Now the dataframe looks better and is ready for analysis (not part of this assignment).

glimpse(top_stories)

## Rows: 81
## Columns: 26
## $ section                   <chr> "world", "world", "world", "world", "world",…
## $ subsection                <chr> "europe", "europe", "europe", "europe", "eur…
## $ title                     <chr> "In First Remarks on Attack, Putin Tries to …
## $ abstract                  <chr> "American officials, who have assessed that …
## $ web_url                   <chr> "https://www.nytimes.com/2024/03/23/world/eu…
## $ resource_identifier       <chr> "nyt://article/bd065841-6d43-573d-9890-07e05…
## $ byline                    <chr> "By Anton Troianovski", "By Anton Troianovsk…
## $ item_type                 <chr> "Article", "Article", "Article", "Article", …
## $ updated_datetime          <dttm> 2024-03-23 16:25:20, 2024-03-23 16:25:20, 2…
## $ created_datetime          <dttm> 2024-03-23 13:58:43, 2024-03-23 13:58:43, 2…
## $ published_datetime        <dttm> 2024-03-23 13:58:43, 2024-03-23 13:58:43, 2…
## $ material_type_facet       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ kicker                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ terms_descriptive_subject <chr> "Terrorism", "Terrorism", "Terrorism", "Mosc…
## $ terms_organization        <chr> NA, NA, NA, "Islamic State Khorasan", "Islam…
## $ terms_person              <chr> "Putin, Vladimir V", "Putin, Vladimir V", "P…
## $ terms_geographic_area     <chr> "Moscow (Russia), Russia", "Moscow (Russia),…
## $ multimedia_url            <chr> "https://static01.nyt.com/images/2024/03/23/…
## $ multimedia_format         <chr> "Super Jumbo", "threeByTwoSmallAt2X", "Large…
## $ multimedia_height         <int> 1463, 400, 150, 720, 400, 150, 1537, 400, 15…
## $ multimedia_width          <int> 2048, 600, 150, 1080, 600, 150, 2048, 600, 1…
## $ multimedia_type           <chr> "image", "image", "image", "image", "image",…
## $ multimedia_subtype        <chr> "photo", "photo", "photo", "photo", "photo",…
## $ multimedia_caption        <chr> "President Vladimir V. Putin of Russia at th…
## $ multimedia_copyright      <chr> "Nanna Heitmann for The New York Times", "Na…
## $ short_url                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

DATA607 Assignment 9

Alexander Simon

2024-03-23

1. Introduction

2. Data