Hi! How’s it going? I’m well too, thanks for asking! It’s been too long since my last post, why? Let’s just thank God for love and life.
This will be a short read (at least I hope it will), mostly because it’s a mirror post. Let me explain.
I came across an interesting medium post by Nazia Habib (huge shoutout) where he detailed how to create a simple ETL pipeline in python. I recommend quickly going through it as I shall try to avoid repeating anything he discussed there.
My goal after reading his post was to replicate it but in R, so the following details how I achieved that.
For some context (as I suspect you may not have checked out his post), the aim is to query movie data from the tmdb API, transform the data, extract three tables (movies, genres list and datetimes) and load them into a desired location/format (Excel Workbook for me and csvs for Nazia). Really, this could go into a database, data warehouse, google sheets or whatever destination you please but for simplicity, an Excel workbook works just fine for me now.
You can skip the explanation and view the full commented source code on my github.
Caution! No for loops were used in the making of this post.
Nazia already explains how to acquire an access key from the tmdb API so I’ll just skip ahead to loading the key and importing relevant libraries
# import libraries
pacman::p_load(jsonlite, tidyverse, glue, httr)
# load api_key from config.json file
key <- read_json("config.json")
Data is queried from our API by movie_id, an integer. To query a single data point, we just call the url with the desired movie_id and our API key as shown below
movie_id <- 500
url <- paste0('https://api.themoviedb.org/3/movie/',movie_id,'?api_key=',key$api_key,'&language=en-US')
response <- GET(url)
Now let’s get data for a range of movies, say movies with ids from 560 to 565.
To do this, we first declare our range of IDs. Next, using the glue function, we generate a vector (list) of urls with each ID and our api_key.
# declare target movie ids
movie_id <- 560:565
# create vector of urls with movie ids and api_key
url_list <- glue('https://api.themoviedb.org/3/movie/{movie_id}?api_key={key$api_key}&language=en-US')
Next, we write a function to query each url, extract the required data and return our desired response. This is done by customizing our GET function to further filter for the desired content.
# customize GET function to extract content, parse from raw to CHar and extract JSON
customGET <- function (url) {
response <- GET(url)
res <- response$content %>%
rawToChar() %>%
fromJSON()
return(res)
}
After verifying that our function works, we can vectorize it with the iterative argument being the url. We do this to enable R loop through the url_list efficently and return a list of responses to us ready to be bound into a dataframe.
# vectorize customGET function to easily call urls from url_list
vCustomGET <- Vectorize(customGET, vectorize.args = "url")
# call urls from url_list
response_list <- vCustomGET(url = url_list)
# convert response to dataframe
df <- as.data.frame(t(rbind(response_list)))
Next we want to select only the columns we’re interested in and unnnest them as they’re still named lists. We’ll address the ‘genres’ column in a bit.
# declare relevant headers
df_columns <- c('budget', 'id', 'imdb_id', 'original_title', 'release_date', 'revenue', 'runtime', 'genres')
# select desired headers and unnest all except genres
df <- df %>%
select(all_of(df_columns)) %>%
rename('movie_id' = 'id') %>%
unnest(cols = -c(genres))
# visualize our dataframe (finally)
df
## # A tibble: 6 x 8
## budget movie_id imdb_id original_title release_date revenue runtime genres
## <int> <int> <chr> <chr> <chr> <int> <int> <name>
## 1 0 560 tt00407~ Secret Beyond~ 1947-12-24 0 99 <df [~
## 2 100000000 561 tt03604~ Constantine 2005-02-08 2.31e8 121 <df [~
## 3 28000000 562 tt00950~ Die Hard 1988-07-15 1.41e8 131 <df [~
## 4 105000000 563 tt01202~ Starship Troo~ 1997-11-07 1.21e8 129 <df [~
## 5 80000000 564 tt01206~ The Mummy 1999-04-16 4.16e8 124 <df [~
## 6 48000000 565 tt02981~ The Ring 2002-10-18 2.49e8 115 <df [~
To create the genres table, we pull our ‘genres’ column and extract all unique genres into a table as shown below.
# create table of unique genre name and id
genres_df <- df %>% select(genres) %>% pull() %>% Reduce(rbind, .) %>% data.frame() %>% unique()
# visualize
genres_df
## id name
## 1 80 Crime
## 2 9648 Mystery
## 3 53 Thriller
## 4 10749 Romance
## 5 14 Fantasy
## 6 28 Action
## 7 27 Horror
## 10 12 Adventure
## 13 878 Science Fiction
For our main movies table, we want to be able to filter through the genres easily such that if a movie belongs to a genre, it has a row value of 1 and if not, the value is 0.
To do this, we unnest the ‘genres’ column and pivot_wider (spread) the column with column names as each unique genre and values as the id (genre id). We then run a simple custom function (serializer_fn) to replace all values in the genre columns with 1 or 0 depending on whether or not they exist (are not NA).
# create funtion to serialize genre columns
serializer_fn <- function(x) {
if_else(!is.na(x), 1, 0)
}
# unnest genre column and serialize
df <- df %>%
unnest(genres) %>%
pivot_wider(names_from = name, values_from = id, values_fn = serializer_fn, values_fill = 0)
# visualize\
df
## # A tibble: 6 x 16
## budget movie_id imdb_id original_title release_date revenue runtime Crime
## <int> <int> <chr> <chr> <chr> <int> <int> <dbl>
## 1 0 560 tt00407~ Secret Beyond ~ 1947-12-24 0 99 1
## 2 100000000 561 tt03604~ Constantine 2005-02-08 2.31e8 121 0
## 3 28000000 562 tt00950~ Die Hard 1988-07-15 1.41e8 131 0
## 4 105000000 563 tt01202~ Starship Troop~ 1997-11-07 1.21e8 129 0
## 5 80000000 564 tt01206~ The Mummy 1999-04-16 4.16e8 124 0
## 6 48000000 565 tt02981~ The Ring 2002-10-18 2.49e8 115 0
## # ... with 8 more variables: Mystery <dbl>, Thriller <dbl>, Romance <dbl>,
## # Fantasy <dbl>, Action <dbl>, Horror <dbl>, Adventure <dbl>,
## # Science Fiction <dbl>
For the last table, the datetime table, using the lubridate package, we can easily extract the day, month, year and day_of_week from the release_date column.
# create datetime table
datetime_df <- df %>%
select(movie_id, release_date) %>%
mutate(release_date= lubridate::ymd(release_date),
day = as.numeric(format(release_date, "%d")),
month = as.numeric(format(release_date, "%m")),
year = as.numeric(format(release_date, "%Y")),
day_of_week = format(release_date, "%A"),
)
# visualise
datetime_df
## # A tibble: 6 x 6
## movie_id release_date day month year day_of_week
## <int> <date> <dbl> <dbl> <dbl> <chr>
## 1 560 1947-12-24 24 12 1947 Wednesday
## 2 561 2005-02-08 8 2 2005 Tuesday
## 3 562 1988-07-15 15 7 1988 Friday
## 4 563 1997-11-07 7 11 1997 Friday
## 5 564 1999-04-16 16 4 1999 Friday
## 6 565 2002-10-18 18 10 2002 Friday
All three tables are created and ready to be loaded into our Excel Workbook destination. This file will be created in our working directory.
# load data to desired location (excel spreadsheet this time)
library(writexl)
## Warning: package 'writexl' was built under R version 4.1.2
# assign each table to a separate worksheet
export_list <- list(
'tmdb_movies' = df,
'genres' = genres_df,
'tmdb_datetimes' = datetime_df
)
# write to xlsx file
writexl::write_xlsx(export_list, 'tmdb_export.xlsx')
Well this has been fun! Thanks for reading this far! My argument for writing ETL scripts in R would be that translating my intention of data transformation is more intuitive in R than in python but hey! do what works for you!
Have a great one and feel free to leave comments and/or questions.
Also connect with me on LinkedIn to collaborate on projects and talk about data in all flavors!