GitHub - https://github.com/rickysoo/top_audiobooks
Contact - ricky [at] rickysoo.com
Source: https://www.pexels.com
Would you like to know what the top 100 audiobooks are whenever you want? I’m a big fan of audio books, having been a paying subscriber to Audible.com for many years.
Here, we will retrieve the current 100 bestsellers on Audible.com and save them into a dataframe. 5 publicly available web pages are retrieved from Audible.com web site for further processing.
Required for web scraping in R.
library(xml2)
library(rvest)
We are going to retrieve 20 audiobooks from each of 5 pages.
pages <- 5
items <- 20
There are 12 data fields to be retrieved. A matrix is used to store the data retrieved.
cols <- c('Rank', 'Title', 'Subtitle', 'Author', 'Narrator', 'Length', 'Release', 'Language', 'Stars', 'Ratings', 'Price', 'URL')
data <- matrix('', nrow = pages * items, ncol = length(cols))
colnames(data) <- cols
Now we go to each page, extract the data, do some minimal cleaning, and put the them into the matrix.
# Loop through the pages
for (page in 1:pages) {
url <- paste0('https://www.audible.com/adblbestsellers?page=', page)
html <- read_html(url)
# Loop through the items on a page
for (item_num in 1:items) {
row <- (page - 1) * items + item_num
item_selector <- paste0('#product-list-a11y-skiplink-target > span > ul > div > li:nth-child(', item_num, ') > div > div.bc-col-responsive.bc-spacing-top-none.bc-col-8 > div > div.bc-col-responsive.bc-col-6 > div > div > span > ul')
item_node <- html_node(html, item_selector)
# The audiobook ranking
data[row, 'Rank'] <- row
# The audiobook title
title_selector <- 'li:nth-child(1) > h3 > a'
title_node <- html_node(item_node, title_selector)
title <- html_text(title_node, trim = TRUE)
data[row, 'Title'] <- title
# The audiobook subtitle. It's empty for some items.
subtitle_selector <- 'li.bc-list-item.subtitle > span'
subtitle_node <- html_node(item_node, subtitle_selector)
subtitle <- html_text(subtitle_node, trim = TRUE)
data[row, 'Subtitle'] <- subtitle
# The author. There might be more than one.
author_selector <- 'li.bc-list-item.authorLabel > span > a'
author_nodes <- html_nodes(item_node, author_selector)
authors <- html_text(author_nodes, trim = TRUE)
author <- paste(authors, collapse = ', ')
data[row, 'Author'] <- author
# The narrator. There might be more than one.
narrator_selector <- 'li.bc-list-item.narratorLabel > span > a'
narrator_nodes <- html_nodes(item_node, narrator_selector)
narrators <- html_text(narrator_nodes, trim = TRUE)
narrator <- paste(narrators, collapse = ', ')
data[row, 'Narrator'] <- narrator
# The audiobook length in hours and minutes
length_selector <- 'li.bc-list-item.runtimeLabel > span'
length_node <- html_node(item_node, length_selector)
length <- html_text(length_node, trim = TRUE)
length <- gsub('Length: ', '', length)
data[row, 'Length'] <- length
# The release date
release_selector <- 'li.bc-list-item.releaseDateLabel > span'
release_node <- html_node(item_node, release_selector)
release <- html_text(release_node, trim = TRUE)
release <- gsub('Release date:\n\\s+', '', release)
data[row, 'Release'] <- release
# The audiobook language
language_selector <- 'li.bc-list-item.languageLabel > span'
language_node <- html_node(item_node, language_selector)
language <- html_text(language_node, trim = TRUE)
language <- gsub('Language:\n\\s+', '', language)
data[row, 'Language'] <- language
# The number of stars received
stars_selector <- 'li.bc-list-item.ratingsLabel > span.bc-text.bc-pub-offscreen'
stars_node <- html_node(item_node, stars_selector)
stars <- html_text(stars_node, trim = TRUE)
data[row, 'Stars'] <- stars
# The number of ratings received
ratings_selector <- 'li.bc-list-item.ratingsLabel > span.bc-text.bc-size-small.bc-color-secondary'
ratings_node <- html_node(item_node, ratings_selector)
ratings <- html_text(ratings_node, trim = TRUE)
data[row, 'Ratings'] <- ratings
# The selling price
price_selector <- paste0('#buybox-regular-price-', item_num - 1, ' > span:nth-child(2)')
price_node <- html_node(html, price_selector)
price <- html_text(price_node, trim = TRUE)
data[row, 'Price'] <- price
# Web page address
url <- paste0('https://www.audible.com', html_attr(title_node, 'href'))
data[row, 'URL'] <- url
}
}
Convert matrix into dataframe and show the first 10 rows.
df <- as.data.frame(data)
head(df, 10)
## Rank Title
## 1 1 A Promised Land
## 2 2 Greenlights
## 3 3 The House Guest
## 4 4 The Law of Innocence
## 5 5 Ready Player Two
## 6 6 Rhythm of War
## 7 7 A Time for Mercy
## 8 8 Atomic Habits
## 9 9 The Ickabog
## 10 10 The New One
## Subtitle
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 A Novel
## 6 The Stormlight Archive, Book 4
## 7 A Jake Brigance Novel
## 8 An Easy & Proven Way to Build Good Habits & Break Bad Ones
## 9 <NA>
## 10 Painfully True Stories from a Reluctant Dad
## Author Narrator
## 1 Barack Obama Barack Obama
## 2 Matthew McConaughey Matthew McConaughey
## 3 Mark Edwards Will M. Watt, Stina Nielsen
## 4 Michael Connelly Peter Giles
## 5 Ernest Cline Wil Wheaton
## 6 Brandon Sanderson Kate Reading, Michael Kramer
## 7 John Grisham Michael Beck
## 8 James Clear James Clear
## 9 J.K. Rowling Stephen Fry
## 10 Mike Birbiglia, J. Hope Stein Mike Birbiglia, J. Hope Stein
## Length Release Language Stars Ratings
## 1 29 hrs and 10 mins 11-17-20 English <NA> Not rated yet
## 2 6 hrs and 42 mins 10-20-20 English 5 out of 5 stars 11,290 ratings
## 3 8 hrs and 26 mins 06-03-20 English 4 out of 5 stars 326 ratings
## 4 12 hrs and 27 mins 11-10-20 English 4.5 out of 5 stars 682 ratings
## 5 16 hrs 11-24-20 English <NA> Not rated yet
## 6 57 hrs and 26 mins 11-17-20 English <NA> Not rated yet
## 7 19 hrs and 59 mins 10-13-20 English 4.5 out of 5 stars 4,679 ratings
## 8 5 hrs and 35 mins 10-16-18 English 5 out of 5 stars 47,484 ratings
## 9 7 hrs and 52 mins 11-10-20 English 4.5 out of 5 stars 110 ratings
## 10 5 hrs and 9 mins 05-05-20 English 4.5 out of 5 stars 507 ratings
## Price
## 1 $45.50
## 2 $28.00
## 3 $25.19
## 4 $30.79
## 5 $35.00
## 6 $66.49
## 7 $31.50
## 8 $28.00
## 9 $24.99
## 10 $29.65
## URL
## 1 https://www.audible.com/pd/A-Promised-Land-Audiobook/0525633723?ref=a_adblbests_c3_lProduct_1_1&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 2 https://www.audible.com/pd/Greenlights-Audiobook/0593294181?ref=a_adblbests_c3_lProduct_1_2&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 3 https://www.audible.com/pd/The-House-Guest-Audiobook/1799771539?ref=a_adblbests_c3_lProduct_1_3&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 4 https://www.audible.com/pd/The-Law-of-Innocence-Audiobook/1549129007?ref=a_adblbests_c3_lProduct_1_4&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 5 https://www.audible.com/pd/Ready-Player-Two-Audiobook/0593396960?ref=a_adblbests_c3_lProduct_1_5&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 6 https://www.audible.com/pd/Rhythm-of-War-Audiobook/1250759781?ref=a_adblbests_c3_lProduct_1_6&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 7 https://www.audible.com/pd/A-Time-for-Mercy-Audiobook/0593168550?ref=a_adblbests_c3_lProduct_1_7&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 8 https://www.audible.com/pd/Atomic-Habits-Audiobook/1524779261?ref=a_adblbests_c3_lProduct_1_8&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 9 https://www.audible.com/pd/The-Ickabog-Audiobook/B08D9T16JZ?ref=a_adblbests_c3_lProduct_1_9&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
## 10 https://www.audible.com/pd/The-New-One-Audiobook/1549126997?ref=a_adblbests_c3_lProduct_1_10&pf_rd_p=4100380b-3e9d-4594-990a-9c93d1a8dac3&pf_rd_r=Z8ZN8ANMRVYJ93135C6D
View the whole dataframe.
View(df)
View only the rank and audiobook title.
df[ , c('Rank', 'Title')]
## Rank Title
## 1 1 A Promised Land
## 2 2 Greenlights
## 3 3 The House Guest
## 4 4 The Law of Innocence
## 5 5 Ready Player Two
## 6 6 Rhythm of War
## 7 7 A Time for Mercy
## 8 8 Atomic Habits
## 9 9 The Ickabog
## 10 10 The New One
## 11 11 The Sentinel
## 12 12 Clanlands
## 13 13 Becoming
## 14 14 Can't Hurt Me
## 15 15 Caste (Oprah's Book Club)
## 16 16 Think Like a Monk
## 17 17 Harry Potter and the Sorcerer's Stone, Book 1
## 18 18 Untamed
## 19 19 The Guest List
## 20 20 Where the Crawdads Sing
## 21 21 Dr. Sebi Cure for Herpes
## 22 22 Midnight Sun
## 23 23 The Meaning of Mariah Carey
## 24 24 The Best of Me
## 25 25 Harry Potter and the Chamber of Secrets, Book 2
## 26 26 Harry Potter and the Goblet of Fire, Book 4
## 27 27 Harry Potter and the Order of the Phoenix, Book 5
## 28 28 Harry Potter and the Prisoner of Azkaban, Book 3
## 29 29 Harry Potter and the Deathly Hallows, Book 7
## 30 30 The Subtle Art of Not Giving a F*ck
## 31 31 Anxious People
## 32 32 Fortune and Glory
## 33 33 Harry Potter and the Half-Blood Prince, Book 6
## 34 34 Dune
## 35 35 Monster
## 36 36 The Invisible Life of Addie LaRue
## 37 37 Born a Crime
## 38 38 12 Rules for Life
## 39 39 Uncomfortable Conversations with a Black Man
## 40 40 Moonflower Murders
## 41 41 Heaven's River
## 42 42 The Vanishing Half
## 43 43 The Return
## 44 44 The Answer Is...
## 45 45 Pretty Things
## 46 46 The Searcher
## 47 47 Blackout
## 48 48 Talking to Strangers
## 49 49 Is This Anything?
## 50 50 American Dirt (Oprah's Book Club)
## 51 51 How to Win Friends & Influence People
## 52 52 From a Certain Point of View: The Empire Strikes Back (Star Wars)
## 53 53 The 7 Habits of Highly Effective People
## 54 54 Breath
## 55 55 Rich Dad Poor Dad
## 56 56 Extreme Ownership
## 57 57 The Four Agreements
## 58 58 The Five Love Languages: The Secret to Love That Lasts
## 59 59 Never Split the Difference
## 60 60 Invisible Girl
## 61 61 The Way of Kings
## 62 62 Too Much and Never Enough
## 63 63 The Truths We Hold
## 64 64 Unfu*k Yourself
## 65 65 The Evening and the Morning
## 66 66 Marauder
## 67 67 Brushfire
## 68 68 The Sandman
## 69 69 My Own Words
## 70 70 Hypnotic Gastric Band
## 71 71 Daylight
## 72 72 Ready Player One
## 73 73 The Chronicles of Narnia Adult Box Set
## 74 74 The Fellowship of the Ring
## 75 75 Dare to Lead
## 76 76 Sapiens
## 77 77 Educated
## 78 78 The Silent Patient
## 79 79 The Giver of Stars
## 80 80 The Body Keeps the Score
## 81 81 The Stand
## 82 82 The Dutch House
## 83 83 Rage
## 84 84 Troubled Blood
## 85 85 The Alchemist
## 86 86 The Power of Now
## 87 87 First Principles
## 88 88 Incomparable
## 89 89 The Audacity of Hope
## 90 90 Anxiety in Relationship
## 91 91 Disloyal: A Memoir
## 92 92 Didn't See That Coming
## 93 93 How to Be an Antiracist
## 94 94 The Total Money Makeover
## 95 95 The Gifts of Imperfection, 10th Anniversary Edition
## 96 96 Shuggie Bain
## 97 97 Extreme Rapid Weight Loss Hypnosis for Women
## 98 98 The Tales of Beedle the Bard
## 99 99 White Fragility
## 100 100 Burnout
The top 100 list might change from time to time. Or the web scraping might not work one day. Time-stamping helps to trace and save the changes.
filename <- paste0('TopAudiobooks-', format(Sys.time(), '%Y%m%d-%H%M%S'), '.csv')
write.csv(df, filename, row.names = FALSE)
Now the data is ready for further cleaning and analysis!