In this analysis, I explore the top 20 books that have featured on the NYT Bestseller list. I also review the author who has appeared on this much coveted list the most. Spoiler alert: the book is The Da Vinci Code and the author is James Michener.
Additionally, I review the three authors behind the pseudonyms present in the dataset and the books penned under those pseudonyms.
Load libraries.
library(tidyverse)
library(ggplot2)
library(readr)
library(stringr)
library(lubridate)
library(dplyr)
library(ggeasy)
library(scales)
library(httr)
load the dataset.
book <- read.csv('nyt-bestsellers.csv', na.strings=c("","NA"))
head(book)
## index year month day big_endian_date little_endian_date us_date
## 1 1 1931 10 12 1931-10-12 12/10/1931 10/12/1931
## 2 2 1931 10 19 1931-10-19 19/10/1931 10/19/1931
## 3 3 1931 10 26 1931-10-26 26/10/1931 10/26/1931
## 4 4 1931 11 2 1931-11-2 2/11/1931 11/2/1931
## 5 5 1931 11 9 1931-11-9 9/11/1931 11/9/1931
## 6 6 1931 11 16 1931-11-16 16/11/1931 11/16/1931
## title authors first_author second_author
## 1 The Ten Commandments Warwick Deeping Warwick Deeping <NA>
## 2 No List Published <NA> <NA> <NA>
## 3 No List Published <NA> <NA> <NA>
## 4 No List Published <NA> <NA> <NA>
## 5 No List Published <NA> <NA> <NA>
## 6 Maid in Waiting John Galsworthy John Galsworthy <NA>
## pseudonym_of
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
Choose relevant columns.
book <- book[,c(7:12)]
head(book)
## us_date title authors first_author second_author
## 1 10/12/1931 The Ten Commandments Warwick Deeping Warwick Deeping <NA>
## 2 10/19/1931 No List Published <NA> <NA> <NA>
## 3 10/26/1931 No List Published <NA> <NA> <NA>
## 4 11/2/1931 No List Published <NA> <NA> <NA>
## 5 11/9/1931 No List Published <NA> <NA> <NA>
## 6 11/16/1931 Maid in Waiting John Galsworthy John Galsworthy <NA>
## pseudonym_of
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
dim(book)
## [1] 4739 6
There are 4739 rows and 6 columns.
Check datatype for date column.
class(book$us_date)
## [1] "character"
The date type is character. This isn’t what we want. Let’s change it to date.
book$us_date <- mdy(book$us_date)
class(book$us_date)
## [1] "Date"
See which year had the most unique books as bestsellers.
Create a column with date as year only.
year_only <- format(book$us_date, format="%Y")
book <- cbind(book, year_only)
#Remove rows that contain the phrase 'No List Published' in title column.
book <- book[!grepl('No List Published', book$title),]
head(book)
## us_date title authors first_author
## 1 1931-10-12 The Ten Commandments Warwick Deeping Warwick Deeping
## 6 1931-11-16 Maid in Waiting John Galsworthy John Galsworthy
## 8 1931-11-30 Maid in Waiting John Galsworthy John Galsworthy
## 9 1931-12-07 Maid in Waiting John Galsworthy John Galsworthy
## 10 1931-12-14 Maid in Waiting John Galsworthy John Galsworthy
## 11 1931-12-21 Maid in Waiting John Galsworthy John Galsworthy
## second_author pseudonym_of year_only
## 1 <NA> <NA> 1931
## 6 <NA> <NA> 1931
## 8 <NA> <NA> 1931
## 9 <NA> <NA> 1931
## 10 <NA> <NA> 1931
## 11 <NA> <NA> 1931
See which title appears the most on the bestseller list.
book_count <- book %>%
group_by(title) %>%
summarise(times_on_list=n()) %>%
arrange(-times_on_list)
book_listed_count <- head(book_count, 20) #Only show the top 20 books
book_listed_count
## # A tibble: 20 x 2
## title times_on_list
## <chr> <int>
## 1 The Da Vinci Code 59
## 2 Hawaii 49
## 3 The Caine Mutiny 48
## 4 The Robe 46
## 5 Love Story 41
## 6 The Source 40
## 7 Jonathan Livingston Seagull 38
## 8 The Bridges of Madison County 38
## 9 Trinity 36
## 10 The Spy Who Came in from the Cold 34
## 11 Where the Crawdads Sing 34
## 12 Anthony Adverse 33
## 13 Gone with the Wind 33
## 14 Désirée 32
## 15 Airport 30
## 16 Anatomy of a Murder 29
## 17 Fifty Shades of Grey 29
## 18 Herzog 29
## 19 Peyton Place 29
## 20 Advise and Consent 28
See the graph.
ggplot(data=book_listed_count, aes(reorder(title, times_on_list), times_on_list)) +
geom_bar(colour="#6495ED", fill="#6495ED", width=.65, stat="identity") +
guides(scale=none) +
coord_flip() + #Interchange x and y axis.
xlab("Book title") + ylab("No. of times on list") +
ggtitle("Top 20 book that stayed the longest on bestseller list")+
theme(plot.title = element_text(hjust = 0.5, size = 15)) #Center the title.
The The Da Vinci Code has featured the most on the bestseller list, which isn’t surprising considering how much buzz surrounded the book when it was published. I’m also not surprised that Fifty Shades of Grey made this list. I don’t recognize the other books.
Which author has been on the list the most.
popular_author <- book %>%
group_by(authors) %>%
summarise(count_on_list=n()) %>%
arrange(-count_on_list)
author_appears <- head(popular_author, 20)
#Plot the graph.
ggplot(data=author_appears, aes(reorder(authors, count_on_list), count_on_list)) +
geom_bar(colour="#6495ED", fill="#6495ED", width=.65, stat="identity") +
guides(scale=none) +
coord_flip() + #Interchange x and y axis.
xlab("Author name") + ylab("No. of times on list") +
ggtitle("Top 20 popular authors")+
theme(plot.title = element_text(hjust = 0.5, size = 15)) #Center the title.
The most popular author on this list is James Michener. Surprisingly, the author of The Da Vinci Code, Dan Brown isn’t even on the top 5 list. One reason is that most of the top 5 authors on this list have written many books, e.g. James Michener has written over 40 books, and Danielle Steel has written 190 books including 141 novels. By contrast, Dan Brown has only written 8 books.
Unlike on the list of book titles, this list is familiar to me. I recognize most of the names on it.
Which year had the most bestsellers
year_count <- book %>%
group_by(year_only) %>%
summarise(num_of_books=n()) %>%
arrange(-num_of_books)
books_in_year <- head(year_count, 30)
books_in_year
## # A tibble: 30 x 2
## year_only num_of_books
## <chr> <int>
## 1 1941 54
## 2 1934 53
## 3 1940 53
## 4 1945 53
## 5 1950 53
## 6 1956 53
## 7 1961 53
## 8 1962 53
## 9 1967 53
## 10 1972 53
## # ... with 20 more rows
See the graph.
ggplot(data=books_in_year, aes(reorder(year_only, num_of_books), num_of_books)) +
geom_bar(colour="#6495ED", fill="#6495ED", width=.65, stat="identity") +
guides(scale=none) +
coord_flip() + #Interchange x and y axis.
xlab("Year") + ylab("Number of books") +
ggtitle("Year with most bestsellers")+
theme(plot.title = element_text(hjust = 0.5, size = 15)) #Center the title.
The year that featured the most bestsellers is 1941. This was strange to me given how easy publishing has become today. You’d expect more books to feature on the bestseller list as more and more authors publish their work. Until I discovered that self-published books like kindle books cannot be on the NYT bestseller list. To be on the list, the author must be publish through a traditional publisher.
We can see that the only 21st century years that made the list are 2017, 2012, 2006, and 2000. The rest are 20th years with the forties and thirties dominating.
Let’s see how many pseudonyms are in the dataset.
pseudonyms <- book[complete.cases(book$pseudonym_of),]
n_distinct(pseudonyms$authors)
## [1] 3
There are 3 pseudonyms. Let’s see what they are.
unique(pseudonyms$authors)
## [1] "Richard Bachman" "Anonymous" "Robert Galbraith"
Let’s reveal the authors behind those pseudonyms and what book they penned under that pseudonym.
real_author <- pseudonyms %>%
distinct(pseudonym_of, .keep_all = TRUE)
real_author[, c('pseudonym_of', 'authors', 'title')] #Display only selected columns.
## pseudonym_of authors title
## 1 Stephen King Richard Bachman Thinner
## 2 Joe Klein Anonymous Primary Colors
## 3 J.K. Rowling Robert Galbraith The Cuckoo's Calling
Which author has been more successful using a pseudonym.
pseudonyms %>%
group_by(pseudonym_of) %>%
summarise(appearance_count = n()) %>%
arrange(-appearance_count)
## # A tibble: 3 x 2
## pseudonym_of appearance_count
## <chr> <int>
## 1 Joe Klein 9
## 2 Stephen King 4
## 3 J.K. Rowling 1
Joe Klein has been on the bestseller list under a pseudonym more times followed by Stephen King. J.K. Rowling wasn’t as successful as she only appeared once on the list under a pseudonym.