Introduction

In this analysis, I explore the top 20 books that have featured on the NYT Bestseller list. I also review the author who has appeared on this much coveted list the most. Spoiler alert: the book is The Da Vinci Code and the author is James Michener.

Additionally, I review the three authors behind the pseudonyms present in the dataset and the books penned under those pseudonyms.

Load libraries.

library(tidyverse)
library(ggplot2)
library(readr)
library(stringr)
library(lubridate)
library(dplyr)
library(ggeasy)
library(scales)
library(httr)

load the dataset.

book <- read.csv('nyt-bestsellers.csv', na.strings=c("","NA"))
head(book)
##   index year month day big_endian_date little_endian_date    us_date
## 1     1 1931    10  12      1931-10-12         12/10/1931 10/12/1931
## 2     2 1931    10  19      1931-10-19         19/10/1931 10/19/1931
## 3     3 1931    10  26      1931-10-26         26/10/1931 10/26/1931
## 4     4 1931    11   2       1931-11-2          2/11/1931  11/2/1931
## 5     5 1931    11   9       1931-11-9          9/11/1931  11/9/1931
## 6     6 1931    11  16      1931-11-16         16/11/1931 11/16/1931
##                  title         authors    first_author second_author
## 1 The Ten Commandments Warwick Deeping Warwick Deeping          <NA>
## 2    No List Published            <NA>            <NA>          <NA>
## 3    No List Published            <NA>            <NA>          <NA>
## 4    No List Published            <NA>            <NA>          <NA>
## 5    No List Published            <NA>            <NA>          <NA>
## 6      Maid in Waiting John Galsworthy John Galsworthy          <NA>
##   pseudonym_of
## 1         <NA>
## 2         <NA>
## 3         <NA>
## 4         <NA>
## 5         <NA>
## 6         <NA>

Choose relevant columns.

book <- book[,c(7:12)]
head(book)
##      us_date                title         authors    first_author second_author
## 1 10/12/1931 The Ten Commandments Warwick Deeping Warwick Deeping          <NA>
## 2 10/19/1931    No List Published            <NA>            <NA>          <NA>
## 3 10/26/1931    No List Published            <NA>            <NA>          <NA>
## 4  11/2/1931    No List Published            <NA>            <NA>          <NA>
## 5  11/9/1931    No List Published            <NA>            <NA>          <NA>
## 6 11/16/1931      Maid in Waiting John Galsworthy John Galsworthy          <NA>
##   pseudonym_of
## 1         <NA>
## 2         <NA>
## 3         <NA>
## 4         <NA>
## 5         <NA>
## 6         <NA>
dim(book)
## [1] 4739    6

There are 4739 rows and 6 columns.


Check datatype for date column.

class(book$us_date)
## [1] "character"

The date type is character. This isn’t what we want. Let’s change it to date.

book$us_date <- mdy(book$us_date)
class(book$us_date)
## [1] "Date"


See which year had the most unique books as bestsellers.
Create a column with date as year only.

year_only <- format(book$us_date, format="%Y")
book <- cbind(book, year_only)

#Remove rows that contain the phrase 'No List Published' in title column.
book <- book[!grepl('No List Published', book$title),]
head(book)
##       us_date                title         authors    first_author
## 1  1931-10-12 The Ten Commandments Warwick Deeping Warwick Deeping
## 6  1931-11-16      Maid in Waiting John Galsworthy John Galsworthy
## 8  1931-11-30      Maid in Waiting John Galsworthy John Galsworthy
## 9  1931-12-07      Maid in Waiting John Galsworthy John Galsworthy
## 10 1931-12-14      Maid in Waiting John Galsworthy John Galsworthy
## 11 1931-12-21      Maid in Waiting John Galsworthy John Galsworthy
##    second_author pseudonym_of year_only
## 1           <NA>         <NA>      1931
## 6           <NA>         <NA>      1931
## 8           <NA>         <NA>      1931
## 9           <NA>         <NA>      1931
## 10          <NA>         <NA>      1931
## 11          <NA>         <NA>      1931

See which title appears the most on the bestseller list.

book_count <- book %>%
  group_by(title) %>%
  summarise(times_on_list=n()) %>%
  arrange(-times_on_list)
book_listed_count <- head(book_count, 20) #Only show the top 20 books
book_listed_count
## # A tibble: 20 x 2
##    title                             times_on_list
##    <chr>                                     <int>
##  1 The Da Vinci Code                            59
##  2 Hawaii                                       49
##  3 The Caine Mutiny                             48
##  4 The Robe                                     46
##  5 Love Story                                   41
##  6 The Source                                   40
##  7 Jonathan Livingston Seagull                  38
##  8 The Bridges of Madison County                38
##  9 Trinity                                      36
## 10 The Spy Who Came in from the Cold            34
## 11 Where the Crawdads Sing                      34
## 12 Anthony Adverse                              33
## 13 Gone with the Wind                           33
## 14 Désirée                                      32
## 15 Airport                                      30
## 16 Anatomy of a Murder                          29
## 17 Fifty Shades of Grey                         29
## 18 Herzog                                       29
## 19 Peyton Place                                 29
## 20 Advise and Consent                           28

See the graph.

ggplot(data=book_listed_count, aes(reorder(title, times_on_list), times_on_list)) + 
  geom_bar(colour="#6495ED", fill="#6495ED", width=.65, stat="identity") + 
  guides(scale=none) +
  coord_flip() + #Interchange x and y axis.
  xlab("Book title") + ylab("No. of times on list") +
  ggtitle("Top 20 book that stayed the longest on bestseller list")+
  theme(plot.title = element_text(hjust = 0.5, size = 15)) #Center the title.

The The Da Vinci Code has featured the most on the bestseller list, which isn’t surprising considering how much buzz surrounded the book when it was published. I’m also not surprised that Fifty Shades of Grey made this list. I don’t recognize the other books.


Which author has been on the list the most.

popular_author <- book %>%
  group_by(authors) %>%
  summarise(count_on_list=n()) %>%
  arrange(-count_on_list)
author_appears <- head(popular_author, 20)

#Plot the graph.
ggplot(data=author_appears, aes(reorder(authors, count_on_list), count_on_list)) + 
  geom_bar(colour="#6495ED", fill="#6495ED", width=.65, stat="identity") + 
  guides(scale=none) +
  coord_flip() + #Interchange x and y axis.
  xlab("Author name") + ylab("No. of times on list") +
  ggtitle("Top 20 popular authors")+
  theme(plot.title = element_text(hjust = 0.5, size = 15)) #Center the title.

The most popular author on this list is James Michener. Surprisingly, the author of The Da Vinci Code, Dan Brown isn’t even on the top 5 list. One reason is that most of the top 5 authors on this list have written many books, e.g. James Michener has written over 40 books, and Danielle Steel has written 190 books including 141 novels. By contrast, Dan Brown has only written 8 books.

Unlike on the list of book titles, this list is familiar to me. I recognize most of the names on it.


Which year had the most bestsellers

year_count <- book %>%
  group_by(year_only) %>%
  summarise(num_of_books=n()) %>%
  arrange(-num_of_books)
books_in_year <- head(year_count, 30)
books_in_year
## # A tibble: 30 x 2
##    year_only num_of_books
##    <chr>            <int>
##  1 1941                54
##  2 1934                53
##  3 1940                53
##  4 1945                53
##  5 1950                53
##  6 1956                53
##  7 1961                53
##  8 1962                53
##  9 1967                53
## 10 1972                53
## # ... with 20 more rows

See the graph.

ggplot(data=books_in_year, aes(reorder(year_only, num_of_books), num_of_books)) + 
  geom_bar(colour="#6495ED", fill="#6495ED", width=.65, stat="identity") + 
  guides(scale=none) +
  coord_flip() + #Interchange x and y axis.
  xlab("Year") + ylab("Number of books") +
  ggtitle("Year with most bestsellers")+
  theme(plot.title = element_text(hjust = 0.5, size = 15)) #Center the title.

The year that featured the most bestsellers is 1941. This was strange to me given how easy publishing has become today. You’d expect more books to feature on the bestseller list as more and more authors publish their work. Until I discovered that self-published books like kindle books cannot be on the NYT bestseller list. To be on the list, the author must be publish through a traditional publisher.

We can see that the only 21st century years that made the list are 2017, 2012, 2006, and 2000. The rest are 20th years with the forties and thirties dominating.


Let’s see how many pseudonyms are in the dataset.

pseudonyms <- book[complete.cases(book$pseudonym_of),]
n_distinct(pseudonyms$authors)
## [1] 3

There are 3 pseudonyms. Let’s see what they are.

unique(pseudonyms$authors)
## [1] "Richard Bachman"  "Anonymous"        "Robert Galbraith"

Let’s reveal the authors behind those pseudonyms and what book they penned under that pseudonym.

real_author <- pseudonyms %>% 
  distinct(pseudonym_of, .keep_all = TRUE)
real_author[, c('pseudonym_of', 'authors', 'title')] #Display only selected columns.
##   pseudonym_of          authors                title
## 1 Stephen King  Richard Bachman              Thinner
## 2    Joe Klein        Anonymous       Primary Colors
## 3 J.K. Rowling Robert Galbraith The Cuckoo's Calling

Which author has been more successful using a pseudonym.

pseudonyms %>%
  group_by(pseudonym_of) %>%
  summarise(appearance_count = n()) %>%
  arrange(-appearance_count)
## # A tibble: 3 x 2
##   pseudonym_of appearance_count
##   <chr>                   <int>
## 1 Joe Klein                   9
## 2 Stephen King                4
## 3 J.K. Rowling                1

Joe Klein has been on the bestseller list under a pseudonym more times followed by Stephen King. J.K. Rowling wasn’t as successful as she only appeared once on the list under a pseudonym.