This data set from Goodreads links provides information about more than 10,000 books, across multiple categories such as title, length, language, average Goodreads rating, and more.

Intro

Project Idea: We were interested in this dataset (from Goodreads, found here) because it could be utilized by programming and cataloging librarians who schedule book clubs and discussions or utilized to understand who buys books throughout the year. It could also be used by book publishers who are seeking to understand trends that happen throughout the year for marketing and sales purposes. Consumers who are looking for books that are rated highly by readers who are similar in habit to themselves could also be interested in this data, as well as authors who may be trying to identify the appropriate time of year to release a book or pitch it to publishers and agents.

These unique audiences can learn the following items about books they are searching: What time of year is most popular for reading; what time of year is most popular in specific genres; what time of year are most books published; when are specific genres published most often; when do Goodreads users most often find the time to rate books; and potential correlations between genres and the seasons/months they are most popular or most highly rated.

Project Questions: As shown in our presentation, a variety of questions were examined while viewing this dataset. The major questions the group sought to answer were:

RQ1: What publishers are most prolific in this dataset?

RQ2: Which publishers on the list produce the longest books?

RQ3: Are there certain words that appear more frequently in this dataset? Are these easier to understand in a word cloud, or a chart?

RQ4: Do the number of pages in a book correlate to the number of reviews it receives?

RQ5: What are the average number of pages in books in this dataset? Does it change if we’re looking at the Top 100 books versus the Bottom 100 books?

RQ6: Do books receive different ratings based on the time of year they’re read?

RQ7: Are more books published during certain types of the year?

RQ8: Do the average number of pages in books vary depending on time of year they’re published?

About the Dataset:

Our group analyzed a Goodreads Dataset, collected by the Goodreads API. This data set contains detailed information about individual books. This dataset was obtained via the Kaggle site, from user Soumik. There are 10,352 unique titles and 11,126 unique ISBN’s with a variety of attributes and data types. The twelve columns contain strings, integers, date Time, and characters.

Information contained in the dataset includes:

title: The name under which the book was published.

authors: Names of the authors of the book.

average rating: The average rating of the book received in total.

isbn: Another unique number to identify the book, the International Standard Book Number.

isbn13: A 13-digit ISBN to identify the book, instead of the standard 11-digit ISBN.

language code: Helps understand what is the primary language of the book. For instance, eng is standard for English.

number of pages: Number of pages the book contains.

ratings count: Total number of ratings the book received.

text reviews count: Total number of written text reviews the book received.

publication date: Date when the book was first published.

publisher

Exploratory Data Analysis:

Before analyzing this data, we first had to clean the dataset, removing categories that were irrelevant, or blank responses. We then made sure that all the headers were abbreviated, and easy to call and use within R. We realized early on that while some of our questions lent themselves well to using the entire dataset, the number of datapoints made analysis extremely slow, and not all questions required the use of all the responses. For these questions, we created two separate files that pulled (by rating) the top 100 books listed in the dataset, as well as the bottom 100 books in the dataset. Our reasoning was if these questions showed similar answers for both ends of the set, it wasn’t necessarily to utilize all (more than 10,000) responses. Once we started analyzing data, we played around with different types of visualizations to see if particular types of plots were better at conveying the data. It turned out, in some cases multiple visualizations helped understand the data in different ways. Some people may not be able to see the variations in a word chart as well as a simple graph, so it made sense to use both.

RQ1

This wordcloud shows the variety and most popular publishers within our cleaned dataset.

                 word freq
penguin       penguin  364
vintage       vintage  305
classics     classics  195
harper         harper  150
berkley       berkley  133
bantam         bantam  127
house           house  126
simon           simon  117
ballantine ballantine  108
university university  103

RQ1 Code

Word cloud code - Natural language processing and text mining!

filePath <- “https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/books_publisher.csv” text <- readLines(filePath) docs <- Corpus(VectorSource(text)) toSpace <- content_transformer(function(x, pattern ) gsub(pattern, " “, x)) docs <- tm_map(docs, toSpace,”/“) docs <- tm_map(docs, toSpace,”@“) docs <- tm_map(docs, toSpace,”\|“) docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords, stopwords(”english“)) docs <- tm_map(docs, removeWords, c (”books“,”publishing“,”company“,”press“,”paperbacks")) #set stopwords? docs <- tm_map(docs, removePunctuation) #docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)

set.seed(1234) wordcloud(words = d\(word, freq = d\)freq, min.freq = 5, max.words=250, random.order=FALSE, rot.per=0.35, scale = c(3, 0.50), colors=brewer.pal(8, “Dark2”))

RQ1cont

The wordclouds depict the most common publishers in the Top 100 and Bottom 100 books on Goodreads.

# A tibble: 6 x 1
  publisher                               
  <chr>                                   
1 Little  Brown and Company               
2 Houghton Mifflin                        
3 Back Bay Books                          
4 Pocket Books                            
5 Scholastic Inc.                         
6 Arthur A. Levine Books / Scholastic Inc.

# A tibble: 71 x 2
# Groups:   publisher [71]
   publisher                   n
   <chr>                   <int>
 1 Penguin Books               7
 2 Penguin Classics            4
 3 St. Martin's Press          4
 4 Anchor Books                3
 5 Back Bay Books              3
 6 Bantam                      3
 7 Scholastic Inc.             3
 8 Alfred A. Knopf             2
 9 Ballantine Books            2
10 Dell Publishing Company     2
# ... with 61 more rows

Bottom 100

# A tibble: 6 x 1
  publisher                               
  <chr>                                   
1 Little  Brown and Company               
2 Houghton Mifflin                        
3 Back Bay Books                          
4 Pocket Books                            
5 Scholastic Inc.                         
6 Arthur A. Levine Books / Scholastic Inc.

# A tibble: 79 x 2
# Groups:   publisher [79]
   publisher            n
   <chr>            <int>
 1 Penguin Books        7
 2 Vintage              5
 3 VIZ Media LLC        4
 4 Broadway Books       3
 5 Modern Library       3
 6 Verso                3
 7 Bantam Books         2
 8 Harper Perennial     2
 9 Pocket Books         2
10 Abrams               1
# ... with 69 more rows

RQ1cont-Code RQ1cont-Code

Another way of doing wordclouds!

urlfile=“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/books_publisher.csv” books_publisher <- read_csv(url(urlfile)) head(books_publisher) urlfile=“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/top100.csv” top100 <- read_csv(url(urlfile)) toppub <- select(top100, 1:12) pub <- group_by(toppub, publisher) pubs <- arrange(pub, publisher) pubs2 <- pubs %>% count(publisher, sort = TRUE) pubs2 wordcloud(words = pubs2\(publisher, freq = pubs2\)n, min.freq = 2, scale = c(3, 0.50), colors=brewer.pal(8, “Dark2”))

Bottom 100

urlfile=“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/books_publisher.csv” books_publisher <- read_csv(url(urlfile)) head(books_publisher) urlfile=“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/bottom100.csv” top100 <- read_csv(url(urlfile)) toppub <- select(top100, 1:12) pub <- group_by(toppub, publisher) pubs <- arrange(pub, publisher) pubs2 <- pubs %>% count(publisher, sort = TRUE) pubs2 wordcloud(words = pubs2\(publisher, freq = pubs2\)n, min.freq = 2, scale = c(3, 0.50), colors=brewer.pal(8, “Dark2”))

RQ2

Chart 3, which highlights the 25 publishers who publish the longest books.

RQ2 Code

Customized Scatterplot!

urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/LongestBooks25.csv”) LongestBooks25 <- read_csv(url(urlfile))

a <- ggplot(LongestBooks25, aes(x=num_pages, y=publisher))+ geom_point(color = “hotpink”) + labs(title = “Publishers with the Longest Books”, x = “Number of Pages”, y = “Publisher Name”) ggplotly(a)

RQ3

Charts 4 with its dataframe and chart 5, which show the most frequent words in titles from this data set.

           word freq
life       life  105
world     world   79
love       love   73
death     death   63
new         new   62
time       time   62
trilogy trilogy   62
history history   61
war         war   59
house     house   57

RQ3 Code

A return to text mining.

filePath <- “https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/books_titles.csv” text <- readLines(filePath) docs <- Corpus(VectorSource(text)) toSpace <- content_transformer(function(x, pattern ) gsub(pattern, " “, x)) docs <- tm_map(docs, toSpace,”/“) docs <- tm_map(docs, toSpace,”@“) docs <- tm_map(docs, toSpace,”\|“) docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords, stopwords(”english“)) docs <- tm_map(docs, removeWords, c (”part“,”make“,”made“,”vol“,”stories“,”story")) #set stopwords? docs <- tm_map(docs, removePunctuation) #docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)

set.seed(1234) wordcloud(words = d\(word, freq = d\)freq, min.freq = 3, max.words=100, random.order=TRUE, rot.per=0.35, scale = c(3, 0.50), colors=brewer.pal(8, “Dark2”)) barplot(d[1:10,]\(freq, las =2, names.arg = d[1:10,]\)word, col = “slateblue”, main = “Most Frequent Words”, ylab = “Word Frequencies”)

RQ4

Column

Chart 6 (Below) and 7, which show the Average Rating compared to the number of pages for the top and bottom 100 books.

Column

Chart 7

RQ4 Code

column

Scatter with error bar

urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/top100.csv”) top100 <- read_csv(url(urlfile)) b <- ggplot(top100, aes(x = num_pages, y = average_rating)) + geom_point(color = “plum2”)+ geom_smooth(se = TRUE) + labs(title = “Pages Compared to Average Rating for Bottom and Top 100”, subtitle = “No Correlation”, x = “Number of Pages in Book”, y = “Average Rating”) ggplotly(b)

column

Scatter with error bar

urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/bottom100.csv”) bottom100 <- read_csv(url(urlfile)) c <- ggplot(bottom100, aes(x=num_pages, y=average_rating)) + geom_point(color = “springgreen1”)+ geom_smooth(se = TRUE) + labs(title = “Pages Compared to Average Rating for Bottom 100”, subtitle = “No Correlation”, x = “Number of Pages in Book”, y = “Average Rating”) ggplotly(c)

RQ5

column

Charts 8 (Below) and 9, which show the similarity in number of pages between the Top 100 and Bottom 100 books.

column

Chart 9

RQ5 Code

Histogram

urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/top100.csv”) d <- ggplot(top100, aes(x=num_pages))+ geom_histogram(binwidth=15, color = “black”, fill = “darkorchid2”) + ggtitle(“Top 100 and Number of Pages”)

ggplotly(d)

Histogram

urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/bottom100.csv”) e <- ggplot(bottom100, aes(x=num_pages))+ geom_histogram(binwidth=15, color = “black”, fill = “darkturquoise”) + ggtitle(“Bottom 100 and Number of Pages”)

ggplotly(e)

RQ6

Chart 10, which shows that time of year has minimal effect on Average Rating or Median Rating.

RQ6 Code

Double Bar with customized DF and Y axis

months <-rep(c(“jan”, “feb”, “mar”, “apr”, “may”, “jun”, “jul”, “aug”, “sep”, “oct”, “nov”, “dec”), 2) Average <-c(3.968, 3.987, 3.949, 3.951, 3.942, 3.977, 3.947, 3.958, 3.972, 3.966, 3.991, 3.958) Median <-c(3.97, 3.99, 3.945, 3.98, 3.96, 3.99, 3.94, 3.97, 3.98, 3.98, 4, 3.98) values <-c(Average, Median) type <-c(rep(“Average”, 12), rep(“Median”, 12)) mydata <-data.frame(months, values, type)

p <-ggplot(mydata, aes(months, values)) + coord_cartesian(ylim=c(3.9,4)) qq<- p +geom_bar(stat = “identity”, aes(fill = type), position = “dodge”) ggplotly(qq)

RQ7

column

Chart 12 highlights more books are published in December than in other months.

column

Chart 13 shows how our variables fluctuate throughout the year.

RQ7 Code

column

Bar plot

urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/month_numofbooks.csv”) month_numofbooks <- read_csv(url(urlfile)) g <- ggplot(month_numofbooks, aes(x=Month, y=Number of Books)) + geom_bar(stat=“identity”, color = “springgreen”, fill = “maroon1” ) + labs(title = “Number of Books Published Each Month”) ggplotly(g)

column

Line and bar graph to show 3 variables at once!

urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/month_numbooks_numpages.csv”) month_numbooks_numpages <- read_csv(url(urlfile)) h <- ggplot(month_numbooks_numpages) + geom_col(aes(x = Month, y = Number of Books), size = 1, color = “darkblue”, fill = “white”) + geom_line(aes(x = Month, y = 3*Average Number of Pages), size = 1.5, color=“red”, group = 1) + scale_y_continuous(sec.axis = sec_axis(~./3, name = “Average Number of Pages”)) ggplotly(h)

Conclusions

Conclusions:

This dataset provides a wealth of data for a variety of audiences, as we had initially hypothesized. While not all of the data may be relevant for a general user, or even a publisher (such as average number of pages in the book based on month published), much of the data could help to direct publisher decisions, as well as consumer purchases, if presented in an understandable way.

column

Goodreads Logo