This data set from Goodreads links provides information about more than 10,000 books, across multiple categories such as title, length, language, average Goodreads rating, and more.
Project Idea: We were interested in this dataset (from Goodreads, found here) because it could be utilized by programming and cataloging librarians who schedule book clubs and discussions or utilized to understand who buys books throughout the year. It could also be used by book publishers who are seeking to understand trends that happen throughout the year for marketing and sales purposes. Consumers who are looking for books that are rated highly by readers who are similar in habit to themselves could also be interested in this data, as well as authors who may be trying to identify the appropriate time of year to release a book or pitch it to publishers and agents.
These unique audiences can learn the following items about books they are searching: What time of year is most popular for reading; what time of year is most popular in specific genres; what time of year are most books published; when are specific genres published most often; when do Goodreads users most often find the time to rate books; and potential correlations between genres and the seasons/months they are most popular or most highly rated.
Project Questions: As shown in our presentation, a variety of questions were examined while viewing this dataset. The major questions the group sought to answer were:
RQ1: What publishers are most prolific in this dataset?
RQ2: Which publishers on the list produce the longest books?
RQ3: Are there certain words that appear more frequently in this dataset? Are these easier to understand in a word cloud, or a chart?
RQ4: Do the number of pages in a book correlate to the number of reviews it receives?
RQ5: What are the average number of pages in books in this dataset? Does it change if we’re looking at the Top 100 books versus the Bottom 100 books?
RQ6: Do books receive different ratings based on the time of year they’re read?
RQ7: Are more books published during certain types of the year?
RQ8: Do the average number of pages in books vary depending on time of year they’re published?
About the Dataset:
Our group analyzed a Goodreads Dataset, collected by the Goodreads API. This data set contains detailed information about individual books. This dataset was obtained via the Kaggle site, from user Soumik. There are 10,352 unique titles and 11,126 unique ISBN’s with a variety of attributes and data types. The twelve columns contain strings, integers, date Time, and characters.
Information contained in the dataset includes:
title: The name under which the book was published.
authors: Names of the authors of the book.
average rating: The average rating of the book received in total.
isbn: Another unique number to identify the book, the International Standard Book Number.
isbn13: A 13-digit ISBN to identify the book, instead of the standard 11-digit ISBN.
language code: Helps understand what is the primary language of the book. For instance, eng is standard for English.
number of pages: Number of pages the book contains.
ratings count: Total number of ratings the book received.
text reviews count: Total number of written text reviews the book received.
publication date: Date when the book was first published.
publisher
Exploratory Data Analysis:
Before analyzing this data, we first had to clean the dataset, removing categories that were irrelevant, or blank responses. We then made sure that all the headers were abbreviated, and easy to call and use within R. We realized early on that while some of our questions lent themselves well to using the entire dataset, the number of datapoints made analysis extremely slow, and not all questions required the use of all the responses. For these questions, we created two separate files that pulled (by rating) the top 100 books listed in the dataset, as well as the bottom 100 books in the dataset. Our reasoning was if these questions showed similar answers for both ends of the set, it wasn’t necessarily to utilize all (more than 10,000) responses. Once we started analyzing data, we played around with different types of visualizations to see if particular types of plots were better at conveying the data. It turned out, in some cases multiple visualizations helped understand the data in different ways. Some people may not be able to see the variations in a word chart as well as a simple graph, so it made sense to use both.
word freq
penguin penguin 364
vintage vintage 305
classics classics 195
harper harper 150
berkley berkley 133
bantam bantam 127
house house 126
simon simon 117
ballantine ballantine 108
university university 103
filePath <- “https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/books_publisher.csv” text <- readLines(filePath) docs <- Corpus(VectorSource(text)) toSpace <- content_transformer(function(x, pattern ) gsub(pattern, " “, x)) docs <- tm_map(docs, toSpace,”/“) docs <- tm_map(docs, toSpace,”@“) docs <- tm_map(docs, toSpace,”\|“) docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords, stopwords(”english“)) docs <- tm_map(docs, removeWords, c (”books“,”publishing“,”company“,”press“,”paperbacks")) #set stopwords? docs <- tm_map(docs, removePunctuation) #docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)
set.seed(1234) wordcloud(words = d\(word, freq = d\)freq, min.freq = 5, max.words=250, random.order=FALSE, rot.per=0.35, scale = c(3, 0.50), colors=brewer.pal(8, “Dark2”))
# A tibble: 6 x 1
publisher
<chr>
1 Little Brown and Company
2 Houghton Mifflin
3 Back Bay Books
4 Pocket Books
5 Scholastic Inc.
6 Arthur A. Levine Books / Scholastic Inc.
# A tibble: 71 x 2
# Groups: publisher [71]
publisher n
<chr> <int>
1 Penguin Books 7
2 Penguin Classics 4
3 St. Martin's Press 4
4 Anchor Books 3
5 Back Bay Books 3
6 Bantam 3
7 Scholastic Inc. 3
8 Alfred A. Knopf 2
9 Ballantine Books 2
10 Dell Publishing Company 2
# ... with 61 more rows
# A tibble: 6 x 1
publisher
<chr>
1 Little Brown and Company
2 Houghton Mifflin
3 Back Bay Books
4 Pocket Books
5 Scholastic Inc.
6 Arthur A. Levine Books / Scholastic Inc.
# A tibble: 79 x 2
# Groups: publisher [79]
publisher n
<chr> <int>
1 Penguin Books 7
2 Vintage 5
3 VIZ Media LLC 4
4 Broadway Books 3
5 Modern Library 3
6 Verso 3
7 Bantam Books 2
8 Harper Perennial 2
9 Pocket Books 2
10 Abrams 1
# ... with 69 more rows
urlfile=“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/books_publisher.csv” books_publisher <- read_csv(url(urlfile)) head(books_publisher) urlfile=“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/top100.csv” top100 <- read_csv(url(urlfile)) toppub <- select(top100, 1:12) pub <- group_by(toppub, publisher) pubs <- arrange(pub, publisher) pubs2 <- pubs %>% count(publisher, sort = TRUE) pubs2 wordcloud(words = pubs2\(publisher, freq = pubs2\)n, min.freq = 2, scale = c(3, 0.50), colors=brewer.pal(8, “Dark2”))
urlfile=“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/books_publisher.csv” books_publisher <- read_csv(url(urlfile)) head(books_publisher) urlfile=“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/bottom100.csv” top100 <- read_csv(url(urlfile)) toppub <- select(top100, 1:12) pub <- group_by(toppub, publisher) pubs <- arrange(pub, publisher) pubs2 <- pubs %>% count(publisher, sort = TRUE) pubs2 wordcloud(words = pubs2\(publisher, freq = pubs2\)n, min.freq = 2, scale = c(3, 0.50), colors=brewer.pal(8, “Dark2”))
urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/LongestBooks25.csv”) LongestBooks25 <- read_csv(url(urlfile))
a <- ggplot(LongestBooks25, aes(x=num_pages, y=publisher))+ geom_point(color = “hotpink”) + labs(title = “Publishers with the Longest Books”, x = “Number of Pages”, y = “Publisher Name”) ggplotly(a)
word freq
life life 105
world world 79
love love 73
death death 63
new new 62
time time 62
trilogy trilogy 62
history history 61
war war 59
house house 57
filePath <- “https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/books_titles.csv” text <- readLines(filePath) docs <- Corpus(VectorSource(text)) toSpace <- content_transformer(function(x, pattern ) gsub(pattern, " “, x)) docs <- tm_map(docs, toSpace,”/“) docs <- tm_map(docs, toSpace,”@“) docs <- tm_map(docs, toSpace,”\|“) docs <- tm_map(docs, content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords, stopwords(”english“)) docs <- tm_map(docs, removeWords, c (”part“,”make“,”made“,”vol“,”stories“,”story")) #set stopwords? docs <- tm_map(docs, removePunctuation) #docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)
set.seed(1234) wordcloud(words = d\(word, freq = d\)freq, min.freq = 3, max.words=100, random.order=TRUE, rot.per=0.35, scale = c(3, 0.50), colors=brewer.pal(8, “Dark2”)) barplot(d[1:10,]\(freq, las =2, names.arg = d[1:10,]\)word, col = “slateblue”, main = “Most Frequent Words”, ylab = “Word Frequencies”)
urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/top100.csv”) top100 <- read_csv(url(urlfile)) b <- ggplot(top100, aes(x = num_pages, y = average_rating)) + geom_point(color = “plum2”)+ geom_smooth(se = TRUE) + labs(title = “Pages Compared to Average Rating for Bottom and Top 100”, subtitle = “No Correlation”, x = “Number of Pages in Book”, y = “Average Rating”) ggplotly(b)
urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/bottom100.csv”) bottom100 <- read_csv(url(urlfile)) c <- ggplot(bottom100, aes(x=num_pages, y=average_rating)) + geom_point(color = “springgreen1”)+ geom_smooth(se = TRUE) + labs(title = “Pages Compared to Average Rating for Bottom 100”, subtitle = “No Correlation”, x = “Number of Pages in Book”, y = “Average Rating”) ggplotly(c)
urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/top100.csv”) d <- ggplot(top100, aes(x=num_pages))+ geom_histogram(binwidth=15, color = “black”, fill = “darkorchid2”) + ggtitle(“Top 100 and Number of Pages”)
ggplotly(d)
urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/bottom100.csv”) e <- ggplot(bottom100, aes(x=num_pages))+ geom_histogram(binwidth=15, color = “black”, fill = “darkturquoise”) + ggtitle(“Bottom 100 and Number of Pages”)
ggplotly(e)
months <-rep(c(“jan”, “feb”, “mar”, “apr”, “may”, “jun”, “jul”, “aug”, “sep”, “oct”, “nov”, “dec”), 2) Average <-c(3.968, 3.987, 3.949, 3.951, 3.942, 3.977, 3.947, 3.958, 3.972, 3.966, 3.991, 3.958) Median <-c(3.97, 3.99, 3.945, 3.98, 3.96, 3.99, 3.94, 3.97, 3.98, 3.98, 4, 3.98) values <-c(Average, Median) type <-c(rep(“Average”, 12), rep(“Median”, 12)) mydata <-data.frame(months, values, type)
p <-ggplot(mydata, aes(months, values)) + coord_cartesian(ylim=c(3.9,4)) qq<- p +geom_bar(stat = “identity”, aes(fill = type), position = “dodge”) ggplotly(qq)
urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/month_numofbooks.csv”) month_numofbooks <- read_csv(url(urlfile)) g <- ggplot(month_numofbooks, aes(x=Month, y=Number of Books
)) + geom_bar(stat=“identity”, color = “springgreen”, fill = “maroon1” ) + labs(title = “Number of Books Published Each Month”) ggplotly(g)
urlfile = (“https://raw.githubusercontent.com/ryanmar814/datavizfinal/master/month_numbooks_numpages.csv”) month_numbooks_numpages <- read_csv(url(urlfile)) h <- ggplot(month_numbooks_numpages) + geom_col(aes(x = Month, y = Number of Books
), size = 1, color = “darkblue”, fill = “white”) + geom_line(aes(x = Month, y = 3*Average Number of Pages
), size = 1.5, color=“red”, group = 1) + scale_y_continuous(sec.axis = sec_axis(~./3, name = “Average Number of Pages”)) ggplotly(h)
Conclusions:
This dataset provides a wealth of data for a variety of audiences, as we had initially hypothesized. While not all of the data may be relevant for a general user, or even a publisher (such as average number of pages in the book based on month published), much of the data could help to direct publisher decisions, as well as consumer purchases, if presented in an understandable way.
Goodreads Logo