This dataset, books.csv,is a comprehensive list of all books listed in goodreads.com. It can be found at https://www.kaggle.com/jealousleopard/goodreadsbooks.

Variables and short description:

This document shows how I attempted the quiz, and the values I got as my answers.

Question 1: Create books_small

The code below shows how the data books_small was created which is a subset of the whole books data. This is created to remove the unessacary columns.

books_small = select(data, title, authors, average_rating, publication_date)
First five instances of Books small table
title authors average_rating publication_date
Harry Potter and the Half-Blood Prince (Harry Potter #6) J.K. Rowling 4.57 2006-09-16
Harry Potter and the Order of the Phoenix (Harry Potter #5) J.K. Rowling 4.49 2004-09-01
Harry Potter and the Chamber of Secrets (Harry Potter #2) J.K. Rowling 4.42 2003-11-01
Harry Potter and the Prisoner of Azkaban (Harry Potter #3) J.K. Rowling 4.56 2004-05-01
Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5) J.K. Rowling 4.78 2004-09-13
Unauthorized Harry Potter Book Seven News: “Half-Blood Prince” Analysis and Speculation W. Frederick Zimmerman 3.74 2005-04-26

Question 2: Add additional column to books called tot_points

books = mutate(data, tot_points = average_rating * ratings_count)
First five instances ofBooks table with additional column for tot_points
bookID title authors average_rating isbn isbn13 language_code num_pages ratings_count text_reviews_count publication_date publisher tot_points
1 Harry Potter and the Half-Blood Prince (Harry Potter #6) J.K. Rowling 4.57 0439785960 9.780440e+12 eng 652 2095690 27591 2006-09-16 Scholastic Inc. 9577303.30
2 Harry Potter and the Order of the Phoenix (Harry Potter #5) J.K. Rowling 4.49 0439358078 9.780439e+12 eng 870 2153167 29221 2004-09-01 Scholastic Inc. 9667719.83
4 Harry Potter and the Chamber of Secrets (Harry Potter #2) J.K. Rowling 4.42 0439554896 9.780440e+12 eng 352 6333 244 2003-11-01 Scholastic 27991.86
5 Harry Potter and the Prisoner of Azkaban (Harry Potter #3) J.K. Rowling 4.56 043965548X 9.780440e+12 eng 435 2339585 36325 2004-05-01 Scholastic Inc. 10668507.60
8 Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5) J.K. Rowling 4.78 0439682584 9.780440e+12 eng 2690 41428 164 2004-09-13 Scholastic 198025.84
9 Unauthorized Harry Potter Book Seven News: “Half-Blood Prince” Analysis and Speculation W. Frederick Zimmerman 3.74 0976540606 9.780977e+12 en-US 152 19 1 2005-04-26 Nimble Books 71.06

Question 3: List number of books with names starting from ‘The’

total = count(filter(data, str_detect(title, "^The")))

Total books:

[1] 2313

Question 4: books published in 2005 with averagerating above 4.5 or more than 1,000 text reviews.

datap = filter(data, publication_date <= "2005-12-31" & publication_date >= "2005-01-01",
            average_rating > 4.5 | text_reviews_count > 1000)

All the books that were published in 2005 and that either had an averagerating above 4.5or more than 1,000 text reviews.

First five instances of books published in 2005 with averagerating above 4.5 or more than 1,000 text reviews.
bookID title authors average_rating isbn isbn13 language_code num_pages ratings_count text_reviews_count publication_date publisher
10 Harry Potter Collection (Harry Potter #1-6) J.K. Rowling 4.73 0439827604 9.780440e+12 eng 3342 28242 808 2005-09-12 Scholastic
231 I am Charlotte Simmons Tom Wolfe 3.42 0312424442 9.780312e+12 eng 738 20888 1688 2005-08-30 Picador USA
428 Play It As It Lays Joan Didion 3.87 0374529949 9.780375e+12 eng 231 23656 1706 2005-11-15 Farrar Straus and Giroux
475 Collapse: How Societies Choose to Fail or Succeed Jared Diamond 3.93 0143036556 9.780143e+12 eng 608 52522 2780 2005-12-27 Penguin Books Ltd. (London)
703 The Plot Against America Philip Roth 3.77 1400079497 9.781400e+12 eng 391 33321 2925 2005-09-27 Vintage International
868 Fullmetal Alchemist Vol. 3 (Fullmetal Alchemist #3) Hiromu Arakawa 4.56 1591169259 9.781591e+12 eng 192 16666 299 2005-09-13 VIZ Media LLC

Question 5: Top 5 Authors

Code

I first filtered out observations by making sure that only authors with more than 1000000 ratings are present. This was done to eliminate the error, or bias that might have been caused by low ratings. Then I calculated average total number of reviews, and average total points for the authors. This was done to calculate the weighted average, so that the high ratings of books with low rating counts or vice versa don’t affect the results. Then the data was arranged and top 5 observations were taken out. The observations were then plotted using bar chart to make it easier to understand for readers.

data2 = filter(books, ratings_count > 1000000) 
top = group_by(data2, authors)
top = summarise(top, avg_count = mean(ratings_count, na.rm = T), avg_tot_points = mean(tot_points, na.rm = T))
top = mutate(top, author_avg_rating = avg_tot_points/avg_count)
top = head(arrange(top, desc(author_avg_rating)),5)

Bar Chart