Books Data

In this exercise, we will be working with Books.csv data set. This dataset, books.csv, is a comprehensive list of all books listed in goodreads.com. It can be found at https://www.kaggle.com/jealousleopard/goodreadsbooks

Variables and short description:

1.Create a new dataframe called books_small that includes title, authors, average_rating, and publication_date.

Solution Using the select function, we can create and store a database of selected columns

books_small <- select(books, title, authors,average_rating, publication_date)
books_small
## # A tibble: 8,470 x 4
##    title                            authors      average_rating publication_date
##    <chr>                            <chr>                 <dbl> <date>          
##  1 "Harry Potter and the Half-Bloo~ J.K. Rowling           4.57 2006-09-16      
##  2 "Harry Potter and the Order of ~ J.K. Rowling           4.49 2004-09-01      
##  3 "Harry Potter and the Chamber o~ J.K. Rowling           4.42 2003-11-01      
##  4 "Harry Potter and the Prisoner ~ J.K. Rowling           4.56 2004-05-01      
##  5 "Harry Potter Boxed Set  Books ~ J.K. Rowling           4.78 2004-09-13      
##  6 "Unauthorized Harry Potter Book~ W. Frederic~           3.74 2005-04-26      
##  7 "Harry Potter Collection (Harry~ J.K. Rowling           4.73 2005-09-12      
##  8 "The Ultimate Hitchhiker's Guid~ Douglas Ada~           4.38 2005-11-01      
##  9 "The Ultimate Hitchhiker's Guid~ Douglas Ada~           4.38 2002-04-30      
## 10 "The Hitchhiker's Guide to the ~ Douglas Ada~           4.22 2004-08-03      
## # ... with 8,460 more rows

2.Create a new variable called tot_points that is the multiplication of average_rating and rating_counts

Solution Creating new variables requires the mutate function. This will create a new column called tot_points in the Main Books data base at the end of its existing columns.

books <- mutate(books,tot_points = (average_rating*ratings_count))
books
## # A tibble: 8,470 x 13
##    bookID title authors average_rating isbn  isbn13 language_code num_pages
##     <dbl> <chr> <chr>            <dbl> <chr> <chr>  <chr>             <dbl>
##  1      1 "Har~ J.K. R~           4.57 0439~ 97804~ eng                 652
##  2      2 "Har~ J.K. R~           4.49 0439~ 97804~ eng                 870
##  3      4 "Har~ J.K. R~           4.42 0439~ 97804~ eng                 352
##  4      5 "Har~ J.K. R~           4.56 0439~ 97804~ eng                 435
##  5      8 "Har~ J.K. R~           4.78 0439~ 97804~ eng                2690
##  6      9 "Una~ W. Fre~           3.74 0976~ 97809~ en-US               152
##  7     10 "Har~ J.K. R~           4.73 0439~ 97804~ eng                3342
##  8     12 "The~ Dougla~           4.38 0517~ 97805~ eng                 815
##  9     13 "The~ Dougla~           4.38 0345~ 97803~ eng                 815
## 10     14 "The~ Dougla~           4.22 1400~ 97814~ eng                 215
## # ... with 8,460 more rows, and 5 more variables: ratings_count <dbl>,
## #   text_reviews_count <dbl>, publication_date <date>, publisher <chr>,
## #   tot_points <dbl>

3.How many book titles begin with the word “The”? solution I filtered the data set using substring of titles that start with “The”

count(filter(books, substr(title,1,3)== "The"))
## # A tibble: 1 x 1
##       n
##   <int>
## 1  2313

4.Find all the books that were published in 2005 and that either had an average rating above 4.5 or more than 1,000 text reviews.

Solution To find this answer, I use the filter function and combination of |, <,>,= operators

filter(books, publication_date >= "2005-01-01" & publication_date <= "2005-12-31", average_rating > 4.5 | text_reviews_count >1000)
## # A tibble: 154 x 13
##    bookID title authors average_rating isbn  isbn13 language_code num_pages
##     <dbl> <chr> <chr>            <dbl> <chr> <chr>  <chr>             <dbl>
##  1     10 Harr~ J.K. R~           4.73 0439~ 97804~ eng                3342
##  2    231 I am~ Tom Wo~           3.42 0312~ 97803~ eng                 738
##  3    428 Play~ Joan D~           3.87 0374~ 97803~ eng                 231
##  4    475 Coll~ Jared ~           3.93 0143~ 97801~ eng                 608
##  5    703 The ~ Philip~           3.77 1400~ 97814~ eng                 391
##  6    868 Full~ Hiromu~           4.56 1591~ 97815~ eng                 192
##  7    870 Full~ Hiromu~           4.5  1591~ 97815~ eng                 192
##  8    871 Full~ Hiromu~           4.55 1591~ 97815~ eng                 200
##  9    873 Full~ Hiromu~           4.52 1591~ 97815~ eng                 192
## 10    880 Pomp~ Robert~           3.82 0812~ 97808~ eng                 274
## # ... with 144 more rows, and 5 more variables: ratings_count <dbl>,
## #   text_reviews_count <dbl>, publication_date <date>, publisher <chr>,
## #   tot_points <dbl>

5.Who are the top 5 authors? There is definitely no right/wrong answer here and there are many ways to think about this question. I am interested in your rationale, how you think about data, represent information and justify your answer. You may answer this question using a visualization and/or data analysis. Be creative and think how you can use the available data to answer this question. No need to do anything fancy (i.e.: regression/model). Add a short, worded explanation (as a comment in the script) accompanying your answer

Solution

Top 5 authors will have an average rating of 4.3 or higher and have the highest text review count should be more than 10,000.This is because authors with high ratings are usually coming out with quality books and are well recieved by the public. Since books are a customer based activity, the views of the customer are important. Additionally, For a 5 rated author to be considered on the top, the number of ratings also needs to be high to ensure that the ratings is not too low. This is why, another variable rating_counts should be more than 100,000.

top_5_authors <- filter(books, average_rating>4.30 & ratings_count > 100000 & text_reviews_count > 10000)

top_5 <- select(top_5_authors, authors, average_rating, ratings_count, text_reviews_count)
top_5 <-  arrange(top_5, -average_rating)
updated_table = kable_minimal(
knitr::kable(top_5, caption = 'Top Authors List'),
lightable_options = "basic",
html_font = "\"Trebuchet MS\", verdana, sans-serif",
collapse_rows(knitr::kable(top_5, caption = 'Top Authors List'))
)
collapse_rows( updated_table)
Top Authors List
authors average_rating ratings_count text_reviews_count
J.K. Rowling 4.57 2095690 27591
4.56 2339585 36325
4.49 2153167 29221
4.42 2293963 34692
George R.R. Martin 4.41 638766 16535
J.R.R. Tolkien 4.36 2128944 13670
Viktor E. Frankl 282127 13449
Diana Gabaldon 4.32 222140 11121
Roald Dahl 4.31 541914 11576

Conclusion The top 5 authors are JK Rowling, George R.R. Martin, J.R.R. Tolkien, Viktor E. Frankl and Diana Gabaldon. Of this list, JK Rowling has the highest ratings and the most books in the highest ratings.