Exercise 5

Books Data

In this exercise, we will be working with Books.csv data set. This dataset, books.csv, is a comprehensive list of all books listed in goodreads.com. It can be found at https://www.kaggle.com/jealousleopard/goodreadsbooks

Variables and short description:

id: unique identification number for each book
title: name under which book was published
authors: names of the authors of the book
average_rating: average rating of the book received in total
language_code: primary language of the book
num_pages: number of pages the book contains
ratings_count: total number of ratings the book received
text_reviews_count: total number of text reviews the book received
publication_date: date book was published
publisher: publishing company

1.Create a new dataframe called books_small that includes title, authors, average_rating, and publication_date.

Solution Using the select function, we can create and store a database of selected columns

books_small <- select(books, title, authors,average_rating, publication_date)
books_small

## # A tibble: 8,470 x 4
##    title                            authors      average_rating publication_date
##    <chr>                            <chr>                 <dbl> <date>          
##  1 "Harry Potter and the Half-Bloo~ J.K. Rowling           4.57 2006-09-16      
##  2 "Harry Potter and the Order of ~ J.K. Rowling           4.49 2004-09-01      
##  3 "Harry Potter and the Chamber o~ J.K. Rowling           4.42 2003-11-01      
##  4 "Harry Potter and the Prisoner ~ J.K. Rowling           4.56 2004-05-01      
##  5 "Harry Potter Boxed Set  Books ~ J.K. Rowling           4.78 2004-09-13      
##  6 "Unauthorized Harry Potter Book~ W. Frederic~           3.74 2005-04-26      
##  7 "Harry Potter Collection (Harry~ J.K. Rowling           4.73 2005-09-12      
##  8 "The Ultimate Hitchhiker's Guid~ Douglas Ada~           4.38 2005-11-01      
##  9 "The Ultimate Hitchhiker's Guid~ Douglas Ada~           4.38 2002-04-30      
## 10 "The Hitchhiker's Guide to the ~ Douglas Ada~           4.22 2004-08-03      
## # ... with 8,460 more rows

2.Create a new variable called tot_points that is the multiplication of average_rating and rating_counts

Solution Creating new variables requires the mutate function. This will create a new column called tot_points in the Main Books data base at the end of its existing columns.

books <- mutate(books,tot_points = (average_rating*ratings_count))
books

## # A tibble: 8,470 x 13
##    bookID title authors average_rating isbn  isbn13 language_code num_pages
##     <dbl> <chr> <chr>            <dbl> <chr> <chr>  <chr>             <dbl>
##  1      1 "Har~ J.K. R~           4.57 0439~ 97804~ eng                 652
##  2      2 "Har~ J.K. R~           4.49 0439~ 97804~ eng                 870
##  3      4 "Har~ J.K. R~           4.42 0439~ 97804~ eng                 352
##  4      5 "Har~ J.K. R~           4.56 0439~ 97804~ eng                 435
##  5      8 "Har~ J.K. R~           4.78 0439~ 97804~ eng                2690
##  6      9 "Una~ W. Fre~           3.74 0976~ 97809~ en-US               152
##  7     10 "Har~ J.K. R~           4.73 0439~ 97804~ eng                3342
##  8     12 "The~ Dougla~           4.38 0517~ 97805~ eng                 815
##  9     13 "The~ Dougla~           4.38 0345~ 97803~ eng                 815
## 10     14 "The~ Dougla~           4.22 1400~ 97814~ eng                 215
## # ... with 8,460 more rows, and 5 more variables: ratings_count <dbl>,
## #   text_reviews_count <dbl>, publication_date <date>, publisher <chr>,
## #   tot_points <dbl>

3.How many book titles begin with the word “The”? solution I filtered the data set using substring of titles that start with “The”

count(filter(books, substr(title,1,3)== "The"))

## # A tibble: 1 x 1
##       n
##   <int>
## 1  2313

4.Find all the books that were published in 2005 and that either had an average rating above 4.5 or more than 1,000 text reviews.

Solution To find this answer, I use the filter function and combination of |, <,>,= operators

filter(books, publication_date >= "2005-01-01" & publication_date <= "2005-12-31", average_rating > 4.5 | text_reviews_count >1000)

## # A tibble: 154 x 13
##    bookID title authors average_rating isbn  isbn13 language_code num_pages
##     <dbl> <chr> <chr>            <dbl> <chr> <chr>  <chr>             <dbl>
##  1     10 Harr~ J.K. R~           4.73 0439~ 97804~ eng                3342
##  2    231 I am~ Tom Wo~           3.42 0312~ 97803~ eng                 738
##  3    428 Play~ Joan D~           3.87 0374~ 97803~ eng                 231
##  4    475 Coll~ Jared ~           3.93 0143~ 97801~ eng                 608
##  5    703 The ~ Philip~           3.77 1400~ 97814~ eng                 391
##  6    868 Full~ Hiromu~           4.56 1591~ 97815~ eng                 192
##  7    870 Full~ Hiromu~           4.5  1591~ 97815~ eng                 192
##  8    871 Full~ Hiromu~           4.55 1591~ 97815~ eng                 200
##  9    873 Full~ Hiromu~           4.52 1591~ 97815~ eng                 192
## 10    880 Pomp~ Robert~           3.82 0812~ 97808~ eng                 274
## # ... with 144 more rows, and 5 more variables: ratings_count <dbl>,
## #   text_reviews_count <dbl>, publication_date <date>, publisher <chr>,
## #   tot_points <dbl>

5.Who are the top 5 authors? There is definitely no right/wrong answer here and there are many ways to think about this question. I am interested in your rationale, how you think about data, represent information and justify your answer. You may answer this question using a visualization and/or data analysis. Be creative and think how you can use the available data to answer this question. No need to do anything fancy (i.e.: regression/model). Add a short, worded explanation (as a comment in the script) accompanying your answer

Solution

Top 5 authors will have an average rating of 4.3 or higher and have the highest text review count should be more than 10,000.This is because authors with high ratings are usually coming out with quality books and are well recieved by the public. Since books are a customer based activity, the views of the customer are important. Additionally, For a 5 rated author to be considered on the top, the number of ratings also needs to be high to ensure that the ratings is not too low. This is why, another variable rating_counts should be more than 100,000.

top_5_authors <- filter(books, average_rating>4.30 & ratings_count > 100000 & text_reviews_count > 10000)

top_5 <- select(top_5_authors, authors, average_rating, ratings_count, text_reviews_count)
top_5 <-  arrange(top_5, -average_rating)

updated_table = kable_minimal(
knitr::kable(top_5, caption = 'Top Authors List'),
lightable_options = "basic",
html_font = "\"Trebuchet MS\", verdana, sans-serif",
collapse_rows(knitr::kable(top_5, caption = 'Top Authors List'))
)

collapse_rows( updated_table)

Top Authors List
authors	average_rating	ratings_count	text_reviews_count
J.K. Rowling	4.57	2095690	27591
	4.56	2339585	36325
	4.49	2153167	29221
	4.42	2293963	34692
George R.R. Martin	4.41	638766	16535
J.R.R. Tolkien	4.36	2128944	13670
Viktor E. Frankl	4.36	282127	13449
Diana Gabaldon	4.32	222140	11121
Roald Dahl	4.31	541914	11576

Conclusion The top 5 authors are JK Rowling, George R.R. Martin, J.R.R. Tolkien, Viktor E. Frankl and Diana Gabaldon. Of this list, JK Rowling has the highest ratings and the most books in the highest ratings.

Exercise 5

Jawaria Abbas

9/21/2020

Books Data