In this exercise, we will be working with Books.csv data set. This dataset, books.csv, is a comprehensive list of all books listed in goodreads.com. It can be found at https://www.kaggle.com/jealousleopard/goodreadsbooks
Variables and short description:
1.Create a new dataframe called books_small that includes title, authors, average_rating, and publication_date.
Solution Using the select function, we can create and store a database of selected columns
books_small <- select(books, title, authors,average_rating, publication_date)
books_small
## # A tibble: 8,470 x 4
## title authors average_rating publication_date
## <chr> <chr> <dbl> <date>
## 1 "Harry Potter and the Half-Bloo~ J.K. Rowling 4.57 2006-09-16
## 2 "Harry Potter and the Order of ~ J.K. Rowling 4.49 2004-09-01
## 3 "Harry Potter and the Chamber o~ J.K. Rowling 4.42 2003-11-01
## 4 "Harry Potter and the Prisoner ~ J.K. Rowling 4.56 2004-05-01
## 5 "Harry Potter Boxed Set Books ~ J.K. Rowling 4.78 2004-09-13
## 6 "Unauthorized Harry Potter Book~ W. Frederic~ 3.74 2005-04-26
## 7 "Harry Potter Collection (Harry~ J.K. Rowling 4.73 2005-09-12
## 8 "The Ultimate Hitchhiker's Guid~ Douglas Ada~ 4.38 2005-11-01
## 9 "The Ultimate Hitchhiker's Guid~ Douglas Ada~ 4.38 2002-04-30
## 10 "The Hitchhiker's Guide to the ~ Douglas Ada~ 4.22 2004-08-03
## # ... with 8,460 more rows
2.Create a new variable called tot_points that is the multiplication of average_rating and rating_counts
Solution Creating new variables requires the mutate function. This will create a new column called tot_points in the Main Books data base at the end of its existing columns.
books <- mutate(books,tot_points = (average_rating*ratings_count))
books
## # A tibble: 8,470 x 13
## bookID title authors average_rating isbn isbn13 language_code num_pages
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 "Har~ J.K. R~ 4.57 0439~ 97804~ eng 652
## 2 2 "Har~ J.K. R~ 4.49 0439~ 97804~ eng 870
## 3 4 "Har~ J.K. R~ 4.42 0439~ 97804~ eng 352
## 4 5 "Har~ J.K. R~ 4.56 0439~ 97804~ eng 435
## 5 8 "Har~ J.K. R~ 4.78 0439~ 97804~ eng 2690
## 6 9 "Una~ W. Fre~ 3.74 0976~ 97809~ en-US 152
## 7 10 "Har~ J.K. R~ 4.73 0439~ 97804~ eng 3342
## 8 12 "The~ Dougla~ 4.38 0517~ 97805~ eng 815
## 9 13 "The~ Dougla~ 4.38 0345~ 97803~ eng 815
## 10 14 "The~ Dougla~ 4.22 1400~ 97814~ eng 215
## # ... with 8,460 more rows, and 5 more variables: ratings_count <dbl>,
## # text_reviews_count <dbl>, publication_date <date>, publisher <chr>,
## # tot_points <dbl>
3.How many book titles begin with the word “The”? solution I filtered the data set using substring of titles that start with “The”
count(filter(books, substr(title,1,3)== "The"))
## # A tibble: 1 x 1
## n
## <int>
## 1 2313
4.Find all the books that were published in 2005 and that either had an average rating above 4.5 or more than 1,000 text reviews.
Solution To find this answer, I use the filter function and combination of |, <,>,= operators
filter(books, publication_date >= "2005-01-01" & publication_date <= "2005-12-31", average_rating > 4.5 | text_reviews_count >1000)
## # A tibble: 154 x 13
## bookID title authors average_rating isbn isbn13 language_code num_pages
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 10 Harr~ J.K. R~ 4.73 0439~ 97804~ eng 3342
## 2 231 I am~ Tom Wo~ 3.42 0312~ 97803~ eng 738
## 3 428 Play~ Joan D~ 3.87 0374~ 97803~ eng 231
## 4 475 Coll~ Jared ~ 3.93 0143~ 97801~ eng 608
## 5 703 The ~ Philip~ 3.77 1400~ 97814~ eng 391
## 6 868 Full~ Hiromu~ 4.56 1591~ 97815~ eng 192
## 7 870 Full~ Hiromu~ 4.5 1591~ 97815~ eng 192
## 8 871 Full~ Hiromu~ 4.55 1591~ 97815~ eng 200
## 9 873 Full~ Hiromu~ 4.52 1591~ 97815~ eng 192
## 10 880 Pomp~ Robert~ 3.82 0812~ 97808~ eng 274
## # ... with 144 more rows, and 5 more variables: ratings_count <dbl>,
## # text_reviews_count <dbl>, publication_date <date>, publisher <chr>,
## # tot_points <dbl>
5.Who are the top 5 authors? There is definitely no right/wrong answer here and there are many ways to think about this question. I am interested in your rationale, how you think about data, represent information and justify your answer. You may answer this question using a visualization and/or data analysis. Be creative and think how you can use the available data to answer this question. No need to do anything fancy (i.e.: regression/model). Add a short, worded explanation (as a comment in the script) accompanying your answer
Solution
Top 5 authors will have an average rating of 4.3 or higher and have the highest text review count should be more than 10,000.This is because authors with high ratings are usually coming out with quality books and are well recieved by the public. Since books are a customer based activity, the views of the customer are important. Additionally, For a 5 rated author to be considered on the top, the number of ratings also needs to be high to ensure that the ratings is not too low. This is why, another variable rating_counts should be more than 100,000.
top_5_authors <- filter(books, average_rating>4.30 & ratings_count > 100000 & text_reviews_count > 10000)
top_5 <- select(top_5_authors, authors, average_rating, ratings_count, text_reviews_count)
top_5 <- arrange(top_5, -average_rating)
updated_table = kable_minimal(
knitr::kable(top_5, caption = 'Top Authors List'),
lightable_options = "basic",
html_font = "\"Trebuchet MS\", verdana, sans-serif",
collapse_rows(knitr::kable(top_5, caption = 'Top Authors List'))
)
collapse_rows( updated_table)
| authors | average_rating | ratings_count | text_reviews_count |
|---|---|---|---|
| J.K. Rowling | 4.57 | 2095690 | 27591 |
| 4.56 | 2339585 | 36325 | |
| 4.49 | 2153167 | 29221 | |
| 4.42 | 2293963 | 34692 | |
| George R.R. Martin | 4.41 | 638766 | 16535 |
| J.R.R. Tolkien | 4.36 | 2128944 | 13670 |
| Viktor E. Frankl | 282127 | 13449 | |
| Diana Gabaldon | 4.32 | 222140 | 11121 |
| Roald Dahl | 4.31 | 541914 | 11576 |
Conclusion The top 5 authors are JK Rowling, George R.R. Martin, J.R.R. Tolkien, Viktor E. Frankl and Diana Gabaldon. Of this list, JK Rowling has the highest ratings and the most books in the highest ratings.