The goal for this part of the project is to tidy and transform the data from the New York Times Bestseller Fiction List from October 4, 2015 so that analysis can be performed, and questions can be answered either textually or using visualizations.
Data Source: [http://www.nytimes.com/best-sellers-books/2015-10-04/combined-print-and-e-book-fiction/list.html]
Below is a snapshot of a portion of the webpage, so you can see the format of the list
## load packages
library(tidyr)
library(magrittr)
library(XML)
library(rvest)
library(stringi)
library(stringr)
library(dplyr)
library(ggplot2)
Scrape the webpage into R - New York Times Best Sellers Fiction 10/4/15, and at the same time parse the data on the webpage (the data we want is the only table)
nyt_bs_100415_webdata <- readHTMLTable("http://www.nytimes.com/best-sellers-books/2015-10-04/combined-print-and-e-book-fiction/list.html", header = TRUE, stringsAsFactors = FALSE)
Change data type of the object for easier manipulation and get rid of the buy option selector, which conveniently occurs in every other row
nyt_bs_100415_df <- as.data.frame(nyt_bs_100415_webdata) ## Convert the list to a dataframe
## Remove the rows that only have the buy button
nyt_bs_100415_df_2 <- nyt_bs_100415_df[seq(1, NROW(nyt_bs_100415_df), by = 2),]
Keep only the title, author and publisher. Store them in separate variables. This uses regular expressions for the parsing, and the stringi function included as part of the tidyr package.
## Loop to extract and clean up publisher, author and title data for each book
i=1
for (i in 1:15){
## Title
nyt_bs_100415_df_2$Title[i] <- str_extract_all(nyt_bs_100415_df_2$NULL.Combined.Print...E.Book.Fiction[i], "[:print:]+, by")[[1]]
nyt_bs_100415_df_2$Title[i] <- stri_sub(nyt_bs_100415_df_2$Title[i], 1, -5)
## Author
nyt_bs_100415_df_2$Author[i] <- str_extract_all(nyt_bs_100415_df_2$NULL.Combined.Print...E.Book.Fiction[i], "by.*?\\(")[[1]]
nyt_bs_100415_df_2$Author[i] <- stri_sub(nyt_bs_100415_df_2$Author[i], 4, -4)
## Publisher
nyt_bs_100415_df_2$Publisher[i] <- str_extract_all(nyt_bs_100415_df_2$NULL.Combined.Print...E.Book.Fiction[i], "\\(.*?\\)")[[1]]
nyt_bs_100415_df_2$Publisher[i] <- stri_sub(nyt_bs_100415_df_2$Publisher[i], 2, -3)
i = i + 1
}
## Delete unnecessary columns (highest column number first!)
nyt_bs_100415_df_2 <- nyt_bs_100415_df_2[, -4] ## Get rid of combined information column
nyt_bs_100415_df_2 <- nyt_bs_100415_df_2[, -2] ## Get rid of empty column
I am looking for suggestions here, I could not get the dplyr rename function to work. I tried select to rename only the ones I wanted to rename, but it gave me an error - unknown variable.
## Rename columns for clarity (tried using dplyr::rename but couldn't make it work)
colnames(nyt_bs_100415_df_2) <- c("Rank_this_week", "Rank_last_week", "Weeks_on_list", "Title", "Author", "Publisher")
## Unlist the list type variables
nyt_bs_100415_df_2$Publisher <- unlist(nyt_bs_100415_df_2$Publisher)
nyt_bs_100415_df_2$Title <- unlist(nyt_bs_100415_df_2$Title)
## Convert character data to integers where appropriate (it fills in NA automatically where I wanted)
nyt_bs_100415_df_2$Rank_this_week <- as.integer(nyt_bs_100415_df_2$Rank_this_week)
nyt_bs_100415_df_2$Rank_last_week <- as.integer(nyt_bs_100415_df_2$Rank_last_week)
nyt_bs_100415_df_2$Weeks_on_list <- as.integer(nyt_bs_100415_df_2$Weeks_on_list)
nyt_bs_100415_df_2
## Rank_this_week Rank_last_week Weeks_on_list
## 1 1 3 17
## 3 2 NA 1
## 5 3 1 2
## 7 4 2 3
## 9 5 NA 1
## 11 6 NA 1
## 13 7 NA 1
## 15 8 4 36
## 17 9 NA 1
## 19 10 7 21
## 21 11 9 53
## 23 12 5 10
## 25 13 6 4
## 27 14 NA 1
## 29 15 8 2
## Title Author
## 1 THE MARTIAN Andy Weir
## 3 DEVOTED IN DEATH J. D. Robb
## 5 MAKE ME Lee Child
## 7 THE GIRL IN THE SPIDER'S WEB David Lagercrantz
## 9 THE END GAME Catherine Coulter and J.T. Ellison
## 11 HARD LOVE Meredith Wild
## 13 THE SCAM Janet Evanovich and Lee Goldberg
## 15 THE GIRL ON THE TRAIN Paula Hawkins
## 17 FATES AND FURIES Lauren Groff
## 19 GRAY MOUNTAIN John Grisham
## 21 ALL THE LIGHT WE CANNOT SEE Anthony Doerr
## 23 GO SET A WATCHMAN Harper Lee
## 25 X Sue Grafton
## 27 FATAL FRENZY Marie Force
## 29 HOLLYWOOD DIRT Alessandra Torre
## Publisher
## 1 Crown
## 3 Putnam
## 5 Delacorte
## 7 Knopf
## 9 Putnam
## 11 Forever
## 13 Bantam
## 15 Riverhead
## 17 Riverhead
## 19 Doubleday
## 21 Scribner
## 23 Harper
## 25 Marian Wood/Putnam
## 27 Carina
## 29 EverAfter Romance
## tbl for better printing
nyt_bs_100415_df_2 <- tbl_df(nyt_bs_100415_df_2)
nyt_bs_100415_df_2
## Source: local data frame [15 x 6]
##
## Rank_this_week Rank_last_week Weeks_on_list
## (int) (int) (int)
## 1 1 3 17
## 2 2 NA 1
## 3 3 1 2
## 4 4 2 3
## 5 5 NA 1
## 6 6 NA 1
## 7 7 NA 1
## 8 8 4 36
## 9 9 NA 1
## 10 10 7 21
## 11 11 9 53
## 12 12 5 10
## 13 13 6 4
## 14 14 NA 1
## 15 15 8 2
## Variables not shown: Title (chr), Author (chr), Publisher (chr)
Use dplyr to change the column order to show the data in each observation in a reasonably arranged format - by rank on the list, then the book title, associated author and publisher, and finally its rank last week and number of weeks on the list. The last column isn’t needed for any of the questions we will answer here, but I left it because I like it.
## Reorder columns
select(nyt_bs_100415_df_2, Rank_this_week, Title, Author, Publisher, everything())
## Source: local data frame [15 x 6]
##
## Rank_this_week Title
## (int) (chr)
## 1 1 THE MARTIAN
## 2 2 DEVOTED IN DEATH
## 3 3 MAKE ME
## 4 4 THE GIRL IN THE SPIDER'S WEB
## 5 5 THE END GAME
## 6 6 HARD LOVE
## 7 7 THE SCAM
## 8 8 THE GIRL ON THE TRAIN
## 9 9 FATES AND FURIES
## 10 10 GRAY MOUNTAIN
## 11 11 ALL THE LIGHT WE CANNOT SEE
## 12 12 GO SET A WATCHMAN
## 13 13 X
## 14 14 FATAL FRENZY
## 15 15 HOLLYWOOD DIRT
## Variables not shown: Author (chr), Publisher (chr), Rank_last_week (int),
## Weeks_on_list (int)
glimpse(nyt_bs_100415_df_2)
## Observations: 15
## Variables: 6
## $ Rank_this_week (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
## $ Rank_last_week (int) 3, NA, 1, 2, NA, NA, NA, 4, NA, 7, 9, 5, 6, NA, 8
## $ Weeks_on_list (int) 17, 1, 2, 3, 1, 1, 1, 36, 1, 21, 53, 10, 4, 1, 2
## $ Title (chr) "THE MARTIAN", "DEVOTED IN DEATH", "MAKE ME", "...
## $ Author (chr) "Andy Weir", "J. D. Robb", "Lee Child", "David ...
## $ Publisher (chr) "Crown", "Putnam", "Delacorte", "Knopf", "Putna...
Rank and Title of books new to the New York Times Bestseller List - Fiction
new_ranked <-
nyt_bs_100415_df_2 %>%
filter(is.na(Rank_last_week)) %>%
select(Rank_this_week, Title)
new_ranked
## Source: local data frame [6 x 2]
##
## Rank_this_week Title
## (int) (chr)
## 1 2 DEVOTED IN DEATH
## 2 5 THE END GAME
## 3 6 HARD LOVE
## 4 7 THE SCAM
## 5 9 FATES AND FURIES
## 6 14 FATAL FRENZY
new_ranked_barplot <- qplot(x=Title, y=factor(Rank_this_week),
data=new_ranked, geom="bar", stat="identity",
position="dodge")
new_ranked_barplot + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
changed_rank <-
nyt_bs_100415_df_2 %>%
filter(!is.na(Rank_last_week)) %>%
mutate(Change_in_Rank = Rank_last_week - Rank_this_week) %>%
select(Title, Change_in_Rank)
changed_rank
## Source: local data frame [9 x 2]
##
## Title Change_in_Rank
## (chr) (int)
## 1 THE MARTIAN 2
## 2 MAKE ME -2
## 3 THE GIRL IN THE SPIDER'S WEB -2
## 4 THE GIRL ON THE TRAIN -4
## 5 GRAY MOUNTAIN -3
## 6 ALL THE LIGHT WE CANNOT SEE -2
## 7 GO SET A WATCHMAN -7
## 8 X -7
## 9 HOLLYWOOD DIRT -7
changed_rank_barplot <- qplot(x=Title, y=Change_in_Rank,
data=changed_rank, geom="bar", stat="identity",
position="dodge")
changed_rank_barplot + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
scale_y_continuous(breaks=c(2, 0, -2, -4, -6, -8))
This type of list shows up on many websites, so I wanted to gain some experience scraping and tidying data presented in this format. Working with this data set taught me a lot about using dplyr. I like the pipe operator and the clarity and ease it provides. I think there is a lot more that I can learn by working with this data set, and others like it. One thing I would be interested in learning in the near future is how to present a visually pleasing table in R markdown.