As a part of the course titled “Webscraping and Social Media Scraping” in my Master programme, I developed three scrapers using BeautifulSoup4,Scrapy and Selenium (scrapy project can be found here). I thought it would be a fun to do some analysis on scraped data. Many questions can be asked based on the data we got. For example:
data <- read.csv("goodreads_20thCentury_scrapy.csv")
# Due to parsing, score and review count are string.
# Here we convert them to numeric values
data$score <- as.numeric(gsub(",", "", data$score))
data$reviews <- as.numeric(gsub(",", "", data$reviews))
Let’s explore and have some fun with the data we have:
summary(data[c("pages", "score", "reviews")])
## pages score reviews
## Min. : 26.0 Min. : 5677 Min. : 67
## 1st Qu.: 213.0 1st Qu.: 9637 1st Qu.: 4010
## Median : 314.0 Median : 18805 Median : 8024
## Mean : 378.5 Mean :215740 Mean : 14416
## 3rd Qu.: 467.0 3rd Qu.: 53824 3rd Qu.: 16316
## Max. :1796.0 Max. :998587 Max. :161230
# While we have 499 books, we only have 331 unique authors
unique(data$author)
## [1] "Harper Lee" "Ray Bradbury"
## [3] "J.R.R. Tolkien" "F. Scott Fitzgerald"
## [5] "J.K. Rowling" "Antoine de Saint-Exupéry"
## [7] "George Orwell" "C.S. Lewis"
## [9] "John Steinbeck" "Aldous Huxley"
## [11] "Patricia Highsmith" "Margaret Mitchell"
## [13] "Richard Brautigan" "Carl Bernstein"
## [15] "Neil Gaiman" "Thomas Mann"
## [17] "James Joyce" "Hermann Hesse"
## [19] "Paul Auster" "Stephen King"
## [21] "Orson Scott Card" "Robert A. Heinlein"
## [23] "Anne Tyler" "Carl Sagan"
## [25] "Mark Helprin" "Crockett Johnson"
## [27] "Cormac McCarthy" "Primo Levi"
## [29] "Lorraine Hansberry" "Herman Wouk"
## [31] "Mary Doria Russell" "Sebastian Faulks"
## [33] "Astrid Lindgren" "Thornton Wilder"
## [35] "Donna Tartt" "Anne Rice"
## [37] "Douglas R. Hofstadter" "Beatrix Potter"
## [39] "Ken Kesey" "John Fowles"
## [41] "John Wyndham" "Emmuska Orczy"
## [43] "Arthur Conan Doyle" "Avi"
## [45] "Isaac Asimov" "D.H. Lawrence"
## [47] "Caleb Carr" "L.M. Montgomery"
## [49] "Nicholas Sparks" "Chris Van Allsburg"
## [51] "Franz Kafka" "Thomas Pynchon"
## [53] "Don DeLillo" "Tom Robbins"
## [55] "Nevil Shute" "Michael Crichton"
## [57] "John Hersey" "Rainer Maria Rilke"
## [59] "Howard Zinn" "William Gibson"
## [61] "E.E. Cummings" "Gail Carson Levine"
## [63] "Fannie Flagg" "Norton Juster"
## [65] "Terry Pratchett" "Isak Dinesen"
## [67] "Robertson Davies" "Ursula K. Le Guin"
## [69] "Katherine Dunn" "Dr. Seuss"
## [71] "Helen Fielding" "Yevgeny Zamyatin"
## [73] "Wally Lamb" "Jung Chang"
## [75] "William L. Shirer" "Philip Pullman"
## [77] "Evelyn Waugh" "Tom Wolfe"
## [79] "William Peter Blatty" "P.G. Wodehouse"
## [81] "Nicolas Tredell" "Roald Dahl"
## [83] "Jean-Paul Sartre" "John Jakes"
## [85] "Jorge Luis Borges" "Pat Frank"
## [87] "Jaroslav Hašek" "Louis de Bernières"
## [89] "Alan Moore" "Salman Rushdie"
## [91] "Richard Dawkins" "J.M. Coetzee"
## [93] "Edgar Rice Burroughs" "George R.R. Martin"
## [95] "E.M. Forster" "Ernest Hemingway"
## [97] "James A. Michener" "Art Spiegelman"
## [99] "Gaston Leroux" "Philip K. Dick"
## [101] "Robert M. Pirsig" "Nikos Kazantzakis"
## [103] "Dodie Smith" "Toni Morrison"
## [105] "Michael Chabon" "Laurie Halse Anderson"
## [107] "T.S. Eliot" "Ira Levin"
## [109] "Virginia Woolf" "Kurt Vonnegut Jr."
## [111] "Judy Blume" "Dave Eggers"
## [113] "Italo Calvino" "Jerzy Kosiński"
## [115] "John Knowles" "Chaim Potok"
## [117] "Rudyard Kipling" "Beatrice Sparks"
## [119] "Michael Ondaatje" "Ivo Andrić"
## [121] "A.A. Milne" "Edith Wharton"
## [123] "Edith Hamilton" "Dashiell Hammett"
## [125] "Laura Esquivel" "Tony Kushner"
## [127] "José Saramago" "Dalton Trumbo"
## [129] "Agatha Christie" "Simone de Beauvoir"
## [131] "Annie Proulx" "Jhumpa Lahiri"
## [133] "Marcel Proust" "J.D. Salinger"
## [135] "Hunter S. Thompson" "Martin Heidegger"
## [137] "Dale Carnegie" "Peter S. Beagle"
## [139] "Carson McCullers" "Willa Cather"
## [141] "Graham Greene" "William Styron"
## [143] "Arthur C. Clarke" "Julio Cortázar"
## [145] "John Berendt" "Upton Sinclair"
## [147] "Henry Miller" "Philip Roth"
## [149] "Anita Diamant" "E.B. White"
## [151] "Lois Lowry" "Arthur Miller"
## [153] "Kenneth Grahame" "Jean M. Auel"
## [155] "Natalie Babbitt" "Tim O'Brien"
## [157] "Margaret Atwood" "Jon Krakauer"
## [159] "William Faulkner" "Dan Brown"
## [161] "Malcolm X" "Beryl Markham"
## [163] "Olive Ann Burns" "Larry McMurtry"
## [165] "Pat Conroy" "Barbara Kingsolver"
## [167] "Robert Jordan" "T.H. White"
## [169] "Haruki Murakami" "Samuel Beckett"
## [171] "Jared Diamond" "Thomas Keneally"
## [173] "Vikram Seth" "Joseph Conrad"
## [175] "Richard Bach" "Mitch Albom"
## [177] "Diana Gabaldon" "Daniel Keyes"
## [179] "Wilson Rawls" "Stella Gibbons"
## [181] "Keri Hulme" "Tom Clancy"
## [183] "Laura Ingalls Wilder" "Robert Frost"
## [185] "Ralph Ellison" "Jack London"
## [187] "Marion Zimmer Bradley" "John le Carré"
## [189] "Zora Neale Hurston" "Robert Musil"
## [191] "Michael Shaara" "W. Somerset Maugham"
## [193] "John Irving" "Anthony Burgess"
## [195] "Boris Pasternak" "David Guterson"
## [197] "Neal Stephenson" "Judi Barrett"
## [199] "M.M. Kaye" "Arundhati Roy"
## [201] "Ayn Rand" "Theodore Dreiser"
## [203] "Alan Paton" "Mario Puzo"
## [205] "Jack Kerouac" "Amy Tan"
## [207] "Flannery O'Connor" "Irving Stone"
## [209] "Richard Wright" "Isabel Allende"
## [211] "John Kennedy Toole" "Jacqueline Susann"
## [213] "Frank McCourt" "John Grisham"
## [215] "Sandra Cisneros" "Juan Rulfo"
## [217] "Pearl S. Buck" "William S. Burroughs"
## [219] "Rachel Carson" "Truman Capote"
## [221] "Tennessee Williams" "John Galsworthy"
## [223] "Viktor E. Frankl" "George Bernard Shaw"
## [225] "Stephen R. Covey" "Erich Maria Remarque"
## [227] "Raymond Chandler" "David Foster Wallace"
## [229] "Betty Friedan" "Bryce Courtenay"
## [231] "Daniel Quinn" "Lawrence Durrell"
## [233] "Tom Stoppard" "Albert Camus"
## [235] "Brian Jacques" "James Herriot"
## [237] "Aleksandr Solzhenitsyn" "Nick Hornby"
## [239] "Mervyn Peake" "Flann O'Brien"
## [241] "Ellen Raskin" "Richard Yates"
## [243] "Rohinton Mistry" "Henri Charrière"
## [245] "James Clavell" "Douglas Adams"
## [247] "S.E. Hinton" "Maya Angelou"
## [249] "Scott O'Dell" "Michael Cunningham"
## [251] "Eugene O'Neill" "Mikhail Bulgakov"
## [253] "Vladimir Nabokov" "Ken Follett"
## [255] "Charles Frazier" "Alex Haley"
## [257] "Pablo Neruda" "Fred Gipson"
## [259] "Louis Sachar" "Colleen McCullough"
## [261] "J.M. Barrie" "Norman Maclean"
## [263] "Maurice Sendak" "John Updike"
## [265] "Frank Herbert" "Robert Penn Warren"
## [267] "Margaret Wise Brown" "Bill Bryson"
## [269] "Jean Rhys" "Umberto Eco"
## [271] "Walter M. Miller Jr." "Robert Graves"
## [273] "Louis-Ferdinand Céline" "W.E.B. Du Bois"
## [275] "Kazuo Ishiguro" "Milan Kundera"
## [277] "E.L. Doctorow" "Dee Brown"
## [279] "Kahlil Gibran" "Sherwood Anderson"
## [281] "Daphne du Maurier" "Chinua Achebe"
## [283] "David James Duncan" "Joseph Campbell"
## [285] "Bernhard Schlink" "Michael Ende"
## [287] "Margery Williams Bianco" "Mary Stewart"
## [289] "Marguerite Yourcenar" "Frances Hodgson Burnett"
## [291] "Jeffrey Eugenides" "Stephen Hawking"
## [293] "Stephen Chbosky" "Janet Evanovich"
## [295] "Allen Ginsberg" "Gabriel García Márquez"
## [297] "Bret Easton Ellis" "Bill Watterson"
## [299] "Eric Carle" "O. Henry"
## [301] "Wallace Stegner" "Günter Grass"
## [303] "David Sedaris" "Tracy Chevalier"
## [305] "Corrie ten Boom" "Leon Uris"
## [307] "Marilynne Robinson" "Irvine Welsh"
## [309] "Joseph Heller" "Alice Walker"
## [311] "Patrick Süskind" "Chuck Palahniuk"
## [313] "John Howard Griffin" "Arthur Koestler"
## [315] "Mark Z. Danielewski" "Shel Silverstein"
## [317] "Mary Norton" "Paulo Coelho"
## [319] "Katherine Paterson" "Richard Adams"
## [321] "Susanna Kaysen" "Sylvia Plath"
## [323] "Betty Smith" "Helene Hanff"
## [325] "Ursula Hegi" "Elie Wiesel"
## [327] "Jostein Gaarder" "William Goldman"
## [329] "Madeleine L'Engle" "Arthur Golden"
## [331] "William Golding"
# Books with most pages
data[data$pages == max(data$pages),]$title
## [1] "The Complete Sherlock Holmes"
# Books with most reviews
data[data$reviews == max(data$reviews),]$title
## [1] "Harry Potter and the Sorcerer's Stone (Harry Potter, #1)"
# Books with best score
data[data$score == max(data$score),]$title
## [1] "To Kill a Mockingbird"
## [2] "Fahrenheit 451"
## [3] "The Hobbit (The Lord of the Rings, #0)"
## [4] "The Great Gatsby"
## [5] "Harry Potter and the Sorcerer's Stone (Harry Potter, #1)"
## [6] "The Little Prince"
## [7] "1984"
## [8] "Animal Farm"
## [9] "The Lion, the Witch and the Wardrobe (Chronicles of Narnia, #1)"
## [10] "The Grapes of Wrath"
## [11] "Brave New World"
## [12] "Gone with the Wind"
## [13] "The Fellowship of the Ring (The Lord of the Rings, #1)"
## [14] "Winnie-the-Pooh (Winnie-the-Pooh, #1)"
## [15] "Blindness"
## [16] "A Portrait of the Artist as a Young Man"
## [17] "And Then There Were None"
## [18] "Cat’s Cradle"
## [19] "Franny and Zooey"
## [20] "Fear and Loathing in Las Vegas"
## [21] "To the Lighthouse"
## [22] "Sophie’s Choice"
## [23] "The Red Tent"
## [24] "The Things They Carried"
## [25] "A Game of Thrones (A Song of Ice and Fire, #1)"
## [26] "Lonesome Dove (Lonesome Dove, #1)"
## [27] "Waiting for Godot"
## [28] "Flowers for Algernon"
## [29] "Where the Red Fern Grows"
## [30] "Invisible Man"
## [31] "A Farewell to Arms"
## [32] "Their Eyes Were Watching God"
## [33] "Little House on the Prairie (Little House, #3)"
## [34] "A Clockwork Orange"
## [35] "The World According to Garp"
## [36] "The Fountainhead"
## [37] "The Sound and the Fury"
## [38] "For Whom the Bell Tolls"
## [39] "The Joy Luck Club"
## [40] "A Confederacy of Dunces"
## [41] "Angela’s Ashes (Frank McCourt, #1)"
## [42] "The Good Earth (House of Earth, #1)"
## [43] "All Quiet on the Western Front"
## [44] "Beloved"
## [45] "The Sun Also Rises"
## [46] "Ulysses"
## [47] "On the Road"
## [48] "The Hitchhiker’s Guide to the Galaxy (The Hitchhiker's Guide to the Galaxy, #1)"
## [49] "The Outsiders"
## [50] "The Master and Margarita"
## [51] "The Pillars of the Earth (Kingsbridge, #1)"
## [52] "The Lorax"
## [53] "Charlie and the Chocolate Factory (Charlie Bucket, #1)"
## [54] "Where the Wild Things Are"
## [55] "Dune (Dune, #1)"
## [56] "A Prayer for Owen Meany"
## [57] "The Name of the Rose"
## [58] "The Unbearable Lightness of Being"
## [59] "The Stand"
## [60] "Rebecca"
## [61] "The Complete Sherlock Holmes"
## [62] "Siddhartha"
## [63] "The Secret Garden"
## [64] "Love in the Time of Cholera"
## [65] "The Poisonwood Bible"
## [66] "The Metamorphosis"
## [67] "Atlas Shrugged"
## [68] "The Color Purple"
## [69] "Green Eggs and Ham"
## [70] "In Cold Blood"
## [71] "Anne of Green Gables (Anne of Green Gables, #1)"
## [72] "The Alchemist"
## [73] "Watership Down (Watership Down, #1)"
## [74] "The Bell Jar"
## [75] "A Tree Grows in Brooklyn"
## [76] "Night (The Night Trilogy, #1)"
## [77] "The Two Towers (The Lord of the Rings, #2)"
## [78] "The Return of the King (The Lord of the Rings, #3)"
## [79] "The Giving Tree"
## [80] "Ender’s Game (Ender's Saga, #1)"
## [81] "The Old Man and the Sea"
## [82] "A Wrinkle in Time (Time Quintet, #1)"
## [83] "The Chronicles of Narnia (The Chronicles of Narnia, #1-7)"
## [84] "The Ultimate Hitchhiker’s Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1-5)"
## [85] "Charlotte’s Web"
## [86] "Harry Potter and the Chamber of Secrets (Harry Potter, #2)"
## [87] "Memoirs of a Geisha"
## [88] "One Flew Over the Cuckoo’s Nest"
## [89] "The Stranger"
## [90] "The Handmaid’s Tale (The Handmaid's Tale, #1)"
## [91] "East of Eden"
## [92] "Lolita"
## [93] "Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)"
## [94] "Of Mice and Men"
## [95] "The Giver (The Giver, #1)"
## [96] "Slaughterhouse-Five"
## [97] "Lord of the Flies"
## [98] "One Hundred Years of Solitude"
## [99] "The Catcher in the Rye"
# Number of authors with best score books
unique(data[data$score == max(data$score),]$author)
## [1] "Harper Lee" "Ray Bradbury"
## [3] "J.R.R. Tolkien" "F. Scott Fitzgerald"
## [5] "J.K. Rowling" "Antoine de Saint-Exupéry"
## [7] "George Orwell" "C.S. Lewis"
## [9] "John Steinbeck" "Aldous Huxley"
## [11] "Margaret Mitchell" "A.A. Milne"
## [13] "José Saramago" "James Joyce"
## [15] "Agatha Christie" "Kurt Vonnegut Jr."
## [17] "J.D. Salinger" "Hunter S. Thompson"
## [19] "Virginia Woolf" "William Styron"
## [21] "Anita Diamant" "Tim O'Brien"
## [23] "George R.R. Martin" "Larry McMurtry"
## [25] "Samuel Beckett" "Daniel Keyes"
## [27] "Wilson Rawls" "Ralph Ellison"
## [29] "Ernest Hemingway" "Zora Neale Hurston"
## [31] "Laura Ingalls Wilder" "Anthony Burgess"
## [33] "John Irving" "Ayn Rand"
## [35] "William Faulkner" "Amy Tan"
## [37] "John Kennedy Toole" "Frank McCourt"
## [39] "Pearl S. Buck" "Erich Maria Remarque"
## [41] "Toni Morrison" "Jack Kerouac"
## [43] "Douglas Adams" "S.E. Hinton"
## [45] "Mikhail Bulgakov" "Ken Follett"
## [47] "Dr. Seuss" "Roald Dahl"
## [49] "Maurice Sendak" "Frank Herbert"
## [51] "Umberto Eco" "Milan Kundera"
## [53] "Stephen King" "Daphne du Maurier"
## [55] "Arthur Conan Doyle" "Hermann Hesse"
## [57] "Frances Hodgson Burnett" "Gabriel García Márquez"
## [59] "Barbara Kingsolver" "Franz Kafka"
## [61] "Alice Walker" "Truman Capote"
## [63] "L.M. Montgomery" "Paulo Coelho"
## [65] "Richard Adams" "Sylvia Plath"
## [67] "Betty Smith" "Elie Wiesel"
## [69] "Shel Silverstein" "Orson Scott Card"
## [71] "Madeleine L'Engle" "E.B. White"
## [73] "Arthur Golden" "Ken Kesey"
## [75] "Albert Camus" "Margaret Atwood"
## [77] "Vladimir Nabokov" "Lois Lowry"
## [79] "William Golding"
We have 99 books with the best score of 998587; while only Harry Potter and the Sorcerer’s Stone has the most review (Well done JK!). I also want to see common theme in this best score books; that might help us see popularity of common themes:
corpus <- Corpus(VectorSource(data[data$score == max(data$score),]$desc))
# source: https://www.r-bloggers.com/2021/03/text-mining-term-frequency-analysis-and-word-cloud-creation-using-the-tm-package/
corpus_clean <- tm_map(corpus, content_transformer(tolower))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, content_transformer(function(x) iconv(x, to="UTF-8")))
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords("en"))
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
# Create a document-term matrix
dtm <- TermDocumentMatrix(corpus_clean)
m <- as.matrix(dtm)
word_freqs <- sort(rowSums(m), decreasing = TRUE)
word_freqs_df <- data.frame(word = names(word_freqs), freq = word_freqs)
# Display the most common words
head(word_freqs_df, n = 20)
## word freq
## one one 24
## novel novel 20
## world world 18
## story story 18
## classic classic 16
## life life 15
## family family 12
## american american 12
## book book 11
## love love 11
## harry harry 11
## will will 11
## war war 11
## little little 10
## first first 9
## beautiful beautiful 9
## back back 9
## four four 9
## begins begins 8
## ever ever 8
wordcloud(words = word_freqs_df$word, freq = word_freqs_df$freq, min.freq = 1,
max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
We cannot see useful insights here (maybe word “american”,“world” and “family”?). Let’s move on.
What about average score per author of a book and how many books they got in the list?:
# Calculate the average score for each author
average_scores_by_author <- data %>%
group_by(author) %>%
summarise(average_score = mean(score, na.rm = TRUE)) %>%
arrange(desc(average_score))
print(average_scores_by_author)
## # A tibble: 331 × 2
## author average_score
## <chr> <dbl>
## 1 Aldous Huxley 998587
## 2 Alice Walker 998587
## 3 Amy Tan 998587
## 4 Anita Diamant 998587
## 5 Anthony Burgess 998587
## 6 Antoine de Saint-Exupéry 998587
## 7 Arthur Golden 998587
## 8 Betty Smith 998587
## 9 Daniel Keyes 998587
## 10 Daphne du Maurier 998587
## # ℹ 321 more rows
author_book_counts <- data %>%
group_by(author) %>%
summarise(number_of_books = n()) %>%
arrange(desc(number_of_books))
print(author_book_counts)
## # A tibble: 331 × 2
## author number_of_books
## <chr> <int>
## 1 C.S. Lewis 11
## 2 Stephen King 10
## 3 Kurt Vonnegut Jr. 6
## 4 Dr. Seuss 5
## 5 Ernest Hemingway 5
## 6 George Orwell 5
## 7 Hermann Hesse 5
## 8 J.R.R. Tolkien 5
## 9 John Steinbeck 5
## 10 Roald Dahl 5
## # ℹ 321 more rows
# Let's also see how much of best score books each authors have from best score
# books list we previously calculated
author_bestscore_book_counts <- data[data$score == max(data$score),] %>%
group_by(author) %>%
summarise(number_of_books = n()) %>%
arrange(desc(number_of_books))
print(author_bestscore_book_counts)
## # A tibble: 79 × 2
## author number_of_books
## <chr> <int>
## 1 Ernest Hemingway 4
## 2 J.R.R. Tolkien 4
## 3 J.K. Rowling 3
## 4 John Steinbeck 3
## 5 Ayn Rand 2
## 6 C.S. Lewis 2
## 7 Douglas Adams 2
## 8 Dr. Seuss 2
## 9 Gabriel García Márquez 2
## 10 George Orwell 2
## # ℹ 69 more rows
It was not a surprise to see J.R.R Tolkien and J.K. Rowling there. C.S. Lewis is on the best books of 20th century list with 11 books; that’s an huge achievement!
We can also find how many books published for specific genre over the years:
genres_split <- data %>%
mutate(genre = strsplit(as.character(genre), ",")) %>%
unnest(genre)
genre_trends <- genres_split %>%
group_by(genre) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange( desc(count))
Since I am an huge fantasy and science fiction fan, I am going to analyze how many books published each year for those genres.
fantasy_scifi <- genres_split %>%
filter(genre == "Fantasy" | genre == "Science Fiction") %>%
group_by(year, genre) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(year,desc(count))
print(fantasy_scifi,n=101)
## # A tibble: 101 × 3
## year genre count
## <int> <chr> <int>
## 1 1902 Fantasy 2
## 2 1908 Fantasy 1
## 3 1911 Fantasy 2
## 4 1912 Fantasy 1
## 5 1912 Science Fiction 1
## 6 1915 Fantasy 1
## 7 1922 Fantasy 1
## 8 1924 Science Fiction 1
## 9 1926 Fantasy 1
## 10 1928 Fantasy 1
## 11 1932 Science Fiction 1
## 12 1937 Fantasy 1
## 13 1938 Science Fiction 1
## 14 1942 Fantasy 1
## 15 1943 Fantasy 1
## 16 1943 Science Fiction 1
## 17 1944 Fantasy 1
## 18 1945 Fantasy 3
## 19 1945 Science Fiction 1
## 20 1946 Fantasy 2
## 21 1949 Science Fiction 2
## 22 1949 Fantasy 1
## 23 1950 Fantasy 3
## 24 1950 Science Fiction 2
## 25 1951 Fantasy 4
## 26 1951 Science Fiction 3
## 27 1952 Fantasy 3
## 28 1953 Fantasy 3
## 29 1953 Science Fiction 3
## 30 1954 Fantasy 3
## 31 1955 Fantasy 3
## 32 1956 Fantasy 3
## 33 1957 Fantasy 2
## 34 1957 Science Fiction 2
## 35 1958 Fantasy 1
## 36 1959 Science Fiction 4
## 37 1959 Fantasy 3
## 38 1960 Fantasy 1
## 39 1961 Fantasy 3
## 40 1961 Science Fiction 1
## 41 1962 Fantasy 4
## 42 1962 Science Fiction 4
## 43 1963 Fantasy 1
## 44 1963 Science Fiction 1
## 45 1964 Fantasy 2
## 46 1965 Fantasy 2
## 47 1965 Science Fiction 1
## 48 1966 Science Fiction 2
## 49 1966 Fantasy 1
## 50 1967 Fantasy 4
## 51 1968 Fantasy 3
## 52 1968 Science Fiction 1
## 53 1969 Science Fiction 2
## 54 1969 Fantasy 1
## 55 1970 Fantasy 2
## 56 1971 Fantasy 2
## 57 1972 Fantasy 2
## 58 1973 Science Fiction 3
## 59 1973 Fantasy 1
## 60 1974 Fantasy 2
## 61 1974 Science Fiction 1
## 62 1975 Fantasy 2
## 63 1976 Fantasy 1
## 64 1977 Fantasy 3
## 65 1977 Science Fiction 1
## 66 1978 Fantasy 1
## 67 1979 Fantasy 2
## 68 1979 Science Fiction 1
## 69 1980 Fantasy 1
## 70 1981 Fantasy 3
## 71 1981 Science Fiction 1
## 72 1982 Fantasy 5
## 73 1982 Science Fiction 1
## 74 1983 Fantasy 2
## 75 1984 Fantasy 3
## 76 1984 Science Fiction 1
## 77 1985 Fantasy 6
## 78 1985 Science Fiction 5
## 79 1986 Fantasy 3
## 80 1986 Science Fiction 1
## 81 1987 Fantasy 1
## 82 1987 Science Fiction 1
## 83 1988 Fantasy 3
## 84 1989 Fantasy 4
## 85 1990 Fantasy 6
## 86 1990 Science Fiction 3
## 87 1991 Fantasy 2
## 88 1992 Fantasy 3
## 89 1992 Science Fiction 1
## 90 1993 Fantasy 2
## 91 1993 Science Fiction 1
## 92 1994 Fantasy 1
## 93 1995 Science Fiction 1
## 94 1996 Fantasy 5
## 95 1996 Science Fiction 3
## 96 1997 Fantasy 3
## 97 1997 Science Fiction 1
## 98 1998 Fantasy 3
## 99 1999 Fantasy 2
## 100 2000 Fantasy 5
## 101 2000 Science Fiction 2
fantasy_scifi %>%
ggplot(aes(x = year, y = count, color = genre)) +
geom_line() +
geom_point() +
theme_minimal() +
labs(title = "Number of Books Published in Fantasy and Science Fiction by Year",
x = "Year",
y = "Number of Books",
color = "Genre")
It looks like Fantasy genre dominated science fiction genre most of 20th century.
# Calculate the Pearson correlation coefficient
cor_score_pages <- cor(data$pages, data$score, use="complete.obs")
cor_reviews_pages <- cor(data$pages, data$reviews, use="complete.obs")
cor_reviews_score <- cor(data$score, data$reviews, use="complete.obs")
# Output the correlation coefficients
print(paste("Correlation between Book Length and Score:", cor_score_pages))
## [1] "Correlation between Book Length and Score: 0.00679097805518466"
print(paste("Correlation between Book Length and Number of Reviews:", cor_reviews_pages))
## [1] "Correlation between Book Length and Number of Reviews: -0.0702156801409243"
print(paste("Correlation between Score and Number of Reviews:", cor_reviews_score))
## [1] "Correlation between Score and Number of Reviews: 0.543389813157696"
cor_matrix <- cor(data[c("pages", "score", "reviews")], use="complete.obs")
corrplot(cor_matrix, method = "circle", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45,
addCoef.col = "black",
diag = FALSE)
Based on these results, there is no strong evidence to suggest a significant relationship between a book’s length and its score or the number of reviews it receives. All coefficients are pretty low. Only relationship worthwhile here is between book’s score and number of reviews it gets which seems to be a positive and a strong relationship.
Another question might be if there is a correlation between author and book’s score. In order to test this, I need to test normality to see if I can use ANOVA :
shapiro.test(data$score)
##
## Shapiro-Wilk normality test
##
## data: data$score
## W = 0.52087, p-value < 2.2e-16
Since the normality assumption required for ANOVA is violated; I searched for another test. Asking ChatGPT and googling answers yield Kruskal–Wallis test,yea!
kruskal.test(score ~ author, data = data)
##
## Kruskal-Wallis rank sum test
##
## data: score by author
## Kruskal-Wallis chi-squared = 360.82, df = 330, p-value = 0.117
Huh?! With conventional alpha level of 0.05, the test indicates that there is no significant effect on score by author. It might be important to note that this might not reflect the truth given 331 authors. It could be fun to see author affecting the score but oh well, Goodreads users seems to be objective readers.
From “Best Books of 20th Century” list in Goodreads, I analyzed couple of questions I had. Some common themes from 20th century seems to be family and it can be interesting to compare this with 21th century books in the future. Also, I was expecting some relationship between a book’s author and its score; but this does not seem to be the case. However, there is a significant correlation between a book’s score and its review.
We also saw J.R.R Tolkien and J.K. Rowling shining on this list!
It was fun to scrape and analyse this list even though we could not find some interesting relationships or insights. This might be the case because the list already contains best books of 20th century. It would be even more interesting to investigate specific categories seperately regardless if they are “the best” or not.
Thanks for reading!