Introduction

In this report, I present an analysis of the “Best Books Ever” dataset from Kaggle, available at this link.

The focus of this analysis is to explore trends in book genres over the period from 1916 to 2017 and its distribution. This involved extensive data cleaning and engineering to prepare the dataset for visualization.


Importing Libraries and Data

Before diving into the analysis, we need to load the necessary libraries and the dataset. The following libraries are used for data manipulation, visualization, and interactive plotting:

# Install necessary packages (if not already installed)

# install.packages("plotly")
# install.packages("dplyr")
# install.packages("tidyverse")
# install.packages("ggplot2")
# install.packages("lubridate")

# Load libraries
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))

# Load the dataset
books <- read.csv("C:\\Users\\kriku\\Documents\\Data Projets\\book_analysis_in_R\\books_1.Best_Books_Ever.csv")

Data Preparation

Step 1: Filtering and Cleaning Genre Data

To analyze genre trends, we first extract relevant columns (bookId, genres, and firstPublishDate), clean the data, and transform the firstPublishDate into a usable format. We also ensure that rows with missing or invalid data are removed.

suppressWarnings(
genre_data <- books %>%
  select(bookId, genres, firstPublishDate) %>%
  filter(!is.na(genres) & genres != "[]" & !is.na(firstPublishDate) & firstPublishDate != "") %>% 
  separate_rows(genres, sep = ", ")  %>% 
  mutate(genres = gsub("[[:punct:]]", "", genres),
         firstPublishDate = substr(firstPublishDate, 7, 8),            # select the year part of the date
         firstPublishDate = ifelse(as.numeric(firstPublishDate) > 15,  # convert to full year
                                   paste0("19", firstPublishDate),  
                                   paste0("20", firstPublishDate)),
         firstPublishDate = paste0(firstPublishDate, "-01-01"),
         firstPublishDate = year(ymd(firstPublishDate)))  %>% 
  filter(!is.na(firstPublishDate)) %>% 
  arrange(desc(firstPublishDate))  
)

# view(genre_data)

Step 2: Selecting Top 5 Genres per Year

To simplify the visualization, we focus on the top 5 most popular genres for each year.

df_top5 <- genre_data %>%
  group_by(firstPublishDate, genres) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(firstPublishDate) %>%
  slice_max(order_by = count, n = 5) %>% 
  ungroup() 

# view(df_top5)

Step 3: Counting Total Books Published per Year

We also create a separate dataframe to track the total number of books published each year.

year_count <- genre_data %>% 
  select(bookId, firstPublishDate) %>% 
  distinct(bookId, .keep_all = TRUE) %>% 
  count(firstPublishDate) %>% 
  suppressWarnings(
  mutate(firstPublishDate = paste0(firstPublishDate, "-01-01"),
         firstPublishDate = year(ymd(firstPublishDate)))
  )

# view(year_count)

Conclusion

This analysis reveals key trends in book genres and its distribution over the past century:

linkedin: https://www.linkedin.com/in/oleksandra-krykun-0b45552a1/