Final Project

Fomba Kassoh & Souleymane Doumbia

2023-12-10

Abstract

The analysis provides a comprehensive examination of audiobook data. It employs web scraping techniques, utilizing Selenium and Scraper Spider, to gather audiobook information. The data, initially saved as CSV and JSON, is processed using R, following the OSEMN framework and Hadley Wickham’s Grammar of Data Science. Key tasks include data tidying, parsing, cleaning, and transformation for thorough analysis.

Significant efforts are made to parse and clean fields like ‘authors’ and ‘narrators’, extract numeric data from ‘length’, ‘rating’, and ‘no_of_ratings’, and standardize ‘release_date’ and ‘language’. Price fields are also formatted, and a ‘sales_status’ column is added.

The analysis delves into audiobook length, revealing a user preference for roughly 9-hour long books, and examines rating distribution, showing a skew towards high ratings. The relationship between audiobook length and ratings is explored, showing a minor negative correlation, suggesting length has only a marginal impact on ratings. The distribution of the number of ratings highlights a concentration of popularity among specific titles.

Author and narrator popularity are analyzed, indicating clear hierarchies and influence on the market. Sentiment analysis using Scrapy data reviews reveals predominantly positive sentiments towards the audiobooks.

The study concludes that user behavior and preferences favor high-quality content, with specific authors and narrators significantly influencing the market. The relationship between audiobook length and ratings is minor, and user engagement varies over time. This multifaceted insight underscores the importance of content quality, author/narrator popularity, and user engagement in shaping the audiobook industry.

Introduction

This project delves into the analysis of audiobooks as follow:

Context of data Collection

Acquired data through web scraping using the following frameworks:

Preview of Scraping code:

https://raw.githubusercontent.com/hawa1983/DATA607_Final_project/main/audible.py

https://raw.githubusercontent.com/hawa1983/DATA607_Final_project/main/audible_selenium_v2.py

Post-acquisition

The project follows the OSEMN framework and Hadley Wickham’s Grammar of Data Science.

# Loading libraries
library(readr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

Overview of the data

Below are the variables in the data. The data scraped by Scraper further has:

audible_books <- read_csv("https://raw.githubusercontent.com/hawa1983/DATA607/main/audible_books.csv")
## Rows: 500 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): title, subtitle, authors, narrators, series, length, release_date,...
## lgl  (2): sales_price, category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(audible_books)
## Rows: 500
## Columns: 15
## $ title         <chr> "He's Not My Type", "Thesaurize", "Moral Stand", "LLC Be…
## $ subtitle      <chr> "Vancouver Agitators Series, Book 4", "The Completionist…
## $ authors       <chr> "['Meghan Quinn']", "['Dakota Krout']", "['Daniel Schinh…
## $ narrators     <chr> "['Connor Crais', 'Erin Mallon', 'Teddy Hamilton', 'Jaso…
## $ series        <chr> "The Vancouver Agitators", "The Completionist Chronicles…
## $ length        <chr> "Length: 11 hrs and 40 mins", "Length: 11 hrs and 40 min…
## $ release_date  <chr> "Release date: 11-28-23", "Release date: 11-06-23", "Rel…
## $ language      <chr> "Language: English", "Language: English", "Language: Eng…
## $ rating        <chr> "5 out of 5 stars", "5 out of 5 stars", "5 out of 5 star…
## $ no_of_ratings <chr> "362 ratings", "328 ratings", "164 ratings", "51 ratings…
## $ regular_price <chr> "$24.95", "$24.95", "$24.95", "$14.95", "$14.95", "$19.9…
## $ sales_price   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ category      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ genres        <chr> "[]", "[]", "[]", "[]", "[]", "[]", "[]", "[]", "[]", "[…
## $ url           <chr> "https://www.audible.com/pd/Hes-Not-My-Type-Audiobook/B0…

Data Tydying

Parsing and Cleaning ‘authors’, ‘narrators’

Parse and clean the ‘authors’ and ‘narrators’ fields using str_replace_all to remove unwanted characters and extract the necessary information.

# Cleaning 'authors' and 'narrators', relocating them, dropping the original, and renaming the cleaned columns
audible_books <- audible_books %>%
  mutate(
    authors_cleaned = str_replace_all(authors, "\\['|'\\]", ""),
    narrators_cleaned = str_replace_all(narrators, "\\['|'\\]", "")
  ) %>%
  relocate(authors_cleaned, .after = authors) %>%
  relocate(narrators_cleaned, .after = narrators) %>%
  select(-authors, -narrators) %>%
  rename(
    authors = authors_cleaned,
    narrators = narrators_cleaned
  )

#audible_books %>%
#  select(title, authors, narrators) %>%
#  head()

Extracting Numeric Data from ‘length’, ‘rating’, and ‘no_of_ratings’

We will tidy these columns as follows:

library(dplyr)
library(stringr)

audible_books <- audible_books %>%
  mutate(
    # Extracting hours and convert to numeric, replacing NA with 0 if only minutes are present
    length_hours = as.numeric(str_extract(length, "\\b\\d+\\b(?=\\s*hr)")) %>% replace_na(0),
    # Extracting minutes and convert to numeric, replacing NA with 0 if only hours are present
    length_minutes = as.numeric(str_extract(length, "\\b\\d+\\b(?=\\s*min)")) %>% replace_na(0),
    # Calculating total length in minutes
    total_length_minutes = (length_hours * 60) + length_minutes,
    # Extracting numeric rating and number of ratings
    rating_numeric = as.numeric(str_extract(rating, "\\b\\d+(\\.\\d+)?")),
    no_of_ratings_numeric = as.numeric(str_extract(no_of_ratings, "\\b\\d+"))
  ) %>%
  select(-length_hours, -length_minutes) %>%
  relocate(total_length_minutes, .after = length) %>%
  relocate(rating_numeric, .after = rating) %>%
  relocate(no_of_ratings_numeric, .after = no_of_ratings)

#audible_books %>%
#  select(length, total_length_minutes, rating, rating_numeric, no_of_ratings, no_of_ratings_numeric) %>%
#  head()

Standardizing ‘release_date’ and ‘language’:

# Standardizing 'release_date' and extracting just the language name from 'language', rearranging them, dropping the original, and renaming them
audible_books <- audible_books %>%
  mutate(
    release_date_standardized = as.Date(str_extract(release_date, "\\d{2}-\\d{2}-\\d{2,4}"), format = "%m-%d-%y"),
    language_standardized = str_replace(language, "Language: ", "")
  ) %>%
  relocate(release_date_standardized, .after = release_date) %>%
  relocate(language_standardized, .after = language) %>%
  select(-release_date, -language) %>%
  rename(
    release_date = release_date_standardized,
    language = language_standardized
  )

#audible_books %>%
#  head()

Formatting Price Fields

# Processing 'regular_price' and 'sales_price', create 'sales_status' column, rearrange, drop the original, and rename
audible_books <- audible_books %>%
  mutate(
    regular_price_numeric = as.numeric(str_extract(regular_price, "\\d+\\.\\d+")),
    sales_price_numeric = as.numeric(str_extract(sales_price, "\\d+\\.\\d+")),
    sales_status = ifelse(is.na(sales_price_numeric), "not on sale", "on sale")
  ) %>%
  relocate(regular_price_numeric, .after = regular_price) %>%
  relocate(sales_price_numeric, .after = sales_price) %>%
  relocate(sales_status, .after = sales_price_numeric) %>%
  select(-regular_price, -sales_price)

#audible_books %>%
#  select(regular_price_numeric, sales_price_numeric, sales_status) %>%
 # head()

Audiobook Length Analysis

library(ggplot2)
# Summary statistics for audiobook lengths
length_summary <- summary(audible_books$total_length_minutes)
length_summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    31.0   369.8   532.0   579.4   697.0  2762.0
# Histogram of audiobook lengths
ggplot(audible_books, aes(x = total_length_minutes)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Distribution of Audiobook Lengths", x = "Total Length (minutes)", y = "Frequency")

The histogram of audiobook lengths shows:

Rating Distribution

# Summary statistics for audiobook lengths
rating_summary <- summary(audible_books$rating_numeric)
rating_summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.500   4.500   5.000   4.843   5.000   5.000
rating_frequency <- audible_books %>% select (rating_numeric) %>% group_by(rating_numeric) %>% count()
rating_frequency
## # A tibble: 2 × 2
## # Groups:   rating_numeric [2]
##   rating_numeric     n
##            <dbl> <int>
## 1            4.5   157
## 2            5     343
# Histogram or density plot of audiobook ratings
ggplot(audible_books, aes(x = rating_numeric)) +
 #geom_histogram(bins = 30, fill = "grey", color = "purple") +
  geom_density(fill = "green", alpha = 0.7) +
  labs(title = "Distribution of Audiobook Ratings", x = "Rating", y = "Density")

The density plot of audiobook ratings shows:

Number of Rating Distribution

# Summary statistics for Number Of Rating
number_ofRating_summary <- summary(audible_books$no_of_ratings_numeric)
number_ofRating_summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   11.00   15.00   28.31   25.00  973.00
frequency <- audible_books %>% select(no_of_ratings_numeric) %>% group_by(no_of_ratings_numeric) %>% count()
frequency
## # A tibble: 64 × 2
## # Groups:   no_of_ratings_numeric [64]
##    no_of_ratings_numeric     n
##                    <dbl> <int>
##  1                     6     5
##  2                     7    24
##  3                     8    38
##  4                     9    41
##  5                    10    16
##  6                    11    17
##  7                    12    29
##  8                    13    29
##  9                    14    29
## 10                    15    27
## # ℹ 54 more rows
# Histogram of Number Of Rating
ggplot(audible_books, aes(x = no_of_ratings_numeric)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Distribution of Number Of Rating", x = "Total Number Of Rating", y = "Frequency")

The “Distribution of Number Of Rating” histogram, together with the R console’s summary statistics, provides a concise visual and quantitative analysis of the audiobook ratings count in the dataset:

Relationship Between Ratings and Length

# Correlation analysis between ratings and total length
correlation_length_rating <- cor(audible_books$rating_numeric, audible_books$total_length_minutes, use = "complete.obs")
correlation_length_rating
## [1] -0.1490941
# Scatter plot of ratings vs total length
ggplot(audible_books, aes(x = total_length_minutes, y = rating_numeric)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Audiobook Ratings vs. Total Length", x = "Total Length (minutes)", y = "Rating")
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot depicts Audiobook Ratings vs. Total Length:

Relationship Between Ratings and Number of Ratings

# Correlation analysis between ratings and number of ratings
correlation_result <- cor(audible_books$rating_numeric, audible_books$no_of_ratings_numeric, use = "complete.obs")
correlation_result
## [1] -0.1603486
# Scatter plot of ratings vs number of ratings
ggplot(audible_books, aes(x = rating_numeric, y = no_of_ratings_numeric)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Audiobook Ratings vs. Number of Ratings", x = "Rating", y = "Number of Rating")
## `geom_smooth()` using formula = 'y ~ x'

Analysis of the audiobook dataset reveals a landscape where:

Author and Narrator Popularity Analysis

# Calculate average ratings and number of ratings by author
author_popularity <- audible_books %>%
  group_by(authors) %>%
  summarise(
    average_rating = mean(rating_numeric, na.rm = TRUE),
    total_ratings = sum(no_of_ratings_numeric, na.rm = TRUE)
  ) %>%
  arrange(desc(total_ratings))

# Top 10 authors by total ratings
top_authors <- head(author_popularity, 10)

ggplot(top_authors, aes(x = reorder(authors, total_ratings), y = total_ratings)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() + # Flips the axes to make labels readable
  labs(x = "Authors", y = "Total Number of Ratings", title = "Top 10 Authors by Number of Ratings") +
  theme_minimal()

# Calculate average ratings and number of ratings by narrator
narrator_popularity <- audible_books %>%
  group_by(narrators) %>%
  summarise(
    average_rating = mean(rating_numeric, na.rm = TRUE),
    total_ratings = sum(no_of_ratings_numeric, na.rm = TRUE)
  ) %>%
  arrange(desc(total_ratings))

# Top 10 narrators by total ratings
top_narrators <- head(narrator_popularity, 10)

ggplot(top_narrators, aes(x = reorder(narrators, total_ratings), y = total_ratings)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  coord_flip() +
  labs(x = "Narrators", y = "Total Number of Ratings", title = "Top 10 Narrators by Number of Ratings") +
  theme_minimal()

Sentiment Analysys

We used the reviews from the scraped data using Scrapy for sentiment analysis

Distribution of positive and negative Sentiments

The bar graph to shows the distribution of sentiments across reviews for audiobooks. Here’s an analysis based on the interpretation of the graph:

Positive Sentiments: The graph shows a large number of positive sentiments, indicated by the tall blue bar. This suggests that the majority of the reviews are positive, indicating a favorable reception from the reviewers.

Negative Sentiments: There is a smaller red bar representing negative sentiments. The count of negative reviews is significantly lower than that of positive reviews, suggesting that there are some criticisms or negative experiences, but they are in the minority.

Neutral Sentiments: The neutral category, depicted in grey, is not visible in the graph. This suggests that there are either no neutral reviews or their count is negligible compared to the positive and negative reviews.

Overall Impression: The dominant number of positive reviews suggests that the sentiment towards the subject is overwhelmingly positive. The small number of negative reviews indicates that there may be a few areas for improvement, but they are not the general consensus.

Implications: For the provider of the products or services being reviewed, this distribution would generally be considered very good news. It may also suggest customer satisfaction and could potentially be used in marketing or product development to further enhance positive aspects or address the negative feedback.

# Load the necessary libraries
library(ggplot2)
library(dplyr)
library(readr)
library(syuzhet)

reviews_sentiments <- reviews_sentiments %>%
  mutate(sentiment_category = case_when(
    sentiment > 0 ~ "Positive",
    sentiment < 0 ~ "Negative",
    TRUE ~ "Neutral"
  ))

# Plot the distribution of sentiment categories
ggplot(reviews_sentiments, aes(x = sentiment_category, fill = sentiment_category)) +
  geom_bar() +
  scale_fill_manual(values = c("Positive" = "blue", "Negative" = "red", "Neutral" = "grey")) +
  labs(title = "Sentiment Category Distribution",
       x = "Sentiment Category",
       y = "Count") +
  theme_minimal()

The bar graph to shows the distribution of sentiments across reviews for audiobooks. Here’s an analysis based on the interpretation of the graph:

Distribution of Sentiments

# Load the necessary libraries
library(ggplot2)
library(dplyr)
library(readr)
library(syuzhet)


reviews_sentiments <- reviews_sentiments %>%
  mutate(sentiment_category = case_when(
    sentiment > 0 ~ "Positive",
    sentiment < 0 ~ "Negative",
    TRUE ~ "Neutral"
  ))


# plot the sentiment scores using a histogram
ggplot(reviews_sentiments, aes(x = sentiment)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Sentiment Scores",
       x = "Sentiment Score",
       y = "Frequency") +
  theme_minimal()

The below histogram depicts the distribution of sentiment scores for audiobook reviews. Here’s an analysis of the graph:

Another sentiment analysis (sentimentr)

library(sentimentr)
## 
## Attaching package: 'sentimentr'
## The following object is masked from 'package:syuzhet':
## 
##     get_sentences
library(readr)
library(jsonlite)

# Read the CSV file into a DataFrame
reviews_sentiments <- read_csv("https://raw.githubusercontent.com/hawa1983/DATA607_Final_project/main/audible.csv", show_col_types = FALSE)


# Filter out rows where 'review_text' is NA
reviews_sentiments <- reviews_sentiments %>%
  filter(!is.na(reviews))

# Calculate the mean sentiment for each review
reviews_sentiments <- reviews_sentiments %>%
  rowwise() %>%
  mutate(
    # Calculate sentiment scores for non-NA review texts
    sentiment_score = mean(sentiment(get_sentences(reviews))$sentiment)
  ) %>%
  ungroup() 

# Create a sentiment decision column based on sentiment_score
reviews_sentiments <- reviews_sentiments %>%
  mutate(
    sentiment_decision = case_when(
      sentiment_score > 0  ~ "positive",
      sentiment_score < 0  ~ "negative",
      TRUE ~ "neutral"
    )
  )

# Select the necessary columns
reviews_sentiments <- reviews_sentiments %>%
  select(title, sentiment_score, sentiment_decision)

head(reviews_sentiments, 10)
## # A tibble: 10 × 3
##    title                                    sentiment_score sentiment_decision
##    <chr>                                              <dbl> <chr>             
##  1 Blood Pact                                        0.188  positive          
##  2 Fall for You                                      0.251  positive          
##  3 Caught Sleeping                                   0.389  positive          
##  4 The False Hero, Volume 8                          0.387  positive          
##  5 Another Girl                                      0      neutral           
##  6 Finally Forever                                   0.343  positive          
##  7 Azarinth Healer, Book Three                       0      neutral           
##  8 Prophet Song                                     -0.0499 negative          
##  9 What Do You Know About Human Harvesting?          0      neutral           
## 10 Sylver Seeker 2                                   0      neutral
library(ggplot2)
library(readr)

# Plot the distribution of sentiment
ggplot(reviews_sentiments, aes(x = sentiment_decision)) +
  geom_bar(aes(fill = sentiment_decision)) +
  theme_minimal() +
  labs(title = "Sentiment Distribution",
       x = "Sentiment",
       y = "Count") +
  scale_fill_manual(values = c("positive" = "blue", "negative" = "red", "not rated" = "green", "neutral" = "grey"))

# Assuming that your sentiment analysis results are in a column called 'sentiment_decision'

Based on graph above, here is the analysis:

Conclusion

The project’s analysis of audiobook data revealed several key findings:

These insights can inform stakeholders in the audiobook industry, guiding decisions on production, marketing, and strategic positioning to enhance the overall user experience and market presence.