In this project, we analyze the dataset containing sales and rating information of programming books being sold by the company. We are given the task to find the most profitable book title.
The objective of this project is to find the most profitable book title being sold by the company. To determine profitability, we assess the book titles by two metrics, sales volume and revenue.
After the data analysis process is conducted, we found the two most profitable book titles being sold by the company, assessed by two different metrics:
Fundamental of R For Beginners is the most profitable book
title.Secrets of R For Advanced Students
is the most profitable book title.Note that these results are obtained after removing records where the reviews are missing. For further study, if there is more information and method on obtaining the missing reviews, then a new analysis report can be made that will improve upon the results of this one.
In this section, we conduct an analysis on the dataset containing the company’s book reviews.
In this section, we review some fundamental properties of the dataset.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
reviews = read_csv("book_reviews.csv")
## Rows: 2000 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): book, review, state
## dbl (1): price
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
reviews
## # A tibble: 2,000 × 4
## book review state price
## <chr> <chr> <chr> <dbl>
## 1 R Made Easy Excellent TX 20.0
## 2 R For Dummies Fair NY 16.0
## 3 R Made Easy Excellent NY 20.0
## 4 R Made Easy Poor FL 20.0
## 5 Secrets Of R For Advanced Students Great Texas 50
## 6 R Made Easy <NA> California 20.0
## 7 R Made Easy Great Florida 20.0
## 8 R Made Easy Poor CA 20.0
## 9 Top 10 Mistakes R Beginners Make Fair CA 30.0
## 10 Secrets Of R For Advanced Students Fair Texas 50
## # ℹ 1,990 more rows
The book reviews dataset, stored in reviews, has
2,000 records and 4 columns. These columns,
and their corresponding datatypes, are:
book - is a character/string column containing the
book’s titles,review - is a character/string column containing user
ratings for the corresponding book,state - is a character/string column containing the
users’ state locations, andprice - is a numeric/double column containing the
book’s price.Let us check each column to set the initial expectations of this dataset.
distinct(reviews, book, price)
## # A tibble: 5 × 2
## book price
## <chr> <dbl>
## 1 R Made Easy 20.0
## 2 R For Dummies 16.0
## 3 Secrets Of R For Advanced Students 50
## 4 Top 10 Mistakes R Beginners Make 30.0
## 5 Fundamentals of R For Beginners 40.0
For the book and price columns, there are
five unique values for both the book titles and their associated prices,
which serve as the programming books being sold by the company. These
are:
19.99,15.99,50.00,29.99,
and39.99.unique(reviews["review"])
## # A tibble: 6 × 1
## review
## <chr>
## 1 Excellent
## 2 Fair
## 3 Poor
## 4 Great
## 5 <NA>
## 6 Good
For the review column, there are five unique ordinal
values which serve as the user rating. These are:
Excellent,Great,Good,Fair, andPoor.Note that there is are also missing values in this column, indicated
by the presence of the NA value in the column. We will be
dealing with these missing values after we conduct the data review.
unique(reviews["state"])
## # A tibble: 8 × 1
## state
## <chr>
## 1 TX
## 2 NY
## 3 FL
## 4 Texas
## 5 California
## 6 Florida
## 7 CA
## 8 New York
For the state column, while it appears that there are 8
unique values, there are really only 4 state locations indicated in the
column. Note that the complete state name and its associated acronym are
included in the column. These unique state locations are:
Texas or TX,New York or NY,Florida or FL, andCalifornia or CA.In this section, we will accomplish the following tasks:
Missing Values: As mentioned in the previous
section, there are missing values in the review column. As
of now, there is no sufficient information for us to fill in these
missing values, and thus we have decided to remove the records where the
value in the review column is missing.
reviews_f1 <- reviews %>%
filter(
review != is.na(review)
)
reviews_f1
## # A tibble: 1,794 × 4
## book review state price
## <chr> <chr> <chr> <dbl>
## 1 R Made Easy Excellent TX 20.0
## 2 R For Dummies Fair NY 16.0
## 3 R Made Easy Excellent NY 20.0
## 4 R Made Easy Poor FL 20.0
## 5 Secrets Of R For Advanced Students Great Texas 50
## 6 R Made Easy Great Florida 20.0
## 7 R Made Easy Poor CA 20.0
## 8 Top 10 Mistakes R Beginners Make Fair CA 30.0
## 9 Secrets Of R For Advanced Students Fair Texas 50
## 10 R Made Easy Great NY 20.0
## # ℹ 1,784 more rows
unique(reviews_f1["review"])
## # A tibble: 5 × 1
## review
## <chr>
## 1 Excellent
## 2 Fair
## 3 Poor
## 4 Great
## 5 Good
This copy of the dataset, reviews_f1, now only contains
rows with no missing values. There are now only 1,794
records, which means that 206 records are filtered from the
new dataset.
Inconsistent state names: As mentioned in the
previous section, the values in the state column contains
multiple character values which pertain to the same state. For example,
California and CA are both present in the
column, but the dataset treats these as two distinct and separate
values. We need to normalize the dataset so that the number of unique
state names reflects the number of unique state locations.
reviews_f2 <- reviews_f1 %>%
mutate(
state = case_when(
state == "CA" ~ "California",
state == "NY" ~ "New York",
state == "TX" ~ "Texas",
state == "FL" ~ "Florida",
.default = state
)
)
reviews_f2
## # A tibble: 1,794 × 4
## book review state price
## <chr> <chr> <chr> <dbl>
## 1 R Made Easy Excellent Texas 20.0
## 2 R For Dummies Fair New York 16.0
## 3 R Made Easy Excellent New York 20.0
## 4 R Made Easy Poor Florida 20.0
## 5 Secrets Of R For Advanced Students Great Texas 50
## 6 R Made Easy Great Florida 20.0
## 7 R Made Easy Poor California 20.0
## 8 Top 10 Mistakes R Beginners Make Fair California 30.0
## 9 Secrets Of R For Advanced Students Fair Texas 50
## 10 R Made Easy Great New York 20.0
## # ℹ 1,784 more rows
unique(reviews_f2["state"])
## # A tibble: 4 × 1
## state
## <chr>
## 1 Texas
## 2 New York
## 3 Florida
## 4 California
This copy of the dataset, reviews_f2, now contains
normalized state location names in the state column.
Transforming ordinal values: As established in the
previous section, the review column contains the user
rating for the associated title. It is immediately apparent that there
is a rank amongst the unique values in this column. In order to perform
a more comprehensive analysis on the user ratings, we must convert these
text ratings into numerical ratings.
reviews_t1 <- reviews_f2 %>%
mutate(
review = case_when(
review == "Poor" ~ 1,
review == "Fair" ~ 2,
review == "Good" ~ 3,
review == "Great" ~ 4,
review == "Excellent" ~ 5
),
is_review_high = case_when(
review >= 4 ~ T,
.default = F
)
)
reviews_t1
## # A tibble: 1,794 × 5
## book review state price is_review_high
## <chr> <dbl> <chr> <dbl> <lgl>
## 1 R Made Easy 5 Texas 20.0 TRUE
## 2 R For Dummies 2 New York 16.0 FALSE
## 3 R Made Easy 5 New York 20.0 TRUE
## 4 R Made Easy 1 Florida 20.0 FALSE
## 5 Secrets Of R For Advanced Students 4 Texas 50 TRUE
## 6 R Made Easy 4 Florida 20.0 TRUE
## 7 R Made Easy 1 California 20.0 FALSE
## 8 Top 10 Mistakes R Beginners Make 2 California 30.0 FALSE
## 9 Secrets Of R For Advanced Students 2 Texas 50 FALSE
## 10 R Made Easy 4 New York 20.0 TRUE
## # ℹ 1,784 more rows
distinct(reviews_t1, review, is_review_high)
## # A tibble: 5 × 2
## review is_review_high
## <dbl> <lgl>
## 1 5 TRUE
## 2 2 FALSE
## 3 1 FALSE
## 4 4 TRUE
## 5 3 FALSE
This copy of the dataset, reviews_t1, now contains
numerical user ratings for the book titles, as well as the indication of
whether or not the given user rating is high or low.
In this section, we will determine the “most profitable” book title in terms of two metrics: the number of sales and the revenue.
Profitability by sales volume: In order to obtain the number of sales for each book title, we simply have to count the number of books sold for each title.
sales_volume <- reviews_t1 %>%
count(
book,
sort = T,
name = "Number of Sales"
)
sales_volume
## # A tibble: 5 × 2
## book `Number of Sales`
## <chr> <int>
## 1 Fundamentals of R For Beginners 366
## 2 R For Dummies 361
## 3 Secrets Of R For Advanced Students 360
## 4 Top 10 Mistakes R Beginners Make 355
## 5 R Made Easy 352
In terms of sales volume, Fundamental of R For Beginners
is the most profitable book title.
Profitability by revenue: In order to obtain the revenue for each book title, we simply have to sum the price sold for each book title.
revenue <- reviews_t1 %>%
group_by(book) %>%
summarise(Revenue = sum(price)) %>%
arrange(desc(Revenue))
revenue
## # A tibble: 5 × 2
## book Revenue
## <chr> <dbl>
## 1 Secrets Of R For Advanced Students 18000
## 2 Fundamentals of R For Beginners 14636.
## 3 Top 10 Mistakes R Beginners Make 10646.
## 4 R Made Easy 7036.
## 5 R For Dummies 5772.
In terms of revenue, Secrets of R For Advanced Students
is the most profitable book title.
In this project, we analyzed the book reviews dataset given by the company selling programming books. Before beginning the data analysis process, we established the main objective of finding the most profitable book title. In order to do so, we conducted a data preparation process which consists of reviewing the basic information about the dataset, taking care of missing values, and transforming the dataset by normalizing inconsistent values.
Once the dataset is prepared for analysis, we found the two most profitable book titles being sold by the company, assessed by two different metrics:
Fundamental of R For Beginners is the most profitable book
title.Secrets of R For Advanced Students
is the most profitable book title.Note that these results are obtained after removing records where the reviews are missing. For further study, if there is more information and method on obtaining the missing reviews, then a new analysis report can be made that will improve upon the results of this one.