Profitability of Programming Books

In this project, we analyze the dataset containing sales and rating information of programming books being sold by the company. We are given the task to find the most profitable book title.

Objective

The objective of this project is to find the most profitable book title being sold by the company. To determine profitability, we assess the book titles by two metrics, sales volume and revenue.

Summary of Results

After the data analysis process is conducted, we found the two most profitable book titles being sold by the company, assessed by two different metrics:

  • In terms of sales volume, Fundamental of R For Beginners is the most profitable book title.
  • In terms of revenue, Secrets of R For Advanced Students is the most profitable book title.

Note that these results are obtained after removing records where the reviews are missing. For further study, if there is more information and method on obtaining the missing reviews, then a new analysis report can be made that will improve upon the results of this one.

Analysis and Discussion

In this section, we conduct an analysis on the dataset containing the company’s book reviews.

Data Review

In this section, we review some fundamental properties of the dataset.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
reviews = read_csv("book_reviews.csv")
## Rows: 2000 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): book, review, state
## dbl (1): price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
reviews
## # A tibble: 2,000 × 4
##    book                               review    state      price
##    <chr>                              <chr>     <chr>      <dbl>
##  1 R Made Easy                        Excellent TX          20.0
##  2 R For Dummies                      Fair      NY          16.0
##  3 R Made Easy                        Excellent NY          20.0
##  4 R Made Easy                        Poor      FL          20.0
##  5 Secrets Of R For Advanced Students Great     Texas       50  
##  6 R Made Easy                        <NA>      California  20.0
##  7 R Made Easy                        Great     Florida     20.0
##  8 R Made Easy                        Poor      CA          20.0
##  9 Top 10 Mistakes R Beginners Make   Fair      CA          30.0
## 10 Secrets Of R For Advanced Students Fair      Texas       50  
## # ℹ 1,990 more rows

The book reviews dataset, stored in reviews, has 2,000 records and 4 columns. These columns, and their corresponding datatypes, are:

  • book - is a character/string column containing the book’s titles,
  • review - is a character/string column containing user ratings for the corresponding book,
  • state - is a character/string column containing the users’ state locations, and
  • price - is a numeric/double column containing the book’s price.

Let us check each column to set the initial expectations of this dataset.

distinct(reviews, book, price)
## # A tibble: 5 × 2
##   book                               price
##   <chr>                              <dbl>
## 1 R Made Easy                         20.0
## 2 R For Dummies                       16.0
## 3 Secrets Of R For Advanced Students  50  
## 4 Top 10 Mistakes R Beginners Make    30.0
## 5 Fundamentals of R For Beginners     40.0

For the book and price columns, there are five unique values for both the book titles and their associated prices, which serve as the programming books being sold by the company. These are:

  • R Made Easy priced at 19.99,
  • R For Dummies priced at 15.99,
  • Secrets of R For Advanced Students priced at 50.00,
  • Top 10 Mistakes R Beginners Make priced at 29.99, and
  • Fundamental of R For Beginners priced at 39.99.
unique(reviews["review"])
## # A tibble: 6 × 1
##   review   
##   <chr>    
## 1 Excellent
## 2 Fair     
## 3 Poor     
## 4 Great    
## 5 <NA>     
## 6 Good

For the review column, there are five unique ordinal values which serve as the user rating. These are:

  • Excellent,
  • Great,
  • Good,
  • Fair, and
  • Poor.

Note that there is are also missing values in this column, indicated by the presence of the NA value in the column. We will be dealing with these missing values after we conduct the data review.

unique(reviews["state"])
## # A tibble: 8 × 1
##   state     
##   <chr>     
## 1 TX        
## 2 NY        
## 3 FL        
## 4 Texas     
## 5 California
## 6 Florida   
## 7 CA        
## 8 New York

For the state column, while it appears that there are 8 unique values, there are really only 4 state locations indicated in the column. Note that the complete state name and its associated acronym are included in the column. These unique state locations are:

  • Texas or TX,
  • New York or NY,
  • Florida or FL, and
  • California or CA.

Data Cleaning

In this section, we will accomplish the following tasks:

  • take care of missing values,
  • normalize inconsistent state location names, and
  • transform rating values.

Missing Values: As mentioned in the previous section, there are missing values in the review column. As of now, there is no sufficient information for us to fill in these missing values, and thus we have decided to remove the records where the value in the review column is missing.

reviews_f1 <- reviews %>%
  filter(
    review != is.na(review)
  )
reviews_f1
## # A tibble: 1,794 × 4
##    book                               review    state   price
##    <chr>                              <chr>     <chr>   <dbl>
##  1 R Made Easy                        Excellent TX       20.0
##  2 R For Dummies                      Fair      NY       16.0
##  3 R Made Easy                        Excellent NY       20.0
##  4 R Made Easy                        Poor      FL       20.0
##  5 Secrets Of R For Advanced Students Great     Texas    50  
##  6 R Made Easy                        Great     Florida  20.0
##  7 R Made Easy                        Poor      CA       20.0
##  8 Top 10 Mistakes R Beginners Make   Fair      CA       30.0
##  9 Secrets Of R For Advanced Students Fair      Texas    50  
## 10 R Made Easy                        Great     NY       20.0
## # ℹ 1,784 more rows
unique(reviews_f1["review"])
## # A tibble: 5 × 1
##   review   
##   <chr>    
## 1 Excellent
## 2 Fair     
## 3 Poor     
## 4 Great    
## 5 Good

This copy of the dataset, reviews_f1, now only contains rows with no missing values. There are now only 1,794 records, which means that 206 records are filtered from the new dataset.

Inconsistent state names: As mentioned in the previous section, the values in the state column contains multiple character values which pertain to the same state. For example, California and CA are both present in the column, but the dataset treats these as two distinct and separate values. We need to normalize the dataset so that the number of unique state names reflects the number of unique state locations.

reviews_f2 <- reviews_f1 %>%
  mutate(
    state = case_when(
      state == "CA" ~ "California",
      state == "NY" ~ "New York",
      state == "TX" ~ "Texas",
      state == "FL" ~ "Florida",
      .default = state
    )
  )
reviews_f2
## # A tibble: 1,794 × 4
##    book                               review    state      price
##    <chr>                              <chr>     <chr>      <dbl>
##  1 R Made Easy                        Excellent Texas       20.0
##  2 R For Dummies                      Fair      New York    16.0
##  3 R Made Easy                        Excellent New York    20.0
##  4 R Made Easy                        Poor      Florida     20.0
##  5 Secrets Of R For Advanced Students Great     Texas       50  
##  6 R Made Easy                        Great     Florida     20.0
##  7 R Made Easy                        Poor      California  20.0
##  8 Top 10 Mistakes R Beginners Make   Fair      California  30.0
##  9 Secrets Of R For Advanced Students Fair      Texas       50  
## 10 R Made Easy                        Great     New York    20.0
## # ℹ 1,784 more rows
unique(reviews_f2["state"])
## # A tibble: 4 × 1
##   state     
##   <chr>     
## 1 Texas     
## 2 New York  
## 3 Florida   
## 4 California

This copy of the dataset, reviews_f2, now contains normalized state location names in the state column.

Transforming ordinal values: As established in the previous section, the review column contains the user rating for the associated title. It is immediately apparent that there is a rank amongst the unique values in this column. In order to perform a more comprehensive analysis on the user ratings, we must convert these text ratings into numerical ratings.

reviews_t1 <- reviews_f2 %>%
  mutate(
    review = case_when(
      review == "Poor" ~ 1,
      review == "Fair" ~ 2,
      review == "Good" ~ 3,
      review == "Great" ~ 4,
      review == "Excellent" ~ 5
    ),
    is_review_high = case_when(
      review >= 4 ~ T,
      .default = F
    )
  )
reviews_t1
## # A tibble: 1,794 × 5
##    book                               review state      price is_review_high
##    <chr>                               <dbl> <chr>      <dbl> <lgl>         
##  1 R Made Easy                             5 Texas       20.0 TRUE          
##  2 R For Dummies                           2 New York    16.0 FALSE         
##  3 R Made Easy                             5 New York    20.0 TRUE          
##  4 R Made Easy                             1 Florida     20.0 FALSE         
##  5 Secrets Of R For Advanced Students      4 Texas       50   TRUE          
##  6 R Made Easy                             4 Florida     20.0 TRUE          
##  7 R Made Easy                             1 California  20.0 FALSE         
##  8 Top 10 Mistakes R Beginners Make        2 California  30.0 FALSE         
##  9 Secrets Of R For Advanced Students      2 Texas       50   FALSE         
## 10 R Made Easy                             4 New York    20.0 TRUE          
## # ℹ 1,784 more rows
distinct(reviews_t1, review, is_review_high)
## # A tibble: 5 × 2
##   review is_review_high
##    <dbl> <lgl>         
## 1      5 TRUE          
## 2      2 FALSE         
## 3      1 FALSE         
## 4      4 TRUE          
## 5      3 FALSE

This copy of the dataset, reviews_t1, now contains numerical user ratings for the book titles, as well as the indication of whether or not the given user rating is high or low.

Ranking Profitability by Number of Sales and Revenue

In this section, we will determine the “most profitable” book title in terms of two metrics: the number of sales and the revenue.

Profitability by sales volume: In order to obtain the number of sales for each book title, we simply have to count the number of books sold for each title.

sales_volume <- reviews_t1 %>%
  count(
    book, 
    sort = T,
    name = "Number of Sales"
  )
sales_volume
## # A tibble: 5 × 2
##   book                               `Number of Sales`
##   <chr>                                          <int>
## 1 Fundamentals of R For Beginners                  366
## 2 R For Dummies                                    361
## 3 Secrets Of R For Advanced Students               360
## 4 Top 10 Mistakes R Beginners Make                 355
## 5 R Made Easy                                      352

In terms of sales volume, Fundamental of R For Beginners is the most profitable book title.

Profitability by revenue: In order to obtain the revenue for each book title, we simply have to sum the price sold for each book title.

revenue <- reviews_t1 %>%
  group_by(book) %>%
  summarise(Revenue = sum(price)) %>%
  arrange(desc(Revenue))
revenue
## # A tibble: 5 × 2
##   book                               Revenue
##   <chr>                                <dbl>
## 1 Secrets Of R For Advanced Students  18000 
## 2 Fundamentals of R For Beginners     14636.
## 3 Top 10 Mistakes R Beginners Make    10646.
## 4 R Made Easy                          7036.
## 5 R For Dummies                        5772.

In terms of revenue, Secrets of R For Advanced Students is the most profitable book title.

Conclusion

In this project, we analyzed the book reviews dataset given by the company selling programming books. Before beginning the data analysis process, we established the main objective of finding the most profitable book title. In order to do so, we conducted a data preparation process which consists of reviewing the basic information about the dataset, taking care of missing values, and transforming the dataset by normalizing inconsistent values.

Once the dataset is prepared for analysis, we found the two most profitable book titles being sold by the company, assessed by two different metrics:

  • In terms of sales volume, Fundamental of R For Beginners is the most profitable book title.
  • In terms of revenue, Secrets of R For Advanced Students is the most profitable book title.

Note that these results are obtained after removing records where the reviews are missing. For further study, if there is more information and method on obtaining the missing reviews, then a new analysis report can be made that will improve upon the results of this one.