DataQuest

Guided Project #2: Creating An Efficient Data Analysis Workflow

In this guided project, we will be acting as a data analyst for a company that sells books for learning programming. Your company has produced multiple books, and each has received many reviews. Your company wants we to check out the sales data and see if we can extract any useful information from it. We’ll walk through this process as we progress through the mission.

The dataset we are using can be found here.

It’s easy to lose context when we’re just talking about data analysis in general. The first thing we should do before we do any analysis is to get acquainted with our dataset. There are many, many things to check with a dataset before we dive into any analysis. How much data is there? What kind of data do we actually have on hand? Is there anything “weird” that might interfere with any analyses we might need to do? Is there missing data? Answering these questions now saves we time and effort later.

If we don’t check the data beforehand, it’s easy to make some false assumptions about the data that can hinder our progress later. Maybe we think that one column looks like a number, but it’s actually been read in as a string. Perhaps some things were misspelled. In any case, getting familiar with the data is our first step in the data analysis workflow.

To begin with, let’s get the data in and see how big the file is

setwd("C:/Users/garyd/Dropbox/Learning/DataQuest/R Guided Projects")
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

books <- read.csv("book_reviews.csv")

dim(books)

## [1] 2000    4

We have 4 columns and 2000 rows.

Next, we will review the column names:

(colnames(books))

## [1] "book"   "review" "state"  "price"

The four column name are “book”, “review,”state" and “price”.

Initial assumptions are that we have the book title, a review score or description, the state (condition) of the book or the area of purchase/review, and the price paid for the book.

Hmmm, it’s not entirely clear what data type is in each column and what each contains.

We’ll use the typeof function, within a for loop, to review each column:

book_types <- c()

for (title in colnames(books)) {
  type <- typeof(books[[title]])
  book_types <- c(book_types, type)
}

book_types

## [1] "character" "character" "character" "double"

All of the columns are of type ‘character’ (string), except for price which is ‘double’ (numeric).

A simpler of way of performing all of the above would have been to use the glimpse function:

glimpse(books)

## Rows: 2,000
## Columns: 4
## $ book   <chr> "R Made Easy", "R For Dummies", "R Made Easy", "R Made Easy"...
## $ review <chr> "Excellent", "Fair", "Excellent", "Poor", "Great", NA, "Grea...
## $ state  <chr> "TX", "NY", "NY", "FL", "Texas", "California", "Florida", "C...
## $ price  <dbl> 19.99, 15.99, 19.99, 19.99, 50.00, 19.99, 19.99, 19.99, 29.9...

This confirms some of the assumptions made earlier, “book” is the relevant title, “review” is a string rating, “state” is not the book condition but covers the states of America, and “price” is the value of the book.

We can also see that the book title “R Made Easy” appears multiple times in the book titles, so there must be multiple entries per book. So, with that in mind, we will run a for loop over the content and review the unique entries of each column:

for (col in colnames(books)){
  print("Unique values in the column: ")
  print(col)
  print(unique(books[[col]]))
  print("")
  }

## [1] "Unique values in the column: "
## [1] "book"
## [1] "R Made Easy"                        "R For Dummies"                     
## [3] "Secrets Of R For Advanced Students" "Top 10 Mistakes R Beginners Make"  
## [5] "Fundamentals of R For Beginners"   
## [1] ""
## [1] "Unique values in the column: "
## [1] "review"
## [1] "Excellent" "Fair"      "Poor"      "Great"     NA          "Good"     
## [1] ""
## [1] "Unique values in the column: "
## [1] "state"
## [1] "TX"         "NY"         "FL"         "Texas"      "California"
## [6] "Florida"    "CA"         "New York"  
## [1] ""
## [1] "Unique values in the column: "
## [1] "price"
## [1] 19.99 15.99 50.00 29.99 39.99
## [1] ""

unique(books[['book']])

## [1] "R Made Easy"                        "R For Dummies"                     
## [3] "Secrets Of R For Advanced Students" "Top 10 Mistakes R Beginners Make"  
## [5] "Fundamentals of R For Beginners"

We can see the book titles are indeed duplicated - across the 2000 rows there are only 5 unique book titles. The query also reveals the range of the review scores (including NA entries), the American states from which the reviews have been collated, some being abbreviated and others having the full title, and 5 different prices for the books.

Data Cleansing

From the previous exercise, we can see the the “review” column has missing values in the form of NA. Before we proceed we will create a new variable that filters out the NA rows:

complete_books <- books %>%
                    filter(!(is.na(review)))

nrow(complete_books)

## [1] 1794

The complete_books variable now has a total of 1,794 rows, therefore we have removed 206 rows that did not contain a review entry. This is approximately 10% and, whilst it’s not ideal, it’s not enough to stop progress.

Next, let’s tidy up the “state” column as there are postal abbreviations and long names in there. The plan will be to go with the abbreviations so therefore we will need to convert Texas entries to TX, New York entries to NY, Florida entries to FL and California entries to CA.

complete_books <- complete_books %>%
  mutate(
    state = case_when(
      state == "New York" ~ "NY",
      state == "Florida" ~ "FL",
      state == "Texas" ~ "TX",
      state == "California" ~ "CA",
      TRUE ~ state  # retain remaining entries that are already abbreviated
      )
  )

print(unique(complete_books$state))

## [1] "TX" "NY" "FL" "CA"

Great, we can see now that only the abbreviated entries are returned in the “state” column.

Next, we’ll turn our attention to the “review” column and look to metricise the string results we have, plus also categorise higher review scores:

complete_books <- complete_books %>%
  mutate(
    score = case_when (
      review == "Poor" ~ 1,
      review == "Fair" ~ 2,
      review == "Good" ~ 3,
      review == "Great" ~ 4,
      review == "Excellent" ~ 5
    ),
    is_high_review = if_else(score >= 4,TRUE,FALSE)
  )

glimpse(complete_books)

## Rows: 1,794
## Columns: 6
## $ book           <chr> "R Made Easy", "R For Dummies", "R Made Easy", "R Ma...
## $ review         <chr> "Excellent", "Fair", "Excellent", "Poor", "Great", "...
## $ state          <chr> "TX", "NY", "NY", "FL", "TX", "FL", "CA", "CA", "TX"...
## $ price          <dbl> 19.99, 15.99, 19.99, 19.99, 50.00, 19.99, 19.99, 29....
## $ score          <dbl> 5, 2, 5, 1, 4, 4, 1, 2, 2, 4, 2, 3, 5, 3, 2, 4, 1, 5...
## $ is_high_review <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, ...

Analysing the results

The goal was to review anything interesting in the sales data. Ideally this would equate to profitability, however without an indication of the purchase price per book then there is no clear approach to this.

Therefore we will look at three different results: number of books sold, amount of sales and average cost of sales:

# number of books sols
book_sales <- complete_books %>%
  group_by(book) %>%
  summarise(
    sales = n(),
    revenue = sum(price),
    avg_sales = mean(revenue/sales)
  ) %>%
  arrange(-avg_sales)

## `summarise()` ungrouping output (override with `.groups` argument)

head(book_sales)

## # A tibble: 5 x 4
##   book                               sales revenue avg_sales
##   <chr>                              <int>   <dbl>     <dbl>
## 1 Secrets Of R For Advanced Students   360  18000       50  
## 2 Fundamentals of R For Beginners      366  14636.      40.0
## 3 Top 10 Mistakes R Beginners Make     355  10646.      30.0
## 4 R Made Easy                          352   7036.      20.0
## 5 R For Dummies                        361   5772.      16.0

Conclusion

Reviewing the returned results we can see that the number of sales is similar for all books, with the minimum being 352 sales for “R Made Easy” and the maximum being 366 for “Fundamentals of R For Beginners”.

The revenue from each book is much more disparate, with “Secrets Of R For Advanced Students” bringing in the most revenue of $18,000.00 and also having the highest average sales price of $50.00

If one was to work through the book titles from lowest revenue through to highest, the lower revenue books appear to be the entry level books into R, with revenue increasing along with the difficulty level contained within the book. This is an observation and would require further data for analysis.

DataQuest_GP#2

Gary Dougherty

12/01/2021

Guided Project #2: Creating An Efficient Data Analysis Workflow

Data Cleansing

Analysing the results

Conclusion