Creating An Efficient Data Analysis Workflow

by Dr. Michael Ige

Introduction This is my second guided project in the Dataquest training for data analyst in R. The project centers around a company that sells books for learning programming. The company has produced multiple books with many reviews. My role as a data analyst for the company is to check the sales data and see if any useful information can be extracted.

Data

Source:

The data for this project was downloaded from the source: https://dq-content.s3.amazonaws.com/498/book_reviews.csv The data contains details of titles, reviews , the states, and the prices for all the books.

Import Data

The data in CSV will be imported and to be stored as “book_reviews”

library(readr)
book_reviews <- read_csv("C:/Users/babao/Desktop/R_wd/Practice Data/book_reviews.csv")

Data Structures

We shall try to get acquainted with the data by knowing the size, kind, any missing data, and ensure it is free of anything that might interfere with our analysis.

Dimension the data

dim(book_reviews)

## [1] 2000    4

The data has a total of 2000 rows and 4 columns.

Data column names.

colnames(book_reviews)

## [1] "book"   "review" "state"  "price"

Data Columns Descriptions

Book : The title of the book
Review : The review rating
State: The US state location of the review
Price: The selling price of the book.

Data Columns types

typeof(book_reviews)

## [1] "list"

str(book_reviews)

## spec_tbl_df [2,000 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ book  : chr [1:2000] "R Made Easy" "R For Dummies" "R Made Easy" "R Made Easy" ...
##  $ review: chr [1:2000] "Excellent" "Fair" "Excellent" "Poor" ...
##  $ state : chr [1:2000] "TX" "NY" "NY" "FL" ...
##  $ price : num [1:2000] 20 16 20 20 50 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   book = col_character(),
##   ..   review = col_character(),
##   ..   state = col_character(),
##   ..   price = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Cleaning

Find the unique values present in columns

for (c in colnames(book_reviews)){
print(c)
print(unique(book_reviews[[c]]))
print("")
}

## [1] "book"
## [1] "R Made Easy"                        "R For Dummies"                     
## [3] "Secrets Of R For Advanced Students" "Top 10 Mistakes R Beginners Make"  
## [5] "Fundamentals of R For Beginners"   
## [1] ""
## [1] "review"
## [1] "Excellent" "Fair"      "Poor"      "Great"     NA          "Good"     
## [1] ""
## [1] "state"
## [1] "TX"         "NY"         "FL"         "Texas"      "California"
## [6] "Florida"    "CA"         "New York"  
## [1] ""
## [1] "price"
## [1] 19.99 15.99 50.00 29.99 39.99
## [1] ""

Remove missing data

library(dplyr)

new_book_reviews <- book_reviews %>%
filter(!is.na(review),
!is.na(book),
!is.na(state),
!is.na(price))
dim(new_book_reviews)

## [1] 1794    4

All 206 rows with missing data have been removed leaving the data to have 1794 rows
Make the “state” label uniform and retain where states are in postal code

new_book_reviews <- new_book_reviews %>% 
  mutate(
    state = case_when(
      state == "California" ~ "CA",
      state == "New York" ~ "NY",
      state == "Texas" ~ "TX",
      state == "Florida" ~ "FL",
      TRUE ~ state 
    )
  )

Convert the “review” rating to numerical form

new_book_reviews <- book_reviews %>% 
  mutate(
    review_num = case_when(
      review == "Poor" ~ 1,
      review == "Fair" ~ 2,
      review == "Good" ~ 3,
      review == "Great" ~ 4,
      review == "Excellent" ~ 5,
       ),
     is_high_review = if_else(review_num >= 4, TRUE, FALSE)
  )

Most Profitable Book

To find the most profiable book, we shall assume the book with the highest gross sales revenue (no. of books sold x price)

new_book_reviews %>%
  group_by(book) %>%
  summarize(
    number_sold = n(),
    sales_price = mean(price),
    gross_sales = n() *mean(price)
      ) %>%
  arrange(-gross_sales)

## # A tibble: 5 × 4
##   book                               number_sold sales_price gross_sales
##   <chr>                                    <int>       <dbl>       <dbl>
## 1 Secrets Of R For Advanced Students         406        50        20300 
## 2 Fundamentals of R For Beginners            410        40.0      16396.
## 3 Top 10 Mistakes R Beginners Make           385        30.0      11546.
## 4 R Made Easy                                389        20.0       7776.
## 5 R For Dummies                              410        16.0       6556.

The most profitable book is Secrets of R For Advanced Students