Creating An Efficient Data Analysis Workflow

by Dr. Michael Ige

Introduction This is my second guided project in the Dataquest training for data analyst in R. The project centers around a company that sells books for learning programming. The company has produced multiple books with many reviews. My role as a data analyst for the company is to check the sales data and see if any useful information can be extracted.

Data

Source:

The data for this project was downloaded from the source: https://dq-content.s3.amazonaws.com/498/book_reviews.csv The data contains details of titles, reviews , the states, and the prices for all the books.

Import Data

library(readr)
book_reviews <- read_csv("C:/Users/babao/Desktop/R_wd/Practice Data/book_reviews.csv")

Data Structures

We shall try to get acquainted with the data by knowing the size, kind, any missing data, and ensure it is free of anything that might interfere with our analysis.

dim(book_reviews)
## [1] 2000    4

The data has a total of 2000 rows and 4 columns.

colnames(book_reviews)
## [1] "book"   "review" "state"  "price"
  1. Book : The title of the book
  2. Review : The review rating
  3. State: The US state location of the review
  4. Price: The selling price of the book.
typeof(book_reviews)
## [1] "list"
str(book_reviews)
## spec_tbl_df [2,000 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ book  : chr [1:2000] "R Made Easy" "R For Dummies" "R Made Easy" "R Made Easy" ...
##  $ review: chr [1:2000] "Excellent" "Fair" "Excellent" "Poor" ...
##  $ state : chr [1:2000] "TX" "NY" "NY" "FL" ...
##  $ price : num [1:2000] 20 16 20 20 50 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   book = col_character(),
##   ..   review = col_character(),
##   ..   state = col_character(),
##   ..   price = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Cleaning

for (c in colnames(book_reviews)){
print(c)
print(unique(book_reviews[[c]]))
print("")
}
## [1] "book"
## [1] "R Made Easy"                        "R For Dummies"                     
## [3] "Secrets Of R For Advanced Students" "Top 10 Mistakes R Beginners Make"  
## [5] "Fundamentals of R For Beginners"   
## [1] ""
## [1] "review"
## [1] "Excellent" "Fair"      "Poor"      "Great"     NA          "Good"     
## [1] ""
## [1] "state"
## [1] "TX"         "NY"         "FL"         "Texas"      "California"
## [6] "Florida"    "CA"         "New York"  
## [1] ""
## [1] "price"
## [1] 19.99 15.99 50.00 29.99 39.99
## [1] ""
library(dplyr)
new_book_reviews <- book_reviews %>%
filter(!is.na(review),
!is.na(book),
!is.na(state),
!is.na(price))
dim(new_book_reviews)
## [1] 1794    4
new_book_reviews <- new_book_reviews %>% 
  mutate(
    state = case_when(
      state == "California" ~ "CA",
      state == "New York" ~ "NY",
      state == "Texas" ~ "TX",
      state == "Florida" ~ "FL",
      TRUE ~ state 
    )
  )
new_book_reviews <- book_reviews %>% 
  mutate(
    review_num = case_when(
      review == "Poor" ~ 1,
      review == "Fair" ~ 2,
      review == "Good" ~ 3,
      review == "Great" ~ 4,
      review == "Excellent" ~ 5,
       ),
     is_high_review = if_else(review_num >= 4, TRUE, FALSE)
  )

Most Profitable Book

To find the most profiable book, we shall assume the book with the highest gross sales revenue (no. of books sold x price)

new_book_reviews %>%
  group_by(book) %>%
  summarize(
    number_sold = n(),
    sales_price = mean(price),
    gross_sales = n() *mean(price)
      ) %>%
  arrange(-gross_sales)
## # A tibble: 5 × 4
##   book                               number_sold sales_price gross_sales
##   <chr>                                    <int>       <dbl>       <dbl>
## 1 Secrets Of R For Advanced Students         406        50        20300 
## 2 Fundamentals of R For Beginners            410        40.0      16396.
## 3 Top 10 Mistakes R Beginners Make           385        30.0      11546.
## 4 R Made Easy                                389        20.0       7776.
## 5 R For Dummies                              410        16.0       6556.

The most profitable book is Secrets of R For Advanced Students