Today we have some book review data for us to analyze. We are trying to figure out, based on reviews and costs, which book is the most profitable for the company?

Guided Project Solutions: Creating An Efficient Data Analysis Workflow

#Loading the dataset
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
book_reviews <- read.csv("book_reviews.csv")

Step 1: Getting Familiar With the Data

#Previewing the dataset
dim(book_reviews)
## [1] 2000    4
#Previewing the column names of the dataset
colnames(book_reviews)
## [1] "book"   "review" "state"  "price"

We can see that there are 4 columns in this dataset called book, review, state and price. There are 2000 rows of data.

#Determining the type of data of each column

for (names in colnames(book_reviews)){
  type_of_data <- typeof(book_reviews[[names]])
  type_of <- c(type_of_data)
  print(type_of_data)
}
## [1] "character"
## [1] "character"
## [1] "character"
## [1] "double"
# What are the unique values in each column?
for (c in colnames(book_reviews)) {
  print("Unique values in the column:")
  print(c)
  print(unique(book_reviews[[c]]))
  print("")
}
## [1] "Unique values in the column:"
## [1] "book"
## [1] "R Made Easy"                        "R For Dummies"                     
## [3] "Secrets Of R For Advanced Students" "Top 10 Mistakes R Beginners Make"  
## [5] "Fundamentals of R For Beginners"   
## [1] ""
## [1] "Unique values in the column:"
## [1] "review"
## [1] "Excellent" "Fair"      "Poor"      "Great"     NA          "Good"     
## [1] ""
## [1] "Unique values in the column:"
## [1] "state"
## [1] "TX"         "NY"         "FL"         "Texas"      "California"
## [6] "Florida"    "CA"         "New York"  
## [1] ""
## [1] "Unique values in the column:"
## [1] "price"
## [1] 19.99 15.99 50.00 29.99 39.99
## [1] ""

This allow us to sample the data and see what are the highs and lows of the data.

Step 2: Handling Missing Data There are some rows with missing data so we will clean the data

#Filtering out missing book reviews

filtered_book_reviews<-book_reviews %>%
  filter(!is.na(review))

 dim (filtered_book_reviews)
## [1] 1794    4

We removed about 200 reviews which were empty

Step 3: Dealing With Inconsistent Labels

The state column contains states which are inconsistent.

#Converting all state abbreviations to full state names

full_state_book_reviews<-filtered_book_reviews%>%
  
  mutate(
      state = case_when(
      state == "TX" ~ "Texas",
      state == "NY" ~ "New York",
      state == "FL" ~ "Florida",
      state == "CA" ~ "California",
      TRUE ~ state
      )
  )

Step 4: Transforming the Review Data

The review data is a character string but it would be more efficient if it was in numerical form

#Converting review character into a numerical value 

review_num <- full_state_book_reviews %>%
  
  mutate(
    review= case_when(
    review == "Poor" ~ 1,
    review == "Fair" ~ 2,
    review == "Good" ~ 3,
    review == "Great" ~ 4,
    review == "Excellent" ~ 5,
  )
  ) 
#Creating a new column to determine if review is high or not
  
  is_review_high <- review_num%>%
    mutate(
      review_num = case_when(
        review <= 3 ~ FALSE,
        review >= 4 ~ TRUE
      )
    )
head(is_review_high)
##                                 book review    state price review_num
## 1                        R Made Easy      5    Texas 19.99       TRUE
## 2                      R For Dummies      2 New York 15.99      FALSE
## 3                        R Made Easy      5 New York 19.99       TRUE
## 4                        R Made Easy      1  Florida 19.99      FALSE
## 5 Secrets Of R For Advanced Students      4    Texas 50.00       TRUE
## 6                        R Made Easy      4  Florida 19.99       TRUE

The new dataframe of the book reviews is much cleaner for analysis

Step 5: Analyzing the Data

The most profitable book will be the book that made the most money.

profit_books <- is_review_high %>%
  group_by(book)%>%
  summarise(
    profit = sum(price),
    avg_review = mean(review),
    price = mean(price)
  ) %>%
  arrange (-profit)
## `summarise()` ungrouping output (override with `.groups` argument)
print(profit_books)
## # A tibble: 5 x 4
##   book                               profit avg_review price
##   <chr>                               <dbl>      <dbl> <dbl>
## 1 Secrets Of R For Advanced Students 18000        2.96  50  
## 2 Fundamentals of R For Beginners    14636.       3.01  40.0
## 3 Top 10 Mistakes R Beginners Make   10646.       3.05  30.0
## 4 R Made Easy                         7036.       2.97  20.0
## 5 R For Dummies                       5772.       2.83  16.0

Conclusion

We can see that the most profitable book is Secrets if R for Advanced Students.