Today we have some book review data for us to analyze. We are trying to figure out, based on reviews and costs, which book is the most profitable for the company?
#Loading the dataset
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
book_reviews <- read.csv("book_reviews.csv")
Step 1: Getting Familiar With the Data
#Previewing the dataset
dim(book_reviews)
## [1] 2000 4
#Previewing the column names of the dataset
colnames(book_reviews)
## [1] "book" "review" "state" "price"
We can see that there are 4 columns in this dataset called book, review, state and price. There are 2000 rows of data.
#Determining the type of data of each column
for (names in colnames(book_reviews)){
type_of_data <- typeof(book_reviews[[names]])
type_of <- c(type_of_data)
print(type_of_data)
}
## [1] "character"
## [1] "character"
## [1] "character"
## [1] "double"
# What are the unique values in each column?
for (c in colnames(book_reviews)) {
print("Unique values in the column:")
print(c)
print(unique(book_reviews[[c]]))
print("")
}
## [1] "Unique values in the column:"
## [1] "book"
## [1] "R Made Easy" "R For Dummies"
## [3] "Secrets Of R For Advanced Students" "Top 10 Mistakes R Beginners Make"
## [5] "Fundamentals of R For Beginners"
## [1] ""
## [1] "Unique values in the column:"
## [1] "review"
## [1] "Excellent" "Fair" "Poor" "Great" NA "Good"
## [1] ""
## [1] "Unique values in the column:"
## [1] "state"
## [1] "TX" "NY" "FL" "Texas" "California"
## [6] "Florida" "CA" "New York"
## [1] ""
## [1] "Unique values in the column:"
## [1] "price"
## [1] 19.99 15.99 50.00 29.99 39.99
## [1] ""
This allow us to sample the data and see what are the highs and lows of the data.
Step 2: Handling Missing Data There are some rows with missing data so we will clean the data
#Filtering out missing book reviews
filtered_book_reviews<-book_reviews %>%
filter(!is.na(review))
dim (filtered_book_reviews)
## [1] 1794 4
We removed about 200 reviews which were empty
Step 3: Dealing With Inconsistent Labels
The state column contains states which are inconsistent.
#Converting all state abbreviations to full state names
full_state_book_reviews<-filtered_book_reviews%>%
mutate(
state = case_when(
state == "TX" ~ "Texas",
state == "NY" ~ "New York",
state == "FL" ~ "Florida",
state == "CA" ~ "California",
TRUE ~ state
)
)
Step 4: Transforming the Review Data
The review data is a character string but it would be more efficient if it was in numerical form
#Converting review character into a numerical value
review_num <- full_state_book_reviews %>%
mutate(
review= case_when(
review == "Poor" ~ 1,
review == "Fair" ~ 2,
review == "Good" ~ 3,
review == "Great" ~ 4,
review == "Excellent" ~ 5,
)
)
#Creating a new column to determine if review is high or not
is_review_high <- review_num%>%
mutate(
review_num = case_when(
review <= 3 ~ FALSE,
review >= 4 ~ TRUE
)
)
head(is_review_high)
## book review state price review_num
## 1 R Made Easy 5 Texas 19.99 TRUE
## 2 R For Dummies 2 New York 15.99 FALSE
## 3 R Made Easy 5 New York 19.99 TRUE
## 4 R Made Easy 1 Florida 19.99 FALSE
## 5 Secrets Of R For Advanced Students 4 Texas 50.00 TRUE
## 6 R Made Easy 4 Florida 19.99 TRUE
The new dataframe of the book reviews is much cleaner for analysis
Step 5: Analyzing the Data
The most profitable book will be the book that made the most money.
profit_books <- is_review_high %>%
group_by(book)%>%
summarise(
profit = sum(price),
avg_review = mean(review),
price = mean(price)
) %>%
arrange (-profit)
## `summarise()` ungrouping output (override with `.groups` argument)
print(profit_books)
## # A tibble: 5 x 4
## book profit avg_review price
## <chr> <dbl> <dbl> <dbl>
## 1 Secrets Of R For Advanced Students 18000 2.96 50
## 2 Fundamentals of R For Beginners 14636. 3.01 40.0
## 3 Top 10 Mistakes R Beginners Make 10646. 3.05 30.0
## 4 R Made Easy 7036. 2.97 20.0
## 5 R For Dummies 5772. 2.83 16.0
Conclusion
We can see that the most profitable book is Secrets if R for Advanced Students.