by Dr. Michael Ige
Introduction This is my second guided project in the Dataquest training for data analyst in R. The project centers around a company that sells books for learning programming. The company has produced multiple books with many reviews. My role as a data analyst for the company is to check the sales data and see if any useful information can be extracted.
Data
Source:
The data for this project was downloaded from the source: https://dq-content.s3.amazonaws.com/498/book_reviews.csv The data contains details of titles, reviews , the states, and the prices for all the books.
Import Data
library(readr)
book_reviews <- read_csv("C:/Users/babao/Desktop/R_wd/Practice Data/book_reviews.csv")
Data Structures
We shall try to get acquainted with the data by knowing the size, kind, any missing data, and ensure it is free of anything that might interfere with our analysis.
dim(book_reviews)
## [1] 2000 4
The data has a total of 2000 rows and 4 columns.
colnames(book_reviews)
## [1] "book" "review" "state" "price"
typeof(book_reviews)
## [1] "list"
str(book_reviews)
## spec_tbl_df [2,000 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ book : chr [1:2000] "R Made Easy" "R For Dummies" "R Made Easy" "R Made Easy" ...
## $ review: chr [1:2000] "Excellent" "Fair" "Excellent" "Poor" ...
## $ state : chr [1:2000] "TX" "NY" "NY" "FL" ...
## $ price : num [1:2000] 20 16 20 20 50 ...
## - attr(*, "spec")=
## .. cols(
## .. book = col_character(),
## .. review = col_character(),
## .. state = col_character(),
## .. price = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Data Cleaning
for (c in colnames(book_reviews)){
print(c)
print(unique(book_reviews[[c]]))
print("")
}
## [1] "book"
## [1] "R Made Easy" "R For Dummies"
## [3] "Secrets Of R For Advanced Students" "Top 10 Mistakes R Beginners Make"
## [5] "Fundamentals of R For Beginners"
## [1] ""
## [1] "review"
## [1] "Excellent" "Fair" "Poor" "Great" NA "Good"
## [1] ""
## [1] "state"
## [1] "TX" "NY" "FL" "Texas" "California"
## [6] "Florida" "CA" "New York"
## [1] ""
## [1] "price"
## [1] 19.99 15.99 50.00 29.99 39.99
## [1] ""
library(dplyr)
new_book_reviews <- book_reviews %>%
filter(!is.na(review),
!is.na(book),
!is.na(state),
!is.na(price))
dim(new_book_reviews)
## [1] 1794 4
All 206 rows with missing data have been removed leaving the data to have 1794 rows
Make the “state” label uniform and retain where states are in postal code
new_book_reviews <- new_book_reviews %>%
mutate(
state = case_when(
state == "California" ~ "CA",
state == "New York" ~ "NY",
state == "Texas" ~ "TX",
state == "Florida" ~ "FL",
TRUE ~ state
)
)
new_book_reviews <- book_reviews %>%
mutate(
review_num = case_when(
review == "Poor" ~ 1,
review == "Fair" ~ 2,
review == "Good" ~ 3,
review == "Great" ~ 4,
review == "Excellent" ~ 5,
),
is_high_review = if_else(review_num >= 4, TRUE, FALSE)
)
Most Profitable Book
To find the most profiable book, we shall assume the book with the highest gross sales revenue (no. of books sold x price)
new_book_reviews %>%
group_by(book) %>%
summarize(
number_sold = n(),
sales_price = mean(price),
gross_sales = n() *mean(price)
) %>%
arrange(-gross_sales)
## # A tibble: 5 × 4
## book number_sold sales_price gross_sales
## <chr> <int> <dbl> <dbl>
## 1 Secrets Of R For Advanced Students 406 50 20300
## 2 Fundamentals of R For Beginners 410 40.0 16396.
## 3 Top 10 Mistakes R Beginners Make 385 30.0 11546.
## 4 R Made Easy 389 20.0 7776.
## 5 R For Dummies 410 16.0 6556.
The most profitable book is Secrets of R For Advanced Students