This project is about the Amazon Top 50 Bestselling Books 2009 – 2019(https://www.kaggle.com/datasets). The Dataset contains 550 books. Data has been categorized into fiction and non-fiction using Goodreads. The analysis of this Dataset will allow us have a deep understanding of the book market trends over the past decade.
This data set includes seven categories, such as Name of the Book, The author of the Book, Amazon User Rating, Number of written reviews on amazon, The price of the book, The Year(s) it ranked on the bestseller, Whether fiction or non-fiction.
I will select the following variables for data analysis. Categorical variable: Genre, Year Numerical variables: User Rating, Reviews, Price.
Use table, pie, dotchart, barchart and histogram to demostrate the data distribution graphically and directly; Use Return function to reduce double counting; Use As function to change the data type; Use Sprint function to generate sentences logically; Use and compare different kinds of sampling like Simple Random Sampling, Systematic Sampling and Stratified Sampling; Use Confidence Intervals to estimate the samples; Use JiebaR to calculate high-frequency terms for text and use Wordcloud2 to illustrate them.
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
## Name Author User Rating Reviews
## Length:550 Length:550 Length:550 Length:550
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Price Year Genre
## Length:550 Length:550 Length:550
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
*Which kind of books is more popular?
Fiction books had the highest proportion of the Amazon Top 50 Bestselling Books at 56%, 12% higher than Non Fiction ones(44%).
Genre <- bestsellers_with_categories$Genre
piepercent = paste(round(100*table(Genre)/sum(table(Genre))), "%")
cols = c("#40DCA7","#106E85")
pie(table(Genre), label = paste(piepercent, table(Genre), sep=" , "), col=cols, main = "Genre")
legend("topright", c("Fiction","Non Fiction "), cex=0.8, fill=cols)
*1.Which price range do people prefer to buy?
Book with prices between $0 and $20 are the best sellers.
## ── Attaching packages ──────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Price <- bestsellers_with_categories$Price
Price <- as.numeric(Price)
Price2 <- table(Price)
Price2 <- as_tibble(Price2)
plot_ly(Price2, x = ~Price, y = ~n, type = 'scatter',
mode = "markers", marker = list(color = c("#FFD31F")))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
*2.Which book is the cheapest?
sprintf("The cheapest book is %s by %s, priced at $%s",
bestsellers_with_categories$Name[which.min(Price)],
bestsellers_with_categories$Author[which.min(Price)], min(Price))
## [1] "The cheapest book is Cabin Fever (Diary of a Wimpy Kid, Book 6) by Jeff Kinney, priced at $0"
*3.Which book is the most expensive?
sprintf("The most expensive book is %s by %s, priced at $%s",
bestsellers_with_categories$Name[which.max(Price)],
bestsellers_with_categories$Author[which.max(Price)], max(Price))
## [1] "The most expensive book is Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5 by American Psychiatric Association, priced at $105"
Generally, there is a positive correlation between reviews and year.The more recent the time, the more reviews. Besides, most books have no more reviews than that 30k.
*1.What’s the relationship between Reviews & Year?
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
##
## subplot
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
RY <- bestsellers_with_categories[ ,c(6,4)]
RY$Reviews <- as.numeric(RY$Reviews)
plot_ly(RY, x = ~Year, y = ~Reviews, type = 'scatter',
mode = "markers", marker = list(color = c("#009db2")))
*2.What’s the average reviews of each year?
## `summarise()` ungrouping output (override with `.groups` argument)
*3.What’s distribution of reviews more than 4000?
sum_of_review <- function(n){
return(sum(subset(RY, (Year %in% n))$Reviews >= 4000))
}
sumofreview_over4000 <- c(sum_of_review(2009),sum_of_review(2010),sum_of_review(2011),sum_of_review(2012),
sum_of_review(2013),sum_of_review(2014),sum_of_review(2015),sum_of_review(2016),
sum_of_review(2017),sum_of_review(2018),sum_of_review(2019))
year <-c(2009:2019)
plot_ly(data.frame(year, sumofreview_over4000), x = ~year, y = ~sumofreview_over4000, type = 'scatter', mode = "markers", marker = list(color = c("#FFD31F")))
###Distribution of User Rating
*What’s the distribution of User Rating?
The distribution of User Rating follows a left skew, while the peak 4.8 is the rating score that most people choose.
User_Rating <- bestsellers_with_categories$`User Rating`
User_Rating <- as.numeric(User_Rating)
prop = prop.table(User_Rating)
plot_ly((bestsellers_with_categories), x = ~User_Rating, y = ~prop, type = 'bar', marker = list(color = c("#40DCA7")))
###Simple Random Sampling and Distribution
1.sample size=50
##
## Attaching package: 'sampling'
## The following objects are masked from 'package:survival':
##
## cluster, strata
s <- srswr(50, nrow(bestsellers_with_categories))
rows <- (1:nrow(bestsellers_with_categories))[s !=0]
rows <- rep(rows, s[s !=0])
sample.1 <- bestsellers_with_categories[rows, ]
sample.1$`User Rating` <- as.numeric(sample.1$`User Rating`)
prop <- prop.table(sample.1$`User Rating`)
barplot(prop.table(sample.1$`User Rating`),
xlab = "x", ylab = "prop", col = c("#FFD31F"))
2.sample size=100
s1 <- srswr(50, nrow(bestsellers_with_categories))
rows <- (1:nrow(bestsellers_with_categories))[s1 !=0]
rows <- rep(rows, s1[s1 !=0])
sample.11 <- bestsellers_with_categories[rows, ]
sample.11$`User Rating` <- as.numeric(sample.11$`User Rating`)
prop <- prop.table(sample.11$`User Rating`)
plot_ly(sample.11, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#106E85")))
3.sample size=150
s2 <- srswr(150, nrow(bestsellers_with_categories))
rows <- (1:nrow(bestsellers_with_categories))[s2 !=0]
rows <- rep(rows, s2[s2 !=0])
sample.111 <- bestsellers_with_categories[rows, ]
sample.111$`User Rating` <- as.numeric(sample.111$`User Rating`)
prop <- prop.table(sample.111$`User Rating`)
plot_ly(sample.111, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#106E85")))
###Systematic Sampling
N <- nrow(bestsellers_with_categories)
n <- 50
k <- ceiling(N/n)
r <- sample(k, 1)
s <- seq(r, by = k, length = n)
sample.2 <- bestsellers_with_categories[s, ]
sample.2$`User Rating` <- as.numeric(sample.2$`User Rating`)
prop <- prop.table(sample.2$`User Rating`)
plot_ly(sample.2, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#009db2")))
bestsellers_with_categories$`User Rating` <- as.numeric(bestsellers_with_categories$`User Rating`)
sampleprob <- inclusionprobabilities(bestsellers_with_categories$`User Rating`, 50)
s <- UPsystematic(sampleprob)
sample.3 <- bestsellers_with_categories[s !=0, ]
prop <- prop.table(sample.3$`User Rating`)
plot_ly(sample.3, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#50c48f")))
###Stratified Sampling
library(dplyr)
f = table(bestsellers_with_categories$Genre)
n = 50* f / sum(f)
s4 = strata(bestsellers_with_categories, stratanames = "Genre", size = n, method = "srswor")
s4
## Genre ID_unit Prob Stratum
## 16 Non Fiction 16 0.07038123 1
## 18 Non Fiction 18 0.07038123 1
## 19 Non Fiction 19 0.07038123 1
## 51 Non Fiction 51 0.07038123 1
## 163 Non Fiction 163 0.07038123 1
## 165 Non Fiction 165 0.07038123 1
## 203 Non Fiction 203 0.07038123 1
## 214 Non Fiction 214 0.07038123 1
## 245 Non Fiction 245 0.07038123 1
## 262 Non Fiction 262 0.07038123 1
## 273 Non Fiction 273 0.07038123 1
## 283 Non Fiction 283 0.07038123 1
## 301 Non Fiction 301 0.07038123 1
## 306 Non Fiction 306 0.07038123 1
## 323 Non Fiction 323 0.07038123 1
## 330 Non Fiction 330 0.07038123 1
## 379 Non Fiction 379 0.07038123 1
## 468 Non Fiction 468 0.07038123 1
## 504 Non Fiction 504 0.07038123 1
## 531 Non Fiction 531 0.07038123 1
## 533 Non Fiction 533 0.07038123 1
## 2 Fiction 2 0.11742424 2
## 21 Fiction 21 0.11742424 2
## 39 Fiction 39 0.11742424 2
## 57 Fiction 57 0.11742424 2
## 64 Fiction 64 0.11742424 2
## 72 Fiction 72 0.11742424 2
## 89 Fiction 89 0.11742424 2
## 120 Fiction 120 0.11742424 2
## 125 Fiction 125 0.11742424 2
## 252 Fiction 252 0.11742424 2
## 255 Fiction 255 0.11742424 2
## 260 Fiction 260 0.11742424 2
## 282 Fiction 282 0.11742424 2
## 342 Fiction 342 0.11742424 2
## 362 Fiction 362 0.11742424 2
## 367 Fiction 367 0.11742424 2
## 382 Fiction 382 0.11742424 2
## 387 Fiction 387 0.11742424 2
## 390 Fiction 390 0.11742424 2
## 392 Fiction 392 0.11742424 2
## 398 Fiction 398 0.11742424 2
## 409 Fiction 409 0.11742424 2
## 452 Fiction 452 0.11742424 2
## 471 Fiction 471 0.11742424 2
## 492 Fiction 492 0.11742424 2
## 514 Fiction 514 0.11742424 2
## 515 Fiction 515 0.11742424 2
## 545 Fiction 545 0.11742424 2
##
## 4 4.2 4.4 4.5 4.6 4.7 4.8 4.9
## 1 1 6 6 10 10 13 2
###Compare the means of User Rating variable for four samples with the entire data.
0.Initial Data
## [1] 4.618364
## [1] 0.2269804
1.Simple Random Sampling
## [1] 4.628
## [1] 0.1629323
2.Systematic Sampling
## [1] 4.594
## [1] 0.2758364
3.Systematic Sampling (Unequal Probabilities)
## [1] 4.616
## [1] 0.2131972
4.Stratified Sampling
## [1] 4.628571
## [1] 0.1814295
###Confidence Intervals
1.Confidence Intervals of dataset
bestsellers_with_categories$Price<- as.numeric(bestsellers_with_categories$Price)
mean <- mean(bestsellers_with_categories$Price)
sd <- sd(bestsellers_with_categories$Price)
books <- rnorm(500, mean = mean, sd = sd)
sample_price <- sample(books, 50)
tmp = 1.96*sd/sqrt(50)
lConf = mean(sample_price)-tmp
hConf = mean(sample_price)+tmp
sprintf("The 95 percent confidence interval is between %s%% and %s%%", lConf, hConf)
## [1] "The 95 percent confidence interval is between 9.11147198322841% and 15.1221153664097%"
2.Confidence Intervals of Simple Random Sampling
sd <- sd(sample.1$`User Rating`)
books2 <- rnorm(50, mean = mean(sample.1$`User Rating`), sd = sd(sample.1$`User Rating`))
sample_price <- sample(books2, 50)
tmp = 1.96*sd/sqrt(50)
lConf = mean(sample_price)-tmp
hConf = mean(sample_price)+tmp
sprintf("The 95 percent confidence interval is between %s%% and %s%%", lConf, hConf)
## [1] "The 95 percent confidence interval is between 4.56237943554461% and 4.65270450183299%"
3.Confidence Intervals of Systematic Sampling
sd <- sd(sample.2$`User Rating`)
books3 <- rnorm(50, mean = mean(sample.2$`User Rating`), sd = sd(sample.2$`User Rating`))
sample_price <- sample(books3, 50)
tmp = 1.96*sd/sqrt(50)
lConf = mean(sample_price)-tmp
hConf = mean(sample_price)+tmp
sprintf("The 95 percent confidence interval is between %s%% and %s%%", lConf, hConf)
## [1] "The 95 percent confidence interval is between 4.56587985578628% and 4.71879574900867%"
4.Confidence Intervals of Stratified Sampling
books4 <- rnorm(50, mean = mean(sample.4$`User Rating`), sd = sd(sample.4$`User Rating`))
sample_price <- sample(books4, 50)
tmp = 1.96*sd/sqrt(50)
lConf = mean(sample_price)-tmp
hConf = mean(sample_price)+tmp
sprintf("The 95 percent confidence interval is between %s%% and %s%%", lConf, hConf)
## [1] "The 95 percent confidence interval is between 4.53239673030019% and 4.68531262352257%"
###High-frequency words
*What themes of books is more popular?
## Loading required package: jiebaRD
##
## ── Column specification ─────────────────────────────────────
## cols(
## `{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600` = col_character()
## )
## Warning: 109 parsing failures.
## row col expected actual file
## 18 -- 1 columns 2 columns 'CS544_Final Project_Cheng_v1/books.txt'
## 26 -- 1 columns 3 columns 'CS544_Final Project_Cheng_v1/books.txt'
## 38 -- 1 columns 5 columns 'CS544_Final Project_Cheng_v1/books.txt'
## 40 -- 1 columns 4 columns 'CS544_Final Project_Cheng_v1/books.txt'
## 46 -- 1 columns 3 columns 'CS544_Final Project_Cheng_v1/books.txt'
## ... ... ......... ......... ........................................
## See problems(...) for more details.
All the distributions of user rating from different sample size(size = 50,100,150) follow left skew; In four different random sampling methods, Systematic Sampling (Unequal Probabilities) has the closest result with the original dataset; American pay more attention to the humanities topics, like love, life, guide, man, kids, and dog.
1.Learned how to present data more efficiently, visually, and aesthetically.
2.Learned using R to calculate probability, reliability, and correlation of data while reducing duplication of effort.
3.Telling a good story is much more important than showing the data. Do not assume that everyone in the audience can or will read the data, learn to turn the data into meaningful suggestions.
4.In future market research work, it should be noted that no sampling method is perfect, which needs to be analyzed on a case-by-case basis.
5.Need to improve the attitude and ability to resist pressure when fixing bugs.