Part1. Summary of project

This project is about the Amazon Top 50 Bestselling Books 2009 – 2019(https://www.kaggle.com/datasets). The Dataset contains 550 books. Data has been categorized into fiction and non-fiction using Goodreads. The analysis of this Dataset will allow us have a deep understanding of the book market trends over the past decade.

Part2. Dataset description

This data set includes seven categories, such as Name of the Book, The author of the Book, Amazon User Rating, Number of written reviews on amazon, The price of the book, The Year(s) it ranked on the bestseller, Whether fiction or non-fiction.

I will select the following variables for data analysis. Categorical variable: Genre, Year Numerical variables: User Rating, Reviews, Price.

Part3. Description of approaches that you have taken

Use table, pie, dotchart, barchart and histogram to demostrate the data distribution graphically and directly; Use Return function to reduce double counting; Use As function to change the data type; Use Sprint function to generate sentences logically; Use and compare different kinds of sampling like Simple Random Sampling, Systematic Sampling and Stratified Sampling; Use Confidence Intervals to estimate the samples; Use JiebaR to calculate high-frequency terms for text and use Wordcloud2 to illustrate them.

Part4. Results

Preparing the data

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
##      Name              Author          User Rating          Reviews         
##  Length:550         Length:550         Length:550         Length:550        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##     Price               Year              Genre          
##  Length:550         Length:550         Length:550        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Analyzing the Genre

*Which kind of books is more popular?

Fiction books had the highest proportion of the Amazon Top 50 Bestselling Books at 56%, 12% higher than Non Fiction ones(44%).

Genre <- bestsellers_with_categories$Genre
piepercent = paste(round(100*table(Genre)/sum(table(Genre))), "%")
cols = c("#40DCA7","#106E85")
pie(table(Genre), label = paste(piepercent, table(Genre), sep=" , "), col=cols, main = "Genre")
legend("topright", c("Fiction","Non Fiction "), cex=0.8, fill=cols)

Analyzing the Price

*1.Which price range do people prefer to buy?

Book with prices between $0 and $20 are the best sellers.

## ── Attaching packages ──────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
Price <- bestsellers_with_categories$Price
Price <- as.numeric(Price)
Price2 <- table(Price)
Price2 <- as_tibble(Price2)
plot_ly(Price2, x = ~Price, y = ~n, type = 'scatter',
        mode = "markers", marker = list(color = c("#FFD31F")))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

*2.Which book is the cheapest?

sprintf("The cheapest book is %s by %s, priced at $%s", 
        bestsellers_with_categories$Name[which.min(Price)], 
        bestsellers_with_categories$Author[which.min(Price)], min(Price))
## [1] "The cheapest book is Cabin Fever (Diary of a Wimpy Kid, Book 6) by Jeff Kinney, priced at $0"

*3.Which book is the most expensive?

sprintf("The most expensive book is %s by %s, priced at $%s", 
        bestsellers_with_categories$Name[which.max(Price)], 
        bestsellers_with_categories$Author[which.max(Price)], max(Price))
## [1] "The most expensive book is Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5 by American Psychiatric Association, priced at $105"

Analyzing the Reviews & Year

Generally, there is a positive correlation between reviews and year.The more recent the time, the more reviews. Besides, most books have no more reviews than that 30k.

*1.What’s the relationship between Reviews & Year?

## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
## 
##     subplot
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
## 
##     cancer
RY <- bestsellers_with_categories[ ,c(6,4)]
RY$Reviews <- as.numeric(RY$Reviews)
plot_ly(RY, x = ~Year, y = ~Reviews, type = 'scatter',
        mode = "markers", marker = list(color = c("#009db2")))

*2.What’s the average reviews of each year?

## `summarise()` ungrouping output (override with `.groups` argument)
plot_ly(AVGRY, x = ~Year, y = ~Reviews, type = 'bar', marker = list(color = c("#40DCA7")))

*3.What’s distribution of reviews more than 4000?

sum_of_review <- function(n){
  return(sum(subset(RY, (Year %in% n))$Reviews >= 4000))
}
sumofreview_over4000 <- c(sum_of_review(2009),sum_of_review(2010),sum_of_review(2011),sum_of_review(2012),
sum_of_review(2013),sum_of_review(2014),sum_of_review(2015),sum_of_review(2016),
sum_of_review(2017),sum_of_review(2018),sum_of_review(2019))
year <-c(2009:2019)
plot_ly(data.frame(year, sumofreview_over4000), x = ~year, y = ~sumofreview_over4000, type = 'scatter', mode = "markers", marker = list(color = c("#FFD31F")))

###Distribution of User Rating

*What’s the distribution of User Rating?

The distribution of User Rating follows a left skew, while the peak 4.8 is the rating score that most people choose.

User_Rating <- bestsellers_with_categories$`User Rating`
User_Rating <- as.numeric(User_Rating)
prop = prop.table(User_Rating)
plot_ly((bestsellers_with_categories), x = ~User_Rating, y = ~prop, type = 'bar', marker = list(color = c("#40DCA7")))

###Simple Random Sampling and Distribution

1.sample size=50

## 
## Attaching package: 'sampling'
## The following objects are masked from 'package:survival':
## 
##     cluster, strata
s <- srswr(50, nrow(bestsellers_with_categories))
rows <- (1:nrow(bestsellers_with_categories))[s !=0]
rows <- rep(rows, s[s !=0])

sample.1 <- bestsellers_with_categories[rows, ]
sample.1$`User Rating` <- as.numeric(sample.1$`User Rating`)
prop <- prop.table(sample.1$`User Rating`)

barplot(prop.table(sample.1$`User Rating`),
        xlab = "x", ylab = "prop", col = c("#FFD31F"))

plot_ly(sample.1, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#106E85")))

2.sample size=100

s1 <- srswr(50, nrow(bestsellers_with_categories))
rows <- (1:nrow(bestsellers_with_categories))[s1 !=0]
rows <- rep(rows, s1[s1 !=0])
sample.11 <- bestsellers_with_categories[rows, ]
sample.11$`User Rating` <- as.numeric(sample.11$`User Rating`)
prop <- prop.table(sample.11$`User Rating`)
plot_ly(sample.11, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#106E85")))

3.sample size=150

s2 <- srswr(150, nrow(bestsellers_with_categories))
rows <- (1:nrow(bestsellers_with_categories))[s2 !=0]
rows <- rep(rows, s2[s2 !=0])
sample.111 <- bestsellers_with_categories[rows, ]
sample.111$`User Rating` <- as.numeric(sample.111$`User Rating`)

prop <- prop.table(sample.111$`User Rating`)
plot_ly(sample.111, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#106E85")))

###Systematic Sampling

N <- nrow(bestsellers_with_categories)
n <- 50

k <- ceiling(N/n)
r <- sample(k, 1)

s <- seq(r, by = k, length = n)
sample.2 <- bestsellers_with_categories[s, ]
sample.2$`User Rating` <- as.numeric(sample.2$`User Rating`)

prop <- prop.table(sample.2$`User Rating`)
plot_ly(sample.2, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#009db2")))

Unequal Probabilities

bestsellers_with_categories$`User Rating` <- as.numeric(bestsellers_with_categories$`User Rating`)
sampleprob <- inclusionprobabilities(bestsellers_with_categories$`User Rating`, 50)

s <- UPsystematic(sampleprob)
sample.3 <- bestsellers_with_categories[s !=0, ]

prop <- prop.table(sample.3$`User Rating`)
plot_ly(sample.3, x = ~`User Rating`, y = ~prop, type = 'bar', marker = list(color = c("#50c48f")))

###Stratified Sampling

library(dplyr)
f = table(bestsellers_with_categories$Genre)
n = 50* f / sum(f)
s4 = strata(bestsellers_with_categories, stratanames = "Genre", size = n, method = "srswor")
s4
##           Genre ID_unit       Prob Stratum
## 16  Non Fiction      16 0.07038123       1
## 18  Non Fiction      18 0.07038123       1
## 19  Non Fiction      19 0.07038123       1
## 51  Non Fiction      51 0.07038123       1
## 163 Non Fiction     163 0.07038123       1
## 165 Non Fiction     165 0.07038123       1
## 203 Non Fiction     203 0.07038123       1
## 214 Non Fiction     214 0.07038123       1
## 245 Non Fiction     245 0.07038123       1
## 262 Non Fiction     262 0.07038123       1
## 273 Non Fiction     273 0.07038123       1
## 283 Non Fiction     283 0.07038123       1
## 301 Non Fiction     301 0.07038123       1
## 306 Non Fiction     306 0.07038123       1
## 323 Non Fiction     323 0.07038123       1
## 330 Non Fiction     330 0.07038123       1
## 379 Non Fiction     379 0.07038123       1
## 468 Non Fiction     468 0.07038123       1
## 504 Non Fiction     504 0.07038123       1
## 531 Non Fiction     531 0.07038123       1
## 533 Non Fiction     533 0.07038123       1
## 2       Fiction       2 0.11742424       2
## 21      Fiction      21 0.11742424       2
## 39      Fiction      39 0.11742424       2
## 57      Fiction      57 0.11742424       2
## 64      Fiction      64 0.11742424       2
## 72      Fiction      72 0.11742424       2
## 89      Fiction      89 0.11742424       2
## 120     Fiction     120 0.11742424       2
## 125     Fiction     125 0.11742424       2
## 252     Fiction     252 0.11742424       2
## 255     Fiction     255 0.11742424       2
## 260     Fiction     260 0.11742424       2
## 282     Fiction     282 0.11742424       2
## 342     Fiction     342 0.11742424       2
## 362     Fiction     362 0.11742424       2
## 367     Fiction     367 0.11742424       2
## 382     Fiction     382 0.11742424       2
## 387     Fiction     387 0.11742424       2
## 390     Fiction     390 0.11742424       2
## 392     Fiction     392 0.11742424       2
## 398     Fiction     398 0.11742424       2
## 409     Fiction     409 0.11742424       2
## 452     Fiction     452 0.11742424       2
## 471     Fiction     471 0.11742424       2
## 492     Fiction     492 0.11742424       2
## 514     Fiction     514 0.11742424       2
## 515     Fiction     515 0.11742424       2
## 545     Fiction     545 0.11742424       2
sample.4 <- getdata(bestsellers_with_categories, s4)
table(sample.4$`User Rating`)
## 
##   4 4.2 4.4 4.5 4.6 4.7 4.8 4.9 
##   1   1   6   6  10  10  13   2

###Compare the means of User Rating variable for four samples with the entire data.

0.Initial Data

mean(User_Rating)
## [1] 4.618364
sd(User_Rating)
## [1] 0.2269804

1.Simple Random Sampling

mean(sample.1$`User Rating`)
## [1] 4.628
sd(sample.1$`User Rating`)
## [1] 0.1629323

2.Systematic Sampling

mean(sample.2$`User Rating`)
## [1] 4.594
sd(sample.2$`User Rating`)
## [1] 0.2758364

3.Systematic Sampling (Unequal Probabilities)

mean(sample.3$`User Rating`)
## [1] 4.616
sd(sample.3$`User Rating`)
## [1] 0.2131972

4.Stratified Sampling

mean(sample.4$`User Rating`)
## [1] 4.628571
sd(sample.4$`User Rating`)
## [1] 0.1814295

###Confidence Intervals

1.Confidence Intervals of dataset

bestsellers_with_categories$Price<- as.numeric(bestsellers_with_categories$Price)
mean <- mean(bestsellers_with_categories$Price)
sd <- sd(bestsellers_with_categories$Price)
books <- rnorm(500, mean = mean, sd = sd)
sample_price <- sample(books, 50)
tmp = 1.96*sd/sqrt(50)
lConf = mean(sample_price)-tmp
hConf = mean(sample_price)+tmp
sprintf("The 95 percent confidence interval is between %s%% and %s%%", lConf, hConf)
## [1] "The 95 percent confidence interval is between 9.11147198322841% and 15.1221153664097%"

2.Confidence Intervals of Simple Random Sampling

sd <- sd(sample.1$`User Rating`)
books2 <- rnorm(50, mean = mean(sample.1$`User Rating`), sd = sd(sample.1$`User Rating`))
sample_price <- sample(books2, 50)
tmp = 1.96*sd/sqrt(50)
lConf = mean(sample_price)-tmp
hConf = mean(sample_price)+tmp
sprintf("The 95 percent confidence interval is between %s%% and %s%%", lConf, hConf)
## [1] "The 95 percent confidence interval is between 4.56237943554461% and 4.65270450183299%"

3.Confidence Intervals of Systematic Sampling

sd <- sd(sample.2$`User Rating`)
books3 <- rnorm(50, mean = mean(sample.2$`User Rating`), sd = sd(sample.2$`User Rating`))
sample_price <- sample(books3, 50)
tmp = 1.96*sd/sqrt(50)
lConf = mean(sample_price)-tmp
hConf = mean(sample_price)+tmp
sprintf("The 95 percent confidence interval is between %s%% and %s%%", lConf, hConf)
## [1] "The 95 percent confidence interval is between 4.56587985578628% and 4.71879574900867%"

4.Confidence Intervals of Stratified Sampling

books4 <- rnorm(50, mean = mean(sample.4$`User Rating`), sd = sd(sample.4$`User Rating`))
sample_price <- sample(books4, 50)
tmp = 1.96*sd/sqrt(50)
lConf = mean(sample_price)-tmp
hConf = mean(sample_price)+tmp
sprintf("The 95 percent confidence interval is between %s%% and %s%%", lConf, hConf)
## [1] "The 95 percent confidence interval is between 4.53239673030019% and 4.68531262352257%"

###High-frequency words

*What themes of books is more popular?

## Loading required package: jiebaRD
## 
## ── Column specification ─────────────────────────────────────
## cols(
##   `{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600` = col_character()
## )
## Warning: 109 parsing failures.
## row col  expected    actual                                     file
##  18  -- 1 columns 2 columns 'CS544_Final Project_Cheng_v1/books.txt'
##  26  -- 1 columns 3 columns 'CS544_Final Project_Cheng_v1/books.txt'
##  38  -- 1 columns 5 columns 'CS544_Final Project_Cheng_v1/books.txt'
##  40  -- 1 columns 4 columns 'CS544_Final Project_Cheng_v1/books.txt'
##  46  -- 1 columns 3 columns 'CS544_Final Project_Cheng_v1/books.txt'
## ... ... ......... ......... ........................................
## See problems(...) for more details.

Part5. Conclusions

All the distributions of user rating from different sample size(size = 50,100,150) follow left skew; In four different random sampling methods, Systematic Sampling (Unequal Probabilities) has the closest result with the original dataset; American pay more attention to the humanities topics, like love, life, guide, man, kids, and dog.

Part6. Lesson learned

1.Learned how to present data more efficiently, visually, and aesthetically.

2.Learned using R to calculate probability, reliability, and correlation of data while reducing duplication of effort.

3.Telling a good story is much more important than showing the data. Do not assume that everyone in the audience can or will read the data, learn to turn the data into meaningful suggestions.

4.In future market research work, it should be noted that no sampling method is perfect, which needs to be analyzed on a case-by-case basis.

5.Need to improve the attitude and ability to resist pressure when fixing bugs.