Loading all the necessary libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2) 
library(ggthemes)
library(purrr)
library(pwr)
library(stats)

Load the Dataset

books <- read.csv("bestsellers.csv")

Check the structure of the data

str(books)

## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : chr  "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...

Hypothesis 1: User Ratings by Genre

H0: The average user rating is the same for fiction and non-fiction books.
H1: The average user rating differs between fiction and non-fiction books.

Neyman-Pearson Framework:

Test Used: Independent samples t-test.
Alpha Level: 0.05, commonly accepted as a standard threshold for Type I error
Power Level: 0.80, which is standard in research for reducing the risk of Type II error to 20%.
Minimum Effect Size: An effect size of Cohen’s d = 0.2 was chosen to detect small but meaningful differences

# Filtering data by genre
fiction <- books %>% filter(Genre == "Fiction")
non_fiction <- books %>% filter(Genre == "Non Fiction")

# Calculate effect size based on a small effect (Cohen's d = 0.2)
d <- 0.2

# Power analysis
pwr_result <- pwr.t.test(d = d, power = 0.80, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
required_n <- ceiling(pwr_result$n)

cat("The required sample size per group is", required_n, "\n")

## The required sample size per group is 394

# Perform t-test
t_test_results <- t.test(User.Rating ~ Genre, data = books, alternative = "two.sided", conf.level = 0.95)

# Output the results of the t-test
t_test_results

## 
##  Welch Two Sample t-test
## 
## data:  User.Rating by Genre
## t = 2.6299, df = 415.29, p-value = 0.008859
## alternative hypothesis: true difference in means between group Fiction and group Non Fiction is not equal to 0
## 95 percent confidence interval:
##  0.01342894 0.09291515
## sample estimates:
##     mean in group Fiction mean in group Non Fiction 
##                  4.648333                  4.595161

# Boxplot for user ratings by genre
ggplot(books, aes(x = Genre, y = User.Rating, fill = Genre)) +
  geom_boxplot() +
  labs(title = "Box Plot of User Ratings by Genre", x = "Genre", y = "User Rating") +
  theme_minimal()

The test yielded a t-value of 2.6299 with 415.29 degrees of freedom, resulting in a p-value of 0.008859. This p-value is below the commonly used significance level of 0.05, indicating that the observed difference in mean user ratings between fiction and non-fiction books is statistically significant and not likely due to random chance.

The 95% confidence interval for the difference in means ranges from 0.01343 to 0.09292. This interval does not include zero, further supporting the rejection of the null hypothesis that there is no difference in average user ratings between the two genres. In practical terms, this means that fiction books tend to have a slightly higher average user rating compared to non-fiction books

Hypothesis 2: Price Stability Over Years

H0: There is no significant change in average book prices over the years.
H1: There is a significant change in average book prices over the years.

Fisher’s Significance Testing Framework:

Test Used: One-way ANOVA, suitable for comparing more than two groups (in this case, the average prices across multiple years).
Interpretation of P-value: The p-value from the ANOVA test determines if there are significant differences between group means, which would indicate changes in book prices over the years.

# Group data by year and calculate average price
yearly_prices <- books %>%
  group_by(Year) %>%
  summarise(Average_Price = mean(Price, na.rm = TRUE))

yearly_prices

## # A tibble: 11 × 2
##     Year Average_Price
##    <int>         <dbl>
##  1  2009          15.4
##  2  2010          13.5
##  3  2011          15.1
##  4  2012          15.3
##  5  2013          14.6
##  6  2014          14.6
##  7  2015          10.4
##  8  2016          13.2
##  9  2017          11.4
## 10  2018          10.5
## 11  2019          10.1

# Perform ANOVA test
anova_results <- aov(Price ~ as.factor(Year), data = books)
summary(anova_results)

##                  Df Sum Sq Mean Sq F value Pr(>F)  
## as.factor(Year)  10   2241   224.1   1.939 0.0379 *
## Residuals       539  62297   115.6                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Line plot of average prices over the years
ggplot(yearly_prices, aes(x = Year, y = Average_Price, group = 1)) +
  geom_line() +
  geom_point() +
  labs(title = "Average Book Prices Over the Years", x = "Year", y = "Average Price") +
  theme_minimal()

This result indicates that the variation in average book prices among different years is not entirely due to random chance. With the degrees of freedom for the group effect being 10 (representing the 11 different years minus one) and for the residuals being 539, the statistical power is sufficient to detect differences, albeit the effect size, as indicated by the F-value, is relatively small.

The breakdown shows that the sum of squares between the different years (2241) is significant compared to the residual sum of squares (62297), yet this between-group variability accounts for only a small part of the total variability in book prices.

week7

Shresta Reddy Nukala

2024-04-22