library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(ggthemes)
library(purrr)
library(pwr)
library(stats)
books <- read.csv("bestsellers.csv")
str(books)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
H0: The average user rating is the same for fiction
and non-fiction books.
H1: The average user rating differs between fiction and
non-fiction books.
Test Used: Independent samples t-test.
Alpha Level: 0.05, commonly accepted as a standard threshold for Type I error
Power Level: 0.80, which is standard in research for reducing the risk of Type II error to 20%.
Minimum Effect Size: An effect size of Cohen’s d = 0.2 was chosen to detect small but meaningful differences
# Filtering data by genre
fiction <- books %>% filter(Genre == "Fiction")
non_fiction <- books %>% filter(Genre == "Non Fiction")
# Calculate effect size based on a small effect (Cohen's d = 0.2)
d <- 0.2
# Power analysis
pwr_result <- pwr.t.test(d = d, power = 0.80, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
required_n <- ceiling(pwr_result$n)
cat("The required sample size per group is", required_n, "\n")
## The required sample size per group is 394
# Perform t-test
t_test_results <- t.test(User.Rating ~ Genre, data = books, alternative = "two.sided", conf.level = 0.95)
# Output the results of the t-test
t_test_results
##
## Welch Two Sample t-test
##
## data: User.Rating by Genre
## t = 2.6299, df = 415.29, p-value = 0.008859
## alternative hypothesis: true difference in means between group Fiction and group Non Fiction is not equal to 0
## 95 percent confidence interval:
## 0.01342894 0.09291515
## sample estimates:
## mean in group Fiction mean in group Non Fiction
## 4.648333 4.595161
# Boxplot for user ratings by genre
ggplot(books, aes(x = Genre, y = User.Rating, fill = Genre)) +
geom_boxplot() +
labs(title = "Box Plot of User Ratings by Genre", x = "Genre", y = "User Rating") +
theme_minimal()
The test yielded a t-value of 2.6299 with 415.29 degrees of freedom, resulting in a p-value of 0.008859. This p-value is below the commonly used significance level of 0.05, indicating that the observed difference in mean user ratings between fiction and non-fiction books is statistically significant and not likely due to random chance.
The 95% confidence interval for the difference in means ranges from 0.01343 to 0.09292. This interval does not include zero, further supporting the rejection of the null hypothesis that there is no difference in average user ratings between the two genres. In practical terms, this means that fiction books tend to have a slightly higher average user rating compared to non-fiction books
H0: There is no significant change in average book
prices over the years.
H1: There is a significant change in average book
prices over the years.
Test Used: One-way ANOVA, suitable for comparing more than two groups (in this case, the average prices across multiple years).
Interpretation of P-value: The p-value from the ANOVA test determines if there are significant differences between group means, which would indicate changes in book prices over the years.
# Group data by year and calculate average price
yearly_prices <- books %>%
group_by(Year) %>%
summarise(Average_Price = mean(Price, na.rm = TRUE))
yearly_prices
## # A tibble: 11 × 2
## Year Average_Price
## <int> <dbl>
## 1 2009 15.4
## 2 2010 13.5
## 3 2011 15.1
## 4 2012 15.3
## 5 2013 14.6
## 6 2014 14.6
## 7 2015 10.4
## 8 2016 13.2
## 9 2017 11.4
## 10 2018 10.5
## 11 2019 10.1
# Perform ANOVA test
anova_results <- aov(Price ~ as.factor(Year), data = books)
summary(anova_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Year) 10 2241 224.1 1.939 0.0379 *
## Residuals 539 62297 115.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Line plot of average prices over the years
ggplot(yearly_prices, aes(x = Year, y = Average_Price, group = 1)) +
geom_line() +
geom_point() +
labs(title = "Average Book Prices Over the Years", x = "Year", y = "Average Price") +
theme_minimal()
This result indicates that the variation in average book prices among different years is not entirely due to random chance. With the degrees of freedom for the group effect being 10 (representing the 11 different years minus one) and for the residuals being 539, the statistical power is sufficient to detect differences, albeit the effect size, as indicated by the F-value, is relatively small.
The breakdown shows that the sum of squares between the different years (2241) is significant compared to the residual sum of squares (62297), yet this between-group variability accounts for only a small part of the total variability in book prices.