Loading all the necessary libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2) 
library(ggthemes)
library(purrr)
library(pwr)
library(stats)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(broom)
library(tsibble)
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

Load the Dataset

books <- read.csv("bestsellers.csv")

Check the structure of the data

str(books)
## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : chr  "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...

Convert Year to Date in R

Create an ID column

books <- books %>%
  mutate(
    Date = as.Date(paste(Year, "01", "01", sep = "-")), # Creating a Date object
    id = row_number() # Adding an id column
  )
books_yearly <- books %>%
  group_by(Year) %>%
  summarise(Price = mean(Price, na.rm = TRUE)) %>%
  mutate(Date = as.Date(paste(Year, "-01-01", sep = "")))

books_yearly
## # A tibble: 11 × 3
##     Year Price Date      
##    <int> <dbl> <date>    
##  1  2009  15.4 2009-01-01
##  2  2010  13.5 2010-01-01
##  3  2011  15.1 2011-01-01
##  4  2012  15.3 2012-01-01
##  5  2013  14.6 2013-01-01
##  6  2014  14.6 2014-01-01
##  7  2015  10.4 2015-01-01
##  8  2016  13.2 2016-01-01
##  9  2017  11.4 2017-01-01
## 10  2018  10.5 2018-01-01
## 11  2019  10.1 2019-01-01

Create a tsibble Object

books_ts <- as_tsibble(books_yearly, index = Date)
books_ts
## # A tsibble: 11 x 3 [1D]
##     Year Price Date      
##    <int> <dbl> <date>    
##  1  2009  15.4 2009-01-01
##  2  2010  13.5 2010-01-01
##  3  2011  15.1 2011-01-01
##  4  2012  15.3 2012-01-01
##  5  2013  14.6 2013-01-01
##  6  2014  14.6 2014-01-01
##  7  2015  10.4 2015-01-01
##  8  2016  13.2 2016-01-01
##  9  2017  11.4 2017-01-01
## 10  2018  10.5 2018-01-01
## 11  2019  10.1 2019-01-01
books <- books %>% mutate(id = row_number())
books_ts <- as_tsibble(books, index = Date, key = id)

Plot the Data Over Time

Visualize the time series data for ‘Price’.

books_ts %>% 
  ggplot(aes(x = Date, y = Price)) +
  geom_line() +
  labs(title = "Book Prices Over Time", x = "Year", y = "Price")

It appears that there are some outliers or instances of high variation at certain points, but there doesn’t seem to be a clear upward or downward trend.

Linear Regression for Trend Detection

Use linear regression to detect any trends.

trend_model <- lm(Price ~ Date, data = books)
summary(trend_model)
## 
## Call:
## lm(formula = Price ~ Date, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.210  -5.990  -2.517   2.727  91.900 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.3092665  6.3783574   5.693 2.04e-08 ***
## Date        -0.0014442  0.0003959  -3.648 0.000289 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.72 on 548 degrees of freedom
## Multiple R-squared:  0.02371,    Adjusted R-squared:  0.02193 
## F-statistic: 13.31 on 1 and 548 DF,  p-value: 0.0002894

The output from the linear model indicates that there is a statistically significant relationship between time (Date) and Price. The negative coefficient for Date (-0.0014442) with a very low p-value indicates a slight downward trend in book prices over time.

Specifically, the model suggests that each passing year is associated with a decrease in price by approximately $0.0014, given the unit of time in the Date variable is in days. However, the effect is small, and the R-squared value suggests that this model explains only a small portion of the variance in book prices.

Smoothing and Seasonality Detection

Employ smoothing techniques to uncover potential seasonal patterns.

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# Use a loess smooth to explore seasonality
books_ts %>%
  ggplot(aes(x = Date, y = Price)) +
  geom_line() +
  geom_smooth(method = "loess") +
  labs(title = "Smoothed Book Prices Over Time", x = "Year", y = "Price")
## `geom_smooth()` using formula = 'y ~ x'

This gives us a better view of the overall trend, smoothing out the individual variations and highlighting a potential downward trend in prices over the years.

ACF and PACF for Seasonality Illustration

Illustrate the seasonality using ACF or PACF plots.

# Ensuring 'Date' is in the right format and 'Price' has no missing values
books <- na.omit(books)
books <- books[order(books$Date), ]  # Ordering by date just in case

# Now create a time series object starting from the minimum year and with frequency 1
start_year <- min(books$Year)
end_year <- max(books$Year)
frequency <- 1  # Assuming yearly data points for simplicity

books_ts_series <- ts(books$Price, start = c(start_year, 1), end = c(end_year, 1), frequency = frequency)

# Plot ACF and PACF
acf(books_ts_series)

pacf(books_ts_series)

The Autocorrelation Function (ACF) plot shows that there is a significant autocorrelation at lag 0, as expected, because each data point is perfectly correlated with itself. However, the lack of spikes at other lags indicates that there is no clear autocorrelation pattern at fixed intervals, which suggests that there is no strong seasonality in the data based on this plot alone.

The Partial Autocorrelation Function (PACF) plot is used to identify the extent of the relationship between the series and its lags after removing the effects of earlier lags. This plot does not show significant spikes beyond the immediate lag, which typically would suggest the potential order of an AR model if you were building one. Like the ACF, it does not suggest strong seasonality.