library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(ggthemes)
library(purrr)
library(pwr)
library(stats)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(broom)
library(tsibble)
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
books <- read.csv("bestsellers.csv")
str(books)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
books <- books %>%
mutate(
Date = as.Date(paste(Year, "01", "01", sep = "-")), # Creating a Date object
id = row_number() # Adding an id column
)
books_yearly <- books %>%
group_by(Year) %>%
summarise(Price = mean(Price, na.rm = TRUE)) %>%
mutate(Date = as.Date(paste(Year, "-01-01", sep = "")))
books_yearly
## # A tibble: 11 × 3
## Year Price Date
## <int> <dbl> <date>
## 1 2009 15.4 2009-01-01
## 2 2010 13.5 2010-01-01
## 3 2011 15.1 2011-01-01
## 4 2012 15.3 2012-01-01
## 5 2013 14.6 2013-01-01
## 6 2014 14.6 2014-01-01
## 7 2015 10.4 2015-01-01
## 8 2016 13.2 2016-01-01
## 9 2017 11.4 2017-01-01
## 10 2018 10.5 2018-01-01
## 11 2019 10.1 2019-01-01
books_ts <- as_tsibble(books_yearly, index = Date)
books_ts
## # A tsibble: 11 x 3 [1D]
## Year Price Date
## <int> <dbl> <date>
## 1 2009 15.4 2009-01-01
## 2 2010 13.5 2010-01-01
## 3 2011 15.1 2011-01-01
## 4 2012 15.3 2012-01-01
## 5 2013 14.6 2013-01-01
## 6 2014 14.6 2014-01-01
## 7 2015 10.4 2015-01-01
## 8 2016 13.2 2016-01-01
## 9 2017 11.4 2017-01-01
## 10 2018 10.5 2018-01-01
## 11 2019 10.1 2019-01-01
books <- books %>% mutate(id = row_number())
books_ts <- as_tsibble(books, index = Date, key = id)
Visualize the time series data for ‘Price’.
books_ts %>%
ggplot(aes(x = Date, y = Price)) +
geom_line() +
labs(title = "Book Prices Over Time", x = "Year", y = "Price")
It appears that there are some outliers or instances of high variation at certain points, but there doesn’t seem to be a clear upward or downward trend.
Use linear regression to detect any trends.
trend_model <- lm(Price ~ Date, data = books)
summary(trend_model)
##
## Call:
## lm(formula = Price ~ Date, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.210 -5.990 -2.517 2.727 91.900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.3092665 6.3783574 5.693 2.04e-08 ***
## Date -0.0014442 0.0003959 -3.648 0.000289 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.72 on 548 degrees of freedom
## Multiple R-squared: 0.02371, Adjusted R-squared: 0.02193
## F-statistic: 13.31 on 1 and 548 DF, p-value: 0.0002894
The output from the linear model indicates that there is a
statistically significant relationship between time
(Date) and
Price. The negative coefficient for
Date (-0.0014442) with a very low p-value
indicates a slight downward trend in book prices over time.
Specifically, the model suggests that each passing year is associated with a decrease in price by approximately $0.0014, given the unit of time in the Date variable is in days. However, the effect is small, and the R-squared value suggests that this model explains only a small portion of the variance in book prices.
Employ smoothing techniques to uncover potential seasonal patterns.
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# Use a loess smooth to explore seasonality
books_ts %>%
ggplot(aes(x = Date, y = Price)) +
geom_line() +
geom_smooth(method = "loess") +
labs(title = "Smoothed Book Prices Over Time", x = "Year", y = "Price")
## `geom_smooth()` using formula = 'y ~ x'
This gives us a better view of the overall trend, smoothing out the individual variations and highlighting a potential downward trend in prices over the years.
Illustrate the seasonality using ACF or PACF plots.
# Ensuring 'Date' is in the right format and 'Price' has no missing values
books <- na.omit(books)
books <- books[order(books$Date), ] # Ordering by date just in case
# Now create a time series object starting from the minimum year and with frequency 1
start_year <- min(books$Year)
end_year <- max(books$Year)
frequency <- 1 # Assuming yearly data points for simplicity
books_ts_series <- ts(books$Price, start = c(start_year, 1), end = c(end_year, 1), frequency = frequency)
# Plot ACF and PACF
acf(books_ts_series)
pacf(books_ts_series)
The Autocorrelation Function (ACF) plot shows that there is a significant autocorrelation at lag 0, as expected, because each data point is perfectly correlated with itself. However, the lack of spikes at other lags indicates that there is no clear autocorrelation pattern at fixed intervals, which suggests that there is no strong seasonality in the data based on this plot alone.
The Partial Autocorrelation Function (PACF) plot is used to identify the extent of the relationship between the series and its lags after removing the effects of earlier lags. This plot does not show significant spikes beyond the immediate lag, which typically would suggest the potential order of an AR model if you were building one. Like the ACF, it does not suggest strong seasonality.