Title: Week_12_Data_Dive
Output: HTML document
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.3
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
theme_set(theme_minimal())
options(scipen = 6)
Loading the Diamonds pageviews from Wikipedia
data <- read.csv("C:/Users/rushy/Desktop/First_semester/Statistics_class/pageviews_alltime.csv")
head(data)
## Date Diamond
## 1 2015-07-01 3692
## 2 2015-07-02 3412
## 3 2015-07-03 3070
## 4 2015-07-04 2969
## 5 2015-07-05 3047
## 6 2015-07-06 3614
Renaming the column to the diamond price for our analysis
colnames(data)[colnames(data) == "Diamond"] <- "Diamond_price"
summary(data)
## Date Diamond_price
## Length:3204 Min. : 1611
## Class :character 1st Qu.: 2530
## Mode :character Median : 2978
## Mean : 3138
## 3rd Qu.: 3579
## Max. :17916
#converting the column date into date format
data$Date <- as.Date(data$Date)
Creating a tsibble object for our time series analysis
# Create a tsibble object
ts_data <- tsibble(date = data$Date, response = data$Diamond_price)
## Using `date` as index variable.
Let us see how the diamond price varies over time
# Plot the data over time
ggplot(ts_data, aes(x = date, y = response)) +
geom_line() +
labs(title = "Response Over Time", x = "Date", y = "Diamonds_price")
Interpretation:
There are no trends or any seasonalities present in the data we
just have occasional spikes
Let us look at monthly :
ggplot(ts_data, aes(x = date, y = response)) +
geom_line() +
labs(title = "Response Over Time", x = "Date", y = "Response") +
facet_wrap(~ month(date))
Interpretation:
there seems to be no trend or seasonalities present in the montly graph over time.
Stationary test
library(tseries)
## Warning: package 'tseries' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
response = data$Diamond_price
adf.test(ts_data$response)
## Warning in adf.test(ts_data$response): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: ts_data$response
## Dickey-Fuller = -4.4575, Lag order = 14, p-value = 0.01
## alternative hypothesis: stationary
Interpretation:
1. We have performed the Dickey-Fuller test to check whether our data is
stationary or not.
2. Since we got a low p-value we can reject the null hypothesis and
accept alternative hypothesis which is our data is
stationary
ACF
acf(ts_data$response, lag.max = 24)
PACF
pacf(ts_data$response, lag.max = 24)
Interpretation:
1. We can see that In ACF the graph declines gradually and in the
PACF it almost drops instantly.
2. The recurring spikes in the ACF plots at certain lags indicate that
the time series data likely has an autoregressive (AR) structure. This
means that the current value of the response variable is influenced by
its past values.
3. There seems to be no seasonality in the data.
let us fit with linear regression
ggplot(ts_data, aes(x = date, y = response)) +
geom_line() +
labs(title = "Response Over Time")+
geom_smooth(method = 'lm', color = 'blue', se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:
The Trend seems to be decreasing gradually.
let us see how different Arima models perform.
# Fit different ARIMA models and compare their performance
arima_model_1 <- arima(ts_data$response, order = c(2, 0, 0))
arima_model_2 <- arima(ts_data$response, order = c(3, 0, 0))
arima_model_3 <- arima(ts_data$response, order = c(0, 0, 1))
arima_model_4 <- arima(ts_data$response, order = c(2, 0, 1))
arima_model_5 <- arima(ts_data$response, order = c(3, 0, 1))
AIC(arima_model_1, arima_model_2, arima_model_3,arima_model_4,arima_model_5)
## df AIC
## arima_model_1 4 49427.82
## arima_model_2 5 49340.25
## arima_model_3 3 50922.12
## arima_model_4 5 48879.59
## arima_model_5 6 48873.52
Interpretation:
it looks like the Arima model 5 has the best performance among all the other models.