week_12.knit

Title: Week_12_Data_Dive
Output: HTML document

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(xts)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.3

## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

theme_set(theme_minimal())
options(scipen = 6)

Loading the Diamonds pageviews from Wikipedia

data <- read.csv("C:/Users/rushy/Desktop/First_semester/Statistics_class/pageviews_alltime.csv")
head(data)

##         Date Diamond
## 1 2015-07-01    3692
## 2 2015-07-02    3412
## 3 2015-07-03    3070
## 4 2015-07-04    2969
## 5 2015-07-05    3047
## 6 2015-07-06    3614

Renaming the column to the diamond price for our analysis

colnames(data)[colnames(data) == "Diamond"] <- "Diamond_price"

summary(data)

##      Date           Diamond_price  
##  Length:3204        Min.   : 1611  
##  Class :character   1st Qu.: 2530  
##  Mode  :character   Median : 2978  
##                     Mean   : 3138  
##                     3rd Qu.: 3579  
##                     Max.   :17916

#converting the column date into date format
data$Date <- as.Date(data$Date)

Creating a tsibble object for our time series analysis

# Create a tsibble object
ts_data <- tsibble(date = data$Date, response = data$Diamond_price)

## Using `date` as index variable.

Let us see how the diamond price varies over time

# Plot the data over time
ggplot(ts_data, aes(x = date, y = response)) +
  geom_line() +
  labs(title = "Response Over Time", x = "Date", y = "Diamonds_price")

Interpretation:
There are no trends or any seasonalities present in the data we just have occasional spikes

Let us look at monthly :

ggplot(ts_data, aes(x = date, y = response)) +
  geom_line() +
  labs(title = "Response Over Time", x = "Date", y = "Response") +
  facet_wrap(~ month(date))

Interpretation:

there seems to be no trend or seasonalities present in the montly graph over time.

Stationary test

library(tseries)

## Warning: package 'tseries' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

response = data$Diamond_price
adf.test(ts_data$response)

## Warning in adf.test(ts_data$response): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data$response
## Dickey-Fuller = -4.4575, Lag order = 14, p-value = 0.01
## alternative hypothesis: stationary

Interpretation:
1. We have performed the Dickey-Fuller test to check whether our data is stationary or not.
2. Since we got a low p-value we can reject the null hypothesis and accept alternative hypothesis which is our data is stationary

ACF

acf(ts_data$response, lag.max = 24)

PACF

pacf(ts_data$response, lag.max = 24)

Interpretation:

1. We can see that In ACF the graph declines gradually and in the PACF it almost drops instantly.
2. The recurring spikes in the ACF plots at certain lags indicate that the time series data likely has an autoregressive (AR) structure. This means that the current value of the response variable is influenced by its past values.
3. There seems to be no seasonality in the data.

let us fit with linear regression

ggplot(ts_data, aes(x = date, y = response)) +
  geom_line() +
  labs(title = "Response Over Time")+
  geom_smooth(method = 'lm', color = 'blue', se=FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Interpretation:
The Trend seems to be decreasing gradually.

let us see how different Arima models perform.

# Fit different ARIMA models and compare their performance
arima_model_1 <- arima(ts_data$response, order = c(2, 0, 0))
arima_model_2 <- arima(ts_data$response, order = c(3, 0, 0))
arima_model_3 <- arima(ts_data$response, order = c(0, 0, 1))
arima_model_4 <- arima(ts_data$response, order = c(2, 0, 1))
arima_model_5 <- arima(ts_data$response, order = c(3, 0, 1))
AIC(arima_model_1, arima_model_2, arima_model_3,arima_model_4,arima_model_5)

##               df      AIC
## arima_model_1  4 49427.82
## arima_model_2  5 49340.25
## arima_model_3  3 50922.12
## arima_model_4  5 48879.59
## arima_model_5  6 48873.52

Interpretation:

it looks like the Arima model 5 has the best performance among all the other models.