Data Dive — Time-based Data

Loading the “Supermart” CSV file

library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

## Rows: 9994 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (4): Sales, Discount, Profit, ProfitRange
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

## # A tibble: 6 × 12
##   `Order ID` CustomerName Category      SubCategory City  OrderDate Region Sales
##   <chr>      <chr>        <chr>         <chr>       <chr> <chr>     <chr>  <dbl>
## 1 OD1        Harish       Oil & Masala  Masalas     Vell… 11/8/17   North   1254
## 2 OD2        Sudha        Beverages     Health Dri… Kris… 11/8/17   South    749
## 3 OD3        Hussain      Food Grains   Atta & Flo… Pera… 6/12/17   West    2360
## 4 OD4        Jackson      Fruits & Veg… Fresh Vege… Dhar… 10/11/16  South    896
## 5 OD5        Ridhesh      Food Grains   Organic St… Ooty  10/11/16  South   2355
## 6 OD6        Adavan       Food Grains   Organic St… Dhar… 6/9/15    West    2305
## # ℹ 4 more variables: Discount <dbl>, Profit <dbl>, State <chr>,
## #   ProfitRange <dbl>

Loading the required packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(xts)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(dplyr)
library(tsibble)

## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

theme_set(theme_minimal())
options(scipen = 6)

Choosing the column ‘OrderDate’ and formatting it

data$OrderDate <- as.Date(data$OrderDate, format = "%m/%d/%y")

Analyzing Sales over time

# removing duplictaes
data1 <- distinct(data)

duplicated_dates <- duplicated(data1$OrderDate)
print(data1[duplicated_dates, ])

## # A tibble: 8,758 × 12
##    `Order ID` CustomerName Category    SubCategory City  OrderDate  Region Sales
##    <chr>      <chr>        <chr>       <chr>       <chr> <date>     <chr>  <dbl>
##  1 OD2        Sudha        Beverages   Health Dri… Kris… 2017-11-08 South    749
##  2 OD5        Ridhesh      Food Grains Organic St… Ooty  2016-10-11 South   2355
##  3 OD7        Jonas        Fruits & V… Fresh Vege… Tric… 2015-06-09 West     826
##  4 OD8        Hafiz        Fruits & V… Fresh Frui… Rama… 2015-06-09 West    1847
##  5 OD9        Hafiz        Bakery      Biscuits    Tiru… 2015-06-09 West     791
##  6 OD10       Krithika     Bakery      Cakes       Chen… 2015-06-09 West    1795
##  7 OD11       Ganesh       Snacks      Chocolates  Karur 2015-06-09 West    1903
##  8 OD12       Yadav        Eggs, Meat… Eggs        Nama… 2015-06-09 West     701
##  9 OD16       Ramesh       Oil & Masa… Edible Oil… Kris… 2016-11-22 Centr…  1440
## 10 OD20       Verma        Beverages   Soft Drinks Kany… 2015-08-27 West     692
## # ℹ 8,748 more rows
## # ℹ 4 more variables: Discount <dbl>, Profit <dbl>, State <chr>,
## #   ProfitRange <dbl>

data2 <- data1 %>% 
  group_by(OrderDate) %>% 
  summarize(Sales = sum(Sales))

Creating a tsibble object of OrderDate and Sales

ts_data <- tsibble(OrderDate = data2$OrderDate, Sales = data2$Sales)

## Using `OrderDate` as index variable.

# Plotting the sales over time
ggplot(ts_data, aes(x = OrderDate, y = Sales)) +
  geom_line() +
  labs(title = "Sales Over Time",
       x = "Date",
       y = "Sales") +
  theme_minimal()

The sales tend to increase as the year progresses within each window of time. This might suggest a positive trend in sales over time, indicating potential growth or seasonal patterns in sales behavior.
Moreover, the highest sales occurred towards the end of each year. This could indicate seasonal peaks or end-of-the-year trends where sales are typically higher, prorbably due to holiday shopping seasons.
Also, the least sales seemed to occur at the beginning of each year.

# fitting a linear regression model
model <- lm(Sales ~ OrderDate, data = ts_data)
summary(model)

## 
## Call:
## lm(formula = Sales ~ OrderDate, data = ts_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15725  -6461  -2175   5068  39896 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)    
## (Intercept) -105722.5937   10476.1869  -10.09   <2e-16 ***
## OrderDate         6.8558       0.6094   11.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9053 on 1234 degrees of freedom
## Multiple R-squared:  0.09303,    Adjusted R-squared:  0.09229 
## F-statistic: 126.6 on 1 and 1234 DF,  p-value: < 2.2e-16

The coefficient for the OrderDate variable is positive (6.8558) and statistically significant (p < 0.001), indicating that there is a significant upward trend in sales over time. This suggests that as time increases, sales also tend to increase.

The Adjusted R-squared value is 0.093, indicating that approximately 9.3% of the variance in sales can be explained by the linear trend over time. While this suggests that the linear trend explains a small proportion of the variability in sales, the trend is statistically significant (p < 0.001).

Also, there is no indication that the data needs to be subsetted for multiple trends. The single linear regression model fitted appears to capture the overall trend in sales over time.

# plotting observed sales and fitted regression line
plot(ts_data$OrderDate, ts_data$Sales, xlab = "OrderDate", ylab = "Sales", main = "Sales Trend", type = "l")
abline(model, col = "red")

The plot also suggests a gradual increase in sales over time, which aligns with the interpretation of the positive coefficient.

Therefore, the analysis indicates a significant upward trend in sales over time, as evidenced by the positive coefficient, statistically significant p-value, and the above plot.

Performing LOWESS smoothing

smoothed_sales_lowess <- stats::lowess(ts_data$Sales, f = 0.5)

# to get smoothed values
smoothed_values <- smoothed_sales_lowess$y

plot(ts_data$OrderDate, ts_data$Sales, type = "l", col = "black", xlab = "OrderDate", ylab = "Sales", main = "Sales with LOWESS Smoothing")
lines(ts_data$OrderDate, smoothed_values, col = "red")
legend("topright", legend = c("Original Sales", "Smoothed (LOWESS)"), col = c("black", "red"), lty = 1)

# Plotting ACF and PACF of residuals (after LOWESS smoothing)
acf <- acf(ts_data$Sales - smoothed_values, main = "ACF of Residuals")

pacf <- pacf(ts_data$Sales - smoothed_values, main = "PACF of Residuals")

The autocorrelation coefficients for all lags are within the confidence interval, meaning there is no statistically significant correlation between the residuals and their past values. This may suggest that the residuals are white noise, which is a desirable property for time series models.

The fact that the partial autocorrelation coefficients in the plot are all within the confidence bounds may suggest that there is no significant autocorrelation in the residuals. This is a good sign, as it indicates that the model is fitting the data well.