library(readr)
data <- read_csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
## Rows: 9994 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Order ID, CustomerName, Category, SubCategory, City, OrderDate, Reg...
## dbl (4): Sales, Discount, Profit, ProfitRange
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 12
## `Order ID` CustomerName Category SubCategory City OrderDate Region Sales
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 OD1 Harish Oil & Masala Masalas Vell… 11/8/17 North 1254
## 2 OD2 Sudha Beverages Health Dri… Kris… 11/8/17 South 749
## 3 OD3 Hussain Food Grains Atta & Flo… Pera… 6/12/17 West 2360
## 4 OD4 Jackson Fruits & Veg… Fresh Vege… Dhar… 10/11/16 South 896
## 5 OD5 Ridhesh Food Grains Organic St… Ooty 10/11/16 South 2355
## 6 OD6 Adavan Food Grains Organic St… Dhar… 6/9/15 West 2305
## # ℹ 4 more variables: Discount <dbl>, Profit <dbl>, State <chr>,
## # ProfitRange <dbl>
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(dplyr)
library(tsibble)
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
theme_set(theme_minimal())
options(scipen = 6)
data$OrderDate <- as.Date(data$OrderDate, format = "%m/%d/%y")
# removing duplictaes
data1 <- distinct(data)
duplicated_dates <- duplicated(data1$OrderDate)
print(data1[duplicated_dates, ])
## # A tibble: 8,758 × 12
## `Order ID` CustomerName Category SubCategory City OrderDate Region Sales
## <chr> <chr> <chr> <chr> <chr> <date> <chr> <dbl>
## 1 OD2 Sudha Beverages Health Dri… Kris… 2017-11-08 South 749
## 2 OD5 Ridhesh Food Grains Organic St… Ooty 2016-10-11 South 2355
## 3 OD7 Jonas Fruits & V… Fresh Vege… Tric… 2015-06-09 West 826
## 4 OD8 Hafiz Fruits & V… Fresh Frui… Rama… 2015-06-09 West 1847
## 5 OD9 Hafiz Bakery Biscuits Tiru… 2015-06-09 West 791
## 6 OD10 Krithika Bakery Cakes Chen… 2015-06-09 West 1795
## 7 OD11 Ganesh Snacks Chocolates Karur 2015-06-09 West 1903
## 8 OD12 Yadav Eggs, Meat… Eggs Nama… 2015-06-09 West 701
## 9 OD16 Ramesh Oil & Masa… Edible Oil… Kris… 2016-11-22 Centr… 1440
## 10 OD20 Verma Beverages Soft Drinks Kany… 2015-08-27 West 692
## # ℹ 8,748 more rows
## # ℹ 4 more variables: Discount <dbl>, Profit <dbl>, State <chr>,
## # ProfitRange <dbl>
data2 <- data1 %>%
group_by(OrderDate) %>%
summarize(Sales = sum(Sales))
ts_data <- tsibble(OrderDate = data2$OrderDate, Sales = data2$Sales)
## Using `OrderDate` as index variable.
# Plotting the sales over time
ggplot(ts_data, aes(x = OrderDate, y = Sales)) +
geom_line() +
labs(title = "Sales Over Time",
x = "Date",
y = "Sales") +
theme_minimal()
The sales tend to increase as the year progresses within each window of time. This might suggest a positive trend in sales over time, indicating potential growth or seasonal patterns in sales behavior.
Moreover, the highest sales occurred towards the end of each year. This could indicate seasonal peaks or end-of-the-year trends where sales are typically higher, prorbably due to holiday shopping seasons.
Also, the least sales seemed to occur at the beginning of each year.
# fitting a linear regression model
model <- lm(Sales ~ OrderDate, data = ts_data)
summary(model)
##
## Call:
## lm(formula = Sales ~ OrderDate, data = ts_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15725 -6461 -2175 5068 39896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105722.5937 10476.1869 -10.09 <2e-16 ***
## OrderDate 6.8558 0.6094 11.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9053 on 1234 degrees of freedom
## Multiple R-squared: 0.09303, Adjusted R-squared: 0.09229
## F-statistic: 126.6 on 1 and 1234 DF, p-value: < 2.2e-16
The coefficient for the OrderDate variable is positive (6.8558) and statistically significant (p < 0.001), indicating that there is a significant upward trend in sales over time. This suggests that as time increases, sales also tend to increase.
The Adjusted R-squared value is 0.093, indicating that approximately 9.3% of the variance in sales can be explained by the linear trend over time. While this suggests that the linear trend explains a small proportion of the variability in sales, the trend is statistically significant (p < 0.001).
Also, there is no indication that the data needs to be subsetted for multiple trends. The single linear regression model fitted appears to capture the overall trend in sales over time.
# plotting observed sales and fitted regression line
plot(ts_data$OrderDate, ts_data$Sales, xlab = "OrderDate", ylab = "Sales", main = "Sales Trend", type = "l")
abline(model, col = "red")
The plot also suggests a gradual increase in sales over time, which aligns with the interpretation of the positive coefficient.
Therefore, the analysis indicates a significant upward trend in sales over time, as evidenced by the positive coefficient, statistically significant p-value, and the above plot.
smoothed_sales_lowess <- stats::lowess(ts_data$Sales, f = 0.5)
# to get smoothed values
smoothed_values <- smoothed_sales_lowess$y
plot(ts_data$OrderDate, ts_data$Sales, type = "l", col = "black", xlab = "OrderDate", ylab = "Sales", main = "Sales with LOWESS Smoothing")
lines(ts_data$OrderDate, smoothed_values, col = "red")
legend("topright", legend = c("Original Sales", "Smoothed (LOWESS)"), col = c("black", "red"), lty = 1)
# Plotting ACF and PACF of residuals (after LOWESS smoothing)
acf <- acf(ts_data$Sales - smoothed_values, main = "ACF of Residuals")
pacf <- pacf(ts_data$Sales - smoothed_values, main = "PACF of Residuals")
The autocorrelation coefficients for all lags are within the confidence interval, meaning there is no statistically significant correlation between the residuals and their past values. This may suggest that the residuals are white noise, which is a desirable property for time series models.
The fact that the partial autocorrelation coefficients in the plot are all within the confidence bounds may suggest that there is no significant autocorrelation in the residuals. This is a good sign, as it indicates that the model is fitting the data well.