Note, you may need to use some combination of as.Date, or to_datetime. And, you may even need to paste year, month, day, hour, etc. together using paste (even if you need to make up a month, like “__/01/01”).
Our data dive this week focused time series modoling using garment worker productivity dataset. The goal was to begin preparing the data for time series analysis by working with the date column and understanding how productivity changes over time. Since the dataset includes daily records, it provides a strong foundation for exploring trends, patterns, and seasonal effects. However, daily data can show short-term fluctuations that make it harder to see broader patterns. For this reason, more attention was given to weekly, monthly, and quarterly groupings, which help smooth out noise and reveal consistent trends and seasonal behaviors over time.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
theme_set(theme_minimal())
data <- read.csv("C:/Users/rbada/Downloads/productivity+prediction+of+garment+employees/garments_worker_productivity.csv")
# Convert date column safely
library(lubridate)
data <- data %>%
mutate(date = mdy(trimws(date))) %>%
filter(!is.na(date))
summary(data$date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2015-01-01" "2015-01-18" "2015-02-03" "2015-02-04" "2015-02-23" "2015-03-11"
cleaned_data <- data %>%
group_by(date) %>%
summarise(avg_productivity = mean(actual_productivity, na.rm = TRUE)) %>%
complete(date = seq(min(date, na.rm = TRUE), max(date, na.rm = TRUE), by = "day"))
head(cleaned_data)
## # A tibble: 6 × 2
## date avg_productivity
## <date> <dbl>
## 1 2015-01-01 0.719
## 2 2015-01-02 NA
## 3 2015-01-03 0.803
## 4 2015-01-04 0.780
## 5 2015-01-05 0.796
## 6 2015-01-06 0.806
nrow(cleaned_data)
## [1] 70
summary(cleaned_data)
## date avg_productivity
## Min. :2015-01-01 Min. :0.6341
## 1st Qu.:2015-01-18 1st Qu.:0.6996
## Median :2015-02-04 Median :0.7316
## Mean :2015-02-04 Mean :0.7343
## 3rd Qu.:2015-02-21 3rd Qu.:0.7723
## Max. :2015-03-11 Max. :0.8550
## NA's :11
weekly_productivity <- cleaned_data %>%
mutate(week = floor_date(date, unit = "week")) %>%
group_by(week) %>%
summarise(avg_productivity = mean(avg_productivity, na.rm = TRUE))
head(weekly_productivity)
## # A tibble: 6 × 2
## week avg_productivity
## <date> <dbl>
## 1 2014-12-28 0.761
## 2 2015-01-04 0.780
## 3 2015-01-11 0.742
## 4 2015-01-18 0.705
## 5 2015-01-25 0.785
## 6 2015-02-01 0.771
nrow(weekly_productivity)
## [1] 11
summary(weekly_productivity)
## week avg_productivity
## Min. :2014-12-28 Min. :0.6701
## 1st Qu.:2015-01-14 1st Qu.:0.7057
## Median :2015-02-01 Median :0.7408
## Mean :2015-02-01 Mean :0.7354
## 3rd Qu.:2015-02-18 3rd Qu.:0.7657
## Max. :2015-03-08 Max. :0.7853
monthly_productivity <- cleaned_data %>%
mutate(month = floor_date(date, unit = "month")) %>%
group_by(month) %>%
summarise(avg_productivity = mean(avg_productivity, na.rm = TRUE))
head(monthly_productivity)
## # A tibble: 3 × 2
## month avg_productivity
## <date> <dbl>
## 1 2015-01-01 0.754
## 2 2015-02-01 0.719
## 3 2015-03-01 0.718
nrow(monthly_productivity)
## [1] 3
summary(monthly_productivity)
## month avg_productivity
## Min. :2015-01-01 Min. :0.7185
## 1st Qu.:2015-01-16 1st Qu.:0.7188
## Median :2015-02-01 Median :0.7192
## Mean :2015-01-31 Mean :0.7305
## 3rd Qu.:2015-02-15 3rd Qu.:0.7365
## Max. :2015-03-01 Max. :0.7538
quarterly_productivity <- cleaned_data %>%
mutate(quarter = floor_date(date, unit = "quarter")) %>%
group_by(quarter) %>%
summarise(avg_productivity = mean(avg_productivity, na.rm = TRUE))
head(quarterly_productivity)
## # A tibble: 1 × 2
## quarter avg_productivity
## <date> <dbl>
## 1 2015-01-01 0.734
nrow(quarterly_productivity)
## [1] 1
summary(quarterly_productivity)
## quarter avg_productivity
## Min. :2015-01-01 Min. :0.7343
## 1st Qu.:2015-01-01 1st Qu.:0.7343
## Median :2015-01-01 Median :0.7343
## Mean :2015-01-01 Mean :0.7343
## 3rd Qu.:2015-01-01 3rd Qu.:0.7343
## Max. :2015-01-01 Max. :0.7343
This week’s work focused on transforming the garment worker productivity dataset into a time series format to explore trends at multiple time scales. The process began by converting the original character-formatted date column (e.g., “1/1/2015”) into a proper Date object using lubridate::mdy(), which enabled accurate time-based grouping. Average productivity was then calculated for daily, weekly, monthly, and quarterly levels using group_by() and summarise(). To create a continuous time series, missing dates were filled using tidyr::complete() for each level of aggregation. This ensured the inclusion of all time points within the dataset’s range, with NA values used for any missing productivity records.
productivity_ts <- cleaned_data %>%
as_tsibble(index = date)
productivity_ts <- cleaned_data %>%
as_tsibble(index = date)
ggplot(productivity_ts, aes(x = date, y = avg_productivity)) +
geom_line(color = "steelblue", linewidth = 1) +
labs(title = "Daily Productivity Over Time (TSibble)",
x = "Date", y = "Average Productivity") +
theme_minimal()
The plot shows daily changes in worker productivity across the time period. There are noticeable ups and downs, with a few sharp dips in mid-January and February. A short-term increase appears in early February before productivity drops again. The overall pattern is irregular, and some missing values (gaps) are visible, suggesting days without recorded data. This irregularity stands out immediately and shows the need to explore trends at weekly or monthly levels for a clearer view.
weekly_ts <- weekly_productivity %>%
as_tsibble(index = week)
ggplot(weekly_ts, aes(x = week, y = avg_productivity)) +
geom_line(color = "darkgreen", linewidth = 1) +
labs(
title = "Weekly Average Productivity Over Time",
x = "Week",
y = "Average Productivity"
) +
theme_minimal()
The weekly productivity plot shows clear ups and downs throughout the observed period. Productivity peaked around early February, dropped significantly in mid-February, and began to rise again in March. This smoothed view helps highlight broader trends that may be missed in daily data and can support better planning or scheduling decisions based on weekly patterns.
monthly_ts <- monthly_productivity %>%
as_tsibble(index = month)
ggplot(monthly_ts, aes(x = month, y = avg_productivity)) +
geom_line(color = "tomato", linewidth = 1) +
labs(
title = "Monthly Average Productivity Over Time",
x = "Month",
y = "Average Productivity"
) +
theme_minimal()
The monthly productivity plot shows a steady decline from January through March. Productivity was highest in January but dropped in February and remained low into early March. This trend suggests a possible slowdown in performance, which may reflect seasonal effects, workload shifts, or operational challenges. Monthly aggregation smooths short-term variation and is useful for observing overall performance trends across longer periods.
quarterly_ts <- quarterly_productivity %>%
as_tsibble(index = quarter)
ggplot(quarterly_ts, aes(x = quarter, y = avg_productivity)) +
geom_point(color = "blue", size = 3) +
labs(
title = "Quarterly Average Productivity (Q1 only)",
x = "Quarter",
y = "Average Productivity"
) +
theme_minimal()
The quarterly plot shows a single data point representing productivity for Q1 2015, with an average around 0.74. Since the dataset only includes data from January through early March, no other quarters are available. This view gives a general summary of overall productivity for that period, but more quarters would be needed to observe long-term changes or seasonal trends. Daily, weekly, monthly, and quarterly productivity were compared to understand how performance changes over time. Daily data captured detailed changes but was often noisy and harder to interpret. Weekly averages provided a clearer view of short-term trends while reducing daily fluctuations. Monthly summaries helped highlight longer-term patterns and are useful for reporting. Quarterly data included only one value in this case, so it was not helpful for deeper analysis. Overall, weekly aggregation offered the best balance between detail and clarity. It allowed for easier interpretation of productivity trends and supported better decision-making.
productivity_ts <- productivity_ts %>%
mutate(date_num = as.numeric(date))
head(productivity_ts)
## # A tsibble: 6 x 3 [1D]
## date avg_productivity date_num
## <date> <dbl> <dbl>
## 1 2015-01-01 0.719 16436
## 2 2015-01-02 NA 16437
## 3 2015-01-03 0.803 16438
## 4 2015-01-04 0.780 16439
## 5 2015-01-05 0.796 16440
## 6 2015-01-06 0.806 16441
trend_model <- lm(avg_productivity ~ date_num, data = productivity_ts)
summary(trend_model)
##
## Call:
## lm(formula = avg_productivity ~ date_num, data = productivity_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.078019 -0.035446 0.002314 0.032397 0.115890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.7771385 4.5175925 3.935 0.000228 ***
## date_num -0.0010347 0.0002743 -3.773 0.000387 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04269 on 57 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.1998, Adjusted R-squared: 0.1858
## F-statistic: 14.23 on 1 and 57 DF, p-value: 0.0003866
A linear regression was used to detect trends in daily productivity over time. The model showed a downward trend, with the slope estimate of -0.0001. This indicates that average productivity slightly decreased each day during the time period. The trend was statistically significant (p < 0.001), suggesting it is unlikely due to random variation. The R-squared value was 0.20, meaning that about 20% of the changes in productivity can be explained by time. Based on the pattern, it may be useful to analyze separate time periods (such as January vs. February) to see if different trends exist in different weeks or months.
jan_data <- productivity_ts %>%
filter(date >= as.Date("2015-01-01") & date <= as.Date("2015-01-31")) %>%
filter(is.finite(avg_productivity) & !is.na(date))
ggplot(jan_data, aes(x = date, y = avg_productivity)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Productivity Trend in January (Linear Regression)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
feb_data <- productivity_ts %>%
filter(date >= as.Date("2015-02-01") & date <= as.Date("2015-02-28")) %>%
filter(is.finite(avg_productivity) & !is.na(date))
ggplot(feb_data, aes(x = date, y = avg_productivity)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Productivity Trend in February (Linear Regression)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
In January, the trend line appears almost flat, suggesting that average productivity remained relatively stable throughout the month. There was no significant increase or decrease in productivity during that time. In contrast, the February plot shows a clear downward trend. The red regression line slopes noticeably down, indicating that productivity declined over the month. This drop could suggest worker fatigue, changing workloads, or other operational challenges that emerged after January. Comparing the two months shows how analyzing trends by smaller time windows can reveal important differences in performance that may not be visible in overall averages.
clean_data <- productivity_ts %>%
filter(is.finite(avg_productivity) & !is.na(date))
ggplot(clean_data, aes(x = date, y = avg_productivity)) +
geom_point(size = 1) +
geom_smooth(method = "lm", se = FALSE, color = "green") +
labs(
title = "Trend in Daily Productivity (Linear Regression)",
x = "Date",
y = "Average Productivity"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
After looking at productivity by month and over the whole period, a clear pattern appears. In January, productivity stayed mostly steady. But in February, it dropped quickly, which had a big effect on the overall trend. The full trend line shows a clear downward direction from January to March. This may be caused by workers getting tired, changes in tasks, or other time-related reasons. Looking at each month helped show when the drop started. It began in early February and became more noticeable after the first week, which matches what we saw in the monthly data. Subsetting the data by month was helpful. It made it easier to see where the changes began and how the trend was different in each month. The drop was not visible from the full data alone. The trend overall was moderate but important. The model showed a statistically significant downward trend, and about 19% of the change in productivity could be explained by time. The drop in February looked even stronger, showing that shorter time periods can reveal bigger changes.
clean_data <- productivity_ts %>%
filter(is.finite(avg_productivity) & !is.na(date))
ggplot(clean_data, aes(x = date, y = avg_productivity)) +
geom_point(size = 1, color = "gray") +
geom_smooth(method = "loess", span = 0.2, se = FALSE, color = "blue") +
labs(
title = "Smoothed Daily Productivity (LO(W)ESS)",
x = "Date",
y = "Average Productivity"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The LO(W)ESS smoothing line revealed a clear pattern of ups and downs in daily productivity. These repeating waves suggest a weekly seasonality, where productivity tends to rise and fall in a regular cycle. This kind of pattern could be related to work schedules, fatigue near the end of the week, or operational routines. LO(W)ESS helped smooth out the noise and made these patterns easier to see.
library(xts)
productivity_xts <- xts(productivity_ts$avg_productivity, order.by = productivity_ts$date)
acf_result <- acf(productivity_ts$avg_productivity, plot = FALSE, na.action = na.exclude)
acf_data <- data.frame(
lag = acf_result$lag,
acf = acf_result$acf
)
head(acf_data)
## lag acf
## 1 0 1.0000000
## 2 1 0.5842230
## 3 2 0.4363752
## 4 3 0.4552327
## 5 4 0.5011326
## 6 5 0.4019822
ACF shows a strong positive correlation at lag 1 (0.58), meaning productivity on one day is closely related to the previous day. The values gradually decrease over time, suggesting a short-term pattern in productivity. This indicates that workers tend to follow a consistent routine from day to day, even though daily data may have some noise.
acf(productivity_xts, main = "Autocorrelation (ACF) of Daily Productivity", na.action = na.exclude)
pacf_result <- pacf(productivity_ts$avg_productivity, plot = FALSE, na.action = na.exclude)
pacf_data <- data.frame(
lag = pacf_result$lag,
pacf = pacf_result$acf
)
head(pacf_data)
## lag pacf
## 1 1 0.584223046
## 2 2 0.144316111
## 3 3 0.236867868
## 4 4 0.229034889
## 5 5 -0.006824074
## 6 6 -0.193924108
PACF also shows a strong spike at lag 1, but the values drop more quickly after that. This means that most of the relationship is explained by the immediate previous day. Later lags do not add much new information once the effect of lag 1 is considered. This supports the idea that recent productivity influences the current day more than longer gaps in time.
pacf(productivity_xts, main = "Partial Autocorrelation (PACF) of Daily Productivity", na.action = na.exclude)
Even though weekly data was more useful for seeing overall trends, daily data was still included to help explore short-term changes. Tools like ACF and PACF work best with daily data and can show if productivity repeats in a regular pattern, like weekly cycles.ACF plot showed some correlation between days, which means productivity levels may be related from one day to the next. The PACF plot helped show which days had the strongest direct effect. Looking at daily data this way helps find patterns that might not be clear in weekly or monthly data. It gives a better idea of how productivity changes from day to day and whether there are repeated patterns in the work process.
This time series analysis aimed to understand worker productivity trends in the garment industry over time. Data were summarized at daily, weekly, monthly, and quarterly levels, with weekly summaries offering the best balance for identifying trends. A linear regression model indicated a slight downward trend in productivity, particularly during February. LO(W)ESS smoothing revealed repeating patterns that suggested weekly cycles in performance. These findings were supported by the autocorrelation function (ACF), which showed strong correlations at short lags. The partial autocorrelation function (PACF) confirmed that the strongest direct influence came from the previous day, indicating a short-term routine in productivity. Together, these tools provided a comprehensive view of both long-term trends and short-term patterns in worker performance.