Week12 Time based

data<- read.csv("C:\\Users\\Krishna\\Desktop\\Projects\\garments_worker_productivity.csv")

# Set a CRAN mirror
chooseCRANmirror(ind=1) # Choose a mirror from the list

# Install the forecast package
install.packages("forecast")

## Installing package into 'C:/Users/Krishna/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'forecast' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Krishna\AppData\Local\Temp\RtmpMdLAqY\downloaded_packages

# Load the forecast package
library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(ggplot2)
library(tsibble)

## 
## Attaching package: 'tsibble'

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

install.packages("forecast")

## Warning: package 'forecast' is in use and will not be installed

data$date <- gsub("/", "-", data$date)

tail(data)

##           date  quarter department       day team targeted_productivity   smv
## 1192 3-11-2015 Quarter2     sweing Wednesday    7                  0.65 30.48
## 1193 3-11-2015 Quarter2  finishing Wednesday   10                  0.75  2.90
## 1194 3-11-2015 Quarter2  finishing Wednesday    8                  0.70  3.90
## 1195 3-11-2015 Quarter2  finishing Wednesday    7                  0.65  3.90
## 1196 3-11-2015 Quarter2  finishing Wednesday    9                  0.75  2.90
## 1197 3-11-2015 Quarter2  finishing Wednesday    6                  0.70  2.90
##      wip over_time incentive idle_time idle_men no_of_style_change
## 1192 935      6840        26         0        0                  1
## 1193  NA       960         0         0        0                  0
## 1194  NA       960         0         0        0                  0
## 1195  NA       960         0         0        0                  0
## 1196  NA      1800         0         0        0                  0
## 1197  NA       720         0         0        0                  0
##      no_of_workers actual_productivity
## 1192            57           0.6505965
## 1193             8           0.6283333
## 1194             8           0.6256250
## 1195             8           0.6256250
## 1196            15           0.5058889
## 1197             6           0.3947222

# converting date column into a particular date format
data$date <- as.Date(data$date, format = "%m-%d-%Y")
# here i am taking actual_productivity as a response variable

duplicated_rows <- duplicated(data$date)

# Remove duplicated rows
cleaned_data <- data[!duplicated_rows, ]

# Convert to tsibble object
my_tsibble <- tsibble::tsibble(
  date = cleaned_data$date,
  actual_productivity = cleaned_data$actual_productivity
)

## Using `date` as index variable.

# Plot the tsibble
ggplot(my_tsibble, aes(x = date, y = actual_productivity)) +
  geom_line() +
  labs(title = "Actual Productivity Over Time",
       x = "Date",
       y = "Actual Productivity")

print(head(my_tsibble))

## # A tsibble: 6 x 2 [1D]
##   date       actual_productivity
##   <date>                   <dbl>
## 1 2015-01-01               0.941
## 2 2015-01-03               0.988
## 3 2015-01-04               0.991
## 4 2015-01-05               0.961
## 5 2015-01-06               0.967
## 6 2015-01-07               0.951

# Check the structure of the ts

what stands out immediately?

Upon visual inspection, we can observe whether there is a clear upward, downward, or stationary trend in the actual productivity over time.

When we analyze the graph of actual productivity over time, several trends stand out:

1)periods of higher variability:

there are three noticeable periods with higher variability in productivity. These periods are characterized by fluctuations or spikes in productivity levels. Understanding the factors contributing to these fluctuations is crucial as they may indicate changes in production processes, resource allocation, or external factors impacting productivity.

2)periods of lower productivity :

Another significant observation is the noted Increase in productivity after February 15. This rise in productivity suggests a potential shift or disruption in the production process during this period

3)long time productivity:

Additionally, we can observe a long-term trend in productivity over the entire time period

# performing linear regression

lm_model <- lm(actual_productivity ~ date, data = data)

# Check the summary of the model
summary(lm_model)

## 
## Call:
## lm(formula = actual_productivity ~ date, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52985 -0.09000  0.03662  0.10185  0.39628 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.8105198  3.9994429   4.453 9.25e-06 ***
## date        -0.0010367  0.0002428  -4.269 2.11e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1732 on 1195 degrees of freedom
## Multiple R-squared:  0.01502,    Adjusted R-squared:  0.0142 
## F-statistic: 18.23 on 1 and 1195 DF,  p-value: 2.115e-05

trend_strength <- summary(lm_model)$r.squared
print(paste("Trend Strength (R-squared):", trend_strength))

## [1] "Trend Strength (R-squared): 0.0150245816939444"

2)How strong these trends are ?

The R-squared value of 0.0150 indicates that the linear regression model explains approximately 1.50% of the variability in the actual productivity data. In other words, the model accounts for a very small proportion of the total variance in productivity over time. This suggests that the linear trend captured by the model is not very strong or meaningful.

Do you need to subset the data for multiple trends?

from my perspective, there is no need to subset the data to detect multiple trends because

1)Continuous time series :

The data represents a continuous time series of productivity measurements over time. Subsetting the data into multiple segments might disrupt the continuity of the time series and potentially overlook important patterns or trends that span across different periods.

2)statistical power :

Splitting the data into multiple subsets reduces the sample size for each subset, potentially reducing the statistical power to detect trends accurately

# Apply smoothing using Loess
smoothed <- loess(actual_productivity ~ as.numeric(date), data = data)

# Create a new data frame with smoothed values
smoothed_df <- data.frame(date = data$date[1:length(fitted(smoothed))], 
                          smoothed = fitted(smoothed))

# Plot the original data and the smoothed curve
ggplot() +
  geom_point(data = data, aes(x = date, y = actual_productivity)) +  # Add points for original data
  geom_line(data = smoothed_df, aes(x = date, y = smoothed), color = "blue", size = 1) +  # Add smoothed curve
  labs(title = "Smoothing with Loess", x = "Date", y = "Actual Productivity") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

acf(data$actual_productivity, main = "Autocorrelation Function (ACF) for Actual Productivity")

the ACF graph provides insights into the seasonality of actual productivity over time

1)Autocorelation function :

The ACF graph displays the correlation between the actual productivity time series and its lagged values at different time lags.

2)interpretation:

Significant positive peaks in the ACF plot indicate strong positive correlations between the actual productivity and its lagged values at those specific lags

Conversely, significant negative peaks indicate strong negative correlations, while peaks close to zero indicate no correlation

3)seasonality detection :

By observing significant peaks at regular intervals in the ACF plot, it suggests the presence of seasonality in actual productivity

These peaks represent periods where the productivity levels exhibit repeating patterns or cycles over time

INSIGHTS:

1)Choosing a response variable :

Insight :=Selecting a specific column(actual_productivity) to analyze over time allows for the identification of trends, patterns in data set.

Significance: understanding how the response variable changes over time is crucial for making informed decisions

2)Creating a tsibble:

Insight :

Visualizing the data over time provides a clear understanding of how the response variable changes over the specified time

Significance: Identifying trends, seasonality, and irregular patterns in the data can help in making data-driven decisions and formulating strategies

Use of linear regression:

Insight : Linear regression helps quantify the relationship between time and the response variable, indicating whether there is a significant upward or downward trend over time.

Significance :

Understanding the presence and direction of trends provides insights into the underlying dynamics of the dataset and can inform forecasting and decision-making processes.

Week12 Time based

2024-04-08