Student names, ID and percentage of contributions

Group information
Student name Student ID Percentage of contribution
Haoran Zhu 3767342 50%
Yu Ji 4045567 50%

Introduction

Candy production has been an essential part of the food industry for decades, reflecting not only consumer preferences but also broader economic trends.

By analyzing historical data on monthly candy production, we can uncover significant patterns and trends that provide insights into production cycles and market dynamics. This investigation will utilize statistical methods to explore these trends and make forecasts about future production levels. Understanding these trends is crucial for manufacturers who need to optimize production schedules, manage inventories, and respond effectively to market demands. Additionally, the analysis of candy production data can offer valuable indicators of consumer behavior and economic conditions. Our primary research question is: “What are the significant trends in monthly candy production over the past decades, and how accurately can we forecast future production levels based on historical data?”

This project aims to provide a comprehensive analysis of these trends and offer actionable insights for stakeholders in the candy production industry.

Problem Statement

The primary question driving this investigation is: “What are the significant trends in monthly candy production over the past decades, and how accurately can we forecast future production levels based on historical data?” To address this problem, we will utilize a range of statistical methods to analyze the historical data on candy production. This includes descriptive statistics to summarize the data, time series analysis to identify patterns and trends, and forecasting models to predict future production levels. By applying these statistical techniques, we aim to uncover insights that will help manufacturers optimize their production schedules, manage inventories more effectively, and respond to market demands. The findings from this analysis will not only provide a deeper understanding of the candy production industry but also offer practical recommendations for stakeholders looking to make data-driven decisions.

Data

The data used in this analysis is sourced from an open dataset on monthly candy production, which is publicly available for research and analysis purposes.

This dataset provides a comprehensive record of candy production figures over several decades, allowing us to explore long-term trends and patterns. To collect the data, we accessed the data source from the Federal Reserve Bank of St. Louis (FRED), which provides the Industrial Production Index for Sugar and Confectionery Products (NAICS = 3113). This comprehensive dataset includes variables such as the observation date and the monthly candy production values, denoted as IPG3113N. These variables provide the necessary details to conduct our analysis, with the observation date indicating the time of production and IPG3113N representing the production quantity. The numeric variable IPG3113N is measured on a continuous scale, reflecting the actual production figures in standardized units.

We preprocess the data by cleaning any missing or inconsistent entries, ensuring that the dataset is complete and accurate for our analysis. This preprocessing step is crucial to maintain the integrity of the analysis and to ensure that the results are reliable and valid. For more details, you can access the dataset here.

Descriptive Statistics and Visualisation

Initial processing

# Set the data path
data_path <- "C:/Users/user/Desktop/Applied Analytics/data/candy_production.csv"

# Load the data
data <- read.csv(data_path)
data$observation_date <- as.Date(data$observation_date)
# Calculate descriptive statistics
summary_stats <- data %>%
  summarise(
    mean_production = mean(IPG3113N, na.rm = TRUE),
    median_production = median(IPG3113N, na.rm = TRUE),
    sd_production = sd(IPG3113N, na.rm = TRUE),
    min_production = min(IPG3113N, na.rm = TRUE),
    max_production = max(IPG3113N, na.rm = TRUE)
  )

print(summary_stats)
##   mean_production median_production sd_production min_production max_production
## 1        100.6625          102.2785      18.05293        50.6689       139.9153

In this section, we summarize the important variables in our investigation and use visualization techniques to highlight interesting features of the data, telling the overall story. The primary variables in our dataset are the observation date and the monthly candy production values (IPG3113N). To provide a comprehensive overview, we calculate descriptive statistics such as the mean, median, standard deviation, and range of the production values.

The mean production is 100.6625, indicating the average monthly production of candy over the period covered by the data. The median production is 102.2785, which suggests that half of the monthly production values are below this amount and half are above. The standard deviation is 18.05293, showing the variability in the production values. The minimum production value is 50.6689, and the maximum production is 139.9153, reflecting the range of production levels observed.

We visualize the data using line plots to show trends over time and histograms to display the distribution of production values. Any data issues, such as missing values or outliers, are addressed by appropriate data imputation techniques or by removing incomplete records. For instance, missing values are handled using linear interpolation, and outliers are identified using the IQR method and verified for their validity before deciding on removal. These steps ensure that we effectively summarize and visualize the data while addressing any data quality issues to maintain the integrity of our analysis.

# Plot production over time
ggplot(data, aes(x = observation_date, y = IPG3113N)) +
  geom_line(color = "blue") +
  labs(title = "Monthly Candy Production Over Time",
       x = "Date",
       y = "Production (IPG3113N)") +
  theme_minimal()

The line plot shows the trend of monthly candy production over time, from the early 1970s to the 2010s. The production values, measured by the IPG3113N index, display a clear seasonal pattern with periodic peaks and troughs each year, reflecting the cyclical nature of candy production.

There is a noticeable upward trend in the overall production levels, particularly from the mid-1980s to the early 2000s, indicating growth in the candy industry during this period. However, there are also periods of fluctuation and decline, especially around the late 2000s, suggesting potential market or economic factors impacting production. The variability in production levels is quite evident, with the amplitude of seasonal fluctuations increasing over time, pointing to greater swings in production volumes.

This visualization effectively highlights the long-term trends, seasonal patterns, and periods of significant change in candy production, providing a comprehensive view of the data over several decades.

# Plot the distribution of production values
ggplot(data, aes(x = IPG3113N)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(title = "Distribution of Monthly Candy Production",
       x = "Production (IPG3113N)",
       y = "Frequency") +
  theme_minimal()

The histogram illustrates the distribution of monthly candy production values measured by the IPG3113N index.

The data shows a relatively normal distribution, centered around the mean production value of approximately 100.6625.The majority of the production values fall between 75 and 125, indicating that most months have production levels within this range. There is a peak around the 100-110 range, which aligns with the calculated mean and median values, suggesting that these are the most common production levels observed. The distribution tails off at the lower and higher ends, with fewer instances of very low (below 50) or very high (above 130) production values.

This visualization effectively highlights the central tendency and variability of monthly candy production, showing that while there is some fluctuation, production is generally consistent within a specific range. This consistency is important for manufacturers and planners in the candy industry as it allows for better forecasting and resource allocation based on historical production patterns.

Advanced Processing

data <- data %>%
  mutate(year = year(observation_date)) %>%
  mutate(month = month(observation_date))
data_avg_year <- data %>%
  group_by(year) %>%
  summarize(avg_IPG3113N = mean(IPG3113N, na.rm = TRUE))

ggplot(data_avg_year, aes(x = year, y = avg_IPG3113N)) +
  geom_line(color = "navyblue") +
  labs(x = "Year", y = "Average IPG3113N", title = "Average IPG3113N by Year") +
  theme_minimal()

The line plot illustrates the average monthly candy production (IPG3113N) by year, showing the overall trends in production from the early 1970s to the 2010s. The data is grouped by year, and the average production value for each year is calculated and plotted.

There is a noticeable increase in average production from the mid-1970s to the early 2000s, indicating a period of growth in the candy industry. Peaks are observed around the early 2000s, followed by a decline and subsequent fluctuations, reflecting periods of both increased production and downturns. The production levels stabilize and show a slight upward trend in the 2010s, suggesting some recovery or stabilization in the industry.

This visualization highlights the long-term changes and cyclical nature of candy production, providing valuable insights into how the industry has evolved over the decades. By examining these trends, stakeholders can better understand the historical performance and potential future directions of candy production.

data_avg_month <- data %>%
  group_by(month) %>%
  summarize(avg_IPG3113N = mean(IPG3113N, na.rm = TRUE))

ggplot(data_avg_month, aes(x = factor(month), y = avg_IPG3113N)) +
  geom_bar(stat = "identity", fill = "navyblue") +
  geom_text(aes(label = round(avg_IPG3113N, 2)), vjust = -0.5) +
  geom_text(aes(label = month), vjust = 1, size = 3) +
  labs(x = "Month", y = "", title = "Average IPG3113N by Month") +
  theme(axis.text.x = element_text(angle = 0, hjust = 1)) +
  theme_classic()

The bar chart displays the average monthly candy production (IPG3113N) for each month of the year. The data is grouped by month, and the average production value for each month is calculated and visualized.

From the chart, we observe that candy production varies significantly throughout the year, with distinct peaks and troughs. Notably, the production levels are highest in October, November, and December, with average values of 118.7, 120.83, and 120.09 respectively, likely due to increased demand for candy during the Halloween, Thanksgiving, and Christmas seasons. Conversely, the production levels are lowest during the spring and summer months, particularly in April, May, and June, with average values of 88.67, 88.84, and 90.75 respectively. This seasonal pattern highlights the influence of holidays and consumer behavior on candy production.

The chart effectively uses labels to indicate the exact average production values and the corresponding months, making it easy to interpret and compare the monthly variations. This visualization provides valuable insights into the cyclical nature of candy production, helping manufacturers plan and optimize their production schedules to meet seasonal demand.

Missing values and outliers

# Verify if there are any missing values
sum(is.na(data$IPG3113N))
## [1] 0
# Handle missing values by linear interpolation IF there are
data$IPG3113N <- zoo::na.approx(data$IPG3113N)

The code first applies the linear interpolation method to the IPG3113N column, which represents the candy production values. After performing the interpolation, the code verifies if there are any missing values left in the dataset by using the sum(is.na(data$IPG3113N)) function. The output [1] 0 indicates that all missing values have been successfully filled, and there are no remaining NA values in the IPG3113N column.

By using linear interpolation, the dataset is now complete and ready for further analysis. This step is crucial for maintaining the integrity of the time series analysis and ensuring that the results are not biased by missing data. Handling missing values appropriately allows for more accurate statistical analyses and better predictive modeling, providing reliable insights into candy production trends.

# Identify outliers using the IQR method
Q1 <- quantile(data$IPG3113N, 0.25)
Q3 <- quantile(data$IPG3113N, 0.75)
IQR <- Q3 - Q1

# Define outliers
outliers_IQR <- data %>%
  filter(IPG3113N < (Q1 - 1.5 * IQR) | IPG3113N > (Q3 + 1.5 * IQR))

print(outliers_IQR)
## [1] observation_date IPG3113N         year             month           
## <0 行> (或0-长度的row.names)

In this case, the result shows 0 rows, indicating that there are no outliers in the IPG3113N column based on the IQR method. This outcome suggests that all the monthly candy production values fall within the expected range, with no significant anomalies or extreme values that deviate substantially from the overall distribution.

Hypothesis Testing

Consider mean of nov being 120. Hypothesis testing should be: \[H_0: \bar{\mu}_{11} = 120 \]

Alternative hypothesis is \[H_A: \bar{\mu}_{11} \neq 120 \]

nov_data <- data[data$month == 11, "IPG3113N"]
t.test(nov_data, mu = 120)
## 
##  One Sample t-test
## 
## data:  nov_data
## t = 0.56123, df = 44, p-value = 0.5775
## alternative hypothesis: true mean is not equal to 120
## 95 percent confidence interval:
##  117.8559 123.7992
## sample estimates:
## mean of x 
##  120.8275

The hypothesis testing conducted aims to determine whether the mean candy production for November (denoted as \(\bar{\mu}_{11}\)) is equal to 120. The null hypothesis (\(H_0\)) states that the mean November production is 120, while the alternative hypothesis (\(H_A\)) suggests that the mean is not equal to 120.

Using a one-sample t-test, we test the sample data of November candy production values against the hypothesized mean of 120. The t-test result yields a t-value of 0.56123 with 44 degrees of freedom and a p-value of 0.5775. This high p-value indicates that there is no significant evidence to reject the null hypothesis, suggesting that the true mean November production is not significantly different from 120. The 95% confidence interval for the mean production ranges from 117.8559 to 123.7992, which includes the hypothesized mean of 120. The sample estimate of the mean is 120.8275, which is very close to the hypothesized value, further supporting the conclusion that the mean production in November does not differ significantly from 120.

This analysis provides robust statistical evidence regarding the mean production levels for November, helping to validate or refute assumptions about seasonal production patterns in the candy industry.

t.test(nov_data, mu = 120)$p.value
## [1] 0.5774881

The additional R code snippet calculates the p-value from the t-test performed on the November candy production data. The result is a p-value of 0.5774881. This p-value confirms the earlier interpretation that there is no significant evidence to reject the null hypothesis (\(H_0: \bar{\mu}_{11} = 120\)). With a p-value significantly greater than the common significance level of 0.05, we conclude that the mean candy production for November is statistically indistinguishable from 120. This finding suggests that the production levels in November align closely with the hypothesized mean, reinforcing the stability and predictability of candy production during this peak season. This statistical validation helps in making informed decisions about production planning and resource allocation for the candy industry during the high-demand months of November.

t.test(nov_data, mu = 120)$conf.int
## [1] 117.8559 123.7992
## attr(,"conf.level")
## [1] 0.95

The additional R code snippet calculates the 95% confidence interval for the mean November candy production. The output indicates that the 95% confidence interval ranges from 117.8559 to 123.7992. This interval means that we can be 95% confident that the true mean production for November lies within this range. The interval includes the hypothesized mean of 120, further reinforcing the conclusion that there is no significant difference between the actual mean and the hypothesized mean of 120. The confidence interval provides a range of plausible values for the mean production, confirming that the average production in November is consistent with expectations and does not exhibit significant deviation from the hypothesized value.

This statistical evidence supports the stability of candy production during November and can be used to inform production planning and decision-making processes in the candy industry.So, we should accept $ H_0: {}{11} = 120 $ and reject $ H_A: {}{11} $.

Regression analysis

data <- data %>%
  mutate(time = 1:nrow(data))

model <- lm(IPG3113N ~ time, data = data)

ggplot(data, aes(x = time, y = IPG3113N)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  labs(title = "IPG3113N Over Time", x = "Time", y = "IPG3113N") +
  theme_classic()

The scatter plot with the regression line illustrates the relationship between candy production (IPG3113N) and time. In this analysis, time is treated as a continuous variable, starting from the earliest observation to the most recent. Each point represents the production value at a specific time, and the red line shows the linear trend estimated by the regression model.

The linear regression model is fitted to the data, with IPG3113N as the dependent variable and time as the independent variable. The resulting regression line indicates a positive trend over time, suggesting that candy production has generally increased over the period covered by the data. The scatter plot shows a clear upward trend, with production values becoming more dispersed as time progresses, indicating increasing variability in production levels.

The regression line’s slope provides an estimate of the average increase in candy production over time. This positive slope implies that, on average, candy production has been rising, which might be attributed to factors such as increasing demand, improvements in production technology, or other market dynamics.

Overall, this regression analysis helps quantify the long-term trend in candy production, offering valuable insights into how production levels have evolved and potentially informing future projections and strategic planning for the industry.

Discussion

The comprehensive analysis of monthly candy production reveals several important trends and insights. From the descriptive statistics, we observe that the mean production is 100.6625 with a median of 102.2785, indicating a relatively consistent production level around this central tendency. The standard deviation of 18.05293 reflects moderate variability in monthly production, while the minimum and maximum values highlight the range within which production fluctuates.

Visualizations further elucidate these patterns. The line plot of monthly candy production over time shows clear seasonal trends, with peaks typically occurring in the months leading up to and including December, likely driven by holiday demand. The histogram of production values demonstrates a near-normal distribution, centered around the mean, confirming that most production values lie within a predictable range.

In addressing missing values using linear interpolation, the data integrity is maintained, ensuring accurate analysis. The absence of outliers, confirmed using the IQR method, suggests that the production data does not have extreme values that could skew the results.

Hypothesis testing for November production indicates that the mean production is not significantly different from 120, as the p-value of 0.5775 is well above the common significance level of 0.05. This finding is supported by the 95% confidence interval (117.8559 to 123.7992) encompassing the hypothesized mean.

Finally, the regression analysis shows a positive trend in candy production over time, indicating an overall increase in production levels across the dataset’s timespan. This increase could be attributed to factors such as growing demand, advancements in production technology, or changes in market dynamics.

In conclusion, the analysis provides valuable insights into the patterns and trends in candy production. These findings can inform strategic planning and decision-making in the candy industry, helping manufacturers optimize production schedules and meet market demands effectively. The use of robust statistical methods and comprehensive visualizations ensures that the conclusions drawn are well-supported and reliable.

References

  1. Federal Reserve Bank of St. Louis (FRED). Industrial Production: Manufacturing: Non-Durable Goods: Sugar and Confectionery Product (NAICS = 3113) [IPG3113N]. Retrieved from FRED, Federal Reserve Bank of St. Louis, May 28, 2024.

  2. Kaggle. Candy Production: Time Series Analysis. Retrieved from Kaggle.

  3. Towards AI. Time Series Forecasting with ARIMA Models In Python [Part 1]. Retrieved from Towards AI.