library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(readr) 
# Assume your dataset is called df with variables: period (1:150) and price
df <- read_csv("C:/Users/ajasa/Downloads/Copper Prices.csv")
## Rows: 197 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Month, Price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
## Rows: 197
## Columns: 2
## $ Month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ Price <dbl> 0.246, 0.627, 0.529, 0.528, 1.086, 1.001, 1.491, 0.293, 0.189, 0…
df$Price <- as.numeric(gsub("[$,]", "", df$Price))
df$Month <- seq.Date(from = as.Date("2005-01-01"), by = "month", length.out = nrow(df))
# Compute mean and sd
price_mean <- mean(df$Price)
price_sd   <- sd(df$Price)

# Plot time series with mean, sd lines, and a trend line
ggplot(df, aes(x = Month, y = Price)) +
  geom_line(color = "steelblue", size = 1) +
  geom_hline(yintercept = price_mean, color = "purple", linetype = "dashed", size = 1) +
  geom_hline(yintercept = price_mean + price_sd, color = "lightgreen", linetype = "dotted", size = 1) +
  geom_hline(yintercept = price_mean - price_sd, color = "lightgreen", linetype = "dotted", size = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "lightblue", size = 1) +
  labs(title = "Copper Prices time series",
       x = "Date", y = "Price")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

Prices fluctuate strongly with sharp peaks above 4.0.

The dashed purple line is the overall mean (~1). The green dotted lines show one standard deviation above and below.

Prices are often outside this band → very volatile market.

The regression line (light blue) slopes slightly downwards → weak long-term downward trend.

# 0) Diagnóstico rápido
length(df$Price)                  
## [1] 197
ts_sales <- ts(df$Price, start = c(2005, 1), frequency = 12)
length(ts_sales)                     # debería ser igual que el anterior
## [1] 197
frequency(ts_sales)                  # debe ser 12
## [1] 12
floor(length(ts_sales) / frequency(ts_sales))  # nº de periodos completos
## [1] 16
ts_price <- ts(df$Price)

# Serie de tiempo mensual desde enero 2015
ts_sales <- ts(df$Price, start = c(2005, 1), frequency = 12)

# Descomposición aditiva
decomp_add <- decompose(ts_sales, type = "additive")

# Visualización
plot(decomp_add)

Observed: The raw series shows repeated ups and downs.

Trend: Around 2005–2010here is a clear upward cycle, then flattening and slight decline.

Seasonal: Clear repeating yearly cycle, confirms monthly seasonality.

Random: Remainder captures shocks; spikes correspond to unexpected events.

Overall → strong seasonality, medium-term trend shifts, and irregular volatility.

acf(df$Price, main = "Autocorrelation Function (ACF)")  #MA

pacf(df$Price, main = "Parcial Autocorrelation Function (PAC)") #AR

ACF: High autocorrelation at all lags (bars close to 1). This indicates strong persistence and seasonality. Suggests that ARIMA or SARIMA models are necessary because simple regression would leave correlated residuals.

PAC: Strong spike at lag 1, then quickly drops to near zero. This suggests an AR(1) structure is dominant: the current value depends strongly on the immediately previous one.

Later lags are insignificant.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
model_lm <- lm(Price ~ Month, data = df)
dwtest(model_lm)
## 
##  Durbin-Watson test
## 
## data:  model_lm
## DW = 0.72065, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
#4
# Augmented Dickey-Fuller Test. 
# p-value < 0.05 → reject H0 → data is stationary.
#si beta<1 es estacionaria, 
library(tseries)
adf_test <- adf.test(df$Price)
adf_test
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df$Price
## Dickey-Fuller = -3.6524, Lag order = 5, p-value = 0.02986
## alternative hypothesis: stationary

Theres positive autocorrelation in residuals of a simple regression.

(p = 0.022): the series is stationary

arma_model <- Arima(ts_price, order = c(2,1,2))
summary(arma_model)
## Series: ts_price 
## ARIMA(2,1,2) 
## 
## Coefficients:
##          ar1      ar2      ma1      ma2
##       0.5394  -0.0386  -0.7594  -0.2202
## s.e.  0.2449   0.1771   0.2361   0.2358
## 
## sigma^2 = 0.4951:  log likelihood = -208.33
## AIC=426.66   AICc=426.97   BIC=443.05
## 
## Training set error measures:
##                        ME      RMSE       MAE       MPE    MAPE      MASE
## Training set -0.003003515 0.6946641 0.4917185 -125.6836 152.687 0.9338839
##                     ACF1
## Training set 0.001625522
#P-value de los coeficientes
coefs <- coef(arma_model)
se <- sqrt(diag(arma_model$var.coef))
t_values <- coefs / se     # Calcular t-values y p-values
p_values <- 2 * (1 - pnorm(abs(t_values)))
results <- data.frame(Coefficient = coefs, StdError = se, t_value = t_values, p_value = p_values)
print(results)   # Combinar todo en una tabla
##     Coefficient  StdError    t_value     p_value
## ar1  0.53936997 0.2448646  2.2027271 0.027613986
## ar2 -0.03863921 0.1770760 -0.2182069 0.827267922
## ma1 -0.75944098 0.2361100 -3.2164705 0.001297779
## ma2 -0.22020289 0.2357862 -0.9339093 0.350350666
arma_model2 <- arima(ts_price, order = c(2,1,2))
summary(arma_model2)
## 
## Call:
## arima(x = ts_price, order = c(2, 1, 2))
## 
## Coefficients:
##          ar1      ar2      ma1      ma2
##       0.5394  -0.0386  -0.7594  -0.2202
## s.e.  0.2449   0.1771   0.2361   0.2358
## 
## sigma^2 estimated as 0.485:  log likelihood = -208.33,  aic = 426.66
## 
## Training set error measures:
##                        ME      RMSE       MAE       MPE    MAPE      MASE
## Training set -0.003003515 0.6946641 0.4917185 -125.6836 152.687 0.9338839
##                     ACF1
## Training set 0.001625522
checkresiduals(arma_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 8.5736, df = 6, p-value = 0.199
## 
## Model df: 4.   Total lags used: 10
arima_model <- auto.arima(ts_price)
summary(arima_model)
## Series: ts_price 
## ARIMA(1,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ma1      ma2      ma3    mean
##       0.8695  -0.0925  -0.2958  -0.1809  1.0607
## s.e.  0.0865   0.1129   0.0921   0.0751  0.1590
## 
## sigma^2 = 0.4809:  log likelihood = -205.24
## AIC=422.47   AICc=422.91   BIC=442.17
## 
## Training set error measures:
##                      ME      RMSE      MAE       MPE     MAPE      MASE
## Training set 0.00486821 0.6846369 0.490417 -124.3236 151.6009 0.9314122
##                     ACF1
## Training set -0.01305183

Training errors (RMSE ≈ 0.68) show a reasonable fit.

Ljung–Box test p-value > 0.05 looks like white noise

Coefficients: AR terms positive, some MA terms negative and significant. This mix captures short-term shocks effectively.

ar1 significant (p < 0.05): confirms dependence on last value. ma1 significant (p < 0.01): shocks from previous period matter. ar2 and ma2 not significant

Suggests ARIMA(1,1,1) might also perform well.

Residuals fluctuate around zero with constant variance. Histogram roughly normal, although with some extreme events. ACF of residuals mostly within confidence bands → residuals behave like white noise.

This validates the chosen ARIMA model.

arma_model5 <- Arima(ts_price, order = c(1,1,1))
summary(arma_model5)
## Series: ts_price 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1     ma1
##       0.6536  -1.000
## s.e.  0.0552   0.084
## 
## sigma^2 = 0.5025:  log likelihood = -211.53
## AIC=429.06   AICc=429.19   BIC=438.9
## 
## Training set error measures:
##                         ME      RMSE      MAE       MPE     MAPE      MASE
## Training set -0.0008335256 0.7034393 0.487828 -127.1974 152.9119 0.9264951
##                   ACF1
## Training set 0.1072183
#P-value de los coeficientes
coefs <- coef(arma_model5)
se <- sqrt(diag(arma_model5$var.coef))
t_values <- coefs / se     # Calcular t-values y p-values
p_values <- 2 * (1 - pnorm(abs(t_values)))
results <- data.frame(Coefficient = coefs, StdError = se, t_value = t_values, p_value = p_values)
print(results)   # Combinar todo en una tabla
##     Coefficient   StdError   t_value p_value
## ar1   0.6535877 0.05515540  11.84993       0
## ma1  -0.9999887 0.08401767 -11.90212       0
arma_model3 <- arima(ts_price, order = c(1,1,1))
summary(arma_model3)
## 
## Call:
## arima(x = ts_price, order = c(1, 1, 1))
## 
## Coefficients:
##          ar1     ma1
##       0.6536  -1.000
## s.e.  0.0552   0.084
## 
## sigma^2 estimated as 0.4974:  log likelihood = -211.53,  aic = 429.06
## 
## Training set error measures:
##                         ME      RMSE      MAE       MPE     MAPE      MASE
## Training set -0.0008335256 0.7034393 0.487828 -127.1974 152.9119 0.9264951
##                   ACF1
## Training set 0.1072183
checkresiduals(arma_model5)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 14.469, df = 8, p-value = 0.07033
## 
## Model df: 2.   Total lags used: 10
arima_model <- auto.arima(ts_price)
summary(arima_model)
## Series: ts_price 
## ARIMA(1,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ma1      ma2      ma3    mean
##       0.8695  -0.0925  -0.2958  -0.1809  1.0607
## s.e.  0.0865   0.1129   0.0921   0.0751  0.1590
## 
## sigma^2 = 0.4809:  log likelihood = -205.24
## AIC=422.47   AICc=422.91   BIC=442.17
## 
## Training set error measures:
##                      ME      RMSE      MAE       MPE     MAPE      MASE
## Training set 0.00486821 0.6846369 0.490417 -124.3236 151.6009 0.9314122
##                     ACF1
## Training set -0.01305183

ARIMA(1,1,1) provides a reasonable fit, though the high MAPE suggests strong variability in the series makes forecasting challenging. Strong positive autocorrelation with the immediately previous value.

The Augmented Dickey–Fuller (ADF) test showed the original series was not perfectly stationary, so we used d=1 differencing to stabilize it.

ARIMA(1,1,1) is one of the simplest models that captures both autoregressive and moving average dynamics after differencing.

Both parameters (AR1 and MA1) came out highly significant, which confirms the structure was appropriate.

sarima_model <- auto.arima(ts_price, seasonal = TRUE)
summary(sarima_model)
## Series: ts_price 
## ARIMA(1,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ma1      ma2      ma3    mean
##       0.8695  -0.0925  -0.2958  -0.1809  1.0607
## s.e.  0.0865   0.1129   0.0921   0.0751  0.1590
## 
## sigma^2 = 0.4809:  log likelihood = -205.24
## AIC=422.47   AICc=422.91   BIC=442.17
## 
## Training set error measures:
##                      ME      RMSE      MAE       MPE     MAPE      MASE
## Training set 0.00486821 0.6846369 0.490417 -124.3236 151.6009 0.9314122
##                     ACF1
## Training set -0.01305183
# Plot forecast
autoplot(forecast(sarima_model, h = 12))

#Auto Arima
fit <- auto.arima(ts_price)
summary(fit)
## Series: ts_price 
## ARIMA(1,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ma1      ma2      ma3    mean
##       0.8695  -0.0925  -0.2958  -0.1809  1.0607
## s.e.  0.0865   0.1129   0.0921   0.0751  0.1590
## 
## sigma^2 = 0.4809:  log likelihood = -205.24
## AIC=422.47   AICc=422.91   BIC=442.17
## 
## Training set error measures:
##                      ME      RMSE      MAE       MPE     MAPE      MASE
## Training set 0.00486821 0.6846369 0.490417 -124.3236 151.6009 0.9314122
##                     ACF1
## Training set -0.01305183

Forecast shows a continuation of the recent level, with uncertainty widening into the future. Blue cone reflects volatility; actual future prices may deviate widely. Model predicts stabilization rather than explosive growth/decline.

We use ARIMA instead of ARMA because the series is not stationary and needs differencing. We don’t use SARIMA because the data does not show strong seasonal autocorrelation, the seasonal effect is weak compared to the volatility and trend, so ARIMA already captures the main dynamics.

Based on the statistical tests and diagnostic analysis, ARIMA(1,1,1) was identified as the most appropriate model for the copper price series. The Augmented Dickey–Fuller test confirmed the need for first differencing (d=1) to achieve stationarity, which rules out ARMA models. While ARIMA(2,1,2) showed a slightly lower RMSE and AIC, several of its parameters were not statistically significant, indicating possible overfitting. ARIMA(1,1,1) provided a parsimonious structure where both coefficients were highly significant, residuals behaved like white noise, and model assumptions were satisfied. For these reasons, ARIMA(1,1,1) offers the best balance between accuracy, interpretability, and robustness for forecasting copper prices.