We aim to implement an online recursive least squares algorithm to build an autoregressive time series model that predicts daily Citibike trips in NYC. The model uses:
The high temperature of the day (TMAX).
The number of trips from the past 7 days.
The goal is to minimize the mean squared error of predictions over time and observe how the model coefficients and \(R^2\) evolve.
For each day \(t > 7\):
\[ N_{trips, t} = \sum_{i=1}^{7} \theta_i N_{trips, t-i} + \theta_8 TMAX_t \]
citibike_data <- read_csv("Daily_citibike_trips.csv")
## Rows: 3949 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): daily_trips, TMAX
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(citibike_data)
## # A tibble: 6 × 3
## date daily_trips TMAX
## <date> <dbl> <dbl>
## 1 2013-05-31 362 90
## 2 2013-06-01 8959 90
## 3 2013-06-02 15467 88
## 4 2013-06-03 7700 78
## 5 2013-06-04 15849 75
## 6 2013-06-05 15724 74
citibike_data <- citibike_data %>%
mutate(across(everything(), as.numeric)) %>%
arrange(date)
for (i in 1:7) {
citibike_data <- citibike_data %>%
mutate(!!paste0("lag_", i) := lag(daily_trips, i))
}
citibike_data <- na.omit(citibike_data)
X <- as.matrix(citibike_data %>% select(starts_with("lag_"), TMAX))
y <- citibike_data$daily_trips
n <- ncol(X)
t <- nrow(X)
G <- t(X[1:n, ]) %*% X[1:n, ]
h <- t(X[1:n, ]) %*% y[1:n]
theta <- solve(G, h)
theta_history <- matrix(NA, nrow = t - n, ncol = n)
R2_history <- numeric(t - n)
for (i in (n + 1):t) {
x_new <- matrix(X[i, ], nrow = n, ncol = 1)
y_new <- y[i]
G <- G + x_new %*% t(x_new)
h <- h + y_new * x_new
theta <- solve(G, h)
theta_history[i - n, ] <- theta
y_pred <- X[i, ] %*% theta
R2_history[i - n] <- 1 - sum((y[1:i] - X[1:i, ] %*% theta)^2) / sum((y[1:i] - mean(y[1:i]))^2)
}
theta_history_df <- as_tibble(theta_history)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
theta_history_df$time <- (n + 1):t
theta_history_df %>%
pivot_longer(-time, names_to = "Coefficient", values_to = "Value") %>%
ggplot(aes(x = time, y = Value, color = Coefficient)) +
geom_line() +
labs(title = "Evolution of Coefficients over Time",
x = "Time", y = "Coefficient Value")
tibble(time = (n + 1):t, R2 = R2_history) %>%
ggplot(aes(x = time, y = R2)) +
geom_line() +
labs(title = "R² over Time",
x = "Time", y = "R²")
Stability and trends in coefficients: Initially, the coefficients exhibit large fluctuations due to limited data and instability in the recursive updates. However, as more data accumulates over time, the coefficients stabilize and converge towards consistent values, indicating a stable autoregressive model.
R² over time: The \(R^2\) value starts relatively low with noticeable volatility during the early periods. As the model receives more observations, \(R^2\) improves and stabilizes around values between 0.75 and 0.9, demonstrating that the model’s explanatory power increases as the data grows.
To account for the potential nonlinear effect of temperature on Citibike trips, we define a new feature that captures the discomfort of extreme temperatures. For example, we can use:
citibike_data <- citibike_data %>%
mutate(TMAX_nonlin = -(TMAX - 75)^2)
X_nonlin <- as.matrix(citibike_data %>% select(starts_with("lag_"), TMAX_nonlin))
y <- citibike_data$daily_trips
n <- ncol(X_nonlin)
t <- nrow(X_nonlin)
G <- t(X_nonlin[1:n, ]) %*% X_nonlin[1:n, ]
h <- t(X_nonlin[1:n, ]) %*% y[1:n]
theta <- solve(G, h)
theta_history_nonlin <- matrix(NA, nrow = t - n, ncol = n)
R2_history_nonlin <- numeric(t - n)
for (i in (n + 1):t) {
x_new <- matrix(X_nonlin[i, ], nrow = n, ncol = 1)
y_new <- y[i]
G <- G + x_new %*% t(x_new)
h <- h + y_new * x_new
theta <- solve(G, h)
theta_history_nonlin[i - n, ] <- theta
y_pred <- X_nonlin[i, ] %*% theta
R2_history_nonlin[i - n] <- 1 - sum((y[1:i] - X_nonlin[1:i, ] %*% theta)^2) / sum((y[1:i] - mean(y[1:i]))^2)
}
theta_history_nonlin_df <- as_tibble(theta_history_nonlin)
theta_history_nonlin_df$time <- (n + 1):t
theta_history_nonlin_df %>%
pivot_longer(-time, names_to = "Coefficient", values_to = "Value") %>%
ggplot(aes(x = time, y = Value, color = Coefficient)) +
geom_line() +
labs(title = "Evolution of Coefficients over Time (Nonlinear Temp)",
x = "Time", y = "Coefficient Value")
tibble(time = (n + 1):t, R2 = R2_history_nonlin) %>%
ggplot(aes(x = time, y = R2)) +
geom_line() +
labs(title = "R² over Time (Nonlinear Temp)",
x = "Time", y = "R²")
Stability of coefficients: Similar to the linear model, the coefficients in the nonlinear model show initial instability but quickly stabilize and converge near zero. This indicates that the model remains stable over time, even with the nonlinear temperature transformation.
\(R^2\) performance: The \(R^2\) values of the nonlinear model follow a similar trajectory to the linear model, with early fluctuations and gradual improvement. However, there is no significant improvement in \(R^2\) compared to the linear model, indicating that the nonlinear transformation does not greatly enhance model performance.
The model remains robust and stable, but the nonlinear feature provides minimal added value over the original linear model.
We aim to analyze the relationship between social mobility and city characteristics using Weighted Least Squares to account for variance differences across cities with varying populations.
The dataset social-mobility.csv
includes:
Fraction of individuals moving from the bottom 20% to the top 20% income percentiles.
City population.
Socio-economic factors (commute time, student-teacher ratio).
social_mobility <- read_csv("social_mobility.csv")
head(social_mobility)
## # A tibble: 6 × 7
## ID Name Mobility State Population Student_teacher_ratio Commute
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 301 Middlesborough 0.0726 TN 66708 15.1 0.359
## 2 401 Winston-Salem 0.0448 NC 493180 15.4 0.292
## 3 500 Greensboro 0.0474 NC 1055133 16.7 0.305
## 4 601 North Wilkesboro 0.0517 NC 90016 16.2 0.289
## 5 602 Galax 0.0796 VA 64676 12.3 0.325
## 6 700 Spartanburg 0.0431 SC 354533 15.9 0.299
# Scatter plot
social_mobility %>%
ggplot(aes(x = Population, y = Mobility)) +
geom_point() +
scale_x_log10() +
labs(title = "Social Mobility vs. Population",
x = "Population (log scale)",
y = "Social Mobility")
This formula is derived from the variance of a binomial proportion, where mobility is the observed proportion of successful outcomes (individuals achieving upward mobility) and population represents the number of trials (individuals in the city). The variance of a proportion from a binomial distribution is given by:
Assuming binomial distribution: \[ Var(\text{mobility}) = \frac{mobility \times (1 - mobility)}{population} \] This reflects that larger populations produce more stable mobility estimates.
social_mobility <- social_mobility %>%
mutate(variance = (Mobility * (1 - Mobility)) / Population,
weight = 1 / variance)
# Weighted Least Squares Model
wls_model <- lm(Mobility ~ Commute + Student_teacher_ratio,
data = social_mobility,
weights = weight)
summary(wls_model)
##
## Call:
## lm(formula = Mobility ~ Commute + Student_teacher_ratio, data = social_mobility,
## weights = weight)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -289.28 -13.16 12.95 35.02 443.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0025339 0.0088872 -0.285 0.776
## Commute 0.0623962 0.0109843 5.680 1.97e-08 ***
## Student_teacher_ratio 0.0031117 0.0003937 7.904 1.06e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.62 on 696 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.09526, Adjusted R-squared: 0.09266
## F-statistic: 36.64 on 2 and 696 DF, p-value: 7.406e-16
# Ordinary Least Squares Model for Comparison
ols_model <- lm(Mobility ~ Commute + Student_teacher_ratio,
data = social_mobility)
summary(ols_model)
##
## Call:
## lm(formula = Mobility ~ Commute + Student_teacher_ratio, data = social_mobility)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.15904 -0.02260 -0.00537 0.01868 0.31943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0359572 0.0162414 2.214 0.0272 *
## Commute 0.2118423 0.0129576 16.349 <2e-16 ***
## Student_teacher_ratio -0.0018822 0.0007617 -2.471 0.0137 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04273 on 696 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.3562, Adjusted R-squared: 0.3544
## F-statistic: 192.6 on 2 and 696 DF, p-value: < 2.2e-16
Weighted Least Squares vs. Ordinary Least Squares Coefficients:
In the Weighted Least Squares model, the coefficient for
Commute
is 0.0624 and for
Student_teacher_ratio
is 0.0031.
In the Ordinary Least Squares model, the coefficient for
Commute
is 0.2118 and for
Student_teacher_ratio
is -0.0019, with
different signs and magnitudes.
Interpretation:
The Weighted Least Squares model reduces the influence of smaller cities with high variance, leading to smaller, more stable coefficient estimates.
Ordinary Least Squares tends to give more weight to smaller cities, which increases the effect sizes and variance.
Model Fit:
Weighted Least Squares has a lower R-squared (0.0953) compared to Ordinary Least Squares (0.3562), indicating a lower proportion of variance explained after adjusting for heteroscedasticity through weighting.
This suggests that Ordinary Least Squares might overfit the noisier, smaller cities, whereas Weighted Least Squares prioritizes stability across the dataset.
We aim to construct optimal portfolios using Markowitz Portfolio Optimization to minimize risk while targeting specific expected returns, based on asset returns data from 2020-2024.
The dataset stock_returns.csv
includes:
Company
: Ticker symbol.
date
: Date of the observation.
adjusted
: Adjusted closing price.
return
: Daily return.
log_return
: Logarithm of the daily return.
We will divide the data into:
Training period: 2020-2022
Testing period: 2022-2024
library(lubridate)
library(tidyr)
library(quadprog)
# Load the data
stock_data <- read_csv("stock_returns.csv")
# Split into training and testing sets
train_data <- stock_data %>% filter(year(date) <= 2022)
test_data <- stock_data %>% filter(year(date) > 2022)
# Calculate annualized returns
annualized_returns <- train_data %>%
group_by(Company) %>%
summarize(mu = exp(0.5 * sum(log_return)) - 1)
# Create wide format for covariance
returns_wide <- train_data %>%
select(date, Company, log_return) %>%
pivot_wider(names_from = Company, values_from = log_return)
# Covariance matrix
Gamma <- cov(returns_wide %>% select(-date), use = "pairwise.complete.obs")
# Function to solve Markowitz optimization
optimize_portfolio <- function(Gamma, mu, r_target) {
Dmat <- 2 * Gamma
dvec <- rep(0, length(mu))
Amat <- cbind(rep(1, length(mu)), mu, diag(1, length(mu)))
bvec <- c(1, r_target, rep(0, length(mu)))
result <- solve.QP(Dmat, dvec, Amat, bvec, meq = 2)
return(result$solution)
}
# Prepare inputs
mu_vector <- annualized_returns$mu
# Optimize for each target return
weights_r1 <- optimize_portfolio(Gamma, mu_vector, 1.05)
weights_r2 <- optimize_portfolio(Gamma, mu_vector, 1.10)
weights_r3 <- optimize_portfolio(Gamma, mu_vector, 1.20)
# Store the weights
portfolio_weights <- tibble(
Company = annualized_returns$Company,
r_1.05 = weights_r1,
r_1.10 = weights_r2,
r_1.20 = weights_r3
)
portfolio_weights
## # A tibble: 471 × 4
## Company r_1.05 r_1.10 r_1.20
## <chr> <dbl> <dbl> <dbl>
## 1 A -1.30e-17 6.20e-17 2.34e-17
## 2 AAPL 2.07e-16 2.13e-17 3.97e-16
## 3 ABBV 1.07e-16 -5.92e-17 2.11e-16
## 4 ABEV 2.25e-17 1.05e-17 -2.65e-16
## 5 ABT 4.41e-17 -2.64e-17 -9.69e-17
## 6 ACGL 2.47e-17 -3.52e-18 -5.10e-17
## 7 ACN 2.45e-17 -3.45e-18 7.27e-19
## 8 ADBE -7.46e-17 -1.19e-16 -1.15e-16
## 9 ADI -6.64e-16 3.86e-16 -3.66e-16
## 10 ADM 3.15e-17 -1.34e-17 -3.19e-17
## # ℹ 461 more rows
# Prepare test data in wide format
returns_test_wide <- test_data %>%
select(date, Company, log_return) %>%
pivot_wider(names_from = Company, values_from = log_return)
# Function to calculate cumulative returns
test_cum_return <- function(weights, returns_data) {
returns_matrix <- as.matrix(returns_data %>% select(-date))
daily_returns <- returns_matrix %*% weights
cum_returns <- exp(cumsum(daily_returns))
tibble(date = returns_data$date, cumulative_return = cum_returns)
}
# Cumulative returns for each portfolio
cum_r1 <- test_cum_return(weights_r1, returns_test_wide)
cum_r2 <- test_cum_return(weights_r2, returns_test_wide)
cum_r3 <- test_cum_return(weights_r3, returns_test_wide)
# Plot cumulative returns
ggplot() +
geom_line(data = cum_r1, aes(x = date, y = cumulative_return, color = "r = 1.05")) +
geom_line(data = cum_r2, aes(x = date, y = cumulative_return, color = "r = 1.10")) +
geom_line(data = cum_r3, aes(x = date, y = cumulative_return, color = "r = 1.20")) +
labs(title = "Cumulative Returns of Portfolios", x = "Date", y = "Cumulative Return") +
scale_color_manual(name = "Target Return", values = c("r = 1.05" = "blue", "r = 1.10" = "green", "r = 1.20" = "red"))
# Function to calculate portfolio statistics
portfolio_stats <- function(weights, returns_data) {
returns_matrix <- as.matrix(returns_data %>% select(-date))
daily_returns <- returns_matrix %*% weights
annual_return <- exp(mean(daily_returns) * 252) - 1
risk <- var(daily_returns) * 252
max_weight <- max(weights)
leverage <- sum(abs(weights))
list(annual_return = annual_return, risk = risk, max_weight = max_weight, leverage = leverage)
}
stats_r1 <- portfolio_stats(weights_r1, returns_test_wide)
stats_r2 <- portfolio_stats(weights_r2, returns_test_wide)
stats_r3 <- portfolio_stats(weights_r3, returns_test_wide)
stats_r1
## $annual_return
## [1] -0.01875454
##
## $risk
## [,1]
## [1,] 0.005464002
##
## $max_weight
## [1] 0.6622191
##
## $leverage
## [1] 1
stats_r2
## $annual_return
## [1] -0.02444645
##
## $risk
## [,1]
## [1,] 0.006526361
##
## $max_weight
## [1] 0.6282857
##
## $leverage
## [1] 1
stats_r3
## $annual_return
## [1] -0.03573142
##
## $risk
## [,1]
## [1,] 0.008961743
##
## $max_weight
## [1] 0.5604189
##
## $leverage
## [1] 1
The cumulative return plots show that portfolios with higher target returns (r = 1.20) experienced greater volatility and sharper drawdowns compared to lower target return portfolios (r = 1.05).
Despite aiming for higher returns, the actual realized annual returns on the test set were negative for all portfolios, with -1.88% for r = 1.05, -2.44% for r = 1.10, and -3.57% for r = 1.20.
Risk increased with higher target returns, with variances of 0.0055, 0.0065, and 0.0089 respectively, showing that aiming for higher returns introduced higher realized volatility.
The maximum allocation weight decreased slightly as target returns increased, suggesting more diversification in the higher target return portfolios.
All portfolios maintained a leverage of 1, indicating no short positions.
lambda_values <- 10^seq(-1, 1, length.out = 10)
penalized_results <- list()
for (lambda in lambda_values) {
Dmat <- 2 * (Gamma + lambda * diag(ncol(Gamma)))
dvec <- rep(0, length(mu_vector))
Amat <- cbind(rep(1, length(mu_vector)), mu_vector, diag(1, length(mu_vector)))
bvec <- c(1, 1.20, rep(0, length(mu_vector)))
result <- solve.QP(Dmat, dvec, Amat, bvec, meq = 2)
penalized_results[[as.character(lambda)]] <- result$solution
}
# Collect statistics for each penalized portfolio
penalized_stats <- tibble(lambda = lambda_values, annual_return = NA, risk = NA, max_weight = NA, leverage = NA)
for (i in seq_along(lambda_values)) {
weights <- penalized_results[[as.character(lambda_values[i])]]
stats <- portfolio_stats(weights, returns_test_wide)
penalized_stats$annual_return[i] <- stats$annual_return
penalized_stats$risk[i] <- stats$risk
penalized_stats$max_weight[i] <- stats$max_weight
penalized_stats$leverage[i] <- stats$leverage
}
penalized_stats
## # A tibble: 10 × 5
## lambda annual_return risk max_weight leverage
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.1 0.116 0.0184 0.307 1.00
## 2 0.167 0.116 0.0186 0.307 1.00
## 3 0.278 0.116 0.0187 0.306 1.00
## 4 0.464 0.116 0.0187 0.306 1.00
## 5 0.774 0.116 0.0187 0.306 1.00
## 6 1.29 0.116 0.0188 0.306 1.00
## 7 2.15 0.116 0.0188 0.306 1.00
## 8 3.59 0.116 0.0188 0.306 1.00
## 9 5.99 0.116 0.0188 0.306 1.00
## 10 10 0.116 0.0188 0.306 1.00
As lambda increases, the annual return slightly improves and stabilizes, with values around 11.5% to 11.6%, suggesting regularization helps avoid extreme portfolios.
Risk increases modestly with higher lambda values, from approximately 0.0184 to 0.0188, indicating that while diversification improves, some variance trade-off remains.
Maximum portfolio weights decrease as lambda increases, showing that regularization spreads the investment more evenly across assets, reducing reliance on any single stock.
Leverage remains at 1 across all lambda values, confirming that no short positions are introduced.
Comparing in-sample and out-of-sample performance, higher lambda values appear to reduce overfitting, leading to portfolios that maintain stable performance when applied to new data.
Regularization using lambda effectively balances the trade-off between maximizing return and minimizing risk while improving out-of-sample generalization.