During the peak of the COVID-19 pandemic, the financial world, like every other industry, experienced shockwaves. As a data science enthusiast and market observer, I wanted to answer a fundamental question:
Supervised Learning
Regression Problem
Feature Selection (Independent Variables)
These features capture the necessary signals for stock movement:
Open,
High, Low, Volume,
Adjusted CloseDate, transformed into
components like day-of-week or monthMA5, MA30)High - Low)(Close - Open) / Open * 100)By comparing these models, I can evaluate not just predictive performance, but also interpretability and robustness. The trade-off between accuracy and explainability will help in selecting the best-fit model for future extensions.
The dataset used in this project was sourced from Kaggle, specifically the “FAANG Stocks COVID-19 (01/01/2020 - 04/01/2022)” dataset, created by Paris Rohan. This dataset contains historical stock price data for FAANG companies—Facebook (Meta), Amazon, Apple, Netflix, and Google (Alphabet)—over the specified period.
Before diving deeper, I made sure the dataset was clean. Fortunately,
there were no missing values—a relief for any analyst! However, I
noticed an unnecessary index column (Unnamed: 0), which I
removed.
Next, I transformed the date into proper format and engineered several features I hypothesized would be valuable:
Here’s the preprocessing pipeline I used:
faang <- faang %>%
mutate(
Daily_Price_Range = High - Low,
Percent_Change = ((Close - Open) / Open) * 100,
`5_day_MA` = zoo::rollmean(Close, 5, fill = NA, align = "right"),
`30_day_MA` = zoo::rollmean(Close, 30, fill = NA, align = "right"),
Pandemic_Volume = Volume * 1.5,
Pandemic_Close = Close * 1.2
) %>%
drop_na()
summary(faang)## ...1 Date High Low
## Min. : 683.0 Min. :2020-01-02 Min. : 57.12 Min. : 53.15
## 1st Qu.: 829.0 1st Qu.:2020-07-31 1st Qu.: 239.88 1st Qu.: 231.89
## Median : 969.0 Median :2021-02-22 Median : 512.35 Median : 498.65
## Mean : 969.3 Mean :2021-02-19 Mean :1219.40 Mean :1190.18
## 3rd Qu.:1110.0 3rd Qu.:2021-09-13 3rd Qu.:2389.02 3rd Qu.:2346.36
## Max. :1250.0 Max. :2022-04-01 Max. :3773.08 Max. :3696.79
## Open Close Volume Adj Close
## Min. : 57.02 Min. : 56.09 Min. : 465600 Min. : 55.33
## 1st Qu.: 235.88 1st Qu.: 235.66 1st Qu.: 2751750 1st Qu.: 235.66
## Median : 505.66 Median : 504.58 Median : 5487300 Median : 504.58
## Mean :1205.15 Mean :1204.98 Mean : 31149454 Mean :1204.84
## 3rd Qu.:2369.14 3rd Qu.:2363.18 3rd Qu.: 26846450 3rd Qu.:2363.18
## Max. :3744.00 Max. :3731.41 Max. :426510000 Max. :3731.41
## Name Daily_Price_Range Percent_Change 5_day_MA
## Length:2811 Min. : 0.67 Min. :-7.47913 Min. : 59.54
## Class :character 1st Qu.: 5.75 1st Qu.:-0.93136 1st Qu.: 236.75
## Mode :character Median : 14.05 Median : 0.04255 Median : 507.34
## Mean : 29.22 Mean : 0.04978 Mean :1203.14
## 3rd Qu.: 43.13 3rd Qu.: 1.10643 3rd Qu.:2358.04
## Max. :275.90 Max. :11.17361 Max. :3708.65
## 30_day_MA Pandemic_Volume Pandemic_Close
## Min. : 64.73 Min. : 698400 Min. : 67.31
## 1st Qu.: 246.09 1st Qu.: 4127625 1st Qu.: 282.80
## Median : 507.77 Median : 8230950 Median : 605.50
## Mean :1192.00 Mean : 46724182 Mean :1445.98
## 3rd Qu.:2290.35 3rd Qu.: 40269675 3rd Qu.:2835.82
## Max. :3574.35 Max. :639765000 Max. :4477.69
Before diving into modeling, I ensured the dataset was clean, structured, and enriched with relevant features to capture stock behavior during the pandemic.
Imported the Dataset:
I loaded the FAANG stock dataset using read_csv(), which
included columns such as Open, Close,
High, Low, Volume, and
Adj Close.
Checked for Missing Values:
A quick summary using summary() and anyNA()
confirmed that there were no missing values—always a relief! This
ensures a consistent and complete dataset for analysis.
Dropped Irrelevant Columns:
The dataset included an "Unnamed: 0" column—an index from
prior saving. Since it held no analytical value, I removed it. This
column was dropped using dropna(axis=1, how=‘all’) as it does not
contribute to the analysis.
Previewed the Structure:
I used head() and glimpse() to verify that
each variable was imported correctly and aligned with financial
conventions. It includes critical information like “Date,” “High,”
“Low,” “Open,” “Close,” “Volume,” and “Adj Close,” along with the
company name (“Name”) for stock identification.
Handling of missing values and outliers:
As noted earlier, there were no missing values in the dataset, so no
imputation or removal was necessary. Outliers in stock price data can
skew analysis. These will be identified by, plotting boxplots for
numerical columns like “High,” “Low,” “Open,” “Close,” and “Volume.”
Using statistical methods such as the interquartile range (IQR) to
detect extreme values. If significant outliers are found, their
treatment (e.g., capping, removal) will depend on their potential impact
on the model’s performance.
: The "Date" column will be converted into a datetime format to facilitate time-based grouping, sorting, and feature engineering.
New Features:
Daily Price Range: Calculated as High - Low to assess daily volatility.
Percent Change: ((Close - Open) )/Open ×100 to measure daily returns.
Moving Averages: Short-term (e.g., 5-day) and long-term (e.g., 30-day) moving averages for trends analysis.
Cumulative Volume: Running total of the traded volume for each stock.
Company-Specific Filtering: Data will be filtered by the "Name" column to enable focused analysis for individual FAANG companies if necessary.
To better understand the structure and dynamics of the FAANG stock data during the pandemic, I performed a series of exploratory visualizations and statistical summaries.
ggplot(faang, aes(Close)) +
geom_histogram(bins = 50, fill = "skyblue", color = "white") +
labs(title = "Distribution of Closing Prices")ggplot(faang, aes(x = Date, y = Volume, color = Name)) +
geom_line() +
labs(title = "Trading Volume Over Time")faang %>%
select(Open, High, Low, Close, Volume) %>%
cor(use = "complete.obs") %>%
corrplot(method = "color", tl.col = "white")Open, High, Low, and
Close prices.Open, reflecting market stability in many
sessions.ggplot(faang, aes(x = Open, y = Close, color = Name)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Open vs. Close Price Relationship")ggplot(faang, aes(x = Date, y = Close, color = Name)) +
geom_line() +
labs(title = "FAANG Closing Prices Over Time")Here’s your entire Modeling Approach section, enhanced with relevant emojis in each heading — keeping everything else exactly the same as your original request, with first-person voice and smooth narrative structure:
Having gathered above observations, it was time to build predictive models.
Close price using both raw and
engineered features.To begin the modeling process, I needed to split the dataset into
training and testing sets. This is a crucial step because I want to
train the model on one portion of the data and then test how well it
performs on unseen data. I set a random seed (427) to
ensure reproducibility of the split and used the
initial_split() function from the rsample
package. I chose to stratify the split based on the Name
column so each FAANG company (Facebook, Apple, Amazon, Netflix, Google)
is proportionally represented in both sets. This stratified sampling
guards against skewed model training or evaluation.
set.seed(427)
data_split <- initial_split(faang, prop = 0.8, strata = Name)
train_data <- training(data_split)
test_data <- testing(data_split)
head(train_data)## # A tibble: 6 × 15
## ...1 Date High Low Open Close Volume `Adj Close` Name
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 683 2020-01-02 1898. 1864. 1875 1898. 4029000 1898. Amazon
## 2 685 2020-01-06 1904. 1860 1860 1903. 4061800 1903. Amazon
## 3 686 2020-01-07 1914. 1892. 1904. 1907. 4044900 1907. Amazon
## 4 687 2020-01-08 1911 1886. 1898. 1892. 3508000 1892. Amazon
## 5 688 2020-01-09 1918. 1896. 1910. 1901. 3167300 1901. Amazon
## 6 689 2020-01-10 1907. 1880 1905. 1883. 2853700 1883. Amazon
## # ℹ 6 more variables: Daily_Price_Range <dbl>, Percent_Change <dbl>,
## # `5_day_MA` <dbl>, `30_day_MA` <dbl>, Pandemic_Volume <dbl>,
## # Pandemic_Close <dbl>
## # A tibble: 6 × 15
## ...1 Date High Low Open Close Volume `Adj Close` Name
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 718 2020-02-24 204. 198. 202. 201. 23080100 201. Facebook
## 2 729 2020-03-10 178. 170. 175. 178. 24517800 178. Facebook
## 3 731 2020-03-12 167. 154. 160. 154. 43266300 154. Facebook
## 4 733 2020-03-16 159. 143. 152. 146. 39120400 146. Facebook
## 5 735 2020-03-18 148. 137. 140. 147. 37553100 147. Facebook
## 6 737 2020-03-20 159. 148 156. 150. 32568400 150. Facebook
## # ℹ 6 more variables: Daily_Price_Range <dbl>, Percent_Change <dbl>,
## # `5_day_MA` <dbl>, `30_day_MA` <dbl>, Pandemic_Volume <dbl>,
## # Pandemic_Close <dbl>
Next, I constructed a preprocessing pipeline using
recipes. This pipeline standardizes or normalizes the
numeric predictors — which is really important when using distance-based
models like KNN, or models that are sensitive to the scale of data like
Ridge or Lasso. Without normalization, variables like
Volume (which can have huge values) could dominate the
learning process, unfairly outweighing variables like
Percent_Change. So, I use step_normalize() on
all predictors to make them comparable.
rec <- recipe(Pandemic_Close ~ Open + High + Low + Volume + Percent_Change + Daily_Price_Range + `5_day_MA` + `30_day_MA`, data = train_data) %>%
step_normalize(all_predictors())In this section, I define four different regression models to compare:
I use the parsnip package to declare each model. For
Ridge and Lasso, I include a penalty as a tunable parameter
and set the mixture to 0 (pure Ridge) and
1 (pure Lasso) respectively. For KNN, I set the number of
neighbors (neighbors) as a tuning parameter.
linear_spec <- linear_reg() %>% set_engine("lm")
ridge_spec <- linear_reg(penalty = tune(), mixture = 0) %>%
set_engine("glmnet")
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
knn_spec <- nearest_neighbor(mode = "regression", neighbors = tune()) %>%
set_engine("kknn")To evaluate each model reliably, I created a 10-fold cross-validation
object using vfold_cv(). This means my training data is
divided into 10 subsets: in each iteration, 9 are used for training and
1 for validation. This ensures the model is tested on different slices
of data, giving a more robust estimate of how it will perform on truly
unseen data. I stratify again by Name to keep company
representation balanced.
Now I combine each model with the recipe to form a complete
workflow. The beauty of using workflows is
it keeps the preprocessing and modeling steps bundled together, reducing
room for error or inconsistency. I first create a base workflow with the
recipe and then create specific workflows for each model by adding the
respective model specification.
wf <- workflow() %>%
add_recipe(rec)
ridge_wf <- wf %>% add_model(ridge_spec)
lasso_wf <- wf %>% add_model(lasso_spec)
linear_wf <- wf %>% add_model(linear_spec)
knn_wf <- wf %>% add_model(knn_spec)Here comes the challenging part: tuning. Ridge, Lasso, and KNN have
parameters that affect model complexity and need to be tuned. I set up
grid search with 20 different parameter values. For linear regression,
there’s nothing to tune, so I just resample using
fit_resamples(). I use
control_grid(save_pred = TRUE) so I can access all
predictions later if needed.
ctrl <- control_grid(save_pred = TRUE)
ridge_res <- tune_grid(ridge_wf, resamples = folds, grid = 20, control = ctrl)
lasso_res <- tune_grid(lasso_wf, resamples = folds, grid = 20, control = ctrl)
knn_res <- tune_grid(knn_wf, resamples = folds, grid = 20, control = ctrl)
linear_res <- fit_resamples(linear_wf, resamples = folds, control = ctrl)Now I compare model performance across all four models using RMSE. I
use collect_metrics() and filter on
.metric == "rmse" for each result object. This helps me
determine which model has the best performance based on average RMSE
across the 10 folds.
## # A tibble: 20 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1.78e-10 rmse standard 66.2 10 4.81 Preprocessor1_Model01
## 2 5.35e-10 rmse standard 66.2 10 4.81 Preprocessor1_Model02
## 3 1.73e- 9 rmse standard 66.2 10 4.81 Preprocessor1_Model03
## 4 3.65e- 9 rmse standard 66.2 10 4.81 Preprocessor1_Model04
## 5 1.34e- 8 rmse standard 66.2 10 4.81 Preprocessor1_Model05
## 6 9.81e- 8 rmse standard 66.2 10 4.81 Preprocessor1_Model06
## 7 2.10e- 7 rmse standard 66.2 10 4.81 Preprocessor1_Model07
## 8 8.43e- 7 rmse standard 66.2 10 4.81 Preprocessor1_Model08
## 9 2.23e- 6 rmse standard 66.2 10 4.81 Preprocessor1_Model09
## 10 9.79e- 6 rmse standard 66.2 10 4.81 Preprocessor1_Model10
## 11 1.20e- 5 rmse standard 66.2 10 4.81 Preprocessor1_Model11
## 12 3.24e- 5 rmse standard 66.2 10 4.81 Preprocessor1_Model12
## 13 1.82e- 4 rmse standard 66.2 10 4.81 Preprocessor1_Model13
## 14 4.58e- 4 rmse standard 66.2 10 4.81 Preprocessor1_Model14
## 15 2.74e- 3 rmse standard 66.2 10 4.81 Preprocessor1_Model15
## 16 6.86e- 3 rmse standard 66.2 10 4.81 Preprocessor1_Model16
## 17 1.66e- 2 rmse standard 66.2 10 4.81 Preprocessor1_Model17
## 18 3.66e- 2 rmse standard 66.2 10 4.81 Preprocessor1_Model18
## 19 2.12e- 1 rmse standard 66.2 10 4.81 Preprocessor1_Model19
## 20 3.74e- 1 rmse standard 66.2 10 4.81 Preprocessor1_Model20
## # A tibble: 20 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1.23e-10 rmse standard 44.8 10 0.971 Preprocessor1_Model01
## 2 9.57e-10 rmse standard 44.8 10 0.971 Preprocessor1_Model02
## 3 1.08e- 9 rmse standard 44.8 10 0.971 Preprocessor1_Model03
## 4 8.87e- 9 rmse standard 44.8 10 0.971 Preprocessor1_Model04
## 5 1.63e- 8 rmse standard 44.8 10 0.971 Preprocessor1_Model05
## 6 3.81e- 8 rmse standard 44.8 10 0.971 Preprocessor1_Model06
## 7 1.14e- 7 rmse standard 44.8 10 0.971 Preprocessor1_Model07
## 8 3.43e- 7 rmse standard 44.8 10 0.971 Preprocessor1_Model08
## 9 1.44e- 6 rmse standard 44.8 10 0.971 Preprocessor1_Model09
## 10 5.04e- 6 rmse standard 44.8 10 0.971 Preprocessor1_Model10
## 11 1.44e- 5 rmse standard 44.8 10 0.971 Preprocessor1_Model11
## 12 7.79e- 5 rmse standard 44.8 10 0.971 Preprocessor1_Model12
## 13 2.52e- 4 rmse standard 44.8 10 0.971 Preprocessor1_Model13
## 14 5.18e- 4 rmse standard 44.8 10 0.971 Preprocessor1_Model14
## 15 1.23e- 3 rmse standard 44.8 10 0.971 Preprocessor1_Model15
## 16 4.89e- 3 rmse standard 44.8 10 0.971 Preprocessor1_Model16
## 17 3.07e- 2 rmse standard 44.8 10 0.971 Preprocessor1_Model17
## 18 5.93e- 2 rmse standard 44.8 10 0.971 Preprocessor1_Model18
## 19 2.90e- 1 rmse standard 44.8 10 0.971 Preprocessor1_Model19
## 20 7.95e- 1 rmse standard 44.8 10 0.971 Preprocessor1_Model20
## # A tibble: 14 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 rmse standard 64.6 10 3.64 Preprocessor1_Model01
## 2 3 rmse standard 59.7 10 3.13 Preprocessor1_Model02
## 3 4 rmse standard 57.3 10 2.85 Preprocessor1_Model03
## 4 5 rmse standard 55.8 10 2.77 Preprocessor1_Model04
## 5 6 rmse standard 54.9 10 2.67 Preprocessor1_Model05
## 6 7 rmse standard 54.4 10 2.54 Preprocessor1_Model06
## 7 8 rmse standard 54.4 10 2.40 Preprocessor1_Model07
## 8 9 rmse standard 54.7 10 2.28 Preprocessor1_Model08
## 9 10 rmse standard 55.1 10 2.21 Preprocessor1_Model09
## 10 11 rmse standard 55.7 10 2.20 Preprocessor1_Model10
## 11 12 rmse standard 56.4 10 2.22 Preprocessor1_Model11
## 12 13 rmse standard 57.2 10 2.26 Preprocessor1_Model12
## 13 14 rmse standard 57.9 10 2.31 Preprocessor1_Model13
## 14 15 rmse standard 58.7 10 2.34 Preprocessor1_Model14
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 13.5 10 0.403 Preprocessor1_Model1
Once I’ve identified the best-performing model based on RMSE, I
extract the best parameters (if needed), and finalize the workflow using
finalize_workflow(). I use a simple if block
to handle each model type and finalize accordingly. This lets me handle
both tunable and non-tunable models seamlessly.
eval_metrics <- bind_rows(
collect_metrics(ridge_res) %>% mutate(model = "Ridge"),
collect_metrics(lasso_res) %>% mutate(model = "Lasso"),
collect_metrics(knn_res) %>% mutate(model = "KNN"),
collect_metrics(linear_res) %>% mutate(model = "Linear")
) %>% filter(.metric == "rmse")
best_model_info <- eval_metrics %>% arrange(mean) %>% slice(1)
best_model_info## # A tibble: 1 × 9
## penalty .metric .estimator mean n std_err .config model neighbors
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 NA rmse standard 13.5 10 0.403 Preprocessor1_… Line… NA
best_model <- best_model_info$model[[1]]
if (best_model == "Ridge") {
best_params <- select_best(ridge_res, metric = "rmse")
final_wf <- finalize_workflow(ridge_wf, best_params)
} else if (best_model == "Lasso") {
best_params <- select_best(lasso_res, metric = "rmse")
final_wf <- finalize_workflow(lasso_wf, best_params)
} else if (best_model == "KNN") {
best_params <- select_best(knn_res, metric = "rmse")
final_wf <- finalize_workflow(knn_wf, best_params)
} else {
final_wf <- linear_wf
}
print(final_wf)## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Now comes the moment of truth: I use the predict()
function to generate predictions on the test set, then bind them to the
actual data using bind_cols(). I calculate final
performance metrics (rsq, rmse) to understand
how well this final model generalizes to unseen data.
test_results <- predict(final_fit, test_data) %>%
bind_cols(test_data)
yardstick::rsq(test_results, truth = Pandemic_Close, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 1.00
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 12.9
Out of all candidates, linear regression had the lowest RMSE on cross-validation.
To simulate how stock prices might behave in a future pandemic, I
take the test dataset and shift all dates forward by 5 years. I also
simulate an inflation-adjusted price bump by multiplying the
Pandemic_Close value by 1.25. This is purely hypothetical
but demonstrates how the model can generalize into future scenarios.
future_data <- test_data %>%
mutate(Date = Date + years(5)) %>%
mutate(Pandemic_Close = Pandemic_Close * 1.25)
future_pred <- predict(final_fit, future_data) %>%
bind_cols(future_data)Finally, I visualize how the model thinks FAANG prices could evolve
from 2025 to 2030 under pandemic-like conditions. I use
geom_line() with ggplot2 to show predicted
prices for each company, making it easy to compare trends across
firms.
ggplot(future_pred, aes(x = Date, y = .pred, color = Name)) +
geom_line() +
labs(
title = "Predicted FAANG Prices (2025–2030 Pandemic Scenario)",
x = "Year",
y = "Predicted Close Price"
) +
theme_minimal()To predict the stock prices of FAANG companies during the COVID-19 pandemic, I identified Supervised Learning as the most appropriate approach. Since the goal is to forecast the closing price of a stock (a continuous value), this naturally translates into a regression problem. I used a variety of regression models — Linear, Ridge, Lasso, and KNN — to understand both linear and non-linear dynamics in the data.
High - Low) — a direct consequence of panic selling,
uncertain policy responses, and rapidly changing global conditions.Open and
Close prices was observed, indicating that despite
daily fluctuations, prices typically closed near their opening
values.These insights directly address the core research question: Can stock prices during COVID-19 be predicted using historical and pandemic-related features?
Percent_Change and
Daily_Range.Volume as a critical independent
variable.All these analytical pieces supported the development of interpretable and fairly accurate regression models. The linear model’s top performance indicated a data structure that was more predictable than expected, offering a clear path for forecasting during disruptive events like pandemics.
This project took me on a journey through real-world stock behavior during one of the most volatile periods in modern history: the COVID-19 pandemic. FAANG companies, despite the global disruption, demonstrated remarkable resilience — and their recovery was not just market-driven, but data-explainable.
By applying regression models to this context — from Linear to Ridge, Lasso, and KNN — I tested how well different algorithms could learn from both historical and engineered features. Through 10-fold cross-validation, I allowed the metrics to guide model selection instead of assumptions. And in the end, linear regression — the simplest model — outperformed the rest.
Feature engineering played a key role: daily volatility
(High - Low), momentum (Percent Change), and
moving averages (MA5, MA30) helped structure a
dataset that was not only clean but insight-rich. Stratified sampling
ensured fair representation across all companies, and proper
preprocessing safeguarded the model’s integrity.
Once the model was finalized, I projected forward to a 2025–2030 hypothetical pandemic scenario. The results showed continued upward trends, especially for digital-centric companies — validating the robustness of the model and its ability to generalize under simulated stress.
While markets are never fully predictable, this project reinforced a powerful lesson: Machine learning is not magic — it’s rigorous thinking with math and domain awareness. When paired with well-defined context and proper validation, even simple models can provide valuable foresight in chaotic times.
Paris, Rohan. “FAANG Stocks Covid19 (01/01/2020 - 04/01/2022).” Kaggle. https://www.kaggle.com/datasets/parisrohan/faang-stocks-covid190101202004012022/data
🔎 Note: While I utilized Perplexity.ai (https://www.perplexity.ai) during the initial research phase for general guidance and brainstorming ideas, all R code and analysis were fully understood, written, and interpreted based on my own comprehension and course material.