Integrative Application of Neural Networks for Predicting Global Stock Market Trends: A Data Science Investigation Using Historical Data

7CS041 MSc Project Data Science

A dissertation submitted in partial ful llment of the requirements for the degree of M.Sc. Data Science (School of Engineering & Informatics)

Author
Affiliation

Rabin Thapa

University of Wolverhampton

Published

02/18/2025  04:55:32 AM +0545

1 ABSTRACT

2 ACKNOWLEDGEMENT

3 INTRODUCTION

This report, created using Quarto in RStudio(Bauer and Landesvatter 2023), provides a visual analysis of a snapshot extracted from data of 2021 household census in England. Using ggplot2, we explore key demographic trends, focusing on factors like age, income, marital status and ethnicity. The main objective is to process the data which can be used to obtain interesting patterns and linear relationships between the variables through clear visualizations (Hoffmann 2021). The report offers insights that could inform future policies and improve understanding of the correlation between the variables.

4 Literature Review

5 Methodology

6 Data Pre processing

After installation and loading the necessary packages, data pre-processing in RStudio, data analysis begins with understanding the variables. As Kandel mentions, which is followed by cleaning and organizing the raw data to make it ready for analysis and visualization(Kandel et al. 2012).

6.1 Data Exploration:

To start the data exploration, we load the necessary library-tidyverse and read the data using the read_csv() function from the specified file path.

Code
reticulate::py_install("jupyter")
Using virtual environment "C:/Users/Dell/OneDrive/Documents/.virtualenvs/r-reticulate" ...
+ "C:/Users/Dell/OneDrive/Documents/.virtualenvs/r-reticulate/Scripts/python.exe" -m pip install --upgrade --no-user jupyter
Code
reticulate::py_config()
python:         C:/Users/Dell/OneDrive/Documents/.virtualenvs/r-reticulate/Scripts/python.exe
libpython:      C:/Users/Dell/AppData/Local/Programs/Python/Python312/python312.dll
pythonhome:     C:/Users/Dell/OneDrive/Documents/.virtualenvs/r-reticulate
version:        3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/Dell/OneDrive/Documents/.virtualenvs/r-reticulate/Lib/site-packages/numpy
numpy_version:  2.0.2

NOTE: Python version was forced by VIRTUAL_ENV
Code
import os
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"   
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"   
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Dense, Dropout, Input, Attention, LayerNormalization, GaussianNoise
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from IPython.display import display
Code
file_directory = r"C:\Users\Dell\OneDrive\Desktop\University\Course\9. M.Sc. Project\3. Dataset\archive"
file_names = [
    "2014_Global_Markets_Data.csv",
    "2015_Global_Markets_Data.csv",
    "2016_Global_Markets_Data.csv",
    "2017_Global_Markets_Data.csv",
    "2018_Global_Markets_Data.csv",
    "2019_Global_Markets_Data.csv",
    "2020_Global_Markets_Data.csv",
    "2021_Global_Markets_Data.csv",
    "2022_Global_Markets_Data.csv",
    "2023_Global_Markets_Data.csv"
]
file_paths = [os.path.join(file_directory, file) for file in file_names]
dfs = [pd.read_csv(file) for file in file_paths]
df_stock = pd.concat(dfs, ignore_index=True)
Code
file_directory = r"C:\Users\Dell\OneDrive\Desktop\University\Course\9. M.Sc. Project\3. Dataset\archive\Macroeconomic_factors"
df_gdp = pd.read_csv(os.path.join(file_directory, "GDP_growth_uk.csv"))
df_inflation = pd.read_csv(os.path.join(file_directory, "Inflation_rate_uk.csv"))
df_interest = pd.read_csv(os.path.join(file_directory, "Interest_rate_uk.csv"))

6.2 Tidying the Data

Cleaning the data includes dealing with missing values, changing categorical data into factors and renaming columns for better clarity. ID and Person_ID are variables with minimal feature importance which are removed from the data. The data was filtered to remove irrelevant or unusual entries.

Code
df_stock['Date'] = pd.to_datetime(df_stock['Date'])
df_gdp['Date'] = pd.to_datetime(df_gdp['Date'])
df_inflation['Date'] = pd.to_datetime(df_inflation['Date'])
df_interest['Date'] = pd.to_datetime(df_interest['Date'])
Code
df_ftse = df_stock[df_stock['Ticker'] == '^FTSE'].copy()
Code
df_ftse.set_index('Date', inplace=True)
df_gdp.set_index('Date', inplace=True)
df_inflation.set_index('Date', inplace=True)
df_interest.set_index('Date', inplace=True)
Code
df_ftse = df_ftse[['Close']].dropna()
df_macro = df_gdp.join([df_inflation, df_interest], how='outer')
df_combined = df_ftse.join(df_macro, how='outer')
df_combined = df_combined.interpolate(method='linear')
df_combined.dropna(inplace=True)

6.3 Refining the Data

Out of three categorical variables; we are changing Mar_Stat and Highest Ed to nominal numeric variable except Eth as illustrated below. This feature transformation is later applicable in regression analysis to understand the trends between the variables(Zeileis and Hothorn 2002).

7 Relation between the variables

Grouping the selected variables plays an important role to identify the strength of relation in the analysis of the demographic data(Yusuf, Martins, and Swanson 2014). Therefore, when, age is grouped into two parts, age up to 50 (Age <= 50) and above 50 (Age > 50). Following algorithm is applied to check its correlation coefficient with average income(INC).

Code
scaler = MinMaxScaler(feature_range=(0, 1))
df_scaled = scaler.fit_transform(df_combined)
Code
def create_sequences(data, time_steps=30):
    X, y = [], []
    for i in range(len(data) - time_steps):
        X.append(data[i:i+time_steps])
        y.append(data[i+time_steps][0])  # Predicting FTSE index (Close Price)
    return np.array(X), np.array(y)

time_steps = 30
X, y = create_sequences(df_scaled, time_steps)

8 Model Development:

Now, for diagrams, these two opposite linear relations can be visualized through scatter plot along with their best fitting regression line by using library-ggplot2 as illustrated in Fig 4.1.

Code
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
Code
num_features = X.shape[2]
Code
def lr_schedule(epoch, lr):
    return lr * np.exp(-0.05)  # Exponential decay
Code
inputs = Input(shape=(time_steps, num_features))
Code
import tensorflow as tf
x = LSTM(128, activation='tanh', return_sequences=True)(inputs)
x = GaussianNoise(0.1)(x)  
x = Dropout(0.2)(x)
Code
attention = Attention()([x, x])  
x = LayerNormalization()(attention)
Code
x = LSTM(64, activation='tanh', return_sequences=False)(x)
x = Dropout(0.2)(x)
Code
x = Dense(32, activation='swish')(x)
outputs = Dense(1)(x)
Code
model = Model(inputs, outputs)
model.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
Code
lr_callback = LearningRateScheduler(lr_schedule)
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[lr_callback], verbose=0)
Code
y_pred = model.predict(X_test, verbose=0) 
Code
y_test_inv = scaler.inverse_transform(np.hstack((y_test.reshape(-1, 1), np.zeros((y_test.shape[0], df_combined.shape[1] - 1)))))[:, 0]
y_pred_inv = scaler.inverse_transform(np.hstack((y_pred, np.zeros((y_pred.shape[0], df_combined.shape[1] - 1)))))[:, 0]
Code
future_steps = 60  
X_future = X_test[-1:].copy()  
future_predictions = []

for _ in range(future_steps):
    pred = model.predict(X_future, verbose=0)
    
    # injecting the Noise to Increase Non-Linearity
    noise = np.random.normal(0, 0.02, size=pred.shape)  
    pred += noise  

    future_predictions.append(pred[0][0])
    
    X_future = np.roll(X_future, -1, axis=1)
    X_future[0, -1, 0] = pred[0][0]  

9 Data Visualization:

Now, for diagrams, these two opposite linear relations can be visualized through scatter plot along with their best fitting regression line by using library-ggplot2 as illustrated in Fig 4.1.

Code
future_predictions_inv = scaler.inverse_transform(np.hstack((np.array(future_predictions).reshape(-1, 1),
                                                              np.zeros((future_steps, df_combined.shape[1] - 1)))))[:, 0]
Code
dates_actual = df_ftse.index[-len(y_test_inv):]  
dates_future = pd.date_range(start=dates_actual[-1], periods=future_steps + 1, freq='ME')[1:] 
Code
_ = plt.figure(figsize=(14, 7))
_ = plt.plot(df_ftse.index, df_ftse['Close'], color='lightblue', lw=2, label="Actual FTSE 100 (2014-2023)")
_ = plt.plot(dates_actual, y_pred_inv, color='darkblue', ls="dashed", label="Model Predicted Trend (2014-2023)")
_ = plt.plot(dates_future, future_predictions_inv, color='orange', lw=2, ls="dashed", label="Predicted FTSE 100 (2024-2028)")
_ = plt.axvline(dates_actual[-1], color='gray', ls="dotted", label="Prediction Start (2024)")
plt.xlabel("Year")
plt.ylabel("FTSE 100 Index")
plt.title("FTSE 100 Prediction with Attention & Noise (2014-2028)")
plt.legend()
plt.grid(True)
plt.show()

10 MODEL EVALUATION

Now, for diagrams, these two opposite linear relations can be visualized through scatter plot along with their best fitting regression line by using library-ggplot2 as illustrated in Fig 4.1.

Code
mae = mean_absolute_error(y_test_inv, y_pred_inv)
rmse = np.sqrt(mean_squared_error(y_test_inv, y_pred_inv))
r2 = r2_score(y_test_inv, y_pred_inv)
directional_accuracy = np.mean(
    (np.sign(y_test_inv[1:] - y_test_inv[:-1]) == np.sign(y_pred_inv[1:] - y_pred_inv[:-1])).astype(int)
)
model_accuracy = 1 - (mae / np.mean(y_test_inv))
metrics = {
    "Metric": ["Mean Absolute Error (MAE)", "Root Mean Squared Error (RMSE)", 
               "R-Squared Score (R²)", "Directional Accuracy", "Model Accuracy"],
    "Value": [f"{mae:.4f}", f"{rmse:.4f}", f"{r2:.4f}", 
              f"{directional_accuracy:.2%}", f"{model_accuracy:.2%}"]
}
df_metrics = pd.DataFrame(metrics)
display(df_metrics)

11 Result

12 Limitation and Recommendation

This analysis shows clear patterns between age, marital status, education and income among British citizens, but it has limitations. The data lacks details on regional, industry and socio-economic factors that could impact income differences(Howe et al. 2012). Furthermore, the simplified categories for ethnicity and marital status may overlook complex social influences on income. Future research would benefit from including more socio-economic factors and regional details. The policies supporting education of elderly people and relationship stability could help improve financial well-being across demographics.

13 CONCLUSION AND FUTURE WORKS

Up to the age of 50, income shows a strong positive link with age, but after 50, income tends to fall. This suggests, elderlfy people in UK at risk of low income. Marriage and stable relationships appear to support financial success, with married individuals generally earning more. There is also a clear income gap, with White individuals earning more than other ethnic groups, although women tend to earn more than men across all groups. These findings point to areas where future government policies could focus, such as supporting elderly education, and lunching social programmes to promote financial stability and equality across age, gender and ethnicity.

14 BIBLIOGRAPHY

Bauer, Paul C., and Camille Landesvatter. 2023. “Writing a Reproducible Paper with RStudio and Quarto.” http://dx.doi.org/10.31219/osf.io/ur4xn.
Hoffmann, John P. 2021. “Linear Regression Models,” July. https://doi.org/10.1201/9781003162230.
Howe, L. D., B. Galobardes, A. Matijasevich, D. Gordon, D. Johnston, O. Onwujekwe, R. Patel, E. A. Webb, D. A. Lawlor, and J. R. Hargreaves. 2012. “Measuring Socio-Economic Position for Epidemiological Studies in Low- and Middle-Income Countries: A Methods of Measurement in Epidemiology Paper.” International Journal of Epidemiology 41 (3): 871–86. https://doi.org/10.1093/ije/dys037.
Kandel, Sean, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. “Enterprise Data Analysis and Visualization: An Interview Study.” IEEE Transactions on Visualization and Computer Graphics 18 (12): 2917–26. https://doi.org/10.1109/tvcg.2012.219.
Yusuf, Farhat, Jo. M. Martins, and David A. Swanson. 2014. Methods of Demographic Analysis. Springer Netherlands. https://doi.org/10.1007/978-94-007-6784-3.
Zeileis, Achim, and Torsten Hothorn. 2002. “Diagnostic Checking in Regression Relationships” 2. https://CRAN.R-project.org/doc/Rnews/.

Citation

BibTeX citation:
@online{thapa2025,
  author = {Thapa, Rabin},
  title = {Integrative {Application} of {Neural} {Networks} for
    {Predicting} {Global} {Stock} {Market} {Trends:} {A} {Data}
    {Science} {Investigation} {Using} {Historical} {Data}},
  date = {2025-02-18},
  url = {https://www.researchgate.net/profile/Rabin-Thapa-8},
  langid = {en}
}
For attribution, please cite this work as:
Thapa, Rabin. 2025. “Integrative Application of Neural Networks for Predicting Global Stock Market Trends: A Data Science Investigation Using Historical Data.” February 18, 2025. https://www.researchgate.net/profile/Rabin-Thapa-8.