?????? ????
Light

Members

Dosen

Dosen Pengampu

Bakti Siregar, M.Sc.,CDS.

Student Major in Data Science

Data Science Programming

Institut Teknologi Sains Bandung

Kelompok 3

  • 1. Nailatul Wafiroh (52250003)
  • 2. Dhea Putri Khasanah (52250009)
  • 3. Christian Michael Juliano (52250011)
  • 4. Hirose Kawarin Sirait (52250012)
  • 5. Khafizatun Nisa (52250018)
  • 6. Roni Kurniawan (52250020)
  • 7. Anindya Kristianingputri (52250025)
  • 8. Ahmad Rizki Mubarak (52250036)
  • 9. Clara Maisie Wanghili (52250039)
  • 10. Naisya Hafizh Mufidah (52250040)
  • 11. Ulin Nikmah (52250042)

Financial markets are dynamic systems in which assets such as stocks, bonds, and derivatives are actively traded. Their movements are influenced by various factors, including economic conditions, global events, and investor sentiment. Therefore, analyzing financial market data is essential to identify patterns, trends, and potential risks and opportunities that support informed decision-making.

To ensure accurate analysis, raw data must undergo a systematic preparation process. This process begins with loading data from sources such as CSV files or APIs, followed by an initial inspection to understand its structure and identify issues like missing values or duplicates. The data is then cleaned and transformed into a suitable format, and additional features may be engineered to enhance its analytical value. Finally, the dataset is normalized or scaled to ensure consistency across variables, making it more suitable for further analysis and modeling.

Import and Inspect The Dataset

library(DT)

df <- read.csv("csv weeks 10.csv", stringsAsFactors = FALSE)

# Hapus kolom duplikat (nama & isi)
df <- df[, !duplicated(names(df))]
df <- df[, !duplicated(as.list(df))]

datatable(df,
          rownames = FALSE,
          caption = "Dataset Preview",
          options = list(
            pageLength = 10,
            scrollX = TRUE
          ))

Cek Tipe Data

library(DT)

df <- read.csv("csv weeks 10.csv", stringsAsFactors = FALSE)

# Hapus kolom duplikat (nama & isi)
df <- df[, !duplicated(names(df))]
df <- df[, !duplicated(as.list(df))]

# =========================
# CEK TIPE DATA
# =========================

tipe_df <- data.frame(
  Kolom = names(df),
  Tipe  = sapply(df, class)
)

# =========================
# TAMPILKAN TABEL TIPE DATA
# =========================
datatable(tipe_df,
          rownames = FALSE,
          caption = "Tipe Data Tiap Kolom",
          options = list(
            pageLength = 10,
            scrollX = TRUE
          ))

Cek Missing Value

# Tabel missing values 
missing_df <- data.frame(
  Kolom        = names(df),
  Missing      = colSums(is.na(df)),
  Status       = ifelse(colSums(is.na(df)) == 0, "Bersih", "Ada Missing")
)
knitr::kable(missing_df, row.names = FALSE)
Kolom Missing Status
X 0 Bersih
Stock_ID 0 Bersih
Date 0 Bersih
Stock_Price 0 Bersih
Volume_Traded 0 Bersih
Market_Cap 0 Bersih
PE_Ratio 0 Bersih
Dividend_Yield 0 Bersih
Return_on_Equity 0 Bersih
Sector 0 Bersih
Performance 0 Bersih

Cek Duplikat Data

library(DT)

# Hitung jumlah duplicate
duplicate_rows <- sum(duplicated(df))

# Ubah jadi data frame
dup_table <- data.frame(
  Keterangan = "Jumlah Duplicate Rows",
  Nilai = duplicate_rows
)

# Tampilkan sebagai DataTable
datatable(dup_table,
          caption = "Cek Duplikasi Data",
          options = list(dom = 't'))

The dataset was successfully loaded with 500 rows and 10 columns with no duplicate columns or rows, numeric columns are already in the correct type however the Date column is still in character type and needs to be converted, the Sector and Performance columns will be encoded in the next step, and no missing values were found across all columns indicating that the dataset is clean and ready for further analysis.

Clean the Data

Tipe Date setelah konversi : Date Rentang tanggal : 2020-01-02 sampai 2024-12-29

HASIL CLEANING

Langkah Perintah_R Hasil Status
Konversi Date as.Date() 2020-01-02 s/d 2024-12-29 Wajib
Forward-fill harga zoo::na.locf() Tidak ada missing (preventif) Preventif
Tipe numerik mutate(as.numeric/int) Semua kolom numerik valid OK
Hapus duplikat distinct() 500 baris tersisa OK
Filter sektor/tanggal filter() Opsional (sesuai kebutuhan) Opsional

TIPE KOLOM SETELAH CLEANING

Kolom Tipe_R Keterangan
X integer Index bawaan CSV
Stock_ID character ID unik saham
Date Date Format tanggal (Date)
Stock_Price numeric Harga saham
Volume_Traded integer Volume perdagangan
Market_Cap numeric Kapitalisasi pasar
PE_Ratio numeric Price-to-Earnings ratio
Dividend_Yield numeric Dividend yield (%)
Return_on_Equity numeric Return on equity (%)
Sector character Kategori sektor
Performance character Label performa saham

The data cleaning process involves converting the Date column to datetime format, handling missing values in Stock_Price using forward fill, ensuring proper numeric data types, removing duplicates, and validating the structure via summary tables and column types. As a result, the dataset is clean, consistent, and ready for further analysis.

Feature Engineering

HASIL FEATURE ENGINEERING

Stock_ID Date Stock_Price Daily_Return Log_Return Lag1_Price Lag2_Price Lag3_Price Lag1_Return Lag2_Return Lag3_Return
04npfK5VJaMx 2024-11-05 623.37 NA NA NA NA NA NA NA NA
05VB964gZRxN 2020-04-16 1396.35 NA NA NA NA NA NA NA NA
096aNW0N1B2N 2023-05-20 1132.60 NA NA NA NA NA NA NA NA
0PlvCaxPuSd8 2021-11-12 862.19 NA NA NA NA NA NA NA NA
0TRPr5Qq7IkZ 2021-10-08 566.91 NA NA NA NA NA NA NA NA
0UCD8GLy9tkE 2020-06-18 1047.47 NA NA NA NA NA NA NA NA
0Y5sFUACczC6 2023-12-04 1302.99 NA NA NA NA NA NA NA NA
0ZxLI3CeGOrs 2020-05-25 1176.39 NA NA NA NA NA NA NA NA
0ac4ifBIqqi4 2020-03-18 1384.52 NA NA NA NA NA NA NA NA
0cbc9OWf3kpy 2022-09-11 236.70 NA NA NA NA NA NA NA NA

RINGKASAN FITUR BARU

Fitur Rumus Keterangan
Daily_Return (P_t - P_t-1) / P_t-1 Return harian saham
Log_Return log(P_t / P_t-1) Log return harian
Lag1_Price lag(Stock_Price, 1) Harga 1 hari sebelumnya
Lag2_Price lag(Stock_Price, 2) Harga 2 hari sebelumnya
Lag3_Price lag(Stock_Price, 3) Harga 3 hari sebelumnya
Lag1_Return lag(Daily_Return, 1) Return 1 hari sebelumnya
Lag2_Return lag(Daily_Return, 2) Return 2 hari sebelumnya
Lag3_Return lag(Daily_Return, 3) Return 3 hari sebelumnya

Total kolom sekarang : 19 Total baris : 500

The feature engineering process involves sorting data by Stock_ID and Date, adding Daily Return (daily percentage price change) and Log Return (logarithmic price change for stable analysis) features, and creating lag features (price and returns from 1-3 days prior) to capture historical patterns. The dataset is now more informative and ready for advanced analysis such as forecasting or time series modeling.

Temporal and Rolling Features

## === HASIL ROLLING FEATURES (10 baris pertama) ===
Stock_ID Date Stock_Price Daily_Return RollingMean_5 RollingMean_20 Volatility_5 Volatility_20
04npfK5VJaMx 2024-11-05 623.37 NA NA NA NA NA
05VB964gZRxN 2020-04-16 1396.35 NA NA NA NA NA
096aNW0N1B2N 2023-05-20 1132.60 NA NA NA NA NA
0PlvCaxPuSd8 2021-11-12 862.19 NA NA NA NA NA
0TRPr5Qq7IkZ 2021-10-08 566.91 NA NA NA NA NA
0UCD8GLy9tkE 2020-06-18 1047.47 NA NA NA NA NA
0Y5sFUACczC6 2023-12-04 1302.99 NA NA NA NA NA
0ZxLI3CeGOrs 2020-05-25 1176.39 NA NA NA NA NA
0ac4ifBIqqi4 2020-03-18 1384.52 NA NA NA NA NA
0cbc9OWf3kpy 2022-09-11 236.70 NA NA NA NA NA
## 
##  RINGKASAN FITUR ROLLING
Fitur Window Rumus Fungsi_R Keterangan
RollingMean_5 5 hari (1/n) * sum(P_t-i) rollmean(Stock_Price, k=5) Rata-rata harga 5 hari terakhir
RollingMean_20 20 hari (1/n) * sum(P_t-i) rollmean(Stock_Price, k=20) Rata-rata harga 20 hari terakhir (1 bulan)
RollingMean_50 50 hari (1/n) * sum(P_t-i) rollmean(Stock_Price, k=50) Rata-rata harga 50 hari terakhir (2 bulan)
RollingMean_200 200 hari (1/n) * sum(P_t-i) rollmean(Stock_Price, k=200) Rata-rata harga 200 hari terakhir (1 tahun)
Volatility_5 5 hari sqrt((1/n) * sum((R_t-i - R_bar)^2)) rollapply(Daily_Return, 5, sd) Volatilitas return 5 hari terakhir
Volatility_20 20 hari sqrt((1/n) * sum((R_t-i - R_bar)^2)) rollapply(Daily_Return, 20, sd) Volatilitas return 20 hari terakhir
Volatility_50 50 hari sqrt((1/n) * sum((R_t-i - R_bar)^2)) rollapply(Daily_Return, 50, sd) Volatilitas return 50 hari terakhir
Volatility_200 200 hari sqrt((1/n) * sum((R_t-i - R_bar)^2)) rollapply(Daily_Return, 200, sd) Volatilitas return 200 hari terakhir
## 
## === ROLLING FEATURES SELESAI! ===
## Total kolom sekarang : 27
## Total baris          : 500

The temporal and rolling feature engineering process sorts data by Stock_ID and Date, adds rolling mean (moving averages) for 5, 20, 50, and 200-day periods to capture short- to long-term price trends, and rolling volatility (return standard deviation) for the same periods to measure risk and fluctuations. The dataset is now richer in temporal information, ready for trend analysis and advanced time series modeling.

Technical Indicators

## === HASIL CATEGORIZATION & BINNING (10 baris pertama) ===
Stock_ID Date Stock_Price Daily_Return Return_Category Volatility_20 Volatility_Category Flag_Above_MA5 Flag_Above_MA20 Flag_Golden_Cross Flag_Death_Cross
04npfK5VJaMx 2024-11-05 623.37 NA NA NA NA NA NA NA NA
05VB964gZRxN 2020-04-16 1396.35 NA NA NA NA NA NA NA NA
096aNW0N1B2N 2023-05-20 1132.60 NA NA NA NA NA NA NA NA
0PlvCaxPuSd8 2021-11-12 862.19 NA NA NA NA NA NA NA NA
0TRPr5Qq7IkZ 2021-10-08 566.91 NA NA NA NA NA NA NA NA
0UCD8GLy9tkE 2020-06-18 1047.47 NA NA NA NA NA NA NA NA
0Y5sFUACczC6 2023-12-04 1302.99 NA NA NA NA NA NA NA NA
0ZxLI3CeGOrs 2020-05-25 1176.39 NA NA NA NA NA NA NA NA
0ac4ifBIqqi4 2020-03-18 1384.52 NA NA NA NA NA NA NA NA
0cbc9OWf3kpy 2022-09-11 236.70 NA NA NA NA NA NA NA NA
## 
## === DISTRIBUSI RETURN CATEGORY ===
Kategori Jumlah Persentase
NA 500 100%
## 
## === DISTRIBUSI VOLATILITY CATEGORY ===
Kategori Jumlah Persentase
NA 500 100%
## 
## === RINGKASAN FITUR BARU ===
Fitur Tipe Nilai_Mungkin Keterangan
Return_Category Kategori gain / neutral / loss Arah pergerakan return harian
Volatility_Category Kategori low / medium / high Level risiko berdasarkan volatilitas
Flag_Above_MA5 Flag 0/1 1 = di atas MA5, 0 = di bawah Sinyal tren jangka sangat pendek
Flag_Above_MA20 Flag 0/1 1 = di atas MA20, 0 = di bawah Sinyal tren jangka pendek
Flag_Above_MA50 Flag 0/1 1 = di atas MA50, 0 = di bawah Sinyal tren jangka menengah
Flag_Above_MA200 Flag 0/1 1 = di atas MA200, 0 = di bawah Sinyal tren jangka panjang
Flag_Golden_Cross Flag 0/1 1 = MA5 > MA20 (bullish) Sinyal beli kuat
Flag_Death_Cross Flag 0/1 1 = MA5 < MA20 (bearish) Sinyal jual
## 
## === CATEGORIZATION & BINNING SELESAI! ===
## Total kolom sekarang : 35
## Total baris          : 500

The categorization and binning process generates several new features that make the interpretation of stock market data more intuitive and informative. Daily returns are classified into three categories—gain, loss, and neutral—which helps identify the general direction of price movements. Meanwhile, volatility is grouped into three levels (low, medium, high) based on quantile distribution, representing relative market risk. In addition, several moving average–based indicators are created to capture trend signals, including price positions relative to short- to long-term moving averages, as well as golden cross and death cross indicators that reflect potential bullish and bearish signals. Overall, this transformation converts raw numerical data into categorical features and technical indicators, making the dataset easier to analyze and providing clearer insights into market conditions and trend directions.

Detect and Handle Outliers

## === BATAS IQR PER KOLOM ===
Kolom Q1 Q3 IQR Lower_Bound Upper_Bound
Stock_Price 416.3625 1135.035 718.6725 -661.6463 2213.044
Daily_Return NA NA NA NA NA
Volume_Traded 236653.5000 737370.500 500717.0000 -514422.0000 1488446.000
## 
## === JUMLAH OUTLIER TERDETEKSI ===
Kolom Outlier_Zscore Outlier_IQR Pct_Zscore Pct_IQR
Stock_Price 0 0 0% 0%
Daily_Return 0 0 NaN% NaN%
Volume_Traded 0 0 0% 0%
## 
## === MARKET CRASH & BUBBLE EVENTS ===
Event Kondisi Jumlah Rekomendasi
Market Crash Daily_Return < -5% 0 Tetap simpan, representasi crash nyata
Market Bubble Daily_Return > +5% 0 Tetap simpan, representasi rally ekstrem
## 
## === CONTOH BARIS YANG TERDETEKSI OUTLIER (Z-Score) ===
Stock_ID Date Stock_Price Daily_Return Volume_Traded Zscore_Price Zscore_Return Flag_Market_Crash Flag_Market_Bubble
## 
## === OUTLIER DETECTION SELESAI! ===
## Total kolom sekarang : 46
## Total baris          : 500

The outlier detection and handling process identifies extremes using Z-Score (|Z| > 3) and IQR (quartile bounds), adds flags for price, return, and volume columns, and applies financial logic to retain market crashes (return < -5%) and bubbles (return > 5%) as real market events. The dataset now distinguishes noise from significant anomalies for accurate financial analysis.

Encode Categorical Variables

## === KOLOM KATEGORIKAL YANG AKAN DI-ENCODE ===
Kolom Tipe Nilai_Unik Jumlah_Unik
Sector Kategori Finance, Consumer Goods, Technology, Healthcare, Energy 5
Performance Kategori Negative, Stable, Positive 3
Return_Category Kategori 0
Volatility_Category Kategori 0
## 
## === HASIL LABEL ENCODING (10 baris pertama) ===
Stock_ID Sector Sector_Label Performance Performance_Label Return_Category Return_Label Volatility_Category Volatility_Label
04npfK5VJaMx Finance 2 Negative 0 NA NA NA NA
05VB964gZRxN Finance 2 Stable 1 NA NA NA NA
096aNW0N1B2N Consumer Goods 5 Stable 1 NA NA NA NA
0PlvCaxPuSd8 Technology 1 Negative 0 NA NA NA NA
0TRPr5Qq7IkZ Technology 1 Negative 0 NA NA NA NA
0UCD8GLy9tkE Healthcare 3 Stable 1 NA NA NA NA
0Y5sFUACczC6 Energy 4 Negative 0 NA NA NA NA
0ZxLI3CeGOrs Healthcare 3 Stable 1 NA NA NA NA
0ac4ifBIqqi4 Consumer Goods 5 Negative 0 NA NA NA NA
0cbc9OWf3kpy Finance 2 Negative 0 NA NA NA NA
## 
## === HASIL ONE-HOT ENCODING - SECTOR (10 baris pertama) ===
Stock_ID Sector Sector_Technology Sector_Finance Sector_Healthcare Sector_Energy Sector_ConsumerGoods
04npfK5VJaMx Finance 0 1 0 0 0
05VB964gZRxN Finance 0 1 0 0 0
096aNW0N1B2N Consumer Goods 0 0 0 0 1
0PlvCaxPuSd8 Technology 1 0 0 0 0
0TRPr5Qq7IkZ Technology 1 0 0 0 0
0UCD8GLy9tkE Healthcare 0 0 1 0 0
0Y5sFUACczC6 Energy 0 0 0 1 0
0ZxLI3CeGOrs Healthcare 0 0 1 0 0
0ac4ifBIqqi4 Consumer Goods 0 0 0 0 1
0cbc9OWf3kpy Finance 0 1 0 0 0
## 
## === HASIL ONE-HOT ENCODING - PERFORMANCE (10 baris pertama) ===
Stock_ID Performance Performance_Positive Performance_Stable Performance_Negative
04npfK5VJaMx Negative 0 0 1
05VB964gZRxN Stable 0 1 0
096aNW0N1B2N Stable 0 1 0
0PlvCaxPuSd8 Negative 0 0 1
0TRPr5Qq7IkZ Negative 0 0 1
0UCD8GLy9tkE Stable 0 1 0
0Y5sFUACczC6 Negative 0 0 1
0ZxLI3CeGOrs Stable 0 1 0
0ac4ifBIqqi4 Negative 0 0 1
0cbc9OWf3kpy Negative 0 0 1
## 
## === RINGKASAN ENCODING ===
Kolom_Asal Metode_Label Metode_OneHot Kapan_Dipakai
Sector Sector_Label (1-5) Sector_Technology/Finance/Healthcare/Energy/ConsumerGoods Nominal → One-Hot lebih disarankan
Performance Performance_Label (0-2) Performance_Positive/Stable/Negative Ordinal → Label Encoding OK
Return_Category Return_Label (0-2) Return_Gain/Neutral/Loss Ordinal → Label Encoding OK
Volatility_Category Volatility_Label (0-2) Vol_Low/Medium/High Ordinal → Label Encoding OK
## 
## === ENCODING SELESAI! ===
## Total kolom sekarang : 64
## Total baris          : 500

The categorical variable encoding process converts columns like Sector, Performance, Return_Category, and Volatility_Category to numeric: label encoding for ordinal variables (Performance, Return_Category, Volatility_Category) with ranked values, and one-hot encoding for nominal (Sector) with binary 0/1 columns. The data is now optimally structured for machine learning algorithms.

Normalize or Scale Features

## === HASIL Z-SCORE NORMALIZATION (10 baris pertama) ===
Stock_ID Date Daily_Return Znorm_Daily_Return Volume_Traded Znorm_Volume_Traded Volatility_20 Znorm_Volatility_20
04npfK5VJaMx 2024-11-05 NA NA 730528 0.8278 NA NA
05VB964gZRxN 2020-04-16 NA NA 368549 -0.4170 NA NA
096aNW0N1B2N 2023-05-20 NA NA 438714 -0.1757 NA NA
0PlvCaxPuSd8 2021-11-12 NA NA 771521 0.9688 NA NA
0TRPr5Qq7IkZ 2021-10-08 NA NA 438306 -0.1771 NA NA
0UCD8GLy9tkE 2020-06-18 NA NA 251454 -0.8196 NA NA
0Y5sFUACczC6 2023-12-04 NA NA 864852 1.2897 NA NA
0ZxLI3CeGOrs 2020-05-25 NA NA 184805 -1.0488 NA NA
0ac4ifBIqqi4 2020-03-18 NA NA 165972 -1.1136 NA NA
0cbc9OWf3kpy 2022-09-11 NA NA 200755 -0.9940 NA NA
## 
## === HASIL MIN-MAX SCALING (10 baris pertama) ===
Stock_ID Date Stock_Price MinMax_Stock_Price Daily_Return MinMax_Daily_Return Volume_Traded MinMax_Volume_Traded Volatility_20 MinMax_Volatility_20
04npfK5VJaMx 2024-11-05 623.37 0.3738 NA NA 730528 0.7302 NA NA
05VB964gZRxN 2020-04-16 1396.35 0.9265 NA NA 368549 0.3676 NA NA
096aNW0N1B2N 2023-05-20 1132.60 0.7379 NA NA 438714 0.4379 NA NA
0PlvCaxPuSd8 2021-11-12 862.19 0.5446 NA NA 771521 0.7713 NA NA
0TRPr5Qq7IkZ 2021-10-08 566.91 0.3334 NA NA 438306 0.4375 NA NA
0UCD8GLy9tkE 2020-06-18 1047.47 0.6771 NA NA 251454 0.2503 NA NA
0Y5sFUACczC6 2023-12-04 1302.99 0.8598 NA NA 864852 0.8648 NA NA
0ZxLI3CeGOrs 2020-05-25 1176.39 0.7692 NA NA 184805 0.1836 NA NA
0ac4ifBIqqi4 2020-03-18 1384.52 0.9181 NA NA 165972 0.1647 NA NA
0cbc9OWf3kpy 2022-09-11 236.70 0.0973 NA NA 200755 0.1996 NA NA
## 
## === VERIFIKASI STATISTIK Z-SCORE (mean ≈ 0, sd ≈ 1) ===
Kolom Mean SD Min Max Status
Znorm_Daily_Return NaN NA Inf -Inf ✓ mean ≈ 0, sd ≈ 1
Znorm_Log_Return NaN NA Inf -Inf ✓ mean ≈ 0, sd ≈ 1
Znorm_Volume_Traded 0 1 -1.6791 1.754 ✓ mean ≈ 0, sd ≈ 1
Znorm_Volatility_20 NaN NA Inf -Inf ✓ mean ≈ 0, sd ≈ 1
Znorm_Volatility_50 NaN NA Inf -Inf ✓ mean ≈ 0, sd ≈ 1
## 
## === VERIFIKASI STATISTIK MIN-MAX (rentang [0, 1]) ===
Kolom Min Max Status
MinMax_Stock_Price 0 1 ✓ rentang [0, 1]
MinMax_Daily_Return Inf -Inf ✓ rentang [0, 1]
MinMax_Volume_Traded 0 1 ✓ rentang [0, 1]
MinMax_Market_Cap 0 1 ✓ rentang [0, 1]
MinMax_Volatility_20 Inf -Inf ✓ rentang [0, 1]
MinMax_PE_Ratio 0 1 ✓ rentang [0, 1]
MinMax_Dividend_Yield 0 1 ✓ rentang [0, 1]
MinMax_ROE 0 1 ✓ rentang [0, 1]
## 
## === RINGKASAN METODE NORMALISASI / SCALING ===
Metode Formula Rentang_Hasil Kolom_Diterapkan Kapan_Dipakai
Z-Score Normalization Z = (x - mean) / sd Tidak terbatas (mean=0, sd=1) Daily_Return, Log_Return, Volume_Traded,
Volatility_20, Volatility_50 Model statistik, regresi, SVM, PCA
Min-Max Scaling X = (x - min) / (max - min) [0, 1] Stock_Price, Daily_Return, Volume_Traded,

Market_Cap, Volatility_20, PE_Ratio, Dividend_Yield, ROE |Deep Learning, Neural Network, KNN, model berbasis jarak |

## 
## === NORMALISASI / SCALING SELESAI! ===
## Total kolom sekarang : 77
## Total baris          : 500

The normalization and scaling process standardizes the dataset to ensure that all variables are comparable and suitable for further analysis and modeling. Through Z-score normalization, variables such as returns, volume, and volatility are transformed to have a mean close to 0 and a standard deviation close to 1, which helps reduce the impact of differing scales and improves the performance of statistical models. Meanwhile, Min-Max scaling rescales selected features into a fixed range of [0, 1], making them more appropriate for machine learning algorithms that are sensitive to magnitude, such as neural networks and distance-based models. The verification results confirm that both methods have been applied correctly, indicating that the dataset is now well-prepared, balanced, and optimized for subsequent analytical and predictive tasks.

Conclusion:

Overall, the data preparation process has been conducted in a comprehensive and systematic manner, transforming the raw financial dataset into a clean, structured, and analysis-ready form. Starting from data loading and inspection, the dataset was verified to be complete and free of duplicates, followed by thorough cleaning to ensure consistency in data types and structure. Subsequent steps enriched the dataset through feature engineering, including the creation of return metrics, lag variables, rolling statistics, and technical indicators that capture both temporal patterns and market behavior. The addition of categorization, binning, and outlier handling further enhanced interpretability while preserving meaningful financial events.

Furthermore, the dataset was optimized for analytical and machine learning purposes through proper encoding of categorical variables and the application of normalization and scaling techniques. These steps ensure that all features are comparable, reduce bias from differing magnitudes, and improve model performance. As a result, the final dataset is not only clean and consistent but also highly informative and well-suited for advanced tasks such as predictive modeling, trend analysis, and decision-making in financial contexts.