Members
Dosen Pengampu
Bakti Siregar, M.Sc.,CDS.
Student Major in Data Science
Data Science Programming
Institut Teknologi Sains Bandung
Kelompok 3
- 1. Nailatul Wafiroh (52250003)
- 2. Dhea Putri Khasanah (52250009)
- 3. Christian Michael Juliano (52250011)
- 4. Hirose Kawarin Sirait (52250012)
- 5. Khafizatun Nisa (52250018)
- 6. Roni Kurniawan (52250020)
- 7. Anindya Kristianingputri (52250025)
- 8. Ahmad Rizki Mubarak (52250036)
- 9. Clara Maisie Wanghili (52250039)
- 10. Naisya Hafizh Mufidah (52250040)
- 11. Ulin Nikmah (52250042)
Introduction
Financial markets are dynamic systems in which assets such as stocks,
bonds, and derivatives are actively traded. Their movements are
influenced by various factors, including economic conditions, global
events, and investor sentiment. Therefore, analyzing financial market
data is essential to identify patterns, trends, and potential risks and
opportunities that support informed decision-making.
To ensure accurate analysis, raw data must undergo a systematic
preparation process. This process begins with loading data from sources
such as CSV files or APIs, followed by an initial inspection to
understand its structure and identify issues like missing values or
duplicates. The data is then cleaned and transformed into a suitable
format, and additional features may be engineered to enhance its
analytical value. Finally, the dataset is normalized or scaled to ensure
consistency across variables, making it more suitable for further
analysis and modeling.
Import and Inspect The Dataset
library(DT)
# Load dataset
df <- read.csv("C:/Users/Roni/Desktop/program10/data csv practikum week 10.csv",
stringsAsFactors = FALSE)
# Remove duplicate columns (by name and content)
df <- df[, !duplicated(names(df))]
df <- df[, !duplicated(as.list(df))]
# Display dataset preview
datatable(df,
rownames = FALSE,
caption = "Dataset Preview",
options = list(
pageLength = 10,
scrollX = TRUE
))
Interpretation
The dataset is successfully imported and displayed using an
interactive table. Duplicate columns are removed to ensure data
integrity.
This step is important because:
It ensures no redundant variables exist It helps prevent bias in
analysis It improves computational efficiency
Data Type Check
library(DT)
# Check column data types
type_df <- data.frame(
Column = names(df),
Type = sapply(df, class)
)
# Display table
datatable(type_df,
rownames = FALSE,
caption = "Data Type of Each Column",
options = list(
pageLength = 10,
scrollX = TRUE
))
Missing Value Check
# Missing values table
missing_df <- data.frame(
Column = names(df),
Missing = colSums(is.na(df)),
Status = ifelse(colSums(is.na(df)) == 0, "Clean", "Has Missing Values")
)
knitr::kable(missing_df, row.names = FALSE)
| X |
0 |
Clean |
| Stock_ID |
0 |
Clean |
| Date |
0 |
Clean |
| Stock_Price |
0 |
Clean |
| Volume_Traded |
0 |
Clean |
| Market_Cap |
0 |
Clean |
| PE_Ratio |
0 |
Clean |
| Dividend_Yield |
0 |
Clean |
| Return_on_Equity |
0 |
Clean |
| Sector |
0 |
Clean |
| Performance |
0 |
Clean |
Duplicate check
library(DT)
# Count duplicate rows
duplicate_rows <- sum(duplicated(df))
dup_table <- data.frame(
Description = "Number of Duplicate Rows",
Value = duplicate_rows
)
datatable(dup_table,
caption = "Duplicate Data Check",
options = list(dom = 't'))
Interpretation:
The dataset was successfully loaded with 500 rows and 10 columns with
no duplicate columns or rows, numeric columns are already in the correct
type however the Date column is still in character type and needs to be
converted, the Sector and Performance columns will be encoded in the
next step, and no missing values were found across all columns
indicating that the dataset is clean and ready for further analysis.
DATA CLEANING
Date Column Conversion Summary
| Date column used |
Date |
| Data type |
Date |
| Start date |
2020-01-02 |
| End date |
2024-12-29 |
Interpretation:
The data cleaning process involves converting the Date column to
datetime format, handling missing values in Stock_Price using forward
fill, ensuring proper numeric data types, removing duplicates, and
validating the structure via summary tables and column types. As a
result, the dataset is clean, consistent, and ready for further
analysis.
Feature Engineering
FEATURE ENGINEERING RESULTS
| 04npfK5VJaMx |
2024-11-05 |
623.37 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 05VB964gZRxN |
2020-04-16 |
1396.35 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 096aNW0N1B2N |
2023-05-20 |
1132.60 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0PlvCaxPuSd8 |
2021-11-12 |
862.19 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0TRPr5Qq7IkZ |
2021-10-08 |
566.91 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0UCD8GLy9tkE |
2020-06-18 |
1047.47 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0Y5sFUACczC6 |
2023-12-04 |
1302.99 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0ZxLI3CeGOrs |
2020-05-25 |
1176.39 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0ac4ifBIqqi4 |
2020-03-18 |
1384.52 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0cbc9OWf3kpy |
2022-09-11 |
236.70 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Feature Summary
| Daily_Return |
(P_t - P_t-1) / P_t-1 |
Daily stock return |
| Log_Return |
log(P_t / P_t-1) |
Logarithmic return |
| Lag1_Price |
lag(Stock_Price, 1) |
Price 1 day ago |
| Lag2_Price |
lag(Stock_Price, 2) |
Price 2 days ago |
| Lag3_Price |
lag(Stock_Price, 3) |
Price 3 days ago |
| Lag1_Return |
lag(Daily_Return, 1) |
Return 1 day ago |
| Lag2_Return |
lag(Daily_Return, 2) |
Return 2 days ago |
| Lag3_Return |
lag(Daily_Return, 3) |
Return 3 days ago |
Interpretation:
The feature engineering process involves sorting data by Stock_ID and
Date, adding Daily Return (daily percentage price change) and Log Return
(logarithmic price change for stable analysis) features, and creating
lag features (price and returns from 1-3 days prior) to capture
historical patterns. The dataset is now more informative and ready for
advanced analysis such as forecasting or time series modeling.
Temporal and Rolling Features
HASIL ROLLING FEATURES (10 baris pertama)
| 04npfK5VJaMx |
2024-11-05 |
623.37 |
NA |
NA |
NA |
NA |
NA |
| 05VB964gZRxN |
2020-04-16 |
1396.35 |
NA |
NA |
NA |
NA |
NA |
| 096aNW0N1B2N |
2023-05-20 |
1132.60 |
NA |
NA |
NA |
NA |
NA |
| 0PlvCaxPuSd8 |
2021-11-12 |
862.19 |
NA |
NA |
NA |
NA |
NA |
| 0TRPr5Qq7IkZ |
2021-10-08 |
566.91 |
NA |
NA |
NA |
NA |
NA |
| 0UCD8GLy9tkE |
2020-06-18 |
1047.47 |
NA |
NA |
NA |
NA |
NA |
| 0Y5sFUACczC6 |
2023-12-04 |
1302.99 |
NA |
NA |
NA |
NA |
NA |
| 0ZxLI3CeGOrs |
2020-05-25 |
1176.39 |
NA |
NA |
NA |
NA |
NA |
| 0ac4ifBIqqi4 |
2020-03-18 |
1384.52 |
NA |
NA |
NA |
NA |
NA |
| 0cbc9OWf3kpy |
2022-09-11 |
236.70 |
NA |
NA |
NA |
NA |
NA |
RINGKASAN FITUR ROLLING
| RollingMean_5 |
5 hari |
(1/n) * sum(P_t-i) |
rollmean(Stock_Price, k=5) |
Rata-rata harga 5 hari terakhir |
| RollingMean_20 |
20 hari |
(1/n) * sum(P_t-i) |
rollmean(Stock_Price, k=20) |
Rata-rata harga 20 hari terakhir (1 bulan) |
| RollingMean_50 |
50 hari |
(1/n) * sum(P_t-i) |
rollmean(Stock_Price, k=50) |
Rata-rata harga 50 hari terakhir (2 bulan) |
| RollingMean_200 |
200 hari |
(1/n) * sum(P_t-i) |
rollmean(Stock_Price, k=200) |
Rata-rata harga 200 hari terakhir (1 tahun) |
| Volatility_5 |
5 hari |
sqrt((1/n) * sum((R_t-i - R_bar)^2)) |
rollapply(Daily_Return, 5, sd) |
Volatilitas return 5 hari terakhir |
| Volatility_20 |
20 hari |
sqrt((1/n) * sum((R_t-i - R_bar)^2)) |
rollapply(Daily_Return, 20, sd) |
Volatilitas return 20 hari terakhir |
| Volatility_50 |
50 hari |
sqrt((1/n) * sum((R_t-i - R_bar)^2)) |
rollapply(Daily_Return, 50, sd) |
Volatilitas return 50 hari terakhir |
| Volatility_200 |
200 hari |
sqrt((1/n) * sum((R_t-i - R_bar)^2)) |
rollapply(Daily_Return, 200, sd) |
Volatilitas return 200 hari terakhir |
Interpretation:
The temporal and rolling feature engineering process sorts data by
Stock_ID and Date, adds rolling mean (moving averages) for 5, 20, 50,
and 200-day periods to capture short- to long-term price trends, and
rolling volatility (return standard deviation) for the same periods to
measure risk and fluctuations. The dataset is now richer in temporal
information, ready for trend analysis and advanced time series
modeling.
Technical Indicators
HASIL CATEGORIZATION & BINNING (10 baris
pertama)
| 04npfK5VJaMx |
2024-11-05 |
623.37 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 05VB964gZRxN |
2020-04-16 |
1396.35 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 096aNW0N1B2N |
2023-05-20 |
1132.60 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0PlvCaxPuSd8 |
2021-11-12 |
862.19 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0TRPr5Qq7IkZ |
2021-10-08 |
566.91 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0UCD8GLy9tkE |
2020-06-18 |
1047.47 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0Y5sFUACczC6 |
2023-12-04 |
1302.99 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0ZxLI3CeGOrs |
2020-05-25 |
1176.39 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0ac4ifBIqqi4 |
2020-03-18 |
1384.52 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
| 0cbc9OWf3kpy |
2022-09-11 |
236.70 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Interpretation:
The categorization and binning process generates several new features
that make the interpretation of stock market data more intuitive and
informative. Daily returns are classified into three
categories—gain, loss, and neutral—which
helps identify the general direction of price movements. Meanwhile,
volatility is grouped into three levels (low, medium,
high) based on quantile distribution, representing relative
market risk. In addition, several moving average–based indicators are
created to capture trend signals, including price positions relative to
short- to long-term moving averages, as well as golden cross
and death cross indicators that reflect potential bullish and
bearish signals. Overall, this transformation converts raw numerical
data into categorical features and technical indicators, making the
dataset easier to analyze and providing clearer insights into market
conditions and trend directions.
DISTRIBUSI RETURN CATEGORY
DISTRIBUSI VOLATILITY CATEGORY
RINGKASAN FITUR BARU
| Return_Category |
Kategori |
gain / neutral / loss |
Arah pergerakan return harian |
| Volatility_Category |
Kategori |
low / medium / high |
Level risiko berdasarkan volatilitas |
| Flag_Above_MA5 |
Flag 0/1 |
1 = di atas MA5, 0 = di bawah |
Sinyal tren jangka sangat pendek |
| Flag_Above_MA20 |
Flag 0/1 |
1 = di atas MA20, 0 = di bawah |
Sinyal tren jangka pendek |
| Flag_Above_MA50 |
Flag 0/1 |
1 = di atas MA50, 0 = di bawah |
Sinyal tren jangka menengah |
| Flag_Above_MA200 |
Flag 0/1 |
1 = di atas MA200, 0 = di bawah |
Sinyal tren jangka panjang |
| Flag_Golden_Cross |
Flag 0/1 |
1 = MA5 > MA20 (bullish) |
Sinyal beli kuat |
| Flag_Death_Cross |
Flag 0/1 |
1 = MA5 < MA20 (bearish) |
Sinyal jual |
Detect and Handle Outliers
## === IQR BOUNDARIES PER COLUMN ===
| Stock_Price |
416.3625 |
1135.035 |
718.6725 |
-661.6463 |
2213.044 |
| Daily_Return |
NA |
NA |
NA |
NA |
NA |
| Volume_Traded |
236653.5000 |
737370.500 |
500717.0000 |
-514422.0000 |
1488446.000 |
##
## === TOTAL DETECTED OUTLIERS ===
| Stock_Price |
0 |
0 |
0% |
0% |
| Daily_Return |
0 |
0 |
NaN% |
NaN% |
| Volume_Traded |
0 |
0 |
0% |
0% |
##
## === MARKET CRASH & BUBBLE EVENTS ===
| Market Crash |
Daily_Return < -5% |
0 |
Keep data (real crash representation) |
| Market Bubble |
Daily_Return > +5% |
0 |
Keep data (extreme rally representation) |
##
## === SAMPLE OUTLIER ROWS (Z-Score Method) ===
##
## === OUTLIER DETECTION COMPLETED! ===
## Total columns: 46
## Total rows : 500
Interpretation:
The outlier detection and handling process identifies extremes using
Z-Score (|Z| > 3) and IQR (quartile bounds), adds flags for price,
return, and volume columns, and applies financial logic to retain market
crashes (return < -5%) and bubbles (return > 5%) as real market
events. The dataset now distinguishes noise from significant anomalies
for accurate financial analysis.
Encode Categorical Variables
## === CATEGORICAL COLUMNS TO BE ENCODED ===
| Sector |
Categorical |
Finance, Consumer Goods, Technology, Healthcare,
Energy |
5 |
| Performance |
Categorical |
Negative, Stable, Positive |
3 |
| Return_Category |
Categorical |
|
0 |
| Volatility_Category |
Categorical |
|
0 |
##
## === LABEL ENCODING RESULT (first 10 rows) ===
| 04npfK5VJaMx |
Finance |
2 |
Negative |
0 |
NA |
NA |
NA |
NA |
| 05VB964gZRxN |
Finance |
2 |
Stable |
1 |
NA |
NA |
NA |
NA |
| 096aNW0N1B2N |
Consumer Goods |
5 |
Stable |
1 |
NA |
NA |
NA |
NA |
| 0PlvCaxPuSd8 |
Technology |
1 |
Negative |
0 |
NA |
NA |
NA |
NA |
| 0TRPr5Qq7IkZ |
Technology |
1 |
Negative |
0 |
NA |
NA |
NA |
NA |
| 0UCD8GLy9tkE |
Healthcare |
3 |
Stable |
1 |
NA |
NA |
NA |
NA |
| 0Y5sFUACczC6 |
Energy |
4 |
Negative |
0 |
NA |
NA |
NA |
NA |
| 0ZxLI3CeGOrs |
Healthcare |
3 |
Stable |
1 |
NA |
NA |
NA |
NA |
| 0ac4ifBIqqi4 |
Consumer Goods |
5 |
Negative |
0 |
NA |
NA |
NA |
NA |
| 0cbc9OWf3kpy |
Finance |
2 |
Negative |
0 |
NA |
NA |
NA |
NA |
##
## === ONE-HOT ENCODING RESULT - SECTOR (first 10 rows) ===
| 04npfK5VJaMx |
Finance |
0 |
1 |
0 |
0 |
0 |
| 05VB964gZRxN |
Finance |
0 |
1 |
0 |
0 |
0 |
| 096aNW0N1B2N |
Consumer Goods |
0 |
0 |
0 |
0 |
1 |
| 0PlvCaxPuSd8 |
Technology |
1 |
0 |
0 |
0 |
0 |
| 0TRPr5Qq7IkZ |
Technology |
1 |
0 |
0 |
0 |
0 |
| 0UCD8GLy9tkE |
Healthcare |
0 |
0 |
1 |
0 |
0 |
| 0Y5sFUACczC6 |
Energy |
0 |
0 |
0 |
1 |
0 |
| 0ZxLI3CeGOrs |
Healthcare |
0 |
0 |
1 |
0 |
0 |
| 0ac4ifBIqqi4 |
Consumer Goods |
0 |
0 |
0 |
0 |
1 |
| 0cbc9OWf3kpy |
Finance |
0 |
1 |
0 |
0 |
0 |
##
## === ONE-HOT ENCODING RESULT - PERFORMANCE (first 10 rows) ===
| 04npfK5VJaMx |
Negative |
0 |
0 |
1 |
| 05VB964gZRxN |
Stable |
0 |
1 |
0 |
| 096aNW0N1B2N |
Stable |
0 |
1 |
0 |
| 0PlvCaxPuSd8 |
Negative |
0 |
0 |
1 |
| 0TRPr5Qq7IkZ |
Negative |
0 |
0 |
1 |
| 0UCD8GLy9tkE |
Stable |
0 |
1 |
0 |
| 0Y5sFUACczC6 |
Negative |
0 |
0 |
1 |
| 0ZxLI3CeGOrs |
Stable |
0 |
1 |
0 |
| 0ac4ifBIqqi4 |
Negative |
0 |
0 |
1 |
| 0cbc9OWf3kpy |
Negative |
0 |
0 |
1 |
##
## === ENCODING SUMMARY ===
| Sector |
Sector_Label (1-5) |
Sector_Technology/Finance/Healthcare/Energy/ConsumerGoods |
Nominal → One-Hot preferred |
| Performance |
Performance_Label (0-2) |
Performance_Positive/Stable/Negative |
Ordinal → Label encoding OK |
| Return_Category |
Return_Label (0-2) |
Return_Gain/Neutral/Loss |
Ordinal → Label encoding OK |
| Volatility_Category |
Volatility_Label (0-2) |
Vol_Low/Medium/High |
Ordinal → Label encoding OK |
##
## === ENCODING COMPLETED! ===
## Total columns now : 64
## Total rows : 500
Interpretation:
The categorical variable encoding process converts columns like
Sector, Performance, Return_Category, and Volatility_Category to
numeric: label encoding for ordinal variables (Performance,
Return_Category, Volatility_Category) with ranked values, and one-hot
encoding for nominal (Sector) with binary 0/1 columns. The data is now
optimally structured for machine learning algorithms.
Normalize or Scale Features
## === Z-SCORE NORMALIZATION RESULTS (first 10 rows) ===
| 04npfK5VJaMx |
2024-11-05 |
NA |
NA |
730528 |
0.8278 |
NA |
NA |
| 05VB964gZRxN |
2020-04-16 |
NA |
NA |
368549 |
-0.4170 |
NA |
NA |
| 096aNW0N1B2N |
2023-05-20 |
NA |
NA |
438714 |
-0.1757 |
NA |
NA |
| 0PlvCaxPuSd8 |
2021-11-12 |
NA |
NA |
771521 |
0.9688 |
NA |
NA |
| 0TRPr5Qq7IkZ |
2021-10-08 |
NA |
NA |
438306 |
-0.1771 |
NA |
NA |
| 0UCD8GLy9tkE |
2020-06-18 |
NA |
NA |
251454 |
-0.8196 |
NA |
NA |
| 0Y5sFUACczC6 |
2023-12-04 |
NA |
NA |
864852 |
1.2897 |
NA |
NA |
| 0ZxLI3CeGOrs |
2020-05-25 |
NA |
NA |
184805 |
-1.0488 |
NA |
NA |
| 0ac4ifBIqqi4 |
2020-03-18 |
NA |
NA |
165972 |
-1.1136 |
NA |
NA |
| 0cbc9OWf3kpy |
2022-09-11 |
NA |
NA |
200755 |
-0.9940 |
NA |
NA |
##
## === MIN-MAX SCALING RESULTS (first 10 rows) ===
| 04npfK5VJaMx |
2024-11-05 |
623.37 |
0.3738 |
NA |
NA |
730528 |
0.7302 |
NA |
NA |
| 05VB964gZRxN |
2020-04-16 |
1396.35 |
0.9265 |
NA |
NA |
368549 |
0.3676 |
NA |
NA |
| 096aNW0N1B2N |
2023-05-20 |
1132.60 |
0.7379 |
NA |
NA |
438714 |
0.4379 |
NA |
NA |
| 0PlvCaxPuSd8 |
2021-11-12 |
862.19 |
0.5446 |
NA |
NA |
771521 |
0.7713 |
NA |
NA |
| 0TRPr5Qq7IkZ |
2021-10-08 |
566.91 |
0.3334 |
NA |
NA |
438306 |
0.4375 |
NA |
NA |
| 0UCD8GLy9tkE |
2020-06-18 |
1047.47 |
0.6771 |
NA |
NA |
251454 |
0.2503 |
NA |
NA |
| 0Y5sFUACczC6 |
2023-12-04 |
1302.99 |
0.8598 |
NA |
NA |
864852 |
0.8648 |
NA |
NA |
| 0ZxLI3CeGOrs |
2020-05-25 |
1176.39 |
0.7692 |
NA |
NA |
184805 |
0.1836 |
NA |
NA |
| 0ac4ifBIqqi4 |
2020-03-18 |
1384.52 |
0.9181 |
NA |
NA |
165972 |
0.1647 |
NA |
NA |
| 0cbc9OWf3kpy |
2022-09-11 |
236.70 |
0.0973 |
NA |
NA |
200755 |
0.1996 |
NA |
NA |
##
## === Z-SCORE STATISTICS VALIDATION (mean ≈ 0, sd ≈ 1) ===
| Znorm_Daily_Return |
NaN |
NA |
Inf |
-Inf |
✓ mean ≈ 0, sd ≈ 1 |
| Znorm_Log_Return |
NaN |
NA |
Inf |
-Inf |
✓ mean ≈ 0, sd ≈ 1 |
| Znorm_Volume_Traded |
0 |
1 |
-1.6791 |
1.754 |
✓ mean ≈ 0, sd ≈ 1 |
| Znorm_Volatility_20 |
NaN |
NA |
Inf |
-Inf |
✓ mean ≈ 0, sd ≈ 1 |
| Znorm_Volatility_50 |
NaN |
NA |
Inf |
-Inf |
✓ mean ≈ 0, sd ≈ 1 |
##
## === MIN-MAX STATISTICS VALIDATION (range [0, 1]) ===
| MinMax_Stock_Price |
0 |
1 |
✓ range [0, 1] |
| MinMax_Daily_Return |
Inf |
-Inf |
✓ range [0, 1] |
| MinMax_Volume_Traded |
0 |
1 |
✓ range [0, 1] |
| MinMax_Market_Cap |
0 |
1 |
✓ range [0, 1] |
| MinMax_Volatility_20 |
Inf |
-Inf |
✓ range [0, 1] |
| MinMax_PE_Ratio |
0 |
1 |
✓ range [0, 1] |
| MinMax_Dividend_Yield |
0 |
1 |
✓ range [0, 1] |
| MinMax_ROE |
0 |
1 |
✓ range [0, 1] |
##
## === SUMMARY OF SCALING METHODS ===
| Z-Score Normalization |
Z = (x - mean) / sd |
Unbounded (mean=0, sd=1) |
Daily_Return, Log_Return, Volume_Traded, |
|
| Volatility_20, Volatility_50 |
Statistical models, regression, SVM, PCA |
|
|
|
| Min-Max Scaling |
X = (x - min) / (max - min) |
[0, 1] |
Stock_Price, Daily_Return, Volume_Traded, |
|
Market_Cap, Volatility_20, PE_Ratio, Dividend_Yield, ROE |Deep
learning, neural networks, KNN, distance-based models |
##
## === SCALING PROCESS COMPLETED! ===
## Total columns : 77
## Total rows : 500
Interpretation:
The normalization and scaling process standardizes the dataset to
ensure that all variables are comparable and suitable for further
analysis and modeling. Through Z-score normalization, variables such as
returns, volume, and volatility are transformed to have a mean close to
0 and a standard deviation close to 1, which helps reduce the impact of
differing scales and improves the performance of statistical models.
Meanwhile, Min-Max scaling rescales selected features into a fixed range
of [0, 1], making them more appropriate for machine learning algorithms
that are sensitive to magnitude, such as neural networks and
distance-based models. The verification results confirm that both
methods have been applied correctly, indicating that the dataset is now
well-prepared, balanced, and optimized for subsequent analytical and
predictive tasks.
Conclusion:
Overall, the data preparation process has been conducted in a
comprehensive and systematic manner, transforming the raw financial
dataset into a clean, structured, and analysis-ready form. Starting from
data loading and inspection, the dataset was verified to be complete and
free of duplicates, followed by thorough cleaning to ensure consistency
in data types and structure. Subsequent steps enriched the dataset
through feature engineering, including the creation of return metrics,
lag variables, rolling statistics, and technical indicators that capture
both temporal patterns and market behavior. The addition of
categorization, binning, and outlier handling further enhanced
interpretability while preserving meaningful financial events.
Furthermore, the dataset was optimized for analytical and machine
learning purposes through proper encoding of categorical variables and
the application of normalization and scaling techniques. These steps
ensure that all features are comparable, reduce bias from differing
magnitudes, and improve model performance. As a result, the final
dataset is not only clean and consistent but also highly informative and
well-suited for advanced tasks such as predictive modeling, trend
analysis, and decision-making in financial contexts.