Praktisi dan akademisi Data Analytics & Data Science, tersertifikasi Data Scientist (BNSP) dan Certified Data Science Specialist (CDSS) internasional, dengan fokus pada eksplorasi data, analisis terapan, dan pemanfaatan data untuk pengambilan keputusan.
Objective: Understand what the data represents before applying any transformation or model.
A dataset typically consists of:
Identifier variables (e.g., id, entity_id) Time variables (e.g., date, timestamp) Input variables (e.g., x1, x2, x3), must fall within reasonable ranges Output or target variable (e.g., y) Categorical descriptors (e.g., category, group_type)
Understanding the role of each variable is the foundation of data preparation.
Raw data often contains imperfections such as missing values, duplicate records, or invalid entries.
Examples:
Common actions include:
The objective is to reduce noise without removing meaningful information.
Data transformation adjusts variables into forms that are easier for models to process.
Examples:
Converting date into year and month
Scaling x1 into x1_scaled
Applying a logarithmic transformation to x2 → log_x2
Creating derived variables such as:
Transformations help improve stability, comparability, and model performance.
When multiple datasets are involved, they must be combined using common variables.
Example join keys:
The resulting dataset contains aligned and synchronized variables
such as: x1, x2, x3, and y. The objective is
to create a single, coherent dataset for analysis.
Not all variables contribute equally to modeling.
Examples:
This step simplifies the dataset and improves interpretability.
Before modeling, the dataset is reviewed to confirm:
Only after these checks is the data considered ready for modeling.
Begin with basic statistical summaries to understand the scale and variability of the data.
Example variables:
Common checks:
Purpose: To understand the overall distribution and identify extreme values.
Analyze each variable individually. Examples: Distribution of y Distribution of x1, x2 Common visualizations: Histogram Density plot Boxplot Purpose: To detect skewness, outliers, and the need for transformations.
Examine relationships between two variables. Visualizations:
Purpose:
To assess direction, strength, and form of relationships (linear or non-linear).