About Me

ITSB Corporation

Logo ITSB

Data Analytics Bootcamp

Create a Flexdasboard using R

Bakti Siregar

Bakti Siregar, M.Sc., CDSS.

LinkedIn | GitHub | Email

Praktisi dan akademisi Data Analytics & Data Science, tersertifikasi Data Scientist (BNSP) dan Certified Data Science Specialist (CDSS) internasional, dengan fokus pada eksplorasi data, analisis terapan, dan pemanfaatan data untuk pengambilan keputusan.

Dataset

Data Preparations

Data Understanding

Objective: Understand what the data represents before applying any transformation or model.

A dataset typically consists of:

Identifier variables (e.g., id, entity_id) Time variables (e.g., date, timestamp) Input variables (e.g., x1, x2, x3), must fall within reasonable ranges Output or target variable (e.g., y) Categorical descriptors (e.g., category, group_type)

Understanding the role of each variable is the foundation of data preparation.

Data Cleaning

Raw data often contains imperfections such as missing values, duplicate records, or invalid entries.

Examples:

  • x1 = NA
  • x2 = -999 used as a placeholder
  • Duplicate rows for the same id and date

Common actions include:

  • Removing incomplete records
  • Replacing missing values with summary statistics
  • Correcting or removing obvious errors

The objective is to reduce noise without removing meaningful information.

Data Transformation

Data transformation adjusts variables into forms that are easier for models to process.

Examples:

  • Converting date into year and month

  • Scaling x1 into x1_scaled

  • Applying a logarithmic transformation to x2 → log_x2

  • Creating derived variables such as:

    • x1_lag_1
    • x2_avg_3

Transformations help improve stability, comparability, and model performance.

Data Integration

When multiple datasets are involved, they must be combined using common variables.

Example join keys:

  • id
  • date

The resulting dataset contains aligned and synchronized variables such as: x1, x2, x3, and y. The objective is to create a single, coherent dataset for analysis.

Feature Selection and Reduction

Not all variables contribute equally to modeling.

Examples:

  • Removing identifiers such as id
  • Dropping highly correlated variables (x2 vs. x3)
  • Retaining only relevant predictors (x1, x2) and the target (y)

This step simplifies the dataset and improves interpretability.

Final Validation

Before modeling, the dataset is reviewed to confirm:

  • No critical missing values
  • Variables are correctly typed
  • Data distributions are reasonable
  • Assumptions required for modeling are satisfied

Only after these checks is the data considered ready for modeling.

EDA

Descriptive Statistics

Begin with basic statistical summaries to understand the scale and variability of the data.

Example variables:

  • Target variable: y
  • Input variables: x1, x2, x3

Common checks:

  • Mean and median
  • Minimum and maximum
  • Standard deviation
  • Quartiles

Purpose: To understand the overall distribution and identify extreme values.

Univariate Analysis

Analyze each variable individually. Examples: Distribution of y Distribution of x1, x2 Common visualizations: Histogram Density plot Boxplot Purpose: To detect skewness, outliers, and the need for transformations.

Bivariate Analysis

Examine relationships between two variables. Visualizations:

  • Scatter plots
  • Line plots (for time-based data)

Purpose:

To assess direction, strength, and form of relationships (linear or non-linear).

Correlation Matrix

Multikolinearitas (VIF)

Autocorrelation Function (ACF)

Modeling

Step 1

Step 2

Step 3


Evaluation

Part 1

Part 2


Prediction

Prediction 1

Prediction 2

Prediction 3

Conclusion and Discussion

Conclusion

Discussion