About Me

ITSB Corporation

Logo ITSB

Data Analytics Bootcamp

Create a Flexdasboard using R

Bakti Siregar, M.Sc., CDSS.

Praktisi dan akademisi Data Analytics & Data Science, tersertifikasi Data Scientist (BNSP) dan Certified Data Science Specialist (CDSS) internasional, dengan fokus pada eksplorasi data, analisis terapan, dan pemanfaatan data untuk pengambilan keputusan.

Dataset

Data Preparations

Data Understanding

Objective: Understand what the data represents before applying any transformation or model.

A dataset typically consists of:

Identifier variables (e.g., id, entity_id) Time variables (e.g., date, timestamp) Input variables (e.g., x1, x2, x3), must fall within reasonable ranges Output or target variable (e.g., y) Categorical descriptors (e.g., category, group_type)

Understanding the role of each variable is the foundation of data preparation.

Data Cleaning

Raw data often contains imperfections such as missing values, duplicate records, or invalid entries.

Examples:

x1 = NA
x2 = -999 used as a placeholder
Duplicate rows for the same id and date

Common actions include:

Removing incomplete records
Replacing missing values with summary statistics
Correcting or removing obvious errors

The objective is to reduce noise without removing meaningful information.

Data Transformation

Data transformation adjusts variables into forms that are easier for models to process.

Examples:

Converting date into year and month
Scaling x1 into x1_scaled
Applying a logarithmic transformation to x2 → log_x2
Creating derived variables such as:
- x1_lag_1
- x2_avg_3

Transformations help improve stability, comparability, and model performance.

Data Integration

When multiple datasets are involved, they must be combined using common variables.

Example join keys:

id
date

The resulting dataset contains aligned and synchronized variables such as: x1, x2, x3, and y. The objective is to create a single, coherent dataset for analysis.

Feature Selection and Reduction

Not all variables contribute equally to modeling.

Examples:

Removing identifiers such as id
Dropping highly correlated variables (x2 vs. x3)
Retaining only relevant predictors (x1, x2) and the target (y)

This step simplifies the dataset and improves interpretability.

Final Validation

Before modeling, the dataset is reviewed to confirm:

No critical missing values
Variables are correctly typed
Data distributions are reasonable
Assumptions required for modeling are satisfied

Only after these checks is the data considered ready for modeling.

EDA

Descriptive Statistics

Begin with basic statistical summaries to understand the scale and variability of the data.

Example variables:

Target variable: y
Input variables: x1, x2, x3

Common checks:

Mean and median
Minimum and maximum
Standard deviation
Quartiles

Purpose: To understand the overall distribution and identify extreme values.

Univariate Analysis

Analyze each variable individually. Examples: Distribution of y Distribution of x1, x2 Common visualizations: Histogram Density plot Boxplot Purpose: To detect skewness, outliers, and the need for transformations.

Bivariate Analysis

Examine relationships between two variables. Visualizations:

Scatter plots
Line plots (for time-based data)

Purpose:

To assess direction, strength, and form of relationships (linear or non-linear).

About Me

Data Analytics Bootcamp

Create a Flexdasboard using R

Bakti Siregar, M.Sc., CDSS.

Dataset

Data Preparations

Data Understanding

Data Cleaning

Data Transformation

Data Integration

Feature Selection and Reduction

Final Validation

EDA

Descriptive Statistics

Univariate Analysis

Bivariate Analysis

Correlation Matrix

Multikolinearitas (VIF)

Autocorrelation Function (ACF)

Modeling

Step 1

Step 2

Step 3

Evaluation

Part 1

Part 2

Prediction

Prediction 1

Prediction 2

Prediction 3

Conclusion and Discussion

Conclusion

Discussion