A. PRE-PROCESSING DATA

Preprocessing data is a crucial part in data analysis since machine learning can only be giving optimum result and prevent bias if the data is clear and easy to process beforehand. Pre-processing is a bridge which could contribute to the achievement of high-quality and unbiased analysis. It is being practiced everytime data-related topic involved. Pre-processing data is including cleaning data, outlier handling, data transformation, and feature encoding. These procedures are actions to minimize error in conducting analysis with or without machine learning.

IMPORT PACKAGES

Several packages on Python will be imported in order to assist the pre-processing of data such as pandas, seaborn, matplotlib, and openpyxl. The following is the definition for each package used.

1. Pandas in its definition from w3school.com is a useful package for cleaning data. Usually referred to “panel data” as well as “Python Data Analysis”, pandas could be utilized for removing empty cell, duplicates, and many other functions. For further information, please head to w3school.com,

2. Matplotlib provides several basic graphs such as line, scatter, bars, histograms, pie charts, and whatnot,

3. Seaborn is an enhancement for matplotlib graph where it is going to enable high-level structure and aesthetic. seaborn commonly used for plotting distribution,

4. Openpyxl, a package which can open excel file data on computer. Importing data from excel to Python required file path from folder.

# ----- A. PRE-PROCESSING DATA ----- #
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import openpyxl

1. PRE-ANALYZING DATA CHARACTERISTICS

First procedure which needs to be followed before pre-processing data is importing the data itself onto Python as well as observing characteristics within dataset. Through descriptive statistics, data can be inspected and compared at the end of pre-processing with cleaned data to see changes. On the other side, dataset information displays type of measurement for each variable within data. For instance, the data will be used in here is stroke patient dataset which has missing observations within it.

Note: Please ensure you have imported correct data by looking at first five data or data’s head.

# 1. Importing Dataset
df = pd.read_excel("Stroke_Dataset.xlsx")
df

##          id  gender   age  ...   bmi   smoking_status stroke
## 0      9046    Male  67.0  ...  36.6  formerly smoked      1
## 1     51676  Female  61.0  ...   NaN     never smoked      1
## 2     31112    Male  80.0  ...  32.5     never smoked      1
## 3     60182  Female  49.0  ...  34.4           smokes      1
## 4      1665  Female  79.0  ...  24.0     never smoked      1
## ...     ...     ...   ...  ...   ...              ...    ...
## 5105  18234  Female  80.0  ...   NaN     never smoked      0
## 5106  44873  Female  81.0  ...  40.0     never smoked      0
## 5107  19723  Female  35.0  ...  30.6     never smoked      0
## 5108  37544    Male  51.0  ...  25.6  formerly smoked      0
## 5109  44679  Female  44.0  ...  26.2          Unknown      0
## 
## [5110 rows x 12 columns]

column_names = df.columns
print(column_names)

## Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
##        'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
##        'smoking_status', 'stroke'],
##       dtype='object')

# Data Structure
print("📌 The First Five Stroke Data:")

## 📌 The First Five Stroke Data:

print(df.head())

##       id  gender   age  ...   bmi   smoking_status stroke
## 0   9046    Male  67.0  ...  36.6  formerly smoked      1
## 1  51676  Female  61.0  ...   NaN     never smoked      1
## 2  31112    Male  80.0  ...  32.5     never smoked      1
## 3  60182  Female  49.0  ...  34.4           smokes      1
## 4   1665  Female  79.0  ...  24.0     never smoked      1
## 
## [5 rows x 12 columns]

print("\n📌 Dataset Information:")

## 
## 📌 Dataset Information:

print(df.info())

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 5110 entries, 0 to 5109
## Data columns (total 12 columns):
##  #   Column             Non-Null Count  Dtype  
## ---  ------             --------------  -----  
##  0   id                 5110 non-null   int64  
##  1   gender             5110 non-null   object 
##  2   age                5110 non-null   float64
##  3   hypertension       5110 non-null   int64  
##  4   heart_disease      5110 non-null   int64  
##  5   ever_married       5110 non-null   object 
##  6   work_type          5110 non-null   object 
##  7   Residence_type     5110 non-null   object 
##  8   avg_glucose_level  5110 non-null   float64
##  9   bmi                4909 non-null   float64
##  10  smoking_status     5110 non-null   object 
##  11  stroke             5110 non-null   int64  
## dtypes: float64(3), int64(4), object(5)
## memory usage: 479.2+ KB
## None

print("\n📌 Descriptive Statistics:")

## 
## 📌 Descriptive Statistics:

print(df.describe())

##                  id          age  ...          bmi       stroke
## count   5110.000000  5110.000000  ...  4909.000000  5110.000000
## mean   36517.829354    43.226614  ...    28.893237     0.048728
## std    21161.721625    22.612647  ...     7.854067     0.215320
## min       67.000000     0.080000  ...    10.300000     0.000000
## 25%    17741.250000    25.000000  ...    23.500000     0.000000
## 50%    36932.000000    45.000000  ...    28.100000     0.000000
## 75%    54682.000000    61.000000  ...    33.100000     0.000000
## max    72940.000000    82.000000  ...    97.600000     1.000000
## 
## [8 rows x 7 columns]

2. CLEANING SECTION

One of the ways to create clear data is to fill the missing observations within it. Of course, erasing missing data point is an option, however, additional information may increase power and reliability into a model or analysis. Moreover, many dataset with several loss observations is difficult to be collected in the first place, therefore, it is a wise move to just fill several missing data. The handling of missing values can be accomplished by substituting the absent observations with population mean, median, mode, or using multivariate imputation strategy. The following is the packages and its definition which could handle missing values.

Simple Imputer

SimpleImputer is a part of sklearn package which can manage missing observations by utilizing population mean. The average value of N sample in a variable will be used as substitute for samples with missing values. Not only it can input missing values with population mean, but it can also be modified with other basic statistic such as median and mode.

Note: you could change Body Mass Index (bmi) with other column names which have missing values if you are using your own data. If you want to use median or mode as the substitute, please change the strategy on SimpleImputer(strategy="mean").

# 2. Data Cleaning
import sklearn.impute
from sklearn.impute import SimpleImputer
import pandas as pd

df = pd.read_excel("Stroke_Dataset.xlsx")
df['bmi'] = pd.to_numeric(df['bmi'], errors='coerce')

imputer_mean = SimpleImputer(strategy="mean")

df_mean = pd.DataFrame(imputer_mean.fit_transform(df[['bmi']]),
    columns=['bmi'])

print("\nMean Imputation:")

## 
## Mean Imputation:

print(df_mean)

##             bmi
## 0     36.600000
## 1     28.893237
## 2     32.500000
## 3     34.400000
## 4     24.000000
## ...         ...
## 5105  28.893237
## 5106  40.000000
## 5107  30.600000
## 5108  25.600000
## 5109  26.200000
## 
## [5110 rows x 1 columns]

print("\n")

Iterative Imputer

Similar with SimpleImputer, IterativeImputer is also a part of sklearn package and another way to give missing observations new values. IterativeImputer function will input estimated values from multivariate imputation-which will depend on other variables within the dataset-into the samples with missing values. Taken its definition from Towards Data Science, using Iterative Imputer, missing observations will be imputed by estimated values from other variables multivariate models.

Note: impute_cols could be changed to other variables used for iterative imputer. You may also want to change maximum iteration and random_state.

import pandas as pd
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer

df = pd.read_excel("Stroke_Dataset.xlsx")
impute_cols = [
    "bmi",
    "age",
    "avg_glucose_level"]

    
imp = IterativeImputer(
    max_iter=10,
    random_state=42)

df[impute_cols] = imp.fit_transform(df[impute_cols])

print("Iterative Imputation:")

## Iterative Imputation:

print(df[impute_cols])

##             bmi   age  avg_glucose_level
## 0     36.600000  67.0             228.69
## 1     32.598921  61.0             202.21
## 2     32.500000  80.0             105.92
## 3     34.400000  49.0             171.23
## 4     24.000000  79.0             174.12
## ...         ...   ...                ...
## 5105  32.484819  80.0              83.75
## 5106  40.000000  81.0             125.20
## 5107  30.600000  35.0              82.99
## 5108  25.600000  51.0             166.29
## 5109  26.200000  44.0              85.28
## 
## [5110 rows x 3 columns]

The variable which has missing values can be replaced with new one and saved into different Excel file using the following codes.

# Replace the original bmi column in the full dataframe with the imputed values

# df['bmi'] = df_mean['bmi'] <-- Use this for simpleimputer
df['bmi'] = df[impute_cols]['bmi']

# Save the entire dataframe (all columns) with imputed bmi values
df.to_excel("Brand_New_BMI_Values_II.xlsx", index=False)

3. FEATURE ENGINEERING

As a part of transforming data, feature engineering enables adding brand new column. Furthermore, the values within the new column could be made as results from mathematical operation. For instance, health score variable can be getting based on calculation from age, BMI, and glucose level.

Note: The formula of health score is merely an example. Beside that, the data recalled as follow is cleaned data, the result from previous pre-processing section.

# 3. Feature Engineering (Health Score)
import pandas as pd

merged_data = pd.read_excel("Brand_New_BMI_Values_II.xlsx")

merged_data['Health_Score'] = (merged_data['bmi'] / merged_data['age']) + (merged_data['avg_glucose_level'] / 100)

print(merged_data.dropna())

##          id  gender   age  ...   smoking_status  stroke Health_Score
## 0      9046    Male  67.0  ...  formerly smoked       1     2.833169
## 1     51676  Female  61.0  ...     never smoked       1     2.556509
## 2     31112    Male  80.0  ...     never smoked       1     1.465450
## 3     60182  Female  49.0  ...           smokes       1     2.414341
## 4      1665  Female  79.0  ...     never smoked       1     2.044997
## ...     ...     ...   ...  ...              ...     ...          ...
## 5105  18234  Female  80.0  ...     never smoked       0     1.243560
## 5106  44873  Female  81.0  ...     never smoked       0     1.745827
## 5107  19723  Female  35.0  ...     never smoked       0     1.704186
## 5108  37544    Male  51.0  ...  formerly smoked       0     2.164861
## 5109  44679  Female  44.0  ...          Unknown       0     1.448255
## 
## [5110 rows x 13 columns]

print("\n")

# Recalling existing columns within the data
column_names = merged_data.columns 
print(column_names)

## Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
##        'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
##        'smoking_status', 'stroke', 'Health_Score'],
##       dtype='object')

4. LABEL FOR CATEGORICAL

Categorical variables are as important as numerical variables, however, it needs to be translated into numeric in order to make the analysis process doable. Especially for machine learning which will involve statistical tests. In fact, machine learning can give an optimum analysis results if the data arranged and easy to read. On Python, LabelEncoder can be utilized to change string information into dummy variables. Furthermore, it alters into number start from 0 to n (based on total category within one variable). For instance, ever_married column with yes and no information will be turned into 1 and 0. It will also be labeled based on alphabet order.

df_label_encoded['gender_numeric'] = label_encoder.fit_transform(merged_data['gender']),

Note: Left side of the equation is the new column created for dummy while the right side is the variable which will be turned into numerical values.

# 4. Label for Categorical Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
 
# Label Encoding
label_encoder = LabelEncoder()
df_label_encoded = merged_data.copy()

df_label_encoded['gender_numeric'] = label_encoder.fit_transform(merged_data['gender'])

df_label_encoded['marital_status_numeric'] = label_encoder.fit_transform(merged_data['ever_married'])

df_label_encoded['employment_category_numeric'] = label_encoder.fit_transform(merged_data['work_type'])

print("\n--- Label Encoded Data ---")

## 
## --- Label Encoded Data ---

print(df_label_encoded[['gender_numeric', 'marital_status_numeric',
    'employment_category_numeric']])

##       gender_numeric  marital_status_numeric  employment_category_numeric
## 0                  1                       1                            2
## 1                  0                       1                            3
## 2                  1                       1                            2
## 3                  0                       1                            2
## 4                  0                       1                            3
## ...              ...                     ...                          ...
## 5105               0                       1                            2
## 5106               0                       1                            3
## 5107               0                       1                            3
## 5108               1                       1                            2
## 5109               0                       1                            0
## 
## [5110 rows x 3 columns]

print("\n")

print(df_label_encoded)

##          id  gender  ...  marital_status_numeric  employment_category_numeric
## 0      9046    Male  ...                       1                            2
## 1     51676  Female  ...                       1                            3
## 2     31112    Male  ...                       1                            2
## 3     60182  Female  ...                       1                            2
## 4      1665  Female  ...                       1                            3
## ...     ...     ...  ...                     ...                          ...
## 5105  18234  Female  ...                       1                            2
## 5106  44873  Female  ...                       1                            3
## 5107  19723  Female  ...                       1                            3
## 5108  37544    Male  ...                       1                            2
## 5109  44679  Female  ...                       1                            0
## 
## [5110 rows x 16 columns]

print("\n")

# Displaying available columns
column_names2 = df_label_encoded.columns
print(column_names2)

## Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
##        'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
##        'smoking_status', 'stroke', 'Health_Score', 'gender_numeric',
##        'marital_status_numeric', 'employment_category_numeric'],
##       dtype='object')

5. RE-SCALING NUMERICAL VARIABLES

Beside labeling string information into dummy variables, standardization of numerical variables is essential. In fact, the data will most likely to be standardized if the sources of data are different and distinctly measured. The normalization of data is inherently to prevent imbalance power of some variables over another. The standardization on Python could be accomplished by using either MinMaxScaler or StandardScaler.

Min-Max Scalling

One of the functions from scikit-learn to re-scale numeric variables into a fixed range between 0 and 1. If the data used in analysis is not following normal distribution and try to preserve the distribution within it, min-max scalling is the perfect fit for standardization. Especially, if the analysis will be conducted required non-minus values.

Note: You able to change id, avg_glucose_level, age, and bmi with other designated numerical columns.

# 5. Re-Scaling Numerical Variables
# a. Min-Max Scaling 
from sklearn.preprocessing import MinMaxScaler

scaler_minmax = MinMaxScaler()

df_minmax_sklearn = df_label_encoded[['id','avg_glucose_level',
    'age', 'bmi']].copy()

df_minmax_sklearn[['avg_glucose_level', 'age', 'bmi']] = scaler_minmax.fit_transform(df_minmax_sklearn[
    ['avg_glucose_level', 'age', 'bmi']])

print("\n--- Min-Max Scaled Data ---")

## 
## --- Min-Max Scaled Data ---

print(df_minmax_sklearn)

##          id  avg_glucose_level       age       bmi
## 0      9046           0.801265  0.816895  0.301260
## 1     51676           0.679023  0.743652  0.255429
## 2     31112           0.234512  0.975586  0.254296
## 3     60182           0.536008  0.597168  0.276060
## 4      1665           0.549349  0.963379  0.156930
## ...     ...                ...       ...       ...
## 5105  18234           0.132167  0.975586  0.254122
## 5106  44873           0.323516  0.987793  0.340206
## 5107  19723           0.128658  0.426270  0.232532
## 5108  37544           0.513203  0.621582  0.175258
## 5109  44679           0.139230  0.536133  0.182131
## 
## [5110 rows x 4 columns]

Z-Score Normalization

Changes numerical variables to have mean equal to 0 and a standard deviation of 1. In other words, the variables will be turned into numbers which follow normal distribution. In contrast to min-max scalling, z-score normalization will turn whatever distribution followed by the data into normal distribution. Beside that, z-score normalization commonly used in statistical analysis because it can distinguish values by its range from the mean. Furthermore, it can identify extreme values by its feature which is standard deviation.

Note: Similar with min-max scalling, you can adjust what numerical variables will be turned into standardized values. In spite of that, you need to have primary key or column which can be an anchor to emerge the original data and the re-scaling results. You can also change the name of the file which will be saved onto excel format.

# b. Z-Score Normalization 
from sklearn.preprocessing import StandardScaler
scaler_standard = StandardScaler()

df_standardized_sklearn = df_label_encoded[['id',
    'avg_glucose_level', 'age', 'bmi']].copy()

df_standardized_sklearn[['avg_glucose_level', 'age', 'bmi']] = scaler_standard.fit_transform(df_standardized_sklearn[
    ['avg_glucose_level', 'age', 'bmi']])

print("\n--- Standardized Data ---")

## 
## --- Standardized Data ---

print(df_standardized_sklearn)

##          id  avg_glucose_level       age       bmi
## 0      9046           2.706375  1.051434  0.991064
## 1     51676           2.121559  0.786070  0.472904
## 2     31112          -0.005028  1.626390  0.460094
## 3     60182           1.437358  0.255342  0.706153
## 4      1665           1.501184  1.582163 -0.640699
## ...     ...                ...       ...       ...
## 5105  18234          -0.494658  1.626390  0.458128
## 5106  44873           0.420775  1.670617  1.431381
## 5107  19723          -0.511443 -0.363842  0.214034
## 5108  37544           1.328257  0.343796 -0.433491
## 5109  44679          -0.460867  0.034205 -0.355788
## 
## [5110 rows x 4 columns]

# Merging the outcomes of standardized numeric values with previous calculations
final_merge_data = pd.merge(df_label_encoded, df_standardized_sklearn,
    on='id')
print(final_merge_data)

##          id  gender  age_x  ...  avg_glucose_level_y     age_y     bmi_y
## 0      9046    Male   67.0  ...             2.706375  1.051434  0.991064
## 1     51676  Female   61.0  ...             2.121559  0.786070  0.472904
## 2     31112    Male   80.0  ...            -0.005028  1.626390  0.460094
## 3     60182  Female   49.0  ...             1.437358  0.255342  0.706153
## 4      1665  Female   79.0  ...             1.501184  1.582163 -0.640699
## ...     ...     ...    ...  ...                  ...       ...       ...
## 5105  18234  Female   80.0  ...            -0.494658  1.626390  0.458128
## 5106  44873  Female   81.0  ...             0.420775  1.670617  1.431381
## 5107  19723  Female   35.0  ...            -0.511443 -0.363842  0.214034
## 5108  37544    Male   51.0  ...             1.328257  0.343796 -0.433491
## 5109  44679  Female   44.0  ...            -0.460867  0.034205 -0.355788
## 
## [5110 rows x 19 columns]


# Save the entire dataframe (all columns) 
final_merge_data.to_excel("Complete.xlsx", index=False)

B. PROCESSING DATA

The cleaned data as the result from pre-processing could be taken into basic or even sophisticated analysis. The following below is the usual first step to get to know the data further which will contain visualization and descriptive statistics.

1. VISUALIZATION

Treemap

As the name suggests, treemap will compare the magnitude for each smaller category within variables into the larger root of the data which is between variables. Need to be noted, plotly package will be used since it is providing wide range of plots which includes treemap.

# ----- B. PROCESSING DATA ----- #
# 1. Visualization
import pandas as pd
import plotly.express as px

brand_new_data = pd.read_excel("Complete.xlsx")

# a. Treemap
# Create a new DataFrame with counts for each category combination
treemap_data = brand_new_data.groupby(['Residence_type', 'gender',
    'ever_married']).size().reset_index(name='count')

fig = px.treemap(
    treemap_data,
    path=['Residence_type', 'gender', 'ever_married'], 
    values='count',
    color='count',   # Color based on values
    hover_data={'count': True},
    color_continuous_scale='Blues',
    title='Treemap Based on Genders, Marital Status, and Occupation'
)
fig.show()

Distribution

Distribution could be a way to convey message regarding comparison between categories. More than that, creating distribution for data may also help approximate guess on distribution type followed by the data. Nonetheless, the packages to create distribution visual have an addition of numpy package to work with arrays of data and gaussian_kde function from scipy as the fundamental of creating gaussian (normal) distribution.

Note: The following is the codes which you can put more if you have additional category.

stroke_1 = df[df['stroke'] == 1]['age_x'],

kde_1 = gaussian_kde(stroke_1),

plt.plot(x_range, kde_1(x_range), color='red', label='Stroke (1)', linewidth=2),

plt.fill_between(x_range, kde_1(x_range), color='red', alpha=0.2). Do not forget to change the data used to make the visual (e.g. Complete.xlsx) as well as the columns which will be compared (e.g. stroke,age_x).

# b. Stroke Patients Distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import pandas as pd

df = pd.read_excel("Complete.xlsx")

# Separate data by stroke status
stroke_0 = df[df['stroke'] == 0]['age_x']
stroke_1 = df[df['stroke'] == 1]['age_x']

# Create Kernel Density Estimation (KDE) for each group
kde_0 = gaussian_kde(stroke_0)
kde_1 = gaussian_kde(stroke_1)

x_range = np.linspace(min(df['age_x'])-2, max(df['age_x'])+2, 500)
plt.figure(figsize=(10, 6))

plt.plot(x_range, kde_0(x_range), color='blue', label='No Stroke (0)',
    linewidth=2)
plt.plot(x_range, kde_1(x_range), color='red', label='Stroke (1)',
    linewidth=2)

plt.fill_between(x_range, kde_0(x_range), color='blue', alpha=0.2)
plt.fill_between(x_range, kde_1(x_range), color='red', alpha=0.2)

plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Distribution of Age by Stroke Status')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Heatmap

Heatmap is simply displaying correlation between variables using Pearson correlation as its default.

Note: Please change selected_vars = [ ] with designated variables which used for correlation analysis.

# c. Correlation Heatmap
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_excel("Complete.xlsx")

selected_vars = [
    'avg_glucose_level_y',
    'hypertension',
    'heart_disease',
    'age_y', 
    'bmi_y',
    'stroke',
]

# Calculate correlation matrix for selected variables
correlation_matrix = data[selected_vars].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(
    correlation_matrix,
    annot=True,
    cmap='coolwarm',
    fmt=".2f", 
    linewidths=0.5,
    cbar_kws={'label': 'Correlation Coefficient'},
    square=True
)

plt.title('Pair Correlation of 6 Main Variables with Heatmap Visual', fontsize=14, pad=20)
plt.tight_layout()
plt.show()

2. EXPLORATION DATA ANALYSIS (EDA)

In order to conduct further analysis, the data which has been cleaned needs to be identified its characteristics. It is going to determine what appropriate or available analysis can be used for the data. Basic statistical analysis can be the first step to know characteristics within the data. The descriptive statistics result using pre-processed data could be compared with the original data which contained missing values in order to see any salient difference between the two. Both data are expected to be slightly different since the original information is the fundamental for drawing the conclusion for an analysis.

Note: You may also want to save descriptive statistics of the new one onto excel file and change the name of it. Beside that, you could adjust or add df['bmi'] with other columns for comparison.

# 2. Exploration Data Analysis (EDA)
import pandas as pd

# a. statistic descriptive
df = pd.read_excel("Complete.xlsx") #<-- Cleaned Data
print("New Descriptive Statistic:")

## New Descriptive Statistic:

print(df['bmi_x'].describe(), "\n")

## count    5110.000000
## mean       28.947291
## std         7.722465
## min        10.300000
## 25%        23.800000
## 50%        28.200000
## 75%        32.936000
## max        97.600000
## Name: bmi_x, dtype: float64

df1 = pd.read_excel("Stroke_Dataset.xlsx") #<-- Original Data
print("Previous Descriptive Statistic:")

## Previous Descriptive Statistic:

print(df1['bmi'].describe())

## count    4909.000000
## mean       28.893237
## std         7.854067
## min        10.300000
## 25%        23.500000
## 50%        28.100000
## 75%        33.100000
## max        97.600000
## Name: bmi, dtype: float64

# Get descriptive statistics
descriptives = df.describe()

# Reset index to make the descriptions ('count', 'mean', etc.) into a column
descriptives = descriptives.reset_index()
descriptives = descriptives.rename(columns={'index': 'description'})

print("Descriptive Statistics with Description Column:")

## Descriptive Statistics with Description Column:

print(descriptives)

##   description            id  ...         age_y         bmi_y
## 0       count   5110.000000  ...  5.110000e+03  5.110000e+03
## 1        mean  36517.829354  ...  3.337187e-17 -1.028966e-16
## 2         std  21161.721625  ...  1.000098e+00  1.000098e+00
## 3         min     67.000000  ... -1.908261e+00 -2.414917e+00
## 4         25%  17741.250000  ... -8.061152e-01 -6.665999e-01
## 5         50%  36932.000000  ...  7.843218e-02 -9.677795e-02
## 6         75%  54682.000000  ...  7.860701e-01  5.165578e-01
## 7         max  72940.000000  ...  1.714845e+00  8.890869e+00
## 
## [8 rows x 15 columns]


# Save to Excel
descriptives.to_excel("Descriptive Statistic.xlsx", index=False)

C. CONCLUSION

Pre-processing data is a crucial part in data analysis since it will determine the analysis later on. Pre-processing data includes cleaning, transforming, feature encoding, label and standardization which are commonly used to get a clear data. In fact, machine learning needs arranged data in order to give the optimum results from the analysis will be conducted. Therefore, pre-process is the first step to achieve high-quality and unbiased analysis and will always be practiced everytime in data analysis.

PRE-PROCESSING AND BASIC PROCESSING DATA ON PYTHON

Deftha Ardana Setiawan

2025-01-15

A. PRE-PROCESSING DATA

IMPORT PACKAGES

1. PRE-ANALYZING DATA CHARACTERISTICS

2. CLEANING SECTION

Simple Imputer

Iterative Imputer

3. FEATURE ENGINEERING

4. LABEL FOR CATEGORICAL

5. RE-SCALING NUMERICAL VARIABLES

Min-Max Scalling

Z-Score Normalization

B. PROCESSING DATA

1. VISUALIZATION

Treemap

Distribution

Heatmap

2. EXPLORATION DATA ANALYSIS (EDA)

C. CONCLUSION