Preprocessing data is a crucial part in data analysis since machine learning can only be giving optimum result and prevent bias if the data is clear and easy to process beforehand. Pre-processing is a bridge which could contribute to the achievement of high-quality and unbiased analysis. It is being practiced everytime data-related topic involved. Pre-processing data is including cleaning data, outlier handling, data transformation, and feature encoding. These procedures are actions to minimize error in conducting analysis with or without machine learning.
Several packages on Python will be imported in order to assist the pre-processing of data such aspandas,seaborn,matplotlib, andopenpyxl. The following is the definition for each package used.
1.Pandasin its definition from w3school.com is a useful package for cleaning data. Usually referred to “panel data” as well as “Python Data Analysis”,pandascould be utilized for removing empty cell, duplicates, and many other functions. For further information, please head to w3school.com,
2.Matplotlibprovides several basic graphs such as line, scatter, bars, histograms, pie charts, and whatnot,
3.Seabornis an enhancement formatplotlibgraph where it is going to enable high-level structure and aesthetic.seaborncommonly used for plotting distribution,
4.Openpyxl, a package which can open excel file data on computer. Importing data from excel to Python required file path from folder.
# ----- A. PRE-PROCESSING DATA ----- #
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import openpyxl
First procedure which needs to be followed before pre-processing data is importing the data itself onto Python as well as observing characteristics within dataset. Through descriptive statistics, data can be inspected and compared at the end of pre-processing with cleaned data to see changes. On the other side, dataset information displays type of measurement for each variable within data. For instance, the data will be used in here is stroke patient dataset which has missing observations within it.
Note: Please ensure you have imported correct data by looking at first five data or data’s head.
# 1. Importing Dataset
df = pd.read_excel("Stroke_Dataset.xlsx")
df
## id gender age ... bmi smoking_status stroke
## 0 9046 Male 67.0 ... 36.6 formerly smoked 1
## 1 51676 Female 61.0 ... NaN never smoked 1
## 2 31112 Male 80.0 ... 32.5 never smoked 1
## 3 60182 Female 49.0 ... 34.4 smokes 1
## 4 1665 Female 79.0 ... 24.0 never smoked 1
## ... ... ... ... ... ... ... ...
## 5105 18234 Female 80.0 ... NaN never smoked 0
## 5106 44873 Female 81.0 ... 40.0 never smoked 0
## 5107 19723 Female 35.0 ... 30.6 never smoked 0
## 5108 37544 Male 51.0 ... 25.6 formerly smoked 0
## 5109 44679 Female 44.0 ... 26.2 Unknown 0
##
## [5110 rows x 12 columns]
column_names = df.columns
print(column_names)
## Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
## 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
## 'smoking_status', 'stroke'],
## dtype='object')
# Data Structure
print("📌 The First Five Stroke Data:")
## 📌 The First Five Stroke Data:
print(df.head())
## id gender age ... bmi smoking_status stroke
## 0 9046 Male 67.0 ... 36.6 formerly smoked 1
## 1 51676 Female 61.0 ... NaN never smoked 1
## 2 31112 Male 80.0 ... 32.5 never smoked 1
## 3 60182 Female 49.0 ... 34.4 smokes 1
## 4 1665 Female 79.0 ... 24.0 never smoked 1
##
## [5 rows x 12 columns]
print("\n📌 Dataset Information:")
##
## 📌 Dataset Information:
print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 5110 entries, 0 to 5109
## Data columns (total 12 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 id 5110 non-null int64
## 1 gender 5110 non-null object
## 2 age 5110 non-null float64
## 3 hypertension 5110 non-null int64
## 4 heart_disease 5110 non-null int64
## 5 ever_married 5110 non-null object
## 6 work_type 5110 non-null object
## 7 Residence_type 5110 non-null object
## 8 avg_glucose_level 5110 non-null float64
## 9 bmi 4909 non-null float64
## 10 smoking_status 5110 non-null object
## 11 stroke 5110 non-null int64
## dtypes: float64(3), int64(4), object(5)
## memory usage: 479.2+ KB
## None
print("\n📌 Descriptive Statistics:")
##
## 📌 Descriptive Statistics:
print(df.describe())
## id age ... bmi stroke
## count 5110.000000 5110.000000 ... 4909.000000 5110.000000
## mean 36517.829354 43.226614 ... 28.893237 0.048728
## std 21161.721625 22.612647 ... 7.854067 0.215320
## min 67.000000 0.080000 ... 10.300000 0.000000
## 25% 17741.250000 25.000000 ... 23.500000 0.000000
## 50% 36932.000000 45.000000 ... 28.100000 0.000000
## 75% 54682.000000 61.000000 ... 33.100000 0.000000
## max 72940.000000 82.000000 ... 97.600000 1.000000
##
## [8 rows x 7 columns]
One of the ways to create clear data is to fill the missing observations within it. Of course, erasing missing data point is an option, however, additional information may increase power and reliability into a model or analysis. Moreover, many dataset with several loss observations is difficult to be collected in the first place, therefore, it is a wise move to just fill several missing data. The handling of missing values can be accomplished by substituting the absent observations with population mean, median, mode, or using multivariate imputation strategy. The following is the packages and its definition which could handle missing values.
SimpleImputeris a part ofsklearnpackage which can manage missing observations by utilizing population mean. The average value of N sample in a variable will be used as substitute for samples with missing values. Not only it can input missing values with population mean, but it can also be modified with other basic statistic such as median and mode.
Note: you could change Body Mass Index (bmi) with other column names which have missing values if you are using your own data. If you want to use median or mode as the substitute, please change the strategy onSimpleImputer(strategy="mean").
# 2. Data Cleaning
import sklearn.impute
from sklearn.impute import SimpleImputer
import pandas as pd
df = pd.read_excel("Stroke_Dataset.xlsx")
df['bmi'] = pd.to_numeric(df['bmi'], errors='coerce')
imputer_mean = SimpleImputer(strategy="mean")
df_mean = pd.DataFrame(imputer_mean.fit_transform(df[['bmi']]),
columns=['bmi'])
print("\nMean Imputation:")
##
## Mean Imputation:
print(df_mean)
## bmi
## 0 36.600000
## 1 28.893237
## 2 32.500000
## 3 34.400000
## 4 24.000000
## ... ...
## 5105 28.893237
## 5106 40.000000
## 5107 30.600000
## 5108 25.600000
## 5109 26.200000
##
## [5110 rows x 1 columns]
print("\n")
Similar withSimpleImputer,IterativeImputeris also a part ofsklearnpackage and another way to give missing observations new values.IterativeImputerfunction will input estimated values from multivariate imputation-which will depend on other variables within the dataset-into the samples with missing values. Taken its definition from Towards Data Science, usingIterative Imputer, missing observations will be imputed by estimated values from other variables multivariate models.
Note:impute_colscould be changed to other variables used for iterative imputer. You may also want to changemaximumiteration andrandom_state.
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df = pd.read_excel("Stroke_Dataset.xlsx")
impute_cols = [
"bmi",
"age",
"avg_glucose_level"]
imp = IterativeImputer(
max_iter=10,
random_state=42)
df[impute_cols] = imp.fit_transform(df[impute_cols])
print("Iterative Imputation:")
## Iterative Imputation:
print(df[impute_cols])
## bmi age avg_glucose_level
## 0 36.600000 67.0 228.69
## 1 32.598921 61.0 202.21
## 2 32.500000 80.0 105.92
## 3 34.400000 49.0 171.23
## 4 24.000000 79.0 174.12
## ... ... ... ...
## 5105 32.484819 80.0 83.75
## 5106 40.000000 81.0 125.20
## 5107 30.600000 35.0 82.99
## 5108 25.600000 51.0 166.29
## 5109 26.200000 44.0 85.28
##
## [5110 rows x 3 columns]
The variable which has missing values can be replaced with new one and saved into different Excel file using the following codes.
# Replace the original bmi column in the full dataframe with the imputed values
# df['bmi'] = df_mean['bmi'] <-- Use this for simpleimputer
df['bmi'] = df[impute_cols]['bmi']
# Save the entire dataframe (all columns) with imputed bmi values
df.to_excel("Brand_New_BMI_Values_II.xlsx", index=False)
As a part of transforming data, feature engineering enables adding brand new column. Furthermore, the values within the new column could be made as results from mathematical operation. For instance, health score variable can be getting based on calculation from age, BMI, and glucose level.
Note: The formula of health score is merely an example. Beside that, the data recalled as follow is cleaned data, the result from previous pre-processing section.
# 3. Feature Engineering (Health Score)
import pandas as pd
merged_data = pd.read_excel("Brand_New_BMI_Values_II.xlsx")
merged_data['Health_Score'] = (merged_data['bmi'] / merged_data['age']) + (merged_data['avg_glucose_level'] / 100)
print(merged_data.dropna())
## id gender age ... smoking_status stroke Health_Score
## 0 9046 Male 67.0 ... formerly smoked 1 2.833169
## 1 51676 Female 61.0 ... never smoked 1 2.556509
## 2 31112 Male 80.0 ... never smoked 1 1.465450
## 3 60182 Female 49.0 ... smokes 1 2.414341
## 4 1665 Female 79.0 ... never smoked 1 2.044997
## ... ... ... ... ... ... ... ...
## 5105 18234 Female 80.0 ... never smoked 0 1.243560
## 5106 44873 Female 81.0 ... never smoked 0 1.745827
## 5107 19723 Female 35.0 ... never smoked 0 1.704186
## 5108 37544 Male 51.0 ... formerly smoked 0 2.164861
## 5109 44679 Female 44.0 ... Unknown 0 1.448255
##
## [5110 rows x 13 columns]
print("\n")
# Recalling existing columns within the data
column_names = merged_data.columns
print(column_names)
## Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
## 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
## 'smoking_status', 'stroke', 'Health_Score'],
## dtype='object')
Categorical variables are as important as numerical variables, however, it needs to be translated into numeric in order to make the analysis process doable. Especially for machine learning which will involve statistical tests. In fact, machine learning can give an optimum analysis results if the data arranged and easy to read. On Python,LabelEncodercan be utilized to change string information into dummy variables. Furthermore, it alters into number start from 0 to n (based on total category within one variable). For instance,ever_marriedcolumn with yes and no information will be turned into 1 and 0. It will also be labeled based on alphabet order.
df_label_encoded['gender_numeric'] = label_encoder.fit_transform(merged_data['gender']),Note: Left side of the equation is the new column created for dummy while the right side is the variable which will be turned into numerical values.
# 4. Label for Categorical Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
# Label Encoding
label_encoder = LabelEncoder()
df_label_encoded = merged_data.copy()
df_label_encoded['gender_numeric'] = label_encoder.fit_transform(merged_data['gender'])
df_label_encoded['marital_status_numeric'] = label_encoder.fit_transform(merged_data['ever_married'])
df_label_encoded['employment_category_numeric'] = label_encoder.fit_transform(merged_data['work_type'])
print("\n--- Label Encoded Data ---")
##
## --- Label Encoded Data ---
print(df_label_encoded[['gender_numeric', 'marital_status_numeric',
'employment_category_numeric']])
## gender_numeric marital_status_numeric employment_category_numeric
## 0 1 1 2
## 1 0 1 3
## 2 1 1 2
## 3 0 1 2
## 4 0 1 3
## ... ... ... ...
## 5105 0 1 2
## 5106 0 1 3
## 5107 0 1 3
## 5108 1 1 2
## 5109 0 1 0
##
## [5110 rows x 3 columns]
print("\n")
print(df_label_encoded)
## id gender ... marital_status_numeric employment_category_numeric
## 0 9046 Male ... 1 2
## 1 51676 Female ... 1 3
## 2 31112 Male ... 1 2
## 3 60182 Female ... 1 2
## 4 1665 Female ... 1 3
## ... ... ... ... ... ...
## 5105 18234 Female ... 1 2
## 5106 44873 Female ... 1 3
## 5107 19723 Female ... 1 3
## 5108 37544 Male ... 1 2
## 5109 44679 Female ... 1 0
##
## [5110 rows x 16 columns]
print("\n")
# Displaying available columns
column_names2 = df_label_encoded.columns
print(column_names2)
## Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
## 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
## 'smoking_status', 'stroke', 'Health_Score', 'gender_numeric',
## 'marital_status_numeric', 'employment_category_numeric'],
## dtype='object')
Beside labeling string information into dummy variables, standardization of numerical variables is essential. In fact, the data will most likely to be standardized if the sources of data are different and distinctly measured. The normalization of data is inherently to prevent imbalance power of some variables over another. The standardization on Python could be accomplished by using eitherMinMaxScalerorStandardScaler.
One of the functions fromscikit-learnto re-scale numeric variables into a fixed range between 0 and 1. If the data used in analysis is not following normal distribution and try to preserve the distribution within it, min-max scalling is the perfect fit for standardization. Especially, if the analysis will be conducted required non-minus values.
Note: You able to changeid,avg_glucose_level,age, andbmiwith other designated numerical columns.
# 5. Re-Scaling Numerical Variables
# a. Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
scaler_minmax = MinMaxScaler()
df_minmax_sklearn = df_label_encoded[['id','avg_glucose_level',
'age', 'bmi']].copy()
df_minmax_sklearn[['avg_glucose_level', 'age', 'bmi']] = scaler_minmax.fit_transform(df_minmax_sklearn[
['avg_glucose_level', 'age', 'bmi']])
print("\n--- Min-Max Scaled Data ---")
##
## --- Min-Max Scaled Data ---
print(df_minmax_sklearn)
## id avg_glucose_level age bmi
## 0 9046 0.801265 0.816895 0.301260
## 1 51676 0.679023 0.743652 0.255429
## 2 31112 0.234512 0.975586 0.254296
## 3 60182 0.536008 0.597168 0.276060
## 4 1665 0.549349 0.963379 0.156930
## ... ... ... ... ...
## 5105 18234 0.132167 0.975586 0.254122
## 5106 44873 0.323516 0.987793 0.340206
## 5107 19723 0.128658 0.426270 0.232532
## 5108 37544 0.513203 0.621582 0.175258
## 5109 44679 0.139230 0.536133 0.182131
##
## [5110 rows x 4 columns]
Changes numerical variables to have mean equal to 0 and a standard deviation of 1. In other words, the variables will be turned into numbers which follow normal distribution. In contrast to min-max scalling, z-score normalization will turn whatever distribution followed by the data into normal distribution. Beside that, z-score normalization commonly used in statistical analysis because it can distinguish values by its range from the mean. Furthermore, it can identify extreme values by its feature which is standard deviation.
Note: Similar with min-max scalling, you can adjust what numerical variables will be turned into standardized values. In spite of that, you need to have primary key or column which can be an anchor to emerge the original data and the re-scaling results. You can also change the name of the file which will be saved onto excel format.
# b. Z-Score Normalization
from sklearn.preprocessing import StandardScaler
scaler_standard = StandardScaler()
df_standardized_sklearn = df_label_encoded[['id',
'avg_glucose_level', 'age', 'bmi']].copy()
df_standardized_sklearn[['avg_glucose_level', 'age', 'bmi']] = scaler_standard.fit_transform(df_standardized_sklearn[
['avg_glucose_level', 'age', 'bmi']])
print("\n--- Standardized Data ---")
##
## --- Standardized Data ---
print(df_standardized_sklearn)
## id avg_glucose_level age bmi
## 0 9046 2.706375 1.051434 0.991064
## 1 51676 2.121559 0.786070 0.472904
## 2 31112 -0.005028 1.626390 0.460094
## 3 60182 1.437358 0.255342 0.706153
## 4 1665 1.501184 1.582163 -0.640699
## ... ... ... ... ...
## 5105 18234 -0.494658 1.626390 0.458128
## 5106 44873 0.420775 1.670617 1.431381
## 5107 19723 -0.511443 -0.363842 0.214034
## 5108 37544 1.328257 0.343796 -0.433491
## 5109 44679 -0.460867 0.034205 -0.355788
##
## [5110 rows x 4 columns]
# Merging the outcomes of standardized numeric values with previous calculations
final_merge_data = pd.merge(df_label_encoded, df_standardized_sklearn,
on='id')
print(final_merge_data)
## id gender age_x ... avg_glucose_level_y age_y bmi_y
## 0 9046 Male 67.0 ... 2.706375 1.051434 0.991064
## 1 51676 Female 61.0 ... 2.121559 0.786070 0.472904
## 2 31112 Male 80.0 ... -0.005028 1.626390 0.460094
## 3 60182 Female 49.0 ... 1.437358 0.255342 0.706153
## 4 1665 Female 79.0 ... 1.501184 1.582163 -0.640699
## ... ... ... ... ... ... ... ...
## 5105 18234 Female 80.0 ... -0.494658 1.626390 0.458128
## 5106 44873 Female 81.0 ... 0.420775 1.670617 1.431381
## 5107 19723 Female 35.0 ... -0.511443 -0.363842 0.214034
## 5108 37544 Male 51.0 ... 1.328257 0.343796 -0.433491
## 5109 44679 Female 44.0 ... -0.460867 0.034205 -0.355788
##
## [5110 rows x 19 columns]
# Save the entire dataframe (all columns)
final_merge_data.to_excel("Complete.xlsx", index=False)
The cleaned data as the result from pre-processing could be taken into basic or even sophisticated analysis. The following below is the usual first step to get to know the data further which will contain visualization and descriptive statistics.
As the name suggests, treemap will compare the magnitude for each smaller category within variables into the larger root of the data which is between variables. Need to be noted,plotlypackage will be used since it is providing wide range of plots which includes treemap.
# ----- B. PROCESSING DATA ----- #
# 1. Visualization
import pandas as pd
import plotly.express as px
brand_new_data = pd.read_excel("Complete.xlsx")
# a. Treemap
# Create a new DataFrame with counts for each category combination
treemap_data = brand_new_data.groupby(['Residence_type', 'gender',
'ever_married']).size().reset_index(name='count')
fig = px.treemap(
treemap_data,
path=['Residence_type', 'gender', 'ever_married'],
values='count',
color='count', # Color based on values
hover_data={'count': True},
color_continuous_scale='Blues',
title='Treemap Based on Genders, Marital Status, and Occupation'
)
fig.show()
Distribution could be a way to convey message regarding comparison between categories. More than that, creating distribution for data may also help approximate guess on distribution type followed by the data. Nonetheless, the packages to create distribution visual have an addition ofnumpypackage to work with arrays of data andgaussian_kdefunction fromscipyas the fundamental of creating gaussian (normal) distribution.
Note: The following is the codes which you can put more if you have additional category.
stroke_1 = df[df['stroke'] == 1]['age_x'],kde_1 = gaussian_kde(stroke_1),plt.plot(x_range, kde_1(x_range), color='red', label='Stroke (1)', linewidth=2),plt.fill_between(x_range, kde_1(x_range), color='red', alpha=0.2). Do not forget to change the data used to make the visual (e.g.Complete.xlsx) as well as the columns which will be compared (e.g.stroke,age_x).
# b. Stroke Patients Distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import pandas as pd
df = pd.read_excel("Complete.xlsx")
# Separate data by stroke status
stroke_0 = df[df['stroke'] == 0]['age_x']
stroke_1 = df[df['stroke'] == 1]['age_x']
# Create Kernel Density Estimation (KDE) for each group
kde_0 = gaussian_kde(stroke_0)
kde_1 = gaussian_kde(stroke_1)
x_range = np.linspace(min(df['age_x'])-2, max(df['age_x'])+2, 500)
plt.figure(figsize=(10, 6))
plt.plot(x_range, kde_0(x_range), color='blue', label='No Stroke (0)',
linewidth=2)
plt.plot(x_range, kde_1(x_range), color='red', label='Stroke (1)',
linewidth=2)
plt.fill_between(x_range, kde_0(x_range), color='blue', alpha=0.2)
plt.fill_between(x_range, kde_1(x_range), color='red', alpha=0.2)
plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Distribution of Age by Stroke Status')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Heatmap is simply displaying correlation between variables using Pearson correlation as its default.
Note: Please changeselected_vars = [ ]with designated variables which used for correlation analysis.
# c. Correlation Heatmap
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_excel("Complete.xlsx")
selected_vars = [
'avg_glucose_level_y',
'hypertension',
'heart_disease',
'age_y',
'bmi_y',
'stroke',
]
# Calculate correlation matrix for selected variables
correlation_matrix = data[selected_vars].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(
correlation_matrix,
annot=True,
cmap='coolwarm',
fmt=".2f",
linewidths=0.5,
cbar_kws={'label': 'Correlation Coefficient'},
square=True
)
plt.title('Pair Correlation of 6 Main Variables with Heatmap Visual', fontsize=14, pad=20)
plt.tight_layout()
plt.show()
In order to conduct further analysis, the data which has been cleaned needs to be identified its characteristics. It is going to determine what appropriate or available analysis can be used for the data. Basic statistical analysis can be the first step to know characteristics within the data. The descriptive statistics result using pre-processed data could be compared with the original data which contained missing values in order to see any salient difference between the two. Both data are expected to be slightly different since the original information is the fundamental for drawing the conclusion for an analysis.
Note: You may also want to save descriptive statistics of the new one onto excel file and change the name of it. Beside that, you could adjust or adddf['bmi']with other columns for comparison.
# 2. Exploration Data Analysis (EDA)
import pandas as pd
# a. statistic descriptive
df = pd.read_excel("Complete.xlsx") #<-- Cleaned Data
print("New Descriptive Statistic:")
## New Descriptive Statistic:
print(df['bmi_x'].describe(), "\n")
## count 5110.000000
## mean 28.947291
## std 7.722465
## min 10.300000
## 25% 23.800000
## 50% 28.200000
## 75% 32.936000
## max 97.600000
## Name: bmi_x, dtype: float64
df1 = pd.read_excel("Stroke_Dataset.xlsx") #<-- Original Data
print("Previous Descriptive Statistic:")
## Previous Descriptive Statistic:
print(df1['bmi'].describe())
## count 4909.000000
## mean 28.893237
## std 7.854067
## min 10.300000
## 25% 23.500000
## 50% 28.100000
## 75% 33.100000
## max 97.600000
## Name: bmi, dtype: float64
# Get descriptive statistics
descriptives = df.describe()
# Reset index to make the descriptions ('count', 'mean', etc.) into a column
descriptives = descriptives.reset_index()
descriptives = descriptives.rename(columns={'index': 'description'})
print("Descriptive Statistics with Description Column:")
## Descriptive Statistics with Description Column:
print(descriptives)
## description id ... age_y bmi_y
## 0 count 5110.000000 ... 5.110000e+03 5.110000e+03
## 1 mean 36517.829354 ... 3.337187e-17 -1.028966e-16
## 2 std 21161.721625 ... 1.000098e+00 1.000098e+00
## 3 min 67.000000 ... -1.908261e+00 -2.414917e+00
## 4 25% 17741.250000 ... -8.061152e-01 -6.665999e-01
## 5 50% 36932.000000 ... 7.843218e-02 -9.677795e-02
## 6 75% 54682.000000 ... 7.860701e-01 5.165578e-01
## 7 max 72940.000000 ... 1.714845e+00 8.890869e+00
##
## [8 rows x 15 columns]
# Save to Excel
descriptives.to_excel("Descriptive Statistic.xlsx", index=False)
Pre-processing data is a crucial part in data analysis since it will determine the analysis later on. Pre-processing data includes cleaning, transforming, feature encoding, label and standardization which are commonly used to get a clear data. In fact, machine learning needs arranged data in order to give the optimum results from the analysis will be conducted. Therefore, pre-process is the first step to achieve high-quality and unbiased analysis and will always be practiced everytime in data analysis.