Data Transformation in Preprocessing Demo using R

This markdown document presents common data transformation techniques in preprocessing as a requirment in PhD ITI 624 Application of Learning Analytics in Instructional Design.

Techniques:

1. Normalization and Standardization

Adjusting the data to a common scale without distorting differences in the ranges of values. Normalization typically scales data to a [0, 1] range, while standardization adjusts data to have a mean of 0 and a standard deviation of 1.

The dataset contains various columns, including demographic information and test scores for math, reading, and writing. For normalization and standardization, we’ll focus on the numeric columns: math score, reading score, and writing score.

Normalization and standardization are both techniques used to rescale the values of numeric features:

Normalization scales the values between 0 and 1.
Standardization scales the values to have a mean of 0 and a standard deviation of 1.

# Select the numeric columns
edmdata_numeric <- edmdata|> select(`math.score`, `reading.score`, `writing.score`)

head(edmdata_numeric)

##   math.score reading.score writing.score
## 1         72            72            74
## 2         69            90            88
## 3         90            95            93
## 4         47            57            44
## 5         76            78            75
## 6         71            83            78

# Normalize the data (Min-Max Scaling)
edmdata_normalized <- edmdata_numeric|>
  mutate(across(everything(), ~ (. - min(.)) / (max(.) - min(.))))

# View the first few rows of the normalized
head(edmdata_normalized)

##   math.score reading.score writing.score
## 1       0.72     0.6626506     0.7111111
## 2       0.69     0.8795181     0.8666667
## 3       0.90     0.9397590     0.9222222
## 4       0.47     0.4819277     0.3777778
## 5       0.76     0.7349398     0.7222222
## 6       0.71     0.7951807     0.7555556

# Standardize the data (Z-Score Scaling)
edmdata_standardized <- edmdata_numeric|>
  mutate(across(everything(), ~ (.- mean(.)) / sd(.)))

# View the first few rows of the standardized data
head(edmdata_standardized)

##   math.score reading.score writing.score
## 1  0.3898284     0.1939016     0.3912960
## 2  0.1919795     1.4267621     1.3126119
## 3  1.5769224     1.7692233     1.6416533
## 4 -1.2589131    -0.8334822    -1.5829523
## 5  0.6536271     0.6048551     0.4571043
## 6  0.3238788     0.9473163     0.6545291

2. Encoding

Converting categorical data into numerical formats, such as one-hot encoding, label encoding, or ordinal encoding, so that they can be used in machine learning algorithms.

In the dataset, several columns contain categorical data, such as gender, race/ethnicity, parental level of education, lunch, and test preparation course. To use these categorical variables in machine learning models, we need to encode them into numeric formats.

There are different encoding techniques, such as:

One-Hot Encoding: Creates a new binary column for each category.
Label Encoding: Assigns a unique integer to each category.
Ordinal Encoding: Assigns integers based on a defined order of categories.

# One-Hot Encoding using step_dummy
recipe_one_hot <- recipe(~ ., data = edmdata)|>
  step_dummy(all_nominal(), -all_outcomes())

# Label Encoding using step_integer
recipe_label <- recipe(~ ., data = edmdata)|>
  step_integer(all_nominal(), -all_outcomes())

# Prepare the recipes
edmdata_one_hot <- prep(recipe_one_hot)|> juice()
edmdata_label <- prep(recipe_label)|> juice()

# View the first few rows of the encoded data
head(edmdata_one_hot)

## # A tibble: 6 × 15
##   math.score reading.score writing.score gender_male race.ethnicity_group.B
##        <int>         <int>         <int>       <dbl>                  <dbl>
## 1         72            72            74           0                      1
## 2         69            90            88           0                      0
## 3         90            95            93           0                      1
## 4         47            57            44           1                      0
## 5         76            78            75           1                      0
## 6         71            83            78           0                      1
## # ℹ 10 more variables: race.ethnicity_group.C <dbl>,
## #   race.ethnicity_group.D <dbl>, race.ethnicity_group.E <dbl>,
## #   parental.level.of.education_bachelor.s.degree <dbl>,
## #   parental.level.of.education_high.school <dbl>,
## #   parental.level.of.education_master.s.degree <dbl>,
## #   parental.level.of.education_some.college <dbl>,
## #   parental.level.of.education_some.high.school <dbl>, lunch_standard <dbl>, …

head(edmdata_label)

## # A tibble: 6 × 8
##   gender race.ethnicity parental.level.of.education lunch test.preparation.cou…¹
##    <int>          <int>                       <int> <int>                  <int>
## 1      1              2                           2     2                      2
## 2      1              3                           5     2                      1
## 3      1              2                           4     2                      2
## 4      2              1                           1     1                      2
## 5      2              3                           5     2                      2
## 6      1              2                           1     2                      2
## # ℹ abbreviated name: ¹test.preparation.course
## # ℹ 3 more variables: math.score <int>, reading.score <int>,
## #   writing.score <int>

3. Aggregation

Summarizing data by combining values, such as calculating averages, sums, or counts for specific groups within the data.

Aggregation involves summarizing or combining data to produce a single result. In the context of the provided dataset, we can perform aggregation to calculate summary statistics like the mean, median, sum, or count based on certain groupings.

Suppose we want to calculate the average math, reading, and writing scores for each group based on gender and race/ethnicity.

# Aggregation: Calculate the mean of math, reading, and writing scores by gender and race/ethnicity
edmdata_aggregated <- edmdata|>
  group_by(gender, `race.ethnicity`)|>
  summarize(
    mean_math_score = mean(`math.score`, na.rm = TRUE),
    mean_reading_score = mean(`reading.score`, na.rm = TRUE),
    mean_writing_score = mean(`writing.score`, na.rm = TRUE),
    count = n()
  )|>
  ungroup()

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

# View the aggregated data
print(edmdata_aggregated)

## # A tibble: 10 × 6
##    gender race.ethnicity mean_math_score mean_reading_score mean_writing_score
##    <chr>  <chr>                    <dbl>              <dbl>              <dbl>
##  1 female group A                   58.5               69                 67.9
##  2 female group B                   61.4               71.1               70.0
##  3 female group C                   62.0               71.9               71.8
##  4 female group D                   65.2               74.0               75.0
##  5 female group E                   70.8               75.8               75.5
##  6 male   group A                   63.7               61.7               59.2
##  7 male   group B                   65.9               62.8               60.2
##  8 male   group C                   67.6               65.4               62.7
##  9 male   group D                   69.4               66.1               65.4
## 10 male   group E                   76.7               70.3               67.4
## # ℹ 1 more variable: count <int>

4. Binning

Grouping continuous data into discrete bins or intervals, often to reduce the noise or to make the data more manageable.

Binning (or discretization) is the process of converting continuous data into discrete intervals or “bins.” This can be useful for simplifying models or visualizations, or when a categorical representation of numeric data is more appropriate.

Let’s say we want to create bins for the math score in the dataset, categorizing them into three bins: “Low”, “Medium”, and “High”.

# Binning: Create bins for math score
edmdata_binned <- edmdata|>
  mutate(math_score_bin = case_when(
    `math.score` < 60 ~ "Low",
    `math.score` >= 60 & `math.score` < 80 ~ "Medium",
    `math.score` >= 80 ~ "High"
  ))

# View the first few rows with the new binned column
head(edmdata_binned)

##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score math_score_bin
## 1                    none         72            72            74         Medium
## 2               completed         69            90            88         Medium
## 3                    none         90            95            93           High
## 4                    none         47            57            44            Low
## 5                    none         76            78            75         Medium
## 6                    none         71            83            78         Medium

5. Feature Transformation

Applying mathematical functions (e.g., logarithmic, square root, or exponential transform

Feature transformation involves modifying the features in a dataset to improve the performance of machine learning models. This can include mathematical transformations like logarithmic, square root, or polynomial transformations, as well as more domain-specific transformations.

In this example, we’ll apply a logarithmic transformation and a polynomial transformation to the math score, reading score, and writing score columns.

# Select the numeric columns to transform
edmdata_numeric <- edmdata|> select(`math.score`, `reading.score`, `writing.score`)

# Logarithmic Transformation
recipe_log <- recipe(~ ., data = edmdata_numeric)|>
  step_log(all_numeric(), offset = 1)

edmdata_log_transformed <- prep(recipe_log)|> juice()

# View the first few rows of the transformed data
head(edmdata_log_transformed)

## # A tibble: 6 × 3
##   math.score reading.score writing.score
##        <dbl>         <dbl>         <dbl>
## 1       4.29          4.29          4.32
## 2       4.25          4.51          4.49
## 3       4.51          4.56          4.54
## 4       3.87          4.06          3.81
## 5       4.34          4.37          4.33
## 6       4.28          4.43          4.37

# Polynomial Transformation (Square)
recipe_poly <- recipe(~ ., data = edmdata_numeric)|>
  step_poly(all_numeric(), degree = 2)

edmdata_poly_transformed <- prep(recipe_poly)|> juice()

# View the first few rows of the transformed data
head(edmdata_poly_transformed)

## # A tibble: 6 × 6
##   math.score_poly_1 math.score_poly_2 reading.score_poly_1 reading.score_poly_2
##               <dbl>             <dbl>                <dbl>                <dbl>
## 1           0.0123           -0.0158               0.00613             -0.0212 
## 2           0.00607          -0.0194               0.0451               0.0326 
## 3           0.0499            0.0412               0.0560               0.0601 
## 4          -0.0398            0.00504             -0.0264              -0.0121 
## 5           0.0207           -0.00834              0.0191              -0.0111 
## 6           0.0102           -0.0172               0.0300               0.00333
## # ℹ 2 more variables: writing.score_poly_1 <dbl>, writing.score_poly_2 <dbl>

6. Filtering and Cleaning

Removing or imputing missing values, outliers, or irrelevant data that could negatively impact the analysis.

Filtering and cleaning data are essential steps in preparing a dataset for analysis or modeling. These steps may involve removing or imputing missing values, filtering out irrelevant data, or correcting data quality issues.

Let’s demonstrate how to filter the data based on specific criteria and clean it by handling missing values.

# Step 1: Filter the data
# Example: Filter out students who have a math score less than 50
edmdata_filtered <- edmdata|>
  filter(`math.score` >= 50)

# Step 2: Cleaning the data
# Example: Handle missing values by replacing them with the median of the respective columns
edmdata_cleaned <- edmdata_filtered|>
  mutate(across(c(`math.score`, `reading.score`, `writing.score`), 
                ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Step 3: Additional cleaning (optional)
# Example: Remove any duplicate rows if they exist
edmdata_final <- edmdata_cleaned|>
  distinct()

# View the first few rows of the cleaned data
head(edmdata_final)

##   gender race.ethnicity parental.level.of.education    lunch
## 1 female        group B           bachelor's degree standard
## 2 female        group C                some college standard
## 3 female        group B             master's degree standard
## 4   male        group C                some college standard
## 5 female        group B          associate's degree standard
## 6 female        group B                some college standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         76            78            75
## 5                    none         71            83            78
## 6               completed         88            95            92

END OF PRESENTATION