This markdown document presents common data transformation techniques in preprocessing as a requirment in PhD ITI 624 Application of Learning Analytics in Instructional Design.
source: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
## Rows: 1,000
## Columns: 8
## $ gender <chr> "female", "female", "female", "male", "mal…
## $ race.ethnicity <chr> "group B", "group C", "group B", "group A"…
## $ parental.level.of.education <chr> "bachelor's degree", "some college", "mast…
## $ lunch <chr> "standard", "standard", "standard", "free/…
## $ test.preparation.course <chr> "none", "completed", "none", "none", "none…
## $ math.score <int> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, 58…
## $ reading.score <int> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, 54…
## $ writing.score <int> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, 52…
Adjusting the data to a common scale without distorting differences in the ranges of values. Normalization typically scales data to a [0, 1] range, while standardization adjusts data to have a mean of 0 and a standard deviation of 1.
The dataset contains various columns, including demographic information and test scores for math, reading, and writing. For normalization and standardization, we’ll focus on the numeric columns: math score, reading score, and writing score.
Normalization and standardization are both techniques used to rescale the values of numeric features:
# Select the numeric columns
edmdata_numeric <- edmdata|> select(`math.score`, `reading.score`, `writing.score`)
head(edmdata_numeric)
## math.score reading.score writing.score
## 1 72 72 74
## 2 69 90 88
## 3 90 95 93
## 4 47 57 44
## 5 76 78 75
## 6 71 83 78
# Normalize the data (Min-Max Scaling)
edmdata_normalized <- edmdata_numeric|>
mutate(across(everything(), ~ (. - min(.)) / (max(.) - min(.))))
# View the first few rows of the normalized
head(edmdata_normalized)
## math.score reading.score writing.score
## 1 0.72 0.6626506 0.7111111
## 2 0.69 0.8795181 0.8666667
## 3 0.90 0.9397590 0.9222222
## 4 0.47 0.4819277 0.3777778
## 5 0.76 0.7349398 0.7222222
## 6 0.71 0.7951807 0.7555556
# Standardize the data (Z-Score Scaling)
edmdata_standardized <- edmdata_numeric|>
mutate(across(everything(), ~ (.- mean(.)) / sd(.)))
# View the first few rows of the standardized data
head(edmdata_standardized)
## math.score reading.score writing.score
## 1 0.3898284 0.1939016 0.3912960
## 2 0.1919795 1.4267621 1.3126119
## 3 1.5769224 1.7692233 1.6416533
## 4 -1.2589131 -0.8334822 -1.5829523
## 5 0.6536271 0.6048551 0.4571043
## 6 0.3238788 0.9473163 0.6545291
Converting categorical data into numerical formats, such as one-hot encoding, label encoding, or ordinal encoding, so that they can be used in machine learning algorithms.
In the dataset, several columns contain categorical data, such as gender, race/ethnicity, parental level of education, lunch, and test preparation course. To use these categorical variables in machine learning models, we need to encode them into numeric formats.
There are different encoding techniques, such as:
# One-Hot Encoding using step_dummy
recipe_one_hot <- recipe(~ ., data = edmdata)|>
step_dummy(all_nominal(), -all_outcomes())
# Label Encoding using step_integer
recipe_label <- recipe(~ ., data = edmdata)|>
step_integer(all_nominal(), -all_outcomes())
# Prepare the recipes
edmdata_one_hot <- prep(recipe_one_hot)|> juice()
edmdata_label <- prep(recipe_label)|> juice()
# View the first few rows of the encoded data
head(edmdata_one_hot)
## # A tibble: 6 × 15
## math.score reading.score writing.score gender_male race.ethnicity_group.B
## <int> <int> <int> <dbl> <dbl>
## 1 72 72 74 0 1
## 2 69 90 88 0 0
## 3 90 95 93 0 1
## 4 47 57 44 1 0
## 5 76 78 75 1 0
## 6 71 83 78 0 1
## # ℹ 10 more variables: race.ethnicity_group.C <dbl>,
## # race.ethnicity_group.D <dbl>, race.ethnicity_group.E <dbl>,
## # parental.level.of.education_bachelor.s.degree <dbl>,
## # parental.level.of.education_high.school <dbl>,
## # parental.level.of.education_master.s.degree <dbl>,
## # parental.level.of.education_some.college <dbl>,
## # parental.level.of.education_some.high.school <dbl>, lunch_standard <dbl>, …
head(edmdata_label)
## # A tibble: 6 × 8
## gender race.ethnicity parental.level.of.education lunch test.preparation.cou…¹
## <int> <int> <int> <int> <int>
## 1 1 2 2 2 2
## 2 1 3 5 2 1
## 3 1 2 4 2 2
## 4 2 1 1 1 2
## 5 2 3 5 2 2
## 6 1 2 1 2 2
## # ℹ abbreviated name: ¹test.preparation.course
## # ℹ 3 more variables: math.score <int>, reading.score <int>,
## # writing.score <int>
Summarizing data by combining values, such as calculating averages, sums, or counts for specific groups within the data.
Aggregation involves summarizing or combining data to produce a single result. In the context of the provided dataset, we can perform aggregation to calculate summary statistics like the mean, median, sum, or count based on certain groupings.
Suppose we want to calculate the average math, reading, and writing scores for each group based on gender and race/ethnicity.
# Aggregation: Calculate the mean of math, reading, and writing scores by gender and race/ethnicity
edmdata_aggregated <- edmdata|>
group_by(gender, `race.ethnicity`)|>
summarize(
mean_math_score = mean(`math.score`, na.rm = TRUE),
mean_reading_score = mean(`reading.score`, na.rm = TRUE),
mean_writing_score = mean(`writing.score`, na.rm = TRUE),
count = n()
)|>
ungroup()
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
# View the aggregated data
print(edmdata_aggregated)
## # A tibble: 10 × 6
## gender race.ethnicity mean_math_score mean_reading_score mean_writing_score
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 female group A 58.5 69 67.9
## 2 female group B 61.4 71.1 70.0
## 3 female group C 62.0 71.9 71.8
## 4 female group D 65.2 74.0 75.0
## 5 female group E 70.8 75.8 75.5
## 6 male group A 63.7 61.7 59.2
## 7 male group B 65.9 62.8 60.2
## 8 male group C 67.6 65.4 62.7
## 9 male group D 69.4 66.1 65.4
## 10 male group E 76.7 70.3 67.4
## # ℹ 1 more variable: count <int>
Grouping continuous data into discrete bins or intervals, often to reduce the noise or to make the data more manageable.
Binning (or discretization) is the process of converting continuous data into discrete intervals or “bins.” This can be useful for simplifying models or visualizations, or when a categorical representation of numeric data is more appropriate.
Let’s say we want to create bins for the math score in the dataset, categorizing them into three bins: “Low”, “Medium”, and “High”.
# Binning: Create bins for math score
edmdata_binned <- edmdata|>
mutate(math_score_bin = case_when(
`math.score` < 60 ~ "Low",
`math.score` >= 60 & `math.score` < 80 ~ "Medium",
`math.score` >= 80 ~ "High"
))
# View the first few rows with the new binned column
head(edmdata_binned)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score math_score_bin
## 1 none 72 72 74 Medium
## 2 completed 69 90 88 Medium
## 3 none 90 95 93 High
## 4 none 47 57 44 Low
## 5 none 76 78 75 Medium
## 6 none 71 83 78 Medium
Applying mathematical functions (e.g., logarithmic, square root, or exponential transform
Feature transformation involves modifying the features in a dataset to improve the performance of machine learning models. This can include mathematical transformations like logarithmic, square root, or polynomial transformations, as well as more domain-specific transformations.
In this example, we’ll apply a logarithmic transformation and a polynomial transformation to the math score, reading score, and writing score columns.
# Select the numeric columns to transform
edmdata_numeric <- edmdata|> select(`math.score`, `reading.score`, `writing.score`)
# Logarithmic Transformation
recipe_log <- recipe(~ ., data = edmdata_numeric)|>
step_log(all_numeric(), offset = 1)
edmdata_log_transformed <- prep(recipe_log)|> juice()
# View the first few rows of the transformed data
head(edmdata_log_transformed)
## # A tibble: 6 × 3
## math.score reading.score writing.score
## <dbl> <dbl> <dbl>
## 1 4.29 4.29 4.32
## 2 4.25 4.51 4.49
## 3 4.51 4.56 4.54
## 4 3.87 4.06 3.81
## 5 4.34 4.37 4.33
## 6 4.28 4.43 4.37
# Polynomial Transformation (Square)
recipe_poly <- recipe(~ ., data = edmdata_numeric)|>
step_poly(all_numeric(), degree = 2)
edmdata_poly_transformed <- prep(recipe_poly)|> juice()
# View the first few rows of the transformed data
head(edmdata_poly_transformed)
## # A tibble: 6 × 6
## math.score_poly_1 math.score_poly_2 reading.score_poly_1 reading.score_poly_2
## <dbl> <dbl> <dbl> <dbl>
## 1 0.0123 -0.0158 0.00613 -0.0212
## 2 0.00607 -0.0194 0.0451 0.0326
## 3 0.0499 0.0412 0.0560 0.0601
## 4 -0.0398 0.00504 -0.0264 -0.0121
## 5 0.0207 -0.00834 0.0191 -0.0111
## 6 0.0102 -0.0172 0.0300 0.00333
## # ℹ 2 more variables: writing.score_poly_1 <dbl>, writing.score_poly_2 <dbl>
Removing or imputing missing values, outliers, or irrelevant data that could negatively impact the analysis.
Filtering and cleaning data are essential steps in preparing a dataset for analysis or modeling. These steps may involve removing or imputing missing values, filtering out irrelevant data, or correcting data quality issues.
Let’s demonstrate how to filter the data based on specific criteria and clean it by handling missing values.
# Step 1: Filter the data
# Example: Filter out students who have a math score less than 50
edmdata_filtered <- edmdata|>
filter(`math.score` >= 50)
# Step 2: Cleaning the data
# Example: Handle missing values by replacing them with the median of the respective columns
edmdata_cleaned <- edmdata_filtered|>
mutate(across(c(`math.score`, `reading.score`, `writing.score`),
~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Step 3: Additional cleaning (optional)
# Example: Remove any duplicate rows if they exist
edmdata_final <- edmdata_cleaned|>
distinct()
# View the first few rows of the cleaned data
head(edmdata_final)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group C some college standard
## 5 female group B associate's degree standard
## 6 female group B some college standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 76 78 75
## 5 none 71 83 78
## 6 completed 88 95 92