Data Preprocessing for Machine Learning/Data Science

Introduction

Data preprocessing is transforming of raw data into an understandable data, which is suitable for machine learning and data science analysis. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set (Source:Wikipedia). Here, we have covered some of the basic techniques for preprocessing of data.

The data can be downloaded from my GitHub account:

Dataset: GitHub link

Loading the data

First, set your working directory to the location where Data.csv file is present. Then, load the CSV file.

dataset <- read.csv('Data.csv')
library(hwriter)
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

Country	Age	Salary	Purchased
France	44	72000	No
Spain	27	48000	Yes
Germany	30	54000	No
Spain	38	61000	No
Germany	40		Yes
France	35	58000	Yes
Spain		52000	No
France	48	79000	Yes
Germany	50	83000	No
France	37	67000	Yes

Missing Data?

We can replace the missing data by taking the mean of all values present in that column. (Missing age value in 7th row and salary value in 5th row)

dataset$Age <- ifelse(is.na(dataset$Age),
                      ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                      dataset$Age)
dataset$Salary <- ifelse(is.na(dataset$Salary),
                         ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                         dataset$Salary)
library(hwriter)
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

Country	Age	Salary	Purchased
France	44.00000	72000.00	No
Spain	27.00000	48000.00	Yes
Germany	30.00000	54000.00	No
Spain	38.00000	61000.00	No
Germany	40.00000	63777.78	Yes
France	35.00000	58000.00	Yes
Spain	38.77778	52000.00	No
France	48.00000	79000.00	Yes
Germany	50.00000	83000.00	No
France	37.00000	67000.00	Yes

Encoding categorical data

We encode text data into numbers. Unlike in python, where we are required to use One Hot Encoding, in R, we can simply use factor method.

dataset$Country <- factor(dataset$Country,
                          levels = c('France', 'Germany', 'Spain'),
                          labels = c(1, 2, 3))
dataset$Purchased <- factor(dataset$Purchased,
                            levels = c('No', 'Yes'),
                            labels = c(0, 1))
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

Country	Age	Salary	Purchased
1	44.00000	72000.00	0
3	27.00000	48000.00	1
2	30.00000	54000.00	0
3	38.00000	61000.00	0
2	40.00000	63777.78	1
1	35.00000	58000.00	1
3	38.77778	52000.00	0
1	48.00000	79000.00	1
2	50.00000	83000.00	0
1	37.00000	67000.00	1

Splitting the dataset into the Training set and Test set

library(caTools)
set.seed(144)
split <- sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
cat(hwrite(training_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

Country	Age	Salary	Purchased
1	44.00000	72000.00	0
3	27.00000	48000.00	1
2	30.00000	54000.00	0
2	40.00000	63777.78	1
1	35.00000	58000.00	1
3	38.77778	52000.00	0
1	48.00000	79000.00	1
2	50.00000	83000.00	0

cat(hwrite(test_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

Country	Age	Salary	Purchased
3	38	61000	0
1	37	67000	1

Feature Scaling

Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.

Source: Feature scaling

training_set[, 2:3] <- scale(training_set[, 2:3])
test_set[, 2:3] <- scale(test_set[, 2:3])
cat(hwrite(training_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

Country	Age	Salary	Purchased
1	0.59898639	0.636098603	0
3	-1.47795225	-1.208160434	1
2	-1.11143367	-0.747095675	0
2	0.11029494	0.004269118	1
1	-0.50056936	-0.439719169	1
3	-0.03902744	-0.900783928	0
1	1.08767784	1.174007489	1
2	1.33202356	1.481383995	0

cat(hwrite(test_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

Country	Age	Salary	Purchased
3	0.7071068	-0.7071068	0
1	-0.7071068	0.7071068	1