Introduction

Data preprocessing is transforming of raw data into an understandable data, which is suitable for machine learning and data science analysis. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set (Source:Wikipedia). Here, we have covered some of the basic techniques for preprocessing of data.

The data can be downloaded from my GitHub account:

Dataset: GitHub link

Loading the data

First, set your working directory to the location where Data.csv file is present. Then, load the CSV file.

dataset <- read.csv('Data.csv')
library(hwriter)
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40   Yes
France 35 58000 Yes
Spain   52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

Missing Data?

We can replace the missing data by taking the mean of all values present in that column. (Missing age value in 7th row and salary value in 5th row)

dataset$Age <- ifelse(is.na(dataset$Age),
                      ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                      dataset$Age)
dataset$Salary <- ifelse(is.na(dataset$Salary),
                         ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                         dataset$Salary)
library(hwriter)
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
Country Age Salary Purchased
France 44.00000 72000.00 No
Spain 27.00000 48000.00 Yes
Germany 30.00000 54000.00 No
Spain 38.00000 61000.00 No
Germany 40.00000 63777.78 Yes
France 35.00000 58000.00 Yes
Spain 38.77778 52000.00 No
France 48.00000 79000.00 Yes
Germany 50.00000 83000.00 No
France 37.00000 67000.00 Yes

Encoding categorical data

We encode text data into numbers. Unlike in python, where we are required to use One Hot Encoding, in R, we can simply use factor method.

dataset$Country <- factor(dataset$Country,
                          levels = c('France', 'Germany', 'Spain'),
                          labels = c(1, 2, 3))
dataset$Purchased <- factor(dataset$Purchased,
                            levels = c('No', 'Yes'),
                            labels = c(0, 1))
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
Country Age Salary Purchased
1 44.00000 72000.00 0
3 27.00000 48000.00 1
2 30.00000 54000.00 0
3 38.00000 61000.00 0
2 40.00000 63777.78 1
1 35.00000 58000.00 1
3 38.77778 52000.00 0
1 48.00000 79000.00 1
2 50.00000 83000.00 0
1 37.00000 67000.00 1

Splitting the dataset into the Training set and Test set

library(caTools)
set.seed(144)
split <- sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
cat(hwrite(training_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
Country Age Salary Purchased
1 44.00000 72000.00 0
3 27.00000 48000.00 1
2 30.00000 54000.00 0
2 40.00000 63777.78 1
1 35.00000 58000.00 1
3 38.77778 52000.00 0
1 48.00000 79000.00 1
2 50.00000 83000.00 0
cat(hwrite(test_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
Country Age Salary Purchased
3 38 61000 0
1 37 67000 1

Feature Scaling

Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.

Source: Feature scaling

training_set[, 2:3] <- scale(training_set[, 2:3])
test_set[, 2:3] <- scale(test_set[, 2:3])
cat(hwrite(training_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
Country Age Salary Purchased
1 0.59898639 0.636098603 0
3 -1.47795225 -1.208160434 1
2 -1.11143367 -0.747095675 0
2 0.11029494 0.004269118 1
1 -0.50056936 -0.439719169 1
3 -0.03902744 -0.900783928 0
1 1.08767784 1.174007489 1
2 1.33202356 1.481383995 0
cat(hwrite(test_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
Country Age Salary Purchased
3 0.7071068 -0.7071068 0
1 -0.7071068 0.7071068 1