Data preprocessing is transforming of raw data into an understandable data, which is suitable for machine learning and data science analysis. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set (Source:Wikipedia). Here, we have covered some of the basic techniques for preprocessing of data.
The data can be downloaded from my GitHub account:
Dataset: GitHub link
First, set your working directory to the location where Data.csv file is present. Then, load the CSV file.
dataset <- read.csv('Data.csv')
library(hwriter)
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| Country | Age | Salary | Purchased |
| France | 44 | 72000 | No |
| Spain | 27 | 48000 | Yes |
| Germany | 30 | 54000 | No |
| Spain | 38 | 61000 | No |
| Germany | 40 | Yes | |
| France | 35 | 58000 | Yes |
| Spain | 52000 | No | |
| France | 48 | 79000 | Yes |
| Germany | 50 | 83000 | No |
| France | 37 | 67000 | Yes |
We can replace the missing data by taking the mean of all values present in that column. (Missing age value in 7th row and salary value in 5th row)
dataset$Age <- ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary <- ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)
library(hwriter)
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| Country | Age | Salary | Purchased |
| France | 44.00000 | 72000.00 | No |
| Spain | 27.00000 | 48000.00 | Yes |
| Germany | 30.00000 | 54000.00 | No |
| Spain | 38.00000 | 61000.00 | No |
| Germany | 40.00000 | 63777.78 | Yes |
| France | 35.00000 | 58000.00 | Yes |
| Spain | 38.77778 | 52000.00 | No |
| France | 48.00000 | 79000.00 | Yes |
| Germany | 50.00000 | 83000.00 | No |
| France | 37.00000 | 67000.00 | Yes |
We encode text data into numbers. Unlike in python, where we are required to use One Hot Encoding, in R, we can simply use factor method.
dataset$Country <- factor(dataset$Country,
levels = c('France', 'Germany', 'Spain'),
labels = c(1, 2, 3))
dataset$Purchased <- factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))
cat(hwrite(dataset, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| Country | Age | Salary | Purchased |
| 1 | 44.00000 | 72000.00 | 0 |
| 3 | 27.00000 | 48000.00 | 1 |
| 2 | 30.00000 | 54000.00 | 0 |
| 3 | 38.00000 | 61000.00 | 0 |
| 2 | 40.00000 | 63777.78 | 1 |
| 1 | 35.00000 | 58000.00 | 1 |
| 3 | 38.77778 | 52000.00 | 0 |
| 1 | 48.00000 | 79000.00 | 1 |
| 2 | 50.00000 | 83000.00 | 0 |
| 1 | 37.00000 | 67000.00 | 1 |
library(caTools)
set.seed(144)
split <- sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
cat(hwrite(training_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| Country | Age | Salary | Purchased |
| 1 | 44.00000 | 72000.00 | 0 |
| 3 | 27.00000 | 48000.00 | 1 |
| 2 | 30.00000 | 54000.00 | 0 |
| 2 | 40.00000 | 63777.78 | 1 |
| 1 | 35.00000 | 58000.00 | 1 |
| 3 | 38.77778 | 52000.00 | 0 |
| 1 | 48.00000 | 79000.00 | 1 |
| 2 | 50.00000 | 83000.00 | 0 |
cat(hwrite(test_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| Country | Age | Salary | Purchased |
| 3 | 38 | 61000 | 0 |
| 1 | 37 | 67000 | 1 |
Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.
Source: Feature scaling
training_set[, 2:3] <- scale(training_set[, 2:3])
test_set[, 2:3] <- scale(test_set[, 2:3])
cat(hwrite(training_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| Country | Age | Salary | Purchased |
| 1 | 0.59898639 | 0.636098603 | 0 |
| 3 | -1.47795225 | -1.208160434 | 1 |
| 2 | -1.11143367 | -0.747095675 | 0 |
| 2 | 0.11029494 | 0.004269118 | 1 |
| 1 | -0.50056936 | -0.439719169 | 1 |
| 3 | -0.03902744 | -0.900783928 | 0 |
| 1 | 1.08767784 | 1.174007489 | 1 |
| 2 | 1.33202356 | 1.481383995 | 0 |
cat(hwrite(test_set, border = 1, table.frame='void', width='600px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| Country | Age | Salary | Purchased |
| 3 | 0.7071068 | -0.7071068 | 0 |
| 1 | -0.7071068 | 0.7071068 | 1 |