Preprocessing Overview

Why Data Preprocessing?

Data preprocessing may significantly influence the statistical conclusions based on the data.

“Garbage In, Garbage Out” is a famous saying that is used to emphasize “the quality of the data (input)”. By preprocessing data, we can minimise the garbage that gets into our analysis so that we can minimise the amount of garbage that our analyses/models result.

Major Tasks

Get
Understand
Tidy and Manipulate
Scan
Transform

Task 1: Get

A data set can be stored in a computer or can be online in different file formats. We need to get the data set into R by importing it from other data sources (i.e., .txt, .xls, .csv files and databases) or scraping from web. R provides many useful commands to import (and also export) data sets in different file formats.

Task 2: Understand

We cannot perform any type of data preprocessing without understanding what we have in hand. In this step, we will check the data volume (i.e., the dimensions of the data) and structure, understand the variables/attributes in the data set, and understand the meaning of each level/value for the variables.

Task 3: Tidy and Manipulate

In this step, we will apply several important tasks to tidy up messy data sets. We will follow Hadley Wickham’s “Tidy Data” principles (Wickham and others (2014)):

Each variable should form a column.
Each observation should form a row.
Each type of observational unit should form a table.

We may also need to manipulate, i.e. filter, arrange, select, subset/split data, or generate new variables from the data set.

Task 4: Scan

This step will include checking for plausibility of values, cleaning data for obvious errors, identifying and handling outliers, and dealing with missing values.

Task 5: Transform

Some statistical analysis methods are sensitive to the scale of the variables and it may be necessary to apply transformations before using them. In this step we will introduce well-known data transformations, data scaling, centering, standardising and normalising methods.