Data Cleansing and EDA is an important stage of Data Science. Before we start working with data, we have to go through several obvious but important steps. In our opinion, the most important (but in any case the first) stage is to define a question. The question we are looking for an answer. A clearly formulated question will make it easier to determine the appropriate dataset, as well as help in data cleaning - what to do with outliers, missing data, etc. When we have decided on the question and the ideal dataset, we need to understand what data we can get? Will our data help us answer our question? Once we have decided on all of the above, and received the data, we move on to the next stage - data cleansing.
Data rarely comes to us in the right form - clearly structured, cleaned and ready to use. Raw data is difficult to use for analysis - some data may be missing, erroneous, duplicated, etc. Our goal on this stage - remove defects and prepare data for further use.
A major part of the data cleansing process is looking for oddities, missing values, and other problems that need to be resolved before analysis. When clearing data, the following principles must be observed: each variable has its own column, each object has its own row. Set may contain missing values (NA. NaN, 9999 for temperature value, nothing, etc). May be we need to format dates, unit conversion (temperature in Celsius or Fahrenheit), to remove duplicated records. We also consider input errors. For example, if we have a column indicating the days of the week, then it is worth making sure that the spelling of the days of the week is the same in all records (lower and uppercase letters, misspelled), otherwise it may turn out that we have more than seven days of the week. Another check is a sanity check - it is unlikely that a person can be taller than three meters or over 200 years old.
Perhaps there are observations (columns) that we can get rid of? First, these are unnecessary observations - for example, the date and separately the year. Birthday and age. Secondly, it is a high correlation of observations. If the correlation of two observations is close to one, we can safely get rid of one of them, since this observation will not add anything to us, and we can reduce the dimension.
It’s also worth taking the time to examining the data types present in the set, as well as their key properties. The data type determines what can be done with this data.
There are several statistical data types:
1. Nominal data - non-intrinsic data. For example, gender, languages, political preferences.
2. Ordinal data - data with ordered categories. For example, education levels.
3. Numerical data - amounts and quantities. Price, salary, pressure, etc.
Consider the following case - we have data from plant workers. Each employee has a sequential number in the table, name, position, salary and academic degree. Id number is a nominal data, as well as a job position. While salary - is a numerical data type, and academic degree is an ordinal. If we want to calculate average of id number - it will make little sense, while average of salary - is more interesting information. What about average of academic degree? We are not sure if this is of particular interest. It is more interesting to calculate the average salary depending on the degree.
Let’s turn to data properties:
1. Granularity - describes the level of detailing. What each record in the data represents?
May be coarse or fine. From a finer level of detail, we can always turn our data into coarser one.
Backward direction is much more difficult.
2. Scope - we want to conduct a study from 2011 to 2019, and our dataset only has data for 2015-2018.
If our dataset covers more years that we need - we may use it for our purposes.
3. Temporality - when was the data collected. It makes sense to pay attention to the date format,
differences between the time zones and difference between default dates for different operating systems.
4. Faithfulness - can we rely on data? Non-existent geographic locations, incredible dates,
non-coincident values in fields birthday and age, weird repeating values.
We can look at a few rows from the table to get an idea of what data we have.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can look at the summary - this will give us an idea of the values. It can also be understood if there are some problems in the data - for example, the maximum weather temperature is far beyond 200, which is not a likely value either for measurement in Celsius or Fahrenheit.
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Clearing data can take a lot of time and effort, but it is critical, without thorough cleansing, we will come to the wrong conclusions.
When we clear data, it is necessary to note all manipulations, as they can significantly affect further analysis.
Now we can start exploring the data more deeply. The purpose of this stage is to explore the data, so we need to be objective and watchful. At this stage, anomalies that were missed earlier are often found. In which case, we need to take a step back and fix them.
In a graphical form, information is perceived much easier, therefore, graphical methods are mainly used to understand data and interactions of variables. Visualization and transformation of data helps to capture patterns, problems and more.
Exploratory graphs are created very quickly. We usually build many such graphs for different variables of dataset. The goal is to understand the dataset, get an idea of how the dataset looks like, what properties it has, what problems it affects, and whether it can be useful to us. On such graphs, we can see outliers and other anomalies in the set, and get an idea of how we can get rid of them.
On this simple graph we can see the minimum, maximum, median, but also the outlier.
boxplot(mtcars$hp, col = "red")
We can look at all data at once and the relationship between pairs of variables.
This graph shows approximately which variables correlate with each other.
cor(mtcars$mpg, mtcars$drat)
## [1] 0.6811719
cor(mtcars$vs, mtcars$am)
## [1] 0.1683451
cor(mtcars$mpg, mtcars$disp)
## [1] -0.8475514
And here is an example of a graph of the relationship of two variables:
plot(mtcars$wt, mtcars$mpg, xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)
In practice, analysis is not limited to visualization techniques. Tabulation, clustering, and other modeling techniques can also be part of exploratory analysis. Even building simple models can be part of this stage.
In the analysis process, it is important not to make assumptions about the data and be open to the possibilities of what might be discovered.
When we have examined the data, and decided that at least they are suitable for us, we can proceed to the next stage - modeling.
The main idea of the exploratory analysis is that our data is probably not perfect, and the analysis will give us a general idea about our data, what information we can extract from the data. What questions we want to answer based on our data, and what questions are impossible. This analysis allows us to provide a rough cut analysis that can lead us to the next step.