About this document

After loading a dataset, Exploratory Data Analysis (EDA) is often the next step of data analysis. The basic idea of EDA is to quickly analyse and investigate data sets and summarise their main characteristics, often employing data visualisation methods.

As presented by IBM (2021), EDA helps…

“…data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. EDA is primarily used to see what data can reveal beyond the formal modelling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables”.

We prepared a really simple EDA example with R and RStudio using key World Development Indicators data. This full code is available in the public repository at https://github.com/movimentar/EDA.

Initial steps

The first step is to load the required packages. The tidyverse (see more at: https://www.tidyverse.org/packages/) and the DataExplorer (see more at: https://www.rdocumentation.org/packages/DataExplorer/versions/0.8.2) will be important for this example of basic EDA. We will also use the readxl package, that allows us to read Microsoft Excel files. We will also use the package DT to display better our data tables (see more at: https://rstudio.github.io/DT/).

If you do not have them installed yet, you will need to run the following code in the R console.

# Install packages (only required if you did not run this before)
install.packages("tidyverse")
install.packages("DataExplorer")
install.packages("readxl")
install.packages("DT")

If you already have the run the code above before, you can simply load the packages as follows:

# Load pre-installed packages.
library(readxl)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(DataExplorer)
library(DT)

Data sources

For the purpose of this example, we should use some raw dataset in any format you prefer such as CSV (comma-separated values) or XLSX (Microsoft Excel).

For example, you may download data any free dataset listed at:

You may also connect to a remote or local database such as MySQL or MongoDB, if you prefer. R is very flexible in terms of connections to data sources.

World Development Indicators

For this example we will use an Excel table from the World Bank Indicators (see WV.2 at http://wdi.worldbank.org/table).

For simplification, I first downloaded it manually in Excel (XLS), converted it to XLSX since it was saved as a webpage and R was unable to open it directly. NAs were formatted as “..” and had to be replaced as show in the code below. Additional clean was necessary since the rows 2 and 3 contained metadata, instead of the actual data. Here we are also transforming the country variable to factor and all character data to numeric, as we only have that type in the dataset.

# Load data
poverty <- read_excel(
  "WV.2_Global_goals_ending_poverty_and_improving_lives.xlsx", 
  # Specify format of NA observations
    na = "..") 
## New names:
## * `` -> ...1
## * `` -> ...11

Cleaning data

The next step is to clean the raw data and make it tidy (see more at: https://r4ds.had.co.nz/tidy-data.html). It is important to do it in a coded way, as it can affect the results.

# Clean dataset
poverty <-  poverty %>% 
  # Renaming columns without names
  rename(
    country = `...1` 
  ) %>% 
  # Renaming columns with merged headers
  rename(
    male_employment = `Vulnerable employment`,
    female_employment = `...11` 
  ) %>%
  # Remove observations in country with NA
  filter(
    !is.na(country)
  ) %>% 
  # Adjusting classes of variables 
  mutate(
    country = as.factor(country)
  ) %>% 
  mutate_if(
    is.character,
    as.numeric
  ) %>% 
  # Simplify variable names
  rename(
    income_share = `Percentage share of income or consumption`,
    child_malnutrition = `Prevalence of child malnutrition`,
    maternal_mortality = `Maternal mortality ratio`,
    u5_mortality = `Under-five mortality rate`,
    hiv_rate = `Incidence of HIV, ages 15-49 (per 1,000 uninfected population ages 15-49)`,
    tuberculosis_rate = `Incidence of tuberculosis`,
    road_mortality = `Mortality caused by road traffic injury`,
    primary_completion = `Primary completion rate`,
    labor_productivity = `Labor productivity`
  ) 

# Remove metadata and rows with all NAs
poverty <- 
  poverty[rowSums(is.na(poverty)) != ncol(poverty)-1, ]

Draw a reproducible simple sample

In order to make sure that we did not miss anything when cleaning data, let us draw a simple random sample of the nrow(poverty) rows of our clean dataset. The code below shows the steps to draw a reproducible simple random sample of 30 rows from the dataset. In order to make that random sample a reproducible one, we will draw the sample after calling the set.seed() function (please see more at: https://movimentar.eu/reproducible-random-sample-generator-in-shinyapps-for-improved-data-and-learning/ and https://r-coder.com/set-seed-r/).

From the table below, the quality of the sampled data looks good. However, we should look deeper to make sure that all is indeed fine.

# Make the sample a reproducible one
set.seed(1234)
# Draw a sample of 30 rows 
poverty_sample <- poverty %>% 
  sample_n(size = 30)
# View sample
  poverty_sample %>% 
  # Ease interactive visualisation
  DT::datatable()

Inspecting the full dataset

The full, clean dataset has 223 rows and 12 columns. Let us have a basic description of the dataset.

# View basic description of full data
plot_intro(poverty)

# View full dataset
  poverty %>% 
  # Ease interactive visualisation
  DT::datatable()

The code below helps us checking for missing data. The colours indicate the number of missing data in each variable. The bars and colours indicate the number of missing data (e.g. hiv_rate has the highest number of rows with missing data).

plot_missing(poverty)

Now, we can also check for the histograms and density plots of all continuous variables.

## View histogram of all continuous variables 
plot_histogram(poverty)

## View histogram of all continuous variables 
plot_density(poverty)

The plot below shows the correlation among the dataset variables after dropping NAs (required). Here we can see, for example, that female and male employment are highly and positively correlated with each other, while primary completing rates are negatively correlated with the under-5 child mortality. In any case, one always need to be very careful to infer causation from the correlation between variables. For example, under-5 child mortality shows a high positive correlation with maternal mortality, but also with male and female employment. That does not really mean that higher employment rates are causing an increase in under-5 child mortality, for example.

## View overall correlation heat map
poverty %>% 
  # Drops the variable "country" and observations with NAs (required)
  select(-country) %>% 
  na.omit() %>% 
  plot_correlation()

Depending on your dataset, you may also benefit to run already from the start the create_report() function from the DataExplorer package, as show below. This will create a nice HTML report which you can use as a guide for further EDA. We are not doing it here as it would generate a separate report.

# Generate automated report
DataExplorer::create_report(poverty)

Conclusions

This is just a very basic EDA which we could expand considerably more. The main purpose of this text was to do a quick demonstration of EDA and raise interest among those who are usually working with spreadsheet applications. Digitalisation requires the transition from a user to, at least, a bit of a programmer. EDA can contribute to ease and improve the process of dealing with the increasing amounts of data that is arising in the era of digitalisation. This can help to design and implement projects, programmes and policies for sustainable development which are oriented by evidence/data and results.

Visit our website at https://www.movimentar.eu to know more about how we are supporting this process!

References