Life circle of data
Data Wrangling is a very important part (probably no less than 50% of the time of an analysis project is allocated to this section). In this section we will use the tidyverse ecosystem to implement this task.
Practical data today is ebola. This data can be downloaded at : https://data.world/brianray/ebola-cases
# packages installations
rm(list = ls())
my_packages <- c("tidyverse", "ggthemes", "rvest", "magrittr","ggplot2")
install.packages(my_packages, repos = "http://cran.rstudio.com")## package 'tidyverse' successfully unpacked and MD5 sums checked
## package 'ggthemes' successfully unpacked and MD5 sums checked
## package 'rvest' successfully unpacked and MD5 sums checked
## package 'magrittr' successfully unpacked and MD5 sums checked
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mlcl.local\AppData\Local\Temp\RtmpqOGtM6\downloaded_packages
# Load some packages for scrapping data and data manipulation:
p <- c("rvest", "tidyverse", "magrittr")
lapply(p, library, character.only = TRUE)## [[1]]
## [1] "rvest" "xml2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[2]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr"
## [6] "tidyr" "tibble" "ggplot2" "tidyverse" "rvest"
## [11] "xml2" "stats" "graphics" "grDevices" "utils"
## [16] "datasets" "methods" "base"
##
## [[3]]
## [1] "magrittr" "forcats" "stringr" "dplyr" "purrr"
## [6] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [11] "rvest" "xml2" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
# Loading data
ebola <- read.csv("C:/Users/mlcl.local/Downloads/ebola_data_db_format.csv", header = TRUE, sep = ',')
ebola %>% head(10)## Indicator
## 1 Cumulative number of confirmed, probable and suspected Ebola cases
## 2 Cumulative number of confirmed Ebola cases
## 3 Cumulative number of probable Ebola cases
## 4 Cumulative number of suspected Ebola cases
## 5 Cumulative number of confirmed, probable and suspected Ebola deaths
## 6 Cumulative number of confirmed Ebola deaths
## 7 Cumulative number of probable Ebola deaths
## 8 Cumulative number of confirmed, probable and suspected Ebola cases
## 9 Cumulative number of confirmed Ebola cases
## 10 Cumulative number of probable Ebola cases
## Country Date value
## 1 Guinea 10/03/2015 3285
## 2 Guinea 10/03/2015 2871
## 3 Guinea 10/03/2015 392
## 4 Guinea 10/03/2015 22
## 5 Guinea 10/03/2015 2170
## 6 Guinea 10/03/2015 1778
## 7 Guinea 10/03/2015 392
## 8 Liberia 10/03/2015 9343
## 9 Liberia 10/03/2015 3150
## 10 Liberia 10/03/2015 1879
This command is used to filter data by line. For example, we want to extract lines that are Cumulative number of confirmed Ebola deaths in the column Indicator, we can:
df <- ebola %>%
filter(Indicator == "Cumulative number of confirmed Ebola deaths")
head(df)## Indicator Country
## 1 Cumulative number of confirmed Ebola deaths Guinea
## 2 Cumulative number of confirmed Ebola deaths Sierra Leone
## 3 Cumulative number of confirmed Ebola deaths Nigeria
## 4 Cumulative number of confirmed Ebola deaths Senegal
## 5 Cumulative number of confirmed Ebola deaths Spain
## 6 Cumulative number of confirmed Ebola deaths United States of America
## Date value
## 1 10/03/2015 1778
## 2 10/03/2015 3263
## 3 10/03/2015 7
## 4 10/03/2015 0
## 5 10/03/2015 0
## 6 10/03/2015 1
Suppose we want to calculate the total number of people who have been confirmed dead by Ebola by country. This time, the group_by () command will be extremely useful. But note that the command group_by () only becomes useful if it comes with other commands, like in this case, sum() - sum:
df %>%
group_by(Country) %>%
summarise(total = sum(value)) ->> ebola_deaths
ebola_deaths %>%
knitr::kable(col.names = c("Country", "Total of deaths"))| Country | Total of deaths |
|---|---|
| Guinea | 461211 |
| Guinea 2 | 1 |
| Italy | 0 |
| Liberia | 26721 |
| Liberia 2 | 215 |
| Mali | 21 |
| Nigeria | 1784 |
| Senegal | 0 |
| Sierra Leone | 797984 |
| Spain | 0 |
| United Kingdom | 0 |
| United States of America | 242 |
# Or want to see the first 10 of total number of deaths over time:
df %>%
group_by(Date) %>%
summarise(total = sum(value)) %>%
head(10)## # A tibble: 10 x 2
## Date total
## <fct> <dbl>
## 1 1/05/2015 5511
## 2 1/06/2015 5562
## 3 1/07/2015 5613
## 4 1/09/2015 5674
## 5 1/10/2014 2006
## 6 1/10/2015 5679
## 7 1/12/2014 2431
## 8 1/12/2015 5682
## 9 10/02/2015 8452
## 10 10/03/2015 5049
This command is used to rearrange the rows of data by the value of, for example, the total variable column:
# Sort by descending:
ebola_deaths %>%
arrange(-total) ## # A tibble: 12 x 2
## Country total
## <fct> <dbl>
## 1 Sierra Leone 797984
## 2 Guinea 461211
## 3 Liberia 26721
## 4 Nigeria 1784
## 5 United States of America 242
## 6 Liberia 2 215
## 7 Mali 21
## 8 Guinea 2 1
## 9 Italy 0
## 10 Senegal 0
## 11 Spain 0
## 12 United Kingdom 0
# Sort by acending:
ebola_deaths %>%
arrange(total)## # A tibble: 12 x 2
## Country total
## <fct> <dbl>
## 1 Italy 0
## 2 Senegal 0
## 3 Spain 0
## 4 United Kingdom 0
## 5 Guinea 2 1
## 6 Mali 21
## 7 Liberia 2 215
## 8 United States of America 242
## 9 Nigeria 1784
## 10 Liberia 26721
## 11 Guinea 461211
## 12 Sierra Leone 797984
The syntax of this command is apply (x, MARGIN, FUN, …) where:
x is a matrix or an array,
FUN is a function applied to: (1) if MARGIN = 2, then the function for the column is applied, (2) if MARGIN = 1, then MARGIN = c (1, 2) Functions for both rows and columns.
For example we creat a matrix:
# Create a matrix X:
set.seed(1)
X <- matrix(rnorm(30), nrow = 5, ncol = 6)
X## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.6264538 -0.8204684 1.5117812 -0.04493361 0.91897737 -0.05612874
## [2,] 0.1836433 0.4874291 0.3898432 -0.01619026 0.78213630 -0.15579551
## [3,] -0.8356286 0.7383247 -0.6212406 0.94383621 0.07456498 -1.47075238
## [4,] 1.5952808 0.5757814 -2.2146999 0.82122120 -1.98935170 -0.47815006
## [5,] 0.3295078 -0.3053884 1.1249309 0.59390132 0.61982575 0.41794156
# Calculate the average of the columns of this matrix:
apply(X, MARGIN = 2 , FUN = mean) %>% head()## [1] 0.12926990 0.13513567 0.03812297 0.45956697 0.08123054 -0.34857703
# Calculate the mean for the rows
apply(X, 1, mean)## [1] 0.1471290 0.2785110 -0.1951493 -0.2816530 0.4634532
The lapply statement has input data as a list, data frame, or vector. The result is a list:
# Data input is a dataframe:
attach(trees)
lapply(trees, mean)## $Girth
## [1] 13.24839
##
## $Height
## [1] 76
##
## $Volume
## [1] 30.17097
# If the input data is a list:
my_list <- list(trees, iris) # Create a simulation of a list
lapply(my_list, summary)## [[1]]
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
##
## [[2]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
This command works similarly to *lapply. The difference is that the result is returned as the simplest data structure as possible.
sapply(trees, mean)## Girth Height Volume
## 13.24839 76.00000 30.17097
# Check the result
is.vector(sapply(trees, mean))## [1] TRUE
In this part we use a home loan dataset. This dataset contains lots of missing values, empty spaces…
# Load data from the internet
hmeq <- read.csv("http://www.creditriskanalytics.net/uploads/1/9/5/1/19511601/hmeq.csv")
# Data
head(hmeq)## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ
## 1 1 1100 25860 39025 HomeImp Other 10.5 0 0 94.36667 1
## 2 1 1300 70053 68400 HomeImp Other 7.0 0 2 121.83333 0
## 3 1 1500 13500 16700 HomeImp Other 4.0 0 0 149.46667 1
## 4 1 1500 NA NA NA NA NA NA NA
## 5 0 1700 97800 112000 HomeImp Office 3.0 0 0 93.33333 0
## 6 1 1700 30548 40320 HomeImp Other 9.0 0 0 101.46600 1
## CLNO DEBTINC
## 1 9 NA
## 2 14 NA
## 3 10 NA
## 4 NA NA
## 5 14 NA
## 6 8 37.11361
# Function for detecting NA observations:
na_rate <- function(x) {x %>% is.na() %>% sum() / length(x)}
sapply(hmeq, na_rate) %>% round(2)## BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ
## 0.00 0.00 0.09 0.02 0.00 0.00 0.09 0.12 0.10
## CLAGE NINQ CLNO DEBTINC
## 0.05 0.09 0.04 0.21
# Function replaces NA by mean:
replace_by_mean <- function(x) {
x[is.na(x)] <- mean(x, na.rm = TRUE)
return(x)
}# A function imputes NA observations for categorical variables
replace_na_categorical <- function(x) {
x %>%
table() %>%
as.data.frame() %>%
arrange(-Freq) ->> my_df
n_obs <- sum(my_df$Freq)
pop <- my_df$. %>% as.character()
set.seed(29)
x[is.na(x)] <- sample(pop, sum(is.na(x)), replace = TRUE, prob = my_df$Freq)
return(x)
}## Relable for "JOB"
name_job <- function(x){
x %<>% as.character()
ELSE <- TRUE
job_name <- c("Mgr", "Office", "Other", "ProfExe", "Sales", "Self")
case_when(!x %in% job_name ~ "Other",
ELSE ~ x)
} # Relable for "REASON"
name_reason <- function(x){
ELSE <- TRUE
x %<>% as.character()
case_when(!x %in% c("DebtCon", "HomeImp") ~ "Unknown",
ELSE ~ x)
}# Relable for "BAD"
label_rename <- function(x){
case_when(x==1 ~ "BAD",
x==0 ~ "GOOD")
}# Use the two functions:
df <- hmeq %>%
mutate_if(is.factor, as.character) %>%
mutate(REASON = case_when(REASON == "" ~ NA_character_, TRUE ~ REASON),
JOB = case_when(JOB == "" ~ NA_character_, TRUE ~ JOB)) %>%
mutate_if(is_character, as.factor) %>%
mutate_if(is.numeric, replace_by_mean) %>%
mutate_if(is.factor, replace_na_categorical) %>%
mutate_at("REASON", name_reason) %>%
mutate_at("JOB", name_job) %>%
mutate(BAD = label_rename(BAD))
# See the data after filling in
head(df, n=20) %>% knitr::kable(col.names = c("BAD", "LOAN", "MORTDUE", "VALUE","REASON","JOB",
"JOY","DEROG", "DELINQ","CLAGE","NINQ","CLNO","DEBTINC"))| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | JOY | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BAD | 1100 | 25860.00 | 39025 | HomeImp | Other | 10.500000 | 0.0000000 | 0.0000000 | 94.36667 | 1.000000 | 9.0000 | 33.779915 |
| BAD | 1300 | 70053.00 | 68400 | HomeImp | Other | 7.000000 | 0.0000000 | 2.0000000 | 121.83333 | 0.000000 | 14.0000 | 33.779915 |
| BAD | 1500 | 13500.00 | 16700 | HomeImp | Other | 4.000000 | 0.0000000 | 0.0000000 | 149.46667 | 1.000000 | 10.0000 | 33.779915 |
| BAD | 1500 | 73760.82 | 101776 | DebtCon | Other | 8.922268 | 0.2545697 | 0.4494424 | 179.76628 | 1.186055 | 21.2961 | 33.779915 |
| GOOD | 1700 | 97800.00 | 112000 | HomeImp | Office | 3.000000 | 0.0000000 | 0.0000000 | 93.33333 | 0.000000 | 14.0000 | 33.779915 |
| BAD | 1700 | 30548.00 | 40320 | HomeImp | Other | 9.000000 | 0.0000000 | 0.0000000 | 101.46600 | 1.000000 | 8.0000 | 37.113614 |
| BAD | 1800 | 48649.00 | 57037 | HomeImp | Other | 5.000000 | 3.0000000 | 2.0000000 | 77.10000 | 1.000000 | 17.0000 | 33.779915 |
| BAD | 1800 | 28502.00 | 43034 | HomeImp | Other | 11.000000 | 0.0000000 | 0.0000000 | 88.76603 | 0.000000 | 8.0000 | 36.884894 |
| BAD | 2000 | 32700.00 | 46740 | HomeImp | Other | 3.000000 | 0.0000000 | 2.0000000 | 216.93333 | 1.000000 | 12.0000 | 33.779915 |
| BAD | 2000 | 73760.82 | 62250 | HomeImp | Sales | 16.000000 | 0.0000000 | 0.0000000 | 115.80000 | 0.000000 | 13.0000 | 33.779915 |
| BAD | 2000 | 22608.00 | 101776 | DebtCon | Other | 18.000000 | 0.2545697 | 0.4494424 | 179.76628 | 1.186055 | 21.2961 | 33.779915 |
| BAD | 2000 | 20627.00 | 29800 | HomeImp | Office | 11.000000 | 0.0000000 | 1.0000000 | 122.53333 | 1.000000 | 9.0000 | 33.779915 |
| BAD | 2000 | 45000.00 | 55000 | HomeImp | Other | 3.000000 | 0.0000000 | 0.0000000 | 86.06667 | 2.000000 | 25.0000 | 33.779915 |
| GOOD | 2000 | 64536.00 | 87400 | DebtCon | Mgr | 2.500000 | 0.0000000 | 0.0000000 | 147.13333 | 0.000000 | 24.0000 | 33.779915 |
| BAD | 2100 | 71000.00 | 83850 | HomeImp | Other | 8.000000 | 0.0000000 | 1.0000000 | 123.00000 | 0.000000 | 16.0000 | 33.779915 |
| BAD | 2200 | 24280.00 | 34687 | HomeImp | Other | 8.922268 | 0.0000000 | 1.0000000 | 300.86667 | 0.000000 | 8.0000 | 33.779915 |
| BAD | 2200 | 90957.00 | 102600 | HomeImp | Mgr | 7.000000 | 2.0000000 | 6.0000000 | 122.90000 | 1.000000 | 22.0000 | 33.779915 |
| BAD | 2200 | 23030.00 | 101776 | DebtCon | Other | 19.000000 | 0.2545697 | 0.4494424 | 179.76628 | 1.186055 | 21.2961 | 3.711312 |
| BAD | 2300 | 28192.00 | 40150 | HomeImp | Other | 4.500000 | 0.0000000 | 0.0000000 | 54.60000 | 1.000000 | 16.0000 | 33.779915 |
| GOOD | 2300 | 102370.00 | 120953 | HomeImp | Office | 2.000000 | 0.0000000 | 0.0000000 | 90.99253 | 0.000000 | 13.0000 | 31.588503 |