This is a quick tutorial on using the most useful functions in the lares library for quick data analysis: freqs, distr, and corr_var. With these you can explore and understand the interaction between all or specific variables within a dataset.

First, we load the lares and tidyverse libraries (for dplyr and ggplot2)

library(lares) # devtools::install_github("laresbernardo/lares")
library(tidyverse)

Then, we load the dft dataset within the lares library. It is a useful subset from the Titanic survival dataset.

data(dft)

Let’s quickly see its structure:

df_str(dft, return = "plot")

head(dft, 5)

FREQUENCIES (freqs):

This function lets us group, count, calculate percentages and cumulatives for further transformations or for visual analysis. For more details: ?freqs

# How many survived?
dft %>% freqs(Survived)
# How many survived and see plot?
dft %>% freqs(Survived, plot = T, results = F)

# How many survived per class?
dft %>% freqs(Survived, Pclass, plot = T, results = F)

# Per class, how many survived?
dft %>% freqs(Pclass, Survived, plot = T, results = F)

# Per sex and class, how many survived?
dft %>% freqs(Sex, Pclass, Survived, plot = T, results = F)

# Per number of siblings, sex, and class, how many survived?
dft %>% freqs(SibSp, Sex, Pclass, Survived, plot = T, results = F)
## Sorry, but trying to plot more than 3 features is as complex as it sounds. You should try another method to understand your analysis!
# Frequency of tickets
dft %>% freqs(Ticket, plot = T, results = F)
## Filtering the top 40 (out of 681) frequent values. Use the 'top' parameter if you want to overrule.

# Frequency of tickets: show me more
dft %>% freqs(Ticket, plot = T, top = 50, results = F)
## Filtering the top 50 (out of 681) frequent values. Use the 'top' parameter if you want to overrule.

# Let's customize the plot a bit....
dft %>% 
  mutate(Survived = ifelse(Survived == 1, "Did survive", "Did not survive")) %>%
  freqs(Pclass, Survived, plot = T, 
        title = "People who survived the Titanic by Class",
        subtitle = paste("Bernardo Lares:", Sys.Date()),
        results = F)

DISTRIBUTIONS (distr):

This function lets us compare the distribution of a target variable vs another variable. The variables can be categorical or continuous. For more details: ?distr

# Relation for survived vs sex
dft %>% distr(Survived, Sex)

# Relation for sex vs survived
dft %>% distr(Sex, Survived)

# Relation for survived vs embark gate (categorical)
dft %>% distr(Survived, Embarked)

# Relation for survived vs embark gate
dft %>% distr(Survived, Fare)

# Relation for survived vs fare with custom colours for Survived
dft %>% distr(Survived, Fare, custom_colours = T)

# Relation for survived vs fare with ascending order
dft %>% distr(Survived, Fare, abc = T)

# Relation for survived vs fare with only 5 splits
dft %>% distr(Survived, Fare, abc = T, breaks = 5)

dft %>% distr(Survived, Fare, abc = T, top = 5)

# Relation for survived vs age (notice NA values)
dft %>% distr(Survived, Age)

dft %>% distr(Survived, Age, na.rm = T, abc = T)

# Distribution of fares payed
dft %>% distr(Fare)

dft %>% filter(Fare < 200) %>% distr(Fare)

# Distribution (frequency) of survivors
dft %>% distr(Survived, force = "char")

# Distribution of log(Fare) vs Fare
dft %>% mutate(logFare = log(Fare)) %>% distr(Fare, logFare)

dft %>% mutate(logFare = log(Fare)) %>% 
  filter(Fare < 100) %>% distr(Fare, logFare) + 
  geom_point(colour = "yellow")

CORRELATION VARIABLE (corr_var):

This function lets us correlate a whole dataframe with a single feature. It can automatically convert into numericall values with one hot encoding, selecting the most frequent values, generate date and time features, etc. For more details: ?corr_var

# Correlate Survived with everything else
dft %>% corr_var(Survived)
## Automatically reduced results to the top 30 variables. Use the 'top' parameter to override this limit.

# Filter out variables with more than 50% of correlation
dft %>% corr_var(Survived, ceiling = 50)
## Automatically reduced results to the top 30 variables. Use the 'top' parameter to override this limit.
## Removing all correlations greater than 50% (absolute)

# Show only 10 values
dft %>% corr_var(Survived, top = 10)

# Also calculate log(values)
dft %>% corr_var(Survived, logs = T)
## Automatically reduced results to the top 30 variables. Use the 'top' parameter to override this limit.

ALSO: You can save and export your plot into a PNG file using save = TRUE= and in any folder, using subdir = "~/" (set with your working directory as default).

Report generated by BL @ 2019-02-14 13:38:18