This is a quick tutorial on using the most useful functions in the lares
library for quick data analysis: freqs
, distr
, and corr_var
. With these you can explore and understand the interaction between all or specific variables within a dataset.
First, we load the lares
and tidyverse
libraries (for dplyr
and ggplot2
)
library(lares) # devtools::install_github("laresbernardo/lares")
library(tidyverse)
Then, we load the dft
dataset within the lares
library. It is a useful subset from the Titanic survival dataset.
data(dft)
Let’s quickly see its structure:
df_str(dft, return = "plot")
head(dft, 5)
This function lets us group, count, calculate percentages and cumulatives for further transformations or for visual analysis. For more details: ?freqs
# How many survived?
dft %>% freqs(Survived)
# How many survived and see plot?
dft %>% freqs(Survived, plot = T, results = F)
# How many survived per class?
dft %>% freqs(Survived, Pclass, plot = T, results = F)
# Per class, how many survived?
dft %>% freqs(Pclass, Survived, plot = T, results = F)
# Per sex and class, how many survived?
dft %>% freqs(Sex, Pclass, Survived, plot = T, results = F)
# Per number of siblings, sex, and class, how many survived?
dft %>% freqs(SibSp, Sex, Pclass, Survived, plot = T, results = F)
## Sorry, but trying to plot more than 3 features is as complex as it sounds. You should try another method to understand your analysis!
# Frequency of tickets
dft %>% freqs(Ticket, plot = T, results = F)
## Filtering the top 40 (out of 681) frequent values. Use the 'top' parameter if you want to overrule.
# Frequency of tickets: show me more
dft %>% freqs(Ticket, plot = T, top = 50, results = F)
## Filtering the top 50 (out of 681) frequent values. Use the 'top' parameter if you want to overrule.
# Let's customize the plot a bit....
dft %>%
mutate(Survived = ifelse(Survived == 1, "Did survive", "Did not survive")) %>%
freqs(Pclass, Survived, plot = T,
title = "People who survived the Titanic by Class",
subtitle = paste("Bernardo Lares:", Sys.Date()),
results = F)
This function lets us compare the distribution of a target variable vs another variable. The variables can be categorical or continuous. For more details: ?distr
# Relation for survived vs sex
dft %>% distr(Survived, Sex)
# Relation for sex vs survived
dft %>% distr(Sex, Survived)
# Relation for survived vs embark gate (categorical)
dft %>% distr(Survived, Embarked)
# Relation for survived vs embark gate
dft %>% distr(Survived, Fare)
# Relation for survived vs fare with custom colours for Survived
dft %>% distr(Survived, Fare, custom_colours = T)
# Relation for survived vs fare with ascending order
dft %>% distr(Survived, Fare, abc = T)
# Relation for survived vs fare with only 5 splits
dft %>% distr(Survived, Fare, abc = T, breaks = 5)
dft %>% distr(Survived, Fare, abc = T, top = 5)
# Relation for survived vs age (notice NA values)
dft %>% distr(Survived, Age)
dft %>% distr(Survived, Age, na.rm = T, abc = T)
# Distribution of fares payed
dft %>% distr(Fare)
dft %>% filter(Fare < 200) %>% distr(Fare)
# Distribution (frequency) of survivors
dft %>% distr(Survived, force = "char")
# Distribution of log(Fare) vs Fare
dft %>% mutate(logFare = log(Fare)) %>% distr(Fare, logFare)
dft %>% mutate(logFare = log(Fare)) %>%
filter(Fare < 100) %>% distr(Fare, logFare) +
geom_point(colour = "yellow")
This function lets us correlate a whole dataframe with a single feature. It can automatically convert into numericall values with one hot encoding, selecting the most frequent values, generate date and time features, etc. For more details: ?corr_var
# Correlate Survived with everything else
dft %>% corr_var(Survived)
## Automatically reduced results to the top 30 variables. Use the 'top' parameter to override this limit.
# Filter out variables with more than 50% of correlation
dft %>% corr_var(Survived, ceiling = 50)
## Automatically reduced results to the top 30 variables. Use the 'top' parameter to override this limit.
## Removing all correlations greater than 50% (absolute)
# Show only 10 values
dft %>% corr_var(Survived, top = 10)
# Also calculate log(values)
dft %>% corr_var(Survived, logs = T)
## Automatically reduced results to the top 30 variables. Use the 'top' parameter to override this limit.
ALSO: You can save and export your plot into a PNG file using save = TRUE=
and in any folder, using subdir = "~/"
(set with your working directory as default).
Report generated by BL @ 2019-02-14 13:38:18