1 Setup
library(pacman)
p_load(char = c("MASS", # for Boston dataset
"tidyverse",
"DataExplorer",
"here",
"ggtext",
"plotly",
"tidyr",
"visdat",
"paletteer",
"corrplot",
"inspectdf",
"ExPanDaR", # shiny based interactive EDA and vis.
"SmartEDA",
"GGally",
"ggpcp",
"dlookr"
))
my_theme <- theme_classic() +
theme(
plot.title = element_text(face = "bold"),
plot.background = element_rect(fill = "gray93"),
panel.grid.major = element_line(color = "gray95", size = 0.2),
strip.background = element_blank(),
# element textbox is from ggtext
strip.text = ggtext::element_textbox(
size = 11, face = "bold",
color = "white", fill = "steelblue3", halign = 0.5,
r = unit(5, "pt"), width = unit(1, "npc"),
padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)
)
)
theme_set(my_theme)
## Data
boston <-
MASS::Boston %>%
as_tibble() %>%
transmute(
crime_rate_percap = crim,
prop_industrial = indus,
charles_river = chas,
NO_conc = nox,
avg_rooms = rm,
prop_b4_1940 = age,
dist_to_emp = dis,
road_access = rad,
black,
tax_percent = tax / 100,
pupil_teacher_rt = ptratio,
prop_low_status = lstat,
median_value = medv
)
titanic <-
Titanic %>% # comes with R
as_tibble() %>%
uncount(weights = n)2 The packages
Figure from Researchgate paper
Comparisons. From here: https://github.com/daya6489/SmartEDA
3 vis_dat and vis_guess from visdat
Visualize data types and missingness.
cars93 <-
MASS::Cars93 %>%
as_tibble()
cars93 %>%
vis_dat() +
scale_fill_paletteer_d(palette = "NineteenEightyR::sonny", drop = T) +
labs(title = "Column types and missingness for the Cars93 dataset.",
subtitle = "Cars were randomly selected from 1993 Consumer Reports magazine") +
my_theme +
theme(axis.text.x = element_text(angle = 60, hjust = 0)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, NA))4 makeCodebook report from dataMaid
Generates a PDF codebook with summary statistics for each variable.
Find output here: https://kendavidn.com/randomfiles/codebook_cars93.pdf
The makeDataReport function produces a similar output, but also includes some ‘diagnostics’. E.g. “this value looks like an outlier”.
5 create_report from DataExplorer
Not very pretty but might be useful to just have a simple html to show.
Find it here: DataExplorer Report
6 inspect_num and inspect_cat from inspectdf
# convert variables with fewer than 10 unique values to factors for observation
cars93 %>%
mutate(across(.cols = where(~ length(unique(.x)) < 10),
.fns = ~ as.factor(.x))) %>%
# distribution of factor variables
inspect_cat() %>%
show_plot(col_palette = 1) +
labs(subtitle = "Numeric vars. with 9 or fewer unique values were converted to factors")# distribution of numerical variables
cars93 %>%
# distribution of factor variables
inspect_num() %>%
show_plot( col_palette = 2 )7 plot_scatterplot from DataExplorer
Creates a scatterplot of ‘predicted’ values against predictors.
- Continuous variables:
boston %>%
plot_scatterplot(
by = "median_value",
nrow = 4,
geom_point_args = list("color" = alpha("dodgerblue4", 0.4)),
ggtheme = my_theme,
title = "Relationship between all variables and median house value (y axis, in $1000s)"
)- Discrete variables
plottitanic <-
titanic %>%
plot_scatterplot(
by = "Survived",
geom_point_args = list("color" = alpha("dodgerblue4", 0.4),
position = position_jitter() ),
ggtheme = my_theme,
title = "Survival among different groups on the titanic"
)Note that if any of your variables is a factor, plot_scatterplot will plot the all the predictors as factors. This can cause some problems. So you should split apart your data before passing it to this function. One part would contain just categorical vars, the other part, just numeric vars.
We can make the scatterplots interactive:
- Future idea: Can we permit crosstalk on these plots?
To do this, we might need to abandon the faceting paradigm and plot separate plots instead. Then combine them, either with cowplot/patchwork or with plotly’s subplot function.
8 imputate_na from dlookr
Use to imputate missing values. Seems super powerful. Can do mice and other such things. For now we keep it simple.
Recall that we have some missing variables in the cars93 dataset:
We replace them with the mean. Other methods available include:
- “mean” : arithmetic mean
- “median” : median
- “mode” : mode
- “knn” : K-nearest neighbors
- “rpart” : Recursive Partitioning and Regression Trees
- “mice” : Multivariate Imputation by Chained Equations
cars93.f <- cars93
cars93.f[, "Luggage.room"] <- Luggage.room.f
cars93.f[, "Rear.seat.room"] <- Rear.seat.room.fCheck again for missingness.
All gone! Now we can use that to make a correlation plot that doesn’t have annoying question marks.
9 corrplot from corrplot
Correlation matrix represented with colored ellipses.
cars93 %>%
select(where( function(.x) is.numeric(.x) )) %>%
cor() %>%
corrplot(method = "ellipse", type ="upper") You can also show numbers. Let’s use the filled data frame, cars93.f
cars93.f %>%
select(where( function(.x) is.numeric(.x) )) %>%
cor() %>%
corrplot(method = "number", type ="upper",
number.cex = .5, tl.cex = 0.7) And we can cluster variables that are highly correlated with each other.
cars93.f %>%
select(where( function(.x) is.numeric(.x) )) %>%
cor() %>%
corrplot(method = "number",
order = "hclust", addrect = 3,
number.cex = .5, tl.cex = 0.7,
tl.srt = 60
) 10 plot_correlate from dlookr
Uses corrplot as a base and adds some additional features. For example we can apply group_by to plot one correlogram for each factor level. And we can select which columns to plot.
11 ExPanDaR
Explore data interactively.
Functions to * make two-variable bar plots. One factor on the x axis, filled by another factor * make two or three variable scatter plots. One var on x, one var on y, one var mapped to point size * make by-group bar charts * make by-group violin charts
And there are several others.
Expandar Example
12 Parallel coordinates plots
Needs a better implementation. The points don’t seem to align. Also geom_pcp cannot be taken in by ggplotly. No good.
cars93 %>%
ggplot(aes(
colour = Manufacturer,
vars = vars(Manufacturer, Model, Type, Min.Price, Max.Price, MPG.city, MPG.highway, AirBags, Origin, Manufacturer))) +
geom_pcp() +
geom_pcp_box(boxwidth=0.1) +
geom_pcp_text(boxwidth=0.1, size = 3) +
theme(legend.position = "none")13 Packages that seem useless or deprecated
- RtutoR
- xray
- exploreR