Useful data summary functions from different R packages

1 Setup

knitr::opts_chunk$set(message = F, warning = F, fig.align = "center")
library(pacman)
p_load(char = c("MASS", # for Boston dataset
                "tidyverse", 
                "DataExplorer",
                "here",
                "ggtext",
                "plotly",
                "tidyr",
                "visdat",
                "paletteer",
                "corrplot",
                "inspectdf", 
                "ExPanDaR", # shiny based interactive EDA and vis.
                "SmartEDA",
                "GGally", 
                "ggpcp",
                "dlookr"
                ))


my_theme <- theme_classic() + 
    theme(
    plot.title = element_text(face = "bold"),
    plot.background = element_rect(fill = "gray93"),
    panel.grid.major = element_line(color = "gray95", size = 0.2),
    strip.background = element_blank(),
    # element textbox is from ggtext
    strip.text = ggtext::element_textbox(
      size = 11, face = "bold",
      color = "white", fill = "steelblue3", halign = 0.5, 
      r = unit(5, "pt"), width = unit(1, "npc"),
      padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)
    )
  )

theme_set(my_theme)

## Data

boston <-
  MASS::Boston %>%
  as_tibble() %>%
  transmute(
    crime_rate_percap = crim,
    prop_industrial = indus,
    charles_river = chas,
    NO_conc = nox,
    avg_rooms = rm,
    prop_b4_1940 = age,
    dist_to_emp = dis,
    road_access = rad,
    black,
    tax_percent = tax / 100,
    pupil_teacher_rt = ptratio,
    prop_low_status = lstat,
    median_value = medv
  )


titanic <- 
  Titanic %>% # comes with R
  as_tibble() %>% 
  uncount(weights = n)

2 The packages

Figure from Researchgate paper

Comparisons. From here: https://github.com/daya6489/SmartEDA

3 vis_dat and vis_guess from visdat

Visualize data types and missingness.

cars93 <- 
  MASS::Cars93 %>% 
  as_tibble()

cars93 %>%
  vis_dat() +
  scale_fill_paletteer_d(palette = "NineteenEightyR::sonny", drop = T) +
  labs(title = "Column types and missingness for the Cars93 dataset.",
       subtitle = "Cars were randomly selected from 1993 Consumer Reports magazine") +
  my_theme +
  theme(axis.text.x = element_text(angle = 60, hjust = 0)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, NA))

4 makeCodebook report from dataMaid

Generates a PDF codebook with summary statistics for each variable.

# makeCodebook(cars93)
# makeDataReport(cars93)

Find output here: https://kendavidn.com/randomfiles/codebook_cars93.pdf

The makeDataReport function produces a similar output, but also includes some ‘diagnostics’. E.g. “this value looks like an outlier”.

5 create_report from DataExplorer

Not very pretty but might be useful to just have a simple html to show.

#create_report(cars93)

Find it here: DataExplorer Report

6 inspect_num and inspect_cat from inspectdf

# convert variables with fewer than 10 unique values to factors for observation
cars93 %>%
  mutate(across(.cols = where(~ length(unique(.x)) < 10),
                .fns =  ~ as.factor(.x))) %>%
  # distribution of factor variables
  inspect_cat() %>%
  show_plot(col_palette = 1) + 
  labs(subtitle = "Numeric vars. with 9 or fewer unique values were converted to factors")

# distribution of numerical variables
cars93 %>%
  # distribution of factor variables
  inspect_num() %>% 
  show_plot( col_palette = 2 )

7 plot_scatterplot from DataExplorer

Creates a scatterplot of ‘predicted’ values against predictors.

  • Continuous variables:
boston %>%
  plot_scatterplot(
    by = "median_value",
    nrow = 4,
    geom_point_args = list("color" = alpha("dodgerblue4", 0.4)),
    ggtheme = my_theme,
    title = "Relationship between all variables and median house value (y axis, in $1000s)"
  )

  • Discrete variables
plottitanic <- 
  titanic %>% 
  plot_scatterplot(
    by = "Survived",     
    geom_point_args = list("color" = alpha("dodgerblue4", 0.4),
                           position = position_jitter()  ),
    ggtheme = my_theme, 
    title = "Survival among different groups on the titanic"
    )

Note that if any of your variables is a factor, plot_scatterplot will plot the all the predictors as factors. This can cause some problems. So you should split apart your data before passing it to this function. One part would contain just categorical vars, the other part, just numeric vars.

We can make the scatterplots interactive:

# ggplotly(plottitanic$page_1)
  • Future idea: Can we permit crosstalk on these plots?

To do this, we might need to abandon the faceting paradigm and plot separate plots instead. Then combine them, either with cowplot/patchwork or with plotly’s subplot function.

8 imputate_na from dlookr

Use to imputate missing values. Seems super powerful. Can do mice and other such things. For now we keep it simple.

Recall that we have some missing variables in the cars93 dataset:

cars93[, c("Rear.seat.room", "Luggage.room")] %>% 
  vis_dat()

We replace them with the mean. Other methods available include:

  • “mean” : arithmetic mean
  • “median” : median
  • “mode” : mode
  • “knn” : K-nearest neighbors
  • “rpart” : Recursive Partitioning and Regression Trees
  • “mice” : Multivariate Imputation by Chained Equations
Luggage.room.f <- imputate_na(cars93, Luggage.room, method = "mean" )

plot(Luggage.room.f)

Rear.seat.room.f <- imputate_na(cars93, Rear.seat.room, method = "mean" )

plot(Rear.seat.room.f)

cars93.f <- cars93

cars93.f[, "Luggage.room"] <- Luggage.room.f
cars93.f[, "Rear.seat.room"] <- Rear.seat.room.f

Check again for missingness.

cars93.f[, c("Rear.seat.room", "Luggage.room")] %>% 
  vis_dat()

All gone! Now we can use that to make a correlation plot that doesn’t have annoying question marks.

9 corrplot from corrplot

Correlation matrix represented with colored ellipses.

cars93 %>% 
  select(where( function(.x) is.numeric(.x) )) %>%
  cor() %>% 
  corrplot(method = "ellipse", type ="upper") 

You can also show numbers. Let’s use the filled data frame, cars93.f

cars93.f %>% 
  select(where( function(.x) is.numeric(.x) )) %>%
  cor() %>% 
  corrplot(method = "number", type ="upper", 
           number.cex = .5, tl.cex = 0.7) 

And we can cluster variables that are highly correlated with each other.

cars93.f %>% 
  select(where( function(.x) is.numeric(.x) )) %>%
  cor() %>% 
  corrplot(method = "number", 
           order = "hclust", addrect = 3,
           number.cex = .5, tl.cex = 0.7, 
           tl.srt = 60
           ) 

10 plot_correlate from dlookr

Uses corrplot as a base and adds some additional features. For example we can apply group_by to plot one correlogram for each factor level. And we can select which columns to plot.

cars93 %>% 
  group_by(DriveTrain) %>% 
  plot_correlate(Min.Price, Price, Max.Price)

11 ExPanDaR

Explore data interactively.

Functions to * make two-variable bar plots. One factor on the x axis, filled by another factor * make two or three variable scatter plots. One var on x, one var on y, one var mapped to point size * make by-group bar charts * make by-group violin charts

And there are several others.

Expandar Example

# ExPanDaR::ExPanD(cars93, export_nb_option = T)

12 Parallel coordinates plots

Needs a better implementation. The points don’t seem to align. Also geom_pcp cannot be taken in by ggplotly. No good.

cars93 %>%
  ggplot(aes(
    colour = Manufacturer,
    vars = vars(Manufacturer, Model, Type, Min.Price, Max.Price, MPG.city, MPG.highway, AirBags, Origin, Manufacturer))) + 
  geom_pcp() + 
  geom_pcp_box(boxwidth=0.1) + 
  geom_pcp_text(boxwidth=0.1, size = 3) +
  theme(legend.position = "none")

13 Packages that seem useless or deprecated

  • RtutoR
  • xray
  • exploreR