1 Introduction

This notebook walks through a full exploratory data analysis (EDA) of the store_sales.csv dataset using base R only — no ggplot2, no dplyr, just the tools that ship with R itself.

The goal is twofold. First, to give you a comprehensive reference for the kinds of plots you should be producing whenever you start working with a new dataset. Second, to build the habit of pairing every visualization with a short written interpretation. A plot without commentary is just decoration; a plot with a clear takeaway is analysis.

The structure of this notebook follows the natural progression of EDA: we start with one variable at a time (univariate), move on to relationships between two variables (bivariate), and finish with summary panels that bring everything together.


2 Setting Up the Workspace

2.1 Creating Output Folders

Reproducibility matters. Rather than scattering plots across your working directory, we will save every figure to a structured set of folders. Setting this up once at the top of the script saves a lot of time later.

folders <- c(
  "plots/01_univariate_numeric",
  "plots/02_univariate_categorical",
  "plots/03_bivariate_numeric",
  "plots/04_bivariate_numeric_x_categorical",
  "plots/05_bivariate_categorical",
  "plots/06_correlation",
  "plots/07_student_dataset",
  "plots/08_overview"
)

for (f in folders) {
  dir.create(f, recursive = TRUE, showWarnings = FALSE)
}

cat("All output folders created under: plots/\n")
#> All output folders created under: plots/

2.2 A Helper Function for Saving Plots

This small wrapper around png() keeps our plotting code clean. You can adjust width, height, and resolution depending on the plot.

save_png <- function(path, width = 900, height = 650, res = 120) {
  png(filename = path, width = width, height = height, res = res)
}

Useful color references:


3 Loading and Inspecting the Data

Before doing anything else, get a feel for the shape and content of your data.

store_data <- read.csv("store_sales.csv")

3.1 Quick Dimensionality Checks

dim(store_data)      # rows x columns (5000 x 11)
nrow(store_data)     # number of observations
head(store_data)     # first 6 rows
str(store_data)      # column names and types
summary(store_data)  # overview of every column

3.2 A Faster Overview with skimr

The skimr package gives a much richer summary than base summary(). It is worth installing.

# install.packages("skimr")
library(skimr)
skim(store_data)

4 Subsetting and Quick Summaries

Before plotting, it is often useful to slice the data and compute simple statistics on the resulting subsets. These exercises help you build intuition for what is in the data.

4.1 Accessories with Heavy Discounts and Low Spend

accessories_low_amount <- store_data[
  store_data$Category       == "Accessories" &
  store_data$DiscountApplied  > 15           &
  store_data$Amount           < 90,
]

nrow(accessories_low_amount)
head(accessories_low_amount)

mean(accessories_low_amount$Amount)
median(accessories_low_amount$Amount)
sd(accessories_low_amount$Amount)

4.2 High-Value Spring Customers

spring_high_value <- store_data[
  store_data$Season            == "Spring" &
  store_data$Amount             >  130     &
  store_data$PreviousPurchases >=  5,
]

nrow(spring_high_value)
mean(spring_high_value$Amount)
median(spring_high_value$Amount)

4.3 Spending Comparison by Gender

Here we compare the spending behavior of male and female customers using a simple summary table.

female_transactions <- store_data[store_data$Gender == "Female", ]
male_transactions   <- store_data[store_data$Gender == "Male",   ]

summary_table <- data.frame(
  Gender = c("Female", "Male"),
  Mean   = c(mean(female_transactions$Amount, na.rm = TRUE),
             mean(male_transactions$Amount,   na.rm = TRUE)),
  Median = c(median(female_transactions$Amount, na.rm = TRUE),
             median(male_transactions$Amount,   na.rm = TRUE)),
  SD     = c(sd(female_transactions$Amount, na.rm = TRUE),
             sd(male_transactions$Amount,   na.rm = TRUE))
)

summary_table

5 Univariate Plots: Numeric Variables

This section covers plots for understanding a single numeric variable at a time. The standard toolkit includes histograms, density plots, boxplots, strip plots, ECDFs, and Q-Q plots.

Output folder: plots/01_univariate_numeric/

5.1 Histogram of Transaction Amount

A histogram divides the range of the variable into bins and counts how many observations fall into each. The choice of breaks matters: too few bins hides structure, too many adds noise.

save_png("plots/01_univariate_numeric/hist_amount.png")

hist(store_data$Amount,
     main   = "Distribution of Transaction Amount",
     xlab   = "Amount ($)",
     col    = "steelblue",
     border = "white",
     breaks = 20)

abline(v   = mean(store_data$Amount),
       col = "red", lty = 2, lwd = 2)

legend("topright",
       legend = paste("Mean =", round(mean(store_data$Amount), 2)),
       col    = "red", lty = 2, lwd = 2)

dev.off()

The dashed red line marks the mean. Adding the mean (and optionally the median) helps you see at a glance whether the distribution is skewed.

abline() quick reference:

  • lty controls line type (2 = dashed, 3 = dotted)
  • lwd controls line thickness

5.2 Histogram of Customer Age

save_png("plots/01_univariate_numeric/hist_age.png")

hist(store_data$Age,
     main   = "Age Distribution of Customers",
     xlab   = "Age (years)",
     col    = "darkorange",
     border = "white",
     breaks = 20)

dev.off()

Practice exercise: Modify the chunk above to add both the mean and median as vertical lines, and include a legend explaining each. The full solution is shown below.

save_png("plots/01_univariate_numeric/hist_age.png")

hist(store_data$Age,
     main   = "Age Distribution of Customers",
     xlab   = "Age (years)",
     col    = "darkorange",
     border = "white",
     breaks = 10)

mean_age   <- mean(store_data$Age,   na.rm = TRUE)
median_age <- median(store_data$Age, na.rm = TRUE)

abline(v = mean_age,   col = "red",  lty = 2, lwd = 2)
abline(v = median_age, col = "blue", lty = 2, lwd = 2)

legend("topright",
       legend = c(paste("Mean =",   round(mean_age,   2)),
                  paste("Median =", round(median_age, 2))),
       col    = c("red", "blue"),
       lty    = 2,
       lwd    = 2)

dev.off()

5.3 Histograms of Discount and Previous Purchases

# Discount Applied
save_png("plots/01_univariate_numeric/hist_discount.png")
hist(store_data$DiscountApplied,
     main   = "Distribution of Discount Applied",
     xlab   = "Discount Applied",
     col    = "mediumpurple",
     border = "white",
     breaks = 20)
dev.off()

# Previous Purchases
save_png("plots/01_univariate_numeric/hist_previous_purchases.png")
hist(store_data$PreviousPurchases,
     main   = "Distribution of Previous Purchases",
     xlab   = "Number of Previous Purchases",
     col    = "forestgreen",
     border = "white",
     breaks = 20)
dev.off()

5.4 Density Plot of Amount

A density plot is a smoothed version of a histogram. It often makes the shape of the distribution clearer, especially when comparing groups.

save_png("plots/01_univariate_numeric/density_amount.png")

d <- density(store_data$Amount)
plot(d,
     main = "Density of Transaction Amount",
     xlab = "Amount ($)",
     col  = "steelblue",
     lwd  = 2)

polygon(d, col = rgb(0.27, 0.51, 0.71, 0.25), border = NA)

abline(v = mean(store_data$Amount),   col = "red",       lty = 2, lwd = 2)
abline(v = median(store_data$Amount), col = "darkgreen", lty = 3, lwd = 2)

legend("topright",
       legend = c("Mean", "Median"),
       col    = c("red", "darkgreen"),
       lty    = c(2, 3), lwd = 2)

dev.off()

5.5 Detecting Outliers Using the IQR Rule

A common rule of thumb: any value below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is flagged as a potential outlier. The function below returns the rows of the data frame that meet this criterion.

get_outlier_rows <- function(df, column) {
  x <- df[[column]]
  
  Q1  <- quantile(x, 0.25, na.rm = TRUE)
  Q3  <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  
  lower <- Q1 - 1.5 * IQR
  upper <- Q3 + 1.5 * IQR
  
  df[x < lower | x > upper, ]
}

out_amount <- get_outlier_rows(store_data, "Amount")
nrow(out_amount)
colSums(is.na(store_data))

5.6 Boxplots of Amount

The boxplot is one of the most efficient ways to see the median, the spread, and any extreme values in a single picture.

# Standard boxplot
save_png("plots/01_univariate_numeric/boxplot_amount.png")
boxplot(store_data$Amount,
        main   = "Boxplot of Transaction Amount",
        ylab   = "Amount ($)",
        col    = "lightblue",
        border = "steelblue")
dev.off()

# Log-transformed boxplot — useful when the distribution is heavily skewed
save_png("plots/01_univariate_numeric/boxplot_log_amount.png")
boxplot(log(store_data$Amount),
        main   = "Boxplot of Transaction Log Amount",
        ylab   = "log(Amount)",
        col    = "lightblue",
        border = "steelblue")
dev.off()

5.6.1 Boxplot with Annotated Five-Number Summary

save_png("plots/01_univariate_numeric/boxplot_amount_annotated.png")

boxplot(store_data$Amount,
        main   = "Amount — 5-Number Summary",
        ylab   = "Amount ($)",
        col    = "lightblue",
        border = "steelblue")

five_num <- fivenum(store_data$Amount)
labels   <- c("Min", "Q1", "Median", "Q3", "Max")

mtext(paste(labels, "=", round(five_num, 1)),
      side = 4, at = five_num, las = 1, cex = 0.75, col = "steelblue")

dev.off()

cat("\n5-Number Summary of Amount:\n")
print(fivenum(store_data$Amount))

5.6.2 Boxplot of Age with Annotated Summary

save_png("plots/01_univariate_numeric/boxplot_age_annotated.png", width = 400)

boxplot(store_data$Age,
        main   = "Age — 5-Number Summary",
        ylab   = "Age (years)",
        col    = "lightblue",
        border = "steelblue")

five_num <- fivenum(store_data$Age)
labels   <- c("Min", "Q1", "Median", "Q3", "Max")

mtext(paste(labels, "=", round(five_num, 1)),
      side = 4, at = five_num, las = 1, cex = 0.75, col = "steelblue")

dev.off()

5.7 Strip Plots

Strip plots show every individual observation. With jitter and transparency, they reveal density patterns that boxplots can hide.

save_png("plots/01_univariate_numeric/stripplot_amount.png")

stripchart(store_data$Amount,
           method = "jitter",
           pch    = 19,
           col    = rgb(0.27, 0.51, 0.71, 0.15),
           main   = "Strip Plot of Transaction Amount",
           xlab   = "Amount ($)")

abline(v = mean(store_data$Amount),   col = "red",       lty = 2, lwd = 2)
abline(v = median(store_data$Amount), col = "darkgreen", lty = 3, lwd = 2)

legend("topright",
       legend = c("Mean", "Median"),
       col    = c("red", "darkgreen"),
       lty    = c(2, 3), lwd = 2)

dev.off()

The semi-transparent points (alpha = 0.15 in the RGB call) reveal density: darker regions indicate where many transactions cluster.

save_png("plots/01_univariate_numeric/stripplot_itemrating.png")

stripchart(store_data$ItemRating,
           method = "jitter",
           pch    = 19,
           col    = rgb(0.27, 0.51, 0.71, 0.3),
           main   = "Strip Plot of Item Ratings",
           xlab   = "Rating (1-5)")

dev.off()

5.8 Empirical Cumulative Distribution Function (ECDF)

The ECDF answers the question: “what fraction of observations are at or below this value?” It is a useful complement to histograms and boxplots.

save_png("plots/01_univariate_numeric/ecdf_amount.png")

plot(ecdf(store_data$Amount),
     main = "ECDF of Transaction Amount",
     xlab = "Amount ($)",
     ylab = "Cumulative Proportion",
     col  = "steelblue",
     lwd  = 2,
     pch  = NA)

abline(v = median(store_data$Amount), h = 0.5, col = "red", lty = 2)

legend("bottomright",
       legend = paste("Median =", round(median(store_data$Amount), 2)),
       col    = "red", lty = 2)

dev.off()

5.9 Q-Q Plots for Checking Normality

Q-Q plots compare the quantiles of your data against the quantiles of a theoretical distribution (here, the normal). If the points follow the reference line closely, the data is approximately normal.

# Q-Q plot for Amount
save_png("plots/01_univariate_numeric/qqplot_amount.png")
qqnorm(store_data$Amount,
       main = "Normal Q-Q Plot of Transaction Amount",
       pch  = 19,
       col  = rgb(0.27, 0.51, 0.71, 0.4),
       cex  = 0.6)
qqline(store_data$Amount, col = "red", lwd = 2)
dev.off()

# Q-Q plot for Age
save_png("plots/01_univariate_numeric/qqplot_age.png")
qqnorm(store_data$Age,
       main = "Normal Q-Q Plot of Customer Age",
       pch  = 19,
       col  = rgb(0.85, 0.33, 0.10, 0.4),
       cex  = 0.6)
qqline(store_data$Age, col = "red", lwd = 2)
dev.off()

6 Univariate Plots: Categorical Variables

For categorical variables, the standard tools are bar charts (raw counts), pie charts (proportions), and Pareto charts (sorted bars with a cumulative line).

Output folder: plots/02_univariate_categorical/

6.1 Bar Charts

save_png("plots/02_univariate_categorical/bar_category.png", width = 1000)

cat_counts <- sort(table(store_data$Category), decreasing = TRUE)

barplot(cat_counts,
        main   = "Transactions by Product Category",
        ylab   = "Count",
        col    = "steelblue",
        border = "white")

dev.off()
# Season
save_png("plots/02_univariate_categorical/bar_season.png")
barplot(table(store_data$Season),
        main   = "Transactions by Season",
        ylab   = "Count",
        xlab   = "Season",
        col    = c("darkorange", "steelblue", "gold", "skyblue"),
        border = "white")
dev.off()

# Gender
save_png("plots/02_univariate_categorical/bar_gender.png")
barplot(table(store_data$Gender),
        main   = "Transactions by Gender",
        ylab   = "Count",
        col    = c("tomato", "steelblue"),
        border = "white")
dev.off()

# Payment Method
save_png("plots/02_univariate_categorical/bar_payment_method.png")
barplot(table(store_data$PaymentMethod),
        main   = "Transactions by Payment Method",
        ylab   = "Count",
        col    = c("steelblue", "darkorange"),
        border = "white")
dev.off()

6.2 Pie Charts

Pie charts are best reserved for showing how a whole is divided among a small number of categories. With many categories, a sorted bar chart is almost always clearer.

# Category
save_png("plots/02_univariate_categorical/pie_category.png",
         width = 900, height = 900)
cat_prop <- prop.table(table(store_data$Category))
pie(cat_prop,
    main   = "Share of Transactions by Category",
    col    = rainbow(length(cat_prop)),
    labels = paste0(names(cat_prop), "\n",
                    round(cat_prop * 100, 1), "%"))
dev.off()

# Season
save_png("plots/02_univariate_categorical/pie_season.png",
         width = 800, height = 800)
season_prop <- prop.table(table(store_data$Season))
pie(season_prop,
    main   = "Share of Transactions by Season",
    col    = c("darkorange", "steelblue", "gold", "skyblue"),
    labels = paste0(names(season_prop), "\n",
                    round(season_prop * 100, 1), "%"))
dev.off()

6.3 Pareto Chart

A Pareto chart sorts bars from largest to smallest and overlays a cumulative percentage line. It is the standard tool for asking: “which categories drive the majority of activity?”

save_png("plots/02_univariate_categorical/pareto_category.png", width = 1000)

cat_sorted  <- sort(table(store_data$Category), decreasing = TRUE)
cat_cum_pct <- cumsum(cat_sorted) / sum(cat_sorted) * 100

par(mar = c(8, 4, 3, 4))
bp_pos <- barplot(cat_sorted,
                  main   = "Pareto Chart - Product Category",
                  ylab   = "Count",
                  col    = "steelblue",
                  border = "white",
                  las    = 2,
                  ylim   = c(0, max(cat_sorted) * 1.1))

par(new = TRUE)
plot(bp_pos, cat_cum_pct,
     type = "b", pch = 19, col = "red", lwd = 2,
     axes = FALSE, xlab = "", ylab = "",
     ylim = c(0, 100))

axis(side = 4, at = seq(0, 100, 20),
     labels = paste0(seq(0, 100, 20), "%"), col.axis = "red")
mtext("Cumulative %", side = 4, line = 3, col = "red")
abline(h = 80, col = "red", lty = 2)

par(mar = c(5, 4, 4, 2))
dev.off()

The horizontal line at 80% marks the classic Pareto threshold: which categories together account for 80% of transactions?


7 Bivariate Plots: Numeric vs Numeric

When comparing two numeric variables, the scatter plot is your starting point. Adding a regression line gives a quick read on linear association.

Output folder: plots/03_bivariate_numeric/

7.1 Scatter Plots with Regression Lines

save_png("plots/03_bivariate_numeric/scatter_amount_vs_discount.png")

plot(store_data$DiscountApplied, store_data$Amount,
     main = "Amount vs Discount Applied",
     xlab = "Discount Applied",
     ylab = "Amount ($)",
     pch  = 19,
     col  = rgb(0.27, 0.51, 0.71, 0.3),
     cex  = 0.7)

abline(lm(Amount ~ DiscountApplied, data = store_data),
       col = "red", lwd = 2)

dev.off()
# Amount vs Previous Purchases
save_png("plots/03_bivariate_numeric/scatter_amount_vs_prev_purchases.png")
plot(store_data$PreviousPurchases, store_data$Amount,
     main = "Amount vs Previous Purchases",
     xlab = "Previous Purchases",
     ylab = "Amount ($)",
     pch  = 19,
     col  = rgb(0.85, 0.33, 0.10, 0.3),
     cex  = 0.7)
abline(lm(Amount ~ PreviousPurchases, data = store_data),
       col = "red", lwd = 2)
dev.off()

# Amount vs Age
save_png("plots/03_bivariate_numeric/scatter_amount_vs_age.png")
plot(store_data$Age, store_data$Amount,
     main = "Amount vs Customer Age",
     xlab = "Age (years)",
     ylab = "Amount ($)",
     pch  = 19,
     col  = rgb(0.20, 0.63, 0.17, 0.3),
     cex  = 0.7)
abline(lm(Amount ~ Age, data = store_data),
       col = "red", lwd = 2)
dev.off()

# Amount vs Item Rating
save_png("plots/03_bivariate_numeric/scatter_amount_vs_itemrating.png")
plot(store_data$ItemRating, store_data$Amount,
     main = "Amount vs Item Rating",
     xlab = "Item Rating (1-5)",
     ylab = "Amount ($)",
     pch  = 19,
     col  = rgb(0.55, 0.20, 0.75, 0.25),
     cex  = 0.7)
abline(lm(Amount ~ ItemRating, data = store_data),
       col = "red", lwd = 2)
dev.off()

7.2 Coloring Points by a Categorical Variable

Adding a third variable as color in a scatter plot is one of the most effective ways to reveal subgroup differences.

save_png("plots/03_bivariate_numeric/scatter_amount_vs_age_by_gender.png")

gender_cols <- c("Female" = "tomato", "Male" = "steelblue")

plot(store_data$Age, store_data$Amount,
     main = "Amount vs Age (by Gender)",
     xlab = "Age (years)",
     ylab = "Amount ($)",
     pch  = 19,
     col  = gender_cols[store_data$Gender],
     cex  = 0.7)

legend("topright",
       legend = names(gender_cols),
       col    = gender_cols,
       pch    = 19)

dev.off()

7.3 Scatter Plot Matrix

When you have several numeric variables, the pairs() function produces a grid of all pairwise scatter plots in one go. This is one of the fastest ways to scan for relationships.

num_vars <- store_data[, c("Age", "Amount", "PreviousPurchases", "ItemRating")]

save_png("plots/03_bivariate_numeric/scatter_matrix_all_numeric.png",
         width = 1000, height = 1000)

pairs(num_vars,
      main  = "Scatter Plot Matrix - Numeric Variables",
      pch   = 19,
      col   = rgb(0.27, 0.51, 0.71, 0.2),
      cex   = 0.5,
      panel = panel.smooth)

dev.off()

8 Bivariate Plots: Numeric vs Categorical

When one variable is numeric and the other is categorical, side-by-side boxplots are the workhorse. They show how the distribution of the numeric variable differs across categories.

Output folder: plots/04_bivariate_numeric_x_categorical/

8.1 Boxplots by Group

# Amount by Season
save_png("plots/04_bivariate_numeric_x_categorical/boxplot_amount_by_season.png")
boxplot(Amount ~ Season,
        data   = store_data,
        main   = "Transaction Amount by Season",
        xlab   = "Season",
        ylab   = "Amount ($)",
        col    = c("darkorange", "steelblue", "gold", "skyblue"),
        border = "grey30")
dev.off()

# Amount by Gender
save_png("plots/04_bivariate_numeric_x_categorical/boxplot_amount_by_gender.png")
boxplot(Amount ~ Gender,
        data   = store_data,
        main   = "Transaction Amount by Gender",
        xlab   = "Gender",
        ylab   = "Amount ($)",
        col    = c("tomato", "steelblue"),
        border = "grey30")
dev.off()

# Amount by Category — wider plot, rotated labels
save_png("plots/04_bivariate_numeric_x_categorical/boxplot_amount_by_category.png",
         width = 1100)
par(mar = c(9, 4, 3, 1))
boxplot(Amount ~ Category,
        data   = store_data,
        main   = "Transaction Amount by Category",
        xlab   = "",
        ylab   = "Amount ($)",
        col    = rainbow(length(unique(store_data$Category)), alpha = 0.7),
        border = "grey30",
        las    = 2)
par(mar = c(5, 4, 4, 2))
dev.off()

# Amount by Payment Method
save_png("plots/04_bivariate_numeric_x_categorical/boxplot_amount_by_payment.png")
boxplot(Amount ~ PaymentMethod,
        data   = store_data,
        main   = "Transaction Amount by Payment Method",
        xlab   = "Payment Method",
        ylab   = "Amount ($)",
        col    = c("steelblue", "darkorange"),
        border = "grey30")
dev.off()

# Item Rating by Season
save_png("plots/04_bivariate_numeric_x_categorical/boxplot_itemrating_by_season.png")
boxplot(ItemRating ~ Season,
        data   = store_data,
        main   = "Item Rating by Season",
        xlab   = "Season",
        ylab   = "Item Rating (1-5)",
        col    = c("darkorange", "steelblue", "gold", "skyblue"),
        border = "grey30")
dev.off()

8.2 Boxplots with Raw Data Overlaid

Combining a boxplot with a strip chart gives the best of both worlds: summary statistics plus the raw data.

save_png("plots/04_bivariate_numeric_x_categorical/boxplot_strip_amount_by_gender.png")

boxplot(Amount ~ Gender,
        data   = store_data,
        main   = "Amount by Gender (with raw data points)",
        xlab   = "Gender",
        ylab   = "Amount ($)",
        col    = c("mistyrose", "lightblue"),
        border = "grey30")

stripchart(Amount ~ Gender,
           data     = store_data,
           add      = TRUE,
           vertical = TRUE,
           method   = "jitter",
           pch      = 19,
           col      = c(rgb(0.8, 0.1, 0.1, 0.15),
                        rgb(0.1, 0.3, 0.7, 0.15)),
           cex      = 0.5)

dev.off()

8.3 Strip Chart by Group

save_png("plots/04_bivariate_numeric_x_categorical/stripchart_amount_by_season.png")

stripchart(Amount ~ Season,
           data     = store_data,
           method   = "jitter",
           vertical = TRUE,
           pch      = 19,
           col      = c(rgb(0.85, 0.41, 0.0, 0.3),
                        rgb(0.27, 0.51, 0.71, 0.3),
                        rgb(0.85, 0.75, 0.0, 0.3),
                        rgb(0.53, 0.81, 0.98, 0.3)),
           main     = "Amount by Season (Strip Chart)",
           xlab     = "Season",
           ylab     = "Amount ($)")

dev.off()

9 Bivariate Plots: Categorical vs Categorical

When both variables are categorical, contingency tables and bar charts (side-by-side, stacked, or proportional) are your main tools. Mosaic plots add another perspective by encoding both counts and proportions in a single figure.

Output folder: plots/05_bivariate_categorical/

9.1 Contingency Tables

gs_table <- table(store_data$Gender,   store_data$Season)
gp_table <- table(store_data$Gender,   store_data$PaymentMethod)
cs_table <- table(store_data$Category, store_data$Season)

9.2 Side-by-Side, Stacked, and Proportional Bar Charts

# Side-by-side: Gender x Season
save_png("plots/05_bivariate_categorical/bar_sidebyside_gender_x_season.png")
barplot(gs_table,
        beside      = TRUE,
        main        = "Transactions: Gender x Season",
        xlab        = "Season",
        ylab        = "Count",
        col         = c("tomato", "steelblue"),
        border      = "white",
        legend.text = rownames(gs_table),
        args.legend = list(x = "topright", bty = "n"))
dev.off()

# Stacked: Gender x Season
save_png("plots/05_bivariate_categorical/bar_stacked_gender_x_season.png")
barplot(gs_table,
        beside      = FALSE,
        main        = "Stacked: Gender x Season",
        xlab        = "Season",
        ylab        = "Count",
        col         = c("tomato", "steelblue"),
        border      = "white",
        legend.text = rownames(gs_table),
        args.legend = list(x = "topright", bty = "n"))
dev.off()

# Proportional stacked: Gender x Season
save_png("plots/05_bivariate_categorical/bar_proportional_gender_x_season.png")
gs_prop <- prop.table(gs_table, margin = 2)
barplot(gs_prop,
        beside      = FALSE,
        main        = "Proportional: Gender x Season",
        xlab        = "Season",
        ylab        = "Proportion",
        col         = c("tomato", "steelblue"),
        border      = "white",
        ylim        = c(0, 1),
        legend.text = rownames(gs_table),
        args.legend = list(x = "topright", bty = "n"))
dev.off()

# Side-by-side: Gender x Payment Method
save_png("plots/05_bivariate_categorical/bar_sidebyside_gender_x_payment.png")
barplot(gp_table,
        beside      = TRUE,
        main        = "Transactions: Gender x Payment Method",
        xlab        = "Payment Method",
        ylab        = "Count",
        col         = c("tomato", "steelblue"),
        border      = "white",
        legend.text = rownames(gp_table),
        args.legend = list(x = "topright", bty = "n"))
dev.off()

# Side-by-side: Category x Season
save_png("plots/05_bivariate_categorical/bar_sidebyside_category_x_season.png",
         width = 1200)
par(mar = c(5, 4, 3, 12))
barplot(cs_table,
        beside      = TRUE,
        main        = "Transactions: Category x Season",
        xlab        = "Season",
        ylab        = "Count",
        col         = rainbow(nrow(cs_table), alpha = 0.8),
        border      = "white",
        legend.text = rownames(cs_table),
        args.legend = list(x = "topright", xpd = TRUE,
                           bty = "n", inset = c(-0.25, 0)))
par(mar = c(5, 4, 4, 2))
dev.off()

9.3 Mosaic Plots

# Gender x Payment Method
save_png("plots/05_bivariate_categorical/mosaic_gender_x_payment.png")
mosaicplot(gp_table,
           main   = "Gender x Payment Method",
           xlab   = "Gender",
           ylab   = "Payment Method",
           col    = c("steelblue", "tomato"),
           border = "white")
dev.off()

# Category x Season
save_png("plots/05_bivariate_categorical/mosaic_category_x_season.png",
         width = 1000)
par(mar = c(5, 9, 3, 1))
mosaicplot(cs_table,
           main   = "Category x Season",
           col    = rainbow(4, alpha = 0.7),
           border = "white",
           las    = 1)
par(mar = c(5, 4, 4, 2))
dev.off()

9.4 Chi-Square Test of Independence

When you want to test whether two categorical variables are statistically independent, use chisq.test().

cat("\n--- Contingency Table: Gender x PaymentMethod ---\n")
print(gp_table)

cat("\nRow proportions:\n")
print(round(prop.table(gp_table, margin = 1), 3))

cat("\nChi-square test:\n")
print(chisq.test(gp_table))

10 Covariance and Correlation

Correlation and covariance summarize linear relationships between numeric variables. Correlation is unitless and easier to interpret; covariance retains the units of the original variables.

Output folder: plots/06_correlation/

10.1 Computing the Matrices

num_vars   <- store_data[, c("Age", "Amount", "PreviousPurchases", "ItemRating")]
cor_matrix <- cor(num_vars)
cov_matrix <- cov(num_vars)

cat("\n===== Correlation Matrix =====\n")
print(round(cor_matrix, 3))

cat("\n===== Covariance Matrix =====\n")
print(round(cov_matrix, 2))

10.2 Correlation Heatmap (Base R)

This is a heatmap built entirely from base R — no extra packages required.

save_png("plots/06_correlation/heatmap_correlation.png",
         width = 800, height = 750)

cor_colors <- colorRampPalette(c("tomato", "white", "steelblue"))(200)

par(mar = c(6, 6, 3, 2))
image(1:ncol(cor_matrix), 1:nrow(cor_matrix),
      t(cor_matrix)[, nrow(cor_matrix):1],
      col  = cor_colors,
      xlab = "", ylab = "",
      axes = FALSE,
      main = "Correlation Heatmap - Numeric Variables",
      zlim = c(-1, 1))

axis(1, at = 1:ncol(cor_matrix),
     labels = colnames(cor_matrix), las = 2, cex.axis = 0.85)
axis(2, at = nrow(cor_matrix):1,
     labels = colnames(cor_matrix), las = 1, cex.axis = 0.85)

for (i in 1:nrow(cor_matrix)) {
  for (j in 1:ncol(cor_matrix)) {
    text(j, nrow(cor_matrix) + 1 - i,
         labels = round(cor_matrix[i, j], 2),
         cex    = 0.9, font = 2,
         col    = ifelse(abs(cor_matrix[i, j]) > 0.5, "white", "black"))
  }
}

par(mar = c(5, 4, 4, 2))
dev.off()

10.3 Correlation Plots Using corrplot

If you have the corrplot package installed, it produces nicer correlation visualizations with much less code.

if (requireNamespace("corrplot", quietly = TRUE)) {
  library(corrplot)
  
  # Full matrix
  save_png("plots/06_correlation/corrplot_full.png",
           width = 800, height = 750)
  corrplot(cor_matrix, method = "color", type = "full",
           addCoef.col = "black", tl.col = "black", tl.srt = 45,
           title = "Correlation Matrix", mar = c(0, 0, 1, 0))
  dev.off()
  
  # Upper triangle only
  save_png("plots/06_correlation/corrplot_upper.png",
           width = 800, height = 750)
  corrplot(cor_matrix, method = "color", type = "upper",
           addCoef.col = "black", tl.col = "black", tl.srt = 45,
           title = "Correlation Matrix (Upper Triangle)",
           mar = c(0, 0, 1, 0))
  dev.off()
}

10.4 Scatter Matrix with Smoothers

save_png("plots/06_correlation/scatter_matrix_with_smooth.png",
         width = 1000, height = 1000)

pairs(num_vars,
      main  = "Scatter Plot Matrix with Smooth Lines",
      pch   = 19,
      col   = rgb(0.27, 0.51, 0.71, 0.2),
      cex   = 0.5,
      panel = panel.smooth)

dev.off()

11 A Small Worked Example: Student Dataset

To consolidate what we have covered, here is a small, easy-to-read example using a hand-crafted student dataset. Working with a tiny dataset like this makes it easy to verify what each plot is showing.

Output folder: plots/07_student_dataset/

11.1 Creating and Inspecting the Data

student_data <- data.frame(
  StudentID  = 1:10,
  StudyHours = c(2, 4, 3, 5, 6, 7, 4, 8, 5, 6),
  Attendance = c(50, 70, 60, 80, 85, 90, 75, 95, 80, 85),
  Score      = c(55, 72, 63, 78, 84, 88, 74, 92, 79, 85)
)

cat("\n===== Student Dataset =====\n")
print(student_data)

s_cov <- cov(student_data[, c("StudyHours", "Attendance", "Score")])
s_cor <- cor(student_data[, c("StudyHours", "Attendance", "Score")])

cat("\n--- Covariance Matrix ---\n");  print(round(s_cov, 4))
cat("\n--- Correlation Matrix ---\n"); print(round(s_cor, 3))

11.2 Scatter Plot Matrix

save_png("plots/07_student_dataset/scatter_matrix_student.png",
         width = 800, height = 800)

pairs(student_data[, c("StudyHours", "Attendance", "Score")],
      main  = "Student Data - Scatter Plot Matrix",
      pch   = 19,
      col   = "steelblue",
      panel = panel.smooth)

dev.off()

11.3 Score vs Study Hours

save_png("plots/07_student_dataset/scatter_score_vs_studyhours.png")

plot(student_data$StudyHours, student_data$Score,
     main = "Score vs Study Hours",
     xlab = "Study Hours",
     ylab = "Score",
     pch  = 19, col = "steelblue", cex = 1.5)

abline(lm(Score ~ StudyHours, data = student_data),
       col = "red", lwd = 2)

text(student_data$StudyHours, student_data$Score,
     labels = student_data$StudentID,
     pos = 3, cex = 0.8, col = "grey40")

dev.off()

11.4 Score vs Attendance

save_png("plots/07_student_dataset/scatter_score_vs_attendance.png")

plot(student_data$Attendance, student_data$Score,
     main = "Score vs Attendance (%)",
     xlab = "Attendance (%)",
     ylab = "Score",
     pch  = 19, col = "darkorange", cex = 1.5)

abline(lm(Score ~ Attendance, data = student_data),
       col = "red", lwd = 2)

dev.off()

11.5 Correlation Heatmap

save_png("plots/07_student_dataset/heatmap_correlation_student.png",
         width = 700, height = 650)

cor_colors <- colorRampPalette(c("tomato", "white", "steelblue"))(200)

par(mar = c(5, 5, 3, 2))
image(1:3, 1:3,
      t(s_cor)[, 3:1],
      col  = cor_colors,
      xlab = "", ylab = "",
      axes = FALSE,
      main = "Correlation Heatmap - Student Dataset",
      zlim = c(-1, 1))

axis(1, at = 1:3, labels = colnames(s_cor), las = 2)
axis(2, at = 3:1, labels = colnames(s_cor), las = 1)

for (i in 1:3) {
  for (j in 1:3) {
    text(j, 4 - i, round(s_cor[i, j], 3),
         cex = 1.1, font = 2,
         col = ifelse(abs(s_cor[i, j]) > 0.5, "white", "black"))
  }
}

par(mar = c(5, 4, 4, 2))
dev.off()

11.6 Mosaic Plot and Chi-Square Test

ct_example <- matrix(
  c(15, 2, 5, 8),
  nrow = 2,
  dimnames = list(Attendance = c("High", "Low"),
                  Result     = c("Pass", "Fail"))
)

cat("\n--- Contingency Table: Attendance x Result ---\n")
print(ct_example)

cat("\nRow %:\n");      print(round(prop.table(ct_example, 1) * 100, 1))
cat("\nColumn %:\n");   print(round(prop.table(ct_example, 2) * 100, 1))
cat("\nChi-square:\n"); print(chisq.test(ct_example))

save_png("plots/07_student_dataset/mosaic_attendance_x_result.png")
mosaicplot(ct_example,
           main   = "Attendance x Pass/Fail",
           col    = c("steelblue", "tomato"),
           border = "white")
dev.off()

12 Multi-Panel Overview Plots

Once you have explored the data thoroughly, it is often useful to assemble a single summary figure that captures the key findings. The par(mfrow = c(rows, cols)) call allows you to combine multiple plots in a grid.

Output folder: plots/08_overview/

12.1 Six-Panel Summary

save_png("plots/08_overview/overview_6panel.png",
         width = 1400, height = 950, res = 130)

par(mfrow = c(2, 3))

# Panel 1: Distribution of Amount
hist(store_data$Amount,
     col = "steelblue", border = "white",
     main = "Distribution of Amount",
     xlab = "Amount ($)", breaks = 25)
abline(v = mean(store_data$Amount), col = "red", lty = 2, lwd = 2)

# Panel 2: Age distribution
hist(store_data$Age,
     col = "darkorange", border = "white",
     main = "Age Distribution", xlab = "Age (years)")

# Panel 3: Amount by Season
boxplot(Amount ~ Season, data = store_data,
        col = c("darkorange", "steelblue", "gold", "skyblue"),
        border = "grey30",
        main = "Amount by Season",
        xlab = "Season", ylab = "Amount ($)", cex.axis = 0.8)

# Panel 4: Amount vs Age
plot(store_data$Age, store_data$Amount,
     pch = 19, col = rgb(0.27, 0.51, 0.71, 0.3), cex = 0.6,
     main = "Amount vs Age",
     xlab = "Age", ylab = "Amount ($)")
abline(lm(Amount ~ Age, data = store_data), col = "red", lwd = 2)

# Panel 5: Category counts
barplot(sort(table(store_data$Category), decreasing = TRUE),
        col = "steelblue", border = "white",
        main = "Category Counts", las = 2, cex.names = 0.6)

# Panel 6: ECDF of Amount
plot(ecdf(store_data$Amount),
     col = "steelblue", lwd = 2, pch = NA,
     main = "ECDF - Amount",
     xlab = "Amount ($)", ylab = "Cumulative Proportion")
abline(v = median(store_data$Amount), h = 0.5, col = "red", lty = 2)

par(mfrow = c(1, 1))
dev.off()

12.2 Four-Panel Univariate Numeric Overview

save_png("plots/08_overview/overview_univariate_numeric.png",
         width = 1200, height = 900, res = 120)

par(mfrow = c(2, 2))

hist(store_data$Amount,
     col = "steelblue", border = "white",
     main = "Histogram: Amount", xlab = "Amount ($)", breaks = 25)

d <- density(store_data$Amount)
plot(d, col = "steelblue", lwd = 2,
     main = "Density: Amount", xlab = "Amount ($)")
polygon(d, col = rgb(0.27, 0.51, 0.71, 0.2), border = NA)

boxplot(store_data$Amount,
        col = "lightblue", border = "steelblue",
        main = "Boxplot: Amount", ylab = "Amount ($)")

qqnorm(store_data$Amount,
       main = "Q-Q Plot: Amount",
       pch = 19, col = rgb(0.27, 0.51, 0.71, 0.4), cex = 0.6)
qqline(store_data$Amount, col = "red", lwd = 2)

par(mfrow = c(1, 1))
dev.off()

12.3 Four-Panel Bivariate Boxplot Overview

save_png("plots/08_overview/overview_bivariate_boxplots.png",
         width = 1200, height = 900, res = 120)

par(mfrow = c(2, 2))

boxplot(Amount ~ Season, data = store_data,
        col = c("darkorange", "steelblue", "gold", "skyblue"),
        border = "grey30",
        main = "Amount by Season",
        xlab = "Season", ylab = "Amount ($)")

boxplot(Amount ~ Gender, data = store_data,
        col = c("tomato", "steelblue"), border = "grey30",
        main = "Amount by Gender",
        xlab = "Gender", ylab = "Amount ($)")

boxplot(Amount ~ PaymentMethod, data = store_data,
        col = c("steelblue", "darkorange"), border = "grey30",
        main = "Amount by Payment Method",
        xlab = "Payment Method", ylab = "Amount ($)")

boxplot(ItemRating ~ Season, data = store_data,
        col = c("darkorange", "steelblue", "gold", "skyblue"),
        border = "grey30",
        main = "Item Rating by Season",
        xlab = "Season", ylab = "Item Rating (1-5)")

par(mfrow = c(1, 1))
dev.off()

13 Listing All Saved Plots

A final convenience: print a tree-like summary of every figure saved during the session. Useful as a sanity check that nothing was missed.

cat("All plots saved. Folder structure:\n\n")

for (f in folders) {
  files <- list.files(f)
  cat(sprintf("  %s/  (%d files)\n", f, length(files)))
  for (file in files) {
    cat(sprintf("    - %s\n", file))
  }
}

14 Summary and Next Steps

This notebook covered the standard EDA toolkit in base R: histograms, density plots, boxplots, strip plots, ECDFs, and Q-Q plots for univariate exploration; bar charts, pie charts, and Pareto charts for categorical data; scatter plots, correlation heatmaps, and pairs plots for relationships among numeric variables; and side-by-side boxplots, contingency tables, and mosaic plots for mixed and categorical pairings.

Two takeaways worth keeping in mind as you build on this material:

First, every plot should answer a question. Before producing a figure, ask yourself what you are trying to learn from it. After producing it, write down what you actually learned. If you cannot, the plot is probably not earning its place in your analysis.

Second, base R is more than enough for serious EDA. Specialized packages such as ggplot2 are powerful and worth learning, but most of what you need can be done with the tools demonstrated here. Mastering the basics first will make the advanced packages much easier to pick up later.

Practice exercises:

  1. Reproduce the histogram of Age with both mean and median lines and a legend.
  2. Create a side-by-side boxplot of DiscountApplied by Category with rotated x-axis labels.
  3. Build a 2x2 panel showing the four numeric variables, each with its own ECDF.
  4. Use chisq.test() to test independence between Gender and Category. Interpret the result in plain language.