1 Preface

This tutorial is designed as a complete book-style R Markdown resource for students, researchers, data analysts, business analysts, public health researchers, and early-career statisticians.

Exploratory Data Analysis, commonly called EDA, is the first serious conversation we have with data. Before applying regression, machine learning, forecasting, Bayesian modelling, artificial intelligence, or policy analysis, we must understand the data.

EDA helps us answer:

  • What type of data do we have?
  • What is the quality of the data?
  • Are there missing values?
  • Are there outliers?
  • What is the distribution of each variable?
  • Are variables related?
  • Are assumptions reasonable for modelling?
  • What story does the data suggest?

This book covers EDA and descriptive statistics from basic to advanced level using R. This version is corrected for RPubs knitting and avoids common package-conflict errors.

2 Introduction to Data and EDA

2.1 What is Data?

Data are recorded observations, measurements, facts, or responses collected for analysis, interpretation, decision-making, or prediction.

Examples of data include student marks, patient blood pressure, daily temperature, customer sales, website clicks, and survey responses.

example_data <- tibble(
  Student = c("A", "B", "C", "D", "E"),
  Age = c(21, 24, 22, 25, 23),
  Gender = c("Female", "Male", "Female", "Male", "Female"),
  Score = c(78, 85, 69, 92, 81),
  Attendance = c(90, 88, 76, 95, 84)
)

kable(example_data, caption = "Example of a Small Dataset") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Example of a Small Dataset
Student Age Gender Score Attendance
A 21 Female 78 90
B 24 Male 85 88
C 22 Female 69 76
D 25 Male 92 95
E 23 Female 81 84

2.2 What is Exploratory Data Analysis?

Exploratory Data Analysis is the systematic process of inspecting, summarising, visualising, and questioning data to discover patterns, anomalies, relationships, data quality problems, and modelling assumptions.

EDA is not only about producing graphs. It is about thinking critically.

Critical EDA questions:
1. What is the unit of observation?
2. What does each row represent?
3. What does each column measure?
4. Are the values reasonable?
5. Are there duplicated records?
6. Are there missing or impossible values?
7. Are the distributions symmetric, skewed, or multimodal?
8. Which variables may explain the outcome of interest?

2.3 Why EDA Matters

Skipping EDA is dangerous. A model can produce results even when the data contain errors, miscoded variables, missing values, outliers, or invalid assumptions.

EDA supports:

  • data cleaning;
  • assumption checking;
  • model selection;
  • hypothesis generation;
  • communication;
  • reproducibility.

2.4 EDA versus Confirmatory Analysis

eda_confirmatory <- tibble(
  Feature = c("Purpose", "Question Type", "Main Tools", "Output"),
  EDA = c(
    "Explore and understand data",
    "What patterns exist?",
    "Tables, graphs, summaries",
    "Insights, assumptions, possible hypotheses"
  ),
  Confirmatory_Analysis = c(
    "Test specific hypotheses",
    "Is there evidence for a specific claim?",
    "Statistical tests and models",
    "P-values, confidence intervals, model estimates"
  )
)

kable(eda_confirmatory, caption = "EDA versus Confirmatory Analysis") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
EDA versus Confirmatory Analysis
Feature EDA Confirmatory_Analysis
Purpose Explore and understand data Test specific hypotheses
Question Type What patterns exist? Is there evidence for a specific claim?
Main Tools Tables, graphs, summaries Statistical tests and models
Output Insights, assumptions, possible hypotheses P-values, confidence intervals, model estimates

3 Data Types and Measurement Scales

3.1 Qualitative and Quantitative Data

data_types <- tibble(
  Type = c("Qualitative", "Quantitative"),
  Meaning = c("Non-numerical categories", "Numerical measurements or counts"),
  Examples = c("Gender, blood group, region", "Age, income, weight, marks"),
  Common_Graphs = c("Bar chart, pie chart, mosaic plot", "Histogram, boxplot, density plot, scatterplot")
)

kable(data_types, caption = "Qualitative and Quantitative Data") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Qualitative and Quantitative Data
Type Meaning Examples Common_Graphs
Qualitative Non-numerical categories Gender, blood group, region Bar chart, pie chart, mosaic plot
Quantitative Numerical measurements or counts Age, income, weight, marks Histogram, boxplot, density plot, scatterplot

3.2 Discrete and Continuous Data

Discrete data are countable values. Continuous data are measurable values that can take many values within an interval.

Examples:

  • Discrete: number of children, number of hospital visits, number of defects.
  • Continuous: height, temperature, time, income, blood pressure.

3.3 Measurement Scales

scales_table <- tibble(
  Scale = c("Nominal", "Ordinal", "Interval", "Ratio"),
  Description = c(
    "Categories without natural order",
    "Categories with meaningful order",
    "Numerical scale with equal intervals but no true zero",
    "Numerical scale with equal intervals and true zero"
  ),
  Examples = c(
    "Gender, region, blood group",
    "Education level, satisfaction rating",
    "Temperature in Celsius",
    "Age, income, height, weight"
  ),
  Suitable_Summary = c(
    "Count, percentage, mode",
    "Median, percentile, rank",
    "Mean, SD, correlation",
    "All arithmetic summaries"
  )
)

kable(scales_table, caption = "Measurement Scales") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Measurement Scales
Scale Description Examples Suitable_Summary
Nominal Categories without natural order Gender, region, blood group Count, percentage, mode
Ordinal Categories with meaningful order Education level, satisfaction rating Median, percentile, rank
Interval Numerical scale with equal intervals but no true zero Temperature in Celsius Mean, SD, correlation
Ratio Numerical scale with equal intervals and true zero Age, income, height, weight All arithmetic summaries

The measurement scale determines the appropriate statistical method. A mean is meaningful for income but not meaningful for blood group.

3.4 Example: Correct and Incorrect Use

scale_example <- tibble(
  Variable = c("Blood Group", "Satisfaction Rating", "Temperature", "Income"),
  Scale = c("Nominal", "Ordinal", "Interval", "Ratio"),
  Poor_Summary = c("Mean blood group", "Mean may be weak", "Percentage only", "Mode only"),
  Better_Summary = c("Frequency table", "Median and distribution", "Mean and SD", "Mean, median, SD, IQR")
)

kable(scale_example, caption = "Choosing Summaries by Measurement Scale") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Choosing Summaries by Measurement Scale
Variable Scale Poor_Summary Better_Summary
Blood Group Nominal Mean blood group Frequency table
Satisfaction Rating Ordinal Mean may be weak Median and distribution
Temperature Interval Percentage only Mean and SD
Income Ratio Mode only Mean, median, SD, IQR

4 Loading, Viewing, and Understanding Data in R

4.1 Loading Built-in Datasets

This book uses built-in R datasets so all examples run immediately.

data(mtcars)
data(iris)
data(airquality)

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

4.2 Understanding Rows and Columns

dim(mtcars)
## [1] 32 11

The mtcars dataset has rows representing cars and columns representing car characteristics.

4.3 Structure of Data

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

4.4 Summary of Data

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

4.5 Creating a Cleaner Dataset

Some numerical columns in mtcars are actually categorical variables.

cars_data <- mtcars %>%
  rownames_to_column("car_name") %>%
  mutate(
    cyl = factor(cyl),
    am = factor(am, labels = c("Automatic", "Manual")),
    gear = factor(gear),
    carb = factor(carb)
  )

glimpse(cars_data)
## Rows: 32
## Columns: 12
## $ car_name <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive",…
## $ mpg      <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 1…
## $ cyl      <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4…
## $ disp     <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8…
## $ hp       <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180,…
## $ drat     <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3…
## $ wt       <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150…
## $ qsec     <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90…
## $ vs       <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1…
## $ am       <fct> Manual, Manual, Manual, Automatic, Automatic, Automatic, Auto…
## $ gear     <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3…
## $ carb     <fct> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1…

4.6 Checking Duplicate Rows

sum(duplicated(cars_data))
## [1] 0

4.7 Checking Variable Names

names(cars_data)
##  [1] "car_name" "mpg"      "cyl"      "disp"     "hp"       "drat"    
##  [7] "wt"       "qsec"     "vs"       "am"       "gear"     "carb"

A good EDA begins with understanding the data dictionary: variable names, meanings, units, and coding rules.

5 Descriptive Statistics

5.1 Meaning of Descriptive Statistics

Descriptive statistics are numerical and graphical techniques used to summarise the main features of a dataset.

They include:

  • central tendency;
  • dispersion;
  • shape;
  • position;
  • frequency;
  • relationship.

5.2 Numerical Summary in R

psych::describe(mtcars) %>%
  round(3) %>%
  kable(caption = "Descriptive Statistics for mtcars") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Descriptive Statistics for mtcars
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.091 6.027 19.200 19.696 5.411 10.400 33.900 23.500 0.611 -0.373 1.065
cyl 2 32 6.188 1.786 6.000 6.231 2.965 4.000 8.000 4.000 -0.175 -1.762 0.316
disp 3 32 230.722 123.939 196.300 222.523 140.476 71.100 472.000 400.900 0.382 -1.207 21.909
hp 4 32 146.688 68.563 123.000 141.192 77.095 52.000 335.000 283.000 0.726 -0.136 12.120
drat 5 32 3.597 0.535 3.695 3.579 0.704 2.760 4.930 2.170 0.266 -0.715 0.095
wt 6 32 3.217 0.978 3.325 3.153 0.767 1.513 5.424 3.911 0.423 -0.023 0.173
qsec 7 32 17.849 1.787 17.710 17.828 1.416 14.500 22.900 8.400 0.369 0.335 0.316
vs 8 32 0.438 0.504 0.000 0.423 0.000 0.000 1.000 1.000 0.240 -2.002 0.089
am 9 32 0.406 0.499 0.000 0.385 0.000 0.000 1.000 1.000 0.364 -1.925 0.088
gear 10 32 3.688 0.738 4.000 3.615 1.483 3.000 5.000 2.000 0.529 -1.070 0.130
carb 11 32 2.812 1.615 2.000 2.654 1.483 1.000 8.000 7.000 1.051 1.257 0.286

5.3 Custom Summary Table

custom_summary <- mtcars %>%
  summarise(
    mean_mpg = mean(mpg),
    median_mpg = median(mpg),
    sd_mpg = sd(mpg),
    min_mpg = min(mpg),
    max_mpg = max(mpg),
    iqr_mpg = IQR(mpg),
    skewness_mpg = skewness(mpg),
    kurtosis_mpg = kurtosis(mpg)
  )

kable(custom_summary, digits = 3, caption = "Custom Summary for MPG") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Custom Summary for MPG
mean_mpg median_mpg sd_mpg min_mpg max_mpg iqr_mpg skewness_mpg kurtosis_mpg
20.091 19.2 6.027 10.4 33.9 7.375 0.64 2.799

5.4 Advanced Example: Summarising Many Variables

summary_many <- mtcars %>%
  summarise(across(
    where(is.numeric),
    list(
      mean = mean,
      median = median,
      sd = sd,
      min = min,
      max = max,
      skew = skewness
    ),
    .names = "{.col}_{.fn}"
  ))

summary_many[, 1:12] %>%
  kable(digits = 3, caption = "Multiple Summary Measures") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Multiple Summary Measures
mpg_mean mpg_median mpg_sd mpg_min mpg_max mpg_skew cyl_mean cyl_median cyl_sd cyl_min cyl_max cyl_skew
20.091 19.2 6.027 10.4 33.9 0.64 6.188 6 1.786 4 8 -0.183

6 Measures of Central Tendency

6.1 Mean

The mean is the arithmetic average of all observations.

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \]

mean(mtcars$mpg)
## [1] 20.09062

6.2 Median

The median is the middle value after observations are arranged in ascending order.

median(mtcars$mpg)
## [1] 19.2

6.3 Mode

The mode is the most frequently occurring value.

get_mode <- function(x){
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

get_mode(mtcars$cyl)
## [1] 8

6.4 Comparing Mean, Median, and Mode

central_table <- tibble(
  Statistic = c("Mean", "Median", "Mode"),
  Best_For = c(
    "Symmetric numerical data",
    "Skewed numerical data",
    "Categorical or repeated values"
  ),
  Sensitive_to_Outliers = c("Yes", "No", "No"),
  Example = c("Average height", "Median income", "Most common blood group")
)

kable(central_table, caption = "Mean, Median and Mode") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Mean, Median and Mode
Statistic Best_For Sensitive_to_Outliers Example
Mean Symmetric numerical data Yes Average height
Median Skewed numerical data No Median income
Mode Categorical or repeated values No Most common blood group

6.5 Example: Outlier Effect

normal_values <- c(45, 48, 50, 51, 52, 53, 55)
outlier_values <- c(normal_values, 150)

outlier_effect <- tibble(
  Dataset = c("Without Outlier", "With Outlier"),
  Mean = c(mean(normal_values), mean(outlier_values)),
  Median = c(median(normal_values), median(outlier_values)),
  Difference = c(mean(normal_values) - median(normal_values),
                 mean(outlier_values) - median(outlier_values))
)

kable(outlier_effect, digits = 2, caption = "Effect of Outlier on Mean and Median") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Effect of Outlier on Mean and Median
Dataset Mean Median Difference
Without Outlier 50.57 51.0 -0.43
With Outlier 63.00 51.5 11.50
outlier_df <- tibble(
  value = c(normal_values, 150),
  group = c(rep("Normal values", length(normal_values)), "Extreme outlier")
)

ggplot(outlier_df, aes(x = value)) +
  geom_histogram(bins = 12, fill = "#5DADE2", color = "white") +
  geom_vline(aes(xintercept = mean(value)), color = "#C0392B", linewidth = 1.2) +
  geom_vline(aes(xintercept = median(value)), color = "#1E8449", linewidth = 1.2, linetype = "dashed") +
  labs(
    title = "Mean is Pulled Toward the Outlier",
    subtitle = "Red line = mean, green dashed line = median",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal()

For skewed data, report both mean and median. If they differ strongly, explain why.

7 Measures of Dispersion

7.1 Range

\[ Range = Maximum - Minimum \]

max(mtcars$mpg) - min(mtcars$mpg)
## [1] 23.5

7.2 Variance

Variance measures the average squared deviation from the mean.

\[ s^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1} \]

var(mtcars$mpg)
## [1] 36.3241

7.3 Standard Deviation

\[ s = \sqrt{s^2} \]

sd(mtcars$mpg)
## [1] 6.026948

7.4 Interquartile Range

\[ IQR = Q_3 - Q_1 \]

IQR(mtcars$mpg)
## [1] 7.375

7.5 Coefficient of Variation

The coefficient of variation compares standard deviation relative to the mean.

\[ CV = \frac{s}{\bar{x}} \times 100 \]

cv_mpg <- sd(mtcars$mpg) / mean(mtcars$mpg) * 100
cv_hp <- sd(mtcars$hp) / mean(mtcars$hp) * 100

tibble(
  Variable = c("MPG", "Horsepower"),
  CV_Percentage = c(cv_mpg, cv_hp)
) %>%
  kable(digits = 2, caption = "Coefficient of Variation") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Coefficient of Variation
Variable CV_Percentage
MPG 30.00
Horsepower 46.74

7.6 Advanced Example: Spread by Group

cars_data %>%
  group_by(cyl) %>%
  summarise(
    n = n(),
    mean_mpg = mean(mpg),
    sd_mpg = sd(mpg),
    iqr_mpg = IQR(mpg),
    cv_mpg = sd(mpg) / mean(mpg) * 100,
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "MPG Dispersion by Cylinder Group") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
MPG Dispersion by Cylinder Group
cyl n mean_mpg sd_mpg iqr_mpg cv_mpg
4 11 26.66 4.51 7.60 16.91
6 7 19.74 1.45 2.35 7.36
8 14 15.10 2.56 1.85 16.95

8 Frequency Tables and Categorical EDA

8.1 Frequency Table

table(cars_data$cyl)
## 
##  4  6  8 
## 11  7 14

8.2 Percentage Table

round(prop.table(table(cars_data$cyl)) * 100, 2)
## 
##     4     6     8 
## 34.38 21.88 43.75

8.3 Cross-tabulation

cyl_am_table <- table(cars_data$cyl, cars_data$am)
cyl_am_table
##    
##     Automatic Manual
##   4         3      8
##   6         4      3
##   8        12      2

8.4 Row Percentage

round(prop.table(cyl_am_table, margin = 1) * 100, 2)
##    
##     Automatic Manual
##   4     27.27  72.73
##   6     57.14  42.86
##   8     85.71  14.29

8.5 Bar Chart

ggplot(cars_data, aes(x = cyl, fill = cyl)) +
  geom_bar(color = "black") +
  labs(
    title = "Number of Cars by Cylinder Category",
    x = "Cylinders",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

8.6 Stacked Percentage Bar Chart

ggplot(cars_data, aes(x = cyl, fill = am)) +
  geom_bar(position = "fill", color = "black") +
  scale_y_continuous(labels = percent) +
  labs(
    title = "Transmission Type within Cylinder Groups",
    x = "Cylinders",
    y = "Percentage",
    fill = "Transmission"
  ) +
  theme_minimal()

For categorical variables, counts alone may be misleading. Percentages are often easier to compare across groups.

9 Graphical EDA for Numerical Variables

9.1 Histogram

ggplot(cars_data, aes(x = mpg)) +
  geom_histogram(bins = 10, fill = "#2E86C1", color = "white") +
  labs(
    title = "Histogram of Miles per Gallon",
    x = "Miles per Gallon",
    y = "Frequency"
  ) +
  theme_minimal()

9.2 Density Plot

ggplot(cars_data, aes(x = mpg)) +
  geom_density(fill = "#58D68D", alpha = 0.7) +
  labs(
    title = "Density Plot of Miles per Gallon",
    x = "Miles per Gallon",
    y = "Density"
  ) +
  theme_minimal()

9.3 Boxplot

ggplot(cars_data, aes(y = mpg)) +
  geom_boxplot(fill = "#F5B041", color = "black") +
  labs(
    title = "Boxplot of Miles per Gallon",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

9.4 Violin Plot

ggplot(cars_data, aes(x = cyl, y = mpg, fill = cyl)) +
  geom_violin(trim = FALSE, alpha = 0.7) +
  geom_boxplot(width = 0.15, fill = "white") +
  labs(
    title = "MPG Distribution by Cylinder Group",
    x = "Cylinders",
    y = "Miles per Gallon"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

9.5 Ridgeline-Style Example without Extra Packages

ggplot(cars_data, aes(x = mpg, fill = cyl)) +
  geom_density(alpha = 0.45) +
  facet_wrap(~cyl, ncol = 1) +
  labs(
    title = "Density of MPG by Cylinder Group",
    x = "MPG",
    y = "Density"
  ) +
  theme_minimal()

10 Skewness

10.1 Definition

Skewness measures the asymmetry of a distribution.

\[ Skewness = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^3}{s^3} \]

10.2 Interpretation of Skewness

skew_table <- tibble(
  Skewness_Value = c("Approximately 0", "Positive", "Negative", "Greater than 1", "Less than -1"),
  Interpretation = c(
    "Approximately symmetric",
    "Right-skewed distribution",
    "Left-skewed distribution",
    "Strong positive skewness",
    "Strong negative skewness"
  ),
  Mean_Median_Relationship = c(
    "Mean approximately equals median",
    "Mean often greater than median",
    "Mean often less than median",
    "Mean can be strongly pulled right",
    "Mean can be strongly pulled left"
  )
)

kable(skew_table, caption = "Interpretation of Skewness") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Interpretation of Skewness
Skewness_Value Interpretation Mean_Median_Relationship
Approximately 0 Approximately symmetric Mean approximately equals median
Positive Right-skewed distribution Mean often greater than median
Negative Left-skewed distribution Mean often less than median
Greater than 1 Strong positive skewness Mean can be strongly pulled right
Less than -1 Strong negative skewness Mean can be strongly pulled left

10.3 Visual Examples

skew_example <- tibble(
  symmetric = rnorm(1000, mean = 50, sd = 10),
  positive_skew = rexp(1000, rate = 0.1),
  negative_skew = -rexp(1000, rate = 0.1)
) %>%
  pivot_longer(cols = everything(), names_to = "Distribution", values_to = "Value")

ggplot(skew_example, aes(x = Value, fill = Distribution)) +
  geom_histogram(bins = 30, color = "white", alpha = 0.8) +
  facet_wrap(~Distribution, scales = "free") +
  labs(
    title = "Symmetric, Positive Skew and Negative Skew Distributions",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

10.4 Skewness in Practice

skew_summary <- mtcars %>%
  summarise(across(where(is.numeric), skewness)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness") %>%
  arrange(desc(abs(Skewness)))

kable(skew_summary, digits = 3, caption = "Skewness of mtcars Variables") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Skewness of mtcars Variables
Variable Skewness
carb 1.102
hp 0.761
mpg 0.640
gear 0.555
wt 0.444
disp 0.400
qsec 0.387
am 0.382
drat 0.279
vs 0.252
cyl -0.183

10.5 Transformation for Skewed Data

income <- rexp(1000, rate = 0.002)

income_df <- tibble(
  income = income,
  log_income = log(income)
)

p1 <- ggplot(income_df, aes(x = income)) +
  geom_histogram(bins = 30, fill = "#3498DB", color = "white") +
  labs(title = "Original Skewed Income", x = "Income", y = "Frequency") +
  theme_minimal()

p2 <- ggplot(income_df, aes(x = log_income)) +
  geom_histogram(bins = 30, fill = "#2ECC71", color = "white") +
  labs(title = "Log-Transformed Income", x = "Log Income", y = "Frequency") +
  theme_minimal()

gridExtra::grid.arrange(p1, p2, ncol = 2)

If data are strongly right-skewed, log transformation can make patterns clearer. However, transformation changes interpretation, so always explain it.

11 Kurtosis and Distribution Shape

11.1 Definition

Kurtosis describes the tail heaviness of a distribution.

kurtosis(mtcars$mpg)
## [1] 2.799467

11.2 Types of Kurtosis

kurtosis_table <- tibble(
  Type = c("Mesokurtic", "Leptokurtic", "Platykurtic"),
  Meaning = c(
    "Similar to normal distribution",
    "Heavy tails and sharper peak",
    "Light tails and flatter peak"
  ),
  Practical_Concern = c(
    "Standard assumption may be reasonable",
    "Extreme values are more likely",
    "Extreme values are less common"
  )
)

kable(kurtosis_table, caption = "Types of Kurtosis") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Types of Kurtosis
Type Meaning Practical_Concern
Mesokurtic Similar to normal distribution Standard assumption may be reasonable
Leptokurtic Heavy tails and sharper peak Extreme values are more likely
Platykurtic Light tails and flatter peak Extreme values are less common

11.3 Q-Q Plot

ggplot(cars_data, aes(sample = mpg)) +
  stat_qq(color = "#2E86C1") +
  stat_qq_line(color = "#D35400", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot for MPG",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

11.4 Advanced Example: Comparing Normal and Heavy-Tailed Data

tail_data <- tibble(
  Normal = rnorm(1000),
  Heavy_Tailed = rt(1000, df = 3)
) %>%
  pivot_longer(cols = everything(), names_to = "Distribution", values_to = "Value")

ggplot(tail_data, aes(x = Value, fill = Distribution)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Normal versus Heavy-Tailed Distribution",
    x = "Value",
    y = "Density"
  ) +
  theme_minimal()

12 Outlier Detection

12.1 What is an Outlier?

An outlier is an observation that is unusually far from the majority of the data.

Outliers may represent:

  • data entry errors;
  • measurement errors;
  • rare events;
  • genuine extreme cases;
  • mixture of populations.

12.2 IQR Rule

\[ Lower\ Fence = Q_1 - 1.5(IQR) \]

\[ Upper\ Fence = Q_3 + 1.5(IQR) \]

q1 <- quantile(mtcars$mpg, 0.25)
q3 <- quantile(mtcars$mpg, 0.75)
iqr_value <- IQR(mtcars$mpg)

lower_fence <- q1 - 1.5 * iqr_value
upper_fence <- q3 + 1.5 * iqr_value

c(lower_fence = lower_fence, upper_fence = upper_fence)
## lower_fence.25% upper_fence.75% 
##          4.3625         33.8625
mtcars %>%
  rownames_to_column("car") %>%
  filter(mpg < lower_fence | mpg > upper_fence)
##              car  mpg cyl disp hp drat    wt qsec vs am gear carb
## 1 Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1

12.3 Z-score Method

\[ z_i = \frac{x_i-\bar{x}}{s} \]

z_scores <- scale(mtcars$mpg)
which(abs(z_scores) > 3)
## integer(0)

12.4 Multivariate Outliers

A point may not be an outlier in one variable but may be unusual when several variables are considered together.

cars_numeric <- mtcars %>% dplyr::select(mpg, hp, wt, qsec)
center <- colMeans(cars_numeric)
cov_matrix <- cov(cars_numeric)

mahal_values <- mahalanobis(cars_numeric, center, cov_matrix)

outlier_cutoff <- qchisq(0.975, df = ncol(cars_numeric))

tibble(
  car = rownames(mtcars),
  mahalanobis_distance = mahal_values,
  possible_outlier = mahal_values > outlier_cutoff
) %>%
  arrange(desc(mahalanobis_distance)) %>%
  head(10) %>%
  kable(digits = 3, caption = "Multivariate Outlier Check using Mahalanobis Distance") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Multivariate Outlier Check using Mahalanobis Distance
car mahalanobis_distance possible_outlier
Maserati Bora 13.804 TRUE
Merc 230 13.429 TRUE
Chrysler Imperial 10.036 FALSE
Toyota Corolla 7.895 FALSE
Fiat 128 6.877 FALSE
Lincoln Continental 5.786 FALSE
Ford Pantera L 5.458 FALSE
Toyota Corona 5.216 FALSE
Lotus Europa 5.131 FALSE
Cadillac Fleetwood 4.826 FALSE

Do not remove outliers automatically. First investigate whether they are errors or meaningful extreme values.

13 Correlation and Relationship Analysis

13.1 Covariance

Covariance measures whether two variables move together, but its scale depends on the units of measurement.

\[ Cov(X,Y)=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{n-1} \]

cov(mtcars$mpg, mtcars$wt)
## [1] -5.116685

13.2 Pearson Correlation

\[ r = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum (x_i-\bar{x})^2}\sqrt{\sum (y_i-\bar{y})^2}} \]

cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594

13.3 Spearman Correlation

Spearman correlation is useful when the relationship is monotonic but not necessarily linear.

cor(mtcars$mpg, mtcars$wt, method = "spearman")
## [1] -0.886422

13.4 Correlation Matrix

numeric_data <- mtcars %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_data)

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  tl.cex = 0.8,
  addCoef.col = "black",
  number.cex = 0.7
)

13.5 Scatterplot

ggplot(cars_data, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "#1F618D") +
  geom_smooth(method = "lm", se = TRUE, color = "#C0392B") +
  labs(
    title = "Relationship Between Car Weight and MPG",
    x = "Weight",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

13.6 Nonlinear Relationship Example

nonlinear_df <- tibble(
  x = seq(-5, 5, length.out = 200),
  y = x^2 + rnorm(200, 0, 3)
)

ggplot(nonlinear_df, aes(x = x, y = y)) +
  geom_point(alpha = 0.7, color = "#2E86C1") +
  geom_smooth(se = FALSE, color = "#C0392B") +
  labs(
    title = "Example of a Nonlinear Relationship",
    x = "X",
    y = "Y"
  ) +
  theme_minimal()

Correlation does not prove causation. A strong association does not automatically mean that one variable causes the other.

14 Missing Data Analysis

14.1 Why Missing Data Matter

Missing data occur when values are not recorded, lost, unavailable, or intentionally skipped.

Missing data can cause:

  • biased estimates;
  • reduced sample size;
  • incorrect conclusions;
  • model failure.

14.2 Types of Missing Data

missing_types <- tibble(
  Type = c("MCAR", "MAR", "MNAR"),
  Full_Name = c(
    "Missing Completely at Random",
    "Missing at Random",
    "Missing Not at Random"
  ),
  Meaning = c(
    "Missingness unrelated to observed or unobserved data",
    "Missingness related to observed variables",
    "Missingness related to the missing value itself"
  ),
  Example = c(
    "Random machine error",
    "Older people less likely to answer online survey",
    "High-income people refusing to report income"
  )
)

kable(missing_types, caption = "Types of Missing Data") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Types of Missing Data
Type Full_Name Meaning Example
MCAR Missing Completely at Random Missingness unrelated to observed or unobserved data Random machine error
MAR Missing at Random Missingness related to observed variables Older people less likely to answer online survey
MNAR Missing Not at Random Missingness related to the missing value itself High-income people refusing to report income

14.3 Creating Missing Values

missing_data <- cars_data
missing_data$mpg[c(2, 5, 8)] <- NA
missing_data$hp[c(4, 10)] <- NA

colSums(is.na(missing_data))
## car_name      mpg      cyl     disp       hp     drat       wt     qsec 
##        0        3        0        0        2        0        0        0 
##       vs       am     gear     carb 
##        0        0        0        0

14.4 Missing Percentage

missing_summary <- missing_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  mutate(Missing_Percentage = Missing_Count / nrow(missing_data) * 100)

kable(missing_summary, caption = "Missing Data Summary") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Missing Data Summary
Variable Missing_Count Missing_Percentage
car_name 0 0.000
mpg 3 9.375
cyl 0 0.000
disp 0 0.000
hp 2 6.250
drat 0 0.000
wt 0 0.000
qsec 0 0.000
vs 0 0.000
am 0 0.000
gear 0 0.000
carb 0 0.000

14.5 Visualising Missingness

missing_summary %>%
  filter(Missing_Count > 0) %>%
  ggplot(aes(x = reorder(Variable, Missing_Count), y = Missing_Count, fill = Variable)) +
  geom_col(color = "black") +
  coord_flip() +
  labs(
    title = "Missing Values by Variable",
    x = "Variable",
    y = "Missing Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

14.6 Simple Imputation

imputed_data <- missing_data %>%
  mutate(
    mpg = ifelse(is.na(mpg), median(mpg, na.rm = TRUE), mpg),
    hp = ifelse(is.na(hp), median(hp, na.rm = TRUE), hp)
  )

colSums(is.na(imputed_data))
## car_name      mpg      cyl     disp       hp     drat       wt     qsec 
##        0        0        0        0        0        0        0        0 
##       vs       am     gear     carb 
##        0        0        0        0

Mean or median imputation is simple, but it can underestimate variability. For formal research, consider multiple imputation or model-based imputation.

15 Standardisation, Transformation, and Feature Understanding

15.1 Why Standardise?

Standardisation converts a variable into a scale with mean 0 and standard deviation 1.

\[ z = \frac{x-\bar{x}}{s} \]

standardized_data <- mtcars %>%
  mutate(
    mpg_z = as.numeric(scale(mpg)),
    hp_z = as.numeric(scale(hp)),
    wt_z = as.numeric(scale(wt))
  )

head(standardized_data[, c("mpg", "mpg_z", "hp", "hp_z", "wt", "wt_z")])
##                    mpg      mpg_z  hp       hp_z    wt         wt_z
## Mazda RX4         21.0  0.1508848 110 -0.5350928 2.620 -0.610399567
## Mazda RX4 Wag     21.0  0.1508848 110 -0.5350928 2.875 -0.349785269
## Datsun 710        22.8  0.4495434  93 -0.7830405 2.320 -0.917004624
## Hornet 4 Drive    21.4  0.2172534 110 -0.5350928 3.215 -0.002299538
## Hornet Sportabout 18.7 -0.2307345 175  0.4129422 3.440  0.227654255
## Valiant           18.1 -0.3302874 105 -0.6080186 3.460  0.248094592

15.2 Why Transform Data?

Transformations are useful when:

  • data are highly skewed;
  • variance increases with the mean;
  • relationships are nonlinear;
  • model assumptions are violated.

15.3 Log Transformation Example

log_example <- tibble(
  original = rexp(1000, rate = 0.1),
  log_transformed = log(original)
)

log_example_long <- log_example %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(log_example_long, aes(x = Value, fill = Variable)) +
  geom_histogram(bins = 30, color = "white") +
  facet_wrap(~Variable, scales = "free") +
  labs(
    title = "Original and Log-Transformed Data",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

16 Simple Linear Regression as EDA

16.1 Why Include Regression in EDA?

Regression can be used in EDA to understand direction, strength, and approximate functional form of relationships.

A simple linear regression model is:

\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]

where:

  • \(Y_i\) is the outcome variable;
  • \(X_i\) is the explanatory variable;
  • \(\beta_0\) is the intercept;
  • \(\beta_1\) is the slope;
  • \(\epsilon_i\) is random error.

16.2 Slope and Intercept

The intercept is the expected value of \(Y\) when \(X=0\). The slope is the expected change in \(Y\) for one-unit increase in \(X\).

16.3 Example: MPG and Weight

model_simple <- lm(mpg ~ wt, data = mtcars)
summary(model_simple)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

16.4 Extracting Regression Coefficients

broom::tidy(model_simple) %>%
  kable(digits = 4, caption = "Simple Linear Regression Coefficients") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Simple Linear Regression Coefficients
term estimate std.error statistic p.value
(Intercept) 37.2851 1.8776 19.8576 0
wt -5.3445 0.5591 -9.5590 0

16.5 Interpretation of Slope

If the slope for wt is negative, it means that as car weight increases, fuel efficiency decreases.

coef(model_simple)
## (Intercept)          wt 
##   37.285126   -5.344472

For example, if the slope is approximately -5.34, then a one-unit increase in car weight is associated with an average decrease of about 5.34 miles per gallon.

16.6 Visualising the Regression Line

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "#1F618D") +
  geom_smooth(method = "lm", se = TRUE, color = "#C0392B") +
  labs(
    title = "Simple Linear Regression: MPG versus Weight",
    subtitle = "Regression line shows expected MPG for each value of weight",
    x = "Weight",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

16.7 Model Fit

broom::glance(model_simple) %>%
  kable(digits = 4, caption = "Model Fit Statistics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Model Fit Statistics
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.7528 0.7446 3.0459 91.3753 0 1 -80.0147 166.0294 170.4266 278.3219 30 32

16.8 Residuals

A residual is the difference between observed and fitted value.

\[ e_i = y_i - \hat{y}_i \]

reg_aug <- broom::augment(model_simple)

head(reg_aug)
## # A tibble: 6 × 9
##   .rownames           mpg    wt .fitted .resid   .hat .sigma  .cooksd .std.resid
##   <chr>             <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
## 1 Mazda RX4          21    2.62    23.3 -2.28  0.0433   3.07  1.33e-2    -0.766 
## 2 Mazda RX4 Wag      21    2.88    21.9 -0.920 0.0352   3.09  1.72e-3    -0.307 
## 3 Datsun 710         22.8  2.32    24.9 -2.09  0.0584   3.07  1.54e-2    -0.706 
## 4 Hornet 4 Drive     21.4  3.22    20.1  1.30  0.0313   3.09  3.02e-3     0.433 
## 5 Hornet Sportabout  18.7  3.44    18.9 -0.200 0.0329   3.10  7.60e-5    -0.0668
## 6 Valiant            18.1  3.46    18.8 -0.693 0.0332   3.10  9.21e-4    -0.231
ggplot(reg_aug, aes(x = .fitted, y = .resid)) +
  geom_point(size = 3, color = "#2E86C1") +
  geom_hline(yintercept = 0, color = "#C0392B", linewidth = 1) +
  labs(
    title = "Residual Plot",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal()

For a good linear model, residuals should not show a strong curve or funnel pattern.

17 Multiple Regression and Model Diagnostics

17.1 Multiple Regression Model

A multiple linear regression model is:

\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \cdots + \beta_pX_{pi} + \epsilon_i \]

17.2 Example: Predicting MPG

model_multiple <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model_multiple)
## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

17.3 Coefficient Table

broom::tidy(model_multiple) %>%
  kable(digits = 4, caption = "Multiple Regression Coefficients") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Multiple Regression Coefficients
term estimate std.error statistic p.value
(Intercept) 38.7518 1.7869 21.6870 0.0000
wt -3.1670 0.7406 -4.2764 0.0002
hp -0.0180 0.0119 -1.5188 0.1400
cyl -0.9416 0.5509 -1.7092 0.0985

17.4 Interpretation

In multiple regression, each coefficient represents the expected change in the outcome for one-unit increase in that predictor, while holding the other predictors constant.

17.5 Confidence Intervals

confint(model_multiple) %>%
  as.data.frame() %>%
  rownames_to_column("Term") %>%
  kable(digits = 4, caption = "Confidence Intervals for Regression Coefficients") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Confidence Intervals for Regression Coefficients
Term 2.5 % 97.5 %
(Intercept) 35.0916 42.4120
wt -4.6840 -1.6500
hp -0.0424 0.0063
cyl -2.0701 0.1869

17.6 Multicollinearity

Multicollinearity occurs when predictors are strongly correlated with each other.

car::vif(model_multiple)
##       wt       hp      cyl 
## 2.580486 3.258481 4.757456

17.7 Diagnostic Plots

par(mfrow = c(2, 2))
plot(model_multiple)

par(mfrow = c(1, 1))

17.8 Normality of Residuals

resid_df <- tibble(residuals = residuals(model_multiple))

ggplot(resid_df, aes(sample = residuals)) +
  stat_qq(color = "#2E86C1") +
  stat_qq_line(color = "#C0392B") +
  labs(
    title = "Q-Q Plot of Regression Residuals",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

17.9 Heteroscedasticity Check

lmtest::bptest(model_multiple)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_multiple
## BP = 2.9351, df = 3, p-value = 0.4017

Regression in EDA should be interpreted carefully. A statistically significant coefficient does not automatically imply causality.

18 Complete EDA Workflow and Case Study

18.1 Step-by-Step EDA Framework

A good EDA report should be systematic, reproducible, visual, and critical.

Recommended workflow:

  1. Define the purpose of analysis.
  2. Identify the unit of observation.
  3. Inspect structure and variable types.
  4. Check missing data.
  5. Check duplicates.
  6. Summarise numerical variables.
  7. Summarise categorical variables.
  8. Visualise distributions.
  9. Detect outliers.
  10. Study relationships.
  11. Check assumptions.
  12. Fit simple exploratory models.
  13. Write clear interpretation.
  14. State limitations.

18.2 Complete EDA Function

quick_eda <- function(data){
  numeric_part <- data %>% dplyr::select(where(is.numeric))

  list(
    dimensions = dim(data),
    variable_names = names(data),
    missing_values = colSums(is.na(data)),
    numeric_summary = psych::describe(numeric_part)
  )
}

quick_eda(mtcars)
## $dimensions
## [1] 32 11
## 
## $variable_names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
## 
## $missing_values
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0 
## 
## $numeric_summary
##      vars  n   mean     sd median trimmed    mad   min    max  range  skew
## mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
## cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
## disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
## hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
## drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
## wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
## qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
## vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
## am      9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
## gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
## carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
##      kurtosis    se
## mpg     -0.37  1.07
## cyl     -1.76  0.32
## disp    -1.21 21.91
## hp      -0.14 12.12
## drat    -0.71  0.09
## wt      -0.02  0.17
## qsec     0.34  0.32
## vs      -2.00  0.09
## am      -1.92  0.09
## gear    -1.07  0.13
## carb     1.26  0.29

18.3 Case Study Question

Research Question: How are car weight, horsepower, transmission type, and number of cylinders related to fuel efficiency?

18.4 Grouped Summary

cars_data %>%
  group_by(cyl, am) %>%
  summarise(
    n = n(),
    mean_mpg = mean(mpg),
    median_mpg = median(mpg),
    sd_mpg = sd(mpg),
    mean_hp = mean(hp),
    mean_wt = mean(wt),
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "Grouped Summary by Cylinder and Transmission") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Grouped Summary by Cylinder and Transmission
cyl am n mean_mpg median_mpg sd_mpg mean_hp mean_wt
4 Automatic 3 22.90 22.80 1.45 84.67 2.94
4 Manual 8 28.08 28.85 4.48 81.88 2.04
6 Automatic 4 19.12 18.65 1.63 115.25 3.39
6 Manual 3 20.57 21.00 0.75 131.67 2.76
8 Automatic 12 15.05 15.20 2.77 194.17 4.10
8 Manual 2 15.40 15.40 0.57 299.50 3.37

18.5 Visual Case Study

ggplot(cars_data, aes(x = wt, y = mpg, color = cyl, shape = am)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "MPG Declines as Car Weight Increases",
    subtitle = "Relationship shown by cylinder group and transmission type",
    x = "Car Weight",
    y = "Miles per Gallon",
    color = "Cylinders",
    shape = "Transmission"
  ) +
  theme_minimal()

18.6 Pairwise Relationship Plot

GGally::ggpairs(
  cars_data[, c("mpg", "hp", "wt", "qsec")]
)

18.7 Final Exploratory Model

final_eda_model <- lm(mpg ~ wt + hp + cyl + am, data = cars_data)
summary(final_eda_model)
## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl + am, data = cars_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

18.8 Model Interpretation

broom::tidy(final_eda_model) %>%
  kable(digits = 4, caption = "Final Exploratory Model Coefficients") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Final Exploratory Model Coefficients
term estimate std.error statistic p.value
(Intercept) 33.7083 2.6049 12.9404 0.0000
wt -2.4968 0.8856 -2.8194 0.0091
hp -0.0321 0.0137 -2.3450 0.0269
cyl6 -3.0313 1.4073 -2.1540 0.0407
cyl8 -2.1637 2.2843 -0.9472 0.3523
amManual 1.8092 1.3963 1.2957 0.2065

The EDA suggests that heavier cars tend to have lower fuel efficiency. Cars with more cylinders generally show lower MPG. Horsepower and weight are positively related, while MPG and weight are negatively related. Regression provides a structured way to quantify these patterns.

19 Review Questions

  1. What is the difference between descriptive and inferential statistics?
  2. Why is the median better than the mean for skewed data?
  3. What does positive skewness mean?
  4. What does a boxplot show?
  5. Why should missing data be checked before modelling?
  6. What is the difference between covariance and correlation?
  7. Why does correlation not imply causation?
  8. What is the difference between variance and standard deviation?
  9. When should we use a bar chart instead of a histogram?
  10. Why should outliers not be deleted automatically?
  11. What is the meaning of a regression slope?
  12. What is the meaning of a regression intercept?
  13. Why should residuals be checked?
  14. What is multicollinearity?
  15. Why should EDA include both graphs and numerical summaries?

20 Practical Exercises with Solutions

20.1 Exercise: Iris Dataset Summary

Use the iris dataset and calculate mean, median, standard deviation, skewness, and kurtosis for Sepal.Length.

data(iris)

tibble(
  Mean = mean(iris$Sepal.Length),
  Median = median(iris$Sepal.Length),
  SD = sd(iris$Sepal.Length),
  Skewness = skewness(iris$Sepal.Length),
  Kurtosis = kurtosis(iris$Sepal.Length)
) %>%
  kable(digits = 3, caption = "Summary of Sepal Length") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Summary of Sepal Length
Mean Median SD Skewness Kurtosis
5.843 5.8 0.828 0.312 2.426

20.2 Exercise: Histogram and Density Plot

ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20, fill = "#5DADE2", color = "white") +
  geom_density(color = "#C0392B", linewidth = 1.2) +
  labs(
    title = "Histogram and Density Plot of Sepal Length",
    x = "Sepal Length",
    y = "Density"
  ) +
  theme_minimal()

20.3 Exercise: Boxplot by Species

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  labs(
    title = "Sepal Length by Iris Species",
    x = "Species",
    y = "Sepal Length"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

20.4 Exercise: Correlation Matrix

iris_numeric <- iris %>% dplyr::select(where(is.numeric))
corrplot(cor(iris_numeric), method = "color", addCoef.col = "black")

20.5 Exercise: Regression with Iris Data

iris_model <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
summary(iris_model)
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length + Petal.Width + Species, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75238 -0.23089 -0.00211  0.23100  1.03108 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.682982   0.107403  34.291  < 2e-16 ***
## Petal.Length       0.905946   0.074311  12.191  < 2e-16 ***
## Petal.Width       -0.005995   0.156260  -0.038    0.969    
## Speciesversicolor -1.598362   0.205706  -7.770 1.32e-12 ***
## Speciesvirginica  -2.112647   0.304024  -6.949 1.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3392 on 145 degrees of freedom
## Multiple R-squared:  0.8367, Adjusted R-squared:  0.8322 
## F-statistic: 185.8 on 4 and 145 DF,  p-value: < 2.2e-16
broom::tidy(iris_model) %>%
  kable(digits = 4, caption = "Regression Model for Iris Data") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Regression Model for Iris Data
term estimate std.error statistic p.value
(Intercept) 3.6830 0.1074 34.2913 0.0000
Petal.Length 0.9059 0.0743 12.1913 0.0000
Petal.Width -0.0060 0.1563 -0.0384 0.9694
Speciesversicolor -1.5984 0.2057 -7.7701 0.0000
Speciesvirginica -2.1126 0.3040 -6.9489 0.0000

20.6 Exercise: Air Quality Dataset

data(airquality)

airquality_clean <- airquality %>%
  tidyr::drop_na()

summary(airquality_clean)
##      Ozone          Solar.R           Wind            Temp      
##  Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
##  1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
##  Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
##  Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
##  3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
##  Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
##      Month            Day       
##  Min.   :5.000   Min.   : 1.00  
##  1st Qu.:6.000   1st Qu.: 9.00  
##  Median :7.000   Median :16.00  
##  Mean   :7.216   Mean   :15.95  
##  3rd Qu.:9.000   3rd Qu.:22.50  
##  Max.   :9.000   Max.   :31.00
ggplot(airquality_clean, aes(x = Temp, y = Ozone)) +
  geom_point(color = "#2E86C1", size = 3) +
  geom_smooth(method = "lm", color = "#C0392B") +
  labs(
    title = "Ozone and Temperature",
    x = "Temperature",
    y = "Ozone"
  ) +
  theme_minimal()

air_model <- lm(Ozone ~ Temp + Wind + Solar.R, data = airquality_clean)
summary(air_model)
## 
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R, data = airquality_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

21 Final Summary

This book-style tutorial covered the full foundation of Exploratory Data Analysis and Descriptive Statistics using R, from basic summaries to regression-based exploratory modelling.

You learned how to:

  • identify data types and measurement scales;
  • summarise numerical variables;
  • summarise categorical variables;
  • calculate mean, median, mode, variance, SD, IQR, and CV;
  • interpret skewness and kurtosis;
  • detect outliers;
  • analyse missing data;
  • visualise distributions;
  • study relationships;
  • use correlation and regression;
  • interpret slope and intercept;
  • check residuals and model diagnostics;
  • prepare a complete EDA workflow.

The best data analysts do not simply run code. They ask critical questions, inspect assumptions, compare summaries with graphs, and explain results clearly.

22 Suggested References

  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
  • Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly.
  • Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Sage.
  • Dalgaard, P. (2008). Introductory Statistics with R. Springer.
  • Kabacoff, R. (2022). R in Action. Manning.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer.

23 Any Questions or feedback ?

Please email Dr Hom at . I hope you love numbers like me..