1 Preface

This tutorial is designed as a complete book-style R Markdown resource for students, researchers, data analysts, business analysts, public health researchers, and early-career statisticians.

Exploratory Data Analysis, commonly called EDA, is the first serious conversation we have with data. Before applying regression, machine learning, forecasting, Bayesian modelling, artificial intelligence, or policy analysis, we must understand the data.

EDA helps us answer:

What type of data do we have?
What is the quality of the data?
Are there missing values?
Are there outliers?
What is the distribution of each variable?
Are variables related?
Are assumptions reasonable for modelling?
What story does the data suggest?

This book covers EDA and descriptive statistics from basic to advanced level using R. This version is corrected for RPubs knitting and avoids common package-conflict errors.

2 Introduction to Data and EDA

2.1 What is Data?

Data are recorded observations, measurements, facts, or responses collected for analysis, interpretation, decision-making, or prediction.

Examples of data include student marks, patient blood pressure, daily temperature, customer sales, website clicks, and survey responses.

example_data <- tibble(
  Student = c("A", "B", "C", "D", "E"),
  Age = c(21, 24, 22, 25, 23),
  Gender = c("Female", "Male", "Female", "Male", "Female"),
  Score = c(78, 85, 69, 92, 81),
  Attendance = c(90, 88, 76, 95, 84)
)

kable(example_data, caption = "Example of a Small Dataset") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Example of a Small Dataset
Student	Age	Gender	Score	Attendance
A	21	Female	78	90
B	24	Male	85	88
C	22	Female	69	76
D	25	Male	92	95
E	23	Female	81	84

2.2 What is Exploratory Data Analysis?

Exploratory Data Analysis is the systematic process of inspecting, summarising, visualising, and questioning data to discover patterns, anomalies, relationships, data quality problems, and modelling assumptions.

EDA is not only about producing graphs. It is about thinking critically.

Critical EDA questions:
1. What is the unit of observation?
2. What does each row represent?
3. What does each column measure?
4. Are the values reasonable?
5. Are there duplicated records?
6. Are there missing or impossible values?
7. Are the distributions symmetric, skewed, or multimodal?
8. Which variables may explain the outcome of interest?

2.3 Why EDA Matters

Skipping EDA is dangerous. A model can produce results even when the data contain errors, miscoded variables, missing values, outliers, or invalid assumptions.

EDA supports:

data cleaning;
assumption checking;
model selection;
hypothesis generation;
communication;
reproducibility.

2.4 EDA versus Confirmatory Analysis

eda_confirmatory <- tibble(
  Feature = c("Purpose", "Question Type", "Main Tools", "Output"),
  EDA = c(
    "Explore and understand data",
    "What patterns exist?",
    "Tables, graphs, summaries",
    "Insights, assumptions, possible hypotheses"
  ),
  Confirmatory_Analysis = c(
    "Test specific hypotheses",
    "Is there evidence for a specific claim?",
    "Statistical tests and models",
    "P-values, confidence intervals, model estimates"
  )
)

kable(eda_confirmatory, caption = "EDA versus Confirmatory Analysis") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

EDA versus Confirmatory Analysis
Feature	EDA	Confirmatory_Analysis
Purpose	Explore and understand data	Test specific hypotheses
Question Type	What patterns exist?	Is there evidence for a specific claim?
Main Tools	Tables, graphs, summaries	Statistical tests and models
Output	Insights, assumptions, possible hypotheses	P-values, confidence intervals, model estimates

3 Data Types and Measurement Scales

3.1 Qualitative and Quantitative Data

data_types <- tibble(
  Type = c("Qualitative", "Quantitative"),
  Meaning = c("Non-numerical categories", "Numerical measurements or counts"),
  Examples = c("Gender, blood group, region", "Age, income, weight, marks"),
  Common_Graphs = c("Bar chart, pie chart, mosaic plot", "Histogram, boxplot, density plot, scatterplot")
)

kable(data_types, caption = "Qualitative and Quantitative Data") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Qualitative and Quantitative Data
Type	Meaning	Examples	Common_Graphs
Qualitative	Non-numerical categories	Gender, blood group, region	Bar chart, pie chart, mosaic plot
Quantitative	Numerical measurements or counts	Age, income, weight, marks	Histogram, boxplot, density plot, scatterplot

3.2 Discrete and Continuous Data

Discrete data are countable values. Continuous data are measurable values that can take many values within an interval.

Examples:

Discrete: number of children, number of hospital visits, number of defects.
Continuous: height, temperature, time, income, blood pressure.

3.3 Measurement Scales

scales_table <- tibble(
  Scale = c("Nominal", "Ordinal", "Interval", "Ratio"),
  Description = c(
    "Categories without natural order",
    "Categories with meaningful order",
    "Numerical scale with equal intervals but no true zero",
    "Numerical scale with equal intervals and true zero"
  ),
  Examples = c(
    "Gender, region, blood group",
    "Education level, satisfaction rating",
    "Temperature in Celsius",
    "Age, income, height, weight"
  ),
  Suitable_Summary = c(
    "Count, percentage, mode",
    "Median, percentile, rank",
    "Mean, SD, correlation",
    "All arithmetic summaries"
  )
)

kable(scales_table, caption = "Measurement Scales") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Measurement Scales
Scale	Description	Examples	Suitable_Summary
Nominal	Categories without natural order	Gender, region, blood group	Count, percentage, mode
Ordinal	Categories with meaningful order	Education level, satisfaction rating	Median, percentile, rank
Interval	Numerical scale with equal intervals but no true zero	Temperature in Celsius	Mean, SD, correlation
Ratio	Numerical scale with equal intervals and true zero	Age, income, height, weight	All arithmetic summaries

The measurement scale determines the appropriate statistical method. A mean is meaningful for income but not meaningful for blood group.

3.4 Example: Correct and Incorrect Use

scale_example <- tibble(
  Variable = c("Blood Group", "Satisfaction Rating", "Temperature", "Income"),
  Scale = c("Nominal", "Ordinal", "Interval", "Ratio"),
  Poor_Summary = c("Mean blood group", "Mean may be weak", "Percentage only", "Mode only"),
  Better_Summary = c("Frequency table", "Median and distribution", "Mean and SD", "Mean, median, SD, IQR")
)

kable(scale_example, caption = "Choosing Summaries by Measurement Scale") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Choosing Summaries by Measurement Scale
Variable	Scale	Poor_Summary	Better_Summary
Blood Group	Nominal	Mean blood group	Frequency table
Satisfaction Rating	Ordinal	Mean may be weak	Median and distribution
Temperature	Interval	Percentage only	Mean and SD
Income	Ratio	Mode only	Mean, median, SD, IQR

4 Loading, Viewing, and Understanding Data in R

4.1 Loading Built-in Datasets

This book uses built-in R datasets so all examples run immediately.

data(mtcars)
data(iris)
data(airquality)

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

4.2 Understanding Rows and Columns

dim(mtcars)

## [1] 32 11

The mtcars dataset has rows representing cars and columns representing car characteristics.

4.3 Structure of Data

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

4.4 Summary of Data

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

4.5 Creating a Cleaner Dataset

Some numerical columns in mtcars are actually categorical variables.

cars_data <- mtcars %>%
  rownames_to_column("car_name") %>%
  mutate(
    cyl = factor(cyl),
    am = factor(am, labels = c("Automatic", "Manual")),
    gear = factor(gear),
    carb = factor(carb)
  )

glimpse(cars_data)

## Rows: 32
## Columns: 12
## $ car_name <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive",…
## $ mpg      <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 1…
## $ cyl      <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4…
## $ disp     <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8…
## $ hp       <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180,…
## $ drat     <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3…
## $ wt       <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150…
## $ qsec     <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90…
## $ vs       <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1…
## $ am       <fct> Manual, Manual, Manual, Automatic, Automatic, Automatic, Auto…
## $ gear     <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3…
## $ carb     <fct> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1…

4.6 Checking Duplicate Rows

sum(duplicated(cars_data))

## [1] 0

4.7 Checking Variable Names

names(cars_data)

##  [1] "car_name" "mpg"      "cyl"      "disp"     "hp"       "drat"    
##  [7] "wt"       "qsec"     "vs"       "am"       "gear"     "carb"

A good EDA begins with understanding the data dictionary: variable names, meanings, units, and coding rules.

5 Descriptive Statistics

5.1 Meaning of Descriptive Statistics

Descriptive statistics are numerical and graphical techniques used to summarise the main features of a dataset.

They include:

central tendency;
dispersion;
shape;
position;
frequency;
relationship.

5.2 Numerical Summary in R

psych::describe(mtcars) %>%
  round(3) %>%
  kable(caption = "Descriptive Statistics for mtcars") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Descriptive Statistics for mtcars
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
mpg	1	32	20.091	6.027	19.200	19.696	5.411	10.400	33.900	23.500	0.611	-0.373	1.065
cyl	2	32	6.188	1.786	6.000	6.231	2.965	4.000	8.000	4.000	-0.175	-1.762	0.316
disp	3	32	230.722	123.939	196.300	222.523	140.476	71.100	472.000	400.900	0.382	-1.207	21.909
hp	4	32	146.688	68.563	123.000	141.192	77.095	52.000	335.000	283.000	0.726	-0.136	12.120
drat	5	32	3.597	0.535	3.695	3.579	0.704	2.760	4.930	2.170	0.266	-0.715	0.095
wt	6	32	3.217	0.978	3.325	3.153	0.767	1.513	5.424	3.911	0.423	-0.023	0.173
qsec	7	32	17.849	1.787	17.710	17.828	1.416	14.500	22.900	8.400	0.369	0.335	0.316
vs	8	32	0.438	0.504	0.000	0.423	0.000	0.000	1.000	1.000	0.240	-2.002	0.089
am	9	32	0.406	0.499	0.000	0.385	0.000	0.000	1.000	1.000	0.364	-1.925	0.088
gear	10	32	3.688	0.738	4.000	3.615	1.483	3.000	5.000	2.000	0.529	-1.070	0.130
carb	11	32	2.812	1.615	2.000	2.654	1.483	1.000	8.000	7.000	1.051	1.257	0.286

5.3 Custom Summary Table

custom_summary <- mtcars %>%
  summarise(
    mean_mpg = mean(mpg),
    median_mpg = median(mpg),
    sd_mpg = sd(mpg),
    min_mpg = min(mpg),
    max_mpg = max(mpg),
    iqr_mpg = IQR(mpg),
    skewness_mpg = skewness(mpg),
    kurtosis_mpg = kurtosis(mpg)
  )

kable(custom_summary, digits = 3, caption = "Custom Summary for MPG") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Custom Summary for MPG
mean_mpg	median_mpg	sd_mpg	min_mpg	max_mpg	iqr_mpg	skewness_mpg	kurtosis_mpg
20.091	19.2	6.027	10.4	33.9	7.375	0.64	2.799

5.4 Advanced Example: Summarising Many Variables

summary_many <- mtcars %>%
  summarise(across(
    where(is.numeric),
    list(
      mean = mean,
      median = median,
      sd = sd,
      min = min,
      max = max,
      skew = skewness
    ),
    .names = "{.col}_{.fn}"
  ))

summary_many[, 1:12] %>%
  kable(digits = 3, caption = "Multiple Summary Measures") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Multiple Summary Measures
mpg_mean	mpg_median	mpg_sd	mpg_min	mpg_max	mpg_skew	cyl_mean	cyl_median	cyl_sd	cyl_min	cyl_max	cyl_skew
20.091	19.2	6.027	10.4	33.9	0.64	6.188	6	1.786	4	8	-0.183

6 Measures of Central Tendency

6.1 Mean

The mean is the arithmetic average of all observations.

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \]

mean(mtcars$mpg)

## [1] 20.09062

6.2 Median

The median is the middle value after observations are arranged in ascending order.

median(mtcars$mpg)

## [1] 19.2

6.3 Mode

The mode is the most frequently occurring value.

get_mode <- function(x){
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

get_mode(mtcars$cyl)

## [1] 8

6.4 Comparing Mean, Median, and Mode

central_table <- tibble(
  Statistic = c("Mean", "Median", "Mode"),
  Best_For = c(
    "Symmetric numerical data",
    "Skewed numerical data",
    "Categorical or repeated values"
  ),
  Sensitive_to_Outliers = c("Yes", "No", "No"),
  Example = c("Average height", "Median income", "Most common blood group")
)

kable(central_table, caption = "Mean, Median and Mode") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Mean, Median and Mode
Statistic	Best_For	Sensitive_to_Outliers	Example
Mean	Symmetric numerical data	Yes	Average height
Median	Skewed numerical data	No	Median income
Mode	Categorical or repeated values	No	Most common blood group

6.5 Example: Outlier Effect

normal_values <- c(45, 48, 50, 51, 52, 53, 55)
outlier_values <- c(normal_values, 150)

outlier_effect <- tibble(
  Dataset = c("Without Outlier", "With Outlier"),
  Mean = c(mean(normal_values), mean(outlier_values)),
  Median = c(median(normal_values), median(outlier_values)),
  Difference = c(mean(normal_values) - median(normal_values),
                 mean(outlier_values) - median(outlier_values))
)

kable(outlier_effect, digits = 2, caption = "Effect of Outlier on Mean and Median") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Effect of Outlier on Mean and Median
Dataset	Mean	Median	Difference
Without Outlier	50.57	51.0	-0.43
With Outlier	63.00	51.5	11.50

outlier_df <- tibble(
  value = c(normal_values, 150),
  group = c(rep("Normal values", length(normal_values)), "Extreme outlier")
)

ggplot(outlier_df, aes(x = value)) +
  geom_histogram(bins = 12, fill = "#5DADE2", color = "white") +
  geom_vline(aes(xintercept = mean(value)), color = "#C0392B", linewidth = 1.2) +
  geom_vline(aes(xintercept = median(value)), color = "#1E8449", linewidth = 1.2, linetype = "dashed") +
  labs(
    title = "Mean is Pulled Toward the Outlier",
    subtitle = "Red line = mean, green dashed line = median",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal()

For skewed data, report both mean and median. If they differ strongly, explain why.

7 Measures of Dispersion

7.1 Range

\[ Range = Maximum - Minimum \]

max(mtcars$mpg) - min(mtcars$mpg)

## [1] 23.5

7.2 Variance

Variance measures the average squared deviation from the mean.

\[ s^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1} \]

var(mtcars$mpg)

## [1] 36.3241

7.3 Standard Deviation

\[ s = \sqrt{s^2} \]

sd(mtcars$mpg)

## [1] 6.026948

7.4 Interquartile Range

\[ IQR = Q_3 - Q_1 \]

IQR(mtcars$mpg)

## [1] 7.375

7.5 Coefficient of Variation

The coefficient of variation compares standard deviation relative to the mean.

\[ CV = \frac{s}{\bar{x}} \times 100 \]

cv_mpg <- sd(mtcars$mpg) / mean(mtcars$mpg) * 100
cv_hp <- sd(mtcars$hp) / mean(mtcars$hp) * 100

tibble(
  Variable = c("MPG", "Horsepower"),
  CV_Percentage = c(cv_mpg, cv_hp)
) %>%
  kable(digits = 2, caption = "Coefficient of Variation") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Coefficient of Variation
Variable	CV_Percentage
MPG	30.00
Horsepower	46.74

7.6 Advanced Example: Spread by Group

cars_data %>%
  group_by(cyl) %>%
  summarise(
    n = n(),
    mean_mpg = mean(mpg),
    sd_mpg = sd(mpg),
    iqr_mpg = IQR(mpg),
    cv_mpg = sd(mpg) / mean(mpg) * 100,
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "MPG Dispersion by Cylinder Group") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

MPG Dispersion by Cylinder Group
cyl	n	mean_mpg	sd_mpg	iqr_mpg	cv_mpg
4	11	26.66	4.51	7.60	16.91
6	7	19.74	1.45	2.35	7.36
8	14	15.10	2.56	1.85	16.95

8 Frequency Tables and Categorical EDA

8.1 Frequency Table

table(cars_data$cyl)

## 
##  4  6  8 
## 11  7 14

8.2 Percentage Table

round(prop.table(table(cars_data$cyl)) * 100, 2)

## 
##     4     6     8 
## 34.38 21.88 43.75

8.3 Cross-tabulation

cyl_am_table <- table(cars_data$cyl, cars_data$am)
cyl_am_table

##    
##     Automatic Manual
##   4         3      8
##   6         4      3
##   8        12      2

8.4 Row Percentage

round(prop.table(cyl_am_table, margin = 1) * 100, 2)

##    
##     Automatic Manual
##   4     27.27  72.73
##   6     57.14  42.86
##   8     85.71  14.29

8.5 Bar Chart

ggplot(cars_data, aes(x = cyl, fill = cyl)) +
  geom_bar(color = "black") +
  labs(
    title = "Number of Cars by Cylinder Category",
    x = "Cylinders",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

8.6 Stacked Percentage Bar Chart

ggplot(cars_data, aes(x = cyl, fill = am)) +
  geom_bar(position = "fill", color = "black") +
  scale_y_continuous(labels = percent) +
  labs(
    title = "Transmission Type within Cylinder Groups",
    x = "Cylinders",
    y = "Percentage",
    fill = "Transmission"
  ) +
  theme_minimal()

For categorical variables, counts alone may be misleading. Percentages are often easier to compare across groups.

9 Graphical EDA for Numerical Variables

9.1 Histogram

ggplot(cars_data, aes(x = mpg)) +
  geom_histogram(bins = 10, fill = "#2E86C1", color = "white") +
  labs(
    title = "Histogram of Miles per Gallon",
    x = "Miles per Gallon",
    y = "Frequency"
  ) +
  theme_minimal()

9.2 Density Plot

ggplot(cars_data, aes(x = mpg)) +
  geom_density(fill = "#58D68D", alpha = 0.7) +
  labs(
    title = "Density Plot of Miles per Gallon",
    x = "Miles per Gallon",
    y = "Density"
  ) +
  theme_minimal()

9.3 Boxplot

ggplot(cars_data, aes(y = mpg)) +
  geom_boxplot(fill = "#F5B041", color = "black") +
  labs(
    title = "Boxplot of Miles per Gallon",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

9.4 Violin Plot

ggplot(cars_data, aes(x = cyl, y = mpg, fill = cyl)) +
  geom_violin(trim = FALSE, alpha = 0.7) +
  geom_boxplot(width = 0.15, fill = "white") +
  labs(
    title = "MPG Distribution by Cylinder Group",
    x = "Cylinders",
    y = "Miles per Gallon"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

9.5 Ridgeline-Style Example without Extra Packages

ggplot(cars_data, aes(x = mpg, fill = cyl)) +
  geom_density(alpha = 0.45) +
  facet_wrap(~cyl, ncol = 1) +
  labs(
    title = "Density of MPG by Cylinder Group",
    x = "MPG",
    y = "Density"
  ) +
  theme_minimal()

10 Skewness

10.1 Definition

Skewness measures the asymmetry of a distribution.

\[ Skewness = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^3}{s^3} \]

10.2 Interpretation of Skewness

skew_table <- tibble(
  Skewness_Value = c("Approximately 0", "Positive", "Negative", "Greater than 1", "Less than -1"),
  Interpretation = c(
    "Approximately symmetric",
    "Right-skewed distribution",
    "Left-skewed distribution",
    "Strong positive skewness",
    "Strong negative skewness"
  ),
  Mean_Median_Relationship = c(
    "Mean approximately equals median",
    "Mean often greater than median",
    "Mean often less than median",
    "Mean can be strongly pulled right",
    "Mean can be strongly pulled left"
  )
)

kable(skew_table, caption = "Interpretation of Skewness") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Interpretation of Skewness
Skewness_Value	Interpretation	Mean_Median_Relationship
Approximately 0	Approximately symmetric	Mean approximately equals median
Positive	Right-skewed distribution	Mean often greater than median
Negative	Left-skewed distribution	Mean often less than median
Greater than 1	Strong positive skewness	Mean can be strongly pulled right
Less than -1	Strong negative skewness	Mean can be strongly pulled left

10.3 Visual Examples

skew_example <- tibble(
  symmetric = rnorm(1000, mean = 50, sd = 10),
  positive_skew = rexp(1000, rate = 0.1),
  negative_skew = -rexp(1000, rate = 0.1)
) %>%
  pivot_longer(cols = everything(), names_to = "Distribution", values_to = "Value")

ggplot(skew_example, aes(x = Value, fill = Distribution)) +
  geom_histogram(bins = 30, color = "white", alpha = 0.8) +
  facet_wrap(~Distribution, scales = "free") +
  labs(
    title = "Symmetric, Positive Skew and Negative Skew Distributions",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

10.4 Skewness in Practice

skew_summary <- mtcars %>%
  summarise(across(where(is.numeric), skewness)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness") %>%
  arrange(desc(abs(Skewness)))

kable(skew_summary, digits = 3, caption = "Skewness of mtcars Variables") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Skewness of mtcars Variables
Variable	Skewness
carb	1.102
hp	0.761
mpg	0.640
gear	0.555
wt	0.444
disp	0.400
qsec	0.387
am	0.382
drat	0.279
vs	0.252
cyl	-0.183

10.5 Transformation for Skewed Data

income <- rexp(1000, rate = 0.002)

income_df <- tibble(
  income = income,
  log_income = log(income)
)

p1 <- ggplot(income_df, aes(x = income)) +
  geom_histogram(bins = 30, fill = "#3498DB", color = "white") +
  labs(title = "Original Skewed Income", x = "Income", y = "Frequency") +
  theme_minimal()

p2 <- ggplot(income_df, aes(x = log_income)) +
  geom_histogram(bins = 30, fill = "#2ECC71", color = "white") +
  labs(title = "Log-Transformed Income", x = "Log Income", y = "Frequency") +
  theme_minimal()

gridExtra::grid.arrange(p1, p2, ncol = 2)

If data are strongly right-skewed, log transformation can make patterns clearer. However, transformation changes interpretation, so always explain it.

11 Kurtosis and Distribution Shape

11.1 Definition

Kurtosis describes the tail heaviness of a distribution.

kurtosis(mtcars$mpg)

## [1] 2.799467

11.2 Types of Kurtosis

kurtosis_table <- tibble(
  Type = c("Mesokurtic", "Leptokurtic", "Platykurtic"),
  Meaning = c(
    "Similar to normal distribution",
    "Heavy tails and sharper peak",
    "Light tails and flatter peak"
  ),
  Practical_Concern = c(
    "Standard assumption may be reasonable",
    "Extreme values are more likely",
    "Extreme values are less common"
  )
)

kable(kurtosis_table, caption = "Types of Kurtosis") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Types of Kurtosis
Type	Meaning	Practical_Concern
Mesokurtic	Similar to normal distribution	Standard assumption may be reasonable
Leptokurtic	Heavy tails and sharper peak	Extreme values are more likely
Platykurtic	Light tails and flatter peak	Extreme values are less common

11.3 Q-Q Plot

ggplot(cars_data, aes(sample = mpg)) +
  stat_qq(color = "#2E86C1") +
  stat_qq_line(color = "#D35400", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot for MPG",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

11.4 Advanced Example: Comparing Normal and Heavy-Tailed Data

tail_data <- tibble(
  Normal = rnorm(1000),
  Heavy_Tailed = rt(1000, df = 3)
) %>%
  pivot_longer(cols = everything(), names_to = "Distribution", values_to = "Value")

ggplot(tail_data, aes(x = Value, fill = Distribution)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Normal versus Heavy-Tailed Distribution",
    x = "Value",
    y = "Density"
  ) +
  theme_minimal()

12 Outlier Detection

12.1 What is an Outlier?

An outlier is an observation that is unusually far from the majority of the data.

Outliers may represent:

data entry errors;
measurement errors;
rare events;
genuine extreme cases;
mixture of populations.

12.2 IQR Rule

\[ Lower\ Fence = Q_1 - 1.5(IQR) \]

\[ Upper\ Fence = Q_3 + 1.5(IQR) \]

q1 <- quantile(mtcars$mpg, 0.25)
q3 <- quantile(mtcars$mpg, 0.75)
iqr_value <- IQR(mtcars$mpg)

lower_fence <- q1 - 1.5 * iqr_value
upper_fence <- q3 + 1.5 * iqr_value

c(lower_fence = lower_fence, upper_fence = upper_fence)

## lower_fence.25% upper_fence.75% 
##          4.3625         33.8625

mtcars %>%
  rownames_to_column("car") %>%
  filter(mpg < lower_fence | mpg > upper_fence)

##              car  mpg cyl disp hp drat    wt qsec vs am gear carb
## 1 Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1

12.3 Z-score Method

\[ z_i = \frac{x_i-\bar{x}}{s} \]

z_scores <- scale(mtcars$mpg)
which(abs(z_scores) > 3)

## integer(0)

12.4 Multivariate Outliers

A point may not be an outlier in one variable but may be unusual when several variables are considered together.

cars_numeric <- mtcars %>% dplyr::select(mpg, hp, wt, qsec)
center <- colMeans(cars_numeric)
cov_matrix <- cov(cars_numeric)

mahal_values <- mahalanobis(cars_numeric, center, cov_matrix)

outlier_cutoff <- qchisq(0.975, df = ncol(cars_numeric))

tibble(
  car = rownames(mtcars),
  mahalanobis_distance = mahal_values,
  possible_outlier = mahal_values > outlier_cutoff
) %>%
  arrange(desc(mahalanobis_distance)) %>%
  head(10) %>%
  kable(digits = 3, caption = "Multivariate Outlier Check using Mahalanobis Distance") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Multivariate Outlier Check using Mahalanobis Distance
car	mahalanobis_distance	possible_outlier
Maserati Bora	13.804	TRUE
Merc 230	13.429	TRUE
Chrysler Imperial	10.036	FALSE
Toyota Corolla	7.895	FALSE
Fiat 128	6.877	FALSE
Lincoln Continental	5.786	FALSE
Ford Pantera L	5.458	FALSE
Toyota Corona	5.216	FALSE
Lotus Europa	5.131	FALSE
Cadillac Fleetwood	4.826	FALSE

Do not remove outliers automatically. First investigate whether they are errors or meaningful extreme values.

13 Correlation and Relationship Analysis

13.1 Covariance

Covariance measures whether two variables move together, but its scale depends on the units of measurement.

\[ Cov(X,Y)=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{n-1} \]

cov(mtcars$mpg, mtcars$wt)

## [1] -5.116685

13.2 Pearson Correlation

\[ r = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum (x_i-\bar{x})^2}\sqrt{\sum (y_i-\bar{y})^2}} \]

cor(mtcars$mpg, mtcars$wt)

## [1] -0.8676594

13.3 Spearman Correlation

Spearman correlation is useful when the relationship is monotonic but not necessarily linear.

cor(mtcars$mpg, mtcars$wt, method = "spearman")

## [1] -0.886422

13.4 Correlation Matrix

numeric_data <- mtcars %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_data)

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  tl.cex = 0.8,
  addCoef.col = "black",
  number.cex = 0.7
)

13.5 Scatterplot

ggplot(cars_data, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "#1F618D") +
  geom_smooth(method = "lm", se = TRUE, color = "#C0392B") +
  labs(
    title = "Relationship Between Car Weight and MPG",
    x = "Weight",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

13.6 Nonlinear Relationship Example

nonlinear_df <- tibble(
  x = seq(-5, 5, length.out = 200),
  y = x^2 + rnorm(200, 0, 3)
)

ggplot(nonlinear_df, aes(x = x, y = y)) +
  geom_point(alpha = 0.7, color = "#2E86C1") +
  geom_smooth(se = FALSE, color = "#C0392B") +
  labs(
    title = "Example of a Nonlinear Relationship",
    x = "X",
    y = "Y"
  ) +
  theme_minimal()

Correlation does not prove causation. A strong association does not automatically mean that one variable causes the other.

14 Missing Data Analysis

14.1 Why Missing Data Matter

Missing data occur when values are not recorded, lost, unavailable, or intentionally skipped.

Missing data can cause:

biased estimates;
reduced sample size;
incorrect conclusions;
model failure.

14.2 Types of Missing Data

missing_types <- tibble(
  Type = c("MCAR", "MAR", "MNAR"),
  Full_Name = c(
    "Missing Completely at Random",
    "Missing at Random",
    "Missing Not at Random"
  ),
  Meaning = c(
    "Missingness unrelated to observed or unobserved data",
    "Missingness related to observed variables",
    "Missingness related to the missing value itself"
  ),
  Example = c(
    "Random machine error",
    "Older people less likely to answer online survey",
    "High-income people refusing to report income"
  )
)

kable(missing_types, caption = "Types of Missing Data") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Types of Missing Data
Type	Full_Name	Meaning	Example
MCAR	Missing Completely at Random	Missingness unrelated to observed or unobserved data	Random machine error
MAR	Missing at Random	Missingness related to observed variables	Older people less likely to answer online survey
MNAR	Missing Not at Random	Missingness related to the missing value itself	High-income people refusing to report income

14.3 Creating Missing Values

missing_data <- cars_data
missing_data$mpg[c(2, 5, 8)] <- NA
missing_data$hp[c(4, 10)] <- NA

colSums(is.na(missing_data))

## car_name      mpg      cyl     disp       hp     drat       wt     qsec 
##        0        3        0        0        2        0        0        0 
##       vs       am     gear     carb 
##        0        0        0        0

14.4 Missing Percentage

missing_summary <- missing_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  mutate(Missing_Percentage = Missing_Count / nrow(missing_data) * 100)

kable(missing_summary, caption = "Missing Data Summary") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Missing Data Summary
Variable	Missing_Count	Missing_Percentage
car_name	0	0.000
mpg	3	9.375
cyl	0	0.000
disp	0	0.000
hp	2	6.250
drat	0	0.000
wt	0	0.000
qsec	0	0.000
vs	0	0.000
am	0	0.000
gear	0	0.000
carb	0	0.000

14.5 Visualising Missingness

missing_summary %>%
  filter(Missing_Count > 0) %>%
  ggplot(aes(x = reorder(Variable, Missing_Count), y = Missing_Count, fill = Variable)) +
  geom_col(color = "black") +
  coord_flip() +
  labs(
    title = "Missing Values by Variable",
    x = "Variable",
    y = "Missing Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

14.6 Simple Imputation

imputed_data <- missing_data %>%
  mutate(
    mpg = ifelse(is.na(mpg), median(mpg, na.rm = TRUE), mpg),
    hp = ifelse(is.na(hp), median(hp, na.rm = TRUE), hp)
  )

colSums(is.na(imputed_data))

## car_name      mpg      cyl     disp       hp     drat       wt     qsec 
##        0        0        0        0        0        0        0        0 
##       vs       am     gear     carb 
##        0        0        0        0

Mean or median imputation is simple, but it can underestimate variability. For formal research, consider multiple imputation or model-based imputation.

15 Standardisation, Transformation, and Feature Understanding

15.1 Why Standardise?

Standardisation converts a variable into a scale with mean 0 and standard deviation 1.

\[ z = \frac{x-\bar{x}}{s} \]

standardized_data <- mtcars %>%
  mutate(
    mpg_z = as.numeric(scale(mpg)),
    hp_z = as.numeric(scale(hp)),
    wt_z = as.numeric(scale(wt))
  )

head(standardized_data[, c("mpg", "mpg_z", "hp", "hp_z", "wt", "wt_z")])

##                    mpg      mpg_z  hp       hp_z    wt         wt_z
## Mazda RX4         21.0  0.1508848 110 -0.5350928 2.620 -0.610399567
## Mazda RX4 Wag     21.0  0.1508848 110 -0.5350928 2.875 -0.349785269
## Datsun 710        22.8  0.4495434  93 -0.7830405 2.320 -0.917004624
## Hornet 4 Drive    21.4  0.2172534 110 -0.5350928 3.215 -0.002299538
## Hornet Sportabout 18.7 -0.2307345 175  0.4129422 3.440  0.227654255
## Valiant           18.1 -0.3302874 105 -0.6080186 3.460  0.248094592

15.2 Why Transform Data?

Transformations are useful when:

data are highly skewed;
variance increases with the mean;
relationships are nonlinear;
model assumptions are violated.

15.3 Log Transformation Example

log_example <- tibble(
  original = rexp(1000, rate = 0.1),
  log_transformed = log(original)
)

log_example_long <- log_example %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(log_example_long, aes(x = Value, fill = Variable)) +
  geom_histogram(bins = 30, color = "white") +
  facet_wrap(~Variable, scales = "free") +
  labs(
    title = "Original and Log-Transformed Data",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

16 Simple Linear Regression as EDA

16.1 Why Include Regression in EDA?

Regression can be used in EDA to understand direction, strength, and approximate functional form of relationships.

A simple linear regression model is:

\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]

where:

\(Y_i\) is the outcome variable;
\(X_i\) is the explanatory variable;
\(\beta_0\) is the intercept;
\(\beta_1\) is the slope;
\(\epsilon_i\) is random error.

16.2 Slope and Intercept

The intercept is the expected value of \(Y\) when \(X=0\). The slope is the expected change in \(Y\) for one-unit increase in \(X\).

16.3 Example: MPG and Weight

model_simple <- lm(mpg ~ wt, data = mtcars)
summary(model_simple)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

16.4 Extracting Regression Coefficients

broom::tidy(model_simple) %>%
  kable(digits = 4, caption = "Simple Linear Regression Coefficients") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Simple Linear Regression Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	37.2851	1.8776	19.8576	0
wt	-5.3445	0.5591	-9.5590	0

16.5 Interpretation of Slope

If the slope for wt is negative, it means that as car weight increases, fuel efficiency decreases.

coef(model_simple)

## (Intercept)          wt 
##   37.285126   -5.344472

For example, if the slope is approximately -5.34, then a one-unit increase in car weight is associated with an average decrease of about 5.34 miles per gallon.

16.6 Visualising the Regression Line

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3, color = "#1F618D") +
  geom_smooth(method = "lm", se = TRUE, color = "#C0392B") +
  labs(
    title = "Simple Linear Regression: MPG versus Weight",
    subtitle = "Regression line shows expected MPG for each value of weight",
    x = "Weight",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

16.7 Model Fit

broom::glance(model_simple) %>%
  kable(digits = 4, caption = "Model Fit Statistics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Model Fit Statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.7528	0.7446	3.0459	91.3753	0	1	-80.0147	166.0294	170.4266	278.3219	30	32

16.8 Residuals

A residual is the difference between observed and fitted value.

\[ e_i = y_i - \hat{y}_i \]

reg_aug <- broom::augment(model_simple)

head(reg_aug)

## # A tibble: 6 × 9
##   .rownames           mpg    wt .fitted .resid   .hat .sigma  .cooksd .std.resid
##   <chr>             <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
## 1 Mazda RX4          21    2.62    23.3 -2.28  0.0433   3.07  1.33e-2    -0.766 
## 2 Mazda RX4 Wag      21    2.88    21.9 -0.920 0.0352   3.09  1.72e-3    -0.307 
## 3 Datsun 710         22.8  2.32    24.9 -2.09  0.0584   3.07  1.54e-2    -0.706 
## 4 Hornet 4 Drive     21.4  3.22    20.1  1.30  0.0313   3.09  3.02e-3     0.433 
## 5 Hornet Sportabout  18.7  3.44    18.9 -0.200 0.0329   3.10  7.60e-5    -0.0668
## 6 Valiant            18.1  3.46    18.8 -0.693 0.0332   3.10  9.21e-4    -0.231

ggplot(reg_aug, aes(x = .fitted, y = .resid)) +
  geom_point(size = 3, color = "#2E86C1") +
  geom_hline(yintercept = 0, color = "#C0392B", linewidth = 1) +
  labs(
    title = "Residual Plot",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal()

For a good linear model, residuals should not show a strong curve or funnel pattern.

17 Multiple Regression and Model Diagnostics

17.1 Multiple Regression Model

A multiple linear regression model is:

\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \cdots + \beta_pX_{pi} + \epsilon_i \]

17.2 Example: Predicting MPG

model_multiple <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model_multiple)

## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

17.3 Coefficient Table

broom::tidy(model_multiple) %>%
  kable(digits = 4, caption = "Multiple Regression Coefficients") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Multiple Regression Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	38.7518	1.7869	21.6870	0.0000
wt	-3.1670	0.7406	-4.2764	0.0002
hp	-0.0180	0.0119	-1.5188	0.1400
cyl	-0.9416	0.5509	-1.7092	0.0985

17.4 Interpretation

In multiple regression, each coefficient represents the expected change in the outcome for one-unit increase in that predictor, while holding the other predictors constant.

17.5 Confidence Intervals

confint(model_multiple) %>%
  as.data.frame() %>%
  rownames_to_column("Term") %>%
  kable(digits = 4, caption = "Confidence Intervals for Regression Coefficients") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Confidence Intervals for Regression Coefficients
Term	2.5 %	97.5 %
(Intercept)	35.0916	42.4120
wt	-4.6840	-1.6500
hp	-0.0424	0.0063
cyl	-2.0701	0.1869

17.6 Multicollinearity

Multicollinearity occurs when predictors are strongly correlated with each other.

car::vif(model_multiple)

##       wt       hp      cyl 
## 2.580486 3.258481 4.757456

17.7 Diagnostic Plots

par(mfrow = c(2, 2))
plot(model_multiple)

par(mfrow = c(1, 1))

17.8 Normality of Residuals

resid_df <- tibble(residuals = residuals(model_multiple))

ggplot(resid_df, aes(sample = residuals)) +
  stat_qq(color = "#2E86C1") +
  stat_qq_line(color = "#C0392B") +
  labs(
    title = "Q-Q Plot of Regression Residuals",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

17.9 Heteroscedasticity Check

lmtest::bptest(model_multiple)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_multiple
## BP = 2.9351, df = 3, p-value = 0.4017

Regression in EDA should be interpreted carefully. A statistically significant coefficient does not automatically imply causality.

18 Complete EDA Workflow and Case Study

18.1 Step-by-Step EDA Framework

A good EDA report should be systematic, reproducible, visual, and critical.

Recommended workflow:

Define the purpose of analysis.
Identify the unit of observation.
Inspect structure and variable types.
Check missing data.
Check duplicates.
Summarise numerical variables.
Summarise categorical variables.
Visualise distributions.
Detect outliers.
Study relationships.
Check assumptions.
Fit simple exploratory models.
Write clear interpretation.
State limitations.

18.2 Complete EDA Function

quick_eda <- function(data){
  numeric_part <- data %>% dplyr::select(where(is.numeric))

  list(
    dimensions = dim(data),
    variable_names = names(data),
    missing_values = colSums(is.na(data)),
    numeric_summary = psych::describe(numeric_part)
  )
}

quick_eda(mtcars)

## $dimensions
## [1] 32 11
## 
## $variable_names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
## 
## $missing_values
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0 
## 
## $numeric_summary
##      vars  n   mean     sd median trimmed    mad   min    max  range  skew
## mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
## cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
## disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
## hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
## drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
## wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
## qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
## vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
## am      9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
## gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
## carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
##      kurtosis    se
## mpg     -0.37  1.07
## cyl     -1.76  0.32
## disp    -1.21 21.91
## hp      -0.14 12.12
## drat    -0.71  0.09
## wt      -0.02  0.17
## qsec     0.34  0.32
## vs      -2.00  0.09
## am      -1.92  0.09
## gear    -1.07  0.13
## carb     1.26  0.29

18.3 Case Study Question

Research Question: How are car weight, horsepower, transmission type, and number of cylinders related to fuel efficiency?

18.4 Grouped Summary

cars_data %>%
  group_by(cyl, am) %>%
  summarise(
    n = n(),
    mean_mpg = mean(mpg),
    median_mpg = median(mpg),
    sd_mpg = sd(mpg),
    mean_hp = mean(hp),
    mean_wt = mean(wt),
    .groups = "drop"
  ) %>%
  kable(digits = 2, caption = "Grouped Summary by Cylinder and Transmission") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Grouped Summary by Cylinder and Transmission
cyl	am	n	mean_mpg	median_mpg	sd_mpg	mean_hp	mean_wt
4	Automatic	3	22.90	22.80	1.45	84.67	2.94
4	Manual	8	28.08	28.85	4.48	81.88	2.04
6	Automatic	4	19.12	18.65	1.63	115.25	3.39
6	Manual	3	20.57	21.00	0.75	131.67	2.76
8	Automatic	12	15.05	15.20	2.77	194.17	4.10
8	Manual	2	15.40	15.40	0.57	299.50	3.37

18.5 Visual Case Study

ggplot(cars_data, aes(x = wt, y = mpg, color = cyl, shape = am)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "MPG Declines as Car Weight Increases",
    subtitle = "Relationship shown by cylinder group and transmission type",
    x = "Car Weight",
    y = "Miles per Gallon",
    color = "Cylinders",
    shape = "Transmission"
  ) +
  theme_minimal()

18.6 Pairwise Relationship Plot

GGally::ggpairs(
  cars_data[, c("mpg", "hp", "wt", "qsec")]
)

18.7 Final Exploratory Model

final_eda_model <- lm(mpg ~ wt + hp + cyl + am, data = cars_data)
summary(final_eda_model)

## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl + am, data = cars_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

18.8 Model Interpretation

broom::tidy(final_eda_model) %>%
  kable(digits = 4, caption = "Final Exploratory Model Coefficients") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Final Exploratory Model Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	33.7083	2.6049	12.9404	0.0000
wt	-2.4968	0.8856	-2.8194	0.0091
hp	-0.0321	0.0137	-2.3450	0.0269
cyl6	-3.0313	1.4073	-2.1540	0.0407
cyl8	-2.1637	2.2843	-0.9472	0.3523
amManual	1.8092	1.3963	1.2957	0.2065

The EDA suggests that heavier cars tend to have lower fuel efficiency. Cars with more cylinders generally show lower MPG. Horsepower and weight are positively related, while MPG and weight are negatively related. Regression provides a structured way to quantify these patterns.

19 Review Questions

What is the difference between descriptive and inferential statistics?
Why is the median better than the mean for skewed data?
What does positive skewness mean?
What does a boxplot show?
Why should missing data be checked before modelling?
What is the difference between covariance and correlation?
Why does correlation not imply causation?
What is the difference between variance and standard deviation?
When should we use a bar chart instead of a histogram?
Why should outliers not be deleted automatically?
What is the meaning of a regression slope?
What is the meaning of a regression intercept?
Why should residuals be checked?
What is multicollinearity?
Why should EDA include both graphs and numerical summaries?

20 Practical Exercises with Solutions

20.1 Exercise: Iris Dataset Summary

Use the iris dataset and calculate mean, median, standard deviation, skewness, and kurtosis for Sepal.Length.

data(iris)

tibble(
  Mean = mean(iris$Sepal.Length),
  Median = median(iris$Sepal.Length),
  SD = sd(iris$Sepal.Length),
  Skewness = skewness(iris$Sepal.Length),
  Kurtosis = kurtosis(iris$Sepal.Length)
) %>%
  kable(digits = 3, caption = "Summary of Sepal Length") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Summary of Sepal Length
Mean	Median	SD	Skewness	Kurtosis
5.843	5.8	0.828	0.312	2.426

20.2 Exercise: Histogram and Density Plot

ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20, fill = "#5DADE2", color = "white") +
  geom_density(color = "#C0392B", linewidth = 1.2) +
  labs(
    title = "Histogram and Density Plot of Sepal Length",
    x = "Sepal Length",
    y = "Density"
  ) +
  theme_minimal()

20.3 Exercise: Boxplot by Species

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  labs(
    title = "Sepal Length by Iris Species",
    x = "Species",
    y = "Sepal Length"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

20.4 Exercise: Correlation Matrix

iris_numeric <- iris %>% dplyr::select(where(is.numeric))
corrplot(cor(iris_numeric), method = "color", addCoef.col = "black")

20.5 Exercise: Regression with Iris Data

iris_model <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
summary(iris_model)

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length + Petal.Width + Species, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75238 -0.23089 -0.00211  0.23100  1.03108 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.682982   0.107403  34.291  < 2e-16 ***
## Petal.Length       0.905946   0.074311  12.191  < 2e-16 ***
## Petal.Width       -0.005995   0.156260  -0.038    0.969    
## Speciesversicolor -1.598362   0.205706  -7.770 1.32e-12 ***
## Speciesvirginica  -2.112647   0.304024  -6.949 1.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3392 on 145 degrees of freedom
## Multiple R-squared:  0.8367, Adjusted R-squared:  0.8322 
## F-statistic: 185.8 on 4 and 145 DF,  p-value: < 2.2e-16

broom::tidy(iris_model) %>%
  kable(digits = 4, caption = "Regression Model for Iris Data") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Regression Model for Iris Data
term	estimate	std.error	statistic	p.value
(Intercept)	3.6830	0.1074	34.2913	0.0000
Petal.Length	0.9059	0.0743	12.1913	0.0000
Petal.Width	-0.0060	0.1563	-0.0384	0.9694
Speciesversicolor	-1.5984	0.2057	-7.7701	0.0000
Speciesvirginica	-2.1126	0.3040	-6.9489	0.0000

20.6 Exercise: Air Quality Dataset

data(airquality)

airquality_clean <- airquality %>%
  tidyr::drop_na()

summary(airquality_clean)

##      Ozone          Solar.R           Wind            Temp      
##  Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
##  1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
##  Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
##  Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
##  3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
##  Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
##      Month            Day       
##  Min.   :5.000   Min.   : 1.00  
##  1st Qu.:6.000   1st Qu.: 9.00  
##  Median :7.000   Median :16.00  
##  Mean   :7.216   Mean   :15.95  
##  3rd Qu.:9.000   3rd Qu.:22.50  
##  Max.   :9.000   Max.   :31.00

ggplot(airquality_clean, aes(x = Temp, y = Ozone)) +
  geom_point(color = "#2E86C1", size = 3) +
  geom_smooth(method = "lm", color = "#C0392B") +
  labs(
    title = "Ozone and Temperature",
    x = "Temperature",
    y = "Ozone"
  ) +
  theme_minimal()

air_model <- lm(Ozone ~ Temp + Wind + Solar.R, data = airquality_clean)
summary(air_model)

## 
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R, data = airquality_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

21 Final Summary

This book-style tutorial covered the full foundation of Exploratory Data Analysis and Descriptive Statistics using R, from basic summaries to regression-based exploratory modelling.

You learned how to:

identify data types and measurement scales;
summarise numerical variables;
summarise categorical variables;
calculate mean, median, mode, variance, SD, IQR, and CV;
interpret skewness and kurtosis;
detect outliers;
analyse missing data;
visualise distributions;
study relationships;
use correlation and regression;
interpret slope and intercept;
check residuals and model diagnostics;
prepare a complete EDA workflow.

The best data analysts do not simply run code. They ask critical questions, inspect assumptions, compare summaries with graphs, and explain results clearly.

22 Suggested References

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly.
Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Sage.
Dalgaard, P. (2008). Introductory Statistics with R. Springer.
Kabacoff, R. (2022). R in Action. Manning.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer.

23 Any Questions or feedback ?

Please email Dr Hom at homnath1988@gmail.com. I hope you love numbers like me..

Exploratory Data Analysis and Descriptive Statistics Using R

A Complete Book-Style Tutorial from Basic to Advanced Level

Dr Hom Nath Dhungana, Dr Anju Kalluvelil Janardhanan, Dr Radhika Kuttala, Dr Eric Howard

2026-06-05