This R Markdown walks through a simple exploratory data
analysis (EDA) using the built‑in mtcars
dataset.
It mirrors the code in your 02_Simple_EDA.R
script and adds
brief explanations between steps.
Learning objectives - Practice a reproducible workflow (seed, package loading). - Inspect, clean, and transform data. - Explore univariate and bivariate relationships. - Compute correlations and fit a small linear model. - Produce plots and tidy tabular outputs.
We set practical knitr defaults: show code by default, hide warnings/messages in the output, and center figures.
Set a seed so any operations that rely on randomness are stable across re‑runs.
set.seed(123)
This chunk ensures the required packages are available and
loaded.
Your original script called install_if_missing(...)
but did
not define it; we define a minimal helper here to keep the document
self‑contained.
install_if_missing <- function(pkgs) {
to_install <- pkgs[!pkgs %in% rownames(installed.packages())]
if (length(to_install)) install.packages(to_install, dependencies = TRUE)
}
install_if_missing(c("dplyr", "ggplot2", "readr", "tibble", "skimr", "reshape2"))
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
library(readr)
library(tibble)
library(skimr)
library(reshape2)
})
We’ll use the built‑in mtcars
dataset.
In practice, you would replace this with readr::read_csv()
(or similar) to read from a file.
data <- mtcars
We examine shape, variable names, structure, summaries, and
missingness.
We also compute a small DIY describe table for numeric
variables and (optionally) use skimr
for a
more detailed skim.
# Dimensions
dim(data); nrow(data); ncol(data)
## [1] 32 11
## [1] 32
## [1] 11
# Peek
head(data, 5)
# Names, structure
names(data)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
str(data)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Summaries
summary(data)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
# Missingness by column
colSums(is.na(data))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
# DIY quick describe for numeric columns
quick_describe <- function(df) {
num <- dplyr::select_if(df, is.numeric)
tibble::tibble(
var = names(num),
n = sapply(num, function(x) sum(!is.na(x))),
mean = sapply(num, mean, na.rm = TRUE),
sd = sapply(num, sd, na.rm = TRUE),
min = sapply(num, min, na.rm = TRUE),
p25 = sapply(num, quantile, probs = 0.25, na.rm = TRUE),
p50 = sapply(num, median, na.rm = TRUE),
p75 = sapply(num, quantile, probs = 0.75, na.rm = TRUE),
max = sapply(num, max, na.rm = TRUE)
) |>
dplyr::arrange(var)
}
desc_tbl <- quick_describe(data)
desc_tbl
If you have skimr
installed, skim()
provides a comprehensive profile:
skim(data)
Name | data |
Number of rows | 32 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
cyl | 0 | 1 | 6.19 | 1.79 | 4.00 | 4.00 | 6.00 | 8.00 | 8.00 | ▆▁▃▁▇ |
disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
vs | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
We keep the original rownames as a proper model
column,
and convert a few numeric ID‑like variables to factors for better
plotting/summary behavior.
data <- data |>
tibble::rownames_to_column(var = "model") |>
mutate(
cyl = factor(cyl, levels = sort(unique(cyl))),
gear = factor(gear, levels = sort(unique(gear))),
am = factor(am, levels = c(0, 1), labels = c("Automatic", "Manual"))
)
data
Look at the distribution of MPG overall and across transmission types.
summary(data$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
p_hist_mpg <- ggplot(data, aes(x = mpg)) +
geom_histogram(bins = 10) +
labs(title = "Distribution of MPG", x = "Miles per gallon (mpg)", y = "Count")
p_hist_mpg
Distribution of MPG
p_box_am <- ggplot(data, aes(x = am, y = mpg)) +
geom_boxplot() +
labs(title = "MPG by Transmission Type", x = "Transmission", y = "MPG")
p_box_am
MPG by Transmission Type
We explore the MPG–weight relationship, colored by cylinder count, and compute a grouped summary table.
p_scatter <- ggplot(data, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Heavier Cars Tend to Have Lower MPG",
x = "Weight (1000 lbs)", y = "MPG", color = "Cylinders")
p_scatter
## `geom_smooth()` using formula = 'y ~ x'
Heavier Cars Tend to Have Lower MPG
avg_mpg_tbl <- data |>
group_by(cyl, am) |>
summarise(
avg_mpg = mean(mpg, na.rm = TRUE),
n = dplyr::n(),
.groups = "drop"
) |>
arrange(cyl, am)
avg_mpg_tbl
Compute a Pearson correlation matrix across numeric variables and
show a simple heatmap using ggplot2
.
num_data <- data |>
select(where(is.numeric)) |>
select(where(~ length(unique(.x)) > 1))
cor_mat <- cor(num_data, use = "pairwise.complete.obs", method = "pearson")
cor_mat[1:6, 1:6] # preview
## mpg disp hp drat wt qsec
## mpg 1.0000000 -0.8475514 -0.7761684 0.68117191 -0.8676594 0.41868403
## disp -0.8475514 1.0000000 0.7909486 -0.71021393 0.8879799 -0.43369788
## hp -0.7761684 0.7909486 1.0000000 -0.44875912 0.6587479 -0.70822339
## drat 0.6811719 -0.7102139 -0.4487591 1.00000000 -0.7124406 0.09120476
## wt -0.8676594 0.8879799 0.6587479 -0.71244065 1.0000000 -0.17471588
## qsec 0.4186840 -0.4336979 -0.7082234 0.09120476 -0.1747159 1.00000000
cor_melt <- reshape2::melt(cor_mat)
p_cor_heatmap <- ggplot(cor_melt, aes(Var2, Var1, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
limit = c(-1, 1), name = "Pearson
Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(title = "Correlation Heatmap (ggplot2)", x = "", y = "") +
coord_fixed()
p_cor_heatmap
Correlation Heatmap (ggplot2)
Fit a small model predicting MPG from
weight and horsepower.
Interpretation tip: coefficients are conditional on the other
variables in the model.
fit <- lm(mpg ~ wt + hp, data = data)
summary(fit)
##
## Call:
## lm(formula = mpg ~ wt + hp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
Construct a compact table with a few useful columns plus a simple derived feature.
report_tbl <- data |>
transmute(
model,
mpg,
wt,
hp,
cyl,
am,
power_to_weight = hp / wt
)
report_tbl