Overview

This R Markdown walks through a simple exploratory data analysis (EDA) using the built‑in mtcars dataset.
It mirrors the code in your 02_Simple_EDA.R script and adds brief explanations between steps.

Learning objectives - Practice a reproducible workflow (seed, package loading). - Inspect, clean, and transform data. - Explore univariate and bivariate relationships. - Compute correlations and fit a small linear model. - Produce plots and tidy tabular outputs.

0. Setup

We set practical knitr defaults: show code by default, hide warnings/messages in the output, and center figures.

1. Reproducibility

Set a seed so any operations that rely on randomness are stable across re‑runs.

set.seed(123)

2. Packages

This chunk ensures the required packages are available and loaded.
Your original script called install_if_missing(...) but did not define it; we define a minimal helper here to keep the document self‑contained.

install_if_missing <- function(pkgs) {
  to_install <- pkgs[!pkgs %in% rownames(installed.packages())]
  if (length(to_install)) install.packages(to_install, dependencies = TRUE)
}

install_if_missing(c("dplyr", "ggplot2", "readr", "tibble", "skimr", "reshape2"))

suppressPackageStartupMessages({
  library(dplyr)
  library(ggplot2)
  library(readr)
  library(tibble)
  library(skimr)
  library(reshape2)
})

3. Load data

We’ll use the built‑in mtcars dataset. In practice, you would replace this with readr::read_csv() (or similar) to read from a file.

data <- mtcars

4. Initial inspection

We examine shape, variable names, structure, summaries, and missingness.
We also compute a small DIY describe table for numeric variables and (optionally) use skimr for a more detailed skim.

# Dimensions
dim(data); nrow(data); ncol(data)
## [1] 32 11
## [1] 32
## [1] 11
# Peek
head(data, 5)
# Names, structure
names(data)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
str(data)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Summaries
summary(data)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
# Missingness by column
colSums(is.na(data))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0
# DIY quick describe for numeric columns
quick_describe <- function(df) {
  num <- dplyr::select_if(df, is.numeric)
  tibble::tibble(
    var   = names(num),
    n     = sapply(num, function(x) sum(!is.na(x))),
    mean  = sapply(num, mean, na.rm = TRUE),
    sd    = sapply(num, sd, na.rm = TRUE),
    min   = sapply(num, min, na.rm = TRUE),
    p25   = sapply(num, quantile, probs = 0.25, na.rm = TRUE),
    p50   = sapply(num, median, na.rm = TRUE),
    p75   = sapply(num, quantile, probs = 0.75, na.rm = TRUE),
    max   = sapply(num, max, na.rm = TRUE)
  ) |>
  dplyr::arrange(var)
}

desc_tbl <- quick_describe(data)
desc_tbl

If you have skimr installed, skim() provides a comprehensive profile:

skim(data)
Data summary
Name data
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.40 15.43 19.20 22.80 33.90 ▃▇▅▁▂
cyl 0 1 6.19 1.79 4.00 4.00 6.00 8.00 8.00 ▆▁▃▁▇
disp 0 1 230.72 123.94 71.10 120.83 196.30 326.00 472.00 ▇▃▃▃▂
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 ▇▇▆▃▁
drat 0 1 3.60 0.53 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 ▃▃▇▁▂
qsec 0 1 17.85 1.79 14.50 16.89 17.71 18.90 22.90 ▃▇▇▂▁
vs 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
am 0 1 0.41 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
gear 0 1 3.69 0.74 3.00 3.00 4.00 4.00 5.00 ▇▁▆▁▂
carb 0 1 2.81 1.62 1.00 2.00 2.00 4.00 8.00 ▇▂▅▁▁

5. Basic cleaning / feature prep

We keep the original rownames as a proper model column, and convert a few numeric ID‑like variables to factors for better plotting/summary behavior.

data <- data |>
  tibble::rownames_to_column(var = "model") |>
  mutate(
    cyl  = factor(cyl, levels = sort(unique(cyl))),
    gear = factor(gear, levels = sort(unique(gear))),
    am   = factor(am, levels = c(0, 1), labels = c("Automatic", "Manual"))
  )

data

6. Univariate analysis

Look at the distribution of MPG overall and across transmission types.

summary(data$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90
p_hist_mpg <- ggplot(data, aes(x = mpg)) +
  geom_histogram(bins = 10) +
  labs(title = "Distribution of MPG", x = "Miles per gallon (mpg)", y = "Count")

p_hist_mpg
Distribution of MPG

Distribution of MPG

p_box_am <- ggplot(data, aes(x = am, y = mpg)) +
  geom_boxplot() +
  labs(title = "MPG by Transmission Type", x = "Transmission", y = "MPG")

p_box_am
MPG by Transmission Type

MPG by Transmission Type

7. Bivariate analysis

We explore the MPG–weight relationship, colored by cylinder count, and compute a grouped summary table.

p_scatter <- ggplot(data, aes(x = wt, y = mpg, color = cyl)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Heavier Cars Tend to Have Lower MPG",
       x = "Weight (1000 lbs)", y = "MPG", color = "Cylinders")

p_scatter
## `geom_smooth()` using formula = 'y ~ x'
Heavier Cars Tend to Have Lower MPG

Heavier Cars Tend to Have Lower MPG

avg_mpg_tbl <- data |>
  group_by(cyl, am) |>
  summarise(
    avg_mpg = mean(mpg, na.rm = TRUE),
    n       = dplyr::n(),
    .groups = "drop"
  ) |>
  arrange(cyl, am)

avg_mpg_tbl

8. Correlations (numeric only)

Compute a Pearson correlation matrix across numeric variables and show a simple heatmap using ggplot2.

num_data <- data |>
  select(where(is.numeric)) |>
  select(where(~ length(unique(.x)) > 1))

cor_mat <- cor(num_data, use = "pairwise.complete.obs", method = "pearson")
cor_mat[1:6, 1:6]  # preview
##             mpg       disp         hp        drat         wt        qsec
## mpg   1.0000000 -0.8475514 -0.7761684  0.68117191 -0.8676594  0.41868403
## disp -0.8475514  1.0000000  0.7909486 -0.71021393  0.8879799 -0.43369788
## hp   -0.7761684  0.7909486  1.0000000 -0.44875912  0.6587479 -0.70822339
## drat  0.6811719 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120476
## wt   -0.8676594  0.8879799  0.6587479 -0.71244065  1.0000000 -0.17471588
## qsec  0.4186840 -0.4336979 -0.7082234  0.09120476 -0.1747159  1.00000000
cor_melt <- reshape2::melt(cor_mat)

p_cor_heatmap <- ggplot(cor_melt, aes(Var2, Var1, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       limit = c(-1, 1), name = "Pearson
Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Heatmap (ggplot2)", x = "", y = "") +
  coord_fixed()

p_cor_heatmap
Correlation Heatmap (ggplot2)

Correlation Heatmap (ggplot2)

9. Simple linear model (illustrative)

Fit a small model predicting MPG from weight and horsepower.
Interpretation tip: coefficients are conditional on the other variables in the model.

fit <- lm(mpg ~ wt + hp, data = data)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

10. Report‑friendly table

Construct a compact table with a few useful columns plus a simple derived feature.

report_tbl <- data |>
  transmute(
    model,
    mpg,
    wt,
    hp,
    cyl,
    am,
    power_to_weight = hp / wt
  )

report_tbl