This project demonstrates the transformation of the Insurance dataset from wide format to long (tidy) format using R’s tidyverse package. The Insurance dataset contains health insurance information for 1,338 individuals, including demographic characteristics and healthcare charges.
The primary objectives of this data transformation project are to:
Original Wide Format Structure:
1,338 rows (one per person) 7 columns: age, sex, bmi, children, smoker, region, charges 4 numeric measurements per person spread across columns 3 categorical identifiers: sex, smoker status, region
Benefits of Transformation - Enhanced Data Analysis:
Transforming to long format provides significant analytical advantages:
- Easier filtering: Query specific measurements (e.g., “show me all BMI
values”) with a single filter operation instead of selecting
multiple
columns - Simplified grouping: Group by measurement type to calculate
statistics across different metrics simultaneously - Streamlined
comparisons: Compare distributions of different measurements
side-by-side without complex column manipulation - Improved
Visualization Capabilities - Long format unlocks powerful visualization
options: - Feature engineering - More intuitive to create derived
features and transformations - Model input formatting - Many ML
algorithms expect data in this normalized structure - Cross-validation -
Easier to split and manipulate data for training and testing
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the insurance dataset
df_insurance <- read_csv('insurance.csv')
## Rows: 1338 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(head(df_insurance, 10))
## # A tibble: 10 × 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
## 7 46 female 33.4 1 no southeast 8241.
## 8 37 female 27.7 3 no northwest 7282.
## 9 37 male 29.8 2 no northeast 6406.
## 10 60 female 25.8 0 no northwest 28923.
cat("Original Shape:", nrow(df_insurance), "rows x", ncol(df_insurance), "columns\n")
## Original Shape: 1338 rows x 7 columns
# Create long format using pivot_longer()
insurance_long_tidy <- df_insurance %>%
pivot_longer(
cols = c(age, bmi, children, charges), # Numeric measurements
names_to = "measurement_type",
values_to = "measurement_value"
) %>%
drop_na(measurement_value) %>% # Remove nulls
arrange(sex, region, measurement_type) # Sort logically
cat("Tidy Long Format:", nrow(insurance_long_tidy), "rows x", ncol(insurance_long_tidy), "columns\n")
## Tidy Long Format: 5352 rows x 5 columns
print(head(insurance_long_tidy, 16))
## # A tibble: 16 × 5
## sex smoker region measurement_type measurement_value
## <chr> <chr> <chr> <chr> <dbl>
## 1 female no northeast age 52
## 2 female no northeast age 60
## 3 female yes northeast age 34
## 4 female no northeast age 63
## 5 female no northeast age 18
## 6 female no northeast age 24
## 7 female no northeast age 18
## 8 female no northeast age 18
## 9 female no northeast age 58
## 10 female no northeast age 22
## 11 female no northeast age 45
## 12 female no northeast age 18
## 13 female no northeast age 26
## 14 female no northeast age 59
## 15 female no northeast age 61
## 16 female no northeast age 20
The insurance dataset originally had 1,338 rows (one per person) with
7 columns including demographic information (sex, smoker status, region)
and 4 numeric measurements (age, bmi, children, charges). Using
pivot_longer()
, we transformed these 4 measurement columns
into long format by creating two new columns: “measurement_type” (what
we’re measuring) and “measurement_value” (the actual number). Each
person that was 1 row became 4 rows - one per measurement - with
demographic identifiers (sex, smoker, region) repeated across rows. The
dataset grew from 1,338 to approximately 5,352 rows, creating a tidy
structure where each row represents a single measurement for a single
person, making it easier to filter, visualize, and analyze specific
measurements across different demographic groups.