Project Overview

This project demonstrates the transformation of the Insurance dataset from wide format to long (tidy) format using R’s tidyverse package. The Insurance dataset contains health insurance information for 1,338 individuals, including demographic characteristics and healthcare charges.

Project Goals

The primary objectives of this data transformation project are to:

  • Restructure the data from wide format (where multiple measurements exist as separate columns) to long format (where each measurement becomes its own row)
  • Apply tidy data principles to make the dataset more suitable for statistical analysis and visualization
  • Demonstrate best practices in data wrangling and preparation for data science workflows
  • Fulfill DATA 624 course requirements by showcasing proficiency in data transformation techniques essential for masters-level data analysis
Dataset Description

Original Wide Format Structure:

1,338 rows (one per person) 7 columns: age, sex, bmi, children, smoker, region, charges 4 numeric measurements per person spread across columns 3 categorical identifiers: sex, smoker status, region

Benefits of Transformation - Enhanced Data Analysis: Transforming to long format provides significant analytical advantages: - Easier filtering: Query specific measurements (e.g., “show me all BMI values”) with a single filter operation instead of selecting multiple
columns - Simplified grouping: Group by measurement type to calculate statistics across different metrics simultaneously - Streamlined comparisons: Compare distributions of different measurements side-by-side without complex column manipulation - Improved Visualization Capabilities - Long format unlocks powerful visualization options: - Feature engineering - More intuitive to create derived features and transformations - Model input formatting - Many ML algorithms expect data in this normalized structure - Cross-validation - Easier to split and manipulate data for training and testing

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the insurance dataset
df_insurance <- read_csv('insurance.csv')
## Rows: 1338 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(head(df_insurance, 10))
## # A tibble: 10 × 7
##      age sex      bmi children smoker region    charges
##    <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
##  1    19 female  27.9        0 yes    southwest  16885.
##  2    18 male    33.8        1 no     southeast   1726.
##  3    28 male    33          3 no     southeast   4449.
##  4    33 male    22.7        0 no     northwest  21984.
##  5    32 male    28.9        0 no     northwest   3867.
##  6    31 female  25.7        0 no     southeast   3757.
##  7    46 female  33.4        1 no     southeast   8241.
##  8    37 female  27.7        3 no     northwest   7282.
##  9    37 male    29.8        2 no     northeast   6406.
## 10    60 female  25.8        0 no     northwest  28923.
cat("Original Shape:", nrow(df_insurance), "rows x", ncol(df_insurance), "columns\n")
## Original Shape: 1338 rows x 7 columns
# Create long format using pivot_longer()
insurance_long_tidy <- df_insurance %>%
  pivot_longer(
    cols = c(age, bmi, children, charges),  # Numeric measurements
    names_to = "measurement_type",
    values_to = "measurement_value"
  ) %>%
  drop_na(measurement_value) %>%  # Remove nulls
  arrange(sex, region, measurement_type)  # Sort logically

cat("Tidy Long Format:", nrow(insurance_long_tidy), "rows x", ncol(insurance_long_tidy), "columns\n")
## Tidy Long Format: 5352 rows x 5 columns
print(head(insurance_long_tidy, 16))
## # A tibble: 16 × 5
##    sex    smoker region    measurement_type measurement_value
##    <chr>  <chr>  <chr>     <chr>                        <dbl>
##  1 female no     northeast age                             52
##  2 female no     northeast age                             60
##  3 female yes    northeast age                             34
##  4 female no     northeast age                             63
##  5 female no     northeast age                             18
##  6 female no     northeast age                             24
##  7 female no     northeast age                             18
##  8 female no     northeast age                             18
##  9 female no     northeast age                             58
## 10 female no     northeast age                             22
## 11 female no     northeast age                             45
## 12 female no     northeast age                             18
## 13 female no     northeast age                             26
## 14 female no     northeast age                             59
## 15 female no     northeast age                             61
## 16 female no     northeast age                             20
Insurance Data Transformation Summary: from Wide to Long Format

The insurance dataset originally had 1,338 rows (one per person) with 7 columns including demographic information (sex, smoker status, region) and 4 numeric measurements (age, bmi, children, charges). Using pivot_longer(), we transformed these 4 measurement columns into long format by creating two new columns: “measurement_type” (what we’re measuring) and “measurement_value” (the actual number). Each person that was 1 row became 4 rows - one per measurement - with demographic identifiers (sex, smoker, region) repeated across rows. The dataset grew from 1,338 to approximately 5,352 rows, creating a tidy structure where each row represents a single measurement for a single person, making it easier to filter, visualize, and analyze specific measurements across different demographic groups.