Cleaning and Preparing Data for Visualization

This document contains the completed version of the in-class tidyverse activity. It demonstrates one correct way to clean and summarize the data in preparation for visualization.


Load Required Packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the Dataset

data("mtcars")
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Inspecting the Data

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Discussion points (not code):

  • Each row represents a different car model.
  • Numeric variables include mpg, hp, wt, disp, and others.
  • Variables such as cyl and am are numeric but represent categories (number of cylinders and transmission type).

Cleaning the Data

Convert Variables to Factors

mtcars <- mtcars %>%
  mutate(
    cyl = factor(cyl),
    am = factor(am)
  )
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Selecting Relevant Variables

cars_clean <- mtcars %>%
  select(mpg, hp, wt, cyl, am)
head(cars_clean)
##                    mpg  hp    wt cyl am
## Mazda RX4         21.0 110 2.620   6  1
## Mazda RX4 Wag     21.0 110 2.875   6  1
## Datsun 710        22.8  93 2.320   4  1
## Hornet 4 Drive    21.4 110 3.215   6  0
## Hornet Sportabout 18.7 175 3.440   8  0
## Valiant           18.1 105 3.460   6  0

Filtering Observations

cars_hp <- cars_clean %>%
  filter(hp > 100)
nrow(cars_hp)
## [1] 23

Creating New Variables

cars_hp <- cars_hp %>%
  mutate(power_to_weight = hp / wt)
head(cars_hp)
##                    mpg  hp    wt cyl am power_to_weight
## Mazda RX4         21.0 110 2.620   6  1        41.98473
## Mazda RX4 Wag     21.0 110 2.875   6  1        38.26087
## Hornet 4 Drive    21.4 110 3.215   6  0        34.21462
## Hornet Sportabout 18.7 175 3.440   8  0        50.87209
## Valiant           18.1 105 3.460   6  0        30.34682
## Duster 360        14.3 245 3.570   8  0        68.62745

Grouping and Summarizing Data

Summary by Number of Cylinders

summary_cyl <- cars_hp %>%
  group_by(cyl) %>%
  summarize(
    mean_mpg = mean(mpg),
    mean_hp = mean(hp),
    n = n()
  )
summary_cyl
## # A tibble: 3 × 4
##   cyl   mean_mpg mean_hp     n
##   <fct>    <dbl>   <dbl> <int>
## 1 4         25.9    111      2
## 2 6         19.7    122.     7
## 3 8         15.1    209.    14

Summary by Transmission Type

summary_transmission <- cars_hp %>%
  group_by(am) %>%
  summarize(
    mean_mpg = mean(mpg),
    mean_power_to_weight = mean(power_to_weight)
  )
summary_transmission
## # A tibble: 2 × 3
##   am    mean_mpg mean_power_to_weight
##   <fct>    <dbl>                <dbl>
## 1 0         16.1                 44.6
## 2 1         20.6                 62.1

Interpreting the Summaries

Discussion points:

  • Cars with automatic vs manual transmission differ in average fuel efficiency.
  • The summary_cyl table would work well for a bar plot comparing means across cylinder groups.
  • The transmission summary could later be paired with boxplots if raw data were retained.

Reflection

Cleaning and summarizing data before plotting ensures that figures accurately represent the underlying patterns in the data. Separating data preparation from visualization helps reduce errors, improves clarity, and makes plotting code easier to interpret and reuse.


Final Notes for Students

This workflow — inspecting data, cleaning variables, filtering observations, creating new variables, and summarizing by group — is a standard approach used before creating figures in R. These steps will appear repeatedly throughout the course.