Cleaning and Preparing Data for Visualization

In this activity, you will practice using tidyverse and dplyr to clean a dataset and prepare summary tables that could later be used to make figures. The focus is on understanding how data structure affects visualization.

Work through each section in order. You may work quietly with classmates nearby, but everyone should write and submit their own work.

Load Required Packages

Load the library of tidyverse, install it if necessary.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the Dataset

For today’s activity, we will use a built-in dataset so that everyone is working with the same data. Load the mtcars dataset

data(mtcars)

Take a moment to look at the dataset.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Inspecting the Data

Before cleaning data, it is important to understand its structure. View the structure of your variables.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Answer the following questions in text below (not as code):

What does each row represent?

a variable/column in the dataset

Name two variables that are numeric.

mpg and drat

Name one variable that represents a category, even if it is currently stored as a number.

gear

Cleaning the Data

Some variables in this dataset are stored as numbers even though they represent categories.

Convert Variables to Factors

Convert the following variables to factors:

cyl (number of cylinders)
am (transmission type)

mtcars%>%
  mutate(cyl = as.factor(cyl))%>%
  mutate(am = as.factor(am))

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Check that the conversion worked by looking at the structure again.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Selecting Relevant Variables

For visualization, it is often helpful to work with only the variables you need.

Create a new object called cars_clean that contains only:

mpg
hp
wt
cyl
am

cars_clean <- mtcars%>%
  select(mpg, hp, wt, cyl, am)

Filtering Observations

Now filter the dataset to include only cars with:

More than 100 horsepower

Save the result as a new object called cars_hp.

cars_hp <- cars_clean%>%
  filter(hp > 100)

Check how many rows remain.

view(cars_hp)

Creating New Variables

Create a new variable called power_to_weight defined as:


horsepower / weight

Add this variable to cars_hp.

cars_hp%>%
  mutate(power_to_weight = hp/wt)

##                      mpg  hp    wt cyl am power_to_weight
## Mazda RX4           21.0 110 2.620   6  1        41.98473
## Mazda RX4 Wag       21.0 110 2.875   6  1        38.26087
## Hornet 4 Drive      21.4 110 3.215   6  0        34.21462
## Hornet Sportabout   18.7 175 3.440   8  0        50.87209
## Valiant             18.1 105 3.460   6  0        30.34682
## Duster 360          14.3 245 3.570   8  0        68.62745
## Merc 280            19.2 123 3.440   6  0        35.75581
## Merc 280C           17.8 123 3.440   6  0        35.75581
## Merc 450SE          16.4 180 4.070   8  0        44.22604
## Merc 450SL          17.3 180 3.730   8  0        48.25737
## Merc 450SLC         15.2 180 3.780   8  0        47.61905
## Cadillac Fleetwood  10.4 205 5.250   8  0        39.04762
## Lincoln Continental 10.4 215 5.424   8  0        39.63864
## Chrysler Imperial   14.7 230 5.345   8  0        43.03087
## Dodge Challenger    15.5 150 3.520   8  0        42.61364
## AMC Javelin         15.2 150 3.435   8  0        43.66812
## Camaro Z28          13.3 245 3.840   8  0        63.80208
## Pontiac Firebird    19.2 175 3.845   8  0        45.51365
## Lotus Europa        30.4 113 1.513   4  1        74.68605
## Ford Pantera L      15.8 264 3.170   8  1        83.28076
## Ferrari Dino        19.7 175 2.770   6  1        63.17690
## Maserati Bora       15.0 335 3.570   8  1        93.83754
## Volvo 142E          21.4 109 2.780   4  1        39.20863

Grouping and Summarizing Data

To prepare data for figures, we often summarize values by group.

Summary by Number of Cylinders

Create a summary table that shows:

Mean miles per gallon (mpg)
Mean horsepower (hp)
Number of observations

Grouped by:

cyl

Save this as summary_cyl.

summary_cyl <- cars_clean%>%
  group_by(cyl)%>%
  summarise('mean miles per gallon'=(mean(mpg)), 'mean horsepower'=(mean(hp)), 'number of observations'=(n()))

Display the table.

view(summary_cyl)

Summary by Transmission Type

Now create a second summary table grouped by:

am

Include:

Mean miles per gallon
Mean power-to-weight ratio

Save this as summary_transmission.

summary_transmission <- cars_clean%>%
  group_by(am)%>%
  summarise('mean miles per gallon'=(mean(mpg)), 'mean power to weight'=(mean(hp/wt)))

Interpreting the Summaries

Answer the following questions in text:

Which group appears to have higher fuel efficiency?

manual cars with 4 cylinders

Which summary table would be useful for making a bar plot?

summary_cyl

Which would work better for a boxplot later?

summary_am

Reflection

In a short paragraph, describe:

One thing that was confusing
One thing that makes more sense now
Why cleaning and summarizing data before plotting is important

One thing that was confusing me was how to have the summarize tool summarize different equations (like mean vs number) but I figured it out. Something that also makes more sense now is how to make new datasets from the existing dataset. I didn’t understand how to make new ones and not just change the existing one but I know now. Cleaning and summarizing the data allows you to get rid of data that might not be relevant to your plots and it lets you see the data clearer so you can tell if your plot correctly reflects it.

Final Check

Before submitting, confirm that:

The document renders without errors
All code chunks run successfully
All written responses are complete

Practice: Cleaning and Summarizing Data with dplyr

Brooke Kopack Ware

2026-02-04