Cleaning and Preparing Data for Visualization

In this activity, you will practice using tidyverse and dplyr to clean a dataset and prepare summary tables that could later be used to make figures. The focus is on understanding how data structure affects visualization.

Work through each section in order. You may work quietly with classmates nearby, but everyone should write and submit their own work.

Load Required Packages

Load the library of tidyverse, install it if necessary.

# YOUR CODE HERE
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the Dataset

For today’s activity, we will use a built-in dataset so that everyone is working with the same data. Load the mtcars dataset

# YOUR CODE HERE
data(mtcars)

Take a moment to look at the dataset.

# YOUR CODE HERE
mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Inspecting the Data

Before cleaning data, it is important to understand its structure. View the structure of your variables.

# YOUR CODE HERE
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Answer the following questions in text below (not as code):

What does each row represent? one car model
Name two variables that are numeric. mpg & hp
Name one variable that represents a category, even if it is currently stored as a number. cyl

Cleaning the Data

Some variables in this dataset are stored as numbers even though they represent categories.

Convert Variables to Factors

Convert the following variables to factors:

cyl (number of cylinders)
am (transmission type)

# YOUR CODE HERE
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)

Check that the conversion worked by looking at the structure again.

# YOUR CODE HERE
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Selecting Relevant Variables

For visualization, it is often helpful to work with only the variables you need.

Create a new object called cars_clean that contains only:

mpg
hp
wt
cyl
am

# YOUR CODE HERE
cars_clean <- mtcars %>%
  select(mpg, hp, wt, cyl, am)

Filtering Observations

Now filter the dataset to include only cars with:

More than 100 horsepower

Save the result as a new object called cars_hp.

# YOUR CODE HERE
cars_hp <- cars_clean %>%
  filter(hp > 100)

Check how many rows remain.

# YOUR CODE HERE
nrow(cars_hp)

## [1] 23

Creating New Variables

Create a new variable called power_to_weight defined as:


horsepower / weight

Add this variable to cars_hp.

# YOUR CODE HERE
cars_hp <- cars_hp %>%
  mutate(power_to_weight = hp / wt)

Grouping and Summarizing Data

To prepare data for figures, we often summarize values by group.

Summary by Number of Cylinders

Create a summary table that shows:

Mean miles per gallon (mpg)
Mean horsepower (hp)
Number of observations

Grouped by:

cyl

Save this as summary_cyl.

# YOUR CODE HERE
summary_cyl <- cars_hp %>%
  group_by(cyl) %>%
  summarise(
    mean_mpg = mean(mpg),
    mean_hp = mean(hp),
    n = n()
  )

Display the table.

summary_cyl

## # A tibble: 3 × 4
##   cyl   mean_mpg mean_hp     n
##   <fct>    <dbl>   <dbl> <int>
## 1 4         25.9    111      2
## 2 6         19.7    122.     7
## 3 8         15.1    209.    14

Summary by Transmission Type

Now create a second summary table grouped by:

am

Include:

Mean miles per gallon
Mean power-to-weight ratio

Save this as summary_transmission.

# YOUR CODE HERE
summary_transmission <- cars_hp %>%
  group_by(am) %>%
  summarise(
    mean_mpg = mean(mpg),
    mean_power_to_weight = mean(power_to_weight)
  )

summary_transmission

## # A tibble: 2 × 3
##   am    mean_mpg mean_power_to_weight
##   <fct>    <dbl>                <dbl>
## 1 0         16.1                 44.6
## 2 1         20.6                 62.1

Interpreting the Summaries

Answer the following questions in text:

Which group appears to have higher fuel efficiency? am1 has higher fuel efficiency
Which summary table would be useful for making a bar plot? summary_transmission
Which would work better for a boxplot later? cars_hp since boxplots need distribution

Reflection

In a short paragraph, describe:

One thing that was confusing
One thing that makes more sense now
Why cleaning and summarizing data before plotting is important

I didn’t really know how to filter, but now it makes much more sense. Cleaning and summarizing data first is important because raw data is often too messy or detailed to visualize directly.

Final Check

Before submitting, confirm that:

The document renders without errors
All code chunks run successfully
All written responses are complete

Save and submit both:

The rendered document
The .Rmd file