Cleaning and Preparing Data for Visualization

In this activity, you will practice using tidyverse and dplyr to clean a dataset and prepare summary tables that could later be used to make figures. The focus is on understanding how data structure affects visualization.

Work through each section in order. You may work quietly with classmates nearby, but everyone should write and submit their own work.

Load Required Packages

Load the library of tidyverse, install it if necessary.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the Dataset

For today’s activity, we will use a built-in dataset so that everyone is working with the same data. Load the mtcars dataset

# This is a built-in dataset; therefore, I don't need any fancy tricks to call it, right?
# I'm just printing it to make sure that it loads properly. This chunk is set to print=FALSE so that I don't make the document unecessarily long.
knitr::kable(mtcars, caption="mtcars")

mtcars
	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

Take a moment to look at the dataset.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Inspecting the Data

Before cleaning data, it is important to understand its structure. View the structure of your variables.

mtcars %>%
  sapply(class) %>%
  knitr::kable(caption="mtcars variables")

mtcars variables
	x
mpg	numeric
cyl	numeric
disp	numeric
hp	numeric
drat	numeric
wt	numeric
qsec	numeric
vs	numeric
am	numeric
gear	numeric
carb	numeric

  #So it looks like applying "sapply" to the table and then piping it forward to knitr::kable ONLY knits the variable names and classes. That's weird. Is it because knitr isn't part of dplyr and, therefore, isn't calibrated to play nice with dplyr commands?
  head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Answer the following questions in text below (not as code):

What does each row represent? Miles per gallon, cylinders, displacement (in cubic inches), horsepower, driveshaft ratio, weight (likely in tons), qsec (measurement of acceleration), engine shape, transmission type, number of gears, and number of carburetors. I had to look some of those meanings up, I used this GitHub page to figure that out.
Name two variables that are numeric. “wt” and “hp”.
Name one variable that represents a category, even if it is currently stored as a number. “am” indicates auto or manual transmission. That should be categorical; right now, it’s stored as a “0” or “1” binary, but R seems to automatically read that as a numerical variable. I’m not sure which value corresponds to 0 or 1 though.

Cleaning the Data

Some variables in this dataset are stored as numbers even though they represent categories.

Convert Variables to Factors

Convert the following variables to factors:

cyl (number of cylinders)
am (transmission type)

# I can convert those two variables to factors with mutate() or as.factor() ; the former is dplyr, the latter is forcats (which might be included in dplyr, not sure). 
#Because this is a prepackaged dataset, I'm not sure if any edits made within the program will be saved when I call "mtcars", so it might be a good idea to make a copy of mtcars and THEN make my edits to that. 
# If there are two datasets called "mtcars"- one packaged with dplyr and one that only exists in this program- will the edited version be prioritized over the base version? Especially if I call it outside of the chunk where I made the edits? I'm not sure.

#This line makes a copy of mtcars called "mtcarsEdit" for the sake of editing.
mtcarsEdit <- mtcars

#These lines pipe mtcarsEdit into two "mutate" functions to change the data types of "cyl" and "am" to factors.
#mtcarsEdit %>%
 # mutate(cyl,as.factor(cyl)) %>%
 # mutate(am, as.factor(cyl)) %>%
#  glimpse()
#After "mutate()" runs, the program prints the updated tables.

#So everything that I just did reverts back to a double in the next chunk. I'm commenting that out and doing this instead.

  mtcarsEdit$am = as.factor(mtcarsEdit$am)
  mtcarsEdit$cyl = as.factor(mtcarsEdit$cyl)
  head(mtcarsEdit)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

  #This seems to retain the changes across chunks. I'm not sure why lines 121-124 didn't stick into the next chunk.

Check that the conversion worked by looking at the structure again.

# I probably have to call mtcarsEdit again, since this is a new chunk. 
mtcarsEdit %>%
  sapply(class)

##       mpg       cyl      disp        hp      drat        wt      qsec        vs 
## "numeric"  "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        am      gear      carb 
##  "factor" "numeric" "numeric"

#NOTE: piping mtcarsEdit into sapply and then piping that into "head" only returns the variable classes of mpg, cyl, disp, hp, drat, and wt in the chart (not the data). Weird. I guess it's taking the limited information from sapply instead of taking information from mtcarsEdit. 
#I assume that calling mtcarsEdit AGAIN and piping it into head() or glimpse() or str() will return the full table though.

mtcarsEdit %>%
  head()

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#Why did the changes not stick when I used mutate() to change variable type, but they DID when I used data$variable = as.factor(data$variable) to change it? That's weird.

Selecting Relevant Variables

For visualization, it is often helpful to work with only the variables you need.

Create a new object called cars_clean that contains only:

mpg
hp
wt
cyl
am

#It looks like adding an apostrophe to "dont" caused the program to fail to knit. So chunk titles probably shouldn't have punctuation.

#I wonder if I could make cars_clean with the pipe-forward operator instead of the pipe-back operator.
# Nope, that reads the intended variable name as a function, not "copy the data into a variable named this".
#Also, I'm using programmer caps for my variable titles instead of underscores. Sorry if this makes my code hard to just... search through.
# This should remove certain columns, not rows.
# This line writes the selected variables of "mtcarsEdit" to "carsClean".
  carsClean = select(mtcarsEdit, c(mpg, hp, wt, cyl, am))

#this line prints the observation number, variables, and some of the values of carsClean.
str(carsClean)

## 'data.frame':    32 obs. of  5 variables:
##  $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ hp : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ wt : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ cyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...

Filtering Observations

Now filter the dataset to include only cars with:

More than 100 horsepower

Save the result as a new object called cars_hp.

# Now HERE'S where filter() comes in handy.
# select() is for excluding/including specific columns, filter() is for excluding/including specific rows. 
#carsHP = filter(carsClean, hp > 100)
#Here's a cleaner version of the previous line

carsHP <- carsClean %>%
  filter(hp > 100)

Check how many rows remain.

str(carsHP)

## 'data.frame':    23 obs. of  5 variables:
##  $ mpg: num  21 21 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 ...
##  $ hp : num  110 110 110 175 105 245 123 123 180 180 ...
##  $ wt : num  2.62 2.88 3.21 3.44 3.46 ...
##  $ cyl: Factor w/ 3 levels "4","6","8": 2 2 2 3 2 3 2 2 3 3 ...
##  $ am : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...

knitr::kable(carsHP, caption="carsHP")

carsHP
	mpg	hp	wt	cyl	am
Mazda RX4	21.0	110	2.620	6	1
Mazda RX4 Wag	21.0	110	2.875	6	1
Hornet 4 Drive	21.4	110	3.215	6	0
Hornet Sportabout	18.7	175	3.440	8	0
Valiant	18.1	105	3.460	6	0
Duster 360	14.3	245	3.570	8	0
Merc 280	19.2	123	3.440	6	0
Merc 280C	17.8	123	3.440	6	0
Merc 450SE	16.4	180	4.070	8	0
Merc 450SL	17.3	180	3.730	8	0
Merc 450SLC	15.2	180	3.780	8	0
Cadillac Fleetwood	10.4	205	5.250	8	0
Lincoln Continental	10.4	215	5.424	8	0
Chrysler Imperial	14.7	230	5.345	8	0
Dodge Challenger	15.5	150	3.520	8	0
AMC Javelin	15.2	150	3.435	8	0
Camaro Z28	13.3	245	3.840	8	0
Pontiac Firebird	19.2	175	3.845	8	0
Lotus Europa	30.4	113	1.513	4	1
Ford Pantera L	15.8	264	3.170	8	1
Ferrari Dino	19.7	175	2.770	6	1
Maserati Bora	15.0	335	3.570	8	1
Volvo 142E	21.4	109	2.780	4	1

# This version of the dataset only has 23 rows.

Creating New Variables

Create a new variable called power_to_weight defined as:


horsepower / weight

Add this variable to cars_hp.

carsHP <- carsHP %>%
  mutate(powerToWeight = hp/wt)
#So piping carsHP forward and making a new variable displays that variable after mutate(), but it disappears the next time I try to load the table. That's really weird. why?
#OKAY, so putting "carsHP <-" before the pipe-forward into the new variable creation MAKES THE CHANGES STICK. That's... interesting.

str(carsHP)

## 'data.frame':    23 obs. of  6 variables:
##  $ mpg          : num  21 21 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 ...
##  $ hp           : num  110 110 110 175 105 245 123 123 180 180 ...
##  $ wt           : num  2.62 2.88 3.21 3.44 3.46 ...
##  $ cyl          : Factor w/ 3 levels "4","6","8": 2 2 2 3 2 3 2 2 3 3 ...
##  $ am           : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
##  $ powerToWeight: num  42 38.3 34.2 50.9 30.3 ...

Grouping and Summarizing Data

To prepare data for figures, we often summarize values by group.

Summary by Number of Cylinders

Create a summary table that shows:

Mean miles per gallon (mpg)
Mean horsepower (hp)
Number of observations

Grouped by:

cyl

Save this as summary_cyl.

summaryCyl <- carsHP %>%
  group_by(cyl) %>%
  summarize('meanMPG' = mean(mpg), 'meanHP' = mean(hp))
  
  #It looks like you need to do extra work to get factors to work with count(), add_count(), or add_tally(). Ohhhhh, I'm going to bite someone
#This code tests how to get count() to work with factors. Ugh
testDf <- count(carsHP, cyl)
str(testDf)

## 'data.frame':    3 obs. of  2 variables:
##  $ cyl: Factor w/ 3 levels "4","6","8": 1 2 3
##  $ n  : int  2 7 14

# There is, without a doubt, a way to do this that is less annoying. However, when I tried to pass count(carsHP, cylCount) directly into summarize, it assigned THE ENTIRE DATA FRAME to each instance of cylCount. That's funny, but it's not what I want at all.

#These two lines:
##1. pass all code on the following two lines back to summaryCyl to make changes stick,
##2. pipes summaryCyl forward into the next line,
##3. Adds a new variable to summaryCyl, obsCount, and copies all instances of testDf$n into summaryCyl$obsCount in the exact order that they appear originally.
summaryCyl <- 
  summaryCyl %>%
  mutate(obsCount = testDf$n)

  
str(summaryCyl)

## tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
##  $ cyl     : Factor w/ 3 levels "4","6","8": 1 2 3
##  $ meanMPG : num [1:3] 25.9 19.7 15.1
##  $ meanHP  : num [1:3] 111 122 209
##  $ obsCount: int [1:3] 2 7 14

head(summaryCyl)

## # A tibble: 3 × 4
##   cyl   meanMPG meanHP obsCount
##   <fct>   <dbl>  <dbl>    <int>
## 1 4        25.9   111         2
## 2 6        19.7   122.        7
## 3 8        15.1   209.       14

Display the table.

summaryCyl

## # A tibble: 3 × 4
##   cyl   meanMPG meanHP obsCount
##   <fct>   <dbl>  <dbl>    <int>
## 1 4        25.9   111         2
## 2 6        19.7   122.        7
## 3 8        15.1   209.       14

#Oh, okay, just typing out the name of the table makes it display. You don't have to use head(), str(), or whatever? I've been wasting time typing stuff.

Summary by Transmission Type

Now create a second summary table grouped by:

am

Include:

Mean miles per gallon
Mean power-to-weight ratio

Save this as summary_transmission.

summaryTransmission<- carsHP %>%
  group_by(am) %>% 
  summarise('meanMPG' = mean(mpg), 'meanPowToWt' = mean(powerToWeight))

#This line isn't strictly necessary, but I wanted to get an idea of the ratio of automatic-to-manual cars in the dataset.
testDf2 <- count (carsHP, am)
testDf2

##   am  n
## 1  0 16
## 2  1  7

summaryTransmission

## # A tibble: 2 × 3
##   am    meanMPG meanPowToWt
##   <fct>   <dbl>       <dbl>
## 1 0        16.1        44.6
## 2 1        20.6        62.1

Interpreting the Summaries

Answer the following questions in text:

Which group appears to have higher fuel efficiency? Using the first dataset, cars with less cylinders appear to have higher fuel efficiency. Using the second dataset, assuming that “am” stands for “automatic” and the factors are true/false states, cars with manual transmission have a higher fuel efficiency (and higher horsepower compared to their weight).
Which summary table would be useful for making a bar plot? Uhm. I think the transmission summary table would be better for making a bar plot because the means are closer together AND the distribution of the two variable states (automatic and manual) is a bit more even than the cylinder summary table. Bar plots generally expose less data than box plots, so I think showing the mean and standard-error bars ONLY is more valuable when the sample sizes are kind of similar.
Which would work better for a boxplot later? I’d prefer that both of these be in boxplots, honestly, because of the different sample sizes, but I think that it works best for the cylinder count table. The cylinder count table has wildly disproportionate groups (2, 7, and 14 observations), and that’s going to impact the value of the quartiles, medians, and upper/lower bounds. If your sample size is weird and skewed, then it’s a good idea to be upfront about that. Furthermore, I think it depends on your audience. A box plot works better if your audience cares about the fine details, but a bar plot works if your audience just cares about the mean values.

Reflection

In a short paragraph, describe:

One thing that was confusing If you look back in my code, you’ll see me complaining about the pipe-forward operators not retaining modifications to the datasets. That was pretty confusing at first.
One thing that makes more sense now The above issue makes more sense now. It seems that <- is the “apply all following lines to this variable” operator, while %>% is the “carry this variable forward as the first argument in the following lines” operator.
Why cleaning and summarizing data before plotting is important It’s easier to work with a smaller, fine-tuned set of data with the information that you actually care about, rather than manually sifting through the data mid-program.

Final Check

Before submitting, confirm that:

The document renders without errors
All code chunks run successfully
All written responses are complete

Save and submit both:

The rendered document
The .Rmd file Submission Text Editor

Practice: Cleaning & Summarizing Data With dplyr

Griffin Carson

2026-02-07