Cleaning and Preparing Data for Visualization

In this activity, you will practice using tidyverse and dplyr to clean a dataset and prepare summary tables that could later be used to make figures. The focus is on understanding how data structure affects visualization.

Work through each section in order. You may work quietly with classmates nearby, but everyone should write and submit their own work.


Load Required Packages

Load the library of tidyverse, install it if necessary.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the Dataset

For today’s activity, we will use a built-in dataset so that everyone is working with the same data. Load the mtcars dataset

# This is a built-in dataset; therefore, I don't need any fancy tricks to call it, right?
# I'm just printing it to make sure that it loads properly. This chunk is set to print=FALSE so that I don't make the document unecessarily long.
knitr::kable(mtcars, caption="mtcars")
mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Take a moment to look at the dataset.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Inspecting the Data

Before cleaning data, it is important to understand its structure. View the structure of your variables.

mtcars %>%
  sapply(class) %>%
  knitr::kable(caption="mtcars variables")
mtcars variables
x
mpg numeric
cyl numeric
disp numeric
hp numeric
drat numeric
wt numeric
qsec numeric
vs numeric
am numeric
gear numeric
carb numeric
  #So it looks like applying "sapply" to the table and then piping it forward to knitr::kable ONLY knits the variable names and classes. That's weird. Is it because knitr isn't part of dplyr and, therefore, isn't calibrated to play nice with dplyr commands?
  head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Answer the following questions in text below (not as code):

  • What does each row represent? Miles per gallon, cylinders, displacement (in cubic inches), horsepower, driveshaft ratio, weight (likely in tons), qsec (measurement of acceleration), engine shape, transmission type, number of gears, and number of carburetors. I had to look some of those meanings up, I used this GitHub page to figure that out.

  • Name two variables that are numeric. “wt” and “hp”.

  • Name one variable that represents a category, even if it is currently stored as a number. “am” indicates auto or manual transmission. That should be categorical; right now, it’s stored as a “0” or “1” binary, but R seems to automatically read that as a numerical variable. I’m not sure which value corresponds to 0 or 1 though.


Cleaning the Data

Some variables in this dataset are stored as numbers even though they represent categories.

Convert Variables to Factors

Convert the following variables to factors:

  • cyl (number of cylinders)

  • am (transmission type)

# I can convert those two variables to factors with mutate() or as.factor() ; the former is dplyr, the latter is forcats (which might be included in dplyr, not sure). 
#Because this is a prepackaged dataset, I'm not sure if any edits made within the program will be saved when I call "mtcars", so it might be a good idea to make a copy of mtcars and THEN make my edits to that. 
# If there are two datasets called "mtcars"- one packaged with dplyr and one that only exists in this program- will the edited version be prioritized over the base version? Especially if I call it outside of the chunk where I made the edits? I'm not sure.

#This line makes a copy of mtcars called "mtcarsEdit" for the sake of editing.
mtcarsEdit <- mtcars

#These lines pipe mtcarsEdit into two "mutate" functions to change the data types of "cyl" and "am" to factors.
#mtcarsEdit %>%
 # mutate(cyl,as.factor(cyl)) %>%
 # mutate(am, as.factor(cyl)) %>%
#  glimpse()
#After "mutate()" runs, the program prints the updated tables.

#So everything that I just did reverts back to a double in the next chunk. I'm commenting that out and doing this instead.

  mtcarsEdit$am = as.factor(mtcarsEdit$am)
  mtcarsEdit$cyl = as.factor(mtcarsEdit$cyl)
  head(mtcarsEdit)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  #This seems to retain the changes across chunks. I'm not sure why lines 121-124 didn't stick into the next chunk.

Check that the conversion worked by looking at the structure again.

# I probably have to call mtcarsEdit again, since this is a new chunk. 
mtcarsEdit %>%
  sapply(class)
##       mpg       cyl      disp        hp      drat        wt      qsec        vs 
## "numeric"  "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        am      gear      carb 
##  "factor" "numeric" "numeric"
#NOTE: piping mtcarsEdit into sapply and then piping that into "head" only returns the variable classes of mpg, cyl, disp, hp, drat, and wt in the chart (not the data). Weird. I guess it's taking the limited information from sapply instead of taking information from mtcarsEdit. 
#I assume that calling mtcarsEdit AGAIN and piping it into head() or glimpse() or str() will return the full table though.

mtcarsEdit %>%
  head()
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#Why did the changes not stick when I used mutate() to change variable type, but they DID when I used data$variable = as.factor(data$variable) to change it? That's weird.

Selecting Relevant Variables

For visualization, it is often helpful to work with only the variables you need.

Create a new object called cars_clean that contains only:

  • mpg

  • hp

  • wt

  • cyl

  • am

#It looks like adding an apostrophe to "dont" caused the program to fail to knit. So chunk titles probably shouldn't have punctuation.

#I wonder if I could make cars_clean with the pipe-forward operator instead of the pipe-back operator.
# Nope, that reads the intended variable name as a function, not "copy the data into a variable named this".
#Also, I'm using programmer caps for my variable titles instead of underscores. Sorry if this makes my code hard to just... search through.
# This should remove certain columns, not rows.
# This line writes the selected variables of "mtcarsEdit" to "carsClean".
  carsClean = select(mtcarsEdit, c(mpg, hp, wt, cyl, am))

#this line prints the observation number, variables, and some of the values of carsClean.
str(carsClean)
## 'data.frame':    32 obs. of  5 variables:
##  $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ hp : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ wt : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ cyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...

Filtering Observations

Now filter the dataset to include only cars with:

  • More than 100 horsepower

Save the result as a new object called cars_hp.

# Now HERE'S where filter() comes in handy.
# select() is for excluding/including specific columns, filter() is for excluding/including specific rows. 
#carsHP = filter(carsClean, hp > 100)
#Here's a cleaner version of the previous line

carsHP <- carsClean %>%
  filter(hp > 100)

Check how many rows remain.

str(carsHP)
## 'data.frame':    23 obs. of  5 variables:
##  $ mpg: num  21 21 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 ...
##  $ hp : num  110 110 110 175 105 245 123 123 180 180 ...
##  $ wt : num  2.62 2.88 3.21 3.44 3.46 ...
##  $ cyl: Factor w/ 3 levels "4","6","8": 2 2 2 3 2 3 2 2 3 3 ...
##  $ am : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
knitr::kable(carsHP, caption="carsHP")
carsHP
mpg hp wt cyl am
Mazda RX4 21.0 110 2.620 6 1
Mazda RX4 Wag 21.0 110 2.875 6 1
Hornet 4 Drive 21.4 110 3.215 6 0
Hornet Sportabout 18.7 175 3.440 8 0
Valiant 18.1 105 3.460 6 0
Duster 360 14.3 245 3.570 8 0
Merc 280 19.2 123 3.440 6 0
Merc 280C 17.8 123 3.440 6 0
Merc 450SE 16.4 180 4.070 8 0
Merc 450SL 17.3 180 3.730 8 0
Merc 450SLC 15.2 180 3.780 8 0
Cadillac Fleetwood 10.4 205 5.250 8 0
Lincoln Continental 10.4 215 5.424 8 0
Chrysler Imperial 14.7 230 5.345 8 0
Dodge Challenger 15.5 150 3.520 8 0
AMC Javelin 15.2 150 3.435 8 0
Camaro Z28 13.3 245 3.840 8 0
Pontiac Firebird 19.2 175 3.845 8 0
Lotus Europa 30.4 113 1.513 4 1
Ford Pantera L 15.8 264 3.170 8 1
Ferrari Dino 19.7 175 2.770 6 1
Maserati Bora 15.0 335 3.570 8 1
Volvo 142E 21.4 109 2.780 4 1
# This version of the dataset only has 23 rows.

Creating New Variables

Create a new variable called power_to_weight defined as:


horsepower / weight

Add this variable to cars_hp.

carsHP <- carsHP %>%
  mutate(powerToWeight = hp/wt)
#So piping carsHP forward and making a new variable displays that variable after mutate(), but it disappears the next time I try to load the table. That's really weird. why?
#OKAY, so putting "carsHP <-" before the pipe-forward into the new variable creation MAKES THE CHANGES STICK. That's... interesting.

str(carsHP)
## 'data.frame':    23 obs. of  6 variables:
##  $ mpg          : num  21 21 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 ...
##  $ hp           : num  110 110 110 175 105 245 123 123 180 180 ...
##  $ wt           : num  2.62 2.88 3.21 3.44 3.46 ...
##  $ cyl          : Factor w/ 3 levels "4","6","8": 2 2 2 3 2 3 2 2 3 3 ...
##  $ am           : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
##  $ powerToWeight: num  42 38.3 34.2 50.9 30.3 ...

Grouping and Summarizing Data

To prepare data for figures, we often summarize values by group.

Summary by Number of Cylinders

Create a summary table that shows:

  • Mean miles per gallon (mpg)

  • Mean horsepower (hp)

  • Number of observations

Grouped by:

  • cyl

Save this as summary_cyl.

summaryCyl <- carsHP %>%
  group_by(cyl) %>%
  summarize('meanMPG' = mean(mpg), 'meanHP' = mean(hp))
  
  #It looks like you need to do extra work to get factors to work with count(), add_count(), or add_tally(). Ohhhhh, I'm going to bite someone
#This code tests how to get count() to work with factors. Ugh
testDf <- count(carsHP, cyl)
str(testDf)
## 'data.frame':    3 obs. of  2 variables:
##  $ cyl: Factor w/ 3 levels "4","6","8": 1 2 3
##  $ n  : int  2 7 14
# There is, without a doubt, a way to do this that is less annoying. However, when I tried to pass count(carsHP, cylCount) directly into summarize, it assigned THE ENTIRE DATA FRAME to each instance of cylCount. That's funny, but it's not what I want at all.

#These two lines:
##1. pass all code on the following two lines back to summaryCyl to make changes stick,
##2. pipes summaryCyl forward into the next line,
##3. Adds a new variable to summaryCyl, obsCount, and copies all instances of testDf$n into summaryCyl$obsCount in the exact order that they appear originally.
summaryCyl <- 
  summaryCyl %>%
  mutate(obsCount = testDf$n)

  
str(summaryCyl)
## tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
##  $ cyl     : Factor w/ 3 levels "4","6","8": 1 2 3
##  $ meanMPG : num [1:3] 25.9 19.7 15.1
##  $ meanHP  : num [1:3] 111 122 209
##  $ obsCount: int [1:3] 2 7 14
head(summaryCyl)
## # A tibble: 3 × 4
##   cyl   meanMPG meanHP obsCount
##   <fct>   <dbl>  <dbl>    <int>
## 1 4        25.9   111         2
## 2 6        19.7   122.        7
## 3 8        15.1   209.       14

Display the table.

summaryCyl
## # A tibble: 3 × 4
##   cyl   meanMPG meanHP obsCount
##   <fct>   <dbl>  <dbl>    <int>
## 1 4        25.9   111         2
## 2 6        19.7   122.        7
## 3 8        15.1   209.       14
#Oh, okay, just typing out the name of the table makes it display. You don't have to use head(), str(), or whatever? I've been wasting time typing stuff.

Summary by Transmission Type

Now create a second summary table grouped by:

  • am

Include:

  • Mean miles per gallon

  • Mean power-to-weight ratio

Save this as summary_transmission.

summaryTransmission<- carsHP %>%
  group_by(am) %>% 
  summarise('meanMPG' = mean(mpg), 'meanPowToWt' = mean(powerToWeight))

#This line isn't strictly necessary, but I wanted to get an idea of the ratio of automatic-to-manual cars in the dataset.
testDf2 <- count (carsHP, am)
testDf2
##   am  n
## 1  0 16
## 2  1  7
summaryTransmission
## # A tibble: 2 × 3
##   am    meanMPG meanPowToWt
##   <fct>   <dbl>       <dbl>
## 1 0        16.1        44.6
## 2 1        20.6        62.1

Interpreting the Summaries

Answer the following questions in text:

  • Which group appears to have higher fuel efficiency? Using the first dataset, cars with less cylinders appear to have higher fuel efficiency. Using the second dataset, assuming that “am” stands for “automatic” and the factors are true/false states, cars with manual transmission have a higher fuel efficiency (and higher horsepower compared to their weight).

  • Which summary table would be useful for making a bar plot? Uhm. I think the transmission summary table would be better for making a bar plot because the means are closer together AND the distribution of the two variable states (automatic and manual) is a bit more even than the cylinder summary table. Bar plots generally expose less data than box plots, so I think showing the mean and standard-error bars ONLY is more valuable when the sample sizes are kind of similar.

  • Which would work better for a boxplot later? I’d prefer that both of these be in boxplots, honestly, because of the different sample sizes, but I think that it works best for the cylinder count table. The cylinder count table has wildly disproportionate groups (2, 7, and 14 observations), and that’s going to impact the value of the quartiles, medians, and upper/lower bounds. If your sample size is weird and skewed, then it’s a good idea to be upfront about that. Furthermore, I think it depends on your audience. A box plot works better if your audience cares about the fine details, but a bar plot works if your audience just cares about the mean values.


Reflection

In a short paragraph, describe:

  • One thing that was confusing If you look back in my code, you’ll see me complaining about the pipe-forward operators not retaining modifications to the datasets. That was pretty confusing at first.

  • One thing that makes more sense now The above issue makes more sense now. It seems that <- is the “apply all following lines to this variable” operator, while %>% is the “carry this variable forward as the first argument in the following lines” operator.

  • Why cleaning and summarizing data before plotting is important It’s easier to work with a smaller, fine-tuned set of data with the information that you actually care about, rather than manually sifting through the data mid-program.


Final Check

Before submitting, confirm that:

  • The document renders without errors

  • All code chunks run successfully

  • All written responses are complete

Save and submit both:

  • The rendered document

  • The .Rmd file Submission Text Editor