Practice: Cleaning and Summarizing Data with dplyr

Load Packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

Load Dataset

data(mtcars)

look at the data set

head(mtcars, 10)      # first ten rows

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Inspecting Data

str(object = mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

What does each row represent? observations of each variable
Name two variables that are numeric. mpg and qsec
Name one variable that represents a category, even if it is currently stored as a number. vs

Cleaning Data

Convert Variables to Factors

d1 <- mtcars %>%                                             # name as d1
  mutate(cyl = as.factor(cyl), am = as.factor(am))

head(d1, 10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

check the structure again

str(d1)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Selecting Variables

# create object that only contains mpg, hp, wt, cyl, and am

cars_clean <- d1 %>%
  select(mpg, hp, wt, cyl, am) 
head(cars_clean)

##                    mpg  hp    wt cyl am
## Mazda RX4         21.0 110 2.620   6  1
## Mazda RX4 Wag     21.0 110 2.875   6  1
## Datsun 710        22.8  93 2.320   4  1
## Hornet 4 Drive    21.4 110 3.215   6  0
## Hornet Sportabout 18.7 175 3.440   8  0
## Valiant           18.1 105 3.460   6  0

Filtering Observations

# filter cars with >100 horsepower 
cars_hp <- d1 %>% 
  filter(hp > 100)

head(cars_hp, 10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

check how many rows remain

nrow(cars_hp)

## [1] 23

Creating New Variables

cars_hp %>%
  mutate(power_to_weight = hp/wt) %>%
  head(10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
##                   power_to_weight
## Mazda RX4                41.98473
## Mazda RX4 Wag            38.26087
## Hornet 4 Drive           34.21462
## Hornet Sportabout        50.87209
## Valiant                  30.34682
## Duster 360               68.62745
## Merc 280                 35.75581
## Merc 280C                35.75581
## Merc 450SE               44.22604
## Merc 450SL               48.25737

Grouping and Summarizing Data

Summarize by Number of Cylinders

summary_cyl <- mtcars %>%
  group_by(cyl) %>%
  summarise(mean_mpg = mean(mpg), mean_horsepower = mean(hp), number_of_rows = nrow(mtcars))
summary_cyl

## # A tibble: 3 × 4
##     cyl mean_mpg mean_horsepower number_of_rows
##   <dbl>    <dbl>           <dbl>          <int>
## 1     4     26.7            82.6             32
## 2     6     19.7           122.              32
## 3     8     15.1           209.              32

Summary By Transmission Type

summary_transmission <- mtcars %>%
  mutate(power_to_weight = hp/wt) %>%
  group_by(am) %>%
  summarise(mean_mpg = mean(mpg), mean_horse_power = mean(power_to_weight))
summary_transmission

## # A tibble: 2 × 3
##      am mean_mpg mean_horse_power
##   <dbl>    <dbl>            <dbl>
## 1     0     17.1             42.2
## 2     1     24.4             49.9

Interpreting The Summaries

Which group appears to have higher fuel efficiency? group 1 of transmission type and 4 cylinder have the highest mean mpg.
Which summary table would be useful for making a bar plot? They would both be good bar plots because they are mean values comparing 2 and 3 categories
Which would work better for a boxplot later? I wold use the cyl summary for boxplot because there are 3 types of cylinders

Reflection

one thing that was confusing was when to use a boxplot versus a bar plot. Both of the summaries were categorical but one has 3 rows and the other had 2 rows. I inferred that the summary with 2 rows would be better for a barplot and the 3 rows would be better for a boxplot but i’m not entirely sure on why. Using the summarise() function is easier now and I was faster with using dplyr this time compared to the first assignment. Cleaning data before plotting can help the user check for errors in the data, make sure all the columns are the correct type (numeric or categorical), and can make looking at the raw data easier. If you are only working with a few columns why would you want to use a data set with 8 extra columns that you won’t ever look at.