Exploring the Tidyverse and Data Manipulation in R

Introduction

This R Markdown file demonstrates various data exploration, manipulation, and summary techniques using the tidyverse package. We also explore the mpg dataset that comes preloaded in R.

Loading and Viewing Data

# Load built-in datasets
data()  # Lists all available datasets

# View the `mpg` dataset (a dataset about fuel economy of cars)
View(mpg)

# Display help documentation for `mpg`
?mpg

## starting httpd help server ... done

The mpg dataset contains information on fuel economy for different makes and models of cars. It’s useful for learning data manipulation and visualization in R.

Working with Mean and Trimming

# Calculate the mean of a numeric vector
x <- c(0:10, 50)
xm <- mean(x)  # Regular mean
trimmed_mean <- mean(x, trim = 0.10)  # Trimmed mean removes the top and bottom 10% of values

# Display the results
c(xm, trimmed_mean)

## [1] 8.75 5.50

# Check head, tail, min, max, range functions

The mean() function calculates the arithmetic mean, and trim allows for trimming extreme values to get a robust mean.

Exploring the Structure of Data

# Glimpse and structure of `mpg` dataset
glimpse(mpg)  # Provides a compact overview of the dataset

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

str(mpg)      # Provides detailed structure information

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Both glimpse() and str() are useful for quickly understanding the dataset’s structure, including column types and sample values.

Calculating the Mean of a Variable

# Calculate the mean displacement (engine size in liters)
mean_displacement <- mean(mpg$displ)
mean_displacement

## [1] 3.471795

The mpg$displ column contains the engine displacement in liters. Calculating the mean gives us an idea of the average engine size.

Filtering Data

# Filter rows where city fuel economy (cty) is at least 20
test <- filter(mpg, cty >= 20)
View(test)

The filter() function is part of dplyr and helps extract rows based on a condition. Here, we’re selecting cars with city mileage of 20 or higher

Creating a Tibble

# Create a tibble with custom columns
tb <- tibble(
  a = 1:5,                   # A sequence of integers
  b = letters[1:5],          # First five letters of the alphabet
  c = Sys.Date() + 1:5       # Dates starting from today
)

# Print the tibble
print(tb)

## # A tibble: 5 × 3
##       a b     c         
##   <int> <chr> <date>    
## 1     1 a     2025-02-22
## 2     2 b     2025-02-23
## 3     3 c     2025-02-24
## 4     4 d     2025-02-25
## 5     5 e     2025-02-26

Tibbles are a modern version of data frames, offering better formatting and additional functionality.

Grouping and Summarizing Data

# Group data by car class and calculate the mean city fuel economy
mpg %>% 
  group_by(class) %>%         # Group cars by their class
  summarise(mean_cty = mean(cty, na.rm = TRUE))  # Calculate mean city mileage for each group

## # A tibble: 7 × 2
##   class      mean_cty
##   <chr>         <dbl>
## 1 2seater        15.4
## 2 compact        20.1
## 3 midsize        18.8
## 4 minivan        15.8
## 5 pickup         13  
## 6 subcompact     20.4
## 7 suv            13.5

The group_by() function is used for grouping data by a categorical variable, and summarise() computes summary statistics for each group. Here, we calculate the average city fuel economy for each class of cars.