This R Markdown file demonstrates various data exploration, manipulation, and summary techniques using the tidyverse package. We also explore the mpg dataset that comes preloaded in R.
# Load built-in datasets
data() # Lists all available datasets
# View the `mpg` dataset (a dataset about fuel economy of cars)
View(mpg)
# Display help documentation for `mpg`
?mpg
## starting httpd help server ... done
The mpg dataset contains information on fuel economy for different makes and models of cars. It’s useful for learning data manipulation and visualization in R.
# Calculate the mean of a numeric vector
x <- c(0:10, 50)
xm <- mean(x) # Regular mean
trimmed_mean <- mean(x, trim = 0.10) # Trimmed mean removes the top and bottom 10% of values
# Display the results
c(xm, trimmed_mean)
## [1] 8.75 5.50
# Check head, tail, min, max, range functions
The mean() function calculates the arithmetic mean, and trim allows for trimming extreme values to get a robust mean.
# Glimpse and structure of `mpg` dataset
glimpse(mpg) # Provides a compact overview of the dataset
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
str(mpg) # Provides detailed structure information
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
Both glimpse() and str() are useful for quickly understanding the dataset’s structure, including column types and sample values.
# Calculate the mean displacement (engine size in liters)
mean_displacement <- mean(mpg$displ)
mean_displacement
## [1] 3.471795
The mpg$displ column contains the engine displacement in liters. Calculating the mean gives us an idea of the average engine size.
# Filter rows where city fuel economy (cty) is at least 20
test <- filter(mpg, cty >= 20)
View(test)
The filter() function is part of dplyr and helps extract rows based on a condition. Here, we’re selecting cars with city mileage of 20 or higher
# Create a tibble with custom columns
tb <- tibble(
a = 1:5, # A sequence of integers
b = letters[1:5], # First five letters of the alphabet
c = Sys.Date() + 1:5 # Dates starting from today
)
# Print the tibble
print(tb)
## # A tibble: 5 × 3
## a b c
## <int> <chr> <date>
## 1 1 a 2025-02-22
## 2 2 b 2025-02-23
## 3 3 c 2025-02-24
## 4 4 d 2025-02-25
## 5 5 e 2025-02-26
Tibbles are a modern version of data frames, offering better formatting and additional functionality.
# Group data by car class and calculate the mean city fuel economy
mpg %>%
group_by(class) %>% # Group cars by their class
summarise(mean_cty = mean(cty, na.rm = TRUE)) # Calculate mean city mileage for each group
## # A tibble: 7 × 2
## class mean_cty
## <chr> <dbl>
## 1 2seater 15.4
## 2 compact 20.1
## 3 midsize 18.8
## 4 minivan 15.8
## 5 pickup 13
## 6 subcompact 20.4
## 7 suv 13.5
The group_by() function is used for grouping data by a categorical variable, and summarise() computes summary statistics for each group. Here, we calculate the average city fuel economy for each class of cars.