1. Getting Started

1.1 Datasets Package

I’ll be using several built-in R datasets for this exercise, which can be accessed by loading the datasets package with the following command: library(datasets). If the package is not already installed, you can install it by running install.packages("datasets"). This will instruct R to download the package from the Comprehensive R Archive Network (CRAN).

To explore and learn more about R, you may visit the official website: https://r-project.com.

Once the package is loaded, you can view all available datasets, along with their descriptions, by running the following command:

data(package = "datasets")

Note: The output style you see may differ from mine. Although the underlying content is identical, I use custom formatting to enhance readability.

Dataset Description
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales) Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly Pine Trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers’ Ratings of State Judges in the US Superior Court
USPersonalExpenditure Personal Expenditure Data
UScitiesD Distances Between European Cities and Between US Cities
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World’s Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines, 1937-1960
airquality New York Air Quality Measurements
anscombe Anscombe’s Quartet of ‘Identical’ Simple Linear Regressions
attenu The Joyner-Boore Attenuation Data
attitude The Chatterjee-Price Attitude Data
austres Quarterly Time Series of the Number of Australian Residents
beaver1 (beavers) Body Temperature Series of Two Beavers
beaver2 (beavers) Body Temperature Series of Two Beavers
cars Speed and Stopping Distances of Cars
chickwts Chicken Weights by Feed Type
co2 Mauna Loa Atmospheric CO2 Concentration
crimtab Student’s 3000 Criminals Data
discoveries Yearly Numbers of Important Discoveries
esoph Smoking, Alcohol and (O)esophageal Cancer
euro Conversion Rates of Euro Currencies
euro.cross (euro) Conversion Rates of Euro Currencies
eurodist Distances Between European Cities and Between US Cities
faithful Old Faithful Geyser Data
fdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
freeny Freeny’s Revenue Data
freeny.x (freeny) Freeny’s Revenue Data
freeny.y (freeny) Freeny’s Revenue Data
gait Hip and Knee Angle while Walking
infert Infertility after Spontaneous and Induced Abortion
iris Edgar Anderson’s Iris Data
iris3 Edgar Anderson’s Iris Data
islands Areas of the World’s Major Landmasses
ldeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
lh Luteinizing Hormone in Blood Samples
longley Longley’s Economic Regression Data
lynx Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
morley Michelson Speed of Light Data
mtcars Motor Trend Car Road Tests
nhtemp Average Yearly Temperatures in New Haven
nottem Average Monthly Temperatures at Nottingham, 1920-1939
npk Classical N, P, K Factorial Experiment
occupationalStatus Occupational Status of Fathers and their Sons
penguins Measurements of Penguins near Palmer Station, Antarctica
penguins_raw (penguins) Measurements of Penguins near Palmer Station, Antarctica
precip Annual Precipitation in Selected US Cities
presidents Quarterly Approval Ratings of US Presidents
pressure Vapor Pressure of Mercury as a Function of Temperature
quakes Locations of Earthquakes off Fiji
randu Random Numbers from Congruential Generator RANDU
rivers Lengths of Major North American Rivers
rock Measurements on Petroleum Rock Samples
sleep Student’s Sleep Data
stack.loss (stackloss) Brownlee’s Stack Loss Plant Data
stack.x (stackloss) Brownlee’s Stack Loss Plant Data
stackloss Brownlee’s Stack Loss Plant Data
state.abb (state) US State Facts and Figures
state.area (state) US State Facts and Figures
state.center (state) US State Facts and Figures
state.division (state) US State Facts and Figures
state.name (state) US State Facts and Figures
state.region (state) US State Facts and Figures
state.x77 (state) US State Facts and Figures
sunspot.m2014 (sunspot.month) Monthly Sunspot Data, from 1749 to “Present”
sunspot.month Monthly Sunspot Data, from 1749 to “Present”
sunspot.year Yearly Sunspot Data, 1700-1988
sunspots Monthly Sunspot Numbers, 1749-1983
swiss Swiss Fertility and Socioeconomic Indicators (1888) Data
treering Yearly Tree-Ring Data, -6000-1979
trees Diameter, Height and Volume for Black Cherry Trees
uspop Populations Recorded by the US Census
volcano Topographic Information on Auckland’s Maunga Whau Volcano
warpbreaks The Number of Breaks in Yarn during Weaving
women Average Heights and Weights for American Women

1.2 A Quick Glimpse Into

As mentioned earlier, the penguins dataset will be used for this exercise. To view a glimpse of the dataset, run the function glimpse(penguins). This provides a transposed overview of the data frame—a compact display of its structure.

Before using the glimpse() function, it is important to install and load the dplyr package.

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
## $ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
## $ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
## $ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
## $ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
## $ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
## $ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
## $ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

or use str() function. It compactly display the internal structure of an R object.

str(penguins)
## 'data.frame':    344 obs. of  8 variables:
##  $ species    : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island     : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_len   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_dep   : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_len: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass  : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex        : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year       : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Another function we could use is class(). Is to check dataset class type.

class(penguins)
## [1] "data.frame"

To confirm one more, use is.data.frame(penguins). This will return logical value.

is.data.frame(penguins)
## [1] TRUE

1.3 A Quick ‘Glimpse’ Into pt. 2

I will now carry out a few simple tasks that you might perform when presented with a dataset, in order to get a sense of the data and determine the most appropriate way to work with it. This step is important for several reasons:

  1. Quick analysis – With just a few straightforward functions or scripts, you can gain preliminary insights into the dataset. This allows you to understand what actions might be needed before committing to any deeper analysis.

  2. Preparation required – Different data types or classes may require different functions or handling. Basically, different strokes for different folks. For instance, using glimpse() can help identify various data types that may each need a tailored strategy.

  3. Early warning – Some data may contain missing or problematic values (such as NA, NULL, NaN, etc.). Recognising these early helps avoid issues later on. If you spot missing values, you might choose to remove or replace them. However, just because a value is missing doesn’t mean it should be removed—it could be intentional or originate from the data source.

Let’s begin the exercise.

With reference to the penguins dataset, the following features can help you understand the data more quickly and effectively:

  • ?penguins : The question mark calls R’s help system. Placing ? before a dataset or function name tells R to display relevant documentation. This works not only with datasets but also with most R functions. It’s good practice to use? whenever you’re uncertain about something. If you’re using RStudio, the information will appear in the “Help” tab.

  • View(penguins) : Opens the dataset in a new tab in a readable, spreadsheet-like format. Note: View() is case-sensitive, make sure to use an uppercase “V”, or R will not recognise it.

  • head(penguins): Displays the first six rows of the dataset. You can also specify a number of rows, e.g. head(penguins, 2) shows the top two rows.

  • tail(penguins) : Similar to head(), but returns the last six rows by default. You can specify a different number in the same way.

  • count(penguins) : Counts the total number of rows in the dataset. This function comes from the dplyr package and returns the number of observations.

  • filter(penguins, species == "Adelie") : Returns all rows where the condition is met. In this case, it filters rows where the species column is exactly equal to “Adelie”. The == operator is used to match values exactly.

  • sort_by(x, y, decreasing) : Two arguments. x takes an object to be sorted, typically a vector or data frame. y is variables to sort by. decreasing sort order either decreasing (order = TRUE) or increasing (order = FALSE).

  • unique(penguins$island) : Returns the unique (non-duplicated) values from the island column.

  • summary(penguins) : Provides a statistical summary of the dataset, showing min., max., mean, and quartiles for numerical variables, and counts for categorical ones.

  • summary(penguins$species) : Since species is a categorical variable, this will return a frequency count of each category.

  • summary(penguins$body_mass) : For a quantitative variable like body_mass, this shows:

    • Min. : the smallest value

    • 1st Qu. : the first quartile (25% of data points fall below this)

    • Median : the middle value

    • Mean : the average

    • 3rd Qu. : the third quartile (75% of data points fall below this)

    • Max : the largest value

    • NA : the number of missing values in the dataset

  • plot(penguins) : Automatically generates basic plots for all variables in the dataset, choosing the most appropriate plot type for each. It offers a quick visual overview.

2. Calling the Data

Calling—or more technically, selecting—is the process of retrieving data, whether or not a condition is applied. Let’s look at the basics of how to access or “call” data.

Using double square brackets [ ]: This is known as subsetting and is suitable when the data is in the form of a table, tibble, or data frame. The general format is [row, column].

Within the brackets, you can use a colon : to specify a range, e.g. 1:5 selects from row 1 to row 5. This indexing method selects data by row and/or column position. If you leave either side of the comma blank, R will return all rows or columns accordingly. You can also simply type the dataset name in the console—without square brackets—to view the entire dataset. Example:

  1. Call the first to third (1 to 3) rows of data and include all the tables.
penguins[1:3,]
##   species    island bill_len bill_dep flipper_len body_mass    sex year
## 1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
## 2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
## 3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
  1. Call the 10th and 190th row, include the first to third table.
penguins[c(10, 190), 1:3]
##     species    island bill_len
## 10   Adelie Torgersen     42.0
## 190  Gentoo    Biscoe     44.4
  1. Calls data from the 100th to 105th row, and only include the first, second, and seventh table.
penguins[100:105, c(1, 2, 7)]
##     species island    sex
## 100  Adelie  Dream   male
## 101  Adelie Biscoe female
## 102  Adelie Biscoe   male
## 103  Adelie Biscoe female
## 104  Adelie Biscoe   male
## 105  Adelie Biscoe female

You can also insert a function inside those brackets for more specific filtering.

2.1 Calling with Conditions pt. 1

Let’s see how many species are there in the penguins dataset. Use this code:

unique(penguins$species)
## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

Say we want to retrieve all rows where the penguins’ species is “Chinstrap”. There are several ways to do this, each producing slightly different results depending on the method used.

# 1. Using dplyr::filter()
filter(penguins, species == 'Chinstrap')

# 2. Using base R subsetting
penguins[penguins$species == 'Chinstrap', ]

# 3. Using subset()
subset(penguins, species == 'Chinstrap')

# 4. Using which() to get row indices
penguins[which(penguins$species == 'Chinstrap'), ]

# 5. Using table()
table(penguins$species == 'Chinstrap')

# 6. Using sum()
sum(penguins$species == 'Chinstrap')

Explanation:

  1. Prints all the rows and columns but the indexes counted from 1 rather than the actual index position in the dataset.

  2. Same as number 1, but the indexes is exactly how it is in the actual dataset.

  3. Same as number 2 because the function works exactly the same. Hence it’s called subsetting (notice the double square brackets).

  4. Same as number 2 & 3, with extra steps.

  5. Returns logical value.

  6. Counts how many are TRUE (i.e., how many penguins are chinstrap species)

Important: Depends on where you run those codes, the generic user interface could be different. If you run #2 on console it will gives basic view of it (command line interface style). But if you run it R markdown, it gives out nicely structured style.

2.2 Calling with Conditions pt. 2: Missing Values

More extensive and with more conditions.

# 1. Checking missing values in all rows
penguins[!complete.cases(penguins), ]

# 2. Checking the indices
which(!complete.cases(penguins))

# 3. Checking missing values in a column
any(is.na(penguins$sex))

# 4. Checking the indices
which(is.na(penguins$sex))

# 5. Checking the count
sum(is.na(penguins$sex))

# 6. Checking which rows have missing values in a column
penguins[is.na(penguins$flipper_len), ]

Explanation:

  1. Using base R and subsetting (note the double square brackets [ ]), we nest the complete.cases() function inside. The exclamation mark ! is a logical operator meaning NOT. So !complete.cases() tells R to return all rows in the penguins dataset that are incomplete - i.e., contain at least one missing value. The result is displayed as a table.

  2. Similar to point 1, but instead of showing the rows, it returns a list of row indices where missing values are found.

  3. This is a quick check that returns TRUE if any values are missing, and FALSE otherwise.

  4. Like point 2, this uses the which() function to find missing values in a column and returns the indices of rows where the values are missing.

  5. Counts how many missing values exist in a column.

  6. Similar to point 1 — it uses subsetting to return all rows that have missing values in a specific column. The result is displayed as a table.

2.3 Pattern Matching & Replacement

In R, to search for text patterns in a column (e.g. values that contain or start with something), we use:

  • grepl() — from base R
  • str_detect() — is a tidyverse-friendly way to detect patterns. It works well in pipes (%>%) and with dplyr::filter().

2.3.1 Using grepl() (Base R)

Example 1: Find rows where island contains “Dream”.

penguins[grepl('Dream', penguins$island), ]

Example 2: Case-insensitive search

penguins[grepl("dream", penguins$island, ignore.case = TRUE), ]

2.3.2 Using str_detect() (Tidyverse)

This is more readable, especially with dplyr::filter():

Example 1: Island contains “Dream”

penguins %>%
  filter(str_detect(island, "Dream"))

Example 2: Island starts with “B”

penguins %>%
  filter(str_detect(island, "^B"))

Example 3: Species end with “ie”

data %>%
  filter(str_detect(species, "ie$"))

2.3.3 Pattern Syntax (Regex Quick Guide)

Pattern Meaning Example
"abc" Exact match "abc"
"^abc" Starts with "abc" "abcde"
"abc$" Ends with "abc" "123abc"
"a.c" "a" followed by any one character, then "c" "abc", "a-c"
"a.*c" "a" followed by anything, then "c" "abc", "axyzc"
"a|b" Match "a" OR "b" "a", "b"
"[aeiou]" Contains any vowel "apple", "unit"
"[^aeiou]" NOT a vowel "b", "c"
"\\bword\\b" Match exact word "Dream" but not "Daydream"
regex("text", ignore_case=TRUE) Case-insensitive "text", "Text", "TEXT", etc are all included

2.3.4 Negating a Pattern

Example: Island does not contain “Dream”

penguins %>%
  filter(!str_detect(island, "Dream"))

2.3.5 Combine Pattern Search with Other Conditions

Example: Penguins on islands containing “Dream”, and body mass > 4000g

penguins %>%
  filter(str_detect(island, "Dream") & body_mass > 4000)

2.3.6 Match Whole Word Only

Use \\b (word boundary) for full-word match (e.g., match “Dream”, not “Daydream”):

penguins %>%
  filter(str_detect(island, "\\bDream\\b"))

3. Logical Operators

3.1 Logical Operators in R pt. 1

# 1. == (Equal to)
penguins$species == "Adelie"

# 2. != (Not equal to)
penguins$species != "Chinstrap"

# 3. > (Greater than)
penguins$bill_len > 50

# 4. < (Less than)
penguins$body_mass < 3000

# 5. >= (Greater than or equal to)
penguins$flipper_len >= 200

# 6. <= (Less than or equal to)
penguins$bill_dep <= 18

Explanation:

  1. Returns TRUE for rows where the species is exactly “Adelie”.

  2. Returns TRUE for rows where the species is not “Chinstrap”.

  3. Search for any value in bill_length column that is greater than 50.

  4. Search for any value in body_mass column that is less than 3000.

  5. Search for any value in flipper_len column that is greater than or equal to 200.

  6. Search for any value in bill_dep column that is less than or equal to 18.

Note: If you those code all you get is a bunch of TRUEs and FALSEs all over the place. This because we have use any function, because the function is what makes it readable. Refer to Missing values sub-chapter for any suitable functions.

3.2 Logical Operators in R pt. 2

# 1.
penguins[penguins$species == "Adelie" & penguins$flipper_len > 190, ]

# 2.
penguins[data$species == "Gentoo" | penguins$body_mass < 3500, ]

# 3.
penguins[!complete.cases(penguins), ]

# 4.
penguins[(penguins$species == "Adelie" & penguins$flipper_len > 190) |
     (penguins$species == "Gentoo" & penguins$body_mass > 4000), ]

# 5.
subset(penguins, species == "Adelie" & flipper_len > 190 & body_mass < 4000)

Explanation:

  1. Show all Adelie penguins with flippers over 190 mm.

  2. Which penguins are either Gentoo or weigh less than 3500g?

  3. Find rows that are incomplete (have missing values).

  4. Adelie with flipper > 190 or Gentoo with body mass > 4000

  5. Same as point 4, functionally the same but looks cleaner, especially for many conditions.

Tips: If mixing & and |, always use parentheses ( ), like what is shown code number 4 above.

3.3 Logical Operators in R pt. 3: Using Dplyr::filter()

# 1.
penguins %>%
  filter(
    species == "Adelie",
    flipper_length_mm > 190,
    body_mass_g < 4000
  )

# 2.
penguins %>%
  filter(
    (species == "Adelie" & flipper_length_mm > 190) |
    (species == "Gentoo" & body_mass_g > 4000)
  )

# 3.
penguins %>%
  filter(complete.cases(.))

Explanation:

  1. Find penguins that are:
  • Species is “Adelie”

  • Flipper length is greater than 190

  • Body mass is less than 4000g

Note: Each condition is separated by a comma — equivalent to using & between them.

  1. Adelie with flipper > 190 OR Gentoo with body mass > 4000.
  2. Extra: Remove rows with missing values

Which to Use?

  • Use grepl() for simple, quick checks in base R.

  • Use str_detect() if you work with dplyr, prefer cleaner syntax, or use it in data pipelines.

4. Plot Operation pt. 1

Now, let’s get creative! With plotting, data can speak beautifully.

R offers a wide variety of plotting types, thanks to its base plotting system, the grid graphics system, and powerful libraries like ggplot2, lattice, and more.

Here is an example of plotting using the ggplot() function from the ggplot2 package. The resulting plot shows a scatterplot with points overlaid by a smooth trend line. It also includes a legend displaying each car class, with each class represented by a distinct colour.

mpg %>% 
  ggplot(mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(colour = class)) +
  geom_smooth(se = F) +
  labs(title = "Fuel efficiency generally decreases with engine size",
       x = "Displacement",
       y  = "Miles per gallon on highway",
       color = "Class: ",
       caption = "Source: Fuel economy data from 1999-2008")

4.1 Basic Plotting

In R’s base package, there is a function called plot(). This function requires some arguments, such as x and y. These are not simply axis labels, but the actual inputs or data you ‘feed’ into the function. For example:

# Categorical variable
plot(penguins$species)

Can you interpret the plot above? Since the chosen variable is the species column—a categorical variable—the function automatically selects the most appropriate plot type. The x-axis displays the different species, while the y-axis shows the count of penguins within each species.

Earlier, I mentioned that plot() requires arguments like x and y. But in the code above, only one argument—x—is provided. So how is that possible? It’s because the plotting behaviour depends on the type of variable supplied. The plot() function is quite flexible and can generate a suitable plot even with just a single input. In fact, it will attempt to plot anything, even if the dataset is empty, that’s just the nature of this function.

# Quantitative variable
plot(penguins$body_mass, penguins$flipper_len)

The plot above shows a clear correlation. There is a consistent linear pattern in the data points, which makes sense in this context: the longer the flipper, the greater the body mass, and vice versa.

Also, note that the plot style differs from previous examples. The plot() function automatically chooses the most appropriate plot type based on the data provided.

# Quantitative variable
plot(penguins$flipper_len)

Can you interpret the plot above? I used the penguins’ flipper length as the main variable to be plotted. Here’s how to read it: the x-axis represents the row indices from the dataset, ranging from the first row to the last. The penguins dataset contains 344 rows, so why does the x-axis extend to 350? That’s because the axis scale is rounded for a cleaner presentation.

If you look closely, there is one penguin with a flipper length of exactly 230 mm, positioned just after index 150, so, somewhere around row 150. But what is the exact row number for that penguin? We can find out using the subset() function:

subset(x = penguins[150:155, ],      # Data
       subset = flipper_len == 230)  # Condition
##     species island bill_len bill_dep flipper_len body_mass  sex year
## 154  Gentoo Biscoe       50     16.3         230      5700 male 2007

There you go. We find it! :)

Bonus. Just like the plot above, here’s more aesthetically pleasing plot:

The plot above can be achieve with this code:

flip.len.by.species <- 
ggplot(penguins, aes(x = 1:nrow(penguins), y = flipper_len,
                     fill = species, shape = species)) +
  geom_point(size = 3, stroke = 0.8) +
  scale_fill_manual(values = c('Adelie' = 'steelblue',
                              'Chinstrap' = 'darkorange',
                              'Gentoo' = 'forestgreen')) +
  scale_shape_manual(values = c(21, 22, 24)) +
  scale_x_continuous(breaks = seq(from = 0, to = 350, by = 50)) +
  labs(title = 'Penguin Flipper Length by Species',
       x = 'Index',
       y = 'Flipper Length (mm)',
       caption = 'Source: Palmer Station Antarctica LTER and K. Gorman (2020)',
       fill = 'Species:',
       shape = 'Species:') +
  theme_bw()

print(flip.len.by.species)

Explanation of each code:

  • flip_len_by_species <- : To assign/save the plot inside a variable data.

  • ggplot() : Used to initialise the plot by specifying the data before adding layers for plotting or styling.

  • aes() : The aesthetic mapping function. Many aesthetics can be mapped here, but the most essential are x and/or y, which define the axes.

    • x = 1:nrow(penguins) : Since geom_point() requires both an x and a y value, but we only have one actual variable to plot (flipper_len), we “trick” the function into creating a second axis. The nrow() function returns the number of rows in the dataset. So 1:nrow(penguins) generates a sequence from 1 to 344, representing each row, which becomes the x-axis values.

    • y = flipper_len : This sets flipper_len as the variable on the y-axis.

    • fill = species : Specifies how each point should be filled—based on the species variable. Used together with scale_fill_manual() to customise colours.

    • shape = species : Assigns different shapes to each species. Since species is a categorical variable, ggplot() automatically assigns distinct shapes to its levels.

  • geom_point() : This is the plotting function that draws the points. This plot style commonly known as scatterplot.

    • size = 3 : Controls the size of the points. The higher the number, the larger the points.

    • stroke = 0.8 : Sets the thickness of the borders around the points. A higher number results in a thicker border.

  • scale_fill_manual() : Manually assigns fill colours to different levels of a variable. Important: You must map fill inside aes(); otherwise, ggplot2 does not know which variable you are trying to apply the colours to.

    • values = c(...) : The values argument specifies the colours to assign to each level of the variable (e.g. species). You provide a vector of colours, and ggplot matches them to the categories. If the condition matches, the colour is applied accordingly.
  • scale_shape_manual() : Manually overrides the default shapes assigned by ggplot2. Important: You must map shape inside aes(); otherwise, ggplot2 will not know which variable you want to assign shapes to.

    • values = c(21, 22, 24) : Assigns specific point shapes to the levels of the variable mapped to shape. ggplot2 uses shape codes from 0 to 25 to represent different point symbols. Useful reference: ref_1, ref_2, ref_3.
  • scale_x_continuous() : Customises the appearance of a continuous x-axis, including tick marks, labels, axis limits, and transformations. You do not map the variable inside scale_x_continuous()—that’s done inside aes(). This function simply modifies how the mapped x-axis is displayed.

    • breaks = ... : Controls the location of tick marks on the axis. Use this to manually define which values appear as ticks, instead of relying on automatic spacing.

    • seq(from = ..., to = ..., by = ...): Generates a sequence of numbers, often used within the breaks argument for evenly spaced tick marks. For example, seq(0, 350, 50) is generating a sequence of numbers starting at 0, ending at 350, with steps (interval) of 50 between each number. Without this, ggplot might default to wider intervals (e.g. 0 to 300 with 100 steps).

  • labs(): Sets or customises labels on the plot. Examples:

    • title = ... for the main plot title

    • x = ... for the x-axis label

    • y = ... for the y-axis label

    • caption = ... for source notes or additional info (appears in the bottom-right)

  • fill = species & shape = species: These mappings create legends on the side of the plot, indicating what the fill colour and shape represent. If only one aesthetic is declared (e.g. just fill or shape), you may end up with two separate legends—because ggplot2 doesn’t automatically combine them. To unify the legends into one, you need to map both fill and shape to the same variable, with the same labels. This results in a single, shared legend titled “Species”.

  • theme_bw(): Applies a clean, black-and-white theme to the plot, replacing the default grey theme (theme_grey()) used by ggplot(). It produces a more minimal and professional look, suitable for reports or publications.

  • print() : To print the result.

4.2 Save Your Plot

The easiest method is to save the plot as an image using ggsave(). It’s simple, works in any R setup, and doesn’t require any extra packages beyond ggplot2.

# Save as a PNG
ggsave(filename = "flip_len_by_species.png", plot = flip.len.by.species, width = 10, height = 5, dpi = 300)

This will save the file flip_len_by_species in your working directory. To know where your working directory is, use this: getwd().

4.3 Plotting Formulas

  • plot(cos, 0, 2*pi) : Cosine with limit from 0 to 2 times Pi.

  • plot(exp, 1, 5) : Exponential distribution from 1 to 5.

  • plot(dnorm, -3, +3) : Density of a normal distribution from -3 to +3 on the X axis

5. Plot Operation pt. 2: Similar but Different

There are many kinds of plot style that it most suitable to your data.

5.1 Bar Chart

Take a look at the plot below:

PLOT 1:

barplot(
  table(penguins$island),
  col = c("salmon", "lightgreen", "skyblue"),
  main = "Penguin Count by Island",
  ylab = "Number of Penguins")

PLOT 2:

penguins %>% 
  ggplot(aes(island, fill = island)) + 
  geom_bar() + 
  theme_bw() +
  theme(legend.position = 'none') +
  labs(title = 'Penguin Count by Island',
       x = '')

Those plots above shows that there are three islands, with the y-axis indicating the number of penguins from each one. Penguins from Biscoe Island exceed 150 in number, the highest count of them all!

By the way, those two codes above just to show you how different approach can land you on the same plot. One function with more advantage than the other.

5.2 Histogram

penguins %>% 
  ggplot(aes(x = flipper_len)) +
  geom_histogram(binwidth = 5,
                 fill = 'maroon',
                 color = 'white') +
  theme_minimal()

You can also do this using R base function:

hist(penguins$flipper_len, col = 'maroon')

5.3 Bar Chart vs Histogram — Side-by-Side

Important: For this to work, a package called “patchwork” is needs to be installed and loaded.

# Load the package
library(patchwork)

# Bar Chart: Count of penguins by species (categorical)
bar_plot <- ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "steelblue", # Color inside the bar
           color = "white") +  # Border color
  labs(title = "Bar Chart: Count by Species",
       x = "Species",
       y = "Count") +
  theme_bw()

# Histogram: Distribution of flipper length (continuous)
hist_plot <- ggplot(penguins, aes(x = flipper_len)) +
  geom_histogram(binwidth = 5, # The width of each bar
                 fill = "tomato",
                 color = "white") +
  labs(title = "Histogram: Flipper Length",
       x = "Flipper Length (mm)",
       y = "Count") +
  theme_bw()

# Combine side-by-side
bar_plot + hist_plot

5.4 Bar Chart vs Histogram — Key Differences

Here’s a quick one:

  • Use a bar chart for categories.

  • Use a histogram for numerical ranges (distribution)

More detailed one:

Feature Bar Chart Histogram
Data type Categorical data (e.g. species, country, gender) Continuous data (e.g. age, height, flipper length)
X-axis Shows categories Shows numeric bins (ranges)
Y-axis Frequency or count (or proportion) Frequency or count (or density)
Width of bars Arbitrary/random (often equal and can have gaps) Represents range of data values (no gaps by default)
Used for Comparing values across groups Showing distribution of a continuous variable
Bars Usually separated Usually touch (to show continuity)

So,

Histogram bars touch to represent that the data is continuous and connected.

Bar chart bars don’t — they represent separate, unrelated categories.”

5.5 Variables Requires Altering

There are some common examples in R (especially with ggplot2 and data manipulation), where tweaking or altering variables ensures the data is treated as intended—typically improving how it’s displayed, analysed, or grouped.

5.5.1 Converting Numeric to Factor for Discrete Categories

Scenario: Using mtcars dataset, you want to see how a car’s weight affects its fuel efficiency, and whether this relationship differs by the number of cylinders (cyl) in the engine.

Take a look at these plots:

( 
  # First code
  mtcars %>% ggplot(aes(wt, mpg, color = cyl)) + 
  geom_point() + labs(color = 'Cyl') + 
  theme_bw()
) + (
  # Second code
  mtcars %>% ggplot(aes(wt, mpg, color = as.factor(cyl))) + 
  geom_point() + labs(color = 'Cyl') + 
  theme_bw()
)

What differences do you notice in both the plot and the code?

Yes, the legend differs between the two plots. Yes, the code is also slightly different. In the second code, you’ll notice that the cyl variable inside the color argument is first converted to a factor using the as.factor() function.

Why?

You see, the cyl variable (number of cylinders) in the mtcars dataset is numeric by default. If you map a numeric variable to a visual aesthetic like color, ggplot2 will treat it as continuous, meaning it will apply a gradient of colors (e.g., from light to dark).

However, in this case, cylinders (e.g., 4, 6, 8) are categorical values, they represent groups, not a continuous scale. By converting cyl to a factor, you tell ggplot2 to treat it as discrete. This results in:

  • Distinct colors for each cylinder category (4, 6, 8)
  • A legend that labels these categories clearly
  • Better visual grouping and interpretation

So, because you want separate colours for categories like 4, 6, 8 cylinders—not a color gradient, you convert cyl to a factor to ensure ggplot2 treats it as a categorical variable, assigning distinct colours rather than a continuous gradient.

5.5.2 Reordering Factor Levels for Better Plot Order

Scenario: You have survey responses stored as a factor, but the order is alphabetical.

responses <- factor(c("Neutral", "Agree", "Disagree", "Agree", "Strongly Agree"),
                    levels = c("Strongly Agree", "Agree", "Neutral", "Disagree"))

ggplot(data.frame(responses), aes(x = responses)) + geom_bar()

Why doing this? Because you want to control the order of bars in the plot to match a logical or similar to Likert scale rather than the default alphabetical order.

5.5.3 Create a New Grouping Variable

Scenario: Using mtcars dataset, you want to group cars in mtcars into “light” and “heavy” based on weight, but no such column exists.

mtcars %>%
  mutate(weight_group = ifelse(wt > 3, "Heavy", "Light")) %>%
  ggplot(aes(x = weight_group, y = mpg)) +
  geom_boxplot()

Why tweak it? You needed a new variable to group cars by weight.

5.5.4 Extracting Parts of a Variable

Scenario: From airquality dataset, we want to group the data by month and calculate the average temperature for each. But the dataset has Month as a number (5, 6, 7…), and we want to label it with actual month names like “May”, “June”, etc.

airquality %>%
  mutate(month_name = month.abb[Month]) %>%  # tweak: convert numeric to abbreviated month name
  group_by(month_name) %>%
  summarise(avg_temp = mean(Temp, 
                            na.rm = TRUE)) %>% # Remove NA's in Temp column
  ggplot(aes(x = month_name, y = avg_temp)) +
  geom_col(fill = "steelblue") +
  labs(title = "Average Temperature by Month",
       x = "Month", y = "Avg Temp (F)")

The month.abb[Month] is to converts the numeric month (5–9) into “May”, “Jun”, etc.

5.5.5 Bin a Numeric Variable into Categories

Scenario: You want to see how many cars fall into low, medium, or high MPG groups, but there’s no such categorisation in mtcars.

mtcars$mpg_group <- cut(mtcars$mpg,
                        breaks = c(0, 15, 25, Inf),
                        labels = c("Low", "Medium", "High"))

ggplot(mtcars, aes(x = mpg_group)) +
  geom_bar(fill = "darkorange") +
  labs(title = "Car Count by MPG Group",
       x = "MPG Group", y = "Count")

Use cut() to convert a continuous variable (mpg) into meaningful groups.

5.5.6 Combine or Simplify Factor Levels

In the titanic dataset, you want to simplify the passenger classes to fewer groups for a clearer survival comparison.

titanic_df <- as.data.frame(Titanic)
titanic_df$Class <- as.character(titanic_df$Class)
titanic_df$Class[titanic_df$Class %in% c("Crew", "3rd")] <- "Lower"
titanic_df$Class <- factor(titanic_df$Class)

ggplot(titanic_df, aes(x = Class, weight = Freq, fill = Survived)) +
  geom_bar(position = "fill") +
  labs(title = "Survival by Class Group", y = "Proportion")

Combine levels (Crew and 3rd) into a new group (Lower) to simplify the analysis.

6. Plot Operation pt. 2: Variants

R offers a wide variety of plotting types, thanks to its base plotting system, the grid graphics system, and powerful libraries like ggplot2, lattice, and more.

Here’s an overview of the some types of plots you can create in R. With different approach, using base R or ggplot2.

6.1 Scatter Plot

6.1.1 Base R

par(mfrow = c(1, 2))  # 1 row, 2 columns (side-by-side view)

plot(penguins$flipper_len, penguins$body_mass,
     main = "Base R Scatter:
     Flipper vs Body Mass",
     xlab = "Flipper Length",
     ylab = "Body Mass",
     pch = 19, col = "blue")

plot(penguins$bill_len, penguins$bill_dep,
     main = "Base R Scatter:
     Bill Length vs Depth",
     xlab = "Bill Length",
     ylab = "Bill Depth",
     pch = 21, bg = "orange")

6.1.2 ggplot2

(
  penguins %>% ggplot(aes(bill_len, bill_dep)) + geom_point(color = "red") +
  ggtitle("ggplot2 Scatter:
          Bill Length vs Depth")
+
  ggplot(penguins, aes(flipper_len, body_mass)) + geom_point() +
  labs(
    title = "ggplot2 Scatter:
    Flipper vs Body Mass")
)

6.2 Line Plot

6.2.1 Base R

par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)

plot(penguins$flipper_len[1:30], type = "l",  main = "Base R Line Plot 1", ylab = "Flipper Length")

plot(penguins$body_mass[1:30], type = "l", col = "green", main = "Base R Line Plot 2", ylab = "Body Mass")

6.2.2 ggplot2

(
  ggplot(data = penguins[1:30, ], mapping = aes(x = seq_along(flipper_len), y = flipper_len)) + geom_line() + ggtitle("ggplot2 Line Plot 1")
+
  ggplot(data = penguins[1:30, ], aes(x = seq_along(body_mass), y = body_mass)) + geom_line(color = "purple") + labs(title = "ggplot2 Line Plot 2")
)

6.3 Bar Plot

6.3.1 Base R

par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)

barplot(table(penguins$species), main = "Base R Barplot 1")

barplot(table(penguins$island), col = "cyan", main = "Base R Barplot 2")

6.3.2 ggplot2

(
  ggplot(penguins, aes(x = species)) + geom_bar() + labs(title = "ggplot2 Barplot 1")
+
  ggplot(penguins, aes(x = island)) + geom_bar(fill = "steelblue",) + labs(title = "ggplot2 Barplot 2")
)

6.4 Histogram

6.4.1 Base R

par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)

hist(penguins$body_mass, main = "Base R Histogram 1", col = "lightblue")

hist(penguins$bill_len, main = "Base R Histogram 2", col = "gray")

6.4.2 ggplot2

(
  ggplot(penguins, aes(body_mass)) + geom_histogram(binwidth = 200,fill = "blue") +
  ggtitle("ggplot2 Histogram 1")
+
  ggplot(penguins, aes(bill_len)) + geom_histogram(binwidth = 2, fill = "darkred") +
  ggtitle("ggplot2 Histogram 2")
)

6.5 Box Plot

6.5.1 Base R

par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)

boxplot(body_mass ~ species, data = penguins, main = "Base R Boxplot 1", col = c("red", "green", "blue"))

boxplot(flipper_len ~ island, data = penguins, main = "Base R Boxplot 2", col = "purple")

6.5.2 ggplot2

(
  ggplot(data = penguins, mapping = aes(x = species, y = body_mass)) +
  geom_boxplot() +  ggtitle("ggplot2 Boxplot 1")
+
  ggplot(penguins, aes(x = island, y = flipper_len)) +
  geom_boxplot(fill = "orange") + ggtitle("ggplot2 Boxplot 2")
)

6.6 Pie Chart (ggplot2 requires workaround)

6.6.1 Base R

par(mfrow = c(1, 2))  # 1 row, 2 columns (side-by-side view)

pie(table(penguins$species), main = "Base R Pie Chart 1", col = rainbow(3))

pie(table(penguins$island), main = "Base R Pie Chart 2", col = c("red", "yellow", "blue"))

6.6.2 ggplot2 (simulate pie using coord_polar())

(
  ggplot(penguins, aes(x = "", fill = species)) + geom_bar(width = 1) +
  coord_polar("y") + labs(title = "ggplot2 Pie Chart 1")
+
  ggplot(penguins, aes(x = "", fill = island)) + geom_bar(width = 1) +
  coord_polar("y") + ggtitle("ggplot2 Pie Chart 2")
)

6.7 Dot Chart / Dotplot

6.7.1 Base R

par(mfrow = c(1, 2))  # 1 row, 2 columns (side-by-side view)

dotchart(penguins$body_mass[1:20],  labels = penguins$species[1:20], main = "Base R Dotchart 1")

dotchart(penguins$flipper_len[1:20], main = "Base R Dotchart 2", color = "darkblue")

6.7.2 ggplot2

(
  ggplot(penguins[1:20, ], aes(x = body_mass, y = species)) +
  geom_point() + ggtitle("ggplot2 Dotplot 1")
+
  ggplot(penguins[1:20, ], aes(x = flipper_len, y = island)) +
  geom_point(color = "darkgreen") + ggtitle("ggplot2 Dotplot 2")
)

6.8 Strip Chart (Jitter)

6.8.1 Base R

par(mfrow = c(1, 2))  # 1 row, 2 columns (side-by-side view)

stripchart(body_mass ~ species, data = penguins, method = "jitter", main = "Base R Stripchart 1")

stripchart(flipper_len ~ island, data = penguins, method = "jitter", col = "darkred", main = "Base R Stripchart 2")

6.8.2 ggplot2

(
  ggplot(penguins, aes(x = species, y = body_mass)) +
  geom_jitter(width = 0.2) + ggtitle("ggplot2 Stripchart 1")
+
  ggplot(penguins, aes(x = island, y = flipper_len)) +
  geom_jitter(width = 0.2, color = "blue") + ggtitle("ggplot2 Stripchart 2")
)

6.9 Density Plot

6.9.1 Base R

par(mfrow = c(1, 2))  # 1 row, 2 columns (side-by-side view)

plot(density(na.omit(penguins$body_mass)), main = "Base R Density Plot 1", col = "blue")

plot(density(na.omit(penguins$bill_len)), main = "Base R Density Plot 2", col = "green")

6.9.2 ggplot2

(
  ggplot(penguins, aes(body_mass)) + geom_density(fill = "skyblue") + 
  ggtitle("ggplot2 Density Plot 1")
+
  ggplot(penguins, aes(bill_len, fill = species)) + geom_density(alpha = 0.5) + 
  ggtitle("ggplot2 Density Plot 2")
)

6.10 Pairs Plot / Correlation Matrix

6.10.1 Base R

pairs(penguins[, c("bill_len", "bill_dep", "flipper_len", "body_mass")], main = "Base R Pairs Plot 1")

pairs(penguins[, 3:6], main = "Base R Pairs Plot 2", col = "blue")

6.10.2 ggplot2

# Optional: install.packages("GGally")
library(GGally)

ggpairs(penguins[, c("bill_len", "bill_dep", "flipper_len", "body_mass")]) + ggtitle("ggplot2 Pairs Plot 1")

ggpairs(penguins[, 3:6], mapping = ggplot2::aes(color = penguins$species)) + ggtitle("ggplot2 Pairs Plot 2")

6.11 Heat Map

penguins_scaled <- 
  scale(na.omit(penguins[, c("bill_len", "bill_dep", "flipper_len", "body_mass")]))

heatmap(penguins_scaled, main = "Heatmap of Scaled Penguin Measurements")