I’ll be using several built-in R datasets for this exercise, which
can be accessed by loading the datasets package with the following
command: library(datasets). If the package is not already
installed, you can install it by running
install.packages("datasets"). This will instruct R to
download the package from the Comprehensive R Archive Network (CRAN).
To explore and learn more about R, you may visit the official website: https://r-project.com.
Once the package is loaded, you can view all available datasets, along with their descriptions, by running the following command:
data(package = "datasets")
Note: The output style you see may differ from mine. Although the underlying content is identical, I use custom formatting to enhance readability.
| Dataset | Description |
|---|---|
| AirPassengers | Monthly Airline Passenger Numbers 1949-1960 |
| BJsales | Sales Data with Leading Indicator |
| BJsales.lead (BJsales) | Sales Data with Leading Indicator |
| BOD | Biochemical Oxygen Demand |
| CO2 | Carbon Dioxide Uptake in Grass Plants |
| ChickWeight | Weight versus age of chicks on different diets |
| DNase | Elisa assay of DNase |
| EuStockMarkets | Daily Closing Prices of Major European Stock Indices, 1991-1998 |
| Formaldehyde | Determination of Formaldehyde |
| HairEyeColor | Hair and Eye Color of Statistics Students |
| Harman23.cor | Harman Example 2.3 |
| Harman74.cor | Harman Example 7.4 |
| Indometh | Pharmacokinetics of Indomethacin |
| InsectSprays | Effectiveness of Insect Sprays |
| JohnsonJohnson | Quarterly Earnings per Johnson & Johnson Share |
| LakeHuron | Level of Lake Huron 1875-1972 |
| LifeCycleSavings | Intercountry Life-Cycle Savings Data |
| Loblolly | Growth of Loblolly Pine Trees |
| Nile | Flow of the River Nile |
| Orange | Growth of Orange Trees |
| OrchardSprays | Potency of Orchard Sprays |
| PlantGrowth | Results from an Experiment on Plant Growth |
| Puromycin | Reaction Velocity of an Enzymatic Reaction |
| Seatbelts | Road Casualties in Great Britain 1969-84 |
| Theoph | Pharmacokinetics of Theophylline |
| Titanic | Survival of passengers on the Titanic |
| ToothGrowth | The Effect of Vitamin C on Tooth Growth in Guinea Pigs |
| UCBAdmissions | Student Admissions at UC Berkeley |
| UKDriverDeaths | Road Casualties in Great Britain 1969-84 |
| UKgas | UK Quarterly Gas Consumption |
| USAccDeaths | Accidental Deaths in the US 1973-1978 |
| USArrests | Violent Crime Rates by US State |
| USJudgeRatings | Lawyers’ Ratings of State Judges in the US Superior Court |
| USPersonalExpenditure | Personal Expenditure Data |
| UScitiesD | Distances Between European Cities and Between US Cities |
| VADeaths | Death Rates in Virginia (1940) |
| WWWusage | Internet Usage per Minute |
| WorldPhones | The World’s Telephones |
| ability.cov | Ability and Intelligence Tests |
| airmiles | Passenger Miles on Commercial US Airlines, 1937-1960 |
| airquality | New York Air Quality Measurements |
| anscombe | Anscombe’s Quartet of ‘Identical’ Simple Linear Regressions |
| attenu | The Joyner-Boore Attenuation Data |
| attitude | The Chatterjee-Price Attitude Data |
| austres | Quarterly Time Series of the Number of Australian Residents |
| beaver1 (beavers) | Body Temperature Series of Two Beavers |
| beaver2 (beavers) | Body Temperature Series of Two Beavers |
| cars | Speed and Stopping Distances of Cars |
| chickwts | Chicken Weights by Feed Type |
| co2 | Mauna Loa Atmospheric CO2 Concentration |
| crimtab | Student’s 3000 Criminals Data |
| discoveries | Yearly Numbers of Important Discoveries |
| esoph | Smoking, Alcohol and (O)esophageal Cancer |
| euro | Conversion Rates of Euro Currencies |
| euro.cross (euro) | Conversion Rates of Euro Currencies |
| eurodist | Distances Between European Cities and Between US Cities |
| faithful | Old Faithful Geyser Data |
| fdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
| freeny | Freeny’s Revenue Data |
| freeny.x (freeny) | Freeny’s Revenue Data |
| freeny.y (freeny) | Freeny’s Revenue Data |
| gait | Hip and Knee Angle while Walking |
| infert | Infertility after Spontaneous and Induced Abortion |
| iris | Edgar Anderson’s Iris Data |
| iris3 | Edgar Anderson’s Iris Data |
| islands | Areas of the World’s Major Landmasses |
| ldeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
| lh | Luteinizing Hormone in Blood Samples |
| longley | Longley’s Economic Regression Data |
| lynx | Annual Canadian Lynx trappings 1821-1934 |
| mdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |
| morley | Michelson Speed of Light Data |
| mtcars | Motor Trend Car Road Tests |
| nhtemp | Average Yearly Temperatures in New Haven |
| nottem | Average Monthly Temperatures at Nottingham, 1920-1939 |
| npk | Classical N, P, K Factorial Experiment |
| occupationalStatus | Occupational Status of Fathers and their Sons |
| penguins | Measurements of Penguins near Palmer Station, Antarctica |
| penguins_raw (penguins) | Measurements of Penguins near Palmer Station, Antarctica |
| precip | Annual Precipitation in Selected US Cities |
| presidents | Quarterly Approval Ratings of US Presidents |
| pressure | Vapor Pressure of Mercury as a Function of Temperature |
| quakes | Locations of Earthquakes off Fiji |
| randu | Random Numbers from Congruential Generator RANDU |
| rivers | Lengths of Major North American Rivers |
| rock | Measurements on Petroleum Rock Samples |
| sleep | Student’s Sleep Data |
| stack.loss (stackloss) | Brownlee’s Stack Loss Plant Data |
| stack.x (stackloss) | Brownlee’s Stack Loss Plant Data |
| stackloss | Brownlee’s Stack Loss Plant Data |
| state.abb (state) | US State Facts and Figures |
| state.area (state) | US State Facts and Figures |
| state.center (state) | US State Facts and Figures |
| state.division (state) | US State Facts and Figures |
| state.name (state) | US State Facts and Figures |
| state.region (state) | US State Facts and Figures |
| state.x77 (state) | US State Facts and Figures |
| sunspot.m2014 (sunspot.month) | Monthly Sunspot Data, from 1749 to “Present” |
| sunspot.month | Monthly Sunspot Data, from 1749 to “Present” |
| sunspot.year | Yearly Sunspot Data, 1700-1988 |
| sunspots | Monthly Sunspot Numbers, 1749-1983 |
| swiss | Swiss Fertility and Socioeconomic Indicators (1888) Data |
| treering | Yearly Tree-Ring Data, -6000-1979 |
| trees | Diameter, Height and Volume for Black Cherry Trees |
| uspop | Populations Recorded by the US Census |
| volcano | Topographic Information on Auckland’s Maunga Whau Volcano |
| warpbreaks | The Number of Breaks in Yarn during Weaving |
| women | Average Heights and Weights for American Women |
As mentioned earlier, the penguins dataset will be used for this
exercise. To view a glimpse of the dataset, run the function
glimpse(penguins). This provides a transposed
overview of the data frame—a compact display of its structure.
Before using the glimpse() function, it is important to
install and load the dplyr package.
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
## $ bill_len <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
## $ bill_dep <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
## $ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
## $ body_mass <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
## $ sex <fct> male, female, female, NA, female, male, female, male, NA, …
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
or use str() function. It compactly display the internal
structure of an R object.
str(penguins)
## 'data.frame': 344 obs. of 8 variables:
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_len : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_dep : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_len: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Another function we could use is class(). Is to check
dataset class type.
class(penguins)
## [1] "data.frame"
To confirm one more, use is.data.frame(penguins). This
will return logical value.
is.data.frame(penguins)
## [1] TRUE
I will now carry out a few simple tasks that you might perform when presented with a dataset, in order to get a sense of the data and determine the most appropriate way to work with it. This step is important for several reasons:
Quick analysis – With just a few straightforward functions or scripts, you can gain preliminary insights into the dataset. This allows you to understand what actions might be needed before committing to any deeper analysis.
Preparation required – Different data types or
classes may require different functions or handling. Basically,
different strokes for different folks. For instance, using
glimpse() can help identify various data types that may
each need a tailored strategy.
Early warning – Some data may contain missing or
problematic values (such as NA, NULL,
NaN, etc.). Recognising these early helps avoid issues
later on. If you spot missing values, you might choose to remove or
replace them. However, just because a value is missing doesn’t mean it
should be removed—it could be intentional or originate from the data
source.
Let’s begin the exercise.
With reference to the penguins dataset, the following
features can help you understand the data more quickly and
effectively:
?penguins : The question mark calls R’s help system.
Placing ? before a dataset or function name tells R to
display relevant documentation. This works not only with datasets but
also with most R functions. It’s good practice to use?
whenever you’re uncertain about something. If you’re using RStudio, the
information will appear in the “Help” tab.
View(penguins) : Opens the dataset in a new tab in a
readable, spreadsheet-like format. Note:
View() is case-sensitive, make sure to use an uppercase
“V”, or R will not recognise it.
head(penguins): Displays the first six rows of the
dataset. You can also specify a number of rows,
e.g. head(penguins, 2) shows the top two rows.
tail(penguins) : Similar to head(), but
returns the last six rows by default. You can specify a different number
in the same way.
count(penguins) : Counts the total number of rows in
the dataset. This function comes from the dplyr package and
returns the number of observations.
filter(penguins, species == "Adelie") : Returns all
rows where the condition is met. In this case, it filters rows where the
species column is exactly equal to “Adelie”. The ==
operator is used to match values exactly.
sort_by(x, y, decreasing) : Two arguments.
x takes an object to be sorted, typically a vector or data
frame. y is variables to sort by. decreasing
sort order either decreasing (order = TRUE) or increasing
(order = FALSE).
unique(penguins$island) : Returns the unique
(non-duplicated) values from the island column.
summary(penguins) : Provides a statistical summary
of the dataset, showing min., max., mean, and quartiles for numerical
variables, and counts for categorical ones.
summary(penguins$species) : Since species is a
categorical variable, this will return a frequency count of each
category.
summary(penguins$body_mass) : For a quantitative
variable like body_mass, this shows:
Min. : the smallest value
1st Qu. : the first quartile (25% of data points
fall below this)
Median : the middle value
Mean : the average
3rd Qu. : the third quartile (75% of data points
fall below this)
Max : the largest value
NA : the number of missing values in the
dataset
plot(penguins) : Automatically generates basic plots
for all variables in the dataset, choosing the most appropriate plot
type for each. It offers a quick visual overview.
Calling—or more technically, selecting—is the process of retrieving data, whether or not a condition is applied. Let’s look at the basics of how to access or “call” data.
Using double square brackets [ ]: This is known as
subsetting and is suitable when the data is in the form
of a table, tibble, or data frame. The general format is [row,
column].
Within the brackets, you can use a colon : to specify a
range, e.g. 1:5 selects from row 1 to row 5. This indexing
method selects data by row and/or column position. If you leave either
side of the comma blank, R will return all rows or columns accordingly.
You can also simply type the dataset name in the console—without square
brackets—to view the entire dataset. Example:
penguins[1:3,]
## species island bill_len bill_dep flipper_len body_mass sex year
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
## 3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
penguins[c(10, 190), 1:3]
## species island bill_len
## 10 Adelie Torgersen 42.0
## 190 Gentoo Biscoe 44.4
penguins[100:105, c(1, 2, 7)]
## species island sex
## 100 Adelie Dream male
## 101 Adelie Biscoe female
## 102 Adelie Biscoe male
## 103 Adelie Biscoe female
## 104 Adelie Biscoe male
## 105 Adelie Biscoe female
You can also insert a function inside those brackets for more specific filtering.
Let’s see how many species are there in the penguins
dataset. Use this code:
unique(penguins$species)
## [1] Adelie Gentoo Chinstrap
## Levels: Adelie Chinstrap Gentoo
Say we want to retrieve all rows where the penguins’ species is “Chinstrap”. There are several ways to do this, each producing slightly different results depending on the method used.
# 1. Using dplyr::filter()
filter(penguins, species == 'Chinstrap')
# 2. Using base R subsetting
penguins[penguins$species == 'Chinstrap', ]
# 3. Using subset()
subset(penguins, species == 'Chinstrap')
# 4. Using which() to get row indices
penguins[which(penguins$species == 'Chinstrap'), ]
# 5. Using table()
table(penguins$species == 'Chinstrap')
# 6. Using sum()
sum(penguins$species == 'Chinstrap')
Explanation:
Prints all the rows and columns but the indexes counted from 1 rather than the actual index position in the dataset.
Same as number 1, but the indexes is exactly how it is in the actual dataset.
Same as number 2 because the function works exactly the same. Hence it’s called subsetting (notice the double square brackets).
Same as number 2 & 3, with extra steps.
Returns logical value.
Counts how many are TRUE (i.e., how many penguins are chinstrap species)
Important: Depends on where you run those codes, the generic user interface could be different. If you run #2 on console it will gives basic view of it (command line interface style). But if you run it R markdown, it gives out nicely structured style.
More extensive and with more conditions.
# 1. Checking missing values in all rows
penguins[!complete.cases(penguins), ]
# 2. Checking the indices
which(!complete.cases(penguins))
# 3. Checking missing values in a column
any(is.na(penguins$sex))
# 4. Checking the indices
which(is.na(penguins$sex))
# 5. Checking the count
sum(is.na(penguins$sex))
# 6. Checking which rows have missing values in a column
penguins[is.na(penguins$flipper_len), ]
Explanation:
Using base R and subsetting (note the double square brackets
[ ]), we nest the complete.cases() function
inside. The exclamation mark ! is a logical operator
meaning NOT. So !complete.cases() tells R
to return all rows in the penguins dataset that are incomplete - i.e.,
contain at least one missing value. The result is displayed as a
table.
Similar to point 1, but instead of showing the rows, it returns a list of row indices where missing values are found.
This is a quick check that returns TRUE if any
values are missing, and FALSE otherwise.
Like point 2, this uses the which() function to find
missing values in a column and returns the indices of rows where the
values are missing.
Counts how many missing values exist in a column.
Similar to point 1 — it uses subsetting to return all rows that have missing values in a specific column. The result is displayed as a table.
In R, to search for text patterns in a column (e.g. values that contain or start with something), we use:
grepl() — from base Rstr_detect() — is a tidyverse-friendly
way to detect patterns. It works well in pipes
(%>%) and with
dplyr::filter().grepl() (Base R)Example 1: Find rows where island contains “Dream”.
penguins[grepl('Dream', penguins$island), ]
Example 2: Case-insensitive search
penguins[grepl("dream", penguins$island, ignore.case = TRUE), ]
str_detect() (Tidyverse)This is more readable, especially with
dplyr::filter():
Example 1: Island contains “Dream”
penguins %>%
filter(str_detect(island, "Dream"))
Example 2: Island starts with “B”
penguins %>%
filter(str_detect(island, "^B"))
Example 3: Species end with “ie”
data %>%
filter(str_detect(species, "ie$"))
| Pattern | Meaning | Example |
|---|---|---|
"abc" |
Exact match | "abc" |
"^abc" |
Starts with "abc" |
"abcde" |
"abc$" |
Ends with "abc" |
"123abc" |
"a.c" |
"a" followed by any one character,
then "c" |
"abc", "a-c" |
"a.*c" |
"a" followed by anything, then
"c" |
"abc", "axyzc" |
"a|b" |
Match "a" OR "b" |
"a", "b" |
"[aeiou]" |
Contains any vowel | "apple", "unit" |
"[^aeiou]" |
NOT a vowel | "b", "c" |
"\\bword\\b" |
Match exact word | "Dream" but not "Daydream" |
regex("text", ignore_case=TRUE) |
Case-insensitive | "text", "Text", "TEXT", etc
are all included |
Example: Island does not contain “Dream”
penguins %>%
filter(!str_detect(island, "Dream"))
Example: Penguins on islands containing “Dream”, and body mass > 4000g
penguins %>%
filter(str_detect(island, "Dream") & body_mass > 4000)
Use \\b (word boundary) for full-word match (e.g., match
“Dream”, not “Daydream”):
penguins %>%
filter(str_detect(island, "\\bDream\\b"))
# 1. == (Equal to)
penguins$species == "Adelie"
# 2. != (Not equal to)
penguins$species != "Chinstrap"
# 3. > (Greater than)
penguins$bill_len > 50
# 4. < (Less than)
penguins$body_mass < 3000
# 5. >= (Greater than or equal to)
penguins$flipper_len >= 200
# 6. <= (Less than or equal to)
penguins$bill_dep <= 18
Explanation:
Returns TRUE for rows where the species is exactly
“Adelie”.
Returns TRUE for rows where the species is not
“Chinstrap”.
Search for any value in bill_length column that is
greater than 50.
Search for any value in body_mass column that is
less than 3000.
Search for any value in flipper_len column that is
greater than or equal to 200.
Search for any value in bill_dep column that is less
than or equal to 18.
Note: If you those code all you get is a bunch of TRUEs and FALSEs all over the place. This because we have use any function, because the function is what makes it readable. Refer to Missing values sub-chapter for any suitable functions.
# 1.
penguins[penguins$species == "Adelie" & penguins$flipper_len > 190, ]
# 2.
penguins[data$species == "Gentoo" | penguins$body_mass < 3500, ]
# 3.
penguins[!complete.cases(penguins), ]
# 4.
penguins[(penguins$species == "Adelie" & penguins$flipper_len > 190) |
(penguins$species == "Gentoo" & penguins$body_mass > 4000), ]
# 5.
subset(penguins, species == "Adelie" & flipper_len > 190 & body_mass < 4000)
Explanation:
Show all Adelie penguins with flippers over 190 mm.
Which penguins are either Gentoo or weigh less than 3500g?
Find rows that are incomplete (have missing values).
Adelie with flipper > 190 or Gentoo with body mass > 4000
Same as point 4, functionally the same but looks cleaner, especially for many conditions.
Tips: If mixing & and
|, always use parentheses ( ), like what is
shown code number 4 above.
# 1.
penguins %>%
filter(
species == "Adelie",
flipper_length_mm > 190,
body_mass_g < 4000
)
# 2.
penguins %>%
filter(
(species == "Adelie" & flipper_length_mm > 190) |
(species == "Gentoo" & body_mass_g > 4000)
)
# 3.
penguins %>%
filter(complete.cases(.))
Explanation:
Species is “Adelie”
Flipper length is greater than 190
Body mass is less than 4000g
Note: Each condition is separated by a comma —
equivalent to using & between them.
Which to Use?
Use grepl() for simple, quick checks in base
R.
Use str_detect() if you work with
dplyr, prefer cleaner syntax, or use it in data
pipelines.
Now, let’s get creative! With plotting, data can speak beautifully.
R offers a wide variety of plotting types, thanks to its base
plotting system, the grid graphics system, and powerful libraries like
ggplot2, lattice, and more.
Here is an example of plotting using the ggplot()
function from the ggplot2 package. The resulting plot shows
a scatterplot with points overlaid by a smooth trend line. It also
includes a legend displaying each car class, with each class represented
by a distinct colour.
mpg %>%
ggplot(mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(colour = class)) +
geom_smooth(se = F) +
labs(title = "Fuel efficiency generally decreases with engine size",
x = "Displacement",
y = "Miles per gallon on highway",
color = "Class: ",
caption = "Source: Fuel economy data from 1999-2008")
In R’s base package, there is a function called
plot(). This function requires some arguments, such as
x and y. These are not simply axis labels, but
the actual inputs or data you ‘feed’ into the function. For example:
# Categorical variable
plot(penguins$species)
Can you interpret the plot above? Since the chosen variable is the species column—a categorical variable—the function automatically selects the most appropriate plot type. The x-axis displays the different species, while the y-axis shows the count of penguins within each species.
Earlier, I mentioned that plot() requires arguments like
x and y. But in the code above, only one
argument—x—is provided. So how is that possible? It’s because the
plotting behaviour depends on the type of variable supplied. The
plot() function is quite flexible and can generate a
suitable plot even with just a single input. In fact, it will attempt to
plot anything, even if the dataset is empty, that’s just the nature of
this function.
# Quantitative variable
plot(penguins$body_mass, penguins$flipper_len)
The plot above shows a clear correlation. There is a consistent linear pattern in the data points, which makes sense in this context: the longer the flipper, the greater the body mass, and vice versa.
Also, note that the plot style differs from previous examples. The
plot() function automatically chooses the most appropriate
plot type based on the data provided.
# Quantitative variable
plot(penguins$flipper_len)
Can you interpret the plot above? I used the penguins’ flipper length as the main variable to be plotted. Here’s how to read it: the x-axis represents the row indices from the dataset, ranging from the first row to the last. The penguins dataset contains 344 rows, so why does the x-axis extend to 350? That’s because the axis scale is rounded for a cleaner presentation.
If you look closely, there is one penguin with a flipper length of
exactly 230 mm, positioned just after index 150, so, somewhere around
row 150. But what is the exact row number for that penguin? We can find
out using the subset() function:
subset(x = penguins[150:155, ], # Data
subset = flipper_len == 230) # Condition
## species island bill_len bill_dep flipper_len body_mass sex year
## 154 Gentoo Biscoe 50 16.3 230 5700 male 2007
There you go. We find it! :)
Bonus. Just like the plot above, here’s more aesthetically pleasing plot:
The plot above can be achieve with this code:
flip.len.by.species <-
ggplot(penguins, aes(x = 1:nrow(penguins), y = flipper_len,
fill = species, shape = species)) +
geom_point(size = 3, stroke = 0.8) +
scale_fill_manual(values = c('Adelie' = 'steelblue',
'Chinstrap' = 'darkorange',
'Gentoo' = 'forestgreen')) +
scale_shape_manual(values = c(21, 22, 24)) +
scale_x_continuous(breaks = seq(from = 0, to = 350, by = 50)) +
labs(title = 'Penguin Flipper Length by Species',
x = 'Index',
y = 'Flipper Length (mm)',
caption = 'Source: Palmer Station Antarctica LTER and K. Gorman (2020)',
fill = 'Species:',
shape = 'Species:') +
theme_bw()
print(flip.len.by.species)
Explanation of each code:
flip_len_by_species <- : To assign/save the plot
inside a variable data.
ggplot() : Used to initialise the plot by specifying
the data before adding layers for plotting or styling.
aes() : The aesthetic mapping function. Many
aesthetics can be mapped here, but the most essential are x
and/or y, which define the axes.
x = 1:nrow(penguins) : Since
geom_point() requires both an x and a
y value, but we only have one actual variable to plot
(flipper_len), we “trick” the function into creating a
second axis. The nrow() function returns the number of rows
in the dataset. So 1:nrow(penguins) generates a sequence
from 1 to 344, representing each row, which becomes the x-axis
values.
y = flipper_len : This sets flipper_len
as the variable on the y-axis.
fill = species : Specifies how each point should be
filled—based on the species variable. Used together with
scale_fill_manual() to customise colours.
shape = species : Assigns different shapes to each
species. Since species is a categorical variable, ggplot()
automatically assigns distinct shapes to its levels.
geom_point() : This is the plotting function that
draws the points. This plot style commonly known as scatterplot.
size = 3 : Controls the size of the points. The
higher the number, the larger the points.
stroke = 0.8 : Sets the thickness of the borders
around the points. A higher number results in a thicker border.
scale_fill_manual() : Manually assigns fill colours
to different levels of a variable. Important: You must
map fill inside aes(); otherwise,
ggplot2 does not know which variable you are trying to
apply the colours to.
values = c(...) : The values argument specifies the
colours to assign to each level of the variable (e.g. species). You
provide a vector of colours, and ggplot matches them to the
categories. If the condition matches, the colour is applied
accordingly.scale_shape_manual() : Manually overrides the
default shapes assigned by ggplot2.
Important: You must map shape inside
aes(); otherwise, ggplot2 will not know which
variable you want to assign shapes to.
scale_x_continuous() : Customises the appearance of
a continuous x-axis, including tick marks, labels, axis limits, and
transformations. You do not map the variable inside
scale_x_continuous()—that’s done inside aes().
This function simply modifies how the mapped x-axis is displayed.
breaks = ... : Controls the location of tick marks
on the axis. Use this to manually define which values appear as ticks,
instead of relying on automatic spacing.
seq(from = ..., to = ..., by = ...): Generates a
sequence of numbers, often used within the breaks argument for evenly
spaced tick marks. For example, seq(0, 350, 50) is
generating a sequence of numbers starting at 0, ending at 350, with
steps (interval) of 50 between each number. Without this,
ggplot might default to wider intervals (e.g. 0 to 300 with
100 steps).
labs(): Sets or customises labels on the plot.
Examples:
title = ... for the main plot title
x = ... for the x-axis label
y = ... for the y-axis label
caption = ... for source notes or additional info
(appears in the bottom-right)
fill = species & shape = species:
These mappings create legends on the side of the plot, indicating what
the fill colour and shape represent. If only one aesthetic is declared
(e.g. just fill or shape), you may end up with
two separate legends—because ggplot2 doesn’t automatically
combine them. To unify the legends into one, you need to map both
fill and shape to the same variable, with the
same labels. This results in a single, shared legend titled
“Species”.
theme_bw(): Applies a clean, black-and-white theme
to the plot, replacing the default grey theme
(theme_grey()) used by ggplot(). It produces a
more minimal and professional look, suitable for reports or
publications.
print() : To print the result.
The easiest method is to save the plot as an image using
ggsave(). It’s simple, works in any R setup, and doesn’t
require any extra packages beyond ggplot2.
# Save as a PNG
ggsave(filename = "flip_len_by_species.png", plot = flip.len.by.species, width = 10, height = 5, dpi = 300)
This will save the file flip_len_by_species in your
working directory. To know where your working directory is, use this:
getwd().
plot(cos, 0, 2*pi) : Cosine with limit from 0 to 2
times Pi.plot(exp, 1, 5) : Exponential distribution from 1 to
5.plot(dnorm, -3, +3) : Density of a normal distribution
from -3 to +3 on the X axisThere are many kinds of plot style that it most suitable to your data.
Take a look at the plot below:
PLOT 1:
barplot(
table(penguins$island),
col = c("salmon", "lightgreen", "skyblue"),
main = "Penguin Count by Island",
ylab = "Number of Penguins")
PLOT 2:
penguins %>%
ggplot(aes(island, fill = island)) +
geom_bar() +
theme_bw() +
theme(legend.position = 'none') +
labs(title = 'Penguin Count by Island',
x = '')
Those plots above shows that there are three islands, with the y-axis indicating the number of penguins from each one. Penguins from Biscoe Island exceed 150 in number, the highest count of them all!
By the way, those two codes above just to show you how different approach can land you on the same plot. One function with more advantage than the other.
penguins %>%
ggplot(aes(x = flipper_len)) +
geom_histogram(binwidth = 5,
fill = 'maroon',
color = 'white') +
theme_minimal()
You can also do this using R base function:
hist(penguins$flipper_len, col = 'maroon')
Important: For this to work, a package called “patchwork” is needs to be installed and loaded.
# Load the package
library(patchwork)
# Bar Chart: Count of penguins by species (categorical)
bar_plot <- ggplot(penguins, aes(x = species)) +
geom_bar(fill = "steelblue", # Color inside the bar
color = "white") + # Border color
labs(title = "Bar Chart: Count by Species",
x = "Species",
y = "Count") +
theme_bw()
# Histogram: Distribution of flipper length (continuous)
hist_plot <- ggplot(penguins, aes(x = flipper_len)) +
geom_histogram(binwidth = 5, # The width of each bar
fill = "tomato",
color = "white") +
labs(title = "Histogram: Flipper Length",
x = "Flipper Length (mm)",
y = "Count") +
theme_bw()
# Combine side-by-side
bar_plot + hist_plot
Here’s a quick one:
Use a bar chart for categories.
Use a histogram for numerical ranges (distribution)
More detailed one:
| Feature | Bar Chart | Histogram |
|---|---|---|
| Data type | Categorical data (e.g. species, country, gender) | Continuous data (e.g. age, height, flipper length) |
| X-axis | Shows categories | Shows numeric bins (ranges) |
| Y-axis | Frequency or count (or proportion) | Frequency or count (or density) |
| Width of bars | Arbitrary/random (often equal and can have gaps) | Represents range of data values (no gaps by default) |
| Used for | Comparing values across groups | Showing distribution of a continuous variable |
| Bars | Usually separated | Usually touch (to show continuity) |
So,
Histogram bars touch to represent that the data is continuous and connected.
Bar chart bars don’t — they represent separate, unrelated categories.”
There are some common examples in R (especially with ggplot2 and data manipulation), where tweaking or altering variables ensures the data is treated as intended—typically improving how it’s displayed, analysed, or grouped.
Scenario: Using mtcars dataset, you want to see how a
car’s weight affects its fuel efficiency, and whether this relationship
differs by the number of cylinders (cyl) in the engine.
Take a look at these plots:
(
# First code
mtcars %>% ggplot(aes(wt, mpg, color = cyl)) +
geom_point() + labs(color = 'Cyl') +
theme_bw()
) + (
# Second code
mtcars %>% ggplot(aes(wt, mpg, color = as.factor(cyl))) +
geom_point() + labs(color = 'Cyl') +
theme_bw()
)
What differences do you notice in both the plot and the code?
Yes, the legend differs between the two plots. Yes, the code is also
slightly different. In the second code, you’ll notice that the
cyl variable inside the color argument is
first converted to a factor using the as.factor()
function.
Why?
You see, the cyl variable (number of cylinders) in the
mtcars dataset is numeric by default. If you map a numeric
variable to a visual aesthetic like color,
ggplot2 will treat it as continuous, meaning it will apply
a gradient of colors (e.g., from light to dark).
However, in this case, cylinders (e.g., 4, 6, 8) are categorical
values, they represent groups, not a continuous scale. By converting
cyl to a factor, you tell ggplot2 to treat it
as discrete. This results in:
So, because you want separate colours for categories like 4, 6, 8
cylinders—not a color gradient, you convert cyl to a factor
to ensure ggplot2 treats it as a categorical variable,
assigning distinct colours rather than a continuous gradient.
Scenario: You have survey responses stored as a factor, but the order is alphabetical.
responses <- factor(c("Neutral", "Agree", "Disagree", "Agree", "Strongly Agree"),
levels = c("Strongly Agree", "Agree", "Neutral", "Disagree"))
ggplot(data.frame(responses), aes(x = responses)) + geom_bar()
Why doing this? Because you want to control the order of bars in the plot to match a logical or similar to Likert scale rather than the default alphabetical order.
Scenario: Using mtcars dataset, you want to group cars
in mtcars into “light” and “heavy” based on weight, but no
such column exists.
mtcars %>%
mutate(weight_group = ifelse(wt > 3, "Heavy", "Light")) %>%
ggplot(aes(x = weight_group, y = mpg)) +
geom_boxplot()
Why tweak it? You needed a new variable to group cars by weight.
Scenario: From airquality dataset, we want to group the
data by month and calculate the average temperature for each. But the
dataset has Month as a number (5, 6, 7…), and we want to label it with
actual month names like “May”, “June”, etc.
airquality %>%
mutate(month_name = month.abb[Month]) %>% # tweak: convert numeric to abbreviated month name
group_by(month_name) %>%
summarise(avg_temp = mean(Temp,
na.rm = TRUE)) %>% # Remove NA's in Temp column
ggplot(aes(x = month_name, y = avg_temp)) +
geom_col(fill = "steelblue") +
labs(title = "Average Temperature by Month",
x = "Month", y = "Avg Temp (F)")
The month.abb[Month] is to converts the numeric month
(5–9) into “May”, “Jun”, etc.
Scenario: You want to see how many cars fall into low, medium, or
high MPG groups, but there’s no such categorisation in
mtcars.
mtcars$mpg_group <- cut(mtcars$mpg,
breaks = c(0, 15, 25, Inf),
labels = c("Low", "Medium", "High"))
ggplot(mtcars, aes(x = mpg_group)) +
geom_bar(fill = "darkorange") +
labs(title = "Car Count by MPG Group",
x = "MPG Group", y = "Count")
Use cut() to convert a continuous variable
(mpg) into meaningful groups.
In the titanic dataset, you want to simplify the
passenger classes to fewer groups for a clearer survival comparison.
titanic_df <- as.data.frame(Titanic)
titanic_df$Class <- as.character(titanic_df$Class)
titanic_df$Class[titanic_df$Class %in% c("Crew", "3rd")] <- "Lower"
titanic_df$Class <- factor(titanic_df$Class)
ggplot(titanic_df, aes(x = Class, weight = Freq, fill = Survived)) +
geom_bar(position = "fill") +
labs(title = "Survival by Class Group", y = "Proportion")
Combine levels (Crew and 3rd) into a new
group (Lower) to simplify the analysis.
R offers a wide variety of plotting types, thanks to its base
plotting system, the grid graphics system, and powerful libraries like
ggplot2, lattice, and more.
Here’s an overview of the some types of plots you can create in R.
With different approach, using base R or
ggplot2.
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
plot(penguins$flipper_len, penguins$body_mass,
main = "Base R Scatter:
Flipper vs Body Mass",
xlab = "Flipper Length",
ylab = "Body Mass",
pch = 19, col = "blue")
plot(penguins$bill_len, penguins$bill_dep,
main = "Base R Scatter:
Bill Length vs Depth",
xlab = "Bill Length",
ylab = "Bill Depth",
pch = 21, bg = "orange")
(
penguins %>% ggplot(aes(bill_len, bill_dep)) + geom_point(color = "red") +
ggtitle("ggplot2 Scatter:
Bill Length vs Depth")
+
ggplot(penguins, aes(flipper_len, body_mass)) + geom_point() +
labs(
title = "ggplot2 Scatter:
Flipper vs Body Mass")
)
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
plot(penguins$flipper_len[1:30], type = "l", main = "Base R Line Plot 1", ylab = "Flipper Length")
plot(penguins$body_mass[1:30], type = "l", col = "green", main = "Base R Line Plot 2", ylab = "Body Mass")
(
ggplot(data = penguins[1:30, ], mapping = aes(x = seq_along(flipper_len), y = flipper_len)) + geom_line() + ggtitle("ggplot2 Line Plot 1")
+
ggplot(data = penguins[1:30, ], aes(x = seq_along(body_mass), y = body_mass)) + geom_line(color = "purple") + labs(title = "ggplot2 Line Plot 2")
)
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
barplot(table(penguins$species), main = "Base R Barplot 1")
barplot(table(penguins$island), col = "cyan", main = "Base R Barplot 2")
(
ggplot(penguins, aes(x = species)) + geom_bar() + labs(title = "ggplot2 Barplot 1")
+
ggplot(penguins, aes(x = island)) + geom_bar(fill = "steelblue",) + labs(title = "ggplot2 Barplot 2")
)
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
hist(penguins$body_mass, main = "Base R Histogram 1", col = "lightblue")
hist(penguins$bill_len, main = "Base R Histogram 2", col = "gray")
(
ggplot(penguins, aes(body_mass)) + geom_histogram(binwidth = 200,fill = "blue") +
ggtitle("ggplot2 Histogram 1")
+
ggplot(penguins, aes(bill_len)) + geom_histogram(binwidth = 2, fill = "darkred") +
ggtitle("ggplot2 Histogram 2")
)
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
boxplot(body_mass ~ species, data = penguins, main = "Base R Boxplot 1", col = c("red", "green", "blue"))
boxplot(flipper_len ~ island, data = penguins, main = "Base R Boxplot 2", col = "purple")
(
ggplot(data = penguins, mapping = aes(x = species, y = body_mass)) +
geom_boxplot() + ggtitle("ggplot2 Boxplot 1")
+
ggplot(penguins, aes(x = island, y = flipper_len)) +
geom_boxplot(fill = "orange") + ggtitle("ggplot2 Boxplot 2")
)
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
pie(table(penguins$species), main = "Base R Pie Chart 1", col = rainbow(3))
pie(table(penguins$island), main = "Base R Pie Chart 2", col = c("red", "yellow", "blue"))
coord_polar())(
ggplot(penguins, aes(x = "", fill = species)) + geom_bar(width = 1) +
coord_polar("y") + labs(title = "ggplot2 Pie Chart 1")
+
ggplot(penguins, aes(x = "", fill = island)) + geom_bar(width = 1) +
coord_polar("y") + ggtitle("ggplot2 Pie Chart 2")
)
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
dotchart(penguins$body_mass[1:20], labels = penguins$species[1:20], main = "Base R Dotchart 1")
dotchart(penguins$flipper_len[1:20], main = "Base R Dotchart 2", color = "darkblue")
(
ggplot(penguins[1:20, ], aes(x = body_mass, y = species)) +
geom_point() + ggtitle("ggplot2 Dotplot 1")
+
ggplot(penguins[1:20, ], aes(x = flipper_len, y = island)) +
geom_point(color = "darkgreen") + ggtitle("ggplot2 Dotplot 2")
)
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
stripchart(body_mass ~ species, data = penguins, method = "jitter", main = "Base R Stripchart 1")
stripchart(flipper_len ~ island, data = penguins, method = "jitter", col = "darkred", main = "Base R Stripchart 2")
(
ggplot(penguins, aes(x = species, y = body_mass)) +
geom_jitter(width = 0.2) + ggtitle("ggplot2 Stripchart 1")
+
ggplot(penguins, aes(x = island, y = flipper_len)) +
geom_jitter(width = 0.2, color = "blue") + ggtitle("ggplot2 Stripchart 2")
)
par(mfrow = c(1, 2)) # 1 row, 2 columns (side-by-side view)
plot(density(na.omit(penguins$body_mass)), main = "Base R Density Plot 1", col = "blue")
plot(density(na.omit(penguins$bill_len)), main = "Base R Density Plot 2", col = "green")
(
ggplot(penguins, aes(body_mass)) + geom_density(fill = "skyblue") +
ggtitle("ggplot2 Density Plot 1")
+
ggplot(penguins, aes(bill_len, fill = species)) + geom_density(alpha = 0.5) +
ggtitle("ggplot2 Density Plot 2")
)
pairs(penguins[, c("bill_len", "bill_dep", "flipper_len", "body_mass")], main = "Base R Pairs Plot 1")
pairs(penguins[, 3:6], main = "Base R Pairs Plot 2", col = "blue")
# Optional: install.packages("GGally")
library(GGally)
ggpairs(penguins[, c("bill_len", "bill_dep", "flipper_len", "body_mass")]) + ggtitle("ggplot2 Pairs Plot 1")
ggpairs(penguins[, 3:6], mapping = ggplot2::aes(color = penguins$species)) + ggtitle("ggplot2 Pairs Plot 2")
penguins_scaled <-
scale(na.omit(penguins[, c("bill_len", "bill_dep", "flipper_len", "body_mass")]))
heatmap(penguins_scaled, main = "Heatmap of Scaled Penguin Measurements")