Mean in R
1.Introduction
Finding the center of a dataset is one of the most common ways to
summarize statistical findings. Often, people communicate the center of
data using words like, on average, usually, or often.
In this lesson, you will learn how to calculate the mean of a
dataset, a common measure of a dataset’s center. We will use the mean to
help us answer the question,
When are adults their most creative and productive?
You could define “creative” and “productive” in a lot of ways, making
this question impossible to fully answer by the end of this lesson.
However, you will form an informed opinion on the question using data of
the one hundred greatest novels of all time.
We collected the dataset from a survey administered by the French
literary magazine, Le Monde. From the dataset, you will calculate the
average age of the authors when their books were published.
Instructions
The histogram to the right displays the ages of 100 authors from the
Le Monde survey. Where do you think the data is centered?
# load libraries
library(readr)
library(dplyr)
library(ggplot2)
# load data frame
greatest_books <- read_csv('top-hundred-books.csv')
#plot data
hist <- qplot(greatest_books$Ages,
geom='histogram',
binwidth = 3,
main = 'Age of Top 100 Novel Authors at Publication',
xlab = 'Publication Age',
ylab = 'Count',
fill=I("blue"),
col=I("red"),
alpha=I(.2))
hist

2.Calculating Mean
The mean, often referred to as the average, is a way to measure the
center of a dataset.
The average of a set is calculated using a two-step process:
1.Add all of the observations in your dataset.
2.Divide the total sum from step one by the number of points in your
dataset.
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Mean1.png")

Example
Imagine that we wanted to calculate average of a dataset with the
following four observations:
data <- c(4, 6, 2, 8)
Step One: Calculate the total
4+6+2+8=20
Step Two: Divide by the number of observations
The total is equal to 20, and the number of observations is equal to
4.
20 / 4 = 5
The average of this dataset is equal to 5.
Instructions
1.In this exercise, you will use R to find the average age of the
first four authors in Le Monde’s top 100 books.
29,49,42,43
Add the values together, and set total equal to the answer. Print
total.
total <- sum(29, 49, 42, 43)
print(total)
[1] 163
2.Divide total by the number of values in the dataset, and set
mean_value to the answer.
Print mean_value. Keep that number in your head as you progress
through the lesson.
mean_value <- total / 4
print(mean_value)
[1] 40.75
3.Mean in R
While you’ve shown that you can calculate the average yourself, it
becomes time-consuming as the size of your dataset increases — imagine
adding all of the numbers in a dataset with 10,000 observations.
The R mean() function can do the work of adding and dividing for you.
In the example below, we use mean() to calculate the average of a
dataset with ten values:
example_data <- c(24, 16, 30, 10, 12, 28, 38, 2, 4, 36)
example_average <- mean(example_data)
print(example_average)
[1] 20
The code above calculates the average of example_data and saves the
value to example_average. The resulting average of this array is 20.
Instructions
1.Use R to calculate the average value of the author_ages array. Save
the result to average_age and print it.
Does the average age of the authors surprise you? If so, how? Is it
older, or younger than you expected?
# Set author ages to a vector
author_ages <- greatest_books$Ages
# Use R to calculate mean
average_age <- mean(author_ages)
average_age
[1] 41.83
4.Review and Discussion
In this lesson, you learned how to calculate the average of a dataset
using the formula:
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Mean1.png")

and the R function:
mean(my_data)
Circling back to the original question, do you feel like the average
of our dataset, 42.12, provides us enough information to claim when
someone is their most creative and productive?
Take a look at the histogram and mean (in red) to the right as you
consider this question.
We would say, No. Though we could argue against its use for a few
reasons, below, we’ve highlighted two:
1.The date of publication is not necessarily an author’s most
creative year. When did they start authoring the book? What factors
impacted their writing during those years?
2.The average age of the publishing dates for 100 authors may not
accurately measure peak creativity in other professions. The average age
of painters or sculptors may be very different.
So, what kind of information does the average provide us, and why
would we use the average to describe something when we could display a
histogram?
The most important outcome is that we’re able to use a single number
as a measure of centrality. Although histograms provide more
information, they are not a concise or precise measure of centrality —
the reader must interpret it for themselves.
Instructions
Take a look at the histogram. Are you older or younger than the
average age at publication?
This doesn’t tell you much about when someone will have their most
creative year. However, this type of data could be used as an example in
a broader study on aging.
#plot data
hist <- qplot(greatest_books$Ages,
geom='histogram',
binwidth = 3,
main = 'Age of Top 100 Novel Authors at Publication',
xlab = 'Publication Age',
ylab = 'Count',
fill=I("blue"),
col=I("black"),
alpha=I(.2)) +
geom_vline(aes(xintercept=mean(greatest_books$Ages),
color="mean"), linetype="solid",
size=1) +
scale_color_manual(name = "statistics", values = c(mean = "red"))
hist

Median in R
1.Introduction
In this lesson, you will learn how to find the median of a dataset —
a common measure of a dataset’s center. Each of the next three exercises
will cover the following topics:
1.Manually finding the median of a dataset
2.Using R’s median function to find the median of a dataset
3.Interpreting what it means for a dataset to have similar and
different median and mean values
In the lesson, we will use a dataset of the 100 greatest novels,
determined by a French literary magazine, Le Monde. From the dataset,
you will use the median to answer the question:
When are great authors most likely to publish their best work?
If you are not familiar with mean, also known as average, we
recommend that you learn about it in our lesson on average.
Instructions
The histogram to the right displays the age of authors, at
publication, for the top 100 novels. The red line represents the average
value of this dataset.
You can think of the median as being the observation in your dataset
that falls right in the middle.
Using this informal definition of the median and the graph to the
right, see if you can determine whether the median of this dataset falls
to the right or the left of the mean. We will show you the correct
answer in the last exercise.
#plot data
hist <- qplot(greatest_books$Ages,
geom='histogram',
binwidth = 3,
main = 'Age of Top 100 Novel Authors at Publication',
xlab = 'Publication Age',
ylab = 'Count',
fill=I("blue"),
col=I("black"),
alpha=I(.2)) +
geom_vline( aes(xintercept=mean(greatest_books$Ages),color="mean"), linetype="solid",size=1) +
scale_color_manual(name = "statistics", values = c(mean = "red"))
hist

2.Median
The formal definition for the median of a dataset is:
The value that, assuming the dataset is ordered from smallest to
largest, falls in the middle. If there are an even number of values in a
dataset, you either report both of the middle two values or their
average.
There are always two steps to finding the median of a dataset:
1.Order the values in the dataset from smallest to largest
2.Identify the number(s) that fall(s) in the middle
Example One: Even Number of Values
Say we have a dataset with the following ten numbers:
24, 16, 30, 10, 12, 28, 38, 2, 4, 36
The first step is to order these numbers from smallest to
largest:
2, 4, 10, 12, [16, 24], 28, 30, 36, 38
Because this dataset has an even number of values, there are two
medians: 16 and 24 — 16 has four datapoints to the left, and 24 has four
datapoints to the right.
Although you can report both values as the median, people often
average them. If you averaged 16 and 24, you could report the median as
20.
Example Two: Odd Number of Values
If we added another value (say, 24) to the dataset and sorted it, we
would have:
2, 4, 10, 12, 16, [24], 24, 28, 30, 36, 38
The new median is equal to 24, because there are 5 values to the left
of it, and 5 values to the right of it.
Instructions
1.In the next two steps, you will manually sort an array, and then
determine which value in the array is the median.
In notebook.Rmd, we have a vector with the ages of the first five
authors from Le Monde’s survey:
29,49,42,43,32
Under five_author_ages there is a variable called sorted_author_ages.
Change the 0s in sorted_author_ages to the values in ascending order
from five_author_ages.
# Array of the first five author ages
five_author_ages <- c(29, 49, 42, 43, 32)
# Fill in the empty array with the values sorted
sorted_author_ages <- sort(five_author_ages)
sorted_author_ages
[1] 29 32 42 43 49
2.Set median_value equal to the median of the array.
# Save the median value to median_value
median_value <- 42
# Print the sorted array and median value
cat("The sorted array is:", sorted_author_ages, '\n')
The sorted array is: 29 32 42 43 49
cat(paste("The median of the array is: ", median_value))
The median of the array is: 42
*關於 cat():
cat() 是 “concatenate and print”
的意思,用來將字串和變數輸出成一行文字;
它不像 print() 會自動換行,所以建議你手動加入 ;
如果你用 cat(paste(…)),那麼 paste() 會先組合成一整句,再輸出;
cat() 不會回傳值,也無法直接複製回 RStudio 物件,僅用於輸出文字。
*在 R 中,paste()
是一個用來將多個字串合併成一個字串的函數。它常用於產生動態文字、報告輸出等情境。
name <- "Annabel"
age <- 18
paste("My name is", name, "and I am", age, "years old.")
[1] "My name is Annabel and I am 18 years old."
# 輸出: "My name is Annabel and I am 18 years old."
4.Review and Discussion
In this lesson, you learned how to find the median of a dataset in
two steps:
1.Sort the dataset
2.Identify the one or two numbers that fall in the middle of the
sorted dataset
You also learned how to calculate the median using R:
median(my_data)
Discussion
Take a look at the histogram. It displays the author age distribution
with vertical lines for the mean (red) and median (blue).
Do you feel like the median of our dataset, 40.5, provides us enough
information to claim when authors publish their greatest work?
We argue it does not.
Although the median is a good measure of the dataset’s center, we
cannot make a definitive claim about when authors publish their greatest
work — the youngest author published at 18 and the oldest at 76. It
would be irresponsible to say anything but, “it seems to be possible at
almost any age.”
Notice that the mean and the median are nearly equal. This is not a
surprising result, as both statistics are a measure of the dataset’s
center. However, it’s worth noting that these results will not always be
so close.
In the instructions below, we’ve written a brief explanation that
puts median in the context of our problem.
#plot data
hist <- qplot(greatest_books$Ages,
geom='histogram',
binwidth = 3,
main = 'Age of Top 100 Novel Authors at Publication',
xlab = 'Publication Age',
ylab = 'Count',
fill=I("blue"),
col=I("black"),
alpha=I(.2)) +
geom_vline(aes(xintercept=median(greatest_books$Ages),
color="median"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mean(greatest_books$Ages),
color="mean"), linetype="solid",
size=1) +
scale_color_manual(name = "statistics", values = c(median = "blue", mean = "green"))
hist

Instructions
The median age of authors, when they publish their best work, from Le
Monde’s 100 greatest books is 41.
While this does not tell us much about which year is an author’s
greatest year, it does indicate that half of the authors from the survey
find their greatest success before the age of 41 and half find their
greatest success after the age of 41.
Mode in R (眾數)
1.Introduction
In this lesson, you will learn how to find the mode of a dataset.
Each of the next three exercises will cover the following:
1.Manually finding the mode of a dataset
2.Using R’s functions to find the mode
3.Comparing mode to mean and median values
In the lesson, we will use a dataset of the 100 greatest novels,
determined by a French literary magazine, Le Monde. From the dataset,
you will use the mode to answer the question:
What is the most common age for a great author to publish their best
work?
If you are not familiar with mean, also known as average, or median,
we recommend that you learn about it in our lessons on average and
median.
Instructions
The histogram to the right displays the age of authors, at
publication, for the top 100 novels from Le Monde’s survey. The red line
is the mean age, and the blue line is the median age.
Use the definition of mode below and the histogram to the right to
guess where the mode falls. You will calculate the correct answer in the
last exercise.
The mode is the most common observation in a dataset.
You will not be able to find the exact mode, because the histogram
displays bins with a range of values. However, you can guess a range of
values where you are most likely to see the mode.
#plot data
hist <- qplot(greatest_books$Ages,
geom='histogram',
binwidth = 3,
main = 'Age of Top 100 Novel Authors at Publication',
xlab = 'Publication Age',
ylab = 'Count',
fill=I("blue"),
col=I("black"),
alpha=I(.2)) +
geom_vline(aes(xintercept=median(greatest_books$Ages),
color="median"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mean(greatest_books$Ages),
color="mean"), linetype="solid",
size=1) +
scale_color_manual(name = "statistics", values = c(median = "blue", mean = "red"))
hist

2.Mode
The formal definition for the mode of a dataset is:
The most frequently occurring observation in the dataset. A dataset
can have multiple modes if there is more than one value with the same
maximum frequency.
While you may be able to find the mode of a small dataset by simply
looking through it, if you have trouble, we recommend you follow these
two steps:
1.Find the frequency of every unique number in the dataset
2.Determine which number has the highest frequency
Example
Say we have a dataset with the following ten numbers:
24, 16, 12, 10, 12, 28, 38, 12, 28, 24
Let’s find the frequency of each number:
24 16 12 10 28 38
2 1 3 1 2 1
From the table, we can see that our mode is 12, the most frequent
number in our dataset.
Instructions
1.Determine the mode of the ages for the first ten authors in the Le
Monde survey:
29,49,42,43,32,38,37,41,27,27
Save the value to mode_age.
mode_age <- 27
mode_age
[1] 27
2.Determine the number of authors who were the age of the mode. Save
the number to mode_count.
mode_count <- 2
mode_count
[1] 2
3.Mode with DescTools
Finding the mode of a dataset becomes increasingly time-consuming as
the size of your dataset increases — imagine finding the mode of a
dataset with 10,000 observations.
The R package DescTools includes a handy Mode() function which can do
the work of finding the mode for us. In the example below, we use Mode()
to calculate the mode of a dataset with ten values:
Example: One Mode
library(DescTools)
example_data <- c(24, 16, 12, 10, 12, 28, 38, 12, 28, 24)
example_mode <- Mode(example_data)
The code above calculates the mode of the values in example_data and
saves it to example_mode.
The result of Mode() is a vector with the mode value:
example_mode
[1] 12
attr(,"freq")
[1] 3
Example: Two Modes
If there are multiple modes, the Mode() function will return them as
a vector.
Let’s look at a vector with two modes, 12 and 24:
example_data = c(24, 16, 12, 10, 12, 24, 38, 12, 28, 24)
example_mode = Mode(example_data)
The result is:
example_mode
[1] 12 24
attr(,"freq")
[1] 3
Instructions
1.We have already imported the DescTools library for you.
Delete the current value set to mode_age.
Find the mode of the observations in the author_ages array. Save the
result to mode_age.
# Set author ages to
author_ages <- greatest_books$Ages
mode_age <- Mode(author_ages)
print(paste("The mode age of authors from Le Monde's 100 greatest books is: ", mode_age[1]))
[1] "The mode age of authors from Le Monde's 100 greatest books is: 38"
4.Review and Discussion
In this lesson, you learned how to find the mode of a dataset in two
steps:
1.Find the frequency of every unique number in the dataset
2.Determine which number has the highest frequency
You also learned how to calculate the mode using DescTools:
Mode(my_array)
Discussion
In this lesson, you found that 38 was the most common age, at
publication, for an author from the Le Monde survey. How does this
number compare to your guess from the beginning of the lesson?
The mode is close to the median and mean of the dataset, but it is
not in the tallest bucket. This should not be surprising, as the
histogram indicates the data is centered between the ages of 30 and 50 —
there is a higher chance of a mode in that range than outside of it.
The mode is not always this close to the median and mean, and often
will not be in the tallest bucket.
Look at the 25-30 year-old bin. There are nine observations in it. If
all the values in that bin happened to be 27, then the dataset’s mode
would be 27. Although unlikely, it is possible. Below, we show what this
would look like:
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Mean2.png")

Based on this graph, it is fair to say the mode may not always be a
great measure of where the data is centered. Simply put, mode is a
measure of the most frequent observation in the dataset, and is not an
indication of the tallest bin in a histogram.
In the instructions below, we’ve written a brief explanation that
puts mode in the context of our problem.
Instructions
#plot data
hist <- qplot(greatest_books$Ages,
geom='histogram',
binwidth = 3,
main = 'Age of Top 100 Novel Authors at Publication',
xlab = 'Publication Age',
ylab = 'Count',
fill=I("blue"),
col=I("black"),
alpha=I(.2)) +
geom_vline(aes(xintercept=median(greatest_books$Ages),
color="median"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mean(greatest_books$Ages),
color="mean"), linetype="solid",
size=1) +
geom_vline(aes(xintercept=38,
color="mode"), linetype="solid",
size=1) +
scale_color_manual(name = "statistics", values = c(median = "blue", mean = "red", mode="green"))
hist

