Mean in R

1.Introduction

Finding the center of a dataset is one of the most common ways to summarize statistical findings. Often, people communicate the center of data using words like, on average, usually, or often.

In this lesson, you will learn how to calculate the mean of a dataset, a common measure of a dataset’s center. We will use the mean to help us answer the question,

When are adults their most creative and productive?

You could define “creative” and “productive” in a lot of ways, making this question impossible to fully answer by the end of this lesson. However, you will form an informed opinion on the question using data of the one hundred greatest novels of all time.

We collected the dataset from a survey administered by the French literary magazine, Le Monde. From the dataset, you will calculate the average age of the authors when their books were published.

Instructions

The histogram to the right displays the ages of 100 authors from the Le Monde survey. Where do you think the data is centered?

# load libraries
library(readr)
library(dplyr)
library(ggplot2)

# load data frame
greatest_books <- read_csv('top-hundred-books.csv')



#plot data
hist <- qplot(greatest_books$Ages,
      geom='histogram',
      binwidth = 3,  
      main = 'Age of Top 100 Novel Authors at Publication', 
      xlab = 'Publication Age',
      ylab = 'Count',
      fill=I("blue"), 
      col=I("red"), 
      alpha=I(.2)) 

hist

2.Calculating Mean

The mean, often referred to as the average, is a way to measure the center of a dataset.

The average of a set is calculated using a two-step process:

1.Add all of the observations in your dataset.

2.Divide the total sum from step one by the number of points in your dataset.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Mean1.png")

Example

Imagine that we wanted to calculate average of a dataset with the following four observations:

data <- c(4, 6, 2, 8)

Step One: Calculate the total

4+6+2+8=20

Step Two: Divide by the number of observations

The total is equal to 20, and the number of observations is equal to 4.

20 / 4 = 5

The average of this dataset is equal to 5.

Instructions

1.In this exercise, you will use R to find the average age of the first four authors in Le Monde’s top 100 books.

29,49,42,43

Add the values together, and set total equal to the answer. Print total.

total <- sum(29, 49, 42, 43)
print(total)
[1] 163

2.Divide total by the number of values in the dataset, and set mean_value to the answer.

Print mean_value. Keep that number in your head as you progress through the lesson.

mean_value <- total / 4
print(mean_value)
[1] 40.75

3.Mean in R

While you’ve shown that you can calculate the average yourself, it becomes time-consuming as the size of your dataset increases — imagine adding all of the numbers in a dataset with 10,000 observations.

The R mean() function can do the work of adding and dividing for you. In the example below, we use mean() to calculate the average of a dataset with ten values:

example_data <- c(24, 16, 30, 10, 12, 28, 38, 2, 4, 36)

example_average <- mean(example_data)

print(example_average)
[1] 20

The code above calculates the average of example_data and saves the value to example_average. The resulting average of this array is 20.

Instructions

1.Use R to calculate the average value of the author_ages array. Save the result to average_age and print it.

Does the average age of the authors surprise you? If so, how? Is it older, or younger than you expected?

# Set author ages to a vector
author_ages <- greatest_books$Ages

# Use R to calculate mean
average_age <- mean(author_ages)
average_age
[1] 41.83

4.Review and Discussion

In this lesson, you learned how to calculate the average of a dataset using the formula:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Mean1.png")

and the R function:

mean(my_data)

Circling back to the original question, do you feel like the average of our dataset, 42.12, provides us enough information to claim when someone is their most creative and productive?

Take a look at the histogram and mean (in red) to the right as you consider this question.

We would say, No. Though we could argue against its use for a few reasons, below, we’ve highlighted two:

1.The date of publication is not necessarily an author’s most creative year. When did they start authoring the book? What factors impacted their writing during those years?

2.The average age of the publishing dates for 100 authors may not accurately measure peak creativity in other professions. The average age of painters or sculptors may be very different.

So, what kind of information does the average provide us, and why would we use the average to describe something when we could display a histogram?

The most important outcome is that we’re able to use a single number as a measure of centrality. Although histograms provide more information, they are not a concise or precise measure of centrality — the reader must interpret it for themselves.

Instructions

Take a look at the histogram. Are you older or younger than the average age at publication?

This doesn’t tell you much about when someone will have their most creative year. However, this type of data could be used as an example in a broader study on aging.

#plot data
hist <- qplot(greatest_books$Ages,
      geom='histogram',
      binwidth = 3,  
      main = 'Age of Top 100 Novel Authors at Publication', 
      xlab = 'Publication Age',
      ylab = 'Count',
      fill=I("blue"), 
      col=I("black"), 
      alpha=I(.2)) +
  geom_vline(aes(xintercept=mean(greatest_books$Ages),
                 color="mean"), linetype="solid",
             size=1) +
  scale_color_manual(name = "statistics", values = c(mean = "red"))

hist

Median in R

1.Introduction

In this lesson, you will learn how to find the median of a dataset — a common measure of a dataset’s center. Each of the next three exercises will cover the following topics:

1.Manually finding the median of a dataset

2.Using R’s median function to find the median of a dataset

3.Interpreting what it means for a dataset to have similar and different median and mean values

In the lesson, we will use a dataset of the 100 greatest novels, determined by a French literary magazine, Le Monde. From the dataset, you will use the median to answer the question:

When are great authors most likely to publish their best work?

If you are not familiar with mean, also known as average, we recommend that you learn about it in our lesson on average.

Instructions

The histogram to the right displays the age of authors, at publication, for the top 100 novels. The red line represents the average value of this dataset.

You can think of the median as being the observation in your dataset that falls right in the middle.

Using this informal definition of the median and the graph to the right, see if you can determine whether the median of this dataset falls to the right or the left of the mean. We will show you the correct answer in the last exercise.

#plot data
hist <- qplot(greatest_books$Ages,
      geom='histogram',
      binwidth = 3,  
      main = 'Age of Top 100 Novel Authors at Publication', 
      xlab = 'Publication Age',
      ylab = 'Count',
      fill=I("blue"), 
      col=I("black"), 
      alpha=I(.2)) +
  geom_vline( aes(xintercept=mean(greatest_books$Ages),color="mean"), linetype="solid",size=1) +
  scale_color_manual(name = "statistics", values = c(mean = "red"))

hist

2.Median

The formal definition for the median of a dataset is:

The value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, you either report both of the middle two values or their average.

There are always two steps to finding the median of a dataset:

1.Order the values in the dataset from smallest to largest

2.Identify the number(s) that fall(s) in the middle

Example One: Even Number of Values

Say we have a dataset with the following ten numbers:

24, 16, 30, 10, 12, 28, 38, 2, 4, 36

The first step is to order these numbers from smallest to largest:

2, 4, 10, 12, [16, 24], 28, 30, 36, 38

Because this dataset has an even number of values, there are two medians: 16 and 24 — 16 has four datapoints to the left, and 24 has four datapoints to the right.

Although you can report both values as the median, people often average them. If you averaged 16 and 24, you could report the median as 20.

Example Two: Odd Number of Values

If we added another value (say, 24) to the dataset and sorted it, we would have:

2, 4, 10, 12, 16, [24], 24, 28, 30, 36, 38

The new median is equal to 24, because there are 5 values to the left of it, and 5 values to the right of it.

Instructions

1.In the next two steps, you will manually sort an array, and then determine which value in the array is the median.

In notebook.Rmd, we have a vector with the ages of the first five authors from Le Monde’s survey:

29,49,42,43,32

Under five_author_ages there is a variable called sorted_author_ages. Change the 0s in sorted_author_ages to the values in ascending order from five_author_ages.

# Array of the first five author ages
five_author_ages <- c(29, 49, 42, 43, 32)

# Fill in the empty array with the values sorted
sorted_author_ages <- sort(five_author_ages)

sorted_author_ages
[1] 29 32 42 43 49

2.Set median_value equal to the median of the array.

# Save the median value to median_value
median_value <- 42

# Print the sorted array and median value
cat("The sorted array is:", sorted_author_ages, '\n')
The sorted array is: 29 32 42 43 49 
cat(paste("The median of the array is: ", median_value))
The median of the array is:  42

*關於 cat():

cat() 是 “concatenate and print” 的意思,用來將字串和變數輸出成一行文字;

它不像 print() 會自動換行,所以建議你手動加入 ;

如果你用 cat(paste(…)),那麼 paste() 會先組合成一整句,再輸出;

cat() 不會回傳值,也無法直接複製回 RStudio 物件,僅用於輸出文字。

*在 R 中,paste() 是一個用來將多個字串合併成一個字串的函數。它常用於產生動態文字、報告輸出等情境。

name <- "Annabel"
age <- 18

paste("My name is", name, "and I am", age, "years old.")
[1] "My name is Annabel and I am 18 years old."
# 輸出: "My name is Annabel and I am 18 years old."

3.Median in R

Finding the median of a dataset becomes increasingly time-consuming as the size of your dataset increases — imagine finding the median of an unsorted dataset with 10,000 observations.

The R median() function can do the work of sorting, then finding the median for you. In the example below, we use median() to calculate the median of a dataset with ten values:

example_data = c(24, 16, 30, 10, 12, 28, 38, 2, 4, 36, 42)

example_median = median(example_data)

print(example_median)
[1] 24

The code above prints the median of the dataset, 24. The mean of this dataset is 22. It’s worth noting these two values are close to one another, but not equal.

Instructions

1.Use R to find the median of the author_ages array. Save the result to median_age.

Does the median age of the authors surprise you? If so, how? Is it older, or younger than you expected?

# Save author ages to author_ages
author_ages <- greatest_books$Ages

# Use R to calculate the median age of the top 100 authors
median_age <- median(author_ages)

print(paste("The median age of the 100 greatest authors, according to a survey by Le Monde is: " , median_age))
[1] "The median age of the 100 greatest authors, according to a survey by Le Monde is:  41"

4.Review and Discussion

In this lesson, you learned how to find the median of a dataset in two steps:

1.Sort the dataset

2.Identify the one or two numbers that fall in the middle of the sorted dataset

You also learned how to calculate the median using R:

median(my_data)

Discussion

Take a look at the histogram. It displays the author age distribution with vertical lines for the mean (red) and median (blue).

Do you feel like the median of our dataset, 40.5, provides us enough information to claim when authors publish their greatest work?

We argue it does not.

Although the median is a good measure of the dataset’s center, we cannot make a definitive claim about when authors publish their greatest work — the youngest author published at 18 and the oldest at 76. It would be irresponsible to say anything but, “it seems to be possible at almost any age.”

Notice that the mean and the median are nearly equal. This is not a surprising result, as both statistics are a measure of the dataset’s center. However, it’s worth noting that these results will not always be so close.

In the instructions below, we’ve written a brief explanation that puts median in the context of our problem.

#plot data
hist <- qplot(greatest_books$Ages,
      geom='histogram',
      binwidth = 3,  
      main = 'Age of Top 100 Novel Authors at Publication', 
      xlab = 'Publication Age',
      ylab = 'Count',
      fill=I("blue"), 
      col=I("black"), 
      alpha=I(.2)) +
  geom_vline(aes(xintercept=median(greatest_books$Ages),
                 color="median"), linetype="dashed",
             size=1) +
  geom_vline(aes(xintercept=mean(greatest_books$Ages),
                 color="mean"), linetype="solid",
             size=1) +
  scale_color_manual(name = "statistics", values = c(median = "blue", mean = "green"))

hist

Instructions

The median age of authors, when they publish their best work, from Le Monde’s 100 greatest books is 41.

While this does not tell us much about which year is an author’s greatest year, it does indicate that half of the authors from the survey find their greatest success before the age of 41 and half find their greatest success after the age of 41.

Mode in R (眾數)

1.Introduction

In this lesson, you will learn how to find the mode of a dataset. Each of the next three exercises will cover the following:

1.Manually finding the mode of a dataset

2.Using R’s functions to find the mode

3.Comparing mode to mean and median values

In the lesson, we will use a dataset of the 100 greatest novels, determined by a French literary magazine, Le Monde. From the dataset, you will use the mode to answer the question:

What is the most common age for a great author to publish their best work?

If you are not familiar with mean, also known as average, or median, we recommend that you learn about it in our lessons on average and median.

Instructions

The histogram to the right displays the age of authors, at publication, for the top 100 novels from Le Monde’s survey. The red line is the mean age, and the blue line is the median age.

Use the definition of mode below and the histogram to the right to guess where the mode falls. You will calculate the correct answer in the last exercise.

The mode is the most common observation in a dataset.

You will not be able to find the exact mode, because the histogram displays bins with a range of values. However, you can guess a range of values where you are most likely to see the mode.

#plot data
hist <- qplot(greatest_books$Ages,
      geom='histogram',
      binwidth = 3,  
      main = 'Age of Top 100 Novel Authors at Publication', 
      xlab = 'Publication Age',
      ylab = 'Count',
      fill=I("blue"), 
      col=I("black"), 
      alpha=I(.2)) +
  geom_vline(aes(xintercept=median(greatest_books$Ages),
                 color="median"), linetype="dashed",
             size=1) +
  geom_vline(aes(xintercept=mean(greatest_books$Ages),
                 color="mean"), linetype="solid",
             size=1) +
  scale_color_manual(name = "statistics", values = c(median = "blue", mean = "red"))

hist

2.Mode

The formal definition for the mode of a dataset is:

The most frequently occurring observation in the dataset. A dataset can have multiple modes if there is more than one value with the same maximum frequency.

While you may be able to find the mode of a small dataset by simply looking through it, if you have trouble, we recommend you follow these two steps:

1.Find the frequency of every unique number in the dataset

2.Determine which number has the highest frequency

Example

Say we have a dataset with the following ten numbers:

24, 16, 12, 10, 12, 28, 38, 12, 28, 24

Let’s find the frequency of each number:

24 16 12 10 28 38

2 1 3 1 2 1

From the table, we can see that our mode is 12, the most frequent number in our dataset.

Instructions

1.Determine the mode of the ages for the first ten authors in the Le Monde survey:

29,49,42,43,32,38,37,41,27,27

Save the value to mode_age.

mode_age <- 27
mode_age
[1] 27

2.Determine the number of authors who were the age of the mode. Save the number to mode_count.

mode_count <- 2
mode_count
[1] 2

3.Mode with DescTools

Finding the mode of a dataset becomes increasingly time-consuming as the size of your dataset increases — imagine finding the mode of a dataset with 10,000 observations.

The R package DescTools includes a handy Mode() function which can do the work of finding the mode for us. In the example below, we use Mode() to calculate the mode of a dataset with ten values:

Example: One Mode

library(DescTools)

example_data <- c(24, 16, 12, 10, 12, 28, 38, 12, 28, 24)

example_mode <- Mode(example_data)

The code above calculates the mode of the values in example_data and saves it to example_mode.

The result of Mode() is a vector with the mode value:

example_mode
[1] 12
attr(,"freq")
[1] 3

Example: Two Modes

If there are multiple modes, the Mode() function will return them as a vector.

Let’s look at a vector with two modes, 12 and 24:

example_data = c(24, 16, 12, 10, 12, 24, 38, 12, 28, 24)

example_mode = Mode(example_data)

The result is:

example_mode
[1] 12 24
attr(,"freq")
[1] 3

Instructions

1.We have already imported the DescTools library for you.

Delete the current value set to mode_age.

Find the mode of the observations in the author_ages array. Save the result to mode_age.

# Set author ages to 
author_ages <- greatest_books$Ages

mode_age <- Mode(author_ages)

print(paste("The mode age of authors from Le Monde's 100 greatest books is: ", mode_age[1]))
[1] "The mode age of authors from Le Monde's 100 greatest books is:  38"

4.Review and Discussion

In this lesson, you learned how to find the mode of a dataset in two steps:

1.Find the frequency of every unique number in the dataset

2.Determine which number has the highest frequency

You also learned how to calculate the mode using DescTools:

Mode(my_array)

Discussion

In this lesson, you found that 38 was the most common age, at publication, for an author from the Le Monde survey. How does this number compare to your guess from the beginning of the lesson?

The mode is close to the median and mean of the dataset, but it is not in the tallest bucket. This should not be surprising, as the histogram indicates the data is centered between the ages of 30 and 50 — there is a higher chance of a mode in that range than outside of it.

The mode is not always this close to the median and mean, and often will not be in the tallest bucket.

Look at the 25-30 year-old bin. There are nine observations in it. If all the values in that bin happened to be 27, then the dataset’s mode would be 27. Although unlikely, it is possible. Below, we show what this would look like:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Mean2.png")

Based on this graph, it is fair to say the mode may not always be a great measure of where the data is centered. Simply put, mode is a measure of the most frequent observation in the dataset, and is not an indication of the tallest bin in a histogram.

In the instructions below, we’ve written a brief explanation that puts mode in the context of our problem.

Instructions

#plot data
hist <- qplot(greatest_books$Ages,
      geom='histogram',
      binwidth = 3,  
      main = 'Age of Top 100 Novel Authors at Publication', 
      xlab = 'Publication Age',
      ylab = 'Count',
      fill=I("blue"), 
      col=I("black"), 
      alpha=I(.2)) +
  geom_vline(aes(xintercept=median(greatest_books$Ages),
                 color="median"), linetype="dashed",
             size=1) +
  geom_vline(aes(xintercept=mean(greatest_books$Ages),
                 color="mean"), linetype="solid",
             size=1) +
  geom_vline(aes(xintercept=38,
                 color="mode"), linetype="solid",
             size=1) +
  scale_color_manual(name = "statistics", values = c(median = "blue", mean = "red", mode="green"))

hist

