1.Quartiles

A common way to communicate a high-level overview of a dataset is to find the values that split the data into four groups of equal size.

By doing this, we can then say whether a new datapoint falls in the first, second, third, or fourth quarter of the data.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quartiles.png")

Those values are called the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3)

In the image above, Q1 is 10, Q2 is 13, and Q3 is 22. Those three values split the data into four groups that each contain five datapoints.

In this lesson, you will learn to calculate the quartiles by hand, and by using base R functions.

Instructions

In this lesson we’ll be looking at a dataset about music. We’ve plotted a histogram of song lengths (measured in seconds) of 9,975 random songs.

Look up the length of a favorite song of yours. Do you think that song falls in the first, second, third or fourth quarter of the data?

For example, we’ve picked one of our favorite songs, Chicago by Sufjan Stevens. Chicago is 364 seconds long — we’ve plotted it as a blue vertical line. It looks like Chicago is in either the third or fourth quarter of the data, but it’s hard to say for sure. Let’s find the quartiles of the dataset!

# load song data
load("songs.Rda")
錯誤發生在 readChar(con, 5L, useBytes = TRUE): 無法開啟連接

2.The Second Quartile

We’ll come back to the music dataset in a bit, but let’s first practice on a small dataset.

Let’s begin by finding the second quartile (Q2). Q2 happens to be exactly the median. Half of the data falls below Q2 and half of the data falls above Q2.

The first step in finding the quartiles of a dataset is to sort the data from smallest to largest. For example, below is an unsorted dataset:

c(8,15,4,−108,16,23,42)

After sorting the dataset, it looks like this:

c(−108,4,8,15,16,23,42)

Now that the list is sorted, we can find Q2. In the example dataset above, Q2 (and the median) is 15 — there are three points below 15 and three points above 15.

Even Number of Datapoints

You might be wondering what happens if there is an even number of points in the dataset. For example, if we remove the -108 from our dataset, it will now look like this:

c(4,8,15,16,23,42)

Q2 now falls somewhere between 15 and 16. There are a couple of different strategies that you can use to calculate Q2 in this situation. One of the more common ways is to take the average of those two numbers. In this case, that would be 15.5.

Recall that you can find the average of two numbers by adding them together and dividing by two.

Instructions

1.We’ve included two small unsorted datasets named dataset_one and dataset_two.

We’ve also included, as a comment, the sorted version of the first dataset.

By looking at sorted version of dataset_one, find the second quartile of the dataset and store it in a variable named dataset_one_q2.

2.Find the second quartile of the dataset_two and store it in a variable named dataset_two_q2.

Remember to sort the dataset. It might help to write out the sorted dataset as a comment!

Since there are an even number of datapoints in this dataset, the second quartile will fall between two points. The second quartile will be the average of those two points.

dataset_two <- sort(dataset_two)
dataset_two
[1] -15   1  20  24  40  45
dataset_two_q2 <- 22
dataset_two_q2
[1] 22

3.Q1 and Q3

Now that we’ve found Q2, we can use that value to help us find Q1 and Q3. Recall our demo dataset:

c(−108,4,8,15,16,23,42)

In this example, Q2 is 15. To find Q1, we take all of the data points smaller than Q2 and find the median of those points. In this case, the points smaller than Q2 are:

c(−108,4,8)

The median of that smaller dataset is 4. That’s Q1!

To find Q3, do the same process using the points that are larger than Q2. We have the following points:

c(16,23,42)

The median of those points is 23. That’s Q3! We now have three points that split the original dataset into groups of four equal sizes.

Instructions

1.Find the first quartile of dataset_one and store it in a variable named dataset_one_q1.

dataset_one <- c(50, 10, 4, -3, 4, -20, 2)
# sorted dataset_one: [-20, -3, 2, 4, 4, 10, 50]

dataset_two <- c(24, 20, 1, 45, -15, 40)

dataset_one_q2 <- 4
dataset_two_q2 <- 22

# define the first and third quartile of both datasets here:
dataset_one_q1 <- -3
dataset_one_q1
[1] -3

2.Find the third quartile of dataset_one and store it in a variable named dataset_one_q3.

dataset_one_q3 <- 10
dataset_one_q3
[1] 10

3.Find Q1 and Q3 of dataset_two. Store the values in variables named dataset_two_q1 and dataset_two_q3.

dataset_two <- sort(dataset_two)
dataset_two
[1] -15   1  20  24  40  45
dataset_two_q1 <- 1
dataset_two_q3 <- 40
dataset_two_q1
[1] 1
dataset_two_q3
[1] 40

4.Method Two: Including Q2

You just learned a commonly used method to calculate the quartiles of a dataset. However, there is another method that is equally accepted that results in different values!

Note that there is no universally agreed upon method of calculating quartiles, and as a result, two different tools might report different results.

The second method includes Q2 when trying to calculate Q1 and Q3. Let’s take a look at an example:

c(−108,4,8,15,16,23,42)

Using the first method, we found Q1 to be 4. When looking at all of the points below Q2, we excluded Q2. Using this second method, we include Q2 in each half.

For example, when calculating Q1 using this new method, we would now find the median of this dataset:

c(−108,4,8,15)

Using this method, Q1 is 6.

Instructions

1.Create a variable named dataset_one_q1 and set it equal to the first quartile of dataset one. This time, use the second method of finding quartiles.

dataset_one <- c(50, 10, 4, -3, 4, -20, 2)
# sorted dataset_one: [-20, -3, 2, 4, 4, 10, 50]

dataset_two <- c(24, 20, 1, 45, -15, 40)

dataset_one_q2 <- 4
dataset_two_q2 <- 22
# define the first and third quartile of both datasets here:
dataset_one_q1 <- -0.5
dataset_one_q1
[1] -0.5

2.Create a variable named dataset_one_q3 and set it equal to the third quartile of dataset one. Again, use the second method of finding quartiles.

dataset_one_q3 <- 7
dataset_one_q3
[1] 7

3.Create two variables named dataset_two_q1 and dataset_two_q3 and set them equal to the first and third quartile of dataset two.

Use the second method of calculating quartiles. Since Q2 fell between two data points, this method is no different than the first method!

dataset_two <- sort(dataset_two)
dataset_two
[1] -15   1  20  24  40  45
dataset_two_q1 <- 1
dataset_two_q1
[1] 1
dataset_two_q3 <- 40
dataset_two_q3
[1] 40

5.Quartiles in R

We were able to find quartiles manually by looking at the dataset and finding the correct division points. But that gets much harder when the dataset starts to get bigger. Luckily, there is a function in base R that will find the quartiles for you.

The base R function that we’ll be using is named quantile(). You can learn more about quantiles in our quantiles lesson, but for right now all you need to know is that a quartile is a specific kind of quantile.

The code below calculates the third quartile of the given dataset:

dataset <- c(50, 10, 4, -3, 4, -20, 2)
third_quartile <- quantile(dataset, 0.75)
third_quartile
75% 
  7 

The quantile() function takes two parameters. The first is the dataset you’re interested in. The second is a number between 0 and 1. Since we calculated the third quartile, we used 0.75 — we want the point that splits the first 75% of the data from the rest.

For the second quartile, we’d use 0.5. This will give you the point that 50% of the data is below and 50% is above.

Notice that the dataset doesn’t need to be sorted for R’s function to work!

Instructions

1.We’ve brought back our music dataset. The lengths of 9,975 songs (in seconds) are stored in a variable named songs. Use the quantile() function to find the first quartile. Store the result in a variable named songs_q1.

25% 175.9342

2.Find the second and third quartile of the dataset and store the values in two variables named songs_q2 and songs_q3.

50% 222.824

75% 275.4738

3.Look up the length of your favorite song in seconds. Store that value in a variable named favorite_song.

Does that song fall in the first, second, third, or fourth quarter of the data? Create a variable named quarter. Set quarter equal to 1 if your favorite song falls in the first quarter of the data. Set it equal to 2 if your song falls in the second fourth. Set it equal to 3 if your song falls in the third fourth. And set it to 4 if your song falls in the final fourth of the data.

# create the variables favorite_song and quarter here:
favorite_song <- 287
favorite_song
[1] 287
quarter <- 4
quarter
[1] 4

6.Quartiles Review

Great work! You now know how to calculate the quartiles of any dataset by hand and with R.

Quartiles are some of the most commonly used descriptive statistics. For example, You might see schools or universities think about quartiles when considering which students to accept. Businesses might compare their revenue to other companies by looking at quartiles.

In fact quartiles are so commonly used that the three quartiles, along with the minimum and the maximum values of a dataset, are called the five-number summary of the dataset. These five numbers help you quickly get a sense of the range, centrality, and spread of the dataset.

Instructions

We’ve plotted the first, second, and third quartiles on the histogram for our music dataset. Are they where you expected to see them?

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quartiles3.png")

