1.Quantiles in R

Quantiles are points that split a dataset into groups of equal size. For example, let’s say you just took a test and wanted to know whether you’re in the top 10% of the class. One way to determine this would be to split the data into ten groups with an equal number of datapoints in each group and see which group you fall into.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles1.png")

There are nine values that split the dataset into ten groups of equal size — each group has 3 different test scores in it.

Those nine values that split the data are quantiles! Specifically, they are the 10-quantiles, or deciles.

You can find any number of quantiles. For example, if you split the dataset into 100 groups of equal size, the 99 values that split the data are the 100-quantiles, or percentiles.

The quartiles are some of the most commonly used quantiles. The quartiles split the data into four groups of equal size.

In this lesson, we’ll show you how to calculate quantiles using R and discuss some of the most commonly used quantiles.

Instructions

We’ve imported a dataset of song lengths (measured in seconds). We’ve drawn a few histograms showing different quantiles.

What do you think a histogram that shows the 100-quantiles would look like?

# load libraries
library(ggplot2)
# load song data
load("songs.Rda")
錯誤發生在 readChar(con, 5L, useBytes = TRUE): 無法開啟連接
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles2.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles3.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles4.png")

2.Quantiles in R

Base R has a function named quantile() that will quickly calculate the quantiles of a dataset for you.

quantile() takes two parameters. The first is the dataset that you are using. The second parameter is a single number or a vector of numbers between 0 and 1. These numbers represent the places in the data where you want to split.

For example, if you only wanted the value that split the first 10% of the data apart from the remaining 90%, you could use this code:

dataset <- c(5, 10, -20, 42, -9, 10)
ten_percent <- quantile(dataset, 0.10)
ten_percent
  10% 
-14.5 

ten_percent now holds the value -14.5. This result technically isn’t a quantile, because it isn’t splitting the dataset into groups of equal sizes — this value splits the data into one group with 10% of the data and another with 90%.

However, it would still be useful if you were curious about whether a data point was in the bottom 10% of the dataset.

Instructions

[1] “The value that splits 23% of the data is 171.7812924”

3.Many Quantiles

In the last exercise, we found a single “quantile” — we split the first 23% of the data away from the remaining 77%.

However, quantiles are usually a set of values that split the data into groups of equal size. For example, you wanted to get the 5-quantiles, or the four values that split the data into five groups of equal size, you could use this code:

dataset <- c(5, 10, -20, 42, -9, 10)
ten_percent <- quantile(dataset, c(0.2, 0.4, 0.6, 0.8))
ten_percent
20% 40% 60% 80% 
 -9   5  10  10 

Note that we had to do a little math in our head to make sure that the values c(0.2, 0.4, 0.6, 0.8) split the data into groups of equal size. Each group has 20% of the data.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantile5.png")

If we used the values c(0.2, 0.4, 0.7, 0.8), the function would return the four values at those split points. However, those values wouldn’t split the data into five equally sized groups. One group would only have 10% of the data and another group would have 30% of the data!

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles6.png")

Instuctions

1.Create a variable named quartiles that contains the quartiles of the songs dataset.

The quartiles of a dataset split the data into four groups of equal size. Each group should have 25% of the data, so you’ll want to use c(0.25, 0.5, 0.75) as the second parameter to the quantile() function.

 25%      50%      75% 

175.9342 222.8240 275.4738

2.Create a variable named deciles. deciles should store the values that split the dataset into ten groups of equal size. Each group should have 10% of the data.

The first value should be at 10% of the data. The next value should be at 20% of the data. The final value should be at 90% of the data.

# define deciles here:
deciles <- quantile(songs, c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9))
deciles
 10%      20%      30%      40%      50%      60%      70%      80% 

135.1519 165.3755 185.9914 204.8152 222.8240 240.2216 262.4779 290.9514

 90% 

348.4939

3.Look at the printout of the deciles. If you had a song that was 170 seconds long, what tenth of the dataset would it fall in?

Create a variable named tenth and set it equal to the 1 if you think the 170 second song would fall in the first tenth of the data. Set it equal to 2 if you think the song would fall in the second tenth of the data. If you think the song would fall in the final tenth of the data, set tenth equal to 10.

# ignore the code below here:

tryCatch(print(paste(c("The quartiles are",quartiles,collapse=" "))), error=function(e) {print("You haven't defined quartiles.")})

tryCatch(print(paste(c("The deciles are",deciles,collapse=" "))), error=function(e) {print("You haven't defined deciles.")})

[1] “The quartiles are” “175.93424” “222.82404”

[4] “275.47383” ” ”

[1] “The deciles are” “135.151876” “165.375546” “185.99138”

[5] “204.815222” “222.82404” “240.22159” “262.47791”

4.Common Quantiles

One of the most common quantiles is the 2-quantile. This value splits the data into two groups of equal size. Half the data will be above this value, and half the data will be below it. This is also known as the median!

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles7.png")

The 4-quantiles, or the quartiles, split the data into four groups of equal size. We found the quartiles in the previous exercise.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles8.png")

Finally, the percentiles, or the values that split the data into 100 groups, are commonly used to compare new data points to the dataset. You might hear statements like “You are above the 80th percentile in height”. This means that your height is above whatever value splits the first 80% of the data from the remaining 20%.

Instructions

1.We won’t make you calculate all 99 percentiles, but let’s take a look at one. Find the value that separates the first 32% of the data from the rest.

Store that value in a variable named percentile.

# define percentile and answer here:
percentile <- quantile(songs, 0.32)
錯誤: 找不到物件 'songs'
 32% 

189.9359

2.Look at the printout. If you had a song that was exactly three minutes long, is that song above or below the 32nd percentile?

Create a variable named answer and set it equal to either “above” or “below”. Don’t forget to include the quotes!

[1] “Your percentile is 189.93587”

5.Quantiles Review

Nice work! Here are some of the major takeaways about quantiles:

1.Quantiles are values that split a dataset into groups of equal size.

2.If you have n quantiles, the dataset will be split into n+1 groups of equal size.

3.The median is a quantile. It is the only 2-quantile. Half the data falls below the median and half falls above the median.

4.Quartiles and percentiles are other common quantiles. Quartiles split the data into 4 groups while percentiles split the data into 100 groups.

Instructions

To the right, we’ve shown three different histograms along with the deciles. Each histogram shows the SAT scores of the students that a fake university has accepted.

If you had an SAT score of 1350, which tenth of the data would you be in for each school? Which schools should you apply to? Would any of the schools be unrealistic options?

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles9.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles10.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Quantiles11.png")

