1.Range Review

One of the most common statistics to describe a dataset is the range. The range of a dataset is the difference between the maximum and minimum values. While this descriptive statistic is a good start, it is important to consider the impact outliers have on the results:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Interquartile1.png")

In this image, most of the data is between 0 and 15. However, there is one large negative outlier (-20) and one large positive outlier (40). This makes the range of the dataset 60 (The difference between 40 and -20). That’s not very representative of the spread of the majority of the data!

The interquartile range (IQR) is a descriptive statistic that tries to solve this problem. The IQR ignores the tails of the dataset, so you know the range around-which your data is centered.

In this lesson, we’ll teach you how to calculate the interquartile range and interpret it.

Instructions

1.We’ve imported a dataset of song lengths (measured in seconds) and plotted a histogram.

It looks like there are some outliers — this might be a good opportunity to use the IQR.

Before we do that, let’s calculate the range. We’ve found the maximum and minimum values of the dataset and stored them in variables named maximum and minimum.

Create a variable named song_range and set it equal to the difference between the maximum and the minimum.

# load libraries
library(ggplot2)
# load song data
load("songs.Rda")
# find maximum and minimum song lengths
maximum <- max(songs)
minimum <- min(songs)
# create variable song_range here:
song_range <- maximum - minimum
song_range

[1] 983.5102

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Interquartile2.png")

# ignore the code below here:

tryCatch(print(paste("The range of the dataset is",song_range)), error=function(e) {print("You haven't defined the variable song_range yet")})

[1] “The range of the dataset is 983.51021”

2.Quartiles

The interquartile range is the difference between the third quartile (Q3) and the first quartile (Q1). If you need a refresher on quartiles, you can take a look at our lesson.

For now, all you need to know is that the first quartile is the value that separates the first 25% of the data from the remaining 75%.

The third quartile is the opposite — it separates the first 75% of the data from the remaining 25%.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Interquartile3.png")

The interquartile range is the difference between these two values.

Instructions

1.We’ve calculated the first quartile of songs and stored it in the variable q1.

Calculate the third quartile and store it in a variable named q3.

To calculate the third quartile, call the same function, but change the second argument to 0.75.

# load song data
load("songs.Rda")
# find the first quartile
q1 <- quantile(songs,0.25)
q1
 25% 

175.9342

# calculate the third quartile here:
q3 <- quantile(songs,0.75)
q3
 75% 

275.4738

2.Now that we have both the first quartile and the third quartile, let’s calculate the IQR.

Create a variable named interquartile_range and set it equal to the difference between q3 and q1.

# calculate the interquartile range here:
interquartile_range <- q3 - q1
interquartile_range
 75% 

99.53959

# ignore the code below here:

tryCatch(print(paste("The first quartile of the dataset is",q1)), error=function(e) {print("You haven't defined q1 yet")})

tryCatch(print(paste("The third quartile of the dataset is",q3)), error=function(e) {print("You haven't defined q3 yet")})

tryCatch(print(paste("The IQR of the dataset is",interquartile_range)), error=function(e) {print("You haven't defined interquartile_range yet")})

[1] “The first quartile of the dataset is 175.93424”

[1] “The third quartile of the dataset is 275.47383”

[1] “The IQR of the dataset is 99.53959”

3.IQR in R

In the last exercise, we calculated the IQR by finding the quartiles using R and finding the difference ourselves. The stats library has a function that can calculate the IQR all in one step.

The IQR() function takes a dataset as a parameter and returns the Interquartile Range.

dataset = c(4, 10, 38, 85, 193)
interquartile_range = IQR(dataset)
interquartile_range
[1] 75

Instructions

1.Let’s calculate the IQR again, but this time, use the stats function.

Create a variable named interquartile_range and set it equal to the result of calling IQR() using songs as an argument.

# create the variable interquartile_range here
interquartile_range <- IQR(songs)
interquartile_range

[1] 99.53959

# ignore the code below here:

tryCatch(print(paste("The IQR of the dataset is",interquartile_range)), error=function(e) {print("You haven't defined interquartile_range yet")})

[1] “The IQR of the dataset is 99.53959”

4.Review

Nice work! You can now calculate the Interquartile Range of a dataset using R. The main takeaway of the IQR is that it is a statistic, like the range, that helps describe the spread of the center of the data.

However, unlike the range, the IQR is robust. A statistic is robust when outliers have little impact on it. For example, the IQRs of the two datasets below are identical, even though one has a massive outlier.

dataset_one = c(6, 9, 10, 45, 190, 200) # IQR is 144.5

dataset_two = c(6, 9, 10, 45, 190, 20000000) # IQR is 144.5

By looking at the IQR instead of the range, you can get a better sense of the spread of the middle of the data.

The interquartile range is displayed in a commonly-used graph — the box plot.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Interquartile4.png")

In a box plot, the ends of the box are Q1 and Q3. So the length of the box is the IQR.

Instrctions

We’ve set up a small dataset and are printing its range and IQR.

Try changing the maximum number in the dataset to different values.

What happens to the range when you make the maximum value 100000? What happens to the IQR?

Try changing the minimum value to be more of an outlier as well.

# small dataset
dataset = c(-500000, -50, -24, -13, -2, 0, 12, 15, 18, 73, 90, 100, 100000)

# calculate range and IQR
dataset_range = max(dataset) - min(dataset)
dataset_iqr = IQR(dataset)

# print range and IQR
print(paste("The range of the dataset is ",dataset_range))
[1] "The range of the dataset is  6e+05"
print(paste("The IQR of the dataset is ",dataset_iqr))
[1] "The IQR of the dataset is  86"
---
title: "Interquartile Range"
author: "Annabel Kuo"
date: "`r format(Sys.time(), '%Y-%m-%d %H:%M')`"
output: html_notebook
---

# 1.Range Review

One of the most common statistics to describe a dataset is the range. The range of a dataset is the difference between the maximum and minimum values. While this descriptive statistic is a good start, it is important to consider the impact outliers have on the results:

```{r Interquartile1, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Interquartile1.png")
```

In this image, most of the data is between 0 and 15. However, there is one large negative outlier (-20) and one large positive outlier (40). This makes the range of the dataset 60 (The difference between 40 and -20). That’s not very representative of the spread of the majority of the data!

The interquartile range (IQR) is a descriptive statistic that tries to solve this problem. The IQR ignores the tails of the dataset, so you know the range around-which your data is centered.

In this lesson, we’ll teach you how to calculate the interquartile range and interpret it.

## Instructions

1.We’ve imported a dataset of song lengths (measured in seconds) and plotted a histogram.

It looks like there are some outliers — this might be a good opportunity to use the IQR.

Before we do that, let’s calculate the range. We’ve found the maximum and minimum values of the dataset and stored them in variables named maximum and minimum.

Create a variable named song_range and set it equal to the difference between the maximum and the minimum.

```{r message=FALSE, warning=FALSE}
# load libraries
library(ggplot2)
```

```{r}
# load song data
load("songs.Rda")
```

```{r}
# find maximum and minimum song lengths
maximum <- max(songs)
minimum <- min(songs)
```

```{r}
# create variable song_range here:
song_range <- maximum - minimum
song_range
```

[1] 983.5102

```{r message=FALSE, echo=FALSE}
# plot histogram
hist <- qplot(songs,
              geom="histogram",
              main = 'Histogram of Song Lengths',
              xlab = 'Song Length (Seconds)',
              ylab = 'Count',
              fill=I("blue"),
              col=I("red"),
              alpha=I(.2))
hist
```

```{r Interquartile2, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Interquartile2.png")
```

```{r}
# ignore the code below here:

tryCatch(print(paste("The range of the dataset is",song_range)), error=function(e) {print("You haven't defined the variable song_range yet")})
```

[1] "The range of the dataset is 983.51021"

# 2.Quartiles

The interquartile range is the difference between the third quartile (Q3) and the first quartile (Q1). If you need a refresher on quartiles, you can take a look at our lesson.

For now, all you need to know is that the first quartile is the value that separates the first 25% of the data from the remaining 75%.

The third quartile is the opposite — it separates the first 75% of the data from the remaining 25%.

```{r Interquartile3, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Interquartile3.png")
```

The interquartile range is the difference between these two values.

## Instructions

1.We’ve calculated the first quartile of songs and stored it in the variable q1.

Calculate the third quartile and store it in a variable named q3.

To calculate the third quartile, call the same function, but change the second argument to 0.75.

```{r}
# load song data
load("songs.Rda")
```

```{r}
# find the first quartile
q1 <- quantile(songs,0.25)
q1
```

     25% 
175.9342 

```{r}
# calculate the third quartile here:
q3 <- quantile(songs,0.75)
q3
```

     75% 
275.4738 

2.Now that we have both the first quartile and the third quartile, let’s calculate the IQR.

Create a variable named interquartile_range and set it equal to the difference between q3 and q1.

```{r}
# calculate the interquartile range here:
interquartile_range <- q3 - q1
interquartile_range
```

     75% 
99.53959 

```{r}
# ignore the code below here:

tryCatch(print(paste("The first quartile of the dataset is",q1)), error=function(e) {print("You haven't defined q1 yet")})

tryCatch(print(paste("The third quartile of the dataset is",q3)), error=function(e) {print("You haven't defined q3 yet")})

tryCatch(print(paste("The IQR of the dataset is",interquartile_range)), error=function(e) {print("You haven't defined interquartile_range yet")})
```

[1] "The first quartile of the dataset is 175.93424"

[1] "The third quartile of the dataset is 275.47383"

[1] "The IQR of the dataset is 99.53959"

# 3.IQR in R

In the last exercise, we calculated the IQR by finding the quartiles using R and finding the difference ourselves. The stats library has a function that can calculate the IQR all in one step.

The IQR() function takes a dataset as a parameter and returns the Interquartile Range.

```{r}
dataset = c(4, 10, 38, 85, 193)
interquartile_range = IQR(dataset)
interquartile_range
```

## Instructions

1.Let’s calculate the IQR again, but this time, use the stats function.

Create a variable named interquartile_range and set it equal to the result of calling IQR() using songs as an argument.

```{r}
# create the variable interquartile_range here
interquartile_range <- IQR(songs)
interquartile_range
```

[1] 99.53959

```{r}
# ignore the code below here:

tryCatch(print(paste("The IQR of the dataset is",interquartile_range)), error=function(e) {print("You haven't defined interquartile_range yet")})
```

[1] "The IQR of the dataset is 99.53959"

# 4.Review

Nice work! You can now calculate the Interquartile Range of a dataset using R. The main takeaway of the IQR is that it is a statistic, like the range, that helps describe the spread of the center of the data.

However, unlike the range, the IQR is robust. A statistic is robust when outliers have little impact on it. For example, the IQRs of the two datasets below are identical, even though one has a massive outlier.

dataset_one = c(6, 9, 10, 45, 190, 200) # IQR is 144.5

dataset_two = c(6, 9, 10, 45, 190, 20000000) # IQR is 144.5

By looking at the IQR instead of the range, you can get a better sense of the spread of the middle of the data.

The interquartile range is displayed in a commonly-used graph — the box plot.

```{r Interquartile4, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Interquartile4.png")
```

In a box plot, the ends of the box are Q1 and Q3. So the length of the box is the IQR.

## Instrctions

We’ve set up a small dataset and are printing its range and IQR.

Try changing the maximum number in the dataset to different values.

What happens to the range when you make the maximum value 100000? What happens to the IQR?

Try changing the minimum value to be more of an outlier as well.
```{r}
# small dataset
dataset = c(-500000, -50, -24, -13, -2, 0, 12, 15, 18, 73, 90, 100, 100000)

# calculate range and IQR
dataset_range = max(dataset) - min(dataset)
dataset_iqr = IQR(dataset)

# print range and IQR
print(paste("The range of the dataset is ",dataset_range))
print(paste("The IQR of the dataset is ",dataset_iqr))
```

