1.Variance Recap

When beginning to work with a dataset, one of the first pieces of information you might want to investigate is the spread — is the data close together or far apart? One of the tools in our statistics toolbelt to do this is the descriptive statistic variance:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Standard1.png")

By finding the variance of a dataset, we can get a numeric representation of the spread of the data.

But what does that number really mean? How can we use this number to interpret the spread?

It turns out, using variance isn’t necessarily the best statistic to use to describe spread. Luckily, there is another statistic — standard deviation — that can be used instead.

In this lesson, we’ll be working with two datasets. The first dataset contains the heights (in inches) of a random selection of players from the NBA. The second dataset contains the heights (in inches) of a random selection of users on the dating platform OkCupid.

Introductions

1.Try to answer the following questions:

What does it mean for the OkCupid dataset to have a larger variance than the NBA dataset? What are the units of the mean? Is someone who is 80 inches tall taller than the average of either group? Which group(s)? In this example, the units of variance are inches squared. Can you interpret what it means for the variance of the NBA dataset to be 13.32 inches squared?

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Standard2.png")

[1] “The mean of the NBA dataset is 77.984”

[1] “The mean of the OkCupid dataset is 68.414”

[1] “The variance of the NBA dataset is 13.323744”

[1] “The variance of the OkCupid dataset is 15.400604”

Because the OkCupid dataset has a larger variance than the NBA dataset, that means that NBA players are more similar in height to each other. The height of OkCupid users vary more.

Somebody who is 80 inches tall is above the average height of both datasets. We can directly compare 80 inches to the mean because they are in the same units (inches).

It’s a bit hard to wrap your head around how variance and mean are related since they are in different units.

2.Standard Deviation

Variance is a tricky statistic to use because its units are different from both the mean and the data itself. For example, the mean of our NBA dataset is 77.98 inches. Because of this, we can say someone who is 80 inches tall is about two inches taller than the average NBA player.

However, because the formula for variance includes squaring the difference between the data and the mean, the variance is measured in units squared. This means that the variance for our NBA dataset is 13.32 inches squared.

This result is hard to interpret in context with the mean or the data because their units are different. This is where the statistic standard deviation is useful.

Standard deviation is computed by taking the square root of the variance. sigma is the symbol commonly used for standard deviation. Conveniently, sigma squared is the symbol commonly used for variance:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Standard3.png")

In R, you can take the square root of a number using ^ 0.5 or sqrt(), up to you which one you prefer:

num <- 25
num_square_root <- num ^ 0.5

num_square_root
[1] 5

Instructions

1.We’ve written some code that calculates the variance of the NBA dataset and the OkCupid dataset.

The variances are stored in variables named nba_variance and okcupid_variance.

Calculate the standard deviation by taking the square root of nba_variance and store it in the variable nba_standard_deviation. Do the same for the variable okcupid_standard_deviation.

# Importing data and calculating variance
load("lesson_data.Rda")
錯誤發生在 readChar(con, 5L, useBytes = TRUE): 無法開啟連接
# Change these variables to be the standard deviation of each dataset.
nba_standard_deviation <- sqrt(nba_variance)
okcupid_standard_deviation <- sqrt(okcupid_variance)

#IGNORE CODE BELOW HERE
print(paste("The standard deviation of the NBA dataset is ",nba_standard_deviation))
print(paste("The standard deviation of the OkCupid dataset is ", okcupid_standard_deviation))

[1] “The standard deviation of the NBA dataset is 3.65017040698102”

[1] “The standard deviation of the OkCupid dataset is 3.92436033004106”

3.Standard Deviation in R

There is an R function dedicated to finding the standard deviation of a dataset — we can cut out the step of first finding the variance. The R function sd() takes a dataset as a parameter and returns the standard deviation of that dataset:

dataset <- c(4, 8, 15, 16, 23, 42)
standard_deviation <- sd(dataset)

standard_deviation 
[1] 13.49074

Instructions

1.We’ve removed the code that calculated the variance of each dataset. By using sd() we don’t need to take that middle step anymore.

Call sd() using nba_data as a parameter, and store the result in nba_standard_deviation.

Make a similar function call using okcupid_data and store the result in okcupid_standard_deviation.

# Change these variables to be the standard deviation of each dataset.
nba_standard_deviation <- sd(nba_data)
okcupid_standard_deviation <- sd(okcupid_data)

#IGNORE CODE BELOW HERE
print(paste("The standard deviation of the NBA dataset is ",nba_standard_deviation))
print(paste("The standard deviation of the OkCupid dataset is ", okcupid_standard_deviation))

[1] “The standard deviation of the NBA dataset is 3.65199686214009”

[1] “The standard deviation of the OkCupid dataset is 3.92632398306864”

4.Using Standard Deviation

Now that we’re able to compute the standard deviation of a dataset, what can we do with it?

Now that our units match, our measure of spread is easier to interpret. By finding the number of standard deviations a data point is away from the mean, we can begin to investigate how unusual that datapoint truly is. In fact, you can usually expect around 68% of your data to fall within one standard deviation of the mean, 95% of your data to fall within two standard deviations of the mean, and 99.7% of your data to fall within three standard deviations of the mean.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Standard4.png")

If you have a data point that is over three standard deviations away from the mean, that’s an incredibly unusual piece of data!

Instructions

1.Let’s find out how many standard deviations away from the mean NBA great Lebron James is. To begin, let’s find the difference between Lebron’s height (80 inches) and the mean of each dataset.

Set nba_difference equal to 80 minus nba_mean.

Find the difference between Lebron’s height and the OkCupid mean and store it in okcupid_difference. The OkCupid dataset’s mean is stored in okcupid_mean.

# Importing data and calculating variance
load("lesson_data.Rda")
variance <- function(x) mean((x-mean(x))^2)

nba_mean <- mean(nba_data)
okcupid_mean <- mean(okcupid_data)
nba_standard_deviation <- sd(nba_data)
okcupid_standard_deviation <- sd(okcupid_data)

#Step 1: Calcualte the difference between the player's height and the means
nba_difference <- 80 - nba_mean
okcupid_difference <- 80 - okcupid_mean

[1] 2.016

[1] 11.586

2.We now want to find out how many times the standard deviation goes into those differences.

Set num_nba_deviations equal to nba_difference divided by nba_standard_deviation.

Do a similar calculation for num_okcupid_deviations.

What does that first number tell you about how unusual Lebron James is in the NBA? What does the second number tell you about how unusual Lebron James is in the dating pool?


#Step 2: Use the difference between the point and the mean to find how many standard deviations the player is away from the mean.
num_nba_deviations <- nba_difference / nba_standard_deviation
錯誤: 找不到物件 'nba_difference'

[1] 2.016

[1] 11.586

[1] “Your basketball player is 0.552026761276738 standard deviations away from the mean of NBA player heights,”

[1] “Your basketball player is 2.95085175089013 standard deviations away from the mean of OkCupid profile heights”

3.Let’s check another NBA player. Earl Boykins is one of the smaller NBA players in history at 5’5” (65 inches). Replace Lebron James’ 80 inches with Earl Boykins’ 65.

What can you say about how unusual Earl Boykins is with respect to the two different datasets?

We were surprised that Boykins wasn’t more standard deviations away from the mean of the OkCupid dataset. Think about why he isn’t more of an outlier in this dataset.

#Step 1: Calcualte the difference between the player's height and the means
nba_difference <- 65 - nba_mean
okcupid_difference <- 65 - okcupid_mean

nba_difference
okcupid_difference

#Step 2: Use the difference between the point and the mean to find how many standard deviations the player is away from the mean.
num_nba_deviations <- nba_difference / nba_standard_deviation
num_okcupid_deviations <- okcupid_difference / okcupid_standard_deviation

num_nba_deviations
num_okcupid_deviations

#IGNORE CODE BELOW HERE
print(paste("Your basketball player is", num_nba_deviations,"standard deviations away from the mean of NBA player heights,"))
print(paste("Your basketball player is ",num_okcupid_deviations," standard deviations away from the mean of OkCupid profile heights"))

[1]nba_difference : -12.984

[1]okcupid_difference : -3.414

[1]num_nba_deviations : -3.555315

[1]num_okcupid_deviations : -0.8695156

[1] “Your basketball player is -3.5553152125085 standard deviations away from the mean of NBA player heights,”

[1] “Your basketball player is -0.869515611733031 standard deviations away from the mean of OkCupid profile heights”

5.Review

In the last exercise you saw that Lebron James was 0.55 standard deviations above the mean of NBA player heights. He’s taller than average, but compared to the other NBA players, he’s not absurdly tall.

However, compared to the OkCupid dating pool, he is extremely rare! He’s almost three full standard deviations above the mean. You’d expect only about 0.15% of people on OkCupid to be more than 3 standard deviations away from the mean.

This is the power of standard deviation. By taking the square root of the variance, the standard deviation gives you a statistic about spread that can be easily interpreted and compared to the mean.

Instructions

We’ve created a visualization that shows the mean, first, second, and third standard deviations of each dataset.

Compare your height (in inches) to the heights of NBA players and OkCupid daters.

You’re likely below the first standard deviation for the NBA. Are you below the second? How unusual is your height compared to the OkCupid members?

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Standard5.png")

