library(ggplot2)Lesson 3 - Mean, Median, and Mode: Measures of Central Tendency
These notes have been taken while studying the Crash Course: Statistics course on YouTube.
Mean
What is the average or mean number of feet people have? Turns out the average number of feet people have is a little less than 2. Cause the average takes into account the small number of people out there with fewer than 2 feet. So, if you have 2 feet, you have more than the average number of feet.
Mean (average or expectation) takes takes the sum of all the numbers in a data set, and divides by the number of data points.
x <- 1:50
mean(x)[1] 25.5
x <- c(you = 10, friend = 20)
mean(x)[1] 15
Normal - A distribution of data that has roughly the same amount of data on either side of the middle, and has its most common values around the middle of the data.
A distribution shows us how often each value occurs in our data set. which is known as their frequency.
set.seed(123)
x <- seq(10, 50)
x <- rnorm(x, 10, 0.5)
x_df <- data.frame(x)ggplot(x_df, aes(x)) +
geom_histogram(
binwidth = 0.5,
fill = "blue"
) +
theme_bw()Median
To give unusually large or small values, also called outliers, less influence on our measure of where the center of our data is, we can use the median.
Unlike the mean, the median doesn’t use the value of every data point in its calculation.
The median is the middle number if we lined up our data from smallest to largest.
Median in a vector with the length of even number
- Median will be equal to the mean of the two middle numbers:
x <- 1:10
x [1] 1 2 3 4 5 6 7 8 9 10
length(x)[1] 10
median(x)[1] 5.5
Median in a vector with the length of odd number
y <- 1:11
y [1] 1 2 3 4 5 6 7 8 9 10 11
length(y)[1] 11
median(y)[1] 6
artist_salary <- c(rep(20000, 10))mean(artist_salary)[1] 20000
median(artist_salary)[1] 20000
elon_musk_salary <- 100000000total_salary <- c(
artist_salary,
elon_musk_salary
)mean(total_salary)[1] 9109091
median(total_salary)[1] 20000
- The mean is still the same, while average is distorted by outliers.
The Mode
Total 400 reviews:
200 five-star reviews
200 one-star review
book_review <- c(
"200" = 5,
"200" = 1
)- The mean number of stars given was 3, but no one in our sample actually gave the book 3 stars, just like no one could actually have. the median of 2,5 cats.
mean(book_review)[1] 3
median(book_review)[1] 3
In both of these. situations, it can be useful to look at the mode.
The word mode comes from the Latin word modus, which means “manner, fashion, or style” and gives us the french expression À la mode, meaning fashionable.
Just like the most popular and fashionable trends, the mode is the most popular value.
Mode is the value that appears most in our data set.
For book reviews both 5 and 1 are modes and these reviews are called “bimodal”, because there are two values that are most common.
Bimodal data is an example of “multimodal” data which has many values that are similarly common. Usually multimodal data results from two or more underlying groups all being measured together.
Mode is the actual value that appears in the data unlike the median and mean which can give us numbers that wouldn’t actually occur and don’t describe our data very well.
The mode is most useful when you have a relatively large sample so that you have a large number of the popular values.
find_mode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}find_mode(book_review)[1] 5 1
- One other benefit of the mode is that it can be used with data that isn’t numeric.
charv <- c("o","it","the","it","it")
find_mode(charv)[1] "it"
- In normal distribution that mentioned earlier the mean, median and mode are all the same.
ggplot(x_df, aes(x)) +
geom_histogram(
binwidth = 0.5,
fill = "blue"
) +
theme_bw() +
labs(title = "Normal distribution")The middle value of the data - the median is also the most common (the mode) and is the peak of the distribution.
The fact that the median and mean are the same tells us that the distribution is symmetric:
- That there’s equal amount of data on either side of the median, and equal amounts on either side of the mean.
Statisticians say the normal distribution has zero skew, since the mean and median are the same.
When the median and mean are different, a distribution is skewed, which is a way of saying there are some unusually extreme values on one side of our distribution, either large or small in our data set.
With a skewed distribution, the mode will still be the highest point on the distribution, and the median will stay in the middle, but the mean will be pulled towards the unusual values.
So, if the mean is a lot higher than the median and mode, that tells you there is a value (or values) that are relatively large in your data set.
set.seed(123)
x_right <- rgamma(
1000,
shape = 2,
scale = 5
)
right_df <- data.frame(x_right)mean(x_right)[1] 9.399825
median(x_right)[1] 8.125601
ggplot(right_df, aes(x_right)) +
geom_histogram(
bins = 20,
fill = "blue"
) +
theme_bw() +
labs(
title = "Right-skewed distribution",
x = "x"
)- And a mean that’s a lot lower than your median and mode tells you there is a value (or values) that are relatively small in your data set.
- Statistics can be simultaneously true and deceptive and important part of statistics is understanding which questions you are trying to answer and whether or not the information you have is answering those questions.