STM1001 Topic 2 Lecture

class: middle
background-image: url(data:image/png;base64,#LTU_logo.jpg)
background-position: top left
background-size: 30%

# STM1001 [Topic 2](https://bookdown.org/a_shaker/STM1001_Topic_2/) Lecture
## Descriptive Statistics
### La Trobe University
This lecture complements the [Topic 2 readings](https://bookdown.org/a_shaker/STM1001_Topic_2/)

---

# Topic 2: Related Links

.pull-left[
## Readings
[Topic 2 Readings](https://bookdown.org/a_shaker/STM1001_Topic_2/)

## Notation

[General notation used throughought STM1001](https://bookdown.org/a_shaker/STM1001_Topic_0/notation-summary.html#general)

]

.pull-right[
## Maths Background

* [Order of operations](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#orderofoperations)
* [Negative numbers](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#negative-numbers)
* [Summation](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#summation)
* [Calculating means](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#calculating-means)
* [Calculating variance (Optional extension)](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#calculating-variance)
* [Squares, square roots and powers](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#squares-square-roots-and-powers)
* [Absolute value](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#absolute-value)
* [Graphs](https://bookdown.org/a_shaker/STM1001_Topic_0/maths-background.html#graphs)
]

---

name: stat
class: middle
background-image: url(data:image/png;base64,#slide_1.png)
background-size: 110%

---

name: stat
class: middle
background-image: url(data:image/png;base64,#slide_4.png)
background-size: 100%

---

# Topic 2: Descriptive Statistics

**Overview**

---

# Measures of location: Mean and median

* ***Measures of location***, or ***measures of central tendency***, are designed to tell us what is a 'typical' value in a given set of data

* Three common measures of location:
  * Mean
  * Median
  * Mode

* Some other useful summary statistics
  * Quantile
  * Percentile
  * Minimum
  * Maximum

---

# Mean

* The first measure we will look at today is the ***mean***, often referred to as the ***average***

* To calculate the mean, add up all of the given values, and then divide that sum by the number of values

* For example, suppose the following five numbers represent the number of visits to a physical Kmart store in the past year for a sample of `$n = 5$` STM1001 students:

`$$2, 50, 300, 10, 25$$`
--

* We can then calculate the mean by adding up the values and then dividing by `$n = 5$`:

`\begin{align}
\text{mean number of Kmart visits} &= (2 + 50 + 300 + 10 + 25) \div 5 \\ 
&= 387 \div 5 \\ 
&= 77.4
\end{align}`
---

# Mean: Some notation

* If we denote any one of the `$n = 5$` values to be `$x_i$`, where `$i$` can take any value from `$1$` to `$n = 5$`, we can denote each value as:
$$ x_1 = 2; x_2 = 50; x_3 = 300; x_4 = 10; x_5 = 25.$$

* The mean of a sample of numbers is called the ***sample mean***, and is usually denoted `$\overline{x}$`, pronounced "x bar"

* The formula for the sample mean can then be given as

`\begin{align}
\overline{x} = \dfrac{1}{n} \sum_{i=1}^{n} x_i,
\end{align}`

* where `$\sum$` is a summation sign. If we read out  `$\displaystyle \sum_{i=1}^{n}$` in words, we would say, "the sum from `$i=1$` to `$n$`".

* In other words, this formula is telling us to add up the values 
`$x_1$` up to `$x_n,$` and then divide that sum by `$n$`: exactly what we have done when calculating the sample mean of Kmart visits

---

# Mean: Some notation

* As we had a ***sample*** of STM1001 students, we calculated the ***sample mean***, `$\overline{x}$`

* The ***population mean*** is usually denoted `$\mu$`.

* Usually, we do not know what the true value of `$\mu$` is, but we can use the ***sample mean***, `$\overline{x}$`, to estimate it

---

# Median

* The ***median*** is simply the 'middle' value, meaning that 50% of the values are higher, and 50% lower, than the median

* To calculate the median:
  1. List the values in order from lowest to highest
  
--

1. Then, if there is an odd number of values, the median will be the middle value. If there is an even number of values, the median will be the mean of the middle two values.

* For example, we can list our Kmart visit numbers in order as:

`$$2, 10, 25, 50, 300,$$`

so that the median is the middle value, 25.

---

# Median

* Now suppose that our sample includes one additional value of 30 visits, so that `$n = 6$`. The number of visits in order are:

`$$2, 10, 25, 30, 50, 300.$$`
--

* Since `$n = 6$` is an even number, to find the median, we need to find the mean of the middle two values (25 and 30), so that the median is:

`$$(25 + 30) \div 2 = 27.5.$$`

---

# Mode

* The ***mode*** is the most commonly occurring value in a given set of values.

* For example, suppose `$n=10$` randomly selected students were asked the question, *how many siblings do you have?*, with responses as follows:

`$$2, 0, 1, 1, 3, 1, 2, 3, 4, 0$$`

* Arranging these responses into a frequency table allows us to more easily see what the mode is:

|Number of Siblings | Frequency|
|:------------------|---------:|
|0                  |         2|
|1                  |         3|
|2                  |         2|
|3                  |         2|
|4                  |         1|

* Since the most commonly occurring response was `$1$` *sibling*, with `$3$` responses, we can say that the ***mode*** is `$1$`, with a frequency of `$3$`.

---

# Mode

We can also determine the mode by viewing a histogram. For example, suppose `$n=100$` students were asked the question, *how many siblings do you have?*, with responses represented in the below histogram:

As we can see, the mode is now `$2$` siblings, with a frequency of `$33$`.

---

# Mode

* Where there is one mode, we have a ***unimodal*** distribution

* Sometimes, there is more than one mode, which can lead to either a ***bi-modal*** or ***multi-modal*** distribution:

---

# Quantiles and Percentiles

* A ***quantile*** is the point at which a certain percentage of the data falls below a certain value.

* For example, recall the `Height` variable from the `survey` data set which contains the responses of Statistics students to a set of questions (Venables and Ripley, 1999).

* Suppose we wanted to know what height a student would need to be to be in the shortest 10% of the sample

* It turns out that students whose height is less than 160cm are among the shortest 10% of the sample (see next slide)

* We can therefore say that 160cm is the _**0.1th quantile**_, or equivalently, the _**10% quantile**_, or the _**10th percentile**_.

* The _**50% quantile**_, or the _**50th percentile**_, is in fact the _**median**_.

---

# Quantiles and Percentiles

---

# Minimum and Maximum

* The minimum and maximum values are also useful pieces of information for us to know

* Like the median, they are most easily determined by listing the data in order from smallest to largest

* Consider again the `$n=5$` Kmart visit numbers we considered earlier, listed in order as
`$$2, 10, 25, 50, 300.$$`

The minimum and maximum values are then 2 and 300 respectively.

---

name: menti
class: middle
background-image: url(data:image/png;base64,#menti.jpg)
background-size: 115%

# Kahoot!

## Quick recap of [Topic 1](https://bookdown.org/a_shaker/STM1001_Topic_1/) 
## and today's lecture so far.

## Go to [www.kahoot.it](https://www.kahoot.it) and use

## the code provided

---

# Measures of spread

* ***Measures of spread***, or ***measures of variability***, tell us how spread out a distribution is

* For example, consider the below two histograms, where the distribution of data in Histogram A is more spread out than that in Histogram B:

---

# Measures of spread

* Measures of spread can help us to quantify these types of differences in variability.

* The measures of spread we will consider in this section are the ***variance***, ***standard deviation***, ***interquartile range***, and ***range***. We will also consider the concept of ***quartiles***

---

# Variance

* The ***variance*** tells us how spread out a given distribution of data is

* The **sample variance** is normally denoted by `$s^2$` and defined as:

`$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\overline{x})^2,$$`

where `$n$`, `$i$`, `$x_i$` and `$\overline{x}$` are as defined earlier.

* That is, we are looking at the difference between each value and the sample mean

* We then calculate the average of the square of all of these differences

* This means that if lots of values are far away from the mean, the variance will be high. On the other hand, if most of the values are very close to the mean, the variance will be low.

* Comparing the data in **Histogram A** and **Histogram B** from before, it turns their variances are 907.947 and 106.5411 respectively

---

# Standard Deviation

* The **standard deviation** is simply the square root of the variance, and the **sample standard deviation** is usually denoted `$s$`.

* We therefore have that

`$$s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\overline{x})^2}.$$`
--

* With this information, we can calculate the standard deviation of the data in **Histogram A** as `$\sqrt{907.947} = 30.132$`

* See if you can calculate the standard deviation of the data in **Histogram B** for yourself (you will find the answer in this topic's readings)

---

# Variance and Standard Deviation

* The standard deviation is usually easier to interpret because it is expressed in the same units as the data at hand, whereas the variance is expressed in the units squared.

* For example, if the data in Histogram A represented height in cm, then the associated standard deviation and variance could be expressed as `$30.132\text{cm}$` and `$907.947\text{cm}^2$` respectively, where 30.132cm is of course much easier to interpret

* The **population variance** is usually denoted `$\sigma^2$` and the **population standard deviation** is usually denoted `$\sigma$`

* Usually, we do not know the true values of `$\sigma^2$` and `$\sigma$`, but we can use the sample variance and sample standard deviation, `$s^2$` and `$s$` respectively, to estimate them

---

# Variance and Standard Deviation

* Calculating these measures of spread is more complicated than calculating, for example, the mean or median

* Thankfully, we will be using statistical software packages to calculate measures of spread in this subject

* However, you will be expected to know how to convert a given variance into a standard deviation, and vice-versa. In summary:
    * If you know the variance, take the square root of the variance to obtain the standard deviation
    * If you know the standard deviation, square the standard deviation to obtain the variance

---

# Quartiles and Inter-quartile Range

* There are some special ***quantiles***, called ***quartiles***

* If we divide our data into four quarters, so that each quarter contained 25% of the observations, Quartiles 1, 2, and 3 (Q1, Q2, and Q3), can be defined as follows:
    * Q1 is the 25% quantile
    * Q2 is the 50% quantile (this is in fact also the median)
    * Q3 is the 75% quantile

* For example, considering the `Height` variable from the `survey` data set, we have that Q1 = 165, Q2 = 171, and Q3 = 180

* We could make the following interpretations:
    * 25% of students are shorter than 165cm and 75% of students are taller than 165cm
    * 50% of students are shorter than 171cm and 50% of students are taller than 171cm
    * 75% of students are shorter than 180cm and 25% of students are taller than 180cm.

---

# Quartiles and Inter-quartile Range

* The ***inter-quartile range (IQR)*** is the distance spanned by the middle 50% of the data

* That is, it the distance between the Q1 and Q3, and can be calculated as

`$$IQR = Q3 - Q1.$$`

* In our `Height` example, we have that

`$$IQR = Q3 - Q1 = 180 − 165 = 15.$$`

---

# Range

* The range is simply the distance spanned across the whole data set

* That is, the difference between the maximum and minimum values

* Considering the `Height` variable again, the minimum and maximum heights are 200cm and 150cm respectively

* The range is therefore 200cm - 150cm = 50cm

* We could interpret this by saying that the maximum difference in height between any two students in the sample is 50cm.

---

# Measure of Shape: Skewness

* A ***measure of shape*** can tell us how ***symmetrical*** or ***asymmetrical*** a set of data is

* ***Skewness*** is a measure of shape, and can be: 
    * Negative (skewed to the left, or negatively skewed) 
    * 0 (symmetrical), or
    * Positive (skewed to the right, or positively skewed) 
    
* This can be demonstrated via the histograms on the following slide

---

---

# Measures of association between variables

* ***Covariance*** and ***correlation*** are two measures can be used to measure the relationship ***between*** two variables.

* If the ***covariance*** and ***correlation*** values are positive, this indicates the two variables are positively related. In other words, when one increases, the other one generally also increases.

* On the other hand, if the ***covariance*** and ***correlation*** values are negative, this indicates the two variables are negatively related. That is, when one increases, the other will typically decrease.

* ***Covariance*** and ***correlation*** values of 0 indicate that the two variables are unrelated (at least in a linear sense).

* Apart from its sign (positive or negative), the ***covariance*** value can be hard to interpret, especially if the two variables are on different scales

* However, ***correlation*** is a standardised measure, meaning it is much easier to interpret

---

# Correlation

* The ***correlation coefficient*** is often denoted `$r$`. (Note that `$r$` is the sample correlation coefficient, whereas the population correlation coefficient is usually denoted `$\rho$`).

* ***Correlation*** is a measure between -1 and 1 which tells us about the relationship between two variables

* The sign of the number tells us about the **direction** of the relationship (positive or negative - we will see some examples shortly)

* The size of the number tells us about the **strength** of the relationship:
    * The closer `$|r|$` (the absolute value of `$r$`) is to 1, the stronger the linear relationship between the two variables

* Below is a guide to interpreting the strength of a correlation coefficient:

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Range of |r| </th>
   <th style="text-align:left;"> Strength of correlation </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 0 to 0.3 </td>
   <td style="text-align:left;"> None or very weak </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 0.3 to 0.5 </td>
   <td style="text-align:left;"> Weak </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 0.5 to 0.8 </td>
   <td style="text-align:left;"> Moderate </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 0.8 to 1 </td>
   <td style="text-align:left;"> Strong </td>
  </tr>
</tbody>
</table>

---

# Scatter plots and correlation

* A scatter plot is a convenient way to visually compare two numeric variables
* For example, consider the following scatterplot which shows the [Happiness Index](https://docs.google.com/spreadsheets/d/1s4ryFpYS1AQAhI2E1XaQyZ0ikj6oF8YlwISrKkyNj6c/edit#gid=501532268) versus income per person for a number of countries (Gapminder.org, 2021):

---

# Scatter plots and correlation

* As we can see, each axis represents one variable

* Each point represents one observation (country in this case), indicating its value for average income per person on the `$x$`-axis, and average happiness score on the `$y$`-axis

* By considering a scatter plot, we can observe how the two variables relate to each other

---

# Scatter plots and correlation

* Scatter plots are a helpful way for us to understand correlation. For example:

---

# Boxplots

* A boxplot is visual way to present much of the key information about a numerical variable, including the: 
    * Minimum 
    * 25% quantile (Q1) 
    * Median (50% quantile, Q2)
    * 75% quantile (Q3) 
    * Maximum 
    * Outliers, if any

---

# Boxplots

* In the first plot below, the box represents the IQR, while the vertical line in the middle of the box represents the median
* The two vertical lines that make up the edges of the box represent Q1 and Q3 
* The second boxplot can be interpreted the same way and additionally has two outliers present, each represented by a dot towards the right hand side of the plot.

---

# Boxplots

Boxplots can also indicate how skewed the data is. For example:

---

# Violin plots

* Violin plots combine some of the benefits of both histograms and boxplots into one chart

* They include most of the information included in a boxplot as well as the ***density***

* The density can be thought of as a smoothed version of the shape we see represented in a histogram

* The figure on the next slide contains the same data as we saw one the previous slide, except the boxplots have been replaced with violin plots

---

---

# When should we use which measure?

* The mean, variance, and standard deviation include every value within a given data set

* Therefore, they are comprehensive measures, but also easily affected by skewed data or extreme values (outliers)

* On the other hand, the median and IQR only consider the ranks of the values in a given data set

* They are therefore less comprehensive, but more robust when data are skewed or in the presence of outliers

---

# When should we use which measure?

* To demonstrate, consider the following set of numbers: `$2, 3, 4, 5, 6$`. We have:
    * Mean = 4
    * Median = 4

* Now suppose the number 6 above was incorrect and should have been 26. Considering `$2, 3, 4, 5, 26$`, we now have:
    * Mean = 8
    * Median = 4

As we can see, the median is unaffected in this example, whereas the mean has doubled after changing just one value! So, a basic guideline is:

.content-box-blue[
.center[
**Which measure should I use?**
]
* Symmetric data: Use mean, standard deviation, variance

* Skewed data, or data with extreme values: Use median, IQR
]

---

# References

Gapminder.org (2021). _Free data from World Bank via Gapminder.org,
CC-BY license_. URL:
[https://www.gapminder.org/data/](https://www.gapminder.org/data/)
(visited on Jul. 30, 2021).

Venables, W. N. and B. D. Ripley (1999). _Modern Applied Statistics
with S-Plus_. 3rd. ISBN 978-1-4757-3121-7. New York: Springer-Verlag.

---

background-image: url(data:image/png;base64,#computerlab.jpg)
background-position: bottom
background-size: 75%
class: center

# See you in the computer labs!

---
class: middle

<font color = "grey">
These notes have been prepared by Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>
</font>