BIO2POS Lecture Topic 1A

class: middle
background-image: url(data:image/png;base64,#LTU_logo_clear.jpg)
background-position: top left
background-size: 25%

# BIO2POS 
# Variables and Describing Data
## Data Analysis Topic 1A
### La Trobe University

---

# Welcome!

### In this lecture we will cover an Introduction to Data, focusing on techniques for defining, describing and presenting data.

Over the following slides, we will cover:

* .orangered_style[Variables]

* .orangered_style[Describing Data]

* Measures of Central Tendency
--

* Measures of Spread
--

* Shape of Data
    
--

* Common Data Visualisations

If you have completed previous statistics subjects (e.g. .seagreen_style[STM1001]), then this content should be familiar already, and the lecture will act as a refresher.

---

# Intended Learning Objectives

### By the end of this lecture you will:

* be able to classify different types of variables

* be able to describe variables using appropriate quantitative measures

The foundational content you learn in Topic 1A and 1B will help you tackle the statistical material in all the future DA Topics.

<br>

We will practice content from this topic in this week's DA computer lab.

Each lab consists of **core** questions (with the 🌱 symbol) and **extension** questions (with the 🌳 symbol), if you would like to extend your knowledge.

<br>

*There is also the .seagreen_style[Online Learning Activity] in this week's LMS tile - it is important to go through this to consolidate your understanding of this topic's content.*

---

# Introduction to Variables

.bold_style[Formal Definition]: A .orangered_style[variable] is any characteristic, number or quantity that can be measured or counted.

.bold_style[Informal Definition]: A .orangered_style[variable] is the information you recorded for each of your participants/test subjects

We can classify variables as follows:

* .orangered_style[Categorical]: A variable that describes categories

* Nominal
  
--

* Ordinal

* .orangered_style[Numerical]: A variable that describes a measurable quantity using a number
  
--

* Discrete
    
--

* Continuous

---

# Introduction to Descriptors

We can describe data using different types of descriptors. In BIO2POS, we will focus on:

1. .orangered_style[Measures of Central Tendency] (aka Measures of Location)
  
--

2. .orangered_style[Measures of Spread] (aka Measures of Variability)
  
--

3. .orangered_style[Measures of Shape] (aka Measures of Skewness)

If we know how to use these measures, we can describe the key characteristics of our data.

<br>

For all these measures, we will focus on the .bold_style[conceptual understanding and interpretation].

<br>

While some formulae will be presented, generally the software we will use in BIO2POS (.seagreen_style[jamovi]) will take care of the calculation details.

---

# 1. Measures of Central Tendency

These are a class of summary statistics that summarize the central location or central point of the data.

We can use these to help describe what a 'typical' value will be in our sample.

These measures include:

* .orangered_style[The Mean]: The average value of all data in a set of data
  
--

* .orangered_style[The Median]: The middle value in an ordered set of data
  
--

* .orangered_style[The Mode]: The most frequently occurring value/category in a set of data

We will look at an example use of these measures shortly.

---

# 2. Measures of Spread

The central point tells us a lot about our data, but it doesn't tell us everything. We also need to look at the *variability* of the data using measures of spread.

We can use these to help understand the amount of variation in our sample.

These measures include:

* .orangered_style[The Variance] (Var): An indicator of the level of dispersion in the data from the central point
  
--

* .orangered_style[The Standard Deviation] (SD): The square root of the variance
  
--

* .orangered_style[The Interquartile Range] (IQR): The range that covers the middle 50% of the data

*Note that we often prefer the SD to the Var as SD is reported in the same units of measurement as our data.*

---

# 3. Measures of Shape

The shape of our data plays an important role in our decision as to which measure(s) of central tendency and spread to use when describing our data.

In BIO2POS, we will focus on distinguishing between:

* .orangered_style[Symmetric data]
  
--

* .orangered_style[Asymmetric data] (aka skewed data)

We can use the .orangered_style[skewness] measure to determine symmetry properties.

---

# 3. Measures of Shape: Visualising Symmetry

Often, the easiest way to observe the shape or symmetry properties of our data is to visualise it. Typically, we will use .seagreen_style[histograms] and/or .seagreen_style[box plots] for this purpose.

Consider the histogram below:

1. Identify the central point, and then imagine drawing a vertical line through that point, effectively splitting the histogram into two parts.

2. If these two parts look like mirror images, we can say the data is symmetric.

---

# 3. Measures of Shape: Symmetric data

For the purposes of BIO2POS, we will use the following definition for skewness.

*Please note however that several formulae and interpretations exist for skewness, so this is an informal definition which does not apply in all contexts.*

.bold_style[Informal Definition]: .orangered_style[Skewness] measures the level of asymmetry of a probability distribution around its mean.

* If the skewness value of our data is 0 or close to 0, we say that the distribution of our data is .orangered_style[symmetric]. 
  
--

* Otherwise we say the distribution is .orangered_style[asymmetric].

If the spread of data to either side of the central point in a data set is equal or approximately equal, with the .seagreen_style[tails] of the distribution covering a similar spread of values, we typically say that the data is .orangered_style[symmetric].

* This will be reflected by the mean and median of the data set being equal or approximately equal.
  
---

# 3. Measures of Shape: Asymmetric data

A skewed distribution can be .orangered_style[left-skewed] or .orangered_style[right-skewed]. The direction refers to the side of the distribution which has a more spread out tail.

* We say that the distribution of the data is .orangered_style[right-skewed or positively skewed] when the right tail is more spread out than the left tail. The few observations which are much larger in value compared to the majority of observations result in the right tail extending further.
  
--

* The skewness value will be positive
    
--

* Generally the mean is greater than the median
  
--

* We say that the distribution of the data is .orangered_style[left-skewed or negatively skewed] when the left tail is more spread out than the right tail. The few observations which are much smaller in value compared to the majority of observations result in the left tail extending further.

* The skewness value will be negative
    
--

* Generally the mean is less than the median
    
---

# 3. Measures of Shape: Symmetric and Asymmetric Examples

If we want to describe the key features of these examples, which measures of central tendency and spread do you think would be most appropriate?

---

# Outliers

Sometimes, one or more observations in a data set we are assessing will appear strange. If an observation does not fit well with the rest of the data, we may refer to this observation as an .orangered_style[outlier] or extreme value.

.pull-left[

.caption_style[
Note. From File:Black Land Crabs.jpg, by [Sudzie](https://commons.wikimedia.org/wiki/User:Sudzie), 2005, Wikimedia Commons ([https://commons.wikimedia.org/](https://commons.wikimedia.org/)). [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
]

]

.pull-right[
Outliers may for example be:
{{content}}
]

* Unusually low values
{{content}}

* Unusually high values
{{content}}

* Due to a recording error
{{content}}

*It is important to note that an outlier may still be a legitimate observation.*

When selecting appropriate descriptors to use, we need to take into account whether our data contains outliers.

---

# Example - Mean vs Median for no outliers

Suppose we are assessing a small sample of data on red crabs from Christmas island (Green, 1997), and have the following .seagreen_style[crab weight (in grams)] measurements:

`$$x_1 = 6.5,\, x_2 = 6,\, x_3 = 8.5,\, x_4 = 9,\, x_5 = 7.5$$`
--

To compute the .orangered_style[sample mean] (typically denoted `$\overline{X}$` or `$M$`), we can sum all the data points then divide by the sample size `$n$`.

Thus, we have:

`$$\overline{X} = \dfrac{\sum_{i=1}^n X_i}{n} = \dfrac{x_1 + x_2 + x_3 + x_4 + x_5}{n} = \dfrac{6.5 + 6 + 8.5 + 9 + 7.5}{5} = 7.5 \text{ grams}$$`
--

To compute the .orangered_style[sample median], we order our data from smallest to largest and pick the middle value:

`$$x_{(1)} = 6,\, x_{(2)} = 6.5,\, x_{(3)} = 7.5,\, x_{(4)} = 8.5,\, x_{(5)} = 9$$`

Here, there are no outliers, and the mean and median are equal, suggesting the distribution of the data is symmetric.

---

# Example - Mean vs Median with an outlier

Now suppose that we have a different sample with an outlier:

`$$x_1 = 6.5,\, x_2 = 6,\, x_3 = 25,\, x_4 = 9,\, x_5 = 7.5$$`
--

The median is robust to outliers and will remain `$7.5$` grams.

However, .bold_style[the mean is easily affected by outliers], and so now, we observe:

`$$\overline{X} = \dfrac{6.5 + 6 + 25 + 9 + 7.5}{5} = 10.8 \text{ grams}$$`

---

# Example - Measures of Spread for no outliers

We could also compute measures of spread for our red crab examples.

If our data appears to have a symmetric distribution, we can use the .orangered_style[sample variance (Var) or sample standard deviation (SD)] to describe the spread of data.

To compute the sample variance, we calculate the squared distances between each point and the mean, add the results and then divide by the sample size minus 1.

Thus, for the example without outliers, we have:

`$$Var(X) = \dfrac{\sum_{i=1}^n (X_i - \overline{X})^2}{n-1} = \dfrac{(6.5 - 7.5)^2 + \cdots + (7.5 - 7.5)^2}{4} = 1.625 \text{ grams}^2$$`
and therefore `$SD(X) = \sqrt(Var(X)) = 1.274755$` grams.

It is appropriate to use the Var and SD here as the data is spread evenly around the mean.

*For the outlier example, the SD is much larger, at `$8.020287$` grams.*

---

# Example - Measures of Spread with outliers

When our data is skewed, it is more appropriate to use the .orangered_style[Interquartile Range (IQR)] than Var or SD to describe the spread of the data.

* The IQR is robust to skewness and outliers.

The IQR is a range that covers the middle half (i.e. middle 50%) of the data points.

It is computed as the `$75^{th}$` percentile `$(Q3)$` minus the `$25^{th}$` percentile `$(Q1)$`.

For our red crab examples, the IQR values are `$2$` grams (no outlier) and `$2.5$` grams (with outlier).

`$Q1$` and `$Q3$` are often presented as part of a .seagreen_style[box plot], as box plots help show the amount of skew and the location of outliers.

---

# IQR Details

---

# Summary

In descriptive statistics, when assessing a data set we can use descriptors to summarise the key characteristics of the data.

The appropriate measures of central tendency and spread to report depend on the shape of the data.

.bold_style[If our data is symmetric], we should report the mean and standard deviation.

.bold_style[If our data is asymmetric or has outliers], we should report the median and IQR.

---

# End

That concludes our Introduction to Variables and Describing Data lecture.

### What to do next:

* .seagreen_style[Quick Kahoot revision quiz]: Please go to [kahoot.it](kahoot.it) and type in the code shown

* Make sure to attend the next DA Lecture on Topic 1B

* If you have any questions, check the LMS, email us or ask in the computer labs

---

# References

* Green, P. T. (1997). Red crabs in rain forest on Christmas Island, Indian Ocean: activity patterns, density and biomass. *Journal of Tropical Ecology*, 13(1), 17-38.

* The jamovi project. (2022). *Jamovi [Computer Software]*.[https://www.jamovi.org](https://www.jamovi.org).

---
class: middle

<font color = "grey">
These notes have been prepared by Rupert Kuveke, Amanda Shaker, and other members of the Department of Mathematical and Physical Sciences. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a>
</font>