What To Expect

The Seminars

Course Dates & Outline I

- Theory and Basics of R

Course Dates & Outline II

- Basic Statistics in R

Learning Goals

Learning Methods

Course Resources and Reading

Let Me Introduce Myself

Useful Reading

You are NOT required to read these!

But these books are seriously good.

The Importance Of Proper Statistics

The Consequences Of Bad Statistics

When Mistakes Happen

Even the rigorous peer-review system might miss some minor flaws.

\(\rightarrow\) No big deal so long as you offer corrections to your flawed work.

Fraudulent Practices - The Case Of Andrew Wakefield

\(\rightarrow\) Knowingly fraudulent practices can cost you your career.

Fraudulent Practices - The Case Of Diederik Stapel

\(\rightarrow\) Knowingly fraudulent practices can cost you your career, discredit your institution and your field of research, and even seriously impede the careers of unknowing co-workers.

What Are Bad Statistics?

Wrong/Malinformed Use

Lack of statistical knowledge

  • Applying statistics to data which they aren’t meant for

\(\rightarrow\) Methods can “break”

  • Flawed understanding of the methodology

\(\rightarrow\) Incorrect conclusions

Uninformed Use

Lack of biological knowledge

  • Delineation of nonsensical but statistically significant relationships

\(\rightarrow\) p-hacking

  • No sense of how to establish testable, feasible hypotheses

\(\rightarrow\) Waste of time

Caveat

Statistical Concern On The Rise

The Recent Debate

Why Keep Up With It?

Advancing In Statistics

Further benefits of a statistical background

The Lack Of Biostatisticans

  • Biological studies without rigorous statistical analyses are almost unpublishable
  • Biostatisticians are rare
  • Almost every biological research group requires at least one capable statistician

Statistics As An Apphrodisiac

Terminology

Classifying Statistics

Frequently Used Classifications

Unsupervised Approaches

Unsupervised methods are often the for .

\(\rightarrow\) and

Supervised Approaches

Supervised methods are often and used to about the data.

\(\rightarrow\)

Basic Vocabulary

Population vs. Sample

describes the sum total of all values of a variable given a certain research question. This includes non-measured data. describes the sum total of all values of a variable for any given analysis. This can only include measured data.

In an experimental set-up, you rear an ant colony of exactly 10,000 individuals. You are interested in the average mandible strength of ants within the colony.

You cannot possibly take measurements of all 10,000 individuals.

Taking measurements on a (e.g. 1,000 individuals) from within the (10,000 individuals).

Training Data vs. Test Data

This differentiation is only applicable when concerned with , which we won’t cover in these seminars. describes the subset of the total data which is used to the model. describes the subset of the total data which is used to the performance of the model.

You have identified a way to model how mandible strength and ant size are interconnected but don’t know how to assess the quality of your model (a model will always fit the data it was built on extremely well).

Split the available data into two non-overlapping subsets of data ( and ) and use these separately to build your model and assess its performance.

What Makes Data Truly Random?

A procedure is when any member of the has an equal chance of being selected into the .

and are established from the population with the same sense of randomness although there may be exceptions depending on the modelling procedure at hand.

Number all units contained within the set-up and sample those units corresponding to random numbers. Use the sample() function to create truly random subsets. Remember to use set.seed() to make this step reproducible!

Random Sampling in R

# Making it reproducible
set.seed(42)
# Establishing a population
pop <- c(1:15)
pop
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
# Establishing a random sample
sam <- sample(pop, 5, replace = FALSE)
sam
## [1]  1  5 15  9 10

Introduction To R

Why Use R?

The Power Of R

The R landscape

Obtaining R

R is a free statistical environment that is used by many researchers all around the globe.

Multiple dedicated forums online:

Layouts

Layouts - The Console

Running R through the console …

But you will have access to it anyway as it comes with R (we will use version 3.4.2. ).

Layouts - The Editor

Running R through an editor…

I recommend RStudio (). If you use it a lot, I also recommend changing the appearance to ‘Vibrant Ink’ (setting located in the ‘Global Options’ window nested within the ‘Tools’ tab).

Layouts - The Editor Explained

The Source is where you load scripts and write most of your coding document.

Layouts - The Editor Explained

The Environment, History, Connections is where you will be able to quickly access all objects of your current R session.

Layouts - The Editor Explained

Files, Plots, Packages, Help Viewer are especially useful for document navigation, data visualisation and to get information on certain functions in R.

Layouts - The Editor Explained

The Console is where you execute short commands, and warning and error messages are displayed.

Coding

The Evolution Of Code