MT5762 Lecture 1

C. Donovan

Programme information

General info, what you should learn, what to do to pass

Contact information

Who am I?

  • Dr. C. Donovan
  • MSc coordinator for Statistics, Module co-ordinator for MT5762, MT5763 & (sometimes) ID5059

  • I am part-time with the University. I try to maintain office hours Tuesday, Wednesday and Friday when not teaching.

  • I am frequently working elsewhere or overseas - email is the best way to contact me.

Who am I?

Research interests (anything involving data):

  • risk mitigation for military/civilian sonar, seismic surveying, marine construction and installations
  • predictive modelling/data-mining/machine-learning/(whatever new title is the flavour)
  • target tracking and identification (SONAR), image-processing
  • analysis of remote sensing data and satellite tags
  • gambling

Commercial interests (anything involving data):

  • cosmaceutical trials
  • customer value and brand equity modelling
  • environmental monitoring, Environmental Impact Assessment (EIA)
  • statistical programming and simulations
  • gambling

Problems?

  • Academic matters relating to this module - see me.
  • MScs relating to statistics - see your adviser. Beyond that Dr Popov as MSc coordinator.
  • Non-academic matters, similar, or the student support services.

You

  • Yourselves… very diverse backgrounds and directions
  • This module slots into several MScs (DIA, ASaDM, Conservation)

Course & module overview

Genesis - ASaDM & DIA

  • ASaDM Started in 2006.
  • Intended to be small.
  • Intent is to take graduates without an applied statistics background and train them in data analysis (your backgrounds are very varied).
  • Strongly applied rather than theoretical, favouring breadth over depth.
  • Modern methods for tricky (real) data.
  • Demand for competent data analysts is high.
  • MSc DIA is more comp. sci. leaning

Google

“I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?”“

Hal Varian (Chief Economist at Google), The McKinsey Quarterly, January 2009

Then this happened ….

Then this happened

Some non-statisticians weren't having that though.

Data Scientist: The Sexiest Job of the 21st Century - Davenport & Patil, Harvard Business Review, October 2012.

Very irritating. We got but 3 years in the sun. Regardless, you're moving in a useful direction.

Computing

  • CREEM computing classroom
  • Maths Institute microlab
  • Eduroam network
  • Dropbox/Google-drive/One-Drive, Teamviewer, Hitask…

Evening study space

  • 24 hour access to the Maths Institute microlab.

Mailing lists and electronic networking

  • I suggest you subscribe to relevant mailing lists e.g. to AllStat (UK - others exist like ANZStat), datasciencejobs, …
  • Linked-In: there is a group for stats-related MSc

MSc representation

  • You need to put forward an advocate for the Masters students (one per school I expect).
  • In maths, the representative will sit on the staff-student board meetings.
  • You can decide how the person is selected.
  • Social organisation e.g. movie evenings, course hoodies.

MT5752

Principally this modules serves:

  • to ensure we all have a common baseline of statistical knowledge that will be essential for following modules;
  • as a refresher for those with greater statistical background;

This is a course with a strong applied emphasis, but theory is included throughout.

That which you ought to take away with you:

  • the graphical/numerical display and summary of various data types
  • the fitting and interpretation of statistical models
  • underpinning theory and ideas for statistical analysis
  • practical data analysis in the exemplar language R (MT5763 covers others)

MT5752 - contents

This is a 20 lecture (30 hours), 5-week course. Broadly covering:

  • Week 1: notation, jargon, sampling, experiments, types of data and their treatment
  • Week 2: probability, probability functions, random variables, specific distributions, inference
  • Week 3: tests, parameter estimation and uncertainty, simple models, simple matrix algebra
  • Week 4: likelihood, linear models (encompassing previous models), increasing complexity, model selection
  • Week 5: more linear models, computer intensive inference, sundry (power, multiple comparisons, tables of counts, non-parametric)

Examples will be based on: drugs & death, gambling, environmental impacts and finance.

MT5752 - assessment

This course is 100% internally assessed. So no exam (yay), but arguably you have to work harder for each mark.

Assessment consists of

  1. Individual project (30%) - circulated end of week 1, due end of week 3.
  2. Group project (40%) - circulated end of week 3, due end of week 6. Has individual components.
  3. Class test (30%) - date TBD, likely week 7.

MT5752 - Resources

Software

  • The exemplar language in this course is R (MT5763 covers others).
  • This is free and open-source. I suggest R-Studio as the IDE to interact with this (also free).
  • Office is pretty much mandatory to interact with the world.

Course materials

  • There is a moodle page for MT5762: online access to notes, data etc
  • This is a relatively new course! Notes will appear progressively.
  • Library and online journals, google scholar etc.
  • Books by John Fox, for the linear models part and technical details.
  • Wild & Seber Chance encounters, or other introductory applied statistics texts are relevant.

Getting started

Statistics is logical and often intuitive

Preliminary musings

What is (the practice of) statistics'?

  • Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data, Moore and McCabe (2003).
  • The subject matter of statistics is the process of finding out more about the real world by collecting and then making sense of data, Wild and Seber (2000).
  • Statistics is a way of reasoning, along with a collection of tools and methods, designed to help us understand the world, De Veaux, Velleman, and Bock (2006).

Group musing

  • Your client has asked “How many people use drugs?”. Narrow the problem and propose a means to answer it. You have a large budget.
  • Olympic run times. Read the distributed paper and discuss within your group. Do you agree with their conclusion? Justify your position.

Some philosophy on modelling

  • What is a model and why create one?
  • What constitutes a good model?
  • What role does data play in model construction?
  • Mechanistic versus non-mechanistic models (NB statistical models are frequently not mechanistic).
  • Stochasticity.

Sampling

One of the foundations of statistics and usually our source of data.

Sampling:

  1. is generally a necessity
  2. introduces types of uncertainty to our problems
  3. needs careful treatment to allow sensible & useful answers

Thinking about it logically leads us all to a similar place, for example…

Abundance of prized sturgeon

“Experts can't agree how many beluga sturgeon are left in the sea. At stake is the future of one of the world's most sought-after fish and its coveted black gold'.” (New Scientist, 20 Sept 2003).

CITES says 11.6 million in 2002; Wildlife Conservation Society says maybe less than 0.5 million.

  • Why would estimates from different sources vary so much?
  • How would you approach such a problem?
  • What if you had very limited resources?

Recap and look-forwards

We've covered:

  • Course requirements, content & structure
  • The real-world throws us uncertainty - statistics is the path to rational answers

Next:

  • Sampling: ideas, approaches, jargon and biases

Reading (moodle):

  • “Estimating excess deaths in Iraq since the US-British-led invasion”

Coffee 'may reverse Alzheimer's'

“Drinking five cups of coffee a day could reverse memory problems seen in Alzheimer's disease, US scientists say…. The 55 mice used in the University of South Florida study had been bred to develop symptoms of Alzheimer's disease…. When the mice were tested again after two months, those who were given the caffeine performed much better on tests measuring their memory and thinking skills and performed as well as mice of the same age without dementia….”“

BBC 5th July 2009

So let's get some.