Basic concepts of statistics

The problem with statistics

Lies, damned lies, and statistics

The above well known quotation is attributed to Benjamin Disraeli (UK Prime Minister)

How to lie with statistics

Less known book by Darrell Huff (142 pages/a5)

Bill Gates recommends in 2015

https://www.gatesnotes.com/About-Bill-Gates/Summer-Books-2015?WT.mc_id=05_19_2015_SummerBooks_GeekWire

BTW: this photo (taken in 2015) coupled with the fact that Gates funded the epidemiology research at John Hopkins University has become “evidence” for various morons (of which the are plenty in the USA), that Gates was behind the COVID19 pandemic

A book written by Darrell Huff in 1954 presenting an introduction to statistics for the general reader. Not a statistician, Huff was a journalist […]

In the 1960/1970s, it became a standard textbook introduction to the subject of statistics for many college students […] one of the best-selling statistics books in history.

https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics

The book consists of 10 chapters and is written in a provocative, way (unscientific). Individual chapters are so well known that if you enter the title of the chapter into google will return hundreds of thousands references

ch1: The Sample with the Built-in Bias (ie it is very difficult to draw unbiased/perfect random sample)

ch2: The Well-Chosen Average. You can manipulate average value in various ways: using various averages/using different definitions of averaged units/measuring in various ways

ch3: The Little Figures That Are Not There (Figures = Details) Reporting results w/o context or important information in short

ch4: Continuing #ch3 insignificant results = difference is of no practical meaning.

ch5: The Gee-Whiz Graphs (Statistical graphs in cartesian coordinates with OY axis not starting from zero) https://en.wikipedia.org/wiki/Gee_Whiz → https://en.wikipedia.org/wiki/Misleading_graph

ch6: The One-Dimensional Picture (comparing 1D quantities using 2D or pseudo-3D) https://thejeshgn.com/2017/11/17/how-to-lie-with-graphs/

ch7: The Semiattached Figure. Using one thing as a way to claim proof of something else, even though there’s no correlation between the two (not attached) https://www.secjuice.com/the-semi-attached-figure/

ch8: Post Hoc Rides Again (Correlation is not causation)

ch9: Misinforming people by the use of statistical material might be called statistical manipulation, in a word, Statisticulation. (summary of ch1–ch8)

ch10: How to Talk Back to a Statistic (How not to be deceived)

Who Says So? (interested parties can be unreliable; car seller reputation is poor);

How Does He Know? (measurement is often unreliable);

What’s Missing? (incomplete analysis signals bias);

Many figures lose meaning because a comparison is missing. In Poland there was a public discussion about falling fertility– women in Poland do not give birth to children; the average age of a mother at the birth of her first child is 27 years. [It is a norm in a whole Europe]

Did Somebody Change The Subject? (beware of the Semiattached Figure)

Does It Make Sense? (forget about statistics and think about common sense)

Despite its mathematical base, statistics is a much an art as it is a science (Huff p. 120)

Is it better now?

Misleading statistical analyzes are still doing quite well if not better than in Huff’s times, which is probably due to the following factors:

the number of statisticians has increased, often amateurs (everyone can count something easily today)
the amount of readily available data has increased

Fake news hype. Numbers disguised as a result of a statistical analysis are often used to persuade something to somebody. Statistical charts are ubiquitous now and many of them are fake…

Why people believe in numbers uncritically? They believe as they are innumerate. Why people are innumerate?

People are illiterate because they failed to learn how to read and write. Simple…

Perhaps people are innumerate not because of genetic load or some other disaster, but simply becuase they are not educated?

Types of statistical analysis

Descriptive: describe (ie. summarise large set of data) using tables, graphs or paramaters. Also known as EDA (Exploratory Data Analysis)

Inferential (mathematical): infere about large set of data using some subset of this set (called sample). CDA (as oposed to EDA) Confirmatory Data Analysis follows the `general framework of scientific discovery’ (GFSD): hypthesis → data (usually sample) → test → reject/accept

Econometrics: the application of statistical methods (CFA in particular) to economic data. The central problem is causality, which is BTW `typical scientifica law’ (X causes Y) Example: Higher price results in lower demand.

GFSD: hypothesis: Higher price results in lower demand; data on price and demand of some good; test aka model: relation between P and D is linear for example; reject /accept

Fundamental concepts

Element: entity (unit) on which data is collected. Example: student, tourist (you are students or/and tourists?), country

Population: collection (ie. set) of elements. Example: all students in Poland in 2022, all countries in the world.

Sample: subset of population

Variable: characteristic(s) of an element under study. Student’s age, sex, shoe number are all variables.

Observation: set of variable measurements obtained for a particular element. If we study age/sex/shoe number of students then observation for some student will be his/her age/sex/shoe number.

Elements/population should be defined: what? where? when? Student (what is student); Where (all in Poland); in 2022 (when)

General advice: reuse standard classifications do not invent your own. Example:

what: NACE (statistical classification of economic activities in the European Community (NACE))

where: NUTS

when: fortunatelly there is a well known established (alas complicated) standard :-)

Variables are measured. It depends on the type of variable how they are measured.

Variable types:

nominal (qualitative/non-numeric): symbols not numbers; sex
numerical (quantitative): discrete vs continous ‘number of’ (discrete) vs how much/many (continous)

Examples: you name it …

Measurement scales

General advice: Whatever type of scale (described below) you apply, use standard measures and do not invent your own

Types of scales (S.S. Stevens, On the Theory of Scales of Measurement, 1946):

nominal scale (classification): sex
ordinal scale (order or rank is meaninful but the ‘distance’ between ranks is not.)

Example:

how often you use X?

never–rarely–sometimes–frequently–almost always

one can’t assume that frequently - sometimes = almost always - frequently, or any other relation

interval scale

the interval between scale values is meaningful

ratio scale

all above + meaningful zero (meaningful means = zero amount of measured value); only with ratio scale relative comparisons are meaningful (twice bigger than)

Measuring Uboot submersion depth (2nd world war): For unknown reason the depth of the ship was registered as A+number where A = -80 (achtzig) meters below surface; So A+20 = -100m, while A+40 = -140m.

Forget about A and transform the scale as if -80 m is ZERO. A40 is 20 meters deeper than A20 BUT is A40 100% deeper then A20?

Uboot submersion scale with zero at -80m is interval; while `normal scale’ with zero at the surface (zero depth) gives: 120/100 = 1.2 which means that A40 is 20% deeper than A20…

Likert scale

Popular scale used to survey opinions/beliefs/intentions etc:

5-point Likert scale example for satisfaction from X:

Highly dissatisfied–Dissatisfied–Neither Dissatisfied nor satisfied– Satisfied–Highly Satisfied

Typically 5 or 7 values. A midpoint value ‘Neither Dissatisfied nor satisfied’ usually is interpreted as ‘no value’ (ie ‘have no opinion’, do ‘not care’, ‘do not know what X is’, etc…)

Likert scale is definitely ordinal; but can be regarded as interval.

Methods of analysis: % Highly Satisfied (always valid) mean of scores (sometimes valid)

Measuring Trust

WVS survey (https://www.worldvaluessurvey.org/wvs.jsp):

Generally speaking, would you say that most people can be trusted or that you can’t be too careful in dealing with people?

Most people can be trusted
Can´t be too careful

Type of scale?

Types of data sets:

cross-sectional data set = many elements measured at the same point in time (approx of course)

typical EDA analasis: mean, spread (how much values differ) untypical values, distribution of values (basically values → frequency function)

time-series = one element measured at (usually) equally spaced points in time (data + time stamp = when)

typical EDA analasis: change in time

many elements measured many times = panel data set
spatial data set = geocoordinations added (data + geocoordinates stamp = where )

typical EDA analasis: spatial distribution (place → value function)

Analysig spatial data is `advanced’

Types of measure frameworks (‘statistical studies’)

observational study = all elements in a population are measured the same way.

experimental study = the elements are assigned to groups; some treatment is applied to one of the groups, while the other group does not receive the treatment. This is called controlled experiment.

random controlled experiment (RCE): assignment to groups is random

Example (medical statistics):

Does fluorine in the water causes cancer? Burke and Yiamouyannis considred 10 fluoridated and 10 non-fluoridated US towns. In the fluoridated towns the cancer mortality rate increased by 20% (between 1950 and 1970), while in non-fluoridated towns increase was only 10%. Is this confirm that fluoridisation causes cancer? Unfortunately not…

Exposure/Treatment = cause

Death rate depends on age (old people die usually), sex (males die earlier than females) and race (non-white people dies earlier in US). Oldham and Newell analysed age-sex-race structure in these 20 US cities which were studied by
Burke and Yiamouyannis and found that the increase of mortality (including increase of mortality due to cancer) can be attributed to change in age-sex-race structure.

Drinking coffee causes higher exam scores

A study was conducted in which students were asked how much coffee they drink during the exam session. Coffee consumption was compared with exam results. The mean exam scores in the heavy coffee drinkers group were higher than in the light drinkers group. Is drinking a lot of coffee proven to improve exam score?

Unfortunately no…

The general design of RCE:

It can be assumed that apart from coffee, the exam result is influenced by, for example, an innate intellectual predisposition. To control this variable (control means set fixed value), a group of students can be randomly divided. As a result the average size of the predisposition will be similar in both groups. Students from experimental group are asked to drink 1 liter of coffee a day while students in the control group are given 1l of water. Average results in the group of students drinking 1l of coffee were higher than in the group drinking water. Does the above result confirm the relationship between coffee consumption and exam scores?

Confirming causality with observational data is difficult :-)

In economis 99% data is observational. RCE data is impossible to obtain or it is deemed to be artificial (not real)… That’s why we will not deal with RCE anymore

Data sets from pure technical point of view:

Population = data set

Data set is a n x m matrix where each row is an observation (set of measurements for each variable) and each column is a variable

If m = 1 univariate data set

if m = 2 bivariate data set

if m > 2 multivariate data set

Methods of exploration of data sets

aggregate data into tables (simple)
draw graphs (charts)
compute some paramaters

Data sources

Census (rare)
Register (data gathered for some other purposes; demographic data for example)
Sample (cheaper than census)

Data bases

Usually we do not gather economic data on our own but obtain them from the database from statistical offices or similar institutions

[more in separate document]

Statistical value chain

Five stages of statistical data analysis:

Usually students’ attention is concentrated on stages 2 and 3 almost exclusively. In result statistics is regarded as part of math thus 100% reliable while in reality it is not. Stages 1 and 5 are often more of an art than a science and if one do not know the rules of these stages one can easily put excessive trust in the final outcome.

Less theory, more practice, and common sense.

Statistics

In 3 words: data + procedures (theory of statistics) + tools

Data

described above

Procedures

some will be explained

Tools

spreadsheets (Excel)

store data + transform data + apply procedures + copy/paste results

Actually not complete statistics program. What is missing:

lack of build-in missing value
many procedures are unavailable (ANOVA for example) or cumbersome to use (chi-squared test of independence for example)
poor accuracy/unreliable results
Poor automation. Usually one have to do a lot of manual copy-pasting and/or mouse clickig/moving

Rule of thumb: sufficient for economical statistics; unsufficient for other domains. IMO: the sooner someone learns something else, the better

SPSS/JASP

SPSS is commercial and expensive/JASP is free. Psychology/sociology oriented

Gretl

Econometrics (open software)

There is a life without speadsheet too: R and Rstudio

R is both programming language for statistical computing and graphics and a software (ie application) to execute programs written in R. R was developed in mid 90s at the University of Auckland (New Zealand).

Since then R has become one of the dominant software environments for data analysis and is used by a variety of scientific disiplines.

BTW why it is called so strange (R)? Long time ago it was popular to use short names for computer languages (C for example). At AT&T Bell Labs (John Chambers) in mid 70s a language oriented towards statistical computing was developed and called S (from Statistics). R is one letter before S in an alphabet.

Rstudio is an environment through which to use R. In Rstudio one can simultaneously write code, execute code it, manage data, get help, view plots. Rstudio is a commercial product distributed under dual-license system by RStudio, Inc. Key developer of RStudio is Hadley Wickham another brilliant New Zealander (cf Hadley Wickham )

Microsoft invest heavily into R development recently. It bought Revolution Analytics a key developer of R and provider of commercial versions of the system. With MS support the system is expected to gain more popularity (for example by integrating it with popular MS products)

Reproducible research or how to make statistical computations more meaningfu

Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry. Eric S. Raymond, E. S. The art of UNIX programming: Addison-Wesley.

Replicability vs Reproducibility

Hot topic: google: reproducible research = 158000

Replicability: independent experiment targetting the same question will produce a result consistent with the original study.

Reproducibility: ability to repeat the experiment with exactly the same outcome as originally reported [description of method/code/data is needed to do so].

Computational science is facing a credibility crisis: it’s impossible to verify most of the computational results presented at conferences and in papers today. (Donoho D. et al 2009)

Australopithecus (Current practices)

Enter data in Excel/OOCalc to clean and/or make explanatory analysis.

Use Excel for data cleaning & descriptive statistics Excel handles missing data inconsistently and sometimes incorrectly Many common functions are poor or missing in Excel

Import data from Spreadsheet into SPSS/SAS/Stata for serious analysis

Use SPSS/SAS/Stata in point-and-click mode to run serious statistical analyses.

Prepare report/paper: copy and paste output to Word/OpenOffice, add description.
Send to publisher (repeat 1–4 if returned for revision).

Problems

Tedious/time-wasting/costly.

Even small data/method change requires extensive recomputation effort/careful report/paper revision and update.

Error-prone: difficult to record/remember a ‘click history’.

Famous example: Reinhart and Rogoff controversy Countries very high GDP–debt ratio suffer from low growth. However the study suffers serious but easy identifiable flaws which were discovered when RR published the dataset they used in their analysis (cf Growth_in_a_Time_of_Debt)

Homo habilis (Enhanced current practices)

Abandon spreadsheets.
Abandon point-and-click mode. Use statistical scripting languages and run program/scripts.

Benefits

Improved: reliability, transparency, automation, maintanability. Lower costs (in the long run).

Solves 1–2 but not 3–4.

Problems: Steeper learning curve. Perhaps higher costs in short run. Duplication of effort (or mess if scripts/programs are poorly documented).

Homo Erectus (Literate statistical programming)

Literate programming concept: Code and description in one document. Create software as works of literature, by embedding source code inside descriptive text, rather than the reverse (as in most programming languages), in an order that is convenient for human readers.

A program is like a WEB tangled and weaved (turned into a document), with relations and connections in the program parts. We express a program as a web of ideas. WEB is a combination of – a document formatting language and – a program language.

General idea of Literate statistical programming mimics Knuth’s WEB system.

Statistical computing code is embedded inside descriptive text. Literate statistical program is weaved (turned) into report/paper by executing code and inserting the results obtained. data/method changes.

Solves 1–4.

LSP: Benefits/Problems/Tools

Reliability: Easier to find/fix bugs. The results produced will not change when recomputed (in theory at least).
Efficiency: Reuse allows to avoid duplication of effort (Payoff in the long run.)
Transparency: increased citation rate, broader impact, improved institutional memory
Institutional memory is a collective set of facts, concepts, experiences and know-how held by a group of people.
Flexibility: When you don’t ‘point-and-click’ you gain many new analytic options.

Problems of LSP: Many incl. costs and learning curve

Tools:

Document formatting language: LaTeX (not recommended) or Markdown (or many others, ie. orgmode). LaTeX is a word processor/a document markup language. Markdown: lightweight document markup language based on email text formatting. Easy to write, read and publish as-is.
Program language: R

Learning resources and data banks

Learnig resources

Rstudio
Making Data Meaningful
bookdown: Authoring Books and Technical Documents with R Markdown
Supplementary resources to my lecture (slides/data/R scripts etc) are available at: https://github.com/hrpunio/Z-MISC/tree/master/Erasmus/2019/Batumi

Data banks