Lies, damned lies, and statistics
The above well known quotation is attributed to Benjamin Disraeli (UK Prime Minister)
How to lie with statistics
Less known book by Darrell Huff (142 pages/a5)
Bill Gates recommends in 2015
BTW: this photo (taken in 2015) coupled with the fact that Gates funded the epidemiology research at John Hopkins University has become “evidence” for various morons (of which the are plenty in the USA), that Gates was behind the COVID19 pandemic
A book written by Darrell Huff in 1954 presenting an introduction to statistics for the general reader. Not a statistician, Huff was a journalist […]
In the 1960/1970s, it became a standard textbook introduction to the subject of statistics for many college students […] one of the best-selling statistics books in history.
https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics
The book consists of 10 chapters and is written in a provocative, way (unscientific). Individual chapters are so well known that if you enter the title of the chapter into google will return hundreds of thousands references
ch1: The Sample with the Built-in Bias (ie it is very difficult to draw unbiased/perfect random sample)
ch2: The Well-Chosen Average. You can manipulate average value in various ways: using various averages/using different definitions of averaged units/measuring in various ways
ch3: The Little Figures That Are Not There (Figures = Details) Reporting results w/o context or important information in short
ch4: Continuing #ch3 insignificant results = difference is of no practical meaning.
ch5: The Gee-Whiz Graphs (Statistical graphs in cartesian coordinates with OY axis not starting from zero) https://en.wikipedia.org/wiki/Gee_Whiz → https://en.wikipedia.org/wiki/Misleading_graph
ch6: The One-Dimensional Picture (comparing 1D quantities using 2D or pseudo-3D) https://thejeshgn.com/2017/11/17/how-to-lie-with-graphs/
ch7: The Semiattached Figure. Using one thing as a way to claim proof of something else, even though there’s no correlation between the two (not attached) https://www.secjuice.com/the-semi-attached-figure/
ch8: Post Hoc Rides Again (Correlation is not causation)
ch9: Misinforming people by the use of statistical material might be called statistical manipulation, in a word, Statisticulation. (summary of ch1–ch8)
ch10: How to Talk Back to a Statistic (How not to be deceived)
Who Says So? (interested parties can be unreliable; car seller reputation is poor);
How Does He Know? (measurement is often unreliable);
What’s Missing? (incomplete analysis signals bias);
Many figures lose meaning because a comparison is missing. In Poland there was a public discussion about falling fertility– women in Poland do not give birth to children; the average age of a mother at the birth of her first child is 27 years. [It is a norm in a whole Europe]
Did Somebody Change The Subject? (beware of the Semiattached Figure)
Does It Make Sense? (forget about statistics and think about common sense)
Despite its mathematical base, statistics is a much an art as it is a science (Huff p. 120)
Is it better now?
Misleading statistical analyzes are still doing quite well if not better than in Huff’s times, which is probably due to the following factors:
the number of statisticians has increased, often amateurs (everyone can count something easily today)
the amount of readily available data has increased
Fake news hype. Numbers disguised as a result of a statistical analysis are often used to persuade something to somebody. Statistical charts are ubiquitous now and many of them are fake…
Why people believe in numbers uncritically? They believe as they are innumerate. Why people are innumerate?
People are illiterate because they failed to learn how to read and write. Simple…
Perhaps people are innumerate not because of genetic load or some other disaster, but simply becuase they are not educated?
Descriptive: describe (ie. summarise large set of data) using tables, graphs or paramaters. Also known as EDA (Exploratory Data Analysis)
Inferential (mathematical): infere about large set of data using some subset of this set (called sample). CDA (as oposed to EDA) Confirmatory Data Analysis follows the `general framework of scientific discovery’ (GFSD): hypthesis → data (usually sample) → test → reject/accept
Econometrics: the application of statistical methods (CFA in particular) to economic data. The central problem is causality, which is BTW `typical scientifica law’ (X causes Y) Example: Higher price results in lower demand.
GFSD: hypothesis: Higher price results in lower demand; data on price and demand of some good; test aka model: relation between P and D is linear for example; reject /accept
Element: entity (unit) on which data is collected. Example: student, tourist (you are students or/and tourists?), country
Population: collection (ie. set) of elements. Example: all students in Poland in 2022, all countries in the world.
Sample: subset of population
Variable: characteristic(s) of an element under study. Student’s age, sex, shoe number are all variables.
Observation: set of variable measurements obtained for a particular element. If we study age/sex/shoe number of students then observation for some student will be his/her age/sex/shoe number.
Elements/population should be defined: what? where? when? Student (what is student); Where (all in Poland); in 2022 (when)
General advice: reuse standard classifications do not invent your own. Example:
what: NACE (statistical classification of economic activities in the European Community (NACE))
where: NUTS
when: fortunatelly there is a well known established (alas complicated) standard :-)
Variables are measured. It depends on the type of variable how they are measured.
nominal (qualitative/non-numeric): symbols not numbers; sex
numerical (quantitative): discrete vs continous ‘number of’ (discrete) vs how much/many (continous)
Examples: you name it …
General advice: Whatever type of scale (described below) you apply, use standard measures and do not invent your own
Types of scales (S.S. Stevens, On the Theory of Scales of Measurement, 1946):
nominal scale (classification): sex
ordinal scale (order or rank is meaninful but the ‘distance’ between ranks is not.)
Example:
how often you use X?
never–rarely–sometimes–frequently–almost always
one can’t assume that frequently - sometimes = almost always - frequently, or any other relation
the interval between scale values is meaningful
all above + meaningful zero (meaningful means = zero amount of measured value); only with ratio scale relative comparisons are meaningful (twice bigger than)
Measuring Uboot submersion depth (2nd world war): For unknown reason the depth of the ship was registered as A+number where A = -80 (achtzig) meters below surface; So A+20 = -100m, while A+40 = -140m.
Forget about A and transform the scale as if -80 m is ZERO. A40 is 20 meters deeper than A20 BUT is A40 100% deeper then A20?
Uboot submersion scale with zero at -80m is interval; while `normal scale’ with zero at the surface (zero depth) gives: 120/100 = 1.2 which means that A40 is 20% deeper than A20…
Popular scale used to survey opinions/beliefs/intentions etc:
5-point Likert scale example for satisfaction from X:
Highly dissatisfied–Dissatisfied–Neither Dissatisfied nor satisfied– Satisfied–Highly Satisfied
Typically 5 or 7 values. A midpoint value ‘Neither Dissatisfied nor satisfied’ usually is interpreted as ‘no value’ (ie ‘have no opinion’, do ‘not care’, ‘do not know what X is’, etc…)
Likert scale is definitely ordinal; but can be regarded as interval.
Methods of analysis: % Highly Satisfied (always valid) mean of scores (sometimes valid)
WVS survey (https://www.worldvaluessurvey.org/wvs.jsp):
Generally speaking, would you say that most people can be trusted or that you can’t be too careful in dealing with people?
Most people can be trusted
Can´t be too careful
Type of scale?
typical EDA analasis: mean, spread (how much values differ) untypical values, distribution of values (basically values → frequency function)
typical EDA analasis: change in time
many elements measured many times = panel data set
spatial data set = geocoordinations added (data + geocoordinates stamp = where )
typical EDA analasis: spatial distribution (place → value function)
Analysig spatial data is `advanced’
observational study = all elements in a population are measured the same way.
experimental study = the elements are assigned to groups; some treatment is applied to one of the groups, while the other group does not receive the treatment. This is called controlled experiment.
random controlled experiment (RCE): assignment to groups is random
Example (medical statistics):
Does fluorine in the water causes cancer? Burke and Yiamouyannis considred 10 fluoridated and 10 non-fluoridated US towns. In the fluoridated towns the cancer mortality rate increased by 20% (between 1950 and 1970), while in non-fluoridated towns increase was only 10%. Is this confirm that fluoridisation causes cancer? Unfortunately not…
Exposure/Treatment = cause
Death rate depends on age (old people die usually), sex (males die
earlier than females) and race (non-white people dies earlier in US).
Oldham and Newell analysed age-sex-race structure in these 20 US cities
which were studied by
Burke and Yiamouyannis and found that the increase of mortality
(including increase of mortality due to cancer) can be attributed to
change in age-sex-race structure.
Drinking coffee causes higher exam scores
A study was conducted in which students were asked how much coffee they drink during the exam session. Coffee consumption was compared with exam results. The mean exam scores in the heavy coffee drinkers group were higher than in the light drinkers group. Is drinking a lot of coffee proven to improve exam score?
Unfortunately no…
The general design of RCE:
It can be assumed that apart from coffee, the exam result is influenced by, for example, an innate intellectual predisposition. To control this variable (control means set fixed value), a group of students can be randomly divided. As a result the average size of the predisposition will be similar in both groups. Students from experimental group are asked to drink 1 liter of coffee a day while students in the control group are given 1l of water. Average results in the group of students drinking 1l of coffee were higher than in the group drinking water. Does the above result confirm the relationship between coffee consumption and exam scores?
Confirming causality with observational data is difficult :-)
In economis 99% data is observational. RCE data is impossible to obtain or it is deemed to be artificial (not real)… That’s why we will not deal with RCE anymore
Population = data set
Data set is a n x m matrix where each row is an observation (set of measurements for each variable) and each column is a variable
If m = 1 univariate data set
if m = 2 bivariate data set
if m > 2 multivariate data set
aggregate data into tables (simple)
draw graphs (charts)
compute some paramaters
Census (rare)
Register (data gathered for some other purposes; demographic data for example)
Sample (cheaper than census)
Usually we do not gather economic data on our own but obtain them from the database from statistical offices or similar institutions
[more in separate document]
Five stages of statistical data analysis:
Usually students’ attention is concentrated on stages 2 and 3 almost exclusively. In result statistics is regarded as part of math thus 100% reliable while in reality it is not. Stages 1 and 5 are often more of an art than a science and if one do not know the rules of these stages one can easily put excessive trust in the final outcome.
Less theory, more practice, and common sense.
In 3 words: data + procedures (theory of statistics) + tools
described above
some will be explained
spreadsheets (Excel)
store data + transform data + apply procedures + copy/paste results
Actually not complete statistics program. What is missing:
lack of build-in missing value
many procedures are unavailable (ANOVA for example) or cumbersome to use (chi-squared test of independence for example)
poor accuracy/unreliable results
Poor automation. Usually one have to do a lot of manual copy-pasting and/or mouse clickig/moving
Rule of thumb: sufficient for economical statistics; unsufficient for other domains. IMO: the sooner someone learns something else, the better
SPSS/JASP
SPSS is commercial and expensive/JASP is free. Psychology/sociology oriented
Gretl
Econometrics (open software)
R is both programming language for statistical computing and graphics and a software (ie application) to execute programs written in R. R was developed in mid 90s at the University of Auckland (New Zealand).
Since then R has become one of the dominant software environments for data analysis and is used by a variety of scientific disiplines.
BTW why it is called so strange (R)? Long time ago it was popular to use short names for computer languages (C for example). At AT&T Bell Labs (John Chambers) in mid 70s a language oriented towards statistical computing was developed and called S (from Statistics). R is one letter before S in an alphabet.
Rstudio is an environment through which to use R. In Rstudio one can simultaneously write code, execute code it, manage data, get help, view plots. Rstudio is a commercial product distributed under dual-license system by RStudio, Inc. Key developer of RStudio is Hadley Wickham another brilliant New Zealander (cf Hadley Wickham )
Microsoft invest heavily into R development recently. It bought Revolution Analytics a key developer of R and provider of commercial versions of the system. With MS support the system is expected to gain more popularity (for example by integrating it with popular MS products)
Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry. Eric S. Raymond, E. S. The art of UNIX programming: Addison-Wesley.
Replicability vs Reproducibility
Hot topic: google: reproducible research = 158000
Replicability: independent experiment targetting the same question will produce a result consistent with the original study.
Reproducibility: ability to repeat the experiment with exactly the same outcome as originally reported [description of method/code/data is needed to do so].
Computational science is facing a credibility crisis: it’s impossible to verify most of the computational results presented at conferences and in papers today. (Donoho D. et al 2009)
Use Excel for data cleaning & descriptive statistics Excel handles missing data inconsistently and sometimes incorrectly Many common functions are poor or missing in Excel
Use SPSS/SAS/Stata in point-and-click mode to run serious statistical analyses.
Prepare report/paper: copy and paste output to Word/OpenOffice, add description.
Send to publisher (repeat 1–4 if returned for revision).
Problems
Tedious/time-wasting/costly.
Even small data/method change requires extensive recomputation effort/careful report/paper revision and update.
Error-prone: difficult to record/remember a ‘click history’.
Famous example: Reinhart and Rogoff controversy Countries very high GDP–debt ratio suffer from low growth. However the study suffers serious but easy identifiable flaws which were discovered when RR published the dataset they used in their analysis (cf Growth_in_a_Time_of_Debt)
Abandon spreadsheets.
Abandon point-and-click mode. Use statistical scripting languages and run program/scripts.
Benefits
Improved: reliability, transparency, automation, maintanability. Lower costs (in the long run).
Solves 1–2 but not 3–4.
Problems: Steeper learning curve. Perhaps higher costs in short run. Duplication of effort (or mess if scripts/programs are poorly documented).
Literate programming concept: Code and description in one document. Create software as works of literature, by embedding source code inside descriptive text, rather than the reverse (as in most programming languages), in an order that is convenient for human readers.
A program is like a WEB tangled and weaved (turned into a document), with relations and connections in the program parts. We express a program as a web of ideas. WEB is a combination of – a document formatting language and – a program language.
General idea of Literate statistical programming mimics Knuth’s WEB system.
Statistical computing code is embedded inside descriptive text. Literate statistical program is weaved (turned) into report/paper by executing code and inserting the results obtained. data/method changes.
Solves 1–4.
Reliability: Easier to find/fix bugs. The results produced will not change when recomputed (in theory at least).
Efficiency: Reuse allows to avoid duplication of effort (Payoff in the long run.)
Transparency: increased citation rate, broader impact, improved institutional memory
Institutional memory is a collective set of facts, concepts, experiences and know-how held by a group of people.
Flexibility: When you don’t ‘point-and-click’ you gain many new analytic options.
Problems of LSP: Many incl. costs and learning curve
Tools:
Document formatting language: LaTeX (not recommended) or Markdown (or many others, ie. orgmode). LaTeX is a word processor/a document markup language. Markdown: lightweight document markup language based on email text formatting. Easy to write, read and publish as-is.
Program language: R
Learnig resources
bookdown: Authoring Books and Technical Documents with R Markdown
Supplementary resources to my lecture (slides/data/R scripts etc) are available at: https://github.com/hrpunio/Z-MISC/tree/master/Erasmus/2019/Batumi
Data banks