Getting Started With Data Analysis and R

Why Study Statistics? Why Use R to do it?

Two Key Ideas

Statistical Analysis
The software to do it
The two ideas are separate but connected:

+ =

The oven bakes the cake mixture
Need both to get result but
- The cake mix can’t cook things
- You can’t eat the oven !!
They are distinct things, but connected

Software \(\leftrightarrow\) Statistics

Cake mix \(\leftrightarrow\) Statistical method
Oven \(\leftrightarrow\) Statistical Software
Cake \(\leftrightarrow\) Result of analysis
… and beyond the pictures above
Decoration on cake \(\leftrightarrow\) Your interpretation

Why Statistical Analysis

Why bother with it?

Why do you do statistics?
- Why don’t researchers just use common sense?

BUT

Is it really plausible to think that a “common sense” approach is very trustworthy?

The Belief Bias Effect Situation A

A valid argument where the conclusion is believable:
- No cigarettes are inexpensive (Premise 1)
- Some addictive things are inexpensive (Premise 2)
- Therefore, some addictive things are not cigarettes (Conclusion A)

The Belief Bias Effect Situation B

A valid argument where the conclusion is less believable:
- Cataracts are more prevalent in elderly people (Premise 1)
- Cigarette smoking reduces life expectancy (Premise 2)
- Therefore at a population level higher smoking rates correlate with lower incidence of cataracts (Conclusion B)

Commentary

Both arguments are valid in terms of consistency. However, in the second argument, there are good reasons to think that the conclusion is incorrect - smoking is bad for you, right? However, the conclusion is a logical consequence of the premises.

The Belief Bias Effect Situation C

An invalid argument that has a believable conclusion:
- No addictive things are inexpensive (Premise 1)
- Some cigarettes are inexpensive (Premise 2)
- Therefore, some addictive things are not cigarettes (Conclusion C)
Conclusion is true, but doesn’t follow from premises 1 and 2 alone.

The Belief Bias Effect Situation D

An invalid argument with an unbelievable conclusion:
- No cigarettes are inexpensive (Premise 1)
- Some addictive things are inexpensive (Premise 2)
- Therefore, some cigarettes are not addictive (Conclusion D)
Conclusion isn’t true, and also does not follow from premises 1 and 2 alone.

In an Ideal World

If common sense was a reliable guide

	Conclusion ‘feels’ true (A/C)	Conclusion ‘feels’ False (B/D)
Argument valid (A/B)	100% of people say ‘valid’	100% of people say ‘valid’
Argument invalid (C/D)	0% of people say ‘valid’	0% of people say ‘valid’

But

An actual study of this by Evans, Barston, and Pollard¹ gave this result

	Conclusion ‘feels’ true (A/C)	Conclusion ‘feels’ False (B/D)
Argument valid (A/B)	92% of people say ‘valid’	46% of people say ‘valid’
Argument invalid (C/D)	92% of people say ‘valid’	8% of people say ‘valid’

What does this show?

People presented with a correct argument that contradicts pre-existing beliefs find it pretty hard to even perceive it to be valid (only 46% of the time).
People presented with a wrong argument that agrees with pre-existing biases, rarely see that the argument is not valid (people in the study got that wrong 92% of the time!)
TL/DR² People have a tendency to ‘believe what they want to believe’ regardless of underlying logic.

Commentary

It’s just too easy for us to “believe what we want to believe”; so if we want to believe in the research data instead, we’re going to need a bit of help to keep our personal biases under control. That’s what statistics does: it helps keep us consistent.

NB - Thanks to Danielle Navarro and Emily Kothe for some material here - see https://learningstatisticswithr.com/book/index.html (in particular chapter 1)

Other Reasons

The argument above isn’t the only one, but I personally think it is a compelling one.
Also tendency to see patterns in random data
Use tests as to whether the patterns could have occurred at random
Visualising data
- See trends
- Identify unusual observations

Why R?

Broad set of reasons (NB This section less philosophical!)

ITS FREE

Works on a variety of operating systems (Mac/Windows/Linux)
Good graphics
Extensible
Programmable

ITS FREE

Problems with Spreadsheets

Doing statistics in a spreadsheet (e.g. MS Excel) is generally a bad idea in the long run.
OK for entering data but…
- Very limited in terms of what analyses they allow you do.
- Graphics not good for (social) scientific work
- More business-oriented graphics than statistical plots
- eg No density plots such as this:
- Harder to reproduce than scripted language
- Autoformatting (eg issue with turning non-date things into dates) - https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

Proprietory Stats Software v. R
(ie paid-for licence in some form)

Avoiding proprietary software is a very good idea!
Some of it is good, but very expensive
Open source alternatives exist
- Mainly R and Python
- R was specifically designed for stats
- Although Python also good (I sometimes teach it)
Open source also makes ‘under the bonnet’ code used open to scrutiny

Extensibility

R can load packages which extend its functionality
These are often also written in R. Examples include:
- sf - added geographical data handling
- tmap - interactive map drawing

Reproducibility

R is a programming language
Thus, analyses files of scripted procedures
This is useful, as you have a precise record of what you have done
- Your analysis is then open to scrutiny and checking
- You can refer back to it yourself
- Easy to modify if you need to do a similar analysis later
- Easy to share with others

Practicalities

Getting R set Up

Again, the Navarro text is helpful here:
- https://learningstatisticswithr.com/book/introR.html#gettingR
In particular
Note that the version of R referred to in the links is now out of date, but looking for R-4.1.1.pkg (Mac) or the link “Download R 4.1.1 for Windows” (Windows) will get the current one.

RStudio

R studio

A useful tool - I recommend using R via RStudio
Download and install (it’s free) - https://www.rstudio.com/products/rstudio/download/
More info here: https://learningstatisticswithr.com/book/introR.html#installing-r-on-a-linux-computer

Actually Doing Some R

Two useful starting points
https://learningstatisticswithr.com/book/introR.html#firstcommand - to end of chapter 3
https://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf - also to end of Chapter 3

Conclusion

💡 New ideas

New general ideas
- why do statistics at all?
New techniques
- why use R to do statistics?
Practical issues
- Installing R and RStudio
Next lecture - Exploring data with graphics

Why Study Statistics? Why Use R to do it?

Two Key Ideas

Software \(\leftrightarrow\) Statistics

Why Statistical Analysis

Why bother with it?

The Belief Bias Effect Situation A

The Belief Bias Effect Situation B

Commentary

The Belief Bias Effect Situation C

The Belief Bias Effect Situation D

In an Ideal World

But

What does this show?

Commentary

Other Reasons

Why R?

Broad set of reasons (NB This section less philosophical!)

Problems with Spreadsheets

Proprietory Stats Software v. R (ie paid-for licence in some form)

Extensibility

Reproducibility

Practicalities

Getting R set Up

RStudio

Actually Doing Some R

Conclusion

💡 New ideas

Proprietory Stats Software v. R
(ie paid-for licence in some form)