R and Statistics Crash Course

Day One Notes, 20/03/2023 (Spring Equinox!)

Notes by Jody Holland for classes by Prof. Fernando Tuya. There may be typos, it may not always make sense, but hopefully it will be a useful reference resource in the future.

Why Stats, Why R?

  • Stats are painful but good R skills makes it easier
  • You don’t have to know much about how the statistical models work, just how to use them
  • Let us focus on stats models that are intuitive, but have a lot of conceptual weight:
    • You don’t have to be a mechanic to drive a car, we need to find that “Goldilocks Zone” between the maths and operationalisation
  • Important that your work is repeatable:
    • Values of open data (most papers these days want to see your code and your data)
  • The community for R is amazing! Lots of packages and community driven innovation!
    • Millions of packages, maybe too many
  • Learning R is tough. It will feel weird at first but will you will hit the speedy upwards curve.
    • Though you gotta work bitch. R is hard and requires self-teaching and discipline

    • Use Youtube (and TikTok) xxx

What are we doing?

  • We are going to look at several families of stats models:

    • General Linear Model:

      • ANOVA

      • ANCOVA

      • Simple (bivariate)

      • Multiple regression (OLS)

    • Generalised Linear Model:

      • Poisson

      • Logistic

      • Multivariate GLM

    • Mixed Models! Model Selection:

      • AIC

      • BIC

      • Tests of Fit/Robustness

Process of Quantitative Analysis

  1. Go outside, do some research, make some inferences.

  2. Build some thesis, looking at how some forces influence another

  3. Write this as a hypothesis, wherein x leads to y. Hypotheses lead to their antithesis the null hypothesis (Ho), which is the statement wherein the predicted relationship doesn’t occur

  4. Collect data through quantitative measurements

  5. Store data in database

  6. Apply statistical models aiming to test the hypothesis

  7. Based on result, accept or reject the null hypothesis

When you start a new study you have to ask yourself many questions with regards to the collection of data, thinking about the space, time-span, and manner in which you make your observations. This is all included and justified in the “experimental design section”.

Basic Concepts

Below are some basic concepts of stats.

Statistical Inference

We need stats because we can’t do a universal sample of every possible data set, we can only make a subsection an extrapolate our results

Random Sampling

In theory every data point should be independent, it should be random in the manner they are selected, independent of researcher bias. In practice this is very hard.

Replicate

The lowest level in your design. I would call this the “unit of analysis”

Variables

Any type of measurement that you use in your data:

  • Dependent Variable: Response Variable: Y

  • Independent Variable: Predictor Variable: Explanatory Variable: X

  • These variables can be categories, there are ways to make these pseudo-numeric, such as using dummy variables (assigning each data point either a 1 or 0 to signify if its a member of a given group)

  • Categorical Variables can be configured into “treatments” which denotes various groupings of categorical variables

Error

The degree to which our predictors are unable to determine the dependent variable.

Experimental Unit

Different areas or time-spans wherein your data is collected. Having multiple areas wherein data is gathered shows replicatability in your experiment

Confounding relationships

It may appear sometimes that there is a relationship is present in the results. However what it might be is that both your dependent and independent variables are actually mutually caused by another, not included third variable. Confounding variables are hard to find, you have to do a lot of research and accept criticism wholeheartedly

Sampling

Sampling is working out what proportion of the population you have to sample in order to achieve a given certainity in your results ability to be extrapolated to the wider population

In general, the smaller your sample size, the greater the standard error.

However, the relationship between sample size and standard error is not linear, there are diminishing returns in increasing the sample size and improving your accuracy

We must also consider the scale of the differences in our findings. Say our sample is small but the differences we find between groups is large, then maybe having a smaller sample is OK. However, if the differences in the results are more nuance, to truly conclude that the slight differences that do exist can be extrapolated to the population, we need a large sample.

Confidence

Sampling plays an important role in determining the confidence of our results and subsequent inferences. In many circles, confidence is denoted using alpha/⍺. ⍺ of 0.05 means 95% confidence, ⍺ of 0.01 means 99% and so on

P Values

Be careful about over using the P value (the probability you can reject the null hypothesis). Use it in the context of the effect size. High P + H effect size… then your cooking. Small effect size however could signify P-hacking/overestimating results based on large sampling

General Linear Model

The General Linear Model breaks down some of the differences between continuous vs categorical independent variables in stats tests.

Traditionally, there is clear differences between ANOVA and OLS. In OLS both X and Y are continuous, but in ANOVA X is categorical. However if you code the categories in ANOVA as dummy variables then do an OLS. This produces results very similar to OLS

Before transformation into dummy variables
ID Category Height
1 Tree 10
2 Tree 6
3 Lamp 8
After transformation into dummy variables
ID Tree Lamp Height
1 1 0 10
2 1 0 6
3 0 1 8

Then you can smash this into a standard multivariate regression like so:

\[ height = c + tree + lamp + e \]

Here’s an example using the continent categorical variable in gapminder data and it’s effect on life expectancy using the lm() function

# load tidyverse
library(tidyverse)
# load gapminder for 2007
data_gm = gapminder::gapminder %>%
  filter(year == 2007)
# lets have a look at the data
data_gm
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # … with 132 more rows

Plotting this data looks as follows

# assign data and variable
ggplot(data = data_gm,
       aes(y = lifeExp, x = continent)) +
  geom_point() +
  labs(title = "Life Expentancy and Continents",
       x = "Continent",
       y = "Life Expectancy")

We need to break down the continent category into columns of dummy variables

# using pivot_wider() to make dummies
data_gm_dummies = data_gm %>%
  pivot_wider(names_from = continent, # get the names
              values_from = continent,
              values_fn = length,
              values_fill = 0) %>%
  select(c("country", 
           "lifeExp",
           "Asia",
           "Europe",
           "Africa",
           "Americas",
           "Oceania")) # select the continent cols
# lets now have a look
data_gm_dummies
## # A tibble: 142 × 7
##    country     lifeExp  Asia Europe Africa Americas Oceania
##    <fct>         <dbl> <int>  <int>  <int>    <int>   <int>
##  1 Afghanistan    43.8     1      0      0        0       0
##  2 Albania        76.4     0      1      0        0       0
##  3 Algeria        72.3     0      0      1        0       0
##  4 Angola         42.7     0      0      1        0       0
##  5 Argentina      75.3     0      0      0        1       0
##  6 Australia      81.2     0      0      0        0       1
##  7 Austria        79.8     0      1      0        0       0
##  8 Bahrain        75.6     1      0      0        0       0
##  9 Bangladesh     64.1     1      0      0        0       0
## 10 Belgium        79.4     0      1      0        0       0
## # … with 132 more rows

Now let us throw together a linear model with this info

# use the lm function
# use -1 to convert continents to dummies
model = lm(data = data_gm_dummies,
            formula = lifeExp ~ 
             Asia + 
             Europe + 
             Africa + 
             Americas + 
             Oceania)
# export summary of regression
summary(model)
## 
## Call:
## lm(formula = lifeExp ~ Asia + Europe + Africa + Americas + Oceania, 
##     data = data_gm_dummies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.9005  -4.0399   0.2565   3.3840  21.6360 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   80.720      5.229  15.438  < 2e-16 ***
## Asia          -9.991      5.385  -1.855   0.0657 .  
## Europe        -3.071      5.400  -0.569   0.5705    
## Africa       -25.913      5.328  -4.863 3.12e-06 ***
## Americas      -7.111      5.434  -1.309   0.1928    
## Oceania           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.395 on 137 degrees of freedom
## Multiple R-squared:  0.6355, Adjusted R-squared:  0.6249 
## F-statistic: 59.71 on 4 and 137 DF,  p-value: < 2.2e-16

Looking at these results, what is interesting is the intercept represents Oceania, and each continent coefficient below represents the effect on life expectancy each being in each continent has relative to Oceania.

We can also extract a conventional ANOVA table from the model

# generate anova table
anova(model)
## Analysis of Variance Table
## 
## Response: lifeExp
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## Asia        1  595.3   595.3  10.8863  0.001235 ** 
## Europe      1 5732.0  5732.0 104.8280 < 2.2e-16 ***
## Africa      1 6639.8  6639.8 121.4290 < 2.2e-16 ***
## Americas    1   93.7    93.7   1.7127  0.192825    
## Residuals 137 7491.2    54.7                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It appears that the difference between the Americas and Oceania is not that significant!