Day One Notes, 20/03/2023 (Spring Equinox!)
Notes by Jody Holland for classes by Prof. Fernando Tuya. There may be typos, it may not always make sense, but hopefully it will be a useful reference resource in the future.
Though you gotta work bitch. R is hard and requires self-teaching and discipline
Use Youtube (and TikTok) xxx
We are going to look at several families of stats models:
General Linear Model:
ANOVA
ANCOVA
Simple (bivariate)
Multiple regression (OLS)
Generalised Linear Model:
Poisson
Logistic
Multivariate GLM
Mixed Models! Model Selection:
AIC
BIC
Tests of Fit/Robustness
Go outside, do some research, make some inferences.
Build some thesis, looking at how some forces influence another
Write this as a hypothesis, wherein x leads to y. Hypotheses lead to their antithesis the null hypothesis (Ho), which is the statement wherein the predicted relationship doesn’t occur
Collect data through quantitative measurements
Store data in database
Apply statistical models aiming to test the hypothesis
Based on result, accept or reject the null hypothesis
When you start a new study you have to ask yourself many questions with regards to the collection of data, thinking about the space, time-span, and manner in which you make your observations. This is all included and justified in the “experimental design section”.
Below are some basic concepts of stats.
We need stats because we can’t do a universal sample of every possible data set, we can only make a subsection an extrapolate our results
In theory every data point should be independent, it should be random in the manner they are selected, independent of researcher bias. In practice this is very hard.
The lowest level in your design. I would call this the “unit of analysis”
Any type of measurement that you use in your data:
Dependent Variable: Response Variable: Y
Independent Variable: Predictor Variable: Explanatory Variable: X
These variables can be categories, there are ways to make these pseudo-numeric, such as using dummy variables (assigning each data point either a 1 or 0 to signify if its a member of a given group)
Categorical Variables can be configured into “treatments” which denotes various groupings of categorical variables
The degree to which our predictors are unable to determine the dependent variable.
Different areas or time-spans wherein your data is collected. Having multiple areas wherein data is gathered shows replicatability in your experiment
It may appear sometimes that there is a relationship is present in the results. However what it might be is that both your dependent and independent variables are actually mutually caused by another, not included third variable. Confounding variables are hard to find, you have to do a lot of research and accept criticism wholeheartedly
Sampling is working out what proportion of the population you have to sample in order to achieve a given certainity in your results ability to be extrapolated to the wider population
In general, the smaller your sample size, the greater the standard error.
However, the relationship between sample size and standard error is not linear, there are diminishing returns in increasing the sample size and improving your accuracy
We must also consider the scale of the differences in our findings. Say our sample is small but the differences we find between groups is large, then maybe having a smaller sample is OK. However, if the differences in the results are more nuance, to truly conclude that the slight differences that do exist can be extrapolated to the population, we need a large sample.
Sampling plays an important role in determining the confidence of our results and subsequent inferences. In many circles, confidence is denoted using alpha/⍺. ⍺ of 0.05 means 95% confidence, ⍺ of 0.01 means 99% and so on
Be careful about over using the P value (the probability you can reject the null hypothesis). Use it in the context of the effect size. High P + H effect size… then your cooking. Small effect size however could signify P-hacking/overestimating results based on large sampling
The General Linear Model breaks down some of the differences between continuous vs categorical independent variables in stats tests.
Traditionally, there is clear differences between ANOVA and OLS. In OLS both X and Y are continuous, but in ANOVA X is categorical. However if you code the categories in ANOVA as dummy variables then do an OLS. This produces results very similar to OLS
| ID | Category | Height |
|---|---|---|
| 1 | Tree | 10 |
| 2 | Tree | 6 |
| 3 | Lamp | 8 |
| ID | Tree | Lamp | Height |
|---|---|---|---|
| 1 | 1 | 0 | 10 |
| 2 | 1 | 0 | 6 |
| 3 | 0 | 1 | 8 |
Then you can smash this into a standard multivariate regression like so:
\[ height = c + tree + lamp + e \]
Here’s an example using the continent categorical variable in gapminder data and it’s effect on life expectancy using the lm() function
# load tidyverse
library(tidyverse)
# load gapminder for 2007
data_gm = gapminder::gapminder %>%
filter(year == 2007)
# lets have a look at the data
data_gm## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # … with 132 more rows
Plotting this data looks as follows
# assign data and variable
ggplot(data = data_gm,
aes(y = lifeExp, x = continent)) +
geom_point() +
labs(title = "Life Expentancy and Continents",
x = "Continent",
y = "Life Expectancy")We need to break down the continent category into columns of dummy variables
# using pivot_wider() to make dummies
data_gm_dummies = data_gm %>%
pivot_wider(names_from = continent, # get the names
values_from = continent,
values_fn = length,
values_fill = 0) %>%
select(c("country",
"lifeExp",
"Asia",
"Europe",
"Africa",
"Americas",
"Oceania")) # select the continent cols
# lets now have a look
data_gm_dummies## # A tibble: 142 × 7
## country lifeExp Asia Europe Africa Americas Oceania
## <fct> <dbl> <int> <int> <int> <int> <int>
## 1 Afghanistan 43.8 1 0 0 0 0
## 2 Albania 76.4 0 1 0 0 0
## 3 Algeria 72.3 0 0 1 0 0
## 4 Angola 42.7 0 0 1 0 0
## 5 Argentina 75.3 0 0 0 1 0
## 6 Australia 81.2 0 0 0 0 1
## 7 Austria 79.8 0 1 0 0 0
## 8 Bahrain 75.6 1 0 0 0 0
## 9 Bangladesh 64.1 1 0 0 0 0
## 10 Belgium 79.4 0 1 0 0 0
## # … with 132 more rows
Now let us throw together a linear model with this info
# use the lm function
# use -1 to convert continents to dummies
model = lm(data = data_gm_dummies,
formula = lifeExp ~
Asia +
Europe +
Africa +
Americas +
Oceania)
# export summary of regression
summary(model)##
## Call:
## lm(formula = lifeExp ~ Asia + Europe + Africa + Americas + Oceania,
## data = data_gm_dummies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.9005 -4.0399 0.2565 3.3840 21.6360
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.720 5.229 15.438 < 2e-16 ***
## Asia -9.991 5.385 -1.855 0.0657 .
## Europe -3.071 5.400 -0.569 0.5705
## Africa -25.913 5.328 -4.863 3.12e-06 ***
## Americas -7.111 5.434 -1.309 0.1928
## Oceania NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.395 on 137 degrees of freedom
## Multiple R-squared: 0.6355, Adjusted R-squared: 0.6249
## F-statistic: 59.71 on 4 and 137 DF, p-value: < 2.2e-16
Looking at these results, what is interesting is the intercept represents Oceania, and each continent coefficient below represents the effect on life expectancy each being in each continent has relative to Oceania.
We can also extract a conventional ANOVA table from the model
# generate anova table
anova(model)## Analysis of Variance Table
##
## Response: lifeExp
## Df Sum Sq Mean Sq F value Pr(>F)
## Asia 1 595.3 595.3 10.8863 0.001235 **
## Europe 1 5732.0 5732.0 104.8280 < 2.2e-16 ***
## Africa 1 6639.8 6639.8 121.4290 < 2.2e-16 ***
## Americas 1 93.7 93.7 1.7127 0.192825
## Residuals 137 7491.2 54.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It appears that the difference between the Americas and Oceania is not that significant!