A workbook for this textbook.

Statistical Tests as Linear Models

rnorm_fixed = function(N, mu=0, sd=1){scale(rnorm(N))*sd+mu}

set.seed(40)

# Create the samples (use the same order as original book - y, x, then y2)
y = rnorm_fixed(50, mu=0.3, sd=2) 
x = rnorm_fixed(N = 50, mu=0, sd=1)
y2 = rnorm_fixed(N = 50, mu=0.5, sd=1.5)

Rearrange data into both long and wide

mydata_wide <- tibble(x=x, y=y, y2=y2)

mydata_long <- mydata_wide %>%
  gather(group, value, x:y2)

## Warning: attributes are not identical across measure variables; they will be
## dropped

look at a plot

ggplot(mydata_long, aes(group, value, color=group))+
  geom_jitter(width=0.2)

non-parametric tests use the function signed rank that organizes values based on their abs value

signed_rank = function(x) sign(x) * rank(abs(x))

linear modelling has 1 numerical and 1 other variable

y = \(\beta\)₀ + \(\beta\)₁ + \(\beta\)₂ + \(\beta\)₃…

sixx <- x[1:6]
sixy <- y[1:6]
plot(sixx, sixy, ylim=c(-3, 3))|>
abline(v=0)

2D or 3D?

you can actually view 1 response variable in a 2-dimensional graph BUT anymore covariates, cannot be shown in 2D and you have to move to 3D

3.2 Estimating linear models in R

To estimate linear models we will use the lm() function in R. It can be written like:

lm(y ~ 1 + x)

## 
## Call:
## lm(formula = y ~ 1 + x)
## 
## Coefficients:
## (Intercept)            x  
##      0.3000      -0.4636

estimating the relationship

between our sample data f x and y. Apply the model y = \(\beta\)₀ + \(\beta\)₁

The H₀: \(\beta\)₁ = 0

This is the same as saying there is no relationship between x and y, where the alternative is that there is a relationship.

The output of the linear model above can tell you if there are grounds for rejecting the null in favor of the alternative.

A key output of interest is the p-value.

Let’s see if our x and y are related:

lm(y~1+x)%>%
  summary()%>%
  print(digits=5)

## 
## Call:
## lm(formula = y ~ 1 + x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.33932 -1.65931  0.33492  1.36293  3.52139 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.30000    0.27799  1.0792   0.2859
## x           -0.46355    0.28081 -1.6507   0.1053
## 
## Residual standard error: 1.9657 on 48 degrees of freedom
## Multiple R-squared:  0.05372,    Adjusted R-squared:  0.034006 
## F-statistic:  2.725 on 1 and 48 DF,  p-value: 0.10532

Here you can see that our p-value is 0.1053 so we fail to reject our null. There is no relationship between x and y.

3.3 Assumptions

Chapter 4: CORRELATIONS

Correlation is the measure of strength and direction that exists between 2 variables. Correlation coefficients (r) assume values range from -1 to 1, where the 1s are the strongest relationship and 0 means that there is no linear association between the variables.

4.1 Pearson correlation

Start by using the Pearson correlation coefficient (built in R function) with the equivalent linear model:

\(y\) = \(\beta\)₀ + \(\beta\)₁ + \(\beta\)₁x

H₀: \(\beta\)₁ = 0

The two tests are written using the following R code:

# Pearsons 
cor.test(y, x, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  y and x
## t = -1.6507, df = 48, p-value = 0.1053
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.47920849  0.04978276
## sample estimates:
##        cor 
## -0.2317767

#linear model
lm(y ~ 1 + x) %>%
  summary() %>%
  print(digits = 8)

## 
## Call:
## lm(formula = y ~ 1 + x)
## 
## Residuals:
##         Min          1Q      Median          3Q         Max 
## -3.33931981 -1.65931459  0.33492062  1.36293243  3.52139194 
## 
## Coefficients:
##                Estimate  Std. Error  t value Pr(>|t|)
## (Intercept)  0.30000000  0.27799190  1.07917  0.28591
## x           -0.46355333  0.28081423 -1.65075  0.10532
## 
## Residual standard error: 1.9657 on 48 degrees of freedom
## Multiple R-squared:  0.053720423,    Adjusted R-squared:  0.034006266 
## F-statistic: 2.7249667 on 1 and 48 DF,  p-value: 0.10531916

The output shows that the correlation coefficient ( r) has a p-value of 0.1053, which is exactly the same as the p-value for the slope of the linear model. In this case, we would not reject the null hypothesis that there was no correlation between the two variables (at the 0.05 level of significance).

The main difference is that the linear model returns the slope of the relationship,
\(\beta\)₁ (which in this case is -0.4636), rather than the correlation coefficient, r. The slope is usually much more interpretable and informative than the correlation coefficient.

Linear Models