A workbook for this textbook.
rnorm_fixed = function(N, mu=0, sd=1){scale(rnorm(N))*sd+mu}
set.seed(40)
# Create the samples (use the same order as original book - y, x, then y2)
y = rnorm_fixed(50, mu=0.3, sd=2)
x = rnorm_fixed(N = 50, mu=0, sd=1)
y2 = rnorm_fixed(N = 50, mu=0.5, sd=1.5)
mydata_wide <- tibble(x=x, y=y, y2=y2)
mydata_long <- mydata_wide %>%
gather(group, value, x:y2)
## Warning: attributes are not identical across measure variables; they will be
## dropped
ggplot(mydata_long, aes(group, value, color=group))+
geom_jitter(width=0.2)
signed_rank = function(x) sign(x) * rank(abs(x))
sixx <- x[1:6]
sixy <- y[1:6]
plot(sixx, sixy, ylim=c(-3, 3))|>
abline(v=0)
you can actually view 1 response variable in a 2-dimensional graph BUT anymore covariates, cannot be shown in 2D and you have to move to 3D
To estimate linear models we will use the lm() function in R. It can be written like:
lm(y ~ 1 + x)
##
## Call:
## lm(formula = y ~ 1 + x)
##
## Coefficients:
## (Intercept) x
## 0.3000 -0.4636
between our sample data f x and y. Apply the model y = \(\beta\)0 + \(\beta\)1
A key output of interest is the p-value.
Correlation is the measure of strength and direction that exists between 2 variables. Correlation coefficients (r) assume values range from -1 to 1, where the 1s are the strongest relationship and 0 means that there is no linear association between the variables.
Start by using the Pearson correlation coefficient (built in R function) with the equivalent linear model:
\(y\) = \(\beta\)0 + \(\beta\)1 + \(\beta\)1x
H0: \(\beta\)1 = 0
The two tests are written using the following R code:
# Pearsons
cor.test(y, x, method="pearson")
##
## Pearson's product-moment correlation
##
## data: y and x
## t = -1.6507, df = 48, p-value = 0.1053
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.47920849 0.04978276
## sample estimates:
## cor
## -0.2317767
#linear model
lm(y ~ 1 + x) %>%
summary() %>%
print(digits = 8)
##
## Call:
## lm(formula = y ~ 1 + x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.33931981 -1.65931459 0.33492062 1.36293243 3.52139194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.30000000 0.27799190 1.07917 0.28591
## x -0.46355333 0.28081423 -1.65075 0.10532
##
## Residual standard error: 1.9657 on 48 degrees of freedom
## Multiple R-squared: 0.053720423, Adjusted R-squared: 0.034006266
## F-statistic: 2.7249667 on 1 and 48 DF, p-value: 0.10531916
The output shows that the correlation coefficient ( r) has a p-value of 0.1053, which is exactly the same as the p-value for the slope of the linear model. In this case, we would not reject the null hypothesis that there was no correlation between the two variables (at the 0.05 level of significance).
The main difference is that the linear model returns the slope of the
relationship,
\(\beta\)1 (which in this
case is -0.4636), rather than the correlation coefficient, r. The slope
is usually much more interpretable and informative than the correlation
coefficient.