Ok let us load the data first
set.seed(123)
fideility = rnorm(100, 50, 2)
satSample = c(1:5)
satisfaction = sample(satSample, 100, replace = TRUE)
randomEffectsCorr = matrix(c(1,-.5,-.5, 1), ncol = 2)
library(semTools)
## Loading required package: lavaan
## This is lavaan 0.5-23.1097
## lavaan is BETA software! Please report any bugs.
##
## ###############################################################################
## This is semTools 0.4-14
## All users of R (or SEM) are invited to submit functions or ideas for functions.
## ###############################################################################
n = 100
randomEffects = mvrnonnorm(n, mu = c(30,30), Sigma = randomEffectsCorr, empirical = TRUE)
colnames(randomEffects) = c("PHQ9Post", "fideility")
genderSamp = c(1,0)
gender = as.factor(sample(genderSamp, 100, prob = c(.5, .5), replace = TRUE))
treatment = as.factor(sample(genderSamp, 100, prob = c(.5, .5), replace = TRUE))
datWeek8 = data.frame(satisfaction, gender, randomEffects)
write.csv(datWeek8, "datWeek8.csv", row.names = FALSE)
datWeek8 = read.csv("datWeek8.csv", header = TRUE)
head(datWeek8)
## satisfaction gender PHQ9Post fideility
## 1 2 1 29.66624 29.35575
## 2 5 0 29.79384 29.17227
## 3 4 0 30.02202 29.47081
## 4 3 1 30.86352 30.15171
## 5 3 0 29.22355 31.50228
## 6 5 1 30.37231 29.83899
Let us run a multivariate regression as follows below
resultsWeek8 = lm(PHQ9Post ~ fideility + satisfaction + gender, data = datWeek8)
summary(resultsWeek8)
##
## Call:
## lm(formula = PHQ9Post ~ fideility + satisfaction + gender, data = datWeek8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0475 -0.4364 0.1587 0.6177 1.4205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.039050 2.684709 16.776 < 2e-16 ***
## fideility -0.499628 0.089673 -5.572 2.3e-07 ***
## satisfaction -0.016282 0.061355 -0.265 0.791
## gender -0.005571 0.178611 -0.031 0.975
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8791 on 96 degrees of freedom
## Multiple R-squared: 0.2506, Adjusted R-squared: 0.2271
## F-statistic: 10.7 on 3 and 96 DF, p-value: 3.945e-06
We can see that fidelity is negatively related (negative t-value) with PHQ9 posts scores. With multiple variables, we want o interpret the adjusted r-squared, because it accounts for the lost degrees of freedom with the additional variables. For a one-unit increase in fidelity there is about a -.5 decrease in PHQ9Post scores holding all other variables constant.
Now we are going to learn a little more about multivariate linear regression and here are some of the assumptions:
Linear relationship = covariates are additive Multivariate normality = normality among all variables together Minimal multicollinearity = when variables are highly correlated No auto-correlation = observations are independent of each other homoscedasticity = residuals have equal variance across the regression line
For a linear relationship we assume that the model is additive:
y = b0+b1+b2+e
Because the model is additive we get the nice interpretation of given a one-unit change we see an (insert number) change in the dependent variable holding all other variables constant.
This model assumes no interaction effects. If we assume interaction effects, we need to account for that in our model (b1*b2). It also assumes no squares (b1^2) or things like that.
Multivariate normality Below are some tests to evaluate multivariate normality which means the relationship between each of the variables together is normal. However, if you have variables that are not continuous then these tests won’t be very helpful. These tests are also sensitive to large samples sizes. At least for reviewers, this assumption is not usually one people worry about as much.
The qqplots can also help us understand multivariate normality and outliers as well.
#install.packages("kableExtra")
#install.packages("knitr")
#install.packages("MVN")
library(knitr)
library(kableExtra)
library(MVN)
## sROC 0.1-2 loaded
mvn(datWeek8, multivariatePlot = "qq")
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 21.8431727529919 0.349080815045321 YES
## 2 Mardia Kurtosis -2.05937163973792 0.0394586504168044 NO
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk satisfaction 0.8813 <0.001 NO
## 2 Shapiro-Wilk gender 0.6358 <0.001 NO
## 3 Shapiro-Wilk PHQ9Post 0.9910 0.7452 YES
## 4 Shapiro-Wilk fideility 0.9894 0.6169 YES
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th
## satisfaction 100 2.92 1.4402581 3.00000 1.00000 5.00000 2.00000
## gender 100 0.48 0.5021167 0.00000 0.00000 1.00000 0.00000
## PHQ9Post 100 30.00 1.0000000 30.01837 26.63783 32.40832 29.41202
## fideility 100 30.00 1.0000000 30.02384 27.76977 32.25354 29.45024
## 75th Skew Kurtosis
## satisfaction 4.00000 0.07811514 -1.3788189
## gender 1.00000 0.07886612 -2.0136173
## PHQ9Post 30.69503 -0.30873618 0.4527758
## fideility 30.65676 -0.11863445 -0.4249991
Minimal multicollinearity
In linear regression, we can evaluate the VIFS, which are the variance inflation factors. It basically runs a regression for each variable as the dependent variable and evaluates how much 1/(1-R^2) so higher values of R^2 means that the higher the VIF. A VIF of five is a general rule of thumb (e.g. R^2 of .8). One way to reduce multicollinearity is to drop the related variables and see if the R^2 drops.
#install.packages(car)
library(car)
## Loading required package: carData
vif(resultsWeek8)
## fideility satisfaction gender
## 1.030033 1.000266 1.030283
No autocorrelation basically means that observations are independent of each other. The most common case of autocorrelation is when you have repeated measures. You can test for autocorrelation, but if you don’t have repeated measures, it probably isn’t necessary.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(resultsWeek8, alternative = "two.sided")
##
## Durbin-Watson test
##
## data: resultsWeek8
## DW = 2.2147, p-value = 0.294
## alternative hypothesis: true autocorrelation is not 0
Homoskedasticity is one that you might want to worry about. We can review the residuals to better understand this assumption.
Example of heteroskedasticity: https://www.google.com/imgres?imgurl=https://upload.wikimedia.org/wikipedia/commons/a/a5/Heteroscedasticity.png&imgrefurl=https://en.wikipedia.org/wiki/Heteroscedasticity&h=357&w=556&tbnid=VMvXLrt6sw2KkM:&q=example+of+heteroscedasticity&tbnh=135&tbnw=211&usg=AFrqEzcbfxmSYOX2PVk0w5iDNslAAu6nGA&vet=12ahUKEwiI8fOO-tXcAhXRpFkKHY1PC0sQ9QEwAHoECAQQBg..i&docid=0eE_8C4gH8Te9M&sa=X&ved=2ahUKEwiI8fOO-tXcAhXRpFkKHY1PC0sQ9QEwAHoECAQQBg
plot(resultsWeek8)