Week 8

Ok let us load the data first

set.seed(123)
fideility = rnorm(100, 50, 2)
satSample = c(1:5)
satisfaction = sample(satSample, 100, replace = TRUE)
randomEffectsCorr = matrix(c(1,-.5,-.5, 1), ncol = 2)
library(semTools)

## Loading required package: lavaan

## This is lavaan 0.5-23.1097

## lavaan is BETA software! Please report any bugs.

##

## ###############################################################################

## This is semTools 0.4-14

## All users of R (or SEM) are invited to submit functions or ideas for functions.

## ###############################################################################

n = 100
randomEffects = mvrnonnorm(n, mu = c(30,30), Sigma = randomEffectsCorr, empirical = TRUE)
colnames(randomEffects) = c("PHQ9Post", "fideility")
genderSamp = c(1,0)
gender = as.factor(sample(genderSamp, 100, prob = c(.5, .5), replace = TRUE))
treatment = as.factor(sample(genderSamp, 100, prob = c(.5, .5), replace = TRUE))
datWeek8 = data.frame(satisfaction, gender, randomEffects)
write.csv(datWeek8, "datWeek8.csv", row.names = FALSE)
datWeek8 = read.csv("datWeek8.csv", header = TRUE)
head(datWeek8)

##   satisfaction gender PHQ9Post fideility
## 1            2      1 29.66624  29.35575
## 2            5      0 29.79384  29.17227
## 3            4      0 30.02202  29.47081
## 4            3      1 30.86352  30.15171
## 5            3      0 29.22355  31.50228
## 6            5      1 30.37231  29.83899

Let us run a multivariate regression as follows below

resultsWeek8 = lm(PHQ9Post ~ fideility + satisfaction + gender, data = datWeek8)
summary(resultsWeek8)

## 
## Call:
## lm(formula = PHQ9Post ~ fideility + satisfaction + gender, data = datWeek8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0475 -0.4364  0.1587  0.6177  1.4205 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.039050   2.684709  16.776  < 2e-16 ***
## fideility    -0.499628   0.089673  -5.572  2.3e-07 ***
## satisfaction -0.016282   0.061355  -0.265    0.791    
## gender       -0.005571   0.178611  -0.031    0.975    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8791 on 96 degrees of freedom
## Multiple R-squared:  0.2506, Adjusted R-squared:  0.2271 
## F-statistic:  10.7 on 3 and 96 DF,  p-value: 3.945e-06

We can see that fidelity is negatively related (negative t-value) with PHQ9 posts scores. With multiple variables, we want o interpret the adjusted r-squared, because it accounts for the lost degrees of freedom with the additional variables. For a one-unit increase in fidelity there is about a -.5 decrease in PHQ9Post scores holding all other variables constant.

Now we are going to learn a little more about multivariate linear regression and here are some of the assumptions:

Linear relationship = covariates are additive Multivariate normality = normality among all variables together Minimal multicollinearity = when variables are highly correlated No auto-correlation = observations are independent of each other homoscedasticity = residuals have equal variance across the regression line

For a linear relationship we assume that the model is additive:

y = b0+b1+b2+e

Because the model is additive we get the nice interpretation of given a one-unit change we see an (insert number) change in the dependent variable holding all other variables constant.

This model assumes no interaction effects. If we assume interaction effects, we need to account for that in our model (b1*b2). It also assumes no squares (b1^2) or things like that.

Multivariate normality Below are some tests to evaluate multivariate normality which means the relationship between each of the variables together is normal. However, if you have variables that are not continuous then these tests won’t be very helpful. These tests are also sensitive to large samples sizes. At least for reviewers, this assumption is not usually one people worry about as much.

The qqplots can also help us understand multivariate normality and outliers as well.

#install.packages("kableExtra")
#install.packages("knitr")
#install.packages("MVN")
library(knitr)
library(kableExtra)
library(MVN)

## sROC 0.1-2 loaded

mvn(datWeek8, multivariatePlot = "qq")

## $multivariateNormality
##              Test         Statistic            p value Result
## 1 Mardia Skewness  21.8431727529919  0.349080815045321    YES
## 2 Mardia Kurtosis -2.05937163973792 0.0394586504168044     NO
## 3             MVN              <NA>               <NA>     NO
## 
## $univariateNormality
##           Test     Variable Statistic   p value Normality
## 1 Shapiro-Wilk satisfaction    0.8813  <0.001      NO    
## 2 Shapiro-Wilk    gender       0.6358  <0.001      NO    
## 3 Shapiro-Wilk   PHQ9Post      0.9910  0.7452      YES   
## 4 Shapiro-Wilk  fideility      0.9894  0.6169      YES   
## 
## $Descriptives
##                n  Mean   Std.Dev   Median      Min      Max     25th
## satisfaction 100  2.92 1.4402581  3.00000  1.00000  5.00000  2.00000
## gender       100  0.48 0.5021167  0.00000  0.00000  1.00000  0.00000
## PHQ9Post     100 30.00 1.0000000 30.01837 26.63783 32.40832 29.41202
## fideility    100 30.00 1.0000000 30.02384 27.76977 32.25354 29.45024
##                  75th        Skew   Kurtosis
## satisfaction  4.00000  0.07811514 -1.3788189
## gender        1.00000  0.07886612 -2.0136173
## PHQ9Post     30.69503 -0.30873618  0.4527758
## fideility    30.65676 -0.11863445 -0.4249991

Minimal multicollinearity

In linear regression, we can evaluate the VIFS, which are the variance inflation factors. It basically runs a regression for each variable as the dependent variable and evaluates how much 1/(1-R^2) so higher values of R^2 means that the higher the VIF. A VIF of five is a general rule of thumb (e.g. R^2 of .8). One way to reduce multicollinearity is to drop the related variables and see if the R^2 drops.

#install.packages(car)
library(car)

## Loading required package: carData

vif(resultsWeek8)

##    fideility satisfaction       gender 
##     1.030033     1.000266     1.030283

No autocorrelation basically means that observations are independent of each other. The most common case of autocorrelation is when you have repeated measures. You can test for autocorrelation, but if you don’t have repeated measures, it probably isn’t necessary.

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

dwtest(resultsWeek8, alternative = "two.sided")

## 
##  Durbin-Watson test
## 
## data:  resultsWeek8
## DW = 2.2147, p-value = 0.294
## alternative hypothesis: true autocorrelation is not 0

Homoskedasticity is one that you might want to worry about. We can review the residuals to better understand this assumption.

Example of heteroskedasticity: https://www.google.com/imgres?imgurl=https://upload.wikimedia.org/wikipedia/commons/a/a5/Heteroscedasticity.png&imgrefurl=https://en.wikipedia.org/wiki/Heteroscedasticity&h=357&w=556&tbnid=VMvXLrt6sw2KkM:&q=example+of+heteroscedasticity&tbnh=135&tbnw=211&usg=AFrqEzcbfxmSYOX2PVk0w5iDNslAAu6nGA&vet=12ahUKEwiI8fOO-tXcAhXRpFkKHY1PC0sQ9QEwAHoECAQQBg..i&docid=0eE_8C4gH8Te9M&sa=X&ved=2ahUKEwiI8fOO-tXcAhXRpFkKHY1PC0sQ9QEwAHoECAQQBg

plot(resultsWeek8)