Now we are moving onto correlation. Pearson’s Correlation (correlation between two continuous variables) measures the relationship between two variables. Correlation is measured on a negative one to positive one scale. Correlations that are negative means that as one variable increases the other decreases. Correlations that are positive means that as one variable increases so does the other. Larger values on the negative or positive side indicate a stronger relationship. We can also get a p-value; however, beware that large samples make everything seem significant.

FYI: Correlation is the standardized form of covariance. So covariance is not on a negative one to positive one scale.

Let’s load the data

set.seed(123)
fideility = rnorm(100, 50, 2)
satSample = c(1:5)
satisfaction = sample(satSample, 100, replace = TRUE)
PHQ9Pre = round(rnorm(100, 50, 4),0)
PHQ9Post = round(rnorm(100, 40, 4), 0)
GAD7Pre = round(rnorm(100, 50, 4),0)
GAD7Post = round(rnorm(100, 40, 4), 0)
genderSamp = c(1,0)
gender = as.factor(sample(genderSamp, 100, prob = c(.5, .5), replace = TRUE))
treatment = as.factor(sample(genderSamp, 100, prob = c(.5, .5), replace = TRUE))

Now let us look at the correlation between the PHQ9 pre scores and the GAD7 pre scores. (I apologize I was lazy and didn’t make these correlated)

cor.test adds the p-value

cor is useful for multiple correlations; however, you need to put the variables into a data frame first.

Remember for a correlation matrix values above and below the ones are the same. There are ones because a variable is always perfectly correlated with itself.

library(psych)
## Warning: package 'psych' was built under R version 3.4.4
PHQ9_GAD7Week7 = data.frame(PHQ9Pre, GAD7Pre, PHQ9Post)

corr.test(PHQ9_GAD7Week7)
## Call:corr.test(x = PHQ9_GAD7Week7)
## Correlation matrix 
##          PHQ9Pre GAD7Pre PHQ9Post
## PHQ9Pre     1.00    0.05     0.03
## GAD7Pre     0.05    1.00    -0.05
## PHQ9Post    0.03   -0.05     1.00
## Sample Size 
## [1] 100
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##          PHQ9Pre GAD7Pre PHQ9Post
## PHQ9Pre     0.00    1.00        1
## GAD7Pre     0.62    0.00        1
## PHQ9Post    0.74    0.62        0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
cor.test(PHQ9Pre, GAD7Pre)
## 
##  Pearson's product-moment correlation
## 
## data:  PHQ9Pre and GAD7Pre
## t = 0.4933, df = 98, p-value = 0.6229
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1480968  0.2438039
## sample estimates:
##        cor 
## 0.04976907
dat = data.frame(PHQ9Pre, GAD7Pre, PHQ9Post, GAD7Post)
cor(dat)
##             PHQ9Pre      GAD7Pre    PHQ9Post     GAD7Post
## PHQ9Pre  1.00000000  0.049769068  0.03397685  0.023277650
## GAD7Pre  0.04976907  1.000000000 -0.04996179 -0.002709857
## PHQ9Post 0.03397685 -0.049961790  1.00000000 -0.066852357
## GAD7Post 0.02327765 -0.002709857 -0.06685236  1.000000000

For two categorical variables, we can use a chi-square test to evaluate whether being in one category is related or dependent upon being in another category. For example, if we randomized people into an intervention and control group we would expect there not to be a relationship between assignment to treatment and gender. So let us test this assumption

chisq.test(gender, treatment)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gender and treatment
## X-squared = 1.4115, df = 1, p-value = 0.2348

To evaluate if there is a relationship between a continuous variable and a categorical variable you could use an ANOVA. Be careful with ANOVAs, as it makes many assumptions (e.g. equal variances among groups, equal groups size). ANOVA doesn’t work with repeated measures and unequal sample sizes, which is almost always the case.

Below we are looking at if there is a difference between treatment assignment and a GAD7 difference score. You want to save the results as an object then run the summary function to get the F-Value and p-value.

anovaDat = data.frame(GAD7Diff = GAD7Post-GAD7Pre, treatment) 
anovaResults= aov(GAD7Diff ~ treatment, data =  anovaDat)
summary(anovaResults)
##             Df Sum Sq Mean Sq F value Pr(>F)
## treatment    1   12.7   12.73   0.401  0.528
## Residuals   98 3111.8   31.75

For the example, above we could also run a regression.

Intercept: Generally it is the predicted or expected mean of y when all other covariates (e.g. coefficient, regressors, independent variables) are zero. In this example, it is the predicted mean GAD7Diff score for the treatment group. In this case the expected difference score for those not in the treatment is about -9. We usually don’t interpret it, it just needs to be in there for the math to work.

y = b0 + b1*(treatment) + e

y is the predicted value of GAD7Diff, which is based on the intercept (expected mean of GAD7Diff when treatment is zero). Beta1 is the slope coefficient for the treatment variable. The interpretation is that relative to those in the control group, for those in the treatment, there is a -.39 decrease in GAD7 difference scores.

regressionDat = data.frame(GAD7PDiff = GAD7Post-GAD7Pre, treatment)
regressionResults = lm(GAD7PDiff ~ treatment, data = regressionDat)
summary(regressionResults)
## 
## Call:
## lm(formula = GAD7PDiff ~ treatment, data = regressionDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2593  -3.5435  -0.2593   3.4565  16.7407 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.4565     0.8308 -11.382   <2e-16 ***
## treatment1    0.7158     1.1306   0.633    0.528    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.635 on 98 degrees of freedom
## Multiple R-squared:  0.004073,   Adjusted R-squared:  -0.006089 
## F-statistic: 0.4008 on 1 and 98 DF,  p-value: 0.5282

Next week we will talk about the assumptions of linear regression, more intrepretations, R^2 squared, and other diagnostis for evaluating model fit.