This tutorial will show you evaluate the relation between two or more variables using simple correlation, single linear regression, and multiple regression.

Simple Correlation

First, we will compute the association between two variables using simple correlation. This exercise is adapted from Workshop Statistics (Rossman & Chance, 2008).

Data Description

These data consist of a random sample of 30 textbooks from a campus bookstore. They include price, number of pages, and year of publication for each book.

First, we’ll access the data. The libraries mosaic and psych are required.

library(mosaic)
library(psych)
options(width=200)
textbook_data<-fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Ampua78j1HupdE5OaDFOZjlNbWxwdXRYeHJJWUR0SkE&output=csv")

The str command tells us about the structure of our dataframe, and the describe command gives us basic descriptive statistics for our variables.

str(textbook_data)
## 'data.frame':    30 obs. of  3 variables:
##  $ Price: num  4.25 5.95 7 6.5 7 ...
##  $ Pages: int  57 194 51 104 294 140 336 150 600 91 ...
##  $ Year : int  2006 1972 2004 2005 2002 1991 1973 2003 2004 1997 ...
describe(textbook_data)
##       vars  n    mean     sd  median trimmed    mad     min    max  range  skew kurtosis    se
## Price    1 30   65.06  51.42   55.25   60.86  59.67    4.25  169.8  165.5  0.44    -1.14  9.39
## Pages    2 30  464.53 287.08  456.50  451.96 367.68   51.00 1060.0 1009.0  0.27    -1.10 52.41
## Year     3 30 2000.77   8.81 2004.00 2002.79   2.97 1972.00 2008.0   36.0 -2.19     4.22  1.61

Description of Problem

Two undergraduate college students were curious as to whether longer books cost more money. What is the appropriate statistic to assess this question?

First, we’ll visualize the analysis using a scatterplot (xyplot is the command). We’re going to color the points so that books of different ages are different colors. Light blue represents the oldest books, light green is the newest books, and pink is medium-aged books.

xyplot(Price~Pages,data=textbook_data,pch=20,groups=ntiles(textbook_data$Year),cex=2,auto.key=TRUE)

plot of chunk unnamed-chunk-4

It sure looks like Price and Pages are related, but how can we quantify this relationship?

We calculate the simple correlation between price and pages.

cor(Price~Pages,data=textbook_data)
## [1] 0.8219

How do we interpret this correlation?

What about the year of publication? Is that related to either of these variables?

If we wanted, we could get the correlations between all three of our variables with one command.

cor(textbook_data)
##        Price  Pages   Year
## Price 1.0000 0.8219 0.4316
## Pages 0.8219 1.0000 0.3194
## Year  0.4316 0.3194 1.0000

So looking at this table, what do you notice? How would you interpret these coefficients?

In order to understand the relations between multiple variables simultaneously, we need a more advanced technique. For this we will now introduce two types of linear regression.

Simple and Multiple Regression

Data Description

These data consist of two sub-scales of a narcissism measure - the NPI (Raskin & Hall, 1979) based on the Ackerman et al. (2011) factor analysis. These two subscales are called Leadership Authority and Entitlement Exploitativeness. Also included are 10 other scales that may correlate with Leadership Authority and Entitlement Exploitativeness. We’ll focus on two of the Big Five personality factors, Extraversion and Agreeableness. We used the Ten-Item Personality Inventory (TIPI; Gosling, Renfrow, & Swan, 2003) to assess Extraversion and Agreeableness. Finally, the dataset includes some demographic variables (gender, age, and college student).

Now we’ll access the data. The libraries mosaic and psych are required.

library(mosaic)
library(RCurl)
psy_data<-fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Ampua78j1HupdHJkenZITjN4N1QyUV96TlBkWVRaWXc&output=csv")
str(psy_data)
## 'data.frame':    51 obs. of  16 variables:
##  $ Subj_ID        : int  101 102 103 104 105 106 107 108 109 110 ...
##  $ LA             : int  7 0 4 5 1 1 1 3 1 4 ...
##  $ EE             : int  4 0 3 0 0 0 0 0 1 0 ...
##  $ ies_avg        : num  4 1 4.17 2 1 ...
##  $ pes_avg        : num  3.56 1 3.89 2.33 2.78 ...
##  $ E              : num  5.5 6 3.5 2.5 3 2.5 6 3 1 3.5 ...
##  $ A              : num  4 6.5 3.5 5 4 2.5 5 3 1.5 6 ...
##  $ C              : num  4.5 7 4.5 6.5 6 4 6 3 1 6 ...
##  $ N              : num  4 6 4 5 5 5 1 4.5 7 2 ...
##  $ O              : num  2.5 6.5 2.5 6.5 6.5 5 5 4.5 6 5.5 ...
##  $ epq_avg        : num  3.8 6.3 4.2 3.7 4 5.3 3.7 3 4.2 4.9 ...
##  $ prim_psy_avg   : num  3.81 1 4.27 2 1.25 ...
##  $ sdo_avg        : num  3.46 1 3.92 1.85 2.46 ...
##  $ gender         : int  2 2 1 2 2 2 1 1 1 2 ...
##  $ age            : int  25 31 33 28 41 32 31 24 56 38 ...
##  $ college_student: Factor w/ 2 levels "college","not college": 1 2 2 2 2 1 2 2 2 2 ...

Description of Problem

Let’s get some descriptive statistics for this new dataset.

describe(psy_data)
##                  vars  n   mean    sd median trimmed   mad   min    max range  skew kurtosis   se
## Subj_ID             1 51 126.00 14.87 126.00  126.00 19.27 101.0 151.00 50.00  0.00    -1.27 2.08
## LA                  2 51   3.43  2.97   3.00    3.10  2.97   0.0  11.00 11.00  0.81    -0.31 0.42
## EE                  3 51   0.84  1.07   0.00    0.68  0.00   0.0   4.00  4.00  1.18     0.83 0.15
## ies_avg             4 51   2.50  1.52   2.00    2.29  1.48   1.0   6.67  5.67  0.88    -0.23 0.21
## pes_avg             5 51   3.32  1.17   3.33    3.29  1.32   1.0   6.33  5.33  0.27    -0.07 0.16
## E                   6 51   3.72  1.67   3.50    3.71  2.22   1.0   7.00  6.00  0.11    -0.91 0.23
## A                   7 51   4.67  1.55   5.00    4.77  1.48   1.5   7.00  5.50 -0.51    -0.80 0.22
## C                   8 51   5.29  1.35   6.00    5.41  1.48   1.0   7.00  6.00 -0.79     0.30 0.19
## N                   9 51   3.54  1.86   3.00    3.45  2.22   1.0   7.00  6.00  0.30    -1.13 0.26
## O                  10 51   5.00  1.37   5.00    5.09  1.48   2.0   7.00  5.00 -0.41    -0.52 0.19
## epq_avg            11 51   4.71  1.04   4.70    4.73  1.04   2.2   6.60  4.40 -0.17    -0.57 0.15
## prim_psy_avg       12 51   2.81  1.12   2.75    2.77  1.30   1.0   6.12  5.12  0.42    -0.18 0.16
## sdo_avg            13 51   2.55  1.24   2.46    2.43  1.37   1.0   6.23  5.23  0.83     0.32 0.17
## gender             14 51   1.47  0.58   1.00    1.41  0.00   1.0   3.00  2.00  0.72    -0.55 0.08
## age                15 50  35.32 10.48  33.00   34.38 10.38  21.0  60.00 39.00  0.68    -0.46 1.48
## college_student*   16 51   1.75  0.44   2.00    1.80  0.00   1.0   2.00  1.00 -1.09    -0.82 0.06

You should take a look at the mean, standard deviation, min, and max for each of the variables to help you understand what values should be expected in the forthcoming analyses.

We can examine a correlation between any of these scales in the same way as in the Textbook Data. For instance, to ask for the correlation between Extraversion (E) and Agreeableness (A), we run the following:

cor(E~A,data=psy_data)
## [1] 0.15

Extraversion assesses (among other things) how interpersonally dominant a person is. Agreeableness assesses (again, among other things) how willing a person is to get along with others. Our narcissism measure also taps these two dimensions. Entitlement Exploitativeness narcissism has to do with being willing to use others for one’s own ends. Leadership Authority narcissism describes one’s ability to command the attention and respect of others.

So we might want to see how Agreeableness correlates with Entitlement Exploitativeness. Because these dimensions both measure getting along with others, we expect them to be associated. However, they should show a negative correlation, because the more agreeable one is, the less willing they should be to exploit others.

cor(EE~A,data=psy_data)
## [1] -0.2979

What does the correlation mean? Can you interpret it?

We can also run a simple (one-predictor) regression, instead of a correlation. If we standardize our predictor and outcome, then our regression coefficient (b) becomes a standardized coeffient (beta). Beta is interpretable as the standard deviation increase in the outcome for a 1 standard deviation unit increase in the predictor. The following code standardizes the predictor and outcome.

psy_data$EEz<-scale(psy_data$EE,center=TRUE,scale=TRUE)
psy_data$Az<-scale(psy_data$A,center=TRUE,scale=TRUE)

Now we’ll run the simple regression with the standardized variables. A simple regression with standardized predictors should produce a coefficient that’s equal to the correlation coefficient.

model<-summary(lm(EEz~Az,data=psy_data))
model
## 
## Call:
## lm(formula = EEz ~ Az, data = psy_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.207 -0.680 -0.440  0.585  2.836 
## 
## Coefficients:
##                           Estimate             Std. Error t value Pr(>|t|)  
## (Intercept) -0.0000000000000000718  0.1350279548898648541    0.00    1.000  
## Az          -0.2978879334331375195  0.1363715497227122320   -2.18    0.034 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.964 on 49 degrees of freedom
## Multiple R-squared:  0.0887, Adjusted R-squared:  0.0701 
## F-statistic: 4.77 on 1 and 49 DF,  p-value: 0.0337

Let’s round down to three decimals.

round(model$coef,digits=4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.0000     0.1350   0.000   1.0000
## Az           -0.2979     0.1364  -2.184   0.0337

Can you find and interpret the standardized regression coefficient?

Instead of the standardized linear regression, we could also run a simple (one-predictor) unstandardized linear regression. The unstandardized coefficient tells us how much increase in Y (our outcome - Entitlement Exploitativeness) we get for a 1 unit increse in X (our predictor - Agreeableness). The unstandardized coefficient is equal to the slope of a line (and the intercept coefficient is the point where that line crosses the Y-axis).

model<-summary(lm(EE~A,data=psy_data))
model
## 
## Call:
## lm(formula = EE ~ A, data = psy_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.285 -0.724 -0.469  0.623  3.021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.7958     0.4592    3.91  0.00028 ***
## A            -0.2041     0.0935   -2.18  0.03375 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.03 on 49 degrees of freedom
## Multiple R-squared:  0.0887, Adjusted R-squared:  0.0701 
## F-statistic: 4.77 on 1 and 49 DF,  p-value: 0.0337

What would this look like as a scatterplot?

plot(psy_data$A,psy_data$EE,xlab='Agreeableness',ylab='Entitlement Exploitativeness',pch=20,col=rgb(0,0,0,.3),cex=2)

plot of chunk unnamed-chunk-19

Let’s add the regression line to the plot. Can you see how the regression coefficient from the analysis above is equal to the slope of this line?

plot(psy_data$A,psy_data$EE,xlab='Agreeableness',ylab='Entitlement Exploitativeness',pch=20,col=rgb(0,0,0,.3),cex=2)
abline(model$coef[1:2,1])

plot of chunk unnamed-chunk-20

Now let’s see how Extraversion relates to Entitlement Exploitativeness. The correlation should be smaller, because interpersonal dominance is not as directly related to exploiting others as Agreeableness is.

cor(EE~E,data=psy_data)
## [1] 0.245

Now let’s standardize Entitlement Exploitativeness and Extraversion.

psy_data$EEz<-scale(psy_data$EE,center=TRUE,scale=TRUE)
psy_data$Ez<-scale(psy_data$E,center=TRUE,scale=TRUE)

Then we can run the standardized regression.

model<-summary(lm(EEz~Ez,data=psy_data))
model
## 
## Call:
## lm(formula = EEz ~ Ez, data = psy_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.275 -0.723 -0.392  0.685  2.701 
## 
## Coefficients:
##                          Estimate            Std. Error t value Pr(>|t|)  
## (Intercept) 0.0000000000000000305 0.1371404187506063133    0.00    1.000  
## Ez          0.2449518873871339431 0.1385050336420791950    1.77    0.083 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.979 on 49 degrees of freedom
## Multiple R-squared:  0.06,   Adjusted R-squared:  0.0408 
## F-statistic: 3.13 on 1 and 49 DF,  p-value: 0.0832

Let’s round the coeffients to three decimals.

round(model$coef,digits=4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.000     0.1371   0.000   1.0000
## Ez             0.245     0.1385   1.768   0.0832

Now here’s the unstandardized regression. Can you interpret the coefficients?

summary(lm(EE~E,data=psy_data))
## 
## Call:
## lm(formula = EE ~ E, data = psy_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.358 -0.770 -0.417  0.730  2.877 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.2608     0.3602    0.72    0.472  
## E             0.1567     0.0886    1.77    0.083 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.04 on 49 degrees of freedom
## Multiple R-squared:  0.06,   Adjusted R-squared:  0.0408 
## F-statistic: 3.13 on 1 and 49 DF,  p-value: 0.0832

A multiple regression will allow us to examine the relation between Agreeableness and Entitlement Exploitativeness controlling for the association between Agreeableness and Extraversion. Because Extraversion and Agreeableness correlate with one another, we may be able to see the relation between Extraversion and Entitlement Exploitativeness more clearly after controlling for Agreeableness.

Compare the coefficients from this model to the two previous unstandardized models. Did the coefficients change? How about the statistical significance? What’s the R-squared value for the model? What does this value mean?

summary(lm(EE~E+A,data=psy_data))
## 
## Call:
## lm(formula = EE ~ E + A, data = psy_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.857 -0.855 -0.133  0.556  2.662 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.2335     0.5083    2.43    0.019 *
## E             0.1896     0.0849    2.23    0.030 *
## A            -0.2346     0.0909   -2.58    0.013 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.988 on 48 degrees of freedom
## Multiple R-squared:  0.175,  Adjusted R-squared:  0.14 
## F-statistic: 5.08 on 2 and 48 DF,  p-value: 0.01

Extra Material - Plotting

This requires the packages manipulate and shiny.

fancyMe=function(var){ xyplot(EE~psy_data[[var]],data=psy_data,xlab=var) }

choices = c(names(psy_data)) names(choices) = names(psy_data) choices = as.list(choices)

manipulate(fancyMe(var),var=picker(choices))

References

Ackerman, R. A., Witt, E. A., Donnellan, M. B., Trzesniewski, K. H., Robins, R. W., & Kashy, D. A. (2011). What does the Narcissistic Personality Inventory really measure? Assessment, 18, 67-87.

Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). A very brief measure of the Big Five personality domains. Journal of Research in Personality, 37, 504-528.

Raskin, R. N., & Hall, C. S. (1979). A narcissistc personality inventory. Psychological reports, 45, 590.

Rossman, A. J., & Chance, B. L. (2008). Workshop statistics: Discovery with data (3rd Ed.). Hoboken, NJ: Wiley.