This tutorial will show you evaluate the relation between two or more variables using simple correlation, single linear regression, and multiple regression.
First, we will compute the association between two variables using simple correlation. This exercise is adapted from Workshop Statistics (Rossman & Chance, 2008).
These data consist of a random sample of 30 textbooks from a campus bookstore. They include price, number of pages, and year of publication for each book.
First, we’ll access the data. The libraries mosaic and psych are required.
library(mosaic)
library(psych)
options(width=200)
textbook_data<-fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Ampua78j1HupdE5OaDFOZjlNbWxwdXRYeHJJWUR0SkE&output=csv")
The str command tells us about the structure of our dataframe, and the describe command gives us basic descriptive statistics for our variables.
str(textbook_data)
## 'data.frame': 30 obs. of 3 variables:
## $ Price: num 4.25 5.95 7 6.5 7 ...
## $ Pages: int 57 194 51 104 294 140 336 150 600 91 ...
## $ Year : int 2006 1972 2004 2005 2002 1991 1973 2003 2004 1997 ...
describe(textbook_data)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Price 1 30 65.06 51.42 55.25 60.86 59.67 4.25 169.8 165.5 0.44 -1.14 9.39
## Pages 2 30 464.53 287.08 456.50 451.96 367.68 51.00 1060.0 1009.0 0.27 -1.10 52.41
## Year 3 30 2000.77 8.81 2004.00 2002.79 2.97 1972.00 2008.0 36.0 -2.19 4.22 1.61
Two undergraduate college students were curious as to whether longer books cost more money. What is the appropriate statistic to assess this question?
First, we’ll visualize the analysis using a scatterplot (xyplot is the command). We’re going to color the points so that books of different ages are different colors. Light blue represents the oldest books, light green is the newest books, and pink is medium-aged books.
xyplot(Price~Pages,data=textbook_data,pch=20,groups=ntiles(textbook_data$Year),cex=2,auto.key=TRUE)
It sure looks like Price and Pages are related, but how can we quantify this relationship?
We calculate the simple correlation between price and pages.
cor(Price~Pages,data=textbook_data)
## [1] 0.8219
How do we interpret this correlation?
What about the year of publication? Is that related to either of these variables?
If we wanted, we could get the correlations between all three of our variables with one command.
cor(textbook_data)
## Price Pages Year
## Price 1.0000 0.8219 0.4316
## Pages 0.8219 1.0000 0.3194
## Year 0.4316 0.3194 1.0000
So looking at this table, what do you notice? How would you interpret these coefficients?
In order to understand the relations between multiple variables simultaneously, we need a more advanced technique. For this we will now introduce two types of linear regression.
These data consist of two sub-scales of a narcissism measure - the NPI (Raskin & Hall, 1979) based on the Ackerman et al. (2011) factor analysis. These two subscales are called Leadership Authority and Entitlement Exploitativeness. Also included are 10 other scales that may correlate with Leadership Authority and Entitlement Exploitativeness. We’ll focus on two of the Big Five personality factors, Extraversion and Agreeableness. We used the Ten-Item Personality Inventory (TIPI; Gosling, Renfrow, & Swan, 2003) to assess Extraversion and Agreeableness. Finally, the dataset includes some demographic variables (gender, age, and college student).
Now we’ll access the data. The libraries mosaic and psych are required.
library(mosaic)
library(RCurl)
psy_data<-fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Ampua78j1HupdHJkenZITjN4N1QyUV96TlBkWVRaWXc&output=csv")
str(psy_data)
## 'data.frame': 51 obs. of 16 variables:
## $ Subj_ID : int 101 102 103 104 105 106 107 108 109 110 ...
## $ LA : int 7 0 4 5 1 1 1 3 1 4 ...
## $ EE : int 4 0 3 0 0 0 0 0 1 0 ...
## $ ies_avg : num 4 1 4.17 2 1 ...
## $ pes_avg : num 3.56 1 3.89 2.33 2.78 ...
## $ E : num 5.5 6 3.5 2.5 3 2.5 6 3 1 3.5 ...
## $ A : num 4 6.5 3.5 5 4 2.5 5 3 1.5 6 ...
## $ C : num 4.5 7 4.5 6.5 6 4 6 3 1 6 ...
## $ N : num 4 6 4 5 5 5 1 4.5 7 2 ...
## $ O : num 2.5 6.5 2.5 6.5 6.5 5 5 4.5 6 5.5 ...
## $ epq_avg : num 3.8 6.3 4.2 3.7 4 5.3 3.7 3 4.2 4.9 ...
## $ prim_psy_avg : num 3.81 1 4.27 2 1.25 ...
## $ sdo_avg : num 3.46 1 3.92 1.85 2.46 ...
## $ gender : int 2 2 1 2 2 2 1 1 1 2 ...
## $ age : int 25 31 33 28 41 32 31 24 56 38 ...
## $ college_student: Factor w/ 2 levels "college","not college": 1 2 2 2 2 1 2 2 2 2 ...
Let’s get some descriptive statistics for this new dataset.
describe(psy_data)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Subj_ID 1 51 126.00 14.87 126.00 126.00 19.27 101.0 151.00 50.00 0.00 -1.27 2.08
## LA 2 51 3.43 2.97 3.00 3.10 2.97 0.0 11.00 11.00 0.81 -0.31 0.42
## EE 3 51 0.84 1.07 0.00 0.68 0.00 0.0 4.00 4.00 1.18 0.83 0.15
## ies_avg 4 51 2.50 1.52 2.00 2.29 1.48 1.0 6.67 5.67 0.88 -0.23 0.21
## pes_avg 5 51 3.32 1.17 3.33 3.29 1.32 1.0 6.33 5.33 0.27 -0.07 0.16
## E 6 51 3.72 1.67 3.50 3.71 2.22 1.0 7.00 6.00 0.11 -0.91 0.23
## A 7 51 4.67 1.55 5.00 4.77 1.48 1.5 7.00 5.50 -0.51 -0.80 0.22
## C 8 51 5.29 1.35 6.00 5.41 1.48 1.0 7.00 6.00 -0.79 0.30 0.19
## N 9 51 3.54 1.86 3.00 3.45 2.22 1.0 7.00 6.00 0.30 -1.13 0.26
## O 10 51 5.00 1.37 5.00 5.09 1.48 2.0 7.00 5.00 -0.41 -0.52 0.19
## epq_avg 11 51 4.71 1.04 4.70 4.73 1.04 2.2 6.60 4.40 -0.17 -0.57 0.15
## prim_psy_avg 12 51 2.81 1.12 2.75 2.77 1.30 1.0 6.12 5.12 0.42 -0.18 0.16
## sdo_avg 13 51 2.55 1.24 2.46 2.43 1.37 1.0 6.23 5.23 0.83 0.32 0.17
## gender 14 51 1.47 0.58 1.00 1.41 0.00 1.0 3.00 2.00 0.72 -0.55 0.08
## age 15 50 35.32 10.48 33.00 34.38 10.38 21.0 60.00 39.00 0.68 -0.46 1.48
## college_student* 16 51 1.75 0.44 2.00 1.80 0.00 1.0 2.00 1.00 -1.09 -0.82 0.06
You should take a look at the mean, standard deviation, min, and max for each of the variables to help you understand what values should be expected in the forthcoming analyses.
We can examine a correlation between any of these scales in the same way as in the Textbook Data. For instance, to ask for the correlation between Extraversion (E) and Agreeableness (A), we run the following:
cor(E~A,data=psy_data)
## [1] 0.15
Extraversion assesses (among other things) how interpersonally dominant a person is. Agreeableness assesses (again, among other things) how willing a person is to get along with others. Our narcissism measure also taps these two dimensions. Entitlement Exploitativeness narcissism has to do with being willing to use others for one’s own ends. Leadership Authority narcissism describes one’s ability to command the attention and respect of others.
So we might want to see how Agreeableness correlates with Entitlement Exploitativeness. Because these dimensions both measure getting along with others, we expect them to be associated. However, they should show a negative correlation, because the more agreeable one is, the less willing they should be to exploit others.
cor(EE~A,data=psy_data)
## [1] -0.2979
What does the correlation mean? Can you interpret it?
We can also run a simple (one-predictor) regression, instead of a correlation. If we standardize our predictor and outcome, then our regression coefficient (b) becomes a standardized coeffient (beta). Beta is interpretable as the standard deviation increase in the outcome for a 1 standard deviation unit increase in the predictor. The following code standardizes the predictor and outcome.
psy_data$EEz<-scale(psy_data$EE,center=TRUE,scale=TRUE)
psy_data$Az<-scale(psy_data$A,center=TRUE,scale=TRUE)
Now we’ll run the simple regression with the standardized variables. A simple regression with standardized predictors should produce a coefficient that’s equal to the correlation coefficient.
model<-summary(lm(EEz~Az,data=psy_data))
model
##
## Call:
## lm(formula = EEz ~ Az, data = psy_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.207 -0.680 -0.440 0.585 2.836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0000000000000000718 0.1350279548898648541 0.00 1.000
## Az -0.2978879334331375195 0.1363715497227122320 -2.18 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.964 on 49 degrees of freedom
## Multiple R-squared: 0.0887, Adjusted R-squared: 0.0701
## F-statistic: 4.77 on 1 and 49 DF, p-value: 0.0337
Let’s round down to three decimals.
round(model$coef,digits=4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0000 0.1350 0.000 1.0000
## Az -0.2979 0.1364 -2.184 0.0337
Can you find and interpret the standardized regression coefficient?
Instead of the standardized linear regression, we could also run a simple (one-predictor) unstandardized linear regression. The unstandardized coefficient tells us how much increase in Y (our outcome - Entitlement Exploitativeness) we get for a 1 unit increse in X (our predictor - Agreeableness). The unstandardized coefficient is equal to the slope of a line (and the intercept coefficient is the point where that line crosses the Y-axis).
model<-summary(lm(EE~A,data=psy_data))
model
##
## Call:
## lm(formula = EE ~ A, data = psy_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.285 -0.724 -0.469 0.623 3.021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7958 0.4592 3.91 0.00028 ***
## A -0.2041 0.0935 -2.18 0.03375 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 49 degrees of freedom
## Multiple R-squared: 0.0887, Adjusted R-squared: 0.0701
## F-statistic: 4.77 on 1 and 49 DF, p-value: 0.0337
What would this look like as a scatterplot?
plot(psy_data$A,psy_data$EE,xlab='Agreeableness',ylab='Entitlement Exploitativeness',pch=20,col=rgb(0,0,0,.3),cex=2)
Let’s add the regression line to the plot. Can you see how the regression coefficient from the analysis above is equal to the slope of this line?
plot(psy_data$A,psy_data$EE,xlab='Agreeableness',ylab='Entitlement Exploitativeness',pch=20,col=rgb(0,0,0,.3),cex=2)
abline(model$coef[1:2,1])
Now let’s see how Extraversion relates to Entitlement Exploitativeness. The correlation should be smaller, because interpersonal dominance is not as directly related to exploiting others as Agreeableness is.
cor(EE~E,data=psy_data)
## [1] 0.245
Now let’s standardize Entitlement Exploitativeness and Extraversion.
psy_data$EEz<-scale(psy_data$EE,center=TRUE,scale=TRUE)
psy_data$Ez<-scale(psy_data$E,center=TRUE,scale=TRUE)
Then we can run the standardized regression.
model<-summary(lm(EEz~Ez,data=psy_data))
model
##
## Call:
## lm(formula = EEz ~ Ez, data = psy_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.275 -0.723 -0.392 0.685 2.701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0000000000000000305 0.1371404187506063133 0.00 1.000
## Ez 0.2449518873871339431 0.1385050336420791950 1.77 0.083 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.979 on 49 degrees of freedom
## Multiple R-squared: 0.06, Adjusted R-squared: 0.0408
## F-statistic: 3.13 on 1 and 49 DF, p-value: 0.0832
Let’s round the coeffients to three decimals.
round(model$coef,digits=4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000 0.1371 0.000 1.0000
## Ez 0.245 0.1385 1.768 0.0832
Now here’s the unstandardized regression. Can you interpret the coefficients?
summary(lm(EE~E,data=psy_data))
##
## Call:
## lm(formula = EE ~ E, data = psy_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.358 -0.770 -0.417 0.730 2.877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2608 0.3602 0.72 0.472
## E 0.1567 0.0886 1.77 0.083 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.04 on 49 degrees of freedom
## Multiple R-squared: 0.06, Adjusted R-squared: 0.0408
## F-statistic: 3.13 on 1 and 49 DF, p-value: 0.0832
A multiple regression will allow us to examine the relation between Agreeableness and Entitlement Exploitativeness controlling for the association between Agreeableness and Extraversion. Because Extraversion and Agreeableness correlate with one another, we may be able to see the relation between Extraversion and Entitlement Exploitativeness more clearly after controlling for Agreeableness.
Compare the coefficients from this model to the two previous unstandardized models. Did the coefficients change? How about the statistical significance? What’s the R-squared value for the model? What does this value mean?
summary(lm(EE~E+A,data=psy_data))
##
## Call:
## lm(formula = EE ~ E + A, data = psy_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.857 -0.855 -0.133 0.556 2.662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2335 0.5083 2.43 0.019 *
## E 0.1896 0.0849 2.23 0.030 *
## A -0.2346 0.0909 -2.58 0.013 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.988 on 48 degrees of freedom
## Multiple R-squared: 0.175, Adjusted R-squared: 0.14
## F-statistic: 5.08 on 2 and 48 DF, p-value: 0.01
This requires the packages manipulate and shiny.
fancyMe=function(var){ xyplot(EE~psy_data[[var]],data=psy_data,xlab=var) }
choices = c(names(psy_data)) names(choices) = names(psy_data) choices = as.list(choices)
manipulate(fancyMe(var),var=picker(choices))
Ackerman, R. A., Witt, E. A., Donnellan, M. B., Trzesniewski, K. H., Robins, R. W., & Kashy, D. A. (2011). What does the Narcissistic Personality Inventory really measure? Assessment, 18, 67-87.
Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). A very brief measure of the Big Five personality domains. Journal of Research in Personality, 37, 504-528.
Raskin, R. N., & Hall, C. S. (1979). A narcissistc personality inventory. Psychological reports, 45, 590.
Rossman, A. J., & Chance, B. L. (2008). Workshop statistics: Discovery with data (3rd Ed.). Hoboken, NJ: Wiley.