Research Scenario

Example: Researchers have theorized that conscientiousness, a personality trait assessing how organized, planful, and mindful of social rules an individual is, may predict important health outcomes. Specifically, theorists propose that people higher on conscientiousness may engage in less risky health-related, and more beneficial health-related, behaviors than people lower on conscientuousness.

For the current example, we’re going to examine whether self-reported conscientiousness scores (higher scores indicate higher conscientiousness) predict people’s self-reported health scores (higher scores indicate people feel that they are in better overall physical health). The data set we’re using today includes the following variables:

  • pid = Participant ID number
  • gender = Participant’s self-reported gender
  • consc = Participant’s average conscientiousness
  • sr_health = Participant’s self-reported health

Let’s examine whether these empirical observations support, or refute, psychologists’ theories about the relationship between conscientiousness and health outcomes.

Import Data

data <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/consc_health.csv")

Model Comparison

Model A (the model corresponding to the alternative hypothesis) includes conscientiousness as a predictor of health outcomes:

\[ModelA: Health_i = \beta_0 + \beta_1*Conscientiousness + \epsilon_i\]

Model C (the model corresponding to the null hypothesis) excludes conscientiousness as a predictor of health outcomes:

\[Model C: Health_i = \beta_0 + \epsilon_i\]

The Null Hypothesis

The null hypothesis being tested is whether \(\beta_1\), the slope corresponding to the relationship between conscientiousness and health, is equal to zero:

\[H_0: \beta_1\ = 0\]

A priori power analysis

Let’s conduct an a priori power analysis to examine the sample size we would need to have an 80% chance of detecting a significant relationship between conscientiousness and health outcomes, if there is one, if we assume that the size of the effect is medium. Remember the conventions for effect sizes:

Cohen’s (1988) Conventions for \(R^2\) Effect Sizes:

  • \(R^2\) = .02 is a small effect size
  • \(R^2\) = .13 is a medium effect size
  • \(R^2\) = .26 is a large effect size

We can perform this power analysis using the pwrss.f.reg function from the pwrss package. Recall the arguments that this function takes:

  • r2 = the estimated r-squared effect size
  • m = the number of predictors being tested in Model A
  • k = the total number of predictors in Model A
  • alpha = 0.05 (typically, in psychology we use an alpha of .05)
  • n = sample size
    • Set to NULL when you want to solve for the sample size needed to achieve a desired level of power (called an a priori power analysis).
  • power = The desired power level
    • Set to NULL when you want to solve for the power you achieved with a given sample size (called a post-hoc power analysis).

Go ahead and use the pwrss.f.reg function below to determine the sample size that would be needed to have an 80% chance of detecting a significant relationship between conscientiousness and health outcomes using (assuming we are 5% willing to make a Type I error):

pwrss.f.reg(r2 = 0.13, 
            m = 1, 
            k = 1, 
            alpha = 0.05,
            n = NULL,
            power = 0.80)
##  Linear Regression (F test) 
##  R-squared Deviation from 0 (zero) 
##  H0: r2 = 0 
##  HA: r2 > 0 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n = 55 
##  ------------------------------ 
##  Numerator degrees of freedom = 1 
##  Denominator degrees of freedom = 52.516 
##  Non-centrality parameter = 8.146 
##  Type I error rate = 0.05 
##  Type II error rate = 0.2

Question: Was the current study adequately powered for detecting a medium sized effect?

Data Cleaning & Wrangling

First, let’s check the measure type that each variable was imported as.

str(data)
## 'data.frame':    60 obs. of  4 variables:
##  $ pid      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender   : chr  "male" "male" "male" "male" ...
##  $ consc    : num  2.66 5.02 3.85 1.73 3.45 ...
##  $ sr_health: num  1.91 4.49 3.4 2.05 4.21 ...

We would like pid and gender to be treated as factors and consc and sr_health to be treated as numeric variables. Convert variables as needed below:

data <- data %>%
  mutate(pid = as.factor(pid),
         gender = as.factor(gender))

Let’s examine the data descriptively for any evidence of data entry errors that need to be cleaned up before we move on with the analysis.

# General descriptive statistics
describe(data)
##           vars  n  mean    sd median trimmed   mad  min   max range skew
## pid*         1 60 30.50 17.46  30.50   30.50 22.24 1.00 60.00 59.00 0.00
## gender*      2 60  1.50  0.50   1.50    1.50  0.74 1.00  2.00  1.00 0.00
## consc        3 60  2.85  0.97   2.84    2.83  0.92 0.57  5.02  4.45 0.08
## sr_health    4 60  3.05  1.00   3.04    3.02  1.07 1.14  5.23  4.09 0.18
##           kurtosis   se
## pid*         -1.26 2.25
## gender*      -2.03 0.07
## consc        -0.42 0.13
## sr_health    -0.80 0.13
# Frequency tables
tab1(data$pid)

## data$pid : 
##         Frequency Percent Cum. percent
## 1               1     1.7          1.7
## 2               1     1.7          3.3
## 3               1     1.7          5.0
## 4               1     1.7          6.7
## 5               1     1.7          8.3
## 6               1     1.7         10.0
## 7               1     1.7         11.7
## 8               1     1.7         13.3
## 9               1     1.7         15.0
## 10              1     1.7         16.7
## 11              1     1.7         18.3
## 12              1     1.7         20.0
## 13              1     1.7         21.7
## 14              1     1.7         23.3
## 15              1     1.7         25.0
## 16              1     1.7         26.7
## 17              1     1.7         28.3
## 18              1     1.7         30.0
## 19              1     1.7         31.7
## 20              1     1.7         33.3
## 21              1     1.7         35.0
## 22              1     1.7         36.7
## 23              1     1.7         38.3
## 24              1     1.7         40.0
## 25              1     1.7         41.7
## 26              1     1.7         43.3
## 27              1     1.7         45.0
## 28              1     1.7         46.7
## 29              1     1.7         48.3
## 30              1     1.7         50.0
## 31              1     1.7         51.7
## 32              1     1.7         53.3
## 33              1     1.7         55.0
## 34              1     1.7         56.7
## 35              1     1.7         58.3
## 36              1     1.7         60.0
## 37              1     1.7         61.7
## 38              1     1.7         63.3
## 39              1     1.7         65.0
## 40              1     1.7         66.7
## 41              1     1.7         68.3
## 42              1     1.7         70.0
## 43              1     1.7         71.7
## 44              1     1.7         73.3
## 45              1     1.7         75.0
## 46              1     1.7         76.7
## 47              1     1.7         78.3
## 48              1     1.7         80.0
## 49              1     1.7         81.7
## 50              1     1.7         83.3
## 51              1     1.7         85.0
## 52              1     1.7         86.7
## 53              1     1.7         88.3
## 54              1     1.7         90.0
## 55              1     1.7         91.7
## 56              1     1.7         93.3
## 57              1     1.7         95.0
## 58              1     1.7         96.7
## 59              1     1.7         98.3
## 60              1     1.7        100.0
##   Total        60   100.0        100.0
tab1(data$gender)

## data$gender : 
##         Frequency Percent Cum. percent
## female         30      50           50
## male           30      50          100
##   Total        60     100          100
# Visualizations
hist(data$consc)

hist(data$sr_health)

ggplot(data) +
  geom_bar(aes(x = gender))

Looks good!

Descriptive Statistics

Let’s get, at a minimum, the mean and standard deviation for the two key variables in our study: self-reported conscientiousness and health.

descriptives_table <- data %>%
  summarise(M_Consc = mean(consc),
            SD_Consc = sd(consc),
            M_Health = mean(sr_health),
            SD_Health = sd(sr_health))

# raw table with no formatting
descriptives_table
##    M_Consc  SD_Consc M_Health SD_Health
## 1 2.847374 0.9702901 3.053342 0.9985299
# adding table formatting
descriptives_table %>%
  knitr::kable(digits = 2, col.names = c("Mean Conscientiousness", "SD Conscientiousness", "Mean Health", "SD Health"), caption = "Descriptive Statistics for Conscientiousness and Health", format.args = list(nsmall = 2))
Descriptive Statistics for Conscientiousness and Health
Mean Conscientiousness SD Conscientiousness Mean Health SD Health
2.85 0.97 3.05 1.00

Data Visualization

As mentioned in class, prior to fitting a linear model, it’s important to visualize the relationships between the variables in the study to make sure that the pattern observed fits the assumption of linearity.

Let’s make a scatterplot to visualize the nature of the relationship between conscientiousness and health scores.

ggplot(data = data, aes(x = consc, y = sr_health)) +
  geom_point() + 
  geom_smooth(method = "lm") + # add se = FALSE if you don't want the SE shading on the graph
  labs(x     = "Conscientiousness", 
       y     = "Self-reported health", 
       title = "The Relationship Between Conscientiousness and Health Scores")
## `geom_smooth()` using formula = 'y ~ x'

It looks like the general relationship does fit the pattern of a straight line. Let’s move forward with performing the linear regression analysis.

Centering the Predictor

It is typical to mean center continuous predictors in linear regression models. This only affects the meaning of zero on the predictor variable. After mean centering, a score of zero on the predictor variable corresponds to a participant scoring equal to the mean of the predictor.

Let’s add a mean-centered version of the consc variable to our data set and call it consc_c.

data$consc_c <- scale(data$consc, center = TRUE, scale = FALSE) # set scale = TRUE if you want to standardize the scores on the predictor variable (aka, transform the raw scores into z-scores)

When do you center continuous predictors?

  • When a score of zero is not meaningful on your predictor variable and you want the model to have a meaningful y-intercept value (b0)
  • When there are multiple continuous predictors in the model and you are constructing an interaction term between them (to reduce redundancy - we’ll take more about this later in the course!)

If there’s only a single continuous predictor in the model, it’s up to the researcher whether they’d like to center the predictor variable or not.

Fit the Model

Fit a linear model predicting sr_health from consc_c using the lm function. Call the object model.

model <- lm(sr_health ~ consc_c, data = data)

Before we interpret the model output, let’s check to make sure the model assumptions were met.

Check whether model assumptions were met

Recall that we are making three assumptions about the model’s errors:

  1. Errors are normally distributed
  2. Independence of errors
  3. Homogeneity of variance

Let’s check whether or not the model’s residuals meet each of these assumptions.

Checking whether errors are normally distributed

Distribution of the Residuals

There are two ways to check this assumption. First, we could look at a distribution of the model’s residuals. We can do this by extracting the model’s residuals using the augment function from the broom package. This results in a table with columns corresponding to our model, like the value the model predicts for each participant (.fitted) and the residual for each participant (.resid).

We’ll be graphing the values in the .resid column.

# storing table containing residuals
augment_model <- augment(model)
augment_model
## # A tibble: 60 × 8
##    sr_health consc_c[,1] .fitted  .resid   .hat .sigma   .cooksd .std.resid
##        <dbl>       <dbl>   <dbl>   <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
##  1      1.91     -0.188     2.96 -1.05   0.0173  0.882 0.0126       -1.20  
##  2      4.49      2.17      4.12  0.374  0.102   0.892 0.0112        0.445 
##  3      3.40      1.00      3.54 -0.140  0.0347  0.893 0.000462     -0.160 
##  4      2.05     -1.11      2.51 -0.462  0.0390  0.891 0.00575      -0.533 
##  5      4.21      0.606     3.35  0.860  0.0233  0.886 0.0115        0.982 
##  6      5.11      1.16      3.62  1.49   0.0408  0.870 0.0629        1.72  
##  7      3.09     -0.0301    3.04  0.0510 0.0167  0.893 0.0000286     0.0581
##  8      4.58      1.06      3.57  1.00   0.0367  0.883 0.0255        1.16  
##  9      3.90      0.255     3.18  0.724  0.0178  0.888 0.00619       0.826 
## 10      2.64      0.192     3.15 -0.510  0.0173  0.891 0.00298      -0.581 
## # ℹ 50 more rows
# plotting histogram of residuals
ggplot(data = augment_model, aes(x = .resid)) + 
  geom_density(fill = "purple") + # histogram of the residuals
  stat_function(linetype = 2, # a normal distribution overlaid on top for reference
                fun = dnorm, 
                args = list(mean = mean(augment_model$.resid), 
                                sd   =   sd(augment_model$.resid))) 

Slightly positively skewed, but the normality assumption is robust to some degree of violation.

Q-Q Plot

Another way to check whether the errors are normally distributed is by looking at a Q-Q plot:

ggplot(model) +
  geom_abline() + 
  stat_qq(aes(sample = .stdresid))

To meet the assumption, the points should be close to lying on the diagonal reference line. Looks like they approximately do!

Checking independence of errors

Ensuring that each of the participants in the sample (and thus their errors) are independent of each other should occur during data collection.

To test for non-independence when the research design does not inform you, though, you can look at a plot of the residuals with a variable that the errors should not correspond to (e.g., ID number, time).

Let’s check whether we have violated the assumption of independence of errors by plotting residuals against ID numbers for the model.

# create an ID column (let's assume the participants are listed in the order in which they participated in the study)
data$ID <- c(1:nrow(data))

# add the ID column to the table containing the residuals
augment_model$ID <- data$ID

# plot the ID numbers against the residuals
ggplot(data = augment_model, aes(x = ID, y = .resid)) + 
  geom_point() +  
  geom_smooth(se = F) +
  geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We want there to not be a systematic pattern in the relationship between the residuals and a variable the residuals should not be related to, like ID. It looks like there isn’t a strong pattern here - good!

Checking homogeneity of variance

Homogeneity of variance is the assumption that the variance of an outcome is approximately the same across all values of the predictor(s). We can use a residuals plot to examine whether our model has met this assumption. To obtain a residuals plot, pass the model to the plot() function and request the firs plot.

plot(model, 1)

We are looking to see that the amount of spread in the residuals across the range of fitted values is approximately the same. Looks pretty good - just a little sparse at the edges.

Interpreting the Model Output

We can examine the model output using two different functions:

  • summary()
  • anova()

They each provide slightly different pieces of useful information, but both will help us come to the same conclusion.

The summary() function will provide:

  • The values of the parameter estimates (in this case, \(b_0\) and \(b_1\))
  • The significance of each of the parameter estimates based on a t-statistic
  • The variance accounted for by the overall model, \(R^2\), as well as its adjusted value, \(R^2_adj\)
  • The significance of the overall model based on an F-statistic

Let’s pair the summary() function with also running the confint() function to get 95% confidence intervals around each of our parameter estimates.

summary(model)
## 
## Call:
## lm(formula = sr_health ~ consc_c, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5450 -0.6414 -0.1623  0.6817  2.0154 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0533     0.1143  26.712  < 2e-16 ***
## consc_c       0.4904     0.1188   4.128 0.000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared:  0.2271, Adjusted R-squared:  0.2138 
## F-statistic: 17.04 on 1 and 58 DF,  p-value: 0.0001186
confint(model)
##                2.5 %    97.5 %
## (Intercept) 2.824537 3.2821475
## consc_c     0.252605 0.7282069

Question: What is the value of b0 and what is its meaning? What is its 95%CI and what is its meaning?

b0 = 1.66, this is the health score predicted by our model for someone scoring zero on conscientious (when consc = 0)

95%CI[0.94, 2.37], the range of values within which we’re 95% confident the true value of \(beta_0\) lies within.

Question: What is the value of b1 and what is its meaning? What is its 95%CI and what is its meaning?

b1 = 0.49, this is the predicted change in health scores per 1-unit change in conscientiousness.

Specifically, our model predicts that, for every 1-unit increase in conscientiousness, people’s health scores increase by 0.49 units

95%CI[0.25, 0.73], the range of values within which we’re 95% confident the true value of \(beta_1\) lies within.

Is conscientiousness a significant predictor of health outcomes?

Yes, b1 = 0.49, t(58) = 4.13, p < .001, 95%CI[0.25, 0.73].

Is the variance accounted for by the overall model in health scores significant?

Yes, R^2 = 0.23, F(1, 58) = 17.04, p < .001.

The anova() function will provide:

  • SSR = the additional sum of squares accounted for by Model A compared to Model C
  • SSE(A) = the remaining sum of squares left unaccounted for by Model A
  • df_Reduced = PA - PC (also called df_Error)
  • df_ModelA = n - PA
  • MSR = SSR/(PA-PC)
  • MSE = SSE(A)/df_ModelA
  • F = MSR/MSE
  • p

The anova() function is helpful for us checking that we did our by-hand calculations correctly.

anova(model)
## Analysis of Variance Table
## 
## Response: sr_health
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## consc_c    1 13.359 13.3588  17.041 0.0001186 ***
## Residuals 58 45.468  0.7839                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question: Does Model A, which includes conscientiousness as a predictor, make a significant improvement to the error accounted for in health scores compared to Model C?

Yes, F(1, 58) = 17.04, p < .001.

Effect Size

So far, we have only examined the significance of conscientiousness as a predictor of health outcomes. However, this does not inform us about the size of the relationship between conscientiousness and health outcomes.

PRE is a useful measure of effect size for calculating the proportion of variability in our outcome that is explained by its relationship to a predictor. In R, we can obtain PRE using the etaSquared() function from the lsr package.

etaSquared(model)
##            eta.sq eta.sq.part
## consc_c 0.2270871   0.2270871

Question: How would you interpret the size of the relationship?

23% of the variation in people’s health scores is accounted for by their relationship to people’s conscientiousness scores. This would be considered a medium sized effect.

When there is only a single predictor in the model, \(eta^2\) will be equal to \(eta^2_{partial}\). When we get to multiple regression, we’ll see that these values are not always equal to each other and we’ll discuss why.

APA-Style Summary

In this study, we were interested in examining whether people’s levels of conscientiousness (M = 2.85, SD = 0.97) predicts people’s self-reported health outcomes (M = 3.05, SD = 1.00).

Using a linear regression analysis, we found that conscientiousness was a significant, positive predictor of people’s self-reported health outcomes, b1 = 0.49, t(58) = 4.13, p < .001, 95%CI[0.25, 0.73].

Additionally, the model including conscientiousness as a predictor accounted for a significant amount of the variability in people’s self-reported health scores, R^2 = 0.23, F(1, 58) = 17.04, p < .001. This was a medium sized effect.