Example: Researchers have theorized that conscientiousness, a personality trait assessing how organized, planful, and mindful of social rules an individual is, may predict important health outcomes. Specifically, theorists propose that people higher on conscientiousness may engage in less risky health-related, and more beneficial health-related, behaviors than people lower on conscientuousness.
For the current example, we’re going to examine whether self-reported conscientiousness scores (higher scores indicate higher conscientiousness) predict people’s self-reported health scores (higher scores indicate people feel that they are in better overall physical health). The data set we’re using today includes the following variables:
Let’s examine whether these empirical observations support, or refute, psychologists’ theories about the relationship between conscientiousness and health outcomes.
data <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/consc_health.csv")
Model A (the model corresponding to the alternative hypothesis) includes conscientiousness as a predictor of health outcomes:
\[ModelA: Health_i = \beta_0 + \beta_1*Conscientiousness + \epsilon_i\]
Model C (the model corresponding to the null hypothesis) excludes conscientiousness as a predictor of health outcomes:
\[Model C: Health_i = \beta_0 + \epsilon_i\]
The null hypothesis being tested is whether \(\beta_1\), the slope corresponding to the relationship between conscientiousness and health, is equal to zero:
\[H_0: \beta_1\ = 0\]
Let’s conduct an a priori power analysis to examine the sample size we would need to have an 80% chance of detecting a significant relationship between conscientiousness and health outcomes, if there is one, if we assume that the size of the effect is medium. Remember the conventions for effect sizes:
Cohen’s (1988) Conventions for \(R^2\) Effect Sizes:
We can perform this power analysis using the pwrss.f.reg
function from the pwrss package. Recall the arguments that
this function takes:
NULL when you want to solve for the sample size
needed to achieve a desired level of power (called an a priori power
analysis).NULL when you want to solve for the power you
achieved with a given sample size (called a post-hoc power
analysis).Go ahead and use the pwrss.f.reg function below to
determine the sample size that would be needed to have an 80% chance of
detecting a significant relationship between conscientiousness and
health outcomes using (assuming we are 5% willing to make a Type I
error):
pwrss.f.reg(r2 = 0.13,
m = 1,
k = 1,
alpha = 0.05,
n = NULL,
power = 0.80)
## Linear Regression (F test)
## R-squared Deviation from 0 (zero)
## H0: r2 = 0
## HA: r2 > 0
## ------------------------------
## Statistical power = 0.8
## n = 55
## ------------------------------
## Numerator degrees of freedom = 1
## Denominator degrees of freedom = 52.516
## Non-centrality parameter = 8.146
## Type I error rate = 0.05
## Type II error rate = 0.2
Question: Was the current study adequately powered for detecting a medium sized effect?
First, let’s check the measure type that each variable was imported as.
str(data)
## 'data.frame': 60 obs. of 4 variables:
## $ pid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "male" "male" "male" "male" ...
## $ consc : num 2.66 5.02 3.85 1.73 3.45 ...
## $ sr_health: num 1.91 4.49 3.4 2.05 4.21 ...
We would like pid and gender to be treated as factors and consc and sr_health to be treated as numeric variables. Convert variables as needed below:
data <- data %>%
mutate(pid = as.factor(pid),
gender = as.factor(gender))
Let’s examine the data descriptively for any evidence of data entry errors that need to be cleaned up before we move on with the analysis.
# General descriptive statistics
describe(data)
## vars n mean sd median trimmed mad min max range skew
## pid* 1 60 30.50 17.46 30.50 30.50 22.24 1.00 60.00 59.00 0.00
## gender* 2 60 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.00 0.00
## consc 3 60 2.85 0.97 2.84 2.83 0.92 0.57 5.02 4.45 0.08
## sr_health 4 60 3.05 1.00 3.04 3.02 1.07 1.14 5.23 4.09 0.18
## kurtosis se
## pid* -1.26 2.25
## gender* -2.03 0.07
## consc -0.42 0.13
## sr_health -0.80 0.13
# Frequency tables
tab1(data$pid)
## data$pid :
## Frequency Percent Cum. percent
## 1 1 1.7 1.7
## 2 1 1.7 3.3
## 3 1 1.7 5.0
## 4 1 1.7 6.7
## 5 1 1.7 8.3
## 6 1 1.7 10.0
## 7 1 1.7 11.7
## 8 1 1.7 13.3
## 9 1 1.7 15.0
## 10 1 1.7 16.7
## 11 1 1.7 18.3
## 12 1 1.7 20.0
## 13 1 1.7 21.7
## 14 1 1.7 23.3
## 15 1 1.7 25.0
## 16 1 1.7 26.7
## 17 1 1.7 28.3
## 18 1 1.7 30.0
## 19 1 1.7 31.7
## 20 1 1.7 33.3
## 21 1 1.7 35.0
## 22 1 1.7 36.7
## 23 1 1.7 38.3
## 24 1 1.7 40.0
## 25 1 1.7 41.7
## 26 1 1.7 43.3
## 27 1 1.7 45.0
## 28 1 1.7 46.7
## 29 1 1.7 48.3
## 30 1 1.7 50.0
## 31 1 1.7 51.7
## 32 1 1.7 53.3
## 33 1 1.7 55.0
## 34 1 1.7 56.7
## 35 1 1.7 58.3
## 36 1 1.7 60.0
## 37 1 1.7 61.7
## 38 1 1.7 63.3
## 39 1 1.7 65.0
## 40 1 1.7 66.7
## 41 1 1.7 68.3
## 42 1 1.7 70.0
## 43 1 1.7 71.7
## 44 1 1.7 73.3
## 45 1 1.7 75.0
## 46 1 1.7 76.7
## 47 1 1.7 78.3
## 48 1 1.7 80.0
## 49 1 1.7 81.7
## 50 1 1.7 83.3
## 51 1 1.7 85.0
## 52 1 1.7 86.7
## 53 1 1.7 88.3
## 54 1 1.7 90.0
## 55 1 1.7 91.7
## 56 1 1.7 93.3
## 57 1 1.7 95.0
## 58 1 1.7 96.7
## 59 1 1.7 98.3
## 60 1 1.7 100.0
## Total 60 100.0 100.0
tab1(data$gender)
## data$gender :
## Frequency Percent Cum. percent
## female 30 50 50
## male 30 50 100
## Total 60 100 100
# Visualizations
hist(data$consc)
hist(data$sr_health)
ggplot(data) +
geom_bar(aes(x = gender))
Looks good!
Let’s get, at a minimum, the mean and standard deviation for the two key variables in our study: self-reported conscientiousness and health.
descriptives_table <- data %>%
summarise(M_Consc = mean(consc),
SD_Consc = sd(consc),
M_Health = mean(sr_health),
SD_Health = sd(sr_health))
# raw table with no formatting
descriptives_table
## M_Consc SD_Consc M_Health SD_Health
## 1 2.847374 0.9702901 3.053342 0.9985299
# adding table formatting
descriptives_table %>%
knitr::kable(digits = 2, col.names = c("Mean Conscientiousness", "SD Conscientiousness", "Mean Health", "SD Health"), caption = "Descriptive Statistics for Conscientiousness and Health", format.args = list(nsmall = 2))
| Mean Conscientiousness | SD Conscientiousness | Mean Health | SD Health |
|---|---|---|---|
| 2.85 | 0.97 | 3.05 | 1.00 |
As mentioned in class, prior to fitting a linear model, it’s important to visualize the relationships between the variables in the study to make sure that the pattern observed fits the assumption of linearity.
Let’s make a scatterplot to visualize the nature of the relationship between conscientiousness and health scores.
ggplot(data = data, aes(x = consc, y = sr_health)) +
geom_point() +
geom_smooth(method = "lm") + # add se = FALSE if you don't want the SE shading on the graph
labs(x = "Conscientiousness",
y = "Self-reported health",
title = "The Relationship Between Conscientiousness and Health Scores")
## `geom_smooth()` using formula = 'y ~ x'
It looks like the general relationship does fit the pattern of a straight line. Let’s move forward with performing the linear regression analysis.
It is typical to mean center continuous predictors in linear regression models. This only affects the meaning of zero on the predictor variable. After mean centering, a score of zero on the predictor variable corresponds to a participant scoring equal to the mean of the predictor.
Let’s add a mean-centered version of the consc variable
to our data set and call it consc_c.
data$consc_c <- scale(data$consc, center = TRUE, scale = FALSE) # set scale = TRUE if you want to standardize the scores on the predictor variable (aka, transform the raw scores into z-scores)
When do you center continuous predictors?
If there’s only a single continuous predictor in the model, it’s up to the researcher whether they’d like to center the predictor variable or not.
Fit a linear model predicting sr_health from
consc_c using the lm function. Call the object
model.
model <- lm(sr_health ~ consc_c, data = data)
Before we interpret the model output, let’s check to make sure the model assumptions were met.
Check whether model assumptions were met
Recall that we are making three assumptions about the model’s errors:
Let’s check whether or not the model’s residuals meet each of these assumptions.
There are two ways to check this assumption. First, we could look at
a distribution of the model’s residuals. We can do this by extracting
the model’s residuals using the augment function from the
broom package. This results in a table with columns
corresponding to our model, like the value the model predicts for each
participant (.fitted) and the residual for each participant
(.resid).
We’ll be graphing the values in the .resid column.
# storing table containing residuals
augment_model <- augment(model)
augment_model
## # A tibble: 60 × 8
## sr_health consc_c[,1] .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.91 -0.188 2.96 -1.05 0.0173 0.882 0.0126 -1.20
## 2 4.49 2.17 4.12 0.374 0.102 0.892 0.0112 0.445
## 3 3.40 1.00 3.54 -0.140 0.0347 0.893 0.000462 -0.160
## 4 2.05 -1.11 2.51 -0.462 0.0390 0.891 0.00575 -0.533
## 5 4.21 0.606 3.35 0.860 0.0233 0.886 0.0115 0.982
## 6 5.11 1.16 3.62 1.49 0.0408 0.870 0.0629 1.72
## 7 3.09 -0.0301 3.04 0.0510 0.0167 0.893 0.0000286 0.0581
## 8 4.58 1.06 3.57 1.00 0.0367 0.883 0.0255 1.16
## 9 3.90 0.255 3.18 0.724 0.0178 0.888 0.00619 0.826
## 10 2.64 0.192 3.15 -0.510 0.0173 0.891 0.00298 -0.581
## # ℹ 50 more rows
# plotting histogram of residuals
ggplot(data = augment_model, aes(x = .resid)) +
geom_density(fill = "purple") + # histogram of the residuals
stat_function(linetype = 2, # a normal distribution overlaid on top for reference
fun = dnorm,
args = list(mean = mean(augment_model$.resid),
sd = sd(augment_model$.resid)))
Slightly positively skewed, but the normality assumption is robust to some degree of violation.
Another way to check whether the errors are normally distributed is by looking at a Q-Q plot:
ggplot(model) +
geom_abline() +
stat_qq(aes(sample = .stdresid))
To meet the assumption, the points should be close to lying on the diagonal reference line. Looks like they approximately do!
Ensuring that each of the participants in the sample (and thus their errors) are independent of each other should occur during data collection.
To test for non-independence when the research design does not inform you, though, you can look at a plot of the residuals with a variable that the errors should not correspond to (e.g., ID number, time).
Let’s check whether we have violated the assumption of independence of errors by plotting residuals against ID numbers for the model.
# create an ID column (let's assume the participants are listed in the order in which they participated in the study)
data$ID <- c(1:nrow(data))
# add the ID column to the table containing the residuals
augment_model$ID <- data$ID
# plot the ID numbers against the residuals
ggplot(data = augment_model, aes(x = ID, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We want there to not be a systematic pattern in the relationship between the residuals and a variable the residuals should not be related to, like ID. It looks like there isn’t a strong pattern here - good!
Homogeneity of variance is the assumption that the variance of an
outcome is approximately the same across all values of the predictor(s).
We can use a residuals plot to examine whether our
model has met this assumption. To obtain a residuals plot, pass the
model to the plot() function and request the firs plot.
plot(model, 1)
We are looking to see that the amount of spread in the residuals across the range of fitted values is approximately the same. Looks pretty good - just a little sparse at the edges.
We can examine the model output using two different functions:
They each provide slightly different pieces of useful information, but both will help us come to the same conclusion.
The summary() function will provide:
Let’s pair the summary() function with also running the
confint() function to get 95% confidence intervals around
each of our parameter estimates.
summary(model)
##
## Call:
## lm(formula = sr_health ~ consc_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5450 -0.6414 -0.1623 0.6817 2.0154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0533 0.1143 26.712 < 2e-16 ***
## consc_c 0.4904 0.1188 4.128 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared: 0.2271, Adjusted R-squared: 0.2138
## F-statistic: 17.04 on 1 and 58 DF, p-value: 0.0001186
confint(model)
## 2.5 % 97.5 %
## (Intercept) 2.824537 3.2821475
## consc_c 0.252605 0.7282069
Question: What is the value of b0 and what is its meaning? What is its 95%CI and what is its meaning?
b0 = 1.66, this is the health score predicted by our model for someone scoring zero on conscientious (when consc = 0)
95%CI[0.94, 2.37], the range of values within which we’re 95% confident the true value of \(beta_0\) lies within.
Question: What is the value of b1 and what is its meaning? What is its 95%CI and what is its meaning?
b1 = 0.49, this is the predicted change in health scores per 1-unit change in conscientiousness.
Specifically, our model predicts that, for every 1-unit increase in conscientiousness, people’s health scores increase by 0.49 units
95%CI[0.25, 0.73], the range of values within which we’re 95% confident the true value of \(beta_1\) lies within.
Is conscientiousness a significant predictor of health outcomes?
Yes, b1 = 0.49, t(58) = 4.13, p < .001, 95%CI[0.25, 0.73].
Is the variance accounted for by the overall model in health scores significant?
Yes, R^2 = 0.23, F(1, 58) = 17.04, p < .001.
The anova() function will provide:
The anova() function is helpful for us checking that we did our by-hand calculations correctly.
anova(model)
## Analysis of Variance Table
##
## Response: sr_health
## Df Sum Sq Mean Sq F value Pr(>F)
## consc_c 1 13.359 13.3588 17.041 0.0001186 ***
## Residuals 58 45.468 0.7839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question: Does Model A, which includes conscientiousness as a predictor, make a significant improvement to the error accounted for in health scores compared to Model C?
Yes, F(1, 58) = 17.04, p < .001.
So far, we have only examined the significance of conscientiousness as a predictor of health outcomes. However, this does not inform us about the size of the relationship between conscientiousness and health outcomes.
PRE is a useful measure of effect size for
calculating the proportion of variability in our outcome that is
explained by its relationship to a predictor. In R, we can obtain PRE
using the etaSquared() function from the lsr
package.
etaSquared(model)
## eta.sq eta.sq.part
## consc_c 0.2270871 0.2270871
Question: How would you interpret the size of the relationship?
23% of the variation in people’s health scores is accounted for by their relationship to people’s conscientiousness scores. This would be considered a medium sized effect.
When there is only a single predictor in the model, \(eta^2\) will be equal to \(eta^2_{partial}\). When we get to multiple regression, we’ll see that these values are not always equal to each other and we’ll discuss why.
In this study, we were interested in examining whether people’s levels of conscientiousness (M = 2.85, SD = 0.97) predicts people’s self-reported health outcomes (M = 3.05, SD = 1.00).
Using a linear regression analysis, we found that conscientiousness was a significant, positive predictor of people’s self-reported health outcomes, b1 = 0.49, t(58) = 4.13, p < .001, 95%CI[0.25, 0.73].
Additionally, the model including conscientiousness as a predictor accounted for a significant amount of the variability in people’s self-reported health scores, R^2 = 0.23, F(1, 58) = 17.04, p < .001. This was a medium sized effect.