Correlation & Linear Regression

This document was composed from Dr. Snopkowski’s ANTH 504 Week 8 lecture and Danielle Navarro’s 2021 Learning statistics with R Chapter 5.7 on correlations & 15 Linear Regression.

Load Packages & data

library(tidyverse)
load("parenthood-1.Rdata")

Aims

  1. Correlation
  • Measuring Relationships

    • Scatterplots

    • Covariance

    • Pearson’s Correlation Coefficient

    • Nonparametric measures - Spearman’s Rho

    • Interpreting Correlations

    • Causality

  1. Linear Regression
  • Understand linear regression with one predictor

  • Understand how we assess the fit of a regression model

    • R2
  • Running regression model in R

  • Interpret a regression model

Correlation

WHAT IS A CORRELATION?

Small Relathipship

Strong Positive Correlation

Strong Negative Correlation

Measurng Relationships

  • We need to see whether as one variable increases, the other increases, decreases or stays the same.

  • This can be done by calculating the Covariance.

    • We look at how much each score deviates from the mean.

    • If both variables deviate from the mean by the same amount, they are likely to be related.

Problems with Covariance

  • It depends upon the units of measurement.

    • E.g. The Covariance of two variables measured in Miles might be 4.25, but if the same scores are converted to Km, the Covariance is 11.
  • One solution: standardize it!

    • Divide by the standard deviations of both variables.
  • The standardized version of Covariance is known as the Correlation coefficient.

    • The correlation coefficient can be represented by

      • the Greek letter rho \(\rho\) when referring to the population correlation coefficient

      • the letter “r” when referring to the sample correlation coefficient

    • Both “\(\rho\)” and “r” represent the strength and direction of a linear relationship between two variables.

      • The value of the correlation coefficient ranges from -1 to 1, with -1 indicating a perfect negative linear relationship, 1 indicating a perfect positive linear relationship, and 0 indicating no linear relationship.

Correlation Coefficient in R

cor(x=var1, y=var2)

#enter the data
adverts_watched <- c(5, 4, 4, 6, 8)
packets_bought <- c(8, 9, 10, 13, 15)
#use cor function to calculate correlation coefficient
cor(adverts_watched, packets_bought)
## [1] 0.8711651

The output I got, 0.8711651, is the correlation coefficient (r) between the number of adverts watched and the number of packets bought. This value is positive and close to 1, indicating a strong positive linear relationship between the two variables. In other words, as the number of adverts watched increases, the number of packets bought also tends to increase.

Remember: correlation does not imply causation. While the correlation coefficient suggests a strong linear relationship between the two variables, it does not necessarily mean that watching more advertisements causes people to buy more packets. There might be other factors involved or it could be a mere coincidence. Further research and analysis would be needed to establish a causal relationship.

Note: the order of adverts_watched and packets_bought doesn’t matter; you’ll get the same value if you switch the order.

#switch order example
cor(packets_bought, adverts_watched)
## [1] 0.8711651

This function also works with data frames.

as_df <- data.frame(adverts_watched, packets_bought)

It will give you the correlation across all variables as long as there is no missing values and the variables are numeric.

cor(as_df)
##                 adverts_watched packets_bought
## adverts_watched       1.0000000      0.8711651
## packets_bought        0.8711651      1.0000000

Limitations of the Correlation Coefficient

  1. Linearity: The correlation coefficient only measures linear relationships. If the relationship between two variables is non-linear (e.g., quadratic, exponential), the correlation coefficient may not adequately capture the strength of the relationship. By plotting the data, you can visually inspect the relationship and determine if it is linear or not.

  2. Outliers: The correlation coefficient is sensitive to outliers. A few extreme data points can greatly affect the value of the correlation coefficient, leading to a misleading interpretation of the relationship between the variables. By plotting the data, you can identify and address outliers before calculating the correlation coefficient.

  3. Normal distribution: The Pearson correlation coefficient assumes that both variables are normally distributed. If the data is not normally distributed, the correlation coefficient may not be an appropriate measure of the relationship between the variables. Visualizing the data using histograms or Q-Q plots can help you assess the normality of the data.

  4. Homoscedasticity: The Pearson correlation coefficient assumes that the variance of one variable is constant across different values of the other variable (homoscedasticity). If the data exhibits heteroscedasticity (i.e., the variance of one variable changes across different values of the other variable), the correlation coefficient may not be an accurate measure of the relationship. By plotting the data, you can visually inspect for homoscedasticity.

How do we avoid these problems?

  1. Always plot your data to see what it looks like!

  2. The assumptions of correlation include:

  3. The relationship between the variables is linear

  4. The data is normal-ish; particularly for small samples

Limitations of the Correlation Coefficient

Correlation Example

  • How does an infant’s sleeping habits affect a parent’s mood?

  • Data:

    • dan.sleep – data on number of hours of sleep of dad

    • baby.sleep – data on number of hours of sleep of baby

    • dan.grump – data on grumpiness of dad (0-100; 10 = maximally grumpy)

    • day – integer (measured from day 1 – 100)

  • Data in file: parenthood.Rdata

  • Download data file and then from Rstudio go to:

  • Open -> choose parenthood.Rdata and it should load

  • Some preliminary descriptive statistics

library(psych)
## Warning: package 'psych' was built under R version 4.2.3
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(parenthood)
##            vars   n  mean    sd median trimmed   mad   min    max range  skew
## dan.sleep     1 100  6.97  1.02   7.03    7.00  1.09  4.84   9.00  4.16 -0.29
## baby.sleep    2 100  8.05  2.07   7.95    8.05  2.33  3.25  12.07  8.82 -0.02
## dan.grump     3 100 63.71 10.05  62.00   63.16  9.64 41.00  91.00 50.00  0.43
## day           4 100 50.50 29.01  50.50   50.50 37.06  1.00 100.00 99.00  0.00
##            kurtosis   se
## dan.sleep     -0.72 0.10
## baby.sleep    -0.69 0.21
## dan.grump     -0.16 1.00
## day           -1.24 2.90

Flow Chart for Correlation

Assumptions

Make a scatterplot of dan.sleep vs. dan.grump

plot(parenthood$dan.sleep, parenthood$dan.grump, xlab = "Dad's Sleep", ylab="Dad's grumpiness")

plot(parenthood$baby.sleep, parenthood$dan.grump, xlab = "Baby's Sleep", ylab="Dad's grumpiness")

Does the relationship appear linear? Yes

  1. Normal data
hist(parenthood$dan.sleep)

hist(parenthood$dan.grump)

shapiro.test(parenthood$dan.sleep)
## 
##  Shapiro-Wilk normality test
## 
## data:  parenthood$dan.sleep
## W = 0.97635, p-value = 0.06878
shapiro.test(parenthood$dan.grump)
## 
##  Shapiro-Wilk normality test
## 
## data:  parenthood$dan.grump
## W = 0.97952, p-value = 0.122
sapply(parenthood, length)
##  dan.sleep baby.sleep  dan.grump        day 
##        100        100        100        100

Given that our sample size is relatively large – probably better to use histograms than shapiro.test

The data are normal enough.

Pearson’s Correlation Coefficients

R code to run Pearson’s correlation coefficient (if assumptions are met; which they were in this case)

cor(parenthood$baby.sleep, parenthood$dan.grump)
## [1] -0.5659637

The Pearson correlation coefficient between baby sleep and Dan’s grumpiness is approximately -0.5659637. This indicates a moderate negative linear relationship, meaning that as the baby’s sleep duration increases, Dan’s grumpiness tends to decrease (and vice versa).

cor(parenthood$dan.sleep, parenthood$dan.grump)
## [1] -0.903384

The Pearson correlation coefficient between Dan’s sleep and Dan’s grumpiness is approximately -0.903384. This indicates a strong negative linear relationship, suggesting that as Dan’s sleep duration increases, his grumpiness tends to decrease significantly (and vice versa).

cor(parenthood)
##              dan.sleep  baby.sleep   dan.grump         day
## dan.sleep   1.00000000  0.62794934 -0.90338404 -0.09840768
## baby.sleep  0.62794934  1.00000000 -0.56596373 -0.01043394
## dan.grump  -0.90338404 -0.56596373  1.00000000  0.07647926
## day        -0.09840768 -0.01043394  0.07647926  1.00000000

Assumptions not met?

R code to run non-parametric Spearman’s rank correlation (if assumptions are NOT met)

R code to run non-parametric Spearman’s rank correlation (if assumptions are NOT met)

cor(parenthood$baby.sleep, parenthood$dan.grump, method="spearman")
## [1] -0.5871037
cor(parenthood$dan.sleep, parenthood$dan.grump, method="spearman")
## [1] -0.8870816
cor(parenthood, method="spearman")
##              dan.sleep   baby.sleep   dan.grump          day
## dan.sleep   1.00000000  0.644518990 -0.88708165 -0.088967566
## baby.sleep  0.64451899  1.000000000 -0.58710366  0.002916397
## dan.grump  -0.88708165 -0.587103665  1.00000000  0.055654351
## day        -0.08896757  0.002916397  0.05565435  1.000000000

Significants

If you want to know whether the correlation is significant, you can use cor.test() which will give you a p-value. Notice the change from cor to cor.test.

What is the null hypothesis of this test?

Ho: \(\rho\) = 0 Ha: \(\rho\) ≠ 0

cor.test(parenthood$dan.sleep, parenthood$dan.grump)
## 
##  Pearson's product-moment correlation
## 
## data:  parenthood$dan.sleep and parenthood$dan.grump
## t = -20.854, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9340614 -0.8594714
## sample estimates:
##       cor 
## -0.903384
cor.test(parenthood$dan.sleep, parenthood$dan.grump, method="spearman", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  parenthood$dan.sleep and parenthood$dan.grump
## S = 314482, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8870816

exact=FALSE helps get around ties. If any data is ties with each other, it wont be able to run the Spearman.

cor.test(parenthood$baby.sleep, parenthood$dan.grump, method="spearman", exact=FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  parenthood$baby.sleep and parenthood$dan.grump
## S = 264491, p-value = 1.363e-10
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5871037

H0: \(\rho\) = 0 HA: \(\rho\) \(\neq\) 0

Reporting the Results

Dad’s hours of sleep were significantly correlated with his grumpiness, r(98) = -.903, p<0.001.

r(degrees of freedom) = correlation coefficient, p=p-value

Dealing with Missing Values

If you have missing values, which is relatively common when working with real data, the cor() function will not work without the use argument.

use="pairwise.complete.obs": Uses all the data it can to calculate pairwise correlations.

# Introduce a missing value
parenthood$dan.sleep[10] <- NA

# Calculate correlation using pairwise.complete.obs
cor(parenthood$dan.sleep, parenthood$dan.grump, use = "pairwise.complete.obs")
## [1] -0.903668

By using use = "pairwise.complete.obs", the cor() function will calculate the correlation coefficient for Dan’s sleep and his grumpiness using all available pairs of observations with complete data, ignoring the missing value introduced in the dataset.

Things to Remember about Correlation

  • It varies between -1 and +1

    • 0 = no relationship
  • It is an effect size

    • ±.1 = small effect

    • ±.3 = medium effect

    • ±.5 = large effect

  • Coefficient of determination, r2

    • By squaring the value of r you get the proportion of variance in one variable shared by the other.
  • Correlation and Causality

    • The third-variable problem:In any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting the results.

    • Direction of causality: Correlation coefficients say nothing about which variable causes the other to change

Linear Regression

Questions you might ask:

What is Linear Regression? A way of predicting the value of one variable from another.

Assumption of linearity

Describing a Straight Line (y=mx+b)

b1

  • Regression coefficient for the predictor

  • Gradient (slope) of the regression line

  • Direction/Strength of Relationship

b0

  • Intercept (value of Y when X = 0)

  • Point at which the regression line crosses the Y-axis

\(\\epsilon\) is the residuals, the error between the point and regression line

Intercepts and Gradients

Ordinary Least Squares (OLS)

Determine the “Best Fit” Line

How good is the model?

The regression line is only a model based on the data.

  • This model might not reflect reality.

    • We need some way of testing how well the model fits the observed data.

How?

  • R2 Pearson Correlation Coefficient Squared

  • F statistic

Sums of Squares

  • SST: Total variability (variability between scores and the mean).

  • SSR: Residual/error variability (variability between the regression model and the actual data).

  • SSM: Model variability (difference in variability between the model and the mean)

Diagram showing from where the regression sums of squares derive

Testing the Model: ANOVA

If the model results in better prediction than using the mean, then we expect SSM to be much greater than SSR

Mean Squared Error

  • Sums of Squares are total values.

  • They can be expressed as averages (dividing by the df).

  • These are called Mean Squares, MS

Testing The Model: R2

  • The proportion of variance accounted for by the regression model.

  • The Pearson Correlation Coefficient Squared

Calculation of Regression

Example in R

Step 1: Assumption

Still using the parenthood data. How does an infant’s sleeping habits affect a parent’s mood?

  • Data:

    • dan.sleep – data on number of hours of sleep of dad

    • baby.sleep – data on number of hours of sleep of baby

    • dan.grump – data on grumpiness of dad (0-100; 10 = maximally grumpy)

    • day – integer (measured from day 1 – 100)

plot(parenthood$dan.sleep, parenthood$dan.grump, xlab = "Dad's sleep", ylab = "Dad's grumpiness")

abline(lm(parenthood$dan.grump ~ parenthood$dan.sleep), col = "red")

Do it with ggplot:

# Create a scatter plot with a linear regression line
ggplot(data = parenthood, aes(x = dan.sleep, y = dan.grump)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Dad's sleep", y = "Dad's grumpiness")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).

geom_smooth(method = "lm", se = FALSE, color = "red") adds a linear regression line to the plot.

The method = "lm" argument specifies that we want to fit a linear model,

se = FALSE means we don’t want to display the confidence interval around the regression line,

and color = "red" sets the color of the line to red.

Ask: Is the assumption of linearity met? Yes

Ask: Are there any outliers / obvious unusual cases? No

Regression

Step 2: Run Regression

#linear regression model
reg_model <- lm(parenthood$dan.grump ~ parenthood$dan.sleep)
#print Outputs the intercept and slope
print(reg_model)
## 
## Call:
## lm(formula = parenthood$dan.grump ~ parenthood$dan.sleep)
## 
## Coefficients:
##          (Intercept)  parenthood$dan.sleep  
##              125.815                -8.922

You must store it in a variable. If you just have it print without storing it, it wont print out all the items that you need. We use summary()

#alternative (but equal) way to write this code
reg_model2 <- lm(dan.grump ~ dan.sleep, data = parenthood)
print(reg_model2)
## 
## Call:
## lm(formula = dan.grump ~ dan.sleep, data = parenthood)
## 
## Coefficients:
## (Intercept)    dan.sleep  
##     125.815       -8.922

b0 is the number under “(Intercept)” and b1 is the number under parenthood$dan.sleep

summary(reg_model)
## 
## Call:
## lm(formula = parenthood$dan.grump ~ parenthood$dan.sleep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9851  -2.1674  -0.3719   2.6265  11.7771 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          125.8152     3.0233   41.62   <2e-16 ***
## parenthood$dan.sleep  -8.9221     0.4293  -20.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.336 on 97 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8166, Adjusted R-squared:  0.8147 
## F-statistic: 431.9 on 1 and 97 DF,  p-value: < 2.2e-16

let’s go through the output of the linear regression model step by step:

  1. Call: This shows the function call used to create the model, with the dependent variable parenthood$dan.grump and the independent variable parenthood$dan.sleep.

  2. Residuals: The summary statistics of the residuals (differences between the observed and predicted values). The minimum, 1st quartile (1Q), median, 3rd quartile (3Q), and maximum values of the residuals are displayed.

  3. Coefficients: This section provides information about the estimated coefficients for the linear regression model.

  4. Estimate: The estimated coefficients for the intercept and the independent variable (parenthood$dan.sleep). In this case, the intercept is 125.9563, and the slope (coefficient) for Dan’s sleep is -8.9368.

  • Grumpinessi = -8.94(Sleepi) + 125.96 + \(\epsilon\)

  • If you know the amount of sleep, you can plug that in and predict sleep. For every hour of sleep, Dan’s grumpiness goes down by 8.94.

  1. Std. Error: The standard errors of the estimated coefficients.

  2. t value: The t-statistic for each coefficient, calculated as the Estimate divided by the Std. Error.

  3. Pr(>|t|): The p-value associated with the t-statistic, used for hypothesis testing. In this case, both p-values are very small (< 2e-16), indicating that the coefficients are significantly different from zero at a 0.05 significance level.

  4. Residual standard error: A measure of the model’s overall accuracy, calculated as the square root of the mean squared error (MSE). In this case, it’s 4.332.

  5. Multiple R-squared: A measure of the proportion of the variance in the dependent variable that is explained by the model. In this case, it’s 0.8161, indicating that the model explains 81.61% of the variance in Dan’s grumpiness.

  6. Adjusted R-squared: A modified version of the R-squared that adjusts for the number of predictors in the model. In this case, it’s 0.8142, which is very close to the Multiple R-squared value, as there is only one predictor.

  7. F-statistic: A statistic that tests the overall significance of the model. The F-statistic is 434.9 in this case, with 1 and 98 degrees of freedom (DF). This is how well the model fits the data.

  8. p-value: The p-value associated with the F-statistic. In this case, it’s < 2.2e-16, indicating that the model is statistically significant at a 0.05 significance level.

In summary, the linear regression model suggests a strong negative relationship between Dan’s sleep and his grumpiness, with each additional hour of sleep associated with an 8.94-point decrease in grumpiness. The model explains 81.61% of the variance in Dan’s grumpiness, and it is statistically significant.

Output: Model Summary

Tips: The details after Coefficients will be most helpful to look at. There is no need to report the t-value. We care about the parenthood$dan.sleep row most.

F-Test

  • Looks at whether the variance explained by the model (SSM) is significantly greater than the error within the model (SSR).

  • It tells us whether using the regression model is significantly better at predicting values of the outcome than using the mean.

How to Interpret BETA values

  • Beta values: the change in the outcome associated with a unit change in the predictor.

  • Standardized beta values: tell us the same but expressed as standard deviations.

    • Can get using the standardCoefs() function (see details in 2 slides)

BETA VALUES

b1= -8.9368.

As hours of sleep increases by 1 hour, grumpiness declines by 8.9368 units

STANDARDIZED BETA VALUES

library(lsr)
standardCoefs(reg_model)
##                              b      beta
## parenthood$dan.sleep -8.922082 -0.903668
  • The output shows us the original beta value (-8.9368) and the standardized value: -0.903384.

  • We interpret the standardized beta as: with a 1 standard deviation increase in the independent variable (Dan’s sleep) we see a corresponding decrease by -0.903384 standard deviations in grumpiness.

*This allows us to compare variables in our model that may be on different scales. It tells us which one has a larger effect.

Interpreting the Model

How much grumpiness would you expect Dan to have after sleeping for 4 hours?

R AND R2

R: The correlation between the observed values of the outcome, and the values predicted by the model.

R2: The proportion of variance “accounted for” by the model. Note: this does not say anything about causality – just association/correlation.

Multiple Regression

Example

How does an infant’s sleeping habits affect a parent’s mood?

  • Data: parenthood

    • dan.sleep – data on number of hours of sleep of dad

    • baby.sleep – data on number of hours of sleep of baby

    • dan.grump – data on grumpiness of dad (0-100; 10 = maximally grumpy)

    • day – integer (measured from day 1 – 100)

  • We will add the following predictor variables to our previous example.

    • baby.sleep [the baby’s amount of sleep]

Scatterplot Matrix

Look for linear and non-linear relationships, outliers, clustering or grouping, and variable distributions.You use this to help you determine which variables to include in the analysis.

  • When we have multiple variables, we may want to be able to see all the possible scatterplots. One way to do this is using the syntax: pairs() in the package graphics
pairs(parenthood)

Multiple Regression Equation

With multiple regression the relationship is described using a variation of the equation of a straight line.

  • b0 is the intercept.The intercept is the value of the Y variable when all Xs = 0. This is the point at which the regression plane crosses the Y-axis (vertical).

  • b1 is the regression coefficient (slope) for variable 1.

  • b2 is the regression coefficient (slope) for variable 2.

  • bn is the regression coefficient (slope) for nth variable.

NOTE: These are the slopes when the other variables are controlled for.

Two Predictors Example

#install.packages("plotly")
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(data = parenthood,
        x = ~dan.sleep,
        y = ~dan.grump,
        z = ~baby.sleep,
        type = "scatter3d",
        mode = "markers",
        marker = list(size = 5, color = "blue", opacity = 0.8),
        text = paste("Day:", parenthood$day))
## Warning: Ignoring 1 observations

Once you create this plot, you can click on it and move it around

library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(parenthood, use="complete.obs"), method="number")

Running Multiple Regression

regression.2 <- lm(dan.grump ~ dan.sleep + baby.sleep, 
data=parenthood)
print(regression.2)
## 
## Call:
## lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood)
## 
## Coefficients:
## (Intercept)    dan.sleep   baby.sleep  
##   125.83034     -8.94456      0.01756
summary(regression.2)
## 
## Call:
## lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0017  -2.1779  -0.3551   2.6247  11.7757 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 125.83034    3.04802  41.283   <2e-16 ***
## dan.sleep    -8.94456    0.55409 -16.143   <2e-16 ***
## baby.sleep    0.01756    0.27147   0.065    0.949    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.358 on 96 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8166, Adjusted R-squared:  0.8128 
## F-statistic: 213.8 on 2 and 96 DF,  p-value: < 2.2e-16

Based on the output from the multiple regression analysis, we can make the following interpretations:

  1. The intercept (125.83) represents the expected grumpiness of Dan when both Dan’s sleep and the baby’s sleep are 0 hours. This value is not meaningful in real-life context, but it is a necessary part of the regression equation.

  2. The coefficient for Dan’s sleep (-8.94456) suggests that for every additional hour of sleep Dan gets, his grumpiness is expected to decrease by approximately 8.94 units, holding the baby’s sleep constant. This relationship is highly significant (p-value < 2e-16), indicating that Dan’s sleep has a strong and negative association with his grumpiness.

  3. The coefficient for baby’s sleep (0.01756) indicates that for every additional hour of sleep the baby gets, Dan’s grumpiness is expected to increase by approximately 0.018 units, holding Dan’s sleep constant. However, this relationship is not statistically significant (p-value = 0.949), suggesting that the baby’s sleep has no significant effect on Dan’s grumpiness when accounting for Dan’s sleep.

The results of the multiple regression analysis show that Dan’s sleep has a significant negative effect on his grumpiness, with each additional hour of sleep reducing his grumpiness by approximately 8.94 units (p-value < 2e-16). In contrast, the baby’s sleep does not appear to have a significant effect on Dan’s grumpiness (p-value = 0.949) when accounting for Dan’s sleep. The overall model explains approximately 81.66% of the variance in Dan’s grumpiness (Adjusted R-squared = 0.8128), and it is highly significant (F-statistic = 213.8, p-value < 2.2e-16), indicating that the combination of Dan’s sleep and baby’s sleep provides a meaningful explanation of the variation in Dan’s grumpiness.

Non-Testable Assumptions

  1. Variable Type: Outcome must be continuous. Predictors can be continuous or categorical (including dichotomous.)

  2. Non-Zero Variance: Predictors must not have zero variance. If this assumption is not met, the corresponding beta value cannot be estimated. (The model wont run of it doesn’t work.)

  3. Additivity: When there are several predictors, their combined effect on the outcome is best described by adding their effects together.

  4. Independence: All values of the outcome variable should come from different observations (e.g., different individuals or independent units).

Testable Assumptions

Fit linear regression model: model <- lm(column_y ~ column_x1 + column_x2, data = df)

Please replace df, column_y, column_x1, and column_x2 with the appropriate data frame and column names for your specific analysis.

  1. Normality: The residuals need to be normally distributed (note: the independent and dependent variables themselves don’t need to be normal).
  1. Q-Q plot plot(model, which = 2)
  1. Linearity: The relationship between the predictors and the outcome variable needs to be linear.
  1. Check your scatterplot matrix. pairs(df)

  2. Check residuals plots.

Residuals vs. fitted values plot: `plot(model, which = 1)`
  1. Homoscedasticity: The standard deviation of the residuals should be the same for all values of the predictors.
  1. Check residuals versus fitted values plot. (see above for generic code example)
  1. Uncorrelated predictors (no multicollinearity)
  1. Check your scatterplot matrix. pairs(df)

  2. Check correlation matrix or variance inflation factors (VIFs) for multicollinearity.

`library(car)` & `vif(model)`
  1. Independence of residuals: Residuals should be independent of each other.
  1. Check to make sure there aren’t any patterns or trends in the residuals plots. (See 2.2 above)

  2. Ensure that there are no extreme outliers or influential points.

  3. Checking the Durbin-Watson test for residuals independence.

`library(car)` & `durbinWatsonTest(model)`

Residuals

Many of our assumptions revolve around residuals.

  1. Ordinary residuals – actual, raw residuals that are the difference between the fitted value Yhati+) and the observed value Y). A downside of ordinary residuals is that they will always have a different standard deviation for each regression model

  2. Standardized residuals – ordinary residuals that have been standardized to have a standard deviation = 1

  3. Studentized residuals (or jackknifed residuals) – it is standardized version of the residuals which you would have gotten if you have deleted the ith observation from the dataset.

Checking Normality of residuals

Step 1: Plot the residuals

hist(x = residuals(regression.2), xlab="Value of residuals", breaks = 20)

QQ Plot (Quantile-Quantile plot)

This plot will compare the quantiles of two data sets. When we use it to compare against a normal distribution, we can see deviations from normality. If the data is perfectly normal, it will fall exactly on the diagonal line

qqnorm(residuals(regression.2))
qqline(residuals(regression.2))

Linearity

We need to check that the residuals are approximately 0 for all values of fitted values – this shows us that our model is in fact linear. Any curvature in our plot suggests that there are points of our regression line that tend to fall above or below our points. The easiest way to see this plot is:

plot(regression.2, which = 1)

Homogeneity of Variance (for residuals)

The variance of the residuals is assumed to be constant. We will use the default plot that R provides to examine this assumption:

plot(regression.2, which =3)

Shows the fitted values (model predictions) against the square root of the absolute value of standardized residuals. This plot is used to diagnose violations of homogeneity of variance. If the variance is really constant, then the line through the middle should be horizontal and flat.

ASSUMPTIONS TO TEST

  1. Normality – the residuals need to be normally distributed (note: the independent and dependent variables don’t need to be normal)

  2. Linearity – the relationship between x’s and y need to be linear – 1) check your scatterplot matrix, 2) check residuals

  3. Homogeneity of variance – the standard deviation of the residuals is the same for all values of �* and all values of X You can tell all of these assumptions with one R command.

plot(regression.2)

Uncorrelated Predictors: NO MULTICOLLINEARITY

Multicollinearity exists if predictors are highly correlated.

  • Your standard errors can get very large.

  • Your betas may be unreliable.

  • You are not able to determine which predictor is important.

We will use the variance inflation factors (VIF) to determine whether our predictors are too highly correlated. This is basically calculated by taking the predictors and making them the outcome variable, using the other predictors in the model as predictors. This indicates how ‘correlated’ the predictors are. The square root of the VIF tells you how much wider the confidence interval for the corresponding beta is, relative to if the betas were uncorrelated.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(regression.2)
##  dan.sleep baby.sleep 
##   1.648835   1.648835

Values greater than 10 are very problematic. Some conservative estimates state that you should look for less than 2.5.

Outliers or Influential Points

  1. Outlier – check your plot 3 which shows your residuals. If there’s one (or more) that seem outside the range of the others, you might want to look into it more.

  2. Influential points – a point that significantly impacts the regression line. You can see this in plot 4 which examines leverage – simply the amount of influence a particular point has over the regression line. W

e will examine influential points through a measure of Cook’s distance. If you’d like to see the calculation, check your book (section 15.9.2). Weisberg (1982) suggests that absolute values greater than one may be cause for concern. Others suggest 4/N. To get cook’s distance do:

cooks.distance(regression.2) #this will output a value for each data point
##            1            2            3            4            5            6 
## 1.702185e-03 1.742778e-02 4.948815e-03 1.271730e-03 2.845249e-04 6.868045e-05 
##            7            8            9           11           12           13 
## 2.976950e-05 5.261590e-04 4.503120e-05 2.748718e-02 8.184516e-03 7.167866e-04 
##           14           15           16           17           18           19 
## 1.553015e-04 2.454317e-03 4.034285e-05 3.610709e-03 2.115542e-03 1.545613e-04 
##           20           21           22           23           24           25 
## 1.756713e-02 7.617503e-05 1.690731e-03 3.022426e-02 3.129870e-05 2.416161e-04 
##           26           27           28           29           30           31 
## 8.476893e-04 5.741160e-02 7.689597e-04 5.440287e-04 2.541346e-03 5.176332e-02 
##           32           33           34           35           36           37 
## 6.133860e-03 2.776889e-02 5.423430e-02 2.151538e-02 4.585454e-02 1.745798e-02 
##           38           39           40           41           42           43 
## 4.833800e-02 2.211607e-03 6.305783e-05 4.556122e-05 1.081720e-02 2.389440e-02 
##           44           45           46           47           48           49 
## 3.362380e-02 1.397640e-02 1.769445e-02 1.593165e-02 4.380215e-03 8.207379e-03 
##           50           51           52           53           54           55 
## 2.594572e-04 3.685649e-02 8.219271e-04 1.905647e-03 2.327857e-03 4.377664e-03 
##           56           57           58           59           60           61 
## 1.348292e-03 1.784234e-03 4.620543e-04 1.212277e-03 3.121308e-03 1.566109e-03 
##           62           63           64           65           66           67 
## 3.895826e-04 1.334338e-04 1.122386e-01 2.839275e-02 1.151948e-02 1.216359e-03 
##           68           69           70           71           72           73 
## 2.307132e-03 2.086943e-04 1.041284e-02 2.795991e-03 1.462321e-03 4.417124e-04 
##           74           75           76           77           78           79 
## 2.898103e-03 2.657275e-03 4.768815e-04 1.429810e-02 4.252532e-02 4.282003e-03 
##           80           81           82           83           84           85 
## 3.903289e-03 2.957952e-02 1.627924e-04 2.228364e-03 1.510660e-04 9.314894e-02 
##           86           87           88           89           90           91 
## 3.034187e-04 3.451490e-04 3.933008e-05 3.575125e-04 7.341481e-04 4.995594e-03 
##           92           93           94           95           96           97 
## 4.384140e-03 1.175413e-02 4.613718e-03 1.708832e-02 2.215978e-04 1.003870e-02 
##           98           99          100 
## 1.263797e-02 2.381551e-02 8.188032e-03
max(cooks.distance(regression.2)) # will give you the max value 
## [1] 0.1122386

This just gives you the maximum of all your data points and you just want it less than 1. It is a problem if it is greater than one.

Durbin-Watson test

Tests independence of errors. It tests for autocorrelation in the residuals. Basically, are those residuals “close” to each other correlated with each other.

We are likely to see this if we had a dataset that has a time component – like stock price; Yesterday’s stock price is correlated with today’s stock price.

The values from this test range of 0-4; 2 means there is no autocorrelation. 0 – 2 means there’s positive autocorrelation; 2-4 means there’s negative autocorrelation.

Durbin-Watson test is sensitive to the order of the data, so it’s essential to ensure that the data is sorted correctly before running the test.

Values <1 or >3 may be problematic; Closer to 2 is best!

Use the function durbinWatsonTest(model) from the car package. (library(car)) dwt() also works.

durbinWatsonTest(regression.2)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.07972016      2.139017   0.496
##  Alternative hypothesis: rho != 0

Here’s how to interpret the Durbin-Watson test output:

Test statistic close to 2: This indicates no autocorrelation in the residuals. The residuals are independent, which is a desirable property in a linear regression model.

Test statistic significantly less than 2: This suggests positive autocorrelation. Positive autocorrelation occurs when a positive error (residual) in one observation increases the likelihood of a positive error in the next observation.

Test statistic significantly greater than 2: This indicates negative autocorrelation. Negative autocorrelation occurs when a positive error (residual) in one observation increases the likelihood of a negative error in the next observation.

In this Durbin-Watson test output:

  • lag is the order of autocorrelation tested, which is 1 in this case. It tests the correlation between the residuals at time t and time t-1.

  • Autocorrelation is the estimated autocorrelation at the specified lag (in this case, -0.07972016 for lag 1). It’s a relatively small value, close to 0, indicating little to no autocorrelation.

  • D-W Statistic is the Durbin-Watson test statistic, which is 2.139017. Since it is close to 2, it suggests that there is no significant autocorrelation in the residuals.

  • p-value is the probability of observing the test statistic under the null hypothesis of no autocorrelation. In this case, the p-value is 0.492, which is relatively large. A large p-value indicates that there is insufficient evidence to reject the null hypothesis of no autocorrelation.

Based on this output, I can conclude that there is no significant autocorrelation in the residuals at lag 1, and the assumption of independent residuals appears to be met for the linear regression model.

INTERPRETING OUR MODEL

If Dan got 4 hours of sleep last night and baby slept for 5 hours, how much grumpiness do we expect from Dan today?

Grumpiness = 125.97 – 8.95(4) + 0.01(5)

Grumpiness = 125.97 – 35.8 + 0.05

Grumpiness = 90.22

Compute standardized regression coefficients

RECALL: STANDARDIZED BETA VALUES? The standardCoefs() function from the lsr package computes standardized regression coefficients, also known as beta coefficients. These coefficients represent the strength and direction of the relationship between the predictor variables and the outcome variable, but they are standardized to be on the same scale, allowing for direct comparison between predictors.

library(lsr)
standardCoefs(regression.2)
##                     b         beta
## dan.sleep  -8.9445644 -0.905945167
## baby.sleep  0.0175595  0.003630047

How do we interpret these results? Which variable has a larger effect?

  • dan.sleep: The beta coefficient is -0.905945167, which means that for every 1 standard deviation increase in “dan.sleep,” there is a 0.906 standard deviation decrease in “dan.grump.” The negative sign indicates an inverse relationship.

  • baby.sleep: The beta coefficient is 0.003630047, which means that for every 1 standard deviation increase in “baby.sleep,” there is a 0.0036 standard deviation increase in “dan.grump.” The positive sign indicates a direct relationship.

As Dan’s sleep increased by 1 standard deviation, Dan’s grumpiness decreased by 0.9 of a standard deviation. In contrast, as baby’s sleep increased by 1 standard deviation, Dan’s grumpiness increased by 0.002 of a standard deviation.

Comparing the two beta coefficients, the absolute value of -0.905945167 (for “dan.sleep”) is much larger than the absolute value of 0.003630047 (for “baby.sleep”). This suggests that “dan.sleep” has a much larger effect on “dan.grump” compared to “baby.sleep.” In other words, the number of hours that the dad sleeps has a more substantial impact on his grumpiness than the baby’s sleep duration.

Model Selection

A common problem is trying to figure out which variables to include in a model. There are a variety of strategies to try to solve this:

  • Hierarchical: Experimenter decides the order in which variables are entered into the model and checks the model fit using AIC / BIC

  • Forced Entry: All predictors are entered simultaneously.

  • Stepwise: Predictors are added or removed based on AIC values automatically.

AIC stands for Akaike information criterion. The AIC value takes the residual sum of squares (which will go down as you add more predictors) and then adds 2*K – where k is the number of predictors. Lower AIC scores are better. So, AIC takes into account that as you add predictors, you get a model with less error by penalizing models with more predictors. So, a predictor will only be added (if you use this method) if it improves the model by a “substantial” amount.

Hierarchical regression

Hierarchical regression is a technique in which predictors are entered into the regression model in a specific order based on prior research or theoretical considerations. Known predictors are entered first, followed by new predictors in separate steps or blocks. This approach allows the experimenter to assess the unique predictive influence of a new variable on the outcome, as known predictors are held constant in the model.

Advantages of hierarchical regression:

  1. Theory testing: It is useful for testing theoretical predictions about the relative importance of predictor variables.

  2. Unique predictive influence: You can evaluate the unique contribution of a new variable on the outcome, while controlling for the effects of known predictors.

Possible con: The technique relies on the experimenter having a good understanding of the existing research and theory.

#include models in R – note, one must be a subset of the other
M0 <- lm(dan.grump ~ dan.sleep, parenthood)
M1 <- lm(dan.grump ~ dan.sleep + baby.sleep, parenthood) 
#compares two models based on the same dataset
anova(M0, M1)
## Analysis of Variance Table
## 
## Model 1: dan.grump ~ dan.sleep
## Model 2: dan.grump ~ dan.sleep + baby.sleep
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     97 1823.7                           
## 2     96 1823.7  1  0.079478 0.0042 0.9486

Here’s a breakdown of the output:

Res.Df: Residual degrees of freedom. Model 1 has 97, and Model 2 has 96.

RSS: Residual sum of squares. Model 1 has an RSS of 1823.7, and Model 2 also has an RSS of 1823.7.

Df: Difference in degrees of freedom between the two models. In this case, it’s 1 (97-96).

Sum of Sq: Difference in the residual sum of squares between the two models. In this case, it’s 0.079478.

F: F-statistic, which indicates the improvement in the model’s fit when adding the new predictor. In this case, it’s 0.0042.

Pr(>F): p-value associated with the F-statistic. In this case, it’s 0.9486.

The p-value (0.9486) is greater than the commonly used significance level of 0.05, which means that adding the new predictor “baby.sleep” does not significantly improve the model’s ability to predict the outcome variable “dan.grump.” Thus, the addition of “baby.sleep” to the model is not justified based on this analysis.

Forced Entry Regression

Forced entry regression, also known as simultaneous regression or multiple regression, is a method of regression analysis where all predictor variables are entered into the model simultaneously. This approach assumes that all the predictors are equally important and should be considered at the same time.

Advantages of forced entry regression:

  1. It’s straightforward and easy to implement since all variables are included at once.

  2. The results are easily comparable when the same set of predictors is used across different studies or datasets.

  3. If you have strong theoretical reasons for including all the variables, this method can provide a comprehensive view of their combined effect on the outcome variable.

Disadvantages of forced entry regression:

  1. If the predictors are highly correlated (multicollinearity), the results may be unstable and difficult to interpret.

  2. The method doesn’t account for the importance of variables, so it may not be the best choice if some predictors are more important than others.

  3. It relies on the researcher’s knowledge and judgment for selecting the appropriate variables, which can introduce bias.

STEPWISE REGRESSION I

Variables are entered into the model based on mathematical criteria. Computer selects variables in steps.

fullmodel <- lm(dan.grump ~ dan.sleep + 
baby.sleep + day, parenthood)
step(fullmodel, direction="backward")
## Start:  AIC=296.41
## dan.grump ~ dan.sleep + baby.sleep + day
## 
##              Df Sum of Sq    RSS    AIC
## - baby.sleep  1       0.1 1823.3 294.41
## - day         1       0.5 1823.7 294.43
## <none>                    1823.2 296.41
## - dan.sleep   1    4889.9 6713.1 423.45
## 
## Step:  AIC=294.41
## dan.grump ~ dan.sleep + day
## 
##             Df Sum of Sq    RSS    AIC
## - day        1       0.5 1823.7 292.44
## <none>                   1823.3 294.41
## - dan.sleep  1    8044.8 9868.0 459.59
## 
## Step:  AIC=292.44
## dan.grump ~ dan.sleep
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   1823.7 292.44
## - dan.sleep  1    8121.2 9944.9 458.36
## 
## Call:
## lm(formula = dan.grump ~ dan.sleep, data = parenthood)
## 
## Coefficients:
## (Intercept)    dan.sleep  
##     125.815       -8.922

STEPWISE METHODS

Forward • Add predictor that improves the model most • Then add the predictor that improves the model 2nd most (after first predictor is already included) • May miss out on effect that only occurs when another predictor is in the model. Backward • Include all variables • Exclude variable that improves AIC most • Both • Excludes variable that improves AIC the most • In each step checks to see if removing or adding variables improves AIC

Stepwise should be avoided except for exploratory model building. Use backward rather than forward to minimize suppressor effects (when a predictor has a significant effect but only when another variable is held constant)

REPORTING THE MODEL

MULTIPLE REGRESSION STEPS 1 & 2

Look at the relationship between all of your variables in a series of scatter plots. This will help you identify any serious deviations from linearity at the beginning. At this point, make a note of variables that look like they might have a problem, but do not transform things unless they are blatantly nonlinear. Another thing to look for is high levels of multicollinearity. Nothing needs to be done at this point, but make a note of it for the future. Create your best model. Don’t check assumptions until you have a model that seems to “fit” well.

MULTIPLE REGRESSION STEP 3

Check assumptions Look at your residuals plots Check for linearity, normality, and homogeneity of variance of residuals Multicollinearity – no VIF greater than 2.5 Independent errors – run D-W test to make sure values are not greater than 3 or less than 1 Check for Outliers / Influential cases Influential cases: Check cook’s distance – shouldn’t have any greater than 1

WHAT ABOUT CATEGORICAL PREDICTORS?

Frequently, when you want to predict a continuous outcome, you may want to include categorical variables as well. If you ONLY have a categorical variable predicting a continuous outcome, this is just ANOVA (and you can run it as regression, if you want – you get the same results) If you have BOTH continuous and categorical variables predicting a continuous outcome, then you can run this with multiple regression.