This document was composed from Dr. Snopkowski’s ANTH 504 Week 8 lecture and Danielle Navarro’s 2021 Learning statistics with R Chapter 5.7 on correlations & 15 Linear Regression.
library(tidyverse)
load("parenthood-1.Rdata")
Measuring Relationships
Scatterplots
Covariance
Pearson’s Correlation Coefficient
Nonparametric measures - Spearman’s Rho
Interpreting Correlations
Causality
Understand linear regression with one predictor
Understand how we assess the fit of a regression model
Running regression model in R
Interpret a regression model
WHAT IS A CORRELATION?
It is a way of measuring the extent to which two variables are related.
It measures the pattern of responses across variables.
Small Relathipship
Strong Positive Correlation
Strong Negative Correlation
We need to see whether as one variable increases, the other increases, decreases or stays the same.
This can be done by calculating the Covariance.
We look at how much each score deviates from the mean.
If both variables deviate from the mean by the same amount, they are likely to be related.
It depends upon the units of measurement.
One solution: standardize it!
The standardized version of Covariance is known as the Correlation coefficient.
The correlation coefficient can be represented by
the Greek letter rho \(\rho\) when referring to the population correlation coefficient
the letter “r” when referring to the sample correlation coefficient
Both “\(\rho\)” and “r” represent the strength and direction of a linear relationship between two variables.
cor(x=var1, y=var2)
#enter the data
adverts_watched <- c(5, 4, 4, 6, 8)
packets_bought <- c(8, 9, 10, 13, 15)
#use cor function to calculate correlation coefficient
cor(adverts_watched, packets_bought)
## [1] 0.8711651
The output I got, 0.8711651, is the correlation coefficient (r) between the number of adverts watched and the number of packets bought. This value is positive and close to 1, indicating a strong positive linear relationship between the two variables. In other words, as the number of adverts watched increases, the number of packets bought also tends to increase.
Remember: correlation does not imply causation. While the correlation coefficient suggests a strong linear relationship between the two variables, it does not necessarily mean that watching more advertisements causes people to buy more packets. There might be other factors involved or it could be a mere coincidence. Further research and analysis would be needed to establish a causal relationship.
Note: the order of adverts_watched and
packets_bought doesn’t matter; you’ll get the same value if
you switch the order.
#switch order example
cor(packets_bought, adverts_watched)
## [1] 0.8711651
This function also works with data frames.
as_df <- data.frame(adverts_watched, packets_bought)
It will give you the correlation across all variables as long as there is no missing values and the variables are numeric.
cor(as_df)
## adverts_watched packets_bought
## adverts_watched 1.0000000 0.8711651
## packets_bought 0.8711651 1.0000000
Linearity: The correlation coefficient only measures linear relationships. If the relationship between two variables is non-linear (e.g., quadratic, exponential), the correlation coefficient may not adequately capture the strength of the relationship. By plotting the data, you can visually inspect the relationship and determine if it is linear or not.
Outliers: The correlation coefficient is sensitive to outliers. A few extreme data points can greatly affect the value of the correlation coefficient, leading to a misleading interpretation of the relationship between the variables. By plotting the data, you can identify and address outliers before calculating the correlation coefficient.
Normal distribution: The Pearson correlation coefficient assumes that both variables are normally distributed. If the data is not normally distributed, the correlation coefficient may not be an appropriate measure of the relationship between the variables. Visualizing the data using histograms or Q-Q plots can help you assess the normality of the data.
Homoscedasticity: The Pearson correlation coefficient assumes that the variance of one variable is constant across different values of the other variable (homoscedasticity). If the data exhibits heteroscedasticity (i.e., the variance of one variable changes across different values of the other variable), the correlation coefficient may not be an accurate measure of the relationship. By plotting the data, you can visually inspect for homoscedasticity.
How do we avoid these problems?
Always plot your data to see what it looks like!
The assumptions of correlation include:
The relationship between the variables is linear
The data is normal-ish; particularly for small samples
Limitations of the Correlation Coefficient
How does an infant’s sleeping habits affect a parent’s mood?
Data:
dan.sleep – data on number of hours of sleep of dad
baby.sleep – data on number of hours of sleep of baby
dan.grump – data on grumpiness of dad (0-100; 10 = maximally grumpy)
day – integer (measured from day 1 – 100)
Data in file: parenthood.Rdata
Download data file and then from Rstudio go to:
Open -> choose parenthood.Rdata and it should
load
Some preliminary descriptive statistics
library(psych)
## Warning: package 'psych' was built under R version 4.2.3
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(parenthood)
## vars n mean sd median trimmed mad min max range skew
## dan.sleep 1 100 6.97 1.02 7.03 7.00 1.09 4.84 9.00 4.16 -0.29
## baby.sleep 2 100 8.05 2.07 7.95 8.05 2.33 3.25 12.07 8.82 -0.02
## dan.grump 3 100 63.71 10.05 62.00 63.16 9.64 41.00 91.00 50.00 0.43
## day 4 100 50.50 29.01 50.50 50.50 37.06 1.00 100.00 99.00 0.00
## kurtosis se
## dan.sleep -0.72 0.10
## baby.sleep -0.69 0.21
## dan.grump -0.16 1.00
## day -1.24 2.90
Make a scatterplot of dan.sleep vs. dan.grump
plot(parenthood$dan.sleep, parenthood$dan.grump, xlab = "Dad's Sleep", ylab="Dad's grumpiness")
plot(parenthood$baby.sleep, parenthood$dan.grump, xlab = "Baby's Sleep", ylab="Dad's grumpiness")
Does the relationship appear linear? Yes
hist(parenthood$dan.sleep)
hist(parenthood$dan.grump)
shapiro.test(parenthood$dan.sleep)
##
## Shapiro-Wilk normality test
##
## data: parenthood$dan.sleep
## W = 0.97635, p-value = 0.06878
shapiro.test(parenthood$dan.grump)
##
## Shapiro-Wilk normality test
##
## data: parenthood$dan.grump
## W = 0.97952, p-value = 0.122
sapply(parenthood, length)
## dan.sleep baby.sleep dan.grump day
## 100 100 100 100
Given that our sample size is relatively large – probably better to use histograms than shapiro.test
The data are normal enough.
R code to run Pearson’s correlation coefficient (if assumptions are met; which they were in this case)
cor(parenthood$baby.sleep, parenthood$dan.grump)
## [1] -0.5659637
The Pearson correlation coefficient between baby sleep and Dan’s grumpiness is approximately -0.5659637. This indicates a moderate negative linear relationship, meaning that as the baby’s sleep duration increases, Dan’s grumpiness tends to decrease (and vice versa).
cor(parenthood$dan.sleep, parenthood$dan.grump)
## [1] -0.903384
The Pearson correlation coefficient between Dan’s sleep and Dan’s grumpiness is approximately -0.903384. This indicates a strong negative linear relationship, suggesting that as Dan’s sleep duration increases, his grumpiness tends to decrease significantly (and vice versa).
cor(parenthood)
## dan.sleep baby.sleep dan.grump day
## dan.sleep 1.00000000 0.62794934 -0.90338404 -0.09840768
## baby.sleep 0.62794934 1.00000000 -0.56596373 -0.01043394
## dan.grump -0.90338404 -0.56596373 1.00000000 0.07647926
## day -0.09840768 -0.01043394 0.07647926 1.00000000
R code to run non-parametric Spearman’s rank correlation (if assumptions are NOT met)
R code to run non-parametric Spearman’s rank correlation (if assumptions are NOT met)
cor(parenthood$baby.sleep, parenthood$dan.grump, method="spearman")
## [1] -0.5871037
cor(parenthood$dan.sleep, parenthood$dan.grump, method="spearman")
## [1] -0.8870816
cor(parenthood, method="spearman")
## dan.sleep baby.sleep dan.grump day
## dan.sleep 1.00000000 0.644518990 -0.88708165 -0.088967566
## baby.sleep 0.64451899 1.000000000 -0.58710366 0.002916397
## dan.grump -0.88708165 -0.587103665 1.00000000 0.055654351
## day -0.08896757 0.002916397 0.05565435 1.000000000
If you want to know whether the correlation is significant, you can
use cor.test() which will give you a p-value. Notice
the change from cor to cor.test.
What is the null hypothesis of this test?
Ho: \(\rho\) = 0 Ha: \(\rho\) ≠ 0
cor.test(parenthood$dan.sleep, parenthood$dan.grump)
##
## Pearson's product-moment correlation
##
## data: parenthood$dan.sleep and parenthood$dan.grump
## t = -20.854, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9340614 -0.8594714
## sample estimates:
## cor
## -0.903384
cor.test(parenthood$dan.sleep, parenthood$dan.grump, method="spearman", exact=FALSE)
##
## Spearman's rank correlation rho
##
## data: parenthood$dan.sleep and parenthood$dan.grump
## S = 314482, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8870816
exact=FALSE helps get around ties. If any data is ties
with each other, it wont be able to run the Spearman.
cor.test(parenthood$baby.sleep, parenthood$dan.grump, method="spearman", exact=FALSE)
##
## Spearman's rank correlation rho
##
## data: parenthood$baby.sleep and parenthood$dan.grump
## S = 264491, p-value = 1.363e-10
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5871037
H0: \(\rho\) = 0 HA: \(\rho\) \(\neq\) 0
Dad’s hours of sleep were significantly correlated with his
grumpiness, r(98) = -.903, p<0.001.
r(degrees of freedom) = correlation coefficient, p=p-value
If you have missing values, which is relatively common when working
with real data, the cor() function will not work without
the use argument.
use="pairwise.complete.obs": Uses all the data it can to
calculate pairwise correlations.
# Introduce a missing value
parenthood$dan.sleep[10] <- NA
# Calculate correlation using pairwise.complete.obs
cor(parenthood$dan.sleep, parenthood$dan.grump, use = "pairwise.complete.obs")
## [1] -0.903668
By using use = "pairwise.complete.obs", the
cor() function will calculate the correlation coefficient
for Dan’s sleep and his grumpiness using all available pairs of
observations with complete data, ignoring the missing value introduced
in the dataset.
It varies between -1 and +1
It is an effect size
±.1 = small effect
±.3 = medium effect
±.5 = large effect
Coefficient of determination, r2
Correlation and Causality
The third-variable problem:In any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting the results.
Direction of causality: Correlation coefficients say nothing about which variable causes the other to change
Questions you might ask:
What is the relationship between rain and grass growth?
How does cooking temperature influence the simple sugar energy of balsamroot?
How do surface topography of clavicle bones relate to age of individual?
How does amount of advertising influence sales of music albums?
What is Linear Regression? A way of predicting the value of one variable from another.
It is a hypothetical model of the relationship between two variables.
The model used is a linear one. Therefore, we describe the relationship using the equation of a straight line.
Describing a Straight Line (y=mx+b)
b1
Regression coefficient for the predictor
Gradient (slope) of the regression line
Direction/Strength of Relationship
b0
Intercept (value of Y when X = 0)
Point at which the regression line crosses the Y-axis
\(\\epsilon\) is the residuals, the error between the point and regression line
Intercepts and Gradients
Determine the “Best Fit” Line
How good is the model?
The regression line is only a model based on the data.
This model might not reflect reality.
How?
R2 Pearson Correlation Coefficient Squared
F statistic
SST: Total variability (variability between scores and the mean).
SSR: Residual/error variability (variability between the regression model and the actual data).
SSM: Model variability (difference in variability between the model and the mean)
Diagram showing from where the regression sums of squares derive
If the model results in better prediction than using the mean, then we expect SSM to be much greater than SSR
Sums of Squares are total values.
They can be expressed as averages (dividing by the df).
These are called Mean Squares, MS
The proportion of variance accounted for by the regression model.
The Pearson Correlation Coefficient Squared
Step 1: Assumption
Still using the parenthood data. How does an infant’s
sleeping habits affect a parent’s mood?
Data:
dan.sleep – data on number of hours of sleep of dad
baby.sleep – data on number of hours of sleep of baby
dan.grump – data on grumpiness of dad (0-100; 10 = maximally grumpy)
day – integer (measured from day 1 – 100)
plot(parenthood$dan.sleep, parenthood$dan.grump, xlab = "Dad's sleep", ylab = "Dad's grumpiness")
abline(lm(parenthood$dan.grump ~ parenthood$dan.sleep), col = "red")
Do it with ggplot:
# Create a scatter plot with a linear regression line
ggplot(data = parenthood, aes(x = dan.sleep, y = dan.grump)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(x = "Dad's sleep", y = "Dad's grumpiness")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
geom_smooth(method = "lm", se = FALSE, color = "red")
adds a linear regression line to the plot.
The method = "lm" argument specifies that we want to fit
a linear model,
se = FALSE means we don’t want to display the confidence
interval around the regression line,
and color = "red" sets the color of the line to red.
Ask: Is the assumption of linearity met? Yes
Ask: Are there any outliers / obvious unusual cases? No
Step 2: Run Regression
#linear regression model
reg_model <- lm(parenthood$dan.grump ~ parenthood$dan.sleep)
#print Outputs the intercept and slope
print(reg_model)
##
## Call:
## lm(formula = parenthood$dan.grump ~ parenthood$dan.sleep)
##
## Coefficients:
## (Intercept) parenthood$dan.sleep
## 125.815 -8.922
You must store it in a variable. If you just have it print without
storing it, it wont print out all the items that you need. We use
summary()
#alternative (but equal) way to write this code
reg_model2 <- lm(dan.grump ~ dan.sleep, data = parenthood)
print(reg_model2)
##
## Call:
## lm(formula = dan.grump ~ dan.sleep, data = parenthood)
##
## Coefficients:
## (Intercept) dan.sleep
## 125.815 -8.922
b0 is the number under “(Intercept)” and b1 is
the number under parenthood$dan.sleep
summary(reg_model)
##
## Call:
## lm(formula = parenthood$dan.grump ~ parenthood$dan.sleep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9851 -2.1674 -0.3719 2.6265 11.7771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 125.8152 3.0233 41.62 <2e-16 ***
## parenthood$dan.sleep -8.9221 0.4293 -20.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.336 on 97 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8166, Adjusted R-squared: 0.8147
## F-statistic: 431.9 on 1 and 97 DF, p-value: < 2.2e-16
let’s go through the output of the linear regression model step by step:
Call: This shows the function call used to
create the model, with the dependent variable
parenthood$dan.grump and the independent variable
parenthood$dan.sleep.
Residuals: The summary statistics of the residuals (differences between the observed and predicted values). The minimum, 1st quartile (1Q), median, 3rd quartile (3Q), and maximum values of the residuals are displayed.
Coefficients: This section provides information about the estimated coefficients for the linear regression model.
Estimate: The estimated coefficients for the intercept and the independent variable (parenthood$dan.sleep). In this case, the intercept is 125.9563, and the slope (coefficient) for Dan’s sleep is -8.9368.
Grumpinessi = -8.94(Sleepi) + 125.96 + \(\epsilon\)
If you know the amount of sleep, you can plug that in and predict sleep. For every hour of sleep, Dan’s grumpiness goes down by 8.94.
Std. Error: The standard errors of the estimated coefficients.
t value: The t-statistic for each coefficient, calculated as the Estimate divided by the Std. Error.
Pr(>|t|): The p-value associated with the t-statistic, used for hypothesis testing. In this case, both p-values are very small (< 2e-16), indicating that the coefficients are significantly different from zero at a 0.05 significance level.
Residual standard error: A measure of the model’s overall accuracy, calculated as the square root of the mean squared error (MSE). In this case, it’s 4.332.
Multiple R-squared: A measure of the proportion of the variance in the dependent variable that is explained by the model. In this case, it’s 0.8161, indicating that the model explains 81.61% of the variance in Dan’s grumpiness.
Adjusted R-squared: A modified version of the R-squared that adjusts for the number of predictors in the model. In this case, it’s 0.8142, which is very close to the Multiple R-squared value, as there is only one predictor.
F-statistic: A statistic that tests the overall significance of the model. The F-statistic is 434.9 in this case, with 1 and 98 degrees of freedom (DF). This is how well the model fits the data.
p-value: The p-value associated with the F-statistic. In this case, it’s < 2.2e-16, indicating that the model is statistically significant at a 0.05 significance level.
In summary, the linear regression model suggests a strong negative relationship between Dan’s sleep and his grumpiness, with each additional hour of sleep associated with an 8.94-point decrease in grumpiness. The model explains 81.61% of the variance in Dan’s grumpiness, and it is statistically significant.
Output: Model Summary
Tips: The details after Coefficients will be most helpful to look at.
There is no need to report the t-value. We care about the
parenthood$dan.sleep row most.
Looks at whether the variance explained by the model (SSM) is significantly greater than the error within the model (SSR).
It tells us whether using the regression model is significantly better at predicting values of the outcome than using the mean.
How to Interpret BETA values
Beta values: the change in the outcome associated with a unit change in the predictor.
Standardized beta values: tell us the same but expressed as standard deviations.
standardCoefs() function (see details
in 2 slides)b1= -8.9368.
As hours of sleep increases by 1 hour, grumpiness declines by 8.9368 units
STANDARDIZED BETA VALUES
library(lsr)
standardCoefs(reg_model)
## b beta
## parenthood$dan.sleep -8.922082 -0.903668
The output shows us the original beta value (-8.9368) and the standardized value: -0.903384.
We interpret the standardized beta as: with a 1 standard deviation increase in the independent variable (Dan’s sleep) we see a corresponding decrease by -0.903384 standard deviations in grumpiness.
*This allows us to compare variables in our model that may be on different scales. It tells us which one has a larger effect.
How much grumpiness would you expect Dan to have after sleeping for 4 hours?
R AND R2
R: The correlation between the observed values of the outcome, and the values predicted by the model.
R2: The proportion of variance “accounted for” by the model. Note: this does not say anything about causality – just association/correlation.
Linear Regression is a model to predict the value of one variable from another.
Multiple Regression is a natural extension of this model:
We use it to predict values of an outcome from several predictors.
It is a hypothetical model of the relationship between several variables.
How does an infant’s sleeping habits affect a parent’s mood?
Data: parenthood
dan.sleep – data on number of hours of sleep of dad
baby.sleep – data on number of hours of sleep of baby
dan.grump – data on grumpiness of dad (0-100; 10 = maximally grumpy)
day – integer (measured from day 1 – 100)
We will add the following predictor variables to our previous example.
Look for linear and non-linear relationships, outliers, clustering or grouping, and variable distributions.You use this to help you determine which variables to include in the analysis.
pairs() in the package graphicspairs(parenthood)
With multiple regression the relationship is described using a variation of the equation of a straight line.
b0 is the intercept.The intercept is the value of the Y variable when all Xs = 0. This is the point at which the regression plane crosses the Y-axis (vertical).
b1 is the regression coefficient (slope) for variable 1.
b2 is the regression coefficient (slope) for variable 2.
bn is the regression coefficient (slope) for nth variable.
NOTE: These are the slopes when the other variables are controlled for.
#install.packages("plotly")
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(data = parenthood,
x = ~dan.sleep,
y = ~dan.grump,
z = ~baby.sleep,
type = "scatter3d",
mode = "markers",
marker = list(size = 5, color = "blue", opacity = 0.8),
text = paste("Day:", parenthood$day))
## Warning: Ignoring 1 observations
Once you create this plot, you can click on it and move it around
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(parenthood, use="complete.obs"), method="number")
regression.2 <- lm(dan.grump ~ dan.sleep + baby.sleep,
data=parenthood)
print(regression.2)
##
## Call:
## lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood)
##
## Coefficients:
## (Intercept) dan.sleep baby.sleep
## 125.83034 -8.94456 0.01756
summary(regression.2)
##
## Call:
## lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0017 -2.1779 -0.3551 2.6247 11.7757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 125.83034 3.04802 41.283 <2e-16 ***
## dan.sleep -8.94456 0.55409 -16.143 <2e-16 ***
## baby.sleep 0.01756 0.27147 0.065 0.949
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.358 on 96 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8166, Adjusted R-squared: 0.8128
## F-statistic: 213.8 on 2 and 96 DF, p-value: < 2.2e-16
Based on the output from the multiple regression analysis, we can make the following interpretations:
The intercept (125.83) represents the expected grumpiness of Dan when both Dan’s sleep and the baby’s sleep are 0 hours. This value is not meaningful in real-life context, but it is a necessary part of the regression equation.
The coefficient for Dan’s sleep (-8.94456) suggests that for every additional hour of sleep Dan gets, his grumpiness is expected to decrease by approximately 8.94 units, holding the baby’s sleep constant. This relationship is highly significant (p-value < 2e-16), indicating that Dan’s sleep has a strong and negative association with his grumpiness.
The coefficient for baby’s sleep (0.01756) indicates that for every additional hour of sleep the baby gets, Dan’s grumpiness is expected to increase by approximately 0.018 units, holding Dan’s sleep constant. However, this relationship is not statistically significant (p-value = 0.949), suggesting that the baby’s sleep has no significant effect on Dan’s grumpiness when accounting for Dan’s sleep.
The results of the multiple regression analysis show that Dan’s sleep has a significant negative effect on his grumpiness, with each additional hour of sleep reducing his grumpiness by approximately 8.94 units (p-value < 2e-16). In contrast, the baby’s sleep does not appear to have a significant effect on Dan’s grumpiness (p-value = 0.949) when accounting for Dan’s sleep. The overall model explains approximately 81.66% of the variance in Dan’s grumpiness (Adjusted R-squared = 0.8128), and it is highly significant (F-statistic = 213.8, p-value < 2.2e-16), indicating that the combination of Dan’s sleep and baby’s sleep provides a meaningful explanation of the variation in Dan’s grumpiness.
Variable Type: Outcome must be continuous. Predictors can be continuous or categorical (including dichotomous.)
Non-Zero Variance: Predictors must not have zero variance. If this assumption is not met, the corresponding beta value cannot be estimated. (The model wont run of it doesn’t work.)
Additivity: When there are several predictors, their combined effect on the outcome is best described by adding their effects together.
Independence: All values of the outcome variable should come from different observations (e.g., different individuals or independent units).
Fit linear regression model:
model <- lm(column_y ~ column_x1 + column_x2, data = df)
Please replace df, column_y,
column_x1, and column_x2 with the appropriate
data frame and column names for your specific analysis.
plot(model, which = 2)Check your scatterplot matrix. pairs(df)
Check residuals plots.
Residuals vs. fitted values plot: `plot(model, which = 1)`
Check your scatterplot matrix. pairs(df)
Check correlation matrix or variance inflation factors (VIFs) for multicollinearity.
`library(car)` & `vif(model)`
Check to make sure there aren’t any patterns or trends in the residuals plots. (See 2.2 above)
Ensure that there are no extreme outliers or influential points.
Checking the Durbin-Watson test for residuals independence.
`library(car)` & `durbinWatsonTest(model)`
Many of our assumptions revolve around residuals.
Ordinary residuals – actual, raw residuals that are the difference between the fitted value Yhati+) and the observed value Y). A downside of ordinary residuals is that they will always have a different standard deviation for each regression model
Standardized residuals – ordinary residuals that have been standardized to have a standard deviation = 1
Studentized residuals (or jackknifed residuals) – it is standardized version of the residuals which you would have gotten if you have deleted the ith observation from the dataset.
Checking Normality of residuals
Step 1: Plot the residuals
hist(x = residuals(regression.2), xlab="Value of residuals", breaks = 20)
This plot will compare the quantiles of two data sets. When we use it to compare against a normal distribution, we can see deviations from normality. If the data is perfectly normal, it will fall exactly on the diagonal line
qqnorm(residuals(regression.2))
qqline(residuals(regression.2))
We need to check that the residuals are approximately 0 for all values of fitted values – this shows us that our model is in fact linear. Any curvature in our plot suggests that there are points of our regression line that tend to fall above or below our points. The easiest way to see this plot is:
plot(regression.2, which = 1)
The variance of the residuals is assumed to be constant. We will use the default plot that R provides to examine this assumption:
plot(regression.2, which =3)
Shows the fitted values (model predictions) against the square root of the absolute value of standardized residuals. This plot is used to diagnose violations of homogeneity of variance. If the variance is really constant, then the line through the middle should be horizontal and flat.
Normality – the residuals need to be normally distributed (note: the independent and dependent variables don’t need to be normal)
Linearity – the relationship between x’s and y need to be linear – 1) check your scatterplot matrix, 2) check residuals
Homogeneity of variance – the standard deviation of the residuals is the same for all values of �* and all values of X You can tell all of these assumptions with one R command.
plot(regression.2)
Outlier – check your plot 3 which shows your residuals. If there’s one (or more) that seem outside the range of the others, you might want to look into it more.
Influential points – a point that significantly impacts the regression line. You can see this in plot 4 which examines leverage – simply the amount of influence a particular point has over the regression line. W
e will examine influential points through a measure of Cook’s distance. If you’d like to see the calculation, check your book (section 15.9.2). Weisberg (1982) suggests that absolute values greater than one may be cause for concern. Others suggest 4/N. To get cook’s distance do:
cooks.distance(regression.2) #this will output a value for each data point
## 1 2 3 4 5 6
## 1.702185e-03 1.742778e-02 4.948815e-03 1.271730e-03 2.845249e-04 6.868045e-05
## 7 8 9 11 12 13
## 2.976950e-05 5.261590e-04 4.503120e-05 2.748718e-02 8.184516e-03 7.167866e-04
## 14 15 16 17 18 19
## 1.553015e-04 2.454317e-03 4.034285e-05 3.610709e-03 2.115542e-03 1.545613e-04
## 20 21 22 23 24 25
## 1.756713e-02 7.617503e-05 1.690731e-03 3.022426e-02 3.129870e-05 2.416161e-04
## 26 27 28 29 30 31
## 8.476893e-04 5.741160e-02 7.689597e-04 5.440287e-04 2.541346e-03 5.176332e-02
## 32 33 34 35 36 37
## 6.133860e-03 2.776889e-02 5.423430e-02 2.151538e-02 4.585454e-02 1.745798e-02
## 38 39 40 41 42 43
## 4.833800e-02 2.211607e-03 6.305783e-05 4.556122e-05 1.081720e-02 2.389440e-02
## 44 45 46 47 48 49
## 3.362380e-02 1.397640e-02 1.769445e-02 1.593165e-02 4.380215e-03 8.207379e-03
## 50 51 52 53 54 55
## 2.594572e-04 3.685649e-02 8.219271e-04 1.905647e-03 2.327857e-03 4.377664e-03
## 56 57 58 59 60 61
## 1.348292e-03 1.784234e-03 4.620543e-04 1.212277e-03 3.121308e-03 1.566109e-03
## 62 63 64 65 66 67
## 3.895826e-04 1.334338e-04 1.122386e-01 2.839275e-02 1.151948e-02 1.216359e-03
## 68 69 70 71 72 73
## 2.307132e-03 2.086943e-04 1.041284e-02 2.795991e-03 1.462321e-03 4.417124e-04
## 74 75 76 77 78 79
## 2.898103e-03 2.657275e-03 4.768815e-04 1.429810e-02 4.252532e-02 4.282003e-03
## 80 81 82 83 84 85
## 3.903289e-03 2.957952e-02 1.627924e-04 2.228364e-03 1.510660e-04 9.314894e-02
## 86 87 88 89 90 91
## 3.034187e-04 3.451490e-04 3.933008e-05 3.575125e-04 7.341481e-04 4.995594e-03
## 92 93 94 95 96 97
## 4.384140e-03 1.175413e-02 4.613718e-03 1.708832e-02 2.215978e-04 1.003870e-02
## 98 99 100
## 1.263797e-02 2.381551e-02 8.188032e-03
max(cooks.distance(regression.2)) # will give you the max value
## [1] 0.1122386
This just gives you the maximum of all your data points and you just want it less than 1. It is a problem if it is greater than one.
Tests independence of errors. It tests for autocorrelation in the residuals. Basically, are those residuals “close” to each other correlated with each other.
We are likely to see this if we had a dataset that has a time component – like stock price; Yesterday’s stock price is correlated with today’s stock price.
The values from this test range of 0-4; 2 means there is no autocorrelation. 0 – 2 means there’s positive autocorrelation; 2-4 means there’s negative autocorrelation.
Durbin-Watson test is sensitive to the order of the data, so it’s essential to ensure that the data is sorted correctly before running the test.
Values <1 or >3 may be problematic; Closer to 2 is best!
Use the function durbinWatsonTest(model) from the
car package. (library(car)) dwt()
also works.
durbinWatsonTest(regression.2)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.07972016 2.139017 0.496
## Alternative hypothesis: rho != 0
Here’s how to interpret the Durbin-Watson test output:
Test statistic close to 2: This indicates no autocorrelation in the residuals. The residuals are independent, which is a desirable property in a linear regression model.
Test statistic significantly less than 2: This suggests positive autocorrelation. Positive autocorrelation occurs when a positive error (residual) in one observation increases the likelihood of a positive error in the next observation.
Test statistic significantly greater than 2: This indicates negative autocorrelation. Negative autocorrelation occurs when a positive error (residual) in one observation increases the likelihood of a negative error in the next observation.
In this Durbin-Watson test output:
lag is the order of autocorrelation tested, which is
1 in this case. It tests the correlation between the residuals at time t
and time t-1.
Autocorrelation is the estimated autocorrelation at
the specified lag (in this case, -0.07972016 for lag 1). It’s a
relatively small value, close to 0, indicating little to no
autocorrelation.
D-W Statistic is the Durbin-Watson test statistic,
which is 2.139017. Since it is close to 2, it suggests that there is no
significant autocorrelation in the residuals.
p-value is the probability of observing the test
statistic under the null hypothesis of no autocorrelation. In this case,
the p-value is 0.492, which is relatively large. A large p-value
indicates that there is insufficient evidence to reject the null
hypothesis of no autocorrelation.
Based on this output, I can conclude that there is no significant autocorrelation in the residuals at lag 1, and the assumption of independent residuals appears to be met for the linear regression model.
INTERPRETING OUR MODEL
If Dan got 4 hours of sleep last night and baby slept for 5 hours, how much grumpiness do we expect from Dan today?
Grumpiness = 125.97 – 8.95(4) + 0.01(5)
Grumpiness = 125.97 – 35.8 + 0.05
Grumpiness = 90.22
RECALL: STANDARDIZED BETA VALUES? The standardCoefs()
function from the lsr package computes standardized
regression coefficients, also known as beta coefficients. These
coefficients represent the strength and direction of the relationship
between the predictor variables and the outcome variable, but they are
standardized to be on the same scale, allowing for direct comparison
between predictors.
library(lsr)
standardCoefs(regression.2)
## b beta
## dan.sleep -8.9445644 -0.905945167
## baby.sleep 0.0175595 0.003630047
How do we interpret these results? Which variable has a larger effect?
dan.sleep: The beta coefficient is -0.905945167,
which means that for every 1 standard deviation increase in “dan.sleep,”
there is a 0.906 standard deviation decrease in “dan.grump.” The
negative sign indicates an inverse relationship.
baby.sleep: The beta coefficient is 0.003630047,
which means that for every 1 standard deviation increase in
“baby.sleep,” there is a 0.0036 standard deviation increase in
“dan.grump.” The positive sign indicates a direct relationship.
As Dan’s sleep increased by 1 standard deviation, Dan’s grumpiness decreased by 0.9 of a standard deviation. In contrast, as baby’s sleep increased by 1 standard deviation, Dan’s grumpiness increased by 0.002 of a standard deviation.
Comparing the two beta coefficients, the absolute value of -0.905945167 (for “dan.sleep”) is much larger than the absolute value of 0.003630047 (for “baby.sleep”). This suggests that “dan.sleep” has a much larger effect on “dan.grump” compared to “baby.sleep.” In other words, the number of hours that the dad sleeps has a more substantial impact on his grumpiness than the baby’s sleep duration.
A common problem is trying to figure out which variables to include in a model. There are a variety of strategies to try to solve this:
Hierarchical: Experimenter decides the order in which variables are entered into the model and checks the model fit using AIC / BIC
Forced Entry: All predictors are entered simultaneously.
Stepwise: Predictors are added or removed based on AIC values automatically.
AIC stands for Akaike information criterion. The AIC
value takes the residual sum of squares (which will go down as you add
more predictors) and then adds 2*K – where k is the number
of predictors. Lower AIC scores are better. So, AIC takes into account
that as you add predictors, you get a model with less error by
penalizing models with more predictors. So, a predictor will only be
added (if you use this method) if it improves the model by a
“substantial” amount.
Hierarchical regression is a technique in which predictors are entered into the regression model in a specific order based on prior research or theoretical considerations. Known predictors are entered first, followed by new predictors in separate steps or blocks. This approach allows the experimenter to assess the unique predictive influence of a new variable on the outcome, as known predictors are held constant in the model.
Advantages of hierarchical regression:
Theory testing: It is useful for testing theoretical predictions about the relative importance of predictor variables.
Unique predictive influence: You can evaluate the unique contribution of a new variable on the outcome, while controlling for the effects of known predictors.
Possible con: The technique relies on the experimenter having a good understanding of the existing research and theory.
#include models in R – note, one must be a subset of the other
M0 <- lm(dan.grump ~ dan.sleep, parenthood)
M1 <- lm(dan.grump ~ dan.sleep + baby.sleep, parenthood)
#compares two models based on the same dataset
anova(M0, M1)
## Analysis of Variance Table
##
## Model 1: dan.grump ~ dan.sleep
## Model 2: dan.grump ~ dan.sleep + baby.sleep
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 97 1823.7
## 2 96 1823.7 1 0.079478 0.0042 0.9486
Here’s a breakdown of the output:
Res.Df: Residual degrees of freedom. Model 1 has 97, and Model 2 has 96.
RSS: Residual sum of squares. Model 1 has an RSS of 1823.7, and Model 2 also has an RSS of 1823.7.
Df: Difference in degrees of freedom between the two models. In this case, it’s 1 (97-96).
Sum of Sq: Difference in the residual sum of squares between the two models. In this case, it’s 0.079478.
F: F-statistic, which indicates the improvement in the model’s fit when adding the new predictor. In this case, it’s 0.0042.
Pr(>F): p-value associated with the F-statistic. In this case, it’s 0.9486.
The p-value (0.9486) is greater than the commonly used significance level of 0.05, which means that adding the new predictor “baby.sleep” does not significantly improve the model’s ability to predict the outcome variable “dan.grump.” Thus, the addition of “baby.sleep” to the model is not justified based on this analysis.
Forced entry regression, also known as simultaneous regression or multiple regression, is a method of regression analysis where all predictor variables are entered into the model simultaneously. This approach assumes that all the predictors are equally important and should be considered at the same time.
Advantages of forced entry regression:
It’s straightforward and easy to implement since all variables are included at once.
The results are easily comparable when the same set of predictors is used across different studies or datasets.
If you have strong theoretical reasons for including all the variables, this method can provide a comprehensive view of their combined effect on the outcome variable.
Disadvantages of forced entry regression:
If the predictors are highly correlated (multicollinearity), the results may be unstable and difficult to interpret.
The method doesn’t account for the importance of variables, so it may not be the best choice if some predictors are more important than others.
It relies on the researcher’s knowledge and judgment for selecting the appropriate variables, which can introduce bias.
Variables are entered into the model based on mathematical criteria. Computer selects variables in steps.
fullmodel <- lm(dan.grump ~ dan.sleep +
baby.sleep + day, parenthood)
step(fullmodel, direction="backward")
## Start: AIC=296.41
## dan.grump ~ dan.sleep + baby.sleep + day
##
## Df Sum of Sq RSS AIC
## - baby.sleep 1 0.1 1823.3 294.41
## - day 1 0.5 1823.7 294.43
## <none> 1823.2 296.41
## - dan.sleep 1 4889.9 6713.1 423.45
##
## Step: AIC=294.41
## dan.grump ~ dan.sleep + day
##
## Df Sum of Sq RSS AIC
## - day 1 0.5 1823.7 292.44
## <none> 1823.3 294.41
## - dan.sleep 1 8044.8 9868.0 459.59
##
## Step: AIC=292.44
## dan.grump ~ dan.sleep
##
## Df Sum of Sq RSS AIC
## <none> 1823.7 292.44
## - dan.sleep 1 8121.2 9944.9 458.36
##
## Call:
## lm(formula = dan.grump ~ dan.sleep, data = parenthood)
##
## Coefficients:
## (Intercept) dan.sleep
## 125.815 -8.922
Forward • Add predictor that improves the model most • Then add the predictor that improves the model 2nd most (after first predictor is already included) • May miss out on effect that only occurs when another predictor is in the model. Backward • Include all variables • Exclude variable that improves AIC most • Both • Excludes variable that improves AIC the most • In each step checks to see if removing or adding variables improves AIC
Stepwise should be avoided except for exploratory model building. Use backward rather than forward to minimize suppressor effects (when a predictor has a significant effect but only when another variable is held constant)
Look at the relationship between all of your variables in a series of scatter plots. This will help you identify any serious deviations from linearity at the beginning. At this point, make a note of variables that look like they might have a problem, but do not transform things unless they are blatantly nonlinear. Another thing to look for is high levels of multicollinearity. Nothing needs to be done at this point, but make a note of it for the future. Create your best model. Don’t check assumptions until you have a model that seems to “fit” well.
Check assumptions Look at your residuals plots Check for linearity, normality, and homogeneity of variance of residuals Multicollinearity – no VIF greater than 2.5 Independent errors – run D-W test to make sure values are not greater than 3 or less than 1 Check for Outliers / Influential cases Influential cases: Check cook’s distance – shouldn’t have any greater than 1
Frequently, when you want to predict a continuous outcome, you may want to include categorical variables as well. If you ONLY have a categorical variable predicting a continuous outcome, this is just ANOVA (and you can run it as regression, if you want – you get the same results) If you have BOTH continuous and categorical variables predicting a continuous outcome, then you can run this with multiple regression.