Correlation and Linear Regression

RPubs Link

https://rpubs.com/zaidyousif/1374917

Learning Objectives

Use paired data to find the value of the correlation coefficient.
Determine whether there is sufficient evidence to support a conclusion that there is a correlation between two variables.
Use paired sample data to find the equation of the regression line.
Find the best predicted value of a variable given some value of the other variable.

Sources

Rosner, Bernard. Fundamentals of Biostatistics. Cengage Learning, 2017
Triola, Mario F., and Laura Iossi. Elementary Statistics. Pearson, 2018

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.5.2
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp

Load data

setwd("E:/Biostat and Study Design/204/Lectures/Data")
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx')

Correlation

Correlation exists between two variables when the values of one variables are somehow associated with the values of the other variable.

Linear correlation exists between two variables when there is a correlation and the plotted points of paired data results in a pattern that can be approximated by a straight line.

image title

Figure (a): Distinct straight-line, or linear, pattern. We say that there is a positive linear correlation between x and y, since as the x values increase, the corresponding y values also increase.
Figure (b): Distinct straight-line, or linear pattern. We say that there is a negative linear correlation between x and y, since as the x values increase, the corresponding y values decrease.
Figure (c): No distinct pattern, which suggests that there is no correlation between x and y.
Figure (d): Distinct pattern suggesting a correlation between x and y, but the pattern is not that of a straight line.

The linear correlation coefficient \(r\) (Pearson’s correlatoin) measures the strength of the linear correlation between the paired quantitative x values and y values in a sample. The linear correlation coefficient r.

\[r={\frac{\mathrm{cov}(X, Y)}{s_X s_Y}}={\frac{\sum{z_xz_y}}{n-1}}\] where \(\mathrm{cov}(X, Y)\) denotes the covariance of \(X\) and \(Y\), \(s\) denotes standard deviation, \(z_x\) denotes the \(z\) score for an individual sample value \(x\) and \(z_y\) is the \(z\) score for the corresponding sample value \(y\).

The following assumptions must be met for correlation coefficient (Pearson’s correlation) to return a valid result:

The sample of paired (x, y) data are continuous and have normal distribution.
Visual examination of the scatter plot must confirm that the points approximate a straight-line pattern.
Because results can be strongly affected by the presence of outliers, any outliers must be removed if they are known to be errors.

Properties of the linear correlation coefficient \(r\):

The value of \(r\) is always between -1 and 1.
A positive \(r\) value indicates that the two variables have a direct relationship. A negative r value indicates that the two variables have an inverse relationship.

image title

If all values of either variable are converted to a different scale, the value of \(r\) does not change.
The value of \(r\) is not affected by the choice of x or y. Interchange all x values and y values, and the value of \(r\) will not change.
\(r\) measures the strength of a linear relationship. It is not designed to measure the strength of a relationship that is not linear.
\(r\) is very sensitive to outliers in the sense that a single outlier could dramatically affect its value.

Use the P-value and significance level \(\alpha\) as follows:

P-Value ≤ \(\alpha\): Supports the claim of a linear correlation.
P-Value > \(\alpha\): Does not support the claim of a linear correlation

Example: Using data from the NHANES study, conduct a formal hypothesis test of the claim that there is a linear correlation between weight in 1971 and weight in 1982.

\({H_0}: \rho = 0\) (No correlation)

\({H_1}: \rho \neq\ 0\) (Correlation)

We start with a scatterplot to confirm that the points approximate a straight-line pattern.

NHANES_df %>% ggplot(aes(x=wt71, y=wt82)) + 
    geom_point()+
  theme_light()

## Warning: Removed 63 rows containing missing values or values outside the scale range
## (`geom_point()`).

The scatterplot reveals a straight-line pattern. Next, let’s evaluate the distribution of the data.

wt_71_hist <- NHANES_df %>% ggplot(aes(x=wt71)) +
    geom_histogram( bins=40, fill="deepskyblue", color="black") + 
    theme_light()
wt_82_hist <- NHANES_df %>% ggplot(aes(x=wt82)) +
    geom_histogram( bins=40, fill="deepskyblue", color="black") + 
    theme_light()

wt_71_qq <- NHANES_df %>% ggplot(aes(sample = wt71)) +
  stat_qq_line(size=2,aes(color='red'))+
  stat_qq(size=2) +
  theme_light()

wt_82_qq <- NHANES_df %>% ggplot(aes(sample = wt82)) +
  stat_qq_line(size=2,aes(color='red'))+
  stat_qq(size=2) +
  theme_light()

plot_grid(wt_71_hist,wt_82_hist,wt_71_qq,wt_82_qq)

Perform Shapiro-Wilk test

shapiro.test(NHANES_df$wt71)

## 
##  Shapiro-Wilk normality test
## 
## data:  NHANES_df$wt71
## W = 0.95888, p-value < 2.2e-16

shapiro.test(NHANES_df$wt82)

## 
##  Shapiro-Wilk normality test
## 
## data:  NHANES_df$wt82
## W = 0.96927, p-value < 2.2e-16

The data appears to be normally distributed. We proceed with calculating the correlation coefficient (Pearson’s correlation).

cor.test(NHANES_df$wt71,NHANES_df$wt82,method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  NHANES_df$wt71 and NHANES_df$wt82
## t = 71.819, df = 1564, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8639271 0.8870210
## sample estimates:
##       cor 
## 0.8759751

Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that there is sufficient evidence to support the claim of a linear correlation in weight between 1971 and 1982. The correlation appears to be strong at \(r\) = 0.83.

Spearman’s rank correlation test

Spearman’s rank correlation test is a nonparametric test that uses ranks of sample data consisting of matched pairs. It is used to test for an association between two variables. Unlike Pearson’s correlation, Spearman’s rank correlation test does not require a normal distribution for any population. Spearman’s rank correlation test can also be used to detect monotonic relationships. In a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate.

image title

Example: Using observations from the NHANES study, conduct a formal hypothesis test of the claim that there is a linear correlation between weight in 1971 and weight in 1982.

\({H_0}: \rho_{s} = 0\) (No correlation)

\({H_1}: \rho_{s} \neq\ 0\) (Correlation)

NHANES_df %>% ggplot(aes(x=wt71, y=wt82)) + 
    geom_point()+
  theme_light()

## Warning: Removed 63 rows containing missing values or values outside the scale range
## (`geom_point()`).

The scatterplot reveals a straight-line pattern. Next, let’s evaluate the distribution of the data.

wt_hist <- NHANES_df %>% ggplot(aes(x=wt71)) +
    geom_histogram( bins=40, fill="deepskyblue", color="black") + 
    theme_light()
wt_hist <- NHANES_df %>% ggplot(aes(x=wt82)) +
    geom_histogram( bins=40, fill="deepskyblue", color="black") + 
    theme_light()

wt_qq <- NHANES_df %>% ggplot(aes(sample = wt71)) +
  stat_qq_line(size=2,aes(color='red'))+
  stat_qq(size=2) +
  theme_light()

wt_qq <- NHANES_df %>% ggplot(aes(sample = wt82)) +
  stat_qq_line(size=2,aes(color='red'))+
  stat_qq(size=2) +
  theme_light()

plot_grid(wt_hist,wt_hist,wt_qq,wt_qq)

Perform Shapiro-Wilk test to assess normality

shapiro.test(NHANES_df$wt71)

## 
##  Shapiro-Wilk normality test
## 
## data:  NHANES_df$wt71
## W = 0.95888, p-value < 2.2e-16

shapiro.test(NHANES_df$wt82)

## 
##  Shapiro-Wilk normality test
## 
## data:  NHANES_df$wt82
## W = 0.96927, p-value < 2.2e-16

Since both variables are right-skewed and the Shapiro-Wilk test P-Value is < 0.05 for both variables, it is not appropriate to use Pearson’s correlation test and we should use instead Spearman’s rank correlation test.

cor.test(NHANES_df$wt71,NHANES_df$wt82,method = 'spearman')

## 
##  Spearman's rank correlation rho
## 
## data:  NHANES_df$wt71 and NHANES_df$wt82
## S = 81118509, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8732651

Common Mistakes in Interpreting Correlation

Assuming that correlation implies causality. Correlation and causation are often confused because humans fabricate these patterns when two variables appear to be so closely associated that one is dependent on the other. Causation is confirmed by running robust studies and clinical trials.

image title

Using data based on averages. Averages suppress individual variation and may inflate the correlation coefficient.
If there is no linear correlation, there might be some other correlation that is not linear.

Linear Regression

The regression line (or line of best fit) is the straight line that “best” fits the scatter plot of the data.

\[\hat{y} = {b}_0 + b_1 x\]

where \(\hat{y}\) is the dependent variable, \(b_0\) is the y-intercept, \(b_1\) is the slope/coefficients, and \(x\) is the independent variable. The following are the requirements for simple linear regression:

The sample of paired (x, y) data is a random sample of quantitative data.
Visual examination of the scatter plot shows that the points approximate a straight-line pattern.
No significant outliers. Outliers can have a strong effect on the regression equation, so remove any outliers if they are known to be errors.

A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible.

image title
Source : https://dustinstansbury.github.io/theclevermachine/cutting-your-losses

Example: Using NHANES data, predict males weight in 1982 using weight in 1971

We start by filtering the NHANES data for males.

NHANES_df_males <- NHANES_df %>% filter(sex=='Male')

Next, we evaluate the scatter plot of weight 1971 and weight 1982.

NHANES_df_males %>% ggplot(aes(x=wt71, y=wt82)) + 
    geom_point()+
    geom_smooth(method=lm, se=FALSE)+
    theme_light()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

We can see that the regression line fits the points well, but the points are not very close to the line. We proceed with constructing the linear regression model.

\({H_0}: \beta_1 = 0\)

\({H_1}: \beta_1 \neq\ 0\)

linearMod <- lm(wt82 ~ wt71 , data=NHANES_df_males)  # construct a simple linear regression model

To evaluate the model, we use the summary command.

summary(linearMod)

## 
## Call:
## lm(formula = wt82 ~ wt71, data = NHANES_df_males)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.833  -4.093  -0.232   3.775  45.638 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.06163    1.54488   5.866 6.68e-09 ***
## wt71         0.91925    0.01968  46.714  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.409 on 760 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.7417, Adjusted R-squared:  0.7413 
## F-statistic:  2182 on 1 and 760 DF,  p-value: < 2.2e-16

Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that weight in 1971 is a predictor of weight in 1982.

We can use the model to predict weight in 1982 using weight in 1971.

predict(linearMod,newdata = data.frame(wt71 = c(70)),interval = 'prediction')

##        fit      lwr     upr
## 1 73.40915 58.85251 87.9658

The predicted weight is 73.4 kg. A prediction interval is a range that likely contains the value of the dependent variable for a single new observation given specific values of the independent variable(s). We can be 95% confident that the weight of the next subject will be within 58.9 kg to 87.9 kg.

predict(linearMod,newdata = data.frame(wt71 = c(70)),interval = 'confidence')

##        fit      lwr      upr
## 1 73.40915 72.81132 74.00699

The average weight 73.4 kg with a 95% confidence interval ranging between 72.8 and 74 kg.

\(R^2\), or coefficient of determination, is a value between 0 and 1 that measures how well our regression line fits our data. \(R^2\) can be interpreted as the percent of variance in our dependent variable that can be explained by our model. The \(R^2\) for our model is 0.742. This means that 74.2% of the variation in 1982 weight can be explained by the 1971 weight.

Correlation and Linear Regression

Zaid Yousif, PharmD, MAS

11/27/2023

RPubs Link

Learning Objectives

Sources

Load Libraries

Load data

Correlation

Spearman’s rank correlation test

Common Mistakes in Interpreting Correlation

Linear Regression