library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.5.2
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
setwd("E:/Biostat and Study Design/204/Lectures/Data")
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx')
Correlation exists between two variables when the values of one variables are somehow associated with the values of the other variable.
Linear correlation exists between two variables when there is a correlation and the plotted points of paired data results in a pattern that can be approximated by a straight line.
The linear correlation coefficient \(r\) (Pearson’s correlatoin) measures the strength of the linear correlation between the paired quantitative x values and y values in a sample. The linear correlation coefficient r.
\[r={\frac{\mathrm{cov}(X, Y)}{s_X s_Y}}={\frac{\sum{z_xz_y}}{n-1}}\] where \(\mathrm{cov}(X, Y)\) denotes the covariance of \(X\) and \(Y\), \(s\) denotes standard deviation, \(z_x\) denotes the \(z\) score for an individual sample value \(x\) and \(z_y\) is the \(z\) score for the corresponding sample value \(y\).
The following assumptions must be met for correlation coefficient (Pearson’s correlation) to return a valid result:
Properties of the linear correlation coefficient \(r\):
Use the P-value and significance level \(\alpha\) as follows:
Example: Using data from the NHANES study, conduct a formal hypothesis test of the claim that there is a linear correlation between weight in 1971 and weight in 1982.
\({H_0}: \rho = 0\) (No correlation)
\({H_1}: \rho \neq\ 0\) (Correlation)
We start with a scatterplot to confirm that the points approximate a straight-line pattern.
NHANES_df %>% ggplot(aes(x=wt71, y=wt82)) +
geom_point()+
theme_light()
## Warning: Removed 63 rows containing missing values or values outside the scale range
## (`geom_point()`).
The scatterplot reveals a straight-line pattern. Next, let’s evaluate the distribution of the data.
wt_71_hist <- NHANES_df %>% ggplot(aes(x=wt71)) +
geom_histogram( bins=40, fill="deepskyblue", color="black") +
theme_light()
wt_82_hist <- NHANES_df %>% ggplot(aes(x=wt82)) +
geom_histogram( bins=40, fill="deepskyblue", color="black") +
theme_light()
wt_71_qq <- NHANES_df %>% ggplot(aes(sample = wt71)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()
wt_82_qq <- NHANES_df %>% ggplot(aes(sample = wt82)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()
plot_grid(wt_71_hist,wt_82_hist,wt_71_qq,wt_82_qq)
Perform Shapiro-Wilk test
shapiro.test(NHANES_df$wt71)
##
## Shapiro-Wilk normality test
##
## data: NHANES_df$wt71
## W = 0.95888, p-value < 2.2e-16
shapiro.test(NHANES_df$wt82)
##
## Shapiro-Wilk normality test
##
## data: NHANES_df$wt82
## W = 0.96927, p-value < 2.2e-16
The data appears to be normally distributed. We proceed with calculating the correlation coefficient (Pearson’s correlation).
cor.test(NHANES_df$wt71,NHANES_df$wt82,method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: NHANES_df$wt71 and NHANES_df$wt82
## t = 71.819, df = 1564, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8639271 0.8870210
## sample estimates:
## cor
## 0.8759751
Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that there is sufficient evidence to support the claim of a linear correlation in weight between 1971 and 1982. The correlation appears to be strong at \(r\) = 0.83.
Spearman’s rank correlation test is a nonparametric test that uses ranks of sample data consisting of matched pairs. It is used to test for an association between two variables. Unlike Pearson’s correlation, Spearman’s rank correlation test does not require a normal distribution for any population. Spearman’s rank correlation test can also be used to detect monotonic relationships. In a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate.
Example: Using observations from the NHANES study, conduct a formal hypothesis test of the claim that there is a linear correlation between weight in 1971 and weight in 1982.
\({H_0}: \rho_{s} = 0\) (No correlation)
\({H_1}: \rho_{s} \neq\ 0\) (Correlation)
NHANES_df %>% ggplot(aes(x=wt71, y=wt82)) +
geom_point()+
theme_light()
## Warning: Removed 63 rows containing missing values or values outside the scale range
## (`geom_point()`).
The scatterplot reveals a straight-line pattern. Next, let’s evaluate the distribution of the data.
wt_hist <- NHANES_df %>% ggplot(aes(x=wt71)) +
geom_histogram( bins=40, fill="deepskyblue", color="black") +
theme_light()
wt_hist <- NHANES_df %>% ggplot(aes(x=wt82)) +
geom_histogram( bins=40, fill="deepskyblue", color="black") +
theme_light()
wt_qq <- NHANES_df %>% ggplot(aes(sample = wt71)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()
wt_qq <- NHANES_df %>% ggplot(aes(sample = wt82)) +
stat_qq_line(size=2,aes(color='red'))+
stat_qq(size=2) +
theme_light()
plot_grid(wt_hist,wt_hist,wt_qq,wt_qq)
Perform Shapiro-Wilk test to assess normality
shapiro.test(NHANES_df$wt71)
##
## Shapiro-Wilk normality test
##
## data: NHANES_df$wt71
## W = 0.95888, p-value < 2.2e-16
shapiro.test(NHANES_df$wt82)
##
## Shapiro-Wilk normality test
##
## data: NHANES_df$wt82
## W = 0.96927, p-value < 2.2e-16
Since both variables are right-skewed and the Shapiro-Wilk test P-Value is < 0.05 for both variables, it is not appropriate to use Pearson’s correlation test and we should use instead Spearman’s rank correlation test.
cor.test(NHANES_df$wt71,NHANES_df$wt82,method = 'spearman')
##
## Spearman's rank correlation rho
##
## data: NHANES_df$wt71 and NHANES_df$wt82
## S = 81118509, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8732651
Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that there is sufficient evidence to support the claim of a linear correlation in weight between 1971 and 1982. The correlation appears to be strong at \(r_s\) = 0.87.
Using data based on averages. Averages suppress individual variation and may inflate the correlation coefficient.
If there is no linear correlation, there might be some other correlation that is not linear.
The regression line (or line of best fit) is the straight line that “best” fits the scatter plot of the data.
\[\hat{y} = {b}_0 + b_1 x\]
where \(\hat{y}\) is the dependent variable, \(b_0\) is the y-intercept, \(b_1\) is the slope/coefficients, and \(x\) is the independent variable. The following are the requirements for simple linear regression:
A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible.
Example: Using NHANES data, predict males weight in 1982 using weight in 1971
We start by filtering the NHANES data for males.
NHANES_df_males <- NHANES_df %>% filter(sex=='Male')
Next, we evaluate the scatter plot of weight 1971 and weight 1982.
NHANES_df_males %>% ggplot(aes(x=wt71, y=wt82)) +
geom_point()+
geom_smooth(method=lm, se=FALSE)+
theme_light()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can see that the regression line fits the points well, but the points are not very close to the line. We proceed with constructing the linear regression model.
\({H_0}: \beta_1 = 0\)
\({H_1}: \beta_1 \neq\ 0\)
linearMod <- lm(wt82 ~ wt71 , data=NHANES_df_males) # construct a simple linear regression model
To evaluate the model, we use the summary command.
summary(linearMod)
##
## Call:
## lm(formula = wt82 ~ wt71, data = NHANES_df_males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.833 -4.093 -0.232 3.775 45.638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.06163 1.54488 5.866 6.68e-09 ***
## wt71 0.91925 0.01968 46.714 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.409 on 760 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.7417, Adjusted R-squared: 0.7413
## F-statistic: 2182 on 1 and 760 DF, p-value: < 2.2e-16
Interpretation: Since the P-Value ≤ 0.05, we reject the null hypothesis and conclude that weight in 1971 is a predictor of weight in 1982.
We can use the model to predict weight in 1982 using weight in 1971.
predict(linearMod,newdata = data.frame(wt71 = c(70)),interval = 'prediction')
## fit lwr upr
## 1 73.40915 58.85251 87.9658
The predicted weight is 73.4 kg. A prediction interval is a range that likely contains the value of the dependent variable for a single new observation given specific values of the independent variable(s). We can be 95% confident that the weight of the next subject will be within 58.9 kg to 87.9 kg.
predict(linearMod,newdata = data.frame(wt71 = c(70)),interval = 'confidence')
## fit lwr upr
## 1 73.40915 72.81132 74.00699
The average weight 73.4 kg with a 95% confidence interval ranging between 72.8 and 74 kg.
\(R^2\), or coefficient of determination, is a value between 0 and 1 that measures how well our regression line fits our data. \(R^2\) can be interpreted as the percent of variance in our dependent variable that can be explained by our model. The \(R^2\) for our model is 0.742. This means that 74.2% of the variation in 1982 weight can be explained by the 1971 weight.