Correlation Analysis

Correlation with R

The Pearson product-moment correlation coefficient measures the strength of the linear relationship between two variables.
It is often referred to as Pearson’s correlation or simply the correlation coefficient.
If the relationship between the variables is not linear, the correlation coefficient does not adequately represent the strength of the relationship.

Computing the Correlation

To compute the Pearson correlation coefficient (“r”), use the cor() command:
```
> cor(x, y)
[1] 0.9581898
```
The coefficient should be between $-1$ and $1$.
The higher the absolute value of the correlation coefficient, the stronger the linear relationship.
A positive correlation coefficient indicates a positive relationship.
A negative correlation coefficient indicates a negative (inverse) relationship.

Correlation and Covariance

Other types of correlation coefficients are possible, such as the Spearman coefficient and the Kendall Tau coefficient.

To specify one of these methods, add the method argument to the command:

> cor(x, y, method="kendall")
[1] 0.7878788
> cor(x, y, method="spearman")
[1] 0.909091

To compute the covariance, use the cov() command:
```
> cov(x, y)
[1] 1.824429
```

Hypothesis Testing

The null hypothesis is that the correlation coefficient is zero.

The alternative hypothesis is that the correlation coefficient is greater than zero.

M = 1000
CorrData = numeric(M)
for (i in 1:M) {
  CorrData[i] = cor(rnorm(10), rnorm(10))
}

Slope and Intercept Estimates

These tests are given in the “Two Tailed” format.
The one-tailed format compares a null hypothesis where the parameter of interest has a true value of less than or equal to one versus an alternative hypothesis stating that it has a value greater than zero.

Simple Linear Regression

Basic regression model: \[ y = \beta_{0} + \beta_{1}x + \epsilon \]
The intercept $\beta_{0}$ describes the point at which the line intersects the y-axis.
The slope $\beta_{1}$ describes the change in $y$ for every unit increase in $x$.
From the data set, we determine the regression coefficients, i.e., estimates for the slope and intercept:
- $\hat{\beta}_{0}$: the intercept estimate.
- $\hat{\beta}_{1}$: the slope estimate.
Fitted model: \[ \hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1}x \]

The `lm()` Command

The command lm() is used to fit linear models.
First, specify the response variable, then the predictor variable.
The tilde sign is used to denote the dependent relationship (i.e., $y$ depends on $x$).

The regression coefficients are then determined.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
  0.7812    0.8581

Detailed Model

A more detailed model (i.e., more than just the coefficients) is generated in the form of a data object.
We can give a name to the model and view all the results of the calculation, including:
- The regression coefficients.
- The fitted $\hat{y}$ values (i.e., the estimated $y$ values for the $x$ data set).
- The residuals (i.e., the differences between the estimated $y$ values and the observed $y$ values).

As with all data structures, we can use the names() function and $ to access components.

> fit1 = lm(y ~ x)
> names(fit1)
 [1] "coefficients"  "residuals"
 [3] "effects"       "rank"
 [5] "fitted.values" "assign"
 [7] "qr"            "df.residual"
 [9] "xlevels"       "call"
[11] "terms"         "model"

> summary(fit1)

Accessing Model Components

We can access components using the $ symbol.

> fit1$coefficients
(Intercept)    x
  0.7812216  0.8580521

> fit1$coefficients[1]  # intercept
(Intercept)
  0.7812216

> fit1$coefficients[2]  # slope
    x
0.8580521

Alternative Method

An alternative method is to use the following commands:
- coef() - returns the regression coefficients of the model.
- fitted() - returns the fitted values of the model.
- resid() - returns the residuals of the model.
```
> coef(fit1)
(Intercept)    x
  0.7812216  0.8580521
```

Coefficient of Determination

The coefficient of determination $R^2$ is the proportion of variability in a data set that is accounted for by the linear model.
$R^2$ provides a measure of how well future outcomes are likely to be predicted by the model.
For simple linear regression, it can also be computed by squaring the correlation coefficient.
```
> summary(fit1)$r.squared
[1] 0.9181277
```

p-values

We will begin to use hypothesis testing in our analyses.
We will mostly be using “p-values”.
If the p-value is very low, we reject the null hypothesis.
If it is above an arbitrary threshold, we “fail to reject” the null hypothesis.
We will use 0.01 (1%) as our arbitrary threshold.
The relevant hypotheses will be discussed for each methodology.

Inference for Regression

We can use the summary() command to determine test statistics and p-values for both regression coefficients.
In both cases, the null hypothesis is that the true value is zero.
Consequently, the alternative hypothesis is that they are not zero in both cases.

Stating that the slope is zero is equivalent to saying that there is no relationship between $x$ and $y$.

> summary(fit1)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max
-0.56320 -0.24413  0.06588  0.19946  0.67913

Coefficients:
        Estimate Std. Error t value Pr(>|t|)
(Int.)    0.78122    0.58121   1.344    0.209
x         0.85805    0.08103  10.590 9.38e-07 ***
.....

Inference for Regression

The p-value for the intercept is 0.209. This means we fail to reject the null hypothesis that the true intercept is zero.
The p-value for the slope is extremely small. This means we reject the null hypothesis that it is zero.
Consequently, we reject the hypothesis that there is no relationship between $x$ and $y$.
Notice the stars beside the p-value. The more stars, the lower the p-value.

Correlation Analysis

Statistical Modelling with R

StatsResource

Correlation with R

Computing the Correlation

Correlation and Covariance

Hypothesis Testing

Slope and Intercept Estimates

Simple Linear Regression

The `lm()` Command

Detailed Model

Accessing Model Components

Alternative Method

Coefficient of Determination

p-values

Inference for Regression

Inference for Regression

Correlation with R

Computing the Correlation

Correlation and Covariance

Hypothesis Testing

Slope and Intercept Estimates

Simple Linear Regression

The lm() Command

Detailed Model

Accessing Model Components

Alternative Method

Coefficient of Determination

p-values

Inference for Regression

Inference for Regression

The `lm()` Command