Correlation with R
- The Pearson product-moment correlation coefficient measures the strength of the linear relationship between two variables.
- It is often referred to as Pearson’s correlation or simply the correlation coefficient.
- If the relationship between the variables is not linear, the correlation coefficient does not adequately represent the strength of the relationship.
Computing the Correlation
To compute the Pearson correlation coefficient (“r”), use the
cor()
command:The coefficient should be between \(-1\) and \(1\).
The higher the absolute value of the correlation coefficient, the stronger the linear relationship.
A positive correlation coefficient indicates a positive relationship.
A negative correlation coefficient indicates a negative (inverse) relationship.
Correlation and Covariance
Other types of correlation coefficients are possible, such as the Spearman coefficient and the Kendall Tau coefficient.
To specify one of these methods, add the method argument to the command:
To compute the covariance, use the
cov()
command:
Hypothesis Testing
The null hypothesis is that the correlation coefficient is zero.
The alternative hypothesis is that the correlation coefficient is greater than zero.
Slope and Intercept Estimates
- These tests are given in the “Two Tailed” format.
- The one-tailed format compares a null hypothesis where the parameter of interest has a true value of less than or equal to one versus an alternative hypothesis stating that it has a value greater than zero.
Simple Linear Regression
- Basic regression model: \[ y = \beta_{0} + \beta_{1}x + \epsilon \]
- The intercept \(\beta_{0}\) describes the point at which the line intersects the y-axis.
- The slope \(\beta_{1}\) describes the change in \(y\) for every unit increase in \(x\).
- From the data set, we determine the regression coefficients, i.e.,
estimates for the slope and intercept:
- \(\hat{\beta}_{0}\): the intercept estimate.
- \(\hat{\beta}_{1}\): the slope estimate.
- Fitted model: \[ \hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1}x \]
The lm()
Command
The command
lm()
is used to fit linear models.First, specify the response variable, then the predictor variable.
The tilde sign is used to denote the dependent relationship (i.e., \(y\) depends on \(x\)).
The regression coefficients are then determined.
Detailed Model
A more detailed model (i.e., more than just the coefficients) is generated in the form of a data object.
We can give a name to the model and view all the results of the calculation, including:
- The regression coefficients.
- The fitted \(\hat{y}\) values (i.e., the estimated \(y\) values for the \(x\) data set).
- The residuals (i.e., the differences between the estimated \(y\) values and the observed \(y\) values).
As with all data structures, we can use the
names()
function and$
to access components.
Accessing Model Components
We can access components using the
$
symbol.
Alternative Method
- An alternative method is to use the following commands:
coef()
- returns the regression coefficients of the model.fitted()
- returns the fitted values of the model.resid()
- returns the residuals of the model.
Coefficient of Determination
The coefficient of determination \(R^2\) is the proportion of variability in a data set that is accounted for by the linear model.
\(R^2\) provides a measure of how well future outcomes are likely to be predicted by the model.
For simple linear regression, it can also be computed by squaring the correlation coefficient.
p-values
- We will begin to use hypothesis testing in our analyses.
- We will mostly be using “p-values”.
- If the p-value is very low, we reject the null hypothesis.
- If it is above an arbitrary threshold, we “fail to reject” the null hypothesis.
- We will use 0.01 (1%) as our arbitrary threshold.
- The relevant hypotheses will be discussed for each methodology.
Inference for Regression
We can use the
summary()
command to determine test statistics and p-values for both regression coefficients.In both cases, the null hypothesis is that the true value is zero.
Consequently, the alternative hypothesis is that they are not zero in both cases.
Stating that the slope is zero is equivalent to saying that there is no relationship between \(x\) and \(y\).
Inference for Regression
- The p-value for the intercept is 0.209. This means we fail to reject the null hypothesis that the true intercept is zero.
- The p-value for the slope is extremely small. This means we reject the null hypothesis that it is zero.
- Consequently, we reject the hypothesis that there is no relationship between \(x\) and \(y\).
- Notice the stars beside the p-value. The more stars, the lower the p-value.