R Markdown

This week we will look at methods to understand the relationship between two numerical variables, using correlation and regression.

To demonstrate the new R commands this week, we will use the data set from Figure 2.3-3 in Whitlock and Schluter. These data investigate the relationship between how ornamented a father guppy is (fatherOrnamentation) and how attractive to females are his sons (sonAttractiveness). Load the data from the .csv file:

guppyData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\chap02e3bGuppyFatherSonAttractiveness.csv", stringsAsFactors = TRUE)

Before we go further, it is wise to plot a scatterplot to view the relationship of the two variables. (See Lab 3.)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
ggplot(guppyData, aes(x=fatherOrnamentation, 
   y=sonAttractiveness)) +
   geom_point() +
   theme_minimal() +
   xlab("Father's ornamentation") +
   ylab("Son's attractiveness")

Correlation

The Correlation Function

Calculating a correlation coefficient in R is straightforward. The function cor() calculates the correlation between the two variables given as input:

cor(guppyData$fatherOrnamentation, guppyData$sonAttractiveness)
## [1] 0.6141043

As we predicted from the graph, the correlation coefficient of these data is positive and fairly strong.

R has no explicit function for calculating the coefficient of determination. However, the coefficient of determination is simply the square of the correlation coefficient, so we can calculate it by simply squaring the output of cor(). Remember from Lab 1 that the caret sign (^) is used to denote exponents, as in the following example:

cor(guppyData$fatherOrnamentation, guppyData$sonAttractiveness)^2
## [1] 0.377124

Testing a Hypothesis about the Correlation Coefficient

To test a hypothesis about the correlation coefficient or to calculate its confidence interval, use cor.test(). It takes as input the names of vectors containing the variables of interest.

cor.test(guppyData$fatherOrnamentation, guppyData$sonAttractiveness)
## 
##  Pearson's product-moment correlation
## 
## data:  guppyData$fatherOrnamentation and guppyData$sonAttractiveness
## t = 4.5371, df = 34, p-value = 6.784e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3577455 0.7843860
## sample estimates:
##       cor 
## 0.6141043

The output gives many bits of information we might want. The title here, “Pearson’s product–moment correlation” is the technical name for the classic correlation coefficient. After, re-stating the names of the variables being used, the output gives us the test statistic t, degrees of freedom, and P-value of a test of the null hypothesis that the population correlation coefficient is zero. In this case the P-value is quite small, P = 6.8 x 10-5. After that we have the 95% confidence interval for the correlation coefficient and finally the estimate of the correlation coefficient itself.

Spearman’s Correlation

The function cor.test() can also calculate a Spearman’s rank correlation, if we add the option method = “spearman” to the command.

cor.test(guppyData$fatherOrnamentation, guppyData$sonAttractiveness, method = "spearman")
## Warning in cor.test.default(guppyData$fatherOrnamentation,
## guppyData$sonAttractiveness, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  guppyData$fatherOrnamentation and guppyData$sonAttractiveness
## S = 3269.4, p-value = 0.0002144
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5792287

The output here is similar to what we described above with a Pearson’s correlation. Spearman’s ρρ in this case is about 0.579.

Linear Regression

Regression in R is a two-step process similar to the steps used in ANOVA last week. In fact, we again start by using lm() to fit a linear model to the data. (Both ANOVA and regression are special cases of linear models, which also can be used to generate much more complicated analyses than these.) We then give the output of lm() to the function summary() to see many useful results of the analysis.

Using the lm() function to calculate regression is similar to the steps used for ANOVA. The first argument is a formula, in the form response_variable ~ explanatory_variable. In this case we want to predict son’s attractiveness from father’s ornamentation, so our formula will be sonAttractiveness ~ fatherOrnamentation. The second input argument is the name of the data frame with the data. We will want to assign the results to a new object with a name (we chose “guppyRegression”), so that we can use the results in later calculations with summary().

guppyRegression <- lm(sonAttractiveness ~ fatherOrnamentation, data = guppyData)

 Let’s look at the output of lm() in this case.

guppyRegression
## 
## Call:
## lm(formula = sonAttractiveness ~ fatherOrnamentation, data = guppyData)
## 
## Coefficients:
##         (Intercept)  fatherOrnamentation  
##            0.005084             0.982285

This tells us that the estimate of the slope of the regression line is 0.982285, and the y-intercept is estimated to be 0.005084. Therefore the line that is estimated to be the best fit to these data is

sonAttractiveness = 0.005084 + (0.982285 x fatherOrnamentation).

We can find other useful information by looking at the summary() of the lm() result:

summary(guppyRegression)
## 
## Call:
## lm(formula = sonAttractiveness ~ fatherOrnamentation, data = guppyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66888 -0.14647 -0.02119  0.27727  0.51324 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.005084   0.118988   0.043    0.966    
## fatherOrnamentation 0.982285   0.216499   4.537 6.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3212 on 34 degrees of freedom
## Multiple R-squared:  0.3771, Adjusted R-squared:  0.3588 
## F-statistic: 20.59 on 1 and 34 DF,  p-value: 6.784e-05

We see the estimates of the slope and intercept repeated here, in the “Coefficients” table under “Estimate”. Now, we also are given the standard error and P-value for each of these numbers in that same table. For these data, the P-value for the null hypothesis that the true slope is zero is 6.78 x 10–5.

Plotting this line on the scatterplot is fairly straightforward in ggplot(). We can use the same plot function as above, with a new layer added with “+ geom_smooth(method=lm)” in the last line below:

ggplot(guppyData, aes(x=fatherOrnamentation, 
   y=sonAttractiveness)) +
   geom_point() +
   theme_minimal() +
   xlab("Father's ornamentation") +
   ylab("Son's attractiveness") +    
   geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

This layer adds both the best-fitting regression line and also the 95% confidence interval for the line shown in grey shading. The outer edges of the shaded area represent the confidence bands, indicating the 95% confidence intervals for the mean of the Y-variable (son’s attractiveness) at each value of the X-variable (father’s ornamentation). If you want a plot without this confidence interval, add the argument se = FALSE to the gm_smooth() function, as in geom_smooth(method=lm, se = FALSE).

Residual Plots

To check that the assumptions of regression apply for your data set, it is can be really helpful to look at a residual plot. A residual is the difference between the actual value of the y variable and the predicted value based on the regression line.

The Residual Function

R can calculate the residuals from a model with the residuals() function. Simply give this function the results from the lm() function, such as the guppyRegression that we calculated above. A vector of all the residuals for this regression line would be calculated by

residuals(guppyRegression)
##            1            2            3            4            5            6 
## -0.668883727 -0.064552472 -0.032603841  0.176687566  0.088813345 -0.051009507 
##            7            8            9           10           11           12 
## -0.009769469 -0.049415172 -0.119592320 -0.189592320 -0.348175134 -0.409238024 
##           13           14           15           16           17           18 
## -0.665163613 -0.513214982 -0.394809317 -0.447191940 -0.108609127 -0.134100724 
##           19           20           21           22           23           24 
## -0.183569279 -0.105872206 -0.076580800  0.022356311  0.043064904  0.153596349 
##           25           26           27           28           29           30 
##  0.261116273  0.391647718  0.382887756  0.352887756  0.352887756  0.513242052 
##           31           32           33           34           35           36 
##  0.434482090  0.355722128  0.325722128  0.369619391  0.222453763  0.124756691

With a residual plot, we plot the residuals of each data point as a function of the explanatory variable.

plot(residuals(guppyRegression) ~ fatherOrnamentation, data = guppyData)

abline(h=0)

In this case we used the built-in function plot() because it is fast and easy, and this residual plot may only be for ourselves. The second command abline(h=0) adds the horizontal line at 0 to the plot so that it is easier to see the baseline.

This residual plot shows no major deviation from the assumptions of linear regression. There is no strong tendency for the variance of the residuals (indicated by the amount of scatter in the vertical dimension) to increase or decrease with increasing x. The residuals show no outliers or other evidence of not being normally distributed for each value of x.