FOR3012 Statistics Module 4

Author

Patrick James & Jack Goldman

Published

October 13, 2023

Refresher

Good Practices in R code

Name

Try having short and explicit names for your variables. Naming a variable var is not very informative.
Use an underscore (_), or a dot (.) to separate words within a name and try to be consistent!
Avoid using names of existing functions and variables (e.g., c, table, T, etc.)

Space

Add spaces around all operators (=, +, -, <-, etc.) to make the code more readable.
Always put a space after a comma, and never before (like in regular English).

Core data types in R

Data types define how the values are stored in R. We can obtain the type and mode of an object using the function typeof(). The core data types are:

Numeric-type with integer and double values

x <- 1.1

typeof(x)

[1] "double"

y <- 2L

typeof(y)

[1] "integer"

Character-type (always between " ")

d <- "R rocks!"

typeof(d)

[1] "character"

Correlation Summary

The correlation coefficient (r) measures the strength and direction of the association between two numerical variables.
Correlation implies association not causation.
The correlation coefficient ranges from - 1 to 1. A value of zero indicates no correlation, -1 indicates a maximum negative correlation, and +1 indicates a maximum positive correlation.
Measures of correlation such as Pearson’s r assume that the two numerical variables have a bivariate normal distribution and that observations are randomly sampled.
With a bivariate normal distribution, the relationship between x and y is linear, and the frequency distributions of x and y separately are normal
A scatter plot is a useful tool for examining the assumption of bivariate normality. Histograms of x and y should both appear normal.
Spearman’s rank correlation test is a non-parametric technique that measures the linear correlation between the ranks of two variables, where each variable is ranked separately from low to high.
Spearman’s rank correlation test is appropriate when the data are not bivariate normal, are expressed as semi-quantitative ranks, or sample sizes are very small.

Regression Summary

Regression is a method used to predict the value of a numerical response variable y from the value of a numerical explanatory variable x
Linear regression fits a straight line through a scatter of points the equation for the regression line equals: \[y= b1 + b0x\] where b1 is the slope of the line and b0 is the intercept
This regression line is found by minimizing the sum of the squared differences between the observed y values and the values predicted by the line
These squared differences between observed values of y and values predicted by the least squares regression line (𝒚̂) are called “residuals”.
r2 is a measure of the fit of a regression line to the data it measures the proportion of the variation in y that is explained by x
Regression assumes that the distribution of residuals is normal, that the samples are independent, and that the variance of y values over the same at all values of x (homoscedasticity)
The significance of the slope of a single predictor in a regression model can be tested using the t distribution.
The null hypothesis of this test is that there is no difference between the slope and a horizontal line, that is, a slope equal to zero.
The significance of the slope can also be tested using a confidence interval, based on the standard error of the estimate of the slope. Confidence intervals that contain 0 indicate a slope that is not significantly different than zero

Exercise

This week, given that we are developing a familiarity with statistics, data, and interacting with R, the instructions for this exercise will be less explicit than in the previous two weeks.

We will first look at how to calculate correlations. The example we are using is a bi-variate (two variable) dataset that captures the relationship between inbreeding and the number of wolf pups per litter in Sweden. By the 1970s, wolves had been eliminated from Norway and Sweden. Then, around 1980, two wolves were reintroduced and founded a new population. The population grew and by 2002 there were >100 individuals. There were concerns however about the long-term viability of this population given the risk of inbreeding depression. Briefly, inbreeding depression is the consequence of the accumulation of deleterious alleles that can have negative consequences on individual fitness. One of the ways that inbreeding can affect populations is by influencing individual reproductive rates and offspring survival. The data for this exercise has been posted to Quercus and has been labelled “InbreedingData.xlsx”. The first column shows the inbreeding coefficient for each wolf, and the second shows the number of pups from the litter surviving their first winter.

Import these data into R using the read.table() command on a tab-delimited .txt file, or alternatively, use the read.csv() function to import a .csv file.

Without some changes to the original data however, you will find that you have trouble importing. When you pass the header=T argument to the read.table() function (for example), R is expecting there to be tabs to indicate the separate column labels. Because the column names (headers) in the raw .xlsx file have spaces (e.g., “Number of pups”), R will be confused as there are only two columns of data. To overcome this issue, we must rename the columns in Excel to have no spaces. You can use underscores, periods, or simply eliminate the space.

Remember last week we discussed that with R, and any other progamming language you should avoid column, row, or variable names that contain spaces

Now that we have successfully imported the data, the first step should be to visualize it. In fact, the first step for any data analysis should be to visualize the data. We can look at the scatterplot of inbreeding coefficient and number of pups using the plot function.

plot(lab4data$IC, lab4data$NumPups, pch=19, col="darkgreen", cex=3, xlab="Inbreeding Coefficient", ylab="Number of pups", main="Inbreeding and wolf pup survival")

This command produces a simple scatter plot, although there are several arguments, some of which are new. Of note is the pch argument, with refers to the shape of the plotting character (here, a filled in circle), and the cex argument, with refers to the size of the plotted character. Further details for these argument can be found easily online, or by using the ? function in R (e.g., ?pch). Try varying pch and cex and observe their effect on your plots.

Given the output graph, you should be able to see that there is a negative linear and monotonic trend here. Before calculating the correlation, let’s check the assumption of bivariate normality next using a couple of histograms.

par(mfrow =c(2,1)) # plot with two rows and 1 column 

hist(lab4data$IC, col="firebrick") 

hist(lab4data$NumPups, col="royalblue")

Although there is a bit of a skew in the number of pups data, the data look normal enough. This can be confirmed as well by the elliptical shape of the points in the scatter plot. In running a Shapiro-Wilk normality test you will see that neither variable is significantly different from normal (p=0.1087 for the inbreeding coefficient, and p=0.0504 for the number of pups). Good news for us.

If the data had turned out to be significantly non-normal, in one variable or the other, we would have a few options: 1) we could transform our data, or 2) use a non-parametric approach (e.g., Spearman’s Rho) if transformations are not successful in resolving the issue.

Common transformations include the log transform (all purpose), the square root transform (often applied to count data), and the arcsine transform (often applied to proportions). To get a feel for transformation, let’s try them out and see how they affect the distribution of our “Number of Pups” variable.

par(mfrow =c(2,2)) # plot with two rows and 2 columns

hist(lab4data$NumPups, col="royalblue", main="Original")

hist(log(lab4data$NumPups+1), col="firebrick", main="Log") # log transform (+1 for zeroes)

hist(sqrt(lab4data$NumPups), col="darkgreen", main="SQRT") # square root

hist(asin(1/sqrt(lab4data$NumPups)), col="bisque", main="ArcSine") # arcsine    – note sqrt required too. We divided it into 1 to make the value <1 (proportion)

Pearson’s correlation

As you will recall from lecture, Pearson’s correlation (r) can be calculated as the ratio of the covariance of two variables to the product of their individual variances. Covariance can be calculated using the cov function and we already know about variances and how they are calculated. There is of course a function in R to do this automatically cor. Let’s first calculate it ourselves and then compare it to the canned function.

cov1 <- cov(lab4data$NumPups, lab4data$NumPups) # covariance

var1 <- var(lab4data$NumPups) # variance 1
var2 <- var(lab4data$NumPups) # variance 2

r1 <- cov1/(sqrt(var1*var2)) # our own calculation

cor(lab4data$NumPups, lab4data$IC) # cor function

We can see that both techniques agree and report the correlation as -0.6077184. However, the cor function does not return any estimate of the significance of this relationship. To see this, we must use the cor.test function, as follows:

cor.test(lab4data$NumPups, lab4data$IC)

Here, we can see the details of the test including the actual correlation (r = -0.608), the associated t-statistic, the degrees of freedom, and the p-value. The alternative hypothesis (H_1_) indicated is that the true (population-level) correlation is not equal to zero, and by association our null hypothesis (H_0_ is that the true correlation IS not different than zero. As the p-value returned here is far less than our α=0.05 threshold, we must reject our null hypothesis and embrace our alternative hypothesis that there is indeed a significant correlation between these two variables. This conclusion can also be reached by an examination of the 95% confidence interval of the correlation coefficient.

This range of values indicates the range in between which lies the true population-level correlation between the two variables with 95% certainty. The confidence interval is calculated as the range between the estimate value of r minus the critical t-value for a given degrees

or freedom times the standard error of the correlation estimate – to – the estimate value of r plus* the critical t-value for a given degrees or freedom times the standard error of the correlation estimate.

Mathematically: 95% CI = \[[𝑟−(𝑡𝛼,𝑑𝑓∗𝑆𝐸𝑟)∶𝑟+(𝑡𝛼,𝑑𝑓∗𝑆𝐸𝑟) ]\],

where \[𝑆𝐸𝑟=1−𝑟2√𝑛−2\] , and tα, df is the critical t-value for a given α and degrees of freedom

In this case, R reports the confidence interval as [-0.812 : -0.271]. Because this range does not include zero, the correlation is deemed significant. Were the interval to include zero, we would conclude that the correlation is not significantly different than zero, and we would NOT reject our null hypothesis. Confidence intervals and p-values are different, but strongly related, methods for assessing statistical significance.

Regression

As before, we will import our excel sheet into R. The file with which we will work on here is entitled “TreeGrowthData.xlsx”. Have a look at the data by clicking on the data frame name in the top right window within RStudio or by typing the object’s name into your script or console.

These data represent measured tree growth in response to different levels of N(nitrogen), P (phosphorus), K (potassium), and Ca (calcium) in the soil around the tree. Note that these data contain both positive and negative values; assume that these values represent deviations from some baseline rate.

The first step of any analysis, as always, is to plot your data. Let’s make a scatter plot of growth as a function of N.

plot(growth~n, dat=myData, pch=19, cex=2, col="red", main= "Lab3") # assuming your data object is called “mydata”

We can also look at a plot of all the variables simultaneously:

plot(myData, pch=19, cex=1.5, col="purple")

Take a minute to digest this plot and what it illustrates. You will note that growth increases with N and P, but there does not appear to be much going on with K or Ca. You can quantify the strength of these relationships by calculating Pearson’s r for each pair of predictors.

Our next step is to fit a linear regression model to our data. Although there are multiple predictors in the data table, we will start with a simple regression model using a single predictor. We will model growth as a function of N. In this case, growth is the response variable and N is the predictor.

Recall the formula for a linear regression model: \[𝑦=𝑏0+𝑏1𝑥+𝜖\]

To fit this model, we will use the lm function in R, and we will use the functions anova and summary to summarize the model and view the results.

mod1 <- lm(growth~n, myData) 

anova(mod1) # we can also look at the model results using the summary command 

summary(mod1)

From both the anova and summary outputs, we see that the slope of the regression is significant (p = 2.44e-08). This strange coding means 2.44 × 10-8, or 0.0000000244. This value is clearly less than our α=0.05 threshold value. Both the anova and summary outputs indicate the same significance of the predictor n. The summary output however also tells us about the r² of the entire model (r²=0.57), as well as provides information about the degrees of freedom used in the test as well as the F statistic used to test the significance of the entire model.

Given these results (most specifically the significance of the predictor n), we must reject the null hypothesis that b1 = 0. We can then summarize this finding by stating something along the lines of: “… the relationship between growth and N is significant (F=49.69; p < 0.001), and 57% of the variation in growth was captured / described by the amount of nitrogen in the soil”.

We can also summarize the final model in terms of the intercept and slope, using the estimated parameters. These estimates can be found in the “Estimate” column of the coefficient table in the summary output.

In our case: growth = 0.6611 + 1.6123n

Be sure to understand where these coefficient estimates came from and how they fit together to create this equation.

Finally, we want to check that our residuals meet our expectation of normality. The first way we can do that is by plotting the residuals of our model against the fitted values. If the residuals are indeed normal, we would expect there to be no meaningful pattern (e.g., a cone or wave, see Figure 1) to the residual plot. We can also assess the normality of the residuals using a histogram and a Shapiro test. Finally, there is another tool call a normal probability plot, also called a qqplot. Here, the q’s refer to quantiles. Try out the following code and assess the normality of the residuals, and by extension the appropriateness of a linear model for these data.

Figure 1. Non-linear and heteroscedastic residuals vs. fitted plots

myResids <- resid(mod1) # extract residuals.

Now we will make a scatter plot of residuals vs. fitted values

plot(myResids ~ fitted(mod1), col=rainbow(100), pch=19, cex=2)

abline(h=0, lty=3, col="darkgrey") # add dashed horizontal line

Histogram of residuals

hist(myResids, col="bisque")

normal probability plot (qq plot)

par(mfrow=c(1,1), pty="s") # parameter setting to make a square plot

qqnorm(myResids, ylim=c(0,2.5), xlim=c(0,2), pch=19, col="darkgreen", cex=2)

qqline(myResids, col="brown", lwd=2) # expected line under true normality

Overall, the residuals look great - the spread is of the residuals on the y axis is even across the x axis (the values of N) in the scatter plot. Likewise, the histogram shows a good distribution as does the qq plot; our points (green) conform with what we would expect (brown line).

Instead of plotting diagnostic plots indvidually, you can also plot them all simultaneously by simply calling plot(mod1)

par(mfrow = c(2, 2))

plot(mod1)

If you have the time and are curious, you can try some multiple regressions. Multiple regression uses several predictors at once to fit the model. We will talk about multiple regression in more detail next week.

Additive multiple regression models:

Does the R² change between models?
What about the AdjR²?
What is the different between R² and AdjR²?

mod2<- lm(growth~n+p, myData) 
mod3<-lm(growth~n+p+Ca, myData) 
mod4<-lm(growth~n+p+Ca+k, myData)

Lab 4 Assignment

50 points – due Thursday Oct 19, 11:59 pm

Two researchers (Bob and Marge) conducted a study to examine how tree seedling growth is affected by different soil characteristics such as nutrient concentrations and depth of different soil horizons. In addition to wanting to understand how each soil characteristic affects growth individually, they were curious about the correlations among all soil attributes. Your job is to address these questions using the data in the file “FOR3012_Lab4_Assignment.xlsx”.

Prepare a summary table of the correlations between all predictors and the response variable, including p-values. Write a few sentences summarizing and interpreting these results. (10 points)
Analyse these data using linear regression. Prepare a simple linear regression model for each predictor (n=6) and compare the models. Summarize your results in a succinct paragraph supported by a table or tables containing pertinent information that allows us to assess the quality, strength of each model as well as their differences. (20 points)
Through visualization of your model residuals, comment on if you think the assumption of normality has been met. We will not concern ourselves with homoscedasticity here. Include your residual plots. (10 points)
Based on your findings above, if you had to pick only one predictor with which to model growth, which one would you choose and why? (10 points)