R language

Install the ISwR R package. Write the built-in dataset thuesen to a tab-separated text file. View it with a text editor. Change the NA to . (period), and read the changed file back into R.

#install.packages("ISwR")
#library("ISwR")
#write.csv(thuesen,"Files/thuesen.csv", row.names = TRUE)
#Change the NA to . in text editor
the <- read.csv("Files/thuesen2.csv", header = TRUE)
the2<- the[1:24,]
knitr::kable(the2)

X	blood.glucose	short.velocity
1	15.3	1.76
2	10.8	1.34
3	8.1	1.27
4	19.5	1.47
5	7.2	1.27
6	5.3	1.49
7	9.3	1.31
8	11.1	1.09
9	7.5	1.18
10	12.2	1.22
11	6.7	1.25
12	5.2	1.19
13	19.0	1.95
14	15.1	1.28
15	6.7	1.52
16	8.6	.
17	4.2	1.12
18	10.3	1.37
19	12.5	1.19
20	16.1	1.05
21	13.3	1.32
22	4.9	1.03
23	8.8	1.12
24	9.5	1.7

The exponential growth of a population is described by this mathematical function:

\[ N_t=N_0 e^rt \]

Where N_t is the population size at time t, N₀ is the initial population size, and r is the rate of growth or reproductive rate. Write an exponential growth function in R that also generates a plot with time on the x axis and N_t on the y axis. Using that function create plots for 20 days assuming an initial population size of 10 individuals under three growth rate scenarios (0.5, 0.8, -0.1).

r <- c(0.5, 0.8, -0.1) #growth rate scenarios
time <- 1:20 #days
n <- 10 #number of individuals staring 

#Creates 3 empty vectors for storage
growth1 <- vector(mode="numeric", length=length(time)) 
growth2 <- vector(mode="numeric", length=length(time))
growth3 <- vector(mode="numeric", length=length(time))


#For loop to calculate the growth 
for (a in seq_along(time)){
  g1 <- round(n*exp(r[1])*time[a])
  g2 <- round(n*exp(r[2])*time[a])
  g3 <- round(n*exp(r[3])*time[a])
  growth1[a] <-g1 
  growth2[a] <-g2
  growth3[a] <-g3
}


#Creates the plots
plot(time, growth1,
     main="Expeted Growth over 20 days",
     ylab="Indivduals (N)",
     xlab= "Time (days)",
     type="l",
     col="blue")
lines(time,growth2, col="pink")
lines(time,growth3,col='purple')
legend("topleft",
       c("Growth (0,5)","Growth (0,8)","Growth (-0,1)"),
       fill=c("blue","pink","purple")
)

Under highly favorable conditions, populations grow exponentially. However resources will eventually limit population growth and exponential growth cannot continue indefinitely. This phenomenon is described by the logistic growth function:

\[ N_t=(KN_0)/(N_0+(K-N_0)e^-rt) \]

Where K is the carrying capacity. Write a logistic growth function in R that also generates a plot with time on the x axis and N_t on the y axis. Using that function create plots for 20 days assuming an initial population size of 10 individuals and a carrying capacity of 1,000 individuals under three growth rate scenarios (0.5, 0.8, 0.4).

logG<-function (N, r, t, K) {
  Nt <- K*N/(N+(K-N) * exp(1)^(-r*t))
  plot(t, Nt, xlab="Time", ylab="Population size", type="l", lty=2, lwd=2.5, col="pink")
}

logG(N=10, r=0.5, t=0:20, K=1000)

logG(N=10, r=0.8, t=0:20, K=1000)

logG(N=10, r=0.4, t=0:20, K=1000)

Write a function (sum_n) that for any given value, say n, computes the sum of the integers from 1 to n (inclusive). Use the function to determine the sum of integers from 1 to 5,000.

sumn <- function(n){
  sum(1:n)
}

sumn(5000)

## [1] 12502500

Write a function (sqrt_round) that takes a number x as input, takes the square root, rounds it to the nearest whole number and then returns the result.

sqrt_round <- function(x){
  round(sqrt(x))
}
sqrt_round(24)

## [1] 5

Install the R package ‘nycflights13’, and load the ‘weather’ data.

Explore the columns names and the top part of the dataset to get a sense of the data
Make a subset of the data from just the first month (1) and then save that subset as ‘weather1’
Using ggplot2, make a beautiful histogram of the variable ‘temp’
Using ggplot2, make a beautiful line plot of ‘temp’ as a function of ‘time_hour’
Using ggplot2, make a beautiful boxplot of ‘temp’ as a function of ‘origin’

#install.packages("nycflights13") #installs the data

library(nycflights13) #calls the data to be used
library(tidyverse) #calls tidyvers, (for using filter)
#view(weather) #view the weather and the data. 

weather1 <- filter(weather, month == 1) #filters the first month 

#histogram for humidity in january NYC
ggplot(weather1, aes(temp))+
  geom_histogram(color="grey", fill="red")+ggtitle("Temperature in NYC")+ylab("Obersvations(n)")+xlab("Temperature (F)")

#Line plot for Temperature and Time of hour
ggplot(weather1, aes(x=hour,label=hour,y=temp))+
  geom_smooth(fill="pink", color="red")+ggtitle("Hourly temperature in NYC")+ylab("Temperature (F)")+xlab("Time of the day (hour)")

#boxblot of Temperature and Origin

ggplot(weather1, aes(y=temp,x=origin, fill=origin))+geom_boxplot()+xlab("Airports")+ylab("Temperature (F)")+ggtitle("NYC Airports and temperature")

Write content in markdown syntax to replicate, as much as possible, the format of the following sample text.

Big Title

Some text using Markdown sytanx

Some text in code format: computing with data

The main data vectors in R are:

-Vectors - arrays - list

The R webside is: https://www.r-project.org

Fundamentals of statistics

A researcher videotaped the glides of 8 tree snakes leaping from a 10-m tower. Ondulation rates of the snakes measured in hertz (cycles per second) were as follows: 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6.

Draw a histogram of the undulation rate
Calculate the sample mean
Calculate the range
Calculate the standard deviation
Write a function to express the standard deviation as a percentage of the mean (that is, the coefficient of variation) and calculate it.

#Write the data as a matrix
s <- c(0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6)

#Draws the histogram from data
hist(s, xlab="Undulation rates (Hz)", ylab="Occurrence",main="Histogram of undulation rates",col="light blue")

#Calculates mean
mean(s)

## [1] 1.375

#calculates range
range(s)

## [1] 0.9 2.0

#Calculates Standart Deviation 
sd(s)

## [1] 0.324037

#Function that express the SD as a percentage of the mean
SM <- function(x){
  sd(x)/mean(x)*100
}
#Returns the function 
SM(s)

## [1] 23.56633

Blood pressure was measured (in units of mm Hg). Here are the measurements: 112, 128, 108, 129, 125, 153, 155, 132, 137.

How many individuals are in the sample?
What is the mean of this sample?
What is the variance?
What is the standard deviation?
What is the coefficient of variation?

#Creates dataframe blood presure
b <- c(112, 128, 108, 129, 125, 153, 155, 132, 137)

#Calculates n of individuals
length(b)

## [1] 9

#Calculates the mean of sample
mean(b)

## [1] 131

#calculates variance of sample
var(b)

## [1] 254.5

#Calculates the standard deviation
sd(b)

## [1] 15.95306

#Calculates the coefficient of variation 
cv <- sd(b) / mean(b) 
cv

## [1] 0.1217791

The data in the file DesertBirdAbundance.csv are from a survey of the breeding birds of Organ Pipe Cactus National Monument in southern Arizona.

Draw a histogram of the abundance data.
Calculate the median and the mean of the bird abundance data.
In this particular case, which do you think is the best measure of center, the mean or the median?
Calculate the range, standard deviation, variance and coefficient of variation of the bird abundance data.

#Write the CSV in to Bird
Bird <- read.csv("Files/DesertBirdAbundance.csv",header = TRUE)

#Draw histogram 
hist(Bird$abundance, main="Bird Observations", xlab = "Bird Observations (n)", ylab="Quantity of Species")

#Calculate Mean and median
mean(Bird$abundance)

## [1] 74.76744

median(Bird$abundance)

## [1] 18

#Best measure of center is the mean because the data is skew to the left.  

#Calculate Range
range(Bird$abundance)

## [1]   1 625

#Calculate Standard Deviation
sd(Bird$abundance)

## [1] 121.9473

#Calculate Variance
var(Bird$abundance)

## [1] 14871.14

#Calculate coefficient of variation 
cvB <- sd(Bird$abundance) / mean(Bird$abundance) 
cvB #high variability in the samples

## [1] 1.631021

Calculate the probability of each of the following events:

A standard normally distributed variable is larger than 3
A normally distributed variable with mean 35 and standard deviation 6 is larger than 42
Getting 10 out of 10 successes in a binomial distribution with probability 0.8
X > 6.5 in a Chi-squared distribution with 2 degrees of freedom

#Distributed Variable larger than 3
n3 <- dnorm(3)
1-n3*100 #probability of

## [1] 0.5568152

#Creates a normal distribution of 45, mean 35 and sd of 6.
n4 <- pnorm(45,mean=35, sd=6)
1-n4*100 #probability of

## [1] -94.22096

#Getting a 10 out 10 succes
n5 <- dbinom(10, size=10, prob=0.8) 
n5*100 #probability of

## [1] 10.73742

#Chi-square distribution
n6 <- pchisq(6.5, df=2)
n6*100

## [1] 96.12258

Demonstrate graphically the central limit theorem (by sampling and calculating the mean) using a Binomial distribution with 10 trials (size=10) and 0.9 probability of success (prob=0.9) and a sample size of 5.

#Creates a a df with the 5 units, probability of 0.9 and 10 trials
p1 <- rbinom(5, size=10, prob=0.9)
#hist(runif(p1)) #random uniform distribution

#Loops the mean of the sample for 100 times
m0 <-numeric(10)
for (i in 1:100) {
  m0[i]<-mean(runif(p1))
}

#Loops the mean of the sample for 100 times
m1 <-numeric(100)
for (i in 1:100) {
  m1[i]<-mean(runif(p1))
}

#Loops the mean of the sample for 1000 times
m2 <-numeric(1000)
for (i in 1:1000) {
  m2[i]<-mean(runif(p1))
}


#Puts all the histograms in one plot 
layout(matrix(c(1,2,3,4),2,2), widths=c(4,4,4,4), heights=c(8,8,8,8))

#creates a histogram with the first distribution
hist(p1,col="light blue", main="Histogram of Bionmial distribution")
abline(v = mean(p1), lty=3,lwd = 2)

hist(m0,col="purple", main="After 10 samples")
abline(v = mean(m0),lty=3,lwd = 2)

hist(m1,col="pink", main="After 100 samples")
abline(v = mean(m1),lty=3,lwd = 2)

#after a 1000 samples, the distribution is normal 
hist(m2,col="yellow", main="After 1000 samples")
abline(v = mean(m2), lty=3,lwd = 2)

Imagine height is genetically determined by the combined (that is, the sum) effect of several genes (polygenic trait). Assume that each gene has an effect on height as a uniform distribution with min=1 and max=3. Simulate stochastic model of height for 1,000 random people based on 1 gene, 2 genes, and 5 genes. As we increase the number of genes, what is the resulting height distribution?

#Puts all the histograms in one plot 
layout(matrix(c(1,2,3),3,3, byrow = TRUE), widths=c(3,3,3), heights=c(5,5,5))

#1 gene uniform distribution
hist(runif(1000,1,3), main="One Gene Distribution",xlab=NULL, col="#9999CC")

#2 gene uniform distribution
hist(runif(1000, 1, 3)+runif(1000, 1, 3),main="Two Gene Distribution",xlab="Genes", col="#9933FF")

#5 gene uniform distribution
hist(runif(1000, 1, 3)+runif(1000, 1, 3)+runif(1000, 1, 3)+runif(1000, 1, 3)+runif(1000, 1, 3),main="Five Gene Distribution",xlab=NULL, col="#6600cc")

#As the genes (variables) in the distribution add up, the distribution goes from uniform to normal.

Do the same as in the previous problem but now assuming that the combined effect of the genes is multiplicative (not the sum). As we increase the number of genes, what is the resulting height distribution?

#Puts all the histograms in one plot 
layout(matrix(c(1,2,3),3,3, byrow = TRUE), widths=c(3,3,3), heights=c(5,5,5))

#1 gene uniform distribution
hist(runif(1000,1,3), main="One Gene Distribution",xlab=NULL, col="#CC9933")

#2 gene uniform distribution 
hist(runif(1000, 1, 3)*runif(1000, 1, 3),main="Two Gene Distribution",xlab="Genes", col="#cc6633")

#5 gene uniform distribution
hist(runif(1000, 1, 3)*runif(1000, 1, 3)*runif(1000, 1, 3)*runif(1000, 1, 3)*runif(1000, 1, 3),main="Five Gene Distribution",xlab=NULL, col="#330000")

#As the genes (variables) in the distribution get multiplied, the distribution goes from uniform to lognormal distribution (skew on the left)

Statistical tests

Normal human body temperature is 98.6 F. Researchers obtained body-temperature measurements on randomly chosen healthy people: 98.4, 98.6, 97.8, 98.8, 97.9, 99.0, 98.2, 98.8, 98.8, 99.0, 98.0, 99.2, 99.5, 99.4, 98.4, 99.1, 98.4, 97.6, 97.4, 97.5, 97.5, 98.8, 98.6, 100.0, 98.4.

Make a histogram of the data
Make a normal quantile plot
Perform a Shapiro-Wilk test to test for normality
Are the data normally distributed?
Are these measurements consistent with a population mean of 98.6 F?

#creates temperature df
T <- c(98.4, 98.6, 97.8, 98.8, 97.9, 99.0, 98.2, 98.8, 98.8, 99.0, 98.0, 99.2, 99.5, 99.4, 98.4, 99.1, 98.4, 97.6, 97.4, 97.5, 97.5, 98.8, 98.6, 100.0, 98.4)

#Puts all the histograms in one plot 
layout(matrix(c(1,2),1,2), widths=c(4,4), heights=c(8,8))

#histogram of body temp.
hist(T, main="Body temperature measurements", xlab="Temperature (F)", col="#3399FF")
abline(v = mean(T),lty=3,lwd = 2) #adds mean to the histogram

#quantile plot of temp
qqnorm(T, main="Body temperature measurements", ylab="Temperature (F)")

#Shapoiro-Wil test to know if the data is normally distributed 
shapiro.test(T)#if Ho is true, W=1, W<1  might be different from normal

## 
##  Shapiro-Wilk normality test
## 
## data:  T
## W = 0.97216, p-value = 0.7001

#Data is normal distributed

#T test to show average temperature is 98.6F
t.test(T, mu=98.6)$p.value #true mean is not equal to 98.6

## [1] 0.5802363

The brown recluse spider often lives in houses throughout central North America. A diet-preference study gave each of 41 spiders a choice between two crickets, one live and one dead. 31 of the 41 spiders chose the dead cricket over the live one. Does this represent evidence for a diet preference?

#there are two outcomes, life or dead crickets, binomial test
#Ho = There is no preference
binom.test(x=31, n=41, p=0.5)$p.value

## [1] 0.001450491

#P value < 0.05, reject the null hypothesis, there is a preference

Ten epileptic patients participated in a study of a new anticonvulsant drug. During the first 8-week period, half the patients received a placebo and half were given the drug, and the number of seizures were recorded. Following this, the same patients were given the opposite treatment and the number of seizures were recorded. Assuming that the distribution of the difference between the placebo and drug meets the assumption of normality, perform an appropriate test to determine whether there were differences in the number of epileptic seizures with and without the drug.

Placebo	Drugs
37	5
52	23
68	40
4	3
29	38
32	19
19	9
52	24
19	17
12	14

Make a boxplot of the data
Test the difference.

#The samples are independent, (Two sample T test)
#Ho = The average Seizures is the same in the placebo and drug 
#Ha = The average Seizures is different in the placebo and drug

#creates a Boxplot
boxplot(S,col=c("#FF00CC","#FF666C"),main="Number of Seizures")

#two sample t test (is there a difference in mean?)
t.test(P,D)

## 
##  Welch Two Sample t-test
## 
## data:  P and D
## t = 1.758, df = 15.093, p-value = 0.099
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.795429 29.195429
## sample estimates:
## mean of x mean of y 
##      32.4      19.2

#P value is greater than 0.05, there is no statistical difference in the mean

A bee biologist is analyzing whether there was an association between bee colony number and type of forest habitat. We expect that there is no habitat preference for bee colony number. Is this true based on this data?

Habitat	Bee.colonies
Oak	33
Hickory	30
Maple	29
Red cedar	4
Poplar	4

Make a barplot of the data
Test the hypothesis of no-association

#Ho = no habitad preference
#Ha = habitad preference 
barplot(hab$Bee.colonies, names.arg=hab$Habitat, col="tan4")

#Chi test to see how much it differs by the expected distribution of all the same 
chisq.test(hab$Bee.colonies)

## 
##  Chi-squared test for given probabilities
## 
## data:  hab$Bee.colonies
## X-squared = 43.1, df = 4, p-value = 9.865e-09

#there is a habitad preference (reject null Hypothesis)

Perform 10 one-sample t-tests for mu=0 on simulated standard normally distributed data of 25 observations each and get the P-value. Repeat the experiment, but instead simulate samples of 25 observations from a t-distribution with 2 degrees of freedom. Find a way to automate the first experiment to do it a 1,000 times and applying a false discovery rate correction to the P-values (check ?replicate).

set.seed(250)

#Creates empty df to save p-values
Pval <- length(10) 
Pval2 <- length(10)
Pval3 <- length(1000)

#Loops the p-value 10 times, with a normal distribution 
for ( i in 1:10 ){
  n <- rnorm(n=25, mean=0)
  t_test <- t.test (n, mu = 0)
  Pval[i] <- t_test$p.value
}

#Loops the p-value 10 times, with a t distribution
for ( i in 1:10 ){
  m <- rt(25, df=2)
  t_test <- t.test (m, mu = 0)
  Pval2[i] <- t_test$p.value
}


#loop p-values 1000 times
for ( i in 1:1000 ){
  o <- rnorm(n=25, mean=0)
  t_test <- t.test (o, mu = 0)
  Pval3[i] <- t_test$p.value
}

#Puts all the histograms in one plot 
layout(matrix(c(1,2,3),3,3, byrow = TRUE), widths=c(3,3,3), heights=c(5,5,5))
hist(Pval, main="Normal Distribution, 10xt.test", xlab = NULL, col="#FFFFCC")
hist(Pval2,main="T Distribution, 10xt.test",xlab = "P Value",col="#CCCC99")
hist(Pval3, main="Normal Distribution, 1000xt.test",xlab = NULL,col="#FFCC99")

ANOVA

You are hired by the US Department of Agriculture to develop effective control practices for invasive plants (data_ANOVA.xlsx). In particular, you are asked to investigate pesticide controls for kudzu which is an invasive vine that grows in thick mats that smother underlying plants. Two of the most widely used pesticides for kudzu are glyphosate and triclopyr. To determine the effectiveness of a single application of pesticide, you conduct an experiment in 18 plots that each had 50% kudzu cover. In mid-summer, you applied equal amounts of 2% glyphosate to 6 plots, 2% triclopyr to 6 plots, and water without pesticides to 6 plots. Then you returned in autumn and measured the percent cover of kudzu in the plots.

Transform the data from wide to long format (Google how to do it - Hint: function melt from reshape2 package)
Make a boxplot of the data
Perform an ANOVA
What is the variation explained (R2)?
Perform a post hoc test

#Pestice control for Kudzu (?)
#18 plots, 50% of Kudzu cover
#Glyphosate to 6 plots, triclopyr to 6 plots, water to 6 plots
#What % of Kuzdo is left on the plot? 

#Read data from files
DW <- read.csv("Files/data_ANOVA.csv")

# Transform the data from wide to long
keycol <- "Treatment" #Creates empty field for colums
valuecol <-"Kudzu" #Creates empty field for colums 
colums <- c("Glyphosate", "Triclopyr", "Control") #puts all the data together in one
DL <- gather_(DW,keycol,valuecol,colums) #puts data together

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

#Creates the boxplots
boxplot(DW, col=c("thistle2","tomato1","slateblue"), ylab="Kudzu (% of Plot)", main="Pesticide Control for Kuzdo")

#ANOVA 
k <- aov(Kudzu~Treatment, data=DL)

#find the Rsquare 
summary(lm(Kudzu~Treatment, data=DL))

## 
## Call:
## lm(formula = Kudzu ~ Treatment, data = DL)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.500 -1.875  0.000  1.875  4.500 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           30.500      1.080  28.238 2.03e-14 ***
## TreatmentGlyphosate  -11.000      1.528  -7.201 3.07e-06 ***
## TreatmentTriclopyr   -15.500      1.528 -10.147 4.12e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.646 on 15 degrees of freedom
## Multiple R-squared:  0.879,  Adjusted R-squared:  0.8629 
## F-statistic:  54.5 on 2 and 15 DF,  p-value: 1.318e-07

#Post-HOC test
PostHocTest(k, method ="bonferroni")

## 
##   Posthoc multiple comparisons of means : Bonferroni 
##     95% family-wise confidence level
## 
## $Treatment
##                       diff     lwr.ci      upr.ci    pval    
## Glyphosate-Control   -11.0 -15.114755  -6.8852452 9.2e-06 ***
## Triclopyr-Control    -15.5 -19.614755 -11.3852452 1.2e-07 ***
## Triclopyr-Glyphosate  -4.5  -8.614755  -0.3852452  0.0300 *  
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#T test of each combination of treatments. 
#Control has always a lower p value, suggesting there is an effect in the treatments.

Data in growth.txt come from a farm-scale trial of animal diets. There are two factors: diet and supplement. Diet is a factor with three levels: barley, oats and wheat. Supplement is a factor with four levels: agrimore, control, supergain and supersupp. The response variable is weight gain after 6 weeks.

Inspect the data using boxplots
Perform a two-way ANOVA
Assess graphically if there exists an interaction between both factors
Given what you learnt about the interaction, what would be a better model?
What is the variation explained (R2) of this new model?
Perform a Tukey HSD test of this new model

#reads achive as table
grow<-read.table("Files/growth.txt", sep="\t", header=T)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

#Boxplot using the columns diet and gain
boxplot(gain~diet, data=grow, ylab="Gain (g)", col="olivedrab")

#Boxplot using the colums gain and supplement 
boxplot(gain~supplement, data=grow, ylab="Gain (g)", col="orange")

#Boxplots using combining gain diet + suplement 
boxplot(gain~diet+supplement, data=grow, col="salmon")

#two way annova (evaluate the effect of two grouping variable)
gw <- lm(gain~diet*supplement, data=grow) #model fitting (linear model)
#first argument is the formula, second argument the data 

#Perfom annova
anova(gw)

## Analysis of Variance Table
## 
## Response: gain
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## diet             2 287.171 143.586 83.5201 2.999e-14 ***
## supplement       3  91.881  30.627 17.8150 2.952e-07 ***
## diet:supplement  6   3.406   0.568  0.3302    0.9166    
## Residuals       36  61.890   1.719                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Grafic control of interction 
interaction.plot(grow$diet, grow$supplement, grow$gain)

interaction.plot(grow$supplement, grow$diet, grow$gain)

#new model 
anova(lm(gain~diet+supplement, data=grow))

## Analysis of Variance Table
## 
## Response: gain
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## diet        2 287.171 143.586  92.358 4.199e-16 ***
## supplement  3  91.881  30.627  19.700 3.980e-08 ***
## Residuals  42  65.296   1.555                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm(gain~diet+supplement, data=grow))

## 
## Call:
## lm(formula = gain ~ diet + supplement, data = grow)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30792 -0.85929 -0.07713  0.92052  2.90615 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          26.1230     0.4408  59.258  < 2e-16 ***
## dietoats             -3.0928     0.4408  -7.016 1.38e-08 ***
## dietwheat            -5.9903     0.4408 -13.589  < 2e-16 ***
## supplementcontrol    -2.6967     0.5090  -5.298 4.03e-06 ***
## supplementsupergain  -3.3815     0.5090  -6.643 4.72e-08 ***
## supplementsupersupp  -0.7274     0.5090  -1.429     0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.247 on 42 degrees of freedom
## Multiple R-squared:  0.8531, Adjusted R-squared:  0.8356 
## F-statistic: 48.76 on 5 and 42 DF,  p-value: < 2.2e-16

#tukey test 
TukeyHSD(aov(gain~diet+supplement, data=grow))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = gain ~ diet + supplement, data = grow)
## 
## $diet
##                   diff       lwr       upr p adj
## oats-barley  -3.092817 -4.163817 -2.021817 0e+00
## wheat-barley -5.990298 -7.061298 -4.919297 0e+00
## wheat-oats   -2.897481 -3.968481 -1.826481 2e-07
## 
## $supplement
##                           diff        lwr        upr     p adj
## control-agrimore    -2.6967005 -4.0583332 -1.3350677 0.0000234
## supergain-agrimore  -3.3814586 -4.7430914 -2.0198259 0.0000003
## supersupp-agrimore  -0.7273521 -2.0889849  0.6342806 0.4888738
## supergain-control   -0.6847581 -2.0463909  0.6768746 0.5400389
## supersupp-control    1.9693484  0.6077156  3.3309811 0.0020484
## supersupp-supergain  2.6541065  1.2924737  4.0157392 0.0000307

Correlation

In a study of hyena laughter, a researcher investigated whether sound spectral properties of hyenas’ giggles are associated with age. Age(years): 2,2,2,6,9,10,13,10,14,14,12,7,11,11,14,20 Frequency (Hz): 840,670,580,470,540,660,510,520,500,480,400,650,460,500,580,500

Inspect the data using a scatterplot
Test the linear association between both variables
Assume that the data are not normally distributed, test the linear association using a non-parametric correlation coefficient

#add data
a<-c(2,2,2,6,9,10,13,10,14,14,12,7,11,11,14,20)
f<-c(840,670,580,470,540,660,510,520,500,480,400,650,460,500,580,500)

#plot data
plot(a, f, type="p", xlab="Age (years)",ylab="Frequency (Hz)", main="Frecuency of hyenas' laughter")

#Perform pearson test
cor.test(a, f, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  a and f
## t = -2.8194, df = 14, p-value = 0.01365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8453293 -0.1511968
## sample estimates:
##       cor 
## -0.601798

#Perform spearman test 
cor.test(a, f, method="spearman")

## Warning in cor.test.default(a, f, method = "spearman"): Cannot compute exact p-
## value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  a and f
## S = 1047.1, p-value = 0.03092
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5397807

Regresion

Testosterone is known to predict aggressive behavior and is involved in face shape during puberty. Does face shape predict aggression? Researchers measured the face width-to-height ratio of 21 university hockey players with the average number of penalty minutes awarded per game for aggressive infractions. The data are available in face.txt.

How steeply does the number of penalty minutes increase per unit increase in face ratio? Calculate the estimate of the intercept. Write the result in the form of an equation for the line.
How many degrees of freedom does this analysis have?
What is the t-statistic?
What is your final conclusion about the slope?
What is the variation explained, R2?
Make a scatterplot of the data and include a linear regression line
Check the model assumptions

t <-read.table("files/face.txt", sep="\t", header=T)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

head(t)

##   Face Penalty
## 1 1.59    0.44
## 2 1.67    1.43
## 3 1.65    1.57
## 4 1.72    0.14
## 5 1.79    0.27
## 6 1.77    0.35

t %>% ggplot(aes(x=Face, y=Penalty)) + geom_point()

#writs a linear model and the summary stats
model_t <- lm(Penalty ~ Face, data=t)
summary(model_t)

## 
## Call:
## lm(formula = Penalty ~ Face, data = t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9338 -0.3883 -0.1260  0.3381  1.0711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -4.505      2.089  -2.157   0.0440 *
## Face           3.189      1.150   2.774   0.0121 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6129 on 19 degrees of freedom
## Multiple R-squared:  0.2882, Adjusted R-squared:  0.2508 
## F-statistic: 7.694 on 1 and 19 DF,  p-value: 0.0121

##plots graft and lm
 t %>% ggplot(aes(x=Face, y=Penalty)) + geom_point() +        stat_smooth(method = "lm", col = "red",se=F)

## `geom_smooth()` using formula 'y ~ x'

check_model(model_t)

Answers
a) y = 3.189x - 4.505 b)19 egress of freedom c) t value = 2.774 d) The slope is positive e) 0.2288, is not a strong relationship g) The assumptions says the model is correct.

Respiratory rate (Y) is expected to depend on body mass (X) by the power law, Y=aXb, where B is the scaling exponent. The data are available in respiratoryrate_bodymass.txt.

Make a scatterplot of the raw data
Make a scatterplot of the linearized relationship. Which transformation did you use?
Use linear regression to estimate 
Carry out a formal test of the null hypothesis of =0
What is the variation explained, R2?
Check the model assumptions

##Read Data
r <-read.table("files/respiratoryrate_bodymass.txt", sep="\t", header=T)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

head(r)

##   BodyMass Respiration
## 1      453         666
## 2     1283         643
## 3      695        1512
## 4     1640        2198
## 5     1207        2535
## 6     2096        4176

##Make Scatter plot
r %>% ggplot(aes(x=BodyMass, y=Respiration)) + geom_point()

##Make Scatter plot of linearize relationship
r %>% ggplot(aes(x=BodyMass, y=Respiration)) + geom_point() +        stat_smooth(method = "lm", col = "red",se=F)

## `geom_smooth()` using formula 'y ~ x'

model_r <- lm(BodyMass ~ Respiration, data=r)

#Checking Regresion Diagnostics
check_model(model_r)

summary(model_r)

## 
## Call:
## lm(formula = BodyMass ~ Respiration, data = r)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1386.9  -468.7   107.5   565.4  1075.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -388.50567  405.38465  -0.958    0.366    
## Respiration    0.92707    0.08718  10.634 5.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 805.6 on 8 degrees of freedom
## Multiple R-squared:  0.9339, Adjusted R-squared:  0.9257 
## F-statistic: 113.1 on 1 and 8 DF,  p-value: 5.356e-06

pp_check(model_r) #posterior predictive checks

#the p value is >0.05 and the R2 is 0.9 so we have a good model

Advance Regresion

Use the ozone.txt data to model the ozone concentration as a linear function of wind speed, air temperature and the intensity of solar radiation. Assume that the requirements to perform a linear regression are met.

Make a multiple panel bivariate scatterplot
Perform a multiple linear regression
What is the variation explained, R2?
Assess the collinearity of the explanatory variables using the variance inflation factor
Check the model assumptions

ozone <-read.table("Files/ozone.txt", sep="\t", header=T)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

head(ozone)

##   rad temp wind ozone
## 1 190   67  7.4    41
## 2 118   72  8.0    36
## 3 149   74 12.6    12
## 4 313   62 11.5    18
## 5 299   65  8.6    23
## 6  99   59 13.8    19

pairs(ozone)

#multiple linear regression
model_ozone <- lm(ozone ~ rad+temp+wind, data=ozone)
summary(model_ozone)

## 
## Call:
## lm(formula = ozone ~ rad + temp + wind, data = ozone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.210  -3.556  10.124  95.600 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.23208   23.04204  -2.788  0.00628 ** 
## rad           0.05980    0.02318   2.580  0.01124 *  
## temp          1.65121    0.25341   6.516 2.43e-09 ***
## wind         -3.33760    0.65384  -5.105 1.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.17 on 107 degrees of freedom
## Multiple R-squared:  0.6062, Adjusted R-squared:  0.5952 
## F-statistic: 54.91 on 3 and 107 DF,  p-value: < 2.2e-16

#asses the co-linearity
vif(model_ozone)

##      rad     temp     wind 
## 1.095241 1.431201 1.328979

#check assumptions
check_model(model_ozone)

Answers c) R2 = 0.6062 d) There is low colineatiry in the varibles e) the model meets the assptions

Use the diminish.txt data (xv is explanatory, yv is response variable) to:

Perform a simple linear regression
Perform a polynomial (second-degree) regression
Compare both models with Akaike’s Information Criterion (AIC). Which model is better?
Make a scatterplot of the data and include both regression lines

diminish <-read.table("files/diminish.txt", sep="\t", header=T)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

head(diminish)

##   xv yv
## 1  5 26
## 2 10 29
## 3 15 31
## 4 20 30
## 5 25 35
## 6 30 37

diminish_lm <- lm(xv~yv, data=diminish)
summary(diminish_lm)

## 
## Call:
## lm(formula = xv ~ yv, data = diminish)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.390  -6.102  -2.158   5.726  15.917 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -100.1645    14.2996  -7.005 2.97e-06 ***
## yv             3.8465     0.3676  10.465 1.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.824 on 16 degrees of freedom
## Multiple R-squared:  0.8725, Adjusted R-squared:  0.8646 
## F-statistic: 109.5 on 1 and 16 DF,  p-value: 1.455e-08

##Make Scatter plot of linearize relationship
diminish %>% ggplot(aes(x=xv, y=yv)) + geom_point() +        stat_smooth(method = "lm", col = "red",se=F)

## `geom_smooth()` using formula 'y ~ x'

##Polinomial regression

diminish_po <- lm(yv ~ poly(xv,2), data=diminish)
summary(diminish_po)

## 
## Call:
## lm(formula = yv ~ poly(xv, 2), data = diminish)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0490 -1.2890  0.6018  1.1829  2.9610 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   38.3889     0.5024  76.415  < 2e-16 ***
## poly(xv, 2)1  24.9644     2.1314  11.713 6.02e-09 ***
## poly(xv, 2)2  -4.7869     2.1314  -2.246   0.0402 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.131 on 15 degrees of freedom
## Multiple R-squared:  0.9046, Adjusted R-squared:  0.8919 
## F-statistic: 71.12 on 2 and 15 DF,  p-value: 2.222e-08

ggplot(diminish, aes(x=xv, y=yv)) + 
          geom_point() +
          stat_smooth(method='lm', formula = y ~ poly(x,2), size = 1)

#Compare both models
AIC(diminish_po,diminish_lm) #the lower AIC, the better

##             df       AIC
## diminish_po  4  83.04403
## diminish_lm  3 137.21491

#Add the to types of data. 
ggplot(diminish, aes(x=xv, y=yv)) + 
          geom_point() +
          stat_smooth(method='lm', formula = y ~ poly(x,2), size = 1) +
          stat_smooth(method='lm', formula = y ~ x, size = 1, color="red")

The data in stork.txt display the stress-induced corticosterone levels circulating in the blood of European white storks and their survival over the subsequent five years of study.

Make a scatterplot of the data
Which type of regression model is suitable for these data?
Perform an appropriate regression to predict survival from corticosterone
What is the pseudo-R2 of the model?
What is the p-value of the model?
Include the predicted curve in the scatterplot
Check the model assumptions

stork <-read.table("files/stork.txt", sep="\t", header=T)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

head(stork)

##   Corticosterone Survival
## 1           26.0        1
## 2           28.2        1
## 3           29.8        1
## 4           34.9        1
## 5           34.9        1
## 6           35.9        1

#Scatterplot of the data
ggplot(stork, aes(x=Corticosterone, y=Survival)) + 
          geom_point()

#it needs a bionial regression 
strork_bi <- glm(stork$Survival ~stork$Corticosterone ,  family = binomial)
summary(strork_bi)

## 
## Call:
## glm(formula = stork$Survival ~ stork$Corticosterone, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4316  -0.8724  -0.6703   1.2125   1.8211  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)  
## (Intercept)           2.70304    1.74725   1.547   0.1219  
## stork$Corticosterone -0.07980    0.04368  -1.827   0.0677 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45.234  on 33  degrees of freedom
## Residual deviance: 41.396  on 32  degrees of freedom
## AIC: 45.396
## 
## Number of Fisher Scoring iterations: 4

PseudoR2(strork_bi)

## McFadden 
## 0.084852

#P value
chi2 <- chisq.test(table(stork))

## Warning in chisq.test(table(stork)): Chi-squared approximation may be incorrect

chi2$p.value

## [1] 0.4094979

#Include the predicted curve in the scatterplot
ggplot(stork, aes(x=Corticosterone, y=Survival))  +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

#Check the model assumptions
check_model(strork_bi)

## Warning: Using `$` in model formulas can produce unexpected results. Specify your model
##   using the `data` argument instead.
##   Try: Survival ~ Corticosterone, data =
##   stork

The clusters.txt dataset contains the response variable Cancers (the number of reported cancer cases per year per clinic) and the explanatory variable Distance (the distance from a nuclear plant to the clinic in kilometers).

Make a scatterplot of the data
Which regression is the more appropriate for these data? (Don’t take overdispersion into account for now)
Given your choice, is the trend significant?
What is the pseudo-R2 of the model?
What is the p-value of the model?
Include the predicted relationship from the model in the scatterplot
Do you think there might be some evidence of overdispersion?
Perform a new generalized linear model with a distribution that better accounts for overdispersion
Check this last model assumptions

#reads txt data to table 
clus <- read.table(file="Files/clusters.txt", header = TRUE, sep = "",quote = "\"'",
                   dec = ".")

view(clus)
#plots the data clus and labels axis
plot(Cancers~Distance, xlab = "Distance(km)", pch=20, ylab="Yearly Cancer Observations (n)", col="blue", data = clus)

hist=(clus$Cancers)

#use lm to make a linear model 
Lin <- lm(Cancers~Distance, data = clus)
check_model(Lin, panel = FALSE)

## $NCV

## 
## $HOMOGENEITY

## 
## $OUTLIERS

## 
## $QQ

## 
## $NORM

summary(Lin)

## 
## Call:
## lm(formula = Cancers ~ Distance, data = clus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1766 -0.9252 -0.6448  0.3642  4.8329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.179289   0.236777   4.981 2.95e-06 ***
## Distance    -0.005549   0.004210  -1.318    0.191    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.219 on 92 degrees of freedom
## Multiple R-squared:  0.01853,    Adjusted R-squared:  0.007859 
## F-statistic: 1.737 on 1 and 92 DF,  p-value: 0.1908

#The residuals are not normal distributed, I can not use a linear model
#Try the Generalized linear model (GLM)
#It was not Binomial, because the results are not binary
#so I choose poisson. 

Glm1 <-glm(Cancers~Distance, data = clus, family=poisson)
check_model(Glm1)

summary(Glm1)

## 
## Call:
## glm(formula = Cancers ~ Distance, family = poisson, data = clus)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5504  -1.3491  -1.1553   0.3877   3.1304  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.186865   0.188728   0.990   0.3221  
## Distance    -0.006138   0.003667  -1.674   0.0941 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 149.48  on 93  degrees of freedom
## Residual deviance: 146.64  on 92  degrees of freedom
## AIC: 262.41
## 
## Number of Fisher Scoring iterations: 5

r2(Glm1)

## # R2 for Generalized Linear Regression
##   Nagelkerke's R2: 0.037

#Draws a fited model
#relationship between a response variable and the predictor
plot(Cancers~Distance, xlab = "Distance(km)", pch=20, ylab="Yearly Cancer Observations (n)", col="red", data = clus)
dis <- seq(0,100) 
lo2 <- predict(Glm1,list(Distance=dis)) #predictions of the models are in log, need to be antilog
lines(dis,exp(lo2),col="#FF33FF")

#overdispersion?
#Residual Deviance is larger than the residual df
check_overdispersion(Glm1)

## # Overdispersion test
## 
##        dispersion ratio =   1.555
##   Pearson's Chi-Squared = 143.071
##                 p-value =   0.001

## Overdispersion detected.

#There is evidence of overdispersion


#The Poisson model assumes that the variance is equal to the mean
#ratio of residual deviance to degrees of freedom should be 1.More than 1 indicates overdispersion.
#When the count data exhibit overdispersion, the negative binomial distribution is a better fit.
Glm2 <-glm.nb(Cancers~Distance, data = clus) #from the MASS library 
summary(Glm2)

## 
## Call:
## glm.nb(formula = Cancers ~ Distance, data = clus, init.theta = 1.359466981, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3103  -1.1805  -1.0442   0.3065   1.9582  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.182490   0.252434   0.723    0.470
## Distance    -0.006041   0.004727  -1.278    0.201
## 
## (Dispersion parameter for Negative Binomial(1.3595) family taken to be 1)
## 
##     Null deviance: 96.647  on 93  degrees of freedom
## Residual deviance: 94.973  on 92  degrees of freedom
## AIC: 253.19
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.359 
##           Std. Err.:  0.612 
## 
##  2 x log-likelihood:  -247.191

r2(Glm2)

## # R2 for Generalized Linear Regression
##   Nagelkerke's R2: 0.027

check_model(Glm2)

#p value is 0.201, there is no evidence of Ha being correct.

Use the jaws.txt data to:

Make a scatterplot of the data (age explanatory, bone response)
Perform a non-linear regression assuming an asymptotic exponential relationship: \[ y=a(1-e^-cx) \]
Perform a non-linear regression assuming a Michaelis-Menten model: \[ y=ax/(1+bx) \]
Estimate the percentage of variation explained by both models (comparing them with a null model with only a constant)
Compare both models with Akaike’s Information Criterion (AIC). Which model is better?
Make a scatterplot of the data and include both regression lines

jaws <- read.table(file="Files/jaws.txt", header = TRUE, sep = "",quote = "\"'",
                   dec = ".")

jaws %>% plot(pch=10)

#non asymptotic 
jaws_a <-nls(data = jaws, bone~a*(1-exp(-c*age)),start=list(a=120,c=0.064))
xn <- 0:50 
yn <- predict(jaws_a,list(age=xn))
summary(jaws_a)

## 
## Formula: bone ~ a * (1 - exp(-c * age))
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## a 115.58056    2.84365  40.645  < 2e-16 ***
## c   0.11882    0.01233   9.635 3.69e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.1 on 52 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 1.369e-06

#Plotted
{jaws %>% plot(pch=16, main="Asymptotic")
lines(xn,yn,col="red")}

par(mfrow=c(2,1))

# Michaelis-Menten
jaws_b <- nls(data = jaws, bone~a*age/(1+b*age),start=list(a=8,b=0.08))
xv <- 0:50 
yv <- predict(jaws_b,list(age=xv))
summary(jaws_b)

## 
## Formula: bone ~ a * age/(1 + b * age)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a 18.72539    2.52587   7.413 1.09e-09 ***
## b  0.13596    0.02339   5.814 3.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.77 on 52 degrees of freedom
## 
## Number of iterations to convergence: 7 
## Achieved convergence tolerance: 1.553e-06

#Plotted
{jaws %>% plot(pch=16, main="Michaelis-Menten")
lines(xv,yv,col="red")}

#comparing models
anova(jaws_a,jaws_b) #model b is better

## Analysis of Variance Table
## 
## Model 1: bone ~ a * (1 - exp(-c * age))
## Model 2: bone ~ a * age/(1 + b * age)
##   Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
## 1     52     8929.1                         
## 2     52     9854.4  0      0

In a recent paper we read: “Linear mixed effects modelling fit by restricted maximum likelihood was used to explain the variations in growth. The linear mixed effects model was generated using the lmer function in the R package lme4, with turbidity, temperature, tide, and wave action set as fixed factors and site and date set as random effects”. Write down the R code to recreate their model.
Researchers at the University of Arizona want to assess the germination rate of saguaros using a factorial design, with 3 levels of soil type (remnant, cultivated and restored) and 2 levels of sterilization (yes or no). The same experimental design was deployed in 4 different greenhouses. Each of the unique treatments was replicated in 5 pots. 6 seeds were planted in each pot.

How many fixed treatments (unique combinations) exist?
What is the total number of pots?
What is the total number of plants measured?
Write the R code that you would use to analyze these data

#a) 3*2 = 6, 3 levels of soil, 2 levels of sterilization and 6 unique combinations
#b) We will need (6*5*4) 120 pots
#c) they will be (6*120) 720 plants measured
#d) glmer(germ ~ soil * ster + (1|GH) + (1|GH:Pot), family=binomial, data = problem)

Check these hypothetical data from 4 subjects relating number of previous performances to negative affect.

What does the thick dashed black line represent?
What is depicting the solid black line?
What do the 4 thin dashed lines represent?

#a) Simple linear Regression 
#b) Random Variables, average of 4 trendiness
#c) Random intercept levels

Load dragons.RData

Perform a simple linear regression with testScore as response and bodyLength as explanatory
Plot the data with ggplot2 and add a linear regression line with confidence intervals. Hint: geom_smooth()
We collected multiple samples from 8 mountain ranges. Generate a boxplot using (or not) ggplot2 to explore this new explanatory variable
Now repeat the scatterplot in b) but coloring by mountain range and without linear regression line
Instead of coloring, use facet_wrap() to separate by mountain range
Perform a new linear model adding mountain range and assuming that it is fixed effects
Perform the same linear model as before but now assuming mountain range is random effects
How much of the variation in test scores is explained by the random effect (mountain range)
Estimate the R2 (conditional and marginal) of the mixed-effects model
Check this last model assumptions
Include the fitted lines for each mountain range (plot from e). [Hint: predict function within geom_line layer]

load("files/dragons.RData")
dragons_a <- lm(testScore ~ bodyLength, data = dragons)
summary(dragons_a)

## 
## Call:
## lm(formula = testScore ~ bodyLength, data = dragons)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.962 -16.411  -0.783  15.193  55.200 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -61.31783   12.06694  -5.081 5.38e-07 ***
## bodyLength    0.55487    0.05975   9.287  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.2 on 478 degrees of freedom
## Multiple R-squared:  0.1529, Adjusted R-squared:  0.1511 
## F-statistic: 86.25 on 1 and 478 DF,  p-value: < 2.2e-16

#Plot linar regression
ggplot(aes(y=testScore, x=bodyLength), data=dragons)+geom_point()+  
  geom_smooth(method="lm", se=T, col="#FF66CC") + labs(title="Dragons score and body lenght")

## `geom_smooth()` using formula 'y ~ x'

## Warning in if (se) "confidence" else "none": the condition has length > 1 and
## only the first element will be used

## Warning in if (se.fit) list(fit = predictor, se.fit = se, df = df,
## residual.scale = sqrt(res.var)) else predictor: the condition has length > 1 and
## only the first element will be used

## Warning in if (se) {: the condition has length > 1 and only the first element
## will be used

#boxplot for new explanatory variable
ggplot(data = dragons, mapping = aes(y = testScore, x = mountainRange, fill = mountainRange))+
  geom_boxplot() + labs(title = "Dragon Test Score and Mountain Range")

#d) scatter plot in b with the samples from each mountain range 
ggplot(aes(y=testScore, x=bodyLength, color = mountainRange), data=dragons)+
  geom_point()+
  labs(title = "Dragon Test Score and Body Length by Mountain Range",
       color = "Mountain Range")

#e) scatterplots by each mountainrange unsing facet_wrap()
ggplot(aes(y=testScore, x=bodyLength, color = mountainRange), data=dragons) +
  geom_point() +
  facet_wrap(.~mountainRange) + 
  labs(title = "Dragon Test Score and Body Length by Mountain Range",
       color = "Mountain Range")

#f Perform a new linear model adding mountain range and assuming that it is fixed effects
dragons_b <- lm(data = dragons, testScore ~ bodyLength + mountainRange)
summary(dragons_b)

## 
## Call:
## lm(formula = testScore ~ bodyLength + mountainRange, data = dragons)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.263  -9.926   0.361   9.994  44.488 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           20.83051   14.47218   1.439  0.15072    
## bodyLength             0.01267    0.07974   0.159  0.87379    
## mountainRangeCentral  36.58277    3.59929  10.164  < 2e-16 ***
## mountainRangeEmmental 16.20923    3.69665   4.385 1.43e-05 ***
## mountainRangeJulian   45.11469    4.19012  10.767  < 2e-16 ***
## mountainRangeLigurian 17.74779    3.67363   4.831 1.84e-06 ***
## mountainRangeMaritime 49.88133    3.13924  15.890  < 2e-16 ***
## mountainRangeSarntal  41.97841    3.19717  13.130  < 2e-16 ***
## mountainRangeSouthern  8.51961    2.73128   3.119  0.00192 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.96 on 471 degrees of freedom
## Multiple R-squared:  0.5843, Adjusted R-squared:  0.5773 
## F-statistic: 82.76 on 8 and 471 DF,  p-value: < 2.2e-16

#g Perform the same linear model as before but now assuming mountain range is random effects
dragons_c <- lme4::lmer(testScore ~ bodyLength + (1|mountainRange), data = dragons)
summary(dragons_c)

## Linear mixed model fit by REML ['lmerMod']
## Formula: testScore ~ bodyLength + (1 | mountainRange)
##    Data: dragons
## 
## REML criterion at convergence: 3991.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.4815 -0.6513  0.0066  0.6685  2.9583 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  mountainRange (Intercept) 339.7    18.43   
##  Residual                  223.8    14.96   
## Number of obs: 480, groups:  mountainRange, 8
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 43.70938   17.13489   2.551
## bodyLength   0.03316    0.07865   0.422
## 
## Correlation of Fixed Effects:
##            (Intr)
## bodyLength -0.924

#h) How much of the variation in test scores is explained by the random effect (mountain range)
#100*(339/(339+223)) = 60% of variation explained 

#Estimate the R2 (conditional and marginal) of the mixed-effects model
r2(dragons_c)

## # R2 for Mixed Models
## 
##   Conditional R2: 0.603
##      Marginal R2: 0.001

#Check assumptions
check_model(dragons_c)

Data in Estuaries.csv correspond to counts of invertebrates at 3-4 sites in each of 7 (randomly chosen) estuaries.

Fit a linear mixed model with Total as response and Modification as explanatory, controlling for Estuary
Estimate the R2 (conditional and marginal) of this model
Plot the data with ggplot2 in a way that helps you understand the different effects
Include the variable Site as a random effect. Do you think this corresponds to a crossed or a nested design?
What are the R2 (conditional and marginal) of the model including Site
Check the model assumptions
Plot the data trying to include Site
Transform the variable Hydroid to presence/absence data
Fit a generalized linear mixed model (GLMM) with this transformed variable as Response and the same fixed and random effects as in d). [Hint: function glmer]
Check the model assumptions

estuaries <- read.csv("Files/Estuaries.csv", header = T)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

#a)  Linear model with total as a response and modification as explanatory variable
estu.lmer<-lmer(Total ~ Modification + (1|Estuary), data=estuaries)

#b) Estimate R2
summary(estu.lmer)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ Modification + (1 | Estuary)
##    Data: estuaries
## 
## REML criterion at convergence: 394.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3859 -0.7142  0.2766  0.5240  2.0456 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Estuary  (Intercept) 55.12    7.424   
##  Residual             86.07    9.277   
## Number of obs: 54, groups:  Estuary, 7
## 
## Fixed effects:
##                      Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)            40.973      4.727   5.090   8.668 0.000309 ***
## ModificationPristine  -14.473      6.230   5.018  -2.323 0.067611 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## MdfctnPrstn -0.759

r2(estu.lmer)

## # R2 for Mixed Models
## 
##   Conditional R2: 0.553
##      Marginal R2: 0.267

#C) Plotting the data
ggplot(estuaries, aes(x = Modification, y = Total)) +
  geom_boxplot()+
  geom_jitter(aes(color=Estuary))

ggplot(estuaries, aes(x = Modification, y = Total)) +
  geom_boxplot()+
  geom_jitter()+
  facet_wrap(~Estuary)

#D) Include varable Site
estu.lmer.site<-lmer(Total ~ Modification + (1|Estuary/Site), data=estuaries)
#implicit nesting

#summary(lmer(Total ~ Modification + (1|Estuary) + (1|Estuary:Site), data=estuaries)) #explicit nesting

#R2 for the model 
summary(estu.lmer.site)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ Modification + (1 | Estuary/Site)
##    Data: estuaries
## 
## REML criterion at convergence: 386.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8686 -0.6687  0.1504  0.6505  1.9816 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  Site:Estuary (Intercept) 49.85    7.061   
##  Estuary      (Intercept) 47.59    6.899   
##  Residual                 43.65    6.607   
## Number of obs: 54, groups:  Site:Estuary, 27; Estuary, 7
## 
## Fixed effects:
##                      Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)            41.053      4.739   5.143   8.662 0.000294 ***
## ModificationPristine  -14.553      6.232   5.028  -2.335 0.066497 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## MdfctnPrstn -0.760

r2(estu.lmer.site)

## # R2 for Mixed Models
## 
##   Conditional R2: 0.774
##      Marginal R2: 0.270

#Check Model
check_model(estu.lmer.site)

#g) plot the data 
ggplot(estuaries, aes(x = Modification, y = Total)) +
  geom_boxplot()+
  geom_jitter(aes(color=as.factor(Site)))+
  facet_wrap(~Estuary)

ggplot(estuaries, aes(x = Modification, y = Total)) +
  geom_violin()+
  geom_jitter(aes(color=Estuary, shape=as.factor(Site)))

#transform the variable Hydroid (1/0) binary
estuaries$HydroidPres <- estuaries$Hydroid > 0

#Fit model 
hydro.glmm.site<-glmer(HydroidPres ~ Modification + (1|Estuary/Site), family=binomial, data=estuaries) #implicit nesting

#See R2 and summary
summary(hydro.glmm.site)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: HydroidPres ~ Modification + (1 | Estuary/Site)
##    Data: estuaries
## 
##      AIC      BIC   logLik deviance df.resid 
##     57.1     65.0    -24.5     49.1       50 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.06063 -0.25028 -0.05443  0.25475  0.98719 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  Site:Estuary (Intercept) 14.45    3.802   
##  Estuary      (Intercept)  1.13    1.063   
## Number of obs: 54, groups:  Site:Estuary, 27; Estuary, 7
## 
## Fixed effects:
##                      Estimate Std. Error z value Pr(>|z|)  
## (Intercept)            -5.710      2.419  -2.361   0.0182 *
## ModificationPristine    6.534      3.140   2.081   0.0375 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## MdfctnPrstn -0.888

r2(hydro.glmm.site)

## # R2 for Mixed Models
## 
##   Conditional R2: 0.888
##      Marginal R2: 0.358

check_model(hydro.glmm.site)

#Check residuals
simulateResiduals(hydro.glmm.site, plot = T)

## Warning in if (plot == TRUE) plot(out): the condition has length > 1 and only
## the first element will be used

## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
##  
## Scaled residual values: 0.6382768 0.3853122 0.8086934 0.01626812 0.9424474 0.9426142 0.5324421 0.3949867 0.82084 0.4799594 0.8009645 0.03378411 0.3224195 0.3157432 0.602992 0.6115054 0.276876 0.8050706 0.7838545 0.5106198 ...

Time series and spatial analysis

Write a function to calculate the 3-point moving average for an input vector. The formula is: \[ y_i^'= (y_(i-1)+y_i+y_(i+1))/3 \]

moving_avg <- function (x) {
  
  y  <-  numeric(length(x)-2)

  for (i in 2:(length(x)-1)) {

   y[i]<-(x[i-1]+x[i]+x[i+1])/3
}
 
  y

}

#test with random numbers
moving_avg(c(rnorm(40)))

##  [1]  0.00000000  1.02032924  1.18610202  1.42464709  0.74952955  0.38474044
##  [7] -0.23191881  0.64558512  0.33832756  0.85327527  0.03993191  0.04090928
## [13] -0.42557406 -0.26784985  0.03194553  0.05041529  0.16330468  0.46015231
## [19]  0.22947965 -0.06850260 -0.47269192 -0.31669317 -0.48720699 -0.51235536
## [25] -0.48080283 -0.06688149  0.53849030  0.76897853  1.20209244  0.46072665
## [31] -0.23291403 -0.48627337 -0.52215194  0.24949867  0.01703505  0.07929347
## [37]  0.41731482  0.04541708 -0.43191837

Read the temp.txt file. The data correspond to monthly average temperatures.

Plot the time series data. [Hint: first you need to create a monthly time series object]
Calculate the 5-point moving average and plot it together with the time series
Decompose the time series into seasonal, trend and residual error components
Generate a temporal correlogram to assess the autocorrelation of the time series
Generate a new correlogram but removing the trend and seasonal variation
Generate a partial correlogram plot
Find the best ARIMA model using the forecast package
Estimate future values using the previous ARIMA model and plot the results

#read Temp file
temp<-read.table("Files/temp.txt", header=T)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

#Creates a monthly time series 
temp.ts<-ts(temp$temps, frequency=12)
plot(temp.ts)

#Calculartes 5-point moving average
library(TTR)
temp.ma<-SMA(temp.ts, n = 5)
temp.decomp<-decompose(temp.ts)
plot(temp.decomp)

#Decompose time series
acf(temp.ts)

acf(temp.decomp$random[!is.na(temp.decomp$random)])

#Partial Corellogram
pacf(temp.ts)

#find arima model
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:DescTools':
## 
##     BoxCox

temp.fit<-auto.arima(temp.ts)
summary(temp.fit)

## Series: temp.ts 
## ARIMA(0,0,1)(2,1,0)[12] 
## 
## Coefficients:
##          ma1     sar1     sar2
##       0.2456  -0.5212  -0.3233
## s.e.  0.0774   0.0823   0.0827
## 
## sigma^2 = 3.772:  log likelihood = -300.76
## AIC=609.52   AICc=609.81   BIC=621.4
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.2384692 1.846384 1.498262 0.2682136 11.90101 0.8327783
##                     ACF1
## Training set 0.005413047

checkresiduals(temp.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(2,1,0)[12]
## Q* = 38.688, df = 21, p-value = 0.01069
## 
## Model df: 3.   Total lags used: 24

#Estimate future values 
temp.fcast<-forecast(temp.fit)
plot(temp.fcast)

Read the bac_dust.txt dataset. The data corresponds to proteobacterial abundance and bacterial species in dust samples collected around the globe.

Make a map of the sampling points using the R libraries maps and ggplot2. Color the points based on the continent they were collected.
Make a map of the sampling points using the R libraries maps and ggplot2. Size the points based on bacterial richness.
Calculate Moran’s I autocorrelation for the variable Richness and Proteobacteria
Make a simple linear regression of Richness as a function of Proteobacteria. Are the residuals spatially autocorrelated?
Perform a generalized linear mixed model (GLMM) using the coordinates as random effect and using the Matérn covariance function (R package spaMM).
Report the nu and the rho values for this spatial regression model.
What can you say about the estimated correlation of this particular spatial regression model across increasing distances between pairs of locations.

#Read 
bac<-read.table("Files/bac_dust.txt", header = T, sep = "\t")

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

#Make a map of the world
library(maps)

## 
## Attaching package: 'maps'

## The following object is masked _by_ '.GlobalEnv':
## 
##     ozone

## The following object is masked from 'package:purrr':
## 
##     map

library(ggplot2)

world_map <- map_data("world")
ggplot(world_map)+
  geom_polygon(aes(x = long, y = lat, group = group), fill = "lightgray", colour = "black", size = 0.1)+
  geom_point(data = bac, aes(x = Longitude, y = Latitude, color = Continent),  size = 2)+
  scale_color_manual(values = c("SouthAmerica" = 'chartreuse4', "NorthAmerica" = 'chartreuse1', "Africa" = 'goldenrod1', "Oceania" = 'darkorchid3', "Europe" = 'firebrick4', "Asia" = 'cornflowerblue'))+
  theme_bw(base_size = 15)+
  theme(legend.title = element_blank(), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())

#Map of the sampling points
ggplot(world_map)+
  geom_polygon(aes(x = long, y = lat, group = group), fill = "lightgray", colour = "black", size = 0.1)+
  geom_point(data = bac, aes(x = Longitude, y = Latitude, color = Continent, size=Richness), alpha=0.6)+
  scale_color_manual(values = c("SouthAmerica" = 'chartreuse4', "NorthAmerica" = 'chartreuse1', "Africa" = 'goldenrod1', "Oceania" = 'darkorchid3', "Europe" = 'firebrick4', "Asia" = 'cornflowerblue'))+
  theme_bw(base_size = 15)+
  theme(legend.title = element_blank(), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())

#Calculate Morans autocrrelation
bac.dists <- as.matrix(dist(cbind(bac$Latitude, bac$Longitude)))
bac.dists.inv <- 1/bac.dists
diag(bac.dists.inv) <- 0
library(ape)

## Warning: package 'ape' was built under R version 4.1.3

Moran.I(bac$Richness, bac.dists.inv)

## $observed
## [1] 0.1159867
## 
## $expected
## [1] -0.003225806
## 
## $sd
## [1] 0.03860909
## 
## $p.value
## [1] 0.002017263

Moran.I(bac$Proteobacteria, bac.dists.inv)

## $observed
## [1] 0.216501
## 
## $expected
## [1] -0.003225806
## 
## $sd
## [1] 0.03858986
## 
## $p.value
## [1] 1.241692e-08

#Simple lineal 
anova(lm(Richness~Proteobacteria, data=bac)) #Not significant

## Analysis of Variance Table
## 
## Response: Richness
##                 Df   Sum Sq Mean Sq F value Pr(>F)
## Proteobacteria   1    38901   38901  0.5349 0.4651
## Residuals      309 22471204   72722

bac$resid <-resid(lm(Richness~Proteobacteria, data=bac))

ggplot(world_map)+
  geom_polygon(aes(x = long, y = lat, group = group), fill = "lightgray", colour = "black", size = 0.1)+
  geom_point(data = bac, aes(x = Longitude, y = Latitude, color = Continent, size=resid), alpha=0.6)+
  scale_color_manual(values = c("SouthAmerica" = 'chartreuse4', "NorthAmerica" = 'chartreuse1', "Africa" = 'goldenrod1', "Oceania" = 'darkorchid3', "Europe" = 'firebrick4', "Asia" = 'cornflowerblue'))+
  theme_bw(base_size = 15)+
  theme(legend.title = element_blank(), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())

Moran.I(bac$resid, bac.dists.inv)

## $observed
## [1] 0.1098749
## 
## $expected
## [1] -0.003225806
## 
## $sd
## [1] 0.03861023
## 
## $p.value
## [1] 0.003397342

library(spaMM)

## Warning: package 'spaMM' was built under R version 4.1.3

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

## spaMM (Rousset & Ferdy, 2014, version 3.11.3) is loaded.
## Type 'help(spaMM)' for a short introduction,
## 'news(package='spaMM')' for news,
## and 'citation('spaMM')' for proper citation.
## Further infos, slides, etc. at https://gitlab.mbb.univ-montp2.fr/francois/spamm-ref.

fitme(Richness~Proteobacteria+Matern(1|Latitude+Longitude), data = bac)

## formula: Richness ~ Proteobacteria + Matern(1 | Latitude + Longitude)
## ML: Estimation of corrPars, lambda and phi by ML.
##     Estimation of fixed effects by ML.
## Estimation of lambda and phi by 'outer' ML, maximizing p_v.
## family: gaussian( link = identity ) 
##  ------------ Fixed effects (beta) ------------
##                Estimate Cond. SE t-value
## (Intercept)      411.58    48.47  8.4910
## Proteobacteria    37.66   112.90  0.3336
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##      1.nu     1.rho 
## 0.2041722 0.3145336 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    Latitude .  :  19190  
## # of obs: 311; # of groups: Latitude ., 311 
##  -------------- Residual variance  ------------
## phi estimate was 54713 
##  ------------- Likelihood values  -------------
##                         logLik
## p_v(h) (marginal L): -2170.262

#nu     rho 
#0.2041723 0.3145336
dd<-dist(bac[,c("Longitude", "Latitude")])
mm<-MaternCorr(dd, nu=0.2041723, rho=0.3145336)
plot(as.numeric(dd), as.numeric(mm), xlab="Distance between pairs of locations", ylab="Estimated correlation")

#the spatial correlation is bigger when the locations are closer.

Download the map of Spain.

Make a basic map of Spain
Get information on the population of Spanish cities and make a map of Spain including the locations and population of Spanish cities.
Visit Spain and enjoy

library(maps)
esp <-map_data("world", region="spain") 

#basic map of spain
ggplot(esp)+
  geom_polygon(aes(x = long, y = lat, group=group), fill='yellow', color=alpha("red",0.4), size=2) + 
  labs(title="España") + theme_void()

#population of spanish cities
cit <-get('world.cities')
cit_esp <-cit[cit$country.etc== 'Spain',]

ggplot(esp) +
  geom_polygon(aes(x = long, y = lat, group=group), fill='yellow', color=alpha("red",0.4), size=2) + 
  geom_point(data=cit_esp, aes(x=long, y=lat, size=pop), alpha=0.4) +
  labs(title=" Poblacion Española por ciudad") + 
  theme_void()

Download GBIF georeferenced occurrence data of the Pyrenean desman (Galemys pyrenaicus) and make a map of its geographical distribution (Hint: it is only present in Spain, Andorra, Portugal and France).

#Find the species
#library(rgbif)
#Galemys <- rgbif::occ_search(scientificName = "Galemys pyrenaicus", hasCoordinate = T)
#I had some problems with knighting so I had to save it as a csv an call it from files
Galemys <- read.csv("files/galemys.csv")
Galemys_df <- as.data.frame(Galemys[,c("decimalLatitude", "decimalLongitude")])

#look for the countries
IbeF <-map_data("world", region=c("Spain", "Portugal", "France", "Andorra"))

#Plot the map
ggplot(IbeF, aes(x=long, y=lat)) +
  geom_polygon(aes(group=group, fill=region)) +
 geom_point(data=Galemys_df, aes(x=decimalLongitude, y=decimalLatitude), color="black", alpha=0.2, size=2) +
  theme_void() +
  labs(title= "Distribution of Galemys pyrenaicus")

Diversity analysis

Load the dune_bio.txt dataset. Species should be in columns and sites in rows.

Calculate the total number of individuals of all species
Calculate the total number of individuals for each species
Calculate the average number of individuals for each species
Calculate the total number of individuals for each site
Calculate the average number of individuals for each site
Write a function to report the median number of individuals for each species and the median number of individuals for each site
Use the vegan function decostand() to transform the dataset to relative abundances. How would you confirm the transformation worked?
Use the vegan function decostand() to standardize the dataset into the range 0 to 1
Use the vegan function decostand() to standardize the dataset to mean=0 and variance=1

#Charge the file
dune.bio<-read.table("files/dune_bio.txt", sep="\t", header=T, row.names=1)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

#all indiduals
sum(dune.bio)

## [1] 685

#in every species 
colSums(dune.bio)

## Belper Empnig Junbuf Junart Airpra Elepal Rumace Viclat Brarut Ranfla Cirarv 
##     13      2     13     18      5     25     18      4     49     14      2 
## Hyprad Leoaut Potpal Poapra Calcus Tripra Trirep Antodo Salrep Achmil Poatri 
##      9     54      4     48     10      9     47     21     11     16     63 
## Chealb Elyrep Sagpro Plalan Agrsto Lolper Alogen Brohor 
##      1     26     20     26     48     58     36     15

#Calculate average
colMeans(dune.bio)

## Belper Empnig Junbuf Junart Airpra Elepal Rumace Viclat Brarut Ranfla Cirarv 
##   0.65   0.10   0.65   0.90   0.25   1.25   0.90   0.20   2.45   0.70   0.10 
## Hyprad Leoaut Potpal Poapra Calcus Tripra Trirep Antodo Salrep Achmil Poatri 
##   0.45   2.70   0.20   2.40   0.50   0.45   2.35   1.05   0.55   0.80   3.15 
## Chealb Elyrep Sagpro Plalan Agrsto Lolper Alogen Brohor 
##   0.05   1.30   1.00   1.30   2.40   2.90   1.80   0.75

#in each site sum 
rowSums(dune.bio)

##  2 13  4 16  6  1  8  5 17 15 10 11  9 18  3 20 14 19 12  7 
## 42 33 45 33 48 18 40 43 15 23 43 32 42 27 40 31 24 31 35 40

#in each site average 
rowMeans(dune.bio)

##         2        13         4        16         6         1         8         5 
## 1.4000000 1.1000000 1.5000000 1.1000000 1.6000000 0.6000000 1.3333333 1.4333333 
##        17        15        10        11         9        18         3        20 
## 0.5000000 0.7666667 1.4333333 1.0666667 1.4000000 0.9000000 1.3333333 1.0333333 
##        14        19        12         7 
## 0.8000000 1.0333333 1.1666667 1.3333333

#Report a median number for each species 
median_sps_sites<-function(x){
  cat("The median number of individuals per sp is", apply(x, 2, median), "and for sites is", apply(x, 1, median))
}
median_sps_sites(dune.bio)

## The median number of individuals per sp is 0 0 0 0 0 0 0 0 2 0 0 0 2 0 3 0 0 2 0 0 0 4 0 0 0 0 1.5 2 0 0 and for sites is 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

#Relative abudance trasnforamtion.
#install.packages("vegan")
library(vegan)

## Warning: package 'vegan' was built under R version 4.1.3

## Loading required package: permute

## Warning: package 'permute' was built under R version 4.1.3

## 
## Attaching package: 'permute'

## The following object is masked from 'package:spaMM':
## 
##     how

## Loading required package: lattice

## This is vegan 2.6-2

dune.bio.relabu<-decostand(dune.bio, method = "total")

#Confirm the transformation 
#making the sums of the row and the result shoud be one
rowSums(dune.bio.relabu)

##  2 13  4 16  6  1  8  5 17 15 10 11  9 18  3 20 14 19 12  7 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

#standarize the data set
dune.bio.range<-decostand(dune.bio, method = "range")
dune.bio.stand<-decostand(dune.bio, method = "standardize")

Write a function to calculate the observed richness of a vector representing the abundances of different species in a community.

rich <- function(x){
  return(specnumber(decostand(x, method = "total")))
}

Write a function to calculate the Shannon-Wiener diversity index of a vector representing the abundances of different species in a community.

richSW_fun <- function(x){
  return(diversity(decostand(x, method = "total"), index="shannon"))
}

Write a function that calculates both the observed richness and the Shannon-Wiener diversity index for each community in a matrix, and writes an output table with the results. You can use the vegan built-in functions. Test your function with the dune_bio.txt dataset.

richSW <- function(x){

    return(diversity(specnumber(decostand(x, method = "total"))
                  , index="shannon"))
}

Make a rank-abundance curve using the second site (13) of the dune_bio.txt dataset. Fit a lognormal model to the data.

dune.bio<-read.table("files/dune_bio.txt", sep="\t", header=T, row.names=1)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

plot(rad.lognormal(dune.bio), lty=2, lwd=2)

rad.lognormal(dune.bio)

## 
## RAD model: Log-Normal 
## Family: poisson 
## No. of species:  197 
## Total abundance: 685 
## 
##      log.mu   log.sigma    Deviance         AIC         BIC 
##   1.1511247   0.4378804   8.9679084 611.3184208 617.8848283

# fit lognormal 
plot(radfit(dune.bio))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

Create a Euclidean distance matrix after standardization using the first (A1), second (Moisture) and fifth (Manure) variables of dune_env.txt. Then create a Bray-Curtis distance matrix using dune_bio.txt. Perform a Mantel test on both distance matrices and plot the relationship.

#Call files
dune.bio<-read.table("files/dune_bio.txt", sep="\t", header=T, row.names=1)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

dune.env<-read.table("files/dune_env.txt", sep="\t", header=T, row.names=1)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

#Encludian Distance
dune.env.euc<-vegdist(decostand(dune.env[,c(1,2,5)], method="standardize"), method="euclidean")

#Bray-Curtis distance matrix
dune.bio.bray<-vegdist(dune.bio, method="bray")

#Perform mantel test
mantel(dune.env.euc, dune.bio.bray)

## 
## Mantel statistic based on Pearson's product-moment correlation 
## 
## Call:
## mantel(xdis = dune.env.euc, ydis = dune.bio.bray) 
## 
## Mantel statistic r: 0.5496 
##       Significance: 0.001 
## 
## Upper quantiles of permutations (null model):
##   90%   95% 97.5%   99% 
## 0.150 0.185 0.217 0.250 
## Permutation: free
## Number of permutations: 999

#plot the relanshipship 
plot(dune.env.euc, dune.bio.bray)

Cluster analysis

Make dendrograms of the results of hierarchical clustering using the single, average and complete methods. Use the dune_bio.txt dataset after creating a Bray-Curtis distance matrix.

dune.bio<-read.table("files/dune_bio.txt", sep="\t", header=T, row.names=1)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

# Data for the clustering 
dune.bio_veg <-vegdist(dune.bio, method = "bray") 

#Single method
single_dune.bio_veg <- hclust(dune.bio_veg, method="single")
plot(single_dune.bio_veg, main=" hierarchical clustering using the single method")

#Average method
avg_dune.bio_veg <-hclust(dune.bio_veg, method="average")
plot(avg_dune.bio_veg, main=" hierarchical clustering using the average method")

#complete method
com_dune.bio_veg <-hclust(dune.bio_veg, method="complete")
plot(com_dune.bio_veg, main=" hierarchical clustering using the complete method")

Perform k-means partitions from 2 to 6 on the dune_bio.txt dataset and select the best partition using the Calinski criterion. Visualize the results.

# Non-hierarchical clustering
dunebio.nhc <- cascadeKM(dune.bio, inf.gr=2, sup.gr=6) 

dunebio.nhc$results

##             2 groups   3 groups   4 groups   5 groups   6 groups
## SSE      1218.500000 950.516667 777.333333 676.500000 593.750000
## calinski    5.611243   5.793253   5.633047   5.110033   4.737482

dunebio.nhc <- cascadeKM(dune.bio, inf.gr=2, sup.gr=8)

plot(dunebio.nhc, sortg=T)

## Warning in if (sortg) {: the condition has length > 1 and only the first element
## will be used

Ordination

Load the varechem dataset within the R package vegan. This data frame collects soil characteristics. Given the nature of this dataset perform a PCA or a CA ordination analysis.

How much variation is explained by the two first axes?
Make a screeplot of the results
Plot the ordination results of the sites
Plot both the sites scores and the soil characteristics scores focusing on the soil variables

#Call files
dune.bio<-read.table("files/dune_bio.txt", sep="\t", header=T, row.names=1)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (header) {: the condition has length > 1 and only the first
## element will be used

dune.env<-read.table("files/dune_env.txt", sep="\t", header=T, row.names=1)

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

## Warning in if (!header) rlabp <- FALSE: the condition has length > 1 and only
## the first element will be used

dune.bio.nmds<-metaMDS(dune.bio, distance="bray", k=2)

## Run 0 stress 0.1192678 
## Run 1 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 0.02027054  max resid 0.06496414 
## Run 2 stress 0.2045511 
## Run 3 stress 0.1922241 
## Run 4 stress 0.1192678 
## Run 5 stress 0.1808911 
## Run 6 stress 0.2075713 
## Run 7 stress 0.1183186 
## ... Procrustes: rmse 9.28724e-06  max resid 2.767461e-05 
## ... Similar to previous best
## Run 8 stress 0.1183186 
## ... Procrustes: rmse 8.528153e-06  max resid 2.500458e-05 
## ... Similar to previous best
## Run 9 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 8.148522e-06  max resid 2.492115e-05 
## ... Similar to previous best
## Run 10 stress 0.1192678 
## Run 11 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 1.999142e-06  max resid 6.878084e-06 
## ... Similar to previous best
## Run 12 stress 0.1183186 
## ... Procrustes: rmse 5.527196e-06  max resid 1.760242e-05 
## ... Similar to previous best
## Run 13 stress 0.1183186 
## ... Procrustes: rmse 8.666665e-06  max resid 2.591284e-05 
## ... Similar to previous best
## Run 14 stress 0.1809577 
## Run 15 stress 0.1183186 
## ... Procrustes: rmse 1.404746e-05  max resid 3.491167e-05 
## ... Similar to previous best
## Run 16 stress 0.1192679 
## Run 17 stress 0.1889645 
## Run 18 stress 0.1192678 
## Run 19 stress 0.1183186 
## ... Procrustes: rmse 7.065505e-06  max resid 2.201767e-05 
## ... Similar to previous best
## Run 20 stress 0.1192678 
## *** Solution reached

dune.bio.nmds$stress

## [1] 0.1183186

#stressplots
stressplot(dune.bio.nmds)

#Plot the ordination
plot(dune.bio.nmds, type="t", display="sites")

rownames(dune.env)==rownames(dune.bio) #Check row names in both files

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE

#take data from dunebio and adds to dune end
dune.env$Axis01<-dune.bio.nmds$points[,1] 
dune.env$Axis02<-dune.bio.nmds$points[,2]
dune.env$Richness<-specnumber(dune.bio)

#Plot site scores 
ggplot(dune.env, aes(x=Axis01, y=Axis02))+
  geom_point(aes(fill=Management, size=Richness), alpha=0.6, pch=21)

The dune_bio.txt dataset corresponds to species abundances. Given the nature of this dataset perform a PCA or a CA ordination analysis.

How much variation is explained by the two first axes?
Make a screeplot of the results
Plot the ordination results of the sites

dune.bio.cca <-vegan::cca(dune.bio)
dune.bio.cca

## Call: cca(X = dune.bio)
## 
##               Inertia Rank
## Total           2.115     
## Unconstrained   2.115   19
## Inertia is scaled Chi-square 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8 
## 0.5360 0.4001 0.2598 0.1760 0.1448 0.1079 0.0925 0.0809 
## (Showing 8 of 19 unconstrained eigenvalues)

summary(dune.bio.cca)

## 
## Call:
## cca(X = dune.bio) 
## 
## Partitioning of scaled Chi-square:
##               Inertia Proportion
## Total           2.115          1
## Unconstrained   2.115          1
## 
## Eigenvalues, and their contribution to the scaled Chi-square 
## 
## Importance of components:
##                          CA1    CA2    CA3     CA4     CA5     CA6     CA7
## Eigenvalue            0.5360 0.4001 0.2598 0.17598 0.14476 0.10791 0.09247
## Proportion Explained  0.2534 0.1892 0.1228 0.08319 0.06844 0.05102 0.04372
## Cumulative Proportion 0.2534 0.4426 0.5654 0.64858 0.71702 0.76804 0.81175
##                           CA8     CA9    CA10    CA11    CA12    CA13     CA14
## Eigenvalue            0.08091 0.07332 0.05630 0.04826 0.04125 0.03523 0.020529
## Proportion Explained  0.03825 0.03466 0.02661 0.02282 0.01950 0.01665 0.009705
## Cumulative Proportion 0.85000 0.88467 0.91128 0.93410 0.95360 0.97025 0.979955
##                           CA15     CA16     CA17     CA18     CA19
## Eigenvalue            0.014911 0.009074 0.007938 0.007002 0.003477
## Proportion Explained  0.007049 0.004290 0.003753 0.003310 0.001644
## Cumulative Proportion 0.987004 0.991293 0.995046 0.998356 1.000000
## 
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## 
## 
## Species scores
## 
##             CA1      CA2      CA3       CA4       CA5      CA6
## Belper -0.50018 -0.35503 -0.15239 -0.704153 -0.058546 -0.07308
## Empnig -0.69027  3.26420  1.95716 -0.176936 -0.073518  0.16083
## Junbuf  0.08157 -0.68074  1.00545  1.078390  0.268360 -0.24168
## Junart  1.27580  0.09963 -0.09320  0.005536  0.289410  0.78247
## Airpra -1.00434  3.06749  1.33773  0.194305 -1.081813  0.53699
## Elepal  1.76383  0.34562 -0.57336 -0.002976 -0.332396  0.14688
## Rumace -0.65289 -0.25525 -0.59728  1.160164  0.255849  0.32730
## Viclat -0.61893  0.37140 -0.46057 -1.000375  1.162652 -1.44971
## Brarut  0.18222  0.26477 -0.16606  0.064009  0.576334 -0.07741
## Ranfla  1.55886  0.30700 -0.29765  0.046974 -0.008747  0.14744
## Cirarv -0.05647 -0.76398  0.91793 -1.175919 -0.384024  0.13985
## Hyprad -0.85408  2.52821  1.13951 -0.175115 -0.311874 -0.11177
## Leoaut -0.19566  0.38884  0.03975 -0.130392  0.141232 -0.23717
## Potpal  1.91690  0.52150 -1.18215 -0.021738 -1.359988 -1.31207
## Poapra -0.38919 -0.32999 -0.02015 -0.358371  0.079296  0.05165
## Calcus  1.95199  0.56743 -0.85948 -0.098969 -0.556737 -0.23282
## Tripra -0.88116 -0.09792 -1.18172  1.282429  0.325706  0.33388
## Trirep -0.07666 -0.02032 -0.20594  0.026462 -0.186748 -0.53957
## Antodo -0.96676  1.08361 -0.17188  0.459788 -0.607533  0.30425
## Salrep  0.61035  1.54868  0.04970 -0.607136  1.429729  0.55183
## Achmil -0.90859  0.08461 -0.58636 -0.008919 -0.660183  0.18877
## Poatri -0.18185 -0.53997  0.23388  0.178834 -0.155342  0.07584
## Chealb  0.42445 -0.84402  1.59029  1.248755 -0.207480 -0.87566
## Elyrep -0.37074 -0.74148  0.26238 -0.566308 -0.270122  0.72624
## Sagpro  0.00364  0.01719  1.11570  0.066981  0.186654 -0.32463
## Plalan -0.84058  0.24886 -0.78066  0.371149  0.271377 -0.11989
## Agrsto  0.93378 -0.20651  0.28165  0.024293 -0.139326  0.02256
## Lolper -0.50272 -0.35955 -0.21821 -0.474727  0.101494  0.01594
## Alogen  0.40088 -0.61839  0.85013  0.346740  0.016509 -0.10169
## Brohor -0.65762 -0.40634 -0.30685 -0.496751 -0.561358 -0.07004
## 
## 
## Site scores (weighted averages of species scores)
## 
##         CA1        CA2      CA3      CA4      CA5      CA6
## 2  -0.63268 -0.6958357 -0.09708 -1.18695 -0.97686 -0.06575
## 13  0.42445 -0.8440195  1.59029  1.24876 -0.20748 -0.87566
## 4  -0.05647 -0.7639784  0.91793 -1.17592 -0.38402  0.13985
## 16  2.00229  0.1090627 -0.33414  0.33760 -0.50097  0.76159
## 6  -0.85633 -0.0005408 -1.39735  1.59909  0.65494  0.19386
## 1  -0.81167 -1.0826714 -0.14479 -2.10665 -0.39287  1.83462
## 8   0.76268 -0.2968459  0.35648 -0.10772  0.17507  0.36444
## 5  -0.95293 -0.1846015 -0.95609  0.86853 -0.34552  0.98333
## 17 -1.47545  2.7724102  0.40859  0.75117 -2.59425  1.10122
## 15  1.91384  0.5079036 -0.96567  0.04227 -0.50681 -0.19370
## 10 -0.87885 -0.0353136 -0.82987 -0.68053 -0.75438 -0.81070
## 11 -0.64223  0.4440332 -0.17371 -1.09684  1.37462 -2.00626
## 9   0.09693 -0.7864314  0.86492  0.40090  0.28704  1.02783
## 18 -0.31241  0.6328355 -0.66501 -1.12728  2.65575 -0.97565
## 3  -0.10148 -0.9128732  0.68815 -0.68137 -0.08709  0.28678
## 20  1.94438  1.0688809 -0.66595 -0.55317  1.59606  1.70292
## 14  1.91996  0.5351062 -1.39863 -0.08575 -2.21317 -2.43044
## 19 -0.69027  3.2642026  1.95716 -0.17694 -0.07352  0.16083
## 12  0.28557 -0.6656161  1.64423  1.71496  0.65381 -1.17376
## 7  -0.87149 -0.2547040 -0.86830  0.90468  0.17385  0.03446

screeplot(dune.bio.cca)

plot(dune.bio.cca, display="sites", xlab="CA1", ylab="CA2")

Perform a nonmetric multidimensional scaling (NMDS) on the dune_bio.txt dataset after calculating Bray-Curtis distances among sites.

What is the stress?
Make a Shepard plot of the NMDS results
Plot the ordination results of the sites
[Advanced – ggplot2] Plot the ordination results using ggplot2. Use the dune_env.txt dataset to make the size of the points proportional to the site richness (number of species) and the color to represent the variable Management.

 #Bray-Curtis and dimension amount
dune.bioNDMS <- metaMDS(dune.bio, distance="bray", k=2)

## Run 0 stress 0.1192678 
## Run 1 stress 0.1192679 
## ... Procrustes: rmse 0.0001469508  max resid 0.0004501215 
## ... Similar to previous best
## Run 2 stress 0.1808911 
## Run 3 stress 0.1192679 
## ... Procrustes: rmse 0.0001101807  max resid 0.0003374409 
## ... Similar to previous best
## Run 4 stress 0.1809577 
## Run 5 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 0.02027067  max resid 0.0649635 
## Run 6 stress 0.1808911 
## Run 7 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 4.575452e-06  max resid 1.388487e-05 
## ... Similar to previous best
## Run 8 stress 0.1183186 
## ... Procrustes: rmse 8.575115e-06  max resid 2.490875e-05 
## ... Similar to previous best
## Run 9 stress 0.1812932 
## Run 10 stress 0.1808911 
## Run 11 stress 0.1183186 
## ... Procrustes: rmse 1.289457e-05  max resid 4.235228e-05 
## ... Similar to previous best
## Run 12 stress 0.1183186 
## ... Procrustes: rmse 1.687299e-06  max resid 4.13915e-06 
## ... Similar to previous best
## Run 13 stress 0.1183186 
## ... Procrustes: rmse 3.241419e-06  max resid 1.029676e-05 
## ... Similar to previous best
## Run 14 stress 0.1889647 
## Run 15 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 1.357556e-06  max resid 3.647557e-06 
## ... Similar to previous best
## Run 16 stress 0.1192678 
## Run 17 stress 0.1812933 
## Run 18 stress 0.1886532 
## Run 19 stress 0.1192678 
## Run 20 stress 0.1183186 
## ... Procrustes: rmse 6.461758e-06  max resid 2.036932e-05 
## ... Similar to previous best
## *** Solution reached

dune.bioNDMS$stress

## [1] 0.1183186

stressplot(dune.bioNDMS)

plot(dune.bioNDMS, type="t", display="sites")

#especies ritchness
dune.env$Richness <- specnumber(x=dune.bio) 

#bio nmds points as df 
p.data <- as.data.frame(dune.bioNDMS$points) 

#Attaching MDS1 of dune_bio, and proportionalize 
dune.env$MDS1 <- p.data$MDS1 
dune.env$MDS2 <- p.data$MDS2 

ggplot(data = dune.env, aes(x=MDS1, y=MDS2))+
  geom_point(aes(fill=`Management`, size=Richness), shape = 21)

ggplot(data=dune.bio, mapping=aes(x=MDS1, y=MDS2))+
  geom_point(data=dune.env, mapping=aes(fill=Management, size=Richness), shape=24)

Perform a constrained correspondence analysis (CCA) on the dune_bio.txt dataset using the first (A1), second (Moisture) and fifth (Manure) variables of dune_env.txt as explanatory.

What is the variation explained by the two first constrained axes?
What is the adjusted R2 of the model?
Use a permutation test to assess the overall statistical significance of the model
Use a permutation test to assess the marginal statistical significance of each explanatory variable of the model. Which variables are significant?
Make a triplot of the ordination results focusing on the sites

d.cca<- cca(dune.bio ~ A1 + Moisture + Manure, data = dune.env)
summary(d.cca)

## 
## Call:
## cca(formula = dune.bio ~ A1 + Moisture + Manure, data = dune.env) 
## 
## Partitioning of scaled Chi-square:
##               Inertia Proportion
## Total          2.1153     1.0000
## Constrained    0.7692     0.3637
## Unconstrained  1.3460     0.6363
## 
## Eigenvalues, and their contribution to the scaled Chi-square 
## 
## Importance of components:
##                         CCA1   CCA2    CCA3    CA1    CA2     CA3     CA4
## Eigenvalue            0.4264 0.2347 0.10812 0.3124 0.2101 0.17518 0.13748
## Proportion Explained  0.2016 0.1110 0.05112 0.1477 0.0993 0.08282 0.06499
## Cumulative Proportion 0.2016 0.3125 0.36366 0.5114 0.6107 0.69348 0.75847
##                           CA5     CA6     CA7     CA8     CA9    CA10     CA11
## Eigenvalue            0.09442 0.09295 0.07380 0.06360 0.05541 0.04663 0.021117
## Proportion Explained  0.04464 0.04394 0.03489 0.03007 0.02620 0.02204 0.009983
## Cumulative Proportion 0.80311 0.84705 0.88194 0.91201 0.93820 0.96025 0.970230
##                           CA12     CA13     CA14     CA15     CA16
## Eigenvalue            0.019730 0.016183 0.012531 0.009960 0.004566
## Proportion Explained  0.009327 0.007651 0.005924 0.004709 0.002159
## Cumulative Proportion 0.979558 0.987208 0.993132 0.997841 1.000000
## 
## Accumulated constrained eigenvalues
## Importance of components:
##                         CCA1   CCA2   CCA3
## Eigenvalue            0.4264 0.2347 0.1081
## Proportion Explained  0.5543 0.3051 0.1406
## Cumulative Proportion 0.5543 0.8594 1.0000
## 
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## 
## 
## Species scores
## 
##            CCA1     CCA2     CCA3      CA1      CA2      CA3
## Belper -0.72508  0.09015  0.28678 -0.14997 -0.08941  0.64602
## Empnig  0.92386 -1.49359 -1.29239  2.87128  1.31651  0.74936
## Junbuf  0.50547  0.17496 -0.35707  0.33578 -1.58298 -0.40514
## Junart  1.08445 -0.22093 -0.33048 -0.84568  0.15861 -0.23372
## Airpra  0.32923 -1.52131 -0.75777  2.80560  1.49278  0.23227
## Elepal  1.39255  0.01967  0.39847 -0.90296  0.89564 -0.18234
## Rumace -0.56966  0.07756  0.44317  0.23147 -0.33959 -1.25112
## Viclat -0.95631 -1.05874  0.13224 -0.42277 -0.16218  0.84521
## Brarut  0.05861 -0.29093  0.14295 -0.17325  0.10170 -0.08514
## Ranfla  1.30695 -0.17793  0.06749 -0.84091  0.45935 -0.18497
## Cirarv -0.40316  1.49565 -0.09657  0.17279  0.57950  1.51909
## Hyprad  0.14257 -1.37899 -0.68907  2.13762  1.11990  0.54595
## Leoaut -0.10805 -0.39424  0.03343  0.21126  0.05153  0.18568
## Potpal  1.82080 -0.59463  2.50587 -0.53778  0.46936  0.51885
## Poapra -0.47298  0.19818 -0.15806 -0.08959 -0.09919  0.21588
## Calcus  1.32589 -0.43845  0.22643 -1.29121  0.99188 -0.18948
## Tripra -0.94282  0.14003  0.52488  0.16097  0.18159 -1.70499
## Trirep -0.03691 -0.24820  0.22610 -0.03138 -0.23464  0.04178
## Antodo -0.42848 -0.66783  0.01590  1.13407  0.58011 -0.56583
## Salrep  0.38937 -1.51269 -0.78060 -0.47447  0.69625  0.26139
## Achmil -0.84528 -0.28199  0.08034  0.23663  0.15351 -0.33457
## Poatri -0.09591  0.48672 -0.08195  0.10052 -0.39496 -0.08346
## Chealb  1.33134  1.08879 -0.17910  0.92504 -1.53277 -0.26877
## Elyrep -0.45995  0.49110 -0.07018 -0.09270 -0.32202  0.53794
## Sagpro  0.36673  0.22891 -0.41200  0.63169 -0.40176  0.48274
## Plalan -0.89463 -0.37036  0.37142  0.16434  0.15598 -0.65298
## Agrsto  0.80663  0.42851  0.03513 -0.31737  0.09257  0.16535
## Lolper -0.68266  0.25866 -0.13366 -0.17559  0.06653  0.20307
## Alogen  0.52884  0.74865 -0.28721  0.11872 -0.57814  0.07754
## Brohor -0.77674  0.11771  0.03195 -0.07993 -0.06235  0.29678
## 
## 
## Site scores (weighted averages of species scores)
## 
##       CCA1     CCA2     CCA3      CA1       CA2      CA3
## 2  -0.8544  0.57195 -0.04460 -0.44645 -0.473383  1.10419
## 13  0.7656  1.43234 -1.04660  0.92504 -1.532773 -0.26877
## 4  -0.1565  1.36916 -0.67602  0.17279  0.579503  1.51909
## 16  2.0459  0.46834  0.70504 -0.99503  1.669766 -0.75493
## 6  -1.0571 -0.29557  1.50936  0.06506  0.216688 -2.09457
## 1  -1.2439  1.24492 -0.99278 -0.48730  0.708935  1.13256
## 8   0.8113  0.66762 -0.54772 -0.28201  0.298269 -0.31177
## 5  -1.1247 -0.04619  1.17587  0.61920  0.008872 -0.95018
## 17 -0.7722 -2.94478 -1.24411  2.70708  1.757175 -0.54337
## 15  2.0065 -0.48090  2.87633 -0.26585  0.401900  0.57133
## 10 -1.0958 -0.42412  0.49762 -0.26374 -0.332785  0.08742
## 11 -0.7816 -0.90608 -0.28150 -0.26599 -0.008900  1.12676
## 9   0.1817  0.86595 -0.91242 -0.34851 -2.059004 -0.10377
## 18 -0.6053 -1.51952  0.07324 -0.89534 -0.298138  1.03991
## 3  -0.2093  1.35237 -0.66038  0.06196  0.171454  0.84659
## 20  1.8996 -1.40266 -0.55715 -2.22940  0.920734 -0.49851
## 14  1.9462 -0.67177  3.54931 -0.80971  0.536813  0.46637
## 19  0.2690 -3.39538 -3.20303  2.87128  1.316513  0.74936
## 12  0.6252  1.06204 -0.88731  0.77476 -2.069361 -0.26842
## 7  -1.0498 -0.01449  0.64118 -0.05748  0.266550 -1.48584
## 
## 
## Site constraints (linear combinations of constraining variables)
## 
##       CCA1     CCA2     CCA3      CA1       CA2      CA3
## 2  -1.0722 -0.15065  0.02248 -0.44645 -0.473383  1.10419
## 13  1.3313  1.08879 -0.17910  0.92504 -1.532773 -0.26877
## 4  -0.4032  1.49565 -0.09657  0.17279  0.579503  1.51909
## 16  1.2912  1.04854 -0.34917 -0.99503  1.669766 -0.75493
## 6  -0.9651 -0.04331  0.47601  0.06506  0.216688 -2.09457
## 1  -1.0995  1.27129 -0.50141 -0.48730  0.708935  1.13256
## 8   1.0904  0.84728 -1.19953 -0.28201  0.298269 -0.31177
## 5  -0.6973  0.22504  1.60982  0.61920  0.008872 -0.95018
## 17 -0.5627 -1.56290  0.04417  2.70708  1.757175 -0.54337
## 15  1.9681 -0.44704  3.12947 -0.26585  0.401900  0.57133
## 10 -0.6232 -0.89889 -0.41620 -0.26374 -0.332785  0.08742
## 11 -1.1054 -0.90857  0.08601 -0.26599 -0.008900  1.12676
## 9   0.4481 -0.77218 -0.96709 -0.34851 -2.059004 -0.10377
## 18 -0.9913 -1.51891  0.77314 -0.89534 -0.298138  1.03991
## 3  -0.3898  1.50907 -0.03988  0.06196  0.171454  0.84659
## 20  0.8971 -1.52042 -1.40577 -2.22940  0.920734 -0.49851
## 14  1.6735 -0.74222  1.88228 -0.80971  0.536813  0.46637
## 19  0.9239 -1.49359 -1.29239  2.87128  1.316513  0.74936
## 12  0.7625  0.26751  0.15988  0.77476 -2.069361 -0.26842
## 7  -1.1327  0.51336 -0.43788 -0.05748  0.266550 -1.48584
## 
## 
## Biplot scores for constraining variables
## 
##             CCA1     CCA2    CCA3 CA1 CA2 CA3
## A1        0.6048  0.04018  0.7954   0   0   0
## Moisture  0.9746 -0.06076 -0.2158   0   0   0
## Manure   -0.2059  0.96205 -0.1791   0   0   0

#two first constrain axess 
#0.2016  + 0.1110  = 0.3126

#Find r2
RsquareAdj(d.cca)$adj.r.squared

## [1] 0.2446353

anova(d.cca)

## Permutation test for cca under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: cca(formula = dune.bio ~ A1 + Moisture + Manure, data = dune.env)
##          Df ChiSquare     F Pr(>F)    
## Model     3   0.76925 3.048  0.001 ***
## Residual 16   1.34602                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(d.cca, by="margin")

## Permutation test for cca under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 999
## 
## Model: cca(formula = dune.bio ~ A1 + Moisture + Manure, data = dune.env)
##          Df ChiSquare      F Pr(>F)    
## A1        1   0.13047 1.5508  0.119    
## Moisture  1   0.30886 3.6714  0.001 ***
## Manure    1   0.23418 2.7837  0.003 ** 
## Residual 16   1.34602                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(d.cca, scaling=2, xlab="CCA1", ylab="CCA2")

Perform a permutational multivariate analysis of variance (PERMANOVA) with 9,999 permutations using the dune_bio.txt dataset after calculating Bray-Curtis distances and the first (A1), second (Moisture) and fifth (Manure) variables of dune_env.txt as explanatory.

Which explanatory variables are significant?
What is the explanatory power (R2) of the significant variables?
Perform a new PERMANOVA analysis with 9,999 permutations with A1 as the explanatory variable but constrain the permutations within the Use variable.

#using Data from dune.bio and dune.env

adonis2(vegdist(dune.bio, method="bray") ~ A1 + Moisture + Manure, data=dune.env, permutations=9999)

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 9999
## 
## adonis2(formula = vegdist(dune.bio, method = "bray") ~ A1 + Moisture + Manure, data = dune.env, permutations = 9999)
##          Df SumOfSqs      R2      F Pr(>F)    
## A1        1   0.7230 0.16817 5.8146  2e-04 ***
## Moisture  1   0.8671 0.20169 6.9736  1e-04 ***
## Manure    1   0.7197 0.16741 5.7885  2e-04 ***
## Residual 16   1.9893 0.46274                  
## Total    19   4.2990 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#A) Moisture and Manure are significant 

adonis2(vegdist(dune.bio, method="bray") ~ Use/A1, strata=dune.env$Use, data=dune.env, permutations=9999)

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Blocks:  strata 
## Permutation: free
## Number of permutations: 9999
## 
## adonis2(formula = vegdist(dune.bio, method = "bray") ~ Use/A1, data = dune.env, permutations = 9999, strata = dune.env$Use)
##          Df SumOfSqs      R2      F Pr(>F)   
## Use       2   0.5532 0.12867 1.5368 0.0042 **
## Use:A1    3   1.2262 0.28524 2.2711 0.0042 **
## Residual 14   2.5196 0.58609                 
## Total    19   4.2990 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

dune.env$Axis01<-metaMDS(vegdist(dune.bio, method = "bray"))$points[,1]

## Run 0 stress 0.1192678 
## Run 1 stress 0.1808911 
## Run 2 stress 0.1192678 
## ... Procrustes: rmse 7.306918e-06  max resid 2.01041e-05 
## ... Similar to previous best
## Run 3 stress 0.1192679 
## ... Procrustes: rmse 0.0001123017  max resid 0.0003385215 
## ... Similar to previous best
## Run 4 stress 0.1192678 
## ... Procrustes: rmse 3.808766e-05  max resid 0.0001157921 
## ... Similar to previous best
## Run 5 stress 0.1192679 
## ... Procrustes: rmse 0.0001104649  max resid 0.0003384593 
## ... Similar to previous best
## Run 6 stress 0.1192678 
## ... Procrustes: rmse 2.719946e-05  max resid 8.131587e-05 
## ... Similar to previous best
## Run 7 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 0.02027034  max resid 0.0649619 
## Run 8 stress 0.1192678 
## Run 9 stress 0.1183186 
## ... Procrustes: rmse 7.76381e-06  max resid 2.469014e-05 
## ... Similar to previous best
## Run 10 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 1.136906e-06  max resid 3.696662e-06 
## ... Similar to previous best
## Run 11 stress 0.1192679 
## Run 12 stress 0.1192678 
## Run 13 stress 0.1192679 
## Run 14 stress 0.1183186 
## ... Procrustes: rmse 1.243488e-06  max resid 3.36028e-06 
## ... Similar to previous best
## Run 15 stress 0.1889649 
## Run 16 stress 0.1192679 
## Run 17 stress 0.1192678 
## Run 18 stress 0.1192678 
## Run 19 stress 0.1183186 
## ... Procrustes: rmse 4.632156e-06  max resid 1.149392e-05 
## ... Similar to previous best
## Run 20 stress 0.1808911 
## *** Solution reached

dune.env$Axis02<-metaMDS(vegdist(dune.bio, method = "bray"))$points[,2]

## Run 0 stress 0.1192678 
## Run 1 stress 0.1192678 
## ... Procrustes: rmse 2.888317e-05  max resid 8.284675e-05 
## ... Similar to previous best
## Run 2 stress 0.1192678 
## ... Procrustes: rmse 5.190709e-06  max resid 1.12841e-05 
## ... Similar to previous best
## Run 3 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 0.02026985  max resid 0.06496007 
## Run 4 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 2.418064e-06  max resid 7.129482e-06 
## ... Similar to previous best
## Run 5 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 3.772612e-06  max resid 1.099565e-05 
## ... Similar to previous best
## Run 6 stress 0.1183186 
## ... Procrustes: rmse 1.584898e-06  max resid 4.892159e-06 
## ... Similar to previous best
## Run 7 stress 0.1192678 
## Run 8 stress 0.1183186 
## ... Procrustes: rmse 5.875465e-06  max resid 2.062322e-05 
## ... Similar to previous best
## Run 9 stress 0.1192678 
## Run 10 stress 0.1886532 
## Run 11 stress 0.1192678 
## Run 12 stress 0.1192679 
## Run 13 stress 0.1192678 
## Run 14 stress 0.1183186 
## ... Procrustes: rmse 6.25659e-06  max resid 1.693995e-05 
## ... Similar to previous best
## Run 15 stress 0.1192678 
## Run 16 stress 0.1192678 
## Run 17 stress 0.1192678 
## Run 18 stress 0.1183186 
## ... Procrustes: rmse 3.384824e-06  max resid 1.164355e-05 
## ... Similar to previous best
## Run 19 stress 0.1192678 
## Run 20 stress 0.1183186 
## ... Procrustes: rmse 5.318761e-06  max resid 1.697698e-05 
## ... Similar to previous best
## *** Solution reached

ggplot(dune.env, aes(x=Axis01, y=Axis02))+
  geom_point(aes(fill=Use, size=A1), alpha=0.6, pch=21)

Practice Problems for ENVS 567

Emilio Aguilar Cubilla

2022-05-03