Exam for Statistics for Biologists/BIOS14, ht 2021

Question 1

The below codes are used to load the data plankton.csv into R:

# read the data into R
data <- read.csv('Datasets/plankton.csv', header = T)
head(data)

##     treat         p     plank
## 1 Control 18.661338 311.06066
## 2 Control  9.737305 211.22552
## 3 Control 10.832724 269.49007
## 4 Control  1.290529  58.30815
## 5 Control 16.422558 264.43144
## 6 Control 19.358486 306.86994

Question 1-a

As the data suggests, we have two numerical variables (p and plank) and one categorical variable (treat) involved in this data set. So, in order to check the linear relationship of two numerical variables in presence of a categorical variable, we can use linear regression as follow:

# a
fit <- lm(plank ~ p + treat, data=data)
summary(fit)

## 
## Call:
## lm(formula = plank ~ p + treat, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126.55  -30.94  -11.06   42.81  131.32 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                179.200     21.824   8.211 7.38e-10 ***
## p                            4.383      1.249   3.509  0.00120 ** 
## treatRemoval_of_some_fish  -62.166     19.526  -3.184  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.6 on 37 degrees of freedom
## Multiple R-squared:  0.3377, Adjusted R-squared:  0.3019 
## F-statistic: 9.431 on 2 and 37 DF,  p-value: 0.0004898

The null hypothesis in each row of Coefficients is that the variable is not significant in the model. Since the p-value for all three coefficients are significantly lower than 0.05, all of them are significant. Therefore, treatment levels also are significant in the dependence structure of the linear relationship between phosphorus concentration and plankton density. The same result can be obtained using the following visualization:

plot(data$p, data$plank, xlab='phosphorus concentration',ylab='plankton density')
data_treat1 <- dplyr::filter(data, treat=='Control')
data_treat2 <- dplyr::filter(data, treat=='Removal_of_some_fish')
fit1 <- lm(plank~p, data=data_treat1)
fit2 <- lm(plank~p, data=data_treat2)
points(data_treat1$p, data_treat1$plank, col='red',lwd=2)
points(data_treat2$p, data_treat2$plank, col='blue',lwd=2)
abline(fit1, col='red', lty=2)
abline(fit2, col='blue', lty=2)
legend('topleft',c('Control','Removal_of_some_fish'),
       lty=2, col=c('red','blue'), cex=0.8)

As the above plot suggests, the relationship between phosphorus concentration and plankton density is significantly affected by the treatment.

Question 1-b

When categorical variables are present in a study, boxplots is one the best options to visualize the relationship between data variables:

# b
boxplot(p~treat, data=data, col=2, ylab='phosphorus concentration')

boxplot(plank~treat, data=data, col=2, ylab='plankton density')

Both boxplots suggest that phosphorus concentration values and plankton density values are significantly different for different levels of treatment.

Question 1-c

The linear equation for the model we analyzed in part a is written as below:

\[ \hat{plank}_t = 179.20+4.38*p_t -62.166*treat_t \]

From this equation, we can expect 4.383 increase in plank for one unit increase in p variable. Also, the relationship between \(p\) and \(plank\) is positive. And the relationship between treat and plank is negative and one unit increase in treat, will decrease plank by 62.165. The model parameters are interpreted by now. Here we can calculate a 95% confidence interval for the model parameters:

# c
summary(fit)

## 
## Call:
## lm(formula = plank ~ p + treat, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126.55  -30.94  -11.06   42.81  131.32 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                179.200     21.824   8.211 7.38e-10 ***
## p                            4.383      1.249   3.509  0.00120 ** 
## treatRemoval_of_some_fish  -62.166     19.526  -3.184  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.6 on 37 degrees of freedom
## Multiple R-squared:  0.3377, Adjusted R-squared:  0.3019 
## F-statistic: 9.431 on 2 and 37 DF,  p-value: 0.0004898

confint(fit)

##                                 2.5 %     97.5 %
## (Intercept)                134.980449 223.418830
## p                            1.851783   6.914174
## treatRemoval_of_some_fish -101.728593 -22.603712

For example, a confidence interval of \((1.851, 6.914)\) for \(p\) coefficient means that every estimated value for this variable will fall in this interval with probability 95%.

Question 1-d

To see the effect of ignoring phosphorus and its effect on plankton density for the model, we can simply use these codes:

# d
fit2 <- lm(plank ~ treat, data=data)
summary(fit2)

## 
## Call:
## lm(formula = plank ~ treat, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -180.910  -44.607   -5.247   37.842  184.166 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 239.22      15.44  15.496   <2e-16 ***
## treatRemoval_of_some_fish   -49.06      21.83  -2.247   0.0305 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.04 on 38 degrees of freedom
## Multiple R-squared:  0.1173, Adjusted R-squared:  0.09407 
## F-statistic:  5.05 on 1 and 38 DF,  p-value: 0.03051

To see what is the result of removing p variable had on the model, we can compare the \(R^2\) statistic of this model with that of the complete model. The \(R^2\) for the complete model in part a) was 0.3377 and this value for the reduced model is 0.1173. So removing a variable like p with significant effect on the response variable will decrease the model performance. As a result, this variable should be considered in the model.

Question 2

The below codes are used to load pine.csv data set into R:

# read the data into R
data <- read.csv('Datasets/pine.csv', header = T)
head(data)

##     treat site       len
## 1 Control    1 24.075840
## 2 Control    2 27.629815
## 3 Control    3 22.766191
## 4 Control    4 18.340246
## 5 Control    5 30.671858
## 6 Control    6  9.216332

Question 2-a

To test for significance of the effects of the treatment, the linear regression model can be used:

# a
fit3 <- lm(len ~ site + treat, data=data)
summary(fit3)

## 
## Call:
## lm(formula = len ~ site + treat, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4131  -6.7230   0.7546   3.1561  20.5052 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.9174     4.2927   5.339 1.38e-05 ***
## site             -0.9266     0.5788  -1.601 0.121507    
## treatFertilizer  37.3070     4.0724   9.161 1.27e-09 ***
## treatRunoff      17.5428     4.0724   4.308 0.000209 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.106 on 26 degrees of freedom
## Multiple R-squared:  0.7691, Adjusted R-squared:  0.7424 
## F-statistic: 28.86 on 3 and 26 DF,  p-value: 1.971e-08

The p-value for different coefficient levels of the treatment variable are lower than 0.05, so obviously treatment variable has a significant effect on the response variable (\(len\)). To cover the second question in this part, we can refer to the previous result. Since the p-value for treatment level called Runoff is 0.000209 and obviously is lower than 0.05, so yes captured agricultural runoff water seem to be an effective fertilizer for pine trees.

Question 2-b

As mentioned in the previous question, boxplots are a powerful tool when dealing with data sets having categorical variables:

# b
boxplot(len~treat, data=data, col=2, ylab='lengths of the seedlings')

boxplot(len~site, data=data, col=2, ylab='lengths of the seedlings')

The first plot shows that lengths of the seedings are higher for Fertilizer treatment and lower for Control treatment. These result are in-line with those in the coefficient significance test because lengths are significantly different for different levels of treatment.

The second boxplot shows no significant difference of lengths for different sites. Also, this result are in-line with the significance tests.

Question 2-c

The basis of our analysis is in normality of the response variable. The response variable in this assignment is lengths of the seedings, so we can use shapiro-wilk normality test to see if it is the case. Also, another important test in linear regression is that the residuals of the model is normally distributed and it can be tested using the same test.

# c
shapiro.test(data$len)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$len
## W = 0.96818, p-value = 0.4906

shapiro.test(fit3$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit3$residuals
## W = 0.93851, p-value = 0.08294

the null hypothesis in shapiro-wilk test is that the data is normally distributed. Since the p-value of the first test is 0.49, we conclude that the response variable is normally distributed (because the null hypothesis is accepted) and the linear regression is a good option for this assignment.

Also, p-value for normality test of residuals is 0.0829 and since it is more than 0.05, again the normality of the residuals is accepted.

Question 2-d

The null hypothesis for part a (significance of the coefficients) are:

\(H_0: Intercept=0\) meaning the intercept is not significant.
\(H_0: site=0\) meaning the site variable is not significant.
\(H_0: treatFertilizer=0\) meaning the Fertilizer level of the treatment variable is not significant.
\(H_0: treatRunoff=0\) meaning the Runoff level of the treatment variable is not significant.

Question 3

The data set under study in this question is moth.csv and it can be loaded into R using the following codes:

# read the data into R
data <- read.csv('Datasets/moth.csv', header = T)
head(data)

##    hab subs      host
## 1 Cold    1 15.686995
## 2 Cold    0  8.788220
## 3 Cold    0  1.958161
## 4 Cold    0  3.821172
## 5 Cold    0  1.656325
## 6 Cold    0  5.850901

Question 3-a

In order to Test if the occurrence of the two subspecies differs between habitats, we need to look at the frequency of the levels of both variables because they are both categorical and can not be analysed using usual statistical tests. For this purpose, we can use a table function to see cross-table of the variables and chi-square test to obtain the results:

# a
table(data$hab, data$subs)

##       
##         0  1
##   Cold 12  6
##   Warm  3  9

chisq.test(data$hab, data$subs, correct = F)

## 
##  Pearson's Chi-squared test
## 
## data:  data$hab and data$subs
## X-squared = 5, df = 1, p-value = 0.02535

Since we get a p-value less than the significance level of 0.05 (p-value=0.02535), we reject the null hypothesis and conclude that the two variables are in fact dependent. So the occurrence of the two subspecies does not differ between habitats.

Question 3-b

In this question, two categorical variables (subs and hab) and also one numerical variable (host) are involved. So, to test if the occurrence of the sub-species is determined by the availability of the potential host plant, as well as by habitat, a linear regression can be used as below:

# b
fit4 <- lm(host ~ subs + hab, data=data)
summary(fit4)

## 
## Call:
## lm(formula = host ~ subs + hab, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9761 -2.5382  0.2142  2.7345  5.0378 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.5967     0.9030   5.090 2.39e-05 ***
## subs         11.0503     1.3339   8.284 6.82e-09 ***
## habWarm      -0.6207     1.3614  -0.456    0.652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.335 on 27 degrees of freedom
## Multiple R-squared:  0.7451, Adjusted R-squared:  0.7262 
## F-statistic: 39.45 on 2 and 27 DF,  p-value: 9.703e-09

the only significant coefficients in this model are intercept and subs. That means, the occurrence of the sub-species is determined by the availability of the potential host plant. but the occurrence of the sub-species is determined by the habitat levels.

Question 3-c

From a, we tested the dependency of two categorical variables (subs and hab) and saw that they are significantly dependent. From b, the only purpose was to test if each of the categorical variables (subs and hab) has a significat effect on the response variable (host) and we observed that the only significant variable was subs variable.

Question 4

Again, to load the morph.csv data into R and print the first six rows of the data set, the below codes are used:

# read the data into R
data <- read.csv('Datasets/morphs.csv', header = T)
head(data)

##         morph length abdomen thorax   s4  wing
## 1 Androchrome  32.36   26.09   2.50 0.88 20.34
## 2 Androchrome  34.97   27.72   2.57 0.91 20.15
## 3 Androchrome  36.88   29.39   2.37 0.75 21.41
## 4 Androchrome  34.91   27.79   2.49 0.97 21.41
## 5 Androchrome  33.95   26.78   2.59 1.04 21.26
## 6 Androchrome  36.91   24.41   2.41 0.84 19.76

Question 4-a

The old and tedious method is to calculate each correlation coefficients between pairs of the morphological traits and test the correlation significance between them such as the codes written below:

# a
cor(data[,-1])

##            length   abdomen    thorax        s4      wing
## length  1.0000000 0.8105638 0.4661393 0.2534774 0.6241142
## abdomen 0.8105638 1.0000000 0.4481525 0.2304304 0.6401569
## thorax  0.4661393 0.4481525 1.0000000 0.6035722 0.5237553
## s4      0.2534774 0.2304304 0.6035722 1.0000000 0.4510957
## wing    0.6241142 0.6401569 0.5237553 0.4510957 1.0000000

As the question suggests, the are strong correlation between pairs of variables. To test the significance of this correlation values, we can use cor.test function in R:

cor.test(data$length, data$abdomen)

## 
##  Pearson's product-moment correlation
## 
## data:  data$length and data$abdomen
## t = 11.244, df = 66, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7091974 0.8790959
## sample estimates:
##       cor 
## 0.8105638

cor.test(data$length, data$thorax)

## 
##  Pearson's product-moment correlation
## 
## data:  data$length and data$thorax
## t = 4.2804, df = 66, p-value = 6.17e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2561866 0.6340918
## sample estimates:
##       cor 
## 0.4661393

cor.test(data$length, data$s4)

## 
##  Pearson's product-moment correlation
## 
## data:  data$length and data$s4
## t = 2.1288, df = 66, p-value = 0.03701
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01602057 0.46386847
## sample estimates:
##       cor 
## 0.2534774

cor.test(data$length, data$wing)

## 
##  Pearson's product-moment correlation
## 
## data:  data$length and data$wing
## t = 6.4893, df = 66, p-value = 1.297e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4531146 0.7508147
## sample estimates:
##       cor 
## 0.6241142

cor.test(data$thorax, data$s4)

## 
##  Pearson's product-moment correlation
## 
## data:  data$thorax and data$s4
## t = 6.15, df = 66, p-value = 5.089e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4265272 0.7360716
## sample estimates:
##       cor 
## 0.6035722

cor.test(data$thorax, data$wing)

## 
##  Pearson's product-moment correlation
## 
## data:  data$thorax and data$wing
## t = 4.9949, df = 66, p-value = 4.576e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3260456 0.6775681
## sample estimates:
##       cor 
## 0.5237553

cor.test(data$s4, data$wing)

## 
##  Pearson's product-moment correlation
## 
## data:  data$s4 and data$wing
## t = 4.1062, df = 66, p-value = 0.0001131
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2383004 0.6225626
## sample estimates:
##       cor 
## 0.4510957

The null hypothesis in this test is that the correlation between the variables under study is zero. As we can see, all the p-values are less than the significance level (0.05) and that means all the correlations are statistically significant.

To obtain a new variable that combines the information from all five traits into a single measure of overall size, we can use the following codes:

# create long format data
library(data.table)
long_data <- melt(setDT(data), id.vars = c("morph"), variable.name = c("new_variable"))
str(long_data)

## Classes 'data.table' and 'data.frame':   340 obs. of  3 variables:
##  $ morph       : chr  "Androchrome" "Androchrome" "Androchrome" "Androchrome" ...
##  $ new_variable: Factor w/ 5 levels "length","abdomen",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value       : num  32.4 35 36.9 34.9 34 ...
##  - attr(*, ".internal.selfref")=<externalptr>

the structure of the new data shows that the new variable is been calculated.

Question 4-b

To test if the female morphs differ in overall size, we can use table() function as below:

# b
table(long_data$morph)

## 
## Androchrome   Infuscans    Obsoleta 
##         100         110         130

As we can see, the female morphs seem to eat different types of prey but in order to make sure, we can use a linear regression and anova test as below:

fit5 <- lm(value ~ morph+new_variable, data=long_data)
summary(fit5)

## 
## Call:
## lm(formula = value ~ morph + new_variable, data = long_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4135 -0.2983  0.0237  0.3503  3.0324 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          33.87761    0.12543  270.099  < 2e-16 ***
## morphInfuscans       -0.09413    0.11747   -0.801  0.42354    
## morphObsoleta        -0.35909    0.11309   -3.175  0.00164 ** 
## new_variableabdomen  -7.16309    0.14581  -49.128  < 2e-16 ***
## new_variablethorax  -31.30926    0.14581 -214.733  < 2e-16 ***
## new_variables4      -32.82882    0.14581 -225.155  < 2e-16 ***
## new_variablewing    -13.32985    0.14581  -91.422  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8502 on 333 degrees of freedom
## Multiple R-squared:  0.9959, Adjusted R-squared:  0.9958 
## F-statistic: 1.34e+04 on 6 and 333 DF,  p-value: < 2.2e-16

the coefficients show that Infuscans level of the morph variable is not significant and it seems that female morphs do not eat different types of prey. To double check the above mentioned results, we can use anova() as below:

anova(fit5)

## Analysis of Variance Table
## 
## Response: value
##               Df Sum Sq Mean Sq    F value    Pr(>F)    
## morph          2      8     4.1     5.6514  0.003858 ** 
## new_variable   4  58107 14526.7 20097.4889 < 2.2e-16 ***
## Residuals    333    241     0.7                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this results, morph variable is only significant at 0.01 significance level and not in 0.05. The visualization below makes everything more clear for the new variable.

boxplot(value~new_variable, data=long_data)

Question 4-c

To improve the design of a modeling study, we can eliminate the insignificant coefficients and rewrite the model. Here, the Infuscans level of the morph variable is not significant so in order to improve the power of the study design, we can eliminate morph variable from further investigation.

follow me for more on

RPubs
Telegram
1. Channel
2. Q & A
YouTube
Website
Aparat