Q1 a) EDA

# Read in data
experiment <- read.table("experiment.txt", header = TRUE)

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.6
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Processing char's as factors
experiment$x1 <- factor(experiment$x1)
experiment$x2 <- factor(experiment$x2)

# Box and Density Plot of a,b factors to response
g1 <- ggplot(experiment, aes(Response, col = x1)) +
        geom_boxplot() +
        geom_line(stat = "density", size = 0.5, col = "black") +
        ggtitle("Boxplot response vs x1(a,b)") +
        theme(legend.position = "none") +
        facet_grid(~x1) #Separate plots by variable x1 into factors a, b

# Scatter plot for Response vs x3, colored by x1 factors a and b
g3  <- ggplot(experiment, aes(x3, Response, col = x1)) +
        geom_point() +
        geom_smooth(method = "lm", se = F) + 
        ggtitle("Scatterplot Response vs X3")

# Summary Statistics and Plots
glimpse(experiment)

## Rows: 8
## Columns: 5
## $ Response <int> 17, 14, 21, 10, 16, 35, 15, 29
## $ x1       <fct> a, b, a, b, a, b, a, b
## $ x2       <fct> c, c, c, c, d, d, d, d
## $ x3       <int> 7, 8, 6, 5, 7, 4, 12, 10
## $ x4       <int> 5, 9, 7, 7, 1, 8, 4, 7

summary(experiment)

##     Response     x1    x2          x3               x4      
##  Min.   :10.00   a:4   c:4   Min.   : 4.000   Min.   :1.00  
##  1st Qu.:14.75   b:4   d:4   1st Qu.: 5.750   1st Qu.:4.75  
##  Median :16.50               Median : 7.000   Median :7.00  
##  Mean   :19.62               Mean   : 7.375   Mean   :6.00  
##  3rd Qu.:23.00               3rd Qu.: 8.500   3rd Qu.:7.25  
##  Max.   :35.00               Max.   :12.000   Max.   :9.00

# Standard deviation and variance
sd_x3 <- sd(experiment$x3)
var_x3 <- sd_x3^2
# Paste function for sd and var
paste("Standard Deviation for x3 =", list(sd = sd_x3))

## [1] "Standard Deviation for x3 = 2.61520280557469"

paste("Variance of x3 = ", list(var = var_x3))

## [1] "Variance of x3 =  6.83928571428571"

The dataframe experiment contains 8 observations, 5 predictor variables and 1 response variable. Categorical variable x1 has been converted into factor form with 2 levels a and b, x2 factor form with 2 levels c and d. The summary statistics output gives the mean and median for x3 valued at 7.375 and 7.00, respectively. The average response value is 19.62 and median 16.52. The standard deviation for x3 is 2.62 and variance for x3 is 6.84.

# GG objects
g1; g3

## `geom_smooth()` using formula 'y ~ x'

The density boxplot for x1 and the response, shows factor a to have a unimodal distribution with a relative higher kurtosis then factor b. Factor b’s distribution looks to be somewhat uniform with no peaks in the distribution and a much larger response spread compared to factor a. Factor b has a larger median of response to factor a. The scatterplot shows a possible weak negative linear relationship for factor a to the response,

# Correlation tests
cor.test(experiment$x3, experiment$Response)

## 
##  Pearson's product-moment correlation
## 
## data:  experiment$x3 and experiment$Response
## t = -0.46929, df = 6, p-value = 0.6554
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7883121  0.5954628
## sample estimates:
##        cor 
## -0.1881656

cor.test(as.numeric(experiment$x1), experiment$Response)

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(experiment$x1) and experiment$Response
## t = 0.77849, df = 6, p-value = 0.4659
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5108113  0.8303362
## sample estimates:
##       cor 
## 0.3028874

A pearson correlation test for the x1 and response has a high p-value and confidence interval that passes through 0, therefore a relationship is undetermined.

b) Dummy Variables

# Dummy Variables for x1 and x2
experiment$x5 <- ifelse(experiment$x1 == 'a', 0, 1)
experiment$x6 <- ifelse(experiment$x2 == 'c', 0, 1)

Using a simple if else statement to combine new variables x5, x6 to the experiment dataframe.

c) Matrix Algebra

# Combine experiment vectors into a matrix for x
x <- cbind(rep(1, 8), as.numeric(experiment$x3), as.numeric(experiment$x4),
           as.numeric(experiment$x5), as.numeric(experiment$x6))

# Response vector
y <- experiment$Response

# Solve function for matrix algebra, on the matrix regression equation
co_ef <- solve(t(x)%*%x, t(x)%*%y)
beta <- c("b0", "b3", "b4", "b5", "b6")
list(cbind(beta, co_ef))

## [[1]]
##      beta                    
## [1,] "b0" "4.47044514716607" 
## [2,] "b3" "-1.4446606665045" 
## [3,] "b4" "3.60909754317685" 
## [4,] "b5" "-9.68766723424961"
## [5,] "b6" "17.9963512527366"

\[\hat{Y} = 4.470445 -1.444661(X_3)+3.609098(X_4) -9.687667(X_5)+17.996351(X_6) + \epsilon\] Note: x5, x6 represent the dummy variables for categorical variables x1 and x2.

d) Linear Model

Let \(i={1,2,3,4}\) for variables \(X\)

\[H_O: \beta_{i} = 0\] All coefficients equal zero, implying no relationship between predictors and response.

\[H_O: \beta_{i} \neq 0\] At least one slope coefficient is not equal to zero, suggesting a linear relationship between predictor variables to the Response.

# Linear model for experiment data, no interactions
lm_experiment <- lm(Response~.-x5-x6, experiment)
summary(lm_experiment)

## 
## Call:
## lm(formula = Response ~ . - x5 - x6, data = experiment)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
##  4.59669 -1.70737 -0.06616 -2.82316  0.03673 -0.87327 -4.56726  5.40379 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   4.4704     9.6612   0.463   0.6751  
## x1b          -9.6877     6.4545  -1.501   0.2304  
## x2d          17.9964     5.0365   3.573   0.0375 *
## x3           -1.4447     0.8517  -1.696   0.1884  
## x4            3.6091     1.4407   2.505   0.0873 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.255 on 3 degrees of freedom
## Multiple R-squared:  0.8316, Adjusted R-squared:  0.607 
## F-statistic: 3.703 on 4 and 3 DF,  p-value: 0.1553

# Graphical parameters for diagnostic plots
# Note: Assumption tests failed
par(mfrow= c(2,2))
plot(lm_experiment)

The equation for the line is such:
\[\hat{Response} = 4.470445 -1.444661(X_3)+3.609098(X_4) -9.687667(X_1)+17.996351(X_2) + \epsilon\] The slope coefficient’s match that of part c). With the values for \(X_5\),\(X_6\) are equal to the coefficients for x1 and x2, respectively. There is evidence to suggest a linear relationship between \(X_2\) and the response variable, given by the low p-value 0.0375.

e)

library(effects)

## Loading required package: carData

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

# Linear model with interactions of x1 and x2
lm_experiment2 <- lm(Response~x1*x2+x3+x4, experiment)

# Effects of categorical variables a and b
lm_effects <- allEffects(lm_experiment2)
plot(lm_effects, multiline = T)

We see an effect between variables x1 and x2, with a main effect from variable x2. This can be seen by factors a and b are non-parallel, x2 factors c and d differ widely in response.

f)

summary(lm_experiment2)

## 
## Call:
## lm(formula = Response ~ x1 * x2 + x3 + x4, data = experiment)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.01758  1.53427  0.01758 -1.53427  1.00902 -0.07665 -1.00902  0.07665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  14.3097     3.9645   3.609   0.0689 .
## x1b         -10.2065     2.2602  -4.516   0.0457 *
## x2d           4.3864     3.3656   1.303   0.3223  
## x3           -0.7583     0.3312  -2.290   0.1492  
## x4            1.6032     0.6577   2.438   0.1350  
## x1b:x2d      16.7944     3.5388   4.746   0.0416 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.838 on 2 degrees of freedom
## Multiple R-squared:  0.9863, Adjusted R-squared:  0.9519 
## F-statistic: 28.72 on 5 and 2 DF,  p-value: 0.03399

Yes, a statistically significant interaction between variables x1 and x2. Given by the low p-value of 0.0416. There is an improvement in the model’s R-squared, \(R^2 = 0.9863\). Meaning roughly 98% of the variation in the model can be explained by the predictor variables and interaction.

Q2 - Multiple Regression For Amphipods

# Read amphipods data
amphipods <- read.csv("amphipods.csv", header = T)

Introduction

The dataset amphipods contains data on small crustations called amphipods, which range in size from 1 to 340mm. Currently there are 9,500 species of amphipods discovered and live in aquatic environments ranging in various salinaty contents of these environments, seagrasses and algae harbor ideal growing conditions for the certain amphipod species to grow. These crustations are crucial in marine food-chains, as they feed on algae and are prey for primary consumers like fish. This study will aim to predict the number of amphipods that live in different sea-grass, algae, epiphytes (small seaweeds on the leaves of seagrasses) contents from a sample of 10 different locations. The aim of this study, is to create a predictive model for amphipods ideal living environment by building a final model based on top significant predictor variables, through model selection. Helping us understand better about biomes, marine ecosystems and interactions contained in the ecosystems.

Variables	Definition
Amphipods	Small crustaceans
Drift Algae	algae that are not attached to the substratum (sea bottom), but float among seagrass plants
Seagrasses	True plants that live in sheltered bays and lagoons, attach to the soft sediments with their roots and rhizoids. Seagrasses form so-called beds or meadows, which house a very diverse animal assemblage
Epiphytes	Small seaweeds growing on the leaf blades of seagrasses

EDA

# Data frame information and structure
glimpse(amphipods)

## Rows: 16
## Columns: 6
## $ sample       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ amphipods    <int> 12, 30, 9, 110, 58, 66, 2, 78, 49, 41, 43, 19, 7, 8, 20, 6
## $ liveseagrass <dbl> 1.77, 5.50, 3.30, 6.37, 2.08, 3.22, 1.52, 5.80, 1.58, 3.3~
## $ deadseagrass <dbl> 1.17, 2.83, 1.24, 1.07, 1.08, 0.58, 0.31, 1.04, 0.80, 0.6~
## $ driftalgae   <dbl> 0.00, 0.00, 0.00, 0.25, 0.54, 0.13, 0.05, 0.27, 0.38, 0.3~
## $ epiphyte     <dbl> 0.01, 0.07, 0.12, 0.05, 0.05, 0.13, 0.01, 0.28, 0.00, 0.0~

The amphipods data contains 16 samples containing 5 variables of interest, excluding sample as it is an identification variable. The response variable amphipods is of integer structure, and predictor variables all continuous.

# Show top 5 samples of amphipods
amphipods %>%
  arrange(desc(amphipods)) %>%
  top_n(5, amphipods)

##   sample amphipods liveseagrass deadseagrass driftalgae epiphyte
## 1      4       110         6.37         1.07       0.25     0.05
## 2      8        78         5.80         1.04       0.27     0.28
## 3      6        66         3.22         0.58       0.13     0.13
## 4      5        58         2.08         1.08       0.54     0.05
## 5      9        49         1.58         0.80       0.38     0.00

The top 5 output, in descending order, both the value of live seagrass and number of amphipods are decreasing, indicating a possible significant predictor variable in the number of amphipods from sample locations.

# Create plot objects
g_amph1 <- ggplot(amphipods, aes(liveseagrass, amphipods)) +
              geom_point(col = "aquamarine3") +
              geom_smooth(method = "lm", se = F) +
              ggtitle("Amphipods vs Live Seagrass")
  
g_amph2 <- ggplot(amphipods, aes(deadseagrass, amphipods)) +
              geom_point(col = "coral4") +
              geom_smooth(method = "lm", se = F) +
              ggtitle("Amphipods vs Dead Seagrass")
   
g_amph3 <- ggplot(amphipods, aes(driftalgae, amphipods)) +
              geom_point(col = "aquamarine4") +
              geom_smooth(method = "lm", se = F) +
              ggtitle("Amphipods vs Drift Algae")
  
g_amph4 <- ggplot(amphipods, aes(epiphyte, amphipods)) +
              geom_point(col = "coral2") +
              geom_smooth(method = "lm", se = F) +
              ggtitle("Amphipods vs Epiphyte")

# Grid ggplots together to compare
gridExtra::grid.arrange(g_amph1, g_amph2, g_amph3, g_amph4)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

# Correlation Matrix from Hmisc package
Hmisc::rcorr(as.matrix(amphipods[,-1]))

##              amphipods liveseagrass deadseagrass driftalgae epiphyte
## amphipods         1.00         0.62         0.08       0.65     0.39
## liveseagrass      0.62         1.00         0.43       0.13     0.49
## deadseagrass      0.08         0.43         1.00      -0.15     0.11
## driftalgae        0.65         0.13        -0.15       1.00     0.07
## epiphyte          0.39         0.49         0.11       0.07     1.00
## 
## n= 16 
## 
## 
## P
##              amphipods liveseagrass deadseagrass driftalgae epiphyte
## amphipods              0.0104       0.7807       0.0067     0.1404  
## liveseagrass 0.0104                 0.1007       0.6326     0.0516  
## deadseagrass 0.7807    0.1007                    0.5737     0.6775  
## driftalgae   0.0067    0.6326       0.5737                  0.7854  
## epiphyte     0.1404    0.0516       0.6775       0.7854

The scatterplot’s and correlation matrix show significant correlations between the number of amphipods and liveseagrass \(R = 0.62, p = 0.0104\), indicating a moderate/strong positive correlation. Amphipods to drift algae also have a moderate/strong positive correlation with \(R = 0.65, p = 0.0067\). May be some colinearity happening between predictor variables live seagrass and epiphyte, from \(R =0.49, p =0.0516\), as the p-value is approaching 0.005 significance. This is expected as epiphyte live on alive seagrass’s.

# Summary statistics without sample using psych library, excluding sample
psych::describe(amphipods[,-1])

##              vars  n  mean    sd median trimmed   mad  min    max  range skew
## amphipods       1 16 34.88 31.00  25.00   31.86 26.69 2.00 110.00 108.00 0.87
## liveseagrass    2 16  2.97  1.72   2.84    2.94  1.36 0.00   6.37   6.37 0.49
## deadseagrass    3 16  0.99  0.55   0.94    0.90  0.27 0.31   2.83   2.52 2.17
## driftalgae      4 16  0.15  0.16   0.07    0.13  0.10 0.00   0.54   0.54 0.90
## epiphyte        5 16  0.06  0.07   0.05    0.05  0.06 0.00   0.28   0.28 1.68
##              kurtosis   se
## amphipods       -0.27 7.75
## liveseagrass    -0.63 0.43
## deadseagrass     5.07 0.14
## driftalgae      -0.39 0.04
## epiphyte         2.48 0.02

The response “amphipods” mean, median and sd; 34.88, 25.00, 31.00, respectively. For predictor liveseagrass’s mean, median and sd values are; 2.97, 2.84, 1.72. Deadseagrass mean, median and sd; 0.99, 0.94, 0.55. For drift algae, mean, median and sd; 0.15, 0.07, 0.16. And epiphyte’s mean, median and sd; 0.06, 0.05, 0.07.

library(car)

## Warning: package 'car' was built under R version 4.0.5

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

#VIF for collinearity from car library
vif(lm(amphipods~., data = amphipods))

##       sample liveseagrass deadseagrass   driftalgae     epiphyte 
##     1.449733     1.793542     1.518828     1.099640     1.345243

All VIF values are less than 5, the guideline stated in the ?vif documentation. So collinearity between predictor’s should not be an issue.

Model Building

This section will create seperate models for individual predictor variables and the number of amphipods as response. This can help determine the highest contributors in the model for predictive accuracy. Models will be constructed in hypothesis testing and assesed using p-values, with diagnostic plots to determine if model assumptions are met.

Model 1: Amphipods and Live Seagrass:

Hypothesis

\[H_O: \beta_{LS} = 0\] \[H_A: \beta_{LS} \neq 0\]

Null Distribution

If \(H_O\) is true, X will follow an \(F_{1,14}\)

P-value

\(p-value(8.7502 \geq F_{1,14}) = 0.01038\)

#model 1, noting significance
anova(lm_ls <- lm(amphipods~liveseagrass, amphipods))

## Analysis of Variance Table
## 
## Response: amphipods
##              Df Sum Sq Mean Sq F value  Pr(>F)  
## liveseagrass  1 5543.8  5543.8  8.7502 0.01038 *
## Residuals    14 8869.9   633.6                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

Using a 5% significance level. There is evidence to reject \(H_O\) from the low p-value, implying a statistically significant fixed effect or relationship between the number of amphipods and amount of live seagrass.

Diagnostic Plots

# Diagnostic Plots, noting failures
par(mfrow=c(2,2))
plot(lm_ls)

The residuals vs fitted plot shows a possible violating of equal variance with a slight convex line of best fit. The data on the normal Q-Q plot mostly fits the Q line, with some observations deviating on the tails. Scale location plot seems to be randomly scattered. Residuals vs Leverage show an influential outlier appearing outside of the conservative cooks distance (0.5). To summarize the model could be influenced by non-equal variance and an extreme outlier. It is likely that the small sample size has an influence.

Model 2: Amphipods and Dead Seagrass

Hypothesis

\[H_O: \beta_{DS} = 0\] \[H_A: \beta_{DS} \neq 0\]

Null Distribution

If \(H_O\) is true, X will follow an \(F_{1,14}\)

P-value

\(p-value(0.0806\geq F_{1,14}) = 0.7807\)

# model 2
anova(lm_ds <- lm(amphipods~deadseagrass, amphipods))

## Analysis of Variance Table
## 
## Response: amphipods
##              Df  Sum Sq Mean Sq F value Pr(>F)
## deadseagrass  1    82.5   82.48  0.0806 0.7807
## Residuals    14 14331.3 1023.66

Conclusion

The p-value is much higher then the significance level of 0.05. Providing no evidence to reject \(H_O\). No relationship or fixed effect of Dead Seagrass and number of Amphipods.

Model 3: Amphipods and Drift Algae

Hypothesis

\[H_O: \beta_{DA} = 0\] \[H_A: \beta_{DA} \neq 0\]

Null Distribution

If \(H_O\) is true, X will follow an \(F_{1,14}\)

P-value

\(p-value(10.1\geq F_{1,14}) = 0.006706\)

# Model 3
anova(lm_da <- lm(amphipods~driftalgae, amphipods))

## Analysis of Variance Table
## 
## Response: amphipods
##            Df Sum Sq Mean Sq F value   Pr(>F)   
## driftalgae  1 6040.7  6040.7    10.1 0.006706 **
## Residuals  14 8373.0   598.1                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

From the low p-value 0.006706, there is evidence against \(H_O\), indicating a significant linear relationship between the number of amphipods measured and the drift algae content in the samples.

Diagnostic Plots

# Diagnostic plots, noting failures
par(mfrow = c(2,2))
plot(lm_da)

The diagnostic plots show unequal variance from the residual vs fitted and scale location plots. The right tail of the Q-Q plot is showing some convection with data deviating away from the Q line. An influential outlier identified in the Residuals vs Leverage plot outside of the conservative cookes distance of 0.5 which has leverage over other observations in the sample. The variable of drift algae has not satisfied the equal variance, normal distribution nor no influential outliers assumptions. The sample size is fairly low with only 16 observation which could be contributing to the failures of the assumption tests.

Model 4: Amphipods and Epiphyte

Hypothesis

\[H_O: \beta_{E} = 0\] \[H_A: \beta_{E} \neq 0\]

Null Distribution

If \(H_O\) is true, X will follow an \(F_{1,14}\)

P-value

\(p-value(2.4428\geq F_{1,14}) = 0.1404\)

# model 4
anova(lm_e <- lm(amphipods~epiphyte, amphipods))

## Analysis of Variance Table
## 
## Response: amphipods
##           Df  Sum Sq Mean Sq F value Pr(>F)
## epiphyte   1  2141.4  2141.4  2.4428 0.1404
## Residuals 14 12272.4   876.6

Conclusion

The high p-value does not give any evidence against \(H_O\), The null hypothesis holds, and no significant linear relationship between epiphyte and number of amphipods.

Multiple Linear Regression

Based on the section above, there are two significant predictor variables, Live Seagrass and Drift Algae. These two predictor variables will be used in constructing a mixed linear model to predict the number of amphipods.

Model 5: Amphipods to Live Seagrass and Drift

Hypothesis

\[H_O: \beta_{LS} = \beta_{DA} = 0\] \[H_O: \beta_{LS} \neq \beta_{DA} \neq 0\]

# model 5
summary(lm_ls_da <- lm(amphipods~liveseagrass+driftalgae, amphipods))

## 
## Call:
## lm(formula = amphipods ~ liveseagrass + driftalgae, data = amphipods)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.078 -11.481  -4.692   2.976  30.492 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -10.127      9.634  -1.051  0.31229   
## liveseagrass    9.812      2.702   3.632  0.00304 **
## driftalgae    108.537     28.265   3.840  0.00205 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.88 on 13 degrees of freedom
## Multiple R-squared:  0.7117, Adjusted R-squared:  0.6673 
## F-statistic: 16.04 on 2 and 13 DF,  p-value: 0.0003086

anova(lm_ls_da)

## Analysis of Variance Table
## 
## Response: amphipods
##              Df Sum Sq Mean Sq F value   Pr(>F)   
## liveseagrass  1 5543.8  5543.8  17.341 0.001111 **
## driftalgae    1 4713.9  4713.9  14.745 0.002046 **
## Residuals    13 4156.0   319.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The equation for the line is such: \[\hat{amphipods}= -10.127+9.812(Live Seagrass)+108.537(DriftAlgae)+\epsilon\]

Conclusion

The number of amphipods is expected to increase by 9.812 for every unit increase of Live Seagrass, and increase by 108.537 for every unit increase of Drift Algae. Both predictor variables are have a statistically signifcant relationship to the number of amphipods given by their p-values, 0.00304 for Live Seagrass and 0.00205 for drift algae. There is sufficent evidence from the p-values to reject \(H_O\) and imply relationships from both variables to the number of amphipods. The multiple \(R^2 = 0.7117\), telling us a significant portion of the models variation, roughly 71.1%, can be explained by the two variables Live Seagrass and Drift Algae.

Diagnostic Plots

#Diagnostic plots for model 5, noting failures
par(mfrow = c(2,2))
plot(lm_ls_da)

There is a non-linear pattern in the residuals vs fitted plot, with the residuals showing a convex pattern. Standardized residuals are deviating away from the line of best fit in the tail end of the Q line. Scale Location looks to non-equal variance the spreading past fitted values = 20. Two outliers in the Residuals Vs Leverage plot have leverage over other observations in the data by being located outside cookes conservative distance. This model has not passed assumption tests of normality, equal variance and no extreme outliers. Again, the low sample size could be contributing to this.

Interactions Model

Model 6: Amphipods and Interactions

Hypothesis

\[H_O: \beta_{DS*E}= 0\] \[H_A: \beta_{DS*E}\neq 0\] At least one slope coefficient is not equal to zero.

# model 6
lm_interactions2 <- lm(amphipods~liveseagrass:driftalgae, amphipods)

summary(lm_interactions2)

## 
## Call:
## lm(formula = amphipods ~ liveseagrass:driftalgae, data = amphipods)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.067  -9.660  -1.939   7.625  33.521 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               12.439      4.963   2.507   0.0251 *  
## liveseagrass:driftalgae   47.875      6.848   6.991 6.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.14 on 14 degrees of freedom
## Multiple R-squared:  0.7773, Adjusted R-squared:  0.7614 
## F-statistic: 48.88 on 1 and 14 DF,  p-value: 6.336e-06

Conclusion

By only including the models interaction between live seagrass and drift algae, the number of amphipods will increase by 47.875 for every unit of the interaction, starting at a value of 12.439 amphipods. There is strong significance for a relationship given by the low p-value \(p=6.34e-06\). Enough evidence to reject \(H_O\) and imply a relationship exists. The models multiple \(R^2 = 0.7773\), so 77.73% of the variation of the data is captured by the interaction. This gives a final model with only the interaction as predicting amphipods, no single predictor variables, with relatively high accuracy. \[\hat{amphipods} = 12.439 + 47.875(liveseagrass:driftalgae)+\epsilon\]

Diagnostic Plots

# Diagnostic plots for model 6, noting failures
par(mfrow = c(2,2))
plot(lm_interactions2)

From the diagnostic plots, most of the data fits the line of best fit in the Q-Q plot, observations 4 and 6 are a little bit questionable. Consequently, the variance does show non-equal variance and one outlier having leverage on the data. With the limited number of observations, its hard to tell if the model passes assumption tests of equal varianc, data is normally distributed and no extreme outliers. Collecting more data or running a non-parametric test like a boosted regression tree could help build a better prediction.

Interaction & Effects Plot

# Interactions and effects plots
plot(allEffects(lm_interactions2))

The interaction plot shows the relationship between live seagrass, drift algae to the count of amphipods. As live seagrass and drift algae increase, the amphipods increase, from the linear model line in the plots. There is quiet a strong increase at driftalgae = 0.5. This matches the predictive model above, as we can see the amphipods increasing with both the drift algae and live seagrass.

Discussion

# ANOVA for final model predictors
anova(lm_interactions2, lm_ls_da)

## Analysis of Variance Table
## 
## Model 1: amphipods ~ liveseagrass:driftalgae
## Model 2: amphipods ~ liveseagrass + driftalgae
##   Res.Df    RSS Df Sum of Sq F Pr(>F)
## 1     14 3209.4                      
## 2     13 4156.0  1   -946.65

The interaction model of live seagrass and drift algae is the most adequete model. The anova table shows the mixed linear model with no interactions to have a higher \(RSS = 4156.0\) to that of the interaction \(RSS = 3209.4\). As well as the multiple \(R^2\) increasing from \(0.7117\) for no interaction, to \(0.7773\), meaning the interaction has captured more of the models variance. By only including the interaction between the two predictors, its lowering the models complexity and noise in the data by not having non-significant variables regressing to the number of amphipods. The assumption tests for both models failed, althought model 1, the interaction, is closer to passing then no interaction.

Multiple Linear Regression and Model Building

Timor Atkins c3332818

07/03/2022

Q1 a) EDA

b) Dummy Variables

c) Matrix Algebra

d) Linear Model

e)

f)

Q2 - Multiple Regression For Amphipods

Introduction

EDA

Model Building

Model 1: Amphipods and Live Seagrass:

Hypothesis

Null Distribution

P-value

Conclusion

Diagnostic Plots

Model 2: Amphipods and Dead Seagrass

Hypothesis

Null Distribution

P-value

Conclusion

Model 3: Amphipods and Drift Algae

Hypothesis

Null Distribution

P-value

Conclusion

Diagnostic Plots

Model 4: Amphipods and Epiphyte

Hypothesis

Null Distribution

P-value

Conclusion

Multiple Linear Regression

Model 5: Amphipods to Live Seagrass and Drift

Hypothesis

Conclusion

Diagnostic Plots

Interactions Model

Model 6: Amphipods and Interactions

Hypothesis

Conclusion

Diagnostic Plots

Interaction & Effects Plot

Discussion