p-value
A p-value is an arbitrary point on the standard normal curve (aka The Gaussian Distribution, i.e. 0.05 or 0.01) to help you determine if the mean of your samples are different enough from one another to be statistically significant. Generally, as you lower your p-value you increase your type II error and the amount of times that you accidentally fail to reject your null when it is true goes up and you lose power. When Dr. Bingham lectured he mentioned that type II error is Beta, and I still don’t know exactly what that means (I thought beta was an explanatory variable in a multiple linear model). Maybe as you increase your explanatory variables, your power decreases and the rate of type II error goes up.
Gaussian Distributions
These are defined in Quinn and Keough, 2002 as: Standard Normal Distributions (SNDs). They are distributions with a mean of 0 and a standard deviation of 1. This distribution has a central tendency with a bell shape. The reason we are always looking for normality in our data is because we want to be able to relate our samples to the standard normal curve, which allows us to determine what we think the population mean is, and allows us to find differences between 1 sample and another, so that we can truly understand the population. I know Dr. Sobocinski likes to mention that we rarely know our true pop\(^n\) mean – and last year in Dr. Kastner’s class I learned that one example of knowing the population mean is the Human Genome Project (also an example in the Analysis of Biological Data by Whitlock and Shluter, one of my all time favorite textbooks).
Collinearity
This is when two explanatory variables have very closely related strength/direction relationships with one another. It becomes a problem when you are trying to create a model. Linear models allow you to predict every change in y for each 1 unit change in x (as long as you hold the remaining explanatory variables constant). When multi-collinearity occurs, you cannot keep one explanatory variable constant, leading to an unreliable model.
Simple Additive Model
Y\(_i\)\(_j\) = \(\mu\) + \(\tau\)\(i\) + \(\rho\)\(_i\)\(_j\)+\(\epsilon\)\(_i\)\(_j\)
The simple additive model above is the model for an ANOVA. Slide 41 of lecture 6 describes this equation as each point’s outcome (y\(_i\)\(_j\)) in a model with explanatory variables as factors. In this model, j represents a single observation within a factor (i). Tau (\(\tau\)) is the average of each group. Epsilon (\(\epsilon\)) are independent random errors (all the unmeasured influences) for each individual, Mu (\(\mu\)) is the grand mean over all groups. This equation gives you the mean response for the \(i^t\)\(^h\) treatment.
Multiple Linear Regression Model
Y ~ X\(\beta\)\(_1\) + X\(\beta\)\(_1\) + \(\epsilon\)
This model is a multi-linear regression model where Y can be calculated given the slope of 1 predictor variable plus the slope of another predictor variable plus random error. Note you can keep adding predictor variables (either numerical or categorical) to this equation but it weakens the equation’s power with each additional predictor variable.
The multiple linear regression (MLR) model is different from the simple additive model (SAM) in the following ways:
MLR can be used for a mix of data types (i.e. both categories and continuous data)
SAM usually has categorical factors as predictors
Their equations are different as explained above, and SAM is used for comparing group means while MLR is used for determining the relationship between variables
Their assumptions are somewhat different. SAM assumes similar variance and independent observations, while MLR assumes linearity, error (residual) independence, normal distribution of errors (residuals), similar variance.
The Gaussian Distribution Equation
\(\epsilon\) ~ N(0, \(\sigma\)\(^2\))
In mathematical words this equation is saying that error/residuals (\(\epsilon\)) are distributed (~) normally (N) with a mean of 0 and a variance. This is essentially saying that your data are distributed normally, which is an assumption of linear modeling.
Okay. This is an interesting question, because both are linear models (just ask Dr. Sobocinski or Dr. Bunn) and they both help find the mathematical relationship between the predictors and the response variable.
An ANOVA generally has factors as explanatory variables to explain 1 continuous response variable (R actually converts the categories into numerical variables behind the scenes), while regression is typically numerical predictors to explain 1 continuous response.
Replicated regression models can be optimal in experimental design because they are incredibly efficient.
Statistical inference
This is what we are doing when we run ANVOAs and Multi-linear regressions; we are taking a sample and trying to predict the population. We make inferences in our analyses in two ways; point estimates (population parameters based on sample data) and hypothesis testing (an inference about our inference). This response is starting to sound like the 2010 film, Inception!
Tidy data
This is a broad term that essentially means to “clean” your data for analyses. The structure of your data (columns, rows) can influence what you can do in R. The labels on your data can influence what you can do in R. Whether your data are tibble or data.frame can influence what you can do in R. There are SO MANY ways that your data can be hard to work with - so tidying it up, and making it easy to analyze is a must before doing any data analysis.
Linear model selection
Model selection is the process you go through during a regression analysis to see which linear model describes your data best. There are a few ways to assess this, in this document I used the Akaike information criterion, and more options can be found in Zuur et al., 2009, Appendix A. You have to determine which variables best predict your response variable best, especially when you have a lot of variables to select from.
Sums of squares as error
I will answer this using linear regression analysis. The sum of squares in regression measures the differences between observed residual values vs. predicted residual values. Observed residuals in this case refer to the error in the model. The model you are looking for reduces the amount of error, and this is why you undergo the process of model selection. You are trying to find the best fit line for your observed residuals.
Understanding variance is important in statistical testing because you want to know if the model you are creating has a good fit. You can draw a line through points in your model (just see all the times I did it in lab 05 with ggplot() using the lm statement), this does not mean the line is fit to the residuals (variance) best. The steps you take in your regression analysis help you to find which line fits all of your variance best.
Variance is standard deviation squared; and in regression analyses, variance is calculated for each individual point that is plotted.
When a regression slope is different from 0 it suggests that there may not be a significant linear relationship between the independent and dependent variables. In order to assess this we look at the t-test statistic from the anova(lm()) function.
Scientific Question: What is the effect of pesticides on fungal infection (\(zoospores/mL\)) in white spotted 3-toed frogs (WTF)?
frogData <- read.csv("frog_fungus.csv")
Variables
Sex (categorical nominal), 2 levels
Pesticide treatment (categorical ordinal), 3 levels
Fungal Colonies (numerical continuous)
Does frog sex play a role in fungal infection severity?
Is there an interaction between sex and treatment?
Analyze the data to determine if fungal infection rates were influenced by pesticide treatment and/or by sex or the interaction among the two.
Planned Analysis:
2-factor ANOVA
Testing the equation:
Y = \(\mu\) + A + B + A * B
FUNGAL SPORES = \(\mu\) + TREATMENT + SEX + TREATMENT * SEX
All possible hypotheses are:
H\(_0\): \(\mu_l\) = \(\mu_2\) = \(\mu_3\) ; H\(_A\): not all means are equal
H\(_0\): \(\mu_m\) = \(\mu_f\) ; H\(_A\): not all means are equal
H\(_0\): There is no interaction between factors Pesticide Treatment and Sex ; H\(_A\): There is an interaction between factors Pesticide Treatment and Sex
Explore the data
a <- data.frame(xtabs(~PesticideTreatment, frogData))
b <- aggregate(FungalColonies ~ PesticideTreatment, frogData, mean) #something is going on here!
c <- aggregate(FungalColonies ~ Sex, frogData, mean)
frogData$treatment <- fct_relevel(frogData$PesticideTreatment, "Low", "Med", "High")
d <-ggplot(frogData, aes(treatment, FungalColonies,
color=Sex))+
geom_jitter(position=position_dodge(0.3))+
geom_point(aes(x="High", y=34.26), color = "black", shape = 20)+
geom_point(aes(x="Med", y=23.29), color="black", shape = 20)+
geom_point(aes(x="Low", y=13.5), color="black", shape = 20)+
labs(title = "Data Exploration",
x= "Treatment",
y = "Fungal Colonies (zoospores/mL)")
Evaluate the model assumptions
Normality of residuals
Homogeneity in variance in residuals
model <- lm(FungalColonies ~ PesticideTreatment + Sex + PesticideTreatment*Sex, data=frogData)
dp <- ggplot(model, aes(x=model$residuals)) +
geom_density(fill="blue", colour=NA, alpha=.2) +
geom_line(stat = "density") +
expand_limits(y = 0) +
labs(title="Density plot residuals") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residual") +
ylab("Density")
qqp <- ggplot(model, aes(sample = model$residuals))+
stat_qq() + stat_qq_line(col="steelblue")+
labs(title="QQ residuals",
x="Theoretical Qualtiles",
y="Sample Quantiles")
fitres.p <- ggplot(model, aes(model$fitted.values, model$residuals))+
geom_point()+
geom_hline(yintercept = 0, linetype="dashed", col="steelblue")+
labs(title="Residuals vs. Fitted",
x="Fitted Values", y="Residuals")
ggarrange(dp, qqp, fitres.p, labels = c("A", "B", "C"))
Fig. 2: A. Density plot of residuals shows that the residuals are approximately normal and the bump on the right would likely go away if we increased our sample size. B. QQ Plot of the residuals shows that most of the data are surrounding the blue line, however, some of the data in the edges seem to suggest our data are not exactly normal. C. Fitted Values vs. Residuals show the points on the left side cover less range than the right. This seems to suggest that the data violate the assumption of equal variance. I transformed using sqrt, log10, and square transformations and none seemed to improve the variance. I will leave the data as it is and assume the sample size is too small to expect equal variance of residuals.
#looking deeper into normality
e <- frogData %>% #Data
group_by(PesticideTreatment) %>% #Break it out into species
summarise(statistic = shapiro.test(FungalColonies)$statistic,
p.value = shapiro.test(FungalColonies)$p.value)
## with this test we fail to reject h0 and our data are normal enough
#looking deeper into homeoscedasticity
f <- tapply(frogData$FungalColonies, frogData$PesticideTreatment, sd) #the variance is not exactly good. High is almost 3 times greater than low. I did some transformations and they did not help though, so I will assume our sample size is too small and move on.
Test the Model
See the code for testing the model (I did not want to overcrowd the document).
anova(model) #the interaction term is not significant.
## Analysis of Variance Table
##
## Response: FungalColonies
## Df Sum Sq Mean Sq F value Pr(>F)
## PesticideTreatment 2 2157.21 1078.60 35.8156 6.242e-08 ***
## Sex 1 27.07 27.07 0.8990 0.3525
## PesticideTreatment:Sex 2 84.25 42.12 1.3987 0.2663
## Residuals 24 722.77 30.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#determining which means are different (even though it is pretty obvious)
forTukey <- aov(FungalColonies ~ PesticideTreatment, data=frogData)
TukeyHSD(forTukey)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = FungalColonies ~ PesticideTreatment, data = frogData)
##
## $PesticideTreatment
## diff lwr upr p adj
## Low-High -20.76 -26.92297 -14.59703 0.0000000
## Med-High -10.97 -17.13297 -4.80703 0.0004200
## Med-Low 9.79 3.62703 15.95297 0.0014638
f <- describeBy(frogData, list(frogData$PesticideTreatment))
ggplot(frogData, aes(treatment, FungalColonies, color=treatment))+
geom_boxplot(alpha=1)+
scale_color_manual(values = c("#00a9ff", "#7cae00", "#ed68ed"))+
labs(title = "Pesticide treatment on fungal spores of frogs",
x= "Treatment",
y = "Fungal Colonies (zoospores/mL)")
Fig. 3: The output for the interaction term (not included here but can be found in the code) for PesticideTreatment:Sex, is not significant. This shows that the impact of pesticide treatment on fungal colony abundance does not depend on the sex of the frog, and therefore, does not play a role in fungal infection severity. You can also see that there is a statistically significant effect of pesticide treatment on the number of fungal colonies, but no statistical significance in sex and number of fungal colonies. The Tukey HSD test provided evidence that mean fungal colonies (zoospores/ml) differed among all three treatments, (F\(_2\)\(_,\)\(_2\)\(_4\) = 35.8, p<0.001), with a low concentration of pesticides having the lowest amount of fungal zoospores (M=13.5; SE=0.83) and high concentrations having the most (M=34.26; SE = 2.70), (Tukey HSD, p <0.001). It seems that an increase in pesticides is harmful to frogs and beneficial for fungi.
Scientific Question: Do geoducks (Panopea generosa) grow better (\(mm/yr\)) in years of high primary productivity (\(g/m^3\))?
geoData <- read.csv("geoducks3.csv")
Variables
Geoduck Growth (continuous)
Productivity (continuous)
Current Velocity (continuous)
Temperature (continuous)
Is primary productivity a predictor of geoduck growth?
Is primary productivity confounded by current velocity and temperature near the site?
Planned Analysis:
Multiple Linear Regression
Testing the hypotheses:
For the t-test part of the regression: H0: \(\beta\) \(=\) 0; HA: \(\beta\) \(\neq\) 0
For the ANOVA part of the regression: H0: \(Y\) \(=\) \(\alpha\); HA: \(Y\) \(=\) \(\alpha\) + \(\beta\)
Basic linear model
simple <- lm(geo.growth ~ chl, data=geoData)
geoData1 <- data.frame(cbind(geoData$geo.growth,
geoData$chl,
geoData$current,
geoData$temp))
names(geoData1)[1:4] <- c("GeoGrowth", "Chlorophyll", "Current", "Temp")
geoData.noNA <- na.omit(geoData1) #tighten up!
tidydata <- head(geoData.noNA)
gt(tidydata) |>
tab_caption(caption = "Table 1: Tidy data 63x4")
| GeoGrowth | Chlorophyll | Current | Temp |
|---|---|---|---|
| 0.02214067 | 1.772549 | 1.023868 | 12.15499 |
| 0.13766518 | 1.817188 | 2.230128 | 10.41907 |
| 0.07149112 | 1.811118 | 1.883350 | 13.05106 |
| 0.19003504 | 1.819651 | 2.433210 | 12.66483 |
| 0.17337719 | 1.824780 | 2.330131 | 13.35653 |
| 0.27014589 | 1.809783 | 3.495209 | 12.16485 |
summarystats <- data.frame(sapply(geoData.noNA,
function(x) c(mean = mean(x),
sd = sd(x))))
gt(summarystats, rownames_to_stub = TRUE)|>
tab_caption(caption = "Table 2: Summary Statistics")
| GeoGrowth | Chlorophyll | Current | Temp | |
|---|---|---|---|---|
| mean | 0.1595455 | 1.82262796 | 1.524272 | 12.110255 |
| sd | 0.1083646 | 0.04980685 | 1.920357 | 2.384713 |
#reorder CHL (x) data from small to large for QQplot
geoData <- geoData.noNA[order(geoData.noNA$Chlorophyll),]
#testing assumption of normality for geoduck data
one <- ggplot(geoData, aes(Chlorophyll))+
geom_histogram(fill="pink", bins=20)+
labs(title="Chlorophyll Distribution", x="Chlorophyll")
two <- ggplot(geoData, aes(GeoGrowth))+
geom_histogram(fill="plum", bins=15)+
labs(title="Geoduck Growth Distribution", x=("Geoduck growth (mm/yr)"))
three <- ggplot(geoData, aes(Chlorophyll, GeoGrowth))+
geom_point(color="steelblue")+
labs(title="Geoduck growth vs. Chlorophyll", x="Chlorophyll (g/m^2)", y="Geoduck Growth (mm/yr)")
ggarrange(one, two, three, labels = c("A", "B", "C"))
Fig. 4: A. Showing simple linear model, and B - C. testing the assumption of normality. Geoduck growth looks a little strange, but who is to say it is too strange? It seems that chlorophyll is not the greatest indicator of geoduck growth. In the remainder of the document I leave out chlorophyll as Aikike information criterion show a better model, called the optimal model in this document.
Fit the model and tidy the data, then fit models using model selection
corr <- ggpairs(geoData)
corr
Fig. 5: Correlelagrams showing colinearity between chlorophyll and current as well as temperature and current. This colinearity can cause issues with standard error of coefficients in the model, but the value, 0.56 is not large enough to worry.
model1 <- lm(GeoGrowth ~ Chlorophyll + Temp + Current, data=geoData)
model3 <- lm(GeoGrowth ~ Chlorophyll + Temp, data=geoData)
model4 <- lm(GeoGrowth ~ Temp + Current + Temp*Current, data=geoData)
model5 <- lm(GeoGrowth ~ Temp*Current, data=geoData)
model6 <- lm(GeoGrowth ~ Temp + Current, data=geoData)
#model6 has the highest R^2 = 0.7184
summary(model6)
##
## Call:
## lm(formula = GeoGrowth ~ Temp + Current, data = geoData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.179455 -0.033428 0.001353 0.032344 0.131494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.277229 0.038918 -7.123 1.54e-09 ***
## Temp 0.034766 0.003322 10.464 3.73e-15 ***
## Current 0.010330 0.004126 2.504 0.015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0575 on 60 degrees of freedom
## Multiple R-squared: 0.7275, Adjusted R-squared: 0.7184
## F-statistic: 80.1 on 2 and 60 DF, p-value: < 2.2e-16
anova(model6) #note, I ran through summary and anova for each of the models
## Analysis of Variance Table
##
## Response: GeoGrowth
## Df Sum Sq Mean Sq F value Pr(>F)
## Temp 1 0.50895 0.50895 153.9282 < 2e-16 ***
## Current 1 0.02072 0.02072 6.2681 0.01503 *
## Residuals 60 0.19838 0.00331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#above and found the best model is model 6
step(model6) #akaike information criteria
## Start: AIC=-356.92
## GeoGrowth ~ Temp + Current
##
## Df Sum of Sq RSS AIC
## <none> 0.19838 -356.92
## - Current 1 0.02072 0.21911 -352.66
## - Temp 1 0.36204 0.56042 -293.50
##
## Call:
## lm(formula = GeoGrowth ~ Temp + Current, data = geoData)
##
## Coefficients:
## (Intercept) Temp Current
## -0.27723 0.03477 0.01033
optimalModel <- model6
optimalModel <- lm(GeoGrowth ~ Temp*Current, data = geoData)
Assess the fit of the model
dp <- ggplot(optimalModel, aes(x=model6$residuals)) +
geom_density(fill="blue", colour=NA, alpha=.2) +
geom_line(stat = "density") +
expand_limits(y = 0) +
labs(title="Density plot residuals") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residual") +
ylab("Density")
qqp <- ggplot(optimalModel, aes(sample = model6$residuals))+
stat_qq() + stat_qq_line(col="steelblue")+
labs(title="QQ residuals",
x="Theoretical Qualtiles",
y="Sample Quantiles")
fitres.p <- ggplot(optimalModel, aes(model6$fitted.values, model6$residuals))+
geom_point()+
geom_hline(yintercept = 0, linetype="dashed", col="steelblue")+
labs(title="Residuals vs. Fitted",
x="Fitted Values", y="Residuals")
std.resid.p<-ggplot(optimalModel, aes(.fitted, sqrt(abs(.stdresid))))+
geom_point(na.rm=TRUE)+
stat_smooth(formula = y ~ x, method = "loess", na.rm = TRUE)+
xlab("Fitted Value")+
ylab(expression(sqrt("|Std. residuals|")))+
ggtitle("Scale-Location")
ggarrange(dp, qqp, fitres.p, std.resid.p, labels = c("A", "B", "C", "D"))
Fig. 6: A. Density plot of residuals shows that the residuals are approximately normal and tail on the right is small enough to not cause concern. B. QQ Plot of the residuals shows that most of the data are surrounding the blue line, however, there is a bit of data on the lower end that is a bit off of the line. It is not concerning though. C. Fitted Values vs. Residuals show the points are appoximately equal abova and below the line. (D) Scale-location plot, another type of residual vs. fitted that shows no clear trend in residual variance throughout the plot.
Create a summary statement
#turn on for html
plot_ly(geoData,
x = ~GeoGrowth,
y = ~Temp,
z = ~Current,
color = ~Temp) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'GeoGrowth'),
yaxis = list(title = 'Temp'),
zaxis = list(title = 'Current')))
Fig. 7: Geoduck model. The optimal fitted regression model using Akaike information criteria was, lm(GeoGrowth ~ Temp*Current. The overall regression was statistically significant (\(R^2\) adj. = 0.722, F(\(_3\), \(_5\)\(_9\)) = 80.1, p < 0.001). It was found that both temperature and current significantly predicted geoduck growth and chlorophyll was not a good predictor of geoduck growth. Geoduck growth increases 0.035 mm/yr for each unit increase in temperature, and 0.01 mm/yr for each unit increase in current.
#3-d for pdf output
#cloud(GeoGrowth ~ Temp*Current,
# data = geoData,
# screen = list(z = 105, x = -70), panel.aspect = 0.75,
# main="Geoduck growth (mm/yr) model",
# auto.key = F)
5/5 becuase that fab Linear Regression key!!!
Pretty prepared!
Here are a few lingering things:
Is the highest \(R^2\) the best model? So statistical significance of model is just part I of deciding the best model (Akaike thinks it is)
Why create a simple linear model/histograms/etc… in the beginning when it is shown in the correlellagram? Can I do one over the other?