Problem 1

Set up summarySE function to summarize data.

Summarize and plot mean length (and error bars) by Sex and Region

LabCultureSummary<-summarySE(LabCultures,measurevar = "Length", groupvars=c("Region","Sex"))
LabCultureSummary
##   Region Sex  N   Length       sd        se        ci
## 1  North   F 52 15.63462 1.645300 0.2281621 0.4580546
## 2  North   M 51 18.80392 3.612587 0.5058634 1.0160564
## 3  South   F 57 14.09649 1.596421 0.2114511 0.4235874
## 4  South   M 90 17.01111 2.722129 0.2869376 0.5701389
# Plots
# Point Plot of Standard error of the mean
ggplot(LabCultureSummary, aes(x=Region, y=Length, color=Sex)) + 
  geom_errorbar(aes(ymin=Length-se, ymax=Length+se), width=.1) +
  geom_line() +
  geom_point()
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

Problem 2

Look for a significant difference between Sexes in mean length.

model1=lm(Length~Sex, data=LabCultures)
summary(model1)
## 
## Call:
## lm(formula = Length ~ Sex, data = LabCultures)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6596 -1.8303 -0.1596  1.3404 10.3404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.8303     0.2553  58.092  < 2e-16 ***
## SexM          2.8293     0.3399   8.323 5.79e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.665 on 248 degrees of freedom
## Multiple R-squared:  0.2183, Adjusted R-squared:  0.2152 
## F-statistic: 69.27 on 1 and 248 DF,  p-value: 5.794e-15
library(car)
Anova(model1)
## Anova Table (Type II tests)
## 
## Response: Length
##            Sum Sq  Df F value    Pr(>F)    
## Sex        492.11   1  69.273 5.794e-15 ***
## Residuals 1761.77 248                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F value: 69.27, 248 DF Relatively high F value indicates there is a significant difference between Sexes in mean length.

Look for a significant difference between Regions in mean length.

model2=lm(Length~Region, data=LabCultures)
summary(model2)
## 
## Call:
## lm(formula = Length ~ Region, data = LabCultures)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.881 -2.204 -0.881  1.796 10.796 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.2039     0.2900  59.329  < 2e-16 ***
## RegionSouth  -1.3229     0.3782  -3.498 0.000555 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.943 on 248 degrees of freedom
## Multiple R-squared:  0.04703,    Adjusted R-squared:  0.04319 
## F-statistic: 12.24 on 1 and 248 DF,  p-value: 0.0005547
library(car)
Anova(model2)
## Anova Table (Type II tests)
## 
## Response: Length
##           Sum Sq  Df F value    Pr(>F)    
## Region     106.0   1  12.239 0.0005547 ***
## Residuals 2147.9 248                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F value: 12.239, 248 DF Lower F value indicates not a significant difference between Regions in mean length.

Does the effect of Region differ between sexes?

model3=lm(Length~ Region*Sex, data=LabCultures)
library(car)
Anova(model3)
## Anova Table (Type II tests)
## 
## Response: Length
##             Sum Sq  Df F value    Pr(>F)    
## Region      168.00   1 25.9472 6.988e-07 ***
## Sex         554.12   1 85.5806 < 2.2e-16 ***
## Region:Sex    0.96   1  0.1484    0.7004    
## Residuals  1592.81 246                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F value: 0.1484 DF=246 Low F value means that the difference between males and females in the North is not signficantly different from the difference between males and females in the South.

Problem 3

Plot model-fitted group means and standard errors.

#sex
plot(allEffects(model1,))

#Region
plot(allEffects(model2))

#Region, sex
plot(allEffects(model3))

Problem 4

Plot Number of Eggs vs. Length and color code points by Region.

EggData<-read_excel("ManyakBellSotka_AmNat_AllData.xlsx", sheet=3)
names(EggData)[4] <- "Length"
names(EggData)[6] <- "NumberOfEggs"
EggData
## # A tibble: 200 x 7
##    Region Population Number Length `Width (mm)` NumberOfEggs `Dry Mass (mg)`
##    <chr>  <chr>       <dbl>  <dbl>        <dbl>        <dbl>           <dbl>
##  1 North  Nahant          1   14.6         4.19          172             7.2
##  2 North  Nahant          2   12.6         3.42          110             9  
##  3 North  Nahant          3   11.4         3.65           80             3.8
##  4 North  Nahant          4   12.9         3.87          104             4.5
##  5 North  Nahant          5   13.9         4.10          148            10.4
##  6 North  Nahant          6   12.0         3.85           83             5.3
##  7 North  Nahant          7   14.4         4.68          184            10.6
##  8 North  Nahant          8   13.7         4.47           79            10.9
##  9 North  Nahant          9   13.0         4.2            86             9.9
## 10 North  Nahant         10   11.9         3.85           58             8  
## # ... with 190 more rows
ggplot(EggData, aes(x=Length, y=NumberOfEggs, color=Region))+geom_point()+ggtitle("Number of Eggs vs Length, by Region")+xlab("Length (mm)")+ylab("Number of Eggs")

Problem 5

Test whether the relationship between Number of Eggs and Length differs between Regions.

modelEgg=lm(NumberOfEggs~Length*Region, data=EggData)
summary(modelEgg)
## 
## Call:
## lm(formula = NumberOfEggs ~ Length * Region, data = EggData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.922  -13.479    0.379   15.391   64.127 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -151.316     21.354  -7.086 2.43e-11 ***
## Length               18.825      1.526  12.333  < 2e-16 ***
## RegionSouth          73.091     29.456   2.481   0.0139 *  
## Length:RegionSouth   -7.454      2.414  -3.088   0.0023 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.44 on 196 degrees of freedom
## Multiple R-squared:  0.7388, Adjusted R-squared:  0.7348 
## F-statistic: 184.8 on 3 and 196 DF,  p-value: < 2.2e-16
Anova(modelEgg)
## Anova Table (Type II tests)
## 
## Response: NumberOfEggs
##               Sum Sq  Df  F value    Pr(>F)    
## Length        107208   1 179.5393 < 2.2e-16 ***
## Region          6371   1  10.6689  0.001286 ** 
## Length:Region   5696   1   9.5382  0.002304 ** 
## Residuals     117037 196                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Significant p-values (p=0.0012) indicate that there is a significant difference in the relationship of Number of Eggs and Length by Region.

Problem 6

Plot the two linear regression lines, one for each Region, on the previous plot.

##### Add the two linear Regression lines for each region
ggplot(EggData, aes(x=Length, y=NumberOfEggs, color=Region))+geom_point()+geom_abline(intercept=-151.316,slope=18.825, color="red")+geom_abline(intercept=(-151.316+73.091),slope=(18.835-7.454),color="blue")+ggtitle("Number of Eggs vs Length, by Region")+xlab("Length (mm)")+ylab("Number of Eggs")

#first geom_abline is for North and second geom_abline is for South. I can't figure out how to take the residuals directly from the data frame but manually plugging in the values works...?

North:Red line, South:Blue line

Problem 7

For the Egg Data model, make plots that explore whether the residuals of the model are normally distributed.

EggResid=resid(modelEgg)
qqnorm(EggResid)
qqline(EggResid)

Based on the residuals, normally distributed and aprox. linear..possibly heavy tailed, as some points near both ends seem to diverge from the line

Look at whether the variance of the residuals increases as Length increases.

ggplot(EggData,aes(x=Length,y=EggResid))+geom_point()+geom_hline(yintercept=0)

Variance appears to increase as length increases.

Look at whether the variance of the residuals varies between regions

plot(modelEgg)

Genuinely unsure how to interpret these graphs and the role that regions play here