## Loading required package: s20x
We are interested in whether the bird density at a particular site can be explained by the total number of foliage at that site.
Bird.df=read.table("birds.txt",header=T)
plot(Total.density~Profile.Area,data=Bird.df)
Bird.lm=lm(Total.density~Profile.Area,data=Bird.df)
plot(Bird.lm,which=1)
normcheck(Bird.lm)
cooks20x(Bird.lm)
summary(Bird.lm)
##
## Call:
## lm(formula = Total.density ~ Profile.Area, data = Bird.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0570 -1.7189 0.6919 1.7879 3.2987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4250 2.4864 1.780 0.0954 .
## Profile.Area 0.1519 0.0749 2.028 0.0607 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.658 on 15 degrees of freedom
## Multiple R-squared: 0.2152, Adjusted R-squared: 0.1629
## F-statistic: 4.113 on 1 and 15 DF, p-value: 0.0607
Bird.lm1=lm(Total.density~Profile.Area,data=Bird.df[-4,])
plot(Bird.lm1,which=1)
normcheck(Bird.lm1)
cooks20x(Bird.lm1)
summary(Bird.lm1)
##
## Call:
## lm(formula = Total.density ~ Profile.Area, data = Bird.df[-4,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1529 -1.3670 0.2921 1.2116 2.9146
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.57677 1.72311 0.915 0.375646
## Profile.Area 0.25750 0.05356 4.808 0.000279 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.722 on 14 degrees of freedom
## Multiple R-squared: 0.6228, Adjusted R-squared: 0.5958
## F-statistic: 23.11 on 1 and 14 DF, p-value: 0.0002786
confint(Bird.lm1)
## 2.5 % 97.5 %
## (Intercept) -2.1189353 5.2724744
## Profile.Area 0.1426246 0.3723821
AreaDensity.fit = lm(Total.density ~ Profile.Area, data= Bird.df)
densityNull.fit = lm(Total.density ~ 1, data = Bird.df)
trendscatter(Total.density ~ Profile.Area, data= Bird.df)
abline(Bird.lm1, col = "red")
preddensity.df = data.frame(Profile.Area = c(30,100))
predict(Bird.lm1, preddensity.df, interval = "prediction")
## fit lwr upr
## 1 9.301871 5.492086 13.11166
## 2 27.327108 18.548973 36.10524
The 95% prediction interval of the bird density for the foilage area of 30 f.p units is between 5.492 and 13.112 bird pairs. The 95% prediction interval of the bird density for the foilage area of 100 f.p units is between 18.549 and 36.105 bird pairs.
Overall I think that the model is useful for prediction due to the notable increase in bird density for each respective foilage data point (30 f.p, and 100 f.p) coinciding with the linear model that we fitted earlier. The predicted value also falls within the expected range of values given from the scatter plot, falling within the range of one standard error.
As the initial plot of the data shows a linear relationship, we have fitted a linear regression model to our data. The residual plot shows randomness around zero with constant variance. Consistent with the initial plot, the observation No. 4 has large profile area but relatively small bird density. It is an influential point (we can see that the coefficent for the slope in the first model changed from 0.15 to 0.26 - more than one standard error, as well as the strength of evidence for a non-zero slope dramatically increasing after observation 4 was deleted), and thus was removed from further analyses. We refit the linear regression model with observation No. 4 removed. There are no overly influential points and all assumptions are satisfied. The effect of profile area is statistically significant.
Our preferred model is:
\(TotalDensity_i=\beta_0+\beta_1 \times ProfileArea_i+ \epsilon_i\), where \(\epsilon_i \sim iid N(0,\sigma^2)\).
Our model explains 62% of the total variation in the data so it is not good for prediction purpose.
We were interested in building a model to estimate bird density, per pair of birds, with a profile area’s foilage (in f.p units)
There was a linear relationship between bird density and profile area foilage (\[P-value \approx 0\]). We estimate that for each one unit increase in f.p units for the profile area, there is to be an estimated 0.143 to 0.372 increase in the bird pair density.
This model would be adequate for prediction. Further analysis using a wider data pool would consildate findings in this study
Limitations of this study were the relatively small data collection pool (17 California Oak woodland sites) restricted seasonally (Most observations were taken in Spring, during the bird breeding season.)
We are interested in seeing if there is a difference in the measures of head sizes between using metal and cardboard callipers.
Measure.df=read.table("measure.txt", header=T)
Measure.diff=Measure.df$Cardboard-Measure.df$Metal
boxplot(Measure.diff,horizontal = TRUE,main="Head diameter, Cardboard-Metal")
summary(Measure.diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.000 0.250 1.000 1.278 2.000 5.000
There seems to be a wide range of variance for the differences between the recorded values of the cardboard and metal calliper: from the lowest at -2cm to the max at 5cm. The differences seem to be more positive than negative, with only 25% of the values being lower then 0.25cm. The mean difference between the cardboard and metal callipers is 1.28cm.
The data pool required for sampling is small (18 participants), as well as the data not being independent (participants head size will be the same for both callipers, affecting both results depending on calliper accuracy). The sample size was not smaller than the population, instead it was the whole population (all N.Z air force pilots)
Formulas: \(T = \frac{\bar{y}-\mu_0}{se(\bar{y})}\) and 95% confidence interval \(\bar{y} \pm t_{df, 0.975} \times se(\bar{y})\)
NOTES: The R code mean(y) calculates \(\bar{y}\). The standard error is \(se(\bar{y}) = \frac{s}{\sqrt{n}}\) where \(s\) is the standard deviation of \(y\) and is calculated by sd(y), and \(n\) is the number of data-points calculated by length(y). The degrees of freedom is \(df = n-1\). The \(t_{df,0.975}\) multiplier is given by the R code qt(0.975, df).
mean_diff = mean(Measure.df$Cardboard - Measure.df$Metal)
std_dev = sd(Measure.df$Cardboard - Measure.df$Metal)
n_diff = length(Measure.df$ID)
tmul_diff = qt(1 - 0.05/2, df = n_diff - 1)
se_diff = std_dev/sqrt(n_diff)
# t-statistic for H0: mu=0:
t_stat_null = (mean_diff - 0)/(se_diff)
# 95% confidence interval for the mean:
CI_Diff = mean_diff + tmul_diff * c(-1,1) * std_dev/sqrt(n_diff)
t.test(Measure.diff)
##
## One Sample t-test
##
## data: Measure.diff
## t = 2.9448, df = 17, p-value = 0.009058
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.362323 2.193233
## sample estimates:
## mean of x
## 1.277778
Note: You should get exactly the same results from the manual calculations and using the \(t.test\) function. Doing this was to give you practice using some R code. The \(t.test\) function also delivers the p-value that we did not calculate above.
Measure.lm=lm(Measure.diff~1)
normcheck(Measure.lm)
cooks20x(Measure.lm)
t_stat_diffm = (coef(summary(Measure.lm))[0])/ (coef(summary(Measure.lm))[1])
confint(Measure.lm)
## 2.5 % 97.5 %
## (Intercept) 0.362323 2.193233
As the data has two measurements on each pilot, we have applied a paired sample t-test (i.e., a one-sample t-test applied to the differences within each pilot). There is no reason to suspect lack of independence between pilots and no problem with residuals or any overly influential observations.
The estimated coefficient for the true measure difference was significantly different from zero (p-value 0.009). The head diameters measured from cardboard callipers are larger than the measurements from metal callipers. Our preferred model is:
Our model is: \(MeasureDiff_i = \mu_{diff} + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)
1.3 Comment on the plot
WRITE COMMENT HERE Data points seem to have a positive linear relationship, meaning that it seems that as the amount of foilage in an area increases, so to, linearly, does the bird population density. There is a notable outlier at the bottom right which needs to be checked for undue influence of our fitted model. There seems to be a moderate strength in the relationship, however there isn’t many data points so this needs to be verified.