## Loading required package: s20x

1 Question 1

1.1 Question of interest/goal of the study

We are interested in whether the bird density at a particular site can be explained by the total number of foliage at that site.

1.2 Exploratory analysis

Bird.df=read.table("birds.txt",header=T)
plot(Total.density~Profile.Area,data=Bird.df)

1.3 Comment on the plot

WRITE COMMENT HERE Data points seem to have a positive linear relationship, meaning that it seems that as the amount of foilage in an area increases, so to, linearly, does the bird population density. There is a notable outlier at the bottom right which needs to be checked for undue influence of our fitted model. There seems to be a moderate strength in the relationship, however there isn’t many data points so this needs to be verified.

1.4 Fit an appropriate linear model, including model checks.

Bird.lm=lm(Total.density~Profile.Area,data=Bird.df)
plot(Bird.lm,which=1)

normcheck(Bird.lm)

cooks20x(Bird.lm)

summary(Bird.lm)

## 
## Call:
## lm(formula = Total.density ~ Profile.Area, data = Bird.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0570 -1.7189  0.6919  1.7879  3.2987 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    4.4250     2.4864   1.780   0.0954 .
## Profile.Area   0.1519     0.0749   2.028   0.0607 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.658 on 15 degrees of freedom
## Multiple R-squared:  0.2152, Adjusted R-squared:  0.1629 
## F-statistic: 4.113 on 1 and 15 DF,  p-value: 0.0607

Bird.lm1=lm(Total.density~Profile.Area,data=Bird.df[-4,])
plot(Bird.lm1,which=1)

normcheck(Bird.lm1)

cooks20x(Bird.lm1)

summary(Bird.lm1)

## 
## Call:
## lm(formula = Total.density ~ Profile.Area, data = Bird.df[-4, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1529 -1.3670  0.2921  1.2116  2.9146 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.57677    1.72311   0.915 0.375646    
## Profile.Area  0.25750    0.05356   4.808 0.000279 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.722 on 14 degrees of freedom
## Multiple R-squared:  0.6228, Adjusted R-squared:  0.5958 
## F-statistic: 23.11 on 1 and 14 DF,  p-value: 0.0002786

confint(Bird.lm1)

##                   2.5 %    97.5 %
## (Intercept)  -2.1189353 5.2724744
## Profile.Area  0.1426246 0.3723821

1.5 Create a scatter plot with the fitted line from your final model superimposed over it.

AreaDensity.fit = lm(Total.density ~ Profile.Area, data= Bird.df)
densityNull.fit = lm(Total.density ~ 1, data = Bird.df)
trendscatter(Total.density ~ Profile.Area, data= Bird.df)
abline(Bird.lm1, col = "red")

1.6 Estimate the predicted bird densities when the total amount of foliage at the sites are 30 f.p. units and 100 f.p. units. Overall, comment on how useful you think this model is for prediction. Justify your answer with two reasons.

preddensity.df = data.frame(Profile.Area = c(30,100))
predict(Bird.lm1, preddensity.df, interval = "prediction")

##         fit       lwr      upr
## 1  9.301871  5.492086 13.11166
## 2 27.327108 18.548973 36.10524

The 95% prediction interval of the bird density for the foilage area of 30 f.p units is between 5.492 and 13.112 bird pairs. The 95% prediction interval of the bird density for the foilage area of 100 f.p units is between 18.549 and 36.105 bird pairs.

Overall I think that the model is useful for prediction due to the notable increase in bird density for each respective foilage data point (30 f.p, and 100 f.p) coinciding with the linear model that we fitted earlier. The predicted value also falls within the expected range of values given from the scatter plot, falling within the range of one standard error.

1.7 Method and Assumption Checks**

As the initial plot of the data shows a linear relationship, we have fitted a linear regression model to our data. The residual plot shows randomness around zero with constant variance. Consistent with the initial plot, the observation No. 4 has large profile area but relatively small bird density. It is an influential point (we can see that the coefficent for the slope in the first model changed from 0.15 to 0.26 - more than one standard error, as well as the strength of evidence for a non-zero slope dramatically increasing after observation 4 was deleted), and thus was removed from further analyses. We refit the linear regression model with observation No. 4 removed. There are no overly influential points and all assumptions are satisfied. The effect of profile area is statistically significant.

Our preferred model is:

\(TotalDensity_i=\beta_0+\beta_1 \times ProfileArea_i+ \epsilon_i\), where \(\epsilon_i \sim iid N(0,\sigma^2)\).

Our model explains 62% of the total variation in the data so it is not good for prediction purpose.

1.8 Executive Summary

We were interested in building a model to estimate bird density, per pair of birds, with a profile area’s foilage (in f.p units)

There was a linear relationship between bird density and profile area foilage (\[P-value \approx 0\]). We estimate that for each one unit increase in f.p units for the profile area, there is to be an estimated 0.143 to 0.372 increase in the bird pair density.

This model would be adequate for prediction. Further analysis using a wider data pool would consildate findings in this study

Limitations of this study were the relatively small data collection pool (17 California Oak woodland sites) restricted seasonally (Most observations were taken in Spring, during the bird breeding season.)

2 Question 2

2.1 Question of interest/goal of the study

We are interested in seeing if there is a difference in the measures of head sizes between using metal and cardboard callipers.

2.2 Read in and inspect the data:

Measure.df=read.table("measure.txt", header=T)
Measure.diff=Measure.df$Cardboard-Measure.df$Metal
boxplot(Measure.diff,horizontal = TRUE,main="Head diameter, Cardboard-Metal")

summary(Measure.diff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.000   0.250   1.000   1.278   2.000   5.000

2.3 Comment on the plot and summary statistics:

There seems to be a wide range of variance for the differences between the recorded values of the cardboard and metal calliper: from the lowest at -2cm to the max at 5cm. The differences seem to be more positive than negative, with only 25% of the values being lower then 0.25cm. The mean difference between the cardboard and metal callipers is 1.28cm.

2.4 Why is the 2-sample t-test not appropriate?

The data pool required for sampling is small (18 participants), as well as the data not being independent (participants head size will be the same for both callipers, affecting both results depending on calliper accuracy). The sample size was not smaller than the population, instead it was the whole population (all N.Z air force pilots)

2.5 Manually calculate the t-statistic for testing if the underlying mean is 0, and the 95% confidence interval for the mean.

Formulas: \(T = \frac{\bar{y}-\mu_0}{se(\bar{y})}\) and 95% confidence interval \(\bar{y} \pm t_{df, 0.975} \times se(\bar{y})\)

NOTES: The R code mean(y) calculates \(\bar{y}\). The standard error is \(se(\bar{y}) = \frac{s}{\sqrt{n}}\) where \(s\) is the standard deviation of \(y\) and is calculated by sd(y), and \(n\) is the number of data-points calculated by length(y). The degrees of freedom is \(df = n-1\). The \(t_{df,0.975}\) multiplier is given by the R code qt(0.975, df).

mean_diff = mean(Measure.df$Cardboard - Measure.df$Metal)
std_dev = sd(Measure.df$Cardboard - Measure.df$Metal)
n_diff = length(Measure.df$ID)
tmul_diff = qt(1 - 0.05/2, df = n_diff - 1)
se_diff = std_dev/sqrt(n_diff)





# t-statistic for H0: mu=0:

t_stat_null = (mean_diff - 0)/(se_diff)

# 95% confidence interval for the mean:

CI_Diff = mean_diff + tmul_diff * c(-1,1) * std_dev/sqrt(n_diff)

2.6 Repeat the same calculation using the t.test function (done for you):

t.test(Measure.diff)

## 
##  One Sample t-test
## 
## data:  Measure.diff
## t = 2.9448, df = 17, p-value = 0.009058
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.362323 2.193233
## sample estimates:
## mean of x 
##  1.277778

Note: You should get exactly the same results from the manual calculations and using the \(t.test\) function. Doing this was to give you practice using some R code. The \(t.test\) function also delivers the p-value that we did not calculate above.

2.7 The linear model has been fitted for you. Provide the appropriate model checking output for checking the linear model. Obtain the t-statistic and 95% confidence interval using the linear model.

Measure.lm=lm(Measure.diff~1)
normcheck(Measure.lm)

cooks20x(Measure.lm)

t_stat_diffm = (coef(summary(Measure.lm))[0])/ (coef(summary(Measure.lm))[1])

confint(Measure.lm)

##                2.5 %   97.5 %
## (Intercept) 0.362323 2.193233

2.8 Did the linear model give the same results as the t-test? Are these doing the same thing?

2.9 Method and Assumption Checks

As the data has two measurements on each pilot, we have applied a paired sample t-test (i.e., a one-sample t-test applied to the differences within each pilot). There is no reason to suspect lack of independence between pilots and no problem with residuals or any overly influential observations.

The estimated coefficient for the true measure difference was significantly different from zero (p-value 0.009). The head diameters measured from cardboard callipers are larger than the measurements from metal callipers. Our preferred model is:

Our model is: \(MeasureDiff_i = \mu_{diff} + \epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)

STATS 201/8 Assignment 1

James Lee - 660212545

Due Date: 3pm Thursday 2nd April