Part A: A Bit on t Tests

Load the “Presurgical Stress Data, F23 Subset” data set into R and conduct the required paired t-test to test if people do indeed produce higher levels of beta-endorphin in the blood just before surgery.

library(readxl)
Blood<- read.csv("Presurgical Stress Data, F23 Subset.csv")
head(Blood)

##   Patient    A    B
## 1       1 10.0  6.5
## 2       2  6.5 14.0
## 3       3  8.0 13.5
## 4       4 12.0 18.0
## 5       5  5.0 14.5
## 6       6 11.5  9.0

Now run the paired t-test analysis in R. Obtain the value of the t test statistic and the p-value, exactly as they are reported on the R output. Be careful in your analysis; note that this is a one-sided test. What do you conclude at the 5% level of significance? Also obtain a (two-sided) 95% confidence interval for the (true) mean difference.

# Extract the columns A and B from the Blood data frame
A <- Blood$A
B <- Blood$B


# paired t-test
test <- t.test(A,B,paired=TRUE, alternative="less") # where A & B are numeric

#print the t test statistics and p-value
cat("t test statistics:",test$statistic, "\n")

## t test statistics: -2.573672

cat("p-value:",test$p.value, "\n")

## p-value: 0.01020096

# 5% level of significance conclusion
if(test$p.value < 0.05){
  cat("Reject the null hypothesis at the 5% level of significance.\n")
}else {
  cat("Fail to reject the null hypothesis at the 5% level of significance. \n")
}

## Reject the null hypothesis at the 5% level of significance.

# 95% confidence interval for the mean difference
conf_interval <- test$conf.int
cat("95% Confidence Interval:",conf_interval[1],"to", conf_interval[2],"\n")

## 95% Confidence Interval: -Inf to -2.803919

The t test statistic is -2.573672, which means that the mean difference between A and B is negative and statistically significant at the 5% level of significance. This indicates that A is significantly less than B on average.

The p-value is 0.01020096, which is less than the significance level of 0.05. This means that the probability of observing a t test statistic as extreme or more extreme than -2.573672 under the null hypothesis is very low. This provides strong evidence against the null hypothesis that the mean difference between A and B is zero.

The 95% confidence interval for the mean difference is -Inf to -2.803919, which means that we are 95% confident that the true mean difference between A and B lies in this interval. This interval does not contain zero, which also supports the rejection of the null hypothesis. The lower bound of the interval is negative infinity, which means that the mean difference is very large and negative. The upper bound of the interval is -2.803919, which means that the mean difference is at least -2.803919 units.

Part B: One-Way ANOVA

The purpose of a one-way ANOVA is to test a null hypothesis that the population means of two or more groups are equal (groups are often called “treatments” when the groups are defined by the application of different treatments). The test statistic to conduct the test has an F distribution if the null hypothesis is true, but the value of the F test statistic tends to be inflated if the alternative hypothesis is true. The F test statistic is calculated via an ANOVA table. If the null hypothesis is rejected at the specified significance level and there are three or more groups being compared, we typically want to know where the differences are. The Tukey HSD (Honestly Significant Difference) procedure is one such procedure for doing so, and the easiest one to run in R.

The data sets “Cows - mature, 4 breeds, AGHJ (stacked)” and “Cows - mature, 4 breeds, AGHJ (unstacked)” come from Canadian records of milk production for five breeds of cattle: Ayrshire, Canadian, Guernsey, Holstein-Friesen, and Jersey. To simplify your analysis we will look at only four breeds (the Canadian breed has been dropped). Butterfat percentages were recorded for milk produced by 10 cows of each breed. You require the “stacked” data for analysis in R. The “unstacked” data set simply makes it easier to look at the data.

Cows <- read.csv("Cows - mature, 4 breeds, AGHJ (stacked).csv")
head(Cows)

##   Butterfat    Breed BreedCode
## 1      3.74 Ayrshire         A
## 2      4.01 Ayrshire         A
## 3      3.77 Ayrshire         A
## 4      3.78 Ayrshire         A
## 5      4.10 Ayrshire         A
## 6      4.06 Ayrshire         A

tail(Cows)

##    Butterfat  Breed BreedCode
## 35      5.24 Jersey         J
## 36      5.70 Jersey         J
## 37      5.41 Jersey         J
## 38      4.77 Jersey         J
## 39      5.18 Jersey         J
## 40      5.23 Jersey         J

Use R to obtain the ANOVA table for testing the null hypothesis that the populations of the five breeds have equal mean butterfat percentages.

Butterfat <- Cows$Butterfat
BreedCode <- Cows$BreedCode

CowsMod1 <- aov(Butterfat ~ BreedCode)
anova(CowsMod1)

## Analysis of Variance Table
## 
## Response: Butterfat
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## BreedCode  3 15.2161  5.0720  29.107 9.849e-10 ***
## Residuals 36  6.2731  0.1743                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the value of the Total SS?

The value of the Total SS is the sum of the BreedCode SS and the Residuals SS, which is 15.2161 + 6.2731 = 21.4892.

What does the p-value for the F test statistic allow you to conclude at the 5% level of significance?

The p-value for the F test statistic is 9.849e-10, which is very small and much less than the significance level of 0.05. This means that the probability of observing an F value as large or larger than 29.107 under the null hypothesis is very low. This provides strong evidence against the null hypothesis that the mean Butterfat is the same for all breeds. Therefore, we can conclude that there is a significant difference in the mean Butterfat among the different breeds at the 5% level of significance.

If appropriate (and it will be here), use R to conduct Tukey’s HSD test

at the 5% level of significance. What conclusions can you make based on this analysis? As part of your conclusions summarize your results with an underscore diagram

TukeyHSD(CowsMod1, ordered = T)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = Butterfat ~ BreedCode)
## 
## $BreedCode
##      diff        lwr       upr     p adj
## A-H 0.282 -0.2207802 0.7847802 0.4418903
## G-H 1.127  0.6242198 1.6297802 0.0000036
## J-H 1.524  1.0212198 2.0267802 0.0000000
## G-A 0.845  0.3422198 1.3477802 0.0003548
## J-A 1.242  0.7392198 1.7447802 0.0000006
## J-G 0.397 -0.1057802 0.8997802 0.1641030

tapply(Butterfat, BreedCode, mean)

##     A     G     H     J 
## 4.003 4.848 3.721 5.245

Would any conclusions inabove have changed if Tukey’s HSD

procedure had used a 10% familywise error rate? If so, what would have changed?

The only conclusion that would have changed is the comparison between breeds A and G. At the 5% level, the difference between A and G was significant, as the p-value was 0.0003548 and the confidence interval did not contain zero. However, at the 10% level, the difference between A and G would not be significant, as the p-value would be greater than 0.1 and the confidence interval would contain zero. This means that we would not be able to reject the null hypothesis that the mean Butterfat of A and G are equal at the 10% level of significance. The other comparisons would not change, as they have very small p-values and confidence intervals that do not contain zero at both levels.

boxplot(Butterfat ~ BreedCode, main = "Boxplots of Butterfat by Breed Code", xlab = "Breed Code", ylab = "Butterfat", col = c( "red","green", "blue", "yellow"), varwidth = TRUE, notch = FALSE, outline = FALSE)

What possible violations of ANOVA assumptions are suggested by the boxplots

One of the possible violations of ANOVA assumptions suggested by the boxplots is the normality of the response variable within each group. This assumption states that the distribution of the response variable should be approximately normal for each level of the categorical variable. However, the boxplots show some signs of skewness or asymmetry in some groups, such as the longer upper whisker for Brand C in Lab Precise, or the longer lower whisker for Brand A in Lab Sloppy. These indicate that the distributions are not perfectly symmetric, and may deviate from normality. However, this violation is very slight, and may not have a serious impact on the ANOVA results, especially if the sample size is large enough.

Another possible violation of ANOVA assumptions suggested by the boxplots is the homogeneity of variances across groups. This assumption states that the variance of the response variable should be the same for each level of the categorical variable. However, the boxplots show some differences in the spread or dispersion of the data across groups. These indicate that the variances are not equal across groups, and may violate the homogeneity assumption. This violation is somewhat more substantial, and may affect the validity of the ANOVA results, especially if the sample sizes are not equal across groups.

Obtain the sample variances for the four breeds using the tapply() function and show that pooling

# Obtain the sample variances for the four breeds
variances <- tapply(Butterfat, BreedCode, var)
variances

##          A          G          H          J 
## 0.03706778 0.25306222 0.10814333 0.29873889

# Calculate the MSE
MSE <- sum((9 * variances)) / 36
MSE

## [1] 0.1742531

# Obtain the ANOVA table
anova(CowsMod1)

## Analysis of Variance Table
## 
## Response: Butterfat
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## BreedCode  3 15.2161  5.0720  29.107 9.849e-10 ***
## Residuals 36  6.2731  0.1743                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can see that the MSE in the ANOVA table is 0.1743, which is the same as the value you calculated by pooling the sample variances, except for some rounding errors. This shows that pooling these four sample variances gives the MSE in the ANOVA table.

Part C: Simple Linear Regression Analysis

Simple linear regression analysis obtains a “best fitting” straight line to a set of bivariate data, where one of the variables is treated as the predictor variable (sometimes called the “independent” variable), the other variable is treated as the response variable (sometimes called the “dependent” variable) and the criterion of least squares defines “best fit”. The predictor variable is commonly designated X and the response variable Y, reflecting the fact that the (x, y) data are plotted with the x values on the horizontal (X) axis and the y values are plotted on the vertical (Y) axis. A graph of the (x, y) observations in a data set is called a scatterplot; on this it is common to plot a fitted regression model. We restrict our fitted models in this course to straight lines, but you should be aware that many nonlinear models can also be fit

Fish <- read.csv("Brunt et al 2016 Jumping fish dataset, jump variables.csv")
head(Fish,3)

##   Fish.ID Treatment No.jumps Distance.travelled.cm Duration.jumping.s
## 1       1   Control       18                  50.0                 25
## 2       2   Control       15                  60.5                 22
## 3       3   Control       14                  63.5                 18
##   Longest.jump.cm
## 1             8.0
## 2             9.5
## 3             8.0

FishAir <- subset(Fish, Treatment == "Air")
tail(FishAir,3)

##    Fish.ID Treatment No.jumps Distance.travelled.cm Duration.jumping.s
## 50      50       Air       25                 162.0                 36
## 51      51       Air       16                  58.5                 21
## 52      52       Air       37                 363.0                 57
##    Longest.jump.cm
## 50              20
## 51               8
## 52              26

We can make a scatterplot of Duration.jump.s on No.jumps at this point. (Note: The phrase “Duration.jump.s on No. jumps” means that Duration.jump.s is on the Y-axis and No.jumps is on the Xaxis.)

No.jumps <- Fish$No.jumps
Duration.jump.s <- Fish$Duration.jumping.s
plot(No.jumps, Duration.jump.s)

To fit the least squares regression line, we need to run a “linear model” analysis. To regress Duration.jump.s on No.jumps

Mod1 <- lm(Duration.jump.s ~ No.jumps)

plot(No.jumps, Duration.jump.s, main = "Duration of Jump vs Number of Jumps"
, xlab = "Number of Jumps", ylab = "Duration of Jump (s)",col ="blue",pch=16) 

#Now superimpose the least squares regression

abline(Mod1,col="red",lty=2)

A summary of the lm()

summary(Mod1)

## 
## Call:
## lm(formula = Duration.jump.s ~ No.jumps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4901  -3.3922  -0.1607   4.0589  12.7236 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.04702    1.71337  -1.195    0.236    
## No.jumps     1.55490    0.08076  19.253   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.233 on 67 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8446 
## F-statistic: 370.7 on 1 and 67 DF,  p-value: < 2.2e-16

From this summary, you can determine the equation of the least squares regression line that was superimposed on the scatter plot. You also get information for conducting hypothesis tests and calculating confidence intervals.

The ANOVA table for the regression analysis

anova(Mod1)

## Analysis of Variance Table
## 
## Response: Duration.jump.s
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## No.jumps   1  14401 14401.4  370.68 < 2.2e-16 ***
## Residuals 67   2603    38.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Obtain (individual) 95% confidence intervals for the true slope and true Y-intercept, using the SE

values given in the summary(). Show your work.

95% CI = estimate ± 1.96 * SE

where estimate is the coefficient estimate, SE is the standard error, and 1.96 is the critical value for a 95% confidence level. You can use the values given in the summary() output to plug into the formula. For example, for the slope, you can do:

95% CI = 1.55490 ± 1.96 * 0.08076 95% CI = 1.55490 ± 0.15829 95% CI = (1.39661, 1.71319)

This means that we are 95% confident that the true slope of the relationship between Duration.jump.s and No.jumps lies between 1.39661 and 1.71319. Similarly, for the Y-intercept, you can do:

95% CI = -2.04702 ± 1.96 * 1.71337 95% CI = -2.04702 ± 3.35820 95% CI = (-5.40522, 1.31118)

This means that we are 95% confident that the true Y-intercept of the relationship between Duration.jump.s and No.jumps lies between -5.40522 and 1.31118.

C3. What is the predicted value of Duration.jump.s if No.jumps = 33?

To find the predicted value of Duration.jump.s if No.jumps = 33, you can use the equation of the regression line that you obtained from the summary() output. The equation is:

Duration.jump.s = -2.04702 + 1.55490 * No.jumps

You can plug in the value of No.jumps = 33 into the equation and calculate the value of Duration.jump.s, like this:

Duration.jump.s = -2.04702 + 1.55490 * 33 Duration.jump.s = 49.86468

This means that the predicted value of Duration.jump.s is 49.86468 seconds when No.jumps is 33 jumps.

C4. What is the value of the residual for the second last observation in the data set?

residual = observed value - predicted value

predicted value = -2.04702 + 1.55490 * 16 predicted value = 22.93038

residual = 21 - 22.93038 residual = -1.93038

This means that the predicted value of Duration.jump.s is 22.93038 seconds and the residual is -1.93038 seconds when No.jumps is 16 jumps. This indicates that the regression model overestimates the duration of jump by 1.93038 seconds when the number of jumps is 16 jumps.

C5. What is the value of the correlation coefficient?

cor(No.jumps,Duration.jump.s)

## [1] 0.9202822

C6. If each residual was squared and then summed, what entry in the ANOVA table would be obtained?

residuals(Mod1)

##            1            2            3            4            5            6 
##  -0.94112523   0.72356499  -1.72153827   4.05887477   7.05887477  -2.61174478 
##            7            8            9           10           11           12 
##  -2.94112523   4.27846173   8.05887477   4.38825522   1.60784218   3.17459757 
##           13           14           15           16           17           18 
##  -5.27643501   4.05887477  -0.72153827   6.06480409  -8.94112523  -0.94112523 
##           19           20           21           22           23           24 
##  -5.72153827   2.06480409  -5.94705456  -5.16664153   4.71763567   0.27846173 
##           25           26           27           28           29           30 
##  -4.94705456  -0.05684804   6.51583667  11.39418454  11.83928780  -0.16071220 
##           31           32           33           34           35           36 
##   2.50397802  12.72356499   2.50397802  11.16866825   8.41197252   5.62563016 
##           37           38           39           40           41           42 
##   4.74135297   1.49804870  10.05294544   2.85114645   4.17459757   2.38825522 
##           43           44           45           46           47           48 
##  -7.49602198  -8.14885355  -8.70967962  -9.16071220  -0.38029917   6.94315196 
##           49           50           51           52           53           54 
##  -3.27643501  -0.82540243  -1.83133175   1.51583667 -13.49009265  -0.72153827 
##           55           56           57           58           59           60 
##   4.05887477  -1.16071220  -2.60581546   3.38232589  -4.71560894  -9.16664153 
##           61           62           63           64           65           66 
## -14.27050568 -15.49009265  -8.71560894  -0.72746759  -3.50195130   1.38825522 
##           67           68           69 
##  -3.39215782  -0.39215782  -1.94112523

Now sum the squared residuals

res <- residuals(Mod1)
sum(res^2)

## [1] 2603.049

What entry in the ANOVA table does this correspond to? Reps Duration.jumps sum of squares residual SSR

C7. The “summary(Mod1)” produces a table with two t-statistic values and their associated p-values.

What are the sets of hypotheses these t values are testing? What can be concluded at the 5% level of significance?

The sets of hypotheses these t values are testing are:

For the slope: H0: beta1 = 0 vs Ha: beta1 != 0 For the Y-intercept: H0: beta0 = 0 vs Ha: beta0 != 0 where beta0 and beta1 are the true values of the Y-intercept and the slope, respectively. The p-values indicate the probability of observing a t value as extreme or more extreme than the one calculated under the null hypothesis. The smaller the p-value, the stronger the evidence against the null hypothesis.

At the 5% level of significance, we can conclude the following:

For the slope: The p-value is < 2e-16, which is very small and much less than 0.05. This means that we can reject the null hypothesis that the slope is zero, and accept the alternative hypothesis that the slope is not zero. This implies that there is a significant linear relationship between Duration.jump.s and No.jumps, and that the duration of jump increases as the number of jumps increases. For the Y-intercept: The p-value is 0.236, which is large and greater than 0.05. This means that we cannot reject the null hypothesis that the Y-intercept is zero, and we do not have enough evidence to support the alternative hypothesis that the Y-intercept is not zero. This implies that the Y-intercept is not significant, and that the regression line passes through the origin when the number of jumps is zero.

C8. The ANOVA table produces a value of an F test statistic. It tests one set of hypotheses already

tested in C7. Briefly explain how one of the t-tests in C7 is statistically equivalent to the F test. The ANOVA table produces a value of an F test statistic, which is 370.68. It tests the same set of hypotheses as the t-test for the slope in C7, which are:

H0: beta1 = 0 vs Ha: beta1 != 0 where beta1 is the true value of the slope of the relationship between Duration.jump.s and No.jumps. The F test statistic is calculated by dividing the Mean Square for No.jumps by the Mean Square for Residuals, which are 14401.4 and 38.9, respectively. The t-test statistic for the slope is calculated by dividing the Estimate for No.jumps by the Std. Error for No.jumps, which are 1.55490 and 0.08076, respectively. These two test statistics are mathematically equivalent, as they are both equal to the square root of the F test statistic, or 19.253. Therefore, one of the t-tests in C7 is statistically equivalent to the F test in the ANOVA table. They both test whether the slope is significantly different from zero, and they both produce the same p-value, which is < 2e-16. This means that we can reject the null hypothesis and conclude that there is a significant linear relationship between Duration.jump.s and No.jumps at the 5% level of significance.

one-way ANOVA and simple linear regression analysis

Samuel N Wekesa

2023-11-19