Markdown Author: Jessie Bell, 2023
Libraries Used: ggplot2
Answers: blue
depthData <- read.csv("Depth.csv")
flowerData <- read.csv("Growth.csv")
A professor posed a hypothesis that there was no difference between how much coffee students drink between student majoring in Environmental Science and Biology. She randomly sampled 25 Esci and 25 Biology students. The mean coffee consumption for Esci students was 40 oz per day and the standard deviation was 5. The mean coffee consumption for Biology students was 30 oz per day and the standard deviation was 5. If she used a two-tailed t test to compare the mean coffee consumption between Esci and Biology students:
Statistical Hypotheses
H\(_0\): \(\mu\)\(_e\)\(_s\)\(_c\)\(_i\) = \(\mu\)\(_b\)\(_i\)\(_o\)
H\(_A\): \(\mu\)\(_e\)\(_s\)\(_c\)\(_i\) \(\neq\) \(\mu\)\(_b\)\(_i\)\(_o\)
# Set the seed for reproducibility
set.seed(123)
# Generate a vector of 25 samples with mean 25 and sd 5
esci.data <- rnorm(25, mean = 40, sd = 5)
bio.data <- rnorm(25, 30, 5)
#just perform a t-test
t.test <- t.test(esci.data, bio.data, var.equal = T)
#look at the output for t-value
t.test
##
## Two Sample t-test
##
## data: esci.data and bio.data
## t = 7.0662, df = 48, p-value = 5.821e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.669972 11.975351
## sample estimates:
## mean of x mean of y
## 39.83335 30.51069
#t-stat = 9.0104, df = 48, p-value < 0.001
# Degrees of freedom
df <- length(esci.data) + length(bio.data) - 2
#should be 48! Just like the t-test output. You can check your work here.
# Significance level
alpha <- 0.05
# Calculate t-critical value for a two-tailed test
t_critical <- qt(1 - alpha/2, df)
t_critical
## [1] 2.010635
#2.010635 is t-critical
#t.critical < t.calculated and we can reject our null hypothesis in favor of alternative.
#this is all in the above t-test output, but below I will show you the long way to calculate it
# Calculate mean and standard deviation for each group
mean_esci <- mean(esci.data)
mean_bio <- mean(bio.data)
sd_esci <- sd(esci.data)
sd_bio <- sd(bio.data)
# Sample sizes
n_esci <- length(esci.data)
n_bio <- length(bio.data)
# Degrees of freedom
df <- n_esci + n_bio - 2
# Calculate the standard error of the difference in means
se_diff <- sqrt((sd_esci^2 / n_esci) + (sd_bio^2 / n_bio))
# Calculate the t-statistic
t_stat <- (mean_esci - mean_bio) / se_diff
# Calculate the p-value for a two-tailed test
p_value <- 2 * pt(-abs(t_stat), df)
p_value #VOILA! 6.8 * 10 ^ -12 matches the number in the t.test function. Things are looking good.
## [1] 5.821082e-09
Based on the results of the two-sample t-test comparing the mean coffee consumption between Esci and Biology students, we found a statistically significant difference (t≈7.07, df=48, p < 0.05). The mean coffee consumption for Esci students (40 oz per day) was significantly higher than that for Biology students (30 oz per day).
Therefore, we reject the null hypothesis, which suggested no difference in coffee consumption between the two groups. This suggests that there is evidence to support the hypothesis that there is a significant difference in coffee consumption habits between Environmental Science and Biology students. It’s essential to consider these findings in the context of potential lifestyle or academic factors that might contribute to this observed difference.
Students in the WWU Ski and Snowboard club were interested in assessing past few years of snow pack up at the Mt Baker ski area in November. They want to compare this year to previous years to see if they could predict how good the ski season will be this year. They were particularly interested to see if the average snowfall for any given day in November 2017 was different than the average for November 2018 , because snow conditions are so variable in the fall and it is hard to tell if the averages could be considered different. The data can be found in the snowdepth tab of the accompanying spreadsheet.
Two different continuous data for 2 categorical years
#there are huge outliers in the data. We should be careful about removing these data, but if we have detailed notes and we know that there was a mistake in that data, we can remove it. I will leave it here for now and just zoom into the boxplot below using the ylim() function.
boxplot(depthData$Nov.17, ylim = c(0, 75), col = "tomato") #ylim() tells the y axis what distance to go to
boxplot(depthData$Nov.18, ylim = c(0, 20), col = "plum") #dang, look how crazy these data are.
Because there are two categories (Nov. 17, and Nov. 18) and numerical data (temp), we should compare using a t-test.
H\(_0\): The mean snow pack for Nov. 17 is equal to the mean snow pack for Nov. 18.
H\(_A\): The mean snow pack for Nov. 17 is NOT equal to the mean snow pack for Nov. 18.
#check out some summary stats and plot for normality
mean_2018 <- mean(depthData$Nov.18) #13.38
mean_2017 <- mean(depthData$Nov.17) #40.33
hist(depthData$Nov.17) #not exactly normal!
hist(depthData$Nov.18) #oh boy. This super violates our assumptions of normality. At this point we probably should not proceed. For the sake of the steps, I will continue though.
#Assessing equal variance (look at teh sd or both)
nov18.sd <- sd(depthData$Nov.18) #39
nov17.sd <- sd(depthData$Nov.17) #33
#those are approximately similar. We can continue assuming equal variance
t.test(depthData$Nov.17, depthData$Nov.18, var.equal=T)
##
## Two Sample t-test
##
## data: depthData$Nov.17 and depthData$Nov.18
## t = 13.905, df = 1438, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 23.14712 30.75069
## sample estimates:
## mean of x mean of y
## 40.33098 13.38208
It seems that the snow pack for 2017 and 2018 was statistically significant (t ~ 13.9, df= 1438, p < 0.001). The average snowpack of for 2017 was much larger (mean = 40.33) than the average snowpack of 2018 (mean = 13.39).
We are going to talk about snow again… but this time in terms of biological effects. You are interested in testing whether the amount of snow affects growth of an early spring annual flower. You perform an experiment where you manipulate snow levels. The experimental variable is the amount of snow with 3 levels: low (1 inch), medium (3 inches), and high (6 inches). The response variable is plant height. The experiment is properly designed so that the data are independent and samples are randomly collected. You can find the data in snowgrowth tab of the accompanying spreadsheet.
#the way the data are setup will make ggplot useful
library(ggplot2) #x and y are even labelled! so you can just input the column names below
ggplot(flowerData, aes(x, y, color=x))+
geom_jitter(width=0.2)+
labs(title = "Checking out the data",
y="Flower growth",
x="Treatment")#BTW you should have labelled all of your charts and graphs for full credit
Here we have categorical treatment data (with 3 levels, low med and high) and numerical data for each of those. We will need to use an ANOVA because we have 3 levels instead of just two (i.e. there were only two categories in the depth data from the last question, 2017 and 2018 so we could just use a t-test).
There is 1 factor (treatment) with 3 levels (low/med/high)
H\(_0\): \(\mu\)\(_1\) = \(\mu\)\(_2\) = \(\mu\)\(_3\)
H\(_A\): At least one of those means is different and we will run a tukey HSD to find out which one it is.
ggplot(flowerData, aes(y, fill=x))+
geom_histogram(alpha = 0.7, bins=30) #medium does not look very normal
#we will proceed assuming that if we sampled more, our data would look more bell-shaped
tapply(flowerData$y, flowerData$x, mean) #if any means are statistically diff it will be between low and high
## High Low Med
## 45.86812 40.58250 43.74562
tapply(flowerData$y, flowerData$x, sd) #the variances are close enough
## High Low Med
## 3.457865 5.387842 7.132181
The test statistic for an anova is F! So our test statistic and p-value are: F = 3.697; p = 0.03262
We can reject our null hypothesis in favor of our alternative. After conducting a post-hoc Tukey HSD test we found that high levels of snowpack had higher flower growth than low levels, (F\(_2\)\(_,\)\(_4\)\(_5\) = 3.697; p < 0.001).
Now we want to work with the same data as in #8, but we want to consider how each of the three depths may be different.
low <- subset(flowerData, x =="Low")
med <- subset(flowerData, x =="Med")
high <- subset(flowerData, x == "High")
low.t <- t.test(low$y)
med.t <- t.test(med$y)
high.t <- t.test(high$y)
p_values <- c(low.t$p.value, med.t$p.value, high.t$p.value)
p.adjust(p_values, method = "bonferroni")
## [1] 2.341333e-14 4.813041e-13 5.208736e-18
They are not the same. I think because of rounding errors - I will use anova and aov functions instead, since they are better.
Tukey HSD tells you which two means are different. This will be important when you don’t have ordinal data and you are trying to decide which two means from multiple categories are different from one another. This example was obvious though.