Feel free to view here: https://rpubs.com/PontSatyre11119/EEB313_Assignment-3
To submit this assignment, upload the full document to Quercus,
including the original questions, your code, and the output. Submit your
assignment as a knitted .pdf (preferred),
.docx, or .html file.
For this assignment, we will be using the same beaver1
dataset that we used in last week’s assignment. Run the code below to
create a categorical variable of the activ column, as we
did for the last assignment. This will make dplyr recognize that there
are only two levels of activity (0 and 1), rather than a continuous
range 0-1, which will facilitate plotting.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
beaver1_f <- beaver1 %>%
mutate(factor_activ = factor(activ))
beaver1_f %>%
group_by(factor_activ) %>% # this groups all data by activity
ggplot(aes(fill = factor_activ)) +
geom_histogram(aes(x = temp), binwidth = 0.05) + # I have selected a bin width of 0.05. Meaning each bar represents 0.05 degrees.
labs(title = "Temperature by Activity", x = "Temperature", y = "Count", fill = "Activity")
beaver1_f %>%
group_by(factor_activ) %>%
summarize(mean_temp = mean(temp), # here is the mean temperature
median_temp = median(temp), # here is the median temperature
min_temp = min(temp), # here is the minimum temperature
max_temp = max(temp), # here is the maximum temperature
IQR_temp = IQR(temp), # here is the skewness
range_temp = (max(temp)-min(temp))) # here is the range
## # A tibble: 2 × 7
## factor_activ mean_temp median_temp min_temp max_temp IQR_temp range_temp
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 36.8 36.9 36.3 37.2 0.190 0.900
## 2 1 37.2 37.2 37.1 37.5 0.135 0.460
Hint: Remember to refer to the hypothesis of the statistical test in your explanation.
zero <- beaver1_f %>%
filter(factor_activ == 0) # I am assigning a new dataframe to only have temperature datapoints with activity 0.
one <- beaver1_f %>% # New df with temp only activity 1.
filter(factor_activ == 1)
shapiro.test(zero$temp)
##
## Shapiro-Wilk normality test
##
## data: zero$temp
## W = 0.96123, p-value = 0.003115
# The Shapiro null hypothesis indicates that the sample distribution is normally distributed. In this case, if the produced p-value is greater than 0.05, than the data is normally distributed. Since our p-value is 0.003115, we reject the null hypothesis and state that beaver temperature data for activity 0 is not normally distributed.
shapiro.test(one$temp)
##
## Shapiro-Wilk normality test
##
## data: one$temp
## W = 0.85953, p-value = 0.1876
# For activity 1, the null hypothesis can be accepted. (p = 0.1876) Temperature data for activity 1 is normally distributed.
shapiro.test(beaver1_f$temp)
##
## Shapiro-Wilk normality test
##
## data: beaver1_f$temp
## W = 0.97031, p-value = 0.01226
# Combined, the null is rejected. (p = 0.01226) The combined temperature data is not normally distributed.
beaver1_f %>%
group_by(factor_activ) %>% # groups by activity
ggplot(aes(color = factor_activ)) + # uses activity as a comparator
geom_boxplot(aes(y = temp)) + # plots temperature on y-axis
labs(title = "Temperature Boxplot", y = "Temperature", color = "Activity") +
theme_classic()
# There are 5 points outside of the normal range. These points can be considered statistical outliers.
# Incorrect thermometer placement (human error) may be an example of random error.
c-i. Perform t-tests to examine whether beaver’s body temperature differ by activity level. Repeat this test after removing the outliers for your data. (1 mark)
t.test(temp ~ factor_activ, data = beaver1_f) # here I performed an un-paired, two-sample t-test with original data.
##
## Welch Two Sample t-test
##
## data: temp by factor_activ
## t = -5.4346, df = 5.6263, p-value = 0.001978
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.5556401 -0.2067673
## sample estimates:
## mean in group 0 mean in group 1
## 36.84213 37.22333
outliers <- boxplot(beaver1_f$temp, plot=FALSE)$out # saves outliers into a new df.
beaver1_f_out.rem <- beaver1_f # creates a new df with original data.
beaver1_f_out.rem <- beaver1_f_out.rem[-which(beaver1_f_out.rem$temp %in% outliers),] # removes specific values in outlier df from the original df.
t.test(temp ~ factor_activ, data = beaver1_f_out.rem) # here I performed a t-test with outliers removed.
##
## Welch Two Sample t-test
##
## data: temp by factor_activ
## t = -7.7087, df = 5.4031, p-value = 0.0004112
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.3995951 -0.2030587
## sample estimates:
## mean in group 0 mean in group 1
## 36.86067 37.16200
c-ii. Explain and contrast the results of your t-tests in plain English. Remember to refer to the hypotheses of the statistical test in your explanation. (1 mark)
# In the above un-paired, two-sample t-test, we compared the temperature means to each other based on activity. The p-value for original data is p=0.0019. The p-value for outlier-removed data is p=0.00041. In both cases, we reject the null hypothesis and say that there is a statistical difference between the means of each activity for both datasets. However, for the dataset with removed outliers, there is higher confidence that the means differ statistically significant.
c-iii. State whether you would remove these data points. Why or why not? (0.5 marks)
# I would remove the outlier data points. Since a t-test assumes normality, removing these outliers would produce a testable dataset. I can confirm normal distribution of temperature by running the Shapiro Test again. Seen in the result below, the Shapiro p-value is p=0.1227 meaning the null is accepted and the dataset is normally distributed.
shapiro.test(beaver1_f_out.rem$temp)
##
## Shapiro-Wilk normality test
##
## data: beaver1_f_out.rem$temp
## W = 0.98103, p-value = 0.1227
c-iv. The t-test makes two assumptions. We already tested for normality in the distribution of body temperature by activity level, so we have a sense of whether we violated the first assumption. What is the second assumption the t-test makes, and how would you validate that? (0.5 marks)
# The assumptions for this t-test also require equal variance. We can test for variance by using var.test. Here, I tested for both original and outlier-removed datasets. We see that the outlier-removed dataset has a high variance ratio; whereas, the original dataset has around equal ratio. In this case, I would opt to keep the outliers.
var.test(temp ~ factor_activ, data = beaver1_f)
##
## F test to compare two variances
##
## data: temp by factor_activ
## F = 1.0957, num df = 107, denom df = 5, p-value = 0.9516
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1803392 2.9445672
## sample estimates:
## ratio of variances
## 1.095705
var.test(temp ~ factor_activ, data = beaver1_f_out.rem)
##
## F test to compare two variances
##
## data: temp by factor_activ
## F = 3.3868, num df = 103, denom df = 4, p-value = 0.2391
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4071762 9.8645263
## sample estimates:
## ratio of variances
## 3.386758
Run this code chunk:
install.packages("praise")
And then this:
library(praise)
replicate(100, praise())