Assignment 3: Data exploration and basic stats

Feel free to view here: https://rpubs.com/PontSatyre11119/EEB313_Assignment-3

To submit this assignment, upload the full document to Quercus, including the original questions, your code, and the output. Submit your assignment as a knitted .pdf (preferred), .docx, or .html file.

For this assignment, we will be using the same beaver1 dataset that we used in last week’s assignment. Run the code below to create a categorical variable of the activ column, as we did for the last assignment. This will make dplyr recognize that there are only two levels of activity (0 and 1), rather than a continuous range 0-1, which will facilitate plotting.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dplyr)
library(ggplot2)

beaver1_f <- beaver1 %>%
  mutate(factor_activ = factor(activ))

Beaver body temperature (2 marks)

Create a histogram to visualize the distribution of the beavers’ body temperatures, separating the temperature data based on the beaver’s activity level. Choose an appropriate bin width for your histogram. (0.5 marks)

beaver1_f %>% 
  group_by(factor_activ) %>% # this groups all data by activity
  ggplot(aes(fill = factor_activ)) +
  geom_histogram(aes(x = temp), binwidth = 0.05) + # I have selected a bin width of 0.05. Meaning each bar represents 0.05 degrees.
  labs(title = "Temperature by Activity", x = "Temperature", y = "Count", fill = "Activity")

Quantitatively describe the properties of the distribution (mean, median, range, and skewness). (0.5 marks)

beaver1_f %>% 
  group_by(factor_activ) %>% 
  summarize(mean_temp = mean(temp), # here is the mean temperature
            median_temp = median(temp), # here is the median temperature
            min_temp = min(temp), # here is the minimum temperature
            max_temp = max(temp), # here is the maximum temperature
            IQR_temp = IQR(temp), # here is the skewness
            range_temp = (max(temp)-min(temp))) # here is the range

## # A tibble: 2 × 7
##   factor_activ mean_temp median_temp min_temp max_temp IQR_temp range_temp
##   <fct>            <dbl>       <dbl>    <dbl>    <dbl>    <dbl>      <dbl>
## 1 0                 36.8        36.9     36.3     37.2    0.190      0.900
## 2 1                 37.2        37.2     37.1     37.5    0.135      0.460

Conduct a statistical test to determine whether beaver body temperatures across the two activity levels are normally distributed. Show your calculations (code) and explain your conclusions in plain English. (1 mark)

Hint: Remember to refer to the hypothesis of the statistical test in your explanation.

zero <- beaver1_f %>% 
  filter(factor_activ == 0) # I am assigning a new dataframe to only have temperature datapoints with activity 0.

one <- beaver1_f %>% # New df with temp only activity 1.
  filter(factor_activ == 1)

shapiro.test(zero$temp)

## 
##  Shapiro-Wilk normality test
## 
## data:  zero$temp
## W = 0.96123, p-value = 0.003115

# The Shapiro null hypothesis indicates that the sample distribution is normally distributed. In this case, if the produced p-value is greater than 0.05, than the data is normally distributed. Since our p-value is 0.003115, we reject the null hypothesis and state that beaver temperature data for activity 0 is not normally distributed. 

shapiro.test(one$temp)

## 
##  Shapiro-Wilk normality test
## 
## data:  one$temp
## W = 0.85953, p-value = 0.1876

# For activity 1, the null hypothesis can be accepted. (p = 0.1876) Temperature data for activity 1 is normally distributed.

shapiro.test(beaver1_f$temp)

## 
##  Shapiro-Wilk normality test
## 
## data:  beaver1_f$temp
## W = 0.97031, p-value = 0.01226

# Combined, the null is rejected. (p = 0.01226) The combined temperature data is not normally distributed.

Outliers (4 marks)

In the beaver1 dataset, there are some particularly high/low body temperature measurements. How many data points laid out of the “normal” range, as defined by ggplot’s boxplot? (0.5 marks)

beaver1_f %>% 
  group_by(factor_activ) %>% # groups by activity
ggplot(aes(color = factor_activ)) + # uses activity as a comparator
  geom_boxplot(aes(y = temp)) + # plots temperature on y-axis
  labs(title = "Temperature Boxplot", y = "Temperature", color = "Activity") +
  theme_classic()

# There are 5 points outside of the normal range. These points can be considered statistical outliers.

Give an example of a systematic or random error that could have influenced these values. (0.5 marks)

# Incorrect thermometer placement (human error) may be an example of random error.

Investigate whether these “outlying” values exert a large influence on your statistical analyses.

c-i. Perform t-tests to examine whether beaver’s body temperature differ by activity level. Repeat this test after removing the outliers for your data. (1 mark)

t.test(temp ~ factor_activ, data = beaver1_f) # here I performed an un-paired, two-sample t-test with original data.

## 
##  Welch Two Sample t-test
## 
## data:  temp by factor_activ
## t = -5.4346, df = 5.6263, p-value = 0.001978
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.5556401 -0.2067673
## sample estimates:
## mean in group 0 mean in group 1 
##        36.84213        37.22333

outliers <- boxplot(beaver1_f$temp, plot=FALSE)$out # saves outliers into a new df.
beaver1_f_out.rem <- beaver1_f # creates a new df with original data.
beaver1_f_out.rem <- beaver1_f_out.rem[-which(beaver1_f_out.rem$temp %in% outliers),] # removes specific values in outlier df from the original df.

t.test(temp ~ factor_activ, data = beaver1_f_out.rem) # here I performed a t-test with outliers removed.

## 
##  Welch Two Sample t-test
## 
## data:  temp by factor_activ
## t = -7.7087, df = 5.4031, p-value = 0.0004112
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.3995951 -0.2030587
## sample estimates:
## mean in group 0 mean in group 1 
##        36.86067        37.16200

c-ii. Explain and contrast the results of your t-tests in plain English. Remember to refer to the hypotheses of the statistical test in your explanation. (1 mark)

# In the above un-paired, two-sample t-test, we compared the temperature means to each other based on activity. The p-value for original data is p=0.0019. The p-value for outlier-removed data is p=0.00041. In both cases, we reject the null hypothesis and say that there is a statistical difference between the means of each activity for both datasets. However, for the dataset with removed outliers, there is higher confidence that the means differ statistically significant.

c-iii. State whether you would remove these data points. Why or why not? (0.5 marks)

# I would remove the outlier data points. Since a t-test assumes normality, removing these outliers would produce a testable dataset. I can confirm normal distribution of temperature by running the Shapiro Test again. Seen in the result below, the Shapiro p-value is p=0.1227 meaning the null is accepted and the dataset is normally distributed. 
shapiro.test(beaver1_f_out.rem$temp)

## 
##  Shapiro-Wilk normality test
## 
## data:  beaver1_f_out.rem$temp
## W = 0.98103, p-value = 0.1227

c-iv. The t-test makes two assumptions. We already tested for normality in the distribution of body temperature by activity level, so we have a sense of whether we violated the first assumption. What is the second assumption the t-test makes, and how would you validate that? (0.5 marks)

# The assumptions for this t-test also require equal variance. We can test for variance by using var.test. Here, I tested for both original and outlier-removed datasets. We see that the outlier-removed dataset has a high variance ratio; whereas, the original dataset has around equal ratio. In this case, I would opt to keep the outliers.

var.test(temp ~ factor_activ, data = beaver1_f)

## 
##  F test to compare two variances
## 
## data:  temp by factor_activ
## F = 1.0957, num df = 107, denom df = 5, p-value = 0.9516
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1803392 2.9445672
## sample estimates:
## ratio of variances 
##           1.095705

var.test(temp ~ factor_activ, data = beaver1_f_out.rem)

## 
##  F test to compare two variances
## 
## data:  temp by factor_activ
## F = 3.3868, num df = 103, denom df = 4, p-value = 0.2391
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4071762 9.8645263
## sample estimates:
## ratio of variances 
##           3.386758

Thanksgiving! (2 marks)

Run this code chunk:

install.packages("praise")

And then this:

library(praise)
replicate(100, praise())

Assignment 3: Data exploration and basic stats

Matthew Ho 1006092416