Lab_5_Assignment

Author

Kathleen Strachota

Load packages

#insert packages here :)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.1 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(patchwork)
library(ggsci)
library(ggridges)
library(performance)
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula

Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units
library(corrplot)
corrplot 0.92 loaded

Essentials

Load data from Old Faithful. Assess the correlation between waiting time and eruption time using a Spearman’s test and a scatter plot. Report Spearman’s Rho and interpret the results

faithful <- faithful
#eruptions = length of eruption time in minutes
#waiting = time between eruptions in minutes

ggplot(data = faithful, aes(y = eruptions, x = waiting)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

cor.test(faithful$eruptions,faithful$waiting, method = "spearman")
Warning in cor.test.default(faithful$eruptions, faithful$waiting, method =
"spearman"): Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  faithful$eruptions and faithful$waiting
S = 744659, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7779721 

\(H_0\): The correlation between length of eruption time and length of wait time = 0 \(H_A\): The correlation between length of eruption time and length of wait time \(\neq 0\)

Spearman’s \(\rho = 0.7779721\) which indicates that there is a strong correlation between length of eruption time and the length of the wait time. The p-value is less than 2.2e-16 which means we can reject the null hypothesis that the correlation between these two variables is 0.

DO A CHI-SQUARE. First we need some categorical data! So, I will make it for you :)

#distribution of class of patients admitted to self poisoning (A) and gastro units (B)
#socioclass
socioclass<-c("I","II","III","IV","V")
selfpoison<-c(17,25,39,42,32)
gastro<-c(22,46,73,91,57)

patients<-data.frame(socioclass,selfpoison,gastro)
head(patients)
  socioclass selfpoison gastro
1          I         17     22
2         II         25     46
3        III         39     73
4         IV         42     91
5          V         32     57
str(patients)
'data.frame':   5 obs. of  3 variables:
 $ socioclass: chr  "I" "II" "III" "IV" ...
 $ selfpoison: num  17 25 39 42 32
 $ gastro    : num  22 46 73 91 57
#remove the categories so the data are clean for chi-square
pat2<-patients[,-1]

\(H_0:\) There is no relationship between social class and which unit the patient is admitted to (self-poisoning or gastro). The variables are independent.

\(H_A:\) There is a relationship between social class and which unit the patient is admitted to (self-poisoning or gastro). The variables are dependent.

\(\alpha = 0.05\)

chisq.test(pat2)

    Pearson's Chi-squared test

data:  pat2
X-squared = 1.9885, df = 4, p-value = 0.7379

Since the p-value is less than \(\alpha = 0.05\) (p = 0.7379), we fail to reject the null that there is no relationship between social class and which unit the patient is admitted to. There does not seem to be a statistically siginficant relationship between social class and whether a patient is admitted to the self-poisoning unit or the gastro unit.

Perform a simple linear regression on the faithful data and check all assumptions. Report the summary your regression, interpret it, and provide a graph to go along with it–with a regression line. Next, test your assumptions and report the results. Is the relationship between your two variables linear? Don’t forget a hypothesis!

head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55
model_faithful <- lm(eruptions ~ waiting, faithful)
summary(model_faithful)

Call:
lm(formula = eruptions ~ waiting, data = faithful)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29917 -0.37689  0.03508  0.34909  1.19329 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
waiting      0.075628   0.002219   34.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared:  0.8115,    Adjusted R-squared:  0.8108 
F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
ggplot(data = faithful, aes(y = eruptions, x = waiting)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

\(H_0: \beta_{waiting} = 0\)/There is no linear relationship between wait time length and length of eruption.

\(H_A: \beta_{waiting} \neq 0\)/There is a linear relationship between wait time length and length of eruption.

\(\alpha = 0.05\)

The adjusted \(R^2\) value is 0.8108 which indicates that there is a strong positive linear relationship between length of wait time and length of eruption. The p-value is less than 2.2e-16 which is less than our significance threshold (\(\alpha = 0.05\)). We have extremely strong evidence against the null hypothesis that \(\beta_{waiting} \neq 0\). We could conclude that there is a statistically significant relationship of wait time length on length of eruption. This is consistent with the \(R^2\) value that we got as well.

Checking Assumptions

check_model(model_faithful)

ggplot(data = faithful, aes(x=model_faithful$residuals)) +
  geom_histogram()+
  geom_density(color = "red") +
  theme_classic()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To check the assumptions, I ran the check_model function on the linear model. In the graph for normality of residuals, the points fall along the line for almost all of the data, so I would say that it is safe to assume that the residuals are normally distributed. In the graph for checking homoscedasticity/homogeneity of variance, the line is slightly curved but I think is still safe to assume that there is homoscedasticity. To check linearity, I looked at the linearity graph and although there is some curvature, but we have a really large sample size so we will continue anyways. We don’t know enough about the experimental design to check our assumption about the independence of the residuals but we will just acknowledge that and still run the test. If we were to do more than this lab with this data, we might want to examine it more, but for this use it is probably okay.

Load palmerpenguins. Using the penguins data, assess whether there is a difference in body mass between Adelie and Gentoo penguins. If there is a difference, assess the direction of the difference (one tends to be bigger than the other, etc). You should: a.) Tell me what statistical test you are using, b.) filter the data appropriately, c.) run your test, d.) make a corresponding visual to show the appropriate differences (means + error bars!). Write a hypothesis, run your test, interpret the test-statistics and use the stats + graph to tell me the result.

library(palmerpenguins)

a)

We will run a t-test because we are comparing the means of two groups, Adelie and Gentoo penguins.

b)

penguin_a <- penguins %>%
  filter(species == "Adelie")
penguin_g <- penguins %>%
  filter(species == "Gentoo")

c)

\(H_0:\) There is no difference in body mass between Adelie and Gentoo penguins \(H_A\): There is a difference in body mass between Adelie and Gentoo penguins \(\alpha = 0.05\)

t.test(penguin_a$body_mass_g, penguin_g$body_mass_g)

    Welch Two Sample t-test

data:  penguin_a$body_mass_g and penguin_g$body_mass_g
t = -23.386, df = 249.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1491.183 -1259.525
sample estimates:
mean of x mean of y 
 3700.662  5076.016 

d)

penguin_ag <- penguins %>%
  filter(species == "Adelie" | species == "Gentoo") %>%
  group_by(species) %>%
  drop_na(body_mass_g) %>%
  summarise(mean = mean(body_mass_g), sd = sd(body_mass_g), n = n(), se = sd/sqrt(n))

penguin_ga <- penguins %>%
  filter(species != "Chinstrap")

ggplot(penguin_ag, aes(x = species, y = mean, color = species)) +
  geom_point() +
  geom_jitter(data = penguin_ga, aes(x = species, y = body_mass_g, color = species), alpha = 0.6, size =0.3) +
  geom_errorbar(data = penguin_ag, aes(x = species, ymin = mean-se, ymax = mean+se), width = 0.2) +
  scale_color_lancet() +
  theme_classic() +
  labs(y = "Body Mass", x = "Species", color = "Species") +
  scale_color_npg()
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Warning: Removed 2 rows containing missing values (`geom_point()`).

The p-value is less than 2.2e-16 which is less than our significance threshold (\(\alpha = 0.05\)) means we reject the null hypothesis and conclude that there is a statistically significant difference in body mass between Adelie and Gentoo penguins. After looking at the graph, it seems that Gentoo have a higher mean body mass than Adelie.


Depth

Make your essentials and depth components into tabsets using panel::tabset. Note that this is a visual change you can make to Quarto documents - this is testing your ability to modify Quarto docs and to google/find answers to common problems in R

Make your t-test graph from Essentials better – add the raw data in behind your means + error bars

ggplot(penguin_ag, aes(x = species, y = mean, color = species)) +
  geom_point() +
  geom_jitter(data = penguin_ga, aes(x = species, y = body_mass_g, color = species), alpha = 0.6, size =0.3) +
  geom_errorbar(data = penguin_ag, aes(x = species, ymin = mean-se, ymax = mean+se), width = 0.2) +
  scale_color_lancet() +
  theme_classic() +
  labs(y = "Body Mass", x = "Species", color = "Species") +
  scale_color_npg()
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Warning: Removed 2 rows containing missing values (`geom_point()`).

What are the effects of body_mass_g and sex on bill_length_mm in Chinstrap penguins? Setup a more complex linear regression to test this! You have 2 variables of interest. You can choose for this to be interactive or additive. Write a hypothesis, filter data as needed, run the test, make a graph, interpret the results.

penguin_c <- penguins %>%
  filter(species == "Chinstrap")

ggplot(data = penguin_c, aes(y = bill_length_mm, x = body_mass_g, color = sex)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() +
  scale_color_npg()
`geom_smooth()` using formula = 'y ~ x'

model_penguinc <- lm(bill_length_mm ~ body_mass_g*sex, penguin_c)

summary(model_penguinc)

Call:
lm(formula = bill_length_mm ~ body_mass_g * sex, data = penguin_c)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6911 -1.1108 -0.2264  0.7923 10.9076 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         35.982910   5.196611   6.924 2.53e-09 ***
body_mass_g          0.003003   0.001469   2.044    0.045 *  
sexmale             11.056140   6.924654   1.597    0.115    
body_mass_g:sexmale -0.001973   0.001870  -1.055    0.295    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.407 on 64 degrees of freedom
Multiple R-squared:  0.5036,    Adjusted R-squared:  0.4803 
F-statistic: 21.64 on 3 and 64 DF,  p-value: 8.606e-10
anova(model_penguinc)
Analysis of Variance Table

Response: bill_length_mm
                Df Sum Sq Mean Sq F value    Pr(>F)    
body_mass_g      1 197.10 197.101 34.0126 1.959e-07 ***
sex              1 172.66 172.661 29.7951 8.337e-07 ***
body_mass_g:sex  1   6.45   6.453  1.1136    0.2953    
Residuals       64 370.88   5.795                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(H_0:\) There is no effect of body mass and/or sex on bill length and there is no interactive effect of body mass and sex on bill length.

\(H_A:\) The interaction between body mass and sex has a significant effect on bill length.

\(\alpha = 0.05\)

There is a statistically significant effect of body mass on bill length (p-value = 1.959e-07). From the graph we can see that it is a positive effect (i.e. bigger penguins have longer bills). There is also a statistically significant effect of sex on bill length (p-value = 8.337e-07). In the graph we can see that males have a longer bill. Finally, there is no statistically significant interactive effect of sex and body mass on bill length (p-value = 0.2953). There is no difference in the relationship between body mass and bill length based on sex.

Using the dataset below, please assess whether there is a difference in number of bigfoot sightings per year between Ohio and Washington. You will have to read the data in, filter by state, make a column for year (modifying date using lubridate), group_by state & year, then count rows (each row= a bigfoot sighting). This data frame will allow you to do your t-test! Next, calculate a mean across years– you can use this to make a graph. In short, I want you to run a t-test to assess the H0 that there is no difference in annual bigfoot sightings between Ohio and Washington. Then I want to see a graph showing means + error. Finally, interpret this graph + stats. This requires use of a skill from every lab so far!

Cleaning up the data

library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
bigfoot <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv')
Rows: 5021 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): observed, location_details, county, state, season, title, classif...
dbl  (17): latitude, longitude, number, temperature_high, temperature_mid, t...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bigfoot_f <- bigfoot %>%
  filter(state == "Ohio" | state == "Washington") %>%
  mutate(year = year(date)) %>%
  group_by(state, year) %>%
  count() %>%
  drop_na()
colnames(bigfoot_f)[colnames(bigfoot_f) == "n"] = "count"

bigfoot_o <-bigfoot_f %>%
  filter(state == "Ohio")

bigfoot_w <-bigfoot_f %>%
  filter(state == "Washington")

T-test

t.test(bigfoot_o$count, bigfoot_w$count)

    Welch Two Sample t-test

data:  bigfoot_o$count and bigfoot_w$count
t = -3.1298, df = 98.619, p-value = 0.002301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.292127 -1.185345
sample estimates:
mean of x mean of y 
 5.053571  8.292308 

\(H_0: \mu_{\text{Washington}} = \mu_{\text{Ohio}}\) (The mean annual number of bigfoot sightings in Washington is equal to the mean annual number of bigfoot sightings in Ohio)

\(H_A: \mu_{\text{Washington}} \neq \mu_{\text{Ohio}}\) (The mean annual number of bigfoot sightings in Washington is different than the mean annula number of bigfoot sightings in Ohio)

\(\alpha = 0.05\)

Graphs

bigfoot_means <- bigfoot_f %>%
  group_by(state) %>%
  summarise(mean = mean(count), sd = sd(count), n = n(), se = sd/sqrt(n))

ggplot(data = bigfoot_means, aes(x = state, y = mean, color = state)) +
  geom_point() +
  geom_jitter(data = bigfoot_f, aes(x = state, y = count, color = state), alpha = 0.6, size =0.3) +
  geom_errorbar(data = bigfoot_means, aes(x = state, ymin = mean-se, ymax = mean+se), width = 0.2) +
  scale_color_lancet() +
  theme_classic() +
  labs(y = "Count", x = "State", color = "State") +
  scale_color_aaas()
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

Interpretation

The p-value is 0.002301 which is less than our significance threshold (\(\alpha = 0.05\)). Thus, we have very strong evidence against the null hypothesis that the mean annual number of bigfoot sightings in Washington is equal to the mean annual number of bigfoot sightings in Ohio. There is a statistically significant difference between the mean annual number of bigfoot sightings in Washington and Ohio. This is consistent with the graph where it seems that Washington has a higher average annual number of bigfoot sightings than Ohio.