Lab_5_Assignment

Author

Maya Frey

1. Load packages

#insert packages here :)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(ggplot2)
library(car)

Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

library(patchwork)
library(ggsci)
library(ggridges)
library(performance)
library(Hmisc)


Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units

library(corrplot)

corrplot 0.92 loaded

library(lubridate)

2. Load data from Old Faithful. Assess the correlation between waiting time and eruption time using a Spearman’s test and a scatter plot. Report Spearman’s Rho and interpret the results

faithful<-faithful
#eruptions = length of eruption time in minutes
#waiting = time between eruptions in minutes
ggplot(data = faithful, aes(x = eruptions, y = waiting)) + 
  geom_point() +
  geom_smooth(method = lm) +
  theme_bw()

`geom_smooth()` using formula = 'y ~ x'

cor.test(faithful$eruptions, faithful$waiting, method = "spearman")

Warning in cor.test.default(faithful$eruptions, faithful$waiting, method =
"spearman"): Cannot compute exact p-value with ties


    Spearman's rank correlation rho

data:  faithful$eruptions and faithful$waiting
S = 744659, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7779721

Spearman’s Rho is 0.778, which indicates that there is a strong correlation between length of eruption time and the waiting time between eruptions. The p-value is less than 0.05, which means that the correlation is significant.

3. DO A CHI-SQUARE. First we need some categorical data! So, I will make it for you :)

#distribution of class of patients admitted to self poisoning (A) and gastro units (B)
#socioclass
socioclass<-c("I","II","III","IV","V")
selfpoison<-c(17,25,39,42,32)
gastro<-c(22,46,73,91,57)

patients<-data.frame(socioclass,selfpoison,gastro)
head(patients)

  socioclass selfpoison gastro
1          I         17     22
2         II         25     46
3        III         39     73
4         IV         42     91
5          V         32     57

str(patients)

'data.frame':   5 obs. of  3 variables:
 $ socioclass: chr  "I" "II" "III" "IV" ...
 $ selfpoison: num  17 25 39 42 32
 $ gastro    : num  22 46 73 91 57

#remove the categories so the data are clean for chi-square
pat2<-patients[,-1]

#perform a chisq test -- write a hypothesis and then perform the test!
pat2_chi <- chisq.test(pat2)
pat2_chi


    Pearson's Chi-squared test

data:  pat2
X-squared = 1.9885, df = 4, p-value = 0.7379

Null hypothesis: The is no significant correlation between social class and which unit patients were admitted to, aka they are independent of each other. Alternative hypothesis: There is a significant correlation between social class and which unit patients were admitted to, aka they are dependent on each other. Results: The p-value is 0.7379, which is greater than 0.05. Therefore, we fail to reject the null hypothesis that there is no significant correlation between social class and which unit patients were admitted to.

4. Perform a simple linear regression on the faithful data and check all assumptions. Report the summary your regression, interpret it, and provide a graph to go along with it–with a regression line. Next, test your assumptions and report the results. Is the relationship between your two variables linear? Don’t forget a hypothesis!

head(faithful)

  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

#Perform the linear regression
lm_faithful <- lm(eruptions ~ waiting, data = faithful)
#Report the summary of your regression
summary(lm_faithful)


Call:
lm(formula = eruptions ~ waiting, data = faithful)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29917 -0.37689  0.03508  0.34909  1.19329 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
waiting      0.075628   0.002219   34.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared:  0.8115,    Adjusted R-squared:  0.8108 
F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

#Provide a graph
ggplot(data = faithful, aes(x = waiting, y = eruptions)) + 
  geom_point() +
  geom_smooth(method = lm) +
  theme_bw()

`geom_smooth()` using formula = 'y ~ x'

#Test assumptions
check_model(lm_faithful)

Null hypothesis: There is no relationship between the waiting time between eruptions and length of eruption time; Waiting time between eruptions has no effect on length of eruption time. Alternative hypothesis: There is a relationship between the waiting time between eruptions and length of eruption time; Waiting time has an effect on length of eruption time. Interpretation of results: Adjusted R-squared is 0.8108, meaning that there is a strong positive linear correlation between waiting time between eruptions and length of eruption time. The p-value is less than 0.05, meaning that the relationship between the variables is significant and we can reject our null hypothesis: Waiting time between eruptions has an effect on length of eruption time. Interpretation of assumptions test: I tested my assumptions using check model, and it shows that the data has linearity, homogeneity of variance, does not have influential observations, and has normal residuals. We do not know enough about the experimental design to know if there is independence of residual error terms, but will assume that there is independence in order to proceed with the analysis.

5 Load palmerpenguins. Using the penguins data, assess whether there is a difference in body mass between Adelie and Gentoo penguins. If there is a difference, assess the direction of the difference (one tends to be bigger than the other, etc). You should: a.) Tell me what statistical test you are using, b.) filter the data appropriately, c.) run your test, d.) make a corresponding visual to show the appropriate differences (means + error bars!). Write a hypothesis, run your test, interpret the test-statistics and use the stats + graph to tell me the result.

#Load data
library(palmerpenguins)
penguins <- penguins
# b) Filtering data
penguins_ade <- penguins %>%
  filter(species == "Adelie")
penguins_gent <- penguins %>%
  filter(species == "Gentoo")
penguins_ttest <- penguins %>%
  filter(species == "Adelie" | species == "Gentoo")
# c) Run your test
t.test(penguins_ade$body_mass_g, penguins_gent$body_mass_g)


    Welch Two Sample t-test

data:  penguins_ade$body_mass_g and penguins_gent$body_mass_g
t = -23.386, df = 249.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1491.183 -1259.525
sample estimates:
mean of x mean of y 
 3700.662  5076.016

# d) Graph
penguins_mean <- penguins %>%
  filter(species == "Adelie" |  species == "Gentoo") %>%
  group_by(species) %>%
  drop_na() %>%
  summarise(mean = mean(body_mass_g), sd = sd(body_mass_g), n = n(), se = sd/sqrt(n))
ggplot(data = penguins_mean, aes(x = species, y = mean)) +
  geom_point() +
  geom_errorbar(data = penguins_mean, aes(x = species, ymin = mean - se, max = mean +se)) + 
  theme_bw() +
  labs(x = 'Species', y = 'Mean body mass (g)')

Null hypothesis: There is no relationship between body mass and whether a penguin is Adelie or Gentoo; penguin body mass does not vary with species. Alternative hypothesis: There is a relationship between body mass and whether a penguin is Adelie or Gentoo; penguin body mass varies significantly by species. a) Statistical test: I will be using a t-test because it is used to determine whether the means of two groups are different from each other. We are looking at whether the mean body mass is different between two groups (species) of penguins. Interpretation: The p-value is less than 0.05, therefore the difference in mean body mass between Adelie and Gentoo penguins is significant and we reject the null hypothesis. The graph shows that Gentoo penguins have a significantly higher body mass than Adelie penguins. ————————————————————————

1 Make your essentials and depth components into tabsets using panel::tabset. Note that this is a visual change you can make to Quarto documents - this is testing your ability to modify Quarto docs and to google/find answers to common problems in R

2 Make your t-test graph from Essentials better – add the raw data in behind your means + error bars

ggplot(data = penguins_mean, aes(x = species, y = mean)) +
  geom_point(data = penguins_ttest, aes(x = species, y = body_mass_g, color = species)) +
  geom_point() +
  geom_errorbar(data = penguins_mean, aes(x = species, ymin = mean - se, max = mean +se)) + 
  theme_bw() +
  labs(x = 'Species', y = 'Mean body mass (g)')

Warning: Removed 2 rows containing missing values (`geom_point()`).

3 What are the effects of body_mass_g and sex on bill_length_mm in Chinstrap penguins? Setup a more complex linear regression to test this! You have 2 variables of interest. You can choose for this to be interactive or additive. Write a hypothesis, filter data as needed, run the test, make a graph, intepret the results.

# Filter data
penguins_chin <- penguins %>%
  filter(species == 'Chinstrap')
# Run linear regression test
lm_chin <- lm(bill_length_mm ~ body_mass_g * sex, data = penguins_chin)
summary(lm_chin)


Call:
lm(formula = bill_length_mm ~ body_mass_g * sex, data = penguins_chin)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6911 -1.1108 -0.2264  0.7923 10.9076 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         35.982910   5.196611   6.924 2.53e-09 ***
body_mass_g          0.003003   0.001469   2.044    0.045 *  
sexmale             11.056140   6.924654   1.597    0.115    
body_mass_g:sexmale -0.001973   0.001870  -1.055    0.295    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.407 on 64 degrees of freedom
Multiple R-squared:  0.5036,    Adjusted R-squared:  0.4803 
F-statistic: 21.64 on 3 and 64 DF,  p-value: 8.606e-10

anova(lm_chin)

Analysis of Variance Table

Response: bill_length_mm
                Df Sum Sq Mean Sq F value    Pr(>F)    
body_mass_g      1 197.10 197.101 34.0126 1.959e-07 ***
sex              1 172.66 172.661 29.7951 8.337e-07 ***
body_mass_g:sex  1   6.45   6.453  1.1136    0.2953    
Residuals       64 370.88   5.795                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Make a graph
ggplot(data = penguins_chin, aes(x = body_mass_g, y = bill_length_mm, color = sex)) +
  geom_point() +
  geom_smooth(method = 'lm')

`geom_smooth()` using formula = 'y ~ x'

Null hypothesis: There is no relationship between body mass and bill length, and/or there is no effect of sex on bill length, and/or the relationship between body mass and bill length does not vary based on sex. Alternative hypothesis: The relationship between body mass and bill length in Chinstrap penguins varies based on sex. Interpretation: According to the ANOV table, the p-value for the relationship between bill length and body mass is less than 0.05. Therefore, body mass has a significant effect on bill length and we reject the null hypothesis. A higher body mass means a greater bill length. The p-value for the relationship between bill length and sex is also less than 0.05, so sex has a significant effect on bill length and we reject the null hypothesis. Bill length is greater in male penguins compared to female penguins. The p-value for the interactive effect of body mass and sex on bill length is greater than 0.05 (it is 0.2953). Therefore, there is not a significant effect and we fail to reject the null hypothesis.

4 Using the dataset below, please assess whether there is a difference in number of bigfoot sightings per year between Ohio and Washington. You will have to read the data in, filter by state, make a column for year (modifying date using lubridate), group_by state & year, then count rows (each row= a bigfoot sighting). This data frame will allow you to do your t-test! Next, calculate a mean across years– you can use this to make a graph. In short, I want you to run a t-test to assess the H0 that there is no difference in annual bigfoot sightings between Ohio and Washington. Then I want to see a graph showing means + error. Finally, interpret this graph + stats. This requires use of a skill from every lab so far!

bigfoot <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv')

Rows: 5021 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): observed, location_details, county, state, season, title, classif...
dbl  (17): latitude, longitude, number, temperature_high, temperature_mid, t...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Format data
bigfoot_f <- bigfoot %>%
  filter(state == 'Ohio' | state == 'Washington') %>%
  mutate(year = year(date)) %>%
  group_by(state, year) %>%
  count() %>%
  drop_na()
bigfoot_o <- bigfoot %>%
  filter(state == 'Ohio') %>%
  mutate(year = year(date)) %>%
  group_by(state, year) %>%
  count() %>%
  drop_na()
bigfoot_w <- bigfoot %>%
  filter(state == 'Washington') %>%
  mutate(year = year(date)) %>%
  group_by(state, year) %>%
  count() %>%
  drop_na()
# T-test
t.test(bigfoot_o$n, bigfoot_w$n)


    Welch Two Sample t-test

data:  bigfoot_o$n and bigfoot_w$n
t = -3.1298, df = 98.619, p-value = 0.002301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.292127 -1.185345
sample estimates:
mean of x mean of y 
 5.053571  8.292308

# Calculate a mean across years
colnames(bigfoot_f)[3] <- "count"
bigfoot_mean <- bigfoot_f %>%
  group_by(state) %>%
  summarise(mean = mean(count), sd = sd(count), n = n(), se = sd/sqrt(n))
ggplot(data = bigfoot_mean, aes(x = state, y = mean)) +
  geom_point() +
  geom_errorbar(data = bigfoot_mean, aes(x = state, ymin = mean - se, max = mean + se)) + 
  theme_bw() +
  labs(x = "State", y = "Mean count")

Null hypothesis: Mean count of bigfoot sightings does not vary by state. Alternative hypothesis: Mean count of bigfoot sightings varies significantly by state. Interpretation: The p-value is 0.002, which is less than 0.05. The graph shows us that the mean sightings in Washington are higher than the mean sightings in Ohio. Therefore, mean bigfoot sightings by year are significantly higher in Washington compared to Ohio.