Lab_5_Assignment

Author

Micah Lohr

Essentials

Load packages

#insert packages here :)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(patchwork)
library(ggsci)
library(ggridges)
library(performance)
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula

Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units
library(corrplot)
corrplot 0.92 loaded

Load data from Old Faithful. Assess the correlation between waiting time and eruption time using a Spearman’s test and a scatter plot. Report Spearman’s Rho and interpret the results

faithful<-faithful
#eruptions = length of eruption time in minutes
#waiting = time between eruptions in minutes
cor.test(faithful$eruptions,faithful$waiting, method="spearman")
Warning in cor.test.default(faithful$eruptions, faithful$waiting, method =
"spearman"): Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  faithful$eruptions and faithful$waiting
S = 744659, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7779721 
ggplot(data=faithful, aes(x=eruptions, y=waiting))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

The Spearman’s rho is 0.7779721, which we can round to 0.78. As indicated by the plot, there appears to be a strong, positive correlation between waiting time and eruption time. According to the correlation test, p = 2.2 x 10^-16, which is statistically significant, meaning we could reject H0 if we were doing a hypothesis test.

DO A CHI-SQUARE. First we need some categorical data! So, I will make it for you :)

#distribution of class of patients admitted to self poisoning (A) and gastro units (B)
#socioclass
socioclass<-c("I","II","III","IV","V")
selfpoison<-c(17,25,39,42,32)
gastro<-c(22,46,73,91,57)

patients<-data.frame(socioclass,selfpoison,gastro)
head(patients)
  socioclass selfpoison gastro
1          I         17     22
2         II         25     46
3        III         39     73
4         IV         42     91
5          V         32     57
str(patients)
'data.frame':   5 obs. of  3 variables:
 $ socioclass: chr  "I" "II" "III" "IV" ...
 $ selfpoison: num  17 25 39 42 32
 $ gastro    : num  22 46 73 91 57
#remove the categories so the data are clean for chi-square
pat2<-patients[,-1]

H0: there is no relationship between self-poisoning and gastro units, they are independent of each other HA: there is a relationship between self-poisoning and gastro units, they are not independent of each other.

#perform a chisq test -- write a hypothesis and then perform the test!
pat2chi<- chisq.test(pat2)
pat2chi

    Pearson's Chi-squared test

data:  pat2
X-squared = 1.9885, df = 4, p-value = 0.7379

The p value is 0.7379, which is not less than 0.05, so we fail to reject H0.

Perform a simple linear regression on the faithful data and check all assumptions. Report the summary your regression, interpret it, and provide a graph to go along with it–with a regression line. Next, test your assumptions and report the results. Is the relationship between your two variables linear? Don’t forget a hypothesis!

H0: There is no linear relationship between waiting time and eruption time

HA: there is a linear relationship between waiting time and eruption time

head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55
lmfaithful<- lm(eruptions~waiting, data=faithful)
summary(lmfaithful)

Call:
lm(formula = eruptions ~ waiting, data = faithful)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29917 -0.37689  0.03508  0.34909  1.19329 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
waiting      0.075628   0.002219   34.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared:  0.8115,    Adjusted R-squared:  0.8108 
F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

The results of the linear regression indicate that there is a strong, positive linear relationship between waiting time and eruption time. The summary tells us that around 81% of the variation in eruption time can be explained by waiting time. We can conclude that this relationship is statistically significant with the p-value of 2.2x10^-16, which is less than 0.05. Here is a graph to visualize this relationship:

ggplot(data=faithful, aes(x=waiting, y=eruptions))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

library(see)

Attaching package: 'see'
The following objects are masked from 'package:ggsci':

    scale_color_material, scale_colour_material, scale_fill_material
check_model(lmfaithful)

I ran check_model, which checks 3/4 of the assumptions for linear regression. The first panel (top left) shows that the model and the data match pretty well, the blue and green lines mostly align, but not fully. This is not technically one of our assumptions but I’m adding it anyway. The top right panel checks for linearity, which looks pretty good here but is still a little wavy, so we are ok to assume that the relationship between waiting time and eruption is approximately linear. The next panel checks for our assumption of homogeneity of variance, which is also looking ok as the green reference line is horizontal and mostly flat, just a bit wavy. The check of our assumption of normality of residuals seems really good, the data all falls along the reference line. This assumptions check does not account for one of our assumptions, the assumption of independence. This is a limitation of this method of checking, and given the data that we have, we cannot know for sure if it is independent.

Load palmerpenguins. Using the penguins data, assess whether there is a difference in body mass between Adelie and Gentoo penguins. If there is a difference, assess the direction of the difference (one tends to be bigger than the other, etc). You should: a.) Tell me what statistical test you are using, b.) filter the data appropriately, c.) run your test, d.) make a corresponding visual to show the appropriate differences (means + error bars!). Write a hypothesis, run your test, interpret the test-statistics and use the stats + graph to tell me the result.

library(palmerpenguins)

The statistical test that I want to use here t-test.

H0: there is no difference between the average body mass (g) of Adelie and Gentoo penguins

HA: there is a difference in the average body mass (g) of Adelie and Gentoo penguins

penguins_G <- penguins %>% 
  filter(species=="Gentoo")
head(penguins_G)
# A tibble: 6 × 8
  species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex    year
  <fct>   <fct>           <dbl>         <dbl>          <int>   <int> <fct> <int>
1 Gentoo  Biscoe           46.1          13.2            211    4500 fema…  2007
2 Gentoo  Biscoe           50            16.3            230    5700 male   2007
3 Gentoo  Biscoe           48.7          14.1            210    4450 fema…  2007
4 Gentoo  Biscoe           50            15.2            218    5700 male   2007
5 Gentoo  Biscoe           47.6          14.5            215    5400 male   2007
6 Gentoo  Biscoe           46.5          13.5            210    4550 fema…  2007
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
penguins_A <- penguins %>% 
  filter(species=="Adelie")
head(penguins_A)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
t.test(penguins_G$body_mass_g,penguins_A$body_mass_g)

    Welch Two Sample t-test

data:  penguins_G$body_mass_g and penguins_A$body_mass_g
t = 23.386, df = 249.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1259.525 1491.183
sample estimates:
mean of x mean of y 
 5076.016  3700.662 
penguinsAG <- penguins %>%
filter(species!= "Chinstrap") 

bodymass <- penguinsAG %>%
drop_na(body_mass_g) %>%
group_by(species) %>% 
dplyr::summarize(avg_bm = mean(body_mass_g), sd=sd(body_mass_g),n=n(),se=sd/sqrt(n)) 
bodymass
# A tibble: 2 × 5
  species avg_bm    sd     n    se
  <fct>    <dbl> <dbl> <int> <dbl>
1 Adelie   3701.  459.   151  37.3
2 Gentoo   5076.  504.   123  45.5
ggplot(data=bodymass, aes(x=species, y=avg_bm, color=species))+
  geom_point()+
  geom_jitter(data=penguinsAG, aes(x=species,y=body_mass_g, color=species),alpha=0.3,size=0.3)+
  geom_errorbar(data=bodymass,aes(x=species,ymin=avg_bm-se,ymax=avg_bm+se, width=0.2))+
  theme_classic()+
  theme(axis.text=element_text(size=10))+
  labs(x='Species',y='Average Body Mass in Grams',title='Silly Penguins')+
  theme(plot.title=element_text(hjust=0.5))
Warning: Removed 2 rows containing missing values (`geom_point()`).

From looking at the graph alone, we can already tell that there is a difference in The graph visually shows us that there is a difference in the average body mass between the two species of penguin, the means are very far away from each other, and there is no overlap of the error bars. This result is further supported by the t-test, which tells us that we have a p-value of 2.2x10^-16, which is less than 0.05. Based on these results, we can reject our null hypothesis that there was no difference between the two means.

Depth

Make your essentials and depth components into tabsets using panel::tabset. Note that this is a visual change you can make to Quarto documents - this is testing your ability to modify Quarto docs and to google/find answers to common problems in R

Done (eye twitching)

Make your t-test graph from Essentials better – add the raw data in behind your means + error bars

I already did this in the Essentials :)

ggplot(data=bodymass, aes(x=species, y=avg_bm, color=species))+
  geom_point()+
  geom_jitter(data=penguinsAG, aes(x=species,y=body_mass_g, color=species),alpha=0.3,size=0.3)+
  geom_errorbar(data=bodymass,aes(x=species,ymin=avg_bm-se,ymax=avg_bm+se, width=0.2))+
  theme_classic()+
  theme(axis.text=element_text(size=10))+
  labs(x='Species',y='Average Body Mass in Grams',title='Silly Penguins')+
  theme(plot.title=element_text(hjust=0.5))
Warning: Removed 2 rows containing missing values (`geom_point()`).

What are the effects of body_mass_g and sex on bill_length_mm in Chinstrap penguins? Setup a more complex linear regression to test this! You have 2 variables of interest. You can choose for this to be interactive of additive. Write a hypothesis, filter data as needed, run the test, make a graph, interpret the results.

H0: there is no effect of body mass on the bill length of Chinstrap penguins, no effect of sex on bill length, and no interactive effect of sex and body mass on bill length

HA: there is an interactive effect of body mass and sex on the bill length of Chinstrap penguins

penguinsC <- penguins %>%
filter(species== "Chinstrap")

ggplot(data=penguinsC, aes(x=body_mass_g, y=bill_length_mm, color=sex))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

lmchin<- lm(bill_length_mm~body_mass_g*sex, data=penguinsC)
summary(lmchin)

Call:
lm(formula = bill_length_mm ~ body_mass_g * sex, data = penguinsC)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6911 -1.1108 -0.2264  0.7923 10.9076 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         35.982910   5.196611   6.924 2.53e-09 ***
body_mass_g          0.003003   0.001469   2.044    0.045 *  
sexmale             11.056140   6.924654   1.597    0.115    
body_mass_g:sexmale -0.001973   0.001870  -1.055    0.295    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.407 on 64 degrees of freedom
Multiple R-squared:  0.5036,    Adjusted R-squared:  0.4803 
F-statistic: 21.64 on 3 and 64 DF,  p-value: 8.606e-10
anova(lmchin)
Analysis of Variance Table

Response: bill_length_mm
                Df Sum Sq Mean Sq F value    Pr(>F)    
body_mass_g      1 197.10 197.101 34.0126 1.959e-07 ***
sex              1 172.66 172.661 29.7951 8.337e-07 ***
body_mass_g:sex  1   6.45   6.453  1.1136    0.2953    
Residuals       64 370.88   5.795                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the graph and the linear regression and anova, we fail to reject the null hypothesis. The p-value for the interaction between body mass and sex is 0.295, which is greater than 0.05, so we cannot determine that there is an interactive effect. This means there is no difference in the relationship between bill length and body mass based on sex. It is still interesting to note that sex and body mass have individual statistically significant effects on bill length. There is a positive effect of body mass on bill length, so bigger penguins have bigger bills. There is also an effect of sex, in this case male penguins have a longer bill length than females.

Using the dataset below, please assess whether there is a difference in number of bigfoot sightings per year between Ohio and Washington. You will have to read the data in, filter by state, make a column for year (modifying date using lubridate), group_by state & year, then count rows (each row= a bigfoot sighting). This data frame will allow you to do your t-test! Next, calculate a mean across years– you can use this to make a graph. In short, I want you to run a t-test to assess the H0 that there is no difference in annual bigfoot sightings between Ohio and Washington. Then I want to see a graph showing means + error. Finally, interpret this graph + stats. This requires use of a skill from every lab so far!

bigfoot <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv')
Rows: 5021 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): observed, location_details, county, state, season, title, classif...
dbl  (17): latitude, longitude, number, temperature_high, temperature_mid, t...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
bigfootOW<- bigfoot %>%
  filter(state=="Ohio"|state=="Washington") 
bigfootOW$date<- year(bigfootOW$date)
bigfeet<- bigfootOW %>%
  group_by(state,date) %>%
  count()

bigfeet
# A tibble: 123 × 3
# Groups:   state, date [123]
   state  date     n
   <chr> <dbl> <int>
 1 Ohio   1956     1
 2 Ohio   1958     2
 3 Ohio   1962     1
 4 Ohio   1963     1
 5 Ohio   1967     1
 6 Ohio   1968     1
 7 Ohio   1970     1
 8 Ohio   1971     2
 9 Ohio   1972     2
10 Ohio   1973     1
# … with 113 more rows
footfoot <- bigfeet %>%
group_by(state) %>% 
dplyr::summarize(avg_count = mean(n), sd=sd(n),sampsize=n(),se=sd/sqrt(sampsize)) 
footfoot
# A tibble: 2 × 5
  state      avg_count    sd sampsize    se
  <chr>          <dbl> <dbl>    <int> <dbl>
1 Ohio            5.32  4.22       57 0.559
2 Washington      9.11  9.80       66 1.21 
ggplot(data=footfoot, aes(x=state, y=avg_count))+
  geom_point()+
  geom_errorbar(data=footfoot,aes(x=state,ymin=avg_count-se,ymax=avg_count+se, width=0.2))+
  theme_classic()+
  theme(axis.text=element_text(size=10))+
  labs(x='State',y='Average Bigfoot Sighting Count',title='It is Done')+
  theme(plot.title=element_text(hjust=0.5))

bigfoot_OH <- bigfeet %>%
  filter(state=="Ohio")

bigfoot_WA<- bigfeet %>%
  filter(state=="Washington")

t.test(bigfoot_OH$n,bigfoot_WA$n)

    Welch Two Sample t-test

data:  bigfoot_OH$n and bigfoot_WA$n
t = -2.8504, df = 91.002, p-value = 0.0054
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.431579 -1.148963
sample estimates:
mean of x mean of y 
 5.315789  9.106061 

According to the graph and t-test above, there is a statistically significant difference in the average bigfoot encounters in Washington and Ohio. The p value is 0.0054, which is less than 0.05. This result is corroborated by the graph, which shows a visible difference between the means and no overlap of the error bars.