Lab 3 Assignment

Author

Dinah Marion Abeja

#load packages
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library (lubridate)

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(palmerpenguins)
library(patchwork)

Essentials:

1.) Read in https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-04/water.csv check the format of the date column and change it using lubridate so it is correct

#read in the water data
water<-read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-04/water.csv')
Rows: 473293 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): report_date, status_id, water_source, water_tech, facility_type, co...
dbl (4): row_id, lat_deg, lon_deg, install_year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
water_1 <- water
head(water_1)
# A tibble: 6 × 13
  row_id lat_deg lon_deg repor…¹ statu…² water…³ water…⁴ facil…⁵ count…⁶ insta…⁷
   <dbl>   <dbl>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <dbl>
1   3957   8.07     38.6 04/06/… y       <NA>    <NA>    <NA>    Ethiop…      NA
2  33512   7.37     40.5 08/04/… y       Protec… <NA>    Improv… Ethiop…    2019
3  35125   0.773    34.9 03/18/… y       Protec… <NA>    Improv… Kenya        NA
4  37760   0.781    35.0 03/18/… y       Boreho… <NA>    Improv… Kenya        NA
5  38118   0.779    35.0 03/18/… y       Protec… <NA>    Improv… Kenya        NA
6  38501   0.308    34.1 03/19/… y       Boreho… <NA>    Improv… Kenya        NA
# … with 3 more variables: installer <chr>, pay <chr>, status <chr>, and
#   abbreviated variable names ¹​report_date, ²​status_id, ³​water_source,
#   ⁴​water_tech, ⁵​facility_type, ⁶​country_name, ⁷​install_year
water_1$report_date<-mdy(water_1$report_date) #converts our date column into a date/time object based on the format (order) of our date and time 

str(water_1)
spc_tbl_ [473,293 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ row_id       : num [1:473293] 3957 33512 35125 37760 38118 ...
 $ lat_deg      : num [1:473293] 8.073 7.374 0.773 0.781 0.779 ...
 $ lon_deg      : num [1:473293] 38.6 40.5 34.9 35 35 ...
 $ report_date  : Date[1:473293], format: "2017-04-06" "2020-08-04" ...
 $ status_id    : chr [1:473293] "y" "y" "y" "y" ...
 $ water_source : chr [1:473293] NA "Protected Spring" "Protected Shallow Well" "Borehole" ...
 $ water_tech   : chr [1:473293] NA NA NA NA ...
 $ facility_type: chr [1:473293] NA "Improved" "Improved" "Improved" ...
 $ country_name : chr [1:473293] "Ethiopia" "Ethiopia" "Kenya" "Kenya" ...
 $ install_year : num [1:473293] NA 2019 NA NA NA ...
 $ installer    : chr [1:473293] "Private-CRS" "WaterAid" NA NA ...
 $ pay          : chr [1:473293] NA NA NA NA ...
 $ status       : chr [1:473293] NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   row_id = col_double(),
  ..   lat_deg = col_double(),
  ..   lon_deg = col_double(),
  ..   report_date = col_character(),
  ..   status_id = col_character(),
  ..   water_source = col_character(),
  ..   water_tech = col_character(),
  ..   facility_type = col_character(),
  ..   country_name = col_character(),
  ..   install_year = col_double(),
  ..   installer = col_character(),
  ..   pay = col_character(),
  ..   status = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
# date which was initially a character is now date object: it is read as date and thus easy to work with. 

2 Read in some data from Justin’s PhD!

bzcoral<-read.csv('https://raw.githubusercontent.com/jbaumann3/BIOL234_Biostats_MHC/main/coralcover.csv')
head(bzcoral)
  site       type lat transect diver cc_percent
1    1  Back Reef   3        1     4  5.8435758
2    1  Back Reef   3        2     4  0.9505263
3    1  Back Reef   3        3     4  5.2423389
4    1  Back Reef   3        4     5  5.0040475
5    1  Back Reef   3        5     5  5.8954916
6    2 Patch Reef   3        1     4  5.2826190
coral_1 <- bzcoral
head(coral_1)
  site       type lat transect diver cc_percent
1    1  Back Reef   3        1     4  5.8435758
2    1  Back Reef   3        2     4  0.9505263
3    1  Back Reef   3        3     4  5.2423389
4    1  Back Reef   3        4     5  5.0040475
5    1  Back Reef   3        5     5  5.8954916
6    2 Patch Reef   3        1     4  5.2826190
tail(coral_1)
   site       type lat transect diver cc_percent
72   14 Patch Reef   1        1     4  2.4702869
73   14 Patch Reef   1        2     4  1.6663234
74   14 Patch Reef   1        3     4  1.5463694
75   14 Patch Reef   1        4     5  0.4389017
76   14 Patch Reef   1        5     5  1.4988514
77   14 Patch Reef   1        6     5  0.3722362
dim(coral_1)
[1] 77  6
ncol(coral_1)
[1] 6
nrow(coral_1)
[1] 77
str(coral_1)
'data.frame':   77 obs. of  6 variables:
 $ site      : int  1 1 1 1 1 2 2 2 2 2 ...
 $ type      : chr  "Back Reef" "Back Reef" "Back Reef" "Back Reef" ...
 $ lat       : int  3 3 3 3 3 3 3 3 3 3 ...
 $ transect  : int  1 2 3 4 5 1 2 3 4 5 ...
 $ diver     : int  4 4 4 5 5 4 4 4 5 5 ...
 $ cc_percent: num  5.844 0.951 5.242 5.004 5.895 ...

3 Look at the data and generate a hypothesis to test (X, Y, and/or Z has no effect on coral cover (cc_percent), for example). cc_percent is the only numerical variable we care about here! Everything else is categorical. Write your hypothesis in BOLD below.

Hypothesis: the type of reef has no effect on the percent distribution of coral cover

\(H_0:\) the type of reef has no effect on the percent coral cover distribution

\(H_A:\) the type of reef has an effect on the percent coral cover distribution

4 Filter our the columns that you are not using for your hypothesis test

coral_fil<- coral_1[,c(2,6)]
head(coral_fil)
        type cc_percent
1  Back Reef  5.8435758
2  Back Reef  0.9505263
3  Back Reef  5.2423389
4  Back Reef  5.0040475
5  Back Reef  5.8954916
6 Patch Reef  5.2826190

5 Using the pipe, %>%, group your data and calcualte the mean(s) you need for your visual hypothesis test

coral_fil2 <- coral_fil %>%
  group_by(type) %>% #group = type of reef 
  summarize(mean_cover = mean(cc_percent), sd_cover = sd(cc_percent), n=n(), se = sd_cover/sqrt(n)) 
head(coral_fil2)
# A tibble: 3 × 5
  type       mean_cover sd_cover     n    se
  <chr>           <dbl>    <dbl> <int> <dbl>
1 Back Reef       11.9      9.52    29 1.77 
2 Nearshore        3.16     2.25    18 0.531
3 Patch Reef      12.7     11.5     30 2.11 

6 Graph your results! Means + errorbars required :) Make a nice, easy to see graph with clear labels and text

p <- ggplot(data = coral_fil2,aes(y = mean_cover, x = type, color ="type")) +
  geom_point()+
  labs(title = "% distribution of coral cover per reef", x= "type", y ="cc_percent")+
  theme_bw()+
  theme(axis.text.x=element_text(angle = 0, size = 15, vjust = 1), axis.title.y = element_text(size = 16),axis.title.x = element_text(size = 16))

p+geom_errorbar(aes(ymin=mean_cover-se, ymax=mean_cover+se), width=.2,
                 position=position_dodge(.9))

7 Assess your hypothesis! What does your graph show (Note: We did not do stats, so please do not say ‘significant’)

The percent distribution of coral cover is very different especially at Near shore (lower) as compared to Patch Reef and Back Reef. Even though the Patch Reef had seemingly the highest % coral cover, it also contains the highest error bars ( which shows that there is very high variation/spread of the data from the mean). Therefore, the type of Reef has an effect on the cc_percent distribution.

Depth

1: Read in intertidal transect data. View it, identify the columns that contain species/cover items and pivot from wide to long format!

op<-read.csv('https://raw.githubusercontent.com/jbaumann3/BIOL234_Biostats_MHC/main/Ocean_Point_Data%202018.csv')
head(op)
  Number  trans_description year tidal_ht_m tidal_ht_cat_num bare_rock calothr
1      0 a_cobble_protected 2017      -0.13                1         0       0
2      0 a_cobble_protected 2017      -0.13                1         5       0
3      0 a_cobble_protected 2017      -0.13                1         5       0
4      0 a_cobble_protected 2018       0.38                2        10      12
5      0 a_cobble_protected 2018       0.38                2         0      93
6      0 a_cobble_protected 2018       0.78                2         0      65
  semibal mytilus Halichondria Spongomorpha chaetomo ulva_lac ascophyl lamin_di
1       0       0            0            0        0        0        0        0
2       0       0            0            0        0        5        0        0
3       0       0            0            0        0        0        0        0
4       8       0            0            0        0        0       10        0
5       7       0            0            0        0        0        0        0
6      30       0            0            0        0        0        0        0
  lamin_sa fucus_spir fucus_vesic fucus_dist fucus_sp chondrus mastocar
1        0          0           0          0        0        0       40
2        0          0           5          0        0        0       20
3        0          0           5          0        0        0       20
4        0          0          60          0        0        0        0
5        0          0           0          0        0        0        0
6        0          0           5          0        0        0        0
  petrocel corralina ceramium lithothamn Enteromorpha tectura litt_sax litt_lit
1        0         2        0         58            0       0        0       64
2        0         5        0         60            0       0        0        0
3        0        15        0         55            0       0        0        0
4        0         0        0          0            0       0        0      112
5        0         0        0          0            0       0        0      112
6        0         0        0          0            0       0        0        0
  litt_obt nucella carcinus Cancer.crabs
1        0       0        0            0
2        0       0        0            0
3       16       0        0            0
4       80       0        0            0
5       16       0        0            0
6        0       0        0            0
long_op<- op %>%
  pivot_longer(bare_rock:Cancer.crabs, names_to = 'species', values_to = 'counts')
head(long_op)
# A tibble: 6 × 7
  Number trans_description   year tidal_ht_m tidal_ht_cat_num species     counts
   <int> <chr>              <int>      <dbl>            <int> <chr>        <dbl>
1      0 a_cobble_protected  2017      -0.13                1 bare_rock        0
2      0 a_cobble_protected  2017      -0.13                1 calothr          0
3      0 a_cobble_protected  2017      -0.13                1 semibal          0
4      0 a_cobble_protected  2017      -0.13                1 mytilus          0
5      0 a_cobble_protected  2017      -0.13                1 Halichondr…      0
6      0 a_cobble_protected  2017      -0.13                1 Spongomorp…      0
# pivoting species that we want.
long_op2<-long_op

2 Filter data such that we retain only ‘species’ that are animals. These include: Carinbus, Cancer Crabs, nucella, litt_obt, litt_lit, litt_sax. Everything else is either rock or algae. Please do this filter step AFTER you pivot to long format. Note: It is easier to do this when you are in wide format, but this is good practice! Ask questions if you need help :). Hint: The ‘|’ key is how you say ‘or’ in code :)

long_op2<- filter(long_op2, species =="carcinus" |species =="Cancer.crabs" |species =="nucella" |species =="litt_obt" |species == "litt_lit" |species =="litt_sax" |species =="semibal") 

head(long_op2)
# A tibble: 6 × 7
  Number trans_description   year tidal_ht_m tidal_ht_cat_num species  counts
   <int> <chr>              <int>      <dbl>            <int> <chr>     <dbl>
1      0 a_cobble_protected  2017      -0.13                1 semibal       0
2      0 a_cobble_protected  2017      -0.13                1 litt_sax      0
3      0 a_cobble_protected  2017      -0.13                1 litt_lit     64
4      0 a_cobble_protected  2017      -0.13                1 litt_obt      0
5      0 a_cobble_protected  2017      -0.13                1 nucella       0
6      0 a_cobble_protected  2017      -0.13                1 carcinus      0
# vertical line equals to OR
#tidyverse version: df <- long_op %>%

2 Using the same data–> rename the factors in the trans_description column based on wave exposure. a_cobble_protected is low, c_flat_profile and d_semi_expos are both moderate, e_exposed is high. Hint: use ifelse(). Justin can help!

long_op2$wave_exposure <- ifelse(long_op2$trans_description == "a_cobble_protected","low",
                               ifelse(long_op2$trans_description == "e_exposed", "low", "moderate"))
head(long_op2)
# A tibble: 6 × 8
  Number trans_description   year tidal_ht_m tidal_ht_c…¹ species counts wave_…²
   <int> <chr>              <int>      <dbl>        <int> <chr>    <dbl> <chr>  
1      0 a_cobble_protected  2017      -0.13            1 semibal      0 low    
2      0 a_cobble_protected  2017      -0.13            1 litt_s…      0 low    
3      0 a_cobble_protected  2017      -0.13            1 litt_l…     64 low    
4      0 a_cobble_protected  2017      -0.13            1 litt_o…      0 low    
5      0 a_cobble_protected  2017      -0.13            1 nucella      0 low    
6      0 a_cobble_protected  2017      -0.13            1 carcin…      0 low    
# … with abbreviated variable names ¹​tidal_ht_cat_num, ²​wave_exposure
# makes a new column(input the dataframe$name of column) 

3 Using the same data, make a simplified tidal height column. We want tidal height cat <4 to be ‘low’, >7 to be ‘high’, and in between to be ‘intermediate’. Hint: You can use if_else for this as well!

long_op2$tidal_ht_cat_num <- ifelse(long_op2$tidal_ht_cat_num  <4,"low",
                               ifelse(long_op2$tidal_ht_cat_num  >7, "high", "moderate"))
head(long_op2)
# A tibble: 6 × 8
  Number trans_description   year tidal_ht_m tidal_ht_c…¹ species counts wave_…²
   <int> <chr>              <int>      <dbl> <chr>        <chr>    <dbl> <chr>  
1      0 a_cobble_protected  2017      -0.13 low          semibal      0 low    
2      0 a_cobble_protected  2017      -0.13 low          litt_s…      0 low    
3      0 a_cobble_protected  2017      -0.13 low          litt_l…     64 low    
4      0 a_cobble_protected  2017      -0.13 low          litt_o…      0 low    
5      0 a_cobble_protected  2017      -0.13 low          nucella      0 low    
6      0 a_cobble_protected  2017      -0.13 low          carcin…      0 low    
# … with abbreviated variable names ¹​tidal_ht_cat_num, ²​wave_exposure

4 Using the same dataframe, build 2 new dataframes, one calculating mean and error (standard error) for semibal (barnacle) abundance and one doing the same for nucella (whelk) abundance by tidal height and wave exposure (trans_description)

long_op3 <- long_op2 %>%
  group_by(species, tidal_ht_cat_num,wave_exposure) %>%
  filter(species =="semibal") %>%
summarize(mean_semibal = mean(counts), sd_semibal = sd(counts),n=n(),se_semibal = sd_semibal/sqrt(n))
`summarise()` has grouped output by 'species', 'tidal_ht_cat_num'. You can
override using the `.groups` argument.
head(long_op3)
# A tibble: 6 × 7
# Groups:   species, tidal_ht_cat_num [3]
  species tidal_ht_cat_num wave_exposure mean_semibal sd_semibal     n se_semi…¹
  <chr>   <chr>            <chr>                <dbl>      <dbl> <int>     <dbl>
1 semibal high             low                  0           0       21     0    
2 semibal high             moderate             0           0       18     0    
3 semibal low              low                  6.11        8.46    18     1.99 
4 semibal low              moderate             0.593       1.45    27     0.279
5 semibal moderate         low                 17.7        26.6     35     4.50 
6 semibal moderate         moderate            14.4        20.8     36     3.47 
# … with abbreviated variable name ¹​se_semibal
long_op4 <- long_op2 %>%
group_by(species, tidal_ht_cat_num,wave_exposure) %>%
  filter(species =="nucella") %>%
summarize(mean_nucella = mean(counts), sd_nucella = sd(counts),n=n(),se_nucella = sd_nucella/sqrt(n))
`summarise()` has grouped output by 'species', 'tidal_ht_cat_num'. You can
override using the `.groups` argument.
head(long_op4)
# A tibble: 6 × 7
# Groups:   species, tidal_ht_cat_num [3]
  species tidal_ht_cat_num wave_exposure mean_nucella sd_nucella     n se_nuce…¹
  <chr>   <chr>            <chr>                <dbl>      <dbl> <int>     <dbl>
1 nucella high             low                  0           0       21     0    
2 nucella high             moderate             0           0       18     0    
3 nucella low              low                  0.889       3.77    18     0.889
4 nucella low              moderate             2.96        8.92    27     1.72 
5 nucella moderate         low                  0           0       35     0    
6 nucella moderate         moderate             4.44       13.0     36     2.17 
# … with abbreviated variable name ¹​se_nucella

5 Plot barnacle and snail abundance + error across tidal heights & wave exposures. You can plot them both on the same graphs (merging your dataframes could help) or you can make multiple graphs and patchwork them together.

long_op5 <- long_op2 %>%
group_by(species, tidal_ht_cat_num,wave_exposure) %>%
  filter(species =="nucella")
long_op6 <- long_op2 %>%
group_by(species, tidal_ht_cat_num,wave_exposure) %>%
  filter(species =="semibal")
p1 <- ggplot(data =long_op4,aes(y = mean_nucella, x = wave_exposure)) +
  geom_point()+
  labs(title = "Abundance of snail by wave exposure", y= "mean_nucella abundance", x ="wave exposure")+
  theme_bw()+
  theme(axis.text.x=element_text(angle = 0, size = 15, vjust = 1), axis.title.y = element_text(size = 16),axis.title.x = element_text(size = 16))
  
p1+geom_errorbar(aes(ymin=mean_nucella, ymax=se_nucella), width=.2,
                 position=position_dodge(.9)) +
    geom_jitter(data =long_op5,aes(y = counts, x = wave_exposure))

p2 <- ggplot(data =long_op3,aes(y = mean_semibal, x = wave_exposure)) +
  geom_point()+
  labs(title = "Abundance of whelk by wave exposure", y= "mean_semibal abundance", x ="wave exposure")+
  theme_bw()+
  theme(axis.text.x=element_text(angle = 0, size = 15, vjust = 1), axis.title.y = element_text(size = 16),axis.title.x = element_text(size = 16))
  
p2+geom_errorbar(aes(ymin=mean_semibal, ymax=se_semibal), width=.2,
                 position=position_dodge(.9)) +
    geom_jitter(data =long_op6,aes(y = counts, x = wave_exposure))

The mean abundance of snail is uniform and relatively similar at low and moderate wave exposure. However there is high variation at moderate wave exposure than at low wave exposure. The mean abundance of whelk is higher at lower wave exposure. There is high spread around the mean in both cases.

For both the snail and the whelk, there is zero mean abundance at high wave exposure.

Overall, the mean abundance of snails is generally higher than whelk distribution for (low and moderate) wave exposures, however, there is also very high spread of the data from the mean (high variation) for the whelk population

p3 <- ggplot(data =long_op4,aes(y = mean_nucella, x = tidal_ht_cat_num)) +
  geom_point()+
  labs(title = "Abundance of snail by tidal height", y= "mean_nucella abundance", x ="tidal_ht_cat_num")+
  theme_bw()+
  theme(axis.text.x=element_text(angle = 0, size = 15, vjust = 1), axis.title.y = element_text(size = 16),axis.title.x = element_text(size = 16))
  
p3+geom_errorbar(aes(ymin=mean_nucella, ymax=se_nucella), width=.2,
                 position=position_dodge(.9)) +
    geom_jitter(data =long_op5,aes(y = counts, x = tidal_ht_cat_num))

p4 <- ggplot(data =long_op3,aes(y = mean_semibal, x = tidal_ht_cat_num)) +
  geom_point()+
  labs(title = "Abundance of whelk by tidal height", y= "mean_semibal abundance", x ="tidal_ht_cat_num")+
  theme_bw()+
  theme(axis.text.x=element_text(angle = 0, size = 15, vjust = 1), axis.title.y = element_text(size = 16),axis.title.x = element_text(size = 16))
  
p4+geom_errorbar(aes(ymin=mean_semibal, ymax=se_semibal), width=.2,
                 position=position_dodge(.9)) +
    geom_jitter(data =long_op6,aes(y = counts, x = tidal_ht_cat_num))

There is higher distribution of snails at high wave exposure and no whelk at high wave exposure. on the other hand, there is high mean abundance of whelk at moderate tidal height than snail abundance. In addition, there is high spread around the mean for both the snail and whelk population at moderate tidal height.

6 Write a short interpretative statement. We didn’t run any stats, so avoid the word ‘significant.’ How do snail and whelk abundances appear to vary by tidal height and wave exposure? I do not need you to tell me anything about the ecology of the system, but feel free to do so if you’d like :) I just need you to interpret your graphs.

see explanation under the graphs above

I tried to patch my graphs, but when i did, the error bars and some data points were not visible anymore - it seemed like it the data became so condensed and i also failed at merging the data frames to one - so was not really able have them side by side as recommended.