Introduction

Dyje otters, visiting rate. Is there an effect of angling?

“here is the excel file. the collumn VR is visiting rate by otters for individual 600 m stretches, collumn angling - yes or not - is the stretch in or out of angling revier. the question: is there statistic difference in VR between stretches in and out side of angling?”

Data

raw <- read_excel("navstevnost upr 2020 10 02.xlsx", sheet=2)
View(raw)

tidy_visit <- raw %>% 
  dplyr::select(1:4) %>% 
  clean_names() %>%
  # dplyr::mutate(angling = recode(angling, 
  #                                 "yes"= "1",
  #                                 "no" = "0")
  #                                    ) %>% 
  dplyr::mutate(stretch = as.integer(stretch)) %>% 
  dplyr::mutate(vr = as.numeric(vr))
View(tidy_visit)

Summary statistics

Compute some summary statistics by groups: mean and sd (standard deviation)

tidy_visit %>%
  group_by(angling) %>%
  get_summary_stats(vr, type = "mean_sd")

## # A tibble: 2 x 5
##   angling variable     n  mean    sd
##   <chr>   <chr>    <dbl> <dbl> <dbl>
## 1 no      vr          14  19.6  31.5
## 2 yes     vr          18  20.2  29.8

Outliers?

Identify outliers by groups

We have to decide if these “outliers” make biological sense.

tidy_visit %>%
  group_by(angling) %>%
  identify_outliers(vr)

## # A tibble: 5 x 6
##   angling stretch    vr revier  is.outlier is.extreme
##   <chr>     <int> <dbl> <chr>   <lgl>      <lgl>     
## 1 no           28  90.9 no      TRUE       TRUE      
## 2 no           29  81.8 no      TRUE       TRUE      
## 3 no           30  54.5 no      TRUE       TRUE      
## 4 yes          23 100   Dyje 13 TRUE       TRUE      
## 5 yes          24  91.8 Dyje 13 TRUE       TRUE

Boxplot with all the data

# ggplot geom_boxplot
ggplot(data = tidy_visit, aes(x = 1, y = vr)) +
  stat_boxplot(geom = 'errorbar', width = 0.3) + 
  geom_boxplot() +
  labs(x = "", y = "Visiting Rate ") +
  theme_bw() +
  theme(axis.text.x = element_blank())

Conditional boxplot

# Generating a conditional boxplot (visiting rate by angling) 
ggplot(data = tidy_visit, aes(x = angling, y = vr)) +
  stat_boxplot(geom = 'errorbar', width = 0.3) + 
  geom_boxplot() +
  labs(x = "", y = "Visiting rate", title = expression(italic("Otter visits depending on angling"))) +
  theme_bw()

Cleveland dotplot

Another tool to recognize outliers is the Cleveland dotplot, proposed by Cleveland in 1993. The idea behind this plot is to visualize all the observations in a ordered fashion, that way, any points far from the bulk of observations can be easily recognized. In this visualization even there are two axis just only one variable is being analysed, thus to produce this plot (mostly times) the row numbers (or the ID of each observation) are plotted along the vertical axis and the observed values are plotted in the horizontal axis. By this, the y-axis shows the order of the data and the x-axis shows the values.

# Cleveland's dotplot
ggplot(tidy_visit, aes(x = vr, y = reorder(vr, stretch))) +
  geom_point(size = 1) + 
  theme_bw() + 
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(), 
        panel.grid.major.y = element_line(colour = "grey90", linetype = "dashed")) + 
  labs(x = "Range of Data", y = "Order of data", title = "Cleveland's dotplot")

Even though the graph below is the same as the produced by the dotchart function (even more customizable) there is a second way to reorder our data, in this case we will sort the observations accordingly to their vr value, this way we should see a continuous line between observations. This visualization is another way to represent the bulk of the data.

# Cleveland's dotplot version 2
ggplot(tidy_visit, aes(x = vr, y = reorder(stretch, vr))) +
  geom_point(size = 1) + 
  theme_bw() + 
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(), 
        panel.grid.major.y = element_line(colour = "grey90", linetype = "dashed")) + 
  labs(x = "Range of Data", y = "Order of data", title = "Cleveland's dotplot")

Multi-panel Cleveland dotplot

In a similar way than previous plots, the multi panel allow us to make comparisons of the presence of outliers (or rare observations) in more than one group. Thus, we will compare the VR re-ordered values for stretches with (yes) and without angling (no).

# Conditional Cleveland dotplot
ggplot(tidy_visit, aes(x = vr, y = reorder(stretch, vr))) +
  geom_point(size = 1) + 
  theme_bw() + 
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(), 
        panel.grid.major.y = element_line(colour = "grey90", linetype = "dashed")) + 
  labs(x = "Range of Data", y = "Order of data", title = "Conditional Cleveland's dotplot") + 
  facet_wrap(~angling)

Normality

Histograms

The histograms came from a frequency table and they are supposed to show the centre and the spread of the data. The shape of this spread gives the information about the probability distribution. Even though, all the bell-shaped distribution do not correspond to only normal distributions (i.e. t-distribution) this distribution is the one we need to look for.

We will explore the distribution of the of the visiting rate (vr). The y-axis will represents the frequency of observations per class unit.

# Histogram of the Wt variable fro just the SSTS species
tidy_visit %>% 
  ggplot(., aes(vr)) +
  geom_histogram() +
  labs(x = "Visiting rate", ylab = "", title = expression(italic("Visiting rate of otters"))) +
  theme_bw()

Check normality by groups

Shapiro wilk test by goups

Not normal (try without outliers?)

tidy_visit %>%
  group_by(angling) %>%
  shapiro_test(vr)

## # A tibble: 2 x 4
##   angling variable statistic         p
##   <chr>   <chr>        <dbl>     <dbl>
## 1 no      vr           0.634 0.0000834
## 2 yes     vr           0.660 0.0000281

QQ plot by group

ggqqplot(tidy_visit, x = "vr", facet.by = "angling")

Check the equality of variances

This can be done using the Levene’s test. If the variances of groups are equal, the p-value should be greater than 0.05.

tidy_visit %>% 
  levene_test(vr ~ angling)

## # A tibble: 1 x 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     1    30  0.000908 0.976

Analysis

Relationships among variables

We can visually inspect the relationships among the variables by the use of the ggpairs() function inside the GGally package.

# library(GGally)

#Visualizing the relationship among variables. 
ggpairs(tidy_visit, mapping = ggplot2::aes(colour = angling),
        lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1)),
        diag = list(continuous = wrap("densityDiag", alpha = 0.3), discrete = "barDiag", na = "naDiag")
)

T-test

We want to know whether the average visit rates are different between areas with or without angling

By default, R computes the Weltch t-test, which is the safer one:

stat.test <- tidy_visit %>% 
  rstatix::t_test(vr ~ angling) %>%
  add_significance()
stat.test

## # A tibble: 1 x 9
##   .y.   group1 group2    n1    n2 statistic    df     p p.signif
##   <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <chr>   
## 1 vr    no     yes       14    18   -0.0472  27.3 0.963 ns

Effect size

Cohen’s d for Welch t-test The Welch test is a variant of t-test used when the equality of variance can’t be assumed. The effect size can be computed by dividing the mean difference between the groups by the “averaged” standard deviation.

Cohen’s d formula:

d = (mean1 - mean2)/sqrt((var1 + var2)/2), where:

mean1 and mean2 are the means of each group, respectively var1 and var2 are the variance of the two groups.

tidy_visit %>% cohens_d(vr ~ angling, var.equal = FALSE)

## # A tibble: 1 x 7
##   .y.   group1 group2 effsize    n1    n2 magnitude 
## * <chr> <chr>  <chr>    <dbl> <int> <int> <ord>     
## 1 vr    no     yes    -0.0169    14    18 negligible

Report

We have to decide on those outliers and the normality problem, but for now the conclusion would be that:

The mean visiting rate in the area free of angling was 19.65 (SD = 31.46), whereas the mean visiting rate in the area where angling is allowed was 20.17 (SD = 29.79). A Welch two-samples t-test showed that the difference was not statistically significant, t(27.3) = -0.05, p = 0.96, d = 0.017; where, t(27.3) is shorthand notation for a Welch t-statistic that has 27.3 degrees of freedom.

stat.test <- stat.test %>% add_xy_position(x = "angling")
bxp <- ggboxplot(
  tidy_visit, x = "angling", y = "vr", 
  ylab = "Visiting rate", xlab = "Angling", add = "jitter"
  )
bxp + 
  stat_pvalue_manual(stat.test, tip.length = 0) +
  labs(subtitle = get_test_label(stat.test, detailed = TRUE))

Dyje Visiting Rate: Effect of angling

Fernando Mateos-González

Last update “02 octubre, 2020”