Dyje otters, visiting rate. Is there an effect of angling?
“here is the excel file. the collumn VR is visiting rate by otters for individual 600 m stretches, collumn angling - yes or not - is the stretch in or out of angling revier. the question: is there statistic difference in VR between stretches in and out side of angling?”
raw <- read_excel("navstevnost upr 2020 10 02.xlsx", sheet=2)
View(raw)
tidy_visit <- raw %>%
dplyr::select(1:4) %>%
clean_names() %>%
# dplyr::mutate(angling = recode(angling,
# "yes"= "1",
# "no" = "0")
# ) %>%
dplyr::mutate(stretch = as.integer(stretch)) %>%
dplyr::mutate(vr = as.numeric(vr))
View(tidy_visit)
Compute some summary statistics by groups: mean and sd (standard deviation)
tidy_visit %>%
group_by(angling) %>%
get_summary_stats(vr, type = "mean_sd")
## # A tibble: 2 x 5
## angling variable n mean sd
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 no vr 14 19.6 31.5
## 2 yes vr 18 20.2 29.8
We have to decide if these “outliers” make biological sense.
tidy_visit %>%
group_by(angling) %>%
identify_outliers(vr)
## # A tibble: 5 x 6
## angling stretch vr revier is.outlier is.extreme
## <chr> <int> <dbl> <chr> <lgl> <lgl>
## 1 no 28 90.9 no TRUE TRUE
## 2 no 29 81.8 no TRUE TRUE
## 3 no 30 54.5 no TRUE TRUE
## 4 yes 23 100 Dyje 13 TRUE TRUE
## 5 yes 24 91.8 Dyje 13 TRUE TRUE
# ggplot geom_boxplot
ggplot(data = tidy_visit, aes(x = 1, y = vr)) +
stat_boxplot(geom = 'errorbar', width = 0.3) +
geom_boxplot() +
labs(x = "", y = "Visiting Rate ") +
theme_bw() +
theme(axis.text.x = element_blank())
# Generating a conditional boxplot (visiting rate by angling)
ggplot(data = tidy_visit, aes(x = angling, y = vr)) +
stat_boxplot(geom = 'errorbar', width = 0.3) +
geom_boxplot() +
labs(x = "", y = "Visiting rate", title = expression(italic("Otter visits depending on angling"))) +
theme_bw()
Another tool to recognize outliers is the Cleveland dotplot, proposed by Cleveland in 1993. The idea behind this plot is to visualize all the observations in a ordered fashion, that way, any points far from the bulk of observations can be easily recognized. In this visualization even there are two axis just only one variable is being analysed, thus to produce this plot (mostly times) the row numbers (or the ID of each observation) are plotted along the vertical axis and the observed values are plotted in the horizontal axis. By this, the y-axis shows the order of the data and the x-axis shows the values.
# Cleveland's dotplot
ggplot(tidy_visit, aes(x = vr, y = reorder(vr, stretch))) +
geom_point(size = 1) +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey90", linetype = "dashed")) +
labs(x = "Range of Data", y = "Order of data", title = "Cleveland's dotplot")
Even though the graph below is the same as the produced by the dotchart function (even more customizable) there is a second way to reorder our data, in this case we will sort the observations accordingly to their vr value, this way we should see a continuous line between observations. This visualization is another way to represent the bulk of the data.
# Cleveland's dotplot version 2
ggplot(tidy_visit, aes(x = vr, y = reorder(stretch, vr))) +
geom_point(size = 1) +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey90", linetype = "dashed")) +
labs(x = "Range of Data", y = "Order of data", title = "Cleveland's dotplot")
In a similar way than previous plots, the multi panel allow us to make comparisons of the presence of outliers (or rare observations) in more than one group. Thus, we will compare the VR re-ordered values for stretches with (yes) and without angling (no).
# Conditional Cleveland dotplot
ggplot(tidy_visit, aes(x = vr, y = reorder(stretch, vr))) +
geom_point(size = 1) +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey90", linetype = "dashed")) +
labs(x = "Range of Data", y = "Order of data", title = "Conditional Cleveland's dotplot") +
facet_wrap(~angling)
The histograms came from a frequency table and they are supposed to show the centre and the spread of the data. The shape of this spread gives the information about the probability distribution. Even though, all the bell-shaped distribution do not correspond to only normal distributions (i.e. t-distribution) this distribution is the one we need to look for.
We will explore the distribution of the of the visiting rate (vr). The y-axis will represents the frequency of observations per class unit.
# Histogram of the Wt variable fro just the SSTS species
tidy_visit %>%
ggplot(., aes(vr)) +
geom_histogram() +
labs(x = "Visiting rate", ylab = "", title = expression(italic("Visiting rate of otters"))) +
theme_bw()
Not normal (try without outliers?)
tidy_visit %>%
group_by(angling) %>%
shapiro_test(vr)
## # A tibble: 2 x 4
## angling variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 no vr 0.634 0.0000834
## 2 yes vr 0.660 0.0000281
ggqqplot(tidy_visit, x = "vr", facet.by = "angling")
This can be done using the Levene’s test. If the variances of groups are equal, the p-value should be greater than 0.05.
tidy_visit %>%
levene_test(vr ~ angling)
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 1 30 0.000908 0.976
We can visually inspect the relationships among the variables by the use of the ggpairs() function inside the GGally package.
# library(GGally)
#Visualizing the relationship among variables.
ggpairs(tidy_visit, mapping = ggplot2::aes(colour = angling),
lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1)),
diag = list(continuous = wrap("densityDiag", alpha = 0.3), discrete = "barDiag", na = "naDiag")
)
We want to know whether the average visit rates are different between areas with or without angling
By default, R computes the Weltch t-test, which is the safer one:
stat.test <- tidy_visit %>%
rstatix::t_test(vr ~ angling) %>%
add_significance()
stat.test
## # A tibble: 1 x 9
## .y. group1 group2 n1 n2 statistic df p p.signif
## <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 vr no yes 14 18 -0.0472 27.3 0.963 ns
Cohen’s d for Welch t-test The Welch test is a variant of t-test used when the equality of variance can’t be assumed. The effect size can be computed by dividing the mean difference between the groups by the “averaged” standard deviation.
Cohen’s d formula:
d = (mean1 - mean2)/sqrt((var1 + var2)/2), where:
mean1 and mean2 are the means of each group, respectively var1 and var2 are the variance of the two groups.
tidy_visit %>% cohens_d(vr ~ angling, var.equal = FALSE)
## # A tibble: 1 x 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 vr no yes -0.0169 14 18 negligible
We have to decide on those outliers and the normality problem, but for now the conclusion would be that:
The mean visiting rate in the area free of angling was 19.65 (SD = 31.46), whereas the mean visiting rate in the area where angling is allowed was 20.17 (SD = 29.79). A Welch two-samples t-test showed that the difference was not statistically significant, t(27.3) = -0.05, p = 0.96, d = 0.017; where, t(27.3) is shorthand notation for a Welch t-statistic that has 27.3 degrees of freedom.
stat.test <- stat.test %>% add_xy_position(x = "angling")
bxp <- ggboxplot(
tidy_visit, x = "angling", y = "vr",
ylab = "Visiting rate", xlab = "Angling", add = "jitter"
)
bxp +
stat_pvalue_manual(stat.test, tip.length = 0) +
labs(subtitle = get_test_label(stat.test, detailed = TRUE))