Thursday Birds: Initial Graphs

Author

Kathleen Strachota

Loading in packages and data

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.1 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidyr)
library(ggsci)

eBird_Data <- read.csv("/Users/kathleenstrachota/Desktop/Mount Holyoke/BIOSTATISTICS/Final/eBird_Data.csv")
longbirds <- eBird_Data %>%
  pivot_longer(Nov_Oven:Mar_mins, names_to = "name", values_to = "value") %>%
  separate(name, into=c('month', 'var'), sep = '_') %>%
  pivot_wider(names_from = var, values_from = value) %>%
  drop_na('Oven')

NDVI by Longitude and Latitude

ggplot(data = longbirds, aes(y = NDVI_diff, x = LONGITUDE)) +
  geom_point(position = position_dodge(width = 0.2)) +
  geom_smooth() +
  labs(y = "NDVI difference", x = "Longitude") +
  theme_bw()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Warning: `position_dodge()` requires non-overlapping x intervals

ggplot(data = longbirds, aes(y = NDVI_diff, x = LATITUDE)) +
  geom_point() +
  geom_smooth() +
  labs(y = "NDVI difference", x = "Latitude") +
  theme_bw()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

These are graphs of the difference of normalized difference vegetation index (NDVI) between November and March by longitude and latitude. According to the metadata sheet of the data set, positive NDVI_diff values indicate greening while negative ones indicate browning. I looked at the mean NDVI_diff by longitude but visually the graph looked the same as NDVI_diff by longitude because often there is only one data point for each value of longitude due the specificity of the decimal points in longitude. So I did not bother to try graph for mean NDVI_diff by latitude as the same thing would occur.

When looking at the first graph, NDVI by longitude, it seems that there could more greening happening between November and March between longitudes of -80ºW and -85ºW because that is where we see positive NDVI values. However, there is not a linear relationship between longitude and NDVI difference which is not super surprising. I cannot say for with any certainty that there is any significant relationship but it could be interesting to look at if we are doing something other than simple linear regression. I wonder if we could test something like, is the mean NDVI between -80 and -85ºW different than in other locations? Or if we could test whether longitude can predict NDVI at all.

In the second graph, NDVI by latitude, it seems that there are more positive values of NDVI_diff found at more Northern latitudes. Whether or not that is true, could be interesting to look at statistically, however I’m not sure what test to run.

Mean NDVI by Month

bird_mm <- longbirds %>%
  group_by(month) %>%
  summarise(mean_ndvi = mean(NDVI_diff), sd = sd(NDVI_diff), n = n(), se = sd/sqrt(n))

ggplot(data = bird_mm, aes(y = mean_ndvi, x = month)) +
  geom_point() +
  geom_errorbar(data = bird_mm, aes(x = month, ymin = mean_ndvi-se, ymax = mean_ndvi+se), width = 0.2) +
  labs(y = "Mean NDVI difference", x = "Month") +
  theme_bw()

This is a graph of the average NDVI difference between November and March by month. All of the means are negative which I think indicates that the average NDVI difference per month leans towards browning. This makes sense given that much of this data was taken in the Northern hemisphere’s winter (November-March). Visually, it looks like the mean NDVI difference in March is different than the other months, so we could test the null hypothesis that all means are the same (ANVOA?) and then follow that up with determining which mean is different should we reject the null.

Mean count by NDVI

b_mcountndvi <- longbirds %>%
  group_by(NDVI_diff, month) %>%
  summarise(mean_count = mean(Oven), sd = sd(Oven), n = n(), se = sd/sqrt(n))

`summarise()` has grouped output by 'NDVI_diff'. You can override using the
`.groups` argument.

ggplot(data = b_mcountndvi, aes(y = mean_count, x = NDVI_diff, color = month)) +
  geom_point() +
  labs(y = "Mean count", x= "NDVI differnce", color = "Month") +
  theme_bw()

This graph shows average count by NDVI difference. I made this graph to see if it seemed like NDVI_diff could be used to predict the mean count. However, this graph is a lot to process visually because so many of the data points land between 0 and 0.5. I don’t think that I can come up with any hypotheses from looking at this graph and I don’t think we should use it as it does not do a very good job at conveying any information.

Mean count by Month

b_mcountm <- longbirds %>%
  group_by(month) %>%
  summarise(mean_count = mean(Oven), sd = sd(Oven), n = n(), se = sd/sqrt(n))

ggplot(data = b_mcountm, aes(y = mean_count, x = month)) +
  geom_point() +
  # geom_jitter(data = longbirds, aes(x = month, y = Oven), alpha = 0.3, size = 0.3) +
  geom_errorbar(data = b_mcountm, aes(x = month, ymin = mean_count-se, ymax = mean_count+se), width = 0.2) +
  labs(y = "Mean count", x = "Month") +
  theme_bw()

This graph shows average bird count by month. I tried to put the data behind the error bars with geom_jitter but it made the graph unreadable so I commented it out. All of the mean counts were less than 1 which makes sense because of the amount zeros in this data set. We could potentially test the null hypothesis that the means for each month are the same and then follow that up with testing which one is different if we end up rejecting the null hypothesis. I don’t think that we will because it looks like there is a lot of overlap in this graph but since these are error bars of standard error and not 95% confidence intervals, I’m not sure.

Count by Longitude and Latitude

ggplot(data = longbirds, aes(y = Oven, x = LONGITUDE)) +
  geom_point() +
  labs(y = "Count", x = "Longitude") +
  theme_bw()

ggplot(data = longbirds, aes(y = Oven, x = LATITUDE)) +
  geom_point() +
  labs(y = "Count", x= "Latitude") +
  theme_bw()

These are graphs of bird count by longitude and latitude. Potentially the areas more in the areas between -65ºW and -97ºW had a higher number of birds spotted but there are many other factors that could have gone into that which might be easier to quantify such as location/country. It seems that more birds were counted between 10ºN and 30ºN, but I’m not sure that I would report that with too much confidence. Overall, neither of the graphs are particularly readable and I don’t think they would be my first pick to use.

Through making these graphs, I have learned that longitude and latitude are tricky to have on the x-axis because there are so many points that are very similar since the change in degree can be very slight (i.e. many many x-values).

Count by Month and latitude

ggplot(data = longbirds, aes(y = Oven, x = month, color = LATITUDE)) +
  geom_point() +
  labs(y = "Count", x = "Month", color = "Latitude") +
  theme_bw()

This a figure showing bird count by month with the varying colors of the dots corresponding to the latitude. The gradient is a little tricky to see and thus to read. I don’t see any obvious trends in regards to at which latitude during which month is one maybe more likely to see more birds but it was cool to see all of these variables on one graph since previously I had only been graphing two of these variables at a time per graph. It could be interesting to make a model with count as the response variable and month and latitude (and maybe their interaction) as the predictors. But that also might not work since month isn’t numeric, I’m not quite sure how to do the stats with a non-numeric predictor variable.