The Data

This week’s Tidy Tuesday data is about the Oregon Spotted Frog. The data comes from USGS. This data set focuses on Oregon Spotted Frog demographics at the Crane Prairie Reservoir in Oregon.

My Focus

This week was a bit of a switch up from previous weeks. I had done some shapefile/map graphing via ggplot2 in the last few Tidy Tuesday’s, so I paired some simpler map plots with a logistic regression to predict if there was a visuals of a frog (i.e. Detection is “Visual” or “Captured”).

Given I know next to nothing about Oregon Spotted Frogs and also there are many levels of the categorical variables in the data frame, I opted to use Lasso to select the predictor variables in my logistic regression. This was my first ever time doing logistic regression in R, so I still have a lot to learn, but it was great to try it out.

Reading in the Data Set

frogs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-08-02/frogs.csv')

Data Wrangling

I began by converting the UTM coordinates to longitude and latitude to use “osmaps”. The structure of this code is largely adapted from Jonathon Kitt. The link to the code I implemented is here.

utm_coords <- frogs %>%
  select(UTME_83, UTMN_83)
longlat_coords <- st_as_sf(x = utm_coords,  
                           coords = c("UTME_83", "UTMN_83"),
                           crs = "+proj=utm +zone=10") %>%
  st_transform(crs = "+proj=longlat +datum=WGS84") %>%
  as_tibble()
list_long <- list()
list_lat <- list()
for (i in 1:nrow(longlat_coords)){
  list_long[[i]] <- longlat_coords$geometry[[i]][1]
  list_lat[[i]] <- longlat_coords$geometry[[i]][2]
}
frogs <- frogs %>%
  mutate(longitude = unlist(list_long),
         latitude = unlist(list_lat)) %>%
  select(-UTME_83, -UTMN_83)
crane_prairie_water <- opq(bbox = c(-121.7920729, 43.7938767, -121.76501, 43.81433)) %>%
    add_osm_feature(key = 'natural', value = 'water') %>%
  osmdata_sf()
frogs <- frogs %>%
    mutate(Female = case_when(
    Female == 1 ~ "Female",
    TRUE~ "Male"))

First Round of Plots

p <- ggplot()+
  geom_sf(data = crane_prairie_water$osm_polygons, inherit.aes = FALSE, fill = "lightcyan2")+
  cowplot::theme_map(11)+
  geom_point(data = frogs, aes(x = longitude, y = latitude, color = Subsite), alpha = .75, size = 1)+
  theme(panel.background = element_rect(fill = 'seashell2'), legend.position="left", title = element_text(size = 14), legend.title = element_text(size = 12, face = "bold"))+
  guides(color = guide_legend(title.position = "top", title = "Subsite"))+
  labs(subtitle = "Visualizing Oregon Spotted Frog Demographics based on Location")
q <- ggplot()+
  geom_sf(data = crane_prairie_water$osm_polygons, inherit.aes = FALSE, fill = "lightcyan2")+
  cowplot::theme_map(11)+
  geom_point(data = frogs, aes(x = longitude, y = latitude, color = as_factor(Female)), alpha = .75, size = 1)+
  theme(panel.background = element_rect(fill = 'seashell2'), legend.position="right", legend.title = element_text(size = 12, face = "bold"))+
  guides(color = guide_legend(title.position = "top", title = "Sex"))
library(patchwork)
p + ggtitle("Oregon Spotted Frog Demographics") + q + labs(caption ="Tidy Tuesday 07-12-2022 | GitHub: @scolando") 

It is important to note that many of these observations come from the same frog. Many detected frogs in this data frame come from the same frog (i.e. the same frequency), so not every dot on the data frame represents a unique frog. See the data frame below:

frogs %>%
  group_by(Frequency,Subsite, Female) %>%
  count() %>%
  arrange(desc(n))
## # A tibble: 34 × 4
## # Groups:   Frequency, Subsite, Female [34]
##    Frequency Subsite        Female     n
##        <dbl> <chr>          <chr>  <int>
##  1      164. W Res          Female    12
##  2      164. Cow Camp River Male      12
##  3      165. W Res          Female    12
##  4      165. Cow Camp River Male      12
##  5      165. W Res          Female    12
##  6      165. Cow Camp River Female    12
##  7      164. N Res          Female    11
##  8      164. N Res          Female    11
##  9      165. SE Pond        Female    11
## 10      165. N Res          Female    11
## # … with 24 more rows
## # ℹ Use `print(n = ...)` to see more rows

Logistic Regression

I next moved onto logistic regression! I used the “Detection” variable to create a binary outcome – seen or not seen.

frogs_new <- frogs %>%
  mutate(Detection = case_when(
    Detection == "Captured" ~ "Seen",
    Detection == "Visual" ~ "Seen",
    TRUE ~ "Not Seen")) %>%
  mutate(Detection = as_factor(Detection))

I then created the test and training data split and also removed unmeaningful predictors from the data set.

library(rsample)

set.seed(47)

frogs_new2 <- frogs_new %>%
    select(-Site,-SurveyDate, -Frequency, -Interval)
  frogs_split <- initial_split(frogs_new2)
frogs_train <- training(frogs_split)
frogs_test <- testing(frogs_split)

As I mentioned earlier, I wanted to use Lasso to select the variables. I used a workflow to implement Lasso – so my first step was to create a recipe.

library(recipes)

frogs_rec <- recipe(Detection ~ ., data = frogs_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes())
library(parsnip)
lasso_spec_tune <- logistic_reg(mixture = 1, penalty = tune()) %>% # mixture = 1 denotes Lasso
  set_mode("classification") %>% #need classification mode to do logistic regression
  set_engine("glmnet")

To select the lambda (or penalty) value for Lasso, I did cross-validation on the training set.

library(dials)
library(workflows)
library(tune)

set.seed(1234)
frogs_fold <- vfold_cv(frogs_train)

lasso_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)

lasso_wf <- workflow() %>%
  add_recipe(frogs_rec)

lasso_fit <- lasso_wf %>%
  add_model(lasso_spec_tune) %>%
  fit(data = frogs_train)

# this is the line that tunes the model using cross validation
set.seed(2020)
lasso_cv <- tune_grid(
  lasso_wf %>% add_model(lasso_spec_tune),
  resamples = frogs_fold,
  grid = lasso_grid
)

Determining the Optimal Penalty

I then used the two outputted prediction metrics – accuracy and roc_auc – to choose the best lambda value.

collect_metrics(lasso_cv) %>%
  filter(.metric == "accuracy") %>%
  arrange(desc(mean))
## # A tibble: 50 × 7
##      penalty .metric  .estimator  mean     n std_err .config              
##        <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.000268  accuracy binary     0.867    10  0.0240 Preprocessor1_Model08
##  2 0.000687  accuracy binary     0.867    10  0.0240 Preprocessor1_Model10
##  3 0.00176   accuracy binary     0.867    10  0.0240 Preprocessor1_Model12
##  4 0.0184    accuracy binary     0.867    10  0.0226 Preprocessor1_Model17
##  5 0.00001   accuracy binary     0.863    10  0.0260 Preprocessor1_Model01
##  6 0.0000160 accuracy binary     0.863    10  0.0260 Preprocessor1_Model02
##  7 0.0000256 accuracy binary     0.863    10  0.0260 Preprocessor1_Model03
##  8 0.0000409 accuracy binary     0.863    10  0.0260 Preprocessor1_Model04
##  9 0.0000655 accuracy binary     0.863    10  0.0260 Preprocessor1_Model05
## 10 0.000105  accuracy binary     0.863    10  0.0260 Preprocessor1_Model06
## # … with 40 more rows
## # ℹ Use `print(n = ...)` to see more rows
collect_metrics(lasso_cv) %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))
## # A tibble: 50 × 7
##    penalty .metric .estimator  mean     n std_err .config              
##      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.0754  roc_auc binary     0.835    10  0.0433 Preprocessor1_Model20
##  2 0.0295  roc_auc binary     0.835    10  0.0418 Preprocessor1_Model18
##  3 0.0471  roc_auc binary     0.833    10  0.0436 Preprocessor1_Model19
##  4 0.0184  roc_auc binary     0.829    10  0.0424 Preprocessor1_Model17
##  5 0.00720 roc_auc binary     0.825    10  0.0385 Preprocessor1_Model15
##  6 0.121   roc_auc binary     0.824    10  0.0405 Preprocessor1_Model21
##  7 0.0115  roc_auc binary     0.823    10  0.0388 Preprocessor1_Model16
##  8 0.00450 roc_auc binary     0.822    10  0.0369 Preprocessor1_Model14
##  9 0.00176 roc_auc binary     0.821    10  0.0361 Preprocessor1_Model12
## 10 0.00281 roc_auc binary     0.821    10  0.0362 Preprocessor1_Model13
## # … with 40 more rows
## # ℹ Use `print(n = ...)` to see more rows
lasso_cv %>%
  collect_metrics() %>%
  ggplot(aes(x = penalty, y = mean, color = .metric)) + 
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err),
    alpha = 0.5) + 
  geom_line(size = 1.5) + 
  scale_x_log10()+
    theme_minimal()

In an effort to best optimize the accuracy and auc_roc, I chose \(\lambda = 0.01842069969\). This lambda corresponded with a high accuracy and a relatively high auc_roc.

lasso_cv %>%
  collect_metrics() %>%
  ggplot(aes(x = penalty, y = mean, color = .metric)) + 
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err),
    alpha = 0.5) + 
  geom_line(size = 1.5) + 
  scale_x_log10()+
  ylim(c(.75,1))+
  geom_vline(xintercept = 0.01842069969)+
  theme_minimal()

best_lasso <- tibble(penalty = 0.01842069969,
                         .config = "Preprocessor1_Model17")
best_lasso
## # A tibble: 1 × 2
##   penalty .config              
##     <dbl> <chr>                
## 1  0.0184 Preprocessor1_Model17

The workflow was then finalized with the best Lasso penalty to generate predictor variable coefficients – see how most of them go to 0 and we are only left with 14 parameters!

Variables/Coefficients in the Model

finalize_workflow(lasso_wf %>% add_model(lasso_spec_tune), best_lasso) %>%
  fit(data = frogs_test) %>% tidy() %>%
  filter(estimate != 0)
## # A tibble: 14 × 3
##    term                        estimate penalty
##    <chr>                          <dbl>   <dbl>
##  1 (Intercept)                   1.55    0.0184
##  2 Ordinal                       1.06    0.0184
##  3 Subsite_N.Res                 0.0568  0.0184
##  4 Subsite_NE.Res               -0.389   0.0184
##  5 HabType_Reservoir             0.0416  0.0184
##  6 Female_Male                   0.410   0.0184
##  7 Water_No.water               -0.0894  0.0184
##  8 Water_Shallow.water          -0.307   0.0184
##  9 Water_Unknown.water           0.709   0.0184
## 10 Type_Non.aquatic             -0.0340  0.0184
## 11 Structure_Open               -0.0830  0.0184
## 12 Substrate_Organic.soil       -1.56    0.0184
## 13 Substrate_Unknown.substrate  -0.929   0.0184
## 14 Beaver_No.beaver             -0.248   0.0184

Visualizing the Logistic Regression

The next step was to predict on the test data. Honestly, I felt kind of rocky about this part and am not sure I did it right. I do think I learned a lot from going through this process though and trying to visualize the logistic regression fit.

lasso_winner_spec <- logistic_reg(mixture = 1, penalty = 0.0184207) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

lasso_winner_wf <- workflow() %>%
  add_recipe(frogs_rec)

lasso_winner <- lasso_winner_wf %>%
  add_model(lasso_winner_spec) %>%
  fit(data = frogs_train)

lasso_new <- lasso_winner %>%
  tidy()
lasso_aug <- lasso_winner %>%
  augment(new_data = frogs_test) #this is what generates the predictions

The dotted red line is the threshold I set (0.5) for the class. Anything above 0.5 corresponds to “Seen” Detection (i.e. captured or visual) and anything below 0.5 corresponds to “Not Seen” (i.e. no visual). The red points are those that were improperly classified by our logistic regression model

library(gghighlight)

lasso_aug %>%
  mutate(Detection = case_when(
    Detection == "Seen" ~ 1,
    TRUE ~ 0
  )) %>%
  ggplot(aes(y=Detection, x = .pred_Seen))+
  geom_point(aes(color=ifelse(.pred_Seen < 0.5 & Detection == 1, 'red', 'black')))+
   stat_smooth(method="glm", color="skyblue", se=FALSE,
                method.args = list(family=binomial))+
  geom_vline(xintercept = 0.5, linetype = "dotted", color = "red")+
  scale_color_identity()+
  theme_minimal()+
  xlab("Predicted Detection Type")+
  ylab("True Detection Type")

Faceting by Sex of the Detected Frog

We can then see that the overall logistic regression fit does not tell the full story. In fact, our model looks worse when we facet by sex of the detected frog – especially if the detected frog is female. In general, it doesn’t seem like this logistic regression model has high predicted power – especially for seen female frogs.

lasso_aug %>%
  mutate(Detection = case_when(
    Detection == "Seen" ~ 1,
    TRUE ~ 0
  )) %>%
  ggplot(aes(y=Detection, x = .pred_Seen))+
  geom_point(aes(color = ifelse(.pred_Seen < 0.5 & Detection == 1, 'red', 'black')))+
   stat_smooth(method="glm", color="skyblue", se=FALSE,
                method.args = list(family=binomial))+
  geom_vline(xintercept = 0.5, linetype = "dotted", color = "red")+
    scale_color_identity()+
  theme_minimal()+
  facet_wrap(~Female)

One hypothesis for why this model is so bad with female seen frogs is because there were not many in the training set. I would need to look into this more, but as a stand-in, I looked at the count of female seen frogs in the whole data set (Female = 1 when the frog is female). There is definitely a skew of not-seen female frogs but it seems like seen female frog counts is comparable to all the other groups below.

lasso_aug %>%
  group_by(Detection, Female) %>%
  count()
## # A tibble: 4 × 3
## # Groups:   Detection, Female [4]
##   Detection Female     n
##   <fct>     <chr>  <int>
## 1 Seen      Female    14
## 2 Seen      Male       5
## 3 Not Seen  Female    45
## 4 Not Seen  Male      14

Finally, it would not be a Tidy Tuesday without some praise. Slowly, the praise package is becoming my favorite (very user friendly lol).

library(praise)
praise()
## [1] "You are amazing!"