A binary “Active” squirrel column using the “Running,” “Chasing,” “Climbing,” “Eating,” and “Foraging” columns.
Convert the “Above Ground Sighter Measurement”, “x”, “y” columns to numeric (INT/FLOAT) values only.
The motivation to use this dataset is simple, I just chose the first interesting popular dataset I found on NYC OpenData. This encourages an exploratory approach, which might be useful when learning new skills.
Code-base
Setup
read the data
check data types
library(tidyverse, ggplot2)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Original Dataurl <-"https://raw.githubusercontent.com/Siganz/data_607_week_1/refs/heads/main/data/2018_Central_Park_Squirrel_Census_Squirrel_Data_20260126.csv"# Read Datadf <-read.csv(url, stringsAsFactors =FALSE)# Optional, view first row (there's a lot of fields so it looks)df[1,]
Interesting that instead of using int/bool they used character. I would like to change that, while also creating vectors for the columns with different names.
# cols vector for field selection# removed Lat.Long because you can get point file from x/y alone.cols <-c("Unique.Squirrel.ID","Primary.Fur.Color","Highlight.Fur.Color","Combination.of.Primary.and.Highlight.Color","Running","Chasing","Climbing","Eating","Foraging","Above.Ground.Sighter.Measurement","X","Y")# Copydf2 <- df[ , cols]# Check matrixstr(df2)
I am using a for loop, this type of iteration is similar in python so it makes it simpler for me to remember. Now I would like to convert bool_cols into logical data types, using another loop.
for (col in bool_cols) { df2[[col]] <-as.logical(df2[[col]])}str(df2)
I don’t like the highlight_color and combination_color columns. I am going to drop them, while also checking for the unique in the remaining color column, primary_color.
df2$highlight_color <-NULLdf2$combination_color <-NULL# should return three lines, last two should just be NULLfor (col in color_cols){print(unique(df2[[col]]))}
[1] "" "Gray" "Cinnamon" "Black"
NULL
NULL
Now, I want to remove the empty string (““) from being a unique value.
Looks like int, except for FALSE and ““. So, we will look for any values %in% those and change them to”0” before converting to numeric. If this was a pipeline, I would build a function to normalize then find any non-numeric values.
# Calculate activity rates for all behaviorsactivity_rates <-do.call(rbind, lapply(bool_cols, function(col) { df2 %>%filter(!is.na(primary_color)) %>%# Remove NA colorsgroup_by(primary_color) %>%summarise(behavior = col,n_true =sum(get(col) ==TRUE, na.rm =TRUE),total =n(),activity_rate = n_true / total,.groups ='drop' )}))# Create the bar chartggplot(activity_rates, aes(x = primary_color, y = activity_rate, fill = primary_color)) +geom_col() +facet_wrap(~ behavior, ncol =3) +scale_y_continuous(labels = scales::percent) +labs(title ="Behavior Rates by Primary Fur Color",x ="Primary Fur Color",y ="Percent Observed" ) +theme_minimal() +theme(legend.position ="none")
Coming from Python, I like f-strings, so here are some ones I made in R:
[1] "Squirrel climbing stats: 658 / 3023"
[1] "Squirrel eating stats 760 / 3023"
[1] "Squirrel foraging stats 1435 / 3023"
[1] "Squirrel running stats 730 / 3023"
Average Activity Rate by Primary Fur Color This one is pretty funny, I actually spent a good 30 minutes trying to figure out if my data was wrong, like if FALSE was somehow being aggregated, but it just turns out all the squirrels were very active!
rates <-aggregate( activity ~ primary_color,data = df2,FUN = mean,na.action = na.omit)ggplot(rates, aes(x = primary_color, y = activity, fill = primary_color)) +geom_col() +scale_y_continuous(labels = scales::percent) +labs(title ="Average Activity Rate by Primary Fur Color",x ="Primary Fur Color",y ="Percent of Observations with Activity" ) +theme_minimal() +theme(legend.position ="none")
Code Base Conclusion
I went a little overboard, but I found it really helpful to learn more about r. I enjoy using the qmd since the “{r}” is easy to type and I can always check the output via html or using the console in rstudio. There is obviously a lot that I need to work on (ggplot2, aggregate, tidyverse), however this was a great introduction. I actually enjoy that base r is very intuitive and I feel like I can pick this up rather quickly.