Assignment 1 – Loading Data into a Data Frame

Author

Shawn Ganz

Introduction

For this assignment I chose to use a csv from NYC OpenData called 2018 Central Park Squirrel Census - Squirrel Data provided by “The Squirrel Census”.

Approach

Since the assignment is to transform this dataframe, I want to drop a couple of columns. Below is a list of all the columns:

df <- read.csv("https://raw.githubusercontent.com/Siganz/data_607_week_1/refs/heads/main/data/2018_Central_Park_Squirrel_Census_Squirrel_Data_20260126.csv")
colnames(df)
 [1] "X"                                         
 [2] "Y"                                         
 [3] "Unique.Squirrel.ID"                        
 [4] "Hectare"                                   
 [5] "Shift"                                     
 [6] "Date"                                      
 [7] "Hectare.Squirrel.Number"                   
 [8] "Age"                                       
 [9] "Primary.Fur.Color"                         
[10] "Highlight.Fur.Color"                       
[11] "Combination.of.Primary.and.Highlight.Color"
[12] "Color.notes"                               
[13] "Location"                                  
[14] "Above.Ground.Sighter.Measurement"          
[15] "Specific.Location"                         
[16] "Running"                                   
[17] "Chasing"                                   
[18] "Climbing"                                  
[19] "Eating"                                    
[20] "Foraging"                                  
[21] "Other.Activities"                          
[22] "Kuks"                                      
[23] "Quaas"                                     
[24] "Moans"                                     
[25] "Tail.flags"                                
[26] "Tail.twitches"                             
[27] "Approaches"                                
[28] "Indifferent"                               
[29] "Runs.from"                                 
[30] "Other.Interactions"                        
[31] "Lat.Long"                                  

I want to create a dataframe with only these columns:

colnames(df)[c(3,9:11,16:20,1,2,31)]
 [1] "Unique.Squirrel.ID"                        
 [2] "Primary.Fur.Color"                         
 [3] "Highlight.Fur.Color"                       
 [4] "Combination.of.Primary.and.Highlight.Color"
 [5] "Running"                                   
 [6] "Chasing"                                   
 [7] "Climbing"                                  
 [8] "Eating"                                    
 [9] "Foraging"                                  
[10] "X"                                         
[11] "Y"                                         
[12] "Lat.Long"                                  

Afterwards I want to create the following:

  • A binary “Active” squirrel column using the “Running,” “Chasing,” “Climbing,” “Eating,” and “Foraging” columns.

  • Convert the “Above Ground Sighter Measurement”, “x”, “y” columns to numeric (INT/FLOAT) values only.

The motivation to use this dataset is simple, I just chose the first interesting popular dataset I found on NYC OpenData. This encourages an exploratory approach, which might be useful when learning new skills.

Code-base

Below will contain the code broken by text which outlines my workflow and thought process as I was coding.

Body

library(tidyverse, ggplot2)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Original Data
url <- "https://raw.githubusercontent.com/Siganz/data_607_week_1/refs/heads/main/data/2018_Central_Park_Squirrel_Census_Squirrel_Data_20260126.csv"

# Read Data
df <- read.csv(url, stringsAsFactors = FALSE)

# Optional, view first row (there's a lot of fields so it looks)
df[1,]
          X        Y Unique.Squirrel.ID Hectare Shift     Date
1 -73.95613 40.79408     37F-PM-1014-03     37F    PM 10142018
  Hectare.Squirrel.Number Age Primary.Fur.Color Highlight.Fur.Color
1                       3                                          
  Combination.of.Primary.and.Highlight.Color Color.notes Location
1                                          +                     
  Above.Ground.Sighter.Measurement Specific.Location Running Chasing Climbing
1                                                      false   false    false
  Eating Foraging Other.Activities  Kuks Quaas Moans Tail.flags Tail.twitches
1  false    false                  false false false      false         false
  Approaches Indifferent Runs.from Other.Interactions
1      false       false     false                   
                                    Lat.Long
1 POINT (-73.9561344937861 40.7940823884086)
# Optional, view the field types
str(df)
'data.frame':   3023 obs. of  31 variables:
 $ X                                         : num  -74 -74 -74 -74 -74 ...
 $ Y                                         : num  40.8 40.8 40.8 40.8 40.8 ...
 $ Unique.Squirrel.ID                        : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
 $ Hectare                                   : chr  "37F" "21B" "11B" "32E" ...
 $ Shift                                     : chr  "PM" "AM" "PM" "PM" ...
 $ Date                                      : int  10142018 10192018 10142018 10172018 10172018 10102018 10102018 10082018 10062018 10102018 ...
 $ Hectare.Squirrel.Number                   : int  3 4 8 14 5 3 2 2 1 3 ...
 $ Age                                       : chr  "" "" "" "Adult" ...
 $ Primary.Fur.Color                         : chr  "" "" "Gray" "Gray" ...
 $ Highlight.Fur.Color                       : chr  "" "" "" "" ...
 $ Combination.of.Primary.and.Highlight.Color: chr  "+" "+" "Gray+" "Gray+" ...
 $ Color.notes                               : chr  "" "" "" "Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments." ...
 $ Location                                  : chr  "" "" "Above Ground" "" ...
 $ Above.Ground.Sighter.Measurement          : chr  "" "" "10" "" ...
 $ Specific.Location                         : chr  "" "" "" "" ...
 $ Running                                   : chr  "false" "false" "false" "false" ...
 $ Chasing                                   : chr  "false" "false" "true" "false" ...
 $ Climbing                                  : chr  "false" "false" "false" "false" ...
 $ Eating                                    : chr  "false" "false" "false" "true" ...
 $ Foraging                                  : chr  "false" "false" "false" "true" ...
 $ Other.Activities                          : chr  "" "" "" "" ...
 $ Kuks                                      : chr  "false" "false" "false" "false" ...
 $ Quaas                                     : chr  "false" "false" "false" "false" ...
 $ Moans                                     : chr  "false" "false" "false" "false" ...
 $ Tail.flags                                : chr  "false" "false" "false" "false" ...
 $ Tail.twitches                             : chr  "false" "false" "false" "false" ...
 $ Approaches                                : chr  "false" "false" "false" "false" ...
 $ Indifferent                               : chr  "false" "false" "false" "false" ...
 $ Runs.from                                 : chr  "false" "false" "false" "true" ...
 $ Other.Interactions                        : chr  "" "" "" "" ...
 $ Lat.Long                                  : chr  "POINT (-73.9561344937861 40.7940823884086)" "POINT (-73.9688574691102 40.7837825208444)" "POINT (-73.97428114848522 40.775533619083)" "POINT (-73.9596413903948 40.7903128889029)" ...

Interesting that instead of using int/bool they used character. I would like to change that, while also creating vectors for the columns with different names.

# cols vector for field selection
# removed Lat.Long because you can get point file from x/y alone.

cols <- c(
  "Unique.Squirrel.ID",
  "Primary.Fur.Color",
  "Highlight.Fur.Color",
  "Combination.of.Primary.and.Highlight.Color",
  "Running",
  "Chasing",
  "Climbing",
  "Eating",
  "Foraging",
  "Above.Ground.Sighter.Measurement",
  "X",
  "Y"
)

# Copy
df2 <- df[ , cols]

# Check matrix
str(df2)
'data.frame':   3023 obs. of  12 variables:
 $ Unique.Squirrel.ID                        : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
 $ Primary.Fur.Color                         : chr  "" "" "Gray" "Gray" ...
 $ Highlight.Fur.Color                       : chr  "" "" "" "" ...
 $ Combination.of.Primary.and.Highlight.Color: chr  "+" "+" "Gray+" "Gray+" ...
 $ Running                                   : chr  "false" "false" "false" "false" ...
 $ Chasing                                   : chr  "false" "false" "true" "false" ...
 $ Climbing                                  : chr  "false" "false" "false" "false" ...
 $ Eating                                    : chr  "false" "false" "false" "true" ...
 $ Foraging                                  : chr  "false" "false" "false" "true" ...
 $ Above.Ground.Sighter.Measurement          : chr  "" "" "10" "" ...
 $ X                                         : num  -74 -74 -74 -74 -74 ...
 $ Y                                         : num  40.8 40.8 40.8 40.8 40.8 ...
df2[1:5,]
  Unique.Squirrel.ID Primary.Fur.Color Highlight.Fur.Color
1     37F-PM-1014-03                                      
2     21B-AM-1019-04                                      
3     11B-PM-1014-08              Gray                    
4     32E-PM-1017-14              Gray                    
5     13E-AM-1017-05              Gray            Cinnamon
  Combination.of.Primary.and.Highlight.Color Running Chasing Climbing Eating
1                                          +   false   false    false  false
2                                          +   false   false    false  false
3                                      Gray+   false    true    false  false
4                                      Gray+   false   false    false   true
5                              Gray+Cinnamon   false   false    false  false
  Foraging Above.Ground.Sighter.Measurement         X        Y
1    false                                  -73.95613 40.79408
2    false                                  -73.96886 40.78378
3    false                               10 -73.97428 40.77553
4     true                                  -73.95964 40.79031
5     true                                  -73.97027 40.77621
# Change column names
colnames(df2) <- c(
  "unique_id",
  "primary_color",
  "highlight_color",
  "combination_color",
  "running",
  "chasing",
  "climbing",
  "eating",
  "foraging",
  "above_ground_measurement",
  "x",
  "y"
)

bool_cols <- c(
  "running",
  "chasing",
  "climbing",
  "eating",
  "foraging"
)

color_cols <- c(
  "primary_color",
  "highlight_color",
  "combination_color"
)

I would now like to view the data and check out the uniques for certain columns and see what I can do with them.

# check names
str(df2)
'data.frame':   3023 obs. of  12 variables:
 $ unique_id               : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
 $ primary_color           : chr  "" "" "Gray" "Gray" ...
 $ highlight_color         : chr  "" "" "" "" ...
 $ combination_color       : chr  "+" "+" "Gray+" "Gray+" ...
 $ running                 : chr  "false" "false" "false" "false" ...
 $ chasing                 : chr  "false" "false" "true" "false" ...
 $ climbing                : chr  "false" "false" "false" "false" ...
 $ eating                  : chr  "false" "false" "false" "true" ...
 $ foraging                : chr  "false" "false" "false" "true" ...
 $ above_ground_measurement: chr  "" "" "10" "" ...
 $ x                       : num  -74 -74 -74 -74 -74 ...
 $ y                       : num  40.8 40.8 40.8 40.8 40.8 ...
# check unique
for (col in bool_cols) {
  print(col)
  print(unique(df2[[col]]))
}
[1] "running"
[1] "false" "true" 
[1] "chasing"
[1] "false" "true" 
[1] "climbing"
[1] "false" "true" 
[1] "eating"
[1] "false" "true" 
[1] "foraging"
[1] "false" "true" 

It seems that most of the columns are already clean, there might be an issue with the lowercase false/true so I’m going to upper those values.

# the columns are clean, will use toupper() on the values 
for (col in bool_cols) {
  df2[[col]] <- toupper(df2[[col]])
}

str(df2)
'data.frame':   3023 obs. of  12 variables:
 $ unique_id               : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
 $ primary_color           : chr  "" "" "Gray" "Gray" ...
 $ highlight_color         : chr  "" "" "" "" ...
 $ combination_color       : chr  "+" "+" "Gray+" "Gray+" ...
 $ running                 : chr  "FALSE" "FALSE" "FALSE" "FALSE" ...
 $ chasing                 : chr  "FALSE" "FALSE" "TRUE" "FALSE" ...
 $ climbing                : chr  "FALSE" "FALSE" "FALSE" "FALSE" ...
 $ eating                  : chr  "FALSE" "FALSE" "FALSE" "TRUE" ...
 $ foraging                : chr  "FALSE" "FALSE" "FALSE" "TRUE" ...
 $ above_ground_measurement: chr  "" "" "10" "" ...
 $ x                       : num  -74 -74 -74 -74 -74 ...
 $ y                       : num  40.8 40.8 40.8 40.8 40.8 ...

I am using a for loop, this type of iteration is similar in python so it makes it simpler for me to remember. Now I would like to convert bool_cols into logical data types, using another loop.

for (col in bool_cols) {
  df2[[col]] <- as.logical(df2[[col]])
}

str(df2)
'data.frame':   3023 obs. of  12 variables:
 $ unique_id               : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
 $ primary_color           : chr  "" "" "Gray" "Gray" ...
 $ highlight_color         : chr  "" "" "" "" ...
 $ combination_color       : chr  "+" "+" "Gray+" "Gray+" ...
 $ running                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ chasing                 : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ climbing                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ eating                  : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ foraging                : logi  FALSE FALSE FALSE TRUE TRUE TRUE ...
 $ above_ground_measurement: chr  "" "" "10" "" ...
 $ x                       : num  -74 -74 -74 -74 -74 ...
 $ y                       : num  40.8 40.8 40.8 40.8 40.8 ...

Now I will create a new column called activity.

df2$activity <- as.logical(rowSums(df2[bool_cols]) > 0)
str(df2)
'data.frame':   3023 obs. of  13 variables:
 $ unique_id               : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
 $ primary_color           : chr  "" "" "Gray" "Gray" ...
 $ highlight_color         : chr  "" "" "" "" ...
 $ combination_color       : chr  "+" "+" "Gray+" "Gray+" ...
 $ running                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ chasing                 : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ climbing                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ eating                  : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ foraging                : logi  FALSE FALSE FALSE TRUE TRUE TRUE ...
 $ above_ground_measurement: chr  "" "" "10" "" ...
 $ x                       : num  -74 -74 -74 -74 -74 ...
 $ y                       : num  40.8 40.8 40.8 40.8 40.8 ...
 $ activity                : logi  FALSE FALSE TRUE TRUE TRUE TRUE ...

I don’t like that activity column is at the tail, so I would like to move it and also remove highlight_color and combination_color column since I don’t believe I’ll use them anymore.

df2 <- df2[, c(
  "unique_id",
  "primary_color",
  "running", "chasing", "climbing", "eating", "foraging",
  "activity", "above_ground_measurement",
  "x", "y"
)]

str(df2)
'data.frame':   3023 obs. of  11 variables:
 $ unique_id               : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
 $ primary_color           : chr  "" "" "Gray" "Gray" ...
 $ running                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ chasing                 : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ climbing                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ eating                  : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ foraging                : logi  FALSE FALSE FALSE TRUE TRUE TRUE ...
 $ activity                : logi  FALSE FALSE TRUE TRUE TRUE TRUE ...
 $ above_ground_measurement: chr  "" "" "10" "" ...
 $ x                       : num  -74 -74 -74 -74 -74 ...
 $ y                       : num  40.8 40.8 40.8 40.8 40.8 ...

I would like to check the unique’s of primary color, which I could use for (col in df2$primarycolor), but I’ll just use color_cols since I created it earlier.

# should return three lines, last two should just be NULL
for (col in color_cols){
  print(unique(df2[[col]]))
}
[1] ""         "Gray"     "Cinnamon" "Black"   
NULL
NULL

Now, I want to remove the empty string (““) and instead make it NA. I learned that using NULL in R would try and remove the rows.

df2$primary_color[df2$primary_color == ""] <- NA

unique(df2$primary_color)
[1] NA         "Gray"     "Cinnamon" "Black"   

Checking on uniques for the above_ground_measurement

unique(df2$above_ground_measurement)
 [1] ""      "10"    "FALSE" "30"    "6"     "24"    "8"     "25"    "5"    
[10] "50"    "4"     "3"     "70"    "12"    "2"     "20"    "7"     "13"   
[19] "15"    "28"    "35"    "100"   "1"     "80"    "65"    "40"    "18"   
[28] "17"    "55"    "60"    "180"   "9"     "45"    "0"     "43"    "16"   
[37] "33"    "11"    "23"    "31"    "14"    "19"   

Looks like int, except for FALSE and ““. So, we will look for any values %in% those and change them to”0” before converting to numeric. If this was a pipeline, I would build a function to normalize then find any non-numeric values.

df2$above_ground_measurement[df2$above_ground_measurement %in% c("", "FALSE")] <- "0"

df2$above_ground_measurement <- as.numeric(df2$above_ground_measurement)

str(df2$above_ground_measurement)
 num [1:3023] 0 0 10 0 0 0 0 0 0 30 ...

Checking on X/Y to see if they have any non numeric symbols:

bad_rows <- which(grepl("^$|[^0-9.-]", df2$x))
bad_rows
integer(0)
bad_rows <- which(grepl("^$|[^0-9.-]", df2$y))
bad_rows
integer(0)

They aren’t, so let’s just do a simple as.numeric conversion

df2$x <- as.numeric(df2$x)
df2$y <- as.numeric(df2$y)
str(df2)
'data.frame':   3023 obs. of  11 variables:
 $ unique_id               : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
 $ primary_color           : chr  NA NA "Gray" "Gray" ...
 $ running                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ chasing                 : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ climbing                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ eating                  : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ foraging                : logi  FALSE FALSE FALSE TRUE TRUE TRUE ...
 $ activity                : logi  FALSE FALSE TRUE TRUE TRUE TRUE ...
 $ above_ground_measurement: num  0 0 10 0 0 0 0 0 0 30 ...
 $ x                       : num  -74 -74 -74 -74 -74 ...
 $ y                       : num  40.8 40.8 40.8 40.8 40.8 ...

This data looks good to me!

Conclusion

This segment will contain some visuals and findings from the data.I used ggplot2 for the visuals, which I had an AI generate the code since I’m not too familiar with that package.

ggplot(df2, aes(x = primary_color, fill = primary_color)) +
  geom_bar() +
  labs(
    title = "Squirrel Count by Primary Fur Color",
    x = "Primary Fur Color",
    y = "Number of Observations"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This is a count of all observations against primary fur color, you’ll see that Gray’s are more populous than any other primary fur color (if we just look at this census data).

Squirrel Behaviors

rates <- colSums(df2[bool_cols]) / nrow(df2)

plot_df <- data.frame(
  behavior = names(rates),
  proportion = as.numeric(rates)
)

ggplot(plot_df, aes(x = behavior, y = proportion, fill = behavior)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Observed Squirrel Behaviors",
    x = "Behavior",
    y = "Percent of Observations"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

We can see that the largest observed activity was foraging, by a large magin. It seems like climbing, eating, and running are nearly identitical with a difference of less than 5%, with chasing the lowest, which is less than 10%.

Summary Table for All Behaviors

# Create a summary table for all behaviors
do.call(rbind, lapply(bool_cols, function(col) {
  df2 %>%
    group_by(primary_color) %>%
    summarise(
      behavior = col,
      n_false = sum(get(col) == FALSE, na.rm = TRUE),
      n_true = sum(get(col) == TRUE, na.rm = TRUE),
      total = n(),
      .groups = 'drop'
    )
}))
# A tibble: 20 × 5
   primary_color behavior n_false n_true total
   <chr>         <chr>      <int>  <int> <int>
 1 Black         running       77     26   103
 2 Cinnamon      running      289    103   392
 3 Gray          running     1876    597  2473
 4 <NA>          running       51      4    55
 5 Black         chasing       96      7   103
 6 Cinnamon      chasing      362     30   392
 7 Gray          chasing     2235    238  2473
 8 <NA>          chasing       51      4    55
 9 Black         climbing      78     25   103
10 Cinnamon      climbing     310     82   392
11 Gray          climbing    1940    533  2473
12 <NA>          climbing      37     18    55
13 Black         eating        79     24   103
14 Cinnamon      eating       281    111   392
15 Gray          eating      1854    619  2473
16 <NA>          eating        49      6    55
17 Black         foraging      60     43   103
18 Cinnamon      foraging     189    203   392
19 Gray          foraging    1290   1183  2473
20 <NA>          foraging      49      6    55

This is a simple table that was generated to see the quantities of records with their column values and the sum of the logic values.

Share of Total Activity by Fur Color

counts <- aggregate(
  activity ~ primary_color,
  data = df2,
  FUN = sum,
  na.action = na.omit
)

counts$percent <- counts$activity / sum(counts$activity)

ggplot(counts, aes(x = primary_color, y = percent, fill = primary_color)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Share of Total Activity by Fur Color",
    x = "Primary Fur Color",
    y = "Percent of Total Activity"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This graph outlines the percentage of recorded activity between primary fur classes. Gray absolutely dominates, but this is mainly due to the amount of records. I created this and left it here, because it could be interpreted incorrectly as Gray’s being more active than the other colors.

Behavior Rates by Primary Fur Color

# Calculate activity rates for all behaviors
activity_rates <- do.call(rbind, lapply(bool_cols, function(col) {
  df2 %>%
    filter(!is.na(primary_color)) %>%  # Remove NA colors
    group_by(primary_color) %>%
    summarise(
      behavior = col,
      n_true = sum(get(col) == TRUE, na.rm = TRUE),
      total = n(),
      activity_rate = n_true / total,
      .groups = 'drop'
    )
}))

# Create the bar chart
ggplot(activity_rates, aes(x = primary_color, y = activity_rate, fill = primary_color)) +
  geom_col() +
  facet_wrap(~ behavior, ncol = 1) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Behavior Rates by Primary Fur Color",
    x = "Primary Fur Color",
    y = "Percent Observed"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This graph is interesting because it visually shows that squirrels, regardless of their fur color, seem to have very similar activity rates. This is true of everything except foraging, so for that we can do this:

behavior_dispersion <- aggregate(
  activity_rate ~ behavior,
  data = activity_rates,
  FUN = sd
)

behavior_dispersion
  behavior activity_rate
1  chasing    0.01450017
2 climbing    0.01781462
3   eating    0.02547622
4 foraging    0.05056946
5  running    0.01067582

This behavior_dispersion shows the standard deviation of activity_rate (which is grouped by primary_color) to behavior. Foraging has 0.05 SD (5 percentage points), which has the strongest association with fur color than all other types of behaviors.

Average Activity Rate by Primary Fur Color

rates <- aggregate(
  activity ~ primary_color,
  data = df2,
  FUN = mean,
  na.action = na.omit
)

ggplot(rates, aes(x = primary_color, y = activity, fill = primary_color)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Average Activity Rate by Primary Fur Color",
    x = "Primary Fur Color",
    y = "Percent of Observations with Activity"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This one is pretty funny, I actually spent a good 30 minutes trying to figure out if my data was wrong, like if FALSE was somehow being aggregated, but it just turns out all the squirrels were very active!

Final Thoughts

Recommendations for the data would be to explore it further, like with more statistical analysis. I think the x/y coordinates would be interesting to see where in the park different color types of squirrels were recorded, and what parts of the park were there the most activity. A GIS map would be a great tool to help visualize this further.

Video

TODO: Video