Project2choieno

Author

E Choi

Emperor Penguin

My topic is about penguins, specifically data associated with aerial counts of emperor penguins. In my project, I chose to specifically focus on the colony location and whether they were adults or chicks for categorical variables. For quantitative variables, I chose the penguin count and year of observation. I chose this topic because I would love to visit Antarctica and love penguins as they are very interesting animals! I also may have been watching the movie “Surf’s Up” recently. The data comes from the University of Minnesota System, and I cleaned up the data similarly to japan’s tutorial by using the select dplyer command to only include the variables I mentioned earlier as well as latitude and longitude in order to make dots on the map.

library(tidyverse) #load libraries
library(ggplot2)
library(tidyr)
library(RColorBrewer)
library(leaflet)
setwd("C:/Users/enomc/OneDrive - montgomerycollege.edu/Documents/Data Science") # set working directory
penguins <- read_csv("emperial.csv") #name dataset
empenguins <- penguins |> #select from dplyr in order to only select the variables necessary
  select(-reference, -vantage, -accuracy, -season_starting, -month, -day, -cammlr_region, -common_name, -site_id) #removes unnecessary variables that I don't use
head(empenguins)
# A tibble: 6 × 6
  site_name      longitude_epsg_4326 latitude_epsg_4326  year penguin_count
  <chr>                        <dbl>              <dbl> <dbl>         <dbl>
1 Emperor Island               -68.7              -67.9  1948            70
2 Emperor Island               -68.7              -67.9  1949           141
3 Emperor Island               -68.7              -67.9  1949           150
4 Emperor Island               -68.7              -67.9  1957            90
5 Emperor Island               -68.7              -67.9  1957            35
6 Emperor Island               -68.7              -67.9  1958           165
# ℹ 1 more variable: count_type <chr>
#find highest penguin count in dataset in order to annotate
new_df <- empenguins |> 
  arrange(desc(penguin_count))
head(new_df) # displays the highest penguin count along with other variables
# A tibble: 6 × 6
  site_name           longitude_epsg_4326 latitude_epsg_4326  year penguin_count
  <chr>                             <dbl>              <dbl> <dbl>         <dbl>
1 Coulman Island                     170.              -73.4  2006         31432
2 Coulman Island                     170.              -73.4  2011         26959
3 Cape Colbeck  Edwa…               -158.              -77.1  2011         26266
4 Coulman Island                     170.              -73.4  2008         25966
5 Coulman Island                     170.              -73.4  2012         25428
6 Coulman Island                     170.              -73.4  2005         25406
# ℹ 1 more variable: count_type <chr>
ggplot(empenguins, aes(x = year, y = penguin_count, fill = count_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Penguin Counts by Year and Data Type", #detailed title
    x = "Year Observed", #meaningful labels
    y = "Penguin Count", #meaningful labels
    fill= "Data Type", #legend to determine color meaning
    caption = "Data source: University of Minnesota System" #caption for data source
  ) +
  theme_bw() +                        # non-default theme
  scale_fill_brewer(palette = "Set4") + # non-default color palette using three colors
  #using the highest penguin count data now annotating text to show where the highest penguin count was in year and penguin count.
  annotate(
    "text",
    x=2006,
    y=31432,
    label= "Most penguins"
  ) #got this information from https://ggplot2.tidyverse.org/reference/annotate.html
Warning: Unknown palette: "Set4"

My non-map data visualization represents the penguin counts throughout the years observed and whether the data aerially observed were adults, chicks, or nests. For the color set, I decided to use Set 4 for the scale fill brewer which used three colors, each color representing whether the data was adults, chicks, or nests. In order to analyze the data more efficiently, I decided to set the x-axis to years. This allows me to see the yearly trends and which years had higher penguin observations. For the y-axis, I decided to srt it to the penguin count in order to see which years had the highest penguin count as well as what type of data it was recorded as. From the chart, adults appear most frequently, followed by chicks, and nests appear only once near 1980, indicating that nests were not observed very often. This was not very shocking because adults are most likely easier to spot as they are bigger and at the very least the chicks are still moving targets. Nests may need a little bit more hard work to observe compared to adults and chicks. Additionally, to place my annotation correctly, I had to find the exact year and penguin count in order to display the text annotation at the correct coordinates. I also was unsure how to annotate so got my annotation bit from this link: https://ggplot2.tidyverse.org/reference/annotate.html

#create popup I followed japan's tutorial and GIS assignment
popup_penpineapple <- paste0(
  "<b>Site: </b>", empenguins$site_name, "<br>",
  "<b>Number of Penguins: </b>", empenguins$penguin_count, "<br>",
  "<b>Year of Observation </b>", empenguins$year, "<br>",
  "<b>Data Type: </b>", empenguins$count_type
)
pal<- colorFactor(palette = c("#7538a1", "#29b0d9", "#2f9c7b"),#from japan tutorial to assign colors
              levels = c("adults", "chicks", "nests"), # from japan tutorial
              empenguins$count_type)
leaflet(empenguins) |>
  setView(lng= -0.07, lat = -75.25, zoom =1.49) |>  #Antartica location
  addProviderTiles("Esri.NatGeoWorldMap") |>
  addCircles(
    lng = ~longitude_epsg_4326, #set longitude values to longitude in empenguins
    lat = ~latitude_epsg_4326, # set latitude values to latitude in empenguins
    fillColor = ~pal(count_type), #assign color of circles to pal 
    color = ~pal(count_type), #assigns color of circles to pal
    fillOpacity = 0.7, #makes color visible
    radius = ~(empenguins$penguin_count) * 4,  # scale circle size by  penguin count
    popup = popup_penpineapple #popup made earlier to display data when circles are clicked on
  )

My map represents the area of the sightings of penguins, how many were observe, what year they were observed, and whether the data were adults, chicks, or nests. One trend I noticed was that the farther right, or east, I clicked, the more data there seemed to be clumped up. If you zoom into the Coulman Island or the large cluste of data on the right, there are a bunch of circles, if you click on the outermost circle or radius, you will be able to see the highest penguin count, and the closer you get to the circle, the number begins to decreae. So what may seem like a bunch of points on top of each other is actually just the amount of data recorded at the site. I also assigned a color factor exactly how we learned in the Japan earthquakes tutorial. This set a different color to each circle representing whether they were adults or chicks or nests. Additionally, the data from the left and right sides seem to have been observed more recently, while the data toward the middle of Antarctica seem to have been observed earlier. I wanted to zoom in a bit more, but with the wide deviation of data, the more I set the zoom, the further points could not be visible, and you would have to zoom out of the originally loaded graph to view them.