library(tidyverse) #load libraries
library(ggplot2)
library(tidyr)
library(RColorBrewer)
library(leaflet)
setwd("C:/Users/enomc/OneDrive - montgomerycollege.edu/Documents/Data Science") # set working directory
penguins <- read_csv("emperial.csv") #name datasetProject2choieno
My topic is about penguins, specifically data associated with aerial counts of emperor penguins. In my project, I chose to specifically focus on the colony location and whether they were adults or chicks for categorical variables. For quantitative variables, I chose the penguin count and year of observation. I chose this topic because I would love to visit Antarctica and love penguins as they are very interesting animals! I also may have been watching the movie “Surf’s Up” recently. The data comes from the University of Minnesota System, and I cleaned up the data similarly to japan’s tutorial by using the select dplyer command to only include the variables I mentioned earlier as well as latitude and longitude in order to make dots on the map.
empenguins <- penguins |> #select from dplyr in order to only select the variables necessary
select(-reference, -vantage, -accuracy, -season_starting, -month, -day, -cammlr_region, -common_name, -site_id) #removes unnecessary variables that I don't use
head(empenguins)# A tibble: 6 × 6
site_name longitude_epsg_4326 latitude_epsg_4326 year penguin_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 Emperor Island -68.7 -67.9 1948 70
2 Emperor Island -68.7 -67.9 1949 141
3 Emperor Island -68.7 -67.9 1949 150
4 Emperor Island -68.7 -67.9 1957 90
5 Emperor Island -68.7 -67.9 1957 35
6 Emperor Island -68.7 -67.9 1958 165
# ℹ 1 more variable: count_type <chr>
#find highest penguin count in dataset in order to annotate
new_df <- empenguins |>
arrange(desc(penguin_count))
head(new_df) # displays the highest penguin count along with other variables# A tibble: 6 × 6
site_name longitude_epsg_4326 latitude_epsg_4326 year penguin_count
<chr> <dbl> <dbl> <dbl> <dbl>
1 Coulman Island 170. -73.4 2006 31432
2 Coulman Island 170. -73.4 2011 26959
3 Cape Colbeck Edwa… -158. -77.1 2011 26266
4 Coulman Island 170. -73.4 2008 25966
5 Coulman Island 170. -73.4 2012 25428
6 Coulman Island 170. -73.4 2005 25406
# ℹ 1 more variable: count_type <chr>
ggplot(empenguins, aes(x = year, y = penguin_count, fill = count_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Penguin Counts by Year and Data Type", #detailed title
x = "Year Observed", #meaningful labels
y = "Penguin Count", #meaningful labels
fill= "Data Type", #legend to determine color meaning
caption = "Data source: University of Minnesota System" #caption for data source
) +
theme_bw() + # non-default theme
scale_fill_brewer(palette = "Set4") + # non-default color palette using three colors
#using the highest penguin count data now annotating text to show where the highest penguin count was in year and penguin count.
annotate(
"text",
x=2006,
y=31432,
label= "Most penguins"
) #got this information from https://ggplot2.tidyverse.org/reference/annotate.htmlWarning: Unknown palette: "Set4"
My non-map data visualization represents the penguin counts throughout the years observed and whether the data aerially observed were adults, chicks, or nests. For the color set, I decided to use Set 4 for the scale fill brewer which used three colors, each color representing whether the data was adults, chicks, or nests. In order to analyze the data more efficiently, I decided to set the x-axis to years. This allows me to see the yearly trends and which years had higher penguin observations. For the y-axis, I decided to srt it to the penguin count in order to see which years had the highest penguin count as well as what type of data it was recorded as. From the chart, adults appear most frequently, followed by chicks, and nests appear only once near 1980, indicating that nests were not observed very often. This was not very shocking because adults are most likely easier to spot as they are bigger and at the very least the chicks are still moving targets. Nests may need a little bit more hard work to observe compared to adults and chicks. Additionally, to place my annotation correctly, I had to find the exact year and penguin count in order to display the text annotation at the correct coordinates. I also was unsure how to annotate so got my annotation bit from this link: https://ggplot2.tidyverse.org/reference/annotate.html
#create popup I followed japan's tutorial and GIS assignment
popup_penpineapple <- paste0(
"<b>Site: </b>", empenguins$site_name, "<br>",
"<b>Number of Penguins: </b>", empenguins$penguin_count, "<br>",
"<b>Year of Observation </b>", empenguins$year, "<br>",
"<b>Data Type: </b>", empenguins$count_type
)pal<- colorFactor(palette = c("#7538a1", "#29b0d9", "#2f9c7b"),#from japan tutorial to assign colors
levels = c("adults", "chicks", "nests"), # from japan tutorial
empenguins$count_type)
leaflet(empenguins) |>
setView(lng= -0.07, lat = -75.25, zoom =1.49) |> #Antartica location
addProviderTiles("Esri.NatGeoWorldMap") |>
addCircles(
lng = ~longitude_epsg_4326, #set longitude values to longitude in empenguins
lat = ~latitude_epsg_4326, # set latitude values to latitude in empenguins
fillColor = ~pal(count_type), #assign color of circles to pal
color = ~pal(count_type), #assigns color of circles to pal
fillOpacity = 0.7, #makes color visible
radius = ~(empenguins$penguin_count) * 4, # scale circle size by penguin count
popup = popup_penpineapple #popup made earlier to display data when circles are clicked on
)My map represents the area of the sightings of penguins, how many were observe, what year they were observed, and whether the data were adults, chicks, or nests. One trend I noticed was that the farther right, or east, I clicked, the more data there seemed to be clumped up. If you zoom into the Coulman Island or the large cluste of data on the right, there are a bunch of circles, if you click on the outermost circle or radius, you will be able to see the highest penguin count, and the closer you get to the circle, the number begins to decreae. So what may seem like a bunch of points on top of each other is actually just the amount of data recorded at the site. I also assigned a color factor exactly how we learned in the Japan earthquakes tutorial. This set a different color to each circle representing whether they were adults or chicks or nests. Additionally, the data from the left and right sides seem to have been observed more recently, while the data toward the middle of Antarctica seem to have been observed earlier. I wanted to zoom in a bit more, but with the wide deviation of data, the more I set the zoom, the further points could not be visible, and you would have to zoom out of the originally loaded graph to view them.