Using data from the palmerpenguin data set, we will be looking for correlations in flipper length, body mass, bill length, and bill depth across three different penguin species. This data was collected by Dr. Kristen Gorman.
Notes: Setting up the initial environment by loading tidyverse and the palmerpenguin data set.
library(tidyverse)
library(palmerpenguins)
To get an idea of the data set we will be working with, we use the head() function to see the first 6 rows of observations as well as the column titles. Next, to see all of the column names in a list we used the colnames() function. This will help us get a better understanding of the data set we are working with in this analysis.
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
colnames(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Now, we want to create a scatter plot that compares the flipper length and the body mass of penguins in the data set. In order to do this we use the ggplot2() function along with the geom_point() function to create the plot in multiple layers.
To create the first layer of the plot we must choose which data set to build the plot from.
ggplot(data=penguins)
The second layer of the plot creates the x,y plane with flipper length on the x-axis and body mass on the y-axis.
ggplot(data=penguins)+aes(x=flipper_length_mm,y=body_mass_g)
The third and final layer, for now, will plot each data point on the plane using the geom_point() function.
ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g))
This scatter plot clearly shows a positive correlation between flipper length and body mass from the penguins included in this data set.
Now we can use the same steps as above to create another scatter plot comparing the length and depth of bills for penguins in the data set.
ggplot(data=penguins)+geom_point(mapping=aes(x=bill_length_mm,y=bill_depth_mm))
Unlike the previous scatter plot, this scatter plot does not immediately show a strong correlation between the two variables. However, further analysis could provide better insight into how bill length and depth could be connected.
Changing the aesthetics of our data visualization can accomplish multiple tasks at the same time. It can make the visualization more appealing and make important information easier to find for stakeholders. We can accomplish this by adding colors, shapes, labels and annotations to the visualization. Or we can split the visualization into multiple facets depending on what information we want to emphasize to stakeholders.
As an extension of the aes() function, we can change the color and shape of data points on our plot. First, we will change the color of the points to be grouped by species on the Flipper Length vs Body Mass scatter plot.
ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))
By doing this, stakeholders can more easily see the difference in flipper length and mass by species. It becomes clear that the Gentoo species is the biggest species in the data set, while the Adelie and Chinstrap species are more similar in flipper length and mass.
We can also change the shape of the data points mapped onto the scatter plot by a given variable. Next we will map the points and give them a different shape depending on species.
ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,shape=species))
This can help further differentiate the data points from each other and help stakeholders read the visualization. We are increasing the accessibility of our visualization by creating multiple ways for stakeholders to read the visualization. Furthermore, we can combine the two, shape and color by species, onto a single scatter plot so there are multiple ways to read the information.
ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,shape=species,color=species))
Finally, if it is still difficult to read the data points on the visualization there are a few tricks we can use to increase readability. This includes changing the size of the shapes based on species.
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,shape=species,color=species,size=species))
Or we can change the alpha of the data points when we are plotting a dense graph to make it easier to differentiate between different points.
ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,alpha=species))
For another way to visualize our data on Flipper Length vs Body Mass we can create a smooth line chart that creates a trend line given the two variables. This can be useful to see the general trend between two variables and look for positive, negative, or no correlation between them. To do this we use the geom_smooth() function instead of the geom_point() function.
ggplot(data=penguins)+geom_smooth(mapping=aes(x=flipper_length_mm,y=body_mass_g))
This line graph also shows that there is a clear positive correlation between flipper length and body mass for the penguins included in this data set.
Now that we have both a scatter plot and line graph showing the correlation between flipper length and body mass, we can layer the two plots on top of each other to give a holistic representation of the analysis. By fitting the trend line to the scatter plot we clearly show the positive correlation between flipper length and body mass among all species of penguin within the data set.
ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_smooth(mapping=aes(x=flipper_length_mm,y=body_mass_g))
To further differentiate between the different species of penguins we can use the facet_wrap() function to create separate graphs for each species. This will allow us to clearly show the differences between the Flipper Length vs Body Mass plot based on species.
ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
facet_wrap(~species)
Now, if we wanted to further analyze the differences in the species we could include sex data from the penguin data set to see how that effects our analysis.
ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
facet_grid(sex~species)
Here we can begin to see that male penguins tend to have longer flipper lengths across all species, while the body mass tends to be similar for the Adelie and Chinstrap penguins. However, male Gentoo penguins appear to have larger flipper lengths and heavier mass on average.
In order to make our visualization complete we can add titles, subtitles, captions and annotations to further emphasize important information from our analysis. The first thing we will do to accomplish this is adding a title to the visualization so that stakeholders can quickly glean insight into the information being provided.
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,shape=species,color=species))+
labs(title="Palmer Penguins: Body Mass vs Flipper Length")
Next, we will add a subtitle to the visualization to provide additional information on the data set.
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,shape=species,color=species))+
labs(title="Palmer Penguins: Body Mass vs Flipper Length",subtitle="Sample of Three Penguin Species")
Finally, we will add a caption to cite where the data came from.
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,shape=species,color=species))+
labs(title="Palmer Penguins: Body Mass vs Flipper Length",subtitle="Sample of Three Penguin Species",
caption="Data collected by Dr. Kristen Gorman")
We can use annotation to help point out valuable information directly on the plot. This can be used in many different way, but here it will be used to emphasize the largest species of penguin in the given data set. We use the annotate() function to accomplish this and as the arguments of the function we include the type of annotation, where the annotation should be located, and what the annotation should include.
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
labs(title="Palmer Penguins: Body Mass vs Flipper Length",subtitle="Sample of Three Penguin Species",
caption="Data collected by Dr. Kristen Gorman")+
annotate("text",x=220,y=3500,label="The Gentoos are the largest")
In order to make the annotation more visually appealing we can change the color, size and angle of the text annotation.
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
labs(title="Palmer Penguins: Body Mass vs Flipper Length",subtitle="Sample of Three Penguin Species",
caption="Data collected by Dr. Kristen Gorman")+
annotate("text",x=220,y=3500,label="The Gentoos are the largest",color="purple",fontface="bold",
size=4.5, angle=25)
Through this analysis we have concluded that there is a strong positive correlation between the flipper length and the body mass of the penguins in the given data set. Furthermore, we have confirmed that this correlation is found across all three species within. We created multiple visualization to support these findings and provided information on how these visualizations were created and used to support the analysis.