Palmer penguin R package contains two datasets:
penguins_rawandpenguins. Today we’ll focus on a curated subset of the raw data in the package namedpenguinsand as you can see from their name, penguins is pre-processed data.
Lets go through each stage of the data science project first!
After installing the packages, you can load them by using the library function.
There are two R packages that are used to transform and visualise data : dplyr and ggplot2. You can find these two combined with various other packages in tidyverse
The dataset from the R package can be loaded using data().
The data we have is already preprocessed therefore it is in tidy format.
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <fct> male, female, female, NA, female, male, female, male~
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
The data has 344 rows or observations and 8 variables.
Difference between bill length and bill depth
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
#install.packages(visdat)
visdat::vis_dat(penguins)
data_subset<- penguins%>%
select(species,bill_depth_mm,flipper_length_mm,bill_length_mm)%>%
filter(!is.na(flipper_length_mm),!is.na(bill_depth_mm),!is.na(bill_length_mm))%>%
mutate(ratio=bill_depth_mm/bill_length_mm)
You can use functions from ggplot2 package to create various visualisations.
ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph.
ggplot2 comes with many geom functions that each add a different type of layer to a plot.
Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties.
The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes.
You can also tweek the vis by adding other aesthetics by including things like the size, the shape, or the color of your points/graph.
ggplot(data_subset)+
geom_density(mapping = aes(x=flipper_length_mm,fill=species),alpha=0.5)+
ggtitle("Flipper Length of different species")
You can do the same thing with histogram!
data_subset%>%
ggplot()+
geom_histogram(mapping = aes(x=flipper_length_mm,fill=species),alpha=0.8)+
ggtitle("Flipper Length of different species")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data_subset%>%
ggplot()+
geom_density(mapping = aes(x=.....,fill=species),alpha=0.5)+
ggtitle("Bill Length of different species")
Set the chunk options as eval = TRUE when you are ready to include the output of this code in your report.
Answer here !
data_subset%>%
ggplot()+
geom_boxplot(mapping = aes(x=species,y=flipper_length_mm),fill="blue",alpha=0.5)+
ggtitle("Comparison of flipper Length of different species")
penguins%>%
ggplot()+
geom_point(mapping = aes(x=bill_length_mm,y=bill_depth_mm,color=species),shape=16)+
ggtitle("Comparing different species wrt bill Length & depth")
## Warning: Removed 2 rows containing missing values (geom_point).
- shape argument assigns shapes to these points. If you dont want a dot you can choose the code shown below to pick any particular shape you want.
Code for choosing shape in ggplot
Feel free to explore further and conclude important findings here.
It is important that you credit the original authors of the R packages or any sources you have used to create this report.
For citing R packages, you can use the function citation(packagename).
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686