Palmer penguin R package contains two datasets: penguins_raw and penguins. Today we’ll focus on a curated subset of the raw data in the package named penguins and as you can see from their name, penguins is pre-processed data.

Lets go through each stage of the data science project first!

Import: Load dataset and library on workspace.

Tidy

The data we have is already preprocessed therefore it is in tidy format.

Summary of Dataset

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

The data has 344 rows or observations and 8 variables.

Difference between bill length and bill depth

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

What do you gather from here?

  • The data gathered is from 2007-2009
  • Three species : Adelie , Chinstrap, Gentoo
  • 3 Island : Biscoe, Dream, Torgersen
  • Has some unknown values

You can visualise the data as a whole as well

#install.packages(visdat)
visdat::vis_dat(penguins)

Transform

data_subset<- penguins%>%
  select(species,bill_depth_mm,flipper_length_mm,bill_length_mm)%>%
  filter(!is.na(flipper_length_mm),!is.na(bill_depth_mm),!is.na(bill_length_mm))%>%
  mutate(ratio=bill_depth_mm/bill_length_mm)

Visualise

  ggplot(data_subset)+
  geom_density(mapping = aes(x=flipper_length_mm,fill=species),alpha=0.5)+
  ggtitle("Flipper Length of different species")

You can do the same thing with histogram!

data_subset%>%
  ggplot()+
  geom_histogram(mapping = aes(x=flipper_length_mm,fill=species),alpha=0.8)+
  ggtitle("Flipper Length of different species")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What do you gather from this?

  • Flipper length seperates Gentoo from the others. Adelie and Chinstrap are more similar in that regard.

Your turn : Density plot of Bill length of different species and what do you gather from it

data_subset%>%
  ggplot()+
  geom_density(mapping = aes(x=.....,fill=species),alpha=0.5)+
  ggtitle("Bill Length of different species")

Set the chunk options as eval = TRUE when you are ready to include the output of this code in your report.

What do you gather from this?

Answer here !

Lets take a closer look at flipper length

data_subset%>%
  ggplot()+
  geom_boxplot(mapping = aes(x=species,y=flipper_length_mm),fill="blue",alpha=0.5)+
  ggtitle("Comparison of flipper Length of different species")

How to interpret a boxplot ?

How does the bill vary between species ?

  • You can give color to different species to see the underlying patterns.
penguins%>%
  ggplot()+
  geom_point(mapping = aes(x=bill_length_mm,y=bill_depth_mm,color=species),shape=16)+
  ggtitle("Comparing different species wrt bill Length & depth")
## Warning: Removed 2 rows containing missing values (geom_point).

- shape argument assigns shapes to these points. If you dont want a dot you can choose the code shown below to pick any particular shape you want.

Code for choosing shape in ggplot

What do you gather from this?

  • Adelie has a shorter bill length but a deeper bill. The opposite applies for Gentoo and Chinstrap is somehwere in the middle.

Conclude

Feel free to explore further and conclude important findings here.

References

It is important that you credit the original authors of the R packages or any sources you have used to create this report.

For citing R packages, you can use the function citation(packagename).