Introduction

While doing a course on Coursera I came across this dataset, which was meant to give hands on experience of working on R package “TIDYVERSE” & “GGPLOT”. The objective of this project is do explanatory data analysis using this data and answer following questions :

  1. To create various kind of plots using GGPLOT ?

  2. Number of species and details ?

  3. Which is the largest penguin Species ?

  4. Which species has more Body Mass ?

Data Set & Metadata

While doing Research I come to know that this dataset is collected and made availble to public by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. The dataset consist of 2 data frames Penguins and Penguins raw, Both datasets contain data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

ROCCC analysis

  1. Reliable

  2. Original

  3. Comprehensive

  4. Current

  5. Cited

Data set checks yes in all criteria of ROCCC analysis.

All the data used in this EDA are in built part of GGPLOT and can be viewed loaded using command library(ggplot2), library(palmerpenguins).

Installing Required Packages

The packages used in this data analysis are : tidyverse, ggplot

and installed using : install.packages(“package_name”)

Loading Required Packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0      v purrr   0.3.5 
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.1      v stringr 1.5.0 
## v readr   2.1.3      v forcats 0.5.2 
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)

Loading Data set in R Studio

library(palmerpenguins)

Checking Column names & Dataframe

str(penguins)
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
head(penguins)
## # A tibble: 6 x 8
##   species island    bill_length_mm bill_depth_mm flipper_l~1 body_~2 sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema~  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema~  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema~  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # ... with abbreviated variable names 1: flipper_length_mm, 2: body_mass_g
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
head(penguins)
## # A tibble: 6 x 8
##   species island    bill_length_mm bill_depth_mm flipper_l~1 body_~2 sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema~  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema~  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema~  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # ... with abbreviated variable names 1: flipper_length_mm, 2: body_mass_g

keeping consitancy in Column names

clean_names(penguins)

Calculating Maximum, Minimum, Mean Bill Length Grouped by Island

penguins %>%  group_by(island) %>% drop_na() %>% summarise(mean_bill_length_mm = mean(bill_length_mm), mamximum_bill_length = max(bill_length_mm), minimum_bill_length = min(bill_length_mm))
## # A tibble: 3 x 4
##   island    mean_bill_length_mm mamximum_bill_length minimum_bill_length
##   <fct>                   <dbl>                <dbl>               <dbl>
## 1 Biscoe                   45.2                 59.6                34.5
## 2 Dream                    44.2                 58                  32.1
## 3 Torgersen                39.0                 46                  33.5

Calculating Maximum, Minimum, Mean Bill Length Grouped by Species

penguins %>%  group_by(species) %>% drop_na() %>% summarise(mean_bill_length_mm = mean(bill_length_mm), mamximum_bill_length = max(bill_length_mm), minimum_bill_length = min(bill_length_mm))
## # A tibble: 3 x 4
##   species   mean_bill_length_mm mamximum_bill_length minimum_bill_length
##   <fct>                   <dbl>                <dbl>               <dbl>
## 1 Adelie                   38.8                 46                  32.1
## 2 Chinstrap                48.8                 58                  40.9
## 3 Gentoo                   47.6                 59.6                40.9

Calculating Maximum, Minimum, Mean Bill Length Grouped by Species & Island

penguins %>%  group_by(island, species) %>% drop_na() %>% summarise(mean_bill_length_mm = mean(bill_length_mm), mamximum_bill_length = max(bill_length_mm), minimum_bill_length = min(bill_length_mm))
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 x 5
## # Groups:   island [3]
##   island    species   mean_bill_length_mm mamximum_bill_length minimum_bill_le~1
##   <fct>     <fct>                   <dbl>                <dbl>             <dbl>
## 1 Biscoe    Adelie                   39.0                 45.6              34.5
## 2 Biscoe    Gentoo                   47.6                 59.6              40.9
## 3 Dream     Adelie                   38.5                 44.1              32.1
## 4 Dream     Chinstrap                48.8                 58                40.9
## 5 Torgersen Adelie                   39.0                 46                33.5
## # ... with abbreviated variable name 1: minimum_bill_length

Calculating Maximum, Minimum, Mean Bill Depth Grouped by Island

penguins %>%  group_by(island) %>% drop_na() %>% summarise(mean_bill_depth_mm = mean(bill_depth_mm), mamximum_bill_depth = max(bill_depth_mm), minimum_bill_depth = min(bill_depth_mm))
## # A tibble: 3 x 4
##   island    mean_bill_depth_mm mamximum_bill_depth minimum_bill_depth
##   <fct>                  <dbl>               <dbl>              <dbl>
## 1 Biscoe                  15.9                21.1               13.1
## 2 Dream                   18.3                21.2               15.5
## 3 Torgersen               18.5                21.5               15.9

Calculating Maximum, Minimum, Mean Bill Depth Grouped by Species

penguins %>%  group_by(species) %>% drop_na() %>% summarise(mean_bill_depth_mm = mean(bill_depth_mm), mamximum_bill_depth = max(bill_depth_mm), minimum_bill_depth = min(bill_depth_mm))
## # A tibble: 3 x 4
##   species   mean_bill_depth_mm mamximum_bill_depth minimum_bill_depth
##   <fct>                  <dbl>               <dbl>              <dbl>
## 1 Adelie                  18.3                21.5               15.5
## 2 Chinstrap               18.4                20.8               16.4
## 3 Gentoo                  15.0                17.3               13.1

Calculating Maximum, Minimum, Mean Bill Depth Grouped by Species & Island

penguins %>%  group_by(island, species) %>% drop_na() %>% summarise(mean_bill_depth_mm = mean(bill_depth_mm), mamximum_bill_depth = max(bill_depth_mm), minimum_bill_depth = min(bill_depth_mm))
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 x 5
## # Groups:   island [3]
##   island    species   mean_bill_depth_mm mamximum_bill_depth minimum_bill_depth
##   <fct>     <fct>                  <dbl>               <dbl>              <dbl>
## 1 Biscoe    Adelie                  18.4                21.1               16  
## 2 Biscoe    Gentoo                  15.0                17.3               13.1
## 3 Dream     Adelie                  18.2                21.2               15.5
## 4 Dream     Chinstrap               18.4                20.8               16.4
## 5 Torgersen Adelie                  18.5                21.5               15.9

Calculating Maximum, Minimum, Mean Body Mass Grouped by Island

penguins %>%  group_by(island) %>% drop_na() %>% summarise(mean_body_mass = mean(body_mass_g), mamximum_body_mass = max(body_mass_g), minimum_body_mass = min(body_mass_g))
## # A tibble: 3 x 4
##   island    mean_body_mass mamximum_body_mass minimum_body_mass
##   <fct>              <dbl>              <int>             <int>
## 1 Biscoe             4719.               6300              2850
## 2 Dream              3719.               4800              2700
## 3 Torgersen          3709.               4700              2900

Calculating Maximum, Minimum, Mean Body Mass Grouped by Species

penguins %>%  group_by(species) %>% drop_na() %>% summarise(mean_body_mass = mean(body_mass_g), mamximum_body_mass = max(body_mass_g), minimum_body_mass = min(body_mass_g))
## # A tibble: 3 x 4
##   species   mean_body_mass mamximum_body_mass minimum_body_mass
##   <fct>              <dbl>              <int>             <int>
## 1 Adelie             3706.               4775              2850
## 2 Chinstrap          3733.               4800              2700
## 3 Gentoo             5092.               6300              3950

Calculating Maximum, Minimum, Mean Body Mass Grouped by Island & Species

penguins %>%  group_by(island, species) %>% drop_na() %>% summarise(mean_body_mass = mean(body_mass_g), mamximum_body_mass = max(body_mass_g), minimum_body_mass = min(body_mass_g))
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 x 5
## # Groups:   island [3]
##   island    species   mean_body_mass mamximum_body_mass minimum_body_mass
##   <fct>     <fct>              <dbl>              <int>             <int>
## 1 Biscoe    Adelie             3710.               4775              2850
## 2 Biscoe    Gentoo             5092.               6300              3950
## 3 Dream     Adelie             3701.               4650              2900
## 4 Dream     Chinstrap          3733.               4800              2700
## 5 Torgersen Adelie             3709.               4700              2900

Ploting Number of Species

Num_of_species <- penguins %>% count(species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + labs(title = " Number Of Species ")

Num_of_species #Viewing Plot

Ploting number of Species grouped by Gender

Num_of_species_Gender <-  penguins %>% drop_na() %>% count(sex, species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + facet_wrap(~sex)+ theme(axis.text.x = element_text(angle = 90)) + labs(title = "Number Of Species by Gender")

Num_of_species_Gender # Viewing plot

Ploting number of Species grouped by Islands

Num_of_species_island <- penguins %>% drop_na() %>% count(island, species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + facet_wrap(~island)+ theme(axis.text.x = element_text(angle = 90)) + labs(title = "Number Of Species On Islands")

Num_of_species_island # Viewing Plot

Ploting Body Mass VS. Flipper Length

Flipper_body_mass <- ggplot(data = penguins) + geom_smooth(mapping = aes(x = body_mass_g, y = flipper_length_mm)) + geom_point(mapping = aes(x = body_mass_g, y = flipper_length_mm, color = species)) + labs(title = " Body Mass VS. Flipper Length ")

Flipper_body_mass #Viewing plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Ploting Bill Length mm VS. Flipper Length mm

Flipper_Bill_len <- ggplot(data = penguins) + geom_smooth(mapping = aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point(mapping = aes(x = bill_length_mm, y = flipper_length_mm, color = species)) + labs(title = " Bill Length mm VS. Flipper Length mm ")

Flipper_Bill_len #Viewing plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

ploting Bill Length mm VS. Bill depth mm

Bill_len_depth <- ggplot(data = penguins) + geom_smooth(mapping = aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + labs(title = "Bill Length mm VS. Bill depth mm")

Bill_len_depth #Viewing plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Ploting Bill length mm VS. Body mass (gm)

Bill_len_body_mass <- ggplot(data = penguins) + geom_smooth(mapping = aes(x = bill_length_mm, y = body_mass_g)) + geom_point(mapping = aes(x = bill_length_mm, y = body_mass_g, color = species)) + labs(title = "Bill length mm VS. Body mass (g)")

Bill_len_body_mass #Viewing plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Ploting Flipper Length Frequency Grouped by Species

Flipper_len_frequency <- ggplot(data = penguins) + geom_histogram(mapping = aes(x=flipper_length_mm, fill= species)) + labs(title = "Flipper Length Frequency Grouped by Species")

Flipper_len_frequency #Viewing plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

Ploting Body Mass Frequency grouped by Species

Body_mass_frequency <- ggplot(data = penguins) + geom_histogram(mapping = aes(x=body_mass_g, fill= species)) + labs(title = "Body Mass Frequency grouped by Species ")

Body_mass_frequency #Viewing plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

Creating A Multiplot to view different plots together

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  plots <- c(list(...), plotlist)

  numPlots = length(plots)
  if (is.null(layout)) {
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
} # Code to create box plot
multiplot(Num_of_species, Num_of_species_Gender, Num_of_species_island, Flipper_body_mass , cols = 2) #Plot 1
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

multiplot(Flipper_len_frequency, Body_mass_frequency, Flipper_Bill_len, Flipper_Bill_len, cols = 2) #Plot 2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Removed 2 rows containing missing values (`geom_point()`).
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Removed 2 rows containing missing values (`geom_point()`).

Conclusion

While analyzing the data set I come to know that :

It appears that The number of Penguin on Biscoe, Dream, and Torgersen islands are 163, 123, and 47 respectively and the number of penguin species present there are Adelle, Chinstrap, Gentoo are respectively 152, 68, 124.

One of the key findings of our study was the correlation we observed between body mass and flipper length. We found that Gentoo penguins had the largest body mass and flipper length of all the species observed. However, it is important to note that island was not found to be a determining factor in penguin size and body mass.